test2/source/blender/compositor/shaders/compositor_parallel_reduction.glsl

/* SPDX-FileCopyrightText: 2022-2023 Blender Authors
 *
 * SPDX-License-Identifier: GPL-2.0-or-later */

/* This shader reduces the given texture into a smaller texture of a size equal to the number of
 * work groups. In particular, each work group reduces its contents into a single value and writes
 * that value to a single pixel in the output image. The shader can be dispatched multiple times to
 * eventually reduce the image into a single pixel.
 *
 * The shader works by loading the whole data of each work group into a linear array, then it
 * reduces the second half of the array onto the first half of the array, then it reduces the
 * second quarter of the array onto the first quarter or the array, and so on until only one
 * element remains. The following figure illustrates the process for sum reduction on 8 elements.
 *
 *     .---. .---. .---. .---. .---. .---. .---. .---.
 *     | 0 | | 1 | | 2 | | 3 | | 4 | | 5 | | 6 | | 7 |  Original data.
 *     '---' '---' '---' '---' '---' '---' '---' '---'
 *       |.____|_____|_____|_____|     |     |     |
 *       ||    |.____|_____|___________|     |     |
 *       ||    ||    |.____|_________________|     |
 *       ||    ||    ||    |.______________________|  <--First reduction. Stride = 4.
 *       ||    ||    ||    ||
 *     .---. .---. .---. .----.
 *     | 4 | | 6 | | 8 | | 10 |                       <--Data after first reduction.
 *     '---' '---' '---' '----'
 *       |.____|_____|     |
 *       ||    |.__________|                          <--Second reduction. Stride = 2.
 *       ||    ||
 *     .----. .----.
 *     | 12 | | 16 |                                  <--Data after second reduction.
 *     '----' '----'
 *       |.____|
 *       ||                                           <--Third reduction. Stride = 1.
 *     .----.
 *     | 28 |
 *     '----'                                         <--Data after third reduction.
 *
 *
 * The shader is generic enough to implement many types of reductions. This is done by using macros
 * that the developer should define to implement a certain reduction operation. Those include,
 * TYPE, IDENTITY, INITIALIZE, LOAD, REDUCE, and WRITE. See the implementation below for more
 * information as well as the compositor_parallel_reduction_infos.hh for example reductions
 * operations. */

/* Doing the reduction in shared memory is faster, so create a shared array where the whole data
 * of the work group will be loaded and reduced. The 2D structure of the work group is irrelevant
 * for reduction, so we just load the data in a 1D array to simplify reduction. The developer is
 * expected to define the TYPE macro to be a float or a vec4, depending on the type of data being
 * reduced. */

#include "gpu_shader_compositor_texture_utilities.glsl"
#include "gpu_shader_math_vector_lib.glsl"
#include "gpu_shader_math_vector_reduce_lib.glsl"
#include "gpu_shader_utildefines_lib.glsl"

#define reduction_size (gl_WorkGroupSize.x * gl_WorkGroupSize.y)
shared TYPE reduction_data[reduction_size];

void main()
{
  int2 texel = int2(gl_GlobalInvocationID.xy);

  /* Initialize the shared array for out of bound invocations using the `IDENTITY` value. The
   * developer is expected to define the `IDENTITY` macro to be a value of type `TYPE` that does
   * not affect the output of the reduction. For instance, sum reductions have an identity of 0.0,
   * while max value reductions have an identity of FLT_MIN */
  if (any(lessThan(texel, int2(0))) || any(greaterThanEqual(texel, texture_size(input_tx)))) {
    reduction_data[gl_LocalInvocationIndex] = IDENTITY;
  }
  else {
    float4 value = texture_load_unbound(input_tx, texel);

    /* Initialize the shared array given the previously loaded value. This step can be different
     * depending on whether this is the initial reduction pass or a latter one. Indeed, the input
     * texture for the initial reduction is the source texture itself, while the input texture to a
     * latter reduction pass is an intermediate texture after one or more reductions have happened.
     * This is significant because the data being reduced might be computed from the original data
     * and different from it, for instance, when summing the luminance of an image, the original
     * data is a vec4 color, while the reduced data is a float luminance value. So for the initial
     * reduction pass, the luminance will be computed from the color, reduced, then stored into an
     * intermediate float texture. On the other hand, for latter reduction passes, the luminance
     * will be loaded directly and reduced without extra processing. So the developer is expected
     * to define the INITIALIZE and LOAD macros to be expressions that derive the needed value from
     * the loaded value for the initial reduction pass and latter ones respectively. */
    reduction_data[gl_LocalInvocationIndex] = is_initial_reduction ? INITIALIZE(value) :
                                                                     LOAD(value);
  }

  /* Reduce the reduction data by half on every iteration until only one element remains. See the
   * above figure for an intuitive understanding of the stride value. */
  for (uint stride = reduction_size / 2; stride > 0; stride /= 2) {
    barrier();

    /* Only the threads up to the current stride should be active as can be seen in the diagram
     * above. */
    if (gl_LocalInvocationIndex >= stride) {
      continue;
    }

    /* Reduce each two elements that are stride apart, writing the result to the element with the
     * lower index, as can be seen in the diagram above. The developer is expected to define the
     * REDUCE macro to be a commutative and associative binary operator suitable for parallel
     * reduction. */
    reduction_data[gl_LocalInvocationIndex] = REDUCE(
        reduction_data[gl_LocalInvocationIndex], reduction_data[gl_LocalInvocationIndex + stride]);
  }

  /* Finally, the result of the reduction is available as the first element in the reduction data,
   * write it to the pixel corresponding to the work group, making sure only the one thread writes
   * it. */
  barrier();
  if (gl_LocalInvocationIndex == 0) {
    /* If no WRITE macro is provided, we assume the reduction type can be passed to the float4
     * constructor. If not, WRITE is expected to be defined to construct the output value. */
#if defined(WRITE)
    imageStore(output_img, int2(gl_WorkGroupID.xy), WRITE(reduction_data[0]));
#else
    imageStore(output_img, int2(gl_WorkGroupID.xy), float4(reduction_data[0]));
#endif
  }
}
License headers: add SPDX licenses for '.glsl' files When GLSL sources were first included in Blender they were treated as data (like blend files) and had no license header. Since then GLSL has been used for more sophisticated features (EEVEE & real-time compositing) where it makes sense to include licensing information. Add SPDX copyright headers to .glsl files, matching headers used for C/C++, also include GLSL files in the license checking script. As leading C-comments are now stripped, added binary size of comments is no longer a concern. Ref !111247 2023-08-24 10:54:59 +10:00			`/* SPDX-FileCopyrightText: 2022-2023 Blender Authors`
			`*`
			`* SPDX-License-Identifier: GPL-2.0-or-later */`

Realtime Compositor: Implement parallel reduction This patch implements generic parallel reduction for the realtime compositor and implements the Levels operation as an example. This patch also introduces the notion of a "Compositor Algorithm", which is a reusable operation that can be used to construct other operations. Differential Revision: https://developer.blender.org/D16184 Reviewed By: Clement Foucault 2022-10-11 13:22:52 +02:00			`/* This shader reduces the given texture into a smaller texture of a size equal to the number of`
			`* work groups. In particular, each work group reduces its contents into a single value and writes`
			`* that value to a single pixel in the output image. The shader can be dispatched multiple times to`
			`* eventually reduce the image into a single pixel.`
			`*`
			`* The shader works by loading the whole data of each work group into a linear array, then it`
			`* reduces the second half of the array onto the first half of the array, then it reduces the`
			`* second quarter of the array onto the first quarter or the array, and so on until only one`
			`* element remains. The following figure illustrates the process for sum reduction on 8 elements.`
			`*`
			`* .---. .---. .---. .---. .---. .---. .---. .---.`
			`* \| 0 \| \| 1 \| \| 2 \| \| 3 \| \| 4 \| \| 5 \| \| 6 \| \| 7 \| Original data.`
			`* '---' '---' '---' '---' '---' '---' '---' '---'`
			`* \|.____\|_____\|_____\|_____\| \| \| \|`
			`* \|\| \|.____\|_____\|___________\| \| \|`
			`* \|\| \|\| \|.____\|_________________\| \|`
			`* \|\| \|\| \|\| \|.______________________\| <--First reduction. Stride = 4.`
			`* \|\| \|\| \|\| \|\|`
			`* .---. .---. .---. .----.`
			`* \| 4 \| \| 6 \| \| 8 \| \| 10 \| <--Data after first reduction.`
			`* '---' '---' '---' '----'`
			`* \|.____\|_____\| \|`
			`* \|\| \|.__________\| <--Second reduction. Stride = 2.`
			`* \|\| \|\|`
			`* .----. .----.`
			`* \| 12 \| \| 16 \| <--Data after second reduction.`
			`* '----' '----'`
			`* \|.____\|`
			`* \|\| <--Third reduction. Stride = 1.`
			`* .----.`
			`* \| 28 \|`
			`* '----' <--Data after third reduction.`
			`*`
			`*`
			`* The shader is generic enough to implement many types of reductions. This is done by using macros`
			`* that the developer should define to implement a certain reduction operation. Those include,`
Refactor: Use Float2 images internally if possible Previously, Float2 images were internally stored as either Float3 or Float4 images due to limitations in the implementation, which no longer exists. So this patch refactors the compositor code to store Float2 images in actual Float2 containers, which gives better performance and memory savings. Some algorithms were adjusted to operate on Float2 instead of Float3 as was previously the case. Pull Request: https://projects.blender.org/blender/blender/pulls/140855 2025-06-23 14:34:37 +02:00			`* TYPE, IDENTITY, INITIALIZE, LOAD, REDUCE, and WRITE. See the implementation below for more`
GPU: Shader: Make info files generated This is the first step of moving the create infos back inside shader sources. All info files are now treated as source files. However, they are not considered in the include tree yet. This will come in another following PR. Each shader source file now generate a `.info` file containing only the create info declarations. This renames all info files so that they do not conflict with their previous versions that were copied (non-generated). Pull Request: https://projects.blender.org/blender/blender/pulls/146676 2025-09-25 10:57:02 +02:00			`* information as well as the compositor_parallel_reduction_infos.hh for example reductions`
Refactor: Use Float2 images internally if possible Previously, Float2 images were internally stored as either Float3 or Float4 images due to limitations in the implementation, which no longer exists. So this patch refactors the compositor code to store Float2 images in actual Float2 containers, which gives better performance and memory savings. Some algorithms were adjusted to operate on Float2 instead of Float3 as was previously the case. Pull Request: https://projects.blender.org/blender/blender/pulls/140855 2025-06-23 14:34:37 +02:00			`* operations. */`
Realtime Compositor: Implement parallel reduction This patch implements generic parallel reduction for the realtime compositor and implements the Levels operation as an example. This patch also introduces the notion of a "Compositor Algorithm", which is a reusable operation that can be used to construct other operations. Differential Revision: https://developer.blender.org/D16184 Reviewed By: Clement Foucault 2022-10-11 13:22:52 +02:00
			`/* Doing the reduction in shared memory is faster, so create a shared array where the whole data`
			`* of the work group will be loaded and reduced. The 2D structure of the work group is irrelevant`
			`* for reduction, so we just load the data in a 1D array to simplify reduction. The developer is`
			`* expected to define the TYPE macro to be a float or a vec4, depending on the type of data being`
			`* reduced. */`
Cleanup: move GLSL comments to the file start This has the benefit that leading comments may be stripped, reducing the binary size (not yet supported). 2023-08-19 17:56:48 +10:00
GPU: Change GLSL include directive This changes the include directive to use the standard C preprocessor `#include` directive. The regex to applied to all glsl sources is: `pragma BLENDER_REQUIRE\((\w+\.glsl)\)` `include "$1"` This allow C++ linter to parse the code and allow easier codebase traversal. However there is a small catch. While it does work like a standard include directive when the code is treated as C++, it doesn't when compiled by our shader backends. In this case, we still use our dependency concatenation approach instead of file injection. This means that included files will always be prepended when compiled to GLSL and a file cannot be appended more than once. This is why all GLSL lib file should have the `#pragma once` directive and always be included at the start of the file. These requirements are actually already enforced by our code-style in practice. On the implementation, the source needed to be mutated to comment the `#pragma once` and `#include`. This is needed to avoid GLSL compiler error out as this is an extension that not all vendor supports. Rel #127983 Pull Request: https://projects.blender.org/blender/blender/pulls/128076 2024-10-04 15:48:22 +02:00			`#include "gpu_shader_compositor_texture_utilities.glsl"`
Cleanup: GPU: Remove dependency on legacy common_math_lib.glsl Replace usage of `common_math_lib.glsl` (deprecated) with gpu shader libs. Pull Request: https://projects.blender.org/blender/blender/pulls/131579 2025-02-10 18:14:50 +01:00			`#include "gpu_shader_math_vector_lib.glsl"`
GPU: Simplify matrix lib to reduce dead code This greatly reduce shader compilation time on some systems. Pull Request: https://projects.blender.org/blender/blender/pulls/146100 2025-09-15 12:07:26 +02:00			`#include "gpu_shader_math_vector_reduce_lib.glsl"`
Cleanup: GPU: Remove dependency on legacy common_math_lib.glsl Replace usage of `common_math_lib.glsl` (deprecated) with gpu shader libs. Pull Request: https://projects.blender.org/blender/blender/pulls/131579 2025-02-10 18:14:50 +01:00			`#include "gpu_shader_utildefines_lib.glsl"`
Cleanup: move GLSL comments to the file start This has the benefit that leading comments may be stripped, reducing the binary size (not yet supported). 2023-08-19 17:56:48 +10:00
Metal: Realtime compositor enablement with addition of GPU Compute. This patch adds support for compilation and execution of GLSL compute shaders. This, along with a few systematic changes and fixes, enable realtime compositor functionality with the Metal backend on macOS. A number of GLSL source modifications have been made to add the required level of type explicitness, allowing all compilations to succeed. GLSL Compute shader compilation follows a similar path to Vertex/Fragment translation, with added support for shader atomics, shared memory blocks and barriers. Texture flags have also been updated to ensure correct read/write specification for textures used within the compositor pipeline. GPU command submission changes have also been made in the high level path, when Metal is used, to address command buffer time-outs caused by certain expensive compute shaders. Authored by Apple: Michael Parkin-White Ref T96261 Ref T99210 Reviewed By: fclem Maniphest Tasks: T99210, T96261 Differential Revision: https://developer.blender.org/D16990 2023-01-30 11:00:26 +01:00			`#define reduction_size (gl_WorkGroupSize.x * gl_WorkGroupSize.y)`
Realtime Compositor: Implement parallel reduction This patch implements generic parallel reduction for the realtime compositor and implements the Levels operation as an example. This patch also introduces the notion of a "Compositor Algorithm", which is a reusable operation that can be used to construct other operations. Differential Revision: https://developer.blender.org/D16184 Reviewed By: Clement Foucault 2022-10-11 13:22:52 +02:00			`shared TYPE reduction_data[reduction_size];`

			`void main()`
			`{`
GPU: Shader: Change vector and matrix type to use blender convention This unify the C++ and GLSL codebase style. The GLSL types are still in the backend compatibility layers to support python shaders. However, the C++ shader compilation layer doesn't have them to enforce correct type usage. Note that this is going to break pretty much all PRs in flight that targets shader code. Rel #137261 Pull Request: https://projects.blender.org/blender/blender/pulls/137369 2025-04-14 13:46:41 +02:00			`int2 texel = int2(gl_GlobalInvocationID.xy);`
Realtime Compositor: Implement parallel reduction This patch implements generic parallel reduction for the realtime compositor and implements the Levels operation as an example. This patch also introduces the notion of a "Compositor Algorithm", which is a reusable operation that can be used to construct other operations. Differential Revision: https://developer.blender.org/D16184 Reviewed By: Clement Foucault 2022-10-11 13:22:52 +02:00
Cleanup: typos in comments (duplicate words) 2024-07-14 18:55:43 +10:00			/* Initialize the shared array for out of bound invocations using the `IDENTITY` value. The
			* developer is expected to define the `IDENTITY` macro to be a value of type `TYPE` that does
			`* not affect the output of the reduction. For instance, sum reductions have an identity of 0.0,`
Fix: Tonemap node has a wrong luminance scale The Tonemap node has a wrong luminance scale. This is because the parallel reduction shader for logarithmic sum had a wrong identity value. In particular, its identity was set to 0.0, but since its initialization macro computed the log, the zero becomes a rather large negative value. To fix this, the general structure of the parallel reduction shader was changed such that the identity is used as is, and not passed to the INITIALIZE or LOAD macros. This simplifies the implementation and even avoid the extra texel fetches at the boundary. 2024-01-22 22:03:05 +02:00			`* while max value reductions have an identity of FLT_MIN */`
GPU: Shader: Change vector and matrix type to use blender convention This unify the C++ and GLSL codebase style. The GLSL types are still in the backend compatibility layers to support python shaders. However, the C++ shader compilation layer doesn't have them to enforce correct type usage. Note that this is going to break pretty much all PRs in flight that targets shader code. Rel #137261 Pull Request: https://projects.blender.org/blender/blender/pulls/137369 2025-04-14 13:46:41 +02:00			`if (any(lessThan(texel, int2(0))) \|\| any(greaterThanEqual(texel, texture_size(input_tx)))) {`
Fix: Tonemap node has a wrong luminance scale The Tonemap node has a wrong luminance scale. This is because the parallel reduction shader for logarithmic sum had a wrong identity value. In particular, its identity was set to 0.0, but since its initialization macro computed the log, the zero becomes a rather large negative value. To fix this, the general structure of the parallel reduction shader was changed such that the identity is used as is, and not passed to the INITIALIZE or LOAD macros. This simplifies the implementation and even avoid the extra texel fetches at the boundary. 2024-01-22 22:03:05 +02:00			`reduction_data[gl_LocalInvocationIndex] = IDENTITY;`
			`}`
			`else {`
GPU: Shader: Change vector and matrix type to use blender convention This unify the C++ and GLSL codebase style. The GLSL types are still in the backend compatibility layers to support python shaders. However, the C++ shader compilation layer doesn't have them to enforce correct type usage. Note that this is going to break pretty much all PRs in flight that targets shader code. Rel #137261 Pull Request: https://projects.blender.org/blender/blender/pulls/137369 2025-04-14 13:46:41 +02:00			`float4 value = texture_load_unbound(input_tx, texel);`
Fix: Tonemap node has a wrong luminance scale The Tonemap node has a wrong luminance scale. This is because the parallel reduction shader for logarithmic sum had a wrong identity value. In particular, its identity was set to 0.0, but since its initialization macro computed the log, the zero becomes a rather large negative value. To fix this, the general structure of the parallel reduction shader was changed such that the identity is used as is, and not passed to the INITIALIZE or LOAD macros. This simplifies the implementation and even avoid the extra texel fetches at the boundary. 2024-01-22 22:03:05 +02:00
			`/* Initialize the shared array given the previously loaded value. This step can be different`
			`* depending on whether this is the initial reduction pass or a latter one. Indeed, the input`
			`* texture for the initial reduction is the source texture itself, while the input texture to a`
			`* latter reduction pass is an intermediate texture after one or more reductions have happened.`
			`* This is significant because the data being reduced might be computed from the original data`
			`* and different from it, for instance, when summing the luminance of an image, the original`
			`* data is a vec4 color, while the reduced data is a float luminance value. So for the initial`
			`* reduction pass, the luminance will be computed from the color, reduced, then stored into an`
			`* intermediate float texture. On the other hand, for latter reduction passes, the luminance`
			`* will be loaded directly and reduced without extra processing. So the developer is expected`
			`* to define the INITIALIZE and LOAD macros to be expressions that derive the needed value from`
			`* the loaded value for the initial reduction pass and latter ones respectively. */`
			`reduction_data[gl_LocalInvocationIndex] = is_initial_reduction ? INITIALIZE(value) :`
			`LOAD(value);`
			`}`
Realtime Compositor: Implement parallel reduction This patch implements generic parallel reduction for the realtime compositor and implements the Levels operation as an example. This patch also introduces the notion of a "Compositor Algorithm", which is a reusable operation that can be used to construct other operations. Differential Revision: https://developer.blender.org/D16184 Reviewed By: Clement Foucault 2022-10-11 13:22:52 +02:00
			`/* Reduce the reduction data by half on every iteration until only one element remains. See the`
			`* above figure for an intuitive understanding of the stride value. */`
			`for (uint stride = reduction_size / 2; stride > 0; stride /= 2) {`
			`barrier();`

			`/* Only the threads up to the current stride should be active as can be seen in the diagram`
			`* above. */`
			`if (gl_LocalInvocationIndex >= stride) {`
			`continue;`
			`}`

			`/* Reduce each two elements that are stride apart, writing the result to the element with the`
			`* lower index, as can be seen in the diagram above. The developer is expected to define the`
			`* REDUCE macro to be a commutative and associative binary operator suitable for parallel`
			`* reduction. */`
			`reduction_data[gl_LocalInvocationIndex] = REDUCE(`
			`reduction_data[gl_LocalInvocationIndex], reduction_data[gl_LocalInvocationIndex + stride]);`
			`}`

			`/* Finally, the result of the reduction is available as the first element in the reduction data,`
			`* write it to the pixel corresponding to the work group, making sure only the one thread writes`
			`* it. */`
			`barrier();`
			`if (gl_LocalInvocationIndex == 0) {`
Refactor: Use Float2 images internally if possible Previously, Float2 images were internally stored as either Float3 or Float4 images due to limitations in the implementation, which no longer exists. So this patch refactors the compositor code to store Float2 images in actual Float2 containers, which gives better performance and memory savings. Some algorithms were adjusted to operate on Float2 instead of Float3 as was previously the case. Pull Request: https://projects.blender.org/blender/blender/pulls/140855 2025-06-23 14:34:37 +02:00			`/* If no WRITE macro is provided, we assume the reduction type can be passed to the float4`
			`* constructor. If not, WRITE is expected to be defined to construct the output value. */`
			`#if defined(WRITE)`
			`imageStore(output_img, int2(gl_WorkGroupID.xy), WRITE(reduction_data[0]));`
			`#else`
GPU: Shader: Change vector and matrix type to use blender convention This unify the C++ and GLSL codebase style. The GLSL types are still in the backend compatibility layers to support python shaders. However, the C++ shader compilation layer doesn't have them to enforce correct type usage. Note that this is going to break pretty much all PRs in flight that targets shader code. Rel #137261 Pull Request: https://projects.blender.org/blender/blender/pulls/137369 2025-04-14 13:46:41 +02:00			`imageStore(output_img, int2(gl_WorkGroupID.xy), float4(reduction_data[0]));`
Refactor: Use Float2 images internally if possible Previously, Float2 images were internally stored as either Float3 or Float4 images due to limitations in the implementation, which no longer exists. So this patch refactors the compositor code to store Float2 images in actual Float2 containers, which gives better performance and memory savings. Some algorithms were adjusted to operate on Float2 instead of Float3 as was previously the case. Pull Request: https://projects.blender.org/blender/blender/pulls/140855 2025-06-23 14:34:37 +02:00			`#endif`
Realtime Compositor: Implement parallel reduction This patch implements generic parallel reduction for the realtime compositor and implements the Levels operation as an example. This patch also introduces the notion of a "Compositor Algorithm", which is a reusable operation that can be used to construct other operations. Differential Revision: https://developer.blender.org/D16184 Reviewed By: Clement Foucault 2022-10-11 13:22:52 +02:00			`}`
			`}`