2023-06-14 16:52:36 +10:00
|
|
|
/* SPDX-FileCopyrightText: 2017-2022 Blender Foundation
|
|
|
|
|
*
|
|
|
|
|
* SPDX-License-Identifier: Apache-2.0 */
|
Cycles: Improve denoising speed on GPUs with small tile sizes
Previously, the NLM kernels would be launched once per offset with one thread per pixel.
However, with the smaller tile sizes that are now feasible, there wasn't enough work to fully occupy GPUs which results in a significant slowdown.
Therefore, the kernels are now launched in a single call that handles all offsets at once.
This has two downsides: Memory accesses to accumulating buffers are now atomic, and more importantly, the temporary memory now has to be allocated for every shift at once, increasing the required memory.
On the other hand, of course, the smaller tiles significantly reduce the size of the memory.
The main bottleneck right now is the construction of the transformation - there is nothing to be parallelized there, one thread per pixel is the maximum.
I tried to parallelize the SVD implementation by storing the matrix in shared memory and launching one block per pixel, but that wasn't really going anywhere.
To make the new code somewhat readable, the handling of rectangular regions was cleaned up a bit and commented, it should be easier to understand what's going on now.
Also, some variables have been renamed to make the difference between buffer width and stride more apparent, in addition to some general style cleanup.
2017-11-10 04:34:14 +01:00
|
|
|
|
|
|
|
|
#ifndef __UTIL_RECT_H__
|
|
|
|
|
#define __UTIL_RECT_H__
|
|
|
|
|
|
2021-10-24 14:19:19 +02:00
|
|
|
#include "util/types.h"
|
Cycles: Improve denoising speed on GPUs with small tile sizes
Previously, the NLM kernels would be launched once per offset with one thread per pixel.
However, with the smaller tile sizes that are now feasible, there wasn't enough work to fully occupy GPUs which results in a significant slowdown.
Therefore, the kernels are now launched in a single call that handles all offsets at once.
This has two downsides: Memory accesses to accumulating buffers are now atomic, and more importantly, the temporary memory now has to be allocated for every shift at once, increasing the required memory.
On the other hand, of course, the smaller tiles significantly reduce the size of the memory.
The main bottleneck right now is the construction of the transformation - there is nothing to be parallelized there, one thread per pixel is the maximum.
I tried to parallelize the SVD implementation by storing the matrix in shared memory and launching one block per pixel, but that wasn't really going anywhere.
To make the new code somewhat readable, the handling of rectangular regions was cleaned up a bit and commented, it should be easier to understand what's going on now.
Also, some variables have been renamed to make the difference between buffer width and stride more apparent, in addition to some general style cleanup.
2017-11-10 04:34:14 +01:00
|
|
|
|
|
|
|
|
CCL_NAMESPACE_BEGIN
|
|
|
|
|
|
|
|
|
|
/* Rectangles are represented as a int4 containing the coordinates of the lower-left and
|
|
|
|
|
* upper-right corners in the order (x0, y0, x1, y1). */
|
|
|
|
|
|
|
|
|
|
ccl_device_inline int4 rect_from_shape(int x0, int y0, int w, int h)
|
|
|
|
|
{
|
|
|
|
|
return make_int4(x0, y0, x0 + w, y0 + h);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
ccl_device_inline int4 rect_expand(int4 rect, int d)
|
|
|
|
|
{
|
|
|
|
|
return make_int4(rect.x - d, rect.y - d, rect.z + d, rect.w + d);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* Returns the intersection of two rects. */
|
|
|
|
|
ccl_device_inline int4 rect_clip(int4 a, int4 b)
|
|
|
|
|
{
|
|
|
|
|
return make_int4(max(a.x, b.x), max(a.y, b.y), min(a.z, b.z), min(a.w, b.w));
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
ccl_device_inline bool rect_is_valid(int4 rect)
|
|
|
|
|
{
|
|
|
|
|
return (rect.z > rect.x) && (rect.w > rect.y);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* Returns the local row-major index of the pixel inside the rect. */
|
|
|
|
|
ccl_device_inline int coord_to_local_index(int4 rect, int x, int y)
|
|
|
|
|
{
|
|
|
|
|
int w = rect.z - rect.x;
|
|
|
|
|
return (y - rect.y) * w + (x - rect.x);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* Finds the coordinates of a pixel given by its row-major index in the rect,
|
|
|
|
|
* and returns whether the pixel is inside it. */
|
Cycles: Kernel address space changes for MSL
This is the first of a sequence of changes to support compiling Cycles kernels as MSL (Metal Shading Language) in preparation for a Metal GPU device implementation.
MSL requires that all pointer types be declared with explicit address space attributes (device, thread, etc...). There is already precedent for this with Cycles' address space macros (ccl_global, ccl_private, etc...), therefore the first step of MSL-enablement is to apply these consistently. Line-for-line this represents the largest change required to enable MSL. Applying this change first will simplify future patches as well as offering the emergent benefit of enhanced descriptiveness.
The vast majority of deltas in this patch fall into one of two cases:
- Ensuring ccl_private is specified for thread-local pointer types
- Ensuring ccl_global is specified for device-wide pointer types
Additionally, the ccl_addr_space qualifier can be removed. Prior to Cycles X, ccl_addr_space was used as a context-dependent address space qualifier, but now it is either redundant (e.g. in struct typedefs), or can be replaced by ccl_global in the case of pointer types. Associated function variants (e.g. lcg_step_float_addrspace) are also redundant.
In cases where address space qualifiers are chained with "const", this patch places the address space qualifier first. The rationale for this is that the choice of address space is likely to have the greater impact on runtime performance and overall architecture.
The final part of this patch is the addition of a metal/compat.h header. This is partially complete and will be extended in future patches, paving the way for the full Metal implementation.
Ref T92212
Reviewed By: brecht
Maniphest Tasks: T92212
Differential Revision: https://developer.blender.org/D12864
2021-10-14 13:53:40 +01:00
|
|
|
ccl_device_inline bool local_index_to_coord(int4 rect,
|
|
|
|
|
int idx,
|
|
|
|
|
ccl_private int *x,
|
|
|
|
|
ccl_private int *y)
|
Cycles: Improve denoising speed on GPUs with small tile sizes
Previously, the NLM kernels would be launched once per offset with one thread per pixel.
However, with the smaller tile sizes that are now feasible, there wasn't enough work to fully occupy GPUs which results in a significant slowdown.
Therefore, the kernels are now launched in a single call that handles all offsets at once.
This has two downsides: Memory accesses to accumulating buffers are now atomic, and more importantly, the temporary memory now has to be allocated for every shift at once, increasing the required memory.
On the other hand, of course, the smaller tiles significantly reduce the size of the memory.
The main bottleneck right now is the construction of the transformation - there is nothing to be parallelized there, one thread per pixel is the maximum.
I tried to parallelize the SVD implementation by storing the matrix in shared memory and launching one block per pixel, but that wasn't really going anywhere.
To make the new code somewhat readable, the handling of rectangular regions was cleaned up a bit and commented, it should be easier to understand what's going on now.
Also, some variables have been renamed to make the difference between buffer width and stride more apparent, in addition to some general style cleanup.
2017-11-10 04:34:14 +01:00
|
|
|
{
|
|
|
|
|
int w = rect.z - rect.x;
|
|
|
|
|
*x = (idx % w) + rect.x;
|
|
|
|
|
*y = (idx / w) + rect.y;
|
|
|
|
|
return (*y < rect.w);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
ccl_device_inline int rect_size(int4 rect)
|
|
|
|
|
{
|
|
|
|
|
return (rect.z - rect.x) * (rect.w - rect.y);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
CCL_NAMESPACE_END
|
|
|
|
|
|
|
|
|
|
#endif /* __UTIL_RECT_H__ */
|