Previously, the NLM kernels would be launched once per offset with one thread per pixel.
However, with the smaller tile sizes that are now feasible, there wasn't enough work to fully occupy GPUs which results in a significant slowdown.
Therefore, the kernels are now launched in a single call that handles all offsets at once.
This has two downsides: Memory accesses to accumulating buffers are now atomic, and more importantly, the temporary memory now has to be allocated for every shift at once, increasing the required memory.
On the other hand, of course, the smaller tiles significantly reduce the size of the memory.
The main bottleneck right now is the construction of the transformation - there is nothing to be parallelized there, one thread per pixel is the maximum.
I tried to parallelize the SVD implementation by storing the matrix in shared memory and launching one block per pixel, but that wasn't really going anywhere.
To make the new code somewhat readable, the handling of rectangular regions was cleaned up a bit and commented, it should be easier to understand what's going on now.
Also, some variables have been renamed to make the difference between buffer width and stride more apparent, in addition to some general style cleanup.
Goal is to reduce OpenCL kernel recompilations.
Currently viewport renders are still set to use 64 closures as this seems to
be faster and we don't want to cause a performance regression there. Needs
to be investigated.
Reviewed By: brecht
Differential Revision: https://developer.blender.org/D2775
This is a prequisite for getting host memory allocation to work. There appears
to be no support for 3D textures using host memory. The original version of
this code was written by Stefan Werner for D2056.
Some drivers may report very large allocation sizes, which could cause
unnecessary memory usage. This is now limited to 2gb which should
still be enough to get the needed performance benefits without waste.
* Remove tex_* and pixels_* functions, replace by mem_*.
* Add MEM_TEXTURE and MEM_PIXELS as memory types recognized by devices.
* No longer create device_memory and call mem_* directly, always go
through device_only_memory, device_vector and device_pixels.
CPU rendering will be restricted to a BVH2, which is not ideal for raytracing
performance but can be shared with the GPU. Decoupled volume shading will be
disabled to match GPU volume sampling.
The number of CPU rendering threads is reduced to leave one core dedicated to
each GPU. Viewport rendering will also only use GPU rendering still. So along
with the BVH2 usage, perfect scaling should not be expected.
Go to User Preferences > System to enable the CPU to render alongside the GPU.
Differential Revision: https://developer.blender.org/D2873
This change affects CUDA GPUs not connected to a display or connected to a
display but supporting compute preemption so that the display does not
freeze. I couldn't find an official list, but compute preemption seems to be
only supported with GTX 1070+ and Linux (not GTX 1060- or Windows).
This helps improve small tile rendering performance further if there are
sufficient samples x number of pixels in a single tile to keep the GPU busy.
Best guess is that cuInit() somehow interferes with the AMD graphics driver
on Windows, and switching the initialization order to do OpenCL first seems
to solve the issue.