This patch puts threads that render the same pixel closer together,
as opposed to threads that render the same sample. Thus threads
within a warp are more coherent in memory access and control flow,
leading to performance improvements.
Example benchmarks on a Quadro RTX4000 (WDDM) on Windows 10:
Koro: 4:23 -> 3:46
BMW: 1:18 -> 1:25
Barbershop Interior: 17:52 -> 14:55
Classroom: 4:37 -> 3:45
Performance differences on OpenCL/AMD were hit and miss, some scenes
became faster, others lost significantly. Therefore, this is kept as
CUDA only change for now.