Commit Graph

60 Commits

Author SHA1 Message Date
Michael Jones
5f61eca7af Cycles: Exploit non-uniform threadgroup sizes on Metal
This patch replaces `dispatchThreadgroups` with `dispatchThreads` which takes care of non-uniform threadgroup bounds. This allows us to remove the bounds guards in the integrator kernel entry points.

Pull Request: https://projects.blender.org/blender/blender/pulls/106217
2023-03-29 21:46:11 +02:00
Julian Eisel
30e517c3ca Merge branch 'blender-v3.5-release' 2023-03-15 13:07:26 +01:00
Michael Jones
089e8a1887 Cycles: Fix Metal API validation error (use uint instead of ushort)
This PR fixes an error that is given when Metal API validation is enabled. The compute grid can exceed 65536 threads so `ushort` is not sufficient for `metal_grid_id [[threadgroup_position_in_grid]]`.

This PR also fixes OS version warnings ([Cycles Metal: Unguarded access to newer macOS features #105630](https://projects.blender.org/blender/blender/issues/105630))

Pull Request: https://projects.blender.org/blender/blender/pulls/105763
2023-03-14 22:05:55 +01:00
Brecht Van Lommel
02c2970983 Cycles: add NanoVDB support for Metal on Apple Silicon
Contributed by Yulia Kuznetcova at Apple.

NanoVDB is patched to give add address spaces required by Metal. We hope that
in the future Metal will support the generic address space.

For AMD and Intel this is currently not available since it causes a performance
regression also on scenes without volumes.

Pull Request #104837
2023-02-21 15:03:52 +01:00
Campbell Barton
91346755ce Cleanup: use '#' prefix for issues instead of 'T'
Match the convention from Gitea instead of Phabricator's T for tasks.
2023-02-12 14:56:05 +11:00
Brecht Van Lommel
773a36d2f8 Fix Cycles OneAPI build error after recent changes 2023-02-06 15:36:49 +01:00
Brecht Van Lommel
9ad3a85f8b Fix Cycles GPU binaries build error after recent changes for Metal 2023-02-06 13:17:57 +01:00
Michael Jones
654e1e901b Cycles: Use local atomics for faster shader sorting (enabled on Metal)
This patch adds two new kernels: SORT_BUCKET_PASS and SORT_WRITE_PASS. These replace PREFIX_SUM and SORTED_PATHS_ARRAY on supported devices (currently implemented on Metal, but will be trivial to enable on the other backends). The new kernels exploit sort partitioning (see D15331) by sorting each partition separately using local atomics. This can give an overall render speedup of 2-3% depending on architecture. As before, we fall back to the original non-partitioned sorting when the shader count is "too high".

Reviewed By: brecht

Differential Revision: https://developer.blender.org/D16909
2023-02-06 11:18:26 +00:00
Nikita Sirgienko
858fffc2df Cycles: oneAPI: add support for SYCL host task
This functionality is related only to debugging of SYCL implementation
via single-threaded CPU execution and is disabled by default.
Host device has been deprecated in SYCL 2020 spec and we removed it
in 305b92e05f.
Since this is still very useful for debugging, we're restoring a
similar functionality here through SYCL 2020 Host Task.
2023-01-03 20:47:24 +01:00
Brecht Van Lommel
222b64fcdc Fix Cycles CUDA crash when building kernels without optimizations (for debug)
In this case the blocksize may not the one we requested, which was assumed to be
the case. Instead get the effective block size from the compiler as was already
done for Metal and OneAPI.
2022-11-30 21:46:17 +01:00
Xavier Hallade
305b92e05f Cycles: oneAPI: remove use of SYCL host device
Host device is deprecated in SYCL 2020 spec, cpu device or standard C++
should be used instead.
2022-10-21 15:36:48 +02:00
Brecht Van Lommel
cf57624764 Cleanup: refactoring of kernel film function names and organization 2022-09-02 17:13:28 +02:00
Brecht Van Lommel
bb376da6df Fix Cycles MetalRT error after recent specialization changes 2022-07-15 18:28:13 +02:00
Xavier Hallade
a02992f131 Cycles: Add support for rendering on Intel GPUs using oneAPI
This patch adds a new Cycles device with similar functionality to the
existing GPU devices.  Kernel compilation and runtime interaction happen
via oneAPI DPC++ compiler and SYCL API.

This implementation is primarly focusing on Intel® Arc™ GPUs and other
future Intel GPUs.  The first supported drivers are 101.1660 on Windows
and 22.10.22597 on Linux.

The necessary tools for compilation are:
- A SYCL compiler such as oneAPI DPC++ compiler or
  https://github.com/intel/llvm
- Intel® oneAPI Level Zero which is used for low level device queries:
  https://github.com/oneapi-src/level-zero
- To optionally generate prebuilt graphics binaries: Intel® Graphics
  Compiler All are included in Linux precompiled libraries on svn:
  https://svn.blender.org/svnroot/bf-blender/trunk/lib The same goes for
  Windows precompiled binaries but for the graphics compiler, available
  as "Intel® Graphics Offline Compiler for OpenCL™ Code" from
  https://www.intel.com/content/www/us/en/developer/articles/tool/oneapi-standalone-components.html,
  for which path can be set as OCLOC_INSTALL_DIR.

Being based on the open SYCL standard, this implementation could also be
extended to run on other compatible non-Intel hardware in the future.

Reviewed By: sergey, brecht

Differential Revision: https://developer.blender.org/D15254

Co-authored-by: Nikita Sirgienko <nikita.sirgienko@intel.com>
Co-authored-by: Stefan Werner <stefan.werner@intel.com>
2022-06-29 12:58:04 +02:00
Brecht Van Lommel
ff1883307f Cleanup: renaming and consistency for kernel data
* Rename "texture" to "data array". This has not used textures for a long time,
  there are just global memory arrays now. (On old CUDA GPUs there was a cache
  for textures but not global memory, so we used to put all data in textures.)
* For CUDA and HIP, put globals in KernelParams struct like other devices.
* Drop __ prefix for data array names, no possibility for naming conflict now that
  these are in a struct.
2022-06-20 12:30:48 +02:00
Michael Jones
4412e14708 Cycles: Useful Metal backend debug & profiling functionality
This patch adds some useful debugging & profiling env vars to the Metal backend:

- `CYCLES_METAL_PROFILING`: output a per-kernel timing report at the end of the render
- `CYCLES_METAL_DEBUG`: enable per-dispatch tracing (very verbose)
- `CYCLES_DEBUG_METAL_CAPTURE_KERNEL`: enable programatic .gputrace capture for a specified kernel index

Here's an example of the timing report with `CYCLES_METAL_PROFILING` enabled:

```
---------------------------------------------------------------------------------------------------
Kernel name                                 Total threads   Dispatches     Avg. T/D    Time   Time%
---------------------------------------------------------------------------------------------------
integrator_init_from_camera                   657,407,232          161    4,083,274   0.24s   0.51%
integrator_intersect_closest                1,629,288,440          681    2,392,494  15.18s  32.12%
integrator_intersect_shadow                   751,652,291          470    1,599,260   5.80s  12.28%
integrator_shade_background                   304,612,074          263    1,158,220   1.16s   2.45%
integrator_shade_surface                    1,159,764,041          676    1,715,627  20.57s  43.52%
integrator_shade_shadow                       598,885,847          418    1,432,741   1.27s   2.69%
integrator_queued_paths_array               2,969,650,130          805    3,689,006   0.35s   0.74%
integrator_queued_shadow_paths_array          593,936,619          379    1,567,115   0.14s   0.29%
integrator_terminated_paths_array              22,205,417          155      143,260   0.05s   0.10%
integrator_sorted_paths_array               2,517,140,043          676    3,723,579   1.65s   3.50%
integrator_compact_paths_array                648,912,748          155    4,186,533   0.03s   0.07%
integrator_compact_states                      20,872,687          155      134,662   0.14s   0.29%
integrator_terminated_shadow_paths_array      374,100,675          438      854,111   0.16s   0.33%
integrator_compact_shadow_paths_array         503,768,657          438    1,150,156   0.05s   0.10%
integrator_compact_shadow_states               37,664,941          202      186,460   0.23s   0.50%
integrator_reset                               25,165,824            6    4,194,304   0.06s   0.12%
film_convert_combined_half_rgba                 3,110,400            6      518,400   0.00s   0.01%
prefix_sum                                            676          676            1   0.19s   0.40%
---------------------------------------------------------------------------------------------------
                                                                 6,760               47.27s 100.00%
---------------------------------------------------------------------------------------------------
```

Reviewed By: brecht

Differential Revision: https://developer.blender.org/D15044
2022-06-07 11:08:39 +01:00
Brecht Van Lommel
610619c203 Merge branch 'blender-v3.2-release' 2022-05-31 17:35:16 +02:00
Brecht Van Lommel
f2cd7e08fe Fix Cycles MNEE not working for Metal
Move MNEE to own kernel, separate from shader ray-tracing. This does introduce
the limitation that a shader can't use both MNEE and AO/bevel, but that seems
like the better trade-off for now.

We can experiment with bigger kernel organization changes later.

Differential Revision: https://developer.blender.org/D15070
2022-05-31 17:24:43 +02:00
Patrick Mours
a8c81ffa83 Cycles: Add half precision float support for volumes with NanoVDB
This patch makes it possible to change the precision with which to
store volume data in the NanoVDB data structure (as float, half, or
using variable bit quantization) via the previously unused precision
field in the volume data block.
It makes it possible to further reduce memory usage during
rendering, at a slight cost to the visual detail of a volume.

Differential Revision: https://developer.blender.org/D10023
2022-05-23 19:08:01 +02:00
Sergey Sharybin
9bb4bf5748 Fix missing 64bit casts when calculating Cycles render buffer offset
Found those missing casts while looking into a crash report made in
the Blender Chat. Was unable to reproduce the crash, but the casts
should totally be there to avoid integer overflow.
2022-05-23 15:59:52 +02:00
Stefan Werner
65dcb5ebd3 Cycles: Semantically separate 2D and 3D texture objects
Currently there are no functional changes.

Preparing for an upcoming oneAPI integration where such separation
in types is needed.
2022-04-01 19:44:31 +02:00
Stefan Werner
9c6dff70c8 Cycles: Introduce postfix for kernel body definition
Increases flexibility of code-generation for kernel entry points.

Currently no functional changes, preparing for integration with oneAPI.
2022-04-01 19:44:02 +02:00
Brecht Van Lommel
076079454f Cleanup: remove some unused Cycles GPU code
To make porting to other architectures easier, clarifying that this does not
need to be supported. The unused parallel_reduce implementation assumed warp
size 32, but is easy to update if we ever need it in the future.
2022-03-16 18:05:08 +01:00
Ethan-Hall
3902bebf18 Cycles: make smart interpolation fallback to cubic for GPU
Matching CPU and Eevee behavior.

Differential Revision: https://developer.blender.org/D14296
2022-03-11 18:27:58 +01:00
Brecht Van Lommel
a9a05d5597 Merge branch 'blender-v3.1-release' 2022-02-15 01:05:47 +01:00
Brecht Van Lommel
facd9d8268 Cleanup: clang-format 2022-02-15 01:05:25 +01:00
Brecht Van Lommel
35c261dfcf Merge branch 'blender-v3.1-release' 2022-02-11 23:58:41 +01:00
Michael Jones
27d3140b13 Cycles: Fix Metal kernel compilation for AMD GPUs
Workaround for a compilation issue preventing kernels compiling for AMD GPUs: Avoid problematic use of templates on Metal by making `gpu_parallel_active_index_array` a wrapper macro, and moving `blocksize` to be a macro parameter.

Reviewed By: brecht

Differential Revision: https://developer.blender.org/D14081
2022-02-11 22:52:48 +00:00
Brecht Van Lommel
9cfc7967dd Cycles: use SPDX license headers
* Replace license text in headers with SPDX identifiers.
* Remove specific license info from outdated readme.txt, instead leave details
  to the source files.
* Add list of SPDX license identifiers used, and corresponding license texts.
* Update copyright dates while we're at it.

Ref D14069, T95597
2022-02-11 17:47:34 +01:00
Brecht Van Lommel
0cf2fafd81 Fix T94050, T94570, T94527: Cycles Bevel and AO nodes not working with Metal
Workaround what may be a compiler bug, solution found by Michael Jones.
2022-01-13 10:40:41 +01:00
Michael Jones
efe3d60a2c Cycles: Fix Metal build
This patch fixes a couple of new Metal kernel compilation errors: 1) a kernel parameter count overflow, and 2) missing address space qualifiers.

Reviewed By: brecht

Differential Revision: https://developer.blender.org/D13763
2022-01-07 16:19:31 +00:00
Patrick Mours
8393ccd076 Cycles: Add OptiX temporal denoising support
Enables the `bpy.ops.cycles.denoise_animation()` operator again and modifies it to support
temporal denoising with OptiX. This requires renders that were done with both the "Vector"
and "Denoising Data" passes.

Differential Revision: https://developer.blender.org/D11442
2022-01-05 15:58:36 +01:00
Michael Jones
9558fa5196 Cycles: Metal host-side code
This patch adds the Metal host-side code:

- Add all core host-side Metal backend files (device_impl, queue, etc)
- Add MetalRT BVH setup files
- Integrate with Cycles device enumeration code
- Revive `path_source_replace_includes` in util/path (required for MSL compilation)

This patch also includes a couple of small kernel-side fixes:

- Add an implementation of `lgammaf` for Metal [Nemes, Gergő (2010), "New asymptotic expansion for the Gamma function", Archiv der Mathematik](https://users.renyi.hu/~gergonemes/)
- include "work_stealing.h" inside the Metal context class because it accesses state now

Ref T92212

Reviewed By: brecht

Maniphest Tasks: T92212

Differential Revision: https://developer.blender.org/D13423
2021-12-07 15:52:21 +00:00
Michael Jones
f613c4c095 Cycles: MetalRT support (kernel side)
This patch adds MetalRT support to Cycles kernel code. It is mostly additive in nature or confined to Metal-specific code, however there are a few areas where this interacts with other code:

- MetalRT closely follows the Optix implementation, and in some cases (notably handling of transforms) it makes sense to extend Optix special-casing to MetalRT. For these generalisations we now have `__KERNEL_GPU_RAYTRACING__` instead of `__KERNEL_OPTIX__`.
- MetalRT doesn't support primitive offsetting (as with `primitiveIndexOffset` in Optix), so we define and populate a new kernel texture, `__object_prim_offset`, containing per-object primitive / curve-segment offsets. This is referenced and applied in MetalRT intersection handlers.
- Two new BVH layout enum values have been added: `BVH_LAYOUT_METAL` and `BVH_LAYOUT_MULTI_METAL_EMBREE` for XPU mode). Some host-side enum case handling has been updated where it is trivial to do so.

Ref T92212

Reviewed By: brecht

Maniphest Tasks: T92212

Differential Revision: https://developer.blender.org/D13353
2021-11-29 15:20:26 +00:00
Michael Jones
eb7827e797 Cycles: Fix film convert address space mismatch on Metal
This patch fixes an address space mismatch in the film convert kernels on Metal. The `film_get_pass_pixel_...` functions take a `ccl_private` result pointer, but the film convert kernels pass a `ccl_global` memory pointer. Specialising the pass-fetch functions with templates results in compilation errors on Visual Studio, so instead this patch just adds an intermediate local on Metal.

Reviewed By: brecht

Differential Revision: https://developer.blender.org/D13350
2021-11-26 13:58:48 +00:00
William Leeson
c49d2cbe92 Merge branch 'blender-v3.0-release' to bring in D13042:
Fix performance decrease with Scrambling Distance on
2021-11-25 09:41:03 +01:00
Alaska
b41c72b710 Fix performance decrease with Scrambling Distance on
With the current code in master, scrambling distance is enabled on non-hardware accelerated ray tracing devices see a measurable performance decrease when compared scrambling distance on vs off. From testing, this performance decrease comes from the large tile sizes scheduled in `tile.cpp`.

This patch attempts to address the performance decrease by using different algorithms to calculate the tile size for devices with hardware accelerated ray traversal and devices without. Large tile sizes for hardware accelerated devices and small tile sizes for others.

Most of this code is based on proposals from @brecht and @leesonw

Reviewed By: brecht, leesonw

Differential Revision: https://developer.blender.org/D13042
2021-11-25 09:32:26 +01:00
Michael Jones
d1f944c186 Cycles: declare constants at program scope on Metal
MSL requires that constant address space literals be declared at program
scope. This patch moves the `blackbody_table_r/g/b` and `cie_colour_match`
constants into separate files so they can be declared at the appropriate scope.

Ref T92212

Differential Revision: https://developer.blender.org/D13241
2021-11-18 14:38:05 +01:00
Michael Jones
d19e35873f Cycles: several small fixes and additions for MSL
This patch contains many small leftover fixes and additions that are
required for Metal-enablement:

- Address space fixes and a few other small compile fixes
- Addition of missing functionality to the Metal adapter headers
- Addition of various scattered `__KERNEL_METAL__` blocks (e.g. for
  atomic support & maths functions)

Ref T92212

Differential Revision: https://developer.blender.org/D13263
2021-11-18 14:38:02 +01:00
Sergey Sharybin
ce395c84a3 Merge branch 'blender-v3.0-release' 2021-11-11 15:29:35 +01:00
Sergey Sharybin
d26d3cfe19 Fix T92868: Cycles catcher with transparency crashes
The issue was caused by splitting happening twice.

Fixed by checking for split flag which is assigned to the both states
during split.

The tricky part was to write catcher data at the moment of split: the
transparency and shadow catcher sample count is to be accumulated at
that point. Now it is happening in the `intersect_closest` kernel.
The downside is that render buffer is to be passed to the kernel, but
the benefit is that extra split bounce check is not needed now.

Had to move the passes write to shadow catcher header, since include
of `film/passes.h` causes all the fun of requirement to have BSDF
data structures available.

Differential Revision: https://developer.blender.org/D13177
2021-11-11 15:21:35 +01:00
Brecht Van Lommel
3fa86f4b28 Merge branch 'blender-v3.0-release' 2021-11-10 20:19:09 +01:00
Brecht Van Lommel
6b0008129e Fix T92972: Cycles HIP wrong render display after a recent refactor
It's unclear why this fails. Maybe the size of half4 is not the expected
8 bytes and adjacent pixels are overwritten. Or there is some bug in the
HIP compiler writing a struct into global memory, which we probably don't
do elsewhere in the kernel.

Thanks to Thomas, William and Jeroen for helping investigate this.
2021-11-10 20:03:07 +01:00
Patrick Mours
f565620435 Fix T92985: CUDA errors with Cycles film convert kernels
rB3a4c8f406a3a3bf0627477c6183a594fa707a6e2 changed the macros that create the film
convert kernel entry points, but in the process accidentally changed the parameter definition
to one of those (which caused CUDA launch and misaligned address errors) and changed the
implementation as well. This restores the correct implementation from before.

In addition, the `ccl_gpu_kernel_threads` macro did not work as intended and caused the
generated launch bounds to end up with an incorrect input for the second parameter (it was
set to "thread_num_registers", rather than the result of the block number calculation). I'm
not entirely sure why, as the macro definition looked sound to me. Decided to simply go with
two separate macros instead, to simplify and solve this.

Also changed how state is captured with the `ccl_gpu_kernel_lambda` macro slightly, to avoid
a compiler warning (expression has no effect) that otherwise occurred.

Maniphest Tasks: T92985

Differential Revision: https://developer.blender.org/D13175
2021-11-10 15:49:50 +01:00
Michael Jones
3a4c8f406a Cycles: Adapt shared kernel/device/gpu layer for MSL
This patch adapts the shared kernel entrypoints so that they can be compiled as MSL (Metal Shading Language). Where possible, the adaptations avoid changes in common code.

In MSL, kernel function inputs are explicitly bound to resources. In the case of argument buffers, we declare a struct containing the kernel arguments, accessible via device pointer. This differs from CUDA and HIP where kernel function arguments are declared as traditional C-style function parameters. This patch adapts the entrypoints declared in kernel.h so that they can be translated via a new `ccl_gpu_kernel_signature` macro into the required parameter struct + kernel entrypoint pairing for MSL.

MSL buffer attribution must be applied to function parameters or non-static class data members. To allow universal access to the integrator state, kernel data, and texture fetch adapters, we wrap all of the shared kernel code in a `MetalKernelContext` class. This is achieved by bracketing the appropriate kernel headers with "context_begin.h" and "context_end.h" on Metal. When calling deeper into the kernel code, we must reference the context class (e.g. `context.integrator_init_from_camera`). This extra prefixing is performed by a set of defines in "context_end.h". These will require explicit maintenance if entrypoints change. We invite discussion on more maintainable ways to enforce correctness.

Lambda expressions are not supported on MSL, so a new `ccl_gpu_kernel_lambda` macro generates an inline function object and optionally capturing any required state. This yields the same behaviour. This approach is applied to all parallel_... implementations which are templated by operation. The lambda expressions in the film_convert... kernels don't adapt cleanly to use function objects. However, these entrypoints can be macro-generated more concisely to avoid lambda expressions entirely, instead relying on constant folding to handle the pixel/channel conversions.

A separate implementation of `gpu_parallel_active_index_array` is provided for Metal to workaround some subtle differences in SIMD width, and also to encapsulate some required thread parameters which must be declared as explicit entrypoint function parameters.

Ref T92212

Reviewed By: brecht

Maniphest Tasks: T92212

Differential Revision: https://developer.blender.org/D13109
2021-11-09 21:43:10 +00:00
Patrick Mours
440a3475b8 Cycles: Improve OptiX denoising with dark images and fix crash when denoiser is destroyed
Adds a pass before denoising that calculates the intensity of the image, which can be
passed into the OptiX denoiser for more optimal results for very dark or very bright images.

In addition this also fixes a crash that sometimes occurred on exit. The OptiX denoiser object
has to be destroyed before the OptiX device context object (since it references that). But in
C++ the destructor function of a class is called before its fields are destructed, so
"~OptiXDevice" was always called before "OptiXDevice::~Denoiser" and therefore
"optixDeviceContextDestroy" was called before "optixDenoiserDestroy", hence the crash.

Differential Revision: https://developer.blender.org/D13160
2021-11-09 14:49:00 +01:00
Brecht Van Lommel
fd25e883e2 Cycles: remove prefix from source code file names
Remove prefix of filenames that is the same as the folder name. This used
to help when #includes were using individual files, but now they are always
relative to the cycles root directory and so the prefixes are redundant.

For patches and branches, git merge and rebase should be able to detect the
renames and move over code to the right file.
2021-10-26 15:37:04 +02:00
Brecht Van Lommel
d7d40745fa Cycles: changes to source code folders structure
* Split render/ into scene/ and session/. The scene/ folder now contains the
  scene and its nodes. The session/ folder contains the render session and
  associated data structures like drivers and render buffers.
* Move top level kernel headers into new folders kernel/camera/, kernel/film/,
  kernel/light/, kernel/sample/, kernel/util/
* Move integrator related kernel headers into kernel/integrator/
* Move OSL shaders from kernel/shaders/ to kernel/osl/shaders/

For patches and branches, git merge and rebase should be able to detect the
renames and move over code to the right file.
2021-10-26 15:36:39 +02:00
Brecht Van Lommel
282516e53e Cleanup: refactor float/half conversions for clarity 2021-10-22 13:03:03 +02:00
Brecht Van Lommel
df00463764 Cycles: add shadow path compaction for GPU rendering
Similar to main path compaction that happens before adding work tiles, this
compacts shadow paths before launching kernels that may add shadow paths.

Only do it when more than 50% of space is wasted.

It's not a clear win in all scenes, some are up to 1.5% slower. Likely caused
by different order of scheduling kernels having an unpredictable performance
impact. Still feels like compaction is just the right thing to avoid cases
where a few shadow paths can hold up a lot of main paths.

Differential Revision: https://developer.blender.org/D12944
2021-10-21 15:38:03 +02:00