Commit Graph

21 Commits

Author SHA1 Message Date
Brecht Van Lommel
ecd54ba4e4 Cycles: Metal graphics interop
This is trivial with unified memory, and avoids one memory copy.

Pull Request: https://projects.blender.org/blender/blender/pulls/137363
2025-04-28 11:38:56 +02:00
Michael Jones
c23c4ae6ba Cycles: Fix issue affecting Metal kernel profiling (normally disabled)
This issue only affects profiling mode (`CYCLES_METAL_PROFILING=1`). There's a modest limit to the number of concurrent counter sampling buffers per device, so instead of creating one per device queue, we create one per device that can be reused by successive device queues.

Authored by Emma Liu.

Pull Request: https://projects.blender.org/blender/blender/pulls/136248
2025-03-21 12:47:15 +01:00
Michael Jones
584f19a5af Cycles: Apple Silicon tidy: Remove non-UMA codepaths (v2)
This PR removes a bunch of dead code following #123551 (removal of AMD and Intel GPU support). It is safe to assume that UMA will be available, so a lot of codepaths that dealt with copying between CPU and GPU are now just clutter.

Pull Request: https://projects.blender.org/blender/blender/pulls/136146
2025-03-19 12:53:01 +01:00
Brecht Van Lommel
ab3204e251 Revert "Cycles: Apple Silicon tidy: Remove non-UMA codepaths"
This reverts commit 1a93dfe4fc.

This is hitting asserts in the tests, revert until it's fixed.

Ref #136117
2025-03-18 20:37:23 +01:00
Michael Jones
1a93dfe4fc Cycles: Apple Silicon tidy: Remove non-UMA codepaths
This PR removes a bunch of dead code following #123551 (removal of AMD and Intel GPU support). It is safe to assume that UMA will be available, so a lot of codepaths that dealt with copying between CPU and GPU are now just clutter.

Pull Request: https://projects.blender.org/blender/blender/pulls/136117
2025-03-18 19:09:25 +01:00
Brecht Van Lommel
dd51c8660b Refactor: Cycles: Add const keyword where possible, using clang-tidy
Check was misc-const-correctness, combined with readability-isolate-declaration
as suggested by the docs.

Temporarily clang-format "QualifierAlignment: Left" was used to get consistency
with the prevailing order of keywords.

Pull Request: https://projects.blender.org/blender/blender/pulls/132361
2025-01-03 10:23:20 +01:00
Brecht Van Lommel
d0c2e68e5f Refactor: Cycles: Automated clang-tidy fixups in Cycles
* Use .empty() and .data()
* Use nullptr instead of 0
* No else after return
* Simple class member initialization
* Add override for virtual methods
* Include C++ instead of C headers
* Remove some unused includes
* Use default constructors
* Always use braces
* Consistent names in definition and declaration
* Change typedef to using

Pull Request: https://projects.blender.org/blender/blender/pulls/132361
2025-01-03 10:22:55 +01:00
Michael Jones
5a29be3c75 Cycles: Fix #116243, #122022 - MetalRT live viewport stability issues
This PR fixes live viewport stability issues on Mac when MetalRT is enabled.

There were two sources of instability:

1) `MTLAccelerationStructure` instances were not being correctly retained meaning that use-after-free crashes could occur following a geometry sync.
2) `MTLIntersectionFunctionTable` objects could be unsafely shared between multiple `MetalDeviceQueue` instances (in this case, `setBuffer` being the unsafe mutation)

The solution to 2 involves creating a new `MetalDispatchPipeline` type which is strictly used by only 1 `MetalDeviceQueue` instance.

Pull Request: https://projects.blender.org/blender/blender/pulls/124055
2024-07-08 16:18:34 +02:00
Michael Jones
9b833fdeba Cycles: Use more accurate GPU counter timestamps for profiling in Metal
This PR replaces the existing CPU wall-clock based profiling mechanism with more precise GPU counter based timestamps. As before, it is enabled by setting the env var `CYCLES_METAL_PROFILING=1`. Original implementation by Morteza Mostajabodaveh.

Pull Request: https://projects.blender.org/blender/blender/pulls/121208
2024-04-29 15:25:32 +02:00
Stefan Werner
31d55e87f9 Cycles: Metal support for OpenImageDenoise
This is supported on Apple Silicon GPUs and macOS 13.0+.

Co-authored-by: Stefan Werner <stefan.werner@intel.com>
Co-authored-by: Attila Afra <attila.t.afra@intel.com>
Pull Request: https://projects.blender.org/blender/blender/pulls/116124
2024-02-06 21:13:23 +01:00
Campbell Barton
c12994612b License headers: use SPDX-FileCopyrightText in intern/cycles 2023-06-14 16:53:23 +10:00
Michael Jones
654e1e901b Cycles: Use local atomics for faster shader sorting (enabled on Metal)
This patch adds two new kernels: SORT_BUCKET_PASS and SORT_WRITE_PASS. These replace PREFIX_SUM and SORTED_PATHS_ARRAY on supported devices (currently implemented on Metal, but will be trivial to enable on the other backends). The new kernels exploit sort partitioning (see D15331) by sorting each partition separately using local atomics. This can give an overall render speedup of 2-3% depending on architecture. As before, we fall back to the original non-partitioned sorting when the shader count is "too high".

Reviewed By: brecht

Differential Revision: https://developer.blender.org/D16909
2023-02-06 11:18:26 +00:00
Michael Jones
8dd7b5b26b Cycles: Metal integrator state size tuning
This patch tunes the integrator state sizing for Metal (`num_concurrent_states` and `num_concurrent_busy_states`).

On all GPUs architecture, we adjust the busy:total states ratio to be 1:4 which gives better rendering performance than the previous 1:16 ratio (independent of total state count). This gives a small performance uplift (e.g. 2-3% on M1 Ultra).

Additionally for M2 architectures, we double the overall state size if there is available headroom. Inclusive of the first change, we can expect uplift of close to 10% in future, as this results in larger dispatch sizes and minimises work submission overheads. In order to make an accurate determination of available headroom, we defer the calculation of `num_concurrent_states` and `num_concurrent_busy_states` until the time of integrator state allocation (i.e. after all of the scene data has been allocated). We also refactor `alloc_integrator_soa` to calculate an *exact* single-state-size in a first pass, right before allocating the integrator SoA buffers in a second pass.

Reviewed By: brecht

Differential Revision: https://developer.blender.org/D16313
2022-10-24 17:14:33 +01:00
Brecht Van Lommel
523bbf7065 Cycles: generalize shader sorting / locality heuristic to all GPU devices
This was added for Metal, but also gives good results with CUDA and OptiX.
Also enable it for future Apple GPUs instead of only M1 and M2, since this has
been shown to help across multiple GPUs so the better bet seems to enable
rather than disable it.

Also moves some of the logic outside of the Metal device code, and always
enables the code in the kernel since other devices don't do dynamic compile.

Time per sample with OptiX + RTX A6000:
                                         new                  old
barbershop_interior                      0.0730s              0.0727s
bmw27                                    0.0047s              0.0053s
classroom                                0.0428s              0.0464s
fishy_cat                                0.0102s              0.0108s
junkshop                                 0.0366s              0.0395s
koro                                     0.0567s              0.0578s
monster                                  0.0206s              0.0223s
pabellon                                 0.0158s              0.0174s
sponza                                   0.0088s              0.0100s
spring                                   0.1267s              0.1280s
victor                                   0.0524s              0.0531s
wdas_cloud                               0.0817s              0.0816s

Ref D15331, T87836
2022-07-15 13:42:47 +02:00
Michael Jones
4b1d315017 Cycles: Improve cache usage on Apple GPUs by chunking active indices
This patch partitions the active indices into chunks prior to sorting by material in order to tradeoff some material coherence for better locality. On Apple Silicon GPUs (particularly higher end M1-family GPUs), we observe overall render time speedups of up to 15%. The partitioning is implemented by repeating the range of `shader_sort_key` for each partition, and encoding a "locator" key which distributes the indices into sorted chunks.

Reviewed By: brecht

Differential Revision: https://developer.blender.org/D15331
2022-07-14 14:26:18 +01:00
Michael Jones
19e0b60f3e Cycles: MetalDeviceQueue - capture of multiple dispatches, and some tidying
This patch adds a new mode of gpu capture (env var `CYCLES_DEBUG_METAL_CAPTURE_SAMPLES`) to capture a block of dispatches between "reset" calls. It also fixes member data naming inconsistencies and adds some missing OS version checks.

Screenshot showing .gputrace capture in Xcode 14.0 beta (using `CYCLES_DEBUG_METAL_CAPTURE_SAMPLES="1"` and `CYCLES_DEBUG_METAL_CAPTURE_LIMIT="10"`):

{F13155703}

Reviewed By: sergey, brecht

Differential Revision: https://developer.blender.org/D15179
2022-06-13 13:42:07 +01:00
Michael Jones
4412e14708 Cycles: Useful Metal backend debug & profiling functionality
This patch adds some useful debugging & profiling env vars to the Metal backend:

- `CYCLES_METAL_PROFILING`: output a per-kernel timing report at the end of the render
- `CYCLES_METAL_DEBUG`: enable per-dispatch tracing (very verbose)
- `CYCLES_DEBUG_METAL_CAPTURE_KERNEL`: enable programatic .gputrace capture for a specified kernel index

Here's an example of the timing report with `CYCLES_METAL_PROFILING` enabled:

```
---------------------------------------------------------------------------------------------------
Kernel name                                 Total threads   Dispatches     Avg. T/D    Time   Time%
---------------------------------------------------------------------------------------------------
integrator_init_from_camera                   657,407,232          161    4,083,274   0.24s   0.51%
integrator_intersect_closest                1,629,288,440          681    2,392,494  15.18s  32.12%
integrator_intersect_shadow                   751,652,291          470    1,599,260   5.80s  12.28%
integrator_shade_background                   304,612,074          263    1,158,220   1.16s   2.45%
integrator_shade_surface                    1,159,764,041          676    1,715,627  20.57s  43.52%
integrator_shade_shadow                       598,885,847          418    1,432,741   1.27s   2.69%
integrator_queued_paths_array               2,969,650,130          805    3,689,006   0.35s   0.74%
integrator_queued_shadow_paths_array          593,936,619          379    1,567,115   0.14s   0.29%
integrator_terminated_paths_array              22,205,417          155      143,260   0.05s   0.10%
integrator_sorted_paths_array               2,517,140,043          676    3,723,579   1.65s   3.50%
integrator_compact_paths_array                648,912,748          155    4,186,533   0.03s   0.07%
integrator_compact_states                      20,872,687          155      134,662   0.14s   0.29%
integrator_terminated_shadow_paths_array      374,100,675          438      854,111   0.16s   0.33%
integrator_compact_shadow_paths_array         503,768,657          438    1,150,156   0.05s   0.10%
integrator_compact_shadow_states               37,664,941          202      186,460   0.23s   0.50%
integrator_reset                               25,165,824            6    4,194,304   0.06s   0.12%
film_convert_combined_half_rgba                 3,110,400            6      518,400   0.00s   0.01%
prefix_sum                                            676          676            1   0.19s   0.40%
---------------------------------------------------------------------------------------------------
                                                                 6,760               47.27s 100.00%
---------------------------------------------------------------------------------------------------
```

Reviewed By: brecht

Differential Revision: https://developer.blender.org/D15044
2022-06-07 11:08:39 +01:00
Sergey Sharybin
eccc9d8eba Cleanup: Remove unused function in Cycles queue
Noticed while looking into oneAPI patch.

Seems to be unused, without clear indication why/when it might be
needed. Removing the function simplifies adding the new backend.

Differential Revision: https://developer.blender.org/D14652
2022-04-19 10:32:07 +02:00
Brecht Van Lommel
9cfc7967dd Cycles: use SPDX license headers
* Replace license text in headers with SPDX identifiers.
* Remove specific license info from outdated readme.txt, instead leave details
  to the source files.
* Add list of SPDX license identifiers used, and corresponding license texts.
* Update copyright dates while we're at it.

Ref D14069, T95597
2022-02-11 17:47:34 +01:00
Michael Jones
e23b54a59f Cycles: Fix OS version warnings
This patch suppresses OS version warnings and hides currently unsupported Metal GPUs when enumerating devices.

Reviewed By: brecht

Differential Revision: https://developer.blender.org/D13506
2021-12-08 15:08:12 +00:00
Michael Jones
9558fa5196 Cycles: Metal host-side code
This patch adds the Metal host-side code:

- Add all core host-side Metal backend files (device_impl, queue, etc)
- Add MetalRT BVH setup files
- Integrate with Cycles device enumeration code
- Revive `path_source_replace_includes` in util/path (required for MSL compilation)

This patch also includes a couple of small kernel-side fixes:

- Add an implementation of `lgammaf` for Metal [Nemes, Gergő (2010), "New asymptotic expansion for the Gamma function", Archiv der Mathematik](https://users.renyi.hu/~gergonemes/)
- include "work_stealing.h" inside the Metal context class because it accesses state now

Ref T92212

Reviewed By: brecht

Maniphest Tasks: T92212

Differential Revision: https://developer.blender.org/D13423
2021-12-07 15:52:21 +00:00