This commit introduces proper handling of ROCm 5 and ROCm 6 runtimes on
Linux, based on the version of the ROCm compiler used at build time.
Previously, HIPEW (the HIP equivalent of Cuda Wrangler) defaulted to
loading the ROCm 5 runtime. If ROCm 5 was unavailable, it would attempt
to load ROCm 6. However, ROCm 6 introduces changes in certain
structures and functions that are not backward compatible, leading to
potential issues when kernels compiled with the ROCm 6 compiler are
executed on the ROCm 5 runtime.
### Summary of Changes:
**Separation of Structures and Functions:**
Structures and functions are now separated into hipew5 and hipew6 to
accommodate the differences between ROCm versions.
**Build-Time Version Detection:**
The ROCm version is determined during build time, and the corresponding
hipew5 or hipew6 is included accordingly.
**Runtime Default to ROCm 6:**
By default, HIPEW now loads the ROCm 6 runtime and
includes hipew6 (Linux only).
**JIT Compilation Behavior:**
Since ROCm 6 is the default version, JIT compilation is supported only
when the ROCm 6 compiler is detected at runtime.
**HIP-RT Update:**
HIP-RT has been updated to load the ROCm 6 runtime by default.
These changes ensure compatibility and stability when switching
between ROCm versions, avoiding issues caused by runtime
and compiler mismatches.
Co-authored-by: Alaska <alaskayou01@gmail.com>
Co-authored-by: Sergey Sharybin <sergey@blender.org>
Pull Request: https://projects.blender.org/blender/blender/pulls/130153
When a scene contains distant lights and local lights, the first step
of the light tree traversal is to compute the importance of
distant lights vs local lights and pick one based on a random number.
In the specific case of when there is only one distant light,
the line of code that had been changed in this commit
effectively reduced to:
`min_importance = fast_cosf(x) < cosf(x) ? 0.0 : compute_min_importance`
And depending on the hardware, compiler, and the specific value being
tested, different configurations could take different code paths.
This commit fixes this issue by turning the comparison into
`fast_cosf(x) < fast_cosf(x)`.
---
Why does `cos_theta_plus_theta_u < cosf(bcone.theta_e - bcone.theta_o)`
reduce to `fast_cos(x) < cos(x)` in this specific case?
- `cos_theta_plus_theta_u` is computed as
`cos_theta * cos_theta_u - sin_theta * sin_theta_u`
- `cos_theta` is always 1.0 in the case of a single distant light.
- `cos_theta_u` is computed earlier as `fast_cosf(theta_e)` in
`distant_light_tree_parameters()`
- `sin_theta` is zero, and so that side of the equation doesn't matter.
This reduces `cos_theta_plus_theta_u` to `fast_cosf(theta_e)`.
`cosf(bcone.theta_e - bcone.theta_o)` reduces to `cosf(bcone.theta_e)`
because for the case of a single distant light `theta_o` is always 0.
Pull Request: https://projects.blender.org/blender/blender/pulls/131932
On Linux, Cycles HIP has a JIT compilation feature.
This feature is used when Cycles can not find a precompiled kernel
for your GPU. Which is most common when using hardware that wasn't
out at the time that a version of Blender was released.
There were various issues with this JIT compilation system, this commit
aims to solve them. The changes include:
- Enable `WITH_NANOVDB` when Blender is built with NanoVDB.
- This fixes a issue where VDB objects would not render.
- Enable some extra debug options for developers when desired
(This is so we match the CUDA implementation of the same feature).
- Reduce the optimizaiton level from -O3 to the default.
- This is to avoid any extra issues that may occur as a result
of an increase optimization level that isn't tested with
precompiled kernels.
- Reduce the optimization level even further to -O1 for Vega.
- This was done on precompiled kernels to work around some issues,
so I decided to apply it to JIT kernels as well.
- Note: Although Vega is not officially supported, this may help
people that unofficially use Vega.
- Added some previously missing compiler arguments and fixed errors that
were introduced when enabling these compiler arguments.
- Fixed a issue where JIT compilation would fail if Blener was
installed in a path that had a space in it.
Pull Request: https://projects.blender.org/blender/blender/pulls/131853
Previous implemenation of 5 < d < 50 was taken from the main paper,
fitting for smaller sizes are found in the supplemental. They are less
forward-scattering.
Pull Request: https://projects.blender.org/blender/blender/pulls/130234
`CLOSURE_WEIGHT_CUTOFF` avoids allocating a closure when its weight is
too small. It makes sense for surface closures, but for volume closures
the contribution also depends on the object size/ray length, such a
cutoff seems random and is causing problem in atmospheric scatterings.
Therefore remove the cutoff for volume, just make sure the weight is
positive.
Pull Request: https://projects.blender.org/blender/blender/pulls/131696
The original paper uses the single scattering albedo `sigma_s/sigma_t`
to pick a channel for sampling the scattering distance. However, this
only considers the situation where there is scattering inside the volume.
If some channel has an extinction coefficient of zero, the light passes
through without attenuation for that channel. We assign such channel
with a weight of 1 instead of 0 to make sure it can be sampled.
Pull Request: https://projects.blender.org/blender/blender/pulls/131741
The cause is numerical issues with `fast_sinf()`. While fixing
`fast_sinf()` would ultimately fix the problem, it involves more
complications in other code paths, and it is safer to clamp the
integration range anyway.
Pull Request: https://projects.blender.org/blender/blender/pulls/131689
The original OSL Shading System API was stateful: You'd create a shader
group, configure it, and then end it. However, this means that only one
group can be created at a time. Further, since Cycles reuses the
Shading System for multiple instances (e.g. viewport render and
material preview), a process-wide mutex is needed.
However, for years now OSL has had a better interface, where you
explicitly provide the group you refer to. With this, we can not only
get rid of the mutex, but actually multi-thread the shader setup even
within one instance.
Realistically, most time is still spent in the JIT stage, but it's
still better than nothing.
Pull Request: https://projects.blender.org/blender/blender/pulls/130133
Before, we'd just zero out the memory of the struct and then set the
defaults afterwards, but that:
- Prevents us from storing non-POD types
- Silently assumes that array<float> is safe to zero out (it currently
is, but that is still ugly and risky)
- Bloats the code since every non-zero entry now needs two lines
So, just make use of C++11 here. All the default values that were
previously unset are taken from the Blender-side defaults.
Pull Request: https://projects.blender.org/blender/blender/pulls/130870
This PR fixes#130641. The bug was caused by a missing self-object constraint when performing SSS on motion blur scenes. scene_intersect_local tests were erroneously hitting other objects, and out of range primitive IDs were causing spurious downstream behavior.
Pull Request: https://projects.blender.org/blender/blender/pulls/131156
This logic is copied from surface shader, so that the sampled closure
does not need to be evaluated twice when summing all the closures, but
it is not used in volume.
Based on #123377 by @brecht, but Gitea doesn't like the rebase these
so here's a new PR.
The purpose here is to switch to fused OptiX programs for OSL execution
on CUDA. On the one hand, this makes the code easier since, but there's
also another advantage - how memory allocation is managed.
OSL shaders need memory to store intermediate values, but how much is
needed depends on the complexity of the shader. With the split program
approach, Cycles had to provide that memory, so we had to allocate a
certain amount (2 KiB, to be precise) statically and show an error if
the shader would need more. If the shader used less (which is the case
for the vast majority), the memory was just wasted.
By switching to fused kernels, OSL knows the required amount during JIT
codegen, so it can allocate only what's required, which avoids this
waste. One still needs to set a maximum, and in theory, OSL would also
support spilling over into a Cycles-provided alternative memory region.
However, we currently don't implement that - instead, we default to the
same 2048 limit as before and let advanced users override it via the
CYCLES_OSL_GROUPDATA_ALLOC environment variable if really needed.
Co-authored-by: Brecht Van Lommel <brecht@blender.org>
Pull Request: https://projects.blender.org/blender/blender/pulls/130149
The `poll()` function for `CYCLES_PT_context_material` was using the
legacy `GPENCIL` identifier as opposed to `GREASEPENCIL`. This caused
duplicated material templates to show in the material tab.
Pull Request: https://projects.blender.org/blender/blender/pulls/130962
Turns out that with `-fassociative-math`, GCC turns
`(1.0f - cos_NH2) + alpha2 * cos_NH2` into
`cos_NH2 * (alpha2 - 1.0f) + 1.0f`.
Not sure why since the operation count is the same, but if alpha2 is very
small, `alpha2 - 1.0f` will be exactly -1.0f, which then causes issues.
Luckily, having one_minus_cos_NH2 as its own variable appears to be enough to
make GCC keep the original formulation.
Just to be safe, I've also used one_minus_cos_NH2 in the other branch to
hopefully reduce the chance of it being folded in again. Also turns a
division into a reciprocal, which is in theory slightly faster.
Pull Request: https://projects.blender.org/blender/blender/pulls/130469
The function table symbol declared in the headers was renamed starting
in OptiX 8.1, from `g_optixFunctionTable` to
`g_optixFunctionTable_<ABI version>`. This adds support for that by
using the new macro for the name when available (after OptiX 8.1) and
falling back to the old name when it is not (before OptiX 8.1).
Pull Request: https://projects.blender.org/blender/blender/pulls/130451
The `ProjectionTransform` object has no trivial copy-assignment constructor.
This results in the following warning on `gcc (Ubuntu 13.2.0-23ubuntu4) 13.2.0`:
```
/.../blender-git/blender/intern/cycles/kernel/../util/projection.h: In function ‘ccl::ProjectionTransform ccl::projection_inverse(ProjectionTransform)’:
/.../blender-git/blender/intern/cycles/kernel/../util/projection.h:219:9: warning: ‘void* memcpy(void*, const void*, size_t)’ writing to an object of type ‘ccl::ProjectionTransform’ {aka ‘struct ccl::ProjectionTransform’} with no trivial copy-assignment; use copy-assignment or copy-initialization instead [-Wclass-memaccess]
219 | memcpy(&tfmR, R, sizeof(R));
| ~~~~~~^~~~~~~~~~~~~~~~~~~~~
/.../blender-git/blender/intern/cycles/kernel/../util/projection.h:67:16: note: ‘ccl::ProjectionTransform’ {aka ‘struct ccl::ProjectionTransform’} declared here
67 | typedef struct ProjectionTransform {
| ^~~~~~~~~~~~~~~~~~~
```
To fix the warning, cast the pointer to `(void *)`.
Pull Request: https://projects.blender.org/blender/blender/pulls/130321
In 891d71a4d4 this keyword was
dropped due to performance regression after
fdc2962beb, but currently code
does not experience this performance degradation, and in fact
there is minor performance improvement on Lunar Lake GPUs,
along with an expected improvement in compile time.
However, this change brings a minor performance regression to
shade_surface kernel on Intel Arc and Meteor Lake GPUs, which
will be solved later by disabling this keyword for
these platforms only.
Pull Request: https://projects.blender.org/blender/blender/pulls/130299
NOTE: This also required some changes to Cycles code itself, who is now
directly including `BKE_image.hh` instead of declaring a few prototypes
of these functions in its `blender/utils.h` header (due to C++ functions
names mangling, this was not working anymore).
Pull Request: https://projects.blender.org/blender/blender/pulls/130174
This change makes it so only kernels of the same vendor are compiled in
parallel. For example for the release builds it will be:
1. All CUDA kernels
2. All OptiX kernels
3. All HIP kernels
4. All OneAPI kernels
This potentially leads to a lower CPU utilization, but it makes it much
easier to manage memory usage and tweak per-vendor concurrency.
The goal of this change is to solve occasional out-of-memory during the
GPU kernels compilation step on the CI/CD farm.
This change also includes tweaks to the prallel jobs for HIP-RT and
oneAPI. The tweak is based on measuring apparent memory usage peak on
Linux when doing single-thread compilation, and giving some safe margin
from the available memory on the buildbot.
Pull Request: https://projects.blender.org/blender/blender/pulls/129945
All the OSL matrix functions had been implemented using the
`Transform` utility of Cycles, but that's built around a 4x3 matrix,
when the OSL matrix functions are working with 4x4 matrices.
This resulted in them not producing results consistent with the
CPU implementation.
This fixes that by making use of the `ProjectionTransform` utility
of Cycles instead, because it's built around a 4x4 matrix. Since
matrix inversion is required, I had to make a few more utility
functions available on the GPU (except Metal, due to use of
references/pointers without specification) that were previously
CPU-only.
Co-authored-by: Brecht Van Lommel <brecht@blender.org>
Pull Request: https://projects.blender.org/blender/blender/pulls/110102
Don't try to use MetalRT by default unless the device explicitly reports that RT is supported. We shouldn't just rely on an assumption that it's supported for M3 and beyond, ad infinitum.
Pull Request: https://projects.blender.org/blender/blender/pulls/129688