Rather than treating all ray types equally, we now always render 1 glossy
bounce and unlimited transmission bounces. This makes it possible to get
good looking results with low AO bounces settings, making it useful to
speed up interior renders for example.
Reviewed By: brecht
Differential Revision: https://developer.blender.org/D2818
Previously we used a 1D sequence to select a light, and another 2D sequence
to sample a point on the light. For multiple lights this meant each light
would get a random subset of a 2D stratified sequence, which is not
guaranteed to be stratified anymore.
Now we use only a 2D sequence, split into segments along the X axis, one for
each light. The samples that fall within a segment then each are a stratified
sequence, at least in the limit. So for example for two lights, we split up
the unit square into two segments [0,0.5[ x [0,1[ and [0.5,1[ x [0,1[.
This doesn't make much difference in most scenes, mainly helps if you have a
few large area lights or some types of HDR backgrounds.
This causes render differences in some scenes, for example fishy_cat
and pabellon scenes render brighter in a few spots. This is an old
bug, not due to recent RR changes.
Disabled forceinline for those architectures, which seems to be compiling
successfully more often.
There might be ~3% slowdown based on quick tests, but better be rendering
something rather than failing to compile kernels again and again.
Those architectures will be doomed for abandon once we'll switch to toolkit 9.
Empty BVH nodes are set to NaN which must be preserved all the way to the
tnear <= tfar test which can then give false for empty nodes. This needs
strict semantices and careful argument ordering for min() and max(), so
the second argument is used if either of the arguments is NaN.
Fixes T52635: crash in BVH traversal with SSE4.1.
Differential Revision: https://developer.blender.org/D2828
One problem is that it was always using __mm_blendv_ps emulation even if the
instruction was supported. The other that the emulation function was wrong.
Thanks a lot to Ray Molenkamp for tracking this one down.
Don't use quick sort for small arrays, bubble sort works way faster for small
arrays due to cache coherency. This is what qsort() from libc is doing actually.
We can also experiment unrolling some extra small arrays, for example 3 and 4
element arrays.
This reduces tangent space calculation for dragon from 3.1sec to 2.9sec.
Brings tangent space calculation from 4.6sec to 3.1sec for dragon model in BI.
Cycles is also somewhat faster, but it has other bottlenecks.
Funny thing, using simple `static inline` already gives a lot of speedup here.
That's just answering question whether it's OK to leave decision on what to
inline up to a compiler..