A major bottleneck of current implementation is the call to create_bindings() for basically every drawcalls.
This is due to the VAO being tagged dirty when assigning a new shader to the Batch, defeating the purpose of the Batch (reuse it for drawing).
Since managing hundreds of batches in DrawManager and DrawCache seems not fun enough to me, I prefered rewritting the batches itself.
--- Batch changes ---
For this to happen I needed to change the Instancing to be part of the Batch rather than being another batch supplied at drawtime.
The Gwn_VertBuffers are copied from the batch to be instanciated and a new Gwn_VertBuffer is supplied for instancing attribs.
This mean a VAO can be generated and cached for this instancing case.
A Batch can be rendered with instancing, without instancing attribs and without the need for a new VAO using the GWN_batch_draw_range_ex with the force_instance parameter set to true.
--- Draw manager changes ---
The downside with this approach is that we must track the validity of the instanced batch (the original one). For this the only way (I could think of) is to set a callback for when the batch is getting free.
This means a bit of refactor in the DrawManager with the separation of batching and instancing Batches.
--- VAO cache ---
Each VAO is generated for a given ShaderInterface. This means we can keep it alive as long as the shader interface lives.
If a ShaderInterface is discarded, it needs to destroy every VAO associated to it. Otherwise, a new ShaderInterface with the same adress could be generated and reuse the same VAO with incorrect bindings.
The VAO cache itself is using a mix between a static array of VAO and a dynamic array if the is not enough space in the static.
Using this hybrid approach is a bit more performant than the dynamic array alone.
The array will not resize down but empty entries will be filled up again. It's unlikely we get a buffer overflow from this. Resizing could be done on next allocation if needed.
--- Results ---
Using Cached VAOs means that we are not querying each vertex attrib for each vbo for each drawcall, every redraw!
In a CPU limited test scene (10000 cubes in Clay engine) I get a reduction of CPU drawing time from ~20ms to 13ms.
The only area that is not caching VAOs is the instancing from particles (see comment DRW_shgroup_instance_batch).
This allows allocation of VAOs from different opengl contexts and thread as long as the drawing happens in the same context.
Allocation is thread safe as long as we abide by the "one opengl context per thread" rule.
We can still free from any thread and actual freeing will occur at new vao allocation or next context binding.
It seems to be useful still in cases where the particle are distributed in
a particular order or pattern, to colorize them along with that. This isn't
really well defined, but might as well avoid breaking backwards compatibility
for now.
Theses batches keeps their memory chuck allocated after transfer to be reused and updated very often.
NOTE: This commit break instancing in DRW. (it's fixed in the next commit)
This is like the only way to add variety to hair which is created
using simple children. Used here for the hair.
Maybe not ideal, but the time will show.
Burley SSS uses a bit of strange thing where the albedo and closure weight are
different, which makes the subsurface color act a bit like a subsurface radius
indirectly by the way the Burley SSS profile works.
This can't work for random walk SSS though, and it's not clear to me that this
is actually a good idea since it's really the subsurface radius that is supposed
to control this. For now I'll leave Burley SSS working the same to not break
backwards compatibility.
Instead of cloning the window to create dummyHWNDs and dummyHDCs to avoid calling the SetPixelFormat more than once in the same window, use the original window and HDC and do not call the SetPixelFormat again.
In addition to avoiding a lot of unnecessary calls, it simplifies the code and makes it match the others OS
It is basically brute force volume scattering within the mesh, but part
of the SSS code for faster performance. The main difference with actual
volume scattering is that we assume the boundaries are diffuse and that
all lighting is coming through this boundary from outside the volume.
This gives much more accurate results for thin features and low density.
Some challenges remain however:
* Significantly more noisy than BSSRDF. Adding Dwivedi sampling may help
here, but it's unclear still how much it helps in real world cases.
* Due to this being a volumetric method, geometry like eyes or mouth can
darken the skin on the outside. We may be able to reduce this effect,
or users can compensate for it by reducing the scattering radius in
such areas.
* Sharp corners are quite bright. This matches actual volume rendering
and results in some other renderers, but maybe not so much real world
objects.
Differential Revision: https://developer.blender.org/D3054
- Read-only access can often use EvaluationContext.object_mode
- Write access to go to WorkSpace.object_mode.
- Some TODO's remain (marked as "TODO/OBMODE")
- Add-ons will need updating
(context.active_object.mode -> context.workspace.object_mode)
- There will be small/medium issues that still need resolving
this does work on a basic level though.
See D3037
This brings separate initialization for libcuda and libnvrtc, which
fixes Cycles nvrtc compilation not working on build machines without
CUDA hardware available.
Differential Revision: https://developer.blender.org/D3045
We should actually be using CL_DEVICE_MEM_BASE_ADDR_ALIGN for sub buffers,
previous change in this code was incorrect. Renamed the function now to
make the specific purpose of this alignment clear, it's not required for
data types in general.