This uniforms can be used to have a unique id for each drawcall of a shgrp.
This only works for standard shgroups and is an exception for the outline
drawing.
The issue was mainly visible when copy-on-write was enabled. This was forcing
lots of meshes to be freed from multiple thread, causing all sorts of race
conditions in Gawain's VAO code.
OpenGL resources seems already to be doing deferred deletion, need to do the
same for CPU side arrays.
I left a flag to quickly debug if something is wrong.
But now that everything uses shader, it seems to be alright since a shader
is always set active before drawing.
This allows us to specify a the number of vertices to upload to the gpu.
This is to keep the same allocation on the System Memory but send the least
amount of data to the GPU/Driver.
We revert to the malloc/realloc and manually manage the upload.
There seems to be a performance penalty from using glMapBuffer on some
hardware, prefering way is glBufferData(NULL) with glBufferSubData.
- Upload the data to the GPU directly when creating the element buffer in
GWN_indexbuf_build_in_place().
- Convert data in place when squeezing the indices and removing the need
for another allocation.
- GWN_indexbuf_build_in_place() can be used with already used element
buffers and reupload their data without changing vbo id (keeping vaos
up to date).
We now alloc a vbo id on creation and let OpenGL manage its memory directly.
We use glMapBuffer to get this memory location.
This enables us to reuse and modify any vertex buffer directly without
destroying it with its associated Batches.
This commit does not really improve performance but will let us implement
more optimizations in the future.
We can also resize the buffer even if this can be slow if we need to keep
the existing data.
The addition of the usage hint makes dynamic buffers not a special case
anymore, simplifying things a bit.
Even if they are for safety they are not free to use !
On my system (Mesa + AMD Vega GPU) calling:
glBindVertexArray(1);
glDrawArrays(GL_TRIANGLES, 0, 3);
glBindVertexArray(0);
in a loop, shows the same overhead as a full vao switching (which is more
or less 10 times slower than just calling glDrawArrays)
Moreover, now that we use OpenGL 3.3 binding a VAO is REQUIRED to issue a
drawcall so it is garanted to be overwritten before the next drawcall.
Problem can only happen if someone draws directly with opengl commands.
This allows to draw multiple primitive of the type
GWN_PRIM_LINE_STRIP
GWN_PRIM_LINE_LOOP
GWN_PRIM_TRI_STRIP
GWN_PRIM_TRI_FAN
GWN_PRIM_LINE_STRIP_ADJ
with only one drawcall. This should speed up some areas that are really
sensitive to drawcall counts : UV drawing, Hair drawing...
Instead of creating a new instancing shading group without attrib, we now have instancing calls. The benefits is that they can be culled.
They can be used in conjuction with the standard and generate calls but shader must support it (which is generally not the case).
We store a pointer to the actual count so that the number can be tweaked between redraw.
This will makes multi layer rendering more efficient.
Turns out to be the call that was destroying performance.
I get 18ms->6ms improvement of drawing time with 10 000 unique objects.
And we can still improve upon this!
This is not an ideal solution but blender freeing system is already well tangled.
So tracking and clearing vao caches when destroying contexts does prevent bad behaviour.
A major bottleneck of current implementation is the call to create_bindings() for basically every drawcalls.
This is due to the VAO being tagged dirty when assigning a new shader to the Batch, defeating the purpose of the Batch (reuse it for drawing).
Since managing hundreds of batches in DrawManager and DrawCache seems not fun enough to me, I prefered rewritting the batches itself.
--- Batch changes ---
For this to happen I needed to change the Instancing to be part of the Batch rather than being another batch supplied at drawtime.
The Gwn_VertBuffers are copied from the batch to be instanciated and a new Gwn_VertBuffer is supplied for instancing attribs.
This mean a VAO can be generated and cached for this instancing case.
A Batch can be rendered with instancing, without instancing attribs and without the need for a new VAO using the GWN_batch_draw_range_ex with the force_instance parameter set to true.
--- Draw manager changes ---
The downside with this approach is that we must track the validity of the instanced batch (the original one). For this the only way (I could think of) is to set a callback for when the batch is getting free.
This means a bit of refactor in the DrawManager with the separation of batching and instancing Batches.
--- VAO cache ---
Each VAO is generated for a given ShaderInterface. This means we can keep it alive as long as the shader interface lives.
If a ShaderInterface is discarded, it needs to destroy every VAO associated to it. Otherwise, a new ShaderInterface with the same adress could be generated and reuse the same VAO with incorrect bindings.
The VAO cache itself is using a mix between a static array of VAO and a dynamic array if the is not enough space in the static.
Using this hybrid approach is a bit more performant than the dynamic array alone.
The array will not resize down but empty entries will be filled up again. It's unlikely we get a buffer overflow from this. Resizing could be done on next allocation if needed.
--- Results ---
Using Cached VAOs means that we are not querying each vertex attrib for each vbo for each drawcall, every redraw!
In a CPU limited test scene (10000 cubes in Clay engine) I get a reduction of CPU drawing time from ~20ms to 13ms.
The only area that is not caching VAOs is the instancing from particles (see comment DRW_shgroup_instance_batch).
This allows allocation of VAOs from different opengl contexts and thread as long as the drawing happens in the same context.
Allocation is thread safe as long as we abide by the "one opengl context per thread" rule.
We can still free from any thread and actual freeing will occur at new vao allocation or next context binding.
Theses batches keeps their memory chuck allocated after transfer to be reused and updated very often.
NOTE: This commit break instancing in DRW. (it's fixed in the next commit)
Everything was fine if one batch is always used with instancing. But problem arise if the next drawcall for this batch is not using instancing as the attrib divisor stays set to 1 in th VAO.
As instancing is less used than normal drawing I prefer to reset the divisor after drawing as it is reset before drawing instances.
Tried 101 but it gives colisions.
I think 257 is enough now that we dont have thousands of uniforms.
This gives some noticeable performance improvement.
Could be refined further.