This PR adds recycling of descriptor pools. Currently descriptor pools
are discarded when full or context is flushed. This PR allows
descriptor pools to be discarded for reuse.
It is also more conservative and only discard
Descriptor pools when they are full or fragmented.
When using the Vulkan backend a small amount of descriptor memory can leak. Even
when we clean up all resources, drivers can still keep data around on the
GPU. Eventually this can lead to out of memory issues depending on how
the GPU driver actually manages descriptor sets.
When the descriptor sets of the descriptor pool aren't used anymore
the VKDiscardPool will recycle the pools back to its original VKDescriptorPools.
It needs to be the same instance as descriptor pools/sets are owned by
a single thread.
Pull Request: https://projects.blender.org/blender/blender/pulls/144992
This PR adds HDR support for Windows for `VK_COLOR_SPACE_EXTENDED_SRGB_LINEAR_EXT`
on `VK_FORMAT_R16G16B16A16_SFLOAT` swapchains .
For nonlinear surface formats (sRGB and extended sRGB) the back buffer is blit into the swapchain,
When VK_COLOR_SPACE_EXTENDED_SRGB_LINEAR_EXT is used as surface format a compute shader
is used to flip and invert the gamma.
SDR white level is updated from a few window event changes, but actually
none of them immediately respond to SDR white level changes in the system.
That requires using the WinRT API, which we don't do so far.
Current limitations:
- Intel GPU support
- Dual GPU support
In the future we may add controls inside Blender for absolute HDR nits,
across different platforms. But this makes behavior closer to macOS.
See !144565 for details
Co-authored-by: Brecht Van Lommel <brecht@blender.org>
Pull Request: https://projects.blender.org/blender/blender/pulls/144717
This PR moves the responsibility of destroying discarded resources to
the submission thread. Previous implementation could be blocked and
would not always run.
This solves memory leak when rendering in background and keeps the
overall memory usage lower as all is done in a single location.
Pull Request: https://projects.blender.org/blender/blender/pulls/144440
Swapchain handling of minimized windows wasn't correct. On some
platforms it still tried to create images with no surface.
This PR will discard swapchains of minimized windows, but still being
able to flush the render graph and free resources.
Pull Request: https://projects.blender.org/blender/blender/pulls/144189
Resetting and recycling of immediate drawing buffers was never done and
would leak memory as the buffers were only destroyed when Blender
exited.
This is solved by not recycling or resetting the buffers and rely on the
discard pools. Additional cleanup of removing unused code-paths is also
part of this change so it can be backported to 4.5.
Pull Request: https://projects.blender.org/blender/blender/pulls/143995
This PR moves Wayland/HDR support out of experimental.
This allows more people to test and provide feedback. We
can always decide later to disable it for the release, but so
far we only got positive feedback.
Pull Request: https://projects.blender.org/blender/blender/pulls/141666
This change enables HDR support for wayland as an experimental feature.
It supports both non-linear extended sRGB and un-clamped sRGB.
Windows isn't supported as the HDR settings are not accessible via an
API and would require similar settings that games use to configure the
monitor. Adding those sliders isn't what we would like to add.
Vulkan (working group) is working on new extensions that might change
the shortcomings. It isn't clear yet what the extension will do and what
the impact is for applications that want to use it. When the extension
is out we should review at the situation again.
Pull Request: https://projects.blender.org/blender/blender/pulls/133159
Depth navigation sends many small render graphs to the device. It can be
that a subsequent render graph uses the same shader as the previous one
with the same descriptor set tracker. The descriptor set tracker didn't
cleared its full state and a subsequent render graph was generating
commands assuming that the device was in a certain state.
However it wasn't and a command to bind a descriptor set was skipped
resulting in a device out of bound write. Depending on the platform this
could overwrite any data on the GPU, including shader programs as the
select shader writes to a storage buffer. This clarifies why the issue
resulted in very odd and none consistent behavior.
This PR fixes this by clearing the VKPipelineData and
VKDescriptorTracker.
Pull Request: https://projects.blender.org/blender/blender/pulls/140526
This has limited use cases since it doesn't
profile the heavy part of the vulkan backend.
Almost 1:1 port of the metal implementation from #139551.
Doesn't cover rendergraph submission nor GPU timings.
Pull Request: https://projects.blender.org/blender/blender/pulls/139899
Descriptor sets/pools are known to be troublesome as it doesn't match
how GPUs work, or how application want to work, adding more complexity
than needed. This results is quite an overhead allocating and
deallocating descriptor sets.
This PR will use descriptor buffers when they are available. Most platforms
support descriptor buffers. When not available descriptor pools/sets
will be used.
Although this is a feature I would like to land it in 4.5 due to the API changes.
This makes it easier to fix issues when 4.5 is released.
The feature can easily be disabled by setting the feature to false if it has
to many problems.
Pull Request: https://projects.blender.org/blender/blender/pulls/138266
In this case a triangle shader was used to render points.
This change entails:
- Using point shaders in this case
- Add support for `GPU_point_size` to update the uniform of the point
shader.
Pull Request: https://projects.blender.org/blender/blender/pulls/139574
This allows multiple threads to request different specializations without
locking usage of all specialized shaders program when a new specialization
is being compiled.
The specialization constants are bundled in a structure that is being
passed to the `Shader::bind()` method. The structure is owned by the
calling thread and only used by the `Shader::bind()`.
Only querying for the specialized shader (Map lookup) is locking the shader
usage.
The variant compilation is now also locking and ensured that
multiple thread trying to compile the same variant will never result
in race condition.
Note that this removes the `is_dirty` optimization. This can be added
back if this becomes a bottleneck in the future. Otherwise, the
performance impact is not noticeable.
Pull Request: https://projects.blender.org/blender/blender/pulls/136991
Importing memory is done to often. when memory doens't change the
previous imported memory can be used.
The idea is to keep track of the last used buffer and keep reusing
it until the view/resolution has changed. This should not happen during
a session.
Pull Request: https://projects.blender.org/blender/blender/pulls/138984
- Reduce artifacts during resizing to also recreate the swapchain
when acquire image is suboptimal
- Do not stretch when backbuffer and swapchain have a different size
Pull Request: https://projects.blender.org/blender/blender/pulls/138925
Part of #136993.
Share as much of the ShaderCompiler implementations as possible.
Remove the ShaderCompiler/ShaderCompilerGeneric split and make most of
its functions non virtual.
Move the `get_compiler` function from `Context` to `GPUBackend` and
creation/deletion to `GPUBackend::init/delete_resources`.
Add a `batch_cancel` function to `ShaderCompiler` (needed for the
GPUPass refactor).
As a nice extra, the multithreaded OpenGL compilation has become faster
too.
The barbershop materials + EEVEE static shaders have gone from 27s to
22s.
I have not observed any performance difference on Vulkan or Metal.
Pull Request: https://projects.blender.org/blender/blender/pulls/136676
Descriptor pools were never discarded, leading to out of memory issues
when running for a long time. This PR discards used descriptor pool when
the render graph is submitted to the device.
- Detected that to descriptor sets could be uploaded multiple times
however once was always empty.
- When render graph is flushed all descriptor pools are discarded.
- Improved debugging of discard pools.
Pull Request: https://projects.blender.org/blender/blender/pulls/137521
A couple of memory leak fixes for the vulkan backend.
We increment the submission_id on render_graphs upon reset. This
triggers cleanup of anything tracked as a VKResourceTracker. Notably
uniform buffers created for push constant fallbacks. This fixes a memory
leak that was accumulating VKUniformBuffers every frame without cleaning
them up.
Reset resource pools when a swapchain image is presented. This ends up
calling vkResetDescriptorPool, freeing up descriptor set resources. This
fixes a memory leak that was accumulate descriptor sets and pools over
time without freeing them.
Pull Request: https://projects.blender.org/blender/blender/pulls/137305
Incorrect data format was selected when using CPU data transfers in
OpenXR. It always used `GPU_DATA_HALF_FLOAT`, also when the swapchains
where `GPU_RGBA8`. This resulted in black screens in release mode, and
asserts in debud mode.
Fixed by selecting the correct data transfer data type based on the
swapchain format.
Co-authored-by: jeroen@blender.org <Jeroen Bakker>
Pull Request: https://projects.blender.org/blender/blender/pulls/137269
This PR add support to use a win32 handle to perform share render
result with the OpenXR vulkan instance. This is only possible when
the GPU matches. Otherwise a CPU roundtrip will be performed.
Pull Request: https://projects.blender.org/blender/blender/pulls/137093
Update the `ShaderCompilerGeneric` to support deferred compilation
using the batch compilation API, so we can get rid of
`drw_manager_shader`.
This approach also allows supporting non-blocking compilation
for static shaders.
This shouldn't cause any behavior changes at the moment, since batch
compilation is not yet used when parallel compilation is disabled.
This adds a `GPUWorker` and a `GPUSecondaryContext` as an easy to use
wrapper for managing secondary GPU contexts.
(Part of #133674)
Pull Request: https://projects.blender.org/blender/blender/pulls/136518
Current implementation uses a CPU roundtrip to transfer render result
to the Xr Swapchain. This PR adds support for sharing the render result
on Linux systems by using file descriptors.
To extend this solution to win32 or dx handles can be done by extending
the data transfer modes, register the correct extensions. When not
using the same GPU between Blender and OpenXR the CPU roundtrip
will still be used.
Solution has been validated with monado simulator and seems to be as
fast as OpenGL.
Performance can be improved by using GPU based synchronization.
Current API is limited as we cannot chain the different renders and
swapchains.
Pull Request: https://projects.blender.org/blender/blender/pulls/136933
This PR refactors the way how swapchains are used.
Allow scaling of the swapchain content to the actual resolution of the swapchain.
can reduce artefacts when resizing windows when supported.
When frame rate is to fast the previous implementation could use a semaphore
that were still in use, leading to unwanted stuttering on certain platforms. Waiting
when the rendering has finished (GHOST_Frame.submission_fence), before the
next image is acquired from the swap chain.
Mailbox has been disabled as it can calculate more frames then actually been
presented, leading to a lag and increased power usage on others.
Pull Request: https://projects.blender.org/blender/blender/pulls/136603
After reviewing the locations where `GPU_flush()` are used it doesn't seem
to be harmfull to include these for the Vulkan backend as well. Hopefully
will save some lag that can happen when submitting one huge render graph.
Improved playback of rain_restaurant.blend where frames could be dropped
resulting into UI lag.
Pull Request: https://projects.blender.org/blender/blender/pulls/136654
The Blender's VkInstance cannot be shared with OpenXR VkInstance. The
reason is a chicken and egg problem where OpenXR needs to be started
before Vulkan. OpenXR can add special vulkan specific requirements
(instance&device) that are only available when the user starts an OpenXR
session.
The goal implementation is to share memory between both instances using
[VK_KHR_external_memory](https://registry.khronos.org/vulkan/specs/latest/man/html/VK_KHR_external_memory.html) and related extensions. However this seems
to be a bridge to far as a initial step. Reason: There are not that many
samples/ guides and documentation to be found to handle the workflow that
we require. We want to do a smaller step by step approach to gain the needed
knowledge.
For that reason this PR does the most stupidest thing that can be done to
share memory between instances. Download the render result to CPU RAM share
the host pointer with the OpenXR instance which copies it to the swap chain.
Also the synchronization is done using wait idle commands.
<video src="attachments/32a0d69b-c3fa-4272-aea0-d207609afaaf" title="Screencast From 2025-03-18 11-16-17.webm" controls></video>
**Gaining knowledge**
- Experiment with `VK_KHR_external_memory_host` extension for uploading vertex buffers (not related to OpenXR).
- Import host pointer with `VK_KHR_external_memory_host`. This reduces the additional
memcpy on OpenXR side.
- Export host pointer from Blender side from a mappable buffer.
- Replace host pointers with fd/dmabuf/winhandle
- Remove mappable buffer.
Ref #133718
Pull Request: https://projects.blender.org/blender/blender/pulls/133824
This PR adds swapchain synchronization. When the swapchain swaps the
buffers it can add a wait semaphore/signal semaphore to support GPU
based synchronization
10 times playback of `rain_restaurant.blend` on AMD RX 7700
Before: 10 × Animation playback: 72347.5540 ms, average: 7234.75539684 ms
After: 10 × Animation playback: 41523.2441 ms, average: 4152.32441425 ms
Getting around the OpenGL performance target.
Pull Request: https://projects.blender.org/blender/blender/pulls/136259
When swap chain is updated the logic could select an incorrect
framebuffer. This isn't actually the case during normal usage, but has
been detected during the development of OpenXR support. Here it did
matter.
Pull Request: https://projects.blender.org/blender/blender/pulls/136115
This PR implements a new the threading model for building render graphs
based on tests performed last month. For out workload multithreaded
command building will block in the driver or device. So better to use a
single thread for command building.
Details of the internal working is documented at https://developer.blender.org/docs/features/gpu/vulkan/render_graph/
- When a context is activated on a thread the context asks for a
render graph it can use by calling `VKDevice::render_graph_new`.
- Parts of the GPU backend that requires GPU commands will add a
specific render graph node to the render graph. The nodes also
contains a reference to all resources it needs including the
access it needs and the image layout.
- When the context is flushed the render graph is submitted to the
device by calling `VKDevice::render_graph_submit`.
- The device puts the render graph in `VKDevice::submission_pool`.
- There is a single background thread that gets the next render
graph to send to the GPU (`VKDevice::submission_runner`).
- Reorder the commands of the render graph to comply with Vulkan
specific command order rules and reducing possible bottlenecks.
(`VKScheduler`)
- Generate the required barriers `VKCommandBuilder::groups_extract_barriers`.
This is a separate step to reduce resource locking giving other
threads access to the resource states when they are building
the render graph nodes.
- GPU commands and pipeline barriers are recorded to a VkCommandBuffer.
(`VKCommandBuilder::record_commands`)
- When completed the command buffer can be submitted to the device
queue. `vkQueueSubmit`
- Render graphs that have been submitted can be reused by a next
thread. This is done by pushing the render graph to the
`VKDevice::unused_render_graphs` queue.
Pull Request: https://projects.blender.org/blender/blender/pulls/132681
Images used to be tracked with ownership in order to reset swap chain
images to its original layout. This isn't used anymore as we always mark
them in VK_IMAGE_LAYOUT_UNDEFINED to make the first pipeline barrier a
nop.
This change reduces unneeded complexity and safe a few CPU cycles.
Pull Request: https://projects.blender.org/blender/blender/pulls/133197
Blender always updates all pixels of the swap chain. As an optimization
we can skip the initial layout transition from present to transfer
destination as all pixels will be rewritten.
Pull Request: https://projects.blender.org/blender/blender/pulls/133061
Framebuffers are getting freed in the GPUContext base class destructor. But
the framebuffer destructors use the MTL/VK/GLContext derived class, whose
destructor has already completed at this point. So these contexts are no
longer valid to use.
Now free the framebuffers earlier.
This caused ASAN warnings, it's not known to cause actual bugs.
Pull Request: https://projects.blender.org/blender/blender/pulls/132504
Its not standard how `Present Engines` return images for presentation, and
currently is expected that they cycle between swap-chain images with each
`vkAcquireNextImageKHR` call.
However present engines could return any available image, that can mean
to reuse the last presented one if available. (This seem to be the behavior
using `Layered on DXGI Swapchain` the default `Present Method` used
with latest NVIDIA drivers on Windows).
Since resource pools expects to images to cycle in a sequential order, if any
present engine always return the same image for presentation only a single
resource pool would be used for each rendered frame, and since resources
are only released by cycling between resource pools, this resource pool would
overflow since it never releases any resource.
This changes makes resource pools to cycle each time a image is presented.
Pull Request: https://projects.blender.org/blender/blender/pulls/131129
When lighting baking is used in a background render the resources are
freed to early. The cause is that light baking does some initialization
within a context, that isn't send to the GPU. The first iteration of
light baking is expecting that it can free resources, what leads to GPU
resources to be deleted that are still used by commands that are
scheduled to be send to the GPU.
This PR fixes this by using multiple resource pools when background
rendering and ensure that contexts are send to the GPU when rendering
ends.
Pull Request: https://projects.blender.org/blender/blender/pulls/131094
Dynamic rendering is a Vulkan 1.3 feature. Most platforms have support
for them, but there are several legacy platforms that don't support dynamic
rendering or have driver bugs that don't allow us to use it.
This change will make dynamic rendering optional allowing legacy
platforms to use Vulkan.
**Limitations**
`GPU_LOADACTION_CLEAR` is implemented as clear attachments.
Render passes do support load clear, but adding support to it would
add complexity as it required multiple pipeline variations to support
suspend/resume rendering. It isn't clear when which variation should
be used what lead to compiling to many pipelines and branches in the
codebase. Using clear attachments doesn't require the complexity
for what is expected to be only used by platforms not supported by
the GPU vendors.
Subpass inputs and dual source blending are not supported as
Subpass inputs can alter the exact binding location of attachments.
Fixing this would add code complexity that is not used.
Ref: #129063
**Current state**

Pull Request: https://projects.blender.org/blender/blender/pulls/129062
When copying the window to the swap chain the image needs to be copied
upside down to match Vulkan/OpenGL image coordinate differences.
There was an of by 1 error when copying resulting in minor drawing
glitch which was noticeable when looking at the viewport grid.
Pull Request: https://projects.blender.org/blender/blender/pulls/130328
WITH_VULKAN_GUARDEDALLOC is a development option to use Blenders guarded
allocator when allocating internal vulkan driver resources. It does not provide any benefits
as this should be covered by vulkan validation and drivers are often ignoring this. This
change will remove the option from cmake and source code.
Pull Request: https://projects.blender.org/blender/blender/pulls/129039
Resources are shared, when running multiple contexts on the same thread.
Cycles uses the same context on multiple threads and expected same resources.
This change will introduce a single render graph per context and an updated
resource management. Render graphs are not shared anymore; Resource pools
are still shared, but garbage collection depends on the thread and if
background rendering is used.
Pull Request: https://projects.blender.org/blender/blender/pulls/128983
When performing preview job rendering the memory wasn't recycled leading
to a memory leak. For background rendering we already recycled memory in
a correct way. This change enables the same branch during preview
rendering.
Also adds a better `VKDevice::debug_print` to see the resources being
tracked by the different threads and resource pools.
Pull Request: https://projects.blender.org/blender/blender/pulls/128377
Immediate mode uses the old 'resource tracker' which has been replaced
by swap chain resource pools. This PR optimizes immediate mode buffers
by utilizing resource pools.
Pull Request: https://projects.blender.org/blender/blender/pulls/128188
Resource binding was over-complicated as I didn't understood the state
manager and vulkan to make the correct decisions at that time. This
refactor will remove a lot of the complexity and improves the performance.
**Performance**
The performance improvement is noticeable in complex grease pencil
scenes.
Grease pencil benchmark file picknick:
- `NVIDIA Quadro RTX 6000` 17 fps -> 24 fps
- `Intel(R) Arc(tm) A750 Graphics (DG2)` 6 -> 21 fps
**Bottle-neck**
The performance improvements originates from moving the update entry
point from state manager to shader interface. The previous implementation
(state manager) had to loop over all the bound resources and find in the
shader interface where it was located in the descriptor set. Ignoring
resources that were not used by the shader. But also making it hard
to determine if descriptor sets actually changed. Previous implementation
assumed descriptor sets always changed.
When descriptor set changed a new descriptor set needed to be allocated.
Most drivers this is a fast operation, but on Intel/Mesa this was measurable
slow. Using an allocation pool doesn't fit the Vulkan API as you are only
able to reuse when the layout matches exactly. Of course doable, but requires
another structure to keep track of the actual layouts.
**Solution**
By using the shader interface as entry point we can:
1. Keep track if there are any changes in the state manager. If not and the
layout is the same, the previous shader can be reused.
2. In stead of looping over each bound resource, we loop over bind points.
**Future extensions**
Bundle all descriptor set uploads just before use. This would be more
in line with how 'modern' Vulkan should be implemented. This PR already
separates the uploading from the updating and technically allows to upload
more than one descriptor set.
Instead of looking 1 set back we should measure if we can handle multiple
or keep track of the different layouts resources to improve the performance
even further.
Optional use `VK_KHR_descriptor_buffer` when available.
Pull Request: https://projects.blender.org/blender/blender/pulls/128068
Since parallel compilations was introduced, a validation error
was signalling that push constants for compute shaders didn't have
the correct pipeline binding. The root cause was that the pipeline
binding was determined, before the type of shader was known.
This PR fixes this by detemining if a shader is a compute shader up
front. It also removes some code that could lead to issues.
Pull Request: https://projects.blender.org/blender/blender/pulls/128010
This PR introduces parallel shader compilation for Vulkan shader
modules. This will improve shader compilation when switching to material
preview or EEVEE render preview. It also improves material compilation.
However in order to measure the differences shaderc needs to be updated.
PR has been created so we can already start with the code review. This
PR doesn't include SPIR-V caching, what will land in a separate PR as
it needs more validation.
Parallel shader compilation has been tested on AMD/NVIDIA on Linux.
Testing on other platforms is planned in the upcoming days.
**Performance**
```
AMD Ryzen™ 9 7950X × 32, 64GB Ram
Operating system: Linux-6.8.0-44-generic-x86_64-with-glibc2.39 64 Bits, X11 UI
Graphics card: Quadro RTX 6000/PCIe/SSE2 NVIDIA Corporation 4.6.0 NVIDIA 550.107.02
```
*Test*: Start blender, open barbershop_interior.blend and wait until the viewport
has fully settled.
| Backend | Test | Duration |
| ------- | ------------------------- | -------- |
| OpenGL | Coldstart/No subprocesses | 1:52 |
| OpenGL | Coldstart/8 Subprocesses | 0:54 |
| OpenGL | Warmstart/8 Subprocesses | 0:06 |
| Vulkan | Coldstart Without PR | 0:59 |
| Vulkan | Warmstart Without PR | 0:58 |
| Vulkan | Coldstart With PR | 0:33 |
| Vulkan | Warmstart With PR | 0:08 |
The difference in time (why OpenGL is faster in a warm start is that all
shaders are cached). Vulkan in this case doesn't cache anything and all
shaders are recompiled each time. Caching the shaders will be part of
a future PR. Main reason not to add it to this PR directly is that SPIR-V
cannot easily be validated and would require a sidecar to keep SPIR-V
compatible with external tools..
**NOTE**:
- This PR was extracted from #127418
- This PR requires #127564 to land and libraries to update. Linux lib
is available as attachment in this PR. It works without, but is as slow as
single threaded compilation.
Pull Request: https://projects.blender.org/blender/blender/pulls/127698