The Blender's VkInstance cannot be shared with OpenXR VkInstance. The
reason is a chicken and egg problem where OpenXR needs to be started
before Vulkan. OpenXR can add special vulkan specific requirements
(instance&device) that are only available when the user starts an OpenXR
session.
The goal implementation is to share memory between both instances using
[VK_KHR_external_memory](https://registry.khronos.org/vulkan/specs/latest/man/html/VK_KHR_external_memory.html) and related extensions. However this seems
to be a bridge to far as a initial step. Reason: There are not that many
samples/ guides and documentation to be found to handle the workflow that
we require. We want to do a smaller step by step approach to gain the needed
knowledge.
For that reason this PR does the most stupidest thing that can be done to
share memory between instances. Download the render result to CPU RAM share
the host pointer with the OpenXR instance which copies it to the swap chain.
Also the synchronization is done using wait idle commands.
<video src="attachments/32a0d69b-c3fa-4272-aea0-d207609afaaf" title="Screencast From 2025-03-18 11-16-17.webm" controls></video>
**Gaining knowledge**
- Experiment with `VK_KHR_external_memory_host` extension for uploading vertex buffers (not related to OpenXR).
- Import host pointer with `VK_KHR_external_memory_host`. This reduces the additional
memcpy on OpenXR side.
- Export host pointer from Blender side from a mappable buffer.
- Replace host pointers with fd/dmabuf/winhandle
- Remove mappable buffer.
Ref #133718
Pull Request: https://projects.blender.org/blender/blender/pulls/133824
This PR adds swapchain synchronization. When the swapchain swaps the
buffers it can add a wait semaphore/signal semaphore to support GPU
based synchronization
10 times playback of `rain_restaurant.blend` on AMD RX 7700
Before: 10 × Animation playback: 72347.5540 ms, average: 7234.75539684 ms
After: 10 × Animation playback: 41523.2441 ms, average: 4152.32441425 ms
Getting around the OpenGL performance target.
Pull Request: https://projects.blender.org/blender/blender/pulls/136259
When swap chain is updated the logic could select an incorrect
framebuffer. This isn't actually the case during normal usage, but has
been detected during the development of OpenXR support. Here it did
matter.
Pull Request: https://projects.blender.org/blender/blender/pulls/136115
When using vec3[] as push constants it selected the incorrect
branch resulting in uploading incorrect data to the shader.
This resulted in not seeing the clipping bounds in vulkan.
Ref: #131111
Since the introduction of storage buffers in Blender, the calling
code has been responsible for ensuring the buffer meets allocation
requirements. All backends require the allocation size to be divisible
by 16 bytes. Until now, this was sufficient, but with GPU subdivision
changes, an external library must also adhere to these requirements.
For OpenSubdiv (OSD), some buffers are not 16-byte aligned, leading
to potential misallocation. Currently, this is mitigated by allocating
a few extra bytes, but this approach has the drawback of potentially
reading unintended bytes beyond the source buffer.
This PR adopts a similar approach to vertex buffers: the backend handles
extra byte allocation while ensuring data uploads and downloads function
correctly without requiring those additional bytes.
No changes were needed for Metal, as its allocation size is already
aligned to 256 bytes.
**Alternative solutions considered**:
- Copying the CPU buffer to a larger buffer when needed (performance impact).
- Modifying OSD buffers to allocate extra space (requires changes to an external library).
- Implementing GPU_storagebuf_update_sub.
Ref #135873
Pull Request: https://projects.blender.org/blender/blender/pulls/135716
Vulkan handles are currently only requested once. In the future OpenXR
also needs acces to these handles and additional handles will be needed
when introducing copy queues and async compute.
This PR will collect the handles in a struct to ensure we don't need to
alter the GHOST interface for every change.
Pull Request: https://projects.blender.org/blender/blender/pulls/135905
`GPU_vertbuf_update_sub` is used by GPU based subdivision to integrate
quads, triangles and edges. This is just an implementation to make it
work as we are planning bigger changes to improve performance of
uploading data to the GPU.
Pull Request: https://projects.blender.org/blender/blender/pulls/135774
It has been confirmed that the latest release of AMD drivers has fixed
issues for both OpenGL and Vulkan. Users should use AMD driver 25.3.1
or later. Removing the workaround as it has performance penalties on
RDNA2 based GPUs.
Reference: #135516
Pull Request: https://projects.blender.org/blender/blender/pulls/135630
Followup to 48e26c3afe, and discussions in !134771 about keeping
'C-style' and 'C++ template type-safe style' implementations of our
guardedalloc separated. And it makes `MEM_freeN<T>` code simpler.
Also skip type-checking in `MEM_freeN<T>` only with MSVC, as clang-cl on
windows-arm64 does work fine with DNA structs using
`DNA_DEFINE_CXX_METHODS`.
Pull Request: https://projects.blender.org/blender/blender/pulls/134861
The vulkan backend was implemented with async in mind, however the one place
where Blender uses for async was implemented blocking. This PR splits the
readback into flushing the command and waiting for readback.
**Performance**
Improvement of animation playback performance of shader balls.blend is around 10%.
Shader balls.blend frame: 1-100, 10 x animation playback
| Branch | Total time | Average time |
| -------------------- | ---------- | ------------ |
| blender-v4.4-release | 26851 ms | 2685 ms |
| This PR | 23675 ms | 2367 ms |
Pull Request: https://projects.blender.org/blender/blender/pulls/134227
Previous implementation used the resource state tracker which is a hash
table lookup. `is_link_to_buffer` is a bit cheaper as it is compares
already loaded data.
This PR changes the resource locking when reordering render graph
nodes. Reordering could be done without locking resources. No measurable
speedup detected.
Pull Request: https://projects.blender.org/blender/blender/pulls/134032
This PR fixes an issue that shaders compilation could stall. This could
be seen in the viewport (sometime not showing first EEVEE render) but
was more prominent when running test cases.
Pull Request: https://projects.blender.org/blender/blender/pulls/134020
VkBufferViews could be used after they were freed. The reason is that
they were not managed by the discard pool. Detected when looking in
failing render tests (pointcloud_motion.blend).
This part of the API is used by motion blur in EEVEE. Fixes the next
render tests
- `eevee_next_motion_blur_vulkan`
- `eevee_next_pointcloud_vulkan`
- `eevee_next_hair_vulkan`
Related: #133546
Pull Request: https://projects.blender.org/blender/blender/pulls/133856
In renderdoc the debug stack got corrupted when render graphs where
reused. The previous usage didn't clear the stack. This PR clears
the debug stack when render graphs are reset.
This PR implements a new the threading model for building render graphs
based on tests performed last month. For out workload multithreaded
command building will block in the driver or device. So better to use a
single thread for command building.
Details of the internal working is documented at https://developer.blender.org/docs/features/gpu/vulkan/render_graph/
- When a context is activated on a thread the context asks for a
render graph it can use by calling `VKDevice::render_graph_new`.
- Parts of the GPU backend that requires GPU commands will add a
specific render graph node to the render graph. The nodes also
contains a reference to all resources it needs including the
access it needs and the image layout.
- When the context is flushed the render graph is submitted to the
device by calling `VKDevice::render_graph_submit`.
- The device puts the render graph in `VKDevice::submission_pool`.
- There is a single background thread that gets the next render
graph to send to the GPU (`VKDevice::submission_runner`).
- Reorder the commands of the render graph to comply with Vulkan
specific command order rules and reducing possible bottlenecks.
(`VKScheduler`)
- Generate the required barriers `VKCommandBuilder::groups_extract_barriers`.
This is a separate step to reduce resource locking giving other
threads access to the resource states when they are building
the render graph nodes.
- GPU commands and pipeline barriers are recorded to a VkCommandBuffer.
(`VKCommandBuilder::record_commands`)
- When completed the command buffer can be submitted to the device
queue. `vkQueueSubmit`
- Render graphs that have been submitted can be reused by a next
thread. This is done by pushing the render graph to the
`VKDevice::unused_render_graphs` queue.
Pull Request: https://projects.blender.org/blender/blender/pulls/132681
Vulkan shader compiler accesses the cache folder via multiple threads.
GHOST part isn't thread safe and can return and overwrite the returned
cache path. This resulted into crashes when performing background
rendering and failing test cases, loading of incorrect shaders etc.
This PR fixes this to cache the cache folder location in the
VKShaderCompiler, which is loaded via the main thread when the vulkan
backend is initialized.
Pull Request: https://projects.blender.org/blender/blender/pulls/133535
Memory areas was requested to be preferable host visible. On some
platforms this would fail to allocate. Best is to not add preferable
host visible for typically large allocations.
This PR also gives the caller the responsibility to set the allocation flags.
Pull Request: https://projects.blender.org/blender/blender/pulls/133528
Cycles uses pixel buffers to update the display. Due to making things
work the vulkan backend downloaded the GPU allocated pixel buffer to the
CPU, Copied it to a GPU allocated staging buffer and update the display
texture using the staging buffer. Needless to say that a (CPU->)GPU->CPU->GPU
roundtrip is a bottleneck.
This PR fixes this by allowing the pixel buffer to act as a staging
buffer as well.
Viewport and final image rendering performance is now also similar.
| **Render** | **GPU Backend** | **Path tracing** | **Display** |
| ---------- | --------------- | ---------------- | ----------- |
| Viewport | OpenGL | 2.7 | 0.06 |
| Viewport | Vulkan | 2.7 | 0.04 |
| Image | OpenGL | 3.9 | 0.02 |
| Image | Vulkan | 3.9 | 0.02 |
Tested on:
```
Operating system: Linux-6.8.0-49-generic-x86_64-with-glibc2.39 64 Bits, X11 UI
Graphics card: AMD Radeon Pro W7700 (RADV NAVI32) Advanced Micro Devices radv Mesa 24.3.1 - kisak-mesa PPA Vulkan Backend
```
Pull Request: https://projects.blender.org/blender/blender/pulls/133485
VKRenderGraphNode is 892 bytes and most of the bytes are used for
specific nodes. By storing large structs in separate vectors we can
reduce the needed memory and improve cache pre-fetching.
With this change the VKRenderGraphNode is reduced to 64 bytes.
On a (50 frames shader_balls.blend) the end user performance is improved by
2%.
| **Platform** | **Before** | **After** |
| ---------------- | ---------- | --------- |
| AMD W7700 | 1409 ms | 1383 ms |
| NVIDIA RTX 6000 | 1443 ms | 1428 ms |
Pull Request: https://projects.blender.org/blender/blender/pulls/133317
Images used to be tracked with ownership in order to reset swap chain
images to its original layout. This isn't used anymore as we always mark
them in VK_IMAGE_LAYOUT_UNDEFINED to make the first pipeline barrier a
nop.
This change reduces unneeded complexity and safe a few CPU cycles.
Pull Request: https://projects.blender.org/blender/blender/pulls/133197
This fixes a rendering issue when local read enabled.
Before fix, the output image is too bright. This is due to incorrect load/store.
With this fix, the logic for attachment load/store ops with local_read on matches the logic with local_read off inside subpass_transition_impl...
Pull Request: https://projects.blender.org/blender/blender/pulls/133111