Previously, the `UserData` and `LocalUserData` classes were only supposed to be
used by the lazy-function system. However, they are generic enough so that they
can also be used by the multi-function system. Therefore, this patch extracts
them into a separate header that can be used in both evaluation systems.
I'm doing this in preparation for being able to pass the geometry nodes logger
to multi-functions, to be able to report errors from there.
Pull Request: https://projects.blender.org/blender/blender/pulls/138861
This patch adds a new `BLI_mutex.hh` header which adds `blender::Mutex` as alias
for either `tbb::mutex` or `std::mutex` depending on whether TBB is enabled.
Description copied from the patch:
```
/**
* blender::Mutex should be used as the default mutex in Blender. It implements a subset of the API
* of std::mutex but has overall better guaranteed properties. It can be used with RAII helpers
* like std::lock_guard. However, it is not compatible with e.g. std::condition_variable. So one
* still has to use std::mutex for that case.
*
* The mutex provided by TBB has these properties:
* - It's as fast as a spin-lock in the non-contended case, i.e. when no other thread is trying to
* lock the mutex at the same time.
* - In the contended case, it spins a couple of times but then blocks to avoid draining system
* resources by spinning for a long time.
* - It's only 1 byte large, compared to e.g. 40 bytes when using the std::mutex of GCC. This makes
* it more feasible to have many smaller mutexes which can improve scalability of algorithms
* compared to using fewer larger mutexes. Also it just reduces "memory slop" across Blender.
* - It is *not* a fair mutex, i.e. it's not guaranteed that a thread will ever be able to lock the
* mutex when there are always more than one threads that try to lock it. In the majority of
* cases, using a fair mutex just causes extra overhead without any benefit. std::mutex is not
* guaranteed to be fair either.
*/
```
The performance benchmark suggests that the impact is negilible in almost
all cases. The only benchmarks that show interesting behavior are the once
testing foreach zones in Geometry Nodes. These tests are explicitly testing
overhead, which I still have to reduce over time. So it's not unexpected that
changing the mutex has an impact there. What's interesting is that on macos the
performance improves a lot while on linux it gets worse. Since that overhead
should eventually be removed almost entirely, I don't really consider that
blocking.
Links:
* Documentation of different mutex flavors in TBB:
https://www.intel.com/content/www/us/en/docs/onetbb/developer-guide-api-reference/2021-12/mutex-flavors.html
* Older implementation of a similar mutex by me:
https://archive.blender.org/developer/differential/0016/0016711/index.html
* Interesting read regarding how a mutex can be this small:
https://webkit.org/blog/6161/locking-in-webkit/
Pull Request: https://projects.blender.org/blender/blender/pulls/138370
This reduces verbosity when using `LinearAllocator` or `ResourceScope` to
allocate values for a `CPPType`. Now, this is simplified and one also does not
have to manually add a destructor call anymore.
Pull Request: https://projects.blender.org/blender/blender/pulls/137685
This makes accessing these properties more convenient. Since we only ever have
const references to `CPPType`, there isn't really a benefit to using methods to
avoid mutation.
Pull Request: https://projects.blender.org/blender/blender/pulls/137482
This implements bundles and closures which are described in more detail in this
blog post: https://code.blender.org/2024/11/geometry-nodes-workshop-october-2024/
tl;dr:
* Bundles are containers that allow storing multiple socket values in a single
value. Each value in the bundle is identified by a name. Bundles can be
nested.
* Closures are functions that are created with the Closure Zone and can be
evaluated with the Evaluate Closure node.
To use the patch, the `Bundle and Closure Nodes` experimental feature has to be
enabled. This is necessary, because these features are not fully done yet and
still need iterations to improve the workflow before they can be officially
released. These iterations are easier to do in `main` than in a separate branch
though. That's because this patch is quite large and somewhat prone to merge
conflicts. Also other work we want to do, depends on this.
This adds the following new nodes:
* Combine Bundle: can pack multiple values into one.
* Separate Bundle: extracts values from a bundle.
* Closure Zone: outputs a closure zone for use in the `Evaluate Closure` node.
* Evaluate Closure: evaluates the passed in closure.
Things that will be added soon after this lands:
* Fields in bundles and closures. The way this is done changes with #134811, so
I rather implement this once both are in `main`.
* UI features for keeping sockets in sync (right now there are warnings only).
One bigger issue is the limited support for lazyness. For example, all inputs of
a Combine Bundle node will be evaluated, even if they are not all needed. The
same is true for all captured values of a closure. This is a deeper limitation
that needs to be resolved at some point. This will likely be done after an
initial version of this patch is done.
Pull Request: https://projects.blender.org/blender/blender/pulls/128340
This adds a new type of zone to Geometry Nodes that allows executing some nodes
for each element in a geometry.
## Features
* The `Selection` input allows iterating over a subset of elements on the set
domain.
* Fields passed into the input node are available as single values inside of the
zone.
* The input geometry can be split up into separate (completely independent)
geometries for each element (on all domains except face corner).
* New attributes can be created on the input geometry by outputting a single
value from each iteration.
* New geometries can be generated in each iteration.
* All of these geometries are joined to form the final output.
* Attributes from the input geometry are propagated to the output
geometries.
## Evaluation
The evaluation strategy is similar to the one used for repeat zones. Namely, it
dynamically builds a `lazy_function::Graph` once it knows how many iterations
are necessary. It contains a separate node for each iteration. The inputs for
each iteration are hardcoded into the graph. The outputs of each iteration a
passed to a separate lazy-function that reduces all the values down to the final
outputs. This final output can have a huge number of inputs and that is not
ideal for multi-threading yet, but that can still be improved in the future.
## Performance
There is a non-neglilible amount of overhead for each iteration. The overhead is
way larger than the per-element overhead when just doing field evaluation.
Therefore, normal field evaluation should be preferred when possible. That can
partially still be optimized if there is only some number crunching going on in
the zone but that optimization is not implemented yet.
However, processing many small geometries (e.g. each hair of a character
separately) will likely **always be slower** than working on fewer larger
geoemtries. The additional flexibility you get by processing each element
separately comes at the cost that Blender can't optimize the operation as well.
For node groups that need to handle lots of geometry elements, we recommend
trying to design the node setup so that iteration over tiny sub-geometries is
not required.
An opposite point is true as well though. It can be faster to process more
medium sized geometries in parallel than fewer very large geometries because of
more multi-threading opportunities. The exact threshold between tiny, medium and
large geometries depends on a lot of factors though.
Overall, this initial version of the new zone does not implement all
optimization opportunities yet, but the points mentioned above will still hold
true later.
Pull Request: https://projects.blender.org/blender/blender/pulls/127331
The depsgraph CoW mechanism is a bit of a misnomer. It creates an
evaluated copy for data-blocks regardless of whether the copy will
actually be written to. The point is to have physical separation between
original and evaluated data. This is in contrast to the commonly used
performance improvement of keeping a user count and copying data
implicitly when it needs to be changed. In Blender code we call this
"implicit sharing" instead. Importantly, the dependency graph has no
idea about the _actual_ CoW behavior in Blender.
Renaming this functionality in the despgraph removes some of the
confusion that comes up when talking about this, and will hopefully
make the depsgraph less confusing to understand initially too. Wording
like "the evaluated copy" (as opposed to the original data-block) has
also become common anyway.
Pull Request: https://projects.blender.org/blender/blender/pulls/118338
This refactors `SocketValueVariant` with the following goals in mind:
* Support type erasure so that not all users of `SocketValueVariant` have
to know about all the types sockets can have.
* Move towards supporting "rainbow sockets" which are sockets whoose
type is only known at run-time.
* Reduce complexity when dealing with socket values in general. Previously,
one had to use `SocketValueVariantCPPType` a lot to manage uninitialized
memory. This is better abstracted away now.
One related change that I had to do that I didn't see coming at first was that
I had to refactor `set_default_remaining_outputs` because now the default value
of a `SocketValueVariant` would not contain any value. Previously, it was
initialized the zero-value of the template parameter. Similarly, I had to change
how implicit conversions are created, because comparing the `CPPType` of linked
sockets was not enough anymore to determine if a conversion is necessary.
We could potentially use `SocketValueVariant` for the remaining socket types in the
future as well. Not entirely sure if that helps yet. `SocketValueVariant` can easily be
adapted to make that work though. That would also justify the name
"SocketValueVariant" better.
Pull Request: https://projects.blender.org/blender/blender/pulls/116231
NDEBUG is part of the C standard and disables asserts. Only this will
now be used to decide if asserts are enabled.
DEBUG was a Blender specific define, that has now been removed.
_DEBUG is a Visual Studio define for builds in Debug configuration.
Blender defines this for all platforms. This is still used in a few
places in the draw code, and in external libraries Bullet and Mantaflow.
Pull Request: https://projects.blender.org/blender/blender/pulls/115774
This struct is currently defined in the `functions` module but not actually used there. It's only used by the geometry nodes module, with an indirect dependency from blenkernel via simulation zone baking. This scope is problematic when adding grids as socket data, which should not be part of the functions module.
The `ValueOrField` struct is now moved to blenkernel, so it can be more easily extended to other kinds of data that might be passed around by geometry nodes sockets in future. No functional changes.
Pull Request: https://projects.blender.org/blender/blender/pulls/115087
Nodes that are scheduled can be executed in any order in theory. So when
there are many scheduled nodes, it can be benefitial to start evaluating
them in parallel.
Note that it is not very common that many nodes are scheduled at the
same time in typical setups because the evaluator uses a depth-first heuristic
to decide in which order to evaluate nodes. It can happen more easily in
generated node trees though.
Also, this change only has an affect in practice if none of the scheduled nodes
uses multi-threading internally, as this would also trigger the user of multiple
threads in the graph executor.
This idea is of remapping parameters of lazy-functions is useful
not only for the repeat zone. For example, it could be used for the
for-each zone as well.
Also, moving it to a more general place indicates that there is no
repeat-zone specific stuff in it.
The main goal of this refactor is to simplify how a geometry node group is executed.
Previously, there was duplicated logic that turned the lazy-function graph of a node
group into a single lazy-function. Now this is done only in one place and others can
just execute the lazy-function directly, without having to worry about the underlying graph.
Pull Request: https://projects.blender.org/blender/blender/pulls/112482
Goals of the refactor:
* Simplify adding (named) graph inputs and outputs.
* Add ability to refer to a graph input or output with an index.
* Get rid of the "dummy" terminology which doesn't really help.
Previously, one would add "dummy nodes" which can then serve as input
and output nodes of the graph. Now one directly adds input and outputs
using `Graph.add_input` and `Graph.add_output`. There is one interface
node that contains all inputs and another one that contains all outputs.
Being able to refer to a graph input or output with an index makes it
more efficient to implement some algorithms. E.g. one could have a
bit span for a socket that contains all the information what graph
inputs this socket depends on.
Pull Request: https://projects.blender.org/blender/blender/pulls/112474
This is a light weight solution to passing in some extra context into
a lazy-function that is invoked by the graph executor.
The new functionality is used by #112421.
By storing a raw pointer instead of a `Span`, we save 16 bytes
per node state. I measured a ~5% speedup in my setup with
a simple repeat zone.
5c450aea05 added some additional asserts to check for valid
indices. Generally, index-errors in this area lead to wrong
behaviors of geometry nodes very quickly.
There are many small allocations when the graph executor is
initialized (e.g. all the node/sockets have to be allocated). Those
were already combined into a few allocations by making use
of `LinearAllocator`. However, even better performance can be
achieved by making one larger allocation and then using
preprocessed offsets into that buffer.
I measured up to 20% speedup in geometry nodes with a simple
repeat zone.
Including <iostream> or similar headers is quite expensive, since it
also pulls in things like <locale> and so on. In many BLI headers,
iostreams are only used to implement some sort of "debug print",
or an operator<< for ostream.
Change some of the commonly used places to instead include <iosfwd>,
which is the standard way of forward-declaring iostreams related
classes, and move the actual debug-print / operator<< implementations
into .cc files.
This is not done for templated classes though (it would be possible
to provide explicit operator<< instantiations somewhere in the
source file, but that would lead to hard-to-figure-out linker error
whenever someone would add a different template type). There, where
possible, I changed from full <iostream> include to only the needed
<ostream> part.
For Span<T>, I just removed print_as_lines since it's not used by
anything. It could be moved into a .cc file using a similar approach
as above if needed.
Doing full blender build changes include counts this way:
- <iostream> 1986 -> 978
- <sstream> 2880 -> 925
It does not affect the total build time much though, mostly because
towards the end of it there's just several CPU cores finishing
compiling OpenVDB related source files.
Pull Request: https://projects.blender.org/blender/blender/pulls/111046
Listing the "Blender Foundation" as copyright holder implied the Blender
Foundation holds copyright to files which may include work from many
developers.
While keeping copyright on headers makes sense for isolated libraries,
Blender's own code may be refactored or moved between files in a way
that makes the per file copyright holders less meaningful.
Copyright references to the "Blender Foundation" have been replaced with
"Blender Authors", with the exception of `./extern/` since these this
contains libraries which are more isolated, any changed to license
headers there can be handled on a case-by-case basis.
Some directories in `./intern/` have also been excluded:
- `./intern/cycles/` it's own `AUTHORS` file is planned.
- `./intern/opensubdiv/`.
An "AUTHORS" file has been added, using the chromium projects authors
file as a template.
Design task: #110784
Ref !110783.
Add a quaternion attribute type that will be used in combination with
rotation sockets for geometry nodes to give a more intuitive experience
and better performance when using rotations.
The most interesting part is probably the interpolation, the rest is
the same as the last attribute type addition, 988f23cec3.
We need to interpolate multiple values with different weights.
Based on Sybren's suggestion, this uses the `expmap` methods from
4805a54525 for that.
This also refactors `SimpleMixerWithAccumulationType` to use a
function rather than a cast to convert to the accumulation type.
See #92967
Pull Request: https://projects.blender.org/blender/blender/pulls/108678
A lot of files were missing copyright field in the header and
the Blender Foundation contributed to them in a sense of bug
fixing and general maintenance.
This change makes it explicit that those files are at least
partially copyrighted by the Blender Foundation.
Note that this does not make it so the Blender Foundation is
the only holder of the copyright in those files, and developers
who do not have a signed contract with the foundation still
hold the copyright as well.
Another aspect of this change is using SPDX format for the
header. We already used it for the license specification,
and now we state it for the copyright as well, following the
FAQ:
https://reuse.software/faq/
Goals of this refactor:
* Reduce memory consumption of `IndexMask`. The old `IndexMask` uses an
`int64_t` for each index which is more than necessary in pretty much all
practical cases currently. Using `int32_t` might still become limiting
in the future in case we use this to index e.g. byte buffers larger than
a few gigabytes. We also don't want to template `IndexMask`, because
that would cause a split in the "ecosystem", or everything would have to
be implemented twice or templated.
* Allow for more multi-threading. The old `IndexMask` contains a single
array. This is generally good but has the problem that it is hard to fill
from multiple-threads when the final size is not known from the beginning.
This is commonly the case when e.g. converting an array of bool to an
index mask. Currently, this kind of code only runs on a single thread.
* Allow for efficient set operations like join, intersect and difference.
It should be possible to multi-thread those operations.
* It should be possible to iterate over an `IndexMask` very efficiently.
The most important part of that is to avoid all memory access when iterating
over continuous ranges. For some core nodes (e.g. math nodes), we generate
optimized code for the cases of irregular index masks and simple index ranges.
To achieve these goals, a few compromises had to made:
* Slicing of the mask (at specific indices) and random element access is
`O(log #indices)` now, but with a low constant factor. It should be possible
to split a mask into n approximately equally sized parts in `O(n)` though,
making the time per split `O(1)`.
* Using range-based for loops does not work well when iterating over a nested
data structure like the new `IndexMask`. Therefor, `foreach_*` functions with
callbacks have to be used. To avoid extra code complexity at the call site,
the `foreach_*` methods support multi-threading out of the box.
The new data structure splits an `IndexMask` into an arbitrary number of ordered
`IndexMaskSegment`. Each segment can contain at most `2^14 = 16384` indices. The
indices within a segment are stored as `int16_t`. Each segment has an additional
`int64_t` offset which allows storing arbitrary `int64_t` indices. This approach
has the main benefits that segments can be processed/constructed individually on
multiple threads without a serial bottleneck. Also it reduces the memory
requirements significantly.
For more details see comments in `BLI_index_mask.hh`.
I did a few tests to verify that the data structure generally improves
performance and does not cause regressions:
* Our field evaluation benchmarks take about as much as before. This is to be
expected because we already made sure that e.g. add node evaluation is
vectorized. The important thing here is to check that changes to the way we
iterate over the indices still allows for auto-vectorization.
* Memory usage by a mask is about 1/4 of what it was before in the average case.
That's mainly caused by the switch from `int64_t` to `int16_t` for indices.
In the worst case, the memory requirements can be larger when there are many
indices that are very far away. However, when they are far away from each other,
that indicates that there aren't many indices in total. In common cases, memory
usage can be way lower than 1/4 of before, because sub-ranges use static memory.
* For some more specific numbers I benchmarked `IndexMask::from_bools` in
`index_mask_from_selection` on 10.000.000 elements at various probabilities for
`true` at every index:
```
Probability Old New
0 4.6 ms 0.8 ms
0.001 5.1 ms 1.3 ms
0.2 8.4 ms 1.8 ms
0.5 15.3 ms 3.0 ms
0.8 20.1 ms 3.0 ms
0.999 25.1 ms 1.7 ms
1 13.5 ms 1.1 ms
```
Pull Request: https://projects.blender.org/blender/blender/pulls/104629
The main goal here is to reduce the number of times thread-local data has
to be looked up using e.g. `EnumerableThreadSpecific.local()`. While this
isn't a bottleneck in many cases, it is when the action performed on the local
data is very short and that happens very often (e.g. logging used sockets
during geometry nodes evaluation).
The solution is to simply pass the thread-local data as parameter to many
functions that use it, instead of looking it up in those functions which
generally is more costly.
The lazy-function graph executor now only looks up the local data if
it knows that it might be on a new thread, otherwise it uses the local data
retrieved earlier.
Alongside with `UserData` there is `LocalUserData` now. This allows users
of the lazy-function evaluation (such as geometry nodes) to have custom
thread-local data that is passed to all the lazy-functions automatically.
This is used for logging now.
This type will be used to store mesh edges in #106638, but it could
be used for anything else too. This commit adds support for:
- The new type in the Python API
- Editing the type in the edit mode "Attribute Set" operator
- Rendering the type in EEVEE and Cycles for all geometry types
- Geometry nodes attribute interpolation and mixing
- Viewing the type in the spreadsheet and using row filters
The attribute uses the `blender::int2` type in most code, and
the `vec2i` DNA type in C code when necessary. The enum names
are based on `INT32_2D` for consistency with `INT8` and `INT32`.
Pull Request: https://projects.blender.org/blender/blender/pulls/106677
For example
```
OIIOOutputDriver::~OIIOOutputDriver()
{
}
```
becomes
```
OIIOOutputDriver::~OIIOOutputDriver() {}
```
Saves quite some vertical space, which is especially handy for
constructors.
Pull Request: https://projects.blender.org/blender/blender/pulls/105594
Straightforward port. I took the oportunity to remove some C vector
functions (ex: copy_v2_v2).
This makes some changes to DRWView to accomodate the alignement
requirements of the float4x4 type.
Straightforward port. I took the oportunity to remove some C vector
functions (ex: `copy_v2_v2`).
This makes some changes to DRWView to accomodate the alignement
requirements of the float4x4 type.
This can improve performance in some circumstances when there are
vectorized and/or unrolled loops. I especially noticed that this helps
a lot while working on D16970 (got a 10-20% speedup there by avoiding
running into the non-vectorized fallback loop too often).
In most cases it is currently not used, so always having it there
causes unnecessary overhead. In my test file that causes
a 2 % performance improvement.