test2

Author	SHA1	Message	Date
Campbell Barton	57dd9c21d3	Cleanup: spelling in comments	2024-03-21 10:02:53 +11:00
Jacques Lucke	b99c1abc3a	BLI: speedup memory bandwidth bound tasks by reducing threading This improves performance by reducing the amounts of threads used for tasks which require a high memory bandwidth. This works because the underlying hardware has a certain maximum memory bandwidth. If that is used up by a few threads already, any additional threads wanting to use a lot of memory will just cause more contention which actually slows things down. By reducing the number of threads that can perform certain tasks, the remaining threads are also not locked up doing work that they can't do efficiently. It's best if there is enough scheduled work so that these tasks can do more compute intensive tasks instead. To use this new functionality, one has to put the parallel code in question into a `threading::memory_bandwidth_bound_task(...)` block. Additionally, one also has to provide a (very) rough approximation for how many bytes are accessed. If the number is low, the number of threads shouldn't be reduced because it's likely that all touched memory can be in L3 cache which generally has a much higher bandwidth than main memory. The exact number of threads that are allowed to do bandwidth bound tasks at the same time is generally highly context and hardware dependent. It's also not really possible to measure reliably because it depends on so many static and dynamic factors. The thread count is now hardcoded to 8. It seems that this many threads are easily capable of maxing out the bandwidth capacity. With this technique I can measure surprisingly good performance improvements: * Generating a 3000x3000 grid: 133ms -> 103ms. * Generating a mesh line with 100'000'000 vertices: 212ms -> 189ms. * Realize mesh instances resulting in ~27'000'000 vertices: 460ms -> 305ms. In all of these cases, only 8 instead of 24 threads are used. The remaining threads are idle in these cases, but they could do other work if available. Pull Request: https://projects.blender.org/blender/blender/pulls/118939	2024-03-19 18:23:56 +01:00
Campbell Barton	a8cc6bb75b	Cleanup: spelling in comments	2024-02-26 10:23:52 +11:00
Jacques Lucke	9a3ceb79de	BLI: add weighted parallel for function The standard `threading::parallel_for` function tries to split the range into uniformly sized subranges. This is great if each element takes approximately the same amount of time to compute. However, there are also situations where the time required to do the work for a single index differs significantly between different indices. In such a case, it's better to split the tasks into segments while taking the size of each task into account. This patch implements `threading::parallel_for_weighted` which allows passing in an additional callback that returns the size of each task. Pull Request: https://projects.blender.org/blender/blender/pulls/118348	2024-02-25 15:01:05 +01:00
Jacques Lucke	50709ca253	BLI: add named constructors for IndexRange Unless you're very familiar with `IndexRange`, it's often hard to know what e.g. `IndexRange(10, 15)` means. Without more context, one could think that it means `10-14`, `10-15` or `10-24`. This patch adds named constructors to `IndexRange` to make the behavior more obvious when writing and when reading the code. With those one can use `IndexRange::from_begin_end(10, 15)`, `IndexRange::from_begin_end_inclusive(10, 15)` or `IndexRange::from_begin_size(10, 15)` respectively. While being a bit more verbose, the explicitness makes code easier to understand and also allows abstracting away some common index computations. The old unnamed constructor that takes a begin and size is not removed by this patch, as that would make the patch significantly bigger. I think it's reasonable to generally use the named constructors going forward and to change the existing usages of the old constructor over time. Pull Request: https://projects.blender.org/blender/blender/pulls/118606	2024-02-22 12:57:10 +01:00
Campbell Barton	e955c94ed3	License Headers: Set copyright to "Blender Authors", add AUTHORS Listing the "Blender Foundation" as copyright holder implied the Blender Foundation holds copyright to files which may include work from many developers. While keeping copyright on headers makes sense for isolated libraries, Blender's own code may be refactored or moved between files in a way that makes the per file copyright holders less meaningful. Copyright references to the "Blender Foundation" have been replaced with "Blender Authors", with the exception of `./extern/` since these this contains libraries which are more isolated, any changed to license headers there can be handled on a case-by-case basis. Some directories in `./intern/` have also been excluded: - `./intern/cycles/` it's own `AUTHORS` file is planned. - `./intern/opensubdiv/`. An "AUTHORS" file has been added, using the chromium projects authors file as a template. Design task: #110784 Ref !110783.	2023-08-16 00:20:26 +10:00
Sergey Sharybin	c1bc70b711	Cleanup: Add a copyright notice to files and use SPDX format A lot of files were missing copyright field in the header and the Blender Foundation contributed to them in a sense of bug fixing and general maintenance. This change makes it explicit that those files are at least partially copyrighted by the Blender Foundation. Note that this does not make it so the Blender Foundation is the only holder of the copyright in those files, and developers who do not have a signed contract with the foundation still hold the copyright as well. Another aspect of this change is using SPDX format for the header. We already used it for the license specification, and now we state it for the copyright as well, following the FAQ: https://reuse.software/faq/	2023-05-31 16:19:06 +02:00
Jacques Lucke	2cfcb8b0b8	BLI: refactor IndexMask for better performance and memory usage Goals of this refactor: * Reduce memory consumption of `IndexMask`. The old `IndexMask` uses an `int64_t` for each index which is more than necessary in pretty much all practical cases currently. Using `int32_t` might still become limiting in the future in case we use this to index e.g. byte buffers larger than a few gigabytes. We also don't want to template `IndexMask`, because that would cause a split in the "ecosystem", or everything would have to be implemented twice or templated. * Allow for more multi-threading. The old `IndexMask` contains a single array. This is generally good but has the problem that it is hard to fill from multiple-threads when the final size is not known from the beginning. This is commonly the case when e.g. converting an array of bool to an index mask. Currently, this kind of code only runs on a single thread. * Allow for efficient set operations like join, intersect and difference. It should be possible to multi-thread those operations. * It should be possible to iterate over an `IndexMask` very efficiently. The most important part of that is to avoid all memory access when iterating over continuous ranges. For some core nodes (e.g. math nodes), we generate optimized code for the cases of irregular index masks and simple index ranges. To achieve these goals, a few compromises had to made: * Slicing of the mask (at specific indices) and random element access is `O(log #indices)` now, but with a low constant factor. It should be possible to split a mask into n approximately equally sized parts in `O(n)` though, making the time per split `O(1)`. * Using range-based for loops does not work well when iterating over a nested data structure like the new `IndexMask`. Therefor, `foreach_` functions with callbacks have to be used. To avoid extra code complexity at the call site, the `foreach_` methods support multi-threading out of the box. The new data structure splits an `IndexMask` into an arbitrary number of ordered `IndexMaskSegment`. Each segment can contain at most `2^14 = 16384` indices. The indices within a segment are stored as `int16_t`. Each segment has an additional `int64_t` offset which allows storing arbitrary `int64_t` indices. This approach has the main benefits that segments can be processed/constructed individually on multiple threads without a serial bottleneck. Also it reduces the memory requirements significantly. For more details see comments in `BLI_index_mask.hh`. I did a few tests to verify that the data structure generally improves performance and does not cause regressions: * Our field evaluation benchmarks take about as much as before. This is to be expected because we already made sure that e.g. add node evaluation is vectorized. The important thing here is to check that changes to the way we iterate over the indices still allows for auto-vectorization. * Memory usage by a mask is about 1/4 of what it was before in the average case. That's mainly caused by the switch from `int64_t` to `int16_t` for indices. In the worst case, the memory requirements can be larger when there are many indices that are very far away. However, when they are far away from each other, that indicates that there aren't many indices in total. In common cases, memory usage can be way lower than 1/4 of before, because sub-ranges use static memory. * For some more specific numbers I benchmarked `IndexMask::from_bools` in `index_mask_from_selection` on 10.000.000 elements at various probabilities for `true` at every index: ``` Probability Old New 0 4.6 ms 0.8 ms 0.001 5.1 ms 1.3 ms 0.2 8.4 ms 1.8 ms 0.5 15.3 ms 3.0 ms 0.8 20.1 ms 3.0 ms 0.999 25.1 ms 1.7 ms 1 13.5 ms 1.1 ms ``` Pull Request: https://projects.blender.org/blender/blender/pulls/104629	2023-05-24 18:11:41 +02:00
Jacques Lucke	64c33871bd	Cleanup: add missing inline This is necessary for correctness of the code to avoid duplicate symbols. In practice, this wasn't necessary yet, because usually we pass lambdas into these functions which cause every instantiation to have a different signature.	2023-05-22 09:32:35 +02:00
Jacques Lucke	7725bacd6a	BLI: support aligned parallel reduce Alignment here means that the size of the range passed into callback is a multiple of the alignment value (which has to be a power of two). This can help with performance when loops in the callback are are unrolled and/or vectorized. Otherwise, it can potentially reduce performance by splitting work into more unequally sized chunks. For example, chunk sizes might be 4 and 8 instead of 6 and 6 when alignment is 4.	2023-05-22 09:30:51 +02:00
Jacques Lucke	f6d824bca6	BLI: move tbb part of parallel_for to implementation file Previously, `tbb::parallel_for` was instantiated every time `threading::parallel_for` is used. However, when actual parallelism is used, the overhead of a function call is negilible. Therefor it is possible to move that part out of the header without causing noticable performance regressions. This reduces the size of the Blender binary from 308.2 to 303.5 MB, which is a reduction of about 1.5%.	2023-05-21 13:31:32 +02:00
Jacques Lucke	3f1886d0b7	Functions: align chunk sizes in multi-function evaluation This can improve performance in some circumstances when there are vectorized and/or unrolled loops. I especially noticed that this helps a lot while working on D16970 (got a 10-20% speedup there by avoiding running into the non-vectorized fallback loop too often).	2023-01-22 00:03:25 +01:00
Jacques Lucke	85908e9edf	Geometry Nodes: new Interpolate Curves node This adds a new `Interpolate Curves` node. It allows generating new curves between a set of existing guide curves. This is essential for procedural hair. Usage: - One has to provide a set of guide curves and a set of root positions for the generated curves. New curves are created starting from these root positions. The N closest guide curves are used for the interpolation. - An additional up vector can be provided for every guide curve and root position. This is typically a surface normal or nothing. This allows generating child curves that are properly oriented based on the surface orientation. - Sometimes a point should only be interpolated using a subset of the guides. This can be achieved using the `Guide Group ID` and `Point Group ID` inputs. The curve generated at a specific point will only take the guides with the same id into account. This allows e.g. for hair parting. - The `Max Neighbors` input limits how many guide curves are taken into account for every interpolated curve. Differential Revision: https://developer.blender.org/D16642	2023-01-20 12:09:38 +01:00
Jacques Lucke	c29c61f840	Fix T102292: deadlock in geometry nodes evaluation with task isolation As described in the comment on `BLI_task_isolate`, deadlocks can happen when isolation is used with threading primitives that separate spawning tasks from executing them. All threads are waiting the tasks to complete but no thread is able to continue working due to task isolation. The fix is to not pass lazy-threading hints through task isolations. This way isolated regions can't create new tasks in a scheduler further up the call stack. This may lead to minor slowdowns because less threading may be used. It's generally possible to get rid of the slowdown again by sending the lazy-threading hint before entering the isolated region.	2022-11-06 15:07:32 +01:00
Jacques Lucke	5c81d3bd46	Geometry Nodes: improve evaluator with lazy threading In large node setup the threading overhead was sometimes very significant. That's especially true when most nodes do very little work. This commit improves the scheduling by not using multi-threading in many cases unless it's likely that it will be worth it. For more details see the comments in `BLI_lazy_threading.hh`. Differential Revision: https://developer.blender.org/D15976	2022-09-20 11:08:05 +02:00
Iliay Katueshenock	c94ca54cda	BLI: add use_threading parameter to parallel_invoke `parallel_invoke` allows executing functions on separate threads. However, creating tasks in tbb has a measurable amount of overhead. Therefore, it can be benefitial to disable parallelization when the amount of work done per function is small. See D15539 for some benchmark results. Differential Revision: https://developer.blender.org/D15539	2022-07-26 11:10:16 +02:00
Hans Goudey	c2737913db	BLI: Avoid invoking tbb for small parallel_reduce calls Apply a change similar to `e130903060` for `parallel_reduce`, just like `parallel_for`. I measured a performance improvement in viewport FPS of at least 10% with 1 million small instances (one bottleneck was computing many small bounding boxes).	2022-05-09 18:21:50 +02:00
Campbell Barton	c434782e3a	File headers: SPDX License migration Use a shorter/simpler license convention, stops the header taking so much space. Follow the SPDX license specification: https://spdx.org/licenses - C/C++/objc/objc++ - Python - Shell Scripts - CMake, GNUmakefile While most of the source tree has been included - `./extern/` was left out. - `./intern/cycles` & `./intern/atomic` are also excluded because they use different header conventions. doc/license/SPDX-license-identifiers.txt has been added to list SPDX all used identifiers. See P2788 for the script that automated these edits. Reviewed By: brecht, mont29, sergey Ref D14069	2022-02-11 09:14:36 +11:00
Jacques Lucke	7c10e364b2	BLI: wrap parallel_invoke from tbb	2022-02-09 13:08:04 +01:00
Jacques Lucke	e130903060	BLI: avoid invoking tbb for small workloads We often call `parallel_for` in places with very variable sized workloads. When many elements are processed, using multi-threading is great, but when processing few elements (possibly many times) using `parallel_for` can result in significant overhead. I measured that this improves performance by >20% in the refactored realize instances code I'm working on separately. The change might also help with debugging sometimes, because the stack trace is smaller and contains fewer irrevelant symbols.	2021-12-02 12:56:47 +01:00
Campbell Barton	3c3669894f	Cleanup: use system includes	2021-10-04 13:14:58 +11:00
Erik Abrahamsson	ceff86aafe	Various Exact Boolean parallelizations and optimizations. From patch D11780 from Erik Abrahamsson. It parallelizes making the vertices, destruction of map entries, finding if the result is PWN, finding triangle adjacencies, and finding the ambient cell. The latter needs a parallel_reduce from tbb, so added one into BLI_task.hh so that if WITH_TBB is false, the code will still work. On Erik's 6-core machine, the elapsed time went from 17.5s to 11.8s (33% faster) on an intersection of two spheres with 3.1M faces. On Howard's 24-core machine, the elapsed time went from 18.7s to 10.8s for the same test.	2021-07-05 18:09:36 -04:00
Jacques Lucke	b37093de7b	BLI: add C++ wrapper for task isolation This makes it easier to use task isolation in c++ code. Previously, one either had to check `WITH_TBB` (possibly indirectly through `WITH_OPENVDB`) or one had to use the C function which is less convenient.	2021-06-16 16:29:21 +02:00
Jacques Lucke	45d59e0df5	BLI: add threading namespace This namespace groups threading related functions/classes. This avoids adding more threading related stuff to the blender namespace. Also it makes naming a bit easier, e.g. the c++ version of BLI_task_isolate could become blender::threading::isolate_task or something similar. Differential Revision: https://developer.blender.org/D11624	2021-06-16 16:14:02 +02:00
Brecht Van Lommel	677e63d518	TBB: fix deprecation warnings with newer TBB versions * USD and OpenVDB headers use deprecated TBB headers, suppress all deprecation warnings there since we have no control over them. * For our own TBB includes, use the individual headers rather than the tbb.h that includes everything to avoid warnings, rather than suppressing all. This is in anticipation of the TBB 2020 upgrade in D10359. Ref D10361.	2021-02-10 19:32:24 +01:00
Ray Molenkamp	626a79204e	MSVC: Fix build warning If a define of NOMINMAX was made before BLI_task.hh was included, the compiler would emit a warning C4005: 'NOMINMAX': macro redefinition warning, to work around this only define it if it is not already defined, and only undefine it if we were the ones that made the define earlier.	2020-11-10 08:48:18 -07:00
Campbell Barton	fa0ceb4959	Cleanup: spelling	2020-10-16 11:46:48 +11:00
Ray Molenkamp	00f7b572d9	Windows: Fix build issue on windows TBB includes Windows.h which defines a min/max macro leading to issues when you want to use std::min and std::max. This change prevents Windows.h from defining them sidestepping the issue.	2020-10-15 17:14:57 -06:00
Jacques Lucke	309c919ee9	BKE: parallelize BKE_mesh_calc_edges `BKE_mesh_calc_edges` was the main performance bottleneck in D9141. While openvdb only needed ~115ms, calculating the edges afterwards took ~960ms. Now with some parallelization this is reduced to ~210ms. Parallelizing `BKE_mesh_calc_edges` is not entirely trivial, because it has to perform deduplication and some other things that have to happen in a certain order. Even though the multithreading improves performance with more threads, there are diminishing returns when too many threads are used in this function. The speedup is mainly achieved by having multiple hash tables that are filled in parallel. The distribution of the edges to hash tables is based on a hash (that is different from the hash used in the actual hash tables). I moved the function to C++, because that made it easier for me to optimize it. Furthermore, I added `BLI_task.hh` which contains some light tbb wrappers for parallelization. Reviewers: campbellbarton Differential Revision: https://developer.blender.org/D9151	2020-10-09 11:56:12 +02:00

29 Commits