Mesh: Further optimize topology map creation

We need a separate array that we can change in during the parallel
group construction. That array tells where in each group the index
is added. Building this array is expensive, since construcing a new
`Array` fills its elements serially. There are two possible solutions:

1. Use a copy of the offsets to increment result indices directly
2. Rely on OS-optimized `calloc` instead of `malloc` and a copy/fill

Both depend on using `fetch_and_add` instead of `add_and_fetch`.

The vertex to corner and edge to corner map creation is optimized
by this commit, though the benefits will be useful elsewhere in the
future.

|          | Before  | 1. offsets copy | 2. calloc       |
| -------- | ------- | --------------- | --------------- |
| Grid 1m  | 3.1 ms  | 1.9 ms (1.63x)  | 1.8 ms (1.72x)  |
| Grid 16m | 51.8 ms | 33.3 ms (1.55x) | 32.7 ms (1.58x) |

This commit implements the calloc solution, since it's slightly faster
and simpler. In the future, `Array` could do this optimization itself
when it detects that its fill value is just zero bytes.

Pull Request: https://projects.blender.org/blender/blender/pulls/112065
This commit is contained in:
Hans Goudey
2023-09-08 16:18:38 +02:00
committed by Hans Goudey
parent be68db8ff9
commit 98e33adac2

View File

@@ -328,12 +328,18 @@ static Array<int> reverse_indices_in_groups(const Span<int> group_indices,
}
BLI_assert(*std::max_element(group_indices.begin(), group_indices.end()) < offsets.size());
BLI_assert(*std::min_element(group_indices.begin(), group_indices.end()) >= 0);
Array<int> counts(offsets.size(), -1);
/* `counts` keeps track of how many elements have been added to each group, and is incremented
* atomically by many threads in parallel. `calloc` can be measurably faster than a parallel fill
* of zero. Alternatively the offsets could be copied and incremented directly, but the cost of
* the copy is slightly higher than the cost of `calloc`. */
int *counts = MEM_cnew_array<int>(size_t(offsets.size()), __func__);
BLI_SCOPED_DEFER([&]() { MEM_freeN(counts); })
Array<int> results(group_indices.size());
threading::parallel_for(group_indices.index_range(), 1024, [&](const IndexRange range) {
for (const int64_t i : range) {
const int group_index = group_indices[i];
const int index_in_group = atomic_add_and_fetch_int32(&counts[group_index], 1);
const int index_in_group = atomic_fetch_and_add_int32(&counts[group_index], 1);
results[offsets[group_index][index_in_group]] = int(i);
}
});