Mesh: Further optimize topology map creation

We need a separate array that we can change in during the parallel group construction. That array tells where in each group the index is added. Building this array is expensive, since construcing a new `Array` fills its elements serially. There are two possible solutions: 1. Use a copy of the offsets to increment result indices directly 2. Rely on OS-optimized `calloc` instead of `malloc` and a copy/fill Both depend on using `fetch_and_add` instead of `add_and_fetch`. The vertex to corner and edge to corner map creation is optimized by this commit, though the benefits will be useful elsewhere in the future. | | Before | 1. offsets copy | 2. calloc | | -------- | ------- | --------------- | --------------- | | Grid 1m | 3.1 ms | 1.9 ms (1.63x) | 1.8 ms (1.72x) | | Grid 16m | 51.8 ms | 33.3 ms (1.55x) | 32.7 ms (1.58x) | This commit implements the calloc solution, since it's slightly faster and simpler. In the future, `Array` could do this optimization itself when it detects that its fill value is just zero bytes. Pull Request: https://projects.blender.org/blender/blender/pulls/112065
2023-09-08 16:18:38 +02:00
parent be68db8ff9
commit 98e33adac2
1 changed files with 8 additions and 2 deletions
--- a/source/blender/blenkernel/intern/mesh_mapping.cc
+++ b/source/blender/blenkernel/intern/mesh_mapping.cc
@@ -328,12 +328,18 @@ static Array<int> reverse_indices_in_groups(const Span<int> group_indices,
  }
  BLI_assert(*std::max_element(group_indices.begin(), group_indices.end()) < offsets.size());
  BLI_assert(*std::min_element(group_indices.begin(), group_indices.end()) >= 0);
-  Array<int> counts(offsets.size(), -1);
+
+  /* `counts` keeps track of how many elements have been added to each group, and is incremented
+   * atomically by many threads in parallel. `calloc` can be measurably faster than a parallel fill
+   * of zero. Alternatively the offsets could be copied and incremented directly, but the cost of
+   * the copy is slightly higher than the cost of `calloc`. */
+  int *counts = MEM_cnew_array<int>(size_t(offsets.size()), __func__);
+  BLI_SCOPED_DEFER([&]() { MEM_freeN(counts); })
  Array<int> results(group_indices.size());
  threading::parallel_for(group_indices.index_range(), 1024, [&](const IndexRange range) {
    for (const int64_t i : range) {
      const int group_index = group_indices[i];
-      const int index_in_group = atomic_add_and_fetch_int32(&counts[group_index], 1);
+      const int index_in_group = atomic_fetch_and_add_int32(&counts[group_index], 1);
      results[offsets[group_index][index_in_group]] = int(i);
    }
  });