Instead of storing a 24 byte struct for every face corner we must do
calculations for, just gather the face corner index in the first single
threaded loop. And don't fill them in chunks and use the task pool API.
Instead just fill vectors and use standard "parallel_for" threading.
The check that avoided threading for tiny meshes becomes redundant this
way too, which simplifies the code more. Overall this removes over
100 lines of code.
On a Ryzen 7950x, face corner ("split"/"loop") normal calculation in a
small armature modifier setup with custom normals went from 4.1 ms to
2.8 ms. Calculation for a 12 million face mesh generated with curve to
mesh with end caps went from 776 ms to 568 ms.
Similar commits:
- 9e9ebcdd72
- 9338ab1d62