Use `std::partition` instead of implementing something similar
ourselves. This is much easier to understand, and it's also much
faster and requires less memory during the build.
I observed a change in the runtime building a 16 million face
BVH from 492 to 389 ms, a 1.27x improvement (with a Ryzen
7950x).
`std::partition` is not multithreaded. I expect there would be
some improvement from multithreading this, at least for the
first few splits.
Currently this only applies to Mesh sculpting.
Pull Request: https://projects.blender.org/blender/blender/pulls/127332