Reduce overhead of copying attribute data into GPU buffers when the
PBVH is active. The existing lambda with a FunctionRef callback had
a significant overhead. While that was reduced by 25917f0165
already, even making the `foreach_faces` lambda into a template gave
significant overhead compared to simpler loops. Instead, separate
value conversion and iteration over visible triangles in a way that the
compiler is able to optimize more easily.
According to the GPU module, it's also better to use raw data access
than `GPU_vertbuf_raw_step`, since the data format strides aren't
meant to vary by platform, and the runtime stride can have a
noticeable performance impact.
Also avoid recalculating face normals, since they're already used to
calculate vertex normals anyway (since ac02f94caf).
I tested the runtime of the initial data-upload after entering sculpt
mode with a 16 million vertex mesh. Before, that took 1350 ms, after
it took 680 ms, which is almost a 2x improvement. In my tests, the
performance improvement was only observable for the initial data
upload, theoretically it is a more general change though.
It's possible that a similar optimization could be applied to multires
or dynamic topology sculpting, but that can be looked at later too.
Pull Request: https://projects.blender.org/blender/blender/pulls/110621