This commit significantly speeds up many of the attribute nodes when
multiple threads are available in linear situations when parallelism
cannot be achieved elsewhere.
See the differential for a table of timing comparisons tested on a
Ryzen 3700x. For an attribute with 4 million elements, the nodes were
about 3 to 9 times faster.
The changes are not exhaustive, other nodes could still be parallelized
in the future. Also, it would be possible to further optimize the grain
size in `parallel_for`, but I'd rather make sure it isn't too small.
I tested some different values, but also relied on intuition--
increasing grain size for less complex operations and vice versa.
Differential Revision: https://developer.blender.org/D11139