347ec1acd7 switched to iterating over nodes with index
masks but introduced nested lambdas with the inner one slicing
the index mask with a range provided by `parallel_for`. That was
done to avoid calling `.local()` more than necessary but in practice
that doesn't have an effect, especially with a grain size of 1.
That nested iteration is cumbersome and error-prone. For the
simple parallel loops with thread local data, switch to just using the
`GrainSize` argument to `mask.foreach_index` instead.
Pull Request: https://projects.blender.org/blender/blender/pulls/128221