This time, all tools' code itself.
Not much to say, except that we can also get rid of that OMP caching pre-process ugly stuff
for multires smoothing.
Together with previous commit, we have about 5% average speedup on stroke execution
(though this vary a lot, up to 30% speedup in rare cases, and in even rarer cases some
odd massive slowdowns...).
Tech note: we may want to add 'guided'-similar feature to our BLI_task threaded loop,
I suspect this could explain random massive slowdowns of new code (very rare, but annoying...).