Previously, `tbb::parallel_for` was instantiated every time `threading::parallel_for` is used. However, when actual parallelism is used, the overhead of a function call is negilible. Therefor it is possible to move that part out of the header without causing noticable performance regressions. This reduces the size of the Blender binary from 308.2 to 303.5 MB, which is a reduction of about 1.5%.