This extracts the inner loops into a separate function.
There are two main reasons for this:
* Allows using `__restrict` to indicate that no other parameter
aliases with the output array. This allows for better optimization.
* Makes it easier to search for the generated assembly code,
especially with the `BLI_NOINLINE`.