cpuProcessorApply_predivide was doing, for each pixel:
- Un-premultiply pixel to straight alpha
- Call OCIO processor on that one pixel
- Premultiply pixel back
This is not great due to just function call overhead, and probably
prevents whatever "batch processing SIMD optimizations" that OCIO
migth have.
Instead, do this:
- Un-premultiply whole input image,
- Call OCIO on the whole image to do whatever it does,
- Premultiply whole image back.
Doing cpuProcessorApply_predivide on a 4K resolution, float4 image
on Ryzen 5950X (Win10/VS2022) on one thread: 128ms -> 69ms
Pull Request: https://projects.blender.org/blender/blender/pulls/127307