Is not visible on any of the officially platforms, as everywhere
SSE2 is available (on Apple Silicon via sse2neon).
Only got noticed by some intermittent issue during development
which made BLI_HAVE_SSE2 unaccessible.
Seems that transpose was done a bit wrong. Not sure if worth trying
to fold the equation into C++ types, as that requires extra memory
transfers for transpose. Opted for a more naive folding, which
avoids extra copies.
Added a regression test for it, verified against numpy, the BLI
SSE2 implementation.