There is some sort of problem with the SSE2 code path, but I couldn't find the cause, maybe a compiler bug due to the large amount of inlining? For now I've disabled SSE2 optimizatons in 32 bit GCC builds.