Previous fix didn't quite work well. For some reason everything worked fine when
using native nvcc in 32bit environment, but cross-compiling from 64bit platform
it was still running out of memory.
For now just made it so all the kernels are slower on 32bit CUDA as a temporary
solution. Either it'll be solved in next CUDA releases (by dropped 32bit? =\) or
we'll find better workaround.