To make porting to other architectures easier, clarifying that this does not need to be supported. The unused parallel_reduce implementation assumed warp size 32, but is easy to update if we ever need it in the future.
To make porting to other architectures easier, clarifying that this does not need to be supported. The unused parallel_reduce implementation assumed warp size 32, but is easy to update if we ever need it in the future.