This speeds up the node ~20% in common cases, e.g. when only the
X axis is used. The main optimization comes from not writing to memory
that's not used afterwards anymore anyway.
The "optimal code" for just extracting the x axis in a separate loop was
not faster for me. That indicates that the node is bottlenecked by
memory bandwidth, which seems reasonable.