The index field of nodes is supposed to be its actual index, so there is no need to read it in swap. On a 64-bit processor i and j are already in registers, so this removes two memory reads. In addition, cache the tree pointer, use branch hints, and put the most frequently accessed 'value' field at 0 offset. Produced a 20% FPS improvement for a 50% heap-heavy workload.