Detected when testing mr_elephant on an Intel HD520. When copying
the velocity buffer using the copy shader, the number of scheduled
workgroups could be larger than supported by the device.
This PR fixes this by copying multiple vertices per thread when
the work size cannot cover all the pixels.
Pull Request: https://projects.blender.org/blender/blender/pulls/120915