This can improve performance in some circumstances when there are vectorized and/or unrolled loops. I especially noticed that this helps a lot while working on D16970 (got a 10-20% speedup there by avoiding running into the non-vectorized fallback loop too often).