This implements a basic render time pass,
using HW-based counters to minimize render time impact.
x86-64 uses the TSC instruction for timing, while ARM64 uses the cntvct_el0
register. In theory TSC is not always super reliable (e.g. old CPUs had it tied
to their current clock rate), but for somewhat recent CPU models it should
be fine. If neither is available, it falls back to `std::chrono::steady_clock`,
which should still be very fast.
The output is in milliseconds of CPU-time per pixel.
Pull Request: https://projects.blender.org/blender/blender/pulls/125933