Our own implementation is in fact the same performance as in fast_math from OpenShadingLanguage, but implementation from fast_math is using explicit madd function, which increases chance of compiler deciding to use intrinsics.