I still think utf8_buf can be 5 (4 bytes + '\0'), but even 6 may not be enough to what is coming next (NFC - precomposedStringWithCanonicalMapping)
incorporating ascii as a subset of utf8. I don't think we need to re-encode it.
U+0000 ~ U+00FF - latin1 set