New code is actually much, much better than first version, using 'fetch_and_add' atomic op here allows us to get rid of the loop etc. The broken CAS issue remains on windows, to be investigated...