Together with the extended loop callback and userdata_chunk, this allows to perform
cumulative tasks (like aggregation) in a lockfree way using local userdata_chunk to store temp data,
and once all workers have finished, to merge those userdata_chunks in the finalize callback
(from calling thread, so no need to lock here either).
Note that this changes how userdata_chunk is handled (now fully from 'main' thread,
which means a given worker thread will always get the same userdata_chunk, without
being re-initialized anymore to init value at start of each iter chunk).