Both, the guarded and lockfree allocator, are keeping track of current
and peak memory usage. Even the lockfree allocator used to use a
global atomic variable for the memory usage. When multiple threads
use the allocator at the same time, this variable is highly contended.
This can result in significant slowdowns as presented in D16862.
While specific cases could always be optimized by reducing the number
of allocations, having this synchronization point in functions used by
almost every part of Blender is not great.
The solution is use thread-local memory counters which are only added
together when the memory usage is actually requested. For more details
see in-code comments and D16862.
Differential Revision: https://developer.blender.org/D16862