- Re-arrange locks, so no actual memory allocation
(which is relatively slow) happens from inside
the lock. operation system will take care of locks
which might be needed there on it's own.
- Use spin lock instead of mutex, since it's just
list operations happens from inside lock, no need
in mutex here.
- Use atomic operations for memory in use and total
used blocks counters.
This makes guarded allocator almost the same speed
as non-guarded one in files from Tube project.
There're still MemHead/MemTail overhead which might
be bad for CPU cache utilization.
TODO: We need smarter 32/64bit compile-time check,
currently i'm afraid only x86 CPU family is
detecting reliably.
This replaces code (pseudo-code):
spin_lock();
update_child_dag_nodes();
schedule_new_nodes();
spin_unlock();
with:
update_child_dag_nodes_with_atomic_ops();
schedule_new_nodes();
The reason for this is that scheduling new nodes implies
mutex lock, and having spin around it is a bad idea.
Alternatives could have been to use spinlock around
child nodes update only, but that would either imply having
either per-node spin-lock or using array to put nodes
ready for update to an array.
Didn't like an alternatives, using atomic operations makes
code much easier to follow, keeps data-flow on cpu nice.
Same atomic ops might be used in other performance-critical
areas later.
Using atomic ops implementation from jemalloc project.