Apparently, some routines in armature deformation code were using static arrays. This is probably just an optimization thing, but it's very bad for threading. Now made it so bbone matrices array is allocating in callee function stack. This required exposing MAX_BBONE_SUBDIV to an external API, This is not so much crappy from code side, and it shall be the same fast as before.