Use a grain size for the final tree creation/balancing/lookup that
depends on the average size of each tree. When the trees are larger,
fewer trees are processed on each thread and vice versa. I didn't notice
a difference when there are hundreds of thousands of groups, but
when there are few (i.e. around the number of cores), I noticed a 6x
performance improvement, from over 1 second to around 0.2 s.
Note that generally the performance is better with many small groups,
because the creation and balancing of trees is single threaded.