Our recent post about the merits of accurate garbage collection over reference counting prompted Tezka to ask for measurements demonstrating the performance differences between an accurate GC and the reference counted shared_ptr smart pointers provided by the Boost C++ library.
The benchmarks we have been using to prototype allocators for our HLVM project provide the perfect setting for such an experiment. We already have C++ code that solves the n-queens problem using logic programming with a variety of different allocation strategies. Adding another version of our benchmark using Boost's shared_ptr gave the following remarkable results:
These new results highlight just how slow reference counting can be. The C++ code using Boost's reference counted smart pointers is running up to 10× slower than OCaml!
Note also the new "stack" section that is the first C++ to beat OCaml, albeit cheating by exploiting the fact that this implementation of this benchmark always happens to allocate and free in FIFO order. So this approach is problem-specific and cannot be used as a general-purpose allocator. However, it is worth noting that the only allocation strategy to beat C++ also (like OCaml's nursery) exploits the ability to collect many freed values in constant time by resetting the stack pointer rather than using repeated pops. Therefore, the only allocation strategies likely to approach OCaml's performance on these kinds of problems are those that exploit sparsity. For example, by manipulating free lists in run-length encoded form.
We have yet to find a general-purpose allocation strategy that can match the performance of OCaml's generational garbage collector for these kinds of applications but we can make two interesting observations from these new results:
Reference counting is the only difference between the "free" results (using malloc and free in FIFO order) and these new "shared_ptr" results, so around 80% of the time is spent handling reference counts, far more than is spent in actual allocation and deallocation!
HLVM currently performs poorly on this benchmark because around 70% of its time is spent maintaining the shadow stack, a comparable overhead to the use of reference counting (but one that handles the collection of cycles correctly).
Finally, we should note that our "region 12" results are also new and note that they are significantly worse than the previous "region" results. The difference is that the previous results allocated and deallocated in FIFO order explicitly in the inner loops of the benchmark whereas these new results are a more accurate reflection of the allocation and deallocate patterns that the current version of HLVM generates, specifically the inner loops only maintain the mark bits and deallocation is now performed by sweeps triggered by sufficient allocation. Sweeping results in much worse performance (from 20% slower than OCaml to 2× slower than OCaml). We have invented two ways to attack this problem. Firstly, the thread-local current region could be swept locally, either using a remembered set or just back to the previous write barrier. Secondly, runs of consecutively-allocated unreachable values are likely to appear near the end of the allocated list, so sparse free lists will hopefully accelerate the collection of contiguous sequences of unreachable values.