| <HTML> |
| <HEAD> |
| <TITLE>Garbage collector scalability</TITLE> |
| </HEAD> |
| <BODY> |
| <H1>Garbage collector scalability</h1> |
| In its default configuration, the Boehm-Demers-Weiser garbage collector |
| is not thread-safe. It can be made thread-safe for a number of environments |
| by building the collector with the appropriate |
| <TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation |
| flag. This has primarily two effects: |
| <OL> |
| <LI> It causes the garbage collector to stop all other threads when |
| it needs to see a consistent memory state. |
| <LI> It causes the collector to acquire a lock around essentially all |
| allocation and garbage collection activity. |
| </ol> |
| Since a single lock is used for all allocation-related activity, only one |
| thread can be allocating or collecting at one point. This inherently |
| limits performance of multi-threaded applications on multiprocessors. |
| <P> |
| On most platforms, the allocator/collector lock is implemented as a |
| spin lock with exponential back-off. Longer wait times are implemented |
| by yielding and/or sleeping. If a collection is in progress, the pure |
| spinning stage is skipped. This has the advantage that uncontested and |
| thus most uniprocessor lock acquisitions are very cheap. It has the |
| disadvantage that the application may sleep for small periods of time |
| even when there is work to be done. And threads may be unnecessarily |
| woken up for short periods. Nonetheless, this scheme empirically |
| outperforms native queue-based mutual exclusion implementations in most |
| cases, sometimes drastically so. |
| <H2>Options for enhanced scalability</h2> |
| Version 6.0 of the collector adds two facilities to enhance collector |
| scalability on multiprocessors. As of 6.0alpha1, these are supported |
| only under Linux on X86 and IA64 processors, though ports to other |
| otherwise supported Pthreads platforms should be straightforward. |
| They are intended to be used together. |
| <UL> |
| <LI> |
| Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to |
| run the mark phase in parallel in multiple threads, and thus on multiple |
| processors. The mark phase typically consumes the large majority of the |
| collection time. Thus this largely parallelizes the garbage collector |
| itself, though not the allocation process. Currently the marking is |
| performed by the thread that triggered the collection, together with |
| <I>N</i>-1 dedicated |
| threads, where <I>N</i> is the number of processors detected by the collector. |
| The dedicated threads are created once at initialization time. |
| <P> |
| A second effect of this flag is to switch to a more concurrent |
| implementation of <TT>GC_malloc_many</tt>, so that free lists can be |
| built, and memory can be cleared, by more than one thread concurrently. |
| <LI> |
| Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread |
| local allocation. It does not, by itself, cause thread local allocation |
| to be used. It simply allows the use of the interface in |
| <TT>gc_local_alloc.h</tt>. |
| <P> |
| Memory returned from thread-local allocators is completely interchangeable |
| with that returned by the standard allocators. It may be used by other |
| threads. The only difference is that, if the thread allocates enough |
| memory of a certain kind, it will build a thread-local free list for |
| objects of that kind, and allocate from that. This greatly reduces |
| locking. The thread-local free lists are refilled using |
| <TT>GC_malloc_many</tt>. |
| <P> |
| An important side effect of this flag is to replace the default |
| spin-then-sleep lock to be replace by a spin-then-queue based implementation. |
| This <I>reduces performance</i> for the standard allocation functions, |
| though it usually improves performance when thread-local allocation is |
| used heavily, and thus the number of short-duration lock acquisitions |
| is greatly reduced. |
| </ul> |
| <P> |
| The easiest way to switch an application to thread-local allocation is to |
| <OL> |
| <LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>, |
| and then include the <TT>gc.h</tt> |
| header in each client source file. |
| <LI> Invoke <TT>GC_thr_init()</tt> before any allocation. |
| <LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>, |
| and/or <TT>GC_GCJ_MALLOC</tt>. |
| </ol> |
| <H2>The Parallel Marking Algorithm</h2> |
| We use an algorithm similar to |
| <A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by |
| Endo, Taura, and Yonezawa</a> at the University of Tokyo. |
| However, the data structures and implementation are different, |
| and represent a smaller change to the original collector source, |
| probably at the expense of extreme scalability. Some of |
| the refinements they suggest, <I>e.g.</i> splitting large |
| objects, were also incorporated into out approach. |
| <P> |
| The global mark stack is transformed into a global work queue. |
| Unlike the usual case, it never shrinks during a mark phase. |
| The mark threads remove objects from the queue by copying them to a |
| local mark stack and changing the global descriptor to zero, indicating |
| that there is no more work to be done for this entry. |
| This removal |
| is done with no synchronization. Thus it is possible for more than |
| one worker to remove the same entry, resulting in some work duplication. |
| <P> |
| The global work queue grows only if a marker thread decides to |
| return some of its local mark stack to the global one. This |
| is done if the global queue appears to be running low, or if |
| the local stack is in danger of overflowing. It does require |
| synchronization, but should be relatively rare. |
| <P> |
| The sequential marking code is reused to process local mark stacks. |
| Hence the amount of additional code required for parallel marking |
| is minimal. |
| <P> |
| It should be possible to use generational collection in the presence of the |
| parallel collector, by calling <TT>GC_enable_incremental()</tt>. |
| This does not result in fully incremental collection, since parallel mark |
| phases cannot currently be interrupted, and doing so may be too |
| expensive. |
| <P> |
| Gcj-style mark descriptors do not currently mix with the combination |
| of local allocation and incremental collection. They should work correctly |
| with one or the other, but not both. |
| <P> |
| The number of marker threads is set on startup to the number of |
| available processors (or to the value of the <TT>GC_NPROCS</tt> |
| environment variable). If only a single processor is detected, |
| parallel marking is disabled. |
| <P> |
| Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside |
| the collector to immediately yield the processor instead of busy waiting |
| first. In the case of a multiprocessor and a client with multiple |
| simultaneously runnable threads, this may have disastrous performance |
| consequences (e.g. a factor of 10 slowdown). |
| <H2>Performance</h2> |
| We conducted some simple experiments with a version of |
| <A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to |
| run multiple concurrent client threads in the same address space. |
| Each client thread does the same work as the original benchmark, but they share |
| a heap. |
| This benchmark involves very little work outside of memory allocation. |
| This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine |
| under Linux 2.2.12. |
| <P> |
| Running with a thread-unsafe collector, the benchmark ran in 9 |
| seconds. With the simple thread-safe collector, |
| built with <TT>-DLINUX_THREADS</tt>, the execution time |
| increased to 10.3 seconds, or 23.5 elapsed seconds with two clients. |
| (The times for the <TT>malloc</tt>/i<TT>free</tt> version |
| with glibc <TT>malloc</tt> |
| are 10.51 (standard library, pthreads not linked), |
| 20.90 (one thread, pthreads linked), |
| and 24.55 seconds respectively. The benchmark favors a |
| garbage collector, since most objects are small.) |
| <P> |
| The following table gives execution times for the collector built |
| with parallel marking and thread-local allocation support |
| (<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>). We tested |
| the client using either one or two marker threads, and running |
| one or two client threads. Note that the client uses thread local |
| allocation exclusively. With -DTHREAD_LOCAL_ALLOC the collector |
| switches to a locking strategy that is better tuned to less frequent |
| lock acquisition. The standard allocation primitives thus peform |
| slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be |
| avoided in time-critical code. |
| <P> |
| (The results using <TT>pthread_mutex_lock</tt> |
| directly for allocation locking would have been worse still, at |
| least for older versions of linuxthreads. |
| With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the |
| lock with pthread_mutex_try_lock(), busy_waiting between attempts. |
| After a fixed number of attempts, we use pthread_mutex_lock().) |
| <P> |
| These measurements do not use incremental collection, nor was prefetching |
| enabled in the marker. We used the C version of the benchmark. |
| All measurements are in elapsed seconds on an unloaded machine. |
| <P> |
| <TABLE BORDER ALIGN="CENTER"> |
| <TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th> |
| <TH>2 marker threads (secs.)</th></tr> |
| <TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td> |
| <TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td> |
| </table> |
| <PP> |
| The execution time for the single threaded case is slightly worse than with |
| simple locking. However, even the single-threaded benchmark runs faster than |
| even the thread-unsafe version if a second processor is available. |
| The execution time for two clients with thread local allocation time is |
| only 1.4 times the sequential execution time for a single thread in a |
| thread-unsafe environment, even though it involves twice the client work. |
| That represents close to a |
| factor of 2 improvement over the 2 client case with the old collector. |
| The old collector clearly |
| still suffered from some contention overhead, in spite of the fact that the |
| locking scheme had been fairly well tuned. |
| <P> |
| Full linear speedup (i.e. the same execution time for 1 client on one |
| processor as 2 clients on 2 processors) |
| is probably not achievable on this kind of |
| hardware even with such a small number of processors, |
| since the memory system is |
| a major constraint for the garbage collector, |
| the processors usually share a single memory bus, and thus |
| the aggregate memory bandwidth does not increase in |
| proportion to the number of processors. |
| <P> |
| These results are likely to be very sensitive to both hardware and OS |
| issues. Preliminary experiments with an older Pentium Pro machine running |
| an older kernel were far less encouraging. |
| |
| </body> |
| </html> |