llvm-gcc-4.2/boehm-gc/doc/scale.html - llvm-archive - Git at Google

 <HTML>
 <HEAD>
 <TITLE>Garbage collector scalability</TITLE>
 </HEAD>
 <BODY>
 <H1>Garbage collector scalability</h1>
 In its default configuration, the Boehm-Demers-Weiser garbage collector
 is not thread-safe.  It can be made thread-safe for a number of environments
 by building the collector with the appropriate
 <TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation
 flag.  This has primarily two effects:
 <OL>
 <LI> It causes the garbage collector to stop all other threads when
 it needs to see a consistent memory state.
 <LI> It causes the collector to acquire a lock around essentially all
 allocation and garbage collection activity.
 </ol>
 Since a single lock is used for all allocation-related activity, only one
 thread can be allocating or collecting at one point.  This inherently
 limits performance of multi-threaded applications on multiprocessors.
 <P>
 On most platforms, the allocator/collector lock is implemented as a
 spin lock with exponential back-off.  Longer wait times are implemented
 by yielding and/or sleeping.  If a collection is in progress, the pure
 spinning stage is skipped.  This has the advantage that uncontested and
 thus most uniprocessor lock acquisitions are very cheap.  It has the
 disadvantage that the application may sleep for small periods of time
 even when there is work to be done.  And threads may be unnecessarily
 woken up for short periods.  Nonetheless, this scheme empirically
 outperforms native queue-based mutual exclusion implementations in most
 cases, sometimes drastically so.
 <H2>Options for enhanced scalability</h2>
 Version 6.0 of the collector adds two facilities to enhance collector
 scalability on multiprocessors.  As of 6.0alpha1, these are supported
 only under Linux on X86 and IA64 processors, though ports to other
 otherwise supported Pthreads platforms should be straightforward.
 They are intended to be used together.
 <UL>
 <LI>
 Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to
 run the mark phase in parallel in multiple threads, and thus on multiple
 processors.  The mark phase typically consumes the large majority of the
 collection time.  Thus this largely parallelizes the garbage collector
 itself, though not the allocation process.  Currently the marking is
 performed by the thread that triggered the collection, together with
 <I>N</i>-1 dedicated
 threads, where <I>N</i> is the number of processors detected by the collector.
 The dedicated threads are created once at initialization time.
 <P>
 A second effect of this flag is to switch to a more concurrent
 implementation of <TT>GC_malloc_many</tt>, so that free lists can be
 built, and memory can be cleared, by more than one thread concurrently.
 <LI>
 Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread
 local allocation.  It does not, by itself, cause thread local allocation
 to be used.  It simply allows the use of the interface in
 <TT>gc_local_alloc.h</tt>.
 <P>
 Memory returned from thread-local allocators is completely interchangeable
 with that returned by the standard allocators.  It may be used by other
 threads.  The only difference is that, if the thread allocates enough
 memory of a certain kind, it will build a thread-local free list for
 objects of that kind, and allocate from that.  This greatly reduces
 locking.  The thread-local free lists are refilled using
 <TT>GC_malloc_many</tt>.
 <P>
 An important side effect of this flag is to replace the default
 spin-then-sleep lock to be replace by a spin-then-queue based implementation.
 This <I>reduces performance</i> for the standard allocation functions,
 though it usually improves performance when thread-local allocation is
 used heavily, and thus the number of short-duration lock acquisitions
 is greatly reduced.
 </ul>
 <P>
 The easiest way to switch an application to thread-local allocation is to
 <OL>
 <LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>,
 and then include the <TT>gc.h</tt>
 header in each client source file.
 <LI> Invoke <TT>GC_thr_init()</tt> before any allocation.
 <LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>,
 and/or <TT>GC_GCJ_MALLOC</tt>.
 </ol>
 <H2>The Parallel Marking Algorithm</h2>
 We use an algorithm similar to
 <A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by
 Endo, Taura, and Yonezawa</a> at the University of Tokyo.
 However, the data structures and implementation are different,
 and represent a smaller change to the original collector source,
 probably at the expense of extreme scalability.  Some of
 the refinements they suggest, <I>e.g.</i> splitting large
 objects, were also incorporated into out approach.
 <P>
 The global mark stack is transformed into a global work queue.
 Unlike the usual case, it never shrinks during a mark phase.
 The mark threads remove objects from the queue by copying them to a
 local mark stack and changing the global descriptor to zero, indicating
 that there is no more work to be done for this entry.
 This removal
 is done with no synchronization.  Thus it is possible for more than
 one worker to remove the same entry, resulting in some work duplication.
 <P>
 The global work queue grows only if a marker thread decides to
 return some of its local mark stack to the global one.  This
 is done if the global queue appears to be running low, or if
 the local stack is in danger of overflowing.  It does require
 synchronization, but should be relatively rare.
 <P>
 The sequential marking code is reused to process local mark stacks.
 Hence the amount of additional code required for parallel marking
 is minimal.
 <P>
 It should be possible to use generational collection in the presence of the
 parallel collector, by calling <TT>GC_enable_incremental()</tt>.
 This does not result in fully incremental collection, since parallel mark
 phases cannot currently be interrupted, and doing so may be too
 expensive.
 <P>
 Gcj-style mark descriptors do not currently mix with the combination
 of local allocation and incremental collection.  They should work correctly
 with one or the other, but not both.
 <P>
 The number of marker threads is set on startup to the number of
 available processors (or to the value of the <TT>GC_NPROCS</tt>
 environment variable).  If only a single processor is detected,
 parallel marking is disabled.
 <P>
 Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside
 the collector to immediately yield the processor instead of busy waiting
 first.  In the case of a multiprocessor and a client with multiple
 simultaneously runnable threads, this may have disastrous performance
 consequences (e.g. a factor of 10 slowdown).
 <H2>Performance</h2>
 We conducted some simple experiments with a version of
 <A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to
 run multiple concurrent client threads in the same address space.
 Each client thread does the same work as the original benchmark, but they share
 a heap.
 This benchmark involves very little work outside of memory allocation.
 This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine
 under Linux 2.2.12.
 <P>
 Running with a thread-unsafe collector,  the benchmark ran in 9
 seconds.  With the simple thread-safe collector,
 built with <TT>-DLINUX_THREADS</tt>, the execution time
 increased to 10.3 seconds, or 23.5 elapsed seconds with two clients.
 (The times for the <TT>malloc</tt>/i<TT>free</tt> version
 with glibc <TT>malloc</tt>
 are 10.51 (standard library, pthreads not linked),
 20.90 (one thread, pthreads linked),
 and 24.55 seconds respectively. The benchmark favors a
 garbage collector, since most objects are small.)
 <P>
 The following table gives execution times for the collector built
 with parallel marking and thread-local allocation support
 (<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>).  We tested
 the client using either one or two marker threads, and running
 one or two client threads.  Note that the client uses thread local
 allocation exclusively.  With -DTHREAD_LOCAL_ALLOC the collector
 switches to a locking strategy that is better tuned to less frequent
 lock acquisition.  The standard allocation primitives thus peform
 slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be
 avoided in time-critical code.
 <P>
 (The results using <TT>pthread_mutex_lock</tt>
 directly for allocation locking would have been worse still, at
 least for older versions of linuxthreads.
 With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the
 lock with pthread_mutex_try_lock(), busy_waiting between attempts.
 After a fixed number of attempts, we use pthread_mutex_lock().)
 <P>
 These measurements do not use incremental collection, nor was prefetching
 enabled in the marker.  We used the C version of the benchmark.
 All measurements are in elapsed seconds on an unloaded machine.
 <P>
 <TABLE BORDER ALIGN="CENTER">
 <TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th>
 <TH>2 marker threads (secs.)</th></tr>
 <TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td>
 <TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td>
 </table>
 <PP>
 The execution time for the single threaded case is slightly worse than with
 simple locking.  However, even the single-threaded benchmark runs faster than
 even the thread-unsafe version if a second processor is available.
 The execution time for two clients with thread local allocation time is
 only 1.4 times the sequential execution time for a single thread in a
 thread-unsafe environment, even though it involves twice the client work.
 That represents close to a
 factor of 2 improvement over the 2 client case with the old collector.
 The old collector clearly
 still suffered from some contention overhead, in spite of the fact that the
 locking scheme had been fairly well tuned.
 <P>
 Full linear speedup (i.e. the same execution time for 1 client on one
 processor as 2 clients on 2 processors)
 is probably not achievable on this kind of
 hardware even with such a small number of processors,
 since the memory system is
 a major constraint for the garbage collector,
 the processors usually share a single memory bus, and thus
 the aggregate memory bandwidth does not increase in
 proportion to the number of processors.
 <P>
 These results are likely to be very sensitive to both hardware and OS
 issues.  Preliminary experiments with an older Pentium Pro machine running
 an older kernel were far less encouraging.

 </body>
 </html>
	<HTML>
	<HEAD>
	<TITLE>Garbage collector scalability</TITLE>
	</HEAD>
	<BODY>
	<H1>Garbage collector scalability</h1>
	In its default configuration, the Boehm-Demers-Weiser garbage collector
	is not thread-safe. It can be made thread-safe for a number of environments
	by building the collector with the appropriate
	<TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation
	flag. This has primarily two effects:
	<OL>
	<LI> It causes the garbage collector to stop all other threads when
	it needs to see a consistent memory state.
	<LI> It causes the collector to acquire a lock around essentially all
	allocation and garbage collection activity.
	</ol>
	Since a single lock is used for all allocation-related activity, only one
	thread can be allocating or collecting at one point. This inherently
	limits performance of multi-threaded applications on multiprocessors.
	<P>
	On most platforms, the allocator/collector lock is implemented as a
	spin lock with exponential back-off. Longer wait times are implemented
	by yielding and/or sleeping. If a collection is in progress, the pure
	spinning stage is skipped. This has the advantage that uncontested and
	thus most uniprocessor lock acquisitions are very cheap. It has the
	disadvantage that the application may sleep for small periods of time
	even when there is work to be done. And threads may be unnecessarily
	woken up for short periods. Nonetheless, this scheme empirically
	outperforms native queue-based mutual exclusion implementations in most
	cases, sometimes drastically so.
	<H2>Options for enhanced scalability</h2>
	Version 6.0 of the collector adds two facilities to enhance collector
	scalability on multiprocessors. As of 6.0alpha1, these are supported
	only under Linux on X86 and IA64 processors, though ports to other
	otherwise supported Pthreads platforms should be straightforward.
	They are intended to be used together.
	<UL>
	<LI>
	Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to
	run the mark phase in parallel in multiple threads, and thus on multiple
	processors. The mark phase typically consumes the large majority of the
	collection time. Thus this largely parallelizes the garbage collector
	itself, though not the allocation process. Currently the marking is
	performed by the thread that triggered the collection, together with
	<I>N</i>-1 dedicated
	threads, where <I>N</i> is the number of processors detected by the collector.
	The dedicated threads are created once at initialization time.
	<P>
	A second effect of this flag is to switch to a more concurrent
	implementation of <TT>GC_malloc_many</tt>, so that free lists can be
	built, and memory can be cleared, by more than one thread concurrently.
	<LI>
	Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread
	local allocation. It does not, by itself, cause thread local allocation
	to be used. It simply allows the use of the interface in
	<TT>gc_local_alloc.h</tt>.
	<P>
	Memory returned from thread-local allocators is completely interchangeable
	with that returned by the standard allocators. It may be used by other
	threads. The only difference is that, if the thread allocates enough
	memory of a certain kind, it will build a thread-local free list for
	objects of that kind, and allocate from that. This greatly reduces
	locking. The thread-local free lists are refilled using
	<TT>GC_malloc_many</tt>.
	<P>
	An important side effect of this flag is to replace the default
	spin-then-sleep lock to be replace by a spin-then-queue based implementation.
	This <I>reduces performance</i> for the standard allocation functions,
	though it usually improves performance when thread-local allocation is
	used heavily, and thus the number of short-duration lock acquisitions
	is greatly reduced.
	</ul>
	<P>
	The easiest way to switch an application to thread-local allocation is to
	<OL>
	<LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>,
	and then include the <TT>gc.h</tt>
	header in each client source file.
	<LI> Invoke <TT>GC_thr_init()</tt> before any allocation.
	<LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>,
	and/or <TT>GC_GCJ_MALLOC</tt>.
	</ol>
	<H2>The Parallel Marking Algorithm</h2>
	We use an algorithm similar to
	<A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by
	Endo, Taura, and Yonezawa</a> at the University of Tokyo.
	However, the data structures and implementation are different,
	and represent a smaller change to the original collector source,
	probably at the expense of extreme scalability. Some of
	the refinements they suggest, <I>e.g.</i> splitting large
	objects, were also incorporated into out approach.
	<P>
	The global mark stack is transformed into a global work queue.
	Unlike the usual case, it never shrinks during a mark phase.
	The mark threads remove objects from the queue by copying them to a
	local mark stack and changing the global descriptor to zero, indicating
	that there is no more work to be done for this entry.
	This removal
	is done with no synchronization. Thus it is possible for more than
	one worker to remove the same entry, resulting in some work duplication.
	<P>
	The global work queue grows only if a marker thread decides to
	return some of its local mark stack to the global one. This
	is done if the global queue appears to be running low, or if
	the local stack is in danger of overflowing. It does require
	synchronization, but should be relatively rare.
	<P>
	The sequential marking code is reused to process local mark stacks.
	Hence the amount of additional code required for parallel marking
	is minimal.
	<P>
	It should be possible to use generational collection in the presence of the
	parallel collector, by calling <TT>GC_enable_incremental()</tt>.
	This does not result in fully incremental collection, since parallel mark
	phases cannot currently be interrupted, and doing so may be too
	expensive.
	<P>
	Gcj-style mark descriptors do not currently mix with the combination
	of local allocation and incremental collection. They should work correctly
	with one or the other, but not both.
	<P>
	The number of marker threads is set on startup to the number of
	available processors (or to the value of the <TT>GC_NPROCS</tt>
	environment variable). If only a single processor is detected,
	parallel marking is disabled.
	<P>
	Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside
	the collector to immediately yield the processor instead of busy waiting
	first. In the case of a multiprocessor and a client with multiple
	simultaneously runnable threads, this may have disastrous performance
	consequences (e.g. a factor of 10 slowdown).
	<H2>Performance</h2>
	We conducted some simple experiments with a version of
	<A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to
	run multiple concurrent client threads in the same address space.
	Each client thread does the same work as the original benchmark, but they share
	a heap.
	This benchmark involves very little work outside of memory allocation.
	This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine
	under Linux 2.2.12.
	<P>
	Running with a thread-unsafe collector, the benchmark ran in 9
	seconds. With the simple thread-safe collector,
	built with <TT>-DLINUX_THREADS</tt>, the execution time
	increased to 10.3 seconds, or 23.5 elapsed seconds with two clients.
	(The times for the <TT>malloc</tt>/i<TT>free</tt> version
	with glibc <TT>malloc</tt>
	are 10.51 (standard library, pthreads not linked),
	20.90 (one thread, pthreads linked),
	and 24.55 seconds respectively. The benchmark favors a
	garbage collector, since most objects are small.)
	<P>
	The following table gives execution times for the collector built
	with parallel marking and thread-local allocation support
	(<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>). We tested
	the client using either one or two marker threads, and running
	one or two client threads. Note that the client uses thread local
	allocation exclusively. With -DTHREAD_LOCAL_ALLOC the collector
	switches to a locking strategy that is better tuned to less frequent
	lock acquisition. The standard allocation primitives thus peform
	slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be
	avoided in time-critical code.
	<P>
	(The results using <TT>pthread_mutex_lock</tt>
	directly for allocation locking would have been worse still, at
	least for older versions of linuxthreads.
	With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the
	lock with pthread_mutex_try_lock(), busy_waiting between attempts.
	After a fixed number of attempts, we use pthread_mutex_lock().)
	<P>
	These measurements do not use incremental collection, nor was prefetching
	enabled in the marker. We used the C version of the benchmark.
	All measurements are in elapsed seconds on an unloaded machine.
	<P>
	<TABLE BORDER ALIGN="CENTER">
	<TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th>
	<TH>2 marker threads (secs.)</th></tr>
	<TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td>
	<TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td>
	</table>
	<PP>
	The execution time for the single threaded case is slightly worse than with
	simple locking. However, even the single-threaded benchmark runs faster than
	even the thread-unsafe version if a second processor is available.
	The execution time for two clients with thread local allocation time is
	only 1.4 times the sequential execution time for a single thread in a
	thread-unsafe environment, even though it involves twice the client work.
	That represents close to a
	factor of 2 improvement over the 2 client case with the old collector.
	The old collector clearly
	still suffered from some contention overhead, in spite of the fact that the
	locking scheme had been fairly well tuned.
	<P>
	Full linear speedup (i.e. the same execution time for 1 client on one
	processor as 2 clients on 2 processors)
	is probably not achievable on this kind of
	hardware even with such a small number of processors,
	since the memory system is
	a major constraint for the garbage collector,
	the processors usually share a single memory bus, and thus
	the aggregate memory bandwidth does not increase in
	proportion to the number of processors.
	<P>
	These results are likely to be very sensitive to both hardware and OS
	issues. Preliminary experiments with an older Pentium Pro machine running
	an older kernel were far less encouraging.

	</body>
	</html>