portable, mostly-concurrent, mostly-copying gc for multi-processors
Post on 17-Jan-2016
24 Views
Preview:
DESCRIPTION
TRANSCRIPT
Portable,mostly-concurrent,
mostly-copying GC formulti-processors
Tony HoskingSecure Software Systems
LabPurdue University
Platform assumptions
• Symmetric multi-processor (SMP/CMP)
• Multiple mutator threads• (Large heaps)
Desirable properties
• Maximize throughput• Minimize collector pauses• Scalability
Exploiting parallelism
• Avoid contention• (Mostly-)Concurrent allocation
• (Mostly-)Concurrent collection
Concurrent allocation
• Use thread-private allocation “pages”
• Threads contend for free pages• Each thread allocates from its own page• multiple small objects per page, or
• multiple pages per large object
Concurrent collection:The tricolour abstraction
• Black• “live”• scanned • cannot refer to white
• Grey• “live” wavefront• still to be scanned• may refer to any color
• White• hypothetical garbage
Garbage collection
• White = whole heap• Shade root targets grey• While grey nonempty
• Shade one grey object black• Shade its white children grey
• At end, white objects are garbage
Copying collection
• Partition white from black by copying
• Reclaim white partition wholesale
• At next GC, “flip” black to white
Mutator threads
Incremental collection
Mutator threads
Concurrent collection
Background GC thread
Concurrent mutators
• Mutation changes reachability during GC
• Loss of black/grey reference is safe• Non-white object losing its last reference will be garbage at next GC
• New reference from black to white• New reference may make target live• Collector may never see new reference
• Mutations may require compensation
Compensation options
• Prevent mutator from creating black-to-white references• write barrier on black• read barrier on grey to prevent mutator obtaining white refs
• Prevent destruction of any path from a grey object to a white object without telling GC• write barrier on grey
Mostly-copying GC [Bartlett]
• Copying collection with ambiguous roots• Uncooperative compilers• Untidy references• Explicit pinning
• Pin ambiguously-referenced objects• Shade their page grey without copying
• Assume heap accuracy• Copy remaining heap-referenced objects
Incremental MCGC[DeTreville]
• Enforce grey mutator invariant– STW greys ambiguously-referenced pages– Read barrier on grey using VM page protection
• Read barrier– Stop mutator threads– Unprotect page– Copy white targets to grey– Shade page black– Restart threads
• Atomic system call wrappers unprotect parameter targets (otherwise traps in OS return error)
Concurrent MCGC?
• Stopping all threads at each increment is prohibitive on SMP & impedes concurrency
• BUT barriers difficult to place on ambiguous references with uncooperative compilers
• ALSO Preemptive scheduling may break wrapper atomicity
Mostly-concurrent MCGC
• Enforce black mutator invariant• STW blackens ambiguously-referenced pages
• Read barrier on load of accurate (tidy) grey reference
• Read barrier:• Blacken grey references as they are loaded
• No system call wrappers: arguments are always black
Read barrier on load of grey
• Object header bit marks grey objects• Inline fast path checks grey bit in target header, calls out to slow path if set
• Out-of-line slow path:• Lock heap meta-data• For each (grey) source object in target page• Copy white targets to grey• Clear grey header bit
• Shade target page black• Unlock heap meta-data
Coherence for fast path
• STW phase synchronizes mutators’ views of heap state
• Grey bits are set only in newly-copied objects (ie, newly-allocated grey pages) since most recent STW
• Mutators can never see a cleared grey header unless the page is also black
• Seeing a spurious grey header due to weak ordering is benign: slow path will synchronize
Implementation
• Modula-3:• gcc-based compiler back-end• No tricky target-specific stack-maps• Compiler front-end emits barriers• M3 threads map to preemptively-scheduled POSIX pthreads
• Stop/start threads: signals + semaphores, or OS primitives if available
• Simple to port: Darwin (OS X), Linux, Solaris, Alpha/OSF
Experiments
• Parallelized GCOld benchmark to permit throughput measurements for multiple mutators
• Measures steady-state GC throughput
• 2 platforms:• 2 x 2.3GHz PowerPC Macintosh Xserve running OS X 10.4.4
• 8 x 700MHz Intel Pentium 3 SMP running Linux 2.6
Read Barriers: STW1 user-level mutator thread, work=1
0
1
1
2
2
3
3
4
4
5
0.1 0.5 1 2 4 8
GC ratio
elapsed time (s)
Hardware Software
Elapsed time (s)1 system-level mutator thread, work=1
0
1
2
3
4
5
6
7
0.1 0.5 1 2 4 8
GC ratio
elapsed time (s)
STW INC
Heap size1 system-level mutator thread
0
20
40
60
80
100
120
140
0.1 0.5 1 2 4 8
GC ratio
maximum heap (MB)
STW INC
BMU1 system-level mutator thread,
work=1000, ratio=1
Scalabilitywork=1000, ratio=1, 8xP3
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
mutator threads
elapsed time (s)
STW INC
Java Hotspot serverwork=1000, 8xP3
0
20
40
60
80
100
120
140
160
180
200
1 2 3 4 5 6 7 8
mutator threads
elapsed time (s)
Serial Concurrent MS
Conclusions
• Mostly-concurrent,mostly-copying collection is feasible for multi-processors (proof-of-existence)
• Performance is good (scalable)• Portable: changes only to compiler front-end to introduce barriers, and to GC run-time system
• Compiler back-end unchanged: full-blown optimizations enabled, no stack-map overheads
Future work
• Convert read barrier to “clean” only target object instead of whole page
Scalabilitywork=10, ratio=1, 8xP3
0
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
mutator threads
elapsed time (s)
STW INC
Java Hotspot serverwork=10, 8xP3
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8
mutator threads
elapsed time (s)
Serial Concurrent MS
top related