efficient concurrent mark-sweep cycle collection
DESCRIPTION
Efficient Concurrent Mark-Sweep Cycle Collection. Daniel Frampton, Stephen Blackburn, Luke Quinane and John Zigman (Pending submission). Presented by Jose Joao CS395T - Mar 23, 2009. Outline. Motivation Backup tracing Trial deletion Mark-Sweep Cycle Detection (MSCD) Results - PowerPoint PPT PresentationTRANSCRIPT
Efficient Concurrent Mark-Sweep Cycle Collection
Daniel Frampton, Stephen Blackburn, Luke Quinane and John Zigman
(Pending submission)
Presented by Jose JoaoCS395T - Mar 23, 2009
Outline
• Motivation– Backup tracing– Trial deletion
• Mark-Sweep Cycle Detection (MSCD)• Results– What worked and what didn’t
• Discussion
Motivation• Reference counting can directly (i.e. locally)
identify garbage– Low pause times– Reasonable throughput (deferred , coalescing,
ulterior)– But it cannot reclaim circular garbage
• Existing general solutions are expensive:– Trace the whole heap (backup tracing)– Temporarily delete an object and see if the cycle
collapses (trial deletion)
Trial deletion• Is partial mark-sweep (no roots required): find objects that
are alive only because they are reachable from themselves
• Three phases:– Assume candidate object is dead and mark&decrement children
recursively.– Trace again from candidate object, marking &incrementing if
some RC is not zero, i.e. if the object is externally reachable– Sweep objects with a zero count
• Bacon and Rajan: process candidates en masse, avoid acyclic objects, concurrent algorithm
• Usually less efficient than concurrent tracing
Backup tracing
• Trace all live objects and sweep the entire heap
• Shortcomings:– Increases pause times– Concurrency for low pause times requires
synchronization, e.g. write barrier– Visits all objects, although some cannot be part of a
cycle
MSCD: base algorithm
1. Add roots to mark queue2. Mark until empty mark queue
1. Pop from queue and process (mark, scan and add children to queue)
2. Enqueue objects subject to races (fixup set)
3. Sweep
MSCD: concurrency• Builds on top of coalescing RC with a snapshot-at-the-beginning
write barrier:
Atomic state update to process each object only once
1) Record all pre-mutation pointers for deferred decrement RC
2) Record object as mutated
MSCD: concurrency
Black: marked and scannedGrey: marked, not yet scannedWhite: not yet visited
C is never visited and incorrectly collected
Again, C is never visited and incorrectly collected
Same here…
Necessary conditions for a race:• Create a pointer from a black to a white object C• Destroy the last path from a grey object to that white object C
Necessary conditions for a race:• Create a pointer from a black to a white object C• Destroy the last path from a grey object to that white object C
RC(C): 1 → 2 → 1RC(C): 1 → 2 → 1
RC(C): 1 → 2 → 1RC(C): 1 → 2 → 1
RC(E): 2 → 1RC(E): 2 → 1
MSCD: concurrencyKey insight: how to reduce the size of fixup set?
Use the set of objects with RC decremented to a non-zero value– These decrements are necessary condition for cyclic
garbage– These decrements are uncommon– Easy to identify while processing the decrement buffer
(after increments)– Robust to coalescing of reference counts– These are the purple objects or candidates for trial deletion
(Bacon&Rajan)– It’s enough to compute this set at tracing time
– Trade-offs?
MSCD: marking• Statically determine acyclic classes:
– No pointer fields, or– Can point only to acyclic classes
• Set green bit in header of acyclic objects at allocation time
• Ignore green objects for the fixup set (step 2.2 of base algorithm?)– why only step 2.2? How about step 2.1?– the sweep phase also has to consider green objects as marked
• How about green objects pointed to only by non-green objects in a cycle?
• Trade-offs?
MSCD: sweeping• Sweep only potentially cyclic objects and their children
• Start with all purple objects
• Trade-offs?– Much cheaper than scanning the heap– Require keeping the set of all purple objects identified
since last cycle detection, not only during tracing• Space overhead• Time overhead of filtering the purple set from
RC-collected objects• Overhead increases with time between cycle detections!
MSCD: implementation
• Interaction with the reference counter– Establish roots atomically – Add complete fixup set to mark queue– RC must not free objects pointed to by MSCD (mark queue
and fixup queue): free buffer
• Invocation heuristics– When RC is unable to free enough memory (?)– Heap fullness threshold– Size of the purple set– Can do trial deletion or backup tracing instead of MSCD
MSCD: possible timing
Mutator RC
Roots
Mutator RC Mutator RC
MSCD: marking
FixupNew(grey)
marking
Fixup
Finalmarking
Sweeping
Mutator
New(grey)
Fixup Fixup Fixup
Methodology and Results• Jikes RVM 2.3.4+CVS, MMTk• Dacapo beta050224, SPECjvm98 and pseudojbb
• Stop-the-world (i.e. limit) throughput: – Trial deletion is about 70% worse than Backup MS, while
MSCD is about 20% better than Backup MS.– MSCD visits only 12% fewer nodes:
• green objects on the fringe still have to be visited, • green objects are short lived (many allocated, fewer on the
heap at a given time)– MSCD has about 7% cheaper cost per visited node:
• green objects not scanned, • sweep optimization
More Results• Concurrent throughput:– Bug in base and MSCD running on SMT (why not CMP?)– Time-slicing (i.e. single-context uniprocessor): no benefit
from concurrency optimization → fixup is too small
• Overall performance (stop-the-world CD triggered by insufficient reclamation by RC):– MSCD with mark opt. is better than MSCD with both mark
and sweep opt. due to overhead of maintaining the purple set– Overhead of gray bit and green bit– Heuristics to trigger CD matters, especially on tight heaps– Generations (e.g. ulterior RC) could reduce cycle detection
load
Discussion• Main ideas: reduce the cost of backup MS by:– stopping mark at the green-object frontier,– start sweep from purple objects,– reusing the concurrency mechanism from coalescing RC
• Figure 6 shows about 50% of the total time is GC+CD (!)• Baseline is non-generational deferred/coalescing RC.
• Why not testing concurrency on CMP in addition to/instead of SMT?
• Synchronization is still required in the write barrier, although they claim the guard can be removed (?)
?
Open questions
• Invocation heuristics (trade-offs?)– When running out of heap– At some heap occupancy threshold– Some form of estimating that there is enough
cyclic garbage to trigger CD?– Hints from programmer/compiler?
• Can we do better with CMPs?
Qustions for the authors
• Old version of Jikes RVM. Why? Does it matter?
• For xalan and compress, green% + cycle% > 100%• Table 2 and Figure 5 don’t agree