getting real, getting dirty (without getting real dirty) ron k. cytron joint work with krishna kavi...
TRANSCRIPT
Getting Real, Getting Dirty(without getting real dirty)
Ron K. CytronJoint work with Krishna Kavi
University of Alabama at Huntsville
April 2001
Dante Cannarozzi, Sharath Cholleti, Morgan Deters, Steve Donahue
Mark Franklin, Matt Hampton, Michael Henrichs, Nicholas Leidenfrost, Jonathan Nye, Michael Plezbert, Conrad Warmbold
Center for Distributed Object ComputingDepartment of Computer Science
Washington University
Funded by the National Science Foundation under grant 0081214
Funded by DARPA under contract F33615-00-C-1697
Outline
• Motivation• Allocation• Collection• Conclusion
Traditional architecture and object-oriented programs
• Caches are still biased toward Fortran-like behavior
• CPU is still responsible for storage management
• Object-management activity invalidates caches– GC disruptive– Compaction
CPU + cache
M
M M
M
L2 cache
An OO-biased design using IRAMs(with Krishna Kavi)
• CPU and cache stay the same, off-the-shelf
• Memory system redesigned to support OO programs
CPU + cache
M
M M
M
L2 cache
Logic
IRAM
IRAM interface
M
M M
M
Logic
IRAM
CPU + cache
L2 cache
malloc
addr
Stable address for an object allows better cache behavior
Object can be relocated within IRAM, but its address to the CPU is constant
IRAM interface
M
M M
M
Logic
IRAM
CPU + cache
L2 cache
putfield/getfield
value
Object referencing—tracked inside IRAM–-supports garbage collection
IRAM interface
M
M M
M
Logic
IRAM
CPU + cache
L2 cache
gccompactprefetch
Goal: relegate storage-management functions to IRAM
Macro accesses
M
M M
M
Logic
IRAM
CPU + cache
L2 cache
p.getLeft().getNext()
*(*(p+12)+32)
Observe: code sequences contain common gestures (superoperators)
Gesture abstraction
M
M M
M
Logic
IRAM
CPU + cache
L2 cache
p.getLeft().getNext()
*(*(p+12)+32)
M143(x):
*(*(x+12)+32)
Goal: decrease traffic between CPU and storage
Gesture application
M
M M
M
Logic
IRAM
CPU + cache
L2 cache
Macro 143 (p)
M143(x):
*(*(x+12)+32)p.getLeft().getNext()
Gesture application
M
M M
M
Logic
IRAM
CPU + cache
L2 cache
Macro 143 (p)
M143(x):
*(*(x+12)+32)
p.getLeft().getNext()
p.getLeft().getNext()
Automatic prefetchingGoal: decrease traffic between CPU and storage
M
M M
M
Logic
IRAM
CPU + cache
L2 cache
Fetch p
p
p.getLeft().getNext()
Automatic prefetchingGoal: decrease traffic between CPU and storage
M
M M
M
Logic
IRAM
CPU + cache
L2 cache
Fetch p
p
Challenges
• Algorithmic– Bounded-time methods for allocation and
collection– Good average performance as well
• Architectural– Lean interface between the CPU and IRAM– Efficient realization
Storage Allocation (Real Time)
• Not necessarily fast• Necessarily predictable• Able to satisfy any reasonable request
– Developer should know “maxlive” characteristics of the application
– This is true for non-embedded systems as well
How much storage?
• curlive—the number of objects live at a point in time
• curspace—the number of bytes live at a point in time
Handles
Object Space
Objects concurrently live
How much object space?
Storage Allocation—Free List
• Linked list of free blocks
• Search for desired fit
• Worst case O(n) for n blocks in the list
Worst-case free-list behavior
• The longer the free-list, the more pronounced the effect
• No a priori bound on how much worse the list-based scheme could get
• Average performance similar
Slowdown of List-Based Allocator
3.15
72.84
0
20
40
60
80
180 3000
Number of objects allocated
Knuth’s Buddy System
• Free-list segregated by size
• All requests rounded up to a power of 2
256
128
64
32
16
8
4
2
1
Knuth’s Buddy System (1)
256
128
64
32
16
8
4
2
1
• Begin with one large block
• Suppose we want a block of size 16
Knuth’s Buddy System (2)
• Begin with one large block
• Recursively subdivide
256
128
64
32
16
8
4
2
1
Knuth’s Buddy System (3)
256
128
64
32
16
8
4
2
1
• Begin with one large block
• Recursively subdivide
Knuth’s Buddy System (4)
256
128
64
32
16
8
4
2
1
• Begin with one large block
• Recursively subdivide
Knuth’s Buddy System (5)
256
128
64
32
16
8
4
2
1
• Begin with one large block
• Yield 2 blocks size 16
Knuth’s Buddy System (6)
• One of those blocks can be given to the program
256
128
64
32
16
8
4
2
1
• Begin with one large block
• Yield: 2 blocks size 16
Worst-case free-list behavior
• The longer the free-list, the more pronounced the effect
• No a priori bound on how much worse the list-based scheme could get
• Average performance similar
Speedup of Buddy over List
3.15
72.84
0.89 0.910
1020304050607080
180 3000
Number of objects allocated
Worst
Avg
Spec Benchmark Results
Speedup of Buddy over List
0
0.2
0.4
0.6
0.8
1
1.2
Buddy System
• If a block can be found, it can be found in log(N), where N is the size of the heap
• The application cannot make that worse
Defragmentation
• To keep up with the diversity of requested block sizes, an allocator may have to reorganize smaller blocks into larger ones
Defragmentation—Free List
• Free-list permutes adjacent blocks
• Storage becomes fragmented, with many small blocks and no large ones
Blocks in memory
Free list
Defragmentation—Free List
• Free-list permutes adjacent blocks
• Two issues:
Blocks in memory
Free list
– Join adjacent blocks
Defragmentation—Free List
• Free-list permutes adjacent blocks
• Two issues:
Blocks in memory
Free list
– Reorganize holes (move live storage)
– Join adjacent blocks
Defragmentation—Free List
• Free-list permutes adjacent blocks
• Two issues:
Blocks in memory
Free list
– Reorganize holes
• Organization by address can help [Kavi]
– Join adjacent blocks
• The blocks resulting from subdivision are viewed as “buddies”
• Their address differs by exactly one bit
• The address of a block of size 2 differs with its buddy’s address at bit n
Buddies—joining adjacent blocks
…0…
…1…
n
Knuth’s Buddy System (6)
256
128
64
32
16
8
4
2
1
Knuth’s Buddy System (5)
256
128
64
32
16
8
4
2
1
• When a block becomes free, it tries to rejoin its buddy
• A bit in its buddy tells whether the buddy is free
• If so, they glue together and make a block twice as big
Knuth’s Buddy System (4)
256
128
64
32
16
8
4
2
1
Knuth’s Buddy System (3)
256
128
64
32
16
8
4
2
1
Knuth’s Buddy System (2)
256
128
64
32
16
8
4
2
1
Knuth’s Buddy System (1)
256
128
64
32
16
8
4
2
1
Two problems
• Oscillation—Buddy looks like it may split, glue, split, glue—isn’t this wasted effort?
• Fragmentation—What happens when Buddy can’t glue but has space it would like to combine?
256
128
64
32
16
8
4
2
1
Buddy—oscillation
256
128
64
32
16
8
4
2
1
Buddy—oscillation
256
128
64
32
16
8
4
2
1
Buddy—oscillation
256
128
64
32
16
8
4
2
1
Buddy—oscillation
256
128
64
32
16
8
4
2
1
Buddy—oscillation
256
128
64
32
16
8
4
2
1
Buddy—oscillation
Problem is lack of hysteresis
• Some programs allocate objects which are almost immediately deallocated.– Continuous, incremental approaches to
garbage collection only make this worse!• Oscillation is expensive: blocks are glued
only to be quickly subdivided again
Estranged Buddy System
• Variant of Knuth’s idea• When deallocated, blocks are not eager to
rejoin their buddies• Evidence of value [Kaufman, TOPLAS ’84]• Slight improvement on spec benchmarks• Algorithmic improvement over Kaufman
Buddy-Busy and Buddy-Free
2
Blocks whose buddies are busy
Blocks whose buddies are free
k
Estranged Buddy—Allocation
Allocation heuristic
1. Buddy-busy
2. Buddy-free
3. Glue one level below, buddy-free
4. Search up (Knuth)
5. Glue below
Buddy-busy
Buddy-free
2k
2k+1
2k-1
How well does Estranged Buddy do?(contrived example)
Size-8150 objects
0
2
4
6
8
10
200 300 400 500 600 700 800
Knuth
Estranged
Estranged Buddy on SpecSpeedup of Estranged Buddy over Knuth
00.20.40.60.8
11.21.41.6
com
pres
sjes
s
raytr
ace db
javac
mpe
gaud
iom
trtjac
k
Recall: two problems
• Oscillation—Buddy looks like it may split, glue, split, glue—isn’t this wasted effort?– Typically not, but can be
• Fragmentation—What happens when Buddy can’t glue but has space it would like to combine?
Buddy System—Fragmentation
• Internal fragmentation from rounding-up of requests to powers of two
• Not really a concern these days• Assume a program can run in maxlive
bytes• How much storage needed so Buddy
never has to defragment?• What is a good algorithm for Buddy
defragmentation?
Buddy Configurations
8
4
2
1
Allocated
Free
Buddy Configurations
8
4
2
1
Allocated
Free
Heap FullAllocated
Free8
4
2
1
Buddy can’t allocate size-2 blockAllocated
Free8
4
2
1
How Big a Heap for Non-Blocking Buddy (M = maxlive)?
• Easy bound: M log M• Better bound: M k,
where k is the number of distinct sizes to be allocated
• Sounds like a good bound, but it isn’t
• Defragmentation may be necessary
256
128
64
32
16
8
4
2
1
M bytes each level
Managing object relocation
• Every object has a stable handle, whose address does not change
• Every handle points to its object’s current location
• All references to objects are indirect, through a handle
Buddy Defragmentation
• When stuck at level k– No blocks free above
level k– No glueable blocks
free below level k– Assume maxlive still
suffices• Example: k=6, size 64
not available
256
128
64
32
16
8
4
2
1
Defragmentation Algorithm
32
64
16
Defragmentation Algorithm
32
64
16
swap
Defragmentation Algorithm
32
64
16
glue
Defragmentation Algorithm
32
64
16
Defragmentation Algorithm
• Recursively visit below to develop two buddies that can be glued
• Analogous to the recursive allocation algorithm
• Still, choices to be made….studies underway
Need 4 bytes
8
4
2
1
Allocated
Free
Move 3 bytes? Move 1 byte?
Recall: two problems• Oscillation—Buddy looks like it may split,
glue, split, glue—isn’t this wasted effort?– Typically not, but can be
• Fragmentation—What happens when Buddy can’t glue but has space it would like to combine?– New algorithm to defragment Buddy– Selective approach—should beat List– Optimizations needed
Towards an IRAM implementation
• VHDL of Buddy System complete– DRAM clocked at 150 MHz– 10 cycles per DRAM access
• Need 7 accesses per level to split blocks• For 16Mbyte heap—24 levels
– 1680 cycles worst case: 11us– 168x slower than a read
• Can we do better?
Two tricks
• Find a suitable free-block quickly• Return its address quickly
Finding a suitable free block
• No space at 16, but 16 points to the level above it that has a block to offer
256
128
64
32
16
8
4
2
1
Finding a suitable free block
• Every level points to the level above it that has a block to offer
• Pointers are maintained using Tarjan’s path-compression
• Locating pointers are not stored in DRAM
256
128
64
32
16
8
4
2
1
Alternative free-block finder
• Path-compression may be too complex for hardware
• Instead, track the largest available free block
256
128
64
32
16
8
4
2
1
Alternative free-block finder
• Path-compression may be too complex for hardware
• Instead, track the largest available free block
• Tends to break up large blocks and favor formation of small ones
256
128
64
32
16
8
4
2
1
Fast return for malloc
• Want 16 bytes• Zip to the 64 display• WLOG we return the
first part of that block immediately to the requestor
64
32
16
• Want 16 bytes• Zip to the 64 display• WLOG we return the
first part of that block immediately to the requestor
• Adjustment to the structures happens in parallel with the return
64
32
16
Fast return for malloc
Improved IRAM allocator• ~10 cycles fast return• ~1000 cycles to recover, worst case• Is this good enough?
– Compare software implementation• ~1000 cycles worst case• ~600 cycles average on spec benchmarks
– Hardware can be much faster– Depends on recover time
Do programs allow us to recover?
• Run of jack—JVM instructions between requests
• 56% of requests separated by at least 100 JVM instructions• Assume 10x expansion, JVM to native code• For the 56%, we return in 10 cycles• Code motion might improve others
Min Median Max
3 181 174053
Garbage Collection
• While allocators are needed for most modern languages, garbage collection is not universally accepted
• Generational and incremental approaches help most applications
• Embedded and real-time need assurances of bounded behavior
Why not garbage collect?
• Some programmers want ultimate control over storage
• Real-Time applications need bounded-time overhead– RT Java spec relegates allocation and
collection to user control– Isn’t this a step back from Java?
Marking Phase—the problem
• To discover the dead objects, we use calculatus eliminatus– Find live objects– All others are dead
Marking Phase—the problem
• To discover the dead objects, we– Find live objects
stack heap
• Pointers from the stack to the heap make objects live
Marking Phase—the problem
• To discover the dead objects, we– Find live objects
• Pointers from the stack to the heap make objects live
• These objects make other objects live
stack heap
Marking Phase—the problem
• To discover the dead objects, we– Find live objects– Sweep all others
away as dead
stack heap
Marking Phase—the problem
• To discover the dead objects, we– Find live objects– Sweep all others
away as dead– Perhaps compact the
heap
stack heap
Problems with “mark” phase
• Takes an unbounded amount of time• Can limit it using generational collection
but then it’s not clear what will get collected
• We seek an approach that spends a constant amount of time per program operation and collects objects continuously
Two Approaches
• Variation on reference counting• Contaminated garbage collection [PLDI00]
Reference Counting
• An integer is associated with every object, summing– Stack references– Heap references
• Objects with reference count of zero are dead
stack heap
1
2
2
1
11
0
00
0
Problems with Reference Counting
• Standard problem is that objects in cycles
stack heap
1
2
1
1
11
0
00
0
• Standard problem is that objects in cycles (and those touched by such objects) cannot be collected
stack heap
1
2
1
1
11
0
00
0
Problems with Reference Counting
• Standard problem is that objects in cycles (and those touched by such objects) cannot be collected
• Contaminated gc will collect such objects
• Overhead of counting can be high
• Untyped stack complicates things
stack heap
1
2
1
1
11
0
00
0
Problems with Reference Counting
The Untyped Stack
• The stack is a collection of untyped cells
• In JVM, safety is verified at class-load time
• No need to tag stack locations with what they contain
• Leads to imprecision in all gc methods
stack heap
Heap
Address?
Idea
• When a stack frame pops, all of its cells are dead
• Don’t worry about tracking cell pointers• Instead, associate an object with the last
stack frame that can reference the object
Reference Counting Approach
• s is zero or one, indicating none or at least one stack reference to the object
• h precisely reflects the number of heap references to the object
• If s+h=0 object is dead
s h
Our treatment of stack activity
• Object is associated with the last-to-be-popped frame that can reference the object
stack
1 0
Our treatment of stack activity
• Object is associated with the last-to-be-popped frame that can reference the object
• When that frame pops– If object is returned,
the receiving frame owns the object
stack
1 0
Our treatment of stack activity
• Object is associated with the last-to-be-popped frame that can reference the object
• When that frame pops– Otherwise the object
is dead
stack
0 0
Our reference-counting implementation
• The objects associated with the frame are linked together
stack heap
0
1
2 0
1
frame
1
Our reference-counting implementation
• The objects associated with the frame are linked together
• When a stack frame pops, all of its cells no longer can point at anything
stack heap
0
1
2 0
1
frame
1
Our reference-counting implementation
• The objects associated with the frame are linked together
• When a stack frame pops, all of its cells no longer can point at anything
• This object is unlinked, but still thought to be live
stack heap
0
1
2 0
1
frame
1
Our reference-counting implementation
• The objects associated with the frame are linked together
• When a stack frame pops, all of its cells no longer can point at anything
• This object is dead and is collected
stack heap
0
1
2 0
1
frame
1
Our reference-counting implementation
• The objects associated with the frame are linked together
• When a stack frame pops, all of its cells no longer can point at anything
• This object is also dead
stack heap
0
2 0
1
frame
1
Our reference-counting implementation
• The objects associated with the frame are linked together
• When a stack frame pops, all of its cells no longer can point at anything
• This object is still live
stack heap
1 0
1
frame
1
Our reference-counting implementation
• The objects associated with the frame are linked together
• When a stack frame pops, all of its cells no longer can point at anything
• Now the frame is gone
stack heap
1 0
1frame
1
Our reference-counting implementation
• The objects associated with the frame are linked together
• When a stack frame pops, all of its cells no longer can point at anything
• This object was linked to its frame all along
stack heap
1 0
1frame
1
Our reference-counting implementation
• The objects associated with the frame are linked together
• When a stack frame pops, all of its cells no longer can point at anything
• This object was linked to its frame all along
• When heap count becomes zero, the object is scheduled for deletion in that frame
stack heap
1 0
1frame
0
Our reference-counting implementation
• The objects associated with the frame are linked together
• When a stack frame pops, all of its cells no longer can point at anything
• This object was linked to its frame all along
• When heap count becomes zero, the object is scheduled for deletion in that frame
• When frame pops, all are dead
stack heap
1 0
1frame
0
Our reference-counting implementation
• The objects associated with the frame are linked together
• When a stack frame pops, all of its cells no longer can point at anything
• This object was linked to its frame all along
• When heap count becomes zero, the object is scheduled for deletion in that frame
• When frame pops, all are dead
stack heap
0 0
1frame
0
Our reference-counting implementation
• The objects associated with the frame are linked together
• When a stack frame pops, all of its cells no longer can point at anything
• This object was linked to its frame all along
• When heap count becomes zero, the object is scheduled for deletion in that frame
• When frame pops, all are dead
stack heap
0 0
0frame
0
Our reference-counting implementation
• The objects associated with the frame are linked together
• When a stack frame pops, all of its cells no longer can point at anything
• This object was linked to its frame all along
• When heap count becomes zero, the object is scheduled for deletion in that frame
• When frame pops, all are dead
stack heap
0 0
0frame
0
Reference Counting
• Predictable, constant overhead for each JVM instruction– putfield decreases count at old pointed-to
object, increases count at new pointed-to object
– areturn associates object with stack frame if not already associated below
• How well does it do? We shall see!
Contaminated Garbage Collection
• Need to collect objects involved in reference cycles without resorting to marking live objects
• Idea– Associate each object with a stack frame
such that when that frame returns, the object is known to be dead
– Like escape analysis, but dynamic
Contaminated garbage collection
• Initially each object is associated with the frame in which it is instantiated
stack
A
C
B
D
E
Contaminated garbage collection
• Initially each object is associated with the frame in which it is instantiated
• When B references A, A becomes as live as B
stack
A
C
B
D
E
Contaminated garbage collection
• Initially each object is associated with the frame in which it is instantiated
• Now A, B, and C are as live as C
stack
A
C
B
D
E
Contaminated garbage collection
• Initially each object is associated with the frame in which it is instantiated
• Even though D is less live than C, it gets contaminated
• Should something reference D later, all will be affected
stack
A
C
B
D
E
Contaminated garbage collection
• Initially each object is associated with the frame in which it is instantiated
• Static finger of life• Now all objects
appear to live forever
stack
A
C
B
D
E
Contaminated garbage collection
• Initially each object is associated with the frame in which it is instantiated
• Static finger of life• Now all objects
appear to live forever• Even if E points away!
stack
A
C
B
D
E
Contaminated garbage collection
• Every object is a member of an equilive set– All objects in a set are scheduled for
deallocation at the same time– Sets are maintained using Tarjan’s disjoint
union/find algorithm• Nearly constant amount of overhead per
operation
Contaminated GC
• Each equilive set is associated with a frame
stack
Contaminated GC
• Each equilive set is associated with a frame
• Suppose an object in one set references an object in another set (in either direction)
stack
Contaminated GC
• Each equilive set is associated with a frame
• Suppose an object in one set references an object in another set (in either direction)
• Contamination!• The sets are unioned
stack
Contaminated GC
• Each equilive set is associated with a frame
• When a frame pops, objects associated with it are dead
stack
Contaminated GC
• Each equilive set is associated with a frame
• When a frame pops, objects associated with it are dead
stack
Contaminated GC
• Each equilive set is associated with a frame
• When a frame pops, objects associated with it are dead
stack
Summary of methods
Reference counting• Can’t handle cycles• Handles pointing at
and then away
Contaminated GC• Tolerates cycles• Can’t track pointing
and then pointing away
Both techniques:
•Incur cost at putfield, areturn
•(Nearly) constant overhead per operation
Implementation details
• SUN JDK 1.1 interpreter version• Many subtle places where references are
generated: String.intern(), ldc instruction, class loader, JNI
• Each gc method took about 3 months to implement
• Can run either method or both in concert• Fairly optimized, more is possible
Size 1 Absolute
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
compress jess raytrace db javac mpegaudio mtrt jack checkit
None
RefCount
Both
CGC
Spec benchmark effectiveness
size 10 Absolute
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
compress jess raytrace db javac mpegaudio mtrt jack checkit
None
RefCount
Both
CGC
Spec benchmark effectiveness
raytrace
1
2
3
4
5
6--10
>10
javac
1
2
3
4
5
6--10
>10
mpegaudio
1
2
3
4
5
6--10
>10
jack
1
2
3
4
5
6--10
>10
db
1
2
3
4
5
6--10
>10
jess
1
2
3
4
5
6--10
>10
Exactness of Equilive Sets
jess
0
1
2
3
4
5
>5
raytrace
0
1
2
3
4
5
>5
db
0
1
2
3
4
5
>5
javac
0
1
2
3
4
5
>5
mpegaudio
0
1
2
3
4
5
>5
jack
0
1
2
3
4
5
>5
Distance to die in frames
Speed of CGC
1.05
1.24 1.21.14 1.17
1.04
1.21.1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
com
pres
sjes
s
raytr
ace db
javac
mpe
gaud
iom
trtjac
k
over JDK bigheapover JDK sameheap
Speedups of Mark-Free Approaches
0.000.200.400.600.801.001.201.40
com
pres
sjes
s
raytr
ace db
javac
mpe
gaud
iom
trtjac
k
RefCount
CGC
Future Plans
• VHDL simulation of more efficient buddy allocator
• VHDL simulation of garbage collection methods
• Better buddy defragmentation• Experiment with informed allocation• Comparison/integration with other IRAM-
based methods (with Krishna Kavi)
Informed Storage Management
• Evidence that programs allocate many objects of the same size
Benchmark jack20% fragmentation
0
5
10
15
20
25
0 5 10 15 20 25
Buddy block-size (log)
Num
ber
of r
eque
sts
(log)
Benchmark raytrace12% fragmentation
0
5
10
15
20
25
0 5 10 15 20 25
Buddy block-size (log)
Nu
mb
er o
f req
ues
ts (l
og
)
Benchmark compress34% fragmentation
0
5
10
15
20
25
0 5 10 15 20 25
Buddy block-size (log)
Num
ber
of r
eque
sts
(log)
Informed Storage Management
• Evidence that programs allocate many objects of the same size
Informed Storage Management
• Evidence that programs allocate many objects of the same size
• Not surprising, in Java
same type same size
Informed Storage Management
• Evidence that programs allocate many objects of the same size
• Not surprising, in Java
same type same size• In C and C++ programmers brew their own
allocators to take advantage of this• What can we do automatically?
Informed Storage Management
•Capture program malloc requests by phase
•Generate a .class file and put it in CLASSPATH
•Load the .class file and inform the allocator
Different phasesdifferent distributions
-0.2
0
0.2
0.4
0.6
0.8
1
8 16 24 32
Block size
Req
uest
s (fr
actio
n)
Phase 1
Phase 2
Phase 3
Phase 4
Phase 5
Phase 6
raytrace
How long is a phase?
500091000
461000
1903
Phases1-3
Phase 4
Phase 5
Phase 6
•Phases 1-3 are common to all programs—JVM startup
•Phases are keyed to allocations, not time, for portability
Questions?