getting real, getting dirty (without getting real dirty) ron k. cytron joint work with krishna kavi...

Getting Real, Getting Dirty(without getting real dirty)

Ron K. CytronJoint work with Krishna Kavi

University of Alabama at Huntsville

April 2001

Dante Cannarozzi, Sharath Cholleti, Morgan Deters, Steve Donahue

Mark Franklin, Matt Hampton, Michael Henrichs, Nicholas Leidenfrost, Jonathan Nye, Michael Plezbert, Conrad Warmbold

Center for Distributed Object ComputingDepartment of Computer Science

Washington University

Funded by the National Science Foundation under grant 0081214

Funded by DARPA under contract F33615-00-C-1697

Outline

• Motivation• Allocation• Collection• Conclusion

Traditional architecture and object-oriented programs

• Caches are still biased toward Fortran-like behavior

• CPU is still responsible for storage management

• Object-management activity invalidates caches– GC disruptive– Compaction

CPU + cache

M

M M

M

L2 cache

An OO-biased design using IRAMs(with Krishna Kavi)

• CPU and cache stay the same, off-the-shelf

• Memory system redesigned to support OO programs

CPU + cache

M

M M

M

L2 cache

Logic

IRAM

IRAM interface

M

M M

M

Logic

IRAM

CPU + cache

L2 cache

malloc

addr

Stable address for an object allows better cache behavior

Object can be relocated within IRAM, but its address to the CPU is constant

IRAM interface

M

M M

M

Logic

IRAM

CPU + cache

L2 cache

putfield/getfield

value

Object referencing—tracked inside IRAM–-supports garbage collection

IRAM interface

M

M M

M

Logic

IRAM

CPU + cache

L2 cache

gccompactprefetch

Goal: relegate storage-management functions to IRAM

Macro accesses

M

M M

M

Logic

IRAM

CPU + cache

L2 cache

p.getLeft().getNext()

*(*(p+12)+32)

Observe: code sequences contain common gestures (superoperators)

Gesture abstraction

M

M M

M

Logic

IRAM

CPU + cache

L2 cache


*(*(p+12)+32)

M143(x):

*(*(x+12)+32)

Goal: decrease traffic between CPU and storage

Gesture application

M

M M

M

Logic

IRAM

CPU + cache

L2 cache

Macro 143 (p)

M143(x):

*(*(x+12)+32)p.getLeft().getNext()

Gesture application

M

M M

M

Logic

IRAM

CPU + cache

L2 cache

Macro 143 (p)

M143(x):

*(*(x+12)+32)



Automatic prefetchingGoal: decrease traffic between CPU and storage

M

M M

M

Logic

IRAM

CPU + cache

L2 cache

Fetch p

p

Challenges

• Algorithmic– Bounded-time methods for allocation and

collection– Good average performance as well

• Architectural– Lean interface between the CPU and IRAM– Efficient realization

Storage Allocation (Real Time)

• Not necessarily fast• Necessarily predictable• Able to satisfy any reasonable request

– Developer should know “maxlive” characteristics of the application

– This is true for non-embedded systems as well

How much storage?

• curlive—the number of objects live at a point in time

• curspace—the number of bytes live at a point in time

Handles

Object Space

Objects concurrently live

How much object space?

Storage Allocation—Free List

• Linked list of free blocks

• Search for desired fit

• Worst case O(n) for n blocks in the list

Worst-case free-list behavior

• The longer the free-list, the more pronounced the effect

• No a priori bound on how much worse the list-based scheme could get

• Average performance similar

Slowdown of List-Based Allocator

3.15

72.84

0

20

40

60

80

180 3000

Number of objects allocated

Knuth’s Buddy System

• Free-list segregated by size

• All requests rounded up to a power of 2

256

128

64

32

16

8

4

2

1

Knuth’s Buddy System (1)

256

128

64

32

16

8

4

2

1

• Begin with one large block

• Suppose we want a block of size 16



• Recursively subdivide

256

128

64

32

16

8

4

2

1


256

128

64

32

16

8

4

2

1




256

128

64

32

16

8

4

2

1


• Yield 2 blocks size 16


• One of those blocks can be given to the program

256

128

64

32

16

8

4

2

1


• Yield: 2 blocks size 16

Worst-case free-list behavior

• The longer the free-list, the more pronounced the effect

• No a priori bound on how much worse the list-based scheme could get

• Average performance similar

Speedup of Buddy over List

3.15

72.84

0.89 0.910

1020304050607080

180 3000

Number of objects allocated

Worst

Avg

Spec Benchmark Results

Speedup of Buddy over List

0

0.2

0.4

0.6

0.8

1

1.2

Buddy System

• If a block can be found, it can be found in log(N), where N is the size of the heap

• The application cannot make that worse

Defragmentation

• To keep up with the diversity of requested block sizes, an allocator may have to reorganize smaller blocks into larger ones

Defragmentation—Free List

• Free-list permutes adjacent blocks

• Storage becomes fragmented, with many small blocks and no large ones

Blocks in memory

Free list



• Two issues:

Blocks in memory

Free list

– Join adjacent blocks



• Two issues:

Blocks in memory

Free list

– Reorganize holes (move live storage)




• Two issues:

Blocks in memory

Free list

– Reorganize holes

• Organization by address can help [Kavi]


• The blocks resulting from subdivision are viewed as “buddies”

• Their address differs by exactly one bit

• The address of a block of size 2 differs with its buddy’s address at bit n

Buddies—joining adjacent blocks

…0…

…1…

n


256

128

64

32

16

8

4

2

1


256

128

64

32

16

8

4

2

1

• When a block becomes free, it tries to rejoin its buddy

• A bit in its buddy tells whether the buddy is free

• If so, they glue together and make a block twice as big


256

128

64

32

16

8

4

2

1

Two problems

• Oscillation—Buddy looks like it may split, glue, split, glue—isn’t this wasted effort?

• Fragmentation—What happens when Buddy can’t glue but has space it would like to combine?

256

128

64

32

16

8

4

2

1

Buddy—oscillation

Problem is lack of hysteresis

• Some programs allocate objects which are almost immediately deallocated.– Continuous, incremental approaches to

garbage collection only make this worse!• Oscillation is expensive: blocks are glued

only to be quickly subdivided again

Estranged Buddy System

• Variant of Knuth’s idea• When deallocated, blocks are not eager to

rejoin their buddies• Evidence of value [Kaufman, TOPLAS ’84]• Slight improvement on spec benchmarks• Algorithmic improvement over Kaufman

Buddy-Busy and Buddy-Free

2

Blocks whose buddies are busy

Blocks whose buddies are free

k

Estranged Buddy—Allocation

Allocation heuristic

1. Buddy-busy

2. Buddy-free

3. Glue one level below, buddy-free

4. Search up (Knuth)

5. Glue below

Buddy-busy

Buddy-free

2k

2k+1

2k-1

How well does Estranged Buddy do?(contrived example)

Size-8150 objects

0

2

4

6

8

10

200 300 400 500 600 700 800

Knuth

Estranged

Estranged Buddy on SpecSpeedup of Estranged Buddy over Knuth

00.20.40.60.8

11.21.41.6

com

pres

sjes

s

raytr

ace db

javac

mpe

gaud

iom

trtjac

k

Recall: two problems

• Oscillation—Buddy looks like it may split, glue, split, glue—isn’t this wasted effort?– Typically not, but can be

• Fragmentation—What happens when Buddy can’t glue but has space it would like to combine?

Buddy System—Fragmentation

• Internal fragmentation from rounding-up of requests to powers of two

• Not really a concern these days• Assume a program can run in maxlive

bytes• How much storage needed so Buddy

never has to defragment?• What is a good algorithm for Buddy

defragmentation?

Buddy Configurations

8

4

2

1

Allocated

Free

Heap FullAllocated

Free8

4

2

1

Buddy can’t allocate size-2 blockAllocated

Free8

4

2

1

How Big a Heap for Non-Blocking Buddy (M = maxlive)?

• Easy bound: M log M• Better bound: M k,

where k is the number of distinct sizes to be allocated

• Sounds like a good bound, but it isn’t

• Defragmentation may be necessary

256

128

64

32

16

8

4

2

1

M bytes each level

Managing object relocation

• Every object has a stable handle, whose address does not change

• Every handle points to its object’s current location

• All references to objects are indirect, through a handle

Buddy Defragmentation

• When stuck at level k– No blocks free above

level k– No glueable blocks

free below level k– Assume maxlive still

suffices• Example: k=6, size 64

not available

256

128

64

32

16

8

4

2

1

Defragmentation Algorithm

32

64

16


32

64

16

swap


32

64

16

glue


32

64

16


• Recursively visit below to develop two buddies that can be glued

• Analogous to the recursive allocation algorithm

• Still, choices to be made….studies underway

Need 4 bytes

8

4

2

1

Allocated

Free

Move 3 bytes? Move 1 byte?

Recall: two problems• Oscillation—Buddy looks like it may split,

glue, split, glue—isn’t this wasted effort?– Typically not, but can be

• Fragmentation—What happens when Buddy can’t glue but has space it would like to combine?– New algorithm to defragment Buddy– Selective approach—should beat List– Optimizations needed

Towards an IRAM implementation

• VHDL of Buddy System complete– DRAM clocked at 150 MHz– 10 cycles per DRAM access

• Need 7 accesses per level to split blocks• For 16Mbyte heap—24 levels

– 1680 cycles worst case: 11us– 168x slower than a read

• Can we do better?

Two tricks

• Find a suitable free-block quickly• Return its address quickly

Finding a suitable free block

• No space at 16, but 16 points to the level above it that has a block to offer

256

128

64

32

16

8

4

2

1

Finding a suitable free block

• Every level points to the level above it that has a block to offer

• Pointers are maintained using Tarjan’s path-compression

• Locating pointers are not stored in DRAM

256

128

64

32

16

8

4

2

1

Alternative free-block finder

• Path-compression may be too complex for hardware

• Instead, track the largest available free block

256

128

64

32

16

8

4

2

1

Alternative free-block finder

• Path-compression may be too complex for hardware

• Instead, track the largest available free block

• Tends to break up large blocks and favor formation of small ones

256

128

64

32

16

8

4

2

1

Fast return for malloc

• Want 16 bytes• Zip to the 64 display• WLOG we return the

first part of that block immediately to the requestor

64

32

16

• Want 16 bytes• Zip to the 64 display• WLOG we return the

first part of that block immediately to the requestor

• Adjustment to the structures happens in parallel with the return

64

32

16

Fast return for malloc

Improved IRAM allocator• ~10 cycles fast return• ~1000 cycles to recover, worst case• Is this good enough?

– Compare software implementation• ~1000 cycles worst case• ~600 cycles average on spec benchmarks

– Hardware can be much faster– Depends on recover time

Do programs allow us to recover?

• Run of jack—JVM instructions between requests

• 56% of requests separated by at least 100 JVM instructions• Assume 10x expansion, JVM to native code• For the 56%, we return in 10 cycles• Code motion might improve others

Min Median Max

3 181 174053

Garbage Collection

• While allocators are needed for most modern languages, garbage collection is not universally accepted

• Generational and incremental approaches help most applications

• Embedded and real-time need assurances of bounded behavior

Why not garbage collect?

• Some programmers want ultimate control over storage

• Real-Time applications need bounded-time overhead– RT Java spec relegates allocation and

collection to user control– Isn’t this a step back from Java?

Marking Phase—the problem

• To discover the dead objects, we use calculatus eliminatus– Find live objects– All others are dead


• To discover the dead objects, we– Find live objects

stack heap

• Pointers from the stack to the heap make objects live


• To discover the dead objects, we– Find live objects

• Pointers from the stack to the heap make objects live

• These objects make other objects live

stack heap


• To discover the dead objects, we– Find live objects– Sweep all others

away as dead

stack heap


• To discover the dead objects, we– Find live objects– Sweep all others

away as dead– Perhaps compact the

heap

stack heap

Problems with “mark” phase

• Takes an unbounded amount of time• Can limit it using generational collection

but then it’s not clear what will get collected

• We seek an approach that spends a constant amount of time per program operation and collects objects continuously

Two Approaches

• Variation on reference counting• Contaminated garbage collection [PLDI00]

Reference Counting

• An integer is associated with every object, summing– Stack references– Heap references

• Objects with reference count of zero are dead

stack heap

1

2

2

1

11

0

00

0

Problems with Reference Counting

• Standard problem is that objects in cycles

stack heap

1

2

1

1

11

0

00

0

• Standard problem is that objects in cycles (and those touched by such objects) cannot be collected

stack heap

1

2

1

1

11

0

00

0


• Standard problem is that objects in cycles (and those touched by such objects) cannot be collected

• Contaminated gc will collect such objects

• Overhead of counting can be high

• Untyped stack complicates things

stack heap

1

2

1

1

11

0

00

0


The Untyped Stack

• The stack is a collection of untyped cells

• In JVM, safety is verified at class-load time

• No need to tag stack locations with what they contain

• Leads to imprecision in all gc methods

stack heap

Heap

Address?

Idea

• When a stack frame pops, all of its cells are dead

• Don’t worry about tracking cell pointers• Instead, associate an object with the last

stack frame that can reference the object

Reference Counting Approach

• s is zero or one, indicating none or at least one stack reference to the object

• h precisely reflects the number of heap references to the object

• If s+h=0 object is dead

s h

Our treatment of stack activity

• Object is associated with the last-to-be-popped frame that can reference the object

stack

1 0



• When that frame pops– If object is returned,

the receiving frame owns the object

stack

1 0



• When that frame pops– Otherwise the object

is dead

stack

0 0

Our reference-counting implementation

• The objects associated with the frame are linked together

stack heap

0

1

2 0

1

frame

1



• When a stack frame pops, all of its cells no longer can point at anything

stack heap

0

1

2 0

1

frame

1




• This object is unlinked, but still thought to be live

stack heap

0

1

2 0

1

frame

1




• This object is dead and is collected

stack heap

0

1

2 0

1

frame

1




• This object is also dead

stack heap

0

2 0

1

frame

1




• This object is still live

stack heap

1 0

1

frame

1




• Now the frame is gone

stack heap

1 0

1frame

1




• This object was linked to its frame all along

stack heap

1 0

1frame

1





• When heap count becomes zero, the object is scheduled for deletion in that frame

stack heap

1 0

1frame

0






• When frame pops, all are dead

stack heap

1 0

1frame

0







stack heap

0 0

1frame

0







stack heap

0 0

0frame

0

Reference Counting

• Predictable, constant overhead for each JVM instruction– putfield decreases count at old pointed-to

object, increases count at new pointed-to object

– areturn associates object with stack frame if not already associated below

• How well does it do? We shall see!

Contaminated Garbage Collection

• Need to collect objects involved in reference cycles without resorting to marking live objects

• Idea– Associate each object with a stack frame

such that when that frame returns, the object is known to be dead

– Like escape analysis, but dynamic

Contaminated garbage collection

• Initially each object is associated with the frame in which it is instantiated

stack

A

C

B

D

E



• When B references A, A becomes as live as B

stack

A

C

B

D

E



• Now A, B, and C are as live as C

stack

A

C

B

D

E



• Even though D is less live than C, it gets contaminated

• Should something reference D later, all will be affected

stack

A

C

B

D

E



• Static finger of life• Now all objects

appear to live forever

stack

A

C

B

D

E



• Static finger of life• Now all objects

appear to live forever• Even if E points away!

stack

A

C

B

D

E


• Every object is a member of an equilive set– All objects in a set are scheduled for

deallocation at the same time– Sets are maintained using Tarjan’s disjoint

union/find algorithm• Nearly constant amount of overhead per

operation

Contaminated GC

• Each equilive set is associated with a frame

stack

Contaminated GC


• Suppose an object in one set references an object in another set (in either direction)

stack

Contaminated GC


• Suppose an object in one set references an object in another set (in either direction)

• Contamination!• The sets are unioned

stack

Contaminated GC


• When a frame pops, objects associated with it are dead

stack

Summary of methods

Reference counting• Can’t handle cycles• Handles pointing at

and then away

Contaminated GC• Tolerates cycles• Can’t track pointing

and then pointing away

Both techniques:

•Incur cost at putfield, areturn

•(Nearly) constant overhead per operation

Implementation details

• SUN JDK 1.1 interpreter version• Many subtle places where references are

generated: String.intern(), ldc instruction, class loader, JNI

• Each gc method took about 3 months to implement

• Can run either method or both in concert• Fairly optimized, more is possible

Size 1 Absolute

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

compress jess raytrace db javac mpegaudio mtrt jack checkit

None

RefCount

Both

CGC

Spec benchmark effectiveness

size 10 Absolute

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

compress jess raytrace db javac mpegaudio mtrt jack checkit

None

RefCount

Both

CGC

Spec benchmark effectiveness

raytrace

1

2

3

4

5

6--10

>10

javac

1

2

3

4

5

6--10

>10

mpegaudio

1

2

3

4

5

6--10

>10

jack

1

2

3

4

5

6--10

>10

db

1

2

3

4

5

6--10

>10

jess

1

2

3

4

5

6--10

>10

Exactness of Equilive Sets

jess

0

1

2

3

4

5

>5

raytrace

0

1

2

3

4

5

>5

db

0

1

2

3

4

5

>5

javac

0

1

2

3

4

5

>5

mpegaudio

0

1

2

3

4

5

>5

jack

0

1

2

3

4

5

>5

Distance to die in frames

Speed of CGC

1.05

1.24 1.21.14 1.17

1.04

1.21.1

0

0.2

0.4

0.6

0.8

1

1.2

1.4

com

pres

sjes

s

raytr

ace db

javac

mpe

gaud

iom

trtjac

k

over JDK bigheapover JDK sameheap

Speedups of Mark-Free Approaches

0.000.200.400.600.801.001.201.40

com

pres

sjes

s

raytr

ace db

javac

mpe

gaud

iom

trtjac

k

RefCount

CGC

Future Plans

• VHDL simulation of more efficient buddy allocator

• VHDL simulation of garbage collection methods

• Better buddy defragmentation• Experiment with informed allocation• Comparison/integration with other IRAM-

based methods (with Krishna Kavi)

Informed Storage Management

• Evidence that programs allocate many objects of the same size

Benchmark jack20% fragmentation

0

5

10

15

20

25

0 5 10 15 20 25

Buddy block-size (log)

Num

ber

of r

eque

sts

(log)

Benchmark raytrace12% fragmentation

0

5

10

15

20

25

0 5 10 15 20 25


Nu

mb

er o

f req

ues

ts (l

og

)

Benchmark compress34% fragmentation

0

5

10

15

20

25

0 5 10 15 20 25


Num

ber

of r

eque

sts

(log)



• Not surprising, in Java

same type same size



• Not surprising, in Java

same type same size• In C and C++ programmers brew their own

allocators to take advantage of this• What can we do automatically?


•Capture program malloc requests by phase

•Generate a .class file and put it in CLASSPATH

•Load the .class file and inform the allocator

Different phasesdifferent distributions

-0.2

0

0.2

0.4

0.6

0.8

1

8 16 24 32

Block size

Req

uest

s (fr

actio

n)

Phase 1

Phase 2

Phase 3

Phase 4

Phase 5

Phase 6

raytrace

How long is a phase?

500091000

461000

1903

Phases1-3

Phase 4

Phase 5

Phase 6

•Phases 1-3 are common to all programs—JVM startup

•Phases are keyed to allocations, not time, for portability

Questions?

getting real, getting dirty (without getting real dirty) ron k. cytron joint work with krishna kavi...

Documents