cs380 c lecture 20 last time –linear scan register allocation –classic compilation techniques...

32
CS380 C lecture 20 • Last time – Linear scan register allocation – Classic compilation techniques – On to a modern context • Today – Jenn Sartor – Experimental evaluation for managed languages with JIT compilation and garbage collection 1

Upload: piers-rodgers

Post on 03-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

CS380 C lecture 20

• Last time– Linear scan register allocation– Classic compilation techniques– On to a modern context

• Today– Jenn Sartor– Experimental evaluation for managed

languages with JIT compilation and garbage collection

1

Page 2: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

Wake Up and Smell the Coffee: Performance Analysis

Methodologies for the 21st Century

Kathryn S McKinleyDepartment of Computer Sciences

University of Texas at Austin

2

Page 3: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

3

Shocking News!

In 2000, Java overtook C and C++ as the most popular programming

language [TIOBE 2000--2008]

Page 4: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

4

Systems Researchin Industry and

Academia

ISCA 200620 papers use C and/or C++5 papers are orthogonal to the programming

language2 papers use specialized programming languages2 papers use Java and C from SPEC1 paper uses only Java from SPEC

Page 5: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

5

What is Experimental Computer Science?

Page 6: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

6

What is Experimental Computer Science?

• An idea

• An implementation in some system

• An evaluation

Page 7: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

7

The success of most systems innovationhinges on evaluation methodologies.

1. Benchmarks reflect current and ideally, future reality

2. Experimental design is appropriate

3. Statistical data analysis

Page 8: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

8

The success of most systems innovationhinges on experimental methodologies.

1. Benchmarks reflect current and ideally, future reality [DaCapo Benchmarks 2006]

2. Experimental design is appropriate.

3. Statistical Data Analysis [Georges et al.

2006]

Page 9: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

9

• We’re not in Kansas anymore!– JIT compilation, GC, dynamic checks,

etc• Methodology has not adapted

– Needs to be updated and institutionalized

“…this sophistication provides a significant challenge tounderstanding complete system performance, not found intraditional languages such as C or C++” [Hauswirth et al OOPSLA ’04]

Experimental Design

Page 10: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

10

Experimental Design

• Comprehensive comparison– 3 state-of-the-art JVMs– Best of 5 executions– 19 benchmarks– Platform: 2GHz Pentium-M, 1GB RAM, linux 2.6.15

Page 11: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

11

2.3941.2481.246 1.158

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

antlr bloat chart eclipsefop

hsqldb jython luindexlusearchpmd

sunflowxalan

geomean

Relative PerformanceSun JDK 16

IBM J9

BEA JRockit 16

Experimental Design

Page 12: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

12

2.3941.2481.246 1.158

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

antlr bloat chart eclipsefop

hsqldb jython luindexlusearchpmd

sunflowxalan

geomean

Relative PerformanceSun JDK 16

IBM J9

BEA JRockit 16

Experimental Design

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

antlr bloat chart eclipsefop

hsqldb jython luindexlusearchpmd

sunflowxalan

geomean

Relative PerformanceSun JDK 16

IBM J9

BEA JRockit 16

Page 13: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

13

2.3941.2481.246 1.158

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

antlr bloat chart eclipsefop

hsqldb jython luindexlusearchpmd

sunflowxalan

geomean

Relative PerformanceSun JDK 16

IBM J9

BEA JRockit 16

Experimental Design

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

antlr bloat chart eclipsefop

hsqldb jython luindexlusearchpmd

sunflowxalan

geomean

Relative PerformanceSun JDK 16

IBM J9

BEA JRockit 16

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

antlr bloat chart eclipsefop

hsqldb jython luindexlusearchpmd

sunflowxalan

geomean

Relative PerformanceSun JDK 16

IBM J9

BEA JRockit 16

Page 14: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

14

2.3941.2481.246 1.158

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

antlr bloat chart eclipsefop

hsqldb jython luindexlusearchpmd

sunflowxalan

geomean

Relative PerformanceSun JDK 16

IBM J9

BEA JRockit 16

Experimental Design

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

antlr bloat chart eclipsefop

hsqldb jython luindexlusearchpmd

sunflowxalan

geomean

Relative PerformanceSun JDK 16

IBM J9

BEA JRockit 16

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

antlr bloat chart eclipsefop

hsqldb jython luindexlusearchpmd

sunflowxalan

geomean

Relative PerformanceSun JDK 16

IBM J9

BEA JRockit 16

First Iteration

Second Iteration

Third Iteration

Page 15: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

15

Experimental Design

Another Experiment• Compare two garbage collectors

– Semispace Full Heap Garbage Collector– Marksweep Full Heap Garbage Collector

Page 16: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

16

Experimental Design

Another Experiment• Compare two garbage collectors

– Semispace Full Heap Garbage Collector– Marksweep Full Heap Garbage Collector

• Experimental Design– Same JVM, same compiler settings– Second iteration for both– Best of 5 executions– One benchmark - SPEC 209_db– Platform: 2GHz Pentium-M, 1GB RAM, linux 2.6.15

Page 17: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

17

-12050.1

10428.389816.1

9688.019624.979599.989516.329490.579523.979412.749387.97

9443.59337.78

SPEC _209_db Performance

1.1

1.15

1.2

1.25

1.3

1.35

Marksweep Semispace

Normalized Time

Marksweep vs Semispace

Page 18: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

18

SPEC _209_db Performance

0.95

1

1.05

1.1

1.15

1.2

Marksweep Semispace

Normalized Time

Marksweep vs Semispace

Page 19: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

19

SPEC _209_db Performance

1

1.05

1.1

1.15

1.2

1.25

1.3

20 40 60 80 100 120

Heap Size (MB)

Normalized Time

SPEC _209_db Performance

0.95

1

1.05

1.1

1.15

1.2

Marksweep Semispace

Normalized Time

Semispace Marksweep

Marksweep vs Semispace

-12050.1

10428.389816.1

9688.019624.979599.989516.329490.579523.979412.749387.97

9443.59337.78

SPEC _209_db Performance

1.1

1.15

1.2

1.25

1.3

1.35

Marksweep Semispace

Normalized Time

Page 20: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

20

Experimental Design

Page 21: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

21

Experimental Design:Best Practices

• Measuring JVM innovations

• Measuring JIT innovations

• Measuring GC innovations

• Measuring Architecture innovations

Page 22: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

22

JVM InnovationBest Practices

• Examples:– Thread scheduling– Performance monitoring

• Workload triggers differences– real workloads & perhaps microbenchmarks– e.g., force frequency of thread switching

• Measure & report multiple iterations – start up– steady state (aka server mode)– never configure the VM to use completely

unoptimized code!

• Use a modest or multiple heap sizes computed as a function of maximum live size of the application

• Use & report multiple architectures

Page 23: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

23

0.50

1.00

1.50

2.00

2.50

3.00

3.50

antlr bloat chart eclipse fop hsqldb jython lusearch luindex pmd xalan min max geomean

Performance relative to best

1st JVM A 2nd JVM A 3rd JVM A

1st JVM B 2nd JVM B 3rd JVM B

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

antlr bloat chart eclipse fop hsqldb jython lusearch luindex pmd xalan min max geomean

Performance relative to best

1st JVM A 2nd JVM A 3rd JVM A

1st JVM B 2nd JVM B 3rd JVM B

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

5.50

antlr bloat chart eclipse fop hsqldb jython lusearch luindex pmd xalan min max geomean

Performance relative to best

1st JVM A 2nd JVM A 3rd JVM A

1st JVM B 2nd JVM B 3rd JVM B

Best Practices

Pentium M

AMD Athlon

SPARC

Page 24: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

24

JIT Innovation Best Practices

Example: new compiler optimization– Code quality: Does it improve the application

code?– Compile time: How much compile time does it

add?– Total time: compiler and application time

together– Problem: adaptive compilation responds to

compilation load– Question: How do we tease all these effects

apart?

Page 25: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

25

JIT Innovation Best Practices

Teasing apart compile time and code quality requires multiple experiments• Total time: Mix methodology

– Run adaptive system as intended• Result: mixture of optimized and unoptimized code

– First & second iterations (that include compile time)– Set and/or report the heap size as a function of maximum live

size of the application– Report: average and show statistical error

• Code quality– OK: Run iterations until performance stabilizes on “best”, or– Better: Run several iterations of the benchmark, turn off the

compiler, and measure a run guaranteed to have no compilation

– Best: Replay mix compilation• Compile time

– Requires the compiler to be deterministic– Replay mix compilation

Page 26: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

26

Replay CompilationForce the JIT to produce a deterministic result • Make a compilation profiler & replayerProfiler

– Profile first or later iterations with adaptive JIT, pick best or average

– Record profiling information used in compilation decisions, e.g., dynamic profiles of edges, paths, &/or dynamic call graph

– Record compilation decisions, e.g., compile method bar at level two, inline method foo into bar

– Mix of optimized and unoptimized, or all optimized/unoptimized

Replayer– Reads in profile– As the system loads each class, apply profile +/- innovation

• Result– controlled experiments with deterministic compiler behavior– reduces statistical variance in measurements

• Still not a perfect methodology for inlining

Page 27: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

27

GC Innovation Best Practices

• Requires more than one experiment...• Use & report a range of fixed heap sizes

– Explore the space time tradeoff– Measure heap size with respect to the maximum live size of the

application– VMs should report total memory not just application memory

• Different GC algorithms vary in the meta-data they require• JIT and VM use memory...

• Measure time with a constant workload– Do not measure through put

• Best: run two experiments– mix with adaptive methodology: what users are likely to see in

practice– replay: hold the compiler activity constant

• Choose a profile with “best” application performance in order to keep from hiding mutator overheads in bad code.

Page 28: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

28

Architecture Innovation Best Practices

• Requires more than one experiment...• Use more than one VM• Set a modest heap size and/or report heap size as a function

of maximum live size• Use a mixture of optimized and uncompiled code• Simulator needs the “same” code in many cases to perform

comparisons• Best for microarchitecture only changes:

– Multiple traces from live system with adaptive methodology• start up and steady state with compiler turned off• what users are likely to see in practice

• Wont work if architecture change requires recompilation, e.g., new sampling mechanism– Use replay to make the code as similar as possible

Page 29: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

29

statistics

Disraeli

benchmarks

There are lies, damn lies, and“sometimes more than twice as fast”

“our …. is better or almost as good as …. across the board”

“garbage collection degrades performance by 70%”

“speedups of 1.2x to 6.4x on a variety of benchmarks”

“our prototype has usable performance”

“the overhead …. is on average negligible”

Quotes from recent research papers

“…demonstrating high efficiency and scalability”

“our algorithm is highly efficient”

“can reduce garbage collection time by 50% to 75%”

“speedups…. are very significant (up to 54-fold)”

“speed up by 10-25% in many cases…”“…about 2x in two cases…”

“…more than 10x in two small benchmarks”

“…improves throughput by up to 41x”

Page 30: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

30

Conclusions

• Methodology includes– Benchmarks– Experimental design– Statistical analysis [OOPSLA 2007]

• Poor Methodology– can focus or misdirect innovation and energy

• We have a unique opportunity– Transactional memory, multicore performance, dynamic

languages • What we can do

– Enlist VM builders to include replay– Fund and broaden participation in benchmarking

• Research and industrial partnerships• Funding through NSF, ACM, SPEC, industry or ??

– Participate in building community workloads

Page 31: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

CS380 C

• More on Java Benchmarking– www.dacapobench.org– Alias analysis

• Read: A. Diwan, K. S. McKinley, and J. E. B. Moss, Using Types to Analyze and Optimize Object-Oriented Programs, ACM Transactions on Programming Languages and Systems, 23(1): 30-72, January 2001.

31

Page 32: CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental

32

Suggested ReadingsPerformance Evaluation of

JVMs• How Java Programs Interact with Virtual Machines at the

Microarchitectural Level, Lieven Eeckhout, Andy Georges and Koen De Bosschere, The 18th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA'03), Oct. 2003

• Method-Level Phase Behavior in Java Workloads, Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere, The 19th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA'04), Oct. 2004

• Myths and Realities: The Performance Impact of Garbage Collection, S. M. Blackburn, P. Cheng, and K. S. McKinley, ACM SIGMETRICS Conference on Measurement & Modeling Computer Systems, pp. 25--36, New York, NY, June 2004.

• The DaCapo Benchmarks: Java Benchmarking Development and Analysis, S. M. Blackburn, et. al., The ACM SIGPLAN Conference on Object Oriented Programming Systems, Languages and Applications (OOPSLA), Portland, OR, pp. 191--208, October 2006.

• Statistically Rigorous Java Performance Evaluation, A. Georges, D. Buytaert, and L. Eeckhout, The ACM SIGPLAN Conference on Object Oriented Programming Systems, Languages and Applications (OOPSLA), Montreal, Canada, Oct 2007. To appear.