maximizing intel® compiler performance using iterative feedback directed optimization (ifdo) ilya...

27
Maximizing Intel® Compiler Performance using Iterative Feedback Directed Optimization (IFDO) Ilya Cherny Software Manager Software and Services Group Thanks to Leonid Brusencov, Sergey Ermolaev, Artem Chirtsov, Sergey Grebenkin! October 23, 2008

Upload: russell-grant

Post on 14-Jan-2016

261 views

Category:

Documents


2 download

TRANSCRIPT

Maximizing Intel® Compiler Performance using

Iterative Feedback Directed Optimization (IFDO)

Ilya ChernySoftware Manager

Software and Services Group

Thanks to Leonid Brusencov, Sergey Ermolaev, Artem Chirtsov, Sergey Grebenkin!October 23, 2008

22

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 2

SECR 2008 October 23

Agenda

•Why IFDO

•What is IFDO

•Alternative approaches

•Experimental Results

•Future plans

33

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 3

SECR 2008 October 23

A Performance Experiment•Take 3 nested loops for matrix multiplication

for(i = 0; i < N; i++) {

for(j = 0; j < N; j++) {

c[i][j] = 0;

for(k = 0; k < N; k++) {

c[i][j] += a[i][k] * b[k][j];

} } }

•Compile with maximum optimization−‘icl –O3 matmul.c’

•Run and measure time – 20 sec

•Compile with enforced vectorization−‘icl -O3 –mP2OPT_vec_always=T’

•Run and measure time – 11 sec

Why don’t switch on vectorization always?

>icl –help

..

/O3 optimize for maximum speed and enable high-level optimizations..

44

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 4

SECR 2008 October 23

A Performance Experiment II•Switch ‘i’ and ‘j’ loops, move initialization loop outside

for(i = 0; i < N; i++)

for(j = 0; j < N; j++)

c[i][j] = 0;

for(j = 0; j < N; j++) {

for(i = 0; i < N; i++) {

for(k = 0; k < N; k++) {

c[i][j] += a[i][k] * b[k][j];

} } }

•Compiler with O3 optimization

•Measure run time – 18 sec

•Compile with enforced vectorization

•Measure run time – 21 sec

Thoughtless optimization loses 15%

55

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 5

SECR 2008 October 23

Why Optimizations Do Not Always Win?

Source code adding 4 integers for(i=0;i<4;i++) A[i]=A[i]+B[i]

loop_start:

mov ebx, B[edi]

add A[edi], ebx

inc edi

loop loop_start

movdqa xmm1, A

paddd xmm1, B

movdqa A, xmm1

•Vector code could be 2-4x times faster, but..

−..what if array size is not a multiple of 4?

−..what if A is not aligned by 16 bytes?

−..what if A and B aligned differently?

•Final code has an overhead of these 3 ‘if’

Compile to vector codeCompile to scalar code

66

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 6

SECR 2008 October 23

Heuristics for Parameterization of Compiler Optimizations

•Compiler has hundreds of performance optimizations

•Each optimization has at least 1 Boolean parameter

•Compiler has heuristics to make a decision about each optimization

•But still code produced by the compiler is far from optimal

−Each application has unique behavior (loop counts, memory accesses)

−Each machine has unique characteristics (latencies, caches)

−It is impossible to target ALL compiler heuristics for ALL machines for ALL applications

Compiler optimizations have considerableperformance headroom if parameterized right

77

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 7

SECR 2008 October 23

Agenda

•Why IFDO

•What is IFDO

•Alternative approaches

•Experimental Results

•Future plans

88

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 8

SECR 2008 October 23

IFDO process

1. Optimization: Compile executable from the sources

2.Feedback:Run executable with profiling

3.Directed:Analyze results, define options for the next compilation

4.Iterative:Repeat while time permits

99

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 9

SECR 2008 October 23

Search Algorithms

Search algorithms find the maximum much faster than after 2n iterations

•Implemented 6 algorithms to do search in the options space [20]

−exhaustive search with priorities

−batch elimination

−iterative elimination

−combined elimination

−genetic algorithm

−statistical selection [23]

1010

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 10

SECR 2008 October 23

Compiler Optimization Options•Selected 6 undocumented options which seem to have the

highest impact

−loop vectorization

−loop fusion

−loop distribution

−loop unrolling

−data blocking

−memory prefetch

•Number of total independent option values is 13

•Search space for these 6 options has 7000 combinations

•Extracted all compiler options from the sources and just started the runs for more than 1000 options

1111

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 11

SECR 2008 October 23

IFDO Tool

IFDO tool automates iteration process

•User defines how to build and execute his application

•IFDO tool outputs the best binary and all performance data

•Based on IFDO results user can

−change compilation options or pragmas

−modify sources

−improve compiler

::User

1: Build application

application launcher

IFDO_compiler

application builder

application executable log_data [iteration]

1212

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 12

SECR 2008 October 23

IFDO Sample Output - SPEC CPU2000 173.applu

41% performance gain!

iter stat proc ticks impr% noise% parameter val parameter val

-------------------------------------------------------------------------------

1 ... *Total* 9571493442 +0.00 0.21

1 ... _SSOR 7335205487 +0.00 0.26 VectorizeP 1 BlockP 1

1 ... _RHS 2218915958 +0.00 0.02 VectorizeP 1 BlockP 1

2 ... *Total* 6234529813 +34.86 0.01 2 ... SSOR 4048766558 +44.80 0.01 VectorizeP 1 BlockP 2

2 ... _RHS 2168605738 +2.27 0.05 VectorizeP 1 BlockP 2

3 ... *Total* 12889959745 -34.67 0.06

3 ... SSOR 11309558116 -54.18 0.07 VectorizeP 2 BlockP 1

3 ... RHS 1563970212 +29.52 0.01 VectorizeP 2 BlockP 1

4 ... *Total* 10806270474 -12.90 0.15

4 ... _SSOR 8657609979 -18.03 0.10 VectorizeP 3 BlockP 1

4 ... _RHS 2102104477 +5.26 0.01 VectorizeP 3 BlockP 1

5 *.. *Total* 5602632636 +41.47 0.00 5 *.. SSOR 4043669682 +44.87 0.02 VectorizeP 1 BlockP 2

5 *.. RHS 1542238426 +30.50 0.06 VectorizeP 2 BlockP 1

But Intel® Compiler improved since 10.0 & SPEC 173.applu has 8% only headroom now!

1313

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 13

SECR 2008 October 23

Agenda

•Why IFDO

•What is IFDO

•Alternative approaches

•Experimental Results

•Future plans

1414

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 14

SECR 2008 October 23

Comparison to other publications

Our work has two novel characteristics

•There are 23 references to the similar papers

•No publications have compiler with procedure level granularity

•All publications were limited by about 50 options, while we started exploring 1000 of undocumented compiler options

1515

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 15

SECR 2008 October 23

Comparison to other toolsGranularity Profiling Conclusion

Manual options search

-

whole application or source changes

-/+

any profiler, but manually

user time consuming

PGOprof_gen/prof_use)

+

basic block level

-

basic block counters only

2 iterations only

PathScale PathOpt2

-

whole application

-

whole application

40% less results

IFDO +

function or loop

+

instrumentation, VTune

not available in the product

1616

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 16

SECR 2008 October 23

Agenda

•Why IFDO

•What is IFDO

•Alternative approaches

•Experimental Results

•Future plans

1717

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 17

SECR 2008 October 23

0

1

2

3

4

5

1 11 21 31 41 51 61 71 81 91 101

Number of iteration

Per

form

an

ce g

ain

, % ES

BE

IE

CE

GA

SS

Search Algorithms – Performance Growth dependency on Iteration

All algorithms gain 2.5%-4% in CPU2000 total time

•BE works for independent options only, but just 14 iterations

•IE and CE get 99% in 30 iterations – the most effective

1818

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 18

SECR 2008 October 23

Procedure vs. Application granularity

8 from 22 benchmarks gain from procedure level

0

1

2

3

4

5

6

7

8

9

10

164

168

173

175

176

177

179

181

183

186

187

188

189

191

197

252

253

254

255

256

300

301

SPEC CPU2000 Benchmark

Per

form

ance

gai

n, %

Procedure

Application

1919

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 19

SECR 2008 October 23

Options & Values Contribution to Performance Increase

Each option gives about 2 percent

0

0.5

1

1.5

2

2.5

3

Block

P{2}

Distri

butio

nP{2}

Distri

butio

nP{3}

Distri

butio

nP{4}

FusionP

{2}

FusionP

{3}

HLOUnro

llFac

torP

{1}

HLOUnro

llFac

torP

{2}

HLOUnro

llFac

torP

{3}

HLOUnro

llFac

torP

{4}

Unrol

lAndJ

amP{2

}

Vecto

rizeP

{2}

Vecto

rizeP

{3}

Options

Pe

rfo

rma

nc

e in

cre

as

e, %

2020

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 20

SECR 2008 October 23

0.00

100.00

200.00

300.00

400.00

500.00

600.00

700.00

800.00

900.00

1000.00

< -100%

-80% -100%

-80% -40%

-40% -20%

-20% -10%

-10% -5%

-5%-2%

-2%-1%

-1% 0%

0% 1%

1% 2%

2% 5%

5% 10%

10% 20%

20% 40%

40% 80%

80% 100%

>100%

Ranges of performance impact, %

Nu

mb

er

of

op

tio

ns

1000 Options Impact on Performance

Only 600 from 3000 options have zero impact

2121

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 21

SECR 2008 October 23

Agenda

•Why IFDO

•What is IFDO

•Alternative approaches

•Experimental Results

•Future plans

2222

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 22

SECR 2008 October 23

Future Plans

Useful to both compiler developers and users

•Make experiments with all undocumented options

−3000 values if no impact on each other

−more than 23000 combinations!

•Implement storing of application properties

−number of FP expressions, number of loops, etc.

•Implement expert/machine learning system

−suggest options according to application properties

−may decrease number of iterations down to 1

−can substitute existing compiler heuristics?

2323

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 23

SECR 2008 October 23

Summary

•If performance’s critical, try at least ‘icl –O3’!

Спасибо!

Thanks!

2525

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 25

SECR 2008 October 23

References 1-13[1] F. Bodin, T. Kisuki, P. Knijnenburg, M. O’Boyle, and E. Rohou, “Iterative compilation in a non-linear optimization space”, In Proc. ACM

Workshop on Profile and Feedback Directed Compilation, 1998, Organized in conjunction with PACT98.

[2] K. Cooper, D. Subramanian, and L. Torczon, “Adaptive optimizing compilers for the 21st century”, J. of Supercomputing, 32(1), 2002.

[3] J. Bilmes, K. Asanovic, C. Chin, and J. Demmel, “Optimizing matrix multiply using PHiPAC: A portable, high- performance, ANSI C coding methodology”, In Proc. ICS, pages 340-347, 1997.

[4] M. Stephenson and S. Amarasinghe, “Prediction unroll factor using supervised classification”, In ERRR/ACM International Symposium on Code Generation and Optimization (CGO 2005), ERRR Computer Society, 2005.

[5] Yom-Toy, J. Thomson, O. Temam, A. Zaks, H. Leather, C. Miranda, M. Namolaru, E. Bonilla, Saclay, B. Mendelson, C. Williams, Haifa, M. O’Boyle, P. Barnard, E. Ashton, E. Courtois, F. Bodin “MILEPOST GCC: machine learning based research compiler”, ARC, International, UK, CAPS Enterprise, France, 2007.

[6] K. Hoste, L. Eeckhout, “COLE: Compiler Optimization Level Exploration”, ELIS Department, Ghent University, Sing-Pietersnieuwstraat 41, B-9000 Gent, Belgium, 2008.

[7] K. Deb, “Multi-Objective Optimization using Evolutionary Algorithms”, Wiley, 2001.

[8] G. Fursin, J. Cavazos, M. O’Boyle, and O. Temam, “MiDataSets: Creating the Conditions for a More Realistic Evaluation of Iterative Optimization”, ALCHEMY Group, INRIA Futurs and LRI, Paris-Sud University, France, 2007.

[9] M. Byler, M. Wolfe, J.R.B. Davies, C. Huson, and B. Leasure, “Multiple version loops.” In ICPP, 1987, pages 312-318, 2005.

[10] K. D. Cooper, M. W. Hall, and K. Kennedy, “Procedure cloning”, In Proceedings of the 1992 IEEE International Conference on Computer Language, pages 99-105, 1992.

[11] P. Diniz and M. Rinard. “Dynamic feedback: An effective technique for adaptive computing”, In Proc. PLDI, pages 71-84, 1997.

[12] G. Fursin, C. Miranda, S. Pop, A. Cohen, O. Temam, “Practical Run-time Adaptation with Procedure Cloning to Enable Continuous Collective Compilation”, Alchemy group, INRIA Futurs and LRI, Paris-Sud 11 University, Orsay, France, 2007.

[13] V. Bala, E. Duesterwald, and S. Banerjia, “Dynamo: A transparent dynamic optimization system”, In ACM SIGPLAN Notices, 2000.

2626

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 26

SECR 2008 October 23

References 14-23[14] R. H. Saavedra and D. Park, “Improving the effectiveness of software prefetching with adaptive execution”, In Conference on

Parallel Architectures and Compilation Techniques (PACT’96), 1996.

[15] M. Voss and R. Eigemann, “High-level adaptive program optimization with adapt”, In Proceedings of the Symposium on Principles and practices of parallel programming, 2001.

[16] G. Fursin, A. Cohen, M. O’Boyle, and O. Temam, “A Practical Method For Quickly Evaluating Program Optimizations”, Institute for Computing Systems Architecture, University of Edinburgh, UK, 2005.

[17] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically characterizing large scale program behavior”, In 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002.

[18] J. Lau, S. Schoenmackers, and B. Calder, “Transition phase classification and prediction”, In International Symposium on High Performance Computer Architecture, 2005.

[19] Z. Pan, R. Eignmann, “Fast and Effective Orchestration of Compiler Optimizations for Automatic Performance Tuning.”, Proceedings of the International Symposium on Code Generation and Optimization, 2006.

[20] S. Triantafyllis, M.J. Bridges, E. Raman, G. Ottoni and D. August, “A Framework for Unrestricted Whole-Program Optimization.”, Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation, 2006.

[21] Z. Pan, R. Eignmann, “Fast, Automatic, Procedure-Level Performance Tuning.”, Proceedings of the 15th International Conference on Parallel Architecture and Compilation Techniques, 2006.

[22] H. Feltl, “Ein Genetischer Algorithmus fuer das Generalized Assignment Problem”, Diplomarbeit, 2003.

[23] M. Haneda, P. Knijnenburg, H. Wijshoff “Automatic Selection of Compiler Options Using Non-Parametristic Inferential Statistics.”, Proceedings of the 14th International Conference on Parallel Architecture and Compilation Techniques, 2005.

2727

Maximizing Intel® Compiler

Performance using IFDOSoftware and Services Group 27

SECR 2008 October 23

Standard Take-Away Banner. Add Text Here!

Basic Foil with Take-Away BannerUse Verdana Bold for Main Body Subheadings

•Use Verdana regular for main body text.

•Use charcoal gray (RGB 51 – 51 – 51) color as the default

•Text size can vary. Use these minimum recommended font sizes

−Slide title: 32 pt

−Main body subheadings: 20 pt

−Bullet points: 18 pt (with sub-bullets reducing by 2 pt each – 16 pt, 14 pt, etc)

−Tables: 14 pt

−Diagram and chart labels: 12 pt

•Emphasize with italics, bold or color (blue)

•Use text boxes to highlight content

•Primary text should be on background color, not photos, etc.

•Use bullets same color as text