Transcript
Page 1: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

LLVM-based Communication Optimizationsfor PGAS Programs Akihiro Hayashi (Rice University)

Jisheng Zhao (Rice University) Michael Ferguson (Cray Inc.) Vivek Sarkar (Rice University)

2nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15

1

Page 2: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

A Big Picture

2Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/

© Argonne National Lab.

©RIKENAICS

©Berkeley Lab.

X10, Habanero-UPC++,…

Communication

Optimization

Page 3: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

PGAS Languages q High-productivity features:

§ Global-View §  Task parallelism §  Data Distribution §  Synchronization

3

X10Photo Credits : http://chapel.cray.com/logo.html, http://upc.lbl.gov/

CAFHabanero-UPC++

Page 4: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Communication is implicit in some PGAS Programming Models q Global Address Space

§ Compiler and Runtime is responsible for performing communications across nodes

4

1: var x = 1; // on Node 0 2: on Locales[1] {// on Node 1 3: … = x; // DATA ACCESS 4: }

RemoteDataAccessinChapel

Page 5: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Communication is Implicit in some PGAS Programming Models (Cont’d)

5

1: var x = 1; // on Node 0 2: on Locales[1] {// on Node 1 3: … = x; // DATA ACCESS

RemoteDataAccess

if (x.locale == MYLOCALE) { *(x.addr) = 1; } else { gasnet_get(…);}

Run>meaffinityhandling

1: var x = 1; 2: on Locales[1] { 3: … = 1;

CompilerOp>miza>on

OR!

Page 6: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Communication Optimizationis Important

6Asynthe>cChapelprogram

onIntelXeonCPUX5660ClusterswithQDRInifiniband

1!

10!

100!

1000!

10000!

100000!

1000000!

10000000!

Latency(m

s)

TransferredByte

Optimized (Bulk Transfer) Unoptimized

1,500x !

59x !

Lower is better

Page 7: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

PGAS Optimizations are language-specific

7Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/

X10, Habanero-UPC++,…

© Argonne National Lab.

©Berkeley Lab.

©RIKENAICS

Chapel Compiler

UPC Compiler

X10 Compiler Habanero-C Compiler

Page 8: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Our goal

8Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/

© Argonne National Lab.

©RIKENAICS

©Berkeley Lab.

X10, Habanero-UPC++,…

Page 9: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Why LLVM? q Widely used language-agnostic compiler

9

LLVM Intermediate Representation (LLVM IR)

C/C++Frontend

Clang

C/C++, Fortran, Ada, Objective-CFrontend

dragonegg Chapel

Frontend

x86 backend

Power PCbackend

ARMbackend

PTXbackend

Analysis & Optimizations

UPC++Frontend

x86 Binary PPC Binary ARM Binary GPU Binary

Page 10: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Summary & Contributions

q Our Observations : §  Many PGAS languages share semantically similar constructs §  PGAS Optimizations are language-specific

q Contributions: §  Built a compilation framework that can uniformly optimize

PGAS programs (Initial Focus : Communication) ü Enabling existing LLVM passes for communication optimizations ü PGAS-aware communication optimizations

10Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,

Page 11: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Overview of our framework

11

Chapel Programs

UPC++ Programs

X10 Programs

Chapel-LLVM

frontend UPC++-

LLVM frontend

X10-LLVM frontend

LLVM IR LLVM-based

Communication Optimization

passes

Lowering Pass

CAF Programs

CAF-LLVM frontend

1. Vanilla LLVM IR 2. use address space feature to express communications

Need to be implemented when supporting a new language/runtime

Generally language-agnostic

Page 12: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

How optimizations work

12

store i64 1, i64 addrspace(100)* %x, …

// x is possibly remotex = 1;

Chapelshared_var<int> x;x = 1;

UPC++

1.Existing LLVM Optimizations

2.PGAS-aware Optimizations

treat remote access as if it

were local access Runtime-Specific Lowering"

Address space-aware Optimizations

Communication API Calls!

Page 13: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

LLVM-based Communication Optimizations for Chapel 1.  Enabling Existing LLVM passes

§  Loop invariant code motion (LICM) §  Scalar replacement, …

2.  Aggregation §  Combine sequences of loads/stores on

adjacent memory location into a single memcpy

13These are already implemented in the standard Chapel compiler

Page 14: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

An optimization example:LICM for Communication Optimizations

14LICM = Loop Invariant Code Motion

for i in 1..100 { %x = load i64 addrspace(100)* %xptr A(i) = %x;}

LICM by LLVM

Page 15: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

An optimization example:Aggregation

15

// p is possibly remotesum = p.x + p.y;

llvm.memcpy(…);

load i64 addrspace(100)* %pptr+0load i64 addrspace(100)* %pptr+4

x! y!

GET! GET!

GET!

Page 16: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

LLVM-based Communication Optimizations for Chapel 3.  Locality Optimization

§  Infer the locality of data and convert possibly-remote access to definitely-local access at compile-time if possible

4.  Coalescing §  Remote array access vectorization

16These are implemented, but not in the standard Chapel compiler

Page 17: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

An Optimization example:Locality Optimization

17

1: proc habanero(ref x, ref y, ref z) {2: var p: int = 0;3: var A:[1..N] int;4: local { p = z; }5: z = A(0) + z; 6:}

2.pandzaredefinitelylocal

3.Definitely-localaccess!(avoidrun@meaffinitychecking)

1.Aisdefinitely-local

Page 18: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

An Optimization example:Coalescing

18

1:for i in 1..N {2: … = A(i);3:}

1:localA = A;2:for i in 1..N {3: … = localA(i);4:}

Performbulktransfer

Convertedtodefinitely-local

access

BeforeAUer

Page 19: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Performance Evaluations:Benchmarks

19

Application Size

Smith-Waterman 185,600 x 192,000

Cholesky Decomp 10,000 x 10,000

NPB EP CLASS = D

Sobel 48,000 x 48,000

SSCA2 Kernel 4 SCALE = 16

Stream EP 2^30

Page 20: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Performance Evaluations:Platforms q Cray XC30™ Supercomputer @ NERSC

§  Node ü  Intel Xeon E5-2695 @ 2.40GHz x 24 cores ü 64GB of RAM

§  Interconnect ü Cray Aries interconnect with Dragonfly topology

q Westmere Cluster @ Rice §  Node

ü  Intel Xeon CPU X5660 @ 2.80GHz x 12 cores ü 48 GB of RAM

§  Interconnect ü Quad-data rated infiniband

20

Page 21: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Performance Evaluations:Details of Compiler & Runtime q Compiler

§  Chapel Compiler version 1.9.0 §  LLVM 3.3

q Runtime : §  GASNet-1.22.0

ü Cray XC : aries ü Westmere Cluster : ibv-conduit

§  Qthreads-1.10 ü Cray XC: 2 shepherds, 24 workers / shepherd ü Westmere Cluster : 2 shepherds, 6 workers / shepherd

21

Page 22: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

BRIEF SUMMARY OF PERFORMANCE EVALUATIONS

Performance Evaluation

22

Page 23: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

0.0

1.0

2.0

3.0

4.0

5.0

SW Cholesky Sobel StreamEP EP SSCA2

Perf

orm

ance

Impr

ovem

ent

over

LLV

M-u

nopt Coalescing Locality Opt

Aggregation Existing

Results on the Cray XC (LLVM-unopt vs. LLVM-allopt)

23

ü  4.6x performance improvement relative to LLVM-unopt on the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)

2.1x

19.5x

1.1x

2.4x 1.4x 1.3x

Higher is better

Page 24: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

0.0

1.0

2.0

3.0

4.0

5.0

SW Cholesky Sobel StreamEP EP SSCA2

Perf

orm

ance

Impr

ovem

ent

over

LLV

M-u

nopt

Coalescing Locality Opt Aggregation Existing

Results on Westmere Cluster(LLVM-unopt vs. LLVM-allopt)

24

2.3x

16.9x

1.1x

2.5x

1.3x 2.3x

ü  4.4x performance improvement relative to LLVM-unopt on the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)

Page 25: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

DETAILED RESULTS & ANALYSIS OF CHOLESKY DECOMPOSITION

Performance Evaluation

25

Page 26: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Cholesky Decomposition

26

dependencies0001122233

001122233

01122233

1122233

122233

22233

2233

233

33 3

Node0

Node1

Node2

Node3

Page 27: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Metrics 1.  Performance & Scalability

§  Baseline (LLVM-unopt) §  LLVM-based Optimizations (LLVM-allopt)

2.  The dynamic number of communication API calls

3.  Analysis of optimized code 4.  Performance comparison

§  Conventional C-backend vs. LLVM-backend 27

Page 28: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Performance Improvement by LLVM (Cholesky on the Cray XC)

28

1

0.1 0.1 0.2 0.2 0.3

2.6 2.7

3.7 4.1 4.3 4.5

0

1

2

3

4

5

1 locale! 2 locales! 4 locales! 8 locales! 16 locales! 32 locales!

Spee

dup

over

LLV

M-u

nopt

1l

ocal

e

LLVM-unopt LLVM-allopt

ü  LLVM-based communication optimizations show scalability

Page 29: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Communication API calls elimination by LLVM (Cholesky on the Cray XC)

29

100.0% 100.0% 100.0% 100.0%

12.1% 0.2%

89.2% 100.0%

0

0.2

0.4

0.6

0.8

1

1.2

LOCAL_GET REMOTE_GET LOCAL_PUT REMOTE_PUT

Dyn

amic

num

ber o

f co

mm

unic

atio

n AP

I cal

ls

(nor

mal

ized

to L

LVM

-uno

pt)

LLVM-unopt LLVM-allopt

8.3x improvement

500x improvement

1.1x improvement

Page 30: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Analysis of optimized code

30

for jB in zero..tileSize-1 { for kB in zero..tileSize-1 { 4GETS for iB in zero..tileSize-1 { 9GETS + 1PUT }}}

1.ALLOCATE LOCAL BUFFER 2.PERFORM BULK TRANSFER for jB in zero..tileSize-1 { for kB in zero..tileSize-1 { 1GET for iB in zero..tileSize-1 { 1GET + 1PUT }}}

LLVM-unopt LLVM-allopt

Page 31: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Performance comparison with C-backend

31

2.8

0.1 0.1 0.1 0.3 0.7

0.4 1

0.1 0.1 0.2 0.2 0.3 0.4

2.6 2.7

3.7 4.1 4.3 4.5 4.5

0

1

2

3

4

5

1locale 2locales 4locales 8locales 16locales 32locales 64locales

Spee

dup

over

LLV

M-u

nopt

1l

ocal

e

C-backend LLVM-unopt LLVM-allopt

C-backend is faster!

Page 32: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Current limitation

q  In LLVM 3.3, many optimizations assume that the pointer size is the same across all address spaces

32

Locale(16bit)

addr(48bit)

For LLVM Code Generation : 64bit packed pointer

For C Code Generation : 128bit struct pointer

ptr.locale;ptr.addr;

ptr >> 48 ptr | 48BITS_MASK;

1. Needs more instructions 2. Lose opportunities for Alias analysis

Page 33: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Conclusions q LLVM-based Communication optimizations for

PGAS Programs §  Promising way to optimize PGAS programs in a

language-agnostic manner §  Preliminary Evaluation with 6 Chapel applications

ü Cray XC30 Supercomputer –  4.6x average performance improvement

ü Westmere Cluster –  4.4x average performance improvement

33

Page 34: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Future work q Extend LLVM IR to support parallel programs with

PGAS and explicit task parallelism §  Higher-level IR

34

Parallel Programs

(Chapel, X10, CAF, HC, …)

1.RI-PIR Gen 2.Analysis

3.Transformation

1.RS-PIR Gen 2.Analysis

3.Transformation LLVM

Runtime-Independent Optimizations

e.g. Task Parallel Construct

LLVM Runtime-Specific

Optimizations e.g. GASNet API

Binary

Page 35: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Acknowledgements q Special thanks to

§  Brad Chamberlain (Cray) §  Rafael Larrosa Jimenez (UMA) §  Rafael Asenjo Plaza (UMA) § Habanero Group at Rice

35

Page 36: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Backup slides

36

Page 37: LLVM-based Communication Optimizations for PGAS Programsllvm-hpc2-workshop.github.io/slides/Hayashi.pdf · LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi

Compilation Flow

37

Chapel Programs

AST Generation and Optimizations

C-code Generation

LLVM Optimizations Backend Compiler’s Optimizations (e.g. gcc –O3)

LLVM IR C Programs

LLVM IR Generation

Binary Binary


Top Related