cell processor implementation of a milc lattice qcd application

National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-Champaign

Cell processor implementation of a MILC lattice QCD application

Guochun Shi, Volodymyr Kindratenko, Steven Gottlieb

2

Presentation outline

• Introduction1. Our view of MILC applications

2. Introduction to Cell Broadband Engine

• Implementation in Cell/B.E.1. PPU performance and stream benchmark

2. Profile in CPU and kernels to be ported

3. Different approaches

• Performance• Conclusion

3

Introduction

• Our target• MIMD Lattice Computation

(MILC) Collaboration code – dynamical clover fermions (clover_dynamical) using the hybrid-molecular dynamics R algorithm

• Our view of the MILC applications• A sequence of

communication and computation blocks

compute loop 1compute loop 1

compute loop ncompute loop n

MPI scatter/gatherMPI scatter/gather

Original CPU-based implementation

CPU

MPI scatter/gather for loop 2MPI scatter/gather for loop 2

MPI scatter/gather for loop 3MPI scatter/gather for loop 3

compute loop 2compute loop 2

MPI scatter/gather for loop n+1MPI scatter/gather for loop n+1

4

Introduction

• Cell/B.E. processor• One Power Processor Element

(PPE) and eight Synergistic Processing Elements (SPE), each SPE has 256 KBs of local storage

• 3.2 GHz processor • 25.6 GB/s processor-to-

memory bandwidth • > 200 GB/s EIB sustained

aggregate bandwidth• Theoretical peak performance:

204.8 GFLOPS (SP) and 14.63 GFLOPS (DP)

5




• Implementation in Cell/B.E.1. PPE performance and stream benchmark




6

Performance in PPE

• Step 1: try to run it in PPE

• In PPE it runs approximately ~2-3x slower than modern CPU

• MILC is bandwidth-bound

• It agrees with what we see with stream benchmark

Runtime in CPU and PPE

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

8x8x16x16 16x16x16x16

lattice size

run

tim

e (s

eco

nd

s)

CPU

PPE

Stream benchmark on CPU and PPE

0

500

1000

1500

2000

2500

3000

3500

4000

Copy Scale Add Triad

Ban

dw

idth

(G

B/s

)CPU

PPE

7

Execution profile and kernels to be ported

0

5

10

15

20

25

udadu_mu_nu

ds las h_w_s ite_s pec ial

s u3mat_copy

mult_su

3_nn

mult_su

3_na

mult_this_

ldu_s ite

udadu_mat_mu_nu

s ingle action

s u3_adjoint

gene ral stri

ded gathe r

f_mu_nu

s calar_

mult_a dd_wve

c

s calar_

multi_su3_matrix

_addi

d_congrad2_c l

s et_neighbor

mult_su

3_an

upda te_u

compute_c lov

upda te_h_c l

s calar_

mult_a dd_wve

c_mag sq

realtra

ce_s u3_nn

add_ su3_m

atrix

wp_s hrink

mags q_wvec_ta

sk

gaug e_action

s et su3_matrix

to ze

ro

reunita

rize

kern

el

tim

e, %

0102030405060708090100

cum

ula

tive

, %

kernel run-time on 8x8x16x16 lattic e

kernel run-time on 16x16x16x16 lattic e

c umulative for 8x8x16x16 lattic ec umulative for 16x16x16x16 lattic e

• 10 of these subroutines are responsible for >90% of overall runtime

• All kernels responsible for 98.8%

8

Kernel memory access pattern

• Kernel codes must be SIMDized

• Performance determined by how fast you DMA in/out the data, not by SIMDized code

• In each iteration, only small elements are accessed

• Lattice size: 1832 bytes• su3_matrix: 72 bytes• wilson_vector: 96 bytes

• Challenge: how to get data into SPUs as fast as possible?

• Cell/B.E. has the best DMA performance when data is aligned to 128 bytes and size is multiple of 128 bytes.

• Data layout in MILC meets neither of them

#define FORSOMEPARITY(i,s,choice) \ for( i=((choice)==ODD ? even_sites_on_node : 0 ), \ s= &(lattice[i]); \ i< ( (choice)==EVEN ? even_sites_on_node : sites_on_node); \ i++,s++)

FORSOMEPARITY(i,s,parity) { mult_adj_mat_wilson_vec( &(s->link[nu]), ((wilson_vector *)F_PT(s,rsrc)), &rtemp ); mult_adj_mat_wilson_vec( (su3_matrix *)(gen_pt[1][i]), &rtemp, &(s->tmp) ); mult_mat_wilson_vec( (su3_matrix *)(gen_pt[0][i]), &(s->tmp), &rtemp ); mult_mat_wilson_vec( &(s->link[mu]), &rtemp, &(s->tmp) ); mult_sigma_mu_nu( &(s->tmp), &rtemp, mu, nu ); su3_projector_w( &rtemp, ((wilson_vector *)F_PT(s,lsrc)), ((su3_matrix*)F_PT(s,mat)) ); }

lattice site 0

Data accesses

Data from neighbor

One sample kernel from udadu_mu_nu() routine

9

Approach I: packing and unpacking

• Good performance in DMA operations• Packing and unpacking are expensive in PPE

PPE and main memory

…struct site

…

struct site

Packing

Unpacking

DMA operations

DMA operations…

SPEs

10

Approach II: Indirect memory access

• Replace elements in struct site with pointers• Pointers point to continuous memory regions• PPE overhead due to indirect memory access

Original lattice

Modified lattice

Continuous mem …

DMA operations

SPEs

……

PPE and main memory

11

Approach III: Padding and small memory DMAs

• Padding elements to appropriate size• Padding struct site to appropriate size• Gained good bandwidth performance with padding overhead• Su3_matrix from 3x3 complex to 4x4 complex matrix

• 72 bytes 128 bytes• Bandwidth efficiency lost: 44%

• Wilson_vector from 4x3 complex to 4x4 complex• 98 bytes 128 bytes• Bandwidth efficiency lost: 23%

Original lattice

Lattice after padding

…

SPEs

…

…

DMA operations

PPE and memory

12

Struct site Padding

• 128 byte stride access has different performance for different stride size

• This is due to 16 banks in main memory• Odd numbers always reach peak• We choose to pad the struct site to 2688 (21*128) bytes

25.38

12.69

25.36

8.6

25.49

12.89

25.48

4.26

25.5

12.93

25.34

8.58

25.51

12.89

25.47

2.13

25.34

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

stride (x 128 bytes)

ban

dw

idth

(G

B/s

)

13




• Implementation in Cell/BE.1. PPU performance and stream benchmark




14

Kernel performance

• GFLOPS are low for all kernels

• Bandwidth is around 80% of peak for most of kernels

• Kernel speedup compared to CPU for most of kernels are between 10x to 20x

• set_memory_to_zero kernel has ~40x speedup, su3mat_copy() speedup >15x

0

1

2

3

4

5

6

7

8

9

10

ud

ad

u_

mu

_n

u

dsla

sh

_w

_sit

e_

sp

ecia

l

su

3m

at_

co

py

mu

lt_

su

3_

nn

mu

lt_

su

3_

na

mu

lt_

th

is_

ldu

_sit

e

ud

ad

u_

ma

t_

mu

_n

u

sin

gle

acti

on

su

3_

ad

join

t

ge

ne

ra

l strid

ed

ga

th

er

f_m

u_

nu

sca

lar_

mu

lt_

ad

d_

wv

ec

sca

lar_

mu

lti_

su

3_

ma

trix

_a

dd

i

d_

co

ng

ra

d2

_cl

se

t_

ne

igh

bo

r

mu

lt_

su

3_

an

up

da

te

_u

co

mp

ute

_clo

v

up

da

te

_h

_cl

sca

lar_

mu

lt_

ad

d_

wv

ec_

ma

g…

re

alt

ra

ce

_su

3_

nn

ad

d_

su

3_

ma

trix

wp

_sh

rin

k

ma

gsq

_w

ve

c_

ta

sk

ga

ug

e_

acti

on

se

t s

u3

_m

atrix

to

ze

ro

re

un

ita

riz

e

Peak GFLOPS (%)8x8x16x16 lattice 16x16x16x16 lattice

0

10

20

30

40

50

60

70

80

90

100

ud

ad

u_

mu

_n

u

dsla

sh

_w

_sit

e_

sp

ecia

l

su

3m

at_

co

py

mu

lt_

su

3_

nn

mu

lt_

su

3_

na

mu

lt_

th

is_

ldu

_sit

e

ud

ad

u_

ma

t_

mu

_n

u

sin

gle

acti

on

su

3_

ad

join

t

ge

ne

ra

l strid

ed

ga

th

er

f_m

u_

nu

sca

lar_

mu

lt_

ad

d_

wv

ec

sca

lar_

mu

lti_

su

3_

ma

trix

_a

dd

i

d_

co

ng

ra

d2

_cl

se

t_

ne

igh

bo

r

mu

lt_

su

3_

an

up

da

te

_u

co

mp

ute

_clo

v

up

da

te

_h

_cl

sca

lar_

mu

lt_

ad

d_

wv

ec_

ma

gsq

re

alt

ra

ce

_su

3_

nn

ad

d_

su

3_

ma

trix

wp

_sh

rin

k

ma

gsq

_w

ve

c_

ta

sk

ga

ug

e_

acti

on

se

t s

u3

_m

atrix

to

ze

ro

re

un

ita

riz

e

Peak bandwidth (%)8x8x16x16 lattice 16x16x16x16 lattice

S peedup

05

101520253035404550

8x8x16x16 lattic e 16x16x16x16 lattic e

15

Application performance

• Single Cell Application performance speedup

• ~8–10x, compared to Xeon single core

• Cell Blade application performance speedup

• 1.5-4.1x, compared to Xeon 2 socket 8 cores

• Profile in Xeon• 98.8% parallel code, 1.2% serial code speedup slowdown

• 67-38% kernel SPU time, 33-62% PPU time of overall runtime in Cell

PPE is standing in the way for further improvement

0.45 0.821.48 1.71

2.2438.244.64

2.97 1.911.72

0.00

1.00

2.00

3.00

4.00

5.00

6.00

1 core Xeon 8 cores Xeon 1 PPE, 8 SPEs

1 PPE, 16 SPEs

(NUMA)

2 PPEs, 8 SPEs per PPE

(MPI)

exec

ution

tim

e (s

ec)

execution mode

SPE contribution38.69

1.886.15 5.94 6.85

10.60166.69

49.65 11.52 6.82

6.38

0.00

5.00

10.00

15.00

20.00

25.00

1 core Xeon 8 cores Xeon 1 PPE, 8 SPEs

1 PPE, 16 SPEs

(NUMA)

2 PPEs, 8 SPEs per PPE

(MPI)

exec

ution

tim

e (s

ec)

execution mode

SPE contribution

55.8

168.57

8x8x16x16 lattice

16x16x16x16 lattice

16

Application performance on two blades

Execution time of the 54 kernels considered for the SPE implementation

Execution time of the rest of the code (PPE portion in the case of Cell/B.E. processor)

Total (seconds)

Two Intel Xeon blades

110.3 seconds27.1 seconds (24.5 seconds

due to MPI)137.3 seconds

Two Cell/B.E. blades

15.9 seconds67.9 seconds (47.6 seconds

due to MPI)83.8 seconds

• For comparison, we ran two Intel Xeon blades and Cell/B.E. blades through Gigabit Ethernet

• More data needed for Cell blades connected through Infiniband

17

Application performance: a fair comparison

8x8x16x16 lattice 16x16x16x16 lattice

Intel Xeon time

Cell/B.E. time

speedupIntel Xeon

timeCell/B.E.

timespeedup

Single core Xeon vs. Cell/B.E. PPE

38.7 73.2 0.5 168.6 412.8 0.4

Single core Xeon vs. Cell/B.E. PPE + 1 SPE

38.7 21.9 1.8 168.6 86.9 1.9

Quad core Xeon vs. Cell/B.E. PPE + 8 SPEs

15.4 4.5 3.4 100.2 17.5 5.7

Xeon blade vs. Cell/B.E. blade

5.5 3.6 1.5 55.8 13.7 4.1

• PPE is slower than Xeon• PPE + 1 SPE is ~2x faster than Xeon • A cell blade is 1.5-4.1x faster than 8-core Xeon blade

18

Conclusion

• We achieved reasonably good performance• 4.5-5.0 Gflops in one Cell processor for whole application

• We maintained the MPI framework • Without the assumtion that the code runs on one Cell processor, certain

optimization cannot be done, e.g. loop fusion

• Current site-centric data layout forces us to take the padding approach• 23-44% efficiency lost for bandwidth• Fix: field-centric data layout desired

• PPE slows the serial part, which is a problem for further improvement • Fix: IBM putting a full-version power core in Cell/B.E.

• PPE may impose problems in scaling to multiple Cell blades• PPE over Infiniband test needed

cell processor implementation of a milc lattice qcd application

Documents