cell processor implementation of a milc lattice qcd application

National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-Champaign

Cell processor implementation of a MILC lattice QCD application

Guochun Shi, Volodymyr Kindratenko, Steven Gottlieb

2

Presentation outline

• Introduction1. Our view of MILC applications2. Introduction to Cell Broadband Engine

• Implementation in Cell/B.E.1. PPU performance and stream benchmark2. Profile in CPU and kernels to be ported3. Different approaches

• Performance• Conclusion

3

Introduction

• Our target• MIMD Lattice Computation

(MILC) Collaboration code – dynamical clover fermions (clover_dynamical) using the hybrid-molecular dynamics R algorithm

• Our view of the MILC applications• A sequence of

communication and computation blocks

compute loop 1

compute loop n

MPI scatter/gather

Original CPU-based implementation

CPU

MPI scatter/gather for loop 2

MPI scatter/gather for loop 3

compute loop 2

MPI scatter/gather for loop n+1

4

Introduction

• Cell/B.E. processor• One Power Processor Element

(PPE) and eight Synergistic Processing Elements (SPE), each SPE has 256 KBs of local storage

• 3.2 GHz processor • 25.6 GB/s processor-to-

memory bandwidth • > 200 GB/s EIB sustained

aggregate bandwidth• Theoretical peak performance:

204.8 GFLOPS (SP) and 14.63 GFLOPS (DP)

5



• Implementation in Cell/B.E.1. PPE performance and stream benchmark2. Profile in CPU and kernels to be ported3. Different approaches


6

Performance in PPE

• Step 1: try to run it in PPE

• In PPE it runs approximately ~2-3x slower than modern CPU

• MILC is bandwidth-bound

• It agrees with what we see with stream benchmark

Runtime in CPU and PPE

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

8x8x16x16 16x16x16x16

lattice size

run

time

(sec

onds

)

CPU

PPE

Stream benchmark on CPU and PPE

0500

1000150020002500300035004000

Copy Scale Add Triad

Ban

dwid

th (G

B/s

)CPU

PPE

7

Execution profile and kernels to be ported

0

5

10

15

20

25

udadu_mu_nu

ds las h_w_s ite_spec ial

s u3mat_copy

mult_su

3_nn

mult_su

3_na

mult_this_

ldu_s ite

udadu_mat_mu_nu

s ingle action

su3_adjoint

gene ral stri

ded gathe r

f_mu_nu

s calar_

mult_a dd_wve

c

s calar_

multi_su3_matrix

_addi

d_congrad2_c l

s et_neighbor

mult_su

3_an

upda te_u

compute_c lov

upda te_h_c l

s calar_

mult_a dd_wve

c_mag sq

realtrac e_s u3_nn

add_ su3_matrix

wp_s hrink

mags q_wvec_ta

sk

gaug e_action

set su3_matrix

to ze

ro

reunitariz

e

kern

el ti

me,

%

0102030405060708090100

cum

ulati

ve, %

kernel run-time on 8x8x16x16 lattic ekernel run-time on 16x16x16x16 lattic e c umulative for 8x8x16x16 latticec umulative for 16x16x16x16 lattic e

• 10 of these subroutines are responsible for >90% of overall runtime

• All kernels responsible for 98.8%

8

Kernel memory access pattern

• Kernel codes must be SIMDized

• Performance determined by how fast you DMA in/out the data, not by SIMDized code

• In each iteration, only small elements are accessed

• Lattice size: 1832 bytes• su3_matrix: 72 bytes• wilson_vector: 96 bytes

• Challenge: how to get data into SPUs as fast as possible?

• Cell/B.E. has the best DMA performance when data is aligned to 128 bytes and size is multiple of 128 bytes.

• Data layout in MILC meets neither of them

#define FORSOMEPARITY(i,s,choice) \ for( i=((choice)==ODD ? even_sites_on_node : 0 ), \ s= &(lattice[i]); \ i< ( (choice)==EVEN ? even_sites_on_node : sites_on_node); \ i++,s++)

FORSOMEPARITY(i,s,parity) { mult_adj_mat_wilson_vec( &(s->link[nu]), ((wilson_vector *)F_PT(s,rsrc)), &rtemp ); mult_adj_mat_wilson_vec( (su3_matrix *)(gen_pt[1][i]), &rtemp, &(s->tmp) ); mult_mat_wilson_vec( (su3_matrix *)(gen_pt[0][i]), &(s->tmp), &rtemp ); mult_mat_wilson_vec( &(s->link[mu]), &rtemp, &(s->tmp) ); mult_sigma_mu_nu( &(s->tmp), &rtemp, mu, nu ); su3_projector_w( &rtemp, ((wilson_vector *)F_PT(s,lsrc)), ((su3_matrix*)F_PT(s,mat)) ); }

lattice site 0

Data accesses

Data from neighbor

One sample kernel from udadu_mu_nu() routine

9

Approach I: packing and unpacking

• Good performance in DMA operations• Packing and unpacking are expensive in PPE

PPE and main memory

…struct site

…

struct site

Packing

Unpacking

DMA operations

DMA operations…

SPEs

10

Approach II: Indirect memory access

• Replace elements in struct site with pointers• Pointers point to continuous memory regions• PPE overhead due to indirect memory access

Original lattice

Modified lattice

Continuous mem …

DMA operations

SPEs

……

PPE and main memory

11

Approach III: Padding and small memory DMAs

• Padding elements to appropriate size• Padding struct site to appropriate size• Gained good bandwidth performance with padding overhead• Su3_matrix from 3x3 complex to 4x4 complex matrix

• 72 bytes 128 bytes• Bandwidth efficiency lost: 44%

• Wilson_vector from 4x3 complex to 4x4 complex• 98 bytes 128 bytes• Bandwidth efficiency lost: 23%

Original lattice

Lattice after padding

…SPEs

…

…

DMA operations

PPE and memory

12

Struct site Padding

• 128 byte stride access has different performance for different stride size

• This is due to 16 banks in main memory• Odd numbers always reach peak• We choose to pad the struct site to 2688 (21*128) bytes

25.38

12.69

25.36

8.6

25.49

12.89

25.48

4.26

25.5

12.93

25.34

8.58

25.51

12.89

25.47

2.13

25.34

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

stride (x 128 byte s)

band

wid

th (G

B/s

)

13



• Implementation in Cell/BE.1. PPU performance and stream benchmark2. Profile in CPU and kernels to be ported3. Different approaches


14

Kernel performance

• GFLOPS are low for all kernels• Bandwidth is around 80% of peak for

most of kernels• Kernel speedup compared to CPU for

most of kernels are between 10x to 20x• set_memory_to_zero kernel has ~40x

speedup, su3mat_copy() speedup >15x

0

1

2

3

4

5

6

7

8

9

10

udad

u_m

u_nu

dsla

sh_w

_site

_spe

cial

su3m

at_c

opy

mul

t_su

3_nn

mul

t_su

3_na

mul

t_th

is_ld

u_sit

eud

adu_

mat

_mu_

nusin

gle

actio

nsu

3_ad

join

tge

nera

l str

ided

gat

her

f_m

u_nu

scal

ar_m

ult_

add_

wve

csc

alar

_mul

ti_su

3_m

atrix

_add

id_

cong

rad2

_cl

set_

neig

hbor

mul

t_su

3_an

upda

te_u

com

pute

_clo

vup

date

_h_c

lsc

alar

_mul

t_ad

d_w

vec_

mag

…re

altr

ace_

su3_

nnad

d_su

3_m

atrix

wp_

shrin

km

agsq

_wve

c_ta

skga

uge_

actio

nse

t su3

_mat

rix to

zero

reun

itariz

e

Peak GFLOPS (%)8x8x16x16 lattice 16x16x16x16 lattice

0

10

20

30

40

50

60

70

80

90

100

udad

u_m

u_nu

dsla

sh_w

_site

_spe

cial

su3m

at_c

opy

mul

t_su

3_nn

mul

t_su

3_na

mul

t_th

is_ld

u_sit

eud

adu_

mat

_mu_

nusin

gle

actio

nsu

3_ad

join

tge

nera

l str

ided

gat

her

f_m

u_nu

scal

ar_m

ult_

add_

wve

csc

alar

_mul

ti_su

3_m

atri

x_ad

did_

cong

rad2

_cl

set_

neig

hbor

mul

t_su

3_an

upda

te_u

com

pute

_clo

vup

date

_h_c

lsc

alar

_mul

t_ad

d_w

vec_

mag

sqre

altr

ace_

su3_

nnad

d_su

3_m

atrix

wp_

shrin

km

agsq

_wve

c_ta

skga

uge_

actio

nse

t su3

_mat

rix to

zero

reun

itariz

e

Peak bandwidth (%)8x8x16x16 lattice 16x16x16x16 lattice

S peedup

05

101520253035404550

8x8x16x16 lattice 16x16x16x16 lattice

15

Application performance

• Single Cell Application performance speedup

• ~8–10x, compared to Xeon single core

• Cell Blade application performance speedup

• 1.5-4.1x, compared to Xeon 2 socket 8 cores

• Profile in Xeon• 98.8% parallel code, 1.2% serial code speedup slowdown

• 67-38% kernel SPU time, 33-62% PPU time of overall runtime in Cell

PPE is standing in the way for further improvement

0.45 0.821.48 1.71

2.2438.244.64

2.97 1.911.72

0.00

1.00

2.00

3.00

4.00

5.00

6.00

1 core Xeon 8 cores Xeon 1 PPE, 8 SPEs

1 PPE, 16 SPEs

(NUMA)

2 PPEs, 8 SPEs per PPE

(MPI)

exec

ution

tim

e (s

ec)

execution mode

SPE contribution38.69

1.886.15 5.94 6.85

10.60166.69

49.65 11.52 6.82

6.38

0.00

5.00

10.00

15.00

20.00

25.00

1 core Xeon 8 cores Xeon 1 PPE, 8 SPEs

1 PPE, 16 SPEs

(NUMA)

2 PPEs, 8 SPEs per PPE

(MPI)

exec

ution

tim

e (s

ec)

execution mode

SPE contribution

55.8

168.57

8x8x16x16 lattice

16x16x16x16 lattice

16

Application performance on two blades

Execution time of the 54 kernels considered for the SPE implementation

Execution time of the rest of the code (PPE portion in the case of Cell/B.E. processor)

Total (seconds)

Two Intel Xeon blades 110.3 seconds 27.1 seconds (24.5 seconds

due to MPI) 137.3 seconds

Two Cell/B.E. blades 15.9 seconds 67.9 seconds (47.6 seconds

due to MPI) 83.8 seconds

• For comparison, we ran two Intel Xeon blades and Cell/B.E. blades through Gigabit Ethernet

• More data needed for Cell blades connected through Infiniband

17

Application performance: a fair comparison

8x8x16x16 lattice 16x16x16x16 lattice

Intel Xeon time

Cell/B.E. time speedup Intel Xeon

timeCell/B.E.

time speedup

Single core Xeon vs. Cell/B.E. PPE 38.7 73.2 0.5 168.6 412.8 0.4

Single core Xeon vs. Cell/B.E. PPE + 1 SPE 38.7 21.9 1.8 168.6 86.9 1.9

Quad core Xeon vs. Cell/B.E. PPE + 8 SPEs

15.4 4.5 3.4 100.2 17.5 5.7

Xeon blade vs. Cell/B.E. blade 5.5 3.6 1.5 55.8 13.7 4.1

• PPE is slower than Xeon• PPE + 1 SPE is ~2x faster than Xeon • A cell blade is 1.5-4.1x faster than 8-core Xeon blade

18

Conclusion

• We achieved reasonably good performance• 4.5-5.0 Gflops in one Cell processor for whole application

• We maintained the MPI framework • Without the assumtion that the code runs on one Cell processor, certain

optimization cannot be done, e.g. loop fusion

• Current site-centric data layout forces us to take the padding approach• 23-44% efficiency lost for bandwidth• Fix: field-centric data layout desired

• PPE slows the serial part, which is a problem for further improvement • Fix: IBM putting a full-version power core in Cell/B.E.

• PPE may impose problems in scaling to multiple Cell blades• PPE over Infiniband test needed

cell processor implementation of a milc lattice qcd application

Documents