cell processor implementation of a milc lattice qcd application
DESCRIPTION
Cell processor implementation of a MILC lattice QCD application. Guochun Shi, Volodymyr Kindratenko, Steven Gottlieb. Presentation outline. Introduction Our view of MILC applications Introduction to Cell Broadband Engine Implementation in Cell/B.E. PPU performance and stream benchmark - PowerPoint PPT PresentationTRANSCRIPT
National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-Champaign
Cell processor implementation of a MILC lattice QCD application
Guochun Shi, Volodymyr Kindratenko, Steven Gottlieb
2
Presentation outline
• Introduction1. Our view of MILC applications2. Introduction to Cell Broadband Engine
• Implementation in Cell/B.E.1. PPU performance and stream benchmark2. Profile in CPU and kernels to be ported3. Different approaches
• Performance• Conclusion
3
Introduction
• Our target• MIMD Lattice Computation
(MILC) Collaboration code – dynamical clover fermions (clover_dynamical) using the hybrid-molecular dynamics R algorithm
• Our view of the MILC applications• A sequence of
communication and computation blocks
compute loop 1
compute loop n
MPI scatter/gather
Original CPU-based implementation
CPU
MPI scatter/gather for loop 2
MPI scatter/gather for loop 3
compute loop 2
MPI scatter/gather for loop n+1
4
Introduction
• Cell/B.E. processor• One Power Processor Element
(PPE) and eight Synergistic Processing Elements (SPE), each SPE has 256 KBs of local storage
• 3.2 GHz processor • 25.6 GB/s processor-to-
memory bandwidth • > 200 GB/s EIB sustained
aggregate bandwidth• Theoretical peak performance:
204.8 GFLOPS (SP) and 14.63 GFLOPS (DP)
5
Presentation outline
• Introduction1. Our view of MILC applications2. Introduction to Cell Broadband Engine
• Implementation in Cell/B.E.1. PPE performance and stream benchmark2. Profile in CPU and kernels to be ported3. Different approaches
• Performance• Conclusion
6
Performance in PPE
• Step 1: try to run it in PPE
• In PPE it runs approximately ~2-3x slower than modern CPU
• MILC is bandwidth-bound
• It agrees with what we see with stream benchmark
Runtime in CPU and PPE
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
450.0
8x8x16x16 16x16x16x16
lattice size
run
time
(sec
onds
)
CPU
PPE
Stream benchmark on CPU and PPE
0500
1000150020002500300035004000
Copy Scale Add Triad
Ban
dwid
th (G
B/s
)CPU
PPE
7
Execution profile and kernels to be ported
0
5
10
15
20
25
udadu_mu_nu
ds las h_w_s ite_spec ial
s u3mat_copy
mult_su
3_nn
mult_su
3_na
mult_this_
ldu_s ite
udadu_mat_mu_nu
s ingle action
su3_adjoint
gene ral stri
ded gathe r
f_mu_nu
s calar_
mult_a dd_wve
c
s calar_
multi_su3_matrix
_addi
d_congrad2_c l
s et_neighbor
mult_su
3_an
upda te_u
compute_c lov
upda te_h_c l
s calar_
mult_a dd_wve
c_mag sq
realtrac e_s u3_nn
add_ su3_matrix
wp_s hrink
mags q_wvec_ta
sk
gaug e_action
set su3_matrix
to ze
ro
reunitariz
e
kern
el ti
me,
%
0102030405060708090100
cum
ulati
ve, %
kernel run-time on 8x8x16x16 lattic ekernel run-time on 16x16x16x16 lattic e c umulative for 8x8x16x16 latticec umulative for 16x16x16x16 lattic e
• 10 of these subroutines are responsible for >90% of overall runtime
• All kernels responsible for 98.8%
8
Kernel memory access pattern
• Kernel codes must be SIMDized
• Performance determined by how fast you DMA in/out the data, not by SIMDized code
• In each iteration, only small elements are accessed
• Lattice size: 1832 bytes• su3_matrix: 72 bytes• wilson_vector: 96 bytes
• Challenge: how to get data into SPUs as fast as possible?
• Cell/B.E. has the best DMA performance when data is aligned to 128 bytes and size is multiple of 128 bytes.
• Data layout in MILC meets neither of them
#define FORSOMEPARITY(i,s,choice) \ for( i=((choice)==ODD ? even_sites_on_node : 0 ), \ s= &(lattice[i]); \ i< ( (choice)==EVEN ? even_sites_on_node : sites_on_node); \ i++,s++)
FORSOMEPARITY(i,s,parity) { mult_adj_mat_wilson_vec( &(s->link[nu]), ((wilson_vector *)F_PT(s,rsrc)), &rtemp ); mult_adj_mat_wilson_vec( (su3_matrix *)(gen_pt[1][i]), &rtemp, &(s->tmp) ); mult_mat_wilson_vec( (su3_matrix *)(gen_pt[0][i]), &(s->tmp), &rtemp ); mult_mat_wilson_vec( &(s->link[mu]), &rtemp, &(s->tmp) ); mult_sigma_mu_nu( &(s->tmp), &rtemp, mu, nu ); su3_projector_w( &rtemp, ((wilson_vector *)F_PT(s,lsrc)), ((su3_matrix*)F_PT(s,mat)) ); }
lattice site 0
Data accesses
Data from neighbor
One sample kernel from udadu_mu_nu() routine
9
Approach I: packing and unpacking
• Good performance in DMA operations• Packing and unpacking are expensive in PPE
PPE and main memory
…struct site
…
struct site
Packing
Unpacking
DMA operations
DMA operations…
SPEs
10
Approach II: Indirect memory access
• Replace elements in struct site with pointers• Pointers point to continuous memory regions• PPE overhead due to indirect memory access
Original lattice
Modified lattice
Continuous mem …
DMA operations
SPEs
……
PPE and main memory
11
Approach III: Padding and small memory DMAs
• Padding elements to appropriate size• Padding struct site to appropriate size• Gained good bandwidth performance with padding overhead• Su3_matrix from 3x3 complex to 4x4 complex matrix
• 72 bytes 128 bytes• Bandwidth efficiency lost: 44%
• Wilson_vector from 4x3 complex to 4x4 complex• 98 bytes 128 bytes• Bandwidth efficiency lost: 23%
Original lattice
Lattice after padding
…SPEs
…
…
DMA operations
PPE and memory
12
Struct site Padding
• 128 byte stride access has different performance for different stride size
• This is due to 16 banks in main memory• Odd numbers always reach peak• We choose to pad the struct site to 2688 (21*128) bytes
25.38
12.69
25.36
8.6
25.49
12.89
25.48
4.26
25.5
12.93
25.34
8.58
25.51
12.89
25.47
2.13
25.34
0
10
20
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
stride (x 128 byte s)
band
wid
th (G
B/s
)
13
Presentation outline
• Introduction1. Our view of MILC applications2. Introduction to Cell Broadband Engine
• Implementation in Cell/BE.1. PPU performance and stream benchmark2. Profile in CPU and kernels to be ported3. Different approaches
• Performance• Conclusion
14
Kernel performance
• GFLOPS are low for all kernels• Bandwidth is around 80% of peak for
most of kernels• Kernel speedup compared to CPU for
most of kernels are between 10x to 20x• set_memory_to_zero kernel has ~40x
speedup, su3mat_copy() speedup >15x
0
1
2
3
4
5
6
7
8
9
10
udad
u_m
u_nu
dsla
sh_w
_site
_spe
cial
su3m
at_c
opy
mul
t_su
3_nn
mul
t_su
3_na
mul
t_th
is_ld
u_sit
eud
adu_
mat
_mu_
nusin
gle
actio
nsu
3_ad
join
tge
nera
l str
ided
gat
her
f_m
u_nu
scal
ar_m
ult_
add_
wve
csc
alar
_mul
ti_su
3_m
atrix
_add
id_
cong
rad2
_cl
set_
neig
hbor
mul
t_su
3_an
upda
te_u
com
pute
_clo
vup
date
_h_c
lsc
alar
_mul
t_ad
d_w
vec_
mag
…re
altr
ace_
su3_
nnad
d_su
3_m
atrix
wp_
shrin
km
agsq
_wve
c_ta
skga
uge_
actio
nse
t su3
_mat
rix to
zero
reun
itariz
e
Peak GFLOPS (%)8x8x16x16 lattice 16x16x16x16 lattice
0
10
20
30
40
50
60
70
80
90
100
udad
u_m
u_nu
dsla
sh_w
_site
_spe
cial
su3m
at_c
opy
mul
t_su
3_nn
mul
t_su
3_na
mul
t_th
is_ld
u_sit
eud
adu_
mat
_mu_
nusin
gle
actio
nsu
3_ad
join
tge
nera
l str
ided
gat
her
f_m
u_nu
scal
ar_m
ult_
add_
wve
csc
alar
_mul
ti_su
3_m
atri
x_ad
did_
cong
rad2
_cl
set_
neig
hbor
mul
t_su
3_an
upda
te_u
com
pute
_clo
vup
date
_h_c
lsc
alar
_mul
t_ad
d_w
vec_
mag
sqre
altr
ace_
su3_
nnad
d_su
3_m
atrix
wp_
shrin
km
agsq
_wve
c_ta
skga
uge_
actio
nse
t su3
_mat
rix to
zero
reun
itariz
e
Peak bandwidth (%)8x8x16x16 lattice 16x16x16x16 lattice
S peedup
05
101520253035404550
8x8x16x16 lattice 16x16x16x16 lattice
15
Application performance
• Single Cell Application performance speedup
• ~8–10x, compared to Xeon single core
• Cell Blade application performance speedup
• 1.5-4.1x, compared to Xeon 2 socket 8 cores
• Profile in Xeon• 98.8% parallel code, 1.2% serial code speedup slowdown
• 67-38% kernel SPU time, 33-62% PPU time of overall runtime in Cell
PPE is standing in the way for further improvement
0.45 0.821.48 1.71
2.2438.244.64
2.97 1.911.72
0.00
1.00
2.00
3.00
4.00
5.00
6.00
1 core Xeon 8 cores Xeon 1 PPE, 8 SPEs
1 PPE, 16 SPEs
(NUMA)
2 PPEs, 8 SPEs per PPE
(MPI)
exec
ution
tim
e (s
ec)
execution mode
SPE contribution38.69
1.886.15 5.94 6.85
10.60166.69
49.65 11.52 6.82
6.38
0.00
5.00
10.00
15.00
20.00
25.00
1 core Xeon 8 cores Xeon 1 PPE, 8 SPEs
1 PPE, 16 SPEs
(NUMA)
2 PPEs, 8 SPEs per PPE
(MPI)
exec
ution
tim
e (s
ec)
execution mode
SPE contribution
55.8
168.57
8x8x16x16 lattice
16x16x16x16 lattice
16
Application performance on two blades
Execution time of the 54 kernels considered for the SPE implementation
Execution time of the rest of the code (PPE portion in the case of Cell/B.E. processor)
Total (seconds)
Two Intel Xeon blades 110.3 seconds 27.1 seconds (24.5 seconds
due to MPI) 137.3 seconds
Two Cell/B.E. blades 15.9 seconds 67.9 seconds (47.6 seconds
due to MPI) 83.8 seconds
• For comparison, we ran two Intel Xeon blades and Cell/B.E. blades through Gigabit Ethernet
• More data needed for Cell blades connected through Infiniband
17
Application performance: a fair comparison
8x8x16x16 lattice 16x16x16x16 lattice
Intel Xeon time
Cell/B.E. time speedup Intel Xeon
timeCell/B.E.
time speedup
Single core Xeon vs. Cell/B.E. PPE 38.7 73.2 0.5 168.6 412.8 0.4
Single core Xeon vs. Cell/B.E. PPE + 1 SPE 38.7 21.9 1.8 168.6 86.9 1.9
Quad core Xeon vs. Cell/B.E. PPE + 8 SPEs
15.4 4.5 3.4 100.2 17.5 5.7
Xeon blade vs. Cell/B.E. blade 5.5 3.6 1.5 55.8 13.7 4.1
• PPE is slower than Xeon• PPE + 1 SPE is ~2x faster than Xeon • A cell blade is 1.5-4.1x faster than 8-core Xeon blade
18
Conclusion
• We achieved reasonably good performance• 4.5-5.0 Gflops in one Cell processor for whole application
• We maintained the MPI framework • Without the assumtion that the code runs on one Cell processor, certain
optimization cannot be done, e.g. loop fusion
• Current site-centric data layout forces us to take the padding approach• 23-44% efficiency lost for bandwidth• Fix: field-centric data layout desired
• PPE slows the serial part, which is a problem for further improvement • Fix: IBM putting a full-version power core in Cell/B.E.
• PPE may impose problems in scaling to multiple Cell blades• PPE over Infiniband test needed