cellss: a programming model for the cell be architecture pieter bellens, josep m. perez, rosa m....
TRANSCRIPT
![Page 1: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/1.jpg)
CellSs: A Programming Model for the Cell BE
Architecture
Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta
Barcelona Supercomputing Center (BSC-CNS)Technical University of Catalonia (UPC)
![Page 2: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/2.jpg)
Index
•Motivation
•Programming models
•CellSs sample codes
•Compilation environment
•Execution behavior
•Results
•Related work
•Conclusions & ongoing work
![Page 3: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/3.jpg)
Motivation
* Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade, Intel White Paper
![Page 4: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/4.jpg)
Motivation
User point of view
So, what is the Cell BE?
Architecture point of view
SPEPPE SPE SPE SPE SPE SPE SPE SPE
Separate address spacesTiny local memoryBandwidth
Thin processorSMT
Hard to optimize
Programmers point of view
![Page 5: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/5.jpg)
ns 100 useconds minutes/hours
Programming models
Grid
Concepts mapping:Instructions Block operations Full binary
Functional units SPEs remote machines
Fetch &decode unit PPE local machine
Registers (name space) Main memory Files
Registers (storage) SPU memory Files
Standard sequential languages:
On standard processors run sequential
On Cell runs parallel
Constraint
Block algorithms
![Page 6: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/6.jpg)
CellSs sample code: Matrix multiply
int main(int argc, char **argv) {int i, j, k;…
initialize(A, B, C);
for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); ... }
static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;
for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j];}
B
BNB
NB
B
B
![Page 7: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/7.jpg)
CellSs sample code: Matrix multiply
int main (int argc, char **argv) {int i, j, k;…
initialize(A, B, C);
for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); ... }
#pragma css task input(A, B) inout(C)static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;
for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j];}
SPE
unroll
B
BNB
NB
B
B
![Page 8: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/8.jpg)
CellSs sample code: Sparse LU
int main(int argc, char **argv) {int ii, jj, kk;…
for (kk=0; kk<NB; kk++) {
lu0(A[kk][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL)
fwd(A[kk][kk], A[kk][jj]);
for (ii=kk+1; ii<NB; ii++)
if (A[ii][kk] != NULL) {
bdiv (A[kk][kk], A[ii][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL) {
if (A[ii][jj]==NULL)
A[ii][jj]=allocate_clean_block();
bmod(A[ii][kk], A[kk][jj], A[ii][jj]);
}
}
}}
void lu0(float *diag);
void bdiv(float *diag, float *row);
void bmod(float *row, float *col, float *inner);
void fwd(float *diag, float *col);
B
BNB
NB
B
B
![Page 9: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/9.jpg)
CellSs sample code: Sparse LU
int main(int argc, char **argv) {int ii, jj, kk;…
for (kk=0; kk<NB; kk++) {
lu0(A[kk][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL)
fwd(A[kk][kk], A[kk][jj]);
for (ii=kk+1; ii<NB; ii++)
if (A[ii][kk] != NULL) {
bdiv (A[kk][kk], A[ii][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL) {
if (A[ii][jj]==NULL)
A[ii][jj]=allocate_clean_block();
bmod(A[ii][kk], A[kk][jj], A[ii][jj]);
}
}
}}
#pragma css task inout(diag[B][B])
void lu0(float *diag);
#pragma css task input(diag[B][B]) inout(row[B][B])
void bdiv(float *diag, float *row);
#pragma css task input(row[B][B],col[B][B]) inout(inner[B][B])
void bmod(float *row, float *col, float *inner);
#pragma css task input(diag[B][B]) inout(col[B][B])void fwd(float *diag, float *col);
B
BNB
NB
B
B
![Page 10: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/10.jpg)
Data dependent parallelism
CellSs sample code: Sparse LU
int main(int argc, char **argv) {int ii, jj, kk;…
for (kk=0; kk<NB; kk++) {
lu0(A[kk][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL)
fwd(A[kk][kk], A[kk][jj]);
for (ii=kk+1; ii<NB; ii++)
if (A[ii][kk] != NULL) {
bdiv (A[kk][kk], A[ii][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL) {
if (A[ii][jj]==NULL)
A[ii][jj]=allocate_clean_block();
bmod(A[ii][kk], A[kk][jj], A[ii][jj]);
}
}
}}
#pragma css task inout(diag[B][B])
void lu0(float *diag);
#pragma css task input(diag[B][B]) inout(row[B][B])
void bdiv(float *diag, float *row);
#pragma css task input(row[B][B],col[B][B]) inout(inner[B][B])
void bmod(float *row, float *col, float *inner);
#pragma css task input(diag[B][B]) inout(col[B][B])void fwd(float *diag, float *col);
B
BNB
NB
B
B
![Page 11: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/11.jpg)
Dynamic main memory allocationData dependent parallelism
CellSs sample code: Sparse LU
int main(int argc, char **argv) {int ii, jj, kk;…
for (kk=0; kk<NB; kk++) {
lu0(A[kk][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL)
fwd(A[kk][kk], A[kk][jj]);
for (ii=kk+1; ii<NB; ii++)
if (A[ii][kk] != NULL) {
bdiv (A[kk][kk], A[ii][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL) {
if (A[ii][jj]==NULL)
A[ii][jj]=allocate_clean_block();
bmod(A[ii][kk], A[kk][jj], A[ii][jj]);
}
}
}}
#pragma css task inout(diag[B][B])
void lu0(float *diag);
#pragma css task input(diag[B][B]) inout(row[B][B])
void bdiv(float *diag, float *row);
#pragma css task input(row[B][B],col[B][B]) inout(inner[B][B])
void bmod(float *row, float *col, float *inner);
#pragma css task input(diag[B][B]) inout(col[B][B])void fwd(float *diag, float *col);
B
BNB
NB
B
B
![Page 12: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/12.jpg)
CellSs sample code: Checking LU
int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}
#pragma css task input(Src) out(Dst) void copy_block (float Src[BS][BS], float Dst[BS][BS]);
void copy_mat (float *Src,float *Dst){ ... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++) ... copy_block(Src[ii][jj],block); ...}
#pragma gss task input(A) out(L,U)void split_block (float A[BS][BS], float L[BS][BS], float U[BS][BS]);
void split_mat (float *LU[NB][NB],float *L[NB][NB],float *U[NB][NB]){... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++){ ... split_block (LU[ii][ii],L[ii][ii],U[ii][ii]); ... }}
![Page 13: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/13.jpg)
Compilation environment
app.c
CSS compiler
app_spe.c
app_ppe.c
llib_css-spe.so
Cell executable
llib_css-ppe.so
SPE Linker
PPE Linker
SPEexecutable SPE Compiler app_spe.o
PPE Compiler app_ppe.o SPE Embedder
SPE Linker
PPEObject
SDK
![Page 14: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/14.jpg)
Execution behavior
PPU
User main
program
CellSs PPU lib
SPU0
DMA inTask executionDMA outSynchronization
CellSs SPU lib
Original task code
Helper threadmain thread
Memory
Userdata
Renaming
Task graph
Synchronization
Tasks
Finalization signal
Stage in/out data
Work assignment
Data dependence Data renaming
Scheduling
SPU1
SPU2
![Page 15: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/15.jpg)
Execution behavior: Matrix multiply
...
#pragma css task input(A, B) inout(C)block_addmultiply( C[i][j], A[i][k], B[k][j])
C[i][j]
A[i][k] B[k][j]
• For each operation, two blocks of data are get from PPE memory to SPE local storage
• Clusters of dependent tasks are scheduled to the
same PPE
The inout block is kept in the local storage and only
put in PPE memory once (reuse)
![Page 16: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/16.jpg)
Execution behavior: Matrix multiply
Clustering Chain of 7 block multiply (270 us)Size of block: 64x64 floatsStage in/out
Reuse
Main thread: task generation
Helper thread
![Page 17: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/17.jpg)
Execution behavior: Matrix multiply
Waiting for SPE availability
Schedule & dispatch
![Page 18: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/18.jpg)
Execution behavior: Matrix multiply
Stage out and notification
Task generation
DispatchScheduleGraph update
![Page 19: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/19.jpg)
Execution behavior: Sparse LU
Priority hints#pragma css task highpriority …
Increase parallelism / support schedulingSupport reuse
![Page 20: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/20.jpg)
Execution behavior: J_Check_LU
copy_mat (A, origA);
LU (A);
split_mat (A, L, U);
clean_mat(A);
sparse_matmult (L, U, A); compare_mat (origA, A);
Without CellSs With CellSs
...
![Page 21: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/21.jpg)
Execution behavior: J_Check
![Page 22: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/22.jpg)
Execution behavior: Other views
Stage in bandwidth
Stage out bandwidth
Task generation lookahead
Full unrolling before execution
Overlaped generation/execution
![Page 23: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/23.jpg)
Scalability
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
9
Matmul speedup results
SPUs
Sp
eed
up
Faster tasks (pre-fetching data)
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
SparseLU speedup results
SPUs
Sp
eed
Up
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
Matul Speedup results (v2)
SPUs
Sp
eed
up
![Page 24: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/24.jpg)
Related work
•Sequoia
• Just presented!
•Charm++
• Runtime tailored to Cell BE
• Offload API
•Octopiler (IBM)
• Auto-SIMDization
• OpenMP as programming model
• Single shared-memory abstraction
![Page 25: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/25.jpg)
Conclusions & Ongoing work
•Cell Superscalar offers a simple programmer model for the Cell BE• Allows easy porting of applications
• General
• Constraints:
• Blocking
•Ongoing work
• Run Time optimization: overheads, halos, scheduling algs, overlap phases, overlays, speculation, short-circuits, more helper threads, lazy renaming, …
• Garbage collection
•Applications
• Bio
• Engineering
•To be distributed as open source soon
![Page 26: CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)](https://reader035.vdocument.in/reader035/viewer/2022070412/56649ef25503460f94c04580/html5/thumbnails/26.jpg)
THANKS!
Visit us at BSC booth #1800 for further information