jim demmel eecs & math departments, uc berkeley [email protected]

65
Jim Demmel EECS & Math Departments, UC Berkeley [email protected] Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Multicore and GPU Implementations

Upload: dyre

Post on 24-Feb-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Multicore and GPU Implementations. Jim Demmel EECS & Math Departments, UC Berkeley [email protected]. Outline of Dense Linear Algebra. Multicore implementations DAG Scheduling to minimize synchronization - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Jim DemmelEECS & Math Departments, UC Berkeley

[email protected]

Minimizing Communication in Numerical Linear Algebra

www.cs.berkeley.edu/~demmel

Multicore and GPUImplementations

Page 2: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

2

Outline of Dense Linear Algebra• Multicore implementations

- DAG Scheduling to minimize synchronization• Optimization on GPUs

- Minimizing data movement between CPU and GPU memories

Summer School Lecture 5

Page 3: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Multicore: Expressing Parallelism with a DAG• DAG = Directed Acyclic Graph

- S1 S2 means statement S2 “depends on” statement S1- Can execute in parallel any Si without input dependencies

• For simplicity, consider Cholesky A = LLT, not LU- N by N matrix, numbered from A(0,0) to A(N-1,N-1)- “Left looking” code

3

for k = 0 to N-1 for n = 0 to k-1 A(k,k) = A(k,k) – A(k,n)*A(k,n) A(k,k) = sqrt(A(k,k)) for m = k+1 to N-1 for n = 0 to k-1 A(m,k) = A(m,k) – A(m,n)*A(k,n) A(m,k) = A(m,k) / A(k,k)

Page 4: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Expressing Parallelism with a DAG - Cholesky

for k = 0 to N-1 for n = 0 to k-1 A(k,k) = A(k,k) – A(k,n)*A(k,n) A(k,k) = sqrt(A(k,k)) for m = k+1 to N-1 for n = 0 to k-1 A(m,k) = A(m,k) – A(m,n)*A(k,n) A(m,k) = A(m,k) · A(k,k)-1

S1(k,n)S2(k)

S3(k,m,n)S4(k,m)

n

k

m

S1(k,n)

S2(k)

S3(k,m,n) S4(k,m)

DAG has N3/6 vertices: S1(k,n) S2(k) for n=0:k-1

S3(k,m,n) S4(k,m) for n=0:k-1

S2(k) S4(k,m) for m=k+1:N

S4(k,m) S3 (k’,m,k) for k’>k

S4(k,m) S3(k,m’,k) for m’>m

Page 5: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Expressing Parallelism with a DAG – Block Cholesky

for k = 0 to N/b-1 for n = 0 to k-1 A[k,k] = A[k,k] – A[k,n]*A[k,n]T

A(k,k) = unblocked_Cholesky(A(k,k)) for m = k+1 to N/b-1 for n = 0 to k-1 A[m,k] = A[m,k] – A[m,n]*A[k,n]T

A(m,k) = A(m,k) · A(k,k)-1

S1(k,n)S2(k)

S3(k,m,n)S4(k,m)

n

k

m

S1(k,n)

S2(k)

S3(k,m,n) S4(k,m)

Same DAG, but only (N/B)3/6 vertices

• Each A[i,j] is a b-by-b block

SYRK:POTRF:

GEMM:TRSM:

Page 6: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

• Note implied order of summation from left to right

• Not necessary for correctness, but it does reflect what the sequential code does

• Can process DAG in any order respecting dependences

03/02/2009 6

Sample Cholesky DAG with #blocks in any row or column = N/b = 5

Slide courtesy of Jakub Kurzak, UTK

Page 7: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Scheduling options• Static (pre-assign tasks to processors) or Dynamic

(idle processors grab ready jobs from work-queue)- If dynamic, does scheduler take user hints/priorities?

• Respect locality (eg processor must have some task data in its cache) vs not

• Build and store entire DAG to schedule it (which may be very large, (N/b)3 ), or build just the next few “levels” at a time (smaller, but less information for scheduler)

• Programmer builds DAG & schedule vs depending on compiler or run-time system

- Ease of programming, vs not exploiting user knowledge- If compiler, how conservative is detection of parallelism?

Summer School Lecture 5 7

Page 8: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Schedulers tested

8

• Cilk• programmer-defined parallelism• spawn – creates independent tasks• sync – synchronizes a sub-branch of the tree

• SMPSs• dependency-defined parallelism• pragma-based annotation of tasks (directionality of

the parameters)

• PLASMA (Static Pipeline)• programmer-defined (hard-coded)• apriori processing order• progress tracking• stalling on dependencies

Slide courtesy of Jakub Kurzak, UTK

Page 9: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Measured Results for Tiled Cholesky

Summer School Lecture 5 9

PLASMA:

Slide courtesy of Jakub Kurzak, UTK

• Measured on Intel Tigerton 2.4 GHz • Cilk 1D: one task is whole panel, but with “look ahead”• Cilk 2D: tasks are blocks, scheduler steals work, little locality• PLASMA works best

Page 10: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Cilk

SMPSs

PLASMA (Static Pipeline)

More Measured Results for Tiled Cholesky• Measured on Intel Tigerton 2.4 GHz

Slide courtesy of Jakub Kurzak, UTK

Page 11: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Still More Measured Results for Tiled Cholesky

quad-socket, quad-core (16 cores total) Intel Tigerton 2.4 GHz

• PLASMA (static pipeline) – best

• SMPSs – somewhat worse• Cilk 2D – inferior• Cilk 1D – still worse

Slide courtesy of Jakub Kurzak, UTK

Page 12: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

12

Intel’s Clovertown Quad Core

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000

Problems Size

Mflo

p/s

1. LAPACK (BLAS Fork-Join Parallelism)

2. ScaLAPACK (Mess Pass using mem copy)

3. DAG Based (Dynamic Scheduling)

3 Implementations of LU factorization Quad core w/2 sockets per board, w/ 8 Threads

8 Core Experiments

Source: Jack Dongarra

Page 13: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Scheduling on Multicore – Next Steps• PLASMA 2.0.0 released

- Just Cholesky, QR, LU, using static scheduling- LU does not do partial pivoting – Stability?- http://icl.cs.utk.edu/plasma/

• Future of PLASMA- Add dynamic scheduling, similar to SMPSs

• DAGs for eigenproblems are too complicated to do by hand- Depend on user annotations and API, not compiler- Still assume homogeneity of available cores

• What about GPUs, or mixtures of CPUs and GPUs?- MAGMA

Summer School Lecture 5 13

Page 14: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

QR Factorization Intel 16 coresTall Skinny Matrices

N, where matrix is M=51200 x N

Gflo

p/s

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 32000

20

40

60

80

100

120

140

160

180

Theoretical PeakDGEMM Peak"MKL (10.1)""LAPACK (3.2)"

Summer School Lecture 5

14

Page 15: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

LAPACK QR

Step 1 Step 2 Step 3 Step 4 . . .

Summer School Lecture 515

Page 16: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Parallel Tasks in QR

• Break into smaller tasks and remove dependencies

Summer School Lecture 5

Page 17: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Parallel Tasks in QR

Step 1: QR of block 1,1

Summer School Lecture 5 17

Page 18: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Parallel Tasks in QR

Step 1: QR of block 1,1

Step 2: Use R to zero A1,2

Summer School Lecture 5 18

Page 19: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Parallel Tasks in QR

Step 1: QR of block 1,1

Step 2: Use R to zero A1,2

Summer School Lecture 5 19

Page 20: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Parallel Tasks in QR

Step3: Use R to zero A1,3

.

.

.

Step 1: QR of block 1,1

Step 2: Use R to zero A1,2

Summer School Lecture 5 20

Page 21: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Parallel Tasks in QR

.

.

.

Step3: Use R to zero A1,3

Step 1: QR of block 1,1

Step 2: Use R to zero A1,2

Summer School Lecture 5 21

Page 22: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Parallel Tasks in QR

.

.

.

Step3: Use R to zero A1,3

Step 1: QR of block 1,1

Step 2: Use R to zero A1,2

Summer School Lecture 5 22

Page 23: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

QR Factorization Intel 16 coresTall Skinny Matrices

N, where matrix is M=51200 x N

Gflo

p/s

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 32000

20

40

60

80

100

120

140

160

180

Theoretical PeakDGEMM PeakPLASMA (2.1)"MKL (10.1)""LAPACK (3.2)"

Summer School Lecture 5

23

Page 24: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Communication Reducing QR Factorization

Courtesy Jack Dongarra

TS matrix MT=6 and NT=3 split into 2 domains

3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains

Summer School Lecture 5 24

Page 25: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

TS matrix MT=6 and NT=3 split into 2 domains

3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains

Communication Reducing QR Factorization

Courtesy Jack DongarraSummer School Lecture 5 25

Page 26: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

TS matrix MT=6 and NT=3 split into 2 domains

3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains

Communication Reducing QR Factorization

Courtesy Jack DongarraSummer School Lecture 5 26

Page 27: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

TS matrix MT=6 and NT=3 split into 2 domains

3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains

Communication Reducing QR Factorization

Courtesy Jack DongarraSummer School Lecture 5 27

Page 28: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

TS matrix MT=6 and NT=3 split into 2 domains

3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains

Communication Reducing QR Factorization

Courtesy Jack DongarraSummer School Lecture 5 28

Page 29: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

TS matrix MT=6 and NT=3 split into 2 domains

3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains

Communication Reducing QR Factorization

Courtesy Jack DongarraSummer School Lecture 5 29

Page 30: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

TS matrix MT=6 and NT=3 split into 2 domains

3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains

Communication Reducing QR Factorization

Courtesy Jack DongarraSummer School Lecture 5 30

Page 31: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

TS matrix MT=6 and NT=3 split into 2 domains

3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains

Communication Reducing QR Factorization

Courtesy Jack DongarraSummer School Lecture 5 31

Page 32: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

TS matrix MT=6 and NT=3 split into 2 domains

3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains

Communication Reducing QR Factorization

Courtesy Jack DongarraSummer School Lecture 5 32

Page 33: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

TS matrix MT=6 and NT=3 split into 2 domains

3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains

Communication Reducing QR Factorization

Courtesy Jack DongarraSummer School Lecture 5 33

Page 34: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

TS matrix MT=6 and NT=3 split into 2 domains

3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains

Communication Reducing QR Factorization

Courtesy Jack DongarraSummer School Lecture 5 34

Page 35: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

TS matrix MT=6 and NT=3 split into 2 domains

3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains

Communication Reducing QR Factorization

Courtesy Jack DongarraSummer School Lecture 5 35

Page 36: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

TS matrix MT=6 and NT=3 split into 2 domains

3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains

Communication Reducing QR Factorization

Courtesy Jack DongarraSummer School Lecture 5 36

Page 37: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

TS matrix MT=6 and NT=3 split into 2 domains

3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains

Communication Reducing QR Factorization

Courtesy Jack DongarraSummer School Lecture 5 37

Page 38: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

TS matrix MT=6 and NT=3 split into 2 domains

3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains

Final R computed

Communication Reducing QR Factorization

Courtesy Jack DongarraSummer School Lecture 5 38

Page 39: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Example with 4 and 8 Domains

A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.

Courtesy Jack DongarraSummer School Lecture 5 39

Page 40: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Execution Trace

16 core run

Courtesy Jack DongarraSummer School Lecture 5 40

Page 41: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Communication Reducing QR Factorization

Quad-socket, quad-core machine Intel Xeon EMT64 E7340 at 2.39 GHz. Theoretical peak is  153.2 Gflop/s with 16 cores.

Matrix size 51200 by 3200

Sequential PLASMACourtesy Jack Dongarra41

Page 42: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Cluster Experiment• grig.sinrg.cs.utk.edu• 61 nodes

- Two CPUs per node- Intel Xeon 3.20GHz- Peak performance 6.4 GFLOPS- Myrinet interconnection (MX 1.0.0)

• Goto BLAS 1.26- DGEMM performance 5.57 GFLOPS (87%)

• MPICH-MX• gcc 64 bits

Courtesy Jack DongarraSummer School Lecture 5 43

Page 43: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Weak Scalability (8 columns of tiles)

Courtesy Jack Dongarra44

Page 44: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Weak Scalability (8 columns of tiles)

Courtesy Jack Dongarra45

Page 45: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Weak Scalability (8 columns of tiles)

On 1 CPU, the matrix size is 64x8 tilesOn k CPUs, the matrix size is k*64x8 tilesTiles are 200x200 blocks

Courtesy Jack Dongarra46

Page 46: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Scalability of CAQR on the Grig Cluster (8 tiles per row)

0

1

2

3

4

5

6

1 2 4 8 16 32 64Number of Cores

GFL

OPS

per

Cor

e

Distri. CAQRScaLAPACK

peak

dgemm

Weak Scalability (8 columns of tiles)

Courtesy Jack Dongarra47

Page 47: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Dense Linear Algebra on GPUs• Source: Vasily Volkov’s SC08 paper

- Best Student Paper Award• New challenges

- More complicated memory hierarchy- Not like “L1 inside L2 inside …”,

• Need to choose which memory to use carefully• Need to move data manually

- GPU does some operations much faster than CPU, but not all- CPU and GPU like different data layouts

Summer School Lecture 5 48

Page 48: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Motivation

• Goal: understand bottlenecks in the dense linear algebra kernels• Requires detailed understanding of the GPU architecture• Result 1: New coding recommendations for high performance on GPUs• Result 2: New , fast variants of LU, QR, Cholesky, other routines

• NVIDIA released CUBLAS 1.0 in 2007, which is BLAS for GPUs• This enables a straightforward port of LAPACK to GPU

• Consider single precision only

49

LAPACK SGETRF

BLAS SGEMM

peak in a*b+c

0 50 100 150 200 250 300 350

GeForce 8800 GTXCore2 Quad 2.4GHz

Gflop/s

CUBLAS 1.1

naive

MKL 10.0

MKL 10.0 2007 results

not so great in matrix-matrix multiply

disappointing performance in (naive) LU factorization

impressive sheer compute power

Page 49: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

GPU Memory Hierarchy

50

• Register file is the fastest and the largest on-chip memory

- Constrained to vector operations only• Shared memory permits indexed and shared access

- However, 2-4x smaller and 4x lower bandwidth than registers

• Only 1 operand in shared memory is allowed versus 4 register operands

- Some instructions run slower if using shared memory

64 KB vector register file16 KBstore

crossbar

16 lanes

64 lanes

Page 50: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Memory Latency on GeForce 8800 GTXRepeat k = A[k] where A[k] = (k + stride) mod array_size

51

0

100

200

300

400

500

600

700

800

0

50

100

150

200

250

300

350

400

450

500

550

600

4 16 64 256 1KB 4KB 16KB 64KB 256KB 1MB 4MB 16MB

cycl

es

late

ncy,

ns

stride, bytes

5KB 5.5KB 20KB

768KB224KB192KB

non-cached, 128MB

8MB

16MB

32MB

local memory, 8KB

Page 51: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

(Some new) NVIDIA coding recommendations

• Minimize communication with CPU memory• Keep as much data in registers as possible

- Largest, fastest on-GPU memory- Vector-only operations

• Use as little shared memory as possible- Smaller, slower than registers; use for communication, sharing only- Speed limit: 66% of peak with one shared mem argument

• Use vector length VL=64, not max VL = 512- Strip mine longer vectors into shorter ones

• Final matmul code similar to Cray X1 or IBM 3090 vector codes

52

Page 52: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

__global__ void sgemmNN( const float *A, int lda, const float *B, int ldb, float* C, int ldc, int k, float alpha, float beta ){

A += blockIdx.x * 64 + threadIdx.x + threadIdx.y*16;B += threadIdx.x + ( blockIdx.y * 16 + threadIdx.y ) * ldb;C += blockIdx.x * 64 + threadIdx.x + (threadIdx.y + blockIdx.y * ldc ) * 16;__shared__ float bs[16][17];float c[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};

const float *Blast = B + k;do{

#pragma unrollfor( int i = 0; i < 16; i += 4 )

bs[threadIdx.x][threadIdx.y+i] = B[i*ldb];B += 16;__syncthreads();

#pragma unrollfor( int i = 0; i < 16; i++, A += lda )

{c[0] += A[0]*bs[i][0]; c[1] += A[0]*bs[i][1]; c[2] += A[0]*bs[i][2];

c[3] += A[0]*bs[i][3];c[4] += A[0]*bs[i][4]; c[5] += A[0]*bs[i][5]; c[6] += A[0]*bs[i][6];

c[7] += A[0]*bs[i][7];c[8] += A[0]*bs[i][8]; c[9] += A[0]*bs[i][9]; c[10] += A[0]*bs[i][10];

c[11] += A[0]*bs[i][11];c[12] += A[0]*bs[i][12]; c[13] += A[0]*bs[i][13]; c[14] += A[0]*bs[i][14];

c[15] += A[0]*bs[i][15];}

__syncthreads();} while( B < Blast );for( int i = 0; i < 16; i++, C += ldc )

C[0] = alpha*c[i] + beta*C[0]; }

Compute pointers to the data

Declare the on-chip storage

Read next B’s block

Store C’s block to memory

The bottleneck:Read A’s columnsDo Rank-1 updates

53

Page 53: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Our code vs. CUBLAS 1.1

54

Performance in multiplying two NxN matrices on GeForce 8800 GTX:

64 128 256 512 1024 2048 4095.999999999990%

10%

20%

30%

40%

50%

60%

70%

N

Frac

tion

of P

eak CUBLAS 1.1

(37%)

our implementation (60%)

multiply-and-add with an operand in shared memory (66%)

Page 54: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

The Progress So Far

• We achieved predictable performance in SGEMM- Which does O(N3) work in LU factorization

• But LU factorization (naïve SGETRF) still underperforms- Must be due to the rest O(N2) work done in BLAS1 and BLAS2- Why O(N2) work takes so much time?

55

LAPACK SGETRF

BLAS SGEMM

peak in a*b+c

0 50 100 150 200 250 300 350Gflop/s

Core2 Quad

CUBLAS 1.1

naive

Core2 Quad

Core2 Quad

in registersusing shared memory

our implementation (now in CUBLAS 2.0)

w/CUBLAS2.0

Good compared to the new, smaller peak

Where does the time go?

Arithmetic runs slower if using shared memory

GeForce 8800 GTX

Page 55: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Row-Pivoting in LU Factorization

56

Exchange two rows of an NxN matrix (SSWAP in CUBLAS 2.0):

Row pivoting in column-major layout on GPU is very slowThis alone consumes half of the runtime in naïve SGETRF

0 2048 4096 6144 8192 10240 12288 14336 163844

40

400

4000

N

mic

rose

cond

s

matrix in row-major lay-out

(aligned unit-stride access)

matrix in column-major layout

(stride-N access)

40x

Page 56: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

BLAS1 Performance

• Peak bandwidth of these GPUs differs by a factor of 4.4• But runtimes are similar• Small tasks on GPU are overhead bound

57

Scale a column of an NxN matrix that fits in the GPU memory(assumes aligned, unit-stride access)

0 2048 4096 6144 8192 10240 12288 14336 163840

1

2

3

4

5

6

7

8

GeForce 8600 GTS, peak = 32 GB/sGeForce GTX 280, peak = 141 GB/s

N

mic

rose

cond

s

Page 57: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Panel Factorization

• Invoking small BLAS operations on GPU from CPU is slow• Can we call a sequence of BLAS operations from GPU?

• Requires barrier synchronization after each parallel BLAS operation• Barrier is possible but requires sequential consistency for correctness

58

Factorizing Nx64 matrix in GPU memory using LAPACK’s SGETF2:

0

5

10

15

20

25

N

Gflop

/s

bound assumes 4 s overhead per BLAS calland 127 GB/s bandwidth in memory access(these are the best sustained numbers)

Page 58: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Optimizing Matrix Factorizations• Use GPU to compute matrix-matrix multiplies only• Factorize panels on the CPU• Use look-ahead to overlap computations on CPU and GPU• Batch Pivoting• Use right-looking algorithms to have more threads in SGEMM

- Better load balance in the GPU workload, better latency hiding• Use row-major layout on GPU in LU factorization

- Requires extra (but fast) matrix transpose for each CPU-GPU transfer• Substitute triangular solves of LX=B by TRSM with multiply by L–1

- At worst squares pivot growth factor in error bound (assume small anyway)- Can check ||L–1||, use TRSM if too large

• Use two-level and variable size blocking as finer tuning- Thicker blocks impose lower bandwidth requirements in SGEMM- Variable size blocking improves CPU/GPU load balance

• Use column-cyclic layout when computing using two GPUs- Requires no data exchange between GPUs in pivoting- Cyclic layout is used on GPUs only so does not affect panel factorization

59

Page 59: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Performance Results

0

50

100

150

200

250

300

350

64 128 256 512 1024 2048 4096 8192 16384

Gflo

p/s

Order of Matrix

QRCholeskyLU

78%

49%

51%

Our solution runs at 50% of the system’s peak (shown on the right)It is bound by SGEMM that runs at 60% of the GPU-only peak

60

Page 60: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Speedup of Factorizations on GPU over CPU

61

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

64 128 256 512 1024 2048 4096 8192 16384

Spe

edup

vs

Cor

e2 Q

uad

Order of Matrix

QRCholeskyLU

4.4x

2.7x

GTX280

8800GTX

GPU only useful on large enough matrices

Page 61: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Slowdown when omitting one optimization from LU on GeForce 8800 GTX

62

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

64 128 256 512 1024 2048 4096 8192 16384

Slo

wdo

wn

Order of Matrix

overlap CPU/GPUtranspose matrixTRSM via GEMMbatch pivoting

Page 62: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Time Breakdown for LU on GeForce 8800 GTX

63

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

448 704 1088 1664 2496 3648 5312 7744 11264

Tim

e

Order of Matrix

CPU-GPU transfer

transposelook-ahead

CPU/GPUoverlap

GPU

CPU

Page 63: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

LU Factorization using Two GPUs

0

50

100

150

200

250

300

350

400

450

500

550

0 2500 5000 7500 10000 12500 15000 17500 20000 22500

Gflo

p/s

Order of Matrix

538

309 298

179

64

• Second GPU allows 1.7x higher rates• More than half-teraflop using two GPUs

Page 64: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Results for matmul, LU on NVIDIA

• What we’ve achieved:- Identified realistic peak speed of GPU architecture- Achieved a large fraction of this peak in matrix multiply- Achieved a large fraction of the matrix multiply rate in dense

factorizations

65

LAPACK SGETRF

BLAS SGEMM

peak in a*b+c

0 50 100 150 200 250 300 350Gflop/s

Core2 Quad

CUBLAS 1.1

naive

Core2 Quad

Core2 Quad

in registersif using shared memory

our implementation (now in CUBLAS 2.0)

our implementation GeForce 8800 GTX

Page 65: Jim  Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Summer School Lecture 5 66

Extra Slides