runtime data flow graph scheduling of matrix computations

49
THE UNIVERSITY OF TEXAS AT AUSTIN Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan

Upload: abel-holloway

Post on 03-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Runtime Data Flow Graph Scheduling of Matrix Computations. Ernie Chan. Teaser. Better. Theoretical Peak Performance. Goals. Programmability Use tools provided by FLAME Parallelism Directed acyclic graph ( DAG) scheduling. Outline. 7. Introduction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Runtime Data Flow  Graph Scheduling of  Matrix Computations

T H E U N I V E R S I T Y O F T E X A S A T A U S T I N

Runtime Data Flow Graph Scheduling of Matrix Computations

Ernie Chan

Page 2: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 2December 15, 2010

Teaser

BetterTheoretical

PeakPerformance

Page 3: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 3December 15, 2010

Goals

• Programmability– Use tools provided by FLAME

• Parallelism– Directed acyclic graph (DAG)

scheduling

Page 4: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 4December 15, 2010

Outline

• Introduction• SuperMatrix• Scheduling• Performance• Conclusion

7

56

345

4

3

2

1

Page 5: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 5December 15, 2010

SuperMatrix

• Formal Linear Algebra Method Environment (FLAME)– High-level abstractions for

expressing linear algebra algorithms

• Cholesky Factorization

Page 6: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 6December 15, 2010

SuperMatrixFLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL );

while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*-----------------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL );}

Page 7: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 7December 15, 2010

SuperMatrix

• Cholesky Factorization– Iteration 1 Iteration 2

CHOLChol( A11 )

TRSMA21 A11

-T

SYRKA22 –

A21 A21T SYRK

A22 –A21 A21

T

CHOLChol( A11 )

TRSMA21 A11

-T

*

*

Page 8: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 8December 15, 2010

SuperMatrix

• LAPACK-style Implementation

DO J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO ) CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’, $ ‘Non-unit’, N-J-JB+1, JB, ONE, $ A( J, J ), LDA, A( J+JB, J ), LDA ) CALL DSYRK( ‘Lower’, ‘No transpose’, $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA,$ ONE, A( J+JB, J+JB ), LDA )ENDDO

Page 9: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 9December 15, 2010

SuperMatrix

• FLASH– Storage-by-blocks, algorithm-by-blocks

Page 10: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 10December 15, 2010

SuperMatrixFLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL );

while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) {

FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*-----------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*-----------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL );}

Page 11: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 11

SuperMatrix

December 15, 2010

• Cholesky Factorization– Iteration 1

CHOL0

CHOL0

Chol( A0,0 )

Page 12: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 12

SuperMatrix

December 15, 2010

• Cholesky Factorization– Iteration 1

CHOL0

TRSM2TRSM1

CHOL0

Chol( A0,0 )

TRSM1

A1,0 A0,0-T

TRSM2

A2,0 A0,0-T

Page 13: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 13

SuperMatrix

December 15, 2010

• Cholesky Factorization– Iteration 1

CHOL0

TRSM2TRSM1

SYRK5GEMM4SYRK3CHOL0

Chol( A0,0 )

TRSM1

A1,0 A0,0-T

SYRK3

A1,1 –A1,0 A1,0

T

TRSM2

A2,0 A0,0-T

SYRK5

A2,2 –A2,0 A2,0

T

GEMM4

A2,1 –A2,0 A1,0

T

Page 14: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 14

SuperMatrix

December 15, 2010

• Cholesky Factorization– Iteration 2

SYRK8

A2,2 –A2,1 A2,1

T

TRSM7

A2,1 A1,1-T

CHOL0

TRSM2TRSM1

SYRK5GEMM4SYRK3

CHOL6

TRSM7

SYRK8

CHOL6

Chol( A1,1 )

Page 15: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 15

SuperMatrix

December 15, 2010

• Cholesky Factorization– Iteration 3

CHOL0

TRSM2TRSM1

SYRK5GEMM4SYRK3

CHOL6

TRSM7

SYRK8

CHOL9

CHOL9

Chol( A2,2 )

Page 16: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 16

SuperMatrix

• Cholesky Factorization– matrix of blocks

December 15, 2010

Page 17: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 17December 15, 2010

SuperMatrix

• Separation of Concerns– Analyzer• Decomposes subproblems into component tasks• Store tasks in global task queue sequentially• Internally calculates all dependencies between tasks,

which form a DAG, only using input and output parameters for each task

– Dispatcher• Spawn threads• Schedule and dispatch tasks to threads in parallel

Page 18: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 18December 15, 2010

Outline

• Introduction• SuperMatrix• Scheduling• Performance• Conclusion

7

56

345

4

3

2

1

Page 19: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 19December 15, 2010

Scheduling

• Dispatcherforeach task in DAG do If task is ready then Enqueue taskend endwhile tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent taskend end end

7

56

345

4

3

2

1

Page 20: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 20December 15, 2010

Scheduling

• Dispatcherforeach task in DAG do If task is ready then Enqueue taskend endwhile tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent taskend end end

7

56

345

4

3

2

1

Page 21: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 21December 15, 2010

Scheduling

• Supermarket– lines for each cashiers– Efficient enqueue and dequeue– Schedule depends on task to thread assignment

• Bank– 1 line for tellers– Enqueue and dequeue become bottlenecks– Dynamic dispatching of tasks to threads

Page 22: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 22December 15, 2010

Scheduling

• Single Queue– Set of all ready and available tasks– FIFO, priority

PE1PE0 PEp-1

Enqueue

Dequeue

Page 23: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 23December 15, 2010

Scheduling

• Multiple Queues– Work stealing, data affinity

PE1PE0 PEp-1

Enqueue

Dequeue

Page 24: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 26December 15, 2010

Scheduling

• Data Affinity– Assign all tasks that write to a particular block to

the same thread– Owner computes rule– 2D block cyclic distribution

• Execution Trace– Cholesky factorization: – Total time: 2D data affinity ~ FIFO queue– Idle threads: 2D ≈ 27% and FIFO ≈ 17%

0

1

0

2

3

2

0

1

0

Page 25: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 27December 15, 2010

Scheduling

• Data Granularity– Cost of task >> enqueue and dequeue

• Single vs. Multiple Queues– FIFO queue increases load balance– 2D data affinity decreases data communication

– Combine best aspects of both!

Page 26: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 28December 15, 2010

Scheduling

• Cache Affinity– Single priority queue sorted by task height– Software cache• LRU• Line = block• Fully associative

Enqueue

Dequeue

…PE1PE0 PEp-1

$p-1$1$0

Page 27: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 29

Scheduling

December 15, 2010

– Enqueue• Insert task• Sort queue via task

heights

– Dispatcher• Update software cache

via cache coherency protocol with write invalidation

• Cache Affinity– Dequeue• Search queue for task

with output block in software cache• If found

return task• Otherwise

return head task

Page 28: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 30

Scheduling

• Multiple Graphics Processing Units– View a GPU as a single accelerator as opposed to

being composed of hundreds of streaming processors

– Must explicitly transfer data from main memory to GPU

– No hardware cache coherency provided• Hybrid Execution Model– Execute tasks on both CPU and GPU

December 15, 2010

Page 29: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 31

Scheduling

• Software Managed Cache Coherency– Use software caches developed for cache affinity

to handle data transfers!– Allow blocks to be dirty on GPU until it is

requested by another GPU– Apply any scheduling algorithm when utilizing

GPUs, particularly cache affinity

December 15, 2010

Page 30: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 32December 15, 2010

Outline

• Introduction• SuperMatrix• Scheduling• Performance• Conclusion

7

56

345

4

3

2

1

Page 31: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 33December 15, 2010

Performance

• CPU Target Architecture– 4 socket 2.66 GHz Intel Dunnington• 24 cores• Linux and Windows• 16 MB shared L3 cache per socket

– OpenMP• Intel compiler 11.1

– BLAS• Intel MKL 10.2

Page 32: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 34December 15, 2010

Performance

• Implementations– SuperMatrix + serial MKL• FIFO queue, cache affinity

– FLAME + multithreaded MKL– Multithreaded MKL– PLASMA + serial MKL

– Double precision real floating point arithmetic– Tuned block size

Page 33: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 35December 15, 2010

Performance

Page 34: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 39December 15, 2010

Performance

Page 35: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 40December 15, 2010

Performance

• Inversion of a Symmetric Positive Definite Matrix– Cholesky factorization

CHOL

– Inversion of a triangular matrixTRINV

– Triangular matrix multiplication by its transpose

TTMM

Page 36: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 41

Performance

• Inversion of an SPD Matrix

December 15, 2010

Page 37: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 44December 15, 2010

Performance

Page 38: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 54

Performance

• Generalized Eigenproblem

where and is symmetric and is symmetric positive definite

• Cholesky Factorization

where is a lower triangular matrix so that

December 15, 2010

Page 39: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 55

Performance

then multiply the equation by • Standard Form

where and • Reduction from Symmetric Definite

Generalized Eigenproblem to Standard Form

December 15, 2010

Page 40: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 56

Performance

December 15, 2010

• Reduction from …

Page 41: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 57

Performance

December 15, 2010

Page 42: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 58December 15, 2010

Performance

• GPU Target Architecture– 2 socket 2.82 GHz Intel Harpertown with NVIDIA

Tesla S1070• 4 602 MHz Tesla C1060 GPUs• 4 GB DDR memory per GPU• Linux

– CUDA• CUBLAS 3.0

– Single precision real floating point arithmetic

Page 43: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 59

Performance

December 15, 2010

Page 44: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 64December 15, 2010

Performance

• Results– Cache affinity vs. FIFO queue– SuperMatrix out-of-order vs. PLASMA in-order– High variability of work stealing vs. predictable

cache affinity performance– Strong scalability on CPU and GPU– Representative performance of other dense linear

algebra operations

Page 45: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 65December 15, 2010

Outline

• Introduction• SuperMatrix• Scheduling• Performance• Conclusion

7

56

345

4

3

2

1

Page 46: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 66December 15, 2010

Conclusion

• Separation of Concerns– Allows us to experiment with different scheduling

algorithms– Port runtime system to multiple GPUs

• Locality, Locality, Locality– Data communication is important as load balance

for scheduling matrix computations

Page 47: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 67

Current Work

• Intel Single-chip Cloud Computer– 48 cores on a single die– Cores communicate via

message passing buffer• RCCE_send• RCCE_recv

– Software managed cache coherency for off-chip shared memory• RCCE_shmalloc

December 15, 2010

Page 48: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 68December 15, 2010

Acknowledgments

• We thank the other members of the FLAME team for their support

• Funding– Intel– Microsoft– NSF grants • CCF–0540926• CCF–0702714

Page 49: Runtime Data Flow  Graph Scheduling of  Matrix Computations

NEC Labs talk 69

Conclusion

December 15, 2010

• More Informationhttp://www.cs.utexas.edu/~flame

[email protected]