runtime data flow graph scheduling of matrix computations

T H E U N I V E R S I T Y O F T E X A S A T A U S T I N

Runtime Data Flow Graph Scheduling of Matrix Computations

Ernie Chan

Intel MKL talk 2November 22, 2010

Teaser

BetterTheoretical

PeakPerformance


Goals

• Programmability– Use tools provided by FLAME

• Parallelism– Directed acyclic graph (DAG)

scheduling


Outline

• Introduction• SuperMatrix• Scheduling• Performance• Conclusion

7

56

345

4

3

2

1


SuperMatrix

• Formal Linear Algebra Method Environment (FLAME)– High-level abstractions for

expressing linear algebra algorithms

• Cholesky Factorization

Intel MKL talk 11

SuperMatrix

November 22, 2010

• Cholesky Factorization– Iteration 1

CHOL0

CHOL0

Chol( A0,0 )

Intel MKL talk 12

SuperMatrix

November 22, 2010


CHOL0

TRSM2TRSM1

CHOL0

Chol( A0,0 )

TRSM1

A1,0 A0,0-T

TRSM2

A2,0 A0,0-T

Intel MKL talk 13

SuperMatrix

November 22, 2010


CHOL0

TRSM2TRSM1

SYRK5GEMM4SYRK3CHOL0

Chol( A0,0 )

TRSM1

A1,0 A0,0-T

SYRK3

A1,1 –A1,0 A1,0

T

TRSM2

A2,0 A0,0-T

SYRK5

A2,2 –A2,0 A2,0

T

GEMM4

A2,1 –A2,0 A1,0

T

Intel MKL talk 14

SuperMatrix

November 22, 2010


SYRK8

A2,2 –A2,1 A2,1

T

TRSM7

A2,1 A1,1-T

CHOL0

TRSM2TRSM1

SYRK5GEMM4SYRK3

CHOL6

TRSM7

SYRK8

CHOL6

Chol( A1,1 )

Intel MKL talk 15

SuperMatrix

November 22, 2010


CHOL0

TRSM2TRSM1

SYRK5GEMM4SYRK3

CHOL6

TRSM7

SYRK8

CHOL9

CHOL9

Chol( A2,2 )

Intel MKL talk 16

SuperMatrix

• Cholesky Factorization– matrix of blocks

November 22, 2010


SuperMatrix

• Separation of Concerns– Analyzer• Decomposes subproblems into component tasks• Store tasks in global task queue sequentially• Internally calculates all dependencies between tasks,

which form a DAG, only using input and output parameters for each task

– Dispatcher• Spawn threads• Schedule and dispatch tasks to threads in parallel


Outline


7

56

345

4

3

2

1


Scheduling

• Dispatcherforeach task in DAG do If task is ready then Enqueue taskend endwhile tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent taskend end end

7

56

345

4

3

2

1


Scheduling

• Supermarket– lines for each cashiers– Efficient enqueue and dequeue– Schedule depends on task to thread assignment

• Bank– 1 line for tellers– Enqueue and dequeue become bottlenecks– Dynamic dispatching of tasks to threads


…

Scheduling

• Single Queue– Set of all ready and available tasks– FIFO, priority

PE1PE0 PEp-1

Enqueue

Dequeue


…

…

Scheduling

• Multiple Queues– Work stealing, data affinity

PE1PE0 PEp-1

Enqueue

Dequeue

Intel MKL talk 24

• Work Stealingforeach task in DAG do If task is ready then Enqueue taskend endwhile tasks are available do Dequeue task if task ≠ Ø then Execute task Update dependent tasks … else Steal taskend end

November 22, 2010

Scheduling

– Enqueue• Place all dependent

tasks on queue of same thread that executes task

– Steal• Select random thread

and remove a task from tail of its queue


Scheduling

• Data Affinity– Assign all tasks that write to a particular block to

the same thread– Owner computes rule– 2D block cyclic distribution

• Execution Trace– Cholesky factorization: – Total time: 2D data affinity ~ FIFO queue– Idle threads: 2D ≈ 27% and FIFO ≈ 17%

0

1

0

2

3

2

0

1

0


Scheduling

• Data Granularity– Cost of task >> enqueue and dequeue

• Single vs. Multiple Queues– FIFO queue increases load balance– 2D data affinity decreases data communication

– Combine best aspects of both!


Scheduling

• Cache Affinity– Single priority queue sorted by task height– Software cache• LRU• Line = block• Fully associative

Enqueue

Dequeue

…

…PE1PE0 PEp-1

$p-1$1$0

Intel MKL talk 29

Scheduling

November 22, 2010

– Enqueue• Insert task• Sort queue via task

heights– Dispatcher• Update software cache

via cache coherency protocol with write invalidation

• Cache Affinity– Dequeue• Search queue for task

with output block in software cache• If found

return task• Otherwise

return head task

Intel MKL talk 30

Scheduling

• Multiple Graphics Processing Units– View a GPU as a single accelerator as opposed to

being composed of hundreds of streaming processors

– Must explicitly transfer data from main memory to GPU

– No hardware cache coherency provided• Hybrid Execution Model– Execute tasks on both CPU and GPU

November 22, 2010

Intel MKL talk 31

Scheduling

• Software Managed Cache Coherency– Use software caches developed for cache affinity

to handle data transfers!– Allow blocks to be dirty on GPU until it is

requested by another GPU– Apply any scheduling algorithm when utilizing

GPUs, particularly cache affinity

November 22, 2010


Outline


7

56

345

4

3

2

1


Performance

• CPU Target Architecture– 4 socket 2.66 GHz Intel Dunnington• 24 cores• Linux and Windows• 16 MB shared L3 cache per socket

– OpenMP• Intel compiler 11.1

– BLAS• Intel MKL 10.2


Performance

• Implementations– SuperMatrix + serial MKL• FIFO queue, cache affinity

– FLAME + multithreaded MKL– Multithreaded MKL– PLASMA + serial MKL

– Double precision real floating point arithmetic– Tuned block size


Performance


Performance

• Inversion of a Symmetric Positive Definite Matrix– Cholesky factorization

CHOL

– Inversion of a triangular matrixTRINV

– Triangular matrix multiplication by its transpose

TTMM

Intel MKL talk 41

Performance

• Inversion of an SPD Matrix

November 22, 2010


Performance

Intel MKL talk 50

Performance

November 22, 2010

Intel MKL talk 51

Performance

November 22, 2010


Performance

Intel MKL talk 53

Performance

• Generalized Eigenproblem

where and is symmetric and is symmetric positive definite

• Cholesky Factorization

where is a lower triangular matrix so that

November 22, 2010

Intel MKL talk 54

Performance

then multiply the equation by • Standard Form

where and • Reduction from Symmetric Definite

Generalized Eigenproblem to Standard Form

November 22, 2010

Intel MKL talk 55

Performance

November 22, 2010

• Reduction from …

Intel MKL talk 56

Performance

November 22, 2010


Performance

• GPU Target Architecture– 2 socket 2.82 GHz Intel Harpertown with NVIDIA

Tesla S1070• 4 602 MHz Tesla C1060 GPUs• 4 GB DDR memory per GPU• Linux

– CUDA• CUBLAS 3.0

– Single precision real floating point arithmetic

Intel MKL talk 58

Performance

November 22, 2010

Intel MKL talk 61

Performance

November 22, 2010

Intel MKL talk 62

Performance

November 22, 2010


Performance

• Results– Cache affinity vs. FIFO queue– SuperMatrix out-of-order vs. PLASMA in-order– High variability of work stealing vs. predictable

cache affinity performance– Strong scalability on CPU and GPU– Representative performance of other dense linear

algebra operations


Outline


7

56

345

4

3

2

1


Conclusion

• Separation of Concerns– Allows us to experiment with different scheduling

algorithms– Port runtime system to multiple GPUs

• Locality, Locality, Locality– Data communication is important as load balance

for scheduling matrix computations

Intel MKL talk 66

Current Work

• Intel Single-chip Cloud Computer– 48 cores on a single die– Cores communicate via

message passing buffer• RCCE_send• RCCE_recv

– Software managed cache coherency for off-chip shared memory• RCCE_shmalloc

November 22, 2010


Acknowledgments

• We thank the other members of the FLAME team for their support

• Funding– Intel– Microsoft– NSF grants • CCF–0540926• CCF–0702714

Intel MKL talk 68

Conclusion

November 22, 2010

• More Informationhttp://www.cs.utexas.edu/~flame

• [email protected]

http://www.cs.utexas.edu/~flame

mailto:[email protected]

runtime data flow graph scheduling of matrix computations

Documents