runtime data flow graph scheduling of matrix computations
DESCRIPTION
Runtime Data Flow Graph Scheduling of Matrix Computations. Ernie Chan. Introduction. Programmability Use tools provided by FLAME Parallelism Directed acyclic graph ( DAG) scheduling. Outline. 7. Introduction SuperMatrix Scheduling Performance Conclusion. 6. - PowerPoint PPT PresentationTRANSCRIPT
T H E U N I V E R S I T Y O F T E X A S A T A U S T I N
Runtime Data Flow Graph Scheduling of Matrix Computations
Ernie Chan
NVIDIA presentation 2February 23, 2011
Introduction
• Programmability– Use tools provided by FLAME
• Parallelism– Directed acyclic graph (DAG)
scheduling
NVIDIA presentation 3February 23, 2011
Outline
• Introduction• SuperMatrix• Scheduling• Performance• Conclusion
7
56
345
4
3
2
1
NVIDIA presentation 4February 23, 2011
SuperMatrix
• Formal Linear Algebra Method Environment (FLAME)– High-level abstractions for
expressing linear algebra algorithms
• Cholesky Factorization
NVIDIA presentation 6February 23, 2011
SuperMatrix
• Cholesky Factorization– Iteration 1 Iteration 2
CHOLChol( A11 )
TRSMA21 A11
-T
SYRKA22 –
A21 A21T SYRK
A22 –A21 A21
T
CHOLChol( A11 )
TRSMA21 A11
-T
*
*
NVIDIA presentation 10
SuperMatrix
February 23, 2011
• Cholesky Factorization– Iteration 1
CHOL0
CHOL0
Chol( A0,0 )
NVIDIA presentation 11
SuperMatrix
February 23, 2011
• Cholesky Factorization– Iteration 1
CHOL0
TRSM2TRSM1
CHOL0
Chol( A0,0 )
TRSM1
A1,0 A0,0-T
TRSM2
A2,0 A0,0-T
NVIDIA presentation 12
SuperMatrix
February 23, 2011
• Cholesky Factorization– Iteration 1
CHOL0
TRSM2TRSM1
SYRK5GEMM4SYRK3CHOL0
Chol( A0,0 )
TRSM1
A1,0 A0,0-T
SYRK3
A1,1 –A1,0 A1,0
T
TRSM2
A2,0 A0,0-T
SYRK5
A2,2 –A2,0 A2,0
T
GEMM4
A2,1 –A2,0 A1,0
T
NVIDIA presentation 13
SuperMatrix
February 23, 2011
• Cholesky Factorization– Iteration 2
SYRK8
A2,2 –A2,1 A2,1
T
TRSM7
A2,1 A1,1-T
CHOL0
TRSM2TRSM1
SYRK5GEMM4SYRK3
CHOL6
TRSM7
SYRK8
CHOL6
Chol( A1,1 )
NVIDIA presentation 14
SuperMatrix
February 23, 2011
• Cholesky Factorization– Iteration 3
CHOL0
TRSM2TRSM1
SYRK5GEMM4SYRK3
CHOL6
TRSM7
SYRK8
CHOL9
CHOL9
Chol( A2,2 )
NVIDIA presentation 15
SuperMatrix
• Cholesky Factorization– matrix of blocks
February 23, 2011
NVIDIA presentation 16February 23, 2011
SuperMatrix
• Separation of Concerns– Analyzer• Decomposes subproblems into component tasks• Store tasks in global task queue sequentially• Internally calculates all dependencies between tasks,
which form a DAG, only using input and output parameters for each task
– Dispatcher• Spawn threads• Schedule and dispatch tasks to threads in parallel
NVIDIA presentation 17February 23, 2011
Outline
• Introduction• SuperMatrix• Scheduling• Performance• Conclusion
7
56
345
4
3
2
1
NVIDIA presentation 18February 23, 2011
Scheduling
• Dispatcherforeach task in DAG do If task is ready then Enqueue taskend endwhile tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent taskend end end
7
56
345
4
3
2
1
NVIDIA presentation 19February 23, 2011
Scheduling
• Dispatcherforeach task in DAG do If task is ready then Enqueue taskend endwhile tasks are available do Dequeue task Execute task foreach dependent task do Update dependent task if dependent task is ready then Enqueue dependent taskend end end
7
56
345
4
3
2
1
NVIDIA presentation 20February 23, 2011
Scheduling
• Supermarket– lines for each cashiers– Efficient enqueue and dequeue– Schedule depends on task to thread assignment
• Bank– 1 line for tellers– Enqueue and dequeue become bottlenecks– Dynamic dispatching of tasks to threads
NVIDIA presentation 21February 23, 2011
…
Scheduling
• Single Queue– Set of all ready and available tasks– FIFO, priority
PE1PE0 PEp-1
Enqueue
Dequeue
NVIDIA presentation 27February 23, 2011
Scheduling
• Cache Affinity– Single priority queue sorted by task height– Software cache• LRU• Line = block• Fully associative
Enqueue
Dequeue
…
…PE1PE0 PEp-1
$p-1$1$0
NVIDIA presentation 28
Scheduling
February 23, 2011
– Enqueue• Insert task• Sort queue via task
heights– Dispatcher• Update software cache
via cache coherency protocol with write invalidation
• Cache Affinity– Dequeue• Search queue for task
with output block in software cache• If found
return task• Otherwise
return head task
NVIDIA presentation 29
Scheduling
• Multiple Graphics Processing Units– View a GPU as a single accelerator as opposed to
being composed of hundreds of streaming processors
– Must explicitly transfer data from main memory to GPU
– No hardware cache coherency provided• Hybrid Execution Model– Execute tasks on both CPU and GPU
February 23, 2011
NVIDIA presentation 30
Scheduling
• Software Managed Cache Coherency– Use software caches developed for cache affinity
to handle data transfers!– Allow blocks to be dirty on GPU until it is
requested by another GPU– Apply any scheduling algorithm when utilizing
GPUs, particularly cache affinity
February 23, 2011
NVIDIA presentation 31February 23, 2011
Outline
• Introduction• SuperMatrix• Scheduling• Performance• Conclusion
7
56
345
4
3
2
1
NVIDIA presentation 32February 23, 2011
Performance
• CPU Target Architecture– 4 socket 2.66 GHz Intel Dunnington• 24 cores• Linux and Windows• 16 MB shared L3 cache per socket
– OpenMP• Intel compiler 11.1
– BLAS• Intel MKL 10.2
NVIDIA presentation 33February 23, 2011
Performance
• Implementations– SuperMatrix + serial MKL• FIFO queue, cache affinity
– FLAME + multithreaded MKL– Multithreaded MKL– PLASMA + serial MKL
– Double precision real floating point arithmetic– Tuned block size
NVIDIA presentation 34
Performance
• PLASMA– v2.1.0 uses static pipelining for scheduling and
does not construct a DAG– v2.2.0 uses dynamic scheduling that roughly
attains the same performance as FIFO queue
• MAGMA– v1.0 only has support for single GPU execution– Does not attempt to minimize data transfers
February 23, 2011
NVIDIA presentation 35February 23, 2011
Performance
NVIDIA presentation 36February 23, 2011
Performance
NVIDIA presentation 37February 23, 2011
Performance
NVIDIA presentation 38February 23, 2011
Performance
NVIDIA presentation 50
Performance
February 23, 2011
NVIDIA presentation 52
Performance
February 23, 2011
NVIDIA presentation 53February 23, 2011
Performance
NVIDIA presentation 54
Performance
• Generalized Eigenproblem
where and is symmetric and is symmetric positive definite
• Cholesky Factorization
where is a lower triangular matrix so that
February 23, 2011
NVIDIA presentation 55
Performance
then multiply the equation by • Standard Form
where and • Reduction from Symmetric Definite
Generalized Eigenproblem to Standard Form
February 23, 2011
NVIDIA presentation 56
Performance
February 23, 2011
• Reduction from …
NVIDIA presentation 57
Performance
February 23, 2011
NVIDIA presentation 58February 23, 2011
Performance
• GPU Target Architecture– 2 socket 2.82 GHz Intel Harpertown with NVIDIA
Tesla S1070• 4 602 MHz Tesla C1060 GPUs• 4 GB DDR memory per GPU• Linux
– CUDA• CUBLAS 3.0
– Single precision real floating point arithmetic
NVIDIA presentation 59
Performance
February 23, 2011
NVIDIA presentation 60
Performance
February 23, 2011
NVIDIA presentation 61
Performance
February 23, 2011
NVIDIA presentation 62
Performance
February 23, 2011
NVIDIA presentation 64February 23, 2011
Performance
• Results– Cache affinity vs. FIFO queue– SuperMatrix out-of-order vs. PLASMA in-order– Strong scalability on CPU and GPU– Typically use block size of 896 on GPU– Representative performance of other dense linear
algebra operations
NVIDIA presentation 65February 23, 2011
Outline
• Introduction• SuperMatrix• Scheduling• Performance• Conclusion
7
56
345
4
3
2
1
NVIDIA presentation 66February 23, 2011
Conclusion
• Separation of Concerns– Allows us to experiment with different scheduling
algorithms– Port runtime system to multiple GPUs
• Locality, Locality, Locality– Data communication is important as load balance
for scheduling matrix computations
NVIDIA presentation 67
Acknowledgments
• We thank the other members of the FLAME team for their support
• Funding from NSF, Microsoft, and Intel• SuperMatrix is implemented within the open
source library libflame released under LGPL
February 23, 2011
NVIDIA presentation 68
Conclusion
February 23, 2011
• More Informationhttp://www.cs.utexas.edu/~flame