cs 267 dense linear algebra: possible class projects
Post on 31-Jan-2016
24 Views
Preview:
DESCRIPTION
TRANSCRIPT
03/04/2009 CS267 Lecture 12a 1
CS 267 Dense Linear Algebra:
Possible Class Projects
James Demmel
www.cs.berkeley.edu/~demmel/cs267_Spr09
03/04/2009 CS267 Lecture 12a 2
Kinds of class projects• Try tuning existing (widely used) codes in LAPACK,
ScaLAPACK or possible future versions- Possible impact: help many people to run faster
• Add missing functionality to these libraries- Possible impact: lots of users want it
• Experiment with algorithms on new architectures- Possible impact: What do we need to do differently for
performance on these platforms? Are there any bottlenecks or other problems in the architecture? Could they be fixed?
• Experiment with new software approaches- Possible impact: Is it easier to write these algorithms while
getting most of the performance? Should we produce future versions of the libraries this way?
• Experiment with new algorithms- Possible impact: Find a better one!
Challenges to Libraries (and parallel SW in general)• Minimizing communication costs
- Cost of bandwidth and latency (to main memory or over a network) growing exponentially compared to arithmetic
• Heterogeneous platforms- Different communication costs depending on destination
• Same chip vs different socket vs different board …
- CPU + GPU• Perform different operations at very different rates
• Dynamic scheduling & load balancing- Can’t always assume each core/processor makes constant
progress on your task
- May be faster to grab next available task than use predesigned “perfectly balanced” schedule
- OS may give, take away resources on the fly
• Fault tolerance – how to recover when one proc fails03/02/2009 CS267 Lecture 11 3
Strassen’s Matmul on Multicore or GPU
• Why no Strassen in most libraries?- See “Baleful Effect of Benchmarks…” by Prof. Kahan
• Likely to be faster for modest-to-large matrix sizes- Where is the crossover?
• May want hybrid: switch to O(n3) algorithm for certain sizes (smaller)
- Autotuning?
• Lots of “blocking” opportunities as for standard matmul- What is least amount of data movement possible?
• How well does it work for the rectangular matmuls in LU, QR and Cholesky?
- Do we need to modify LU, QR or Cholesky to take advantage of Strassen (by using a variant that multiplies different size matrices)?
03/04/2009 CS267 Lecture 12a 4
Review: Alternative recursive GE formulation
• Toledo (1997) - Describe without pivoting for simplicity
- “Do left half of matrix, then right half”
03/04/2009 CS267 Lecture 12a 5
function [L,U] = RLU (A) … assume A is m by n if (n=1) L = A/A(1,1), U = A(1,1) else [L1,U1] = RLU( A(1:m , 1:n/2)) … do left half of A … let L11 denote top n/2 rows of L1
A( 1:n/2 , n/2+1 : n ) = L11-1 * A( 1:n/2 , n/2+1 : n ) … update top n/2 rows of right half of A A( n/2+1: m, n/2+1:n ) = A( n/2+1: m, n/2+1:n ) - A( n/2+1: m, 1:n/2 ) * A( 1:n/2 , n/2+1 : n ) … update rest of right half of A [L2,U2] = RLU( A(n/2+1:m , n/2+1:n) ) … do right half of A return [ L1,[0;L2] ] and [U1, [ A(.,.) ; U2 ] ]
A = L * U
Register-file resident Linear Algebra on GPUs• Vasily’s results for LU, QR and Cholesky on GPU
target single large matrices, too large to fit just in the “fast memory” (shared + registers) of the GPU
• There is also demand for solving many smaller problems in parallel, eg A(i) * x(i) = b(i) for many different A(1),…,A(k) and b(1),…,b(k)
• Project: Design linear algebra algorithms that operate on many different matrices in parallel, each small enough to fit in the 64 KB register set of each multiprocessor
- single precision square matrix of dimension n=128
• Question: Does possible need to branch differently on each multiprocessor (because of different pivot orders) matter? If so, is QR better than LU?
• Question: Do we need BLAS3 code versions on such small matrices, or is BLAS2 enough?
03/04/2009 CS267 Lecture 12a 6
Extend Vasily’s GPU analysis, code to ATI• Vasily’s Best Student Paper Award from SC08 had
two parts:- Analyzed bottlenecks, speedup possibilities in NVIDIA
architecture
- Applied lessons to reorganization of LU, QR, Cholesky
• What about ATI GPU?- Both above aspects interesting
- ATI GPU available in ParLab
- What are pros, cons of ATI, NVIDIA architectures? Others?
- Do we need to reorganize algorithms differently for each, or does one algorithm (perhaps with different block sizes, other parameters) work for both (which would be simpler)?
• Other BLAS-like operations on GPU- Needed for finite-element analysis
03/04/2009 CS267 Lecture 12a 7
Missing Drivers in Sca/LAPACK
LAPACK ScaLAPACK
Linear Equations
LU
Cholesky
LDLT
xGESV
xPOSV
xSYSV
PxGESV
PxPOSV
missing
Least Squares (LS)
QR
QR+pivot
SVD/QR
SVD/D&C
SVD/MRRR
QR + iterative refine.
xGELS
xGELSY
xGELSS
xGELSD
missing (oops)
missing
PxGELS
missing driver
missing driver
missing (intent)
missing(oops)
missing
Generalized LS LS + equality constr.
Generalized LM
Above + Iterative ref.
xGGLSE
xGGGLM
missing
missing
missing
missing
More missing drivers
LAPACK ScaLAPACK
Symmetric EVD QR / Bisection+Invit
D&C
MRRR
xSYEV / X
xSYEVD
xSYEVR
PxSYEV / X
missing (intent)
missing
Nonsymmetric EVD Schur form
Vectors too
xGEES / X
xGEEV /X
missing driver
missing driver
SVD QR
D&C
MRRR
Jacobi
xGESVD
xGESDD
missing(oops)
xGESVJ
PxGESVD
missing (intent)
missing(oops)
missing
Generalized Symmetric EVD
QR / Bisection+Invit
D&C
MRRR
xSYGV / X
xSYGVD
missing
PxSYGV / X
missing (intent)
missing
Generalized Nonsymmetric EVD
Schur form
Vectors too
xGGES / X
xGGEV / X
missing
missing
Generalized SVD Kogbetliantz
MRRR
xGGSVD
missing(oops)
missing (intent)
missing(oops)
Missing matrix types in ScaLAPACK• Symmetric, Hermitian, triangular
- Band, Packed
• Positive Definite- Packed
• Orthogonal, Unitary- Packed
0
10
20
30
40
50
60
70
80
90
100
seconds
10002000300040005000600070008000900010000
1x60
2x30
3x20
4x15
5x12
6x10
problem size
grid shape
Execution time of PDGESV for various grid shape
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
10-20
0-10
Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect, 2GB Memory
Speedups for using 2D processor grid range from 2x to 8x
Tuning the data layout
Layout depends on block size b and processor grid Pr x PcSimple layouts easy for user, but bad for performance
0.01
0.1
1
10
100
seconds
1000 4000 7000 1000
problem size
Optimal grid (6x10) for PDGESVComparison between Computation and Redistribution of Data from Linear Grid
Calculation Time
RedistributionTime
Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect, 2GB Memory
Cost of redistributing matrix to optimal layout is small
Cost of tuning the data layout, compared to runtime
Possible project: build “wrapper” that chooses fastest layout, whether to convert back and forth, and hides details from the user.
Parallel Eigenvalue Algorithms on GPU
• Harder to use all BLAS3 than solving Ax=b, least squares
• Symmetric eigenvalue problem for A=AT (SVD similar)- Find orthogonal Q to transform A = QTQT, where T=TT is tridiagonal
(nonzero on main diagonal, right above and below
- Find eigenvals =diag(λ1,…,λn)and orthog. eigenvecs U of T = UUT
• Good parallel algorithms; cheaper than first step
- Then A = (QU) (QU)T so orthog. eigenvectors =QU, eigenvalues =
• A = QTQT is proposed challenge- Use “Successive Band Reduction” (Sun, Bischof et al)
- Go from A to wide band matrix B via A = VBVT , V orthogonal• All BLAS3, fast on GPU
- Go from B to tridiagonal T via B = WTWT , W orthogonal• BLAS1 and BLAS2, do it on CPU
- Find T = UUT as above, then A = (VWU) (VWU)T
• Prospect of minimizing communication in theory13
Experiment with PLASMA for Multicore
• PLASMA is experimental system for writing, scheduling linear algebra algorithms as Directed Acyclic Graphs (DAGs)
- icl.cs.utk.edu/plasma/
03/04/2009 CS267 Lecture 12a 14
15
A
C
A
B C
T TT
Fork-Join vs. Dynamic Execution on Multicore
Fork-Join – parallel BLAS
DAG-based – dynamic scheduling
Time
Experiments on Experiments on Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown with 2 Sockets w/ 8 Treadswith 2 Sockets w/ 8 Treads
Time saved
Source: Jack Dongarra
Experiment with PLASMA for Multicore PLASMA is experimental system for writing,
scheduling linear algebra algorithms as Directed Acyclic Graphs (DAGs)
- icl.cs.utk.edu/plasma/
Experiment with PLASMA - Implement other factorizations
- Compare performance • To LAPACK with parallel BLAS
• To ScaLAPACK
- Evaluate expressiveness for eigenvalue problems
- Study interaction of scheduler with higher level scheduler being designed in ParLab
• Can PLASMA “gracefully” accept, give up, resources?
03/04/2009 CS267 Lecture 12a 16
Perform analogous experiments with UPC, Titanium or other PGAS languages
17
Investigate role of “Dense Motif” in ParLab Apps
Initial study (below) showed Dense Linear Algebra in Image, Speech, Music
Determine what is really needed Functions, problem sizes, performance requirements
What do we still need to optimize?
top related