a framework for practical fast matrix multiplication
DESCRIPTION
Slides from my talk at the ICME Colloquium.TRANSCRIPT
![Page 1: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/1.jpg)
1
A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION
Austin Benson ([email protected]), ICME, Stanford
Grey Ballard, Sandia National Laboratories
ICME Colloquium, October 20, 2014
github.com/arbenson/fast-matmul
arXiv: 1409.2908
![Page 2: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/2.jpg)
2
Fast matrix multiplication:bridging theory and practice
• There are a number of Strassen-like algorithms for matrix multiplication that have only been “discovered” recently. [Smirnov13], [Benson&Ballard14]
• We show that they can achieve higher performance with respect to Intel Math Kernel Library (MKL).
• We use code generation for extensive prototyping.
32 2.81[Strassen79]
2.37[Williams12]
xxx xx xx xx
![Page 3: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/3.jpg)
3
Strassen’s algorithm
![Page 4: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/4.jpg)
4
Key ingredients of Strassen’s algorithm
• 1. Block partitioning of matrices (<2, 2, 2>)• 2. Seven linear combinations of sub-blocks of A• 3. Seven linear combinations of sub-blocks of B• 4. Seven matrix multiplies to form Mr (recursive)
• 5. Linear combinations of Mr to form Cij
![Page 5: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/5.jpg)
5
Key ingredients of fast matmul algorithms
• 1. Block partitioning of matrices (<M, K, N>)• 2. R linear combinations of sub-blocks of A
• 3. R linear combinations of sub-blocks of B• 4. R matrix multiplies to form Mr (recursive)
R < MKN faster than classical
• 5. Linear combinations of Mr to form Cij
Key idea: Trade N^3 computation (matmul) for moreN^2 computation (adding matrices together)
![Page 6: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/6.jpg)
6
“Outer product” fast algorithm
• <4, 2, 4> partitioning• R = 26 multiplies (< 4 * 2 * 4 = 32)
23% speedup per recursive step (if everything else free)• Linear combinations of Aij to form Sr: 68 terms
• Linear combinations of Bij to form Tr: 52 terms
• Linear combinations of Mr to form Cij: 69 terms
![Page 7: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/7.jpg)
7
Discovering fast algorithms is anumerical challenge
• Low-rank tensor decompositions lead to fast algorithms• Tensors are small, but we need exact decompositions
NP-hard• Use alternating least squares with regularization and
rounding tricks [Smirnov13], [Benson&Ballard14]
• We have around 10 fast algorithms for <M, K, N> decompositions. Also have permutations like <K, M, N>.
![Page 8: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/8.jpg)
8
[Smirnov13]
[Strassen69]
![Page 9: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/9.jpg)
9
Code generation lets us prototype algorithms quickly
• We have compact representation of many fast algorithms: 1. dimensions of block partitioning (<M, K, N>) 2. linear combinations of sub-blocks (Sr, Tr) 3. linear combinations of Mr to form Cij
• We use code generation to rapidly prototype fast algorithms
![Page 10: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/10.jpg)
10
Sequential performance =
Effective GFLOPS for M x K x N multiplies = 1e-9 * 2 * MKN / time in seconds
True peak
![Page 11: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/11.jpg)
11
Sequential performance
• All algorithms beat MKL on large problems• Strassen’s algorithm is hard to beat
=
![Page 12: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/12.jpg)
12
Sequential performance =
• Almost all algorithms beat MKL• <4, 2, 4> and <3, 2, 3> tend to perform the best
![Page 13: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/13.jpg)
13
Sequential performance =
• Almost all algorithms beat MKL• <4, 3, 3> and <4, 2, 3> tend to perform the best
![Page 14: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/14.jpg)
14
Practical issues
• When to stop recursion?• How to implement the linear combinations?• Can we eliminate redundant linear combinations?
Is this useful?• How to parallelize? How to model parallel performance?
![Page 15: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/15.jpg)
15
When to stop recursion? Look at the gemm curve.
Basic idea: take another recursive step if the sub-problems will still operate at high performance <M, K, N> = <4, 2, 3>
![Page 16: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/16.jpg)
16
Understandingparallel performance =
• 6 cores: similar performance to sequential• 24 cores: MKL best for large problems
![Page 17: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/17.jpg)
17
Bandwidth problems• We rely on the cost of matrix multiplications to be much
more expensive than the cost of matrix additions• Parallel dgemm on 24 cores: easily get 50-75% of peak• STREAM benchmark: < 6x speedup in read/write
performance on 24 cores
C
M1 M7
+
M2 …
![Page 18: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/18.jpg)
18
High-level conclusions• For square matrix multiplication, Strassen’s is best• For rectangular matrix multiplication, use a fast algorithm
that “matches the shape”• Bandwidth limits the performance of shared memory
parallel fast matrix multiplication should be less of an issue in distributed memory
Future work:• Numerical stability• Using fast matmul as a kernel for other algorithms in
numerical linear algebra
![Page 19: A framework for practical fast matrix multiplication](https://reader035.vdocument.in/reader035/viewer/2022062216/558e1b7b1a28abbd5b8b462d/html5/thumbnails/19.jpg)
19
A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION
Austin Benson ([email protected]), ICME, Stanford
Grey Ballard, Sandia National Laboratories
ICME Colloquium, October 20, 2014
github.com/arbenson/fast-matmul
arXiv: 1409.2908