![Page 1: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/1.jpg)
High Performance Dense Linear Algebra on Spatially Distributed Processors
Jeffrey Diamond and Behnam Robatmili
Stephen Keckler, Robert van de Geijn, Kazushige Goto*, Doug Burger
Department of Computer ScienceUniversity of Texas at Austin
*Texas Advanced Computing CenterUniversity of Texas at Austin
![Page 2: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/2.jpg)
2
Trends in Chip Level Parallelism
Emerging architectures more fine grained On chip networks, precise control over communication Tight orchestration of computation across ALUs
Algorithmic insight from most fine grained case
CoarseGrained
FineGrained
Quad Core
(MIMD)TRIPS(SDU)
Cell Tilera
![Page 3: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/3.jpg)
3
Parallel Programming Paradigms
Programming occurs at many levels Trends towards optimized library model
Special low level APIs for high performance We’re interested in these low level APIs
High Level API
Low Level API
Haskel, F#, Sequoia, CUDA, Ct, UPC, etcDynamic Run Times / CompilationClassic MultithreadingHigh Performance, Low Level Libraries
![Page 4: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/4.jpg)
4
Case Study: Matrix Multiply
Implementing full scale DGEMM High Performance Dense Linear Algebra Libraries
(Level 3 BLAS) are layered on top of high performance Matrix Multiply kernels: SYMM, SYRK, TRSM, TRMM, etc. Core LAPACK: LU with partial pivoting, Cholesky, QR
factorization, matrix inversion, reduction to tridiagonal/Hessenberg/bidiagonal form
Control theory: Sylvester equation, Lyapunov equation, and many, many others...
Regular operation is very amenable to algorithmic transformations and easy to reason about
![Page 5: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/5.jpg)
5
Talk Outline
Spatially Distributed Uniprocessors Matrix Multiply Algorithm
High Level Memory Management Low Level Blocking Inner Kernel
Optimizing Inner Kernel Results Conclusion
![Page 6: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/6.jpg)
6
Spatially Distributed Uniprocessors (SDUs)
Single threaded scalability issues for architectures and implementation technology: Wire delay, Power, Issue Width, Memory Bandwidth… Solution: SDU - partitioned register banks, functional units, …
Still executing a single thread across multiple ALUs Where an instruction executes matters
Program statically determines location of instructions Examples include advanced VLIW processors in embedded
market TRIPS partitions most aspects of single core into tiles:
Tiles connected by on chip 2-D network Large number of distributed ALUs, registers, data ports Enormous aggregate bandwidth to registers and data, but… Communication between ALUs must go through network
![Page 7: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/7.jpg)
7
TRIPS - a modern SDU
![Page 8: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/8.jpg)
8
TRIPS - a modern SDU
Core 1
Core 2
Shared L2
![Page 9: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/9.jpg)
9
TRIPS - a modern SDURegister BanksL1 banksL2 banks
Grid of ALUs
![Page 10: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/10.jpg)
10
Talk Outline
Spatially Distributed Uniprocessors Matrix Multiply Algorithm
High Level Memory Management Low Level Blocking Inner Kernel
Optimizing Inner Kernel Results Conclusion
![Page 11: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/11.jpg)
11
Outer-level: Goto streaming algorithm At heart GotoBLAS Linear Algebra Libraries Licensed by many of the top computer vendors Used by many supercomputers in top 500 list
Mid-level: enhanced Goto algorithm with new hierarchical blocking layer to leverage SDU topology
Inner kernel: novel algorithm suited to SDUs
Implementing Matrix Multiply
![Page 12: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/12.jpg)
12
Goto Streaming Algorithm
Classical blocking algorithm (C += AB): Break matrices into square blocks just big
enough for a, b and c to fit in L1 cache Goto: L2 cache is actually fast enough to
access directly from inner kernel Instead of small, square matrix blocks, use
huge block-panel multiplies Traversal order to maximize reuse Stream full-sized panels of B and C directly out of
DRAM
![Page 13: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/13.jpg)
13
Goto: High Level Blocking
C A B
High Level Blocking
C’ A’ B’
Original Problem
A’C’ B’
L2 DRAM/L1DRAM/REG
Thousands
Thousands
Thousands ThousandsHundreds
Hundreds
Panel Slices
+=
+=
+=
![Page 14: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/14.jpg)
14
128 registers hold non-trivial sized blocks 2-D mesh network has high bandwidth in orthogonal
directions (like a systolic array) Additionally store blocks of A in registers
Bring in elements of A and B simultaneously and maximize bandwidth Maximize use of both horizontal and vertical network links
But to amortize use of elements of A in registers, need to add another level of low level blocking to the hierarchy
Enhancing Goto Algorithm
![Page 15: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/15.jpg)
15
B’, C’ panel slices broken into mini-panels b’, c’ a’-block broken into mini-blocks, a’
a’ block and c mini panel held in registers 4x4 a’ amortized over 4x16 b’
Careful ordering of data movement preserves computational properties of larger block-panel multiply B slice stays in L1 for a LONG time, A stays even longer
A’C’ B’
(L2) (L1)(DRAM)
16 16444 4
+=Hundreds Hundreds
Low Level Blocking Scheme
![Page 16: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/16.jpg)
16
How do we traverse?
A’C’
B’
128
512
128
512
X
• B’ slice fits in L1 cache• A’ block fits in L2 cache• C’ streams from DRAM
Load c’ and a’ blocks into Registers
+=
16164 44
![Page 17: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/17.jpg)
17
A’C’
B’
128
512
128
1616
512
X
Stream b’(4x16) from L1 & multiply by a’(4x4)(Reuse a’ four times!)
+= B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
How do we traverse?
4 4
![Page 18: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/18.jpg)
18
A’C’
B’
128
512
128
512
X
+= B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
How do we traverse?
16164 4
![Page 19: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/19.jpg)
19
A’C’
B’
128
512
128
512
X
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
16164 4
![Page 20: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/20.jpg)
20
A’C’
B’
128
512
128
512
X
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
16164 4
![Page 21: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/21.jpg)
21
A’C’
B’
128
512
128
161651
2
X
Reuse register c’, next a’ right, next b’ below:
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
![Page 22: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/22.jpg)
22
A’C’
B’
128
512
128
161651
2
X
Repeat until at bottom of B slice, right of A row
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
![Page 23: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/23.jpg)
23
A’C’
B’
128
512
128
161651
2
X
Save c’s, load next row of a’ and c’, reuse entire B’ slice’
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
![Page 24: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/24.jpg)
24
A’C’
B’
128
512
128
161651
2
X
Repeat process over slice of B’
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
![Page 25: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/25.jpg)
25
A’C’
B’
128
512
128
161651
2
X
Continue over entire block of A’ and C’
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
![Page 26: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/26.jpg)
26
C’
B’
A’C’
B’
128
512
128
1616
X
Fetch next slice of B’ and move into next slice of C’
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
![Page 27: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/27.jpg)
27
A’C’
B’
128
512
128
1616
X
Complete B’, C’ Panels, load next A’ and repeat…
C’
B
C’
B
+=
How do we traverse?
B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM
![Page 28: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/28.jpg)
28
Defined Inner Kernel
C A B
High Level Blocking
C’ A’ B’
Original Problem
A’C’ B’
L2 DRAM/L1DRAM/REG
Thousands
Thousands
Thousands ThousandsHundreds
Hundreds
Panel Slices
+=
+=
+=
16
4
16
4
4
4 Mini Block-PanelREG REG L1
+=c’ b’a’
![Page 29: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/29.jpg)
29
Talk Outline
Spatially Distributed Uniprocessors Matrix Multiply Algorithm Optimizing Inner Kernel Results Conclusion
![Page 30: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/30.jpg)
30
Optimizing the Inner Kernel
Developed several optimization principles: First to apply these principles to TRIPS
Avoiding network contention is critical! Single overscheduled link can cut performance in half Avoided by datapath routing, direction oriented
computation (DOC), register mirroring, data interleaving - got a 5x jump in Instructions Per Cycle, exceeding 10 IPC
Load balance every resource in system In a loop, total performance limited by most used wire link
or execution slot Loop body scaled to match register and data usage and to
minimize architectural overheads
Results in “fragility” of optimization typical of spatial architectures with shared resources
![Page 31: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/31.jpg)
31
Simplified Schedule Step 1: Reading A from Register files
D0
D1
D2
D3
GT R0 R1 R2 R3
Step 2: Loading B and broadcast it across rows
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Step 3: Do the multiply and then add across columns Step 4: Write the results back to C
1 2 3 4
![Page 32: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/32.jpg)
32
Every register use must be retrieved across network Every load and store needs to get an address Need to interleave prefetching, writing, updating pointers, counters Need to account for data movement instructions
What are the complications?
![Page 33: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/33.jpg)
33
Talk Outline
Spatially Distributed Uniprocessors Matrix Multiply Algorithm Optimizing Inner Kernel Results Conclusion
![Page 34: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/34.jpg)
34
Comparison of FPC across major processors
0
1
2
3
4
5
6
7
Opter
on P4
Core
2 Duo
POWER5
Itaniu
m
TRIPS
Kernel FPCDGEMM FPC
Execution Bottlenecks:Integer/Network Ops vs FLOPSSingle Operand Per Cycle
Enhancement OpportunitiesSIMD instruction setLarger Instruction WindowMore network bandwidth
* Results from K. Goto and R. A. van de Geijn, Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software, 2008. 13:748-757, August 2007
![Page 35: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/35.jpg)
35
0
1
2
3
4
5
6
0 512 1024 1536 2048 2560 3072 3584 4096
FP
C
DGEMM
C Kernel + Goto
C Kernel, no Goto
Performance vs Matrix Size
![Page 36: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/36.jpg)
36
Role of the Compiler
Kernel has 8x performance of TRIPS C compiler Did exhaustive empirical studies to determine individual
performance contributions of optimizations and their interaction with the TRIPS compiler
TRIPS compiler does scheduling as post process Determined that existing scheduler can handle
orchestration well if algorithm matches topology: If assembly for inner loop specified, scheduler obtained
75% of total performance
Lesson: Orchestration is not the difficult part Need to consider basic topology during compilation Blocking compilers and register clustering are active topics
of research Annotations / hints to compiler?
![Page 37: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/37.jpg)
37
Conclusions
Fine grained architectures can boost single thread performance
Optimization principles we learned can be applied to many levels of architectural granularity But critical for fine grained architectures
In the future, high performance will depend on algorithms that incorporate both the memory hierarchy and the topology of the processing/ communication substrate
![Page 38: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/38.jpg)
38
Thank You :)
Any Questions?
![Page 39: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/39.jpg)
39
Thank You :)
Any Questions?
![Page 40: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/40.jpg)
40
Back Up Slides
Just a list for now: Comparison of GotoBLAS against
Atlas/LAPACK More detailed diagrams of algorithm Other performance graphs Systolic Array Diagrams of other canonical processors
![Page 41: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/41.jpg)
41
Future work
Explore applicability of optimization principles beyond dense linear algebra, to irregular, control intensive algorithms
Quantify degree to which principles apply to coarser grained architectures (CMPs) and different memory topologies
![Page 42: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/42.jpg)
42
Trends in Chip Level Parallelism
Multiple ways to exploit parallelism: Instruction/Thread/Data Level Parallelism Coarse Grained vs Fine Grained
What’s the programming model? High level paradigm of your choice… Dynamic compilation and run time
systems Low level APIs for writing optimized
libraries Likely need to rewrite applications
![Page 43: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/43.jpg)
43
Trends in Computer Architecture
Emerging architectures are trending towards more fine grained control E.g. Intel Terascale, RAW, Tilera Tightly orchestrated computation On chip networks Precise control over
communication These represent a step down a
path Algorithmic insight can be gained
by looking at the most fine grained examples
![Page 44: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/44.jpg)
44
Spatially Distributed Uniprocessors Scalability issues for both architectures and underlying
technology Wire delay ,Power, Issue Width…
More and more components of microprocessors becoming distributed Partitioned register banks, functional units, …
SDU partitions all aspects of single core into tiles Tiles connected by on chip 2-D network Large number of distributed registers, data ports Enormous aggregate bandwidth to registers and data, but… Communication between ALUs must go through network
Key performance characteristic: Where an instruction executes matters!
![Page 45: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/45.jpg)
45
TRIPS - a modern SDU
Grid of ALUs (16) Large number of distributed registers Large number of data ports On chip 2-D mesh network S-NUCA distributed L1 and L2 cache
![Page 46: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/46.jpg)
46
TRIPS - a modern SDU
Potential Advantages for Matrix Multiply Large number of ALUs Precise placement of instructions
Not a MIMD machine Model of execution is block dataflow graphs Bring in graphs one at a time and execute must also deal with data movement, registers, data bandwidth, control
![Page 47: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/47.jpg)
47
Classical Matrix Multiply
Need to compute C = AB + C Once just used a triply nested loop… Want to amortize O(n2) data movement over
2n3 computation of matrix multiply Break A, B and C matrices into square blocks
just small enough to fit A, B and C in L1 cache Inner kernel computes block of C by caching
elements of C in registers and using values of A and B from L1 cache
![Page 48: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/48.jpg)
48
Performance for thin panels
Performance vs Panel Thickness
0
1
2
3
4
5
6
0 512 1024 1536 2048 2560 3072 3584 4096
k (m = n = 4096)
FPC
Cmxn = Amxk x Bkxn
![Page 49: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/49.jpg)
49
Goto’s Streaming Algorithm
Classical algorithm breaks matrices into blocks just big enough for A, B and C to fit in L1 cache
Goto realized L2 cache is actually fast enough to access directly from inner kernel! Use most of L2 cache for a giant block of A Inner kernel uses all levels of memory hierarchy
simultaneously Cache large slices of B panel in L1 cache, cache small piece
of C in registers
Instead of square matrix blocks, use block-panel multiplies, with traversal order to maximize reuse Stream full-sized contiguous panels of B and C directly out
of DRAM
Use extremely optimized hand tuned assembly
![Page 50: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/50.jpg)
50
Methodology
So we compiled code using the TRIPS compilerAnd we ran it on a hardware prototype.We kept making changes and seeing how fast it ran.We made notes of the changes.We made graphs from the notes.We made slides based on the graphs.We made conclusions based on the slides.It’s 130nm and 366 MHz, but that’s OK.
![Page 51: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/51.jpg)
51
Controlling The Cache
AC
B
=+
128
512
128
161651
2
X
• B slice fits in L1 cache• A block fits in L2 cache• C chunks from L2
How do we keep B in L1 cache while streaming all of A through?
![Page 52: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/52.jpg)
52
A Buffer Size
Affect of Dimensions of A Buffer (same area)
0
1
2
3
4
5
6
0 512 1024 1536 2048 2560 3072 3584 4096
m = n = k
FPC
512*128
256*256
128*512
![Page 53: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/53.jpg)
53
Block Panel Multiply
C BA
+= x
Doing multiple GEMDOTS in parallel.
![Page 54: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/54.jpg)
54
Block Panel Multiply
C BA
+= x
Doing multiple GEMDOTS in parallel.
![Page 55: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/55.jpg)
55
Block Panel Multiply
C BA
+= x
Doing multiple GEMDOTS in parallel.
![Page 56: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/56.jpg)
56
Block Panel Multiply
C BA
+= x
Doing multiple GEMDOTS in parallel.
![Page 57: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/57.jpg)
57
Block Panel Multiply
C BA
+= x
Doing multiple GEMDOTS in parallel.
![Page 58: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige](https://reader035.vdocument.in/reader035/viewer/2022062801/56649e2a5503460f94b188a3/html5/thumbnails/58.jpg)
58