1 lecture 6: memory hierarchy and cache (continued) jack dongarra university of tennessee and oak...
DESCRIPTION
3 6 Variations of Matrix MultipleTRANSCRIPT
![Page 1: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/1.jpg)
1
Lecture 6: Memory Hierarchy and Cache (Continued)
Jack Dongarra University of Tennessee andOak Ridge National Laboratory
Cache: A safe place for hiding and storing things. Webster’s New World Dictionary (1976)
![Page 2: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/2.jpg)
2
Homework Assignment
• Implement, in Fortran or C, the six different ways to perform matrix multiplication by interchanging the loops. (Use 64-bit arithmetic.) Make each implementation a subroutine, like:
• subroutine ijk ( a, m, n, lda, b, k, ldb, c, ldc )• subroutine ikj ( a, m, n, lda, b, k, ldb, c, ldc )• …
![Page 3: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/3.jpg)
3
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
![Page 4: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/4.jpg)
4
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
ijkC i,j A I,k B k,j
![Page 5: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/5.jpg)
5
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
ijk
ikj
C i,j A I,k B k,j
![Page 6: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/6.jpg)
6
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
ijk
ikj
kij
C i,j A I,k B k,j
![Page 7: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/7.jpg)
7
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
ijk
ikj
kij
kji
C i,j A I,k B k,j
![Page 8: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/8.jpg)
8
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
ijk
ikj
kij
kji
jki
C i,j A I,k B k,j
![Page 9: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/9.jpg)
9
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
ijk
ikj
kij
kji
jki
jik
C i,j A I,k B k,j
![Page 10: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/10.jpg)
10
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
ijk
ikj
kij
kji
jki
jik
C i,j A I,k B k,j
FortranC
![Page 11: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/11.jpg)
11
6 Variations of Matrix Multiple
for _ = 1:n; for _ = 1:n; for _ = 1:n;
end endend
C C A Bi j i j i k k j, , , ,
ijk
ikj
kij
kji
jki
jik
C i,j A I,k B k,j
FortranC
However, only part of the story
![Page 12: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/12.jpg)
SUN Ultra 2 200 MHz (L1=16KB, L2=1MB)
• ijk
• jki
• kij
• dgemm
• jik
• kji
• ikj
![Page 13: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/13.jpg)
13
Matrices in Cache
• L1 cache 16 KB
• L2 cache 2 MB
16 8 45KB /
2 8 362120MB /
![Page 14: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/14.jpg)
14
![Page 15: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/15.jpg)
15
Optimizing Matrix Addition for Caches
• Dimension A(n,n), B(n,n), C(n,n) • A, B, C stored by column (as in Fortran) • Algorithm 1:
– for i=1:n, for j=1:n, A(i,j) = B(i,j) + C(i,j)
• Algorithm 2:– for j=1:n, for i=1:n, A(i,j) = B(i,j) + C(i,j)
• What is “memory access pattern” for Algs 1 and 2?• Which is faster?• What if A, B, C stored by row (as in C)?
![Page 16: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/16.jpg)
16
Using a Simpler Model of Memory to Optimize
• Assume just 2 levels in the hierarchy, fast and slow• All data initially in slow memory
– m = number of memory elements (words) moved between fast and slow memory
– tm = time per slow memory operation– f = number of arithmetic operations– tf = time per arithmetic operation < tm– q = f/m average number of flops per slow element access
• Minimum possible Time = f*tf, when all data in fast memory
• Actual Time = f*tf + m*tm = f*tf*(1 + (tm/tf)*(1/q))• Larger q means Time closer to minimum f*tf
![Page 17: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/17.jpg)
17
Simple example using memory model
s = 0
for i = 1, n
s = s + h(X[i])
• Assume tf=1 Mflop/s on fast memory
• Assume moving data is tm = 10• Assume h takes q flops• Assume array X is in slow memory
• To see results of changing q, consider simple computation
• So m = n and f = q*n• Time = read X + compute = 10*n + q*n• Mflop/s = f/t = q/(10 + q)• As q increases, this approaches the “peak” speed of 1
Mflop/s
![Page 18: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/18.jpg)
18
Simple Example (continued)• Algorithm 1
s1 = 0; s2 = 0
for j = 1 to n
s1 = s1+h1(X(j))
s2 = s2 + h2(X(j))
° Algorithm 2
s1 = 0; s2 = 0
for j = 1 to n
s1 = s1 + h1(X(j))
for j = 1 to n
s2 = s2 + h2(X(j))
° Which is faster?
![Page 19: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/19.jpg)
19
Loop Fusion Example/* Before */for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];/* After */for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1){ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access; improve spatial locality
![Page 20: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/20.jpg)
20
Optimizing Matrix Multiply for Caches
• Several techniques for making this faster on modern processors
– heavily studied• Some optimizations done automatically by
compiler, but can do much better• In general, you should use optimized libraries
(often supplied by vendor) for this and other very common linear algebra operations
– BLAS = Basic Linear Algebra Subroutines• Other algorithms you may want are not going to be
supplied by vendor, so need to know these techniques
![Page 21: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/21.jpg)
21
Warm up: Matrix-vector multiplication y = y + A*x
for i = 1:nfor j = 1:n
y(i) = y(i) + A(i,j)*x(j)
= + *
y(i) y(i)
A(i,:)
x(:)
![Page 22: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/22.jpg)
22
Warm up: Matrix-vector multiplication y = y + A*x
{read x(1:n) into fast memory}{read y(1:n) into fast memory}for i = 1:n
{read row i of A into fast memory} for j = 1:n
y(i) = y(i) + A(i,j)*x(j){write y(1:n) back to slow memory}
° m = number of slow memory refs = 3*n + n2
° f = number of arithmetic operations = 2*n2
° q = f/m ~= 2° Matrix-vector multiplication limited by slow memory speed
![Page 23: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/23.jpg)
23
Matrix Multiply C=C+A*B
for i = 1 to n for j = 1 to n
for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)
= + *C(i,j) C(i,j) A(i,:)
B(:,j)
![Page 24: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/24.jpg)
24
Matrix Multiply C=C+A*B(unblocked, or untiled)
for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory}
= + *C(i,j) C(i,j) A(i,:)
B(:,j)
![Page 25: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/25.jpg)
25
Matrix Multiply (unblocked, or untiled)
Number of slow memory references on unblocked matrix multiplym = n3 read each column of B n times
+ n2 read each column of A once for each i + 2*n2 read and write each element of C once = n3 + 3*n2
So q = f/m = (2*n3)/(n3 + 3*n2) ~= 2 for large n, no improvement over matrix-vector mult
= + *C(i,j) C(i,j) A(i,:)
B(:,j)
q=ops/slow mem ref
![Page 26: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/26.jpg)
26
Matrix Multiply (blocked, or tiled)
Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize
for i = 1 to N for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on
blocks} {write block C(i,j) back to slow memory}
= + *C(i,j) C(i,j) A(i,k)
B(k,j)
![Page 27: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/27.jpg)
27
Matrix Multiply (blocked or tiled)
Why is this algorithm correct?
Number of slow memory references on blocked matrix multiplym = N*n2 read each block of B N3 times (N3 * n/N * n/N)
+ N*n2 read each block of A N3 times + 2*n2 read and write each block of C once = (2*N + 2)*n2
So q = f/m = 2*n3 / ((2*N + 2)*n2) ~= n/N = b for large n
So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2)
Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b2 <= M, so q ~= b <= sqrt(M/3)
Theorem (Hong, Kung, 1981): Any reorganization of this algorithm (that uses only associativity) is limited to q =O(sqrt(M))
q=ops/slow mem ref
![Page 28: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/28.jpg)
28
Model• As much as possible will be overlapped• Dot Product: ACC = 0 do i = x,n ACC = ACC + x(i) y(i) end do• Experiments done on an IBM RS6000/530
– 25 MHz– 2 cycle to complete FMA can be pipelined
» => 50 Mflop/s peak– one cycle from cache
![Page 29: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/29.jpg)
29
DOT Operation - Data in Cache
Do 10 I = 1, n T = T + X(I)*Y(I) 10 CONTINUE
• Theoretically, 2 loads for X(I) and Y(I), one FMA operation, no re-use of data
• Pseudo-assembler LOAD fp0,T label: LOAD fp1,X(I) LOAD fp2,Y(I) FMA fp0,fp0,fp1,fp2 BRANCH label:
Load x Load y FMA
Load x Load y
1 result per cycle = 25 Mflop/sFMA
![Page 30: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/30.jpg)
30
Matrix-Vector Product
• DOT version DO 20 I = 1, M DO 10 J = 1, N Y(I) = Y(I) + A(I,J)*X(J) 10 CONTINUE 20 CONTINUE
• From Cache = 22.7 Mflops • From Memory = 12.4 Mflops
![Page 31: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/31.jpg)
31
Loop Unrolling
DO 20 I = 1, M, 2 T1 = Y(I ) T2 = Y(I+1) DO 10 J = 1, N T1 = T1 + A(I,J )*X(J) T2 = T2 + A(I+1,J)*X(J)10 CONTINUE Y(I ) = T1 Y(I+1) = T2 20 CONTINUE
• 3 loads, 4 flops• Speed of y=y+ATx,
N=48
Depth 1 2 3 4 Speed 25 33.3 37.5 40 50Measured 22.7 30.5 34.3 36.5Memory 12.4 12.7 12.7 12.6
• unroll 1: 2 loads : 2 ops per 2 cycles• unroll 2: 3 loads : 4 ops per 3 cycles• unroll 3: 4 loads : 6 ops per 4 cycles• …• unroll n: n+1 loads : 2n ops per n+1 cycles
• problem: only so many registers
![Page 32: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/32.jpg)
32
Matrix Multiply
• DOT version - 25 Mflops in cache DO 30 J = 1, M DO 20 I = 1, M DO 10 K = 1, L C(I,J) = C(I,J) + A(I,K)*B(K,J) 10 CONTINUE 20 CONTINUE 30 CONTINUE
![Page 33: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/33.jpg)
33
How to Get Near Peak DO 30 J = 1, M, 2 DO 20 I + 1, M, 2 T11 = C(I, J ) T12 = C(I, J+1) T21 = C(I+1,J ) T22 = C(I+1,J+1) DO 10 K = 1, L T11 = T11 + A(I, K) *B(K,J ) T12 = T12 + A(I, K) *B(K,J+1) T21 = T21 + A(I+1,K)*B(K,J ) T22 = T22 + A(I+1,K)*B(K,J+1) 10 CONTINUE C(I, J ) = T11 C(I, J+1) = T12 C(I+1,J ) = T21 C(I+1,J+1) = T22 20 CONTINUE 30 CONTINUE
• Inner loop: – 4 loads, 8 operations,
optimal.
• In practice we have measured 48.1 out of a peak of 50 Mflop/s when in cache
![Page 34: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/34.jpg)
34
BLAS -- Introduction
• Clarity: code is shorter and easier to read,• Modularity: gives programmer larger building
blocks,• Performance: manufacturers will provide
tuned machine-specific BLAS,• Program portability: machine dependencies
are confined to the BLAS
![Page 35: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/35.jpg)
35
Memory Hierarchy
RegistersL 1
CacheL 2 CacheLocal
MemoryRemote MemorySecondary Memory
• Key to high performance in effective use of memory hierarchy
• True on all architectures
![Page 36: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/36.jpg)
36
Level 1, 2 and 3 BLAS• Level 1 BLAS
Vector-Vector operations
• Level 2 BLAS Matrix-Vector operations
• Level 3 BLAS Matrix-Matrix operations
+ *
*
+ *
![Page 37: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/37.jpg)
37
More on BLAS (Basic Linear Algebra Subroutines)
• Industry standard interface(evolving)• Vendors, others supply optimized implementations• History
– BLAS1 (1970s): » vector operations: dot product, saxpy (y=*x+y), etc» m=2*n, f=2*n, q ~1 or less
– BLAS2 (mid 1980s)» matrix-vector operations: matrix vector multiply, etc» m=n2, f=2*n2, q~2, less overhead » somewhat faster than BLAS1
– BLAS3 (late 1980s)» matrix-matrix operations: matrix matrix multiply, etc» m >= 4n2, f=O(n3), so q can possibly be as large as n, so BLAS3 is
potentially much faster than BLAS2• Good algorithms used BLAS3 when possible (LAPACK)• www.netlib.org/blas, www.netlib.org/lapack
![Page 38: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/38.jpg)
38
Why Higher Level BLAS?
• Can only do arithmetic on data at the top of the hierarchy
• Higher level BLAS lets us do this
BLAS MemoryRefs
Flops Flops/MemoryRefs
Level 1y=y+x
3n 2n 2/3
Level 2y=y+Ax
n2 2n2 2
Level 3C=C+AB
4n2 2n3 n/2
RegistersL 1
CacheL 2
CacheLocal
MemoryRemote Memory
Secondary Memory
![Page 39: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/39.jpg)
39
BLAS for Performance
• Development of blocked algorithms important for performance
IBM RS/6000-590 (66 MHz, 264 Mflop/s Peak)
0
50
100
150
200
250
10 100 200 300 400 500Order of vector/Matrices
Mflo
p/s
Level 3 BLAS
Level 2 BLAS
Level 1 BLAS
![Page 40: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/40.jpg)
40
BLAS for Performance
• Development of blocked algorithms important for performance
Alpha EV 5/6 500MHz (1Gflop/s peak)
0100200300400500600700
10 100 200 300 400 500Order of vector/Matrices
Mflo
p/s
Level 3 BLAS
Level 2 BLASLevel 1 BLAS
BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2 (n-by-n matrix vector multiply) vs BLAS 1 (saxpy of n vectors)
![Page 41: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/41.jpg)
Fast linear algebra kernels: BLAS
• Simple linear algebra kernels such as matrix-matrix multiply
• More complicated algorithms can be built from these basic kernels.
• The interfaces of these kernels have been standardized as the Basic Linear Algebra Subroutines (BLAS).
• Early agreement on standard interface (~1980) • Led to portable libraries for vector and shared
memory parallel machines. • On distributed memory, there is a less-
standard interface called the PBLAS
![Page 42: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/42.jpg)
Level 1 BLAS
• Operate on vectors or pairs of vectors– perform O(n) operations; – return either a vector or a scalar.
• saxpy – y(i) = a * x(i) + y(i), for i=1 to n. – s stands for single precision, daxpy is for double
precision, caxpy for complex, and zaxpy for double complex,
• sscal y = a * x, for scalar a and vectors x,y
• sdot computes s = S ni=1 x(i)*y(i)
![Page 43: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/43.jpg)
Level 2 BLAS
• Operate on a matrix and a vector; – return a matrix or a vector;– O(n2) operations
• sgemv: matrix-vector multiply– y = y + A*x– where A is m-by-n, x is n-by-1 and y is m-by-1.
• sger: rank-one update – A = A + y*xT, i.e., A(i,j) = A(i,j)+y(i)*x(j) – where A is m-by-n, y is m-by-1, x is n-by-1, – strsv: triangular solve – solves y=T*x for x, where T is triangular
![Page 44: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/44.jpg)
Level 3 BLAS
• Operate on pairs or triples of matrices– returning a matrix;– complexity is O(n3).
• sgemm: Matrix-matrix multiplication– C = C +A*B, – where C is m-by-n, A is m-by-k, and B is k-by-n
• strsm: multiple triangular solve– solves Y = T*X for X, – where T is a triangular matrix, and X is a rectangular
matrix.
![Page 45: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/45.jpg)
45
Optimizing in practice
• Tiling for registers– loop unrolling, use of named “register” variables
• Tiling for multiple levels of cache• Exploiting fine-grained parallelism within the
processor– super scalar – pipelining
• Complicated compiler interactions• Hard to do by hand (but you’ll try)• Automatic optimization an active research area
– PHIPAC: www.icsi.berkeley.edu/~bilmes/phipac– www.cs.berkeley.edu/~iyer/asci_slides.ps– ATLAS: www.netlib.org/atlas/index.html
![Page 46: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/46.jpg)
46
BLAS -- References
• BLAS software and documentation can be obtained via:
– WWW: http://www.netlib.org/blas,– (anonymous) ftp ftp.netlib.org: cd blas; get index– email [email protected] with the message: send
index from blas
• Comments and questions can be addressed to: [email protected]
![Page 47: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/47.jpg)
47
BLAS Papers
• C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, Basic Linear Algebra Subprograms for Fortran Usage, ACM Transactions on Mathematical Software, 5:308--325, 1979.
• J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson, An Extended Set of Fortran Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, 14(1):1--32, 1988.
• J. Dongarra, J. Du Croz, I. Duff, S. Hammarling, A Set of Level 3 Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, 16(1):1--17, 1990.
![Page 48: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/48.jpg)
Performance of BLAS
• BLAS are specially optimized by the vendor
– Sun BLAS uses features in the Ultrasparc• Big payoff for algorithms that can be
expressed in terms of the BLAS3 instead of BLAS2 or BLAS1.
• The top speed of the BLAS3• Algorithms like Gaussian elimination
organized so that they use BLAS3
![Page 49: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/49.jpg)
49
How To Get Performance From Commodity Processors?
• Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning.
• Routines have a large design space w/many parameters– blocking sizes, loop nesting permutations, loop unrolling depths,
software pipelining strategies, register allocations, and instruction schedules.
– Complicated interactions with the increasingly sophisticated microarchitectures of new microprocessors.
• A few months ago no tuned BLAS for Pentium for Linux.• Need for quick/dynamic deployment of optimized routines.• ATLAS - Automatic Tuned Linear Algebra Software
– PhiPac from Berkeley
![Page 50: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/50.jpg)
M C A B
N
K
N
M
K
*NB
Adaptive Approach for Level 3• Do a parameter study of the operation on the
target machine, done once.• Only generated code is on-chip multiply• BLAS operation written in terms of generated on-
chip multiply• All tranpose cases coerced through data copy to
1 case of on-chip multiply– Only 1 case generated per platform
![Page 51: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/51.jpg)
51
Code Generation Strategy
• Code is iteratively generated & timed until optimal case is found. We try:
– Differing NBs– Breaking false dependencies– M, N and K loop unrolling
• On-chip multiply optimizes for:
– TLB access– L1 cache reuse– FP unit usage– Memory fetch– Register reuse– Loop overhead minimization
• Takes a couple of hours to run.
![Page 52: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/52.jpg)
52
500x500 Double Precision Matrix-Matrix Multiply Across Multiple Architectures
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
DE
C A
lpha
2116
4a-4
33
HP
PA
8000
180M
hz
HP
9000
/735
/125
IBM
Pow
er2-
135
IBM
Pow
erP
C60
4e-3
32
Pen
tium
MM
X-15
0
Pen
tium
Pro
-200
Pen
tium
II-2
66
SG
I R46
00
SG
I R50
00
SG
I R80
00ip
21
SG
I R10
000i
p27
Sun
Mic
rosp
arc
IIM
odel
70
Sun
Dar
win
-270
Sun
Ultr
a2 M
odel
2200
System
Mflo
ps
Vendor Matrix Multiply ATLAS Matrix Multiply
![Page 53: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/53.jpg)
53
500 x 500 Double Precision LU Factorization Performance Across Multiple Architectures
0.0
100.0
200.0
300.0
400.0
500.0
600.0
DC
G L
X 21
164a
-53
3
DE
C A
lpha
211
64a-
433
HP
PA
8000
IBM
Pow
er2-
135
IBM
Pow
erP
C60
4e-3
32
Pen
tium
Pro
-200
Pen
tium
II-2
66
SG
I R50
00
SG
I R10
000i
p27
Sun
Dar
win
-270
Sun
Ultr
a2 M
odel
2200
MFL
OPS
LU w/Vendor BLAS LU w/ATLAS & GEMM-based BLAS
![Page 54: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/54.jpg)
54
500x500 gemm-based BLAS on SGI R10000ip28
0
50
100
150
200
250
300
DGEMM DSYMM DSYR2K DSYRK DTRMM DTRSM
MFL
OPS
Vendor BLAS ATLAS/SSBLAS Reference BLAS
![Page 55: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/55.jpg)
55
500x500 gemm-based BLAS on UltraSparc 2200
0
50
100
150
200
250
300
DGEMM DSYMM DSYR2K DSYRK DTRMM DTRSM
Level 3 BLAS Routine
MFL
OPS
Vendor BLAS ATLAS/GEMM-based BLAS Reference BLAS
![Page 56: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/56.jpg)
56
Recursive Approach for Other Level 3 BLAS
• Recur down to L1 cache block size
• Need kernel at bottom of recursion
– Use gemm-based kernel for portability
Recursive TRMM
00
0
00
0
0
0
0
0
0
0
00
0
0
0
0
0
0
![Page 57: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/57.jpg)
57
500x500 Level 2 BLAS DGEMV
0
50
100
150
200
250
300
Architectures
MFL
OPS
Vendor NoTrans ATLAS NoTrans
F77 NoTrans
![Page 58: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/58.jpg)
58
0100200300400500600700800
Size
Mflo
p/s
Intel BLAS 1 proc ATLAS 1proc Intel BLAS 2 proc ATLAS 2 proc
Multi-Threaded DGEMMIntel PIII 550 MHz
![Page 59: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/59.jpg)
59
ATLAS
• Keep a repository of kernels for specific machines.
• Develop a means of dynamically downloading code
• Extend work to allow sparse matrix operations
• Extend work to include arbitrary code segments
• See: http://www.netlib.org/atlas/
![Page 60: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/60.jpg)
60
BLAS Technical Forum http://www.netlib.org/utk/papers/blast-forum.html
• Established a Forum to consider expanding the BLAS in light of modern software, language, and hardware developments.
• Minutes available from each meeting• Working proposals for the following:
– Dense/Band BLAS– Sparse BLAS– Extended Precision BLAS– Distributed Memory BLAS– C and Fortran90 interfaces to Legacy BLAS
![Page 61: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/61.jpg)
61
Strassen’s Matrix Multiply
• The traditional algorithm (with or without tiling) has O(n3) flops
• Strassen discovered an algorithm with asymptotically lower flops
– O(n2.81)• Consider a 2x2 matrix multiply, normally 8 multiplies
Let M = [m11 m12] = [a11 a12] * [b11 b12]
[m21 m22] [a21 a22] [b21 b22]
Let p1 = (a12 - 122) * (b21 + b22) p5 = a11 * (b12 - b22)
p2 = (a11 + a22) * (b11 + b22) p6 = a22 * (b21 - b11)
p3 = (a11 - a21) * (b11 + b12) p7 = (a21 + a22) * b11
p4 = (a11 + a12) * b22
Then m11 = p1 + p2 - p4 + p6
m12 = p4 + p5
m21 = p6 + p7
m22 = p2 - p3 + p5 - p7
Extends to nxn by divide&conquer
![Page 62: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/62.jpg)
62
Strassen (continued)
T(n) = Cost of multiplying nxn matrices
= 7*T(n/2) + 18*(n/2)2 = O(nlog_2 7) = O(n2.81)
° Available in several libraries° Up to several time faster if n large enough (100s)° Needs more memory than standard algorithm° Can be less accurate because of roundoff error° Current world’s record is O(n2.376.. )
![Page 63: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/63.jpg)
63
Summary• Performance programming on uniprocessors
requires– understanding of memory system
» levels, costs, sizes– understanding of fine-grained parallelism in processor to
produce good instruction mix
• Blocking (tiling) is a basic approach that can be applied to many matrix algorithms
• Applies to uniprocessors and parallel processors– The technique works for any architecture, but choosing the
blocksize b and other details depends on the architecture
• Similar techniques are possible on other data structures
![Page 64: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/64.jpg)
64
Summary: Memory Hierachy• Virtual memory was controversial at the time:
can SW automatically manage 64KB across many programs?
– 1000X DRAM growth removed the controversy
• Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy
• Today CPU time is a function of (ops, cache misses) vs. just f(ops):What does this mean to Compilers, Data structures, Algorithms?
![Page 65: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/65.jpg)
65
BLAS MemoryRefs
Flops Flops/MemoryRefs
Level 1y=y+x
3n 2n 2/3
Level 2y=y+Ax
n2 2n2 2
Level 3C=C+AB
4n2 2n3 n/2
Performance = Effective Use of Memory Hierarchy
• Can only do arithmetic on data at the top of the hierarchy
• Higher level BLAS lets us do this
• Development of blocked algorithms important for performance
Level 1, 2 & 3 BLAS Intel PII 450MHz
0
100
200
300
400
10 100 200 300 400 500Order of vector/Matrices
Mflo
p/s
![Page 66: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/66.jpg)
66
Engineering: SUN Enterprise
• Proc + mem card - I/O card– 16 cards of either type– All memory accessed over bus, so symmetric– Higher bandwidth, higher latency bus
Gigaplane bus (256 data, 41 addr ess, 83 MHz)
SB
US
SB
US
SB
US
2 Fi
berC
hann
el
100b
T, S
CS
I
Bus interface
CPU/memcardsP
$2
$P
$2
$
Mem ctrl
Bus interface/switch
I/O cards
![Page 67: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/67.jpg)
67
Engineering: Cray T3E
– Scale up to 1024 processors, 480MB/s links– Memory controller generates request message for non-local references– No hardware mechanism for coherence
» SGI Origin etc. provide this
Switch
P$
XY
Z
External I/O
Memctrl
and NI
Mem
![Page 68: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/68.jpg)
68
000001
010011
100
110
101
111
Evolution of Message-Passing Machines
• Early machines: FIFO on each link– HW close to prog. Model; – synchronous ops– topology central (hypercube algorithms)
CalTech Cosmic Cube (Seitz, CACM Jan 95)
![Page 69: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/69.jpg)
69
Diminishing Role of Topology
• Shift to general links– DMA, enabling non-blocking ops
» Buffered by system at destination until recv
– Store&forward routing• Diminishing role of topology
– Any-to-any pipelined routing– node-network interface dominates
communication time
– Simplifies programming– Allows richer design space
» grids vs hypercubes
H x (T0 + n/B)
vs
T0 + H + n/B
Intel iPSC/1 -> iPSC/2 -> iPSC/860
![Page 70: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/70.jpg)
70
Example Intel Paragon
Memory bus (64-bit, 50 MHz)
i860
L1 $
NI
DMA
i860
L1 $
Driver
Memctrl
4-wayinterleaved
DRAM
IntelParagonnode
8 bits,175 MHz,bidirectional2D grid network
with processing nodeattached to every switch
Sandia’ s Intel Paragon XP/S-based Super computer
![Page 71: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/71.jpg)
71
Memory bus
MicroChannel bus
I/O
i860 NI
DMA
DR
AM
IBM SP-2 node
L2 $
Power 2CPU
Memorycontroller
4-wayinterleaved
DRAM
General interconnectionnetwork formed from8-port switches
NIC
Building on the mainstream: IBM SP-2
• Made out of essentially complete RS6000 workstations
• Network interface integrated in I/O bus (bw limited by I/O bus)
![Page 72: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/72.jpg)
72
Berkeley NOW
• 100 Sun Ultra2 workstations
• Inteligent network interface
– proc + mem
• Myrinet Network
– 160 MB/s per link
– 300 ns per hop
![Page 73: 1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding](https://reader036.vdocument.in/reader036/viewer/2022062600/5a4d1b467f8b9ab0599a351c/html5/thumbnails/73.jpg)
73
Thanks • These slides came in part from
courses taught by the following people:
– Kathy Yelick, UC, Berkeley– Dave Patterson, UC, Berkeley– Randy Katz, UC, Berkeley– Craig Douglas, U of Kentucky
• Computer Architecture A Quantitative Approach, Chapter 8, Hennessy and Patterson, Morgan Kaufman Pub.