University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 1
Computer Systems
the impact of caches
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 2
Introduction
Different sorts of memory
• On-die 0/1/10 cycles
• On-board 100
• On-disk 10.000
• Off-machine 1.000.000
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 3
The CPU-Memory Gap
• The increasing gap between disk, DRAM and SRAM, CPU speeds.
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1980 1985 1990 1995 2000
year
ns
Disk seek time
DRAM access time
SRAM access time
CPU cycle time
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 4
Storage Trendsbigger, not faster
(Culled from back issues of Byte and PC Magazine)
metric 1980 1985 1990 1995 2000 2000:1980
$/MB 8,000 880 100 30 1 8,000access (ns) 375 200 100 70 60 6typical size (MB) 0.064 0.256 4 16 64 1,000
DRAM
metric 1980 1985 1990 1995 2000 2000:1980
$/MB 500 100 8 0.30 0.05 10,000access (ms) 87 75 28 10 8 11typical size (MB) 1 10 160 1,000 9,000 9,000
Disk
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 5
metric 1980 1985 1990 1995 2000 2000:1980
$/MB 19,200 2,900 320 256 100 190access (ns) 300 150 35 15 2 100typical size (MB) 0.008 0.016 0.032
Processor trendsfaster
1980 1985 1990 1995 2000 2000:1980
processor 8080 286 386 Pent P-IIIclock rate (MHz) 1 6 20 150 750 750cycle time (ns) 1,000 166 50 6 1.6 750
SRAM
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 6
Intel Processors CacheSRAM
L1 L2
486 1989-1994 8K -
Pentium 1993 8 K 8K -
Pentium Pro 1995-1999 8 K 8K 256K-1M
Pentium II 1997 16 K 16 K 512K ½
Celeron A 1998 16 K 16 K 128K
Pentium III Coppermine
2000 16 K 16 K 256K
Pentium 4Willamette
2000 12 K 8 K 256K
Pentium 4Northwood
2002 12 K 8 K 512K
http://www.intel.com/pressroom/kits/quickreffam.htm
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 7
Memory Hierarchy
Registers
On-chip L1cache (SRAM)
Main memory(DRAM)
Local secondary storage(local disks)
Larger, slower,
and cheaper (per byte)storagedevices
Remote secondary storage(distributed file systems, Web servers)
Local disks hold files retrieved from disks on remote network servers.
Main memory holds disk blocks retrieved from local disks.
Off-chip L2cache (SRAM)
L1 cache holds cache lines retrieved from the L2 cache.
CPU registers hold words retrieved from cache memory.
L2 cache holds cache lines retrieved from memory.
L0:
L1:
L2:
L3:
L4:
L5:
Smaller,faster,and
costlier(per byte)storage devices
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 8
Pay the price
• To access large amounts of data in a cost-effective manner, the bulk of the data must be stored on disk
1GB: ~$200 80 GB: ~$110
4 MB: ~$500
DiskDRAMSRAM
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 9
Locality• Principle of Locality:
– Programs tend to reuse data and instructions near those they have used recently, or that were recently referenced themselves.
– Temporal locality: Recently referenced items are likely to be referenced in the near future.
– Spatial locality: Items with nearby addresses tend to be referenced close together in time.
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 10
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 11
Locality Example
• Data– Reference array elements in succession
(stride-1 reference pattern):– Reference sum each iteration:
• Instructions– Reference instructions in sequence:– Cycle through loop repeatedly:
sum = 0;for (i = 0; i < n; i++)
sum += a[i];return sum;
Spatial locality
Spatial locality
Temporal locality
Temporal locality
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 12
Power Programmer
• Claim: Being able to look at code and get a qualitative sense of its locality is a key skill for a professional programmer.
• Good locality?
int sumarrayrows(int a[M][N]){ int i, j, sum = 0;
for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum}
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 13
Stride-M example
• Question: Does this function have good locality?
int sumarraycols(int a[M][N]){ int i, j, sum = 0;
for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum}
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 14
Matrix M=2,N=3
Adress 0 4 8 12 16 20
Contents a00 a01 a02 a10 a11 a12
Acces order 1 2 3 4 5 6
Adress 0 4 8 12 16 20
Contents a00 a01 a02 a10 a11 a12
Acces order 1 3 5 2 4 6
int sumarrowrows()
int sumarrowcols()
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 15
Expect: Stride-1 is better! 32 bytes
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
stride (words)
MB
/s
Series1
– int A[2][4]
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 16
Reality: small matrices fit in cache
4 KB
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
stride (words)
Th
rou
gp
ut
(MB
/s)
Series1
– int A[32][32]
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 17
Reality: Performance-drop cache L2 / L1
not dramatic128 KB
0
1000
2000
3000
4000
5000
6000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
stride (words)
Th
rou
gh
pu
t (M
B/s
)
Series1
– int A[180][180]
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 18
Reality: Only when DRAM is accessed,
the penalty can be seen 1 MB
0
200
400
600
800
1000
1200
1400
1600
1800
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
stride (words)
Th
rou
gh
pu
t (M
B/s
)
Series1
– int A[512][512]
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 19
s1
s3
s5
s7
s9
s11
s13
s15
8m
2m 512k 12
8k 32k 8k
2k
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000R
ead
th
rou
gh
pu
t (M
B/s
)
Stride (words)Working set size (bytes)
Pentium 42.4 GHz 8 KB L1 d-cache12 KB L1 i-cache512 KB L2 cache
Ridges oftemporallocality
L1
L2
Mem
Slopes ofspatiallocality
xe
Memory Mountain
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 20
Summary
• As long as your data fits in the cache, and your program shows good locality, good performance is guaranteed.
University of Amsterdam
Computer Systems – the impact of caches Arnoud Visser 21
Assignment
• Practice Problem 6.9 (p. 624): 'Order three functions to the spatial locality enjoyed by each.'
• Practice Problem 6.22 (p. 659): 'Estimate the time, in CPU cycles, to read a 8-byte word, from the different L1-d of a i7 processor