cmput229 - fall 2003
DESCRIPTION
CMPUT229 - Fall 2003. Topic D: The Memory Hierarchy José Nelson Amaral. Bryant , Randal E., O’Hallaron , David, Computer Systems: A Programmer’s Perspective , Prentice Hall, 2003. (B&H). Reading Assignment. Chapter 6: The Memory Hierarchy. Types of Memories. - PowerPoint PPT PresentationTRANSCRIPT
CMPUT 229 - Computer Organization and Architecture I
1
CMPUT229 - Fall 2003
Topic D: The Memory HierarchyJosé Nelson Amaral
CMPUT 229 - Computer Organization and Architecture I
2
Reading Assignment
Bryant, Randal E., O’Hallaron, David, Computer Systems: A Programmer’s Perspective, Prentice Hall, 2003. (B&H)
Chapter 6: The Memory Hierarchy
CMPUT 229 - Computer Organization and Architecture I
3
Types of Memories
Read/Write Memory (RWM):
the time required to read orwrite a bit of memory is independent of the bit’s location.
once a word is writtento a location, it remains stored as long as power is appliedto the chip, unless the location is written again.
the data stored ateach location must be refreshed periodically by reading it andthen writing it back again, or else it disappears.
we can store and retrieve data.
Random Access Memory (RAM):
Static Random Access Memory (SRAM):
Dynamic Random Access Memory (DRAM):
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
DOUT3 DOUT2 DOUT1 DOUT0
3-to-8decoder
2
1
0
A2
A1
A0
0
1
2
3
4
5
6
7
DIN3 DIN0DIN2 DIN1
WE_LCS_L
OE_L
WR_L
IOE_L
0
1
1
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
DOUT3 DOUT3 DOUT3 DOUT3
3-to-8decoder
2
1
0
A2
A1
A0
0
1
2
3
4
5
6
7
DIN3 DIN3DIN3 DIN3
WE_LCS_L
OE_L
WR_L
IOE_L
0
1
1
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
DOUT3 DOUT3 DOUT3 DOUT3
3-to-8decoder
2
1
0
A2
A1
A0
0
1
2
3
4
5
6
7
DIN3 DIN3DIN3 DIN3
WE_LCS_L
OE_L
WR_L
IOE_L
0
1
1
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
DOUT3 DOUT3 DOUT3 DOUT3
3-to-8decoder
2
1
0
A2
A1
A0
0
1
2
3
4
5
6
7
DIN3 DIN3DIN3 DIN3
WE_LCS_L
OE_L
WR_L
IOE_L
0
1
1
CMPUT 229 - Computer Organization and Architecture I
8
Refreshing the Memory
Vcap
0V
HIGHLOW
VCC
time
0 stored
1 written refreshes
The solution is to periodically refresh the memorycells by reading and writing back each one of them.
CMPUT 229 - Computer Organization and Architecture I
9
SRAM with Bi-directional Data Bus
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
IN OUTSELWR
DIO3 DIO2 DIO1 DIO0
WE_LCS_L
OE_L
WR_L
IOE_L
microprocessor
CMPUT 229 - Computer Organization and Architecture I
10
DRAM High Level View
Cols
Rows
0 1 2 3
0
1
2
3
Internal row buffer
DRAM chip
addr
data
2/
8/
Memorycontroller
(to CPU)
Byant/O’Hallaron, pp. 459
CMPUT 229 - Computer Organization and Architecture I
11
DRAM RAS Request
RAS = 2
Cols
Rows
0 1 2 3
0
1
2
3
Internal row buffer
DRAM chip
Row 2
addr
data
2/
8/
Memorycontroller
RAS = Row Address StrobeByant/O’Hallaron, pp. 460
CMPUT 229 - Computer Organization and Architecture I
12
DRAM CAS Request
Supercell (2,1)
Cols
Rows
0 1 2 3
0
1
2
3
Internal row buffer
DRAM chip
CAS = 1
addr
data
2/
8/
Memorycontroller
CAS = Column Address StrobeByant/O’Hallaron, pp. 460
Memory Modules: Supercell (i,j)
031 78151623243263 394047485556
64-bit double word at main memory address A
addr (row = i, col = j)
data
64 MB memory module
consisting of8 8Mx8 DRAMs
Memorycontroller
bits0-7
DRAM 7
DRAM 0
bits8-15
bits16-23
bits24-31
bits32-39
bits40-47
bits48-55
bits56-63
64-bit doubleword to CPU chip
Byant/O’Hallaron, pp. 461
Step 1: Apply row address
1
Step 2: RAS go from high to low and remain low2
Step 4: WE must be high
4
Step 3: Apply column address
3Step 5: CAS goes from high to low and remain low
5
Step 6: OE goes low
6
Step 7: Data appears
7
Step 8: RAS and CAS return to high
8
Read Cycle on an Asynchronous DRAM
CMPUT 229 - Computer Organization and Architecture I
15
Improved DRAMs
Central Idea: Each read to a DRAM actuallyreads a complete row of bits or word line fromthe DRAM core into an array of sense amps.
A traditional asynchronous DRAM interfacethen selects a small number of these bits to bedelivered to the cache/microprocessor.
All the other bits already extracted from the DRAMcells into the sense amps are wasted.
CMPUT 229 - Computer Organization and Architecture I
16
Fast Page Mode DRAMs
In a DRAM with Fast Page Mode, a page is defined asall memory addresses that have the same row address.
To read in fast page mode, all the steps from 1 to 7 ofa standard read cycle are performed.
Then OE and CAS are switched high, but RAS remains low.
Then the steps 3 to 7 (providing a new column address,asserting CAS and OE) are performed for each newmemory location to be read.
A Fast Page Mode Read Cycle on an Asynchronous DRAM
CMPUT 229 - Computer Organization and Architecture I
18
Enhanced Data Output RAMs (EDO-RAM)
The process to read multiple locations in an EDO-RAMis very similar to the Fast Page Mode.
The difference is that the output drivers are not disabledwhen CAS goes high.
This distintion allows the data from the current read cycleto be present at the outputs while the next cyclebegins.
As a result, faster read cycle times are allowed.
An Enhanced Data Output Read Cycle on an Asynchronous DRAM
CMPUT 229 - Computer Organization and Architecture I
20
Synchronous DRAMs (SDRAM)
A Synchronous DRAM (SDRAM) has a clock input. It operatesin a similar fashion as the fast page mode and EDO DRAM.However the consecutive data is output synchronously on thefalling/rising edge of the clock, instead of on command byCAS.
How many data elements will be output (the length of the burst) is programmable up to the maximum size ofthe row.
The clock in an SDRAM typically runs oneorder of magnitude faster than the access time forindividual accesses.
CMPUT 229 - Computer Organization and Architecture I
21
DDR SDRAM
A Double Data Rate (DDR) SDRAM is an SDRAMthat allows data transfers both on the rising andfalling edge of the clock.
Thus the effective data transfer rate of a DDR SDRAM is two times the data transfer rate ofa standard SDRAM with the same clock frequency.
CMPUT 229 - Computer Organization and Architecture I
22
The Rambus DRAM (RDRAM)
Multiple memory arrays (banks)Rambus DRAMs are synchronous and transfer data on both edges of the clock.
CMPUT 229 - Computer Organization and Architecture I
23
SDRAM Memory Systems
Complex circuits for RAS/CAS/OE.
Each DIMM is connectedin parallel with the memorycontroller.(DIMM = Dual In-line Memory Module)
Often requires buffering.
Needs the whole clockcycle to establish valid data.
Making the bus wider ismechanically complicated.
CMPUT 229 - Computer Organization and Architecture I
24
RDRAM Memory Systems
CMPUT 229 - Computer Organization and Architecture I
25
Bus Structure
Mainmemory
I/O bridge
Bus interface
ALU
Register fileCPU
System bus Memory bus
Disk controller
Graphicsadapter
USBcontroller
Mouse Keyboard Monitor
Disk
I/O bus Expansion slots forother devices such
as network adapters
Byant/O’Hallaron, pp. 472
CMPUT 229 - Computer Organization and Architecture I
26
DMA Request
Mainmemory
I/O bridge
Bus interface
ALU
Register fileCPU
System bus Memory bus
Disk controller
Graphicsadapter
USBcontroller
Mouse Keyboard Monitor
Disk
I/O bus Expansion slots forother devices such
as network adapters
DMA = Direct Memory Access
Byant/O’Hallaron, pp. 473
CMPUT 229 - Computer Organization and Architecture I
27
DMA Transfer
Mainmemory
I/O bridge
Bus interface
ALU
Register fileCPU
System bus Memory bus
Disk controller
Graphicsadapter
USBcontroller
Mouse Keyboard Monitor
Disk
I/O bus Expansion slots forother devices such
as network adapters
DMA = Direct Memory Access
Byant/O’Hallaron, pp. 473
CMPUT 229 - Computer Organization and Architecture I
28
DMA Complet. Notification
Mainmemory
I/O bridge
Bus interface
ALU
Register fileCPU
Memory bus
Disk controller
Graphicsadapter
USBcontroller
Mouse Keyboard Monitor
Disk
I/O bus Expansion slots forother devices such
as network adapters
DMA = Direct Memory Access
Interrupt
Byant/O’Hallaron, pp. 474
CMPUT 229 - Computer Organization and Architecture I
29
Locality
We say that a computer program exhibits good locality if the program tends to reference data that is nearby or datathat has been referenced recently.
Because a program might do one of these things, but not the other,the principle of locality is separated into two flavors:
Temporal locality: a memory location that is referenced once is likely to be referenced multiple times in the near future.
Spatial locality: if a memory location that is referenced once then locations that are nearby are likely to be referenced in the near future.
Byant/O’Hallaron, pp. 478
CMPUT 229 - Computer Organization and Architecture I
30
Examples
In the Sampler function below, RandInt returns a randomly selected integer within the specified interval.Which program has better locality?
1 int SumVec(int v[], int N) 2 { 3 int i; 4 int sum = 0; 5 6 for (i=0 ; i<N ; i=i+1) 7 sum += v[i]; 8 return sum; 9 }
1 int Sampler(int v[], int N, int K) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (i=0 ; i<K ; i=i+1) 7 { 8 j = RandInt(0,N-1); 9 sum += v[j];10 }11 return sum/K;12 }
Byant/O’Hallaron, pp. 479
Memory Hierarchy
Larger, slower,
and cheaper (per byte)storagedevices
Registers
CPU registers hold words retrieved from cache memory.
L0:
On-chip L1cache (SRAM)
L1 cache holds cache lines retrieved from the L2 cache.L1:
Off-chip L2cache (SRAM)
L2 cache holds cache lines retrieved from memory.L2:
Main memory(DRAM)
Main memory holds disk blocks retrieved from local
disks.
L3:
Local secondary storage(local disks)
Local disks hold files retrieved from disks on
remote network servers.
L4:
Remote secondary storage(distributed file systems, Web servers)
L5:
Smaller,faster,and
costlier(per byte)storage devices
Byant/O’Hallaron,
pp. 483
CMPUT 229 - Computer Organization and Architecture I
32
Caching Principle
4 9 14 3
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Larger, slower, cheaper storagedevice at level k+1 is partitioned
into blocks.
Smaller, faster, more expensivedevice at level k caches a
subset of the blocks from level k+1
Data is copied betweenlevels in block-sized transfer units
Level k:
Level k+1:
Byant/O’Hallaron, pp. 484
CMPUT 229 - Computer Organization and Architecture I
33
Cache Misses
Cold Misses, or compulsory misses, occur the first time that a data is referenced.
Conflict Misses, occur when two memory references have to occupy the same memory line. It can occur even when the remainder of the cache is not in use.
Capacity Misses, occur when there are no more free lines in the cache.
CMPUT 229 - Computer Organization and Architecture I
34
L1 and L2 Bus System
Mainmemory
I/Obridge
Bus interfaceL2 cache
ALU
Register file
CPU chip
Cache bus System bus Memory bus
L1 cache
Byant/O’Hallaron, pp. 488
Cache Organization
• • • B–110
• • • B–110
Valid
Valid
Tag
TagSet 0:
B = 2b bytesper cache block
E lines per set
S = 2s sets
t tag bitsper line
1 valid bitper line
Cache size: C = B x E x S data bytes
• • •
• • • B–110
• • • B–110
Valid
Valid
Tag
TagSet 1:
• • •
• • • B–110
• • • B–110
Valid
Valid
Tag
TagSet S -1:
• • •• • •
Byant/O’Hallaron, pp. 488
CMPUT 229 - Computer Organization and Architecture I
36
Address Partition
t bits s bits b bits
0m-1
Tag Set index Block offset
Address:
Compared with tags in thecache to find a match.
Used to find the set wherethe data might be found inthe cache.
Selects which word, insidethe block, is referenced.
Byant/O’Hallaron, pp. 488
CMPUT 229 - Computer Organization and Architecture I
37
Multi-Level Cache Organization
Mainmemory Disk
L1 i-cache
L1 d-cacheRegs L2 unifiedcache
CPU
Byant/O’Hallaron, pp. 504
CMPUT 229 - Computer Organization and Architecture I
38
Writing Cache-Conscious Programs
Problem: Write C code for a function that computes the sum of the elements of a two dimensional array, a[M][N], of integers.
int SumArray(int a[][], int M, int N)
1 int SumArrayRows(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (i=0 ; i<M ; i++) 7 for (j=0 ; j<N ; j++) 8 sum += a[i][j]; 8 return sum; 9 }
1 int SumArrayCols(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (j=0 ; j<N ; i++) 7 for (i=0 ; i<M ; i++) 8 sum += a[i][j]; 8 return sum; 9 }
Byant/O’Hallaron, pp. 508
CMPUT 229 - Computer Organization and Architecture I
39
SumArrayRows Data Access Order
a[1][2]a[1][3]a[1][4]a[1][5]a[2][0]a[2][1]a[2]2]a[2][3]a[2][4]a[2][5]a[3][0]a[3][1]a[3][2]a[3][3]a[3][4]
•••
a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]
0x8000 4000
0x8000 4004
0x8000 4010
0x8000 4024
0x8000 4008
0x8000 4014
0x8000 4028
0x8000 403C
0x8000 400C
0x8000 4018
0x8000 402C
0x8000 4040
0x8000 401C
0x8000 4030
0x8000 4044
0x8000 4050
0x8000 4020
0x8000 4034
0x8000 4048
0x8000 4054
0x8000 4038
0x8000 404C
0x8000 4058
•••
1 int SumArrayRows(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (i=0 ; i<M ; i++) 7 for (j=0 ; j<N ; j++) 8 sum += a[i][j]; 8 return sum; 9 }
Byant/O’Hallaron, pp. 508
CMPUT 229 - Computer Organization and Architecture I
40
SumArrayRows Data Access Order
a[1][2]a[1][3]a[1][4]a[1][5]a[2][0]a[2][1]a[2]2]a[2][3]a[2][4]a[2][5]a[3][0]a[3][1]a[3][2]a[3][3]a[3][4]
•••
a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]
0x8000 4000
0x8000 4004
0x8000 4010
0x8000 4024
0x8000 4008
0x8000 4014
0x8000 4028
0x8000 403C
0x8000 400C
0x8000 4018
0x8000 402C
0x8000 4040
0x8000 401C
0x8000 4030
0x8000 4044
0x8000 4050
0x8000 4020
0x8000 4034
0x8000 4048
0x8000 4054
0x8000 4038
0x8000 404C
0x8000 4058
•••
1 int SumArrayRows(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (i=0 ; i<M ; i++) 7 for (j=0 ; j<N ; j++) 8 sum += a[i][j]; 8 return sum; 9 }
Byant/O’Hallaron, pp. 508
CMPUT 229 - Computer Organization and Architecture I
41
SumArrayRows Data Access Order
a[1][2]a[1][3]a[1][4]a[1][5]a[2][0]a[2][1]a[2]2]a[2][3]a[2][4]a[2][5]a[3][0]a[3][1]a[3][2]a[3][3]a[3][4]
•••
a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]
0x8000 4000
0x8000 4004
0x8000 4010
0x8000 4024
0x8000 4008
0x8000 4014
0x8000 4028
0x8000 403C
0x8000 400C
0x8000 4018
0x8000 402C
0x8000 4040
0x8000 401C
0x8000 4030
0x8000 4044
0x8000 4050
0x8000 4020
0x8000 4034
0x8000 4048
0x8000 4054
0x8000 4038
0x8000 404C
0x8000 4058
•••
1 int SumArrayRows(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (i=0 ; i<M ; i++) 7 for (j=0 ; j<N ; j++) 8 sum += a[i][j]; 8 return sum; 9 }
Byant/O’Hallaron, pp. 508
CMPUT 229 - Computer Organization and Architecture I
42
SumArrayRows Data Access Order
a[1][2]a[1][3]a[1][4]a[1][5]a[2][0]a[2][1]a[2]2]a[2][3]a[2][4]a[2][5]a[3][0]a[3][1]a[3][2]a[3][3]a[3][4]
•••
a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]
0x8000 4000
0x8000 4004
0x8000 4010
0x8000 4024
0x8000 4008
0x8000 4014
0x8000 4028
0x8000 403C
0x8000 400C
0x8000 4018
0x8000 402C
0x8000 4040
0x8000 401C
0x8000 4030
0x8000 4044
0x8000 4050
0x8000 4020
0x8000 4034
0x8000 4048
0x8000 4054
0x8000 4038
0x8000 404C
0x8000 4058
•••
1 int SumArrayRows(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (i=0 ; i<M ; i++) 7 for (j=0 ; j<N ; j++) 8 sum += a[i][j]; 8 return sum; 9 }
Byant/O’Hallaron, pp. 508
CMPUT 229 - Computer Organization and Architecture I
43
SumArrayCols Data Access Order
a[1][2]a[1][3]a[1][4]a[1][5]a[2][0]a[2][1]a[2]2]a[2][3]a[2][4]a[2][5]a[3][0]a[3][1]a[3][2]a[3][3]a[3][4]
•••
a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]
0x8000 4000
0x8000 4004
0x8000 4010
0x8000 4024
0x8000 4008
0x8000 4014
0x8000 4028
0x8000 403C
0x8000 400C
0x8000 4018
0x8000 402C
0x8000 4040
0x8000 401C
0x8000 4030
0x8000 4044
0x8000 4050
0x8000 4020
0x8000 4034
0x8000 4048
0x8000 4054
0x8000 4038
0x8000 404C
0x8000 4058
•••
1 int SumArrayCols(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (j=0 ; j<N ; i++) 7 for (i=0 ; i<M ; i++) 8 sum += a[i][j]; 8 return sum; 9 }
Byant/O’Hallaron, pp. 508
CMPUT 229 - Computer Organization and Architecture I
44
SumArrayCols Data Access Order
a[1][2]a[1][3]a[1][4]a[1][5]a[2][0]a[2][1]a[2]2]a[2][3]a[2][4]a[2][5]a[3][0]a[3][1]a[3][2]a[3][3]a[3][4]
•••
a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]
0x8000 4000
0x8000 4004
0x8000 4010
0x8000 4024
0x8000 4008
0x8000 4014
0x8000 4028
0x8000 403C
0x8000 400C
0x8000 4018
0x8000 402C
0x8000 4040
0x8000 401C
0x8000 4030
0x8000 4044
0x8000 4050
0x8000 4020
0x8000 4034
0x8000 4048
0x8000 4054
0x8000 4038
0x8000 404C
0x8000 4058
•••
1 int SumArrayCols(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (j=0 ; j<N ; i++) 7 for (i=0 ; i<M ; i++) 8 sum += a[i][j]; 8 return sum; 9 }
Byant/O’Hallaron, pp. 508
CMPUT 229 - Computer Organization and Architecture I
45
SumArrayCols Data Access Order
a[1][2]a[1][3]a[1][4]a[1][5]a[2][0]a[2][1]a[2]2]a[2][3]a[2][4]a[2][5]a[3][0]a[3][1]a[3][2]a[3][3]a[3][4]
•••
a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]
0x8000 4000
0x8000 4004
0x8000 4010
0x8000 4024
0x8000 4008
0x8000 4014
0x8000 4028
0x8000 403C
0x8000 400C
0x8000 4018
0x8000 402C
0x8000 4040
0x8000 401C
0x8000 4030
0x8000 4044
0x8000 4050
0x8000 4020
0x8000 4034
0x8000 4048
0x8000 4054
0x8000 4038
0x8000 404C
0x8000 4058
•••
1 int SumArrayCols(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (j=0 ; j<N ; i++) 7 for (i=0 ; i<M ; i++) 8 sum += a[i][j]; 8 return sum; 9 }
Byant/O’Hallaron, pp. 508
CMPUT 229 - Computer Organization and Architecture I
46
SumArrayCols Data Access Order
a[1][2]a[1][3]a[1][4]a[1][5]a[2][0]a[2][1]a[2]2]a[2][3]a[2][4]a[2][5]a[3][0]a[3][1]a[3][2]a[3][3]a[3][4]
•••
a[0][0]a[0][1]a[0][2]a[0][3]a[0][4]a[0][5]a[1][0]a[1][1]
0x8000 4000
0x8000 4004
0x8000 4010
0x8000 4024
0x8000 4008
0x8000 4014
0x8000 4028
0x8000 403C
0x8000 400C
0x8000 4018
0x8000 402C
0x8000 4040
0x8000 401C
0x8000 4030
0x8000 4044
0x8000 4050
0x8000 4020
0x8000 4034
0x8000 4048
0x8000 4054
0x8000 4038
0x8000 404C
0x8000 4058
•••
1 int SumArrayCols(int a[][], int M, int N) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (j=0 ; j<N ; i++) 7 for (i=0 ; i<M ; i++) 8 sum += a[i][j]; 8 return sum; 9 }
Byant/O’Hallaron, pp. 508
CMPUT 229 - Computer Organization and Architecture I
47
Read Bandwidth
The rate that a program reads data from the memory system iscalled the read throughput or the read bandwidth.
The read throughput of a program depends on the memory hierarchy level from which the data is retrieved.
The read throughput is measured in bytes per second, or morecommonly in Mbytes/s.
We can write a program to force the data to come from the various levels in the hierarchy to estimate the read throughput.
CMPUT 229 - Computer Organization and Architecture I
48
Measuring Read Bandwidth
1 int test(int elems, int stride) 2 { 3 int i; 4 int result = 0; 5 volatile int sink; 6 7 for(i=0 ; i<elems ; i += stride) 8 result += data[i]; 9 sink = result; /* to prevent compiler from optimizing away the loop */10 }
Byant/O’Hallaron, pp. 508
Pentium III Xeon Memory Mountain
s1s3
s5s7
s9s11
s13s15 8m
2m 512k128k
32k
8k2k
0
200
400
600
800
1000
1200
Read throughput (MB/s)
Stride (words) Working set size (bytes)
Pentium III Xeon550 MHz16 KB on-chip L1 d-cache16 KB on-chip L1 i-cache512 KB off-chip unifiedL2 cache
Ridges oftemporallocality
L1
L2
Mem
Slopes ofspatiallocality
xe
Byant/O’Hallaron, pp. 514
Temporal Locality(stride = 1)
0
200
400
600
800
1000
1200
8m 4m 2m1024k512k 256k 128k 64k 32k 16k
8k 4k 2k 1k
Working set size (bytes)
Read througput (MB/s)
L1 cacheregion
L2 cacheregion
Main memoryregion
Spatial Locality Slope(size = 256 KB)
0
100
200
300
400
500
600
700
800
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16
Stride (words)
Read throughput (MB/s)
One access per cache line