eecc722 - shaaban #1 lec # 12 fall 2001 10-29-2001 computer system components sdram pc100/pc133...
Post on 20-Dec-2015
215 views
TRANSCRIPT
EECC722 - ShaabanEECC722 - Shaaban#1 lec # 12 Fall 2001 10-29-2001
Computer System ComponentsComputer System Components
SDRAMPC100/PC133100-133MHZ64-128 bits wide2-way interleaved~ 900 MBYTES/SEC )64bit)
Double DateRate (DDR) SDRAMPC2100266MHZ64-128 bits wide4-way interleaved~2.1 GBYTES/SEC (64bit)
RAMbus DRAM (RDRAM)400-800MHZ16 bits wide~ 1.6 GBYTES/SEC
CPU
CachesSystem Bus
I/O Devices:
Memory
Controllers
adapters
DisksDisplaysKeyboards
Networks
NICs
I/O BusesMemoryController Example: PCI, 33MHZ
32 bits wide 133 MBYTES/SEC
CPU Core500 MHZ - 2.0 GHZ4-way SuperscalerRISC or RISC-core (x86): Deep Instruction Pipelines Dynamic scheduling Multiple FP, integer FUs Dynamic branch prediction Hardware speculation
L1
L2 L3
Memory Bus
All Non-blocking cachesL1 16-64K 1-2 way set associative (on chip), separate or unifiedL2 128K- 1M 4-16 way set associative (on chip) unifiedL3 1-16M 8-16 way set associative (off chip) unified
Examples: Alpha, AMD K7: EV6, 200-266MHZ Intel PII, PIII: GTL+ 100MHZ Intel P4 400MHZ
EECC722 - ShaabanEECC722 - Shaaban#2 lec # 12 Fall 2001 10-29-2001
Main MemoryMain Memory• Main memory generally utilizes Dynamic RAM (DRAM),
which use a single transistor to store a bit, but require a periodic data refresh by reading every row (~every 8 msec).
• Static RAM may be used if the added expense, low density, power consumption, and complexity is feasible (e.g. Cray Vector Supercomputers)
• Main memory performance is affected by:
– Memory latency: Affects cache miss penalty. Measured by:• Access time: The time it takes between a memory access request is issued to
main memory and the time the requested information is available to cache/CPU.
• Cycle time: The minimum time between requests to memory
(greater than access time in DRAM to allow address lines to be stable)
– Memory bandwidth: The sustained data transfer rate between main memory and cache/CPU.
EECC722 - ShaabanEECC722 - Shaaban#3 lec # 12 Fall 2001 10-29-2001
Processor-Memory (DRAM) Performance GapProcessor-Memory (DRAM) Performance Gap
µProc60%/yr.
DRAM10%/yr.
1
10
100
1000198
0198
1 198
3198
4198
5 198
6198
7198
8198
9199
0199
1 199
2199
3199
4199
5199
6199
7199
8 199
9200
0
DRAM
CPU
198
2
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Current memory access latency: 100 or more CPU cycles
EECC722 - ShaabanEECC722 - Shaaban#4 lec # 12 Fall 2001 10-29-2001
X86 CPU Cache/Memory Performance Gap Example:X86 CPU Cache/Memory Performance Gap Example:AMD Athlon T-Bird Vs. Intel PIII, Vs. P4AMD Athlon T-Bird Vs. Intel PIII, Vs. P4
AMD Athlon T-Bird 1GHZL1: 64K INST, 64K DATA (3 cycle latency), both 2-wayL2: 256K 16-way 64 bit Latency: 7 cycles L1,L2 on-chip
Intel PIII 1 GHZL1: 16K INST, 16K DATA (3 cycle latency) both 4-wayL2: 256K 8-way 256 bit , Latency: 7 cycles
L1,L2 on-chip
Intel P 4, 1.5 GHZL1: 8K INST, 8K DATA (2 cycle latency) both 4-way 96KB Execution Trace CacheL2: 256K 8-way 256 bit , Latency: 7 cycles
L1,L2 on-chip
Source: http://www1.anandtech.com/showdoc.html?i=1360&p=15
High L1, L2 data miss ratesmain memory accessed for data
EECC722 - ShaabanEECC722 - Shaaban#5 lec # 12 Fall 2001 10-29-2001
Logical DRAM Organization Logical DRAM Organization
• Square root of bits per RAS/CAS
Column DecoderColumn Decoder
Sense Sense Amps & I/OAmps & I/O
Memory Memory ArrayArray(2,048 x 2,048)(2,048 x 2,048)
A0…A1A0…A10
…
11 DD
WWord Lineord Line Storage CellCell
EECC722 - ShaabanEECC722 - Shaaban#6 lec # 12 Fall 2001 10-29-2001
CPU/Memory Performance Gap Reduction Techniques
• Latency Reduction/Higher Bandwidth:– Memory Hierarchy with one or more levels of cache + software/hardware
cache performance enhancement techniques (high level of locality essential).
– Low latency, wider, faster system bus.
– Independent Memory Bank Interleaving.
– Lower latency, high-bandwidth memory interfaces:• Current: DDR SDRAM, Direct Rambus (DRDRAM)
• Future: Magnetic RAM (MRAM).
– Chip level integration of memory controller/main memory (IRAM-like).
• Latency-tolerant Architectures:– Simultaneous Multithreaded(SMT) Architectures.
– Decoupled Architectures: Separate memory access from normal processor operation (e.g HiDISC).
EECC722 - ShaabanEECC722 - Shaaban#7 lec # 12 Fall 2001 10-29-2001
Latency Reduction/Higher Bandwidth The Memory Hierarchy The Memory Hierarchy
Part of The On-chip CPU Datapath 16-256 Registers
One or more levels (Static RAM):Level 1: On-chip 16-64K Level 2: On or Off-chip 128-512KLevel 3: Off-chip 128K-8M
Registers
Cache
Main Memory
Magnetic Disc
Optical Disk or Magnetic Tape
Farther away from The CPU
Lower Cost/Bit
Higher Capacity
Increased AccessTime/Latency
Lower ThroughputDRAM, RDRAM 16M-16G
Interface:SCSI, RAID, IDE, 13944G-100G
EECC722 - ShaabanEECC722 - Shaaban#8 lec # 12 Fall 2001 10-29-2001
A Typical Memory HierarchyA Typical Memory Hierarchy ((With Two Levels of Cache)With Two Levels of Cache)
VirtualMemory,
SecondaryStorage(Disk)
MainMemory(DRAM)
SecondLevelCache
(SRAM)L2
1s 10,000,000s
(10s ms)
Speed (ns): 10s 100s
100s GsSize (bytes): Ks Ms
TertiaryStorage(Tape)
10,000,000,000s (10s sec)
Ts
Control
Datapath
Processor
Registers
On-ChipLevel OneCache L1
Larger CapacityFaster
EECC722 - ShaabanEECC722 - Shaaban#9 lec # 12 Fall 2001 10-29-2001
Improving Cache PerformanceImproving Cache Performance• Miss Rate Reduction Techniques:Miss Rate Reduction Techniques:
* Increased cache capacity * Larger block size
* Higher associativity * Victim caches
* Hardware prefetching of instructions and data * Pseudo-associative Caches
* Compiler-controlled prefetching * Compiler optimizations
* Trace cache
• Cache Miss Penalty Reduction Techniques:Cache Miss Penalty Reduction Techniques:* Giving priority to read misses over writes * Sub-block placement
* Early restart and critical word first * Non-blocking caches
* Second-level cache (L2)
• Cache Hit Time Reduction Techniques:Cache Hit Time Reduction Techniques:* Small and simple caches
* Avoiding address translation during cache indexing
* Pipelining writes for fast write hits
EECC722 - ShaabanEECC722 - Shaaban#10 lec # 12 Fall 2001 10-29-2001
Cache Optimization SummaryCache Optimization SummaryTechnique MR MP HT Complexity
Larger Block Size + – 0Higher Associativity + – 1Victim Caches + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + 2Compiler Controlled Prefetching + 3Compiler Reduce Misses + 0
Trace Cache + 3
Priority to Read Misses + 1Subblock Placement + + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2
Small & Simple Caches – + 0Avoiding Address Translation + 2Pipelining Writes + 1
Mis
s ra
teH
it t
ime
Mis
sP
enal
ty
EECC722 - ShaabanEECC722 - Shaaban#11 lec # 12 Fall 2001 10-29-2001
Latency Reduction/Higher Bandwidth
• Wider Main Memory System BUS: Memory width is increased to a number of words (usually the size of a
cache block). Memory bandwidth is proportional to memory width.
e.g Doubling the width of cache and memory doubles
memory bandwidth
• Simple Interleaved Memory: Memory is organized as a number of banks each one word wide.
– Simultaneous multiple word memory reads or writes are accomplished by sending memory addresses to several memory banks at once.
– Interleaving factor: Refers to the mapping of memory addressees to memory banks.
e.g. using 4 banks, bank 0 has all words whose address is:
(word address mod) 4 = 0
EECC722 - ShaabanEECC722 - Shaaban#12 lec # 12 Fall 2001 10-29-2001
Three examples of bus width, memory width, and memory interleaving to achieve higher memory bandwidth
Narrow busand cachewithinterleaved memory
Wider memory, busand cache
Simplest design:Everything is the width of one word
EECC722 - ShaabanEECC722 - Shaaban#13 lec # 12 Fall 2001 10-29-2001
Memory InterleavingMemory Interleaving
EECC722 - ShaabanEECC722 - Shaaban#14 lec # 12 Fall 2001 10-29-2001
Memory Width, Interleaving: An ExampleMemory Width, Interleaving: An ExampleGiven the following system parameters with single cache level L1:Block size=1 word Memory bus width=1 word Miss rate =3% Miss penalty=32 cycles
(4 cycles to send address 24 cycles access time/word, 4 cycles to send a word)
Memory access/instruction = 1.2 Ideal CPI (ignoring cache misses) = 2
Miss rate (block size=2 word)=2% Miss rate (block size=4 words) =1%
• The CPI of the base machine with 1-word blocks = 2+(1.2 x 0.03 x 32) = 3.15
• Increasing the block size to two words gives the following CPI:
– 32-bit bus and memory, no interleaving = 2 + (1.2 x 002 x 2 x 32) = 3.54
– 32-bit bus and memory, interleaved = 2 + (1.2 x .02 x (4 + 24 + 8) = 2.86
– 64-bit bus and memory, no interleaving = 2 + (1.2 x 0.02 x 1 x 32) = 2.77
• Increasing the block size to four words; resulting CPI:
– 32-bit bus and memory, no interleaving = 2 + (1.2 x 0.01 x 4 x 32) = 3.54
– 32-bit bus and memory, interleaved = 2 + (1.2 x 0.01 x (4 +24 + 16) = 2.53
– 64-bit bus and memory, no interleaving = 2 + (1.2 x 0.01 x 2 x 32) = 2.77
EECC722 - ShaabanEECC722 - Shaaban#15 lec # 12 Fall 2001 10-29-2001
Simplified Asynchronous DRAM Read TimingSimplified Asynchronous DRAM Read Timing
Source: http://arstechnica.com/paedia/r/ram_guide/ram_guide.part2-1.html
EECC722 - ShaabanEECC722 - Shaaban#16 lec # 12 Fall 2001 10-29-2001
DRAM memory interfaces Page Mode DRAM: OperationPage Mode DRAM: Operation
EECC722 - ShaabanEECC722 - Shaaban#17 lec # 12 Fall 2001 10-29-2001
DRAM memory interfaces
Simplified Asynchronous Fast Page Mode (FPM) DRAM Read TimingSimplified Asynchronous Fast Page Mode (FPM) DRAM Read Timing
Typical timing at 66 MHZ : 5-3-3-3For bus width = 64 bits = 8 bytes cache block size = 32 bytesIt takes = 5+3+3+3 = 14 memory cycles or 15 ns x 14 = 210 ns to read 32 byte blockRead Miss penalty for CPU running at 1 GHZ = 15 x 14 = 210 CPU cycles
FPM DRAM speed rated using tRAC ~ 50-70ns
EECC722 - ShaabanEECC722 - Shaaban#18 lec # 12 Fall 2001 10-29-2001
• Extended Data Out DRAM operates in a similar fashion to Fast Page Mode DRAM except the data from one read is on the output pins at the same time the column address for the next read is being latched in.
Simplified Asynchronous Extended Data Out (EDO) Simplified Asynchronous Extended Data Out (EDO) DRAM Read TimingDRAM Read Timing
Source: http://arstechnica.com/paedia/r/ram_guide/ram_guide.part2-1.html
Typical timing at 66 MHZ : 5-2-2-2For bus width = 64 bits = 8 bytes Max. Bandwidth = 8 x 66 / 2 = 264 Mbytes/sec It takes = 5+2+2+2 = 11 memory cycles or 15 ns x 11 = 165 ns to read 32 byte cache blockRead Miss penalty for CPU running at 1 GHZ = 11 x 15 = 165 CPU cycles
EDO DRAM speed rated using tRAC ~ 40-60ns
EECC722 - ShaabanEECC722 - Shaaban#19 lec # 12 Fall 2001 10-29-2001
Characteristics of Synchronous DRAM Characteristics of Synchronous DRAM Interface ArchitecturesInterface Architectures
EECC722 - ShaabanEECC722 - Shaaban#20 lec # 12 Fall 2001 10-29-2001
SynchronousSynchronousDynamic RAM,Dynamic RAM,SDRAMSDRAMOrganizationOrganization
DDR SDRAM:Similar organization but using four banks to allowdata transfer on both risingand falling edges of the clock.
EECC722 - ShaabanEECC722 - Shaaban#21 lec # 12 Fall 2001 10-29-2001
Simplified SDRAM Read TimingSimplified SDRAM Read Timing
Typical timing at 133 MHZ (PC133 SDRAM) : 4-1-1-1For bus width = 64 bits = 8 bytes Max. Bandwidth = 133 x 8 = 1064 Mbytes/secIt takes = 4+1+1+1 = 8 memory cycles or 7.5 ns x 8 = 60 ns to read 32 byte cache blockRead Miss penalty for CPU running at 1 GHZ = 7.5 x 8 = 60 CPU cycles