cse 237a memory - university of california, san...
TRANSCRIPT
CSE 237ACSE 237AMemory
Tajana Simunic RosingDepartment of Computer Science and EngineeringUniversity of California, San Diego.
1
Last timeLast time…HW#2 due todayFinished MOCsFinished MOCsCPUs: RISC, CISC, VLIW, FPGA, simple controllers
Coming up:Coming up:MemoriesMini project due Thursday 2/7Mini project due Thursday, 2/7
Meet in ES lab in CSEEmail report by due dateBe ready with a short demo/presentation
2Tajana Simunic Rosing
Hard are platform architect reHardware platform architecture
3Fall 2005
Memor hierarchMemory hierarchyWant inexpensive, f t
Processor
fast memoryMain memory
Large inexpensive
Cache
Main memory
Registers
Large, inexpensive, slow, stores all data
CacheDisk
Backup
Small, expensive, fast memory stores a copy of likelycopy of likely accessed partsL1, L2
4Fall 2005
MemorMemoryEfficiency is a concern:Efficiency is a concern:
speed (latency and throughput); predictable timingpredictable timingenergy efficiencysizesizecost
th tt ib t ( l til i t t t )other attributes (volatile vs. persistent, etc)
5Fall 2005
Access timesAccess-times8S d 8Speed
4
≥ 2x2
≥ 2xevery 2 years
2 4 5ears
311
0
6Fall 2005
years
[P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane]
Access times & energy vs. memory sizeAccess times & energy vs. memory size
"Currently, the size of some applications is doubling every 10doubling every 10 months"[STMicroelectronics, Medea+ Workshop, Stuttgart, Nov. 2003]
7Fall 2005
p, g , ]
MPEG2 video exampleMPEG2 video exampleStrongARM platform with SRAMStrongARM platform with SRAM
6.E-09 ProcessorFLASHSRAM
4.E-09
5.E-09
(mW
hr)
SRAMBatteryPins & InterconnectDC-DC Converter
2 E 09
3.E-09
gy p
er C
ycle
(
1.E-09
2.E-09
Ener
g
8Fall 2005
0.E+000 100000 200000 300000 400000 500000 600000 700000 800000
Cycles
Caches and CPUsCaches and CPUs
cachedataaddress
CPUca
che
ontro
ller cache
mainmemoryddc
coy
data
address
data
9Fall 2005
Caches and CPUs MPSoCCaches and CPUs - MPSoCHi h d t L1&L2 h hiHigher end systems – L1&L2 cache on chip
10Fall 2005
CacheCacheD i d ith SRAM U ll hiDesigned with SRAM, Usually on same chip as processorCache operation:
Request for main memory access (read or write)First, check cache for copy
Tag Index Offset
V T D
cache hitcache miss
Design choicescache mapping
Data
Valid=
Direct - each memory location maps onto exactly one cache entryFully associative – anywhere in memory, never implementedSet-associative - each memory location can go into one of n set
write techniquesWrite-through - write to main memory at each update T I d OffWrite through write to main memory at each updateWrite-back – write only when “dirty” block replaced
replacement policiesRandomLRU: least-recently used
Tag Index Offset
V T DData
Valid
V T D
11Fall 2005
FIFO: first-in-first-out=
Valid
=
Cache impact on system p yperformance
Most important parameters in terms of performance:Most important parameters in terms of performance:Total size of cache (data and control info – tags etc)Degree of associativityD t bl k iData block size
Larger caches -> lower miss rates, higher access costAverage memory access time (h1=L1 hit rate, h2=L2 hit rate)
tav = h1tL1 + (h2-h1)tL2 + (1- h2-h1)tmain
e.g., if miss cost = 20 2 Kbyte: miss rate = 20%, hit cost = 2 cycles, access 5.6 cycles
10% 34 Kbyte: miss rate = 10%, hit cost = 3 cycles, access 4.7 cycles8 Kbyte: miss rate = 8%, hit cost = 4 cycles, access 4.8 cycles
12Fall 2005
Cache performance trade offsCache performance trade-offsImpro ing cache hit rate itho t increasing si eImproving cache hit rate without increasing size
Increase line sizeChange set associativityChange set-associativity
0.14
0.16
0.08
0.1
0.12
1 way2 way4 way
% cache miss
0
0.02
0.04
0.06
1 Kb 2 Kb 4 Kb 8 Kb 16 Kb 32 Kb 64 Kb 128 Kb
8 way
cache size
13Fall 2005
1 Kb 2 Kb 4 Kb 8 Kb 16 Kb 32 Kb 64 Kb 128 Kb
Influence of the associativityInfluence of the associativity
14Fall 2005
[P. Marwedel et al., ASPDAC, 2004]
PredictabilitPredictability
Embedded systems are often real-time:Have to guarantee meeting timing constraints.g g g
Pre run-time scheduling - predictabilityTi t i d t ti ll h d l d tiTime-triggered, statically scheduled operating systems
Predictable cache design?
15Fall 2005
Scratch pad memories (SPM)
Hierarchy
mainExample
SPM
Address
ARM7TDMI cores, well-known for low power
ti0
processor
s space
consumptionscratch pad memory
no tag memory
16Fall 2005
FFF..
Wh not j st se a cache ?Why not just use a cache ?Worst case execution time
(WCET) may be large
17Fall 2005
[P. Marwedel et al., ASPDAC, 2004]
Wh not j st se a cache ?Why not just use a cache ? Energy for parallel access of sets, in comparators, muxes.
8
9
.
Energy for parallel access of sets, in comparators, muxes.
5
6
7
s [n
J]
Scratch padCache 2way 4GB space
3
4
y pe
r ac
cess
Cache, 2way, 4GB spaceCache, 2way, 16 MB spaceCache, 2way, 1 MB space
0
1
2
Ener
gy
18Fall 2005
256 512 1024 2048 4096 8192 16384
memory size[R. Banakar, S. Steinke, B.-S. Lee, 2001]
Scratchpad vs main memory currentsScratchpad vs. main memory currentsExample: Atmel ARM-Evaluation board current reduction:
Current32 Bit-Load Instruction (Thumb)
Factor of 3.02
150
200Main
M116
77,2 82,2 1,1650
100mA
Memory
(on board)
48,2 50,9 44,4 53,10
Prog Off-Chip/Data Off-Chip
Prog Off-Chip/Data On-Chip
Prog On-Chip/Data Off-Chip
Prog On-Chip/Data On-Chip
SPM (On-chip)
19Fall 2005
Core+On-Chip-Memory Current (mA) Off-Chip-Memory Current (mA)Processor
Scratchpad vs main memory energyScratchpad vs. main memory energyExample: Atmel ARM-Evaluation board "Main" memory
Energy32 Bit-Load Instruction (Thumb)
access takes longer
savings 86%
115,8
100 0120,0140,0
energy reduction:factor of 7 06
51,6
76,5
16 440,060,080,0
100,0
nJ
factor of 7.06
100% predictable16,4
0,020,0
Prog Off-Chip/Data Off-Chip
Prog Off-Chip/Data On-Chip
Prog On-Chip/Data Off-Chip
Prog On-Chip/Data On-Chip
20Fall 2005
Energy
Memor management nitsMemory management unitsM t it (MMU)Memory management unit (MMU) translates addresses:
logicaldd
physical
CPU mainmemory
memorymanagement
unit
address address
21Fall 2005
Memor Management Unit (MMU)Memory Management Unit (MMU)
Duties of MMUHandles DRAM refresh, bus interface & arbitrationTakes care of memory sharing among multiple CPUsTranslates logic memory addresses from processor to physical memory addresses of DRAMphysical memory addresses of DRAM
Modern CPUs often come with MMU built-inSingle purpose processors can be usedSingle-purpose processors can be used
22Fall 2005
Address translationAddress translationMapping logical to physical dd
page 1page 2addresses.
Two basic schemes:Segmented
f t i t h
segment 1
page 2
memory footprint can change dynamicallyusually only a few segments per process; e.g. data and stack
P d
memory
Pagedsize preassigned
can be combined (x86).segment 2
SEGMENTATION PAGING
Involves programmer Transparent to programmer
Separate compiling No separate compiling
23Fall 2005
Separate compiling No separate compiling
Separate protection No separate protection
Shared No sharing
ARM memor managementARM memory managementM i tMemory region types:
section: 1 Mbyte block;large page: 64 kbytes;small page: 4 kbytes.
An address is marked as section-mapped or page-mapped.Two-level translation scheme.
24Fall 2005
ARM address translationARM address translationT l ti t bl offset1st index 2nd indexTranslation table
base register
1st level tabledescriptor concatenate
descriptor
concatenate
physical address2nd level tablep
25Fall 2005
Memor basic conceptsMemory: basic conceptsSt l b f bit
m × n memory
Stores large number of bitsm x n: m words of n bits eachk = Log2(m) address input signals
…
…
mw
ords
or m = 2^k wordse.g., 4,096 x 8 memory:
32,768 bits12 dd i t i l
n bits per word
12 address input signals8 input/output data signals
Memory access/ l t d it
enable2k × n read and write
memory
A0
r/w
memory external view
r/w: selects read or writeenable: read or write only when assertedmultiport: multiple accesses to different locations simultaneously
0…
…
Ak-1
26Fall 2005
locations simultaneously Q0Qn-1
W it bilit / tWrite ability/ storage permanenceTraditional ROM/RAM ag
ene
nce
ROMread only, bits stored without power
RAM EPROM
Mask-programmed ROM
EEPROM FLASH
Stor
ape
rman
Ideal memory
OTP ROM
Tens of
Life ofproduct
RAMread and write, lose stored bits without power
Distinctions blurred
Batterylife (10years)
EPROM EEPROM FLASH
NVRAM
SRAM/DRAM
Nonvolatile
In-systemprogrammable
Tens ofyears
Distinctions blurredAdvanced ROMs can be written to
e.g., EEPROMExternal
programmerOR in-system,
Writeability
programmable
Duringfabrication
only
Externalprogrammer,
1,000s
Externalprogrammer,one time only
Externalprogrammer
OR in-system,
In-system, fastwrites,
unlimited
Nearzero
e g , OAdvanced RAMs can hold bits without power
e.g., NVRAM Write ability and storage permanence of memories, showing relative degrees along each axis (not to scale)
block-orientedwrites, 1,000s
of cycles
of cycles 1,000sof cycles
unlimitedcycles
27Fall 2005
showing relative degrees along each axis (not to scale).
ROM “Read Onl ” MemorROM: “Read-Only” MemoryNonvolatile, read only“Programmed” before inserting to embedded system
Mask-programmed – at fabrication2k ROM
External view
One-time programmable (OTP ROM)Programmed by user; fuse/anitfuse tech., cheaper
Erasable ROM
2k × n ROM
…
A0
…
enable
Ak-1
Erasable ROMWith UV light – EPROMWith electricity - EEPROM
Uses
Q0Qn-1
Store software program for general-purpose processorStore constant data needed by systemImplement combinational circuit
28Fall 2005
Implement combinational circuit
EEPROM: Electrically erasable yprogrammable ROM
P d d d l t i llProgrammed and erased electronicallyhigher than normal voltagecan program and erase individual wordsp gCan be erased and programmed 1000s of times
Better write abilityb i t bl ith b ilt i i it t idcan be in-system programmable with built-in circuit to provide
higher than normal voltagewrites very slow due to erasing and programming
Similar storage permanence to EPROM (about 10 years)Far more convenient than EPROMs, but more expensive
29Fall 2005
Flash MemorFlash MemoryExtension of EEPROM
Same floating gate principleSame write ability and storage permanence
Fast eraseFast eraseLarge blocks of memory erased at once, rather than one word at a timeBl k t i ll l th d b t lBlocks typically several thousand bytes large
Writes to single words may be slowerEntire block must be read, word updated, then entire block written , p ,back
Used with embedded systems storing large data items in nonvolatile memory
30Fall 2005
nonvolatile memory
RAM “R d ”RAM: “Random-access” memoryVolatile, read/write at run timeInternal structure more complex 2k × n read and write r/w
external view
Internal structure more complex a word consists of several memory cells, each storing 1 biteach input and output data line connects to each cell in its columnrd/wr connected to every cell
SRAM: Static RAM
enable memory
A0 …
…Ak-1
SRAM: Static RAMMemory cell uses flip-flop to store bitRequires 6 transistors Holds data as long as power supplied
DRAM: Dynamic RAM
Q0Qn-1
I0I3 I2 I1
internal view
yMemory cell uses MOS transistor and capacitor to store bitMore compact than SRAM“Refresh” required due to capacitor leak
word’s cells refreshed when readTypical refresh rate 15 625 microsec
4×4 RAM
2×4 decoder
A
enable
Typical refresh rate 15.625 microsec.Slower to access than SRAM
A0A1
Memory cell
rd/wr To every cell
SRAM
DataW
DRAM
31Fall 2005
Q0Q3 Q2 Q1Data
W
Data'W
Generic SRAM timingGeneric SRAM timing
CE’
R/W’
Adrs
Data
d it
From SRAM From CPU
32Fall 2005
timeread write
Generic DRAM timingGeneric DRAM timingCE’CE’
R/W’
RAS’
CAS’CAS’
Adrs rowadrs
coladrs
Data
adrs adrs
data
33Fall 2005
time
Page mode accessPage mode accessCE’CE’
R/W’
RAS’
CAS’CAS’
Adrs rowadrs
coladrs
coladrs
coladrs
Data
adrs adrs
data
adrs adrs
data data
34Fall 2005
time
Ram ariationsRam variationsPSRAM: Pseudo-static RAMPSRAM: Pseudo static RAM
DRAM with built-in memory refresh controllerPopular low-cost high-density alternative to SRAM
NVRAM N l til RAMNVRAM: Nonvolatile RAMHolds data after external power removedBattery-backed RAMy
SRAM with own permanently connected batterywrites as fast as readsno limit on number of writes unlike nonvolatile ROM-based memory
SRAM with EEPROM or flashstores complete RAM contents on EEPROM or flash before power turned off
35Fall 2005
power turned off
E tended data o t DRAMExtended data out DRAM
Improvement of FPM (full page mode) DRAMExtra latch before output buffer
allows strobing of cas before data read operation completed
R d d/ it l t b dditi l lReduces read/write latency by additional cycle
row col col col
data data data
ras
cas
address
data
36Fall 2005
data data data
Speedup through overlap
data
(S)ynchronous and ( )yEnhanced Synchronous (ES) DRAMSDRAM latches data on active edge of clockSDRAM latches data on active edge of clockEliminates time to detect ras/cas and rd/wr signalsA counter is initialized to column address thenA counter is initialized to column address then incremented on active edge of clock to access consecutive memory locationsESDRAM improves SDRAM
added buffers enable overlapping of column addressingfaster clocking and lower read/write latency possible
clock
ras
cas
37Fall 2005
address
data
row col
data data data
Ramb s DRAM (RDRAM)Rambus DRAM (RDRAM)
More of a bus interface architecture than DRAM architectureData is latched on both rising and falling edge of clockBroken into 4 banks each with own row decoder
h 4 t tican have 4 pages open at a timeCapable of very high throughput
38Fall 2005
DRAM integration problemDRAM integration problemSRAM il i t t d ith CPUSRAM easily integrated with CPUDRAM more difficult
Different chip making process between DRAM and conventional logicGoal of conventional logic (IC) designers:
minimize parasitic capacitance to reduce signal propagation dela s and po er cons mptionpropagation delays and power consumption
Goal of DRAM designers:create capacitor cells to retain stored information
39Fall 2005
create capacitor cells to retain stored information
SRAM s DRAMSRAM vs. DRAMSRAMSRAM:
Faster.Easier to integrate with logic.Higher active power consumption per bit.
DRAM:Denser.Must be refreshed.
40Fall 2005
Design example: SmartBadgeDesign example: SmartBadge
Active power: 3.5 WIdle power: 2.2 WStandby: 0.2 WSleep time: 1msWake-up time: 150 ms
UCB1200Analog &
DigitalSensors
Microphoneand
Speakers
Memory:
Flash (1MB)
DisplayStrongARM
SA-1100
41Fall 2005
Flash (1MB)
SRAM (1MB) DC-DCConverter Ba
tter
y
RF
SmartBadge HW for MPEG2 VideoSmartBadge HW for MPEG2 VideoName Initial Burst Active Idle Interconnect I/O Pin Manufacturer
Memory Architectures
Access Access Power Power Capacitance CapacitanceUnits (ns) (ns) (mW) (mW) (pF/line) (pF/pin)
FLASH 80 N/A 75 0.5 4.8 10 IntelBFLASH 80 40.00 600 2.5 4.8 10 TISRAM 90 N/A 185 0.1 8 8 Toshiba
Energy Consumption
BSRAM 90 45.00 365 1.7 8 8 MicronBSDRAM 30 15.00 430 10 8 8 MicronL2 Cache 20.00 10 1985 330 3.2 5 Motorola
Hardware Configurations
0.20
0.25
0.30r)
DC-DC Converter
Interconnect & Pins
L2 Cache
Data Memory
Energy ConsumptionName Instruction Data L2 Cache
Memory Memory PresentOriginal FLASH SRAM noL2 Cache FLASH BSDRAM yes
Hardware Configurations
0.05
0.10
0.15
Ener
gy (m
Whr Instruction Memory
Processor
L2 Cache FLASH BSDRAM yesBurst SRAM BFLASH BSRAM noBurst SDRAM BFLASH BSDRAM no
42Fall 2005
0.00Original L2 Cache Burst SRAM Burst SDRAM
Memor b s designMemory bus designBus signals are usually tri-stated.Address and data lines may be multiplexed.
Bus master controls operations on the bus.CPU is default bus master.Other devices may request bus mastership -> handshaking lines.
Every device on the bus must be able to drive the i b l dmaximum bus load
Bus may include clock signal; timing is relative to it
Address:Data:
mn
43Fall 2005
Control: c
Busses as communicating gmachines
enq = 1
0
ack = 0
0
enq = 0
1
ack = 1
1
ack enq
enq = 0
1
ack = 1
1ack enq
0
M1
0
M2
ack enq
44Fall 2005
Fi ed dela memor accessFixed-delay memory access
R/Wread = 1adrs = A
R/W
R
Wmem[adrs] =data
data
CPUmemory
data = mem[adrs]adrsreg = data
CPU
45Fall 2005
Variable dela memor accessVariable-delay memory access
read = 1adrs = A done = 0R/W
R/Wmem[adrs] =
datadone = 1R
Wdata
d
memory
data = mem[adrs]done = 1
adrsdone
y
n
done
CPU
memoryreg = data
46Fall 2005
T pical b s accessTypical bus accessclock
R/W’
Addressenable
adrsadrs
DataReady’Ready
data
it
47Fall 2005
timeread write
DMA operationDMA operationCPU sets p DMA transferCPU sets up DMA transfer:
e.g. start address, length, block sizeDMA controller performs transfer signals when doneDMA controller performs transfer, signals when done
memory I/O
CPU
DMA
48Fall 2005
S mmarSummaryM hi hMemory hierarchy
Needs: speed, low power, predictableCache design
Mapping, replacement & write policiesMemory types
ROM vs RAM, types of ROM/RAMROM vs RAM, types of ROM/RAMMemory bus design
49Fall 2005
Sources and References
Frank Vahid, Tony Givargis, “Embedded S t D i ” Wil 2002System Design,” Wiley, 2002. Wayne Wolf, “Computers as Components,” Morgan Kaufmann, 2001. Peter Marwedel, “Embedded Systems , yDesign,” 2004.
50Fall 2005