dezső sima september 2008 (ver. 1.0) sima dezső, 2008 2. challenges/limiters of parallel...
TRANSCRIPT
Dezső Sima
September 2008
(Ver. 1.0) Sima Dezső, 2008
2. Challenges/limiters of parallel connected synchronous memories
Overview
1. Key challenges facing main memories•
2. Main limiters of increasing the transfer rate of main memories - Overview
•
3. Challenges in increasing the rate of sourcing/sinking data from/to the memory cell array
•
4. Challenges in increasing the transfer rate between the memory controller and the DRAM parts
•
5. Challenges in increasing the rate of capturing data in the memory controller/DRAM parts
•
6. Main limiters of increasing the memory size•
7. References•
1. Key challenges facing main memories
Key challenges facing main memories
• Increasing (single core) processor performance (the past)
1. Key challenges facing main memories (1)
Figure 1.2: Integer performance growth of Intel’s x86 processors
SPECint92
5
10
50
Year86 8879 1980 81 82 83 84 85 87 89 1990 91 92 93 94 95 96 97 98 99
*
*
*
**
*
**
2
386/16
*
* *
*
*
* 8088/5
*0.5
100
8088/8
80286/10
80286/12
386/20 386/25
386/33
500
*
*
*1000
20
200
1
0.2
*
***
**
*
486/25
486/33486/50 486-DX2/66
Pentium/66
Pentium/100 Pentium/120
Pentium Pro/200
PII/450
PIII/600
486-DX4/100
Pentium/133 Pentium/166
Pentium/200
PII/300PII/400 PIII/500
486-DX2/50*
2000 01 02 03
5000
2000*
*
*
*
*
** *
*
PIII/1000
P4/1500P4/1700
P4/2000 P4/2200P4/2400 P4/2800
P4/3060
P4/3200
~ 100*/10 years
*
*
***
04 05
Northwood B
10000
Prescott (1M)Prescott (2M)
Levelling off
1. Key challenges facing main memories (2)
Integer performance grows
Key challenges facing main memories
• Increasing (single core) processor performance (the past)
• Multicore/manycore processors with doubling core numbers in about every two years (the presence and near future)
1. Key challenges facing main memories (3)
Figure: Evolution of Intel’s process technology [1]
1. Key challenges facing main memories (4)
Evolution of Intel’s process technology
Shrinking: ~ 0.7/2 Years
Figure: The actual rise of IC complexity in DRAMs and microprocessors [2]
1. Key challenges facing main memories (5)
The evolution of IC complexity (Moore’s low)
Figure: Rapid spreading of Intel’s multicore processors
1. Key challenges facing main memories (6)
Rapid spreading of multicore processors in Intel’s processor portfolio
EIB: Element Interface Bus
Figure: Block diagram of the Cell BE [3]
SPE: Synergistic Procesing ElementSPU: Synergistic Processor UnitSXU: Synergistic Execution UnitLS: Local Store of 256 KBSMF: Synergistic Mem. Flow Unit
PPE: Power Processing ElementPPU: Power Processing UnitPXU: POWER Execution Unit
MIC: Memory Interface Contr.BIC: Bus Interface Contr.
XDR: Rambus DRAM
1. Key challenges facing main memories (7)
The Cell BE (2006)
Assuming that the IC process technology will evolve in the near future at a similar rate as now (shrinking of characteristic feature sizes at a rate of ~ 0.7/2 years)
the number of cores will double also about every two years.
1. Key challenges facing main memories (8)
Higher processor performance/more cores
Higher memory performance requirements in terms of
• larger memory size
• higher memory bandwidth
• lower memory latency
1. Key challenges facing main memories (9)
Higher processor performance/more cores
Higher memory performance requirements in terms of
• larger memory size
• higher memory bandwidth
• lower memory latency
Depends on
• characteristics of the application• cache architecture• ...
1. Key challenges facing main memories (10)
Higher processor performance/more cores
Higher memory performance requirements in terms of
• larger memory size
• higher memory bandwidth
• lower memory latency
Depends on
• characteristics of the application• cache architecture• ...
Interestingresearch
area
1. Key challenges facing main memories (11)
Higher processor performance/more cores
Higher memory performance requirements in terms of
• larger memory size
• higher memory bandwidth
• lower memory latency
Depends on
• characteristics of the application• cache architecture• ...
Limitations ofrecent
implementations
1. Key challenges facing main memories (12)
Higher processor performance/more cores
Higher memory performance requirements in terms of
• larger memory size
• higher memory bandwidth
• lower memory latency
Depends on
• characteristics of the application• cache architecture• ...
Limitations ofrecent
implementations
1. Key challenges facing main memories (13)
2. Main limiters of increasing the transfer rate of main memories - Overview
Memory CellArray
I/OBuffers
Memorycontroller
DRAM device
Figure: Main components of the main memory
Main components of the main memory
2. The transfer rate of main memories (1)
• The rate of sourcing/sinking data from/to the memory array, (problem of reducing the Column Cycle Time of the memory cell array)
Main limitations of recent commodity DRAMs (sychronous main memories)in increasing transfer rates
2. The transfer rate of main memories (2)
Memory CellArray
I/OBuffers
Memorycontroller
DRAM device
Sourcing/Sinking
Figure: Schematic view of the structure of the main memory
• The rate of transmitting data between memory controller and memory modules (transmission line termination problem),
Main limitations of recent commodity DRAMs (sychronous main memories)in increasing transfer rates
2. The transfer rate of main memories (3)
Memory CellArray
I/OBuffers
Memorycontroller
DRAM device
Sourcing/Sinking Transfering
Figure: Schematic view of the structure of the main memory
• The rate of capturing data in the memory controller/memory module. (signaling and synchronization problem).
Main limitations of recent commodity DRAMs (sychronous main memories)in increasing transfer rates
2. The transfer rate of main memories (4)
Memory CellArray
I/OBuffers
Memorycontroller
DRAM device
Sourcing/Sinking Transfering
Capturing Capturing
Figure: Schematic view of the structure of the main memory
• The rate of sourcing/sinking data from/to the memory array, (problem of reducing the Column Cycle Time of the memory cell array)
• The rate of transmitting data between memory controller and memory modules (transmission line termination problem),
• The rate of capturing data at the memory controller/memory module. (signaling and synchronization problem).
Main limitations of recent commodity DRAMs (sychronous main memories)in increasing transfer rates
The most serious limitation constrains the achievable transfer rate.
2. The transfer rate of main memories (5)
3. Challenges in increasing the rate of sourcing/sinking data from/to the memory cell
array
3. Challenges in increasing the rate of sourcing/sinking data from/to the memory cell array
Basic operation speed of recent sychronous DRAMs
The memory cell array sources/sinks data to/from the I/O buffers at a rate of T (at a data width of x4/x8/x16).
T = 1/tCCD x FW
with tCCD: Min. column cycle time of the memory cell array
FW: Fetch width of the memory cell array
3. The rate of sourcing/sinking data (1)
Figure: The interpretation of tCCD [4]
3. The rate of sourcing/sinking data (2)
The min. column cycle time (tCCD) of the memory cell array
tCCD (Core column delay)
is the min. time interval between consecutive Reads or Writes.
Remark
tCCD is designated also as the Read/Write command to Read/Write command delay
Figure: The evolution of the column cycle time (tCCD) in different SDRAM types (ns) [5]
3. The rate of sourcing/sinking data (3)
ns
Note: The min. column cycle time (tCCD) of synchronous DRAMs is:
SDRAM: 7.5 nsDDR/2/3 5 ns
specifies how many times more bits the cell array fetches per column cycle then the data width of the device.
E.g. an x4 DRAM chip with a fetch width of 4 (actually a DDR2 DRAM) fetches 4 × 4 that is 16 bits from the memory cell array per column cycle.
The fetch width (FW) of the memory cell array of synchronous DRAMs is typically:
SDRAM: 1DDR: 2DDR2: 4DDR3: 8
DRAM type FW
3. The rate of sourcing/sinking data (4)
The fetch width (FW) of the memory cell array
3. The rate of sourcing/sinking data (5)
DRAM core clock100 MHz
Clock (CK/CK#)400 MHz
Memory CellArray
I/OBuffers
DDR3SDRAM
2 x fCK
fCK/4
n bits
8xn bits
E.g.
DRAM core clock100 MHz
Clock (CK/CK#)200 MHz
Memory CellArray
I/OBuffers
DDR2SDRAM
2 x fCK
fCK/2
4xn bitsn bits
E.g.
Memory CellArray
I/OBuffers
DDRSDRAM
2 x fCKfCK
2xn bitsn bits
DRAM core clock100 MHz
Clock (CK/CK#)100 MHzE.g.
DRAM core frequency100 MHz
Clock frequency (fCK)
100 MHzE.g.
Memory CellArray
I/OBuffers
SDRAM fCKfCK
n bits n bits
DDR3-800Data transfer on both edges of DQS over the data lines (DQ0 - DQn-1)
800 MT/s
Data Strobe (DQS)400 MHz
DDR2-400
Data Strobe (DQS)200 MHz
Data transfer on both edges of DQS over the data lines (DQ0 - DQn-1)
400 MT/s
DDR-200Data transfer on both edges of DQS over the data lines (DQ0 - DQn-1)
200 MT/s
Data Strobe (DQS)100 MHz
Clock (CK)100 MHz
SDRAM-100Data transfer on the rising edges of CK
over the data lines (DQ0 - DQn-1) 100 MT/s
Figure: Fetch width of synchronous DRAM generations
SDRAM: 1/7.5 x 1 = 133 MT/sDDR: 1/5 X 2 = 400 MT/sDDR2: 1/5 x 4 = 800 MT/sDDR3: 1/5 x 8 = 1600 MT/s (not yet achived)
The peak rates of sourcing/sinking data to/from the I/O buffers are:
According to Tmax = 1/tCCD x FW
3. The rate of sourcing/sinking data (6)
The main limitation in increasing the rates of sourcing/sinking data from/to the memory array is TCCD (Column Cycle Time).
The column cycle time TCCD) resulting from a DRAM design depends on a number of architectural choiches, like column decoder layout, array block size, array partitioning, decisions to share resources between array banks etc. [32]. Its reduction below 5 ns is an intricate circuit design task, that is out of scope of our discussion. For an insight into the subject see [32].
Remark
GDDR3 and GDDR4 devices, with peak transfer rates of 1.6 and 2.5 GT/s, respectively, achive min. column cycle times (TCCD) of 2.5 and 3.2 ns, respectively [32].
4. Challenges in increasing the transfer rate between the memory controller and the DRAM parts
4. The transfer rate between the MC and the DRAM parts (1)
Memory controller
Memory modules
Motherboard trace
Figure: The dataway connecting the memory controller and the DRAM chips(based on [6])
The dataway connecting the memory controller and the DRAM chips
4. The transfer rate between the MC and the DRAM parts (2)
Memory controller
Memory modules
Motherboard trace
Figure: The dataway connecting the memory controller and the DRAM chips(based on [6])
The dataway connecting the memory controller and the DRAM chips
For higher data rates PCB tracesbehave as transmission lines
Basic behaviour of transmission lines (TL)
TL
Driver Receiver
Principle of operation
• A signal front given at the input of the TL travels down the TL from the driver side to the receiver side.
• Arriving at the receiver side the signal becomes reflected back to the driver side, then
• at the driver side, the signal will be reflected again toward the receiver side etc.
4. The transfer rate between the MC and the DRAM parts (3)
PC board traces (microstrips) behaves over ~ 100 MT/s like transmission lines with
Transmission lines (TL)
• a characteristic impedance (ZO)• and trace velocity
4. The transfer rate between the MC and the DRAM parts (4)
Characteristic impedance of PCB traces (ZO) [7]
Table: Typical characteristic impedance values of PCB traces [8]
4. The transfer rate between the MC and the DRAM parts (5)
Table: Typical trace velocity values of PCB traces [8]
Remark
2.0 ns/ft equals ~ 15 cm/ns
With 1 ft = 30.48 cm, the equivalent values in cm/ns are:
2.2 ns/ft equals ~ 14 cm/ns
1.6 ns/ft equals ~ 19 cm/ns
4. The transfer rate between the MC and the DRAM parts (6)
Trace velocity
ZD: Internal impedance of the driver
ZO: Charateristic impedance of the TL
ZT: Impedance terminaling the TL
T: Flight-time over TL
TL
VrD(t) VrR(t)
VR(t)VD(t)VO (t)
ZD
ZO
ZT
Driver Receiver
T
Figure: Equivalent circuit of an ideal transmission line, (neglecting attenuation along the TL and capacitive as well as inductive loading of the TL)
With VO(t): Generator voltage
VD(t): Voltage at the
driver output
VrD(t): Reflected voltage
at the driver
VR(t): Voltage at the
receiver
VrR(t): Reflected voltage
at the receiver
Behaviour of an ideal TL
Ideal TL: no attenuation, no capacitive or inductive loading.
4. The transfer rate between the MC and the DRAM parts (7)
describing the reflections and driver/receiver side voltages (based on [9])
VO(t=0) = VO
VD(t=0) = VD(0) = VO
ZO
ZO + ZD
4. The transfer rate between the MC and the DRAM parts (8)
Characteristic equations
At t = 0
VrD(t=0) = VD(t=0)
VR(nT) = VD((n-1)T)*(1+rR)
rR = ZT – ZO
ZT + ZO
VrR(nT) = VD((n-1)T)*rR
At t = T (T: propagation time across the TL)
Driver side:
Receiver side:
where
ZD – ZO
ZD + ZO
4. The transfer rate between the MC and the DRAM parts (9)
Characteristic equations (cont.)
At t = nT (n>1)
At t ∞ (Steady state)
ZO
ZO + ZD
VR(t∞) = VO
VR(nT) =VR((n-2)T) + VrD((n-1)T)*(1+rR)
VrR(nT) = VrD((n-1)T)*rR
where:
VrD((n+1)T) = VrR(nT)*rD
Driver side
Receiver side
VD((n+1)T) = VD((n-1)*T)+VrR(nT)*(1+rD)
rD =
Receiver side
Example 1: Open ended ideal TL
TL
VO (t=0) = 2V
ZD = 25 ΩZO = 50 Ω
ZT >> ZO
Driver Receiver
VrD (t)
VrR (t)
VR(t)VD(t)VO(t)
ZD
ZT
4. The transfer rate between the MC and the DRAM parts (10)
Figure: Equivalent circuit of an open ended ideal TL
Figure: Ladder diagram and VD(t), VR(t) waveforms of an open ended ideal TL (based on [6])
1.0
2.0
2T4T
6T8T
2T
4T
6T
8T
5T
1T
3T
7T
1.0
2.0
T3T
5T7T
9T
1.333
1.333
-0.444
-0.444
0.148
0.148
-0.049
-0.049
0.002
1.333
2.222
1.926
2.025
2.666
1.778
2.074
1.976
1.33
2.221.93
2.02
2.67
1.782.07
1.98
VD(t) VR(t)
VD (t) V
R (t)
4. The transfer rate between the MC and the DRAM parts (11)
Driver side Receiver side
D: DriverR: ReceiverO: OutputI: Input
Figure: Open ended real TL(diiferential connection) [10]
Reflections at both ends (R-end, D-end)
4. The transfer rate between the MC and the DRAM parts (12)
Figure: Reflections shown on a eye diagram due to termination mismatch [11]
4. The transfer rate between the MC and the DRAM parts (13)
Reflections
Implications of the reflections on a TL
Reflections limit the max. data transfer rate of a TL.
• When a data signal is given at the driver side of the TL, a signal wavefront travels down the TL and will be ping-ponged between both ends of the TL until the steady state condition is reached.
4. The transfer rate between the MC and the DRAM parts (14)
• But until the signal becomes at least nearly settled no further wavefront can be given to the TL else inter symbol interferences (ISI) arise.
Example
Open ended TL of the length of 10 cm
• Signal velocity on the TL is 20 cm/ns.
• Reflections settle to an acceptable level after three roundtrips (6T).
Assumptions:
Then the wavefront of a signal settles nearly after 6×0.5 ns = 3 ns.
½ of the min. cycle time is 3 ns, the min. cycle time is 6 ns, the max. transfer rate of the above open ended TL is ~ 166 MHz
The max. data transfer rate is limited primarily by the time until the signal settles, that is, it depends both on
• the number of signal round trips until the signal settles, and• the length of the TL.
4. The transfer rate between the MC and the DRAM parts (15)
T = 0.5 ns
Open ended TLs may be used only for
For higher transfer rates or longer distances the TL needs to be terminated by its characteristic impedance Z0.
• relative low transfer rates (up to ~ 100 MHz), that is up to SDRAM devices, and
• short distances (up to ~ 10 cm).
4. The transfer rate between the MC and the DRAM parts (16)
4. The transfer rate between the MC and the DRAM parts (17)
Reducing reflections by a series resistor
A series resistor put before the TL reduces reflections
Improved signal integrity, higher transfer rates
4. The transfer rate between the MC and the DRAM parts (18)
Figure: Equivalent circuit of an open ended TL with a series resistor (R3 in the figure) includedbetween the driver and the TL (Micro-Cap 9.0.5.0)
Example 2: Using series resistors to reduce reflections
4. The transfer rate between the MC and the DRAM parts (19)
Figure: Driver (Vout) and Reciever (Vin) voltages of an open ended TL with a series resistor R3The value of R3 is modified from 0 to 25 Ohm
R3:
R3 = 0 Ώ
R3 = 25 Ώ
RS
Memory Contr.
Comm., Contr.Addr.
DQ, DQS
DM
SDR DIMM
LVTTL
SDR DIMM
RS
Slot 1 Slot 2
Figure: Series resistors on an SDRAM module inserted into the DQ, DQS, DM lines
(Rs = 10 or 22 Ω)
4. The transfer rate between the MC and the DRAM parts (20)
Matched TLs
Needed above ~ 100 MHz (i.e. for DDR/DDR2/DDR3 memories).
Basic scheme for unidirectional signals (assuming SSTL signaling)
VREF: 0.5 Output voltage
VT: Termination Voltage = VREF
Figure: Termination of a TL with its characteristic impedance [12]
RT: 50 Ohm
VREF
VT
ZO = 50 Ohms
Transmitter Receiver
RT
4. The transfer rate between the MC and the DRAM parts (21)
SSTL:Stub Series Termination Logic
Example 3: Perfectly terminated ideal TL
TL
VO(t=0) = 2V
ZD = 25 Ω ZO = 50 Ω
ZT = 50 Ω
Driver Receiver
VrD (t)
VR(t)VD(t)VO (t)
ZD
ZT
4. The transfer rate between the MC and the DRAM parts (22)
VrR (t)
Figure: Equivalent circuit of a perfectly terminated ideal TL
Figure: Ladder diagram and waveforms VD(t), VR(t) of a perfectly matched ideal TL (based on [6])
1.0
2.0
2T
1T
3T
1.0
2.0
0
1.33
1.33VD(t) VR(t)
VD (t) V
R (t)
4T
4. The transfer rate between the MC and the DRAM parts (23)
Driver side Receiver side
Figure: Perfectly matched real TL(differential connection) [10]
No reflections from the receiver end
RT = ZO
4. The transfer rate between the MC and the DRAM parts (24)
Figure: Discontinuities of TLs connecting the memory controller and the memory modules
(based on [6])
• The TL connecting the memory controller and the DRAM devices is not homogeneous, it consists of multiple sections.
The problem of TL inhomogenity
Memory controller
Memory modules
Motherboard/transmission line
4. The transfer rate between the MC and the DRAM parts (25)
4. The transfer rate between the MC and the DRAM parts (26)
Figure: Discontinuities of TLs connecting the slot to the particular DRAM devices assuming stub-bus topology and a registered memory module [5]
Figure: Discontinuities of TLs connecting the memory controller and the memory modules
(based on [6])
• Between different TL sections there are discontinuities, that give rise to reflections.
• The TL connecting the memory controller and the DRAM devices is not homogeneous, it consists of multiple sections.
The problem of TL inhomogenity
Memory controller
Memory modules
Motherboard/transmission line
4. The transfer rate between the MC and the DRAM parts (27)
Addressing the problem of TL discontinuities
SSTL termination (Stub Series Termination Logic)
Principle
VREF: 0.5 Output voltageVT: Termination Voltage = VREF
Figure: SSTL termination of a unidirectional signal [12]
Use both perfect termination and a series resistors (RS) to increase the TL attenuation and thus reduce reflections from the memory module back to the memory controller [6].
Used in DDR/DDR2/DDR3 devices
4. The transfer rate between the MC and the DRAM parts (28)
RS: 22/25 Ohm
RT: 50 Ohm
RS
VREF
VT
ZO = 50 Ohms
Transmitter Receiver
RT
4. The transfer rate between the MC and the DRAM parts (29)
Figure: Equivalent circuit of two TLs (T1, T2) with slightly different characteristic impedances,a series resistor (R3), while T2 is terminated by 50 Ohm and 3 pF.
4. The transfer rate between the MC and the DRAM parts (30)
Figure: Driver (Vout) and Reciever (Vin) voltages of the previous equivalent circuit
Discontinuities of the transmission line generate reflections
R3 = 0 … 25
4. The transfer rate between the MC and the DRAM parts (31)
Figure: Driver (Vout) and Reciever (Vin) voltages of the previous equivalent circuitThe value of R3 is modified from 0 to 25 Ohm
R3 = 0 Ώ
R3 = 25 Ώ
Higher series resistor values attenuate reflections but lower the steady state output voltage
C3 = 0 … 9 pF
4. The transfer rate between the MC and the DRAM parts (32)
Figure: Driver (Vout) and Reciever (Vin) voltages of the previous equivalent circuitThe value of C3 is modified from 0 to 9 pF
C3=0 pF
C3=9 pF
Higher output capacitance values lower the reflections
Note
With increasing value of Rs (from 2 Ohm to 22 Ohm) the amplitude of the reflected voltage at the receiver side clearly decreases.
4. The transfer rate between the MC and the DRAM parts (33)
Example 1: Line terminations in a DDR memory
Figure: Line terminations in a DDR memory
(RS1: 7.5 Ω for 4 devices, 5.1 Ω for 8 devices, 3 Ω for 16 devicesRS2 = 22 Ω, RT = 56 Ω)
RS1
Memory Contr.
Comm., Contr.Addr.
DQ, DQS/#
DM
DDR DIMM DDR DIMM
Slot 1 Slot 2
VTT VTT
RT
RT
SSTL_2
RS2
RS1
RS2
4. The transfer rate between the MC and the DRAM parts (34)
In order to achieve higher transfer rates
4. The transfer rate between the MC and the DRAM parts (35)
Examples: Synchronous DRAMs (commodity DRAMs)
more and more sophisticated line terminations are needed.
Vss
VTT
Rs2
RS1
VTT
Memory Contr.
Comm., Contr.Addr.
DQ, DQS/#
DM
DDR2 DIMM
RTT
VTTDDR2 DIMM
SSTL_18ODT
Vss
VTT
Rs2
RS1ODT
Slot 1 Slot 2
Figure: Line terminations in a DDR2 memory
(RS1: 10 Ω for 4 devices, 5.1-10 Ω for 8 devices, 7.5 Ω for 16 devicesRS2 = 22 Ω, RTT = 47 Ω)
R1
R2
R1
R2
RTT
Example 2: Line terminations in a DDR2 memory
On-Die Termination (ODT)
4. The transfer rate between the MC and the DRAM parts (36)
Figure: Line terminations in a DDR3 memory
(Rs = 10-15 Ω, RT = 36-39 Ω, RZQ = 240 Ω ±1%)
Vss
VTT
Rs
Dyn. ODT
ZQ
Vss
RZQ
Memory Contr.
Comm., Contr.Addr.
DQ, DQS/#
DM
DDR3 DIMM DDR3 DIMM
SSTL_15
Vss
VTT
RsZQ
Vss
RZQ
Dyn. ODT
R1
R2
R1
R2
VTT
RT
VTT
RT
Remark: Due to the fligh-by module topology no series resistors are needed for the Command/Control/Address lines
Example 3: Line terminations in a DDR3 memory
Dynamic On-Die Termination (ODT) opt.: to optimize termination resistors along with each write command
ZQ calibration: to adjust the „on” and the „termination” impedances of the merged drivers every 128 ms.
4. The transfer rate between the MC and the DRAM parts (37)
Table : Implementation details of SDRAM types
4. The transfer rate between the MC and the DRAM parts (38)
SDRAM DDR SDRAM DDR2 SDRAM DDR3 SDRAM
Signaling
C/C/A LVTTL SSTL_2 SSTL_18 SSTL_15
Clock (CLK/CK) LVTTL SSTL_2 Diff. SSTL_18 Diff. SSTL_15 Diff.
DQ, DQM LVTTL SSTL_2 SSTL_18 SSTL_15
DQS -- SSTL_2 SSTTL-18/SSTL_18 Diff. SSTL_15 Diff.l
Terminations No RS
RS
RS on module RS on module No RS
RT No RT RT on board RT on board RT on module
RS
RS on module
RT-- RT on board ODT (RT on die) Dyn. ODT (RT on die)
Driver architectureSeparate output /
termination driversSeparate output /
termination driversSeparate output /
termination drivers
Merged output/termination drivers with ZQ-calibration (during power
up/ periodically)
Synchronization
Basic scheme Central clock Source synchronization
Aligning DQS with CK
No DQS DLL DLLDLL+ Read/write leveling to
compensate fly-time skews between DQS and CK (during power-up)
Posted reads/writes No No Yes Yes
Reset pin No No No Yes
DIMM topology Stub architecture Fly-by architecture
Packaging TSOP-54TSOP-54BGA-60
BGA-60 for x4/x8BGA-84 for x16
BGA-78 for x4/x8BGA-96 for x16
C/C/A: Command/Control/Address DQ: Data DQM: Data Mask DQS: Data Strobe
C/C/A
DQ/DQS/DM
Line terminations of recent commodity DRAMs
there is not to much headroom remaining
achieved already a rather high grade of sophistication
for further improvements.
4. The transfer rate between the MC and the DRAM parts (39)
5. Challenges in increasing the rate of capturing data in the memory controller/DRAM parts
5. Challenges in increasing the rate of capturing data in the memory controller/DRAM parts
5.1 Coping with capturing data•
5.2 Using more advanced signalling•
5.3 Using more advanced synchronization•
• Data and commands are latched by D Flip-Flops.
Basics of capturing data
• For correctly capturing data or commands,
input signals need to be held valid for specified periods of time before and after the clock puls,
termed as the setup time (tS) and
the hold time (tH) as shown in the figure.
Data/Commands
Clock
Clock
Data
Q
tS
tH
Figure: Temporal requirementsfor correctly capturing data
5.1 Coping with capturing data (1)
Hold Time (tH)
the minimum time interval for which the input signal must remain valid (high or low) following the clock edge in order to capture the data bit correctly.
Setup time (tS)
the minimum time interval for which the input signal must remain valid (high or low) prior to the clock edge in order to capture the data bit correctly.
5.1 Coping with capturing data (2)
Table: Excerpt from the specification of the dynamic parameters of a DDR-400 device [13]
Specification of the setup time (tS) and the hold time (tH)
In device datasheets, e.g. in case of a DDR-400 device:
Note: A DDR-400 device is clocked by 200 MHz, so its half clock period is 1.25 ns. By contrast, its setup and hold times are 0.4 ns each (designated as tDS, TDH in the table).
5.1 Coping with capturing data (3)
Minimum data valid window (DVW)
the minimum time interval for which the input signal must remain valid (high or low) before and after the clock edge in order to capture the data bits correctly.
The minimum DVW has two characteristics,
Figure: Interpretation of the minimum DVW for ideal signals
Data
CK
tS
tH
Min. DVW
5.1 Coping with capturing data (4)
a size, that is the sum of the setup time (tS) and the hold time (tH), and a correct phase related to the clock edge, to satisfy both tS and tH requirements.
In a DDR-400 SDRAM tS = tH = 0.4 ns [13], then
If both tS = tH, the clock edge needs to be center aligned with the DVW, as indicated below.
Data
CK
tS
tH
Min. DVW
Figure: Center aligned clock edge within the min. DVW
Example
• the min. DVW is 0.8 ns, i.e. roughly 2/3 of the clock period (1.25 ns), and
• the clock edge needs to be center aligned in the min. DVW.
5.1 Coping with capturing data (5)
Available DVW
the time interval for which the input signal remains valid (high or low).
Figure: Interpretation of the minimum DVW and the available DVW for ideal signals
Data
CK
tS
tH
Min. DVW
Available DVW
For correctly capturing data:, two requirements need to be fulfilled:
5.1 Coping with capturing data (6)
• the available DVW ≥ available DVW, and
• the clock edge needs to be properly aligned (usually center aligned) within the available DVW.
for the highest transfer rate
the clock signal needs to be center aligned with the data.
Note
Assuming tS = tH (as usual)
5.1 Coping with capturing data (7)
5.1 Coping with capturing data (8)
• skews and
• jitter.
Reduction of the available DVW in real systems
In real systems the available DVW is reduced due to
Skews arise mainly due to
- propagation delays in the PC-board traces, termed also as time of flight (TOF) (about 170 ps/inch), as indicated above [14],
- capacitive loading of a PC-board trace (about 50 ps per pF) as indicated in the
subsequent figure [14],
- SSO (Simultaneous Switching Output) occurring due to parasitic inductances in case when a number of bit lines simultaneously change their output states.
Figure: Skew due to propagation delay [15]
• between different occurances of the same signal, such as a clock, at different locations on a chip or a PC board (as shown in the Figure below), or
• between different bit lines of a parallel bus at a given location.
Skew
is a time offset of the signal edges
5.1 Coping with capturing data (9)
Figure: Skew due to capacitve loading of signal lines [14]
CK-1
CK-2
Skew
5.1 Coping with capturing data (10)
Available DVW
Data
CK
tS
tH
Min. DVW
Data
CK
tS
tH
Min. DVW
Available DVW
Center aligned clock Skewed clock
Figure: Reduction of operational tolerances due to clock skew (ideal signals assumed)
A larger than indicated skew would even jeopardize or prevent correct operation.
Deskewing of clock distribution is needed.
Reduction of operational tolerances due to skews
5.1 Coping with capturing data (11)
• phase uncertainty causing ambiguity in the rising and falling edges of a signal,
as shown in the figure below,
• has a stochastic nature,
Figure: Jitter of signal edges [15]
Jitter
5.1 Coping with capturing data (12)
The main sources of jitter are
• Crosstalk caused by coupling adjacent traces on the board or in the DRAM device,
• ISI (Inter-Symbol Interference) caused by cycling the bus faster than it can settle,
• Reflection noise due to mismatching termination of signal lines,
• EMI (Electromagnetic Interference) caused by electromagnetic radiation emitted from external sources.
Jitter obviously narrow the available DVW, as shown in the following example for DDR-200 devices.
(DDR-200 devices are clocked by 100 MHz, thus their half clock period is 5 ns).
5.1 Coping with capturing data (13)
~ 5 ns
Av. DVW with jitter
Av. DVW without jitter
Figure: Narrowing the available DVW due to jitter
DQ
Narroving the available DVW due to jitter
The timing budget of the available DVW
The available DVW need to cover
• the min. requested DVW (tS +tH),
• all possible sources of skews,
• all possible sources of jitter.
Skews/jitters Skews/jitters
Available DVW
min.DVW
Figure: Interpretation of the timing budget of the available DVW
Note
The white areas before and after the min. DVW represent available timing margins
5.1 Coping with capturing data (14)
Table: Timing budget of a DDR-266 memory [16]
RemarkThe table uses partly different terminology, as follows
Total skew: Available DVWTransmitter skew: Setup timeReceiver skew: Hold timeVREF noise: OSSCIN mismatch: Skew due to different capacitive loading
Example
Timing budget of a DDR-266 memory
5.1 Coping with capturing data (15)
Note
The crucial sources and actual extent of occurring skews and jitters depend on
• the frequency range in question,
• DRAM type used,
• mainboard and memory module implementation details.
Timing budget tuning is a main task of developing DRAM devices/modules and mainboards.
5.1 Coping with capturing data (16)
tDV Width of the available VDW
Shrinking the available DVW for higher transfer rates
Higher data rates Shorter clock periods Shorter available DVWs
This is one of the key problems to be handled for achieving higher data rates.
Figure: Shrinking the available DVW while raising the data rate from PC-133 to DDR-400 and DDR2-800 [17]
5.1 Coping with capturing data (17)
• using more advanced signaling techniques, such as
5.1 Coping with capturing data (18)
• SSTL (Stub Series Terminated Logic) or
• LVDS (Low Voltage Differential Signaling),
• using more efficient synchronisation schemes than central clocking, such as
source-synchronous synchronisation.
instead of open-ended LVTTL (Low Voltage TTL),
Addressing the problem of shrinking (available) DVWs in order to raise DRAM speed
Reducing skews and jitters by
• using DLLs/PLLs to align clock or data strobe edges.
Using more advanced signaling techniques
5.2 Using more advanced signaling (1)
Signal types
Voltage referenced, single ended
Ground referenced Voltage referenced, differential
LVTTL: Low Voltage TTL LVDS: Low Voltage Differential SignalingHVDS: High Voltage Differential Signaling SSTL: Stub Series Terminated LogicVREF: Reference Voltage VCM: Common Mode Voltage
LVTTL (3.3 V) SDRAM PCI PCI-X AGP1.0
TTL (5 V) PCI
SSTL single ended signals SSTL_2 (2.5 V) (DDR)
SSTL_18 (1.8 V) (DDR2)
SSTL_15 (1.5 V) (DDR3)
AGP2.0 (1.5 V)
AGP3.0 (0.8 V)
LVDS Hypertransport SATA Ultra-2 SCSI and later PCI-E
HVDS SCSI-1
t t
VREF
t
S+
S-VCM
Higher data rates
5.2 Using more advanced signaling (2)
Figure: Overview of signal types
SSTL Differential signals
Figure: Signal types used in mainstream DRAMs
(Earliest DRAMs (1K/4K) omitted)
TTL LVTTL SSTL
Signal types used in mainstream DRAMs
3.3 V5 V 2.5/1.8/1.5 V
(Low Voltage TTL) (Stub Series Termination Logic)
Page ModeFPMEDO
FPMEDOSDRAM
DDRDDR2DDR3
Ground referenced Ground referenced Voltage referenced,single ended/differential
Signal type
Voltage
Used in theDRAM types
Termination Open ended Open ended Terminated
5.2 Using more advanced signaling (3)
Table: Signal types of the main signal groups in synchronous DRAM devices
SDRAM DDR SDRAM DDR2 SDRAM DDR3 SDRAM
Comm./Control/Addr./Data (DQ)/Data Mask (DM)
LVTTL SSTL_2 SSTL_18 SSTL_15
Clock (CLK/CK) LVTTL SSTL_2 Diff. SSTL_18 Diff. SSTL_15 Diff.
Data Strobe (DQS) -- SSTL_2 SSTL_18 / SSTL_18 Diff. SSTL_15 Diff.
5.2 Using more advanced signaling (4)
Figure: Input/output characteristics of TTL signals as used in PM/FPM/EDO devices (based on [6])
TTL inverter
VoutVIN
5.2 Using more advanced signaling (5)
2.0
3.0
1.0
4.0
0.4
5.0
2.0 3.01.0 4.0 5.0
2.4
0.8VIL max VIH min
Vin
Vout
VOL max
VOH min
2.4
VOH max
Figure: Input/output characteristics of LVTTL signals (based on [6])
LVTTL inverter
VoutVIN
2.0
3.0
1.0
0.4
3.3
2.0 3.01.0 3.3
2.4VOH min
0.8VIL max VIH min
Vin
Vout
VOL max
VOH max
5.2 Using more advanced signaling (6)
SSTL_2: VDDQ = 2.5 V JESD8-9 (Sept. 1998), used in DDR SDRAMs
SSTL_18: VDDQ = 1.8 V JESD8-15A (Oct. 2002), used in DDR2 SDRAMs
SSTL_15 VDDQ = 1.5 V used in DDR3 SDRAMs
Stub Series Terminated Logic (SSTL)
5.2 Using more advanced signaling (7)
Three generations
SSTL signals
Differential Single ended
Commmand/Control/Address,
Data (DQ), Data Mask (DM),
Data Strobe (DQS) in DDR/DDR2 Data Strobe (DQS) in DDR2/3
Used as
Figure: Types of SSTL signals
t
S+
S-VCM
t
VREF
Clock (CK)
Figure: Input/output characteristics of single ended SSTL signals (based on [6])
The static view
Vout
VIN
VREF
SSTL inverter
5.2 Using more advanced signaling (8)
2.0
2.5
1.0
0.375
2.01.0 2.5
2.125
VREF
1.25
(VREF – 150 mV) (VREF + 150 mV)
VIL max VIH min
Vin
Vout
1.25
VOH min
VOL max
AC values: define the timing specifications the receiver needs to meet e.g. slew rate)
DC values: define the final logic state.
A certain amout of time after the device has crossed the DC threshold and then also the AC threshold (hold time), the device will switch state and will not switch back as long as the input stays beyond the dc threshold [18].
Figure: Interpretation of characteristic input levels of single ended SSTL signals [18]The dynamic view
5.2 Using more advanced signaling (9)
State changes
Figure: Using AC values for defining the falling and rising slew rates of single ended SSTL signals [19]
5.2 Using more advanced signaling (10)
Table: Characteristic input levels of single ended SSTL signals in DDR/DDR2/DD3 devices [20], [21], [22]
DDR DDR2 DDR3
VDDQ 2.5 V 1.8 V 1.5 V
VREF 1.25 V 0.9 V 0.75 V
VIH (ac )min. VREF + 310 mV VREF + 250 mV VREF + 175 mV
VIH (dc) min. VREF + 150 mV VREF + 125 mV VREF + 100 mV
VIL (dc )max. VREF - 150 mV VREF - 125 mV VREF - 100 mV
VIL (ac)max. VREF - 310 mV VREF - 250 mV VREF - 175 mV
VSS Ground Ground Ground
5.2 Using more advanced signaling (11)
Figure: Interpretation of characteristic input levels of differential SSTL signals [19]
Table: Characteristic input levels of differential SSTL signals in DDR/DDR2/DD3 devices [20], [22], [19]
VTR: True level
VCP: Compl. level
(CK/CK#, DQS/DQS#)
DDR DDR2 DDR3
VDDQ 2.5 V 1.8 V 1.5 V
VREF 1.25 V 0.9 V 0.75 V
VID 620 mV 500 mV 400 mV
VIX VREF VREF VREF
VSS Ground Ground Ground
5.2 Using more advanced signaling (12)
Skew reduction by differential data strobes (DQ, DQ#)
Figure: Skew reduction while using differential strobes instead of single ended strobes [23]
5.2 Using more advanced signaling (13)
Figure: Eye diagram of an ideal and a real signal
The eye diagram
5.2 Using more advanced signaling (14)
Visualizes both signal traces (belonging to the H and L levels) by overlapping subsequent symbols in time, as indicated below for both an ideal and real signal.
The eye diagram is a favorable way
JitterReflections
Reflections
• to visualize reflections, jitter and
• to contrast expected and available values both for the DVW and voltage levels.
Figure: Eye diagram of an ideal signal showing both min. and available DVW and voltage levels
5.2 Using more advanced signaling (15)
V1Hmin
V1Lmax
Min.DVW
tS
tH
DATA eyeMargin
Margin
Mrg
Mrg
Visualizing both min. and available DVWs and voltage margins by means of an eye diagram
5.2 Using more advanced signaling (16)
Min.DVW
min
max
Figure: Eye diagram of a real signal showing both min. and available DVW and voltage levels [24]
For a correct operation
available DVW and voltage values ≥ required values
A stable operation needs reasonable temporal margins (timing budget) and voltage margins.
5.2 Using more advanced signaling (17)
Improving the basic synchronisation scheme
Basic synchronisation scheme
Central clock synchronisation Source synchronisation
A central clock is used to latch (capture) addresses, commands and data from
the respective buses or send fetched data.
Figure: Basic synchronisation schemes
5.3 Using more advanced sycnhronisation (1)
Figure: Contrasting central clocking (SDRAMs) and source synchronised clocking (DDR SDRAMs)while writing random data [25], [13]
(TDOSS: Write command to first DQS latching transition)
Address, command and data lines arelatched by the central clock (CLK)
Central clock synchronization(SDRAMs)
5.3 Using more advanced sycnhronisation (2)
Improving the basic synchronisation scheme
Basic synchronisation scheme
Central clock synchronisation Source synchronisation
A central clock is used to latch (capture) addresses, commands and data from
the respective buses or send fetched data.
• Leads to high skews due to propagation delays (flight of time), different path length, different loading of the traces etc.
• SDRAMs and earlier DRAMs are centrally clocked.
Figure: Basic synchronisation schemes
5.3 Using more advanced sycnhronisation (3)
Improving the basic synchronisation scheme
Basic synchronisation scheme
Central clock synchronisation Source synchronisation
A central clock is used to latch (capture) addresses, commands and data from
the respective buses or send fetched data.
An extra data strobe signal (DQS) is provided to accompany data sent from the driving unit to the receiving unit
• Leads to high skews due to propagation delays (flight of time), different path length, different loading of the traces etc.
• SDRAMs and earlier DRAMs are centrally clocked.
Figure: Basic synchronisation schemes
5.3 Using more advanced sycnhronisation (4)
Figure: Contrasting central clocking (SDRAMs) and source synchronised clocking (DDR SDRAMs)while writing random data [25], [13]
(TDOSS: Write command to first DQS latching transition)
Address, command and data lines arelatched by the central clock (CLK)
Central clock synchronization(SDRAMs)
Command and address lines are latchedby the differential clock (CK, CK#) but
data are latched by the source synchronousdata strobe DQS
Source synchronization(DDR SDRAMs)
5.3 Using more advanced sycnhronisation (5)
Improving the basic synchronisation scheme
Basic synchronisation scheme
Central clock synchronisation Source synchronisation
A central clock is used to latch (capture) addresses, commands and data from
the respective buses or send fetched data.
An extra data strobe signal (DQS) is provided to accompany data sent from the driving unit to the receiving unit
• Leads to high skews due to propagation delays (flight of time), different path length, different loading of the traces etc.
• The data strobe signal eliminates propagation delays between data lines
• The data strobe signal (DQS) is bidirectional to reduce pin count.
• SDRAMs and earlier DRAMs are centrally clocked.
• DDR SDRAMs are source synchronised.
Figure: Basic synchronisation schemes
5.3 Using more advanced sycnhronisation (6)
Required phase alignments for synchronous DRAM devices, controllers and modules
• Memory controllers of devices need to perform the following alignments:
• for data writes
SDRAM devices do not perform any alignment on the data sent to the controller, it is the task of the controller to shift the CLK edge to the center of the data eye.
• for data reads
center align data signals (DQ) with the clock (CLK),
• for all commands center align address, command and control signals with the clock (CLK).
5.3 Using more advanced sycnhronisation (7)
In case of SDRAM devices
• SDRAM devices do not need to perform any phase alignments, however
they have to garantee that the required minimal data hold time (TOH) is satisfied, see Figure.
• for data reads
• SDRAM modules need to perform clock deskewing for the clock (CLK) distribution circuitry.
• Memory controllers of devices need to perform the following alignments:
• for all commands center align address, command and control signals with the clock (CK).
In case of DDRx SDRAM devices
5.3 Using more advanced sycnhronisation (8)
5.3 Using more advanced sycnhronisation (9)
Figure: Required phase alignments in case of DDRx devices
• Memory controllers of devices need to perform the following alignments:
• for data writes center align data signals (DQ) with the data strobe (DQS),
• for all commands center align address, command and control signals with the clock (CK).
In case of DDRx SDRAM devices
5.3 Using more advanced sycnhronisation (10)
5.3 Using more advanced sycnhronisation (11)
Figure: Required phase alignments in case of DDRx devices
• Memory controllers of devices need to perform the following alignments:
• for data writes
(DDRx devices send edge aligned data strobe signals (DQS) with the data signals (DQ).)
• for data reads
center align data signals (DQ) with the data strobe (DQS),
• for all commands center align address, command and control signals with the clock (CK).
In case of DDRx SDRAM devices
5.3 Using more advanced sycnhronisation (12)
5.3 Using more advanced sycnhronisation (13)
Figure: Required phase alignments in case of DDRx devices
• Memory controllers of devices need to perform the following alignments:
• for data writes
(DDRx devices send edge aligned data strobe signals (DQS) with the data signals (DQ).) It is then the task of the controller to shift the DQS edge to the center of the data eye.
• for data reads
center align data signals (DQ) with the data strobe (DQS),
• for all commands center align address, command and control signals with the clock (CK).
In case of DDRx SDRAM devices
5.3 Using more advanced sycnhronisation (14)
5.3 Using more advanced sycnhronisation (15)
Figure: DDR2 write operation at 800 MT/s showing 90O shift of the differential DQS into the center of the data eye [27]
Example: Shifting DQS into the center of DQ
• Memory controllers of devices need to perform the following alignments:
• for data writes
(DDRx devices send edge aligned data strobe signals (DQS) with the data signals (DQ).) It is then the task of the controller to shift the DQS edge to the center of the data eye.
• for data reads
center align data signals (DQ) with the data strobe (DQS),
• for all commands center align address, command and control signals with the clock (CK).
In case of DDRx SDRAM devices
• DDRx devices perform the following alignment:
they edge align the data strobe signal (DQS) with the data signal (DQ).
• for data reads
5.3 Using more advanced sycnhronisation (16)
The rationale of this alignment scheme
by centralizing DLL circuitry needed to accomplish alignments in a single place that is into the memory controller and thus to avoid the need for replication DLLs into every DRAM device (except the DLLs needed in the DRAMs to edge align the DQS with CK for reads) [26].
5.3 Using more advanced sycnhronisation (17)
to keep DRAM devices as simple as possible and put complexity into the memory controller [27],
5.3 Using more advanced sycnhronisation (18)
• DDRx modules need to perform clock deskewing for the clock (CK) distribution circuitry.
Furthermore
DLLs (Delay Locked Loops)
used to
• edge align or deskew two signals, or
• center align the data strobe signal (DSQ) with the data signal (DQ).
5.3 Using more advanced sycnhronisation (19)
Figure: Deskewing the CLK signal with reference to the CLKRE signal by means of a DLL
Delay
CLKREF
CLK
CLKD
CLKOUT
Delay
DQ
DQS
DQS
DQS
Figure: Shifting the data strobe (DQS) to the center of the data signal (DQ) by means of a DLL
Simplified block diagram and principle of operation of a DLL
Delay Delay Delay DelayCLK
Phase Delay Control
CLKOUT
CLKOUT
Clock DistributionNetworkCLKREF
Figure: Block diagram and principle of operation of the DLL by deskewing the clock signal CLK
Delay
CLKREF
CLK
CLKD
CLKOUT
The DLL is buit up mainly of a delay line and a phase delay control unit. The phase delay control unit inserts delay on the clock signal (CLK) until the rising edgeof the clock signal (CLK) is in phase with the rising edge of the reference clock signal (CLKREF).
5.3 Using more advanced sycnhronisation (20)
based on [28]
In a DRAM device the DLL will be activated during initialization (power up procedure).After enabling however, the DLL needs about 200 clock cycles to lock [13]and thus, until any read command can be issued.
Remark [6]
• PLLs and DLLs fulfill similar tasks. However,
• PLLs include a voltage controlled oscillator (VCO), that generates a new clock signal, whose phase is adjustable.
• DLLs include a delay line, that inserts a voltage controlled phase delay between the input and output signal.
While DLLs just delay the incoming signal to achieve a phase alignement, the PLLs actually synthesize a new clock signal, whose phase is adjustable.
• Since DLLs do not incorporate a VCO, they are cheaper to implement than PLLs.
„Warm up” time of DLLs
Memory controllers and DRAM devices of synchronous DRAMs make use of DLLs to implement phase alignments. In contrast, memory modules usePLLs to deskew clock distribution networks.
5.3 Using more advanced sycnhronisation (21)
6. Main limiters of increasing the memory size
Memory size (CM)
CM = nCU x nCH x nM x nR x CD
nM: No. of memory modules per channel
nCU: No. of north bridges/memory control units
nCH: No. of memory channels per north bridge/control unit
CR: Rank capacity (device density x no. of DRAM devices)
with
nR: No. of ranks per memory module
E.g. The Core 2 based P35 chipset supports up to two memory channels with up to two dual-ranked memory modules per channel, with 8 x8 devices of 512 Mb or 1 Gb density per rank.
The resulting maximum memory capacity is:
CMmax = 1 x 2 x 2 x 2 x 1 Gb x8 = 8 GB
6. Main limiters of increasing the memory size (1)
The memory size is given basically by the amount of memory installed in the memory system:
6. Main limiters of increasing the memory size (2)
Crucial factors limiting the maximum size of main memories
• nM: No. of memory modules supported per memory channel
• CR: Rank capacity (device density x no. of DRAM devices/rank).
Beyound the max. installable memory the max. memory size may be limited by particular constraints, such as the supported max. addressable space due to the number of address pins on the FSB, like in the 925X and 925XE desktop chipsets [31].
Remark
Number of memory modulessupported per memory channel
1-4memory modules
6-8memory modules
Modules connectedvia a parallel bus
Modules connectedvia a serial bus
SDRAM, DDR, DDR2, DDR3modules
FBDIMM modules
Higher transfer rates limitthe number of mem. modules
typically to one or two.
Figure: Number of memory modules supported by memory channel
E.g.
6. Main limiters of increasing the memory size (3)
Figure: Max. number of supported memory modules (slots)/channel in Intel’s desktop chipsets
133 200 266 333 400 533 667 800 1066 1333 1600
*2
1
3
4
*
*
*
MT/s
Slots/ch.
6. Main limiters of increasing the memory size (4)
Figure: Max. number of supported memory modules (slots)/channel in Intel’s server chipsets
133 200 266 333 400 533 667 800
*2
1
3
4
*
MT/s
Slots/ch
* *
*
* * *At intro.
Later
6. Main limiters of increasing the memory size (5)
Notes
1. Servers prefer memory size over memory speed. E.g.
• current desktop chipsets support
speed grades of up to DDR3-1333 (even DDR3-1600 with strong size restriction) andmemory sizes of up to 4 GB/channel,
• current server chipsets using parallel connected main memory support
speed grades of up to DDR2-667 but
memory sizes of up to 16/24 GB/channel.
2. Servers expect registered memory modules rather than unbuffered modules as desktops do. Registered modules provide buffering for the address and control lines, and through reducing signal loading they increase the number of supported memory slots (memory modules) and thus supported memory size.
3. On higher transfer rates the next wavefront arrives earlier on the transmission line, Less time remains until the next wavefront arrives the transmission line,
Less time remains for settling the reflections of the privious wavefront,Inter signal interferences (ISI) will raise.
Thus, for higher frequencies reflections, also skews and jitter impede more and more signal integrity. This limits the number of supported memory modules/channel.
Recent desktop chipsets support typically 1-2 whereas server chipsets with parallelcommunication path, typically 2-3 memory modules (slots)/channel.
6. Main limiters of increasing the memory size (6)
Rank capacity (CR)
CR = nD x D
with nD: Number of DRAM devices/rank
D: Device density
Number of DRAM devices/rank
E.g. A one-sided (single rank) DDR3 memory module built up of 8 devices
A 64-bit wide rank consists of 8 x8 or 16 x4 devices, and occupies usually one module side.
6. Main limiters of increasing the memory size (7)
Remark
Figure: Double sided DDR SDRAM DIMM with 16 stacked devices on each side [30]
A few Intel server chipsets, such as the E7500, 7501 supported stacked devices as well.E.g. the E7500 server chipset supported double-sided dual rank DIMMs with 16 stackeddevices (a rank) mounted on each side and yielding a total modul size of 2 GB.
6. Main limiters of increasing the memory size (8)
Figure: Evolution of DRAM densities (Mbit) and no. of units shipped/year (Based on [29])
6. Main limiters of increasing the memory size (9)
256M
64K
16M
1G
4M
256K
64M
1M
20151980 1985 1990 1995 2000 2005 2010
500
1000
1500
2000
16K
Units 106
Year
Density: ~4×/4Y
Device density
Figure: Supported max. device size and max memory size/channel in Intel’s desktop chipsets
133 200 266 333 400 533 667 800 1066 1333 1600
*1 GB *
*
MT/s
Max. dev. size
*512 Mb
1 Gb
512 Mb
133 200 266 333 400 533 667 800 1066 1333 1600
*2 GB
1 GB
3 GB
4 GB
*
*
MT/s
Max. mem. size/ch.
*
* * *
845 875P1 925X 975X P35
X482(1/02) (4/03) (6/04) (11/05) (6/07)
(3/08)
6. Main limiters of increasing the memory size (10)
Figure: Supported max. device size and max memory size/channel in Intel’s server chipsets
133 200 266 333 400 533 667 800
*1 Gb
*
*
MT/s
Max. dev. size
*512 Mb
1 Gb
512 Mb
2 Gb
*
*2 Gb
133 200 266 333 400 533 667 800
*
16 GG
8 GB
24 GB
MT/s
Max. mem. size/ch.
*
* *
E7501 E7520 51001
(12/02) (8/04) (1/08)
*
*
* *
At intro.
Later
6. Main limiters of increasing the memory size (11)
Notes
1. As the figures indicate, recent desktops provide up to 4 GB/channel memory size, whereas recent servers (with parallel bus attatchment) offer 4-8 times larger sizes.
2. Servers achieve larger memory sizes by• supporting more memory modules (with registering expected) than desktop chipsets do, and
• using higher density DRAM devices at the same speed grade (e.g. 1 Gb devices instead of 512 Mb devices or 2 Gb devices instead of 1 Gb devices than desktop chipsets.
3. Recent server chipsets supporting main memories with serial bus attachement (like Intel’s 5000 and 7000 DP and MP-family chipsets) support both more channels and more modules/channel providing much higher main memory sizes of up to 192 GB or more (see Section Main memories with serial bus attachment).
6. Main limiters of increasing the memory size (12)
For the same numbers of control units/modules/ranks
The rate of increasing DRAM densities
In accordance with Moore’s law (saying that the transistor count per chip is doubling about every 24 month
DRAM densities evolve about 4x/ 4 years.
the maximum size of main memories would increases also about 4x/4 years.
6. Main limiters of increasing the memory size (13)
But as the number of modules/channel decreases with higher transfer rates,
the maximum size of main memories increases by a rate < 4x/4 years.
7. References (1)
[2]: Moore G. E., No Exponential is Forever... ISSCC 2003, ftp://download.intel.com/research/silicon/Gordon_Moore_ISSCC_021003.pdf
[3]: Gshwind M., „Chip Multiprocessing and the Cell BE,” ACM Computing Frontiers, 2006, http://beatys1.mscd.edu/compfront//2006/cf06-gschwind.pdf
[4]: 16 Mb Synchronous DRAM, MT48LC4M4A1/A2, MT48LC2M8A1/A2 Micron, http://datasheet.digchip.com/297/297-04447-0-MT48LC2M8A1.pdf
[5]: Rhoden D., „The Evolution of DDR”, Via Technology Forum, 2005, http://www.via.com.tw/en/downloads/presentations/events/vtf2005/vtf05hd_inphi.pdf
[6]: Jacob B., Ng S. W., Wang D. T., Memory Systems, Elsevier, 2008
[7]: Backplane Designer’s Guide, Section 9 - Layout Considerations, Fairchild Semiconductor, Apr. 2002, http://www.fairchildsemi.com/ms/MS/MS-569.pdf
[1]: D. Bhandarkar: „The Dawn of a New Era”, 11. EMEA, May, 2006.
[8]: PC133 SDRAM RegisteredcDIMM Design Specification, Rev. 1.1, Aug. 1999, IBM & Reliance Computer Corp., http://www.simmtester.com/PAGE/memory/techdata_ pc133rev1_1.pdf
[9]: Horna O. A., „Pulse Reflection in Transmission Lines,” IEEE Transactions on Computers, Vol. C-20, No. 12, Dec. 1971, pp. 1558-1563
7. References (2)
[10]: Vo J., „A Comparison of Differential Termination Techniques,” Application Note 903, Aug. 1993, National Semiconductor, http://www1.control.com/PLCArchive/RS485_3.pdf
[11]: Allan G., „The outlook for DRAMs in consumer electronics”, EETIMES Europe Online, 01/12/2007, http://eetimes.eu/showArticle.jhtml?articleID=196901366&queryText =calibrated
[12]: Interfacing to DDR SDRAM with CoolRunner-II CPLDs, Application Note XAPP384, Febr. 2003, XILINC inc.
[13]: Double Data Rate (DDR) SDRAM MT46V128M4, MT46V64M8, MT46V32M16, Micron Techn. Inc, 2000, http://download.micron.com/pdf/datasheets/dram/ddr/512MBDDRx4x8x16.pdf
[14]: Kirstein B., „Practical timing analysis for 100-MHz digital design,”, EDN, Aug. 8, 2002, www.edn.com
[15]: Ebeling C., Koontz T., Krueger R., „System Clock Management Simplified with Virtex-II Pro FPGAs”, WP190, Febr. 25 2003, Xilinx, http://www.xilinx.com/support/ documentation/white_papers/wp190.pdf
[16]: DDR Simulation Process Introduction, TN-46-11, July 2005, Micron, http://download. micron.com/pdf/technotes/DDR/TN4611.pdf
[17]: Allan G., „DDR Integration,” Chip Design Magazine, June/July 2007
7. References (3)
[19]: Stub Series Terminated Logic for 1.8 Volts (SSTL-18), JEDEC Standard JESD8-15A, Sept. 2003
[20]: Double Data Rate (DDR) SDRAM Specification, JEDEC Standard JESD79E, May 2005
[22]: DDR3 SDRAM Standard, JEDEC Standard JESD79-3, June 2007
[21]: DDR2 SDRAM Specification, JEDEC Standard JESD79-2, May 2006
[23]: DDR2 (Point-to-Point) Features and Functionality, TN-47-19, Micron,2003, http://download.micron.com/pdf/technotes/ddr2/TN4719.pdf
[24]: Ahn J.-H., „Memory Design Overview,” March 2007, Hynix, http://netro.ajou.ac.kr/~jungyol/memory2.pdf
[18]: Stub Series Terminated Logic for 2.5 Volts (SSTL-2), EIA/JEDEC Standard JESD8-9, Sept. 1998
[25]: Micron Synchronous DRAM, 64 Mbit, MT48LC16M4A2, MT48LC16M8A2, MT48LC16M16A2, Micron Technology, Inc. http://www.micron.com/products/dram/sdram/partlist.aspx Oct. 2000
[26] General DDR SDRAM Functionality, TN-46-05, Micron Techn. Inc., July 2001, http://download.micron.com/pdf/technotes/TN4605.pdf
[27]: Haskill, „The Love/Hate relationship with DDR SDRAM Controllers,” Mosaid, Oct. 2006, http://www.mosaid.com/corporate/products-services/ip/SDRAM_Controller_whitepaper_ Oct_2006.pdf
7. References (4)
[28]: Introduction to Xilinx, Xilinx FPGA Design Workshop, http://www.eas.asu.edu/~kchatha/ cse320_f07/xilinx_intro.ppt
[29]: DRAM Pricing – A White Paper, Tachyon Semiconductors, http://www.tachyonsemi.com/about/papers/dram_pricing.pdf
[30]: Intel E7500 MCH A2 x4/x8 DDR Memory Limitations, Application Note AP-722, March 2002, Intel
[31]: Intel 925X/925XE Express Chipset, Datasheet, Rev. 001, Jun. 2004, Intel
[32]: Keeth B., Baker R. J., Johnson B., Lin F., DRAM Circuit Design, Wiley-Interscience, 2008