designing for ddr4 power and performance

48
DDR4: Designing for Power and Performance

Upload: dylan-joao-colaco

Post on 02-Oct-2015

79 views

Category:

Documents


1 download

DESCRIPTION

Very good paper

TRANSCRIPT

  • DDR4: Designing for Power and Performance

  • Agenda

    Comparison between DDR3 and DDR4 Designing for power

    DDR4 power savings

    Designing for performance Creating a data valid window Good layout practices for DDR4 Board debug tools to minimize issues

    Looking ahead and conclusion

    2

  • Comparison Between DDR3 and DDR4

    3

  • DRAM Technology Comparison DDR3 DDR4 GDDR5

    Voltage 1.5 V / 1.35 V 1.2 V 1.5 V / 1.35 V

    Strobe Bi-directional differential Bi-directional differential Free-running differential WRITE clock Strobe Configuration Per byte Per byte Per word READ Data Capture Strobe based Strobe based Clock data recovery

    Data Termination VDDQ/2 VDDQ VDDQ Address/Command

    Termination VDDQ/2 VDDQ/2 VDDQ

    Burst Length BC4, 8 BC4, 8 8 Bank Grouping No 4 4

    On-Chip Error Detection No Command / address parity CRC for data bus CRC for data bus

    Configuration x4, x8, x16 x4, x8, x16 x16, x32 Package 78-ball / 96-ball FBGA 78-ball / 96-ball FBGA 170-ball FBGA

    Data Rate (Mbps/Pin) 800 2,133 1,600 3,200+ 4,000 7,000 Component Density 1 GB 8 GB 2 GB 16 GB 512 MB 2 GB

    Stacking Options DDP, QDP Up to 8H (128-GB stack); single load No

    4

  • DDR4 Power Savings

    5

  • DDR4 Power Savings Features

    DDR4 voltage is 1.2 V (up to 40% savings) Lower voltage than DDR3 (1.5 V) On-die VREF Pseudo-open drain I/Os

    Manages refreshes (up to 20% savings) Based on temperature

    New DDR4 low-power auto self-refresh (LPASR) capability Changes refresh rate based on temperature

    Only refreshes parts of array that is in use Controller must allow fine-granularity refresh based on memory utilization

    Supports data bus inversion Limits number of signals transitioning, reducing simultaneous switching

    output (SSO) and saving power

    6

  • Creating a Data Valid Window

    7

  • Timing Margins Are Shrinking

    8

    Data Valid Window

    DRAM Margin

    Package/ Board Margin

    Chip Margin

    DDR1 2,500 900 800 800 DDR2 938 425 256 256 DDR3 469 188 140 140 DDR4 313 125 93 93

    2,500

    938

    469 313

    DDR1 DDR2 DDR3 DDR4

    Shrinking Timing Margins in Picoseconds DRAM Margin Package/board Margin Chip Margin Data Valid Window

    400 Mbps 3,200 Mbps

    Package / Board Margin

  • Shrinking the Window Even More: DDR4 VREF Training (1/2)

    DDR4 VREF training Training: sweep VREF setting, find maximum passing window

    Lump sum of DCD, RX offset, etc. Resolution error is the combination of (VREF, PI, or delay chain)

    Margin loss calculation VREF step size: from 0.5% VDDQ to 0.8% VDDQ VREF set tolerance: 1.625% or 0.15% Calibration error: 1 step size

    0.8% * VDDQ = 0.8% * 1.2V = 9.6 mV Margin loss (due to VREF calibration error)

    9.6 mv * 2 / slew_rate = 4.8 ps (assume slew rate = 4 V/ns) Calibration error = half step size

    10

    Vref Step Size Vref step 0.50% 0.65% 0.80% VDDQ 2

    Vref Set Tolerance Vref_set_tol -1.625% 0.00% 1.625% VDDQ 3, 4, 6

    -0.15% 0.00% 0.15% VDDQ 3, 5, 7

  • Shrinking the Window Even More: DDR4 VREF Training (2/2)

    Discussion with JEDEC members RDDR4 specification section 13.4: any DRAM component level variation

    must be accounted for within the DRAM RX mask. This means that the VREF calibration error is included in VdlVW_total.

    VREF_DQ internal aligns to VCENT_DQs with training. VCENT_DQs has variation. VREF_DQ training error should increase with this variation and internal voltage noise etc.

    11

  • Shrinking the Window Even More: Duty Cycle Error

    DDR4 specification is +/-2% tCK = +/- 0.04 UI IPD current budget +/-3% tCK

    Margin loss is 4% tCK With proper link timing calibration

    2% tCK margin loss

    Assume same for read

    12

    +/-2%

    +/-2%

    DQS

    DQ

    Timing Parameters by Speed Bin for DDR4-2400 to DDR4-3200

    Speed DDR4-2400 DDR4-2666 DDR4-3200 Units NOTE

    Parameter Symbol MIN MAX MIN MAX MIN MAX

    Clock Timing

    Minimum Clock Cycle Time (DLL Off Mode) tCK (DLL_OFF) 8 - 8 - 8 - n 22

    Average Clock Period tCK (avg) TBD p

    Average High Pulse Width tCH (avg) 0.48 0.52 0.48 0.52 0.48 0.52 tCK (avg)

    Average Low Pulse Width tCL (avg) 0.48 0.52 0.48 0.52 0.48 0.52 tCK (avg)

  • Shrinking the Window Even More: Calculating the PLL Jitter

    13

    Current Profile : I(f) PDN Impedance : Z(f)

    f f

    Jitter Sensitivity : S(f)

    f

    Jitter Spectrum J(f)

    f

    iFFT

    TIE Jitter : j(t)

    t

    )()()()()()( tjfJfPfSfZfI TIEiFFT=

    p-p jitter

    PSRR of PLL: P(f)

    f

  • DDR4 Bank Group Timing

    Different timing within a group and between groups (tCCD, tWTR, tRRD) Long timing: bank-to-bank within a group Short timing: access to different bank groups

    Maintain array timing requirements within bank group Maintain speed between different bank groups

    Bank Group 1 Bank 2

    Bank 0

    Bank 3

    Bank 1

    Bank 2

    Bank 0

    Bank 3

    Bank 1

    Short Timings

    Long Timings

    14

    Bank Group 1

    Bank Group 0

    Bank 2

    Bank 0

    Bank 3

    Bank 1

    Bank Group 3

    Bank 2

    Bank 0

    Bank 3

    Bank 1

    Bank Group 2

    Bank 2

    Bank 0

    Bank 3

    Bank 1

  • Calibration Is Critical to Shrinking Margins

    15

    -0.1

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    Mar

    gin

    (ns)

    FPGA EffectsExternalEffects

    CalibrationEffects

    CalibrationUncertainty

    No Margin Without Calibration

  • What is Calibration?

    16

    Benefit: Accurate strobe placement More resync margin

    0 15 30 45 60 315 330 345 360DQ0DQ1DQ2DQ3**DQ70DQ71

    Valid data window

    Resync Calibration

    Voltage and temperature

    tracking Data shifts due to VT variations

    VT Compensation

    Benefit: Dynamic phase adjustment to match shifting data valid window Robust over VT

    Capture Calibration (De-skew)

    Benefit: Reduce skew between data group More capture margin

    Before de-skew small valid capture window DQs

    0 15 30 45 60 75 90 105 120 135 150 165 180DQ0DQ1DQ2DQ3DQ4DQ5DQ6DQ7

    DQs0 15 30 45 60 75 90 105 120 135 150 165 180

    DQ0DQ1DQ2DQ3DQ4DQ5

    After de-skew maximize valid capture window

  • High-Level Output Topology

    Calibration knobs DQ-out1 and DQ-out2 delay : Control the delay applied to outgoing DQ

    pins DQS-out1 and DQS-out2 delay : Control the delay applied to outgoing DQS

    pins Write leveling output : Changes the delay on both DQ and DQS relative to

    the memory clock-in phase taps

    17

    DQS

    CLK

    DQS OUT2 DelayDQS OUT1 Delay

    X phaseX+90 phase

    DQDQ OUT2 DelayDQ OUT1 Delay

    ptap control DQS out dtap1 control

    DQS out dtap2 control

    DQ out dtap1 control

    DQ out dtap2 control

  • High-Level Input Topology

    Calibration knobs DQ-in delay: Control the delay applied to incoming DQ pins DQS-in delay: Control the delay applied to incoming DQS pins LFIFO : Controls number of cycles after read command that data is read out of

    the LFIFO DQS-En phase: Control the delay on DQS En in phase taps DQS-En delay: Control the delay on DQS En in dtaps VIFO : Adjusts the delay in cycles applied to controller-provided DQS burst signal

    to generate DQS enable

    18

    DQS

    DQ

    DQS IN Delay DQS Delay Chain

    DQ IN Delay

    DQS in dtap control

    DQ in dtap control

    DDIOin

    DQS Enable

    X phase

    dqs_en ptap control

    DQS En Delay

    DQS en dtap control

    VFIFO

    vfifo control

    LFIFO

    Lfifo control

  • Calibration Stages

    DQS-enable calibration Calibrate DQS enable (delayed read data valid) relative to DQS

    Post-amble tracking Track DQS-enable across temperature variation

    Read data deskew Calibrate DQS relative to read command (read leveling)

    Calibrate DQ versus DQS (per-bit deskew) for reads

    LFIFO training Calibrate LFIFO delay cycles (read latency)

    Write leveling Calibrate DQS and DM to write command (write leveling)

    Write data deskew Calibrate DQ versus DQS (per-bit deskew) for writes

    Address/command training (leveling and deskew) Calibrate CS, CAS, RAS, and ODT versus memory clock

    VREF training (FPGA and memory) Calibrates receiver voltage threshold

    (for DDR4 with pseudo open drain DQs)

    19

    Initialize INST/AC ROM for all pins on this

    Mem Interface

    Initialize the memory(Mode Registers etc.)

    Calibratethe Mem Interface

    Start

    Y

    N

    User command found in DPRIO?

    User command found in RAM?

    Process DPRIO user command

    Process RAM user command

    Y

    YN

    N

    All Mem Interfaces calibrated?

    Calibration loop

    User mode loop

    Wait for PLL/DLL locking

  • Calibration Is Critical to Shrinking Margins

    20

    -0.1

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    Mar

    gin (n

    s)

    FPGA EffectsExternalEffects

    CalibrationEffects

    CalibrationUncertainty

    No Margin Without Calibration

  • Good Layout Practices for DDR4

    21

  • DDR4 Output Driver

    DDR3 Push-Pull DDR4 Pseudo Open Drain

    22

    Content Courtesy of Micron

  • Unadjusted, Non-Terminated Data Eye

    Jitter

    Overshoot

    Undershoot

    VDD

    VSS

    23

    Content Courtesy of Micron

  • Terminated Data Eye

    VIHac

    VILac

    Vref

    VIHdc

    VILdc

    Hi-Ringback

    Lo-Ringback

    Overshoot

    Undershoot

    24

    Content Courtesy of Micron

  • OCT from the Controller Standpoint

    DQ and CA pins are terminated differently in DDR4

    25

    Specification DDR3 DDR4

    Density / Speed 512 Mb ~ 8 GB 1.6 ~ 2.1 Gbps 2 GB ~ 16 GB 1.6 ~ 3.2 Gbps

    Interface

    Voltage (VDD / VDDQ / VPP)

    1.5 V / 1.5 V / NA (1.35 V / 1.35 V / NA) 1.2 V / 1.2 V / 2.5 V

    VREF External VREF (VDD / 2) Internal VREF (need training)

    Data I/Os CTT (34 ohm) POD (34 ohm)

    CMD/ADDR I/Os CTT CTT

    Strobe Bi-directional / differential Bi-directional / differential

    Core Architect

    Number of banks 8 16 (4 GB)

    Page size (x4 / x8 / x16) 1 KB / 1 KB / 2 KB 512 B / 1 KB / 2 KB

    Number of prefetch 8 bits 8 bits

    Added function RESET / ZQ / Dynamic ODT + CRC / DBI / Multi preamble

    Physical

    Package type / balls (x4, x8 / x16) 78 / 96 BGA 78 / 96 BGA

    DIMM type R, LR, U, SoDIMM + ECC SoDIMM

    DIMM pins 240 (R, LR, U) / 204 (So) 284 (R, LR, U) / 256 (So)

  • OCT Calibration Scheme to Support DDR4

    OCT can calibrate 2 times with 2 sets of pins (DQ/CA) DQ and CA pins will have 2 different sets of codes in DDR4

    26

    DDR3 DDR4

  • General Layout Concerns

    Avoid crossing splits in the power plane SSO on controller collapsed strobes/clocks

    Separate supplies and/or flip-chip packaging helps

    Low-pass VREF filtering on controller helps Minimize VREF noise Minimize intersymbol interference (ISI) Minimize crosstalk

    27

    Content Courtesy of Micron

  • Layout and Termination (1/12)

    Signal integrity review Importance of transmission line theory

    Todays clock rates are too fast to ignore Matched impedance line is important for good signaling

    Mismatched impedance lines result in reflections Termination schemes are used to reduce / eliminate reflections

    Good power bussing is paramount to reducing SSO SSO reduce voltage and timing margins

    Decoupling capacitors needs and requirements

    28

    Content Courtesy of Micron

  • Layout and Termination (2/12)

    Signal integrity analysis is paramount to developing cost-effective high-speed memory systems Develop timing budget for proof of concept Use models to simulate Board skews are important and should accounted for ISI, crosstalk, VREF noise, path length matching, Cin and RTT mismatch

    employ industry practices and assumptions Model vias too Eliminate return path discontinuities (RPDs) Minimize SSO affects

    Difficult to model

    29

    Content Courtesy of Micron

  • Layout and Termination (3/12)

    DRAM and controller package parasitics are fixed SSO effects already contained in their specified timings

    However, these are to test conditions with specific decoupling

    Power delivery network (PDN) for the controller and DRAM need to be properly designed

    Lowering power supply inductance minimizes signaling variations between devices Use power and ground planes wherever possible Make all power and ground traces as fat as possible Couple power and ground as much as possible

    Lowers inductance (mutual effects)

    30

    Content Courtesy of Micron

  • Layout and Termination (4/12)

    SSO Timing and noise issues generated due to rapid changes in voltage and

    current caused by multiple circuits switching simultaneously in the same direction

    Problems caused by SSO False triggers due to power/ground bounce Reduced timing margin due to SSO induced skew Reduced voltage margin due to power/ground noise Slew rate variation

    31

    Content Courtesy of Micron

  • Layout and Termination (5/12)

    Good power bussing is paramount to reducing SSO

    Reduce L (power delivery effective inductance) Use planes for power and ground distribution Proper routing of power and ground traces to devices Proper use of decoupling capacitance

    Locate as close as possible to the component pins

    Reduce dI/dt (switching current slew rate) Use the slowest drive edge that will work Use reduced drive strength instead of full drive where possible

    =

    dtdILV

    32

    Content Courtesy of Micron

  • Layout and Termination (6/12)

    RPDs induce board noise and are difficult to model Splits/holes in reference planes Connector discontinuities Layer changes

    Avoid RPDs if at all possible Avoid crossing holes/splits in reference plane Route signals so they reference the proper domain Add power/ground vias to board

    Especially in dense layer-change areas Place decoupling capacitors near connectors Solid Return Path

    Split Return Path

    33

    Content Courtesy of Micron

  • Layout and Termination (7/12)

    VREF noise Induces strobe to data skews and reduces voltage margins Power/ground plane noise Crosstalk

    Minimize VREF noise Use widest trace practical to route

    From chip to decoupling capacitor Use large spacing between VREF and neighboring traces

    34

    Content Courtesy of Micron

  • Layout and Termination (8/12)

    ISI Occurs when data is random

    Clocks do not have ISI Multiple bits on the bus at the same time

    Bus cannot settle from bit #1 before bit #2, etc. Signal edges jitter due to previous bits energy still on the bus Ringing due to impedance mismatches Low pass structures can cause ISI

    Minimize ISI Optimize layout Keep board/DIMM impedances matched

    Drive impedance should be same as Zo of transmission line Terminate nets

    Termination values should be the same as Zo of transmission line Select high-quality connector

    Matched to board/DIMM impedance Low mutual coupling

    35

    Content Courtesy of Micron

  • Layout and Termination (9/12)

    Crosstalk Coupling on board, package, and connector from other signals, including

    RPDs Inductive coupling is typically stronger than capacitive coupling

    When aggressors fire at the same time as victim (e.g. data-to-data coupling) Victim edge speeds up or slows down, causing jitter

    When aggressors do not fire at the same time as victim (e.g. data-to-command/address coupling) Noise couples onto victim at time of aggressor switching

    36

    Content Courtesy of Micron

  • Layout and Termination (10/12)

    Minimize crosstalk Keep bits that switch on same clock edge routed together

    Route data bits next to other data bits; never next to CMD/ADDR bits Isolate sensitive bits (strobes)

    If need be, route next to signals that rarely switch Separate traces by at least two to three {preferred} conductor widths

    (more accurately, one would define by trace pitch and height above reference plane) Example: 5-mil trace located 5 mils from a reference plane should have a 15-mil gap

    to its nearest neighbors to minimize crosstalk Choose a high-quality connector Run traces as stripline (as opposed to microstrip)

    Not at the cost of additional vias Maintain good references for signals and their return paths Avoid RPDs Keep driver, BD Zo, and ODT selections well matched

    37

    Content Courtesy of Micron

  • Layout and Termination (11/12)

    Cin mismatch Differing input capacitances on receiver pins Adds skew to input timings

    RTT mismatch Termination resistors not at nominal value Internal ODT on data pins have smaller variation than on DDR2

    They are calibrated (so is DRAMs Ron) External termination resistor variation must be accounted for

    Consider one-percent resistors

    38

    Content Courtesy of Micron

  • Layout and Termination (12/12)

    High-speed signals must maintain a solid reference plane Reference plane may be either VDD or ground

    For DDR3 UDIMM systems, the DQ busses are referenced to ground while the ADDR/CMD and clock are referenced to VDD

    All signals may be referenced to ground if the layout allows

    Best signaling is obtained when a constant reference plane is maintained If this is not possible try to make the transitions near decoupling capacitors

    Signal Power Plane

    Ground Plane

    Cap

    39

    Content Courtesy of Micron

  • Board Debug Tools to Minimize Issues

    40

  • TimeQuest DDR Timing: Read Capture

    41

    Errors in the calibration algorithm Effects of

    temperature and voltage changes on

    the calibration

    Total margin after calibration

    Before calibration is the standard timing analysis

    Calibrating out some of the process variation in the

    memory

    Calibrating to the FPGA variations

    (deskew + pessimism removal)

  • EMIF Debug Toolkit Features

    Reports results of the last calibration to the user Reports interface details, margins observed before calibration, settings

    made during calibration, and post-calibration margins In the case of a calibration failure, toolkit reports the stage at which

    calibration failed and the group

    Provides eye monitor support Provides loopback support Allows user interaction with memory interface

    Send commands to the memory interface to recalibrate, mask groups and ranks

    Eye monitor support of data valid window Loopback support for bit error rate (BER) testing

    42

  • 43

    TimeQuest-Like GUI interface

    Commands run Shown in console

    Tasks section

    Reports section

  • On-Chip EMIF Debug Toolkit

    Core access to calibration data Access same calibration data as the EMIF toolkit, now via FPGA logic

    Via Avalon Memory-Mapped (Avalon-MM) interface

    44

  • Looking Ahead and Conclusion

    45

  • Will There Be a DDR5?

    Very unlikely SI for a parallel bus of 2 GHz and above would be very difficult Timing budget would be consumed in the package

    PDN noise Package skew

    Transition to stack memory Hybrid Memory Cube and serialized memory 3D memories integrated into ASICs

    46

  • Conclusion

    DDR4 has many ways to reduce overall system power ~50% lower power than DDR3 at 1.5 V

    DDR4 is 33% faster than DDR3 2133 But there are challenges..

    Shrinking data valid window Increase signal integrity and power integrity concerns

    These can be overcome by good controller design Innovative calibration Good ODT Careful board design Good board debug tools

    47

  • Thank You Thank You

    DDR4: Designing for Power and PerformanceAgendaComparison Between DDR3 and DDR4DRAM Technology ComparisonDDR4 Power SavingsDDR4 Power Savings FeaturesCreating a Data Valid WindowTiming Margins Are ShrinkingDDR4 JEDEC Definition of the Data Valid WindowShrinking the Window Even More:DDR4 VREF Training (1/2)Shrinking the Window Even More:DDR4 VREF Training (2/2)Shrinking the Window Even More:Duty Cycle Error Shrinking the Window Even More:Calculating the PLL JitterDDR4 Bank Group TimingCalibration Is Critical to Shrinking MarginsWhat is Calibration?High-Level Output TopologyHigh-Level Input TopologyCalibration StagesCalibration Is Critical to Shrinking MarginsGood Layout Practices for DDR4DDR4 Output DriverUnadjusted, Non-Terminated Data EyeTerminated Data EyeOCT from the Controller StandpointOCT Calibration Scheme to Support DDR4General Layout ConcernsLayout and Termination (1/12)Layout and Termination (2/12)Layout and Termination (3/12)Layout and Termination (4/12)Layout and Termination (5/12)Layout and Termination (6/12)Layout and Termination (7/12)Layout and Termination (8/12)Layout and Termination (9/12)Layout and Termination (10/12)Layout and Termination (11/12)Layout and Termination (12/12)Board Debug Tools to Minimize IssuesTimeQuest DDR Timing: Read CaptureEMIF Debug Toolkit FeaturesSlide Number 43On-Chip EMIF Debug ToolkitLooking Ahead and ConclusionWill There Be a DDR5?ConclusionSlide Number 48