dinesh somasekhar - oregon state...

Embedded Memories

Dinesh Somasekhar

Outline

Embedded Memories

• SRAM

• eDRAM

Metrics

Selected Literature

Embedded Memory Metrics

• Power to Read and Write

• Power to retain state

Performance

• Bandwidth - Cycle Time

• Directly translates to Cost

PPA - Power Performance

• Empirically 1/Power(W) x 1/Tcycle(s) x 1/Area(mm2)

Static Random Access Memory

SRAM – Embedded Memory of choice for Logic

Processes From: Eric Karl “A 0.6V, 1.5GHz 84Mb SRAM Design in 14nm FinFET CMOS Technology,” ISSCC 2015, Paper 17.1

SRAM – Embedded Memory of Choice

Compatibility with Logic Process Technlogy

• Simple integration – Uses existing logic devices of a

technology

• Lowest cost and complexity

Performance is related to Logic Transistor

Performance

• Benefits from transistor performance enhancement

Highest Performance Memory

• Cycle Times compatible with modern digital logic

(GHz capable)

• Unchallenged in speed – read/write performance

Embedded Memory - Scaling

Intel – P. Kollar, et al “A 32 nm High-k Metal Gate SRAM With Adaptive Dynamic Stability Enhancement for Low-Voltage Operation,”

IEEE JSCC, Jan 2011

Classic SRAM Bit-Cell

6 Transistors

• 2 Pass Device

• 2 Inverters

‒ 2NMOS

‒ 2PMOS

1980 SRAM Cell

1700um2

Source: Kelin Kuhn, 2nd Intl.

Variability Conference 2009

Bit-Cell Evolution – 90nm

Physical Topology has

evolved over the years

Classical 6T Topology

• Tight Diffusion breaks

• Bi-Directional devices

‒ Close proximity

• 2 Poly cross-couple

Litho Friendly Bitcell

INTEL - Presskit

IBM – H. Pilo ISSCC

Spacer definition of bit-cell

Taejoong Song et.al. “A 10nm FinFET 128Mb SRAM with Assist Adjustment System for Power, Performance, Area Optimization” ISSCC 2016

– Paper 17.1

Bit-Cell Evolution – 22nm

Fully Gridded

• 3D device based

Tight Cuts

• Double poly cut

Eric Karl “A 4.6GHz 162Mb SRAM Design in 22nm Tri-Gate CMOS Technology with Integrated Active VMIN-Enhancing

Assist Circuitry,” ISSCC 2012

1-1-1 Cell (Pu-Pa-Pd) 1-1-2 Cell (Pu-Pa-Pd)

Next highest speed class of Memory

Bridges the gap between DRAM and SRAM

• Tcycle of 40nS for DRAM and 1ns for SRAM

• Density midway between SRAM and DRAM

Logic process based

• However needs process optimized for leakage

Best energy/bit (pJ/b) for large arrays

• Measured at equal capacity

eDRAM - Cell

Single Transistor – Single Capacitor

F. Hamzaoglu et.al. “A 1Gb 2GHz Embedded DRAM in 22nm Tri-Gate CMOS Technology”, ISSCC 2014, Paper 13.1

eDRAM - Cell

Open bit-line structure shown

• At the cross of WL and BL there is a cell

• Needs 2 poly tracks in this cell

eDRAM - Cell

Bit-line pickups along with capacitor pickup

• Note: Metal pitch need 1½ metal tracks per cell

eDRAM - Cell

Capacitor is vertically integrated

• Multiple levels of metal

• Capacitor over BL (COB)

eDRAM - Cell

Capacitors are vertically integrated

• Plate connection is common to multiple capacitors

Plate Connection

Functionality

Cell Functionality

SRAM Cells

• State Retention

• Read Stability

• Write Stability

DRAM Cells

• Retention Time

Key Difference:

• SRAM ratios transistor strengths for read and write.

SRAMs are non-destructive in read-out

SRAM Cell Retention Stability

Eye opening – metric of stability

• Defined by transistor parameter variations, strength

• Retention Vccmin

SRAM Cell Read Stability

Idsat of access device competes with Idlin of pull-down device • NMOS to NMOS ratio determines stability

Vcc Vcc

SRAM Cell Write Stability

Access device Idsat to PMOS Idsat ratio

• NMOS to PMOS ratio

DRAM Cell Functionality

Failure of cell is in loss of state with time

DRO (destructive read out) non-ratioed write and read

Special access transistor with very high Vt

e.g. 15fF capacitor with 0.1msec retention requires pA of leakage per device

Leakage mechanism – subthreshold, junction, adjacent cells, capacitor, gate and defects

WL WL WL

Write disturb 0 Write disturb 1

Functionality - Assisted 6T operation

Shigeki Ohbayashi, et al "A 65-nm SoC Embedded 6T-SRAM Designed for Manufacturability

With Read and Write Operation Stabilizing Circuits,” JSSCC April 2007

Read Assist

Write Assist

Collapse Array Vdd

Negative BL

Raise WL in steps Under-drive WL

Suppress BL Raise Vss

Adaptive WL Read-Assist

Dynamic WL under-drive

• Skew corner tracking

• Applied to die which are

read-stability limited

Intel – P. Kollar, et al “A 32 nm High-k Metal Gate SRAM With Adaptive Dynamic Stability

Enhancement for Low-Voltage Operation,” IEEE JSCC, Jan 2011

Modulating Bit-Line Voltage

IBM: H. Pilo, et al “64Mb SRAM in 32nm High-k Metal-Gate SOI

Technology with 0.7V Operation Enabled by Stability,

Write-Ability and Read-Ability Enhancements” ISSCC 2011

Read-Assist

Precharge of BL to lower than Vdd

Write-Assist

Capacitive negative BL drive

Vdd SRAM Based Assist

IBM: PILO et al. “Sram Design in

65nm Featuring Read-Write Assist

Circuits,” JSSC 2007

Per BL Sense Dynamic BL read assist

Performance

Frequency relationship

Modern designs span the V-F range

• Circuit optimality – only possible at a single V

Memory arrays designed not to be the Freq limiter

0.2 0.4 0.6 0.8 1 1.2 1.4

Voltage

Gate DelayEquation

Wire Loaded

𝑮𝒂𝒕𝒆 𝑫𝒆𝒍𝒂𝒚 = 𝑽𝒅𝒅/𝟐

𝑽𝒅𝒅 − 𝑽𝒕 𝟐

Small Signal SRAM I/O Slice

1 100 10000 1000000 1E+08

N Memory Size (bits)

Cell target

Ysel Target

SA Target

DPM=10

Half of a SRAM I/O Slice

SAO _ B

W L [ 1

W L [ 0

SAO _ R

E L [ 7

SRAM Cycle – 2 Clk Array

SAPCH SAE

PCHRYSEL

SAODOTsao

Tcycle

1 2 Functional Race WL SAE Power Race WL PCH

Performance specification: clock cycles, frequency

BitLine Development

Speed dependant on Iread and bit-line cap

Power is not the classical CV2F

• Linear dependence on V, may be independent of F

Voltage

𝑽𝒃𝒍 =𝑰𝒓𝒆𝒂𝒅 × 𝑻

𝑪𝒃𝒍

𝑷𝒐𝒘𝒆𝒓 = 𝑰𝒓𝒆𝒂𝒅 × 𝑽𝒅𝒅𝑻

𝑻𝒄𝒚𝒄𝒍𝒆

𝑷𝒐𝒘𝒆𝒓 = 𝑪𝒃𝒍 × 𝑽𝒃𝒍 × 𝑽𝒅𝒅 × 𝑭

𝑷𝒐𝒘𝒆𝒓 = 𝑪𝒃𝒍 𝑰𝒓𝒆𝒂𝒅 × 𝑻

𝑪𝒃𝒍𝑽𝒅𝒅 × 𝑭

Example:

Assumption - Iread = 10uA

Cbl = 20fF

T = 1/2GHz = 500pS

Vbl = 0.25Volts

Sense-Amplifier Basics

o BL_B

Pairwise change device by sigma target

Cycle Time Target

Array Performance

Performance is in GHz range

• Shorter bit-lines to achieve logic performance

compatibility

2 Cycle Arrays

Small Signal DRAM I/O Slice

Folded bit-line architecture

only one of the bit-lines in the pair is activated

Sense-amplifier per bit-line pair

sense-amplifier does a read followed by write-back

Post sense-amp YSEL mux

Write is done by loading the sense-amplifier

Example 256rows X 2sectors X 32 = 2KBytes

W r i t

W L [ 0

BL _ B

W L [ 1

W L [ 2

W L [ 3

W L [ 2

V c c / 2

DRAM Cycle

Interleaving is used to approach thoughput of SRAM

Tcycle

Cell node

BL/BL_B

SAN SAP

YSEL readwrite

Lower Frequency

Longer cycle time

accommodate write-back

0 1 2 3

3cycle, 4 or higher

DRAM Sense operation

Signal does not increase with time

Offset causes functional failure

• Cell and bit-line determines Tsa

eDRAM – Cycle Time

Paper 13.1 – 6 cycles to complete

eDRAM – Array Performance

Slower than SRAM ( ~6X) but clocks in the GHz

• 3X more clock cycles, 2X freq. at same voltage

From I/O Slice to SubArray

Xdecoder - Word-Line driver predecode one-hot addresses are used to activate a single word-line

Power saving through sleep devices activating a group of word-line drivers

WL & SAE split at POD

(point of divergence)

(32+1) I/O Slices

SRAM 256 cells per WL

DRAM 1024 cells per WL

16KB SRAM

64KB DRAM

32 dout + 1 redundancy

I x J x K

SAGate WLGate

PreAddrL [ I ]

PreAddrM [ I ]

PreAddrH [ I ]

POD Timing

Sub-Array to Unit

Data Chunking

bus runs at full frequency

Redundancy shift-mux structure for column group replacement

row replacement, block replacement for other dimension

ECC multi-cycle operation

SECDED, DECTED, transparent ECC

32 SubArrays

512KB SRAM

2MB DRAM

Multi-Level BUS routes

0 1 2 3 4 5 6 7 8 9 10 11 12 13

chunk chunk WL SA chunk chunk

Ecc Gen 0 1 0 1 Ecc Fix

Bus Transit decode Bus Transit

Area – Cell Size

14nm Technology

• Cpp = 70nm, FinPitch = 42

• HDC = 140nm x 357nm, LVC = 140nm x 320nm

Contacted Poly Pitch: Cpp = 70nm

Fin Pitch: FP = 42nm

HDC = 2 Cpp x 8.5 FP

irecti

Y – Word-Line direction

LVC = 2 Cpp x 10 FP

Eric Karl “A 0.6V, 1.5GHz 84Mb SRAM Design in 14nm FinFET CMOS Technology,” ISSCC 2015, Paper 17.1

Area – Cell Size

2 Cpp x 2x(4/3) FP

22nm Process

Cpp = 90nm

Fp = 60nm

8.5 / 2.667

3.2X denser

Area – Cell Level

Example Calculation of bit-cell area

• Note: BL direction same size (2 x Cpp)

• WL direction 3.2X smaller

Tech Cpp FP 8.5 Cpp 2 FP Area 10 Cpp 2 FP Area 2.67 Cpp 2 FP Area

22nm 90nm 60nm 510nm 180nm 0.0918 600nm 180nm 0.108 160nm 180nm 0.0288

14nm 70nm 42nm 357nm 140nm 0.05 420nm 140nm 0.0588

HD SRAM LV SRAM eDRAM

EDRAM Reference – ISSCC 2014, 13.1 Intel “A 1Gb 2GHz Embedded DRAM in 22nm Tri-Gate CMOS Technology”

SRAM Reference – ISSCC 2015, 17.1 Intel “A 0.6V, 1.5GHz 84Mb SRAM Design in 14nm FinFET CMOS Technology”

Area - LVC SubArray

SubArray efficiency (73%) – Logic is 36% of bit-cell area

bitcells

10 FP x 4 = 1.68um

FP = 42nm

Cpp = 70nm

ovhd =

Assume:

+25% in IO Col

+10% in WL

(snap to 2FP, 2Cpp)

Bitcells = 114.24um

+ Row Decoder (+10%) = 125.664um

Area – eDRAM SubArray

Substantially similar (2X bits with 1 node gap)

• 13.1 mentions 65% sub-array efficiency.

ovhd =

Substantially more IO column overhead – per bit sense and write

Example calculation

Bitcells = 165.12um

+ Row Decoder (+10%) = 181.68um

Area – Block Level

Minimum Usable Structure

• Note: Multiple SubArrays

• Data chunking

Logic approx. 43% of bit-cell

MidLogic

MidLogic – Equiv. Pipe Stage

+11% Area Overhead

rray X

ovhd =

125.664um

Efficiency 69.8%

Area - Top

ECC in modern arrays forces line accesses

• Area Efficient Arrays are relatively large

‒ Example 65% logical array efficiency, ~70% physical efficiency

Explains 40%-50% efficiency of caches.

Need 8 Blocks for a cacheline

Data Bits 64 x 8b 512

DECTED 2 x 10b +1 21

Spare/Red ~1 per SubA 11

Total 544

CacheLine

Logical to Physical

Adders:

ECC +4%

Col Red. +2%

Row Red. +.7%

1005.312um

Cost of Walking Bits

Energy related to cost of switching wires

• Cap per unit mm – roughly invariant (geometry related)

Relative Permittivity ~3.0

Wikipedia is your friend – Lo K Dielectric

W = S = H

Cap per unit mm

Combination of parallel plate

Capacitor and Coaxial Cap

2 x Pi x Eo x Er / ln(2.2)

Cwire ~ 200fF/mm

Modulated by width and spacing

(within limits)

Calculating Power

Power = Cdyn x Vswitch x Vsupply x Freq

• For logic circuits Vswitch = Vsupply = Vdd

• Cdyn x Vdd2 x F

𝐂𝐝𝐲𝐧 =𝑷𝒐𝒘𝒆𝒓

𝑽𝒅𝒅𝟐 ×𝑭𝒓𝒆𝒒

Leakage Power

• Physical bit-cell count

• Leakage per bit-cell

• Uplift to account for distribution of bit-cell leakage

‒ Assumes that Logic can be extensively power-gated

Cost of Walking Bits

Repeater Segment length depends on wire resistance

Cdyn per unit length is roughly invariant

Wire length: L

No. of segments : N

Cseg Cseg/4

Cseg = Cwire / N

Switched Cap per mm = N x Cwire/N x ( 1 + ¼ )

Roughly Cwire x 1.25

𝐶𝑑𝑦𝑛 = 0.5 × 0.5 × 𝐶𝑤𝑖𝑟𝑒 × 1.25

Average Actvity Factor

(0.38 if encoded)

Clock Wires

𝐶𝑑𝑦𝑛 = 𝐶𝑤𝑖𝑟𝑒 × 1.25 Buses

NRZ effect Clock

DataCdyn = 0.0625pF/mm

Wire Cdyn

Repeater placement and sizing are generally

near optimal

80% 100% 120% 140% 160%

tive W

Relative Cdyn/mm

Repeater segment too short R

epeate

r segm

Global Walks

Data Movement Power

accounted in terms of BW

1005um 376u

Avg. ~880um

Recap: 2GHz – 14nm design

0.5MB arrays, 32B din + 32B dout,

2GHz, 2 clock Tcycle

Freq 2GHz

Vdd 0.8V

BW 128GB/s

Global

Avg. Walk 878.5um

Act. Fact.

Wire/Leaf Scale 1.04

Cdyn 57.2fF

Edyn 0.8V 36.6fJ

Power 128GB/s 37mW

533b/512b

GB/s X 8 X Cdyn x V2

Clock, Address, Control

At the global level data movement

dominates

1005um 376u

Ctrl, Clk ~2574um

13 address bits per group

3 control bits, 1 clock

gated in this example

Global Global Global Global

Avg. Walk 878.5um 1381um 2574um 2574um

Act. Fact.

Wire/Leaf Scale 1.04 0.03 0.006 1.5

Cdyn 57.2fF 0.96fF 0.9fF 1.0pF

Edyn 0.8V 36.6fJ 0.61fJ 0.6fJ

Power 128GB/s 37mW 1mW 1mW

Clk Power 2GHz 1.2mW

0.5 0.5

0.21875 0.5 1

Data Address

Control Clock

Addr ~1381um

Clk May not be BW depedent

Address 1-hot

13b/512b

3b/512b

Address Encoding

Sparse Encoding – powerful technique to

reduce control and address power

Can be applied to data-buses – DBI drops AF to

3 to 8

00001000

3 to 8

00000001

3 to 8

00000001

Activity factor ~21% -- (7/8 x 2/8)

Block Block Block Block

Avg. Walk 98.56um 143.4um 143.4um 1479um

Act. Fact.

Wire/Leaf Scale 1.06 0.19 0.047 16

Cdyn 6.5fF 0.74fF 0.42fF 5.9pF

Edyn 0.8V 4.2fJ 0.47fJ 0.27fJ

Power 128GB/s 4mW 0.5mW 0.3mW

Clk Power 2GHz 7.6mW

0.5 0.5

0.21875 0.5 1

Data Address

Control Clock

Block Walk

Data movement dominates at local level

• Clock power tapers as we go to global level

Active

1 subarray + ½ mid 1.5 subarray + ½

12x8/512

68x8/512

3x8/512

18um x 68 + 126 +

1.5 Subarray

SubArray Power

Lower-bound by computing bit-line power

• Accounting for pseudo-read columns

Pseuod R

Compute on

IO Column basis

Effectively

1 selected BL pair

3 pseudo BL pair

Bitlines - Mult-driven Nets

Back-end design is a strong influence on BL cap

W x Ci

L x Cwire L Wire Length

Cwire Wire Cap per um

Ci Cell cap per um

W Width of device

N Number of cells

1 1.5 2 2.5 3

Ctot / Cwire

Delay Sesitivity

Relative Energy

N cells

Ctot = N x W x Ci + L x Cwire

Assumption Ctot = 2 x L x Cwire

Example

Cbl = 2 x 89.6um/2 x 200fF/um

SubArray Power

Embedded memories – sub-array power is non-

negligible

Selected Pseudo Selected Pseudo

BL Cap 17.9fF 14.3fF 17.9fF 14.3fF

Avg BL Swing 0.2V 0.25V 0.8V 0.25V

Vsw/Vdd 25% 31% 100% 31%

Wire Scale 1.0625 3.1875 1.0625 3.1875

Effective Cap 4.8fF 14.3fF 19.0fF 14.3fF

Power 128GB/s

19.0fF 33.3fF

34mW19mW

WriteRead

BL Length selected

½ Subarray 45um

Pseudo read length accounts

only for bit-cell dimension 36um

Scale swing based

on cap

1 pair for every di/do

3 pseudo pairs

* Accounts only for bit-line power – actual power higher (upper bound ~2X)

Unaccounted: Word-line, Control, Timer, Write Assist, Read Assist

37.5mW, 40%

0.6mW, 1%

4.3mW, 5%0.5mW, 0%0.3mW, 0%

8.8mW, 9%

26.8mW, 29%

14.4mW, 15%

Global - Data

Global - Address

Global - Control

Local - Data

Local - Address

Local - Control

Subarray - Rd=Wr

Leakage (1nA)

Energy Efficient Storage ( example 0.1pJ/b)

Leakage at 1nA/b

Clock is

underestimated

SubArray

50% Rd – 50% Wr

Literature Survey

SRAM array – 22nm example

Lower voltage operation

• Heavily relies on circuits to avoid contention during

write and read processes

Eric Karl, “A 4.6GHz 162Mb SRAM Design in 22nm Tri-Gate CMOS Technology with Integrated Active VMIN-Enhancing Assist Circuitry”

SRAM array – 22nm

Classical – 6T assembled into 4 quadrant sub-array

22nm SRAM – Write Assist

Satish Dhamaraju, “A 22nm IA CPU GPU on die”

Eric Karl “A 4.6GHz 162Mb SRAM Design in 22nm Tri-Gate CMOS Technology with Integrated Active VMIN-Enhancing

Assist Circuitry,” ISSCC 2012

22nm SRAM – Read Assist

Vmin focus – multiple supply rails

• Importance of supply collaterals

ISSCC 2016 17.1 10nm FinFet SRAM..

A 10nm FinFET 128Mb SRAM with Assist

Adjustment System for Power, Performance,

Area Optimization

10nm node – 0.040um2 (HD) 0.049um2 (HC)

ISSCC 2016 17.1 10nm FinFet SRAM..

A 10nm FinFET 128Mb SRAM with Assist

Adjustment System for Power, Performance,

Area Optimization

10nm node – 0.040um2 (HD) 0.049um2 (HC)

WordLine Under Drive

Suppressed Bit-Line

Negative Bit-Line WordLine Over-Drive

WordLine Collapse

Dual Transient Word-Line

ISSCC 2015 – 14nm SRAM Eric Karl “A 0.6V, 1.5GHz 84Mb SRAM Design in 14nm FinFET CMOS Technology,” ISSCC 2015, Paper

172Mb SRAM Test Vehicle

Highlights – Write Assist (TVC)

ISSCC 2014 – 22nm eDRAM

Notables: Supply collaterals – positive pumps, negative pumps, mid-voltage generators,

Two level sense – local sense followed by global sense, high-voltage word-line drivers.

ISSCC 2014 – 22nm eDRAM

evaluate restore precharge

2Mb ½ bank

Micro 2016

Questions

dinesh somasekhar - oregon state...

Documents

power control mechanisms on warp boards somasekhar reddy...

climate modeling - oregon state...

stevan j. arnold - oregon state...

abstract form - oregon state...

professional education support wyeth pharmaceuticals melinda...

power control mechanisms on warp boards somasekhar...

lesson plan - oregon state...

the geometry of calculus - oregon state...

dynamic microcontroller in an xc4000...

ecological - oregon state...

octonions and fermions - oregon state...

hrms spectra - royal society of chemistry · synthesis of...

wireless wearable orthopedic device for posture...

mth 251 study guide - oregon state...

the geometry of relativity - oregon state...

c s - oregon state...

sllu application guide v1 - info.glide-line.com

climate modeling - oregon state...

curriculum mission statement: - oregon state...

chapter 9 professional ethics - oregon state...