hybrid vc-mtj/cmos non-volatile stochastic logic for...

27
Shaodi Wang , Saptadeep Pal, Tianmu Li, Andrew Pan, Cecile Grezes, Pedram Khalili-Amiri, Kang L. Wang, and Puneet Gupta Department of Electrical Engineering, University of California, Los Angeles, CA 90095 USA Hybrid VC-MTJ/CMOS Non-volatile Stochastic Logic for Efficient Computing

Upload: tranduong

Post on 16-Mar-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

Shaodi Wang, Saptadeep Pal, Tianmu Li, Andrew Pan, Cecile Grezes, Pedram Khalili-Amiri,

Kang L. Wang, and Puneet Gupta

Department of Electrical Engineering, University of California, Los Angeles, CA 90095 USA

Hybrid VC-MTJ/CMOS Non-volatile StochasticLogic for Efficient Computing

• Introduction of stochastic computing (SC)

• Introduction of voltage-controlled magnetic tunnel junctions (VC-MTJ) and negative differential resistance (NDR)

• Stochastic logic gates designed by VC-MTJ and NDR

• Stochastic bitstream (SBS) generation design

• Design evaluation

Guidelines

4 April 2017 Shaodi Wang / UCLA 2

• Advantages• Efficiency in additions and multiplications• Parallelism

• Disadvantages• High leakage from massive registers• Inefficiency for high-precision applications• Inefficient pseudo stochastic bitstream generation

4 April 2017 Shaodi Wang / UCLA

Stochastic computing (SC)

• Advantages• Low leakage• Truly random SBS generation

• Disadvantages• SBS generation is limited by precision• Correlation caused by process variation • Data copy between CMOS and NVM• Amended by the proposed work

Non-volatile SC with STT-MTJ and memristor

4 April 2017 Shaodi Wang / UCLA 4

Figures from Knag, Phil, Wei Lu, and Zhengya Zhang. "A native stochastic computing architecture enabled by memristors." IEEE Transactions

on Nanotechnology 13.2 (2014): 283-293.

• The proposed voltage-controlled magnetic tunnel junctions (VC-MTJ) based SC

• VC-MTJ based NV SBS registers

• Computations on VC-MTJs• Skipping the data copy between non-volatile memory and CMOS logics

• Energy-efficient computation

• Efficient and reliable truly stochastic bitstream (SBS) generator• Free from process and temperature variation impact

Highlights

4 April 2017 Shaodi Wang / UCLA 5

Utilizing Voltage-controlled magnetic tunnel junctions (VC-MTJ) for SBS generation and storing

4 April 2017 Shaodi Wang / UCLA 6

• Anti-parallel (AP) and parallel (P) resistance >50k Ω low switching energy

• Non-deterministic switching need a simple hardware to enable deterministic switching

V = 0

EB

Free

Fixed

MgO V = 0

V =VC

MgOV=VC>0

• Zero read disturbance • But need a simple

hardware to enable the use of register

MgOV < 0

V < 0

Precessional switching

Introduction – Negative differential resistance (NDR)

4 April 2017 Shaodi Wang / UCLA 7

VNDR

Cu

rren

t

VAPP

Unstable

Increase VAPP from 0Decrease VAPP from high to low

RRAM resistance from HRS to LRS

Peak

Valley

VAPP

VN

DR

T1

T2

T3

in

GND

bias

intIn

Bias

GND

VN

DR

IN

DR

Current drops 40X

• Reset operation

• Read operation

Bitwise operation using MTJ and NDR

4 April 2017Shaodi Wang / UCLA 8

VAPP

VMTJ

RMTJrAPrP

rP

Reset AP-MTJ Reset P-MTJ

VAPP

VMTJ

RMTJrAP

rP

Write mode

Anode

Cathode

VAPP

bias

Read AP-MTJ Read P-MTJ0

0

0

0

VAP P

Anode

Cathode

Read mode

bias

Bitwise operation (cont’d)

4 April 2017 Shaodi Wang / UCLA 9

Flip xx

x

0.6ns

Long pulse (5 ns)Randomization x0.5

x

Vread

x

y

XNOR y x y+ Short pulse

Flip MTJ data

XNOR

RandomizationVC-MTJ switching probability

Reset + Flip = Write-1 (AP)

Bitwise operation (cont’d)

4 April 2017 Shaodi Wang / UCLA 10

y

Vand

x

VCC

x y

AND

NDR peak current design

prerequisite:

VCC / (2rAP) < Ipeak < VCC / (rAP + rP)

Scaled addition

Vread

Vread

VDD

s

x

y

Short pulse

s x+(1-s) y

Stochastic bitstream generation

4 April 2017 Shaodi Wang / UCLA

11

=1 ⋅1

2

1+ 0 ⋅

1

2

2+ 1 ⋅

1

2

3= 1 + 0 + 1 ⋅

1

2⋅1

2⋅1

2

0 0 0 0 0 0 0 0

SBS0

SBS1

SBS2

Randomization1

2

Scaled copy x/2(if SBS0[i]==1 then Randomize SBS1[i])

0 1 1 0 1 0 1 1

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 1 0 0 0 1 0

1 0 1 0 1 1 1 0

0 1

2

0 1

4

0 5

8

copy and rand (x+1)/2

(if SBS1[i]==1, then flip SBS2[i], else randomize SBS1[i])

• Generation can be pipelinedBinary fixed point 0.101

• SBS generator• HSPICE simulation with experimentally verified 50nm VC-MTJ Verilog-A

model

• 55X lower energy than CMOS linear-feedback shift register (LFSR)

• Standard adder and multiplier• HSPICE simulation to extract power and delay

• Multiplier: 8 transistors per bit (this work) vs. 160 transistor per bit for a multiplier (CMOS binary)

• Register: 3 transistors + 1 VC-MTJ (this work) vs 16 transistors (CMOS binary)

Design simulation

4 April 2017 Shaodi Wang / UCLA 12

• Two representative design benchmarks• Finite impulse response (FIR) filter

• Adaboost machine learning accelerator

• Ratio of SBS generation to multiplication and additions• FIR > Adaboost

Evaluation benchmarks

4 April 2017 Shaodi Wang / UCLA 13

FIR filterAdaboost

Input image

h1 h2 hn

H

α1 α2 αn

• Experimental setup• Precision: 32-bit SBS to 256-bit SBS corresponding to 5-bit to 8-bit binary

fixed-point number• Energy is accounted for computation and SBS generation separately• Energy/output is the comparison metric

• SC is easily pipelined, so there is no delay comparison

• Evaluation procedure• CMOS binary and CMOS SC implementation

• Synthesis and place-route with 45nm commercials library• This work

• HSPICE with 45nm CMOS commercial library and VC-MTJ experimentally verified model

Evaluation (cont’d)

4 April 2017 Shaodi Wang / UCLA 14

• Energy efficiency comparisons• FIR: 3~7X (256~32-bit precision) lower energy than CMOS binary design

• Adaboost: 12~25X (256~32-bit precision) lower energy than CMOS binary design

• This work < CMOS binary < CMOS SC in terms of energy

• SC is more advantageous in low-precision applications

Evaluation (cont’d)

4 April 2017 Shaodi Wang / UCLA 15

Thank you!

4 April 2017 Shaodi Wang / UCLA 16

Backup slides

4 April 2017 Shaodi Wang / UCLA 17

NanoCAD Lab [email protected] UCLA

Voltage-controlled magnetic tunnel junctions (VC-MTJ)

18

• Zero read disturbance• Anti-parallel (AP) and parallel (P)

resistance (>50k Ω) low switching energy

• Efficient switching (computing): ~fJ per bit operation

MTJ stacks

Precesionnal switching

V = 0

EB

V =VC

Free

Fixed

MgO MgO

V = 0 V=VC>0 MgO

V < 0

V < 0

NanoCAD Lab [email protected] UCLA

Experimental Setup for NDR and MRAM

19

50nm perpendicular VC-MTJ

NDR

NanoCAD Lab [email protected] UCLA

Experimental NDR and NDR-assisted write and read

demonstration

20

T1

T2

T3

in

GND

bias

intIn

Bias

GND

VN

DR

IN

DR

NDR implementation NDR I-V curves vs. biasPeak-to-valley current ratio of 10~1000

AP P

Write mode

Anode

Cathode

VCC

bias

NDR assisted VC-MTJ write Write current drops 40X, while write voltage drops 50X

rAP/rp = 1.3X

NanoCAD Lab [email protected] UCLA

Experimental NDR and NDR-assisted write and read

demonstration

• “Reset” VC-MTJ to P

state (0 state) w/ NDR

– Reset error rate < 10-6

• NDR-assisted non-

destructive VC-MTJ

read

– 0 read disturbance

– Full voltage output

swing through NDR to

detect MTJ states

21

bias

Cathode

Read mode

VCC

Anode

NanoCAD Lab [email protected] UCLA

Stochastic bitstream generation

• Conversion from binary input to fraction

– 0. 𝑛2𝑛1𝑛0(𝑏𝑖𝑛𝑎𝑟𝑦) = 𝑛2 ⋅1

2

1+ 𝑛1 ⋅

1

2

2+ 𝑛0 ⋅

1

2

3

• Conversion from faction to stochastic bitstream (SBS)

– 𝑛21

2

1+ 𝑛1

1

2

2+ 𝑛0

1

2

3= 𝑛2 + 𝑛1 + 𝑛0 ⋅

1

2⋅1

2⋅1

2

– xi is 0 or 1

– 1 + 𝑥 ⋅1

2and (0 + 𝑥) ⋅

1

2, where x is a SBS

– With long pulse, VC-MTJ switching creates perfect 1

2

NanoCAD Lab [email protected] UCLA

Stochastic bitstream generation (cont’d)

• Copy (y=x)

– Reset SBS y to 0

– For each xi=1, flip yi to 1

• Scaled copy (y=x/2)

– Reset SBS y to 0

– For each xi=1, random yi

• Copy and rand (y=(1+x)/2)

– Reset SBS y to 0

– For each xi=1, flip yi to 1

– For each xi=0, random yi

23

Scaled copy y = 0x/2

Vread

x

y

Copy y = 0x

Short pulse Long pulse

NanoCAD Lab [email protected] UCLA

Stochastic bitstream generation (cont’d)

24

STEP SBS0 IN0 SBS1 IN1 SBS2 IN2

1

2

3

4

0.1 0 11 01Binary input[1]:

0 1 00.Binary input[2]: 010

1 1 10.Binary input[3]: 1 1

0110100101001000

00000000 01101011

0111010010101100 01010000

11101101

Rules:INi==1: copy and rand y=(x+1)/2

INi==0: scaled copy y=x/2

Pipelined SBS generation

IN2IN1IN0

NanoCAD Lab [email protected] UCLA

Stochastic bitstream generation (cont’d)

25

• Hardware designs:

– SBSodd is written while SBSeven is being read

– Every clock cycle half of SBS is changed

VDD

GND

VDD

GND

write_even

write_odd write_odd

VDD

GND

write_odd

write_odd

write_odd

write_even

write_odd

write_even

write_odd

write_even

write_odd &

reset_odd

B[0] C[0] ...

write_even

GND

reset

resetreset

reset

reset

reset

reset

reset

reset

reset

reset

Long pulseshort pulse

write_even

write_even

write_odd

reset

reset

reset

reset

write_oddreset

Long pulse

A[2] B[2] ...

B[1] C[1] ...

Long pulse

short pulse

short pulse

short pulse

Long pulse

short pulse

Long pulsereset

VC-MTJ array (nx2n)

Peripheral control

SBS0

SBS1

SBS2

Write

Write

Read

Read

NanoCAD Lab [email protected] UCLA

Stochastic bitstream generation (cont’d)

26

Reset SBS0

Randomize SBS0

Reset SBS1 and copy from SBS0

NanoCAD Lab [email protected] UCLA

On-going works – automatic synthesis

• Target: synthesizing algorithm description to circuit

• Optimizing targets

– Correlation, area and latency minimization

• Proposed solutions

– Factorization and computing depth minimization

– Examples:

• 𝐹 = 𝑎𝑏𝑔 + 𝑎𝑐𝑔 + 𝑎𝑑𝑓 + 𝑎𝑒𝑓 + 𝑎𝑓𝑔 + 𝑏𝑑 + 𝑐𝑒 + 𝑏𝑒 + 𝑐𝑑 −→ 𝑏 + 𝑐 𝑑 + 𝑒 + 𝑎𝑔 +𝑑 + 𝑒 + 𝑔 𝑎𝑓

27