Download - Marinella presentation - SRC
Neural Inspired Computational Elements
Photos placed in horizontal position with even amount of white space
between photos and header
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP
Matthew J. Marinella, Sapan Agarwal, Robin Jacobs-Gedrim, Alex Hsia, David Hughart, Elliot Fuller, Steve J. Plimpton, A. Alec Talin, and Conrad D. James
Sandia National Laboratories*[email protected]
Ultra-Efficient Neural Algorithm Accelerator Using Processing With Memory
1
+ - + - + -
++++
--
--
--
++
VTE
+
++
+
+
--
+
++++
+ +
--
--
-
PtTaOxTa
Neural Inspired Computational Elements
Why do we need more efficient computers? Google Deep Learning Study
16000 core, 1000 machine GPU cluster Trained on 10 million 200x200 pixel images Training required 3 days Training dataset size: no larger than what
can be trained in 1 week What would they like to do?
~2 billion photos uploaded to internet per day (2014) Can we train a deep net on one day of image data? Assume 1000x1000 nominal image size, linear scaling
(both assumptions are unrealistically optimistic) Requires 5 ZettaIPS to train in 3 days
(ZettaIPS=1021 IPS; ~5 billion modern GPU cores) World doesn’t produce enough power for this! Data is increasing exponentially with time
Need >1016-1018 instruction-per-second on 1 IC Less than 10 fJ per instruction energy budget
2
Q. Le, IEEE ICASSP 2013
Neural Inspired Computational ElementsInspired by Hasler and Marr,Frontiers in Neuroscience, 2013
“Let physics do the computation” Our brain is the ultimate example of this paradigm
Moore’s Law Era: Density scaling and Dennard Power Density Scaling
100 nJ/MAC
10 nJ/MAC
1 nJ/MAC
100 pJ/MAC
10 pJ/MAC
1 pJ/MAC
100 fJ/MAC
10 fJ/MAC
1 fJ/MAC
100 aJ/MAC
10 aJ/MAC
1 aJ/MAC
3
*Caveat: Biological neurons probably do not perform MACs
1980’s
2010-Present
2025?
2035?
1990’s
2000’s
Evolution of Computing Machinery
Modern Computer
Biological Neurons*
Neuromorphic Target Efficiency only possible
with new computing paradigm
Exascale System GoalSingle GPU Card, Late 2016
Heterogeneous Integration Era: Close integration of emerging memory, low
voltage logic, photonics
>100
J/M
AC
1946
-EN
IAC
Neural Inspired Computational Elements
-0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
0.0
2.0x10-6
4.0x10-6
6.0x10-6
Cur
rent
(A)
Voltage (V) -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0
-6.0x10-4
-4.0x10-4
-2.0x10-4
0.0
2.0x10-4
Cur
rent
(A)
Voltage (V)
RESET SET
Metal Oxide Resistive RAM (ReRAM)
Highest current switching process
Read Window
TiN
Ta (15 nm)
TiN
TaOx (5-10 nm)
switching channel
++
+ +
- -- --
-
++
O-2 anions exchange
VTE
(+) charged vacancies+
+
++
+
--
Example: Sandia TiN/Ta/TaOx/TiN device Starts as insulating MIM structure Forming: remove O2- soft breakdown Bipolar resistance modulation Excellent memory attributes: Switching in
less than 1ns, less than 1 pJ demonstrated, scaling to 5nm, >1012 write cycles
Forming
-1.0 -0.5 0.0 0.5 1.0
-4.0x10-9
-2.0x10-9
0.0
2.0x10-9
Cur
rent
(A)
Voltage (V)
Pre-Form I/V SET-RESET
Neural Inspired Computational Elements
Crossbar Theoretical Limits
Potential for 100 Tbit of ReRAM on chip If each can perform 1M computations of
interest per second (1 M-op): 1012 active devices/chip x 106 cycle per
second 1018 comps per second per chip Exascale-computations per sec on one chip!
In order to not melt the chip, entire area must be limited to ~100W
Allowed energy per operation = P x t/op = 100W / 1018 = 10-16 = 100 aJ/operation
10nm line capacitance = 10 aF Can charge line to 1V with 10 aJ Drawback: “only” ~100B transistors/chip
5
r1
r2
r3
r4
R1,1
...
...
...
...
R2,1
R3,1
R4,1
R1,2
R2,2
R3,2
R4,2
R1,3
R2,3
R3,3
R4,3
R1,4
R2,4
R3,4
R4,4
c1 c2 c3 c4
... ... ... ...
0.8 1 2 4 6 8 10 20 401 100.01
0.1
1
10
100
1000
Dens
ity (T
Bit/c
m2 )
Minimum Feature Size, F (nm)
1 Layer 24 Layers 96 Layers
ReRAM Density vs Min. Feature SizeAssumes 4F2 cell, 1-bit cell
Neural Inspired Computational Elements
G1,2
Electronic Vector Matrix Multiply
How does a crossbar perform a useful computation per device?
6
G1,1
G2,1
G3,1
G2,2
G3,2
G1,3
G2,3
G3,3
V1
V2
V3
I1=ΣVi,1Gi,1
W1,2W1,1
W2,1
W3,1
W2,2
W3,2
W1,3
W2,3
W3,3
V1
I1=ΣVi,1Wi,1
V2 V3
=
I2=ΣVi,2Wi,2
VTW=I
I3=ΣVi,3Wi,3
Mathematical Electrical
I2=ΣVi,2Gi,2 I3=ΣVi,3Gi,3
WV
I
Neural Inspired Computational Elements
Mapping Backprop to a Crossbar
Vector Matrix Multiply, Rank 1 Update: Key kernel used in many algorithms
Convert analog inputs to varying length voltage pulses
in
out
Integrate current to get an analog output value
Inpu
ts fr
om p
revi
ous
laye
r
Backpropagated error from following layer
- + - + - +
-+
-+
-+
Back
prop
agat
eder
ror t
o pr
evio
us la
yer
Outputs to next layer
Neural Inspired Computational Elements
Accelerator Architecture
8
Neural Inspired Computational Elements
Device to Algorithm Model
9
Neural Algorithm Level Model
McPAT*Computer Architecture Level Model
NVSIMCrossSim
Circuit Level Models
Device Level Models
CrossSim
SSTCrossSim
Electrical Test
Electrical Test
Compact Models
Neural Algorithm
Top
Dow
nBo
ttom
Up
What device properties are needed?
Algorithm Accuracy
Energy and Latency
How do specific devices work in system?
Neural Inspired Computational Elements
Experimental Device Nonidealities
Ideally weight would increase and decrease linearly proportional to learning rule result
Experimental devices have several nonidealities: Write Variability, Write Nonlinearity, Asymmetry, Read Noise
Circuits also have A/D, D/A noise, parasitics
10
G∝
W
Pulse Number (Vwrite=1V, tpulse=1µs)
= Ideal
Conductance versus Pulse
= Write Variability = Nonlinear
I∝G∝
W
Time (Vread=100mV)
I0I0-∆I
I0+∆I
Read Noise
Symmetric and Linear
Asymmetric, Nonlinear
Neural Inspired Computational Elements 11
ReRAM Analog CharacterizationSET RESET
Use as a neuromorphic weight requires precise analog tuning Dataset requires 1000 repeated SET and RESET pulses Nominal pulse values
SET: +1V 10ns RT/PW/FT RESET: -1V 10ns RT/PW/FT READ: 100 mV 1 ms RT/PW/FT V
t
Current MeasurementWrite Pulse
Read Pulse
0 500 1000
3.0x10-4
4.0x10-4
5.0x10-4
6.0x10-4C
ondu
ctan
ce (S
)
Pulse Number (#)1000 1500 2000
3.0x10-4
4.0x10-4
5.0x10-4
6.0x10-4
Con
duct
ance
(S)
Pulse Number (#)
Neural Inspired Computational Elements
Repeated Pulsed Cycling
12
10 ns 1 µs100 ns
SET
RESET
100 onoff cycles, (200k pulses)
Neural Inspired Computational Elements
TaOx ReRAM in Backprop Training
13
How can training accuracy be improved?
Increasing Network Size
Neural Inspired Computational Elements
Li-Ion Synaptic Transistor for Analog Computation (LISTA)
14
G-V for LISTA Transistor
E. Fuller et al, Adv Mater, accepted 2017
Neural Inspired Computational Elements
LISTA > 200 states
PCM Array
Analog State Characterization
15GW Burr et al, IEEE TED 2015
TaOx ReRAM
E. Fuller et al, Adv Mater, accepted 2017
Neural Inspired Computational Elements
LISTA-device Performance for Backprop Algorithm
16
Data set # Training Examples
# Test Examples
Network Size
UCI Small Digits[1] 3,823 1,797 64×36×10File Types[2] 4,501 900 256×512×9
MNIST Large Digits[3] 60,000 10,000 784×300×10
Increasing Network Size
E. Fuller et al, Adv Mater, accepted 2017See Poster for Detail!
Neural Inspired Computational Elements
Electrochemical Neuromorphic Organic Device (eNode)
17
van de Burgt … Saleo, Nature Mater., 2017 in press
Neural Inspired Computational Elements
Electrochemical Neuromorphic Organic Device (eNode)
18
See Poster for Detail!van de Burgt … Saleo, Nature Mater., 2017 in press
Neural Inspired Computational Elements
Allows much closer to ideal with high variability TaOx device
LISTA achieves essentially perfect accuracy
Requires tradeoff of energy/latency for accuracy –exact tradeoff depends on algorithm reqs.
19
Agarwal et al, submitted 2017
Circuit-Level ImprovementTaOx – File Types
Multi-Synapse
Ideal Numeric
Single Device
TaOx – MNIST
Multi-Synapse
Single Device
Ideal Numeric
LISTA - MNIST99
98
979695
Multi-Synapse
Ideal Numeric
Ideal w/ A/D
Single Device
LISTA – File Types
Ideal Numeric
Single Device
Neural Inspired Computational Elements
Energy and Latency Analysis
20
SRAM Digital ReRAM Analog ReRAM
Equivalent Area~450 1k × 1k matrices
400 mm2 32 mm2 11 mm2
Total Timeper 1-layer cycle 3 Ops: 2 reads, 1 write
~ 100 µsTranspose read dominated
~ 60 µsUpdate dominated:10 ns write
~ 5 µsTemporal coding dominated:256 levels
Total Energyper 1-layer cycle
~ 1000 nJMultiply dominated
~ 700 nJMultiply dominated
~ 15 nJ
Matrix Storage Area 95% 50% 17%
Periphery Area 5% 50% 100%crossbar on top of periphery
Matrices per400 mm2 Chip
~450 ~5,500 ~15,000
Energy / Operation 330 fJ 230 fJ 5 fJ
Operations/Second 14 TeraOps/S 270 TeraOps/S 9,000 TeraOps/S
Neural Inspired Computational Elements
Energy and Latency Analysis
21
Matrix Storage1024×1024
[Values are per-array]
SRAM Digital ReRAM Analog ReRAM Crossbar
Area 800,000 µm2 35,000 µm2 Array: 4,300 µm2
Periphery: 8,460 µm2
Read 30 nJ / 15 µs 15 nJ / 4 µs ~ 3 nJ / ~ 1.5 µs
ReadTranspose 300 nJ / 65 µs 15 nJ / 4 µs ~ 3 nJ / ~ 1.5 µs
Write 30 nJ / 15 µs 50 nJ / 45 µs 3 nJ / ~ 1.5 µs
MultiplyAccumulators [256 in
parallel]
Area 19,000 µm2
FREERun: 1M ops 200 nJ / 4 µs
Output LUTArea 1,400 µm2
Uses Digital MethodsRead 1 nJ / 1 µs
Input/Output Buffers[8 bits]
Area 13,000 µm2
Uses Digital MethodsPer Run ~ 0.1 nJ
Vector Cache*16 entries 1024x8-bit
[Values are per 1024x8 vector]
Area 11,250 µm2 500 µm2
Uses Digital MethodsRead ~ 0.1 nJ / ~ 0.2 µs ~ 1 nJ / 4 ns
Write ~ 0.1 nJ / ~ 0.2 µs ~ 1 nJ / 50 ns
Neural Inspired Computational Elements
Conclusion Dennard (constant power density) scaling has ceased and
Moore’s law is slowing New paradigms like neuromorphic computing will be required
for sub-fJ computing We now require a device through system design mentality Motivation behind CrossSim See poster for more detail on CrossSim
Oxide-based resistive memory offers intriguing device options for both eras
Novel LISTA and eNode devices, offer significant potential in the development of a low energy neural accelerator See LISTA and eNode posters for more detail on these
22
Neural Inspired Computational Elements
Thank you!
23
Neural Inspired Computational Elements
Acknowledgements This work is funded by Sandia’s Laboratory Directed Research and
Development as part of the Hardware Acceleration of Adaptive Neural Algorithms Grand Challenge Project
Many shared ideas among collaborators: Alberto Saleo, Yoeri van de Burgt, Stanford Engin Ipek, U Rochester Dave Mountain, Mark McLean, US Government Stan Williams, John Paul Strachan, HPL Dhireesha Kudithipudi, RIT Jianhua Yang, U Mass Hugh Barnaby, Mike Kozicki, Sheming Yu, ASU Tarek Taha, U Dayton Paul Franzon, NC State University Sayeef Salahuddin, UC Berkeley Dozens of others…
We are especially interested in collaborations on cross-sim!
24
Neural Inspired Computational Elements
ENIAC
25
Electronic Numerical Integrator And ComputerDeveloped by US Army/U Penn – 1946
150 kW, 357 FLOPs400 J/FLOP (10 bit)
Neural Inspired Computational Elements
Where Are we Today?
Single Unit: Nvidea Tesla P100 GPU Most advanced GPU processor specs, released
late 2016 Target’s deep learning and neural applications 20 TFLOPs 16 bit peak performance w/ peak
power dissipation of 300W 70 GFLOPs/watt or about 15 pJ/FLOP (16 bit)
Supercomputer: Sunway TaihuLight (China) Top supercomputer in the world ShenWei processor 90 PFLOPs peak, 15 MW power 6 GFLOPs/W or about 170 pJ/FLOP
Need >1000x improvement to tackle internet-scale problems
26
Neural Inspired Computational Elements
Basics of Neural Networks
𝑤𝑤2𝑤𝑤1
Neuron
Basic Building Block
Inputs
z = �𝑖𝑖=0
𝑛𝑛
𝑤𝑤𝑖𝑖𝑥𝑥𝑖𝑖
𝑤𝑤𝑛𝑛
𝑥𝑥2𝑥𝑥1 𝑥𝑥𝑛𝑛
Weights
y
Simple Network:Backpropagation
Inputs
Hidden Layer
10 2 3
10 0 0Correct – no adjustment00 0 1
Incorrect –adjust
z
y =1
1 + 𝑒𝑒−𝑧𝑧Outputs
Neural Inspired Computational Elements
Another Analogy
𝑤𝑤2𝑤𝑤1
Neuron
Mathematical
Inputs
z = �𝑖𝑖=0
𝑛𝑛
𝑤𝑤𝑖𝑖𝑥𝑥𝑖𝑖
𝑤𝑤𝑛𝑛
𝑥𝑥2𝑥𝑥1 𝑥𝑥𝑛𝑛
Weights
y
z
y =1
1 + 𝑒𝑒−𝑧𝑧
𝐺𝐺2𝐺𝐺1
I = �𝑖𝑖=0
𝑛𝑛
𝐺𝐺𝑖𝑖𝑉𝑉𝑖𝑖
𝐺𝐺𝑛𝑛
𝑉𝑉2𝑉𝑉1 𝑉𝑉𝑛𝑛
Electrical
Neural Inspired Computational Elements
Why is it essential to cram so many computations on a single chip?
Crossbar accelerator cores29
Can you simply connect millions of ultra-efficient chips?Yes, but every time data leaves the chip, it is elevated in the comm hierarchyEnergy efficiency per operation is reduced
Neural Inspired Computational Elements
TaOx ReRAM in Backprop Training
30
Data set # Training Examples
# Test Examples
Network Size
UCI Small Digits[1] 3,823 1,797 64×36×10File Types[2] 4,501 900 256×512×9
MNIST Large Digits[3] 60,000 10,000 784×300×10
Increasing Network Size
1 µs pulses
Neural Inspired Computational Elements
-4x10-8 -2x10-8 0 2x10-8 4x10-8-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Volta
ge (V
)
Time (S)
32
ReRAM Measurements DC Current-voltage “loops” sweeps
are not time-controlled Excessive heating and early wearout Do not provide info on dynamics
Physical switching < 10ns Need pseudo RF setup to measure
Ground/signal, conductor backed Agilent B1530 module 10 ns RT/FT, 10 ns PW 1 V nominal, ~140 mV overshoot
TiN
Ta (15 nm)
TiN
TaOx (5-10 nm)
switching channel
++
+ +
- -- --
-
++
O-2 anions exchange
VTE
(+) charged vacancies+
+
++
+
--
Rise = 12.8 nsFall = 11.4 nsAmp = 1.14 V
Neural Inspired Computational Elements 33
Shorter pulses may be employed to lower conductance switching range Linearity qualitatively similar across Pulse Width (PW) and Edge Time (ET)
Best for SET at 100 ns Best for RESET at 1 us
Relative conductance change increased with shorter Pulse Width / Edge Time
Effect of Pulse Width and Edge Time
Nominal Pulse Voltage Values: SET: +1 V RESET: -1 V
0 500 1000 1500 2000
10-3
10-2
Con
duct
ance
(S)
Pulse Number (#)
1000 ns 100 ns 10 ns
0 500 1000 1500 20000
20
40
60
80
100
Nor
mal
ized
Con
duct
ance
(%)
Pulse Number (#)
1 µs 100 ns 10 ns
0 500 1000 1500 2000
100
120
140
160
180
200
Con
duct
ance
Cha
nge
(%)
Pulse Number (#)
1 µs 100 ns 10 ns
Y =𝐺𝐺 − 𝐺𝐺𝑚𝑚𝑖𝑖𝑛𝑛
𝐺𝐺𝑚𝑚𝑚𝑚𝑚𝑚 − 𝐺𝐺𝑚𝑚𝑖𝑖𝑛𝑛� 100
Y =𝐺𝐺
𝐺𝐺𝑀𝑀𝑖𝑖𝑛𝑛� 100
Neural Inspired Computational Elements
yi
Digital Core
jzj ey −+=
11
∑ ×=i
ijij wyz
yj
O(N2) Operations
O(N) Operations
in
out
i
j
k
Analog Core: Forward Propagation
Neuron Function
wij
Neural Inspired Computational Elements
yi
Digital Core
kjj zy ∆⋅= )('δ
∑ ×=i
ijij wyz
O(N2) Read Operations
O(N) Operations
i
j
k
kδerror
kk
jkk w δ∑=∆
jδη ×k∆
jzyi
iy
O(N2) Write Operations
Analog Core: Back Propagation
wij