marinella presentation - src
TRANSCRIPT
Neural Inspired Computational Elements
Photos placed in horizontal position with even amount of white space
between photos and header
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energyโs National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP
Matthew J. Marinella, Sapan Agarwal, Robin Jacobs-Gedrim, Alex Hsia, David Hughart, Elliot Fuller, Steve J. Plimpton, A. Alec Talin, and Conrad D. James
Sandia National Laboratories*[email protected]
Ultra-Efficient Neural Algorithm Accelerator Using Processing With Memory
1
+ - + - + -
++++
--
--
--
++
VTE
+
++
+
+
--
+
++++
+ +
--
--
-
PtTaOxTa
Neural Inspired Computational Elements
Why do we need more efficient computers? Google Deep Learning Study
16000 core, 1000 machine GPU cluster Trained on 10 million 200x200 pixel images Training required 3 days Training dataset size: no larger than what
can be trained in 1 week What would they like to do?
~2 billion photos uploaded to internet per day (2014) Can we train a deep net on one day of image data? Assume 1000x1000 nominal image size, linear scaling
(both assumptions are unrealistically optimistic) Requires 5 ZettaIPS to train in 3 days
(ZettaIPS=1021 IPS; ~5 billion modern GPU cores) World doesnโt produce enough power for this! Data is increasing exponentially with time
Need >1016-1018 instruction-per-second on 1 IC Less than 10 fJ per instruction energy budget
2
Q. Le, IEEE ICASSP 2013
Neural Inspired Computational ElementsInspired by Hasler and Marr,Frontiers in Neuroscience, 2013
โLet physics do the computationโ Our brain is the ultimate example of this paradigm
Mooreโs Law Era: Density scaling and Dennard Power Density Scaling
100 nJ/MAC
10 nJ/MAC
1 nJ/MAC
100 pJ/MAC
10 pJ/MAC
1 pJ/MAC
100 fJ/MAC
10 fJ/MAC
1 fJ/MAC
100 aJ/MAC
10 aJ/MAC
1 aJ/MAC
3
*Caveat: Biological neurons probably do not perform MACs
1980โs
2010-Present
2025?
2035?
1990โs
2000โs
Evolution of Computing Machinery
Modern Computer
Biological Neurons*
Neuromorphic Target Efficiency only possible
with new computing paradigm
Exascale System GoalSingle GPU Card, Late 2016
Heterogeneous Integration Era: Close integration of emerging memory, low
voltage logic, photonics
>100
J/M
AC
1946
-EN
IAC
Neural Inspired Computational Elements
-0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
0.0
2.0x10-6
4.0x10-6
6.0x10-6
Cur
rent
(A)
Voltage (V) -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0
-6.0x10-4
-4.0x10-4
-2.0x10-4
0.0
2.0x10-4
Cur
rent
(A)
Voltage (V)
RESET SET
Metal Oxide Resistive RAM (ReRAM)
Highest current switching process
Read Window
TiN
Ta (15 nm)
TiN
TaOx (5-10 nm)
switching channel
++
+ +
- -- --
-
++
O-2 anions exchange
VTE
(+) charged vacancies+
+
++
+
--
Example: Sandia TiN/Ta/TaOx/TiN device Starts as insulating MIM structure Forming: remove O2- soft breakdown Bipolar resistance modulation Excellent memory attributes: Switching in
less than 1ns, less than 1 pJ demonstrated, scaling to 5nm, >1012 write cycles
Forming
-1.0 -0.5 0.0 0.5 1.0
-4.0x10-9
-2.0x10-9
0.0
2.0x10-9
Cur
rent
(A)
Voltage (V)
Pre-Form I/V SET-RESET
Neural Inspired Computational Elements
Crossbar Theoretical Limits
Potential for 100 Tbit of ReRAM on chip If each can perform 1M computations of
interest per second (1 M-op): 1012 active devices/chip x 106 cycle per
second 1018 comps per second per chip Exascale-computations per sec on one chip!
In order to not melt the chip, entire area must be limited to ~100W
Allowed energy per operation = P x t/op = 100W / 1018 = 10-16 = 100 aJ/operation
10nm line capacitance = 10 aF Can charge line to 1V with 10 aJ Drawback: โonlyโ ~100B transistors/chip
5
r1
r2
r3
r4
R1,1
...
...
...
...
R2,1
R3,1
R4,1
R1,2
R2,2
R3,2
R4,2
R1,3
R2,3
R3,3
R4,3
R1,4
R2,4
R3,4
R4,4
c1 c2 c3 c4
... ... ... ...
0.8 1 2 4 6 8 10 20 401 100.01
0.1
1
10
100
1000
Dens
ity (T
Bit/c
m2 )
Minimum Feature Size, F (nm)
1 Layer 24 Layers 96 Layers
ReRAM Density vs Min. Feature SizeAssumes 4F2 cell, 1-bit cell
Neural Inspired Computational Elements
G1,2
Electronic Vector Matrix Multiply
How does a crossbar perform a useful computation per device?
6
G1,1
G2,1
G3,1
G2,2
G3,2
G1,3
G2,3
G3,3
V1
V2
V3
I1=ฮฃVi,1Gi,1
W1,2W1,1
W2,1
W3,1
W2,2
W3,2
W1,3
W2,3
W3,3
V1
I1=ฮฃVi,1Wi,1
V2 V3
=
I2=ฮฃVi,2Wi,2
VTW=I
I3=ฮฃVi,3Wi,3
Mathematical Electrical
I2=ฮฃVi,2Gi,2 I3=ฮฃVi,3Gi,3
WV
I
Neural Inspired Computational Elements
Mapping Backprop to a Crossbar
Vector Matrix Multiply, Rank 1 Update: Key kernel used in many algorithms
Convert analog inputs to varying length voltage pulses
in
out
Integrate current to get an analog output value
Inpu
ts fr
om p
revi
ous
laye
r
Backpropagated error from following layer
- + - + - +
-+
-+
-+
Back
prop
agat
eder
ror t
o pr
evio
us la
yer
Outputs to next layer
Neural Inspired Computational Elements
Accelerator Architecture
8
Neural Inspired Computational Elements
Device to Algorithm Model
9
Neural Algorithm Level Model
McPAT*Computer Architecture Level Model
NVSIMCrossSim
Circuit Level Models
Device Level Models
CrossSim
SSTCrossSim
Electrical Test
Electrical Test
Compact Models
Neural Algorithm
Top
Dow
nBo
ttom
Up
What device properties are needed?
Algorithm Accuracy
Energy and Latency
How do specific devices work in system?
Neural Inspired Computational Elements
Experimental Device Nonidealities
Ideally weight would increase and decrease linearly proportional to learning rule result
Experimental devices have several nonidealities: Write Variability, Write Nonlinearity, Asymmetry, Read Noise
Circuits also have A/D, D/A noise, parasitics
10
Gโ
W
Pulse Number (Vwrite=1V, tpulse=1ยตs)
= Ideal
Conductance versus Pulse
= Write Variability = Nonlinear
IโGโ
W
Time (Vread=100mV)
I0I0-โI
I0+โI
Read Noise
Symmetric and Linear
Asymmetric, Nonlinear
Neural Inspired Computational Elements 11
ReRAM Analog CharacterizationSET RESET
Use as a neuromorphic weight requires precise analog tuning Dataset requires 1000 repeated SET and RESET pulses Nominal pulse values
SET: +1V 10ns RT/PW/FT RESET: -1V 10ns RT/PW/FT READ: 100 mV 1 ms RT/PW/FT V
t
Current MeasurementWrite Pulse
Read Pulse
0 500 1000
3.0x10-4
4.0x10-4
5.0x10-4
6.0x10-4C
ondu
ctan
ce (S
)
Pulse Number (#)1000 1500 2000
3.0x10-4
4.0x10-4
5.0x10-4
6.0x10-4
Con
duct
ance
(S)
Pulse Number (#)
Neural Inspired Computational Elements
Repeated Pulsed Cycling
12
10 ns 1 ยตs100 ns
SET
RESET
100 onoff cycles, (200k pulses)
Neural Inspired Computational Elements
TaOx ReRAM in Backprop Training
13
How can training accuracy be improved?
Increasing Network Size
Neural Inspired Computational Elements
Li-Ion Synaptic Transistor for Analog Computation (LISTA)
14
G-V for LISTA Transistor
E. Fuller et al, Adv Mater, accepted 2017
Neural Inspired Computational Elements
LISTA > 200 states
PCM Array
Analog State Characterization
15GW Burr et al, IEEE TED 2015
TaOx ReRAM
E. Fuller et al, Adv Mater, accepted 2017
Neural Inspired Computational Elements
LISTA-device Performance for Backprop Algorithm
16
Data set # Training Examples
# Test Examples
Network Size
UCI Small Digits[1] 3,823 1,797 64ร36ร10File Types[2] 4,501 900 256ร512ร9
MNIST Large Digits[3] 60,000 10,000 784ร300ร10
Increasing Network Size
E. Fuller et al, Adv Mater, accepted 2017See Poster for Detail!
Neural Inspired Computational Elements
Electrochemical Neuromorphic Organic Device (eNode)
17
van de Burgt โฆ Saleo, Nature Mater., 2017 in press
Neural Inspired Computational Elements
Electrochemical Neuromorphic Organic Device (eNode)
18
See Poster for Detail!van de Burgt โฆ Saleo, Nature Mater., 2017 in press
Neural Inspired Computational Elements
Allows much closer to ideal with high variability TaOx device
LISTA achieves essentially perfect accuracy
Requires tradeoff of energy/latency for accuracy โexact tradeoff depends on algorithm reqs.
19
Agarwal et al, submitted 2017
Circuit-Level ImprovementTaOx โ File Types
Multi-Synapse
Ideal Numeric
Single Device
TaOx โ MNIST
Multi-Synapse
Single Device
Ideal Numeric
LISTA - MNIST99
98
979695
Multi-Synapse
Ideal Numeric
Ideal w/ A/D
Single Device
LISTA โ File Types
Ideal Numeric
Single Device
Neural Inspired Computational Elements
Energy and Latency Analysis
20
SRAM Digital ReRAM Analog ReRAM
Equivalent Area~450 1k ร 1k matrices
400 mm2 32 mm2 11 mm2
Total Timeper 1-layer cycle 3 Ops: 2 reads, 1 write
~ 100 ยตsTranspose read dominated
~ 60 ยตsUpdate dominated:10 ns write
~ 5 ยตsTemporal coding dominated:256 levels
Total Energyper 1-layer cycle
~ 1000 nJMultiply dominated
~ 700 nJMultiply dominated
~ 15 nJ
Matrix Storage Area 95% 50% 17%
Periphery Area 5% 50% 100%crossbar on top of periphery
Matrices per400 mm2 Chip
~450 ~5,500 ~15,000
Energy / Operation 330 fJ 230 fJ 5 fJ
Operations/Second 14 TeraOps/S 270 TeraOps/S 9,000 TeraOps/S
Neural Inspired Computational Elements
Energy and Latency Analysis
21
Matrix Storage1024ร1024
[Values are per-array]
SRAM Digital ReRAM Analog ReRAM Crossbar
Area 800,000 ยตm2 35,000 ยตm2 Array: 4,300 ยตm2
Periphery: 8,460 ยตm2
Read 30 nJ / 15 ยตs 15 nJ / 4 ยตs ~ 3 nJ / ~ 1.5 ยตs
ReadTranspose 300 nJ / 65 ยตs 15 nJ / 4 ยตs ~ 3 nJ / ~ 1.5 ยตs
Write 30 nJ / 15 ยตs 50 nJ / 45 ยตs 3 nJ / ~ 1.5 ยตs
MultiplyAccumulators [256 in
parallel]
Area 19,000 ยตm2
FREERun: 1M ops 200 nJ / 4 ยตs
Output LUTArea 1,400 ยตm2
Uses Digital MethodsRead 1 nJ / 1 ยตs
Input/Output Buffers[8 bits]
Area 13,000 ยตm2
Uses Digital MethodsPer Run ~ 0.1 nJ
Vector Cache*16 entries 1024x8-bit
[Values are per 1024x8 vector]
Area 11,250 ยตm2 500 ยตm2
Uses Digital MethodsRead ~ 0.1 nJ / ~ 0.2 ยตs ~ 1 nJ / 4 ns
Write ~ 0.1 nJ / ~ 0.2 ยตs ~ 1 nJ / 50 ns
Neural Inspired Computational Elements
Conclusion Dennard (constant power density) scaling has ceased and
Mooreโs law is slowing New paradigms like neuromorphic computing will be required
for sub-fJ computing We now require a device through system design mentality Motivation behind CrossSim See poster for more detail on CrossSim
Oxide-based resistive memory offers intriguing device options for both eras
Novel LISTA and eNode devices, offer significant potential in the development of a low energy neural accelerator See LISTA and eNode posters for more detail on these
22
Neural Inspired Computational Elements
Thank you!
23
Neural Inspired Computational Elements
Acknowledgements This work is funded by Sandiaโs Laboratory Directed Research and
Development as part of the Hardware Acceleration of Adaptive Neural Algorithms Grand Challenge Project
Many shared ideas among collaborators: Alberto Saleo, Yoeri van de Burgt, Stanford Engin Ipek, U Rochester Dave Mountain, Mark McLean, US Government Stan Williams, John Paul Strachan, HPL Dhireesha Kudithipudi, RIT Jianhua Yang, U Mass Hugh Barnaby, Mike Kozicki, Sheming Yu, ASU Tarek Taha, U Dayton Paul Franzon, NC State University Sayeef Salahuddin, UC Berkeley Dozens of othersโฆ
We are especially interested in collaborations on cross-sim!
24
Neural Inspired Computational Elements
ENIAC
25
Electronic Numerical Integrator And ComputerDeveloped by US Army/U Penn โ 1946
150 kW, 357 FLOPs400 J/FLOP (10 bit)
Neural Inspired Computational Elements
Where Are we Today?
Single Unit: Nvidea Tesla P100 GPU Most advanced GPU processor specs, released
late 2016 Targetโs deep learning and neural applications 20 TFLOPs 16 bit peak performance w/ peak
power dissipation of 300W 70 GFLOPs/watt or about 15 pJ/FLOP (16 bit)
Supercomputer: Sunway TaihuLight (China) Top supercomputer in the world ShenWei processor 90 PFLOPs peak, 15 MW power 6 GFLOPs/W or about 170 pJ/FLOP
Need >1000x improvement to tackle internet-scale problems
26
Neural Inspired Computational Elements
Basics of Neural Networks
๐ค๐ค2๐ค๐ค1
Neuron
Basic Building Block
Inputs
z = ๏ฟฝ๐๐=0
๐๐
๐ค๐ค๐๐๐ฅ๐ฅ๐๐
๐ค๐ค๐๐
๐ฅ๐ฅ2๐ฅ๐ฅ1 ๐ฅ๐ฅ๐๐
Weights
y
Simple Network:Backpropagation
Inputs
Hidden Layer
10 2 3
10 0 0Correct โ no adjustment00 0 1
Incorrect โadjust
z
y =1
1 + ๐๐โ๐ง๐งOutputs
Neural Inspired Computational Elements
Another Analogy
๐ค๐ค2๐ค๐ค1
Neuron
Mathematical
Inputs
z = ๏ฟฝ๐๐=0
๐๐
๐ค๐ค๐๐๐ฅ๐ฅ๐๐
๐ค๐ค๐๐
๐ฅ๐ฅ2๐ฅ๐ฅ1 ๐ฅ๐ฅ๐๐
Weights
y
z
y =1
1 + ๐๐โ๐ง๐ง
๐บ๐บ2๐บ๐บ1
I = ๏ฟฝ๐๐=0
๐๐
๐บ๐บ๐๐๐๐๐๐
๐บ๐บ๐๐
๐๐2๐๐1 ๐๐๐๐
Electrical
Neural Inspired Computational Elements
Why is it essential to cram so many computations on a single chip?
Crossbar accelerator cores29
Can you simply connect millions of ultra-efficient chips?Yes, but every time data leaves the chip, it is elevated in the comm hierarchyEnergy efficiency per operation is reduced
Neural Inspired Computational Elements
TaOx ReRAM in Backprop Training
30
Data set # Training Examples
# Test Examples
Network Size
UCI Small Digits[1] 3,823 1,797 64ร36ร10File Types[2] 4,501 900 256ร512ร9
MNIST Large Digits[3] 60,000 10,000 784ร300ร10
Increasing Network Size
1 ยตs pulses
Neural Inspired Computational Elements
-4x10-8 -2x10-8 0 2x10-8 4x10-8-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Volta
ge (V
)
Time (S)
32
ReRAM Measurements DC Current-voltage โloopsโ sweeps
are not time-controlled Excessive heating and early wearout Do not provide info on dynamics
Physical switching < 10ns Need pseudo RF setup to measure
Ground/signal, conductor backed Agilent B1530 module 10 ns RT/FT, 10 ns PW 1 V nominal, ~140 mV overshoot
TiN
Ta (15 nm)
TiN
TaOx (5-10 nm)
switching channel
++
+ +
- -- --
-
++
O-2 anions exchange
VTE
(+) charged vacancies+
+
++
+
--
Rise = 12.8 nsFall = 11.4 nsAmp = 1.14 V
Neural Inspired Computational Elements 33
Shorter pulses may be employed to lower conductance switching range Linearity qualitatively similar across Pulse Width (PW) and Edge Time (ET)
Best for SET at 100 ns Best for RESET at 1 us
Relative conductance change increased with shorter Pulse Width / Edge Time
Effect of Pulse Width and Edge Time
Nominal Pulse Voltage Values: SET: +1 V RESET: -1 V
0 500 1000 1500 2000
10-3
10-2
Con
duct
ance
(S)
Pulse Number (#)
1000 ns 100 ns 10 ns
0 500 1000 1500 20000
20
40
60
80
100
Nor
mal
ized
Con
duct
ance
(%)
Pulse Number (#)
1 ยตs 100 ns 10 ns
0 500 1000 1500 2000
100
120
140
160
180
200
Con
duct
ance
Cha
nge
(%)
Pulse Number (#)
1 ยตs 100 ns 10 ns
Y =๐บ๐บ โ ๐บ๐บ๐๐๐๐๐๐
๐บ๐บ๐๐๐๐๐๐ โ ๐บ๐บ๐๐๐๐๐๐๏ฟฝ 100
Y =๐บ๐บ
๐บ๐บ๐๐๐๐๐๐๏ฟฝ 100
Neural Inspired Computational Elements
yi
Digital Core
jzj ey โ+=
11
โ ร=i
ijij wyz
yj
O(N2) Operations
O(N) Operations
in
out
i
j
k
Analog Core: Forward Propagation
Neuron Function
wij
Neural Inspired Computational Elements
yi
Digital Core
kjj zy โโ = )('ฮด
โ ร=i
ijij wyz
O(N2) Read Operations
O(N) Operations
i
j
k
kฮดerror
kk
jkk w ฮดโ=โ
jฮดฮท รkโ
jzyi
iy
O(N2) Write Operations
Analog Core: Back Propagation
wij