long short-term memory networks in memristor crossbar arrays10.1038... · 05 10 15 supplementary...
Post on 29-Jul-2020
2 Views
Preview:
TRANSCRIPT
Articleshttps://doi.org/10.1038/s42256-018-0001-4
Long short-term memory networks in memristor crossbar arraysCan Li 1,5, Zhongrui Wang1, Mingyi Rao1, Daniel Belkin 1, Wenhao Song1, Hao Jiang1, Peng Yan1, Yunning Li1, Peng Lin1, Miao Hu2, Ning Ge3, John Paul Strachan2, Mark Barnell4, Qing Wu4, R. Stanley Williams 2, J. Joshua Yang 1* and Qiangfei Xia 1*
1Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, MA, USA. 2Hewlett Packard Labs, Palo Alto, CA, USA. 3HP Labs, HP Inc., Palo Alto, CA, USA. 4Air Force Research Laboratory, Information Directorate, Rome, NY, USA. 5Present address: Hewlett Packard Labs, Palo Alto, CA, USA. *e-mail: jjyang@umass.edu; qxia@umass.edu
SUPPLEMENTARY INFORMATION
In the format provided by the authors and unedited.
NatuRe MaCHiNe iNteLLiGeNCe | www.nature.com/natmachintell
Supplementary Figures
Supplementary Figure 1. The two-pulse memristor conductance update scheme. a, The two-pulse memristor conductance update scheme for decreasing a memristor conductance, which includes a complete RESET (to a lower conductance state) cycle, and a SET (to a higher conductance state) cycle. The change in the gate voltage determines the change in conductance. b, The scheme for increasing a memristor conductance, where the RESET cycle in (a) is skipped. c, 20 cycles of potentiation / depression (overlaid together) with each cycle including 200 updates for two memristors (in blue and red respectively). Although there is a mismatch for the interception (we intentionally picked two memristor with more noticeable mismatch), the slopes (Δ𝐆/Δ𝐕gate) of the conductance changes match closely. d, Weight update error expressed as the normalized standard deviation during the two-pulse conductance update. The analysis is conducted with the data shown for all the devices in a 128×64 array, with two of them shown in (c).
s.d.
of G
upd
ate
erro
r / G
rang
e
Target G (conductance) (μS)
5%
4%
3%
2%
1%
0%0 200 400 600 800 10000 50
0.6 1.1 1.6 1.1 0.6
100 150 200
Pulse Number (#)
Gate Voltage (V)
0
200
400
600
800
1000
1200
1400
Con
duct
ance
(S)
c d
a b
clk
TE
BE
Gate
Vset
Vg, pre+ΔVg
SET cycle
clk
TE
BE
Gate
RESET cycle
Vset
Vreset
Vg, pre+ΔVg
VDD
SET cycle
Supplementary Figure 2: The software architecture developed for the experiment. The backend virtual class implements the matrix multiplication operations in both the forward and backward passes, and weight update operations. The software backend performs the matrix multiplication and weight update with MATLAB built-in functions, where the weights are represented by 32-bit floating-point numbers. Simu Array performs the operations in a crossbar array simulator, in which we could take non-ideal factors, such as random conductance update errors, conductance upper and lower bounds, array wire resistances, etc., into consideration. Finally, the Real Array performs the matrix multiplication and weight (represented by memristor conductance) update experimentally in the memristor crossbar array. The code is deposited at: http://github.com/lican81/memNN
Model
Layers
Dense Recurrent
LSTM
Optimizer
SGD RMSprop
Software Crossbar
Real Array Simu Array
Firmware on MCU
Loss
MSECross
Entropy
Backend
- weight / conductance matrix
+ multiply( vector )+ multiply_reverse( vector )+ update( delta_weight_matrix )
Supplementary Figure 3: Additional data for the regression experiment. a, Conductance map of the 34×60 memristor array in the LSTM layer after the in-situ training. b, Measured conductance of the 32 memristors in the fully-connected layer after training. c, Map of synaptic weight calculated from the conductances shown in (a). d, Synaptic weights calculated from the conductances shown in (b).
15 30 45 60
17
34 0
200
400
600
800a
15 30 45 6017 -400
-2000200400
0 5 10 15-200
0
200
400
0 5 10 15 20 25 300
200
400
b
Conductance (µS)W
eight Value (µS)
Cond
ucta
nce
(µS)
Weig
ht V
alue
(µS)
c d
Supplementary Figure 4: Pre-processing of the gait identification dataset. a, One frame from the raw video. b, The extracted silhouette (ref. 39 in the main text) from the video, which was further converted to a width profile vector. Each dimension of the width profile vector represents the width of the silhouette at the corresponding height. c, The width profile vectors at each frame in the video. d, The total width in the width vector profile in each frame shows a periodic trend, which after processing a low-pass spectrum by an inverse Fourier transformation of the low-passed spectrum is used to detect the gait cycles. e, One video is divided into multiple samples according to the gait cycles.
a
b
11 22 33 44 55 66 77 88
16
32
48
64
80
96
112
128
16
32
48
64
80
96
112
12820 40 60
Width (px)
b
c
20 40 60 80 100 120 140 160Frame #
163248648096
112128
y
0
10
20
30
40
50
60
Widt
h (p
x)
d
0 20 40 60 80 100 120 140 160 180Frame #
2000
3000
4000
5000
Widt
h su
mm
ation
(px)
Sample #1 Sample #2 Sample #3 Sample #4 Sample #5 Sample #6 Sample #7 Sample #8 Sample #9e
Supplementary Figure 5: Additional data for the classification experiment. a, Conductance map of the 128×56 memristors in the LSTM layer after the in-situ training. b, Conductance map of the 28×8 fully-connected layer after training. c, Map of synaptic weights calculated from the conductances shown in (a). The LSTM synaptic weights are constituted by the weights (𝐖a, 𝐖i, 𝐖f and 𝐖o) connected to the input, and the recurrent weights (𝐔a, 𝐔i, 𝐔f and 𝐔o) connected to the LSTM outputs from the previous time step d, Map of the synaptic weights that calculated from the conductances shown in (b).
c
a
14 28 42 56
1
50
64 -400
-300
-200
-100
0
100
200
300
400
Weight Value (µS)
2 4 6 8
2
4
6
8
10
12
14 -400
-300
-200
-100
0
100
200
300
400
Weight Value (µS)
14 28 42 56
1
50
64
114
128 0
100
200
300
400
500
600
700
800
Conductance (µS)2 4 6 8
5
10
15
20
25
28 0
100
200
300
400
500
600
700
800
Conductance (µS)
b
d
Supplementary Figure 6: Output from the human identification inference. a, Raw electrical current output after the in-situ training. Different curves represent the current output from different columns (col1 to col8). The maximum current output is identified as the inference result of the memristor RNN. b, The Bayesian probability computed from the data in (a) by the softmax function.
a
b
0
0.2
0.4
0.6
0.8
1
Baye
sian
prob
abilit
y
Person #1 Person #3Person #2 Person #4 Person #5 Person #6 Person #7 Person #8
col1 col2 col3 col4 col5 col6 col7 col8
Person #1-400
-200
0
200
400Cu
rrent
(A)
Person #3Person #2 Person #4 Person #5 Person #6 Person #7 Person #8
col1 col2 col3 col4 col5 col6 col7 col8
Supplementary Figure 7: Comparisons among training on the experimental crossbar array, software training with 32-bit floating point, and training on simulated crossbar arrays with various weight update errors. a, Comparison of the mean square error (MSE), between the prediction made by the LSTM network and the ground truth, as the loss for the airline passenger number regression experiment. The dashed horizontal line represents the result we acquired from the experiment after 800 epochs of training. One sees that the MSE loss becomes less predictable and, in general, larger with increasing conductance update error. b, Accuracy comparison between the software and the crossbar array simulations with various random weight update errors. Larger memristor conductance update errors yield worse prediction accuracy, but update errors smaller than 0.6% are not statistically significant. For both boxplots, the figures show the statistical result from 50 runs of software / simulated training with parameters indicated on the x-axis. The red pluses on the plots indicate the outliners outside ±2.7 × s. d.
Experiment
Experiment (Max)
FP32 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3s.d. of conductance update error / conductance range (%)
0.2
0.30.40.50.60.70.80.91
1.52
MSE
Los
sa
FP32 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3s.d. of conductance update error (%)
0
20
40
60
80
100
Accu
racy
(%)
b
Supplementary Table
Supplementary Table 1: The summary of on-crossbar and off-crossbar operations
Inference only Training
M=14, N=50 (This work)
M=512, N=256 (Larger network)
M=14, N=50 (This work)
M=512, N=256 (Larger network)
On-crossbar operations 7,168 AOP 3,146 kAOP 7,168 AOP 3,146 kAOP
Off-crossbar operations
70 AOP + 70 tanh/sigmoid
2.56 kAOP + 2.56 kilo
tanh/sigmoid
294 AOP (bp) + 3,584 AOP
(outer product)
11 kAOP (bp) + 1,573 kAOP
(outer product) On-crossbar
memory 7,168 Byte 3 MBtyte 7,168 Byte 3 MByte
Off-crossbar memory 14 Byte 256 Byte
7,168 Byte (Gate voltage) +
3,000 Byte (I/O history)
3 MByte (Gate voltage) +
68.5 KByte (I/O history)
Note: AOP stands for Analog Operation.
Supplementary Notes
Supplementary Note 1: Summary of on-crossbar and off-crossbar operations and
memories
We have summarized the specific number of operations and memories for networks of different
sizes (Supplementary Table 1). We assume all the parameters, including the input/output (I/O)
histories and the gate voltage matrices are stored in 8-bits (1 Byte). For an LSTM layer with input
dimension of N and M hidden units, the size of the input vector is (𝑀 + 𝑁 + 1), and the size of
the output vector is 4𝑀 . Because two memristors are used to represent one weight to allow for
negative values, the size of the memristor crossbar array would be 2 (𝑀 + 𝑁 + 1) × (4𝑀). For
each temporal step, the inference described in Eq. 1 (in the main text, as all the following equations)
involves (𝑀 + 𝑁 + 1) × (4𝑀) multiply-accumulates (MACs) performed on-crossbar, and in Eq.
2 involves 3M sigmoid operations, 2M tanh operations, 3M multiplications and 2M additions,
all performed off-crossbar. Increasing the size of the problem, the on-crossbar operations will
dominate the computations since they have complexity O(𝑀2), while the off-crossbar operations
scale with O(𝑀). Most model parameters O(𝑀2) are stored in the crossbar array, with O(𝑀)
S/H or registers storing the intermediate signals.
The training process is more complicated. In the present work, only Eq. 18 can be performed in
the crossbar array, while all others are performed in software. Among those, Eq. 12-Eq. 17
involve 16M multiplications and 5M subtractions in software. Specifically, they are:
Eq. 12, 2𝑀 multiplication, 𝑀 derivative of sigmoid function (=𝑀 subtraction + 𝑀
multiplication).
Eq. 13, 2𝑀 multiplication, 𝑀 derivative of tanh function (=𝑀 subtraction + 𝑀
multiplication), and M addition.
Eq. 14, 2𝑀 multiplication, 𝑀 derivative of tanh function (=𝑀 subtraction + 𝑀
multiplication).
Eq. 15, 2𝑀 multiplication, 𝑀 derivative of sigmoid function (=𝑀 subtraction + 𝑀
multiplication).
Eq. 16, 2𝑀 multiplication, 𝑀 derivative of sigmoid function (=𝑀 subtraction + 𝑀
multiplication).
Eq. 17, 𝑀 multiplication.
Eq. 18, error backpropagation, (𝑀 + 𝑁 + 1) × (4𝑀) MACs, can be performed on-crossbar.
Eq. 20, weight gradient calculated by outer product, (𝑀 + 𝑁 + 1) × (4𝑀) multiplications, off-
crossbar. (Eq. 19 is for the fully connected layer).
For SGD without momentum, Eqs. 21-22 is the accumulation of the weight gradient calculated
by Eq. 20 that scaled by learning rate. They can be combined and directly accumulated to the
gate voltage matrix, by another scaling with a coefficient Δ𝑉gate/Δ𝐺. In this case, there are
2(𝑀 + 𝑁 + 1)/(4𝑀) multiplication operations, and it requires an auxiliary memory (can be
potentially implemented in the memristor array) that stores 2(𝑀 + 𝑁 + 1) × (4𝑀) gate voltage
values. In addition, the history of the input values and the activations also need to be stored,
which requires (𝑁 + 5𝑀) × 𝑇 memory, where 𝑇 is the depth of the temporal sequence, and we
are updating the array after each inference (batch size = 1).
It is clear that for an inference only system, most of the operations and memories are performed /
stored in-crossbar, while a large number of operations are off-chip for training.
Supplementary Note 2: Additional discussion on the output current for a large array
The size of the memristor array may be limited by the output current, but there are several potential
solutions with their corresponding tradeoffs. First, the read current can be lowered by reducing
the input voltage amplitude (or the pulse duty cycle) applied to the row wires to a very low level,
as long as the input signal-to-noise ratio (SNR) is acceptable. The current may not be lower than
a micro-ampere because this generally increases the sensing overhead. A high sensing current level
on the other hand does not necessarily lead to a greatly increased energy consumption because the
total sensing time is reduced. Secondly, the memristor device conductance can be lowered by
device/materials engineering subject to the constraint that the IV characteristic should be as linear
as possible, or different weight updating schemes should be employed. There are also architectural
solutions that involve tiling many smaller crossbar arrays to handle larger computations.S1
Supplementary Note 3: Additional discussion on weight update overhead
In our programming scheme, we estimate that the actual time consumed for RESET is around 1s,
and that for SET is about 2 s for updating the entire 128×64 array. Most of that time is spent in
the serial communication between the off-chip computer and the microcontroller. An individual
device can be switched within 5 nsS2 and even faster. Therefore, for an on-chip integrated system
a parallel weight update for a 128×64 array will take less than 5 ns × 64 (RESET) + 5 ns × 128
(SET) = 960 ns, functionally equivalent to 7.95 Gbps (for a 1 bit cell) or 47.68 Gbps (considering
a memristor is a 6-bit equivalent cellS3. Thus, the weight update scheme is not a limiting factor to
the throughput. The 5 ns switching time cited above was limited by our available measurement
apparatus. It has been reported that a memristor can be switched within 85 psS4, which if
implemented would boost the above equivalent rates, for a 128×64 array, to 16.32 ns, 0.46 Tbps
(1 bit) or 2.74 Tbps (6 bit).
Considering an average conductance of 500 𝜇S, the power consumption for the entire
128×64 array is (2.5 V)2 × 500 𝜇S × 64 = 0.18 W for SET and (1.7 V)2 × 500 𝜇S × 1282 =
0.09 W for RESET. However, the energy required to update one cell is very low:
(2.5 V)2 × 500 µS × 5 ns =15.6 pJ / 6 bit cell for SET and (1.7 V)2 × 500 μS × 5 ns =7.2
pJ / 6 bit cell for RESET. As a comparison, DRAM and SRAM in the 16 nm technology node
requires >100 pJ per single-bit flipS5. The energy can further be reduced by lowering the operation
voltage, shortening the write pulse with lower conductance in an improved memristor device.
Supplementary References
S1. Shafiee, A., Nag, A., Muralimanohar, N. Balasubramonia, R., Strachan, J. P., Hu, M., Williams, R. S., & Srikumar, V. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars, Seoul, South Korean, June 18-22 2016, IEEE.
S2. Jiang, H. et al. Sub-10 nm Ta Channel Responsible for Superior Performance of a HfO2 Memristor. Scientific Reports 6, 28525 (2016).
S3. Li, C. et al. Analogue signal and image processing with large memristor crossbars. Nature Electronics 1, 52 (2018).
S4. Choi, B. J., High-Speed and Low-Energy Nitride Memristors, Advanced Functional Materials, 26, 5290 (2016)
S5. Wojcicki, T. VLSI: Circuits for Emerging Applications. 414 (CRC Press, 2014).
top related