inference of long-short term memory networks at software

2
T82 978-4-86348-717-8 ©2019 JSAP 2019 Symposium on VLSI Technology Digest of Technical Papers T8-1 Inference of Long-Short Term Memory networks at software-equivalent accuracy using 2.5M analog Phase Change Memory devices H. Tsai, S. Ambrogio, C. Mackin, P. Narayanan, R. M. Shelby, K. Rocki, A. Chen and G. W. Burr IBM Research–Almaden, 650 Harry Road, San Jose, CA 95120, Tel: (408) 927–2073, E-mail: [email protected] Abstract We report accuracy for forward inference of long-short-term-memory (LSTM) networks using weights programmed into the conductances of >2.5M phase-change memory (PCM) devices. We demonstrate strate- gies for software weight-mapping and programming of hardware analog conductances that provide accurate weight programming despite signifi- cant device variability. Inference accuracy very close to software-model baselines is achieved on several language modeling tasks. Keywords: PCM, LSTM, forward inference, in-memory computing Introduction Deep neural network (DNN) computations in analog memory have made significant progress, recently achieving software-equivalent accu- racy with weights stored in non-volatile memory arrays [1], [2]. Fully- connected (FC) networks, composed of the FC-layers that are particu- larly well-suited for acceleration using in-memory analog computing [3], have been mostly eclipsed in modern DNN applications. But recurrent neural networks (RNN), such as Long-Short-Term-Memory (LSTM), are widely used — for language modeling, speech recognition, transla- tion, and sequence classification — and primarily consist of FC-layers (plus a few element-wise vector operations). Due to the recurrent nature of LSTMs, analog device requirements for LSTMs are more stringent [4] than for FC-networks. To date, experimental demonstrations of LSTMs in analog memory have been limited by hardware constraints to very small networks [5]. In this paper, we study the inference accuracy of larger LSTM networks using a mixed-hardware-software experiment [1], where synaptic weights are programmed into a 90nm-node phase change memory (PCM) array with 4M mushroom-cell devices, while all neuron functionalities are simulated in software. We use the weight mapping scheme in Fig.1 to map LSTM networks for language modeling tasks using two datasets, including the Penn Tree Bank (PTB) dataset, a widely-used LSTM software benchmark. LSTM Network and Dataset The 2-layer LSTM network used in this paper is shown in Fig. 2. A fully connected embedding layer converts characters or words using a ‘one-hot’ encoding x0(t) into an embedding vector, x(t), with the same size as the hidden layer. x(t) is then passed to a 2-layer LSTM with a recurrent hidden state, h(t), in each layer. Output y(t) is then computed from the hidden state in the second layer, h2(t), through a fully-connected output layer. Fig. 3a shows weight distributions for software-trained models using two datasets: the book ‘Alice in Wonder- land’ (character-based) with a smaller model (50 hidden units) and the PTB (word-based) dataset (hidden layer size of 200). Since software weights span different ranges, different scaling factors are chosen to map the weights into the same ‘sensitive region’ of analog conductances [6]. To assess the impact of weight-programming errors, conductance values are individually measured from the hardware, and LSTM activa- tions are calculated in software (using the equations in Figs. 1 and 4). Future in-memory analog computing circuitry will efficiently and rapidly perform these multiply-accumulate operations in the analog domain at the weight locations, but expected DNN performance for such hardware can be evaluated by the present experiment, as follows. For language modeling, as each character/word is presented, the LSTM network pre- dicts the probability — quantified by softmax(y(t)) — of the next character/word in the sequence. This prediction can be compared to the actual next input vector, x0(t +1). ‘Accuracy’ of the model is quantified by cross-entropy loss or perplexity – lower loss (perplexity) indicates that correct answers are being predicted with higher probability. Programming PCM Conductances and Weights Fig. 5 compares two strategies for efficient programming in the presence of device variabilities. One strategy applies SET pulses with steadily increasing compliance current to reach the target analog conduc- tance. With this method, PCM conductance is non-monotonic, decreas- ing sharply as each device reaches its RESET condition. In contrast, applying RESET pulses with steadily decreasing compliance current is more tolerant of device variability and offers better precision at low con- ductance values. Iterating these sequences with different pulse-width helps further address outlying devices (Fig. 6(b)). Conductance pro- gramming of a single target conductance (Fig.6) and of a simple target pattern with 32 different levels (Fig.7) show promising results. Fig. 8 compares weight programming results when using 2 PCM and then 4 PCM devices for weight mapping. Target weight distribu- tions, scaled from software weights into conductance units by α=2.5μS, overlap (Fig. 8a,d) and strongly correlate (Fig. 8b,e) with weights pro- grammed into 2 or 4 PCM devices, with low weight programming error (Fig.8c,f). Successive programming of two conductance pairs with F =4 (e,f) significantly reduces weight errors compared to a single pair (b,c). Simple target patterns (Fig.9) also clearly show the benefits of the weight programming strategies introduced here. Impact on LSTM Performance Fig.10a summarizes LSTM accuracy results for the Alice in Won- derland dataset, compared to software baselines (PyTorch & MATLAB). Customized experiments in MATLAB provided an alternate weight-set for programming, obtained by clipping weight-magnitudes during train- ing to a level that left cross-entropy loss unaffected. The mapping of this reduced software-weight range better utilizes available PCM con- ductance levels, and reduces the relative impact of read and write noise. Cross-entropy loss after programming is thus reduced, despite slightly higher software loss in the clipped model. Programming only the weights of the 2-layer LSTM into PCM brings cross-entropy loss even closer to baseline. Fig.10b shows inference results for the PTB dataset, compared to PyTorch baseline [7]. (Here only the 2-layer LSTM fits within the available PCM array.) LSTM network and dataset sizes are summarized in Fig. 11a. Fig. 11b puts the inference perplexity of PTB in context with state-of-the-art software models [8]. For comparison with the full mixed-hardware-software results (Fig. 10), inference results with both constant (Fig. 11c) and weight-dependent (Fig. 11d, σ values extracted from Fig. 9c) added Gaussian noise are shown. Finally, conductance drift of PCM devices causes LSTM accuracy performance to degrade (Fig.11e), highlighting the importance of PCM device structures known to offer lower drift [9,10] as well as potentially better conductance- programming accuracy [10]. Conclusions We demonstrated analog conductance programming and weight mapping of LSTM weights into PCM arrays for inference. Write noise was found to be the primary limiting factor. Training and mapping procedures were shown that can help avoid large weights prone to pro- gramming error. Despite non-ideality and variability in PCM arrays, inference results close to software baselines were achieved on LSTM networks of competitive-size. References [1] G. W. Burr et. al., IEDM, T29-5 (2014). [2] S. Ambrogio et. al., Nature, 558, 60-67 (2018). [3] T. Gokmen et. al., Frontiers in Neuroscience, 10, 333 (2016). [4] T. Gokmen et. al., Frontiers in Neuroscience, 12, 745 (2018). [5] C. Li et. al., Nat. Machine Intelligence, 1, 49-57 (2019). [6] C. Mackin et. al., Adv. Electr. Mater., submitted (2019). [7] https://github.com/deeplearningathome/pytorch-language-model [8] S. Merity et. al., https://arxiv.org/abs/1708.02182, (2017). [9] S. Kim et al., IEDM, 30-7 (2013). [10] I. Giannopoulos et al., IEDM, 27-7 (2018).

Upload: others

Post on 25-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

T82 978-4-86348-717-8 ©2019 JSAP 2019 Symposium on VLSI Technology Digest of Technical Papers

T8-1Inference of Long-Short Term Memory networks at software-equivalent

accuracy using 2.5M analog Phase Change Memory devicesH. Tsai, S. Ambrogio, C. Mackin, P. Narayanan, R. M. Shelby, K. Rocki, A. Chen and G. W. BurrIBM Research–Almaden, 650 Harry Road, San Jose, CA 95120, Tel: (408) 927–2073, E-mail: [email protected]

AbstractWe report accuracy for forward inference of long-short-term-memory

(LSTM) networks using weights programmed into the conductances of>2.5M phase-change memory (PCM) devices. We demonstrate strate-gies for software weight-mapping and programming of hardware analogconductances that provide accurate weight programming despite signifi-cant device variability. Inference accuracy very close to software-modelbaselines is achieved on several language modeling tasks.

Keywords: PCM, LSTM, forward inference, in-memory computingIntroduction

Deep neural network (DNN) computations in analog memory havemade significant progress, recently achieving software-equivalent accu-racy with weights stored in non-volatile memory arrays [1], [2]. Fully-connected (FC) networks, composed of the FC-layers that are particu-larly well-suited for acceleration using in-memory analog computing [3],have been mostly eclipsed in modern DNN applications. But recurrentneural networks (RNN), such as Long-Short-Term-Memory (LSTM),are widely used — for language modeling, speech recognition, transla-tion, and sequence classification — and primarily consist of FC-layers(plus a few element-wise vector operations). Due to the recurrent natureof LSTMs, analog device requirements for LSTMs are more stringent[4] than for FC-networks. To date, experimental demonstrations ofLSTMs in analog memory have been limited by hardware constraints tovery small networks [5]. In this paper, we study the inference accuracyof larger LSTM networks using a mixed-hardware-software experiment[1], where synaptic weights are programmed into a 90nm-node phasechange memory (PCM) array with 4M mushroom-cell devices, whileall neuron functionalities are simulated in software. We use the weightmapping scheme in Fig.1 to map LSTM networks for language modelingtasks using two datasets, including the Penn Tree Bank (PTB) dataset, awidely-used LSTM software benchmark.

LSTM Network and DatasetThe 2-layer LSTM network used in this paper is shown in Fig. 2.

A fully connected embedding layer converts characters or words usinga ‘one-hot’ encoding x0(t) into an embedding vector, x(t), with thesame size as the hidden layer. x(t) is then passed to a 2-layer LSTMwith a recurrent hidden state, h(t), in each layer. Output y(t) is thencomputed from the hidden state in the second layer, h2(t), through afully-connected output layer. Fig. 3a shows weight distributions forsoftware-trained models using two datasets: the book ‘Alice in Wonder-land’ (character-based) with a smaller model (50 hidden units) and thePTB (word-based) dataset (hidden layer size of 200). Since softwareweights span different ranges, different scaling factors are chosen to mapthe weights into the same ‘sensitive region’ of analog conductances [6].

To assess the impact of weight-programming errors, conductancevalues are individually measured from the hardware, and LSTM activa-tions are calculated in software (using the equations in Figs. 1 and 4).Future in-memory analog computing circuitry will efficiently and rapidlyperform these multiply-accumulate operations in the analog domain atthe weight locations, but expected DNN performance for such hardwarecan be evaluated by the present experiment, as follows. For languagemodeling, as each character/word is presented, the LSTM network pre-dicts the probability — quantified by softmax(y(t)) — of the nextcharacter/word in the sequence. This prediction can be compared to theactual next input vector, x0(t+1). ‘Accuracy’ of the model is quantifiedby cross-entropy loss or perplexity – lower loss (perplexity) indicatesthat correct answers are being predicted with higher probability.

Programming PCM Conductances and WeightsFig. 5 compares two strategies for efficient programming in the

presence of device variabilities. One strategy applies SET pulses with

steadily increasing compliance current to reach the target analog conduc-tance. With this method, PCM conductance is non-monotonic, decreas-ing sharply as each device reaches its RESET condition. In contrast,applying RESET pulses with steadily decreasing compliance current ismore tolerant of device variability and offers better precision at low con-ductance values. Iterating these sequences with different pulse-widthhelps further address outlying devices (Fig. 6(b)). Conductance pro-gramming of a single target conductance (Fig.6) and of a simple targetpattern with 32 different levels (Fig.7) show promising results.

Fig. 8 compares weight programming results when using 2 PCMand then 4 PCM devices for weight mapping. Target weight distribu-tions, scaled from software weights into conductance units by α=2.5µS,overlap (Fig. 8a,d) and strongly correlate (Fig. 8b,e) with weights pro-grammed into 2 or 4 PCM devices, with low weight programming error(Fig.8c,f). Successive programming of two conductance pairs with F=4(e,f) significantly reduces weight errors compared to a single pair (b,c).Simple target patterns (Fig.9) also clearly show the benefits of the weightprogramming strategies introduced here.

Impact on LSTM PerformanceFig. 10a summarizes LSTM accuracy results for the Alice in Won-

derland dataset, compared to software baselines (PyTorch & MATLAB).Customized experiments in MATLAB provided an alternate weight-setfor programming, obtained by clipping weight-magnitudes during train-ing to a level that left cross-entropy loss unaffected. The mapping ofthis reduced software-weight range better utilizes available PCM con-ductance levels, and reduces the relative impact of read and write noise.Cross-entropy loss after programming is thus reduced, despite slightlyhigher software loss in the clipped model. Programming only the weightsof the 2-layer LSTM into PCM brings cross-entropy loss even closer tobaseline. Fig.10b shows inference results for the PTB dataset, comparedto PyTorch baseline [7]. (Here only the 2-layer LSTM fits within theavailable PCM array.) LSTM network and dataset sizes are summarizedin Fig. 11a. Fig. 11b puts the inference perplexity of PTB in contextwith state-of-the-art software models [8]. For comparison with the fullmixed-hardware-software results (Fig. 10), inference results with bothconstant (Fig. 11c) and weight-dependent (Fig. 11d, σ values extractedfrom Fig. 9c) added Gaussian noise are shown. Finally, conductancedrift of PCM devices causes LSTM accuracy performance to degrade(Fig.11e), highlighting the importance of PCM device structures knownto offer lower drift [9,10] as well as potentially better conductance-programming accuracy [10].

ConclusionsWe demonstrated analog conductance programming and weight

mapping of LSTM weights into PCM arrays for inference. Write noisewas found to be the primary limiting factor. Training and mappingprocedures were shown that can help avoid large weights prone to pro-gramming error. Despite non-ideality and variability in PCM arrays,inference results close to software baselines were achieved on LSTMnetworks of competitive-size.

References[1] G. W. Burr et. al., IEDM, T29-5 (2014).[2] S. Ambrogio et. al., Nature, 558, 60-67 (2018).[3] T. Gokmen et. al., Frontiers in Neuroscience, 10, 333 (2016).[4] T. Gokmen et. al., Frontiers in Neuroscience, 12, 745 (2018).[5] C. Li et. al., Nat. Machine Intelligence, 1, 49-57 (2019).[6] C. Mackin et. al., Adv. Electr. Mater., submitted (2019).[7] https://github.com/deeplearningathome/pytorch-language-model[8] S. Merity et. al., https://arxiv.org/abs/1708.02182, (2017).[9] S. Kim et al., IEDM, 30-7 (2013).[10] I. Giannopoulos et al., IEDM, 27-7 (2018).

T832019 Symposium on VLSI Technology Digest of Technical Papers

W = G+ - G- +

G+ G– g+ g-

Most Significant Pair(MSP)

Least Significant Pair(LSP)

g+ - g-

FFig. 1 Schematic for mapping weights intotwo pairs of phase change memory (PCM)devices. The total weight (in effective con-ductance, e.g. µS) is the weighted sum ofthe 4 PCM conductance values.

Embedding Layer

LSTM 1 Layer

LSTM 2 Layer

Output Layer

𝑥𝑥0(𝑡𝑡) 𝑥𝑥(𝑡𝑡) ℎ1(𝑡𝑡) ℎ2(𝑡𝑡) 𝑦𝑦(𝑡𝑡)

G+ G-

G+ G-Vi+1(t)

Vi(t)

Ii=Vi×G

Icolumn=Sxi×wij

a

b

c

N1

M1

MmNn

Forwardpropagation

Back-propagation

Wf Uf

bf

+

ℎ𝑓𝑓(𝑡𝑡)

Wi Ui

bi

+

ℎ𝑖𝑖(𝑡𝑡)

Wc Uc

bc

+

ℎ𝑐𝑐(𝑡𝑡)

Wo Uo

bo

+

ℎ𝑜𝑜(𝑡𝑡)

* * * * * * * *

sigmoid sigmoid tanh sigmoid

𝑥𝑥(𝑡𝑡) ℎ 𝑡𝑡 − 1

𝑐𝑐 𝑡𝑡 − 1 .*

.*

+tanh

𝑐𝑐 𝑡𝑡

.*ℎ(𝑡𝑡)

delay

delay

Fig. 2 (a) Network used in this paper with fully-connected Embedding and Outputlayers and 2 LSTM layers. (b) LSTM weights and activation structure. (c) Eventualmapping of fully connected layers inside the LSTM block to analog memory array.

a

Alice in Wonderland

PTB

G+ G–

G+ G–

b

Target W

Actual W

Sensitive region

Saturated region

Saturated region

PTB (LSTM layers

only)

Fig. 3 (a) Weight distributions fromsoftware. (b) Majority of modelweights should be mapped into the‘sensitive region’.

x(t) = And so it was indeed: she was now only ten […]

LSTM “ideal” predictionA n d s o i t w a s i n d e e d :

n d s o i t w a s i n d e e d :

And so it was indeed

so it was indeed she

ℎ𝑓𝑓(𝑡𝑡) = 𝜎𝜎 𝑊𝑊𝑓𝑓 ∗ 𝑥𝑥 𝑡𝑡 + 𝑈𝑈𝑓𝑓 ∗ ℎ(𝑡𝑡 − 1) + 𝑏𝑏𝑓𝑓

𝑐𝑐(𝑡𝑡) = ℎ𝑓𝑓 𝑡𝑡 .∗ 𝑐𝑐(𝑡𝑡 − 1) + ℎ𝑖𝑖 𝑡𝑡 .∗ ℎ𝑐𝑐(𝑡𝑡)ℎ 𝑡𝑡 = ℎ𝑜𝑜 𝑡𝑡 .∗ 𝑡𝑡𝑡𝑡𝑡𝑡ℎ[𝑐𝑐 𝑡𝑡 ]𝑦𝑦(𝑡𝑡) = 𝑊𝑊𝑦𝑦 ∗ ℎ(𝑡𝑡) + 𝑏𝑏𝑦𝑦

ℎ𝑖𝑖(𝑡𝑡) = 𝜎𝜎 𝑊𝑊𝑖𝑖 ∗ 𝑥𝑥 𝑡𝑡 + 𝑈𝑈𝑖𝑖 ∗ ℎ(𝑡𝑡 − 1) + 𝑏𝑏𝑖𝑖

ℎ𝑜𝑜(𝑡𝑡) = 𝜎𝜎 𝑊𝑊𝑜𝑜 ∗ 𝑥𝑥 𝑡𝑡 + 𝑈𝑈𝑜𝑜 ∗ ℎ(𝑡𝑡 − 1) + 𝑏𝑏𝑜𝑜ℎ𝑐𝑐(𝑡𝑡) = 𝑡𝑡𝑡𝑡𝑡𝑡ℎ 𝑊𝑊𝑐𝑐 ∗ 𝑥𝑥 𝑡𝑡 + 𝑈𝑈𝑐𝑐 ∗ ℎ(𝑡𝑡 − 1) + 𝑏𝑏𝑐𝑐

Input gateForget gateOutput gate

Input activationCell state

LSTM output

Input 𝑥𝑥(𝑡𝑡)

Output

Discretized Time

Character Based Prediction

Discretized Time

Word Based Predictionx(t) = And so it was indeed: she was now only ten […]

LSTM “ideal” prediction

Cross-Entropy Loss = -log(probability for correct answer) Perplexity = exp(Cross-Entropy Loss) = 1/probability for correct answer

Probability 𝑝𝑝𝑝𝑝𝑝𝑝𝑏𝑏𝑡𝑡𝑏𝑏𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑦𝑦(𝑡𝑡) = 𝑠𝑠𝑝𝑝𝑠𝑠𝑡𝑡𝑠𝑠𝑡𝑡𝑥𝑥[ 𝑦𝑦 𝑡𝑡 ]

Fig. 4 LSTM equations. Prediction probability is the softmax of y(t). The ‘timestep’ t indicatesposition in the sequence. Lower cross-entropy loss means that correct answers are being predictedwith higher probability.

a bCumulative Current Increase Cumulative Current Decrease

10 ns

100 ns100 ns

10 ns217 PCMs 217 PCMs

Fig. 5 Two programming strategies: (a) Iteratively apply SET pulseswith steadily increasing compliance current. (b) Iteratively apply RESETpulses with decreasing compliance current.

TargetConfidence

216 PCM

Abovetarget

Current decrease

a

TargetConfidence

216 PCM

Further optimization

b

Below target

Fig. 6 Conductance CDFs after (a)1 sequence of programming. (b) 10sequences of variable pulse-width.

32 targets, from 0.1 mS

to 10 mSSensitive region

Programmed Conductance

Target Conductance

a b

0

5

10[mS]

c218

PCMs

218

PCMs

Saturated region

Fig. 7 Heat maps showing results of conduc-tance programming. Analog conductance val-ues in the ‘sensitive region’ can be programmedmore accurately than those in the ‘saturated re-gion.’

0

10

103PDF

e

c

f

G+ G–

G+ G– g+ g-

102

b

d

Target W

W = G+ - G- + (g+ - g-)/F

W = G+ - G-

Target W

a Saturated region

Sensitive region

≈≈

W [software] =

=W [mS]a [mS]

W [software] =

=W [mS]a [mS]

1

Fig. 8 Correlation between target and programmed weights (in both effective con-ductance and software units) for PTB weights (LSTM only) by encoding each weight(a)-(c) with 2 PCM devices and (d)-(f) with 4 PCM devices.

Programmed Weight

Target Weight

a b

-10

0

10[mS]

32 targets, from -10 mS

to 10 mS

c

220

PCMs

220

PCMs

Fig. 9 Heat maps showing results of weightprogramming encoding each weight using 4PCM devices. Weight programming is sig-nificantly more accurate than the conductanceprogramming shown in Fig. 7.

PyT

orch

Mat

lab Fu

ll N

etw

ork

Alice in Wonderland

Penn Tree Bank

a

b

LSTM

cel

ls o

nly

Mat

lab

Full

Net

wor

kLS

TM c

ells

onl

y

1.92 1.91 2.09 1.99 1.94 2.05 2.00

PyT

orch

2-PC

M

LSTM

cel

ls o

nly

LSTM

cel

ls o

nly

97.9 105.08 98.91

2-PC

M F

ull N

etw

ork

2.41

Clipped weights

Fig. 10 Inference results for LSTM net-works with mixed-hardware-software ex-periment. 4 PCM are used to map eachweight unless otherwise specified.

Alice in Wonderland Penn Treebank (PTB)

Train/Test dataset size135k/15k (number of characters)

Train/Valid/Test dataset size929k/73k/82k (number of words)

Layer sizes 256 (input)/ 50 (hidden) 10,000 (input)/ 200 (hidden)

2-layer LSTM only all weights 2-layer LSTM only all weights

# of weights 40,400 66,256 641,600 4,651,600

# of PCM devices 161,600 265,024 2,566,400 exceeds

hardware size

a b

e

This work

Other works(software)

c

Alice Alice clipped PTB

LSTM 1.949 1.971 99.07

all 2.053 1.997 NA

d

98.91

641,600 weights in PCM 4.64x106

All weights LSTM weights onlyAll weights

LSTM weights only

Alice in Wonderland

PTB

AB

C

AB

C

All weights

LSTM weights only

n = 0.018

Ideal n = 0.005

Fig. 11 (a) LSTM network and dataset sizes. (b) Perplexity results for PTB datasetin context of other published software models. (c) Inference results with simulatedconstant Gaussian noise. (d) Inference results with simulated noise from weightCDFs. (e) Loss evolution over time for Alice in Wonderland model.