angel xuan chang | angel xuan chang - recurrent neural networks · 2020. 12. 2. · recurrent...
TRANSCRIPT
![Page 1: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/1.jpg)
Recurrent Neural Networks
Fall 20202020-10-16
CMPT 413 / 825: Natural Language Processing
How to model sequences using neural networks?
(Some slides adapted from Chris Manning, Abigail See, Andrej Karpathy)
SFUNatLangLab
Adapted from slides from Anoop Sarkar, Danqi Chen, Karthik Narasimhan, and Justin Johnson1
![Page 2: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/2.jpg)
Overview
• What is a recurrent neural network (RNN)? • Simple RNNs • Backpropagation through time • Long short-term memory networks (LSTMs) • Applications • Variants: Stacked RNNs, Bidirectional RNNs
2
![Page 3: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/3.jpg)
Recurrent neural networks (RNNs)
A class of neural networks allowing to handle variable length inputs
A function: y = RNN(x1, x2, …, xn) ∈ ℝd
where x1, …, xn ∈ ℝdin
3
![Page 4: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/4.jpg)
Recurrent neural networks (RNNs)
Proven to be an highly effective approach to language modeling, sequence tagging as well as text classification tasks:
Language modeling Sequence tagging
The movie sucks .
👎
Text classification
4
![Page 5: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/5.jpg)
Recurrent neural networks (RNNs)
Form the basis for the modern approaches to machine translation, question answering and dialogue:
5
![Page 6: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/6.jpg)
Why variable-length?Recall the feedfoward neural LMs we learned:
The dogs are barking
the dogs in the neighborhood are ___
x = [ethe, edogs, eare] 2 R3d<latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">AAACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3//vuBsfNj9ube+0Pn3+8nXX29vvmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmffOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLii0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjkk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit><latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">AAACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3//vuBsfNj9ube+0Pn3+8nXX29vvmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmffOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLii0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjkk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit><latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">AAACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3//vuBsfNj9ube+0Pn3+8nXX29vvmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmffOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLii0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjkk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit><latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">AAACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3//vuBsfNj9ube+0Pn3+8nXX29vvmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmffOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLii0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjkk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit>
(fixed-window size = 3)
6
![Page 7: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/7.jpg)
RNNs vs Feedforward NNs
7
![Page 8: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/8.jpg)
RNNs for sequence processing
Vanilla Neural
Networks
Sequence Tagging
Neural Machine Translation
Text ClassificationText Generation
8
![Page 9: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/9.jpg)
Recurrent Neural Language Models
9
![Page 10: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/10.jpg)
Language Modeling
Predict probability of sequence of words
<latexit sha1_base64="8u8r93p2B+o46uuE0CaGrvKG1tw=">AAACJHicbVBNSwMxFMzWr/pd9eglWIQWStkVRQ8tFL14rNCq0NYlm03b0OxmSd5ayrp/xoO/xYOXKh68+FtMaw9qHQgMM/N4eeNFgmuw7Q8rs7C4tLySXV1b39jc2s7t7F5rGSvKmlQKqW49opngIWsCB8FuI8VI4Al24w0uJv7NPVOay7ABo4h1AtILeZdTAkZyc5V6QRdxFdcLQ9cp4bbwJegSHrqNidqOlPTdBKpOepc00mkK8IOxkwqkRTeXt8v2FHieODOSRzPU3dy47UsaBywEKojWLceOoJMQBZwKlq61Y80iQgekx1qGhiRgupNMr0zxoVF83JXKvBDwVP05kZBA61HgmWRAoK//ehPxP68VQ/esk/AwioGF9HtRNxYYJJ5Uhn2uGAUxMoRQxc1fMe0TRSiYYk0Hzt+L58n1Udk5KdtXx/na+ayNLNpHB6iAHHSKaugS1VETUfSIntEYvVpP1ov1Zr1/RzPWbGYP/YL1+QXWcqIq</latexit>
P (s) = P (w1, . . . , wT ) =TY
t=1
P (wt|w<t)
10
with n-grams
with HMMs
<latexit sha1_base64="te4UgyxB6fuTxCWXFdKYrmQYx24=">AAACF3icbVBNS8NAEN3U7++qRy+LRVC0NRFFDx6KXjxWsB/QhrDZbnTpZhN2J2qJ+Rke/C0evKh47dF/47b2UFsfDDzem2Fmnh8LrsG2v63c1PTM7Nz8wuLS8srqWn59o6ajRFFWpZGIVMMnmgkuWRU4CNaIFSOhL1jd71z2/fo9U5pH8ga6MXNDcit5wCkBI3n5w8rugwf4CT946Tlke7hF4lhFj3hEh6Lcdw4wFJ1sz8sX7JI9AJ4kzpAU0BAVL99rtSOahEwCFUTrpmPH4KZEAaeCZYutRLOY0A65ZU1DJQmZdtPBYxneMUobB5EyJQEP1NGJlIRad0PfdIYE7vS41xf/85oJBGduymWcAJP0d1GQCAwR7qeE21wxCqJrCKGKm1sxvSOKUDBZmgyc8Y8nSe2o5JyU7OvjQvlimMY82kLbaBc56BSV0RWqoCqi6Bm9onf0Yb1Yb9an9fXbmrOGM5voD6zeD3itnc4=</latexit>
P (wt|w<t) ⇡ P (wt|wt�n+1,t�1)
<latexit sha1_base64="1W+OCq4/tF5nlGADlRrX14petk8=">AAACJXicbVDLSgMxFM34tr6qLt0Ei9CiLTOi6EJBdOOygtVCW4ZMmukEM5khuaOWsV/jwm9x4cIHgit/xbQdQasHAueecy8393ix4Bps+8MaG5+YnJqemc3NzS8sLuWXVy50lCjKajQSkap7RDPBJasBB8HqsWIk9AS79K5O+v7lNVOaR/IcujFrhaQjuc8pASO5+cNq8cYFfIdv3PQAeiXcJHGsolv8rQculEwRZEUKZbnpbGEoO6bZzRfsij0A/kucjBRQhqqbf262I5qETAIVROuGY8fQSokCTgXr5ZqJZjGhV6TDGoZKEjLdSgdn9vCGUdrYj5R5EvBA/TmRklDrbuiZzpBAoEe9vvif10jA32+lXMYJMEmHi/xEYIhwPzPc5opREF1DCFXc/BXTgChCwSRrMnBGL/5LLrYrzm7FPtspHB1nacygNbSOishBe+gInaIqqiGK7tEjekGv1oP1ZL1Z78PWMSubWUW/YH1+AYpxol4=</latexit>
P (wt|w<t) ⇡ P (wt|ht)P (ht|ht�n+1,t�1)
with neural networks
with fixed window
with RNNs
<latexit sha1_base64="M5BrvPdeSmcyfagEkftBYaSLsTY=">AAACLHicbVDNSgMxGMz6b/2revQSLIKFWnZF0YMHUQ8eFWwrdMuSzaY2mN2E5Fu1LPtCHnwWBS8qXn0Os20Pah0ITGbmI/kmVIIbcN03Z2Jyanpmdm6+tLC4tLxSXl1rGplqyhpUCqmvQ2KY4AlrAAfBrpVmJA4Fa4W3p4XfumPacJlcQV+xTkxuEt7llICVgvKZ2r4PAPsxj/B9kB1BXsU+UUrLB1xYGeRD01c9bu9eDfsikmBqRRx2vLxaLQXlilt3B8DjxBuRChrhIii/+JGkacwSoIIY0/ZcBZ2MaOBUsLzkp4YpQm/JDWtbmpCYmU422DbHW1aJcFdqexLAA/XnREZiY/pxaJMxgZ756xXif147he5hJ+OJSoEldPhQNxUYJC6qwxHXjILoW0Ko5vavmPaIJhRswbYD7+/G46S5W/f26+7lXuX4ZNTGHNpAm2gbeegAHaNzdIEaiKJH9Ize0Lvz5Lw6H87nMDrhjGbW0S84X98OMqZu</latexit>
p(wt | w<t) ⇡ p(wt | �(w1, . . . , wt�1))
<latexit sha1_base64="nVLTfTeRZqZQDE/Bjafx+3dZXyk=">AAACHXicbZDLSgMxFIYzXmu9jbp0EyyCRS0zoujCRdGNywr2Am0ZMmlqg5lMSM5Yy9gnceGzuHCjIrgS38b0Inj7IfDxn3M4OX+oBDfgeR/OxOTU9MxsZi47v7C4tOyurFZMnGjKyjQWsa6FxDDBJSsDB8FqSjMShYJVw6vTQb16zbThsbyAnmLNiFxK3uaUgLUC96C01Q0A3+JukB5DP48bRCkd3+Avv6E63GIKu3Lb38Gw6/fz+cDNeQVvKPwX/DHk0FilwH1rtGKaREwCFcSYuu8paKZEA6eC9bONxDBF6BW5ZHWLkkTMNNPheX28aZ0WbsfaPgl46H6fSElkTC8KbWdEoGN+1wbmf7V6Au2jZsqlSoBJOlrUTgSGGA+ywi2uGQXRs0Co5vavmHaIJhRsojYD//fFf6GyV/APCt75fq54Mk4jg9bRBtpCPjpERXSGSqiMKLpDD+gJPTv3zqPz4ryOWiec8cwa+iHn/ROp5J/4</latexit>
P (wt|w<t) ⇡ P (wt|�(wt�n+1,t�1))
<latexit sha1_base64="UPWhrdY1LIvfIYfgqE5xkjBxR3o=">AAACQ3icbVDLSgMxFM34flt16SZYhBa0zIiiC4WiG5cVrC10hiGTZmxo5kFyRy3jfJwLP8ClX+DCjYpbwUxboVYPBE7Ouffm5nix4ApM89mYmJyanpmdm19YXFpeWS2srV+pKJGU1WkkItn0iGKCh6wOHARrxpKRwBOs4XXPcr9xw6TiUXgJvZg5AbkOuc8pAS25hVatdOsCvse3bnoMWRnbJI5ldId/dDsg0PH8tJO5UN75dcUn2C+NCCnsWtlOPiknZewWimbF7AP/JdaQFNEQNbfwZLcjmgQsBCqIUi3LjMFJiQROBcsW7ESxmNAuuWYtTUMSMOWk/RAyvK2VNvYjqU8IuK+OdqQkUKoXeLoyX1mNe7n4n9dKwD9yUh7GCbCQDh7yE4EhwnmiuM0loyB6mhAqud4V0w6RhILOXWdgjf/4L7naq1gHFfNiv1g9HaYxhzbRFiohCx2iKjpHNVRHFD2gF/SG3o1H49X4MD4HpRPGsGcD/YLx9Q0qD7AU</latexit>
P (wt|w<t) ⇡ P (wt|ht),ht = f(ht�1, wt�1)
![Page 11: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/11.jpg)
Simple (vanilla) RNNs
h0 ∈ ℝd is an initial state
ht = fW(ht−1, xt) ∈ ℝd
ht = g(Whhht−1 + Uxt + b) ∈ ℝd
Simple (vanilla) RNNs:
Whh ∈ ℝd×d, U ∈ ℝd×din, b ∈ ℝd
: nonlinearity (e.g. tanh),g
ht : hidden states which store information from to x1 xt
x
RNN
y
Output label for each time step: Denote , yt = softmax(Woht) Wo ∈ ℝ|V|×d
new state old state input at time t
function with weights W
11
![Page 12: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/12.jpg)
Simple RNNs
Key idea: apply the same weights repeatedlyW
ht = g(Wht−1 + Uxt + b) ∈ ℝd
12
![Page 13: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/13.jpg)
Recurrent Neural Language Models (RNNLMs)
P(w1, w2, …, wn) = P(w1) × P(w2 ∣ w1) × P(w3 ∣ w1, w2) × … × P(wn ∣ w1, w2, …, wn−1)
= P(w1 ∣ h0) × P(w2 ∣ h1) × P(w3 ∣ h2) × … × P(wn ∣ hn−1)
• Denote , yt = softmax(Woht) Wo ∈ ℝ|V|×d
• Cross-entropy loss:
L(θ) = −1n
n
∑t=1
log yt−1(wt)
the students opened their …exams
…
θ = {W, U, b, Wo, E}
13
![Page 14: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/14.jpg)
Progress on language models
On the Penn Treebank (PTB) dataset Metric: perplexity
(Mikolov and Zweig, 2012): Context dependent recurrent neural network language model
KN5: Kneser-Ney 5-gram
14
![Page 15: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/15.jpg)
Progress on language models
(Yang et al, 2018): Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
On the Penn Treebank (PTB) dataset Metric: perplexity
dropping perplexity
15
![Page 16: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/16.jpg)
Training the RNN
16
![Page 17: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/17.jpg)
RNN Computation Graph
h0 fW h1
x1
17
![Page 18: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/18.jpg)
h0 fW h1 fW h2
x2x1
RNN Computation Graph
18
![Page 19: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/19.jpg)
h0 fW h1 fW h2 fW h3
x3
…x2x1
hT
RNN Computation Graph
19
![Page 20: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/20.jpg)
h0 fW h1 fW h2 fW h3
x3
yT
…x2x1W
hT
y3y2y1
RNN Computation Graph
20
![Page 21: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/21.jpg)
h0 fW h1 fW h2 fW h3
x3
yT
…x2x1W
hT
y3y2y1 L1 L2 L3 LT
RNN Computation Graph
21
![Page 22: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/22.jpg)
h0 fW h1 fW h2 fW h3
x3
yT
…x2x1W
hT
y3y2y1 L1 L2 L3 LT
L
RNN Computation Graph
22
![Page 23: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/23.jpg)
Training RNNLMs
• Backpropagation? Yes, but not that simple!
• The algorithm is called Backpropagation Through Time (BPTT).
23
![Page 24: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/24.jpg)
Backpropagation through time
h1 = g(Wh0 + Ux1 + b)
h2 = g(Wh1 + Ux2 + b)
h3 = g(Wh2 + Ux3 + b)
L3 = − log y3(w4)
You should know how to compute: ∂L3
∂h3
∂L3
∂W=
∂L3
∂h3
∂h3
∂W+
∂L3
∂h3
∂h3
∂h2
∂h2
∂W+
∂L3
∂h3
∂h3
∂h2
∂h2
∂h1
∂h1
∂W
∂L∂W
= −1n
n
∑t=1
t
∑k=1
∂Lt
∂ht
t
∏j=k+1
∂hj
∂hj−1
∂hk
∂W
24
![Page 25: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/25.jpg)
Loss
Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient
25
![Page 26: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/26.jpg)
Truncated backpropagation through time
• Backpropagation is very expensive if you handle long sequences
• Run forward and backward through chunks of the sequence instead of whole sequence
• Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps
26
![Page 27: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/27.jpg)
Vanishing gradients
27
![Page 28: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/28.jpg)
ht-1
xt
W
stack
tanh
ht
Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013
28
![Page 29: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/29.jpg)
ht-1
xt
W
stack
tanh
ht
Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013
29
![Page 30: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/30.jpg)
Vanishing gradients
30
![Page 31: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/31.jpg)
Vanishing Gradients
31
![Page 32: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/32.jpg)
Computing gradient of involves many factors of (and repeated tanh)
h0
W
Largest singular value > 1: Exploding gradients
Largest singular value < 1: Vanishing gradients
Gradient clipping: Scale gradient if its norm is too big
Change RNN architecture32
![Page 33: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/33.jpg)
Different RNN cells
33
![Page 34: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/34.jpg)
Long Short-term Memory (LSTM)• A type of RNN proposed by Hochreiter and Schmidhuber
in 1997 as a solution to the vanishing gradients problem
ht = f(ht−1, xt) ∈ ℝd
• Work extremely well in practice
• Basic idea: turning multiplication into addition
• Use “gates” to control how much information to add/erase
• At each timestep, there is a hidden state and also a cell state
• stores long-term information
• We write/erase after each step
• We read from
ht ∈ ℝd ct ∈ ℝd
ct
ct
ht ct
34
![Page 35: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/35.jpg)
Long Short-term Memory (LSTM)
There are 4 gates:
• Input gate (how much to write): it = σ(W(i)ht−1 + U(i)xt + b(i)) ∈ ℝd
• Forget gate (how much to erase): ft = σ(W( f )ht−1 + U( f )xt + b( f )) ∈ ℝd
• Output gate (how much to reveal): ot = σ(W(o)ht−1 + U(o)xt + b(o)) ∈ ℝd
• New memory cell (what to write): gt = tanh(W(c)ht−1 + U(c)xt + b(c)) ∈ ℝd
How many parameters in total?
• Final memory cell: ct = ft ⊙ ct−1 + it ⊙ gt
• Final hidden cell: ht = ot ⊙ tanh(ct)
element-wise product
Backpropagation from to only element wise multiplication by , no matrix multiply by
ct ct−1
f W
4 × (d2 + ddin + 1)35
![Page 36: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/36.jpg)
LSTM cell intuitively
36
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
![Page 37: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/37.jpg)
Long Short-term Memory (LSTM)
• LSTM doesn’t guarantee that there is no vanishing/exploding gradient, but it does provide an easier way for the model to learn long-distance dependencies
• LSTMs were invented in 1997 but finally got working from 2013-2015.
37
![Page 38: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/38.jpg)
Is the LSTM architecture optimal?
(Jozefowicz et al, 2015): An Empirical Exploration of Recurrent Network Architectures38
![Page 39: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/39.jpg)
Simple RNN vs GRU vs LSTM
Simple RNN GRU LSTM
39
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
![Page 40: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/40.jpg)
Batching
• Apply RNNs to batches of sequences
• Present data as 3D tensor of ( )T × B × F
40
![Page 41: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/41.jpg)
Batching
• Use mask matrix to aid with computations that ignore padded zeros
1 1 1 1 0 0
1 0 0 0 0 0
1 1 1 1 1 1
1 1 1 0 0 0
41
![Page 42: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/42.jpg)
Batching
• Sorting (partially) can help to create more efficient mini-batches • However, the input is less randomized
Unsorted
Sorted
42
![Page 43: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/43.jpg)
Overview
• What is a recurrent neural network (RNN)? • Simple RNNs • Backpropagation through time • Long short-term memory networks (LSTMs) • Applications • Variants: Stacked RNNs, Bidirectional RNNs
43
![Page 44: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/44.jpg)
Application: Text Generation
You can generate text by repeated sampling. Sampled output is next step’s input.44
![Page 45: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/45.jpg)
Fun with RNNs
Andrej Karpathy “The Unreasonable Effectiveness of Recurrent Neural Networks”
Obama speeches Latex generation
45
![Page 46: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/46.jpg)
Application: Sequence Tagging
P(yi = k) = softmaxk(Wohi) Wo ∈ ℝC×d
L = −1n
n
∑i=1
log P(yi = k)
Input: a sentence of n words: x1, …, xn
Output: y1, …, yn, yi ∈ {1,…C}
46
![Page 47: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/47.jpg)
Application: Text Classification
the movie was terribly exciting !
hn
P(y = k) = softmaxk(Wohn) Wo ∈ ℝC×d
Input: a sentence of n words
Output: y ∈ {1,2,…, C}
47
![Page 48: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/48.jpg)
Multi-layer RNNs
• RNNs are already “deep” on one dimension (unroll over time steps)
• We can also make them “deep” in another dimension by applying multiple RNNs
• Multi-layer RNNs are also called stacked RNNs.
48
![Page 49: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/49.jpg)
Multi-layer RNNs
The hidden states from RNN layer are the inputs to RNN layer
ii + 1
• In practice, using 2 to 4 layers is common (usually better than 1 layer) • Transformer-based networks can be up to 24 layers with lots of skip-
connections.
49
![Page 50: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/50.jpg)
Bidirectional RNNs
• Bidirectionality is important in language representations:
terribly: • left context “the movie was” • right context “exciting !”
50
![Page 51: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/51.jpg)
Bidirectional RNNs
ht = f(ht−1, xt) ∈ ℝd
h t = f1(h t−1, xt), t = 1,2,…n
h t = f2(h t+1, xt), t = n, n − 1,…1
ht = [h t, h t] ∈ ℝ2d
51
![Page 52: Angel Xuan Chang | Angel Xuan Chang - Recurrent Neural Networks · 2020. 12. 2. · Recurrent Neural Networks Fall 2020 2020-10-16 CMPT 413 / 825: Natural Language Processing How](https://reader036.vdocument.in/reader036/viewer/2022071506/61271858785b3b4cdc514f8a/html5/thumbnails/52.jpg)
Bidirectional RNNs
• Sequence tagging: Yes! • Text classification: Yes! With slight modifications.
• Text generation: No. Why?
terribly exciting !the movie wasterribly exciting !the movie was
Sentence encoding
element-wise mean/max element-wise mean/max
52