![Page 1: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/1.jpg)
Deep Sequence ModelsContext Representation, Regularization, and
Application to Language
Adji Bousso Dieng
![Page 2: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/2.jpg)
All Data Are Born Sequential
“Time underlies many interesting human behaviors.”– Elman, 1990.
![Page 3: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/3.jpg)
Why model these data?
→ to help in decision making
→ to generate more of it
→ to predict and forecast
→ ... for science
![Page 4: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/4.jpg)
How do we model these data?
→ need to capture all the dependencies
→ need to account for dimensionality
→ need to account for seasonality
... It’s complicated.
![Page 5: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/5.jpg)
Recurrent Neural Networks: Successes
→ Image generation (Gregor+, 2015)
→ Text generation (Graves, 2013)
→ Machine translation (Sutskever+, 2014)
![Page 6: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/6.jpg)
Recurrent Neural Networks: Challenges
st = fW (xt, st−1)
st = g(s0, xt, xt−1, ..., x0) and g = f(f(f(...)))
ot = softmax(V st)
→ Vanishing and exploding gradients.
→ V can be very high-dimensional
→ Hidden state has limited capacity.
→ The RNN is trying to do too many things at once.
![Page 7: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/7.jpg)
Context Representation
![Page 8: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/8.jpg)
What Is Context?
The U.S. presidential race is not only drawing attention andcontroversy in the United States – it is being closely watchedacross the globe. But what does the rest of the world thinkabout a campaign that has already thrown up one surpriseafter another? CNN asked 10 journalists for their take on therace so far, and what their country might be hoping for inAmerica’s next –
→ local context:
few words preceding the word to predict
order matters.
defines syntax
![Page 9: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/9.jpg)
What Is Context?
The U.S. presidential race is not only drawing attention
and controversy in the United States – it is being closelywatched across the globe. But what does the rest of the worldthink about a campaign that has already thrown up onesurprise after another? CNN asked 10 journalists for their takeon the race so far, and what their country might be hopingfor in America ’s next –
→ global context:
words in the same document as the word to predict
order does not matter.
defines semantic
![Page 10: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/10.jpg)
Topics As Context (1/3)
source: David Blei
![Page 11: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/11.jpg)
Topics As Context (2/3)
source: David Blei
![Page 12: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/12.jpg)
Topics As Context (3/3)
source: David Blei
θd ∼ Dir(α) ; βk ∼ Dir(η) ; zdn ∼ Multinomial(θd)
wdn ∼ Multinomial(βzdn)
![Page 13: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/13.jpg)
Composing Topics And RNNs (1/3)
source: Wang+, 2017
→ RNN focuses on capturing local correlations (syntax model)
→ Topic model captures global dependencies (semantic model)
→ Combine both to make predictions
![Page 14: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/14.jpg)
Composing Topics And RNNs (2/3)
B B B B
V V VV VV V V
U U U U U U
W W W W W
source: Dieng+, 2017
ht = fW (xt, ht−1) ; lt ∼ Bernoulli(σ(Γ>ht))
yt ∼ softmax(V >ht + (1− lt)B>θ)
![Page 15: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/15.jpg)
Composing Topics And RNNs (3/3)Xc (bag-of-words)stop words excluded
X (full document)stop words included
Y (target document)
RNN
U
VB
W
source: Dieng+, 2017
→ Choose q(θ |Xc) to be an MLP
→ Choose p(θ) to be standard Gaussian: θ = g(N (0, IK))
→ Maximize the ELBO:
ELBO = Eq(θ |Xc)
[T∑t=1
log p(yt, lt|θ;ht)
]−KL (q(θ |Xc) ‖ p(θ))
![Page 16: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/16.jpg)
Composing Topics And RNNs (3/3)
source: Wang+, 2017
→ has been extended to mixture of experts (Wang+, 2017)
→ has been applied to conversation modeling (Wen+, 2017)
![Page 17: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/17.jpg)
Some Results On Language Modeling (1)
source: Dieng+, 2017
→ Perplexity on Penn Treebank dataset (the lower the better)
→ Three different network capacity
→ Adding topic features is always better
→ Doing so jointly is even better
![Page 18: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/18.jpg)
Some Results On Language Modeling (2)
0 10 20 30 40 500.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16Inferred Topic Distribution from TopicGRU
0 10 20 30 40 500.00
0.05
0.10
0.15
0.20
0.25Inferred Topic Distribution from TopicGRU
0 10 20 30 40 500.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16Inferred Topic Distribution from TopicGRU
source: Dieng+, 2017
→ Document distribution for 3 different documents with TopicGRU
→ Different topics get picked up for different documents
![Page 19: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/19.jpg)
Some Results On Language Modeling (3)
source: Wang+, 2017
→ Topics for three different datasets
→ Shows top five words of ten random topics
![Page 20: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/20.jpg)
Some Results On Document Classification
source: Dieng+, 2017
→ Sentiment classification on IMDB
→ Feature extraction: concatenate RNN feature and Topic feature
→ PCA + K-Means
![Page 21: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/21.jpg)
Some Results On Document Classification
source: Dieng+, 2017
![Page 22: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/22.jpg)
Regularization
![Page 23: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/23.jpg)
Co-adaptation
“When a neural network overfits badly during training, its hidden statesdepend very heavily on each other.”
– Hinton, 2012
![Page 24: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/24.jpg)
Noise As Regularizer
→ Define a noise-injected RNN as:
ε1:T ∼ ϕ(·;µ, γ) ; zt = gW (xt, zt−1, εt) and p(yt | y1:t−1) = p(yt | zt)
→ The likelihood p(yt | zt) is in the exponential family
→ Different noise ε at each layer
0.0 2.5 5.0 7.5 10.0 12.5 15.0Epochs
50
100
150
200
250
300
Perp
lexity
RNN -- TrainRNN -- ValidationRNN + NOISIN -- TrainRNN + NOISIN -- Validation
![Page 25: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/25.jpg)
Dropout
→ For the LSTM this is:
ft = σ(W>x1xt−1 � εxft +W>h1ht−1 � ε
hft )
it = σ(W>x2xt−1 � εxit +W>h2ht−1 � εhit )
ot = σ(W>x4xt−1 � εxot +W>h4ht−1 � εhot )
ct = ft � ct−1 + it � tanh(W>x3xt−1 � εxct +W>h3ht−1 � εhct )
zdropoutt = ot � tanh(ct).
![Page 26: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/26.jpg)
NOISIN: Unbiased Noise Injection
→ Strong unbiasedness condition
Ep(zt(ε1:t) | zt−1) [zt(ε1:t)] = st
→ Weak unbiasedness condition
Ep(zt(ε1:t) | zt−1) [zt(ε1:t)] = fW (xt−1, zt−1)
→ Under unbiasedness the underlying RNN is preserved
→ Examples: additive and multiplicative noise
gW (xt−1, zt−1, εt) = fW (xt−1, zt−1) + εt
gW (xt−1, zt−1, εt) = fW (xt−1, zt−1)� (1 + εt)
→ Dropout does not meet this requirement; it is biased
![Page 27: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/27.jpg)
NOISIN: The Objective
→ NOISIN maximizes the following objective
L = Ep(ε1:T ) [log p(x1:T |z1:T (ε1:T ))]
→ In more detail this is
L =
T∑t=1
Ep(ε1:t)
[log p(xt|zt(ε1:t))
]→ Notice this objective is a Jensen bound on the marginallog-likelihood of the data,
L ≤ logEp(ε1:T ) [p(x1:T |z1:T (ε1:T ))] = log p(x1:T )
![Page 28: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/28.jpg)
NOISIN: Connections
L =
T∑t=1
Ep(ε1:t)
[log p(xt|zt(ε1:t))
]→ Ensemble method
average the predictions of infinitely many RNNsat each time step
→ Empirical Bayesestimate the parameters of the prior on the hidden states
![Page 29: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/29.jpg)
Some Results On Language Modeling (1/2)
→ Perplexity on the Penn Treebank (lower the better)
→ D + Distribution is Dropout-LSTM with NOISIN
→ Studied many noise distributions: only variance matters
→ Noise is scaled to enjoy unbounded variance
![Page 30: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/30.jpg)
Some Results On Language Modeling (2/2)
→ Perplexity on the Wikitext-2 (lower the better)
→ D + Distribution is Dropout-LSTM with NOISIN
→ Studied many noise distributions: only variance matters
→ Noise is scaled to enjoy unbounded variance
![Page 31: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/31.jpg)
Lessons Learned So Far
Context representation
→ Need to rethink long-term dependencies (for language)
→ Combine a syntax model and a semantic model
→ Topic models are good semantic models
→ TopicRNN is a deep generative model that uses topics ascontext for RNNs
Regularization
→ Noise can be used to avoid co-adaptation
→ It should be injected unbiasedly into the hidden units of the RNN
→ This is some form of model averaging and is like empirical Bayes
→ NOISIN is simple yet significantly improves RNN-based models
![Page 32: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha](https://reader033.vdocument.in/reader033/viewer/2022060902/609eedb7fd202d6d5a160d26/html5/thumbnails/32.jpg)
More Challenges to Tackle
→ Scalability
→ Incorporating prior knowledge
→ Improving generation