deep sequence modelswhatiscontext? theu.s.presidentialraceisnotonlydrawingattentionand...

32
Deep Sequence Models Context Representation, Regularization, and Application to Language Adji Bousso Dieng

Upload: others

Post on 31-Dec-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Deep Sequence ModelsContext Representation, Regularization, and

Application to Language

Adji Bousso Dieng

Page 2: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

All Data Are Born Sequential

“Time underlies many interesting human behaviors.”– Elman, 1990.

Page 3: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Why model these data?

→ to help in decision making

→ to generate more of it

→ to predict and forecast

→ ... for science

Page 4: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

How do we model these data?

→ need to capture all the dependencies

→ need to account for dimensionality

→ need to account for seasonality

... It’s complicated.

Page 5: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Recurrent Neural Networks: Successes

→ Image generation (Gregor+, 2015)

→ Text generation (Graves, 2013)

→ Machine translation (Sutskever+, 2014)

Page 6: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Recurrent Neural Networks: Challenges

st = fW (xt, st−1)

st = g(s0, xt, xt−1, ..., x0) and g = f(f(f(...)))

ot = softmax(V st)

→ Vanishing and exploding gradients.

→ V can be very high-dimensional

→ Hidden state has limited capacity.

→ The RNN is trying to do too many things at once.

Page 7: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Context Representation

Page 8: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

What Is Context?

The U.S. presidential race is not only drawing attention andcontroversy in the United States – it is being closely watchedacross the globe. But what does the rest of the world thinkabout a campaign that has already thrown up one surpriseafter another? CNN asked 10 journalists for their take on therace so far, and what their country might be hoping for inAmerica’s next –

→ local context:

few words preceding the word to predict

order matters.

defines syntax

Page 9: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

What Is Context?

The U.S. presidential race is not only drawing attention

and controversy in the United States – it is being closelywatched across the globe. But what does the rest of the worldthink about a campaign that has already thrown up onesurprise after another? CNN asked 10 journalists for their takeon the race so far, and what their country might be hopingfor in America ’s next –

→ global context:

words in the same document as the word to predict

order does not matter.

defines semantic

Page 10: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Topics As Context (1/3)

source: David Blei

Page 11: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Topics As Context (2/3)

source: David Blei

Page 12: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Topics As Context (3/3)

source: David Blei

θd ∼ Dir(α) ; βk ∼ Dir(η) ; zdn ∼ Multinomial(θd)

wdn ∼ Multinomial(βzdn)

Page 13: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Composing Topics And RNNs (1/3)

source: Wang+, 2017

→ RNN focuses on capturing local correlations (syntax model)

→ Topic model captures global dependencies (semantic model)

→ Combine both to make predictions

Page 14: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Composing Topics And RNNs (2/3)

B B B B

V V VV VV V V

U U U U U U

W W W W W

source: Dieng+, 2017

ht = fW (xt, ht−1) ; lt ∼ Bernoulli(σ(Γ>ht))

yt ∼ softmax(V >ht + (1− lt)B>θ)

Page 15: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Composing Topics And RNNs (3/3)Xc (bag-of-words)stop words excluded

X (full document)stop words included

Y (target document)

RNN

U

VB

W

source: Dieng+, 2017

→ Choose q(θ |Xc) to be an MLP

→ Choose p(θ) to be standard Gaussian: θ = g(N (0, IK))

→ Maximize the ELBO:

ELBO = Eq(θ |Xc)

[T∑t=1

log p(yt, lt|θ;ht)

]−KL (q(θ |Xc) ‖ p(θ))

Page 16: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Composing Topics And RNNs (3/3)

source: Wang+, 2017

→ has been extended to mixture of experts (Wang+, 2017)

→ has been applied to conversation modeling (Wen+, 2017)

Page 17: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Some Results On Language Modeling (1)

source: Dieng+, 2017

→ Perplexity on Penn Treebank dataset (the lower the better)

→ Three different network capacity

→ Adding topic features is always better

→ Doing so jointly is even better

Page 18: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Some Results On Language Modeling (2)

0 10 20 30 40 500.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16Inferred Topic Distribution from TopicGRU

0 10 20 30 40 500.00

0.05

0.10

0.15

0.20

0.25Inferred Topic Distribution from TopicGRU

0 10 20 30 40 500.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16Inferred Topic Distribution from TopicGRU

source: Dieng+, 2017

→ Document distribution for 3 different documents with TopicGRU

→ Different topics get picked up for different documents

Page 19: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Some Results On Language Modeling (3)

source: Wang+, 2017

→ Topics for three different datasets

→ Shows top five words of ten random topics

Page 20: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Some Results On Document Classification

source: Dieng+, 2017

→ Sentiment classification on IMDB

→ Feature extraction: concatenate RNN feature and Topic feature

→ PCA + K-Means

Page 21: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Some Results On Document Classification

source: Dieng+, 2017

Page 22: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Regularization

Page 23: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Co-adaptation

“When a neural network overfits badly during training, its hidden statesdepend very heavily on each other.”

– Hinton, 2012

Page 24: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Noise As Regularizer

→ Define a noise-injected RNN as:

ε1:T ∼ ϕ(·;µ, γ) ; zt = gW (xt, zt−1, εt) and p(yt | y1:t−1) = p(yt | zt)

→ The likelihood p(yt | zt) is in the exponential family

→ Different noise ε at each layer

0.0 2.5 5.0 7.5 10.0 12.5 15.0Epochs

50

100

150

200

250

300

Perp

lexity

RNN -- TrainRNN -- ValidationRNN + NOISIN -- TrainRNN + NOISIN -- Validation

Page 25: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Dropout

→ For the LSTM this is:

ft = σ(W>x1xt−1 � εxft +W>h1ht−1 � ε

hft )

it = σ(W>x2xt−1 � εxit +W>h2ht−1 � εhit )

ot = σ(W>x4xt−1 � εxot +W>h4ht−1 � εhot )

ct = ft � ct−1 + it � tanh(W>x3xt−1 � εxct +W>h3ht−1 � εhct )

zdropoutt = ot � tanh(ct).

Page 26: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

NOISIN: Unbiased Noise Injection

→ Strong unbiasedness condition

Ep(zt(ε1:t) | zt−1) [zt(ε1:t)] = st

→ Weak unbiasedness condition

Ep(zt(ε1:t) | zt−1) [zt(ε1:t)] = fW (xt−1, zt−1)

→ Under unbiasedness the underlying RNN is preserved

→ Examples: additive and multiplicative noise

gW (xt−1, zt−1, εt) = fW (xt−1, zt−1) + εt

gW (xt−1, zt−1, εt) = fW (xt−1, zt−1)� (1 + εt)

→ Dropout does not meet this requirement; it is biased

Page 27: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

NOISIN: The Objective

→ NOISIN maximizes the following objective

L = Ep(ε1:T ) [log p(x1:T |z1:T (ε1:T ))]

→ In more detail this is

L =

T∑t=1

Ep(ε1:t)

[log p(xt|zt(ε1:t))

]→ Notice this objective is a Jensen bound on the marginallog-likelihood of the data,

L ≤ logEp(ε1:T ) [p(x1:T |z1:T (ε1:T ))] = log p(x1:T )

Page 28: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

NOISIN: Connections

L =

T∑t=1

Ep(ε1:t)

[log p(xt|zt(ε1:t))

]→ Ensemble method

average the predictions of infinitely many RNNsat each time step

→ Empirical Bayesestimate the parameters of the prior on the hidden states

Page 29: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Some Results On Language Modeling (1/2)

→ Perplexity on the Penn Treebank (lower the better)

→ D + Distribution is Dropout-LSTM with NOISIN

→ Studied many noise distributions: only variance matters

→ Noise is scaled to enjoy unbounded variance

Page 30: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Some Results On Language Modeling (2/2)

→ Perplexity on the Wikitext-2 (lower the better)

→ D + Distribution is Dropout-LSTM with NOISIN

→ Studied many noise distributions: only variance matters

→ Noise is scaled to enjoy unbounded variance

Page 31: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

Lessons Learned So Far

Context representation

→ Need to rethink long-term dependencies (for language)

→ Combine a syntax model and a semantic model

→ Topic models are good semantic models

→ TopicRNN is a deep generative model that uses topics ascontext for RNNs

Regularization

→ Noise can be used to avoid co-adaptation

→ It should be injected unbiasedly into the hidden units of the RNN

→ This is some form of model averaging and is like empirical Bayes

→ NOISIN is simple yet significantly improves RNN-based models

Page 32: Deep Sequence ModelsWhatIsContext? TheU.S.presidentialraceisnotonlydrawingattentionand controversyintheUnitedStates–itisbeingcloselywatched acrosstheglobe. Butwha

More Challenges to Tackle

→ Scalability

→ Incorporating prior knowledge

→ Improving generation