deep sequence modelswhatiscontext? theu.s.presidentialraceisnotonlydrawingattentionand...

Post on 31-Dec-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Deep Sequence ModelsContext Representation, Regularization, and

Application to Language

Adji Bousso Dieng

All Data Are Born Sequential

“Time underlies many interesting human behaviors.”– Elman, 1990.

Why model these data?

→ to help in decision making

→ to generate more of it

→ to predict and forecast

→ ... for science

How do we model these data?

→ need to capture all the dependencies

→ need to account for dimensionality

→ need to account for seasonality

... It’s complicated.

Recurrent Neural Networks: Successes

→ Image generation (Gregor+, 2015)

→ Text generation (Graves, 2013)

→ Machine translation (Sutskever+, 2014)

Recurrent Neural Networks: Challenges

st = fW (xt, st−1)

st = g(s0, xt, xt−1, ..., x0) and g = f(f(f(...)))

ot = softmax(V st)

→ Vanishing and exploding gradients.

→ V can be very high-dimensional

→ Hidden state has limited capacity.

→ The RNN is trying to do too many things at once.

Context Representation

What Is Context?

The U.S. presidential race is not only drawing attention andcontroversy in the United States – it is being closely watchedacross the globe. But what does the rest of the world thinkabout a campaign that has already thrown up one surpriseafter another? CNN asked 10 journalists for their take on therace so far, and what their country might be hoping for inAmerica’s next –

→ local context:

few words preceding the word to predict

order matters.

defines syntax

What Is Context?

The U.S. presidential race is not only drawing attention

and controversy in the United States – it is being closelywatched across the globe. But what does the rest of the worldthink about a campaign that has already thrown up onesurprise after another? CNN asked 10 journalists for their takeon the race so far, and what their country might be hopingfor in America ’s next –

→ global context:

words in the same document as the word to predict

order does not matter.

defines semantic

Topics As Context (1/3)

source: David Blei

Topics As Context (2/3)

source: David Blei

Topics As Context (3/3)

source: David Blei

θd ∼ Dir(α) ; βk ∼ Dir(η) ; zdn ∼ Multinomial(θd)

wdn ∼ Multinomial(βzdn)

Composing Topics And RNNs (1/3)

source: Wang+, 2017

→ RNN focuses on capturing local correlations (syntax model)

→ Topic model captures global dependencies (semantic model)

→ Combine both to make predictions

Composing Topics And RNNs (2/3)

B B B B

V V VV VV V V

U U U U U U

W W W W W

source: Dieng+, 2017

ht = fW (xt, ht−1) ; lt ∼ Bernoulli(σ(Γ>ht))

yt ∼ softmax(V >ht + (1− lt)B>θ)

Composing Topics And RNNs (3/3)Xc (bag-of-words)stop words excluded

X (full document)stop words included

Y (target document)

RNN

U

VB

W

source: Dieng+, 2017

→ Choose q(θ |Xc) to be an MLP

→ Choose p(θ) to be standard Gaussian: θ = g(N (0, IK))

→ Maximize the ELBO:

ELBO = Eq(θ |Xc)

[T∑t=1

log p(yt, lt|θ;ht)

]−KL (q(θ |Xc) ‖ p(θ))

Composing Topics And RNNs (3/3)

source: Wang+, 2017

→ has been extended to mixture of experts (Wang+, 2017)

→ has been applied to conversation modeling (Wen+, 2017)

Some Results On Language Modeling (1)

source: Dieng+, 2017

→ Perplexity on Penn Treebank dataset (the lower the better)

→ Three different network capacity

→ Adding topic features is always better

→ Doing so jointly is even better

Some Results On Language Modeling (2)

0 10 20 30 40 500.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16Inferred Topic Distribution from TopicGRU

0 10 20 30 40 500.00

0.05

0.10

0.15

0.20

0.25Inferred Topic Distribution from TopicGRU

0 10 20 30 40 500.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16Inferred Topic Distribution from TopicGRU

source: Dieng+, 2017

→ Document distribution for 3 different documents with TopicGRU

→ Different topics get picked up for different documents

Some Results On Language Modeling (3)

source: Wang+, 2017

→ Topics for three different datasets

→ Shows top five words of ten random topics

Some Results On Document Classification

source: Dieng+, 2017

→ Sentiment classification on IMDB

→ Feature extraction: concatenate RNN feature and Topic feature

→ PCA + K-Means

Some Results On Document Classification

source: Dieng+, 2017

Regularization

Co-adaptation

“When a neural network overfits badly during training, its hidden statesdepend very heavily on each other.”

– Hinton, 2012

Noise As Regularizer

→ Define a noise-injected RNN as:

ε1:T ∼ ϕ(·;µ, γ) ; zt = gW (xt, zt−1, εt) and p(yt | y1:t−1) = p(yt | zt)

→ The likelihood p(yt | zt) is in the exponential family

→ Different noise ε at each layer

0.0 2.5 5.0 7.5 10.0 12.5 15.0Epochs

50

100

150

200

250

300

Perp

lexity

RNN -- TrainRNN -- ValidationRNN + NOISIN -- TrainRNN + NOISIN -- Validation

Dropout

→ For the LSTM this is:

ft = σ(W>x1xt−1 � εxft +W>h1ht−1 � ε

hft )

it = σ(W>x2xt−1 � εxit +W>h2ht−1 � εhit )

ot = σ(W>x4xt−1 � εxot +W>h4ht−1 � εhot )

ct = ft � ct−1 + it � tanh(W>x3xt−1 � εxct +W>h3ht−1 � εhct )

zdropoutt = ot � tanh(ct).

NOISIN: Unbiased Noise Injection

→ Strong unbiasedness condition

Ep(zt(ε1:t) | zt−1) [zt(ε1:t)] = st

→ Weak unbiasedness condition

Ep(zt(ε1:t) | zt−1) [zt(ε1:t)] = fW (xt−1, zt−1)

→ Under unbiasedness the underlying RNN is preserved

→ Examples: additive and multiplicative noise

gW (xt−1, zt−1, εt) = fW (xt−1, zt−1) + εt

gW (xt−1, zt−1, εt) = fW (xt−1, zt−1)� (1 + εt)

→ Dropout does not meet this requirement; it is biased

NOISIN: The Objective

→ NOISIN maximizes the following objective

L = Ep(ε1:T ) [log p(x1:T |z1:T (ε1:T ))]

→ In more detail this is

L =

T∑t=1

Ep(ε1:t)

[log p(xt|zt(ε1:t))

]→ Notice this objective is a Jensen bound on the marginallog-likelihood of the data,

L ≤ logEp(ε1:T ) [p(x1:T |z1:T (ε1:T ))] = log p(x1:T )

NOISIN: Connections

L =

T∑t=1

Ep(ε1:t)

[log p(xt|zt(ε1:t))

]→ Ensemble method

average the predictions of infinitely many RNNsat each time step

→ Empirical Bayesestimate the parameters of the prior on the hidden states

Some Results On Language Modeling (1/2)

→ Perplexity on the Penn Treebank (lower the better)

→ D + Distribution is Dropout-LSTM with NOISIN

→ Studied many noise distributions: only variance matters

→ Noise is scaled to enjoy unbounded variance

Some Results On Language Modeling (2/2)

→ Perplexity on the Wikitext-2 (lower the better)

→ D + Distribution is Dropout-LSTM with NOISIN

→ Studied many noise distributions: only variance matters

→ Noise is scaled to enjoy unbounded variance

Lessons Learned So Far

Context representation

→ Need to rethink long-term dependencies (for language)

→ Combine a syntax model and a semantic model

→ Topic models are good semantic models

→ TopicRNN is a deep generative model that uses topics ascontext for RNNs

Regularization

→ Noise can be used to avoid co-adaptation

→ It should be injected unbiasedly into the hidden units of the RNN

→ This is some form of model averaging and is like empirical Bayes

→ NOISIN is simple yet significantly improves RNN-based models

More Challenges to Tackle

→ Scalability

→ Incorporating prior knowledge

→ Improving generation

top related