lecture 12 recurrent neural networks ii - ttic.uchicago.edu · lecture 12 recurrent neural networks...

130
Lecture 12 Recurrent Neural Networks II CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 03, 2017 Lecture 12 Recurrent Neural Networks II CMSC 35246

Upload: others

Post on 11-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Lecture 12Recurrent Neural Networks II

CMSC 35246: Deep Learning

Shubhendu Trivedi&

Risi Kondor

University of Chicago

May 03, 2017

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 2: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Recap: Plain Vanilla RNNs

x1

y1

h1

U

V

x2

y2

h2

U

W

V

x3

y3

h3

U

W

V

. . .

W

U

W

V

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 3: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Recap: BPTT

x1

y1

h1

U

V

∂L∂U

x2

y2

h2

U

W

V

∂L∂U

∂L∂W

x3

y3

h3

U

W

V

∂L∂U

∂L∂W

x4

y4

h4

U

V

W

∂L∂U

∂L∂W

x5

y5

h5

U

W

V ∂L∂V

∂L∂U

∂L∂W

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 4: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Challenge of Long Term Dependencies

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 5: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Challenge of Long-Term Dependencies

Basic problem: Gradients propagated over many stages tendto vanish (most of the time) or explode (relatively rarely)

- Blow up � network parameters oscillate- Vanishing � no learning

Problem first analyzed by Hochreiter and Schmidhuber, 1991and Bengio et al., 1993

Reference: Sepp Hochreiter. Untersuchungen zu dynamischenneuronalen Netzen. Diploma thesis, TU Munich, 1991

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 6: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Challenge of Long-Term Dependencies

Basic problem: Gradients propagated over many stages tendto vanish (most of the time) or explode (relatively rarely)

- Blow up � network parameters oscillate

- Vanishing � no learning

Problem first analyzed by Hochreiter and Schmidhuber, 1991and Bengio et al., 1993

Reference: Sepp Hochreiter. Untersuchungen zu dynamischenneuronalen Netzen. Diploma thesis, TU Munich, 1991

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 7: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Challenge of Long-Term Dependencies

Basic problem: Gradients propagated over many stages tendto vanish (most of the time) or explode (relatively rarely)

- Blow up � network parameters oscillate- Vanishing � no learning

Problem first analyzed by Hochreiter and Schmidhuber, 1991and Bengio et al., 1993

Reference: Sepp Hochreiter. Untersuchungen zu dynamischenneuronalen Netzen. Diploma thesis, TU Munich, 1991

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 8: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Challenge of Long-Term Dependencies

Basic problem: Gradients propagated over many stages tendto vanish (most of the time) or explode (relatively rarely)

- Blow up � network parameters oscillate- Vanishing � no learning

Problem first analyzed by Hochreiter and Schmidhuber, 1991and Bengio et al., 1993

Reference: Sepp Hochreiter. Untersuchungen zu dynamischenneuronalen Netzen. Diploma thesis, TU Munich, 1991

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 9: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Challenge of Long-Term Dependencies

Basic problem: Gradients propagated over many stages tendto vanish (most of the time) or explode (relatively rarely)

- Blow up � network parameters oscillate- Vanishing � no learning

Problem first analyzed by Hochreiter and Schmidhuber, 1991and Bengio et al., 1993

Reference: Sepp Hochreiter. Untersuchungen zu dynamischenneuronalen Netzen. Diploma thesis, TU Munich, 1991

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 10: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Why do gradients explode or vanish?

Recall the expression for ht in RNNs:

ht = tanh(Wht−1 + V xt)

L was our loss, so we have by the chain rule:

∂L

∂ht=

∂L

∂hT

∂hT∂ht

=∂L

∂hT

T−1∏k=t

∂hk+1

∂hk

=∂L

∂hT

T−1∏k=t

Dk+1WTk

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 11: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Why do gradients explode or vanish?

Recall the expression for ht in RNNs:

ht = tanh(Wht−1 + V xt)

L was our loss, so we have by the chain rule:

∂L

∂ht=

∂L

∂hT

∂hT∂ht

=∂L

∂hT

T−1∏k=t

∂hk+1

∂hk

=∂L

∂hT

T−1∏k=t

Dk+1WTk

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 12: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Why do gradients explode or vanish?

∂L

∂ht=

∂L

∂hT

T−1∏k=t

Dk+1WTk

Reminder: Dk+1 = diag(1− tanh2(Wht−1 + V xt)) is theJacobian matrix of the pointwise nonlinearity

The quantity of interest is the norm of the gradient∥∥∥ ∂L∂ht

∥∥∥:

Which is simply:∥∥∥∥ ∂L∂ht∥∥∥∥ =

∥∥∥∥∥ ∂L∂hTT−1∏k=t

Dk+1WTk

∥∥∥∥∥Note: ‖ � ‖ represents the L2 norm for a vector and thespectral norm for a matrix

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 13: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Why do gradients explode or vanish?

∂L

∂ht=

∂L

∂hT

T−1∏k=t

Dk+1WTk

Reminder: Dk+1 = diag(1− tanh2(Wht−1 + V xt)) is theJacobian matrix of the pointwise nonlinearity

The quantity of interest is the norm of the gradient∥∥∥ ∂L∂ht

∥∥∥:

Which is simply:∥∥∥∥ ∂L∂ht∥∥∥∥ =

∥∥∥∥∥ ∂L∂hTT−1∏k=t

Dk+1WTk

∥∥∥∥∥Note: ‖ � ‖ represents the L2 norm for a vector and thespectral norm for a matrix

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 14: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Why do gradients explode or vanish?

∂L

∂ht=

∂L

∂hT

T−1∏k=t

Dk+1WTk

Reminder: Dk+1 = diag(1− tanh2(Wht−1 + V xt)) is theJacobian matrix of the pointwise nonlinearity

The quantity of interest is the norm of the gradient∥∥∥ ∂L∂ht

∥∥∥:

Which is simply:∥∥∥∥ ∂L∂ht∥∥∥∥ =

∥∥∥∥∥ ∂L∂hTT−1∏k=t

Dk+1WTk

∥∥∥∥∥Note: ‖ � ‖ represents the L2 norm for a vector and thespectral norm for a matrix

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 15: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Why do gradients explode or vanish?

∂L

∂ht=

∂L

∂hT

T−1∏k=t

Dk+1WTk

Reminder: Dk+1 = diag(1− tanh2(Wht−1 + V xt)) is theJacobian matrix of the pointwise nonlinearity

The quantity of interest is the norm of the gradient∥∥∥ ∂L∂ht

∥∥∥:

Which is simply:∥∥∥∥ ∂L∂ht∥∥∥∥ =

∥∥∥∥∥ ∂L∂hTT−1∏k=t

Dk+1WTk

∥∥∥∥∥Note: ‖ � ‖ represents the L2 norm for a vector and thespectral norm for a matrix

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 16: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Why do gradients explode or vanish?

Given that for any matrices A,B and vector v:‖Av‖ ≤ ‖A‖‖v‖ and ‖AB‖ ≤ ‖A‖‖B‖, we have the trivialbound:

∥∥∥∥ ∂L∂ht∥∥∥∥ =

∥∥∥∥∥ ∂L∂hTT−1∏k=t

Dk+1WTk

∥∥∥∥∥ ≤∥∥∥∥ ∂L∂hT

∥∥∥∥ T−1∏k=t

∥∥Dk+1WTk

∥∥

Given that ‖A‖ is the spectral norm (largest singular valueσA):

∥∥∥∥ ∂L∂ht∥∥∥∥ ≤ ∥∥∥∥ ∂L∂hT

∥∥∥∥ T−1∏k=t

∥∥Dk+1WTk

∥∥ =

∥∥∥∥ ∂L∂hT∥∥∥∥ T−1∏k=t

σDkσWk

The above tells us that the gradient norm can shrink to zeroor blow up exponentially fast depending on the gain σ

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 17: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Why do gradients explode or vanish?

Given that for any matrices A,B and vector v:‖Av‖ ≤ ‖A‖‖v‖ and ‖AB‖ ≤ ‖A‖‖B‖, we have the trivialbound:

∥∥∥∥ ∂L∂ht∥∥∥∥ =

∥∥∥∥∥ ∂L∂hTT−1∏k=t

Dk+1WTk

∥∥∥∥∥ ≤∥∥∥∥ ∂L∂hT

∥∥∥∥ T−1∏k=t

∥∥Dk+1WTk

∥∥Given that ‖A‖ is the spectral norm (largest singular valueσA):

∥∥∥∥ ∂L∂ht∥∥∥∥ ≤ ∥∥∥∥ ∂L∂hT

∥∥∥∥ T−1∏k=t

∥∥Dk+1WTk

∥∥ =

∥∥∥∥ ∂L∂hT∥∥∥∥ T−1∏k=t

σDkσWk

The above tells us that the gradient norm can shrink to zeroor blow up exponentially fast depending on the gain σ

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 18: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Why do gradients explode or vanish?

Given that for any matrices A,B and vector v:‖Av‖ ≤ ‖A‖‖v‖ and ‖AB‖ ≤ ‖A‖‖B‖, we have the trivialbound:

∥∥∥∥ ∂L∂ht∥∥∥∥ =

∥∥∥∥∥ ∂L∂hTT−1∏k=t

Dk+1WTk

∥∥∥∥∥ ≤∥∥∥∥ ∂L∂hT

∥∥∥∥ T−1∏k=t

∥∥Dk+1WTk

∥∥Given that ‖A‖ is the spectral norm (largest singular valueσA):

∥∥∥∥ ∂L∂ht∥∥∥∥ ≤ ∥∥∥∥ ∂L∂hT

∥∥∥∥ T−1∏k=t

∥∥Dk+1WTk

∥∥ =

∥∥∥∥ ∂L∂hT∥∥∥∥ T−1∏k=t

σDkσWk

The above tells us that the gradient norm can shrink to zeroor blow up exponentially fast depending on the gain σ

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 19: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Simplified Model

Consider the recurrence relationship:

h(t) = W Th(t−1)

This could be thought of as a very simple recurrent neuralnetwork without a nonlinear activation and lacking x

Essentially describes the power method:

h(t) = (W t)Th(0)

If W admits a decomposition W = QΛQT with orthogonal Q

The recurrence becomes:

h(t) = (W t)Th(0) = QTΛtQh(0)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 20: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Simplified Model

Consider the recurrence relationship:

h(t) = W Th(t−1)

This could be thought of as a very simple recurrent neuralnetwork without a nonlinear activation and lacking x

Essentially describes the power method:

h(t) = (W t)Th(0)

If W admits a decomposition W = QΛQT with orthogonal Q

The recurrence becomes:

h(t) = (W t)Th(0) = QTΛtQh(0)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 21: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Simplified Model

Consider the recurrence relationship:

h(t) = W Th(t−1)

This could be thought of as a very simple recurrent neuralnetwork without a nonlinear activation and lacking x

Essentially describes the power method:

h(t) = (W t)Th(0)

If W admits a decomposition W = QΛQT with orthogonal Q

The recurrence becomes:

h(t) = (W t)Th(0) = QTΛtQh(0)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 22: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Simplified Model

Consider the recurrence relationship:

h(t) = W Th(t−1)

This could be thought of as a very simple recurrent neuralnetwork without a nonlinear activation and lacking x

Essentially describes the power method:

h(t) = (W t)Th(0)

If W admits a decomposition W = QΛQT with orthogonal Q

The recurrence becomes:

h(t) = (W t)Th(0) = QTΛtQh(0)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 23: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Simplified Model

h(t) = (W t)Th(0) = QTΛtQh(0)

Eigenvalues are raised to t: Quickly decay to zero or explode

Problem particular to RNNs

Can be avoided in feedforward networks (atleast in principle)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 24: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Simplified Model

h(t) = (W t)Th(0) = QTΛtQh(0)

Eigenvalues are raised to t: Quickly decay to zero or explode

Problem particular to RNNs

Can be avoided in feedforward networks (atleast in principle)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 25: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Simplified Model

h(t) = (W t)Th(0) = QTΛtQh(0)

Eigenvalues are raised to t: Quickly decay to zero or explode

Problem particular to RNNs

Can be avoided in feedforward networks (atleast in principle)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 26: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Simplified Model

h(t) = (W t)Th(0) = QTΛtQh(0)

Eigenvalues are raised to t: Quickly decay to zero or explode

Problem particular to RNNs

Can be avoided in feedforward networks (atleast in principle)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 27: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Some Solutions

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 28: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Idea 1: Skip Connections

Add connections from the distant past to the present

Plain Vanilla RNNs: Recurrence goes from a unit at time t toa unit at time t+ 1

Gradients vanish/explode w.r.t number of time steps

With recurrent connections with a time-delay of d, gradientsexplode/vanish exponentially as a function of τ

d rather than τ

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 29: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Idea 1: Skip Connections

Add connections from the distant past to the present

Plain Vanilla RNNs: Recurrence goes from a unit at time t toa unit at time t+ 1

Gradients vanish/explode w.r.t number of time steps

With recurrent connections with a time-delay of d, gradientsexplode/vanish exponentially as a function of τ

d rather than τ

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 30: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Idea 1: Skip Connections

Add connections from the distant past to the present

Plain Vanilla RNNs: Recurrence goes from a unit at time t toa unit at time t+ 1

Gradients vanish/explode w.r.t number of time steps

With recurrent connections with a time-delay of d, gradientsexplode/vanish exponentially as a function of τ

d rather than τ

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 31: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Idea 1: Skip Connections

Add connections from the distant past to the present

Plain Vanilla RNNs: Recurrence goes from a unit at time t toa unit at time t+ 1

Gradients vanish/explode w.r.t number of time steps

With recurrent connections with a time-delay of d, gradientsexplode/vanish exponentially as a function of τ

d rather than τ

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 32: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Idea 2: Leaky Units

Keep a running average for a hidden unit by adding a linearself connection:

ht ← αht−1 + (1− α)ht

Such hidden units are called leaky units

Ensures hidden units can easily access values from the past

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 33: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Idea 2: Leaky Units

Keep a running average for a hidden unit by adding a linearself connection:

ht ← αht−1 + (1− α)ht

Such hidden units are called leaky units

Ensures hidden units can easily access values from the past

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 34: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Idea 2: Leaky Units

Keep a running average for a hidden unit by adding a linearself connection:

ht ← αht−1 + (1− α)ht

Such hidden units are called leaky units

Ensures hidden units can easily access values from the past

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 35: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Idea 3: Echo State Networks

Idea: Set the recurrent weights such that they do a good jobof capturing past history and learn only the output weights

Methods: Echo State Machines, Liquid State Machines

The general methodology is called Reservoir Computing

How to choose the recurrent weights?

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 36: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Idea 3: Echo State Networks

Idea: Set the recurrent weights such that they do a good jobof capturing past history and learn only the output weights

Methods: Echo State Machines, Liquid State Machines

The general methodology is called Reservoir Computing

How to choose the recurrent weights?

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 37: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Idea 3: Echo State Networks

Idea: Set the recurrent weights such that they do a good jobof capturing past history and learn only the output weights

Methods: Echo State Machines, Liquid State Machines

The general methodology is called Reservoir Computing

How to choose the recurrent weights?

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 38: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Idea 3: Echo State Networks

Idea: Set the recurrent weights such that they do a good jobof capturing past history and learn only the output weights

Methods: Echo State Machines, Liquid State Machines

The general methodology is called Reservoir Computing

How to choose the recurrent weights?

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 39: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Echo State Networks: Motivation

Choose recurrent weights such that the hidden-to-hiddentransition Jacobian has eigenvalues close to 1

In particular we pay attention to the spectral radius of Jt

Consider gradient g, after one step of backpropagation itwould be Jg and after n steps it would be Jng

Now consider a perturbed version of g i.e. g + δv, after nsteps we will have Jn(g + δv)

Infact, the separation is exactly δ|λ|n

When |λ > 1|, δ|λ|n grows exponentially large and vice-versa

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 40: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Echo State Networks: Motivation

Choose recurrent weights such that the hidden-to-hiddentransition Jacobian has eigenvalues close to 1

In particular we pay attention to the spectral radius of Jt

Consider gradient g, after one step of backpropagation itwould be Jg and after n steps it would be Jng

Now consider a perturbed version of g i.e. g + δv, after nsteps we will have Jn(g + δv)

Infact, the separation is exactly δ|λ|n

When |λ > 1|, δ|λ|n grows exponentially large and vice-versa

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 41: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Echo State Networks: Motivation

Choose recurrent weights such that the hidden-to-hiddentransition Jacobian has eigenvalues close to 1

In particular we pay attention to the spectral radius of Jt

Consider gradient g, after one step of backpropagation itwould be Jg and after n steps it would be Jng

Now consider a perturbed version of g i.e. g + δv, after nsteps we will have Jn(g + δv)

Infact, the separation is exactly δ|λ|n

When |λ > 1|, δ|λ|n grows exponentially large and vice-versa

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 42: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Echo State Networks: Motivation

Choose recurrent weights such that the hidden-to-hiddentransition Jacobian has eigenvalues close to 1

In particular we pay attention to the spectral radius of Jt

Consider gradient g, after one step of backpropagation itwould be Jg and after n steps it would be Jng

Now consider a perturbed version of g i.e. g + δv, after nsteps we will have Jn(g + δv)

Infact, the separation is exactly δ|λ|n

When |λ > 1|, δ|λ|n grows exponentially large and vice-versa

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 43: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Echo State Networks: Motivation

Choose recurrent weights such that the hidden-to-hiddentransition Jacobian has eigenvalues close to 1

In particular we pay attention to the spectral radius of Jt

Consider gradient g, after one step of backpropagation itwould be Jg and after n steps it would be Jng

Now consider a perturbed version of g i.e. g + δv, after nsteps we will have Jn(g + δv)

Infact, the separation is exactly δ|λ|n

When |λ > 1|, δ|λ|n grows exponentially large and vice-versa

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 44: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Echo State Networks: Motivation

Choose recurrent weights such that the hidden-to-hiddentransition Jacobian has eigenvalues close to 1

In particular we pay attention to the spectral radius of Jt

Consider gradient g, after one step of backpropagation itwould be Jg and after n steps it would be Jng

Now consider a perturbed version of g i.e. g + δv, after nsteps we will have Jn(g + δv)

Infact, the separation is exactly δ|λ|n

When |λ > 1|, δ|λ|n grows exponentially large and vice-versa

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 45: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Echo State Networks

For a vector h, when a linear map W always shrinks h, themapping is said to be contractive

The strategy of echo state networks is to make use of thisintuition

The Jacobian is chosen such that the spectral radiuscorresponds to stable dynamics

Then we only learn the output weights!

Can be used to initialize a fully trainable RNN

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 46: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Echo State Networks

For a vector h, when a linear map W always shrinks h, themapping is said to be contractive

The strategy of echo state networks is to make use of thisintuition

The Jacobian is chosen such that the spectral radiuscorresponds to stable dynamics

Then we only learn the output weights!

Can be used to initialize a fully trainable RNN

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 47: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Echo State Networks

For a vector h, when a linear map W always shrinks h, themapping is said to be contractive

The strategy of echo state networks is to make use of thisintuition

The Jacobian is chosen such that the spectral radiuscorresponds to stable dynamics

Then we only learn the output weights!

Can be used to initialize a fully trainable RNN

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 48: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Echo State Networks

For a vector h, when a linear map W always shrinks h, themapping is said to be contractive

The strategy of echo state networks is to make use of thisintuition

The Jacobian is chosen such that the spectral radiuscorresponds to stable dynamics

Then we only learn the output weights!

Can be used to initialize a fully trainable RNN

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 49: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Echo State Networks

For a vector h, when a linear map W always shrinks h, themapping is said to be contractive

The strategy of echo state networks is to make use of thisintuition

The Jacobian is chosen such that the spectral radiuscorresponds to stable dynamics

Then we only learn the output weights!

Can be used to initialize a fully trainable RNN

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 50: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Echo State Networks

Figure: Scholarpedia

Solid arrows represent fixed, random connections. Dashedarrows represent learnable weights

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 51: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

A Popular Solution: Gated Architectures

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 52: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Back to Plain Vanilla RNN

Figure: Chris Olah

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 53: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Long Short Term Memory

Figure: Chris Olah

Proposed by Hochreiter and Schmidhuber (1997)

Now let’s try to understand each memory cell!

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 54: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Long Short Term Memory

Figure: Chris Olah

Proposed by Hochreiter and Schmidhuber (1997)

Now let’s try to understand each memory cell!

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 55: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Long Short Term Memory

xt

ht−1 tanh

ht = tanh(Wht−1 + Uxt)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 56: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Long Short Term Memory

xt

ht−1 tanh

ct−1

ct

ct = tanh(Wht−1 + Uxt)

ct = ct−1 + ct

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 57: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Long Short Term Memory

xt

ht−1 tanh

ct−1

ct

σForget Gate

ct = tanh(Wht−1 + Uxt)

ct = ft � ct−1 + ct

ft = σ(Wfht−1 + Ufxt)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 58: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Long Short Term Memory

xt

ht−1 tanh

ct−1

ct

σ

σ

Input ct = tanh(Wht−1 + Uxt)

ct = ft � ct−1 + it � ct

ft = σ(Wfht−1 + Ufxt)

it = σ(Wiht−1 + Uixt)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 59: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Long Short Term Memory

xt

ht−1 tanh

ct−1

ct

σ

σ

σ

tanh

ct = tanh(Wht−1 + Uxt)

ct = ft � ct−1 + it � ct

ht = ot � tanh(ct)

ft = σ(Wfht−1 + Ufxt)

it = σ(Wiht−1 + Uixt)

ot = σ(Woht−1 + Uoxt)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 60: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

LSTM: Further Intuition

The Cell State

ct = ft � ct−1 + it � ct with ct = tanh(Wht−1 + Uxt)

Useful to think of the cell as a conveyor belt (Olah), whichruns across time; only interrupted with linear interactions

The memory cell can add or delete information from the cellstate by gates

Gates are constructed by using a sigmoid and a pointwisemultiplication

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 61: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

LSTM: Further Intuition

The Cell State

ct = ft � ct−1 + it � ct with ct = tanh(Wht−1 + Uxt)

Useful to think of the cell as a conveyor belt (Olah), whichruns across time; only interrupted with linear interactions

The memory cell can add or delete information from the cellstate by gates

Gates are constructed by using a sigmoid and a pointwisemultiplication

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 62: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

LSTM: Further Intuition

The Cell State

ct = ft � ct−1 + it � ct with ct = tanh(Wht−1 + Uxt)

Useful to think of the cell as a conveyor belt (Olah), whichruns across time; only interrupted with linear interactions

The memory cell can add or delete information from the cellstate by gates

Gates are constructed by using a sigmoid and a pointwisemultiplication

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 63: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

LSTM: Further Intuition

The Cell State

ct = ft � ct−1 + it � ct with ct = tanh(Wht−1 + Uxt)

Useful to think of the cell as a conveyor belt (Olah), whichruns across time; only interrupted with linear interactions

The memory cell can add or delete information from the cellstate by gates

Gates are constructed by using a sigmoid and a pointwisemultiplication

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 64: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

LSTM: Further Intuition

The Forget Gate

ft = σ(Wfht−1 + Ufxt)

Helps to decide what information to throw away from the cellstate

Once we have thrown away what we want from the cell state,we want to update it

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 65: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

LSTM: Further Intuition

The Forget Gate

ft = σ(Wfht−1 + Ufxt)

Helps to decide what information to throw away from the cellstate

Once we have thrown away what we want from the cell state,we want to update it

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 66: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

LSTM: Further Intuition

The Forget Gate

ft = σ(Wfht−1 + Ufxt)

Helps to decide what information to throw away from the cellstate

Once we have thrown away what we want from the cell state,we want to update it

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 67: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

LSTM: Further Intuition

First we decide how much of the input we want to store in theupdated cell state via the Input Gate

it = σ(Wiht−1 + Uixt)

We then update the cell state:

ct = ft � ct−1 + it � ct

We then need to output, and use the output gateot = σ(Woht−1 + Uoxt) to pass on the filtered version

ht = ot � tanh(ct)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 68: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

LSTM: Further Intuition

First we decide how much of the input we want to store in theupdated cell state via the Input Gate

it = σ(Wiht−1 + Uixt)

We then update the cell state:

ct = ft � ct−1 + it � ct

We then need to output, and use the output gateot = σ(Woht−1 + Uoxt) to pass on the filtered version

ht = ot � tanh(ct)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 69: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

LSTM: Further Intuition

First we decide how much of the input we want to store in theupdated cell state via the Input Gate

it = σ(Wiht−1 + Uixt)

We then update the cell state:

ct = ft � ct−1 + it � ct

We then need to output, and use the output gateot = σ(Woht−1 + Uoxt) to pass on the filtered version

ht = ot � tanh(ct)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 70: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Gated Recurrent Unit

Let ht = tanh(Wht−1 + Uxt) and ht = ht

Reset gate: rt = σ(Wrht−1 + Urxt)

New ht = tanh(W (rt � ht−1) + Uxt)

Find: zt = σ(Wzht−1 + Uzxt)

Update ht = zt � ht

Finally: ht = (1− zt)� ht−1 + zt � ht

Comes from attempting to factor LSTM and reduce gates

Example: One gate controls forgetting as well as decides if thestate needs to be updated

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 71: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Gated Recurrent Unit

Let ht = tanh(Wht−1 + Uxt) and ht = ht

Reset gate: rt = σ(Wrht−1 + Urxt)

New ht = tanh(W (rt � ht−1) + Uxt)

Find: zt = σ(Wzht−1 + Uzxt)

Update ht = zt � ht

Finally: ht = (1− zt)� ht−1 + zt � ht

Comes from attempting to factor LSTM and reduce gates

Example: One gate controls forgetting as well as decides if thestate needs to be updated

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 72: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Gated Recurrent Unit

Let ht = tanh(Wht−1 + Uxt) and ht = ht

Reset gate: rt = σ(Wrht−1 + Urxt)

New ht = tanh(W (rt � ht−1) + Uxt)

Find: zt = σ(Wzht−1 + Uzxt)

Update ht = zt � ht

Finally: ht = (1− zt)� ht−1 + zt � ht

Comes from attempting to factor LSTM and reduce gates

Example: One gate controls forgetting as well as decides if thestate needs to be updated

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 73: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Gated Recurrent Unit

Let ht = tanh(Wht−1 + Uxt) and ht = ht

Reset gate: rt = σ(Wrht−1 + Urxt)

New ht = tanh(W (rt � ht−1) + Uxt)

Find: zt = σ(Wzht−1 + Uzxt)

Update ht = zt � ht

Finally: ht = (1− zt)� ht−1 + zt � ht

Comes from attempting to factor LSTM and reduce gates

Example: One gate controls forgetting as well as decides if thestate needs to be updated

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 74: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Gated Recurrent Unit

Let ht = tanh(Wht−1 + Uxt) and ht = ht

Reset gate: rt = σ(Wrht−1 + Urxt)

New ht = tanh(W (rt � ht−1) + Uxt)

Find: zt = σ(Wzht−1 + Uzxt)

Update ht = zt � ht

Finally: ht = (1− zt)� ht−1 + zt � ht

Comes from attempting to factor LSTM and reduce gates

Example: One gate controls forgetting as well as decides if thestate needs to be updated

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 75: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Gated Recurrent Unit

Let ht = tanh(Wht−1 + Uxt) and ht = ht

Reset gate: rt = σ(Wrht−1 + Urxt)

New ht = tanh(W (rt � ht−1) + Uxt)

Find: zt = σ(Wzht−1 + Uzxt)

Update ht = zt � ht

Finally: ht = (1− zt)� ht−1 + zt � ht

Comes from attempting to factor LSTM and reduce gates

Example: One gate controls forgetting as well as decides if thestate needs to be updated

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 76: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Gated Recurrent Unit

Let ht = tanh(Wht−1 + Uxt) and ht = ht

Reset gate: rt = σ(Wrht−1 + Urxt)

New ht = tanh(W (rt � ht−1) + Uxt)

Find: zt = σ(Wzht−1 + Uzxt)

Update ht = zt � ht

Finally: ht = (1− zt)� ht−1 + zt � ht

Comes from attempting to factor LSTM and reduce gates

Example: One gate controls forgetting as well as decides if thestate needs to be updated

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 77: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Gated Recurrent Unit

Let ht = tanh(Wht−1 + Uxt) and ht = ht

Reset gate: rt = σ(Wrht−1 + Urxt)

New ht = tanh(W (rt � ht−1) + Uxt)

Find: zt = σ(Wzht−1 + Uzxt)

Update ht = zt � ht

Finally: ht = (1− zt)� ht−1 + zt � ht

Comes from attempting to factor LSTM and reduce gates

Example: One gate controls forgetting as well as decides if thestate needs to be updated

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 78: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Gated Recurrent Unit

xt

ht−1 httanh

ht = tanh(Wht−1 + Uxt)

ht = ht

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 79: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Gated Recurrent Unit

xt

ht−1 httanh

σ

Reset

ht = tanh(W (rt�ht−1) + Uxt)

ht = ht

rt = σ(Wrht−1 + Urxt)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 80: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Gated Recurrent Unit

xt

ht−1 httanh

σ σ

Update

ht = tanh(W (rt�ht−1) + Uxt)

ht = zt � ht

rt = σ(Wrht−1 + Urxt)

zt = σ(Wzht−1 + Uzxt)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 81: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Gated Recurrent Unit

xt

ht−1 httanh

σ σ

1−

ht = tanh(W (rt�ht−1) + Uxt)

ht = (1− zt)�ht−1 + zt � ht

rt = σ(Wrht−1 + Urxt)

zt = σ(Wzht−1 + Uzxt)

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 82: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention Models

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 83: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention Models

To illustrate the fundamental idea of attention, we will look attwo classic papers on the topic

Machine Translation: Neural Machine Translation by JointlyLearning to Align and Translate by Bahdanau et al.

Image Caption Generation: Show, Attend and Tell: NeuralImage Caption Generation with Visual Attention, ICML 2015by Xu et al.

Let us consider Machine Translation first

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 84: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention Models

To illustrate the fundamental idea of attention, we will look attwo classic papers on the topic

Machine Translation: Neural Machine Translation by JointlyLearning to Align and Translate by Bahdanau et al.

Image Caption Generation: Show, Attend and Tell: NeuralImage Caption Generation with Visual Attention, ICML 2015by Xu et al.

Let us consider Machine Translation first

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 85: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention Models

To illustrate the fundamental idea of attention, we will look attwo classic papers on the topic

Machine Translation: Neural Machine Translation by JointlyLearning to Align and Translate by Bahdanau et al.

Image Caption Generation: Show, Attend and Tell: NeuralImage Caption Generation with Visual Attention, ICML 2015by Xu et al.

Let us consider Machine Translation first

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 86: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention Models

To illustrate the fundamental idea of attention, we will look attwo classic papers on the topic

Machine Translation: Neural Machine Translation by JointlyLearning to Align and Translate by Bahdanau et al.

Image Caption Generation: Show, Attend and Tell: NeuralImage Caption Generation with Visual Attention, ICML 2015by Xu et al.

Let us consider Machine Translation first

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 87: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention Models: Motivation

Recall our encoder-decoder model for machine translation

Figure: Goodfellow et al.

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 88: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention Models: Motivation

Let’s look at the steps for translation again:

The input sentence x1, . . . ,xn via hidden unit activationsh1, . . . ,hn is encoded into the thought vector C

Using C, the decoder then generates the output sentencey1, . . . ,yp

We stop when we sample a terminating token i.e. 〈END〉A Problem? For long sentences, it might not be useful to onlygive the decoder access to the vector C

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 89: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention Models: Motivation

Let’s look at the steps for translation again:

The input sentence x1, . . . ,xn via hidden unit activationsh1, . . . ,hn is encoded into the thought vector C

Using C, the decoder then generates the output sentencey1, . . . ,yp

We stop when we sample a terminating token i.e. 〈END〉A Problem? For long sentences, it might not be useful to onlygive the decoder access to the vector C

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 90: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention Models: Motivation

Let’s look at the steps for translation again:

The input sentence x1, . . . ,xn via hidden unit activationsh1, . . . ,hn is encoded into the thought vector C

Using C, the decoder then generates the output sentencey1, . . . ,yp

We stop when we sample a terminating token i.e. 〈END〉A Problem? For long sentences, it might not be useful to onlygive the decoder access to the vector C

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 91: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention Models: Motivation

Let’s look at the steps for translation again:

The input sentence x1, . . . ,xn via hidden unit activationsh1, . . . ,hn is encoded into the thought vector C

Using C, the decoder then generates the output sentencey1, . . . ,yp

We stop when we sample a terminating token i.e. 〈END〉

A Problem? For long sentences, it might not be useful to onlygive the decoder access to the vector C

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 92: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention Models: Motivation

Let’s look at the steps for translation again:

The input sentence x1, . . . ,xn via hidden unit activationsh1, . . . ,hn is encoded into the thought vector C

Using C, the decoder then generates the output sentencey1, . . . ,yp

We stop when we sample a terminating token i.e. 〈END〉A Problem? For long sentences, it might not be useful to onlygive the decoder access to the vector C

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 93: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention Models: Motivation

When we ourselves are translating a sentence from onelanguage to another, we don’t consider the whole sentence atall times

Intuition: Every word in the output only depends on a wordor a group of words in the input sentence

We can help the decoding process by allowing the decoder torefer to the input sentence

We would like the decoder, while it is about to generate thenext word, to attend to a group of words in the input sentencemost relevant to predicting the right next word

Maybe it would be more efficient to also be able to attend tothese words while decoding

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 94: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention Models: Motivation

When we ourselves are translating a sentence from onelanguage to another, we don’t consider the whole sentence atall times

Intuition: Every word in the output only depends on a wordor a group of words in the input sentence

We can help the decoding process by allowing the decoder torefer to the input sentence

We would like the decoder, while it is about to generate thenext word, to attend to a group of words in the input sentencemost relevant to predicting the right next word

Maybe it would be more efficient to also be able to attend tothese words while decoding

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 95: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention Models: Motivation

When we ourselves are translating a sentence from onelanguage to another, we don’t consider the whole sentence atall times

Intuition: Every word in the output only depends on a wordor a group of words in the input sentence

We can help the decoding process by allowing the decoder torefer to the input sentence

We would like the decoder, while it is about to generate thenext word, to attend to a group of words in the input sentencemost relevant to predicting the right next word

Maybe it would be more efficient to also be able to attend tothese words while decoding

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 96: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention Models: Motivation

When we ourselves are translating a sentence from onelanguage to another, we don’t consider the whole sentence atall times

Intuition: Every word in the output only depends on a wordor a group of words in the input sentence

We can help the decoding process by allowing the decoder torefer to the input sentence

We would like the decoder, while it is about to generate thenext word, to attend to a group of words in the input sentencemost relevant to predicting the right next word

Maybe it would be more efficient to also be able to attend tothese words while decoding

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 97: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention Models: Motivation

When we ourselves are translating a sentence from onelanguage to another, we don’t consider the whole sentence atall times

Intuition: Every word in the output only depends on a wordor a group of words in the input sentence

We can help the decoding process by allowing the decoder torefer to the input sentence

We would like the decoder, while it is about to generate thenext word, to attend to a group of words in the input sentencemost relevant to predicting the right next word

Maybe it would be more efficient to also be able to attend tothese words while decoding

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 98: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

First Observation: For each word we had a hidden unit. Thisencoded a representation for each word.

Let use first try to incorporate both forwards and backwardcontext for each word using a bidirectional RNN andconcatenate the resulting representations

We have already seen why using a bidirectional RNN is useful

Figure: Roger Grosse

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 99: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

First Observation: For each word we had a hidden unit. Thisencoded a representation for each word.

Let use first try to incorporate both forwards and backwardcontext for each word using a bidirectional RNN andconcatenate the resulting representations

We have already seen why using a bidirectional RNN is useful

Figure: Roger Grosse

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 100: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

First Observation: For each word we had a hidden unit. Thisencoded a representation for each word.

Let use first try to incorporate both forwards and backwardcontext for each word using a bidirectional RNN andconcatenate the resulting representations

We have already seen why using a bidirectional RNN is useful

Figure: Roger Grosse

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 101: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

First Observation: For each word we had a hidden unit. Thisencoded a representation for each word.

Let use first try to incorporate both forwards and backwardcontext for each word using a bidirectional RNN andconcatenate the resulting representations

We have already seen why using a bidirectional RNN is useful

Figure: Roger Grosse

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 102: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

The decoder generates the sentence one word at a timeconditioned on C

We can instead have a context vector for every time stepThese vectors C(t) learn to attend to specific words of theinput sentence

Figure: Roger Grosse

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 103: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

The decoder generates the sentence one word at a timeconditioned on CWe can instead have a context vector for every time step

These vectors C(t) learn to attend to specific words of theinput sentence

Figure: Roger Grosse

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 104: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

The decoder generates the sentence one word at a timeconditioned on CWe can instead have a context vector for every time stepThese vectors C(t) learn to attend to specific words of theinput sentence

Figure: Roger Grosse

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 105: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

The decoder generates the sentence one word at a timeconditioned on CWe can instead have a context vector for every time stepThese vectors C(t) learn to attend to specific words of theinput sentence

Figure: Roger Grosse

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 106: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

How do we learn C(t)’s so that they attend to relevant words?

First: Let the representations of the bidirectional RNN foreach word be hi

Define C(t) to be the weighted average of encoder’srepresentations:

C(t) =∑i

αtihi

αt defines a probability distribution over the input words

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 107: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

How do we learn C(t)’s so that they attend to relevant words?

First: Let the representations of the bidirectional RNN foreach word be hi

Define C(t) to be the weighted average of encoder’srepresentations:

C(t) =∑i

αtihi

αt defines a probability distribution over the input words

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 108: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

How do we learn C(t)’s so that they attend to relevant words?

First: Let the representations of the bidirectional RNN foreach word be hi

Define C(t) to be the weighted average of encoder’srepresentations:

C(t) =∑i

αtihi

αt defines a probability distribution over the input words

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 109: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

How do we learn C(t)’s so that they attend to relevant words?

First: Let the representations of the bidirectional RNN foreach word be hi

Define C(t) to be the weighted average of encoder’srepresentations:

C(t) =∑i

αtihi

αt defines a probability distribution over the input words

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 110: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

αti is a function of the representations of the words, as wellthe previous state of the decoder

αti =exp(eti)∑k exp(etk)

With eti = a(s(t−1),h(i))

This is a form of content-based addressing

Example: The language model says the next word should bean adjective, give me an adjective in the input

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 111: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

αti is a function of the representations of the words, as wellthe previous state of the decoder

αti =exp(eti)∑k exp(etk)

With eti = a(s(t−1),h(i))

This is a form of content-based addressing

Example: The language model says the next word should bean adjective, give me an adjective in the input

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 112: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

αti is a function of the representations of the words, as wellthe previous state of the decoder

αti =exp(eti)∑k exp(etk)

With eti = a(s(t−1),h(i))

This is a form of content-based addressing

Example: The language model says the next word should bean adjective, give me an adjective in the input

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 113: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

αti is a function of the representations of the words, as wellthe previous state of the decoder

αti =exp(eti)∑k exp(etk)

With eti = a(s(t−1),h(i))

This is a form of content-based addressing

Example: The language model says the next word should bean adjective, give me an adjective in the input

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 114: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

αti is a function of the representations of the words, as wellthe previous state of the decoder

αti =exp(eti)∑k exp(etk)

With eti = a(s(t−1),h(i))

This is a form of content-based addressing

Example: The language model says the next word should bean adjective, give me an adjective in the input

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 115: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Machine Translation Using Attention

For each word in the translation, the matrix gives the degreeof focus on all the input words

A linear order is not forced, but it figures out that thetranslation is approximately linear

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 116: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention in Computer Vision

We will look at only one illustrative example: Show, Attendand Tell: Neural Image Caption Generation with VisualAttention, ICML 2015

Attention can also be used to understand images

Humans don’t process a visual scene all at once. The Foveagives high resolution vision in only a tiny region of our field ofview

A series of glimpses are then integrated

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 117: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention in Computer Vision

We will look at only one illustrative example: Show, Attendand Tell: Neural Image Caption Generation with VisualAttention, ICML 2015

Attention can also be used to understand images

Humans don’t process a visual scene all at once. The Foveagives high resolution vision in only a tiny region of our field ofview

A series of glimpses are then integrated

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 118: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention in Computer Vision

We will look at only one illustrative example: Show, Attendand Tell: Neural Image Caption Generation with VisualAttention, ICML 2015

Attention can also be used to understand images

Humans don’t process a visual scene all at once. The Foveagives high resolution vision in only a tiny region of our field ofview

A series of glimpses are then integrated

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 119: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Attention in Computer Vision

We will look at only one illustrative example: Show, Attendand Tell: Neural Image Caption Generation with VisualAttention, ICML 2015

Attention can also be used to understand images

Humans don’t process a visual scene all at once. The Foveagives high resolution vision in only a tiny region of our field ofview

A series of glimpses are then integrated

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 120: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Caption Generation using Attention

Here we have an encoder and decoder as well:

Encoder: A trained network like ResNet that extractsfeatures for an input image

Decoder: Attention based RNN, which is like the decoder inthe translation model of Bahdanau

While generating the caption, at every time step, the decodermust decide which region of the image to attend to

The decoder here too receives a context vector, which is theweighted average of the convolutional network features

The α’s here would define a distribution over the pixelsindicating what pixels we would like to focus on to predict thenext word

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 121: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Caption Generation using Attention

Here we have an encoder and decoder as well:

Encoder: A trained network like ResNet that extractsfeatures for an input image

Decoder: Attention based RNN, which is like the decoder inthe translation model of Bahdanau

While generating the caption, at every time step, the decodermust decide which region of the image to attend to

The decoder here too receives a context vector, which is theweighted average of the convolutional network features

The α’s here would define a distribution over the pixelsindicating what pixels we would like to focus on to predict thenext word

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 122: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Caption Generation using Attention

Here we have an encoder and decoder as well:

Encoder: A trained network like ResNet that extractsfeatures for an input image

Decoder: Attention based RNN, which is like the decoder inthe translation model of Bahdanau

While generating the caption, at every time step, the decodermust decide which region of the image to attend to

The decoder here too receives a context vector, which is theweighted average of the convolutional network features

The α’s here would define a distribution over the pixelsindicating what pixels we would like to focus on to predict thenext word

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 123: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Caption Generation using Attention

Here we have an encoder and decoder as well:

Encoder: A trained network like ResNet that extractsfeatures for an input image

Decoder: Attention based RNN, which is like the decoder inthe translation model of Bahdanau

While generating the caption, at every time step, the decodermust decide which region of the image to attend to

The decoder here too receives a context vector, which is theweighted average of the convolutional network features

The α’s here would define a distribution over the pixelsindicating what pixels we would like to focus on to predict thenext word

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 124: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Caption Generation using Attention

Here we have an encoder and decoder as well:

Encoder: A trained network like ResNet that extractsfeatures for an input image

Decoder: Attention based RNN, which is like the decoder inthe translation model of Bahdanau

While generating the caption, at every time step, the decodermust decide which region of the image to attend to

The decoder here too receives a context vector, which is theweighted average of the convolutional network features

The α’s here would define a distribution over the pixelsindicating what pixels we would like to focus on to predict thenext word

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 125: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Caption Generation using Attention

Here we have an encoder and decoder as well:

Encoder: A trained network like ResNet that extractsfeatures for an input image

Decoder: Attention based RNN, which is like the decoder inthe translation model of Bahdanau

While generating the caption, at every time step, the decodermust decide which region of the image to attend to

The decoder here too receives a context vector, which is theweighted average of the convolutional network features

The α’s here would define a distribution over the pixelsindicating what pixels we would like to focus on to predict thenext word

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 126: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Caption Generation without Attention

Figure: Andrej Karpathy

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 127: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Caption Generation with Attention

Figure: Andrej Karpathy

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 128: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Caption Generation using Attention

Not only generates good captions, but we also get to seewhere the decoder is looking at in the image

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 129: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Caption Generation using Attention

Can also see the networks mistakes

Lecture 12 Recurrent Neural Networks II CMSC 35246

Page 130: Lecture 12 Recurrent Neural Networks II - ttic.uchicago.edu · Lecture 12 Recurrent Neural Networks II CMSC 35246. Challenge of Long-Term Dependencies Basic problem:Gradients propagated

Next Time:Neural Networks with Explicit Memory

Lecture 12 Recurrent Neural Networks II CMSC 35246