lecture 11 recurrent neural networks i -...

160
Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Lecture 11 Recurrent Neural Networks I CMSC 35246

Upload: others

Post on 24-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Lecture 11Recurrent Neural Networks I

CMSC 35246: Deep Learning

Shubhendu Trivedi&

Risi Kondor

University of Chicago

May 01, 2017

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 2: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Introduction

Sequence Learning with Neural Networks

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 3: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Some Sequence Tasks

Figure credit: Andrej Karpathy

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 4: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Problems with MLPs for Sequence Tasks

The ”API” is too limited.

MLPs only accept an input of fixed dimensionality and map itto an output of fixed dimensionality

Great e.g.: Inputs - Images, Output - Categories

Bad e.g.: Inputs - Text in one language, Output - Text inanother language

MLPs treat every example independently. How is thisproblematic?

Need to re-learn the rules of language from scratch each time

Another example: Classify events after a fixed number offrames in a movie

Need to resuse knowledge about the previous events to help inclassifying the current.

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 5: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Problems with MLPs for Sequence Tasks

The ”API” is too limited.

MLPs only accept an input of fixed dimensionality and map itto an output of fixed dimensionality

Great e.g.: Inputs - Images, Output - Categories

Bad e.g.: Inputs - Text in one language, Output - Text inanother language

MLPs treat every example independently. How is thisproblematic?

Need to re-learn the rules of language from scratch each time

Another example: Classify events after a fixed number offrames in a movie

Need to resuse knowledge about the previous events to help inclassifying the current.

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 6: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Problems with MLPs for Sequence Tasks

The ”API” is too limited.

MLPs only accept an input of fixed dimensionality and map itto an output of fixed dimensionality

Great e.g.: Inputs - Images, Output - Categories

Bad e.g.: Inputs - Text in one language, Output - Text inanother language

MLPs treat every example independently. How is thisproblematic?

Need to re-learn the rules of language from scratch each time

Another example: Classify events after a fixed number offrames in a movie

Need to resuse knowledge about the previous events to help inclassifying the current.

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 7: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Problems with MLPs for Sequence Tasks

The ”API” is too limited.

MLPs only accept an input of fixed dimensionality and map itto an output of fixed dimensionality

Great e.g.: Inputs - Images, Output - Categories

Bad e.g.: Inputs - Text in one language, Output - Text inanother language

MLPs treat every example independently. How is thisproblematic?

Need to re-learn the rules of language from scratch each time

Another example: Classify events after a fixed number offrames in a movie

Need to resuse knowledge about the previous events to help inclassifying the current.

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 8: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Problems with MLPs for Sequence Tasks

The ”API” is too limited.

MLPs only accept an input of fixed dimensionality and map itto an output of fixed dimensionality

Great e.g.: Inputs - Images, Output - Categories

Bad e.g.: Inputs - Text in one language, Output - Text inanother language

MLPs treat every example independently. How is thisproblematic?

Need to re-learn the rules of language from scratch each time

Another example: Classify events after a fixed number offrames in a movie

Need to resuse knowledge about the previous events to help inclassifying the current.

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 9: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Problems with MLPs for Sequence Tasks

The ”API” is too limited.

MLPs only accept an input of fixed dimensionality and map itto an output of fixed dimensionality

Great e.g.: Inputs - Images, Output - Categories

Bad e.g.: Inputs - Text in one language, Output - Text inanother language

MLPs treat every example independently. How is thisproblematic?

Need to re-learn the rules of language from scratch each time

Another example: Classify events after a fixed number offrames in a movie

Need to resuse knowledge about the previous events to help inclassifying the current.

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 10: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Problems with MLPs for Sequence Tasks

The ”API” is too limited.

MLPs only accept an input of fixed dimensionality and map itto an output of fixed dimensionality

Great e.g.: Inputs - Images, Output - Categories

Bad e.g.: Inputs - Text in one language, Output - Text inanother language

MLPs treat every example independently. How is thisproblematic?

Need to re-learn the rules of language from scratch each time

Another example: Classify events after a fixed number offrames in a movie

Need to resuse knowledge about the previous events to help inclassifying the current.

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 11: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Problems with MLPs for Sequence Tasks

The ”API” is too limited.

MLPs only accept an input of fixed dimensionality and map itto an output of fixed dimensionality

Great e.g.: Inputs - Images, Output - Categories

Bad e.g.: Inputs - Text in one language, Output - Text inanother language

MLPs treat every example independently. How is thisproblematic?

Need to re-learn the rules of language from scratch each time

Another example: Classify events after a fixed number offrames in a movie

Need to resuse knowledge about the previous events to help inclassifying the current.

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 12: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrent Networks

Recurrent Neural Networks (Rumelhart, 1986) are a family ofneural networks for handling sequential data

Sequential data: Each example consists of a pair of sequences.Each example can have different lengths

Need to take advantage of an old idea in Machine Learning:Share parameters across different parts of a model

Makes it possible to extend the model to apply it to sequencesof different lengths not seen during training

Without parameter sharing it would not be possible to sharestatistical strength and generalize to lengths of sequences notseen during training

Recurrent networks share parameters: Each output is afunction of the previous outputs, with the same update ruleapplied

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 13: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrent Networks

Recurrent Neural Networks (Rumelhart, 1986) are a family ofneural networks for handling sequential data

Sequential data: Each example consists of a pair of sequences.Each example can have different lengths

Need to take advantage of an old idea in Machine Learning:Share parameters across different parts of a model

Makes it possible to extend the model to apply it to sequencesof different lengths not seen during training

Without parameter sharing it would not be possible to sharestatistical strength and generalize to lengths of sequences notseen during training

Recurrent networks share parameters: Each output is afunction of the previous outputs, with the same update ruleapplied

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 14: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrent Networks

Recurrent Neural Networks (Rumelhart, 1986) are a family ofneural networks for handling sequential data

Sequential data: Each example consists of a pair of sequences.Each example can have different lengths

Need to take advantage of an old idea in Machine Learning:Share parameters across different parts of a model

Makes it possible to extend the model to apply it to sequencesof different lengths not seen during training

Without parameter sharing it would not be possible to sharestatistical strength and generalize to lengths of sequences notseen during training

Recurrent networks share parameters: Each output is afunction of the previous outputs, with the same update ruleapplied

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 15: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrent Networks

Recurrent Neural Networks (Rumelhart, 1986) are a family ofneural networks for handling sequential data

Sequential data: Each example consists of a pair of sequences.Each example can have different lengths

Need to take advantage of an old idea in Machine Learning:Share parameters across different parts of a model

Makes it possible to extend the model to apply it to sequencesof different lengths not seen during training

Without parameter sharing it would not be possible to sharestatistical strength and generalize to lengths of sequences notseen during training

Recurrent networks share parameters: Each output is afunction of the previous outputs, with the same update ruleapplied

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 16: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrent Networks

Recurrent Neural Networks (Rumelhart, 1986) are a family ofneural networks for handling sequential data

Sequential data: Each example consists of a pair of sequences.Each example can have different lengths

Need to take advantage of an old idea in Machine Learning:Share parameters across different parts of a model

Makes it possible to extend the model to apply it to sequencesof different lengths not seen during training

Without parameter sharing it would not be possible to sharestatistical strength and generalize to lengths of sequences notseen during training

Recurrent networks share parameters: Each output is afunction of the previous outputs, with the same update ruleapplied

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 17: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrent Networks

Recurrent Neural Networks (Rumelhart, 1986) are a family ofneural networks for handling sequential data

Sequential data: Each example consists of a pair of sequences.Each example can have different lengths

Need to take advantage of an old idea in Machine Learning:Share parameters across different parts of a model

Makes it possible to extend the model to apply it to sequencesof different lengths not seen during training

Without parameter sharing it would not be possible to sharestatistical strength and generalize to lengths of sequences notseen during training

Recurrent networks share parameters: Each output is afunction of the previous outputs, with the same update ruleapplied

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 18: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrence

Consider the classical form of a dynamical system:

s(t) = f(s(t−1); θ)

This is recurrent because the definition of s at time t refersback to the same definition at time t− 1

For some finite number of time steps τ , the graph representedby this recurrence can be unfolded by using the definitionτ − 1 times. For example when τ = 3

s(3) = f(s(2); θ) = f(f(s(1); θ); θ)

This expression does not involve any recurrence and can berepresented by a traditional directed acyclic computationalgraph

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 19: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrence

Consider the classical form of a dynamical system:

s(t) = f(s(t−1); θ)

This is recurrent because the definition of s at time t refersback to the same definition at time t− 1

For some finite number of time steps τ , the graph representedby this recurrence can be unfolded by using the definitionτ − 1 times. For example when τ = 3

s(3) = f(s(2); θ) = f(f(s(1); θ); θ)

This expression does not involve any recurrence and can berepresented by a traditional directed acyclic computationalgraph

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 20: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrence

Consider the classical form of a dynamical system:

s(t) = f(s(t−1); θ)

This is recurrent because the definition of s at time t refersback to the same definition at time t− 1

For some finite number of time steps τ , the graph representedby this recurrence can be unfolded by using the definitionτ − 1 times. For example when τ = 3

s(3) = f(s(2); θ) = f(f(s(1); θ); θ)

This expression does not involve any recurrence and can berepresented by a traditional directed acyclic computationalgraph

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 21: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrence

Consider the classical form of a dynamical system:

s(t) = f(s(t−1); θ)

This is recurrent because the definition of s at time t refersback to the same definition at time t− 1

For some finite number of time steps τ , the graph representedby this recurrence can be unfolded by using the definitionτ − 1 times. For example when τ = 3

s(3) = f(s(2); θ) = f(f(s(1); θ); θ)

This expression does not involve any recurrence and can berepresented by a traditional directed acyclic computationalgraph

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 22: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrence

Consider the classical form of a dynamical system:

s(t) = f(s(t−1); θ)

This is recurrent because the definition of s at time t refersback to the same definition at time t− 1

For some finite number of time steps τ , the graph representedby this recurrence can be unfolded by using the definitionτ − 1 times. For example when τ = 3

s(3) = f(s(2); θ) = f(f(s(1); θ); θ)

This expression does not involve any recurrence and can berepresented by a traditional directed acyclic computationalgraph

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 23: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrent Networks

Consider another dynamical system, that is driven by anexternal signal x(t)

s(t) = f(s(t−1), x(t); θ)

The state now contains information about the whole pastsequence

RNNs can be built in various ways: Just as any function canbe considered a feedforward network, any function involving arecurrence can be considered a recurrent neural network

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 24: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrent Networks

Consider another dynamical system, that is driven by anexternal signal x(t)

s(t) = f(s(t−1), x(t); θ)

The state now contains information about the whole pastsequence

RNNs can be built in various ways: Just as any function canbe considered a feedforward network, any function involving arecurrence can be considered a recurrent neural network

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 25: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrent Networks

Consider another dynamical system, that is driven by anexternal signal x(t)

s(t) = f(s(t−1), x(t); θ)

The state now contains information about the whole pastsequence

RNNs can be built in various ways: Just as any function canbe considered a feedforward network, any function involving arecurrence can be considered a recurrent neural network

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 26: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrent Networks

Consider another dynamical system, that is driven by anexternal signal x(t)

s(t) = f(s(t−1), x(t); θ)

The state now contains information about the whole pastsequence

RNNs can be built in various ways: Just as any function canbe considered a feedforward network, any function involving arecurrence can be considered a recurrent neural network

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 27: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrent Networks

Consider another dynamical system, that is driven by anexternal signal x(t)

s(t) = f(s(t−1), x(t); θ)

The state now contains information about the whole pastsequence

RNNs can be built in various ways: Just as any function canbe considered a feedforward network, any function involving arecurrence can be considered a recurrent neural network

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 28: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

We can consider the states to be the hidden units of thenetwork, so we replace s(t) by h(t)

h(t) = f(h(t−1), x(t); θ)

This system can be drawn in two ways:

We can have additional architectural features: Such as outputlayers that read information from h to make predictions

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 29: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

We can consider the states to be the hidden units of thenetwork, so we replace s(t) by h(t)

h(t) = f(h(t−1), x(t); θ)

This system can be drawn in two ways:

We can have additional architectural features: Such as outputlayers that read information from h to make predictions

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 30: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

We can consider the states to be the hidden units of thenetwork, so we replace s(t) by h(t)

h(t) = f(h(t−1), x(t); θ)

This system can be drawn in two ways:

We can have additional architectural features: Such as outputlayers that read information from h to make predictions

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 31: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

We can consider the states to be the hidden units of thenetwork, so we replace s(t) by h(t)

h(t) = f(h(t−1), x(t); θ)

This system can be drawn in two ways:

We can have additional architectural features: Such as outputlayers that read information from h to make predictions

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 32: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

When the task is to predict the future from the past, thenetwork learns to use h(t) as a summary of task relevantaspects of the past sequence upto time t

This summary is lossy because it maps an arbitrary lengthsequence (x(1), x(t−1), . . . , x(2), x(1)) to a fixed vector h(t)

Depending on the training criterion, the summary mightselectively keep some aspects of the past sequence with moreprecision (e.g. statistical language modeling)

Most demanding situation for h(t): Approximately recover theinput sequence

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 33: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

When the task is to predict the future from the past, thenetwork learns to use h(t) as a summary of task relevantaspects of the past sequence upto time t

This summary is lossy because it maps an arbitrary lengthsequence (x(1), x(t−1), . . . , x(2), x(1)) to a fixed vector h(t)

Depending on the training criterion, the summary mightselectively keep some aspects of the past sequence with moreprecision (e.g. statistical language modeling)

Most demanding situation for h(t): Approximately recover theinput sequence

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 34: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

When the task is to predict the future from the past, thenetwork learns to use h(t) as a summary of task relevantaspects of the past sequence upto time t

This summary is lossy because it maps an arbitrary lengthsequence (x(1), x(t−1), . . . , x(2), x(1)) to a fixed vector h(t)

Depending on the training criterion, the summary mightselectively keep some aspects of the past sequence with moreprecision (e.g. statistical language modeling)

Most demanding situation for h(t): Approximately recover theinput sequence

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 35: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

When the task is to predict the future from the past, thenetwork learns to use h(t) as a summary of task relevantaspects of the past sequence upto time t

This summary is lossy because it maps an arbitrary lengthsequence (x(1), x(t−1), . . . , x(2), x(1)) to a fixed vector h(t)

Depending on the training criterion, the summary mightselectively keep some aspects of the past sequence with moreprecision (e.g. statistical language modeling)

Most demanding situation for h(t): Approximately recover theinput sequence

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 36: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns of Recurrent Networks

Plain Vanilla RNN: Produce an output at each time stampand have recurrent connections between hidden units

Is infact Turing Complete (Siegelmann, 1991, 1995, 1995)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 37: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns of Recurrent Networks

Plain Vanilla RNN: Produce an output at each time stampand have recurrent connections between hidden units

Is infact Turing Complete (Siegelmann, 1991, 1995, 1995)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 38: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns of Recurrent Networks

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 39: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Plain Vanilla Recurrent Network

ht

xt

yt

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 40: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrent Connections

ht

xt

yt

U

Wht = ψ(Uxt +Wht−1)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 41: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrent Connections

ht

xt

yt

U

V

Wht = ψ(Uxt +Wht−1)

yt = φ(V ht)

ψ can be tanh and φ can be softmax

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 42: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Unrolling the Recurrence

x1 x2 x3 . . . xτ

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 43: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Unrolling the Recurrence

x1

h1

U

x2 x3 . . . xτ

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 44: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Unrolling the Recurrence

x1

y1

h1

U

V

x2 x3 . . . xτ

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 45: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Unrolling the Recurrence

x1

y1

h1

U

V

x2

h2

U

W

x3 . . . xτ

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 46: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Unrolling the Recurrence

x1

y1

h1

U

V

x2

y2

h2

U

W

V

x3 . . . xτ

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 47: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Unrolling the Recurrence

x1

y1

h1

U

V

x2

y2

h2

U

W

V

x3

h3

U

W

. . . xτ

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 48: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Unrolling the Recurrence

x1

y1

h1

U

V

x2

y2

h2

U

W

V

x3

y3

h3

U

W

V

. . . xτ

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 49: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Unrolling the Recurrence

x1

y1

h1

U

V

x2

y2

h2

U

W

V

x3

y3

h3

U

W

V

. . .

W

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 50: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Unrolling the Recurrence

x1

y1

h1

U

V

x2

y2

h2

U

W

V

x3

y3

h3

U

W

V

. . .

W

U

W

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 51: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Unrolling the Recurrence

x1

y1

h1

U

V

x2

y2

h2

U

W

V

x3

y3

h3

U

W

V

. . .

W

U

W

V

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 52: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Feedforward Propagation

This is a RNN where the input and output sequences are ofthe same length

Feedforward operation proceeds from left to right

Update Equations:

at = b+Wht−1 + Uxt

ht = tanhat

ot = c+ V ht

yt = softmax(ot)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 53: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Feedforward Propagation

This is a RNN where the input and output sequences are ofthe same length

Feedforward operation proceeds from left to right

Update Equations:

at = b+Wht−1 + Uxt

ht = tanhat

ot = c+ V ht

yt = softmax(ot)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 54: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Feedforward Propagation

This is a RNN where the input and output sequences are ofthe same length

Feedforward operation proceeds from left to right

Update Equations:

at = b+Wht−1 + Uxt

ht = tanhat

ot = c+ V ht

yt = softmax(ot)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 55: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Feedforward Propagation

This is a RNN where the input and output sequences are ofthe same length

Feedforward operation proceeds from left to right

Update Equations:

at = b+Wht−1 + Uxt

ht = tanhat

ot = c+ V ht

yt = softmax(ot)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 56: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Feedforward Propagation

Loss would just be the sum of losses over time steps

If Lt is the negative log-likelihood of yt given x1, . . . ,xt, then:

L({x1, . . . ,xt}, {y1, . . . ,yt}

)=∑t

Lt

With: ∑t

Lt = −∑t

log pmodel

(yt|{x1, . . . ,xt}

)

Observation: Forward propagation takes time O(t); can’t beparallelized

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 57: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Feedforward Propagation

Loss would just be the sum of losses over time steps

If Lt is the negative log-likelihood of yt given x1, . . . ,xt, then:

L({x1, . . . ,xt}, {y1, . . . ,yt}

)=∑t

Lt

With: ∑t

Lt = −∑t

log pmodel

(yt|{x1, . . . ,xt}

)

Observation: Forward propagation takes time O(t); can’t beparallelized

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 58: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Feedforward Propagation

Loss would just be the sum of losses over time steps

If Lt is the negative log-likelihood of yt given x1, . . . ,xt, then:

L({x1, . . . ,xt}, {y1, . . . ,yt}

)=∑t

Lt

With: ∑t

Lt = −∑t

log pmodel

(yt|{x1, . . . ,xt}

)

Observation: Forward propagation takes time O(t); can’t beparallelized

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 59: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Feedforward Propagation

Loss would just be the sum of losses over time steps

If Lt is the negative log-likelihood of yt given x1, . . . ,xt, then:

L({x1, . . . ,xt}, {y1, . . . ,yt}

)=∑t

Lt

With: ∑t

Lt = −∑t

log pmodel

(yt|{x1, . . . ,xt}

)

Observation: Forward propagation takes time O(t); can’t beparallelized

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 60: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Feedforward Propagation

Loss would just be the sum of losses over time steps

If Lt is the negative log-likelihood of yt given x1, . . . ,xt, then:

L({x1, . . . ,xt}, {y1, . . . ,yt}

)=∑t

Lt

With: ∑t

Lt = −∑t

log pmodel

(yt|{x1, . . . ,xt}

)

Observation: Forward propagation takes time O(t); can’t beparallelized

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 61: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Backward Propagation

Need to find: ∇V L, ∇WL, ∇UL

And the gradients w.r.t biases: ∇cL and ∇bLTreat the recurrent network as a usual multilayer network andapply backpropagation on the unrolled network

We move from the right to left: This is calledBackpropagation through time

Also takes time O(t)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 62: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Backward Propagation

Need to find: ∇V L, ∇WL, ∇ULAnd the gradients w.r.t biases: ∇cL and ∇bL

Treat the recurrent network as a usual multilayer network andapply backpropagation on the unrolled network

We move from the right to left: This is calledBackpropagation through time

Also takes time O(t)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 63: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Backward Propagation

Need to find: ∇V L, ∇WL, ∇ULAnd the gradients w.r.t biases: ∇cL and ∇bLTreat the recurrent network as a usual multilayer network andapply backpropagation on the unrolled network

We move from the right to left: This is calledBackpropagation through time

Also takes time O(t)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 64: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Backward Propagation

Need to find: ∇V L, ∇WL, ∇ULAnd the gradients w.r.t biases: ∇cL and ∇bLTreat the recurrent network as a usual multilayer network andapply backpropagation on the unrolled network

We move from the right to left: This is calledBackpropagation through time

Also takes time O(t)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 65: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Backward Propagation

Need to find: ∇V L, ∇WL, ∇ULAnd the gradients w.r.t biases: ∇cL and ∇bLTreat the recurrent network as a usual multilayer network andapply backpropagation on the unrolled network

We move from the right to left: This is calledBackpropagation through time

Also takes time O(t)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 66: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

BPTT

x1

y1

h1

U

V

x2

y2

h2

U

W

V

x3

y3

h3

U

W

V

x4

y4

h4

U

V

W

x5

y5

h5

U

W

V

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 67: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

BPTT

x1

y1

h1

U

V

x2

y2

h2

U

W

V

x3

y3

h3

U

W

V

x4

y4

h4

U

V

W

x5

y5

h5

U

W

V ∂L∂V

∂L∂U

∂L∂W

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 68: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

BPTT

x1

y1

h1

U

V

x2

y2

h2

U

W

V

x3

y3

h3

U

W

V

x4

y4

h4

U

V

W

∂L∂U

∂L∂W

x5

y5

h5

U

W

V ∂L∂V

∂L∂U

∂L∂W

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 69: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

BPTT

x1

y1

h1

U

V

x2

y2

h2

U

W

V

x3

y3

h3

U

W

V

∂L∂U

∂L∂W

x4

y4

h4

U

V

W

∂L∂U

∂L∂W

x5

y5

h5

U

W

V ∂L∂V

∂L∂U

∂L∂W

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 70: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

BPTT

x1

y1

h1

U

V

∂L∂U

x2

y2

h2

U

W

V

∂L∂U

∂L∂W

x3

y3

h3

U

W

V

∂L∂U

∂L∂W

x4

y4

h4

U

V

W

∂L∂U

∂L∂W

x5

y5

h5

U

W

V ∂L∂V

∂L∂U

∂L∂W

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 71: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Gradient Computation

∇V L =∑t

(∇otL)hTt

Where:

(∇otL)i =∂L

∂o(i)t

=∂L

∂Lt

∂Lt

∂o(i)t

= y(i)t − 1i,yt

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 72: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Gradient Computation

∇V L =∑t

(∇otL)hTt

Where:

(∇otL)i =∂L

∂o(i)t

=∂L

∂Lt

∂Lt

∂o(i)t

= y(i)t − 1i,yt

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 73: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

BPTT

x1

y1

h1

U

V

∂L∂U

x2

y2

h2

U

W

V

∂L∂U

∂L∂W

x3

y3

h3

U

W

V

∂L∂U

∂L∂W

x4

y4

h4

U

V

W

∂L∂U

∂L∂W

x5

y5

h5

U

W

V ∂L∂V

∂L∂U

∂L∂W

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 74: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Gradient Computation

∇WL =∑t

diag(1− (ht)

2)(∇htL)hTt−1

Where, for t = τ (one descendant):

(∇hτL) = V T (∇oτL)

For some t < τ (two descendants)

(∇htL) =(∂ht+1

∂ht

)T(∇ht+1L) +

(∂ot∂ht

)T(∇otL)

= W T (∇ht+1L)diag(1− ht+12) + V (∇otL)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 75: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

BPTT

x1

y1

h1

U

V

∂L∂U

x2

y2

h2

U

W

V

∂L∂U

∂L∂W

x3

y3

h3

U

W

V

∂L∂U

∂L∂W

x4

y4

h4

U

V

W

∂L∂U

∂L∂W

x5

y5

h5

U

W

V ∂L∂V

∂L∂U

∂L∂W

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 76: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Gradient Computation

∇UL =∑t

diag(1− (ht)

2)(∇htL)xTt

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 77: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrent Neural Networks

But weights are shared across different time stamps? How isthis constraint enforced?

Train the network as if there were no constraints, obtainweights at different time stamps, average them

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 78: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recurrent Neural Networks

But weights are shared across different time stamps? How isthis constraint enforced?

Train the network as if there were no constraints, obtainweights at different time stamps, average them

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 79: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns of Recurrent Networks

Summarization: Produce a single output and have recurrentconnections from output between hidden units

Useful for summarizing a sequence (e.g. sentiment analysis)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 80: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns of Recurrent Networks

Summarization: Produce a single output and have recurrentconnections from output between hidden units

Useful for summarizing a sequence (e.g. sentiment analysis)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 81: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Fixed vector as input

We have considered RNNs in the context of a sequence ofvectors x(t) with t = 1, . . . , τ as input

Sometimes we are interested in only taking a single, fixedsized vector x as input, that generates the y sequence

Some common ways to provide an extra input to an RNN are:

– As an extra input at each time step– As the initial state h(0)

– both

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 82: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Fixed vector as input

We have considered RNNs in the context of a sequence ofvectors x(t) with t = 1, . . . , τ as input

Sometimes we are interested in only taking a single, fixedsized vector x as input, that generates the y sequence

Some common ways to provide an extra input to an RNN are:

– As an extra input at each time step– As the initial state h(0)

– both

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 83: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Fixed vector as input

We have considered RNNs in the context of a sequence ofvectors x(t) with t = 1, . . . , τ as input

Sometimes we are interested in only taking a single, fixedsized vector x as input, that generates the y sequence

Some common ways to provide an extra input to an RNN are:

– As an extra input at each time step– As the initial state h(0)

– both

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 84: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Fixed vector as input

We have considered RNNs in the context of a sequence ofvectors x(t) with t = 1, . . . , τ as input

Sometimes we are interested in only taking a single, fixedsized vector x as input, that generates the y sequence

Some common ways to provide an extra input to an RNN are:

– As an extra input at each time step

– As the initial state h(0)

– both

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 85: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Fixed vector as input

We have considered RNNs in the context of a sequence ofvectors x(t) with t = 1, . . . , τ as input

Sometimes we are interested in only taking a single, fixedsized vector x as input, that generates the y sequence

Some common ways to provide an extra input to an RNN are:

– As an extra input at each time step– As the initial state h(0)

– both

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 86: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Fixed vector as input

We have considered RNNs in the context of a sequence ofvectors x(t) with t = 1, . . . , τ as input

Sometimes we are interested in only taking a single, fixedsized vector x as input, that generates the y sequence

Some common ways to provide an extra input to an RNN are:

– As an extra input at each time step– As the initial state h(0)

– both

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 87: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Fixed vector as input

The first option (extra input at each time step) is the mostcommon:

Maps a fixed vector x into a distribution over sequences Y(xTR effectively is a new bias parameter for each hidden unit)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 88: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Fixed vector as input

The first option (extra input at each time step) is the mostcommon:

Maps a fixed vector x into a distribution over sequences Y(xTR effectively is a new bias parameter for each hidden unit)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 89: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Fixed vector as input

The first option (extra input at each time step) is the mostcommon:

Maps a fixed vector x into a distribution over sequences Y(xTR effectively is a new bias parameter for each hidden unit)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 90: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Application: Caption Generation

Caption Generation

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 91: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Bidirectional RNNs

RNNs considered till now, all have a causal structure: state attime t only captures information from the past x(1), . . . , x(t−1)

Sometimes we are interested in an output y(t) which maydepend on the whole input sequence

Example: Interpretation of a current sound as a phoneme maydepend on the next few due to co-articulation

Basically, in many cases we are interested in looking into thefuture as well as the past to disambiguate interpretations

Bidirectional RNNs were introduced to address this need(Schuster and Paliwal, 1997), and have been used inhandwriting recognition (Graves 2012, Graves andSchmidhuber 2009), speech recognition (Graves andSchmidhuber 2005) and bioinformatics (Baldi 1999)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 92: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Bidirectional RNNs

RNNs considered till now, all have a causal structure: state attime t only captures information from the past x(1), . . . , x(t−1)

Sometimes we are interested in an output y(t) which maydepend on the whole input sequence

Example: Interpretation of a current sound as a phoneme maydepend on the next few due to co-articulation

Basically, in many cases we are interested in looking into thefuture as well as the past to disambiguate interpretations

Bidirectional RNNs were introduced to address this need(Schuster and Paliwal, 1997), and have been used inhandwriting recognition (Graves 2012, Graves andSchmidhuber 2009), speech recognition (Graves andSchmidhuber 2005) and bioinformatics (Baldi 1999)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 93: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Bidirectional RNNs

RNNs considered till now, all have a causal structure: state attime t only captures information from the past x(1), . . . , x(t−1)

Sometimes we are interested in an output y(t) which maydepend on the whole input sequence

Example: Interpretation of a current sound as a phoneme maydepend on the next few due to co-articulation

Basically, in many cases we are interested in looking into thefuture as well as the past to disambiguate interpretations

Bidirectional RNNs were introduced to address this need(Schuster and Paliwal, 1997), and have been used inhandwriting recognition (Graves 2012, Graves andSchmidhuber 2009), speech recognition (Graves andSchmidhuber 2005) and bioinformatics (Baldi 1999)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 94: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Bidirectional RNNs

RNNs considered till now, all have a causal structure: state attime t only captures information from the past x(1), . . . , x(t−1)

Sometimes we are interested in an output y(t) which maydepend on the whole input sequence

Example: Interpretation of a current sound as a phoneme maydepend on the next few due to co-articulation

Basically, in many cases we are interested in looking into thefuture as well as the past to disambiguate interpretations

Bidirectional RNNs were introduced to address this need(Schuster and Paliwal, 1997), and have been used inhandwriting recognition (Graves 2012, Graves andSchmidhuber 2009), speech recognition (Graves andSchmidhuber 2005) and bioinformatics (Baldi 1999)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 95: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Bidirectional RNNs

RNNs considered till now, all have a causal structure: state attime t only captures information from the past x(1), . . . , x(t−1)

Sometimes we are interested in an output y(t) which maydepend on the whole input sequence

Example: Interpretation of a current sound as a phoneme maydepend on the next few due to co-articulation

Basically, in many cases we are interested in looking into thefuture as well as the past to disambiguate interpretations

Bidirectional RNNs were introduced to address this need(Schuster and Paliwal, 1997), and have been used inhandwriting recognition (Graves 2012, Graves andSchmidhuber 2009), speech recognition (Graves andSchmidhuber 2005) and bioinformatics (Baldi 1999)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 96: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Bidirectional RNNs

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 97: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Encoder-Decoder

How do we map input sequences to output sequences that arenot necessarily of the same length?

Example: Input - Kerem jojjenek maskor es kulonosenmashoz. Output - ’Please come rather at another time and toanother person.’

Other example applications: Speech recognition, questionanswering etc.

The input to this RNN is called the context, we want to find arepresentation of the context C

C could be a vector or a sequence that summarizesX = {x(1), . . . , x(nx)}

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 98: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Encoder-Decoder

How do we map input sequences to output sequences that arenot necessarily of the same length?

Example: Input - Kerem jojjenek maskor es kulonosenmashoz. Output - ’Please come rather at another time and toanother person.’

Other example applications: Speech recognition, questionanswering etc.

The input to this RNN is called the context, we want to find arepresentation of the context C

C could be a vector or a sequence that summarizesX = {x(1), . . . , x(nx)}

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 99: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Encoder-Decoder

How do we map input sequences to output sequences that arenot necessarily of the same length?

Example: Input - Kerem jojjenek maskor es kulonosenmashoz. Output - ’Please come rather at another time and toanother person.’

Other example applications: Speech recognition, questionanswering etc.

The input to this RNN is called the context, we want to find arepresentation of the context C

C could be a vector or a sequence that summarizesX = {x(1), . . . , x(nx)}

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 100: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Encoder-Decoder

How do we map input sequences to output sequences that arenot necessarily of the same length?

Example: Input - Kerem jojjenek maskor es kulonosenmashoz. Output - ’Please come rather at another time and toanother person.’

Other example applications: Speech recognition, questionanswering etc.

The input to this RNN is called the context, we want to find arepresentation of the context C

C could be a vector or a sequence that summarizesX = {x(1), . . . , x(nx)}

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 101: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Encoder-Decoder

How do we map input sequences to output sequences that arenot necessarily of the same length?

Example: Input - Kerem jojjenek maskor es kulonosenmashoz. Output - ’Please come rather at another time and toanother person.’

Other example applications: Speech recognition, questionanswering etc.

The input to this RNN is called the context, we want to find arepresentation of the context C

C could be a vector or a sequence that summarizesX = {x(1), . . . , x(nx)}

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 102: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Encoder-Decoder

Far more complicated mappings

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 103: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Design Patterns: Encoder-Decoder

In the context of Machine Trans. C is called a thought vector

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 104: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Deep Recurrent Networks

The computations in RNNs can be decomposed into threeblocks of parameters/associated transformations:

– Input to hidden state– Previous hidden state to the next– Hidden state to the output

Each of these transforms till now were learned affinetransformations followed by a fixed nonlinearity

Introducing depth in each of these operations is advantageous(Graves et al. 2013, Pascanu et al. 2014)

The intuition on why depth should be more useful is quitesimilar to that in deep feed-forward networks

Optimization can be made much harder, but can be mitigatedby tricks such as introducing skip connections

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 105: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Deep Recurrent Networks

The computations in RNNs can be decomposed into threeblocks of parameters/associated transformations:

– Input to hidden state

– Previous hidden state to the next– Hidden state to the output

Each of these transforms till now were learned affinetransformations followed by a fixed nonlinearity

Introducing depth in each of these operations is advantageous(Graves et al. 2013, Pascanu et al. 2014)

The intuition on why depth should be more useful is quitesimilar to that in deep feed-forward networks

Optimization can be made much harder, but can be mitigatedby tricks such as introducing skip connections

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 106: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Deep Recurrent Networks

The computations in RNNs can be decomposed into threeblocks of parameters/associated transformations:

– Input to hidden state– Previous hidden state to the next

– Hidden state to the output

Each of these transforms till now were learned affinetransformations followed by a fixed nonlinearity

Introducing depth in each of these operations is advantageous(Graves et al. 2013, Pascanu et al. 2014)

The intuition on why depth should be more useful is quitesimilar to that in deep feed-forward networks

Optimization can be made much harder, but can be mitigatedby tricks such as introducing skip connections

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 107: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Deep Recurrent Networks

The computations in RNNs can be decomposed into threeblocks of parameters/associated transformations:

– Input to hidden state– Previous hidden state to the next– Hidden state to the output

Each of these transforms till now were learned affinetransformations followed by a fixed nonlinearity

Introducing depth in each of these operations is advantageous(Graves et al. 2013, Pascanu et al. 2014)

The intuition on why depth should be more useful is quitesimilar to that in deep feed-forward networks

Optimization can be made much harder, but can be mitigatedby tricks such as introducing skip connections

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 108: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Deep Recurrent Networks

The computations in RNNs can be decomposed into threeblocks of parameters/associated transformations:

– Input to hidden state– Previous hidden state to the next– Hidden state to the output

Each of these transforms till now were learned affinetransformations followed by a fixed nonlinearity

Introducing depth in each of these operations is advantageous(Graves et al. 2013, Pascanu et al. 2014)

The intuition on why depth should be more useful is quitesimilar to that in deep feed-forward networks

Optimization can be made much harder, but can be mitigatedby tricks such as introducing skip connections

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 109: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Deep Recurrent Networks

The computations in RNNs can be decomposed into threeblocks of parameters/associated transformations:

– Input to hidden state– Previous hidden state to the next– Hidden state to the output

Each of these transforms till now were learned affinetransformations followed by a fixed nonlinearity

Introducing depth in each of these operations is advantageous(Graves et al. 2013, Pascanu et al. 2014)

The intuition on why depth should be more useful is quitesimilar to that in deep feed-forward networks

Optimization can be made much harder, but can be mitigatedby tricks such as introducing skip connections

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 110: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Deep Recurrent Networks

The computations in RNNs can be decomposed into threeblocks of parameters/associated transformations:

– Input to hidden state– Previous hidden state to the next– Hidden state to the output

Each of these transforms till now were learned affinetransformations followed by a fixed nonlinearity

Introducing depth in each of these operations is advantageous(Graves et al. 2013, Pascanu et al. 2014)

The intuition on why depth should be more useful is quitesimilar to that in deep feed-forward networks

Optimization can be made much harder, but can be mitigatedby tricks such as introducing skip connections

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 111: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Deep Recurrent Networks

The computations in RNNs can be decomposed into threeblocks of parameters/associated transformations:

– Input to hidden state– Previous hidden state to the next– Hidden state to the output

Each of these transforms till now were learned affinetransformations followed by a fixed nonlinearity

Introducing depth in each of these operations is advantageous(Graves et al. 2013, Pascanu et al. 2014)

The intuition on why depth should be more useful is quitesimilar to that in deep feed-forward networks

Optimization can be made much harder, but can be mitigatedby tricks such as introducing skip connections

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 112: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Deep Recurrent Networks

(b) lengthens shortest paths linking different time steps, (c) mitigates this by introducing skip layers

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 113: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recursive Neural Networks

The computational graph is structured as a deep tree ratherthan as a chain in a RNN

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 114: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recursive Neural Networks

First introduced by Pollack (1990), used in MachineReasoning by Bottou (2011)

Successfully used to process data structures as input to neuralnetworks (Frasconi et al 1997), Natural Language Processing(Socher et al 2011) and Computer vision (Socher et al 2011)

Advantage: For sequences of length τ , the number ofcompositions of nonlinear operations can be reduced from τto O(log τ)

Choice of tree structure is not very clear• A balanced binary tree, that does not depend on the

structure of the data has been used in many applications• Sometimes domain knowledge can be used: Parse trees

given by a parser in NLP (Socher et al 2011)

The computation performed by each node need not be theusual neuron computation - it could instead be tensoroperations etc (Socher et al 2013)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 115: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recursive Neural Networks

First introduced by Pollack (1990), used in MachineReasoning by Bottou (2011)

Successfully used to process data structures as input to neuralnetworks (Frasconi et al 1997), Natural Language Processing(Socher et al 2011) and Computer vision (Socher et al 2011)

Advantage: For sequences of length τ , the number ofcompositions of nonlinear operations can be reduced from τto O(log τ)

Choice of tree structure is not very clear• A balanced binary tree, that does not depend on the

structure of the data has been used in many applications• Sometimes domain knowledge can be used: Parse trees

given by a parser in NLP (Socher et al 2011)

The computation performed by each node need not be theusual neuron computation - it could instead be tensoroperations etc (Socher et al 2013)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 116: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recursive Neural Networks

First introduced by Pollack (1990), used in MachineReasoning by Bottou (2011)

Successfully used to process data structures as input to neuralnetworks (Frasconi et al 1997), Natural Language Processing(Socher et al 2011) and Computer vision (Socher et al 2011)

Advantage: For sequences of length τ , the number ofcompositions of nonlinear operations can be reduced from τto O(log τ)

Choice of tree structure is not very clear• A balanced binary tree, that does not depend on the

structure of the data has been used in many applications• Sometimes domain knowledge can be used: Parse trees

given by a parser in NLP (Socher et al 2011)

The computation performed by each node need not be theusual neuron computation - it could instead be tensoroperations etc (Socher et al 2013)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 117: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recursive Neural Networks

First introduced by Pollack (1990), used in MachineReasoning by Bottou (2011)

Successfully used to process data structures as input to neuralnetworks (Frasconi et al 1997), Natural Language Processing(Socher et al 2011) and Computer vision (Socher et al 2011)

Advantage: For sequences of length τ , the number ofcompositions of nonlinear operations can be reduced from τto O(log τ)

Choice of tree structure is not very clear

• A balanced binary tree, that does not depend on thestructure of the data has been used in many applications

• Sometimes domain knowledge can be used: Parse treesgiven by a parser in NLP (Socher et al 2011)

The computation performed by each node need not be theusual neuron computation - it could instead be tensoroperations etc (Socher et al 2013)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 118: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recursive Neural Networks

First introduced by Pollack (1990), used in MachineReasoning by Bottou (2011)

Successfully used to process data structures as input to neuralnetworks (Frasconi et al 1997), Natural Language Processing(Socher et al 2011) and Computer vision (Socher et al 2011)

Advantage: For sequences of length τ , the number ofcompositions of nonlinear operations can be reduced from τto O(log τ)

Choice of tree structure is not very clear• A balanced binary tree, that does not depend on the

structure of the data has been used in many applications

• Sometimes domain knowledge can be used: Parse treesgiven by a parser in NLP (Socher et al 2011)

The computation performed by each node need not be theusual neuron computation - it could instead be tensoroperations etc (Socher et al 2013)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 119: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recursive Neural Networks

First introduced by Pollack (1990), used in MachineReasoning by Bottou (2011)

Successfully used to process data structures as input to neuralnetworks (Frasconi et al 1997), Natural Language Processing(Socher et al 2011) and Computer vision (Socher et al 2011)

Advantage: For sequences of length τ , the number ofcompositions of nonlinear operations can be reduced from τto O(log τ)

Choice of tree structure is not very clear• A balanced binary tree, that does not depend on the

structure of the data has been used in many applications• Sometimes domain knowledge can be used: Parse trees

given by a parser in NLP (Socher et al 2011)

The computation performed by each node need not be theusual neuron computation - it could instead be tensoroperations etc (Socher et al 2013)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 120: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Recursive Neural Networks

First introduced by Pollack (1990), used in MachineReasoning by Bottou (2011)

Successfully used to process data structures as input to neuralnetworks (Frasconi et al 1997), Natural Language Processing(Socher et al 2011) and Computer vision (Socher et al 2011)

Advantage: For sequences of length τ , the number ofcompositions of nonlinear operations can be reduced from τto O(log τ)

Choice of tree structure is not very clear• A balanced binary tree, that does not depend on the

structure of the data has been used in many applications• Sometimes domain knowledge can be used: Parse trees

given by a parser in NLP (Socher et al 2011)

The computation performed by each node need not be theusual neuron computation - it could instead be tensoroperations etc (Socher et al 2013)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 121: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Long-Term Dependencies

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 122: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Challenge of Long-Term Dependencies

Basic problem: Gradients propagated over many stages tendto vanish (most of the time) or explode (relatively rarely)

Difficulty with long term interactions (involving multiplicationof many jacobians) arises due to exponentially smallerweights, compared to short term interactions

The problem was first analyzed by Hochreiter andSchmidhuber 1991 and Bengio et al 1993

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 123: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Challenge of Long-Term Dependencies

Basic problem: Gradients propagated over many stages tendto vanish (most of the time) or explode (relatively rarely)

Difficulty with long term interactions (involving multiplicationof many jacobians) arises due to exponentially smallerweights, compared to short term interactions

The problem was first analyzed by Hochreiter andSchmidhuber 1991 and Bengio et al 1993

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 124: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Challenge of Long-Term Dependencies

Basic problem: Gradients propagated over many stages tendto vanish (most of the time) or explode (relatively rarely)

Difficulty with long term interactions (involving multiplicationof many jacobians) arises due to exponentially smallerweights, compared to short term interactions

The problem was first analyzed by Hochreiter andSchmidhuber 1991 and Bengio et al 1993

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 125: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Challenge of Long-Term Dependencies

Recurrent Networks involve the composition of the samefunction multiple times, once per time step

The function composition in RNNs somewhat resemblesmatrix multiplication

Consider the recurrence relationship:

h(t) = W Th(t−1)

This could be thought of as a very simple recurrent neuralnetwork without a nonlinear activation and lacking x

This recurrence essentially describes the power method andcan be written as:

h(t) = (W t)Th(0)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 126: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Challenge of Long-Term Dependencies

Recurrent Networks involve the composition of the samefunction multiple times, once per time step

The function composition in RNNs somewhat resemblesmatrix multiplication

Consider the recurrence relationship:

h(t) = W Th(t−1)

This could be thought of as a very simple recurrent neuralnetwork without a nonlinear activation and lacking x

This recurrence essentially describes the power method andcan be written as:

h(t) = (W t)Th(0)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 127: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Challenge of Long-Term Dependencies

Recurrent Networks involve the composition of the samefunction multiple times, once per time step

The function composition in RNNs somewhat resemblesmatrix multiplication

Consider the recurrence relationship:

h(t) = W Th(t−1)

This could be thought of as a very simple recurrent neuralnetwork without a nonlinear activation and lacking x

This recurrence essentially describes the power method andcan be written as:

h(t) = (W t)Th(0)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 128: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Challenge of Long-Term Dependencies

Recurrent Networks involve the composition of the samefunction multiple times, once per time step

The function composition in RNNs somewhat resemblesmatrix multiplication

Consider the recurrence relationship:

h(t) = W Th(t−1)

This could be thought of as a very simple recurrent neuralnetwork without a nonlinear activation and lacking x

This recurrence essentially describes the power method andcan be written as:

h(t) = (W t)Th(0)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 129: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Challenge of Long-Term Dependencies

Recurrent Networks involve the composition of the samefunction multiple times, once per time step

The function composition in RNNs somewhat resemblesmatrix multiplication

Consider the recurrence relationship:

h(t) = W Th(t−1)

This could be thought of as a very simple recurrent neuralnetwork without a nonlinear activation and lacking x

This recurrence essentially describes the power method andcan be written as:

h(t) = (W t)Th(0)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 130: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Challenge of Long-Term Dependencies

If W admits a decomposition W = QΛQT with orthogonal Q

The recurrence becomes:

h(t) = (W t)Th(0) = QTΛtQh(0)

Eigenvalues are raised to t: Quickly decay to zero or explode

Problem particular to RNNs

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 131: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Challenge of Long-Term Dependencies

If W admits a decomposition W = QΛQT with orthogonal Q

The recurrence becomes:

h(t) = (W t)Th(0) = QTΛtQh(0)

Eigenvalues are raised to t: Quickly decay to zero or explode

Problem particular to RNNs

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 132: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Challenge of Long-Term Dependencies

If W admits a decomposition W = QΛQT with orthogonal Q

The recurrence becomes:

h(t) = (W t)Th(0) = QTΛtQh(0)

Eigenvalues are raised to t: Quickly decay to zero or explode

Problem particular to RNNs

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 133: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Challenge of Long-Term Dependencies

If W admits a decomposition W = QΛQT with orthogonal Q

The recurrence becomes:

h(t) = (W t)Th(0) = QTΛtQh(0)

Eigenvalues are raised to t: Quickly decay to zero or explode

Problem particular to RNNs

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 134: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Solution 1: Echo State Networks

Idea: Set the recurrent weights such that they do a good jobof capturing past history and learn only the output weights

Methods: Echo State Machines, Liquid State Machines

The general methodology is called reservoir computing

How to choose the recurrent weights?

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 135: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Solution 1: Echo State Networks

Idea: Set the recurrent weights such that they do a good jobof capturing past history and learn only the output weights

Methods: Echo State Machines, Liquid State Machines

The general methodology is called reservoir computing

How to choose the recurrent weights?

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 136: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Solution 1: Echo State Networks

Idea: Set the recurrent weights such that they do a good jobof capturing past history and learn only the output weights

Methods: Echo State Machines, Liquid State Machines

The general methodology is called reservoir computing

How to choose the recurrent weights?

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 137: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Solution 1: Echo State Networks

Idea: Set the recurrent weights such that they do a good jobof capturing past history and learn only the output weights

Methods: Echo State Machines, Liquid State Machines

The general methodology is called reservoir computing

How to choose the recurrent weights?

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 138: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Echo State Networks

Original idea: Choose recurrent weights such that thehidden-to-hidden transition Jacobian has eigenvalues close to1

In particular we pay attention to the spectral radius of Jt

Consider gradient g, after one step of backpropagation itwould be Jg and after n steps it would be Jng

Now consider a perturbed version of g i.e. g + δv, after nsteps we will have Jn(g + δv)

Infact, the separation is exactly δ|λ|n

When |λ > 1|, δ|λ|n grows exponentially large and vice-versa

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 139: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Echo State Networks

Original idea: Choose recurrent weights such that thehidden-to-hidden transition Jacobian has eigenvalues close to1

In particular we pay attention to the spectral radius of Jt

Consider gradient g, after one step of backpropagation itwould be Jg and after n steps it would be Jng

Now consider a perturbed version of g i.e. g + δv, after nsteps we will have Jn(g + δv)

Infact, the separation is exactly δ|λ|n

When |λ > 1|, δ|λ|n grows exponentially large and vice-versa

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 140: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Echo State Networks

Original idea: Choose recurrent weights such that thehidden-to-hidden transition Jacobian has eigenvalues close to1

In particular we pay attention to the spectral radius of Jt

Consider gradient g, after one step of backpropagation itwould be Jg and after n steps it would be Jng

Now consider a perturbed version of g i.e. g + δv, after nsteps we will have Jn(g + δv)

Infact, the separation is exactly δ|λ|n

When |λ > 1|, δ|λ|n grows exponentially large and vice-versa

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 141: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Echo State Networks

Original idea: Choose recurrent weights such that thehidden-to-hidden transition Jacobian has eigenvalues close to1

In particular we pay attention to the spectral radius of Jt

Consider gradient g, after one step of backpropagation itwould be Jg and after n steps it would be Jng

Now consider a perturbed version of g i.e. g + δv, after nsteps we will have Jn(g + δv)

Infact, the separation is exactly δ|λ|n

When |λ > 1|, δ|λ|n grows exponentially large and vice-versa

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 142: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Echo State Networks

Original idea: Choose recurrent weights such that thehidden-to-hidden transition Jacobian has eigenvalues close to1

In particular we pay attention to the spectral radius of Jt

Consider gradient g, after one step of backpropagation itwould be Jg and after n steps it would be Jng

Now consider a perturbed version of g i.e. g + δv, after nsteps we will have Jn(g + δv)

Infact, the separation is exactly δ|λ|n

When |λ > 1|, δ|λ|n grows exponentially large and vice-versa

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 143: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Echo State Networks

Original idea: Choose recurrent weights such that thehidden-to-hidden transition Jacobian has eigenvalues close to1

In particular we pay attention to the spectral radius of Jt

Consider gradient g, after one step of backpropagation itwould be Jg and after n steps it would be Jng

Now consider a perturbed version of g i.e. g + δv, after nsteps we will have Jn(g + δv)

Infact, the separation is exactly δ|λ|n

When |λ > 1|, δ|λ|n grows exponentially large and vice-versa

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 144: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Echo State Networks

For a vector h, when a linear map W always shrinks h, themapping is said to be contractive

The strategy of echo state networks is to make use of thisintuition

The Jacobian is chosen such that the spectral radiuscorresponds to stable dynamics

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 145: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Echo State Networks

For a vector h, when a linear map W always shrinks h, themapping is said to be contractive

The strategy of echo state networks is to make use of thisintuition

The Jacobian is chosen such that the spectral radiuscorresponds to stable dynamics

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 146: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Echo State Networks

For a vector h, when a linear map W always shrinks h, themapping is said to be contractive

The strategy of echo state networks is to make use of thisintuition

The Jacobian is chosen such that the spectral radiuscorresponds to stable dynamics

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 147: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Other Ideas

Skip Connections

Leaky Units

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 148: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Other Ideas

Skip Connections

Leaky Units

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 149: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Long Short Term Memory

xt

ht−1 tanh

ht = tanh(Wht−1 + Uxt)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 150: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Long Short Term Memory

xt

ht−1 tanh

ct−1

ct

ct = tanh(Wht−1 + Uxt)

ct = ct−1 + ct

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 151: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Long Short Term Memory

xt

ht−1 tanh

ct−1

ct

σForget Gate

ct = tanh(Wht−1 + Uxt)

ct = ft � ct−1 + ct

ft = σ(Wfht−1 + Ufxt)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 152: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Long Short Term Memory

xt

ht−1 tanh

ct−1

ct

σ

σ

Input ct = tanh(Wht−1 + Uxt)

ct = ft � ct−1 + it � ct

ft = σ(Wfht−1 + Ufxt)

it = σ(Wiht−1 + Uixt)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 153: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Long Short Term Memory

xt

ht−1 tanh

ct−1

ct

σ

σ

σ

tanh

ct = tanh(Wht−1 + Uxt)

ct = ft � ct−1 + it � ct

ht = ot � tanh(ct)

ft = σ(Wfht−1 + Ufxt)

it = σ(Wiht−1 + Uixt)

ot = σ(Woht−1 + Uoxt)

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 154: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Gated Recurrent Unit

Let ht = tanh(Wht−1 + Uxt) and ht = ht

Reset gate: rt = σ(Wrht−1 + Urxt)

New ht = tanh(W (rt � ht−1) + Uxt)

Find: zt = σ(Wzht−1 + Uzxt)

Update ht = zt � ht

Finally: ht = (1− zt)� ht−1 + zt � ht

Comes from attempting to factor LSTM and reduce gates

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 155: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Gated Recurrent Unit

Let ht = tanh(Wht−1 + Uxt) and ht = ht

Reset gate: rt = σ(Wrht−1 + Urxt)

New ht = tanh(W (rt � ht−1) + Uxt)

Find: zt = σ(Wzht−1 + Uzxt)

Update ht = zt � ht

Finally: ht = (1− zt)� ht−1 + zt � ht

Comes from attempting to factor LSTM and reduce gates

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 156: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Gated Recurrent Unit

Let ht = tanh(Wht−1 + Uxt) and ht = ht

Reset gate: rt = σ(Wrht−1 + Urxt)

New ht = tanh(W (rt � ht−1) + Uxt)

Find: zt = σ(Wzht−1 + Uzxt)

Update ht = zt � ht

Finally: ht = (1− zt)� ht−1 + zt � ht

Comes from attempting to factor LSTM and reduce gates

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 157: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Gated Recurrent Unit

Let ht = tanh(Wht−1 + Uxt) and ht = ht

Reset gate: rt = σ(Wrht−1 + Urxt)

New ht = tanh(W (rt � ht−1) + Uxt)

Find: zt = σ(Wzht−1 + Uzxt)

Update ht = zt � ht

Finally: ht = (1− zt)� ht−1 + zt � ht

Comes from attempting to factor LSTM and reduce gates

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 158: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Gated Recurrent Unit

Let ht = tanh(Wht−1 + Uxt) and ht = ht

Reset gate: rt = σ(Wrht−1 + Urxt)

New ht = tanh(W (rt � ht−1) + Uxt)

Find: zt = σ(Wzht−1 + Uzxt)

Update ht = zt � ht

Finally: ht = (1− zt)� ht−1 + zt � ht

Comes from attempting to factor LSTM and reduce gates

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 159: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Gated Recurrent Unit

Let ht = tanh(Wht−1 + Uxt) and ht = ht

Reset gate: rt = σ(Wrht−1 + Urxt)

New ht = tanh(W (rt � ht−1) + Uxt)

Find: zt = σ(Wzht−1 + Uzxt)

Update ht = zt � ht

Finally: ht = (1− zt)� ht−1 + zt � ht

Comes from attempting to factor LSTM and reduce gates

Lecture 11 Recurrent Neural Networks I CMSC 35246

Page 160: Lecture 11 Recurrent Neural Networks I - ttic.uchicago.edushubhendu/Pages/Files/Lecture11_pauses.pdfLecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi

Gated Recurrent Unit

Let ht = tanh(Wht−1 + Uxt) and ht = ht

Reset gate: rt = σ(Wrht−1 + Urxt)

New ht = tanh(W (rt � ht−1) + Uxt)

Find: zt = σ(Wzht−1 + Uzxt)

Update ht = zt � ht

Finally: ht = (1− zt)� ht−1 + zt � ht

Comes from attempting to factor LSTM and reduce gates

Lecture 11 Recurrent Neural Networks I CMSC 35246