deep learning recurrent networks
TRANSCRIPT
Deep Learning
Recurrent Networks
10/11/2017
1
Which open source project?
Related math. What is it talking about?
And a Wikipedia page explaining it all
The unreasonable effectiveness of recurrent neural networks..
โข All previous examples were generated blindly by a recurrent neural network..
โข http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Modelling Series
โข In many situations one must consider a series of inputs to produce an output
โ Outputs to may be a series
โข Examples: ..
Should I invest..
โข Stock market
โ Must consider the series of stock values in the past several days to decide if it is wise to invest todayโข Ideally consider all of history
โข Note: Inputs are vectors. Output may be scalar or vector
โ Should I invest, vs. should I invest in X
15/0314/0313/0312/0311/0310/039/038/037/03
To invest or not to invest?
sto
cks
Representational shortcut
โข Input at each time is a vector
โข Each layer has many neurons
โ Output layer too may have many neurons
โข But will represent everything simple boxes
โ Each box actually represents an entire layer with many units
Representational shortcut
โข Input at each time is a vector
โข Each layer has many neurons
โ Output layer too may have many neurons
โข But will represent everything simple boxes
โ Each box actually represents an entire layer with many units
Representational shortcut
โข Input at each time is a vector
โข Each layer has many neurons
โ Output layer too may have many neurons
โข But will represent everything simple boxes
โ Each box actually represents an entire layer with many units
Representational shortcut
โข Input at each time is a vector
โข Each layer has many neurons
โ Output layer too may have many neurons
โข But will represent everything simple boxes
โ Each box actually represents an entire layer with many units
The stock predictor
โข The sliding predictorโ Look at the last few days
โ This is just a convolutional neural net applied to series dataโข Also called a Time-Delay neural network
Stockvector
Time
X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)
Y(t+3)
The stock predictor
โข The sliding predictorโ Look at the last few days
โ This is just a convolutional neural net applied to series dataโข Also called a Time-Delay neural network
Stockvector
Time
X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)
Y(t+4)
The stock predictor
โข The sliding predictorโ Look at the last few days
โ This is just a convolutional neural net applied to series dataโข Also called a Time-Delay neural network
Stockvector
Time
X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)
Y(t+5)
The stock predictor
โข The sliding predictorโ Look at the last few days
โ This is just a convolutional neural net applied to series dataโข Also called a Time-Delay neural network
Stockvector
Time
X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)
Y(t+6)
The stock predictor
โข The sliding predictorโ Look at the last few days
โ This is just a convolutional neural net applied to series dataโข Also called a Time-Delay neural network
Stockvector
Time
X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)
Y(t+6)
Finite-response model
โข This is a finite response system
โ Something that happens today only affects the output of the system for ๐ days into the future
โข ๐ is the width of the system
๐๐ก = ๐ ๐๐ก, ๐๐กโ1, โฆ , ๐๐กโ๐
The stock predictor
Stockvector
Time
X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)
Y(t+2)
โข This is a finite response system
โ Something that happens today only affects the output of the system for ๐ days into the future
โข ๐ is the width of the system
๐๐ก = ๐ ๐๐ก , ๐๐กโ1, โฆ , ๐๐กโ๐
The stock predictor
Stockvector
Time
X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)
Y(t+3)
โข This is a finite response system
โ Something that happens today only affects the output of the system for ๐ days into the future
โข ๐ is the width of the system
๐๐ก = ๐ ๐๐ก , ๐๐กโ1, โฆ , ๐๐กโ๐
The stock predictor
Stockvector
Time
X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)
Y(t+4)
โข This is a finite response system
โ Something that happens today only affects the output of the system for ๐ days into the future
โข ๐ is the width of the system
๐๐ก = ๐ ๐๐ก , ๐๐กโ1, โฆ , ๐๐กโ๐
The stock predictor
Stockvector
Time
X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)
Y(t+5)
โข This is a finite response system
โ Something that happens today only affects the output of the system for ๐ days into the future
โข ๐ is the width of the system
๐๐ก = ๐ ๐๐ก , ๐๐กโ1, โฆ , ๐๐กโ๐
The stock predictor
Stockvector
Time
X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)
Y(t+6)
โข This is a finite response system
โ Something that happens today only affects the output of the system for ๐ days into the future
โข ๐ is the width of the system
๐๐ก = ๐ ๐๐ก , ๐๐กโ1, โฆ , ๐๐กโ๐
The stock predictor
Stockvector
Time
X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)
Y(t+6)
โข This is a finite response system
โ Something that happens today only affects the output of the system for ๐ days into the future
โข ๐ is the width of the system
๐๐ก = ๐ ๐๐ก , ๐๐กโ1, โฆ , ๐๐กโ๐
Finite-response model
โข This is a finite response systemโ Something that happens today only affects the output
of the system for ๐ days into the futureโข ๐ is the width of the system
๐๐ก = ๐ ๐๐ก , ๐๐กโ1, โฆ , ๐๐กโ๐
Stockvector
Time
X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)
Y(t+6)
Finite-response
โข Problem: Increasing the โhistoryโ makes the network more complex
โ No worries, we have the CPU and memory
โข Or do we?
Stockvector
Time
X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7)
Y(t+6)
Systems often have long-term dependencies
โข Longer-term trends โ
โ Weekly trends in the market
โ Monthly trends in the market
โ Annual trends
โ Though longer history tends to affect us less than more recent events..
We want infinite memory
โข Required: Infinite response systemsโ What happens today can continue to affect the output
foreverโข Possibly with weaker and weaker influence
๐๐ก = ๐ ๐๐ก , ๐๐กโ1, โฆ , ๐๐กโโ
Time
Examples of infinite response systems
๐๐ก = ๐ ๐๐ก , ๐๐กโ1โ Required: Define initial state: ๐๐กโ1 for ๐ก = โ1
โ An input at ๐0 at ๐ก = 0 produces ๐0โ ๐0 produces ๐1 which produces ๐2 and so on until ๐โ even
if ๐1โฆ๐โ are 0โข i.e. even if there are no further inputs!
โข This is an instance of a NARX network
โ โnonlinear autoregressive network with exogenous inputsโ
โ ๐๐ก = ๐ ๐0:๐ก , ๐0:๐กโ1
โข Output contains information about the entire past
A one-tap NARX network
โข A NARX net with recursion from the output
TimeX(t)
Y(t)
A NARX network
โข A NARX net with recursion from the output
TimeX(t)
Y(t) Y
A one-tap NARX network
โข A NARX net with recursion from the output
TimeX(t)
Y(t)
A one-tap NARX network
โข A NARX net with recursion from the output
TimeX(t)
Y(t)
A one-tap NARX network
โข A NARX net with recursion from the output
TimeX(t)
Y(t)
A one-tap NARX network
โข A NARX net with recursion from the output
TimeX(t)
Y(t)
A one-tap NARX network
โข A NARX net with recursion from the output
TimeX(t)
Y(t)
A one-tap NARX network
โข A NARX net with recursion from the output
TimeX(t)
Y(t)
A more complete representation
โข A NARX net with recursion from the output
โข Showing all computations
โข All columns are identical
โข An input at t=0 affects outputs forever
TimeX(t)
Y(t)
Brown boxes show output nodesYellow boxes are outputs
Same figure redrawn
โข A NARX net with recursion from the output
โข Showing all computations
โข All columns are identical
โข An input at t=0 affects outputs forever
TimeX(t)
Y(t)
Brown boxes show output nodesAll outgoing arrows are the same output
A more generic NARX network
โข The output ๐๐ก at time ๐ก is computed from the
past ๐พ outputs ๐๐กโ1, โฆ , ๐๐กโ๐พ and the current
and past ๐ฟ inputs ๐๐ก, โฆ , ๐๐กโ๐ฟ
TimeX(t)
Y(t)
A โcompleteโ NARX network
โข The output ๐๐ก at time ๐ก is computed from all
past outputs and all inputs until time t
โ Not really a practical model
TimeX(t)
Y(t)
NARX Networks
โข Very popular for time-series prediction
โ Weather
โ Stock markets
โ As alternate system models in tracking systems
โข Any phenomena with distinct โinnovationsโ that โdriveโ an output
An alternate model for infinite response systems: the state-space model
โ๐ก = ๐ ๐ฅ๐ก, โ๐กโ1๐ฆ๐ก = ๐ โ๐ก
โข โ๐ก is the state of the network
โข Need to define initial state โโ1
โข This is a recurrent neural network
โข State summarizes information about the entire past
The simple state-space model
โข The state (green) at any time is determined by the input at that time, and the state at the previous time
โข An input at t=0 affects outputs forever
โข Also known as a recurrent neural net
Time
X(t)
Y(t)
t=0
h-1
An alternate model for infinite response systems: the state-space model
โ๐ก = ๐ ๐ฅ๐ก, โ๐กโ1๐ฆ๐ก = ๐ โ๐ก
โข โ๐ก is the state of the network
โข Need to define initial state โโ1
โข The state an be arbitrarily complex
Single hidden layer RNN
โข Recurrent neural network
โข All columns are identical
โข An input at t=0 affects outputs forever
Time
X(t)
Y(t)
t=0
h-1
Multiple recurrent layer RNN
โข Recurrent neural network
โข All columns are identical
โข An input at t=0 affects outputs forever
Time
Y(t)
X(t)
t=0
A more complex state
โข All columns are identical
โข An input at t=0 affects outputs forever
TimeX(t)
Y(t)
Or the network may be even more complicated
โข Shades of NARX
โข All columns are identical
โข An input at t=0 affects outputs forever
TimeX(t)
Y(t)
Generalization with other recurrences
โข All column (including incoming edges) are
identical
Time
Y(t)
X(t)
t=0
State dependencies may be simpler
โข Recurrent neural network
โข All columns are identical
โข An input at t=0 affects outputs forever
TimeX(t)
Y(t)
Multiple recurrent layer RNN
โข We can also have skips..
Time
Y(t)
X(t)
t=0
A Recurrent Neural Network
โข Simplified models often drawn
โข The loops imply recurrence
The detailed version of the simplified representation
Time
X(t)
Y(t)
t=0
h-1
Multiple recurrent layer RNN
Time
Y(t)
X(t)
t=0
Multiple recurrent layer RNN
Time
Y(t)
X(t)
t=0
Equations
โข Note superscript in indexing, which indicates layer of network from which inputs are obtained
โข Assuming vector function at output, e.g. softmax
โข The state node activation, ๐1() is typically tanh()
โข Every neuron also has a bias input
๐(๐ก) = ๐2
๐
๐ค๐๐1โ๐
1๐ก + ๐๐
1, ๐ = 1. .๐
โ๐1(๐ก) = ๐1
๐
๐ค๐๐0๐๐ ๐ก +
๐
๐ค๐๐11
โ๐1
๐ก โ 1 + ๐๐1
โ๐1
โ1 = ๐๐๐๐ก ๐๐ ๐๐๐ก๐ค๐๐๐ ๐๐๐๐๐๐๐ก๐๐๐
๐
โ(1)
๐
Equations
๐(๐ก) = ๐3
๐
๐ค๐๐2โ๐
2๐ก + ๐๐
3, ๐ = 1. .๐
โ๐2(๐ก) = ๐2
๐
๐ค๐๐1โ๐
1(๐ก) +
๐
๐ค๐๐22
โ๐2
๐ก โ 1 + ๐๐2
โ๐1
โ1 = ๐๐๐๐ก ๐๐ ๐๐๐ก๐ค๐๐๐ ๐๐๐๐๐๐๐ก๐๐๐
โข Assuming vector function at output, e.g. softmax ๐3()
โข The state node activations, ๐๐() are typically tanh()
โข Every neuron also has a bias input
โ๐2
โ1 = ๐๐๐๐ก ๐๐ ๐๐๐ก๐ค๐๐๐ ๐๐๐๐๐๐๐ก๐๐๐
โ๐1(๐ก) = ๐1
๐
๐ค๐๐0๐๐ ๐ก +
๐
๐ค๐๐11
โ๐1
๐ก โ 1 + ๐๐1
๐
โ(1)
๐
โ(2)
Equations
๐๐(๐ก) = ๐3
๐
๐ค๐๐2โ๐
2๐ก +
๐
๐ค๐๐1,3
โ๐1(๐ก) + ๐๐
3, ๐ = 1. .๐
โ๐2(๐ก) = ๐2
๐
๐ค๐๐1,2
โ๐1(๐ก) +
๐
๐ค๐๐0,2
๐๐ ๐ก +
๐
๐ค๐๐2,2
โ๐2
๐ก โ 1 + ๐๐2
โ๐1
โ1 = ๐๐๐๐ก ๐๐ ๐๐๐ก๐ค๐๐๐ ๐๐๐๐๐๐๐ก๐๐๐
โ๐2
โ1 = ๐๐๐๐ก ๐๐ ๐๐๐ก๐ค๐๐๐ ๐๐๐๐๐๐๐ก๐๐๐
โ๐1(๐ก) = ๐1
๐
๐ค๐๐0,1
๐๐ ๐ก +
๐
๐ค๐๐1,1
โ๐1
๐ก โ 1 + ๐๐1
Variants on recurrent nets
โข 1: Conventional MLPโข 2: Sequence generation, e.g. image to captionโข 3: Sequence based prediction or classification, e.g. Speech recognition,
text classification
Images fromKarpathy
Variants
โข 2: Sequence to sequence, e.g. stock problem, label prediction
โข 1: Delayed sequence to sequence
โข Etcโฆ
Images fromKarpathy
How do we train the network
โข Back propagation through time (BPTT)
โข Given a collection of sequence inputsโ (๐๐ , ๐๐), where
โ ๐๐ = ๐๐,0, โฆ , ๐๐,๐
โ ๐๐ = ๐ท๐,0, โฆ , ๐ท๐,๐
โข Train network parameters to minimize the error between the output of the network ๐๐ = ๐๐,0, โฆ , ๐๐,๐ and the desired outputs
โ This is the most generic setting. In other settings we just โremoveโ some of the input or output entries
X(0)
Y(0)
t
h-1
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2) Y(T-2) Y(T-1) Y(T)
Training: Forward pass
โข For each training input:
โข Forward pass: pass the entire data sequence through the network, generate outputs
X(0)
Y(0)
t
h-1
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2) Y(T-2) Y(T-1) Y(T)
Training: Computing gradients
โข For each training input:
โข Backward pass: Compute gradients via backpropagation
โ Back Propagation Through Time
X(0)
Y(0)
t
h-1
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2) Y(T-2) Y(T-1) Y(T)
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
Will only focus on one training instance
All subscripts represent components and not training instance index
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐ท(1. . ๐)
๐ท๐ผ๐
โข The divergence computed is between the sequence of outputsby the network and the desired sequence of outputs
โข This is not just the sum of the divergences at individual timesโช Unless we explicitly define it that way
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐ท(1. . ๐)
๐ท๐ผ๐
First step of backprop: Compute ๐๐ท๐ผ๐
๐๐๐(๐)for all i
In general we will be required to compute ๐๐ท๐ผ๐
๐๐๐(๐ก)for all ๐ and ๐ก as we will see. This can
be a source of significant difficulty in many scenarios.
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐ท(๐)
๐ท๐๐ฃ(๐)
๐๐ท๐ผ๐
๐๐๐(๐ก)for all i for all T
๐ท๐๐ฃ(๐ โ 1)๐ท๐๐ฃ(๐ โ 2)๐ท๐๐ฃ(2)๐ท๐๐ฃ(1)๐ท๐๐ฃ(0)
๐ท๐ผ๐
Must compute
๐๐ท๐ผ๐
๐๐๐(๐ก)=๐๐ท๐๐ฃ(๐ก)
๐๐๐(๐ก)
Will usually get
Special case, when the overall divergence is a simple combination of localdivergences at each time:
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐ท(1. . ๐)
๐ท๐ผ๐
First step of backprop: Compute ๐๐ท๐ผ๐
๐๐๐(๐)for all i
๐๐ท๐ผ๐
๐๐๐(๐)=
๐๐ท๐ผ๐
๐๐๐(๐)
๐๐๐(๐)
๐๐๐(๐)
๐๐ท๐ผ๐
๐๐๐(๐)=
๐
๐๐ท๐ผ๐
๐๐๐(๐)
๐๐๐(๐)
๐๐๐(๐)OR
Vector output activation
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐๐ท๐ผ๐
๐๐๐(๐)for all i
๐๐ท๐ผ๐
๐๐๐(1)(๐)
=๐๐ท๐๐ฃ(๐)
๐๐๐(๐)
๐๐๐(๐)
๐๐๐(1)(๐)
๐๐ท๐ผ๐
๐โ๐(๐)=
๐
๐๐ท๐ผ๐
๐๐๐(1)(๐)
๐๐๐(1)(๐)
๐โ๐(๐)=
๐
๐ค๐๐(1) ๐๐ท๐ผ๐
๐๐๐(1)(๐)
๐ท(1. . ๐)
๐ท๐ผ๐
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐๐ท๐ผ๐
๐๐๐(1)(๐)
=๐๐ท๐๐ฃ(๐)
๐๐๐(๐)
๐๐๐(๐)
๐๐๐(1)(๐)
๐๐ท๐ผ๐
๐๐ค๐๐(1)
=๐๐ท๐ผ๐
๐๐๐(1)(๐)
๐๐๐(1)(๐)
๐๐ค๐๐(1)
=๐๐ท๐ผ๐
๐๐๐1
๐โ๐(๐)
๐๐ท๐ผ๐
๐โ๐(๐)=
๐
๐ค๐๐(1) ๐๐ท๐ผ๐
๐๐๐(1)(๐)
๐ท(1. . ๐)
๐ท๐ผ๐
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐๐ท๐ผ๐
๐๐๐(1)(๐)
=๐๐ท๐ผ๐
๐๐๐(๐)
๐๐๐(๐)
๐๐๐(1)(๐)
๐๐ท๐ผ๐
๐โ๐(๐)=
๐
๐ค๐๐(1) ๐๐ท๐ผ๐
๐๐๐(1)(๐)
๐๐ท๐ผ๐
๐๐ค๐๐(1)
=๐๐ท๐ผ๐
๐๐๐1
๐โ๐(๐)
๐๐ท๐ผ๐
๐๐๐(0)(๐)
=๐๐ท๐ผ๐
๐โ๐(๐)
๐โ๐(๐)
๐๐๐(0)(๐)
๐ท(1. . ๐)
๐ท๐ผ๐
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐๐ท๐ผ๐
๐๐ค๐๐(0)
=๐๐ท๐ผ๐
๐๐๐0
๐๐๐(๐)
๐ท(1. . ๐)
๐ท๐ผ๐
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐๐ท๐ผ๐
๐๐ค๐๐(11)
=๐๐ท๐ผ๐
๐๐๐0
๐โ๐(๐ โ 1)
๐๐ท๐ผ๐
๐๐ค๐๐(0)
=๐๐ท๐ผ๐
๐๐๐0
๐๐๐(๐)
๐ท(1. . ๐)
๐ท๐ผ๐
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐๐ท๐ผ๐
๐๐๐1(๐ โ 1)
=๐๐ท๐ผ๐
๐๐๐(๐ โ 1)
๐๐๐(๐ โ 1)
๐๐๐1(๐ โ 1)
๐ท(1. . ๐)
๐ท๐ผ๐
Vector output activation
๐๐ท๐ผ๐
๐๐๐1(๐ โ 1)
=
๐
๐๐ท๐ผ๐
๐๐๐(๐ โ 1)
๐๐๐(๐ โ 1)
๐๐๐1(๐ โ 1)
OR
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐๐ท๐ผ๐
๐โ๐(๐ โ 1)=
๐
๐ค๐๐(1) ๐๐ท๐ผ๐
๐๐๐1(๐ โ 1)
+
๐
๐ค๐๐(11) ๐๐ท๐ผ๐
๐๐๐0(๐)
๐ท(1. . ๐)
๐ท๐ผ๐
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐๐ท๐ผ๐
๐โ๐(๐ โ 1)=
๐
๐ค๐๐(1) ๐๐ท๐ผ๐
๐๐๐1(๐ โ 1)
+
๐
๐ค๐๐(11) ๐๐ท๐ผ๐
๐๐๐0(๐)
๐ท(1. . ๐)
๐ท๐ผ๐
๐๐ท๐ผ๐
๐๐ค๐๐(1)
+=๐๐ท๐ผ๐
๐๐๐1
๐ โ 1โ๐(๐ โ 1)Note the addition
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐ท(1. . ๐)
๐ท๐ผ๐
๐๐ท๐ผ๐
๐๐๐0(๐ โ 1)
=๐๐ท๐ผ๐
๐โ๐(๐ โ 1)
๐โ๐(๐ โ 1)
๐๐๐0(๐ โ 1)
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐ท(1. . ๐)
๐ท๐ผ๐
๐๐ท๐ผ๐
๐๐๐0(๐ โ 1)
=๐๐ท๐ผ๐
๐โ๐(๐ โ 1)
๐โ๐(๐ โ 1)
๐๐๐0(๐ โ 1)
๐๐ท๐ผ๐
๐๐ค๐๐(0)
+=๐๐ท๐ผ๐
๐๐๐0
๐ โ 1๐๐(๐ โ 1)
Note the addition
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐ท(1. . ๐)
๐ท๐ผ๐
๐๐ท๐ผ๐
๐๐ค๐๐(0)
+=๐๐ท๐ผ๐
๐๐๐0
๐ โ 1๐๐(๐ โ 1)
๐๐ท๐ผ๐
๐๐ค๐๐(11)
+=๐๐ท๐ผ๐
๐๐๐0
๐ โ 1โ๐(๐ โ 2)
Note the addition
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐ท(1. . ๐)
๐ท๐ผ๐
๐๐ท๐ผ๐
๐โโ1=
๐
๐ค๐๐(11) ๐๐ท๐ผ๐
๐๐๐1(0)
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐ท(1. . ๐)
๐ท๐ผ๐
๐๐ท๐ผ๐
๐๐๐๐(๐ก)
=๐๐ท๐ผ๐
๐โ๐๐(๐ก)
๐๐โฒ ๐๐
๐(๐ก)
๐๐ท๐ผ๐
๐โ๐๐(๐ก)
=
๐
๐ค๐,๐(๐) ๐๐ท๐ผ๐
๐๐๐๐+1
(๐ก)+
๐
๐ค๐,๐(๐,๐) ๐๐ท๐ผ๐
๐๐๐๐(๐ก + 1)
Not showing derivativesat output neurons
Back Propagation Through Time
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐ท(1. . ๐)
๐ท๐ผ๐
๐๐ท๐ผ๐
๐๐ค๐๐(0)
=
๐ก
๐๐ท๐ผ๐
๐๐๐0
๐ก๐๐(๐ก)
๐๐ท๐ผ๐
๐๐ค๐๐(11)
=
๐ก
๐๐ท๐ผ๐
๐๐๐0
๐กโ๐(๐ก โ 1)
๐๐ท๐ผ๐
๐โโ1=
๐
๐ค๐๐(11) ๐๐ท๐ผ๐
๐๐๐1(0)
BPTT
โข Can be generalized to any architecture
Extensions to the RNN: Bidirectional RNN
โข RNN with both forward and backward recursion
โ Explicitly models the fact that just as the future can be predicted from the past, the past can be deduced from the future
Bidirectional RNN
โข A forward net process the data from t=0 to t=T
โข A backward net processes it backward from t=T down to t=0
X(0)
Y(0)
t
hf(-1)
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2) Y(T-2) Y(T-1) Y(T)
X(0) X(1) X(2) X(T-2) X(T-1) X(T)
hb(inf)
Bidirectional RNN: Processing an input string
โข The forward net process the data from t=0 to t=T
โ Only computing the hidden states, initially
โข The backward net processes it backward from t=T down to t=0
X(0)
t
hf(-1)
X(1) X(2) X(T-2) X(T-1) X(T)
Bidirectional RNN: Processing an input string
โข The backward nets processes the input data in reverse time, end to beginning
โ Initially only the hidden state values are computed
โ Clearly, this is not an online process and requires the entire input data
โ Note: This is not the backward pass of backprop.
โข The backward net processes it backward from t=T down to t=0
X(0)
t
hf(-1)
X(1) X(2) X(T-2) X(T-1) X(T)
X(0) X(1) X(2) X(T-2) X(T-1) X(T)
hb(inf)
Bidirectional RNN: Processing an input string
โข The computed states of both networks are
used to compute the final output at each time
X(0)
Y(0)
t
hf(-1)
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2) Y(T-2) Y(T-1) Y(T)
X(0) X(1) X(2) X(T-2) X(T-1) X(T)
hb(inf)
Backpropagation in BRNNs
โข Forward pass: Compute both forward and
backward networks and final output
X(0)
Y(0)
t
hf(-1)
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2) Y(T-2) Y(T-1) Y(T)
X(0) X(1) X(2) X(T-2) X(T-1) X(T)
hb(inf)
Backpropagation in BRNNs
โข Backward pass: Define a divergence from the desired output
โข Separately perform back propagation on both nets
โ From t=T down to t=0 for the forward net
โ From t=0 up to t=T for the backward net
X(0)
Y(0)
t
hf(-1)
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2) Y(T-2) Y(T-1) Y(T)
X(0) X(1) X(2) X(T-2) X(T-1) X(T)
hb(inf)
Div()d1..dT
Div
Backpropagation in BRNNs
โข Backward pass: Define a divergence from the desired output
โข Separately perform back propagation on both nets
โ From t=T down to t=0 for the forward net
โ From t=0 up to t=T for the backward net
X(0)
Y(0)
t
hf(-1)
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2) Y(T-2) Y(T-1) Y(T)
Div()d1..dT
Div
Backpropagation in BRNNs
โข Backward pass: Define a divergence from the desired output
โข Separately perform back propagation on both nets
โ From t=T down to t=0 for the forward net
โ From t=0 up to t=T for the backward net
Y(0)
t
Y(1) Y(2) Y(T-2) Y(T-1) Y(T)
X(0) X(1) X(2) X(T-2) X(T-1) X(T)
hb(inf)
Div()d1..dT
Div
RNNs..
โข Excellent models for time-series analysis tasks
โ Time-series prediction
โ Time-series classification
โ Sequence prediction..
So how did this happen
So how did this happen
RNNs..
โข Excellent models for time-series analysis tasks
โ Time-series prediction
โ Time-series classification
โ Sequence prediction..
โ They can even simplify some problems that are difficult for MLPs
Recall: A Recurrent Neural Network
TimeX(t)
Y(t)
MLPs vs RNNs
โข The addition problem: Add two N-bit numbers to produce a N+1-bit number
โ Input is binary
โ Will require large number of training instances
โข Output must be specified for every pair of inputs
โข Weights that generalize will make errors
โ Network trained for N-bit numbers will not work for N+1 bit numbers
1 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 0
MLP
1 0 1 0 1 0 1 1 1 1 0
MLPs vs RNNs
โข The addition problem: Add two N-bit
numbers to produce a N+1-bit number
โข RNN solution: Very simple, can add two
numbers of any size
1 0
1
RNN unitPreviouscarry
Carryout
MLP: The parity problem
โข Is the number of โonesโ even or oddโข Network must be complex to capture all patterns
โ At least one hidden layer of size N plus an output neuronโ Fixed input size
1 0 0 0 1 1 0 0 1 0
MLP
1
RNN: The parity problem
โข Trivial solution
โข Generalizes to input of any size
1
0
1
RNN unit
Previousoutput
RNNs..
โข Excellent models for time-series analysis tasks
โ Time-series prediction
โ Time-series classification
โ Sequence prediction..
โ They can even simplify problems that are difficult for MLPs
โข But first โ a problem..
The vanishing gradient problem
โข A particular problem with training deep networks..
โ The gradient of the error with respect to weights is unstable..
Some useful preliminary math: The problem with training deep networks
โข A multilayer perceptron is a nested function
๐ = ๐๐ ๐๐โ1๐๐โ1 ๐๐โ2๐๐โ2 โฆ๐0๐
โข ๐๐ is the weights matrix at the kth layer
โข The error for ๐ can be written as
๐ท๐๐ฃ(๐) = ๐ท ๐๐ ๐๐โ1๐๐โ1 ๐๐โ2๐๐โ2 โฆ๐0๐
W0 W1 W2
Training deep networks
โข Vector derivative chain rule: for any ๐ ๐๐ ๐ :
๐๐ ๐๐ ๐
๐๐=๐๐ ๐๐ ๐
๐๐๐ ๐
๐๐๐ ๐
๐๐ ๐
๐๐ ๐
๐๐
๐ป๐๐ = ๐ป๐๐.๐. ๐ป๐๐
โข Where
โ ๐ = ๐๐ ๐
โ ๐ป๐๐ is the jacobian matrix of ๐(๐)w.r.t ๐
โข Using the notation ๐ป๐๐ instead of ๐ฝ๐(๐ง) for consistency
Poor notation
Training deep networks
โข For
๐ท๐๐ฃ(๐) = ๐ท ๐๐ ๐๐โ1๐๐โ1 ๐๐โ2๐๐โ2 โฆ๐0๐
โข We get:
๐ป๐๐๐ท๐๐ฃ = ๐ป๐ท. ๐ป๐๐ .๐๐โ1. ๐ป๐๐โ1.๐๐โ2โฆ๐ป๐๐+1๐๐
โข Where
โ ๐ป๐๐๐ท๐๐ฃ is the gradient ๐ท๐๐ฃ(๐) of the error w.r.t the output of the
kth layer of the network
โข Needed to compute the gradient of the error w.r.t ๐๐โ1
โ ๐ป๐๐ is jacobian of ๐๐() w.r.t. to its current input
โ All blue terms are matrices
The Jacobian of the hidden layers
โข ๐ป๐๐ก() is the derivative of the output of the (layer of) hidden recurrent neurons with respect to their input
โ A matrix where the diagonal entries are the derivatives of the activation of the recurrent hidden layer
โ๐1(๐ก) = ๐1 ๐ง๐
1๐ก๐
โ1
๐
๐ป๐๐ก ๐ง๐ =
๐๐ก,1โฒ (๐ง1) 0 โฏ 0
0 ๐๐ก,2โฒ (๐ง2) โฏ 0
โฎ โฎ โฑ โฎ0 0 โฏ ๐๐ก,๐
โฒ (๐ง๐)
The Jacobian
โข The derivative (or subgradient) of the activation function is always bounded
โ The diagonals of the Jacobian are bounded
โข There is a limit on how much multiplying a vector by the Jacobian will scale it
โ๐1(๐ก) = ๐1 ๐ง๐
1๐ก
๐
โ1
๐
๐ป๐๐ก ๐ง๐ =
๐๐ก,1โฒ (๐ง1) 0 โฏ 0
0 ๐๐ก,2โฒ (๐ง2) โฏ 0
โฎ โฎ โฑ โฎ0 0 โฏ ๐๐ก,๐
โฒ (๐ง๐)
The derivative of the hidden state activation
โข Most common activation functions, such as sigmoid, tanh() and RELU have derivatives that are always less than 1
โข The most common activation for the hidden units in an RNN is the tanh()
โ The derivative of tanh()is always less than 1
โข Multiplication by the Jacobian is always a shrinking operation
๐ป๐๐ก ๐ง๐ =
๐๐ก,1โฒ (๐ง1) 0 โฏ 0
0 ๐๐ก,2โฒ (๐ง2) โฏ 0
โฎ โฎ โฑ โฎ0 0 โฏ ๐๐ก,๐
โฒ (๐ง๐)
Training deep networks
โข As we go back in layers, the Jacobians of the
activations constantly shrink the derivative
โ After a few instants the derivative of the divergence at
any time is totally โforgottenโ
๐ป๐๐๐ท๐๐ฃ = ๐ป๐ท. ๐ป๐๐.๐๐โ1. ๐ป๐๐โ1.๐๐โ2โฆ๐ป๐๐+1๐๐
What about the weights
๐ป๐๐๐ท๐๐ฃ = ๐ป๐ท. ๐ป๐๐ .๐๐โ1. ๐ป๐๐โ1.๐๐โ2โฆ๐ป๐๐+1๐๐
โข In a single-layer RNN, the weight matrices are identical
โข The chain product for ๐ป๐๐๐ท๐๐ฃ will
โ Expand ๐ป๐ท along directions in which the singular values of the weight matrices are greater than 1
โ Shrink ๐ป๐ท in directions where the singular values ae less than 1
โ Exploding or vanishing gradients
Exploding/Vanishing gradients
๐ป๐๐๐ท๐๐ฃ = ๐ป๐ท. ๐ป๐๐ .๐๐โ1. ๐ป๐๐โ1.๐๐โ2โฆ๐ป๐๐+1๐๐
โข Every blue term is a matrix
โข ๐ป๐ท is proportional to the actual error
โ Particularly for L2 and KL divergence
โข The chain product for ๐ป๐๐๐ท๐๐ฃ will
โ Expand ๐ป๐ท in directions where each stage has singular
values greater than 1
โ Shrink ๐ป๐ท in directions where each stage has singular
values less than 1
Gradient problems in deep networks
โข The gradients in the lower/earlier layers can explode or vanish
โ Resulting in insignificant or unstable gradient descent updates
โ Problem gets worse as network depth increases
๐ป๐๐๐ท๐๐ฃ = ๐ป๐ท. ๐ป๐๐.๐๐โ1. ๐ป๐๐โ1.๐๐โ2โฆ๐ป๐๐+1๐๐
Vanishing gradient examples..
โข 19 layer MNIST model
โ Different activations: Exponential linear units, RELU, sigmoid, than
โ Each layer is 1024 layers wide
โ Gradients shown at initialization
โข Will actually decrease with additional training
โข Figure shows log ๐ป๐๐๐๐ข๐๐๐๐ธ where ๐๐๐๐ข๐๐๐ is the vector of incoming weights to each neuron
โ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron
ELU activation, Batch gradients
Output layer
Input layer
Vanishing gradient examples..
โข 19 layer MNIST model
โ Different activations: Exponential linear units, RELU, sigmoid, than
โ Each layer is 1024 layers wide
โ Gradients shown at initialization
โข Will actually decrease with additional training
โข Figure shows log ๐ป๐๐๐๐ข๐๐๐๐ธ where ๐๐๐๐ข๐๐๐ is the vector of incoming weights to each neuron
โ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron
RELU activation, Batch gradients
Output layer
Input layer
Vanishing gradient examples..
โข 19 layer MNIST model
โ Different activations: Exponential linear units, RELU, sigmoid, than
โ Each layer is 1024 layers wide
โ Gradients shown at initialization
โข Will actually decrease with additional training
โข Figure shows log ๐ป๐๐๐๐ข๐๐๐๐ธ where ๐๐๐๐ข๐๐๐ is the vector of incoming weights to each neuron
โ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron
Sigmoid activation, Batch gradients
Output layer
Input layer
Vanishing gradient examples..
โข 19 layer MNIST model
โ Different activations: Exponential linear units, RELU, sigmoid, than
โ Each layer is 1024 layers wide
โ Gradients shown at initialization
โข Will actually decrease with additional training
โข Figure shows log ๐ป๐๐๐๐ข๐๐๐๐ธ where ๐๐๐๐ข๐๐๐ is the vector of incoming weights to each neuron
โ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron
Tanh activation, Batch gradients
Output layer
Input layer
Vanishing gradient examples..
โข 19 layer MNIST model
โ Different activations: Exponential linear units, RELU, sigmoid, than
โ Each layer is 1024 layers wide
โ Gradients shown at initialization
โข Will actually decrease with additional training
โข Figure shows log ๐ป๐๐๐๐ข๐๐๐๐ธ where ๐๐๐๐ข๐๐๐ is the vector of incoming weights to each neuron
โ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron
ELU activation, Individual instances
Vanishing gradients
โข ELU activations maintain gradients longest
โข But in all cases gradients effectively vanish after about 10 layers!
โ Your results may vary
โข Both batch gradients and gradients for individual instances disappear
โ In reality a tiny number may actually blow up.
Recurrent nets are very deep nets
๐ป๐๐๐ท๐๐ฃ = ๐ป๐ท. ๐ป๐๐ .๐๐โ1. ๐ป๐๐โ1.๐๐โ2โฆ๐ป๐๐+1๐๐
โข The relation between ๐(0) and ๐(๐) is one of a very deep network
โ Gradients from errors at t = ๐ will vanish by the time theyโre propagated to ๐ก = 0
X(0)
hf(-1)
Y(T)
Vanishing stuff..
โข Not merely during back propagation
โข Stuff gets forgotten in the forward pass tooโ Otherwise outputs would saturate or blow up
โ Typically forgotten in a dozen or so timesteps
h-1
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐)
The long-term dependency problem
โข Any other pattern of any length can happen between pattern 1 and pattern 2
โ RNN will โforgetโ pattern 1 if intermediate stuff is too long
โ โJaneโ the next pronoun referring to her will be โsheโ
โข Can learn such dependencies in theory; in practice will not
โ Vanishing gradient problem
PATTERN1 [โฆโฆโฆโฆโฆโฆโฆโฆโฆโฆ..] PATTERN 2
1
Jane had a quick lunch in the bistro. Then she..
Exploding/Vanishing gradients
๐ป๐๐๐ท๐๐ฃ = ๐ป๐ท. ๐ป๐๐ .๐๐โ1. ๐ป๐๐โ1.๐๐โ2โฆ๐ป๐๐+1๐๐
โข Can we replace this with something that doesnโt
fade or blow up?
โข ๐ป๐๐๐ท๐๐ฃ = ๐ป๐ท๐ถ๐๐๐ถ๐๐โ1๐ถ โฆ๐๐
Enter โ the constant error carousel
โข History is carried through uncompressed
โ No weights, no nonlinearities
โ Only scaling is through the s โgatingโ term that captures other triggers
โ E.g. โHave I seen Pattern2โ?
Time
ร ร ร รโ(๐ก)โ(๐ก + 1) โ(๐ก + 2) โ(๐ก + 3)
โ(๐ก + 4)
๐(๐ก + 1) ๐(๐ก + 2) ๐(๐ก + 3) ๐(๐ก + 4)
t+1 t+2 t+3 t+4
ร ร ร ร
Enter โ the constant error carousel
โข Actual non-linear work is done by other portions of the network
โ(๐ก)โ(๐ก + 1) โ(๐ก + 2) โ(๐ก + 3)
โ(๐ก + 4)
๐(๐ก + 1) ๐(๐ก + 2) ๐(๐ก + 3) ๐(๐ก + 4)
๐(๐ก + 1) ๐(๐ก + 2) ๐(๐ก + 3) ๐(๐ก + 4)
Time
ร ร ร ร
Enter โ the constant error carousel
โข Actual non-linear work is done by other portions of the network
โ(๐ก)โ(๐ก + 1) โ(๐ก + 2) โ(๐ก + 3)
โ(๐ก + 4)
๐(๐ก + 1) ๐(๐ก + 2) ๐(๐ก + 3) ๐(๐ก + 4)
๐(๐ก + 1) ๐(๐ก + 2) ๐(๐ก + 3) ๐(๐ก + 4)
Otherstuff
Time
ร ร ร ร
Enter โ the constant error carousel
โข Actual non-linear work is done by other portions of the network
โ(๐ก)โ(๐ก + 1) โ(๐ก + 2) โ(๐ก + 3)
โ(๐ก + 4)
๐(๐ก + 1) ๐(๐ก + 2) ๐(๐ก + 3) ๐(๐ก + 4)
๐(๐ก + 1) ๐(๐ก + 2) ๐(๐ก + 3) ๐(๐ก + 4)
Otherstuff
Time
ร ร ร ร
Enter โ the constant error carousel
โข Actual non-linear work is done by other portions of the network
โ(๐ก)โ(๐ก + 1) โ(๐ก + 2) โ(๐ก + 3)
โ(๐ก + 4)
๐(๐ก + 1) ๐(๐ก + 2) ๐(๐ก + 3) ๐(๐ก + 4)
๐(๐ก + 1) ๐(๐ก + 2) ๐(๐ก + 3) ๐(๐ก + 4)
Otherstuff
Time
Enter the LSTM
โข Long Short-Term Memory
โข Explicitly latch information to prevent decay / blowup
โข Following notes borrow liberally from
โข http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Standard RNN
โข Recurrent neurons receive past recurrent outputs and current input as inputs
โข Processed through a tanh() activation function
โ As mentioned earlier, tanh() is the generally used activation for the hidden layer
โข Current recurrent output passed to next higher layer and next time instant
Long Short-Term Memory
โข The ๐() are multiplicative gates that decide if something is important or not
โข Remember, every line actually represents a vector
LSTM: Constant Error Carousel
โข Key component: a remembered cell state
LSTM: CEC
โข ๐ถ๐ก is the linear history carried by the constant-error carousel
โข Carries information through, only affected by a gate
โ And addition of history, which too is gated..
LSTM: Gates
โข Gates are simple sigmoidal units with outputs in the range (0,1)
โข Controls how much of the information is to be let through
LSTM: Forget gate
โข The first gate determines whether to carry over the history or to forget it
โ More precisely, how much of the history to carry over
โ Also called the โforgetโ gate
โ Note, weโre actually distinguishing between the cell memory ๐ถ and the state โ that is coming over time! Theyโre related though
LSTM: Input gate
โข The second gate has two parts
โ A perceptron layer that determines if thereโs something interesting in the input
โ A gate that decides if its worth remembering
โ If so its added to the current memory cell
LSTM: Memory cell update
โข The second gate has two parts
โ A perceptron layer that determines if thereโs something interesting in the input
โ A gate that decides if its worth remembering
โ If so its added to the current memory cell
LSTM: Output and Output gate
โข The output of the cell
โ Simply compress it with tanh to make it lie between 1 and -1
โข Note that this compression no longer affects our ability to carry memory forward
โ While weโre at it, lets toss in an output gate
โข To decide if the memory contents are worth reporting at this time
LSTM: The โPeepholeโ Connection
โข Why not just let the cell directly influence the
gates while at it
โ Party!!
The complete LSTM unit
โข With input, output, and forget gates and the
peephole connection..
๐ฅ๐ก
โ๐กโ1 โ๐ก
๐ถ๐กโ1 ๐ถ๐ก
๐๐ก ๐๐ก ๐๐กแ๐ถ๐ก
s() s() s()tanh
tanh
Backpropagation rules: Forward
โข Forward rules:
๐ฅ๐ก
โ๐กโ1 โ๐ก
๐ถ๐กโ1 ๐ถ๐ก
๐๐ก๐๐ก ๐๐ก
แ๐ถ๐กs() s() s()tanh
tanh
Gates
Variables
Backpropagation rules: Backward
๐ฅ๐ก
โ๐กโ1โ๐ก
๐ถ๐กโ1๐ถ๐ก
๐๐ก ๐๐ก ๐๐กแ๐ถ๐ก
s() s() s()tanh
tanh
๐ง๐ก
๐ถ๐ก
๐ฅ๐ก+1
๐ถ๐ก+1
แ๐ถ๐ก+1s() s() s()tanh
tanh
๐ป๐ถ๐ก๐ท๐๐ฃ =
โ๐ก+1
Backpropagation rules: Backward
๐ฅ๐ก
โ๐กโ1โ๐ก
๐ถ๐กโ1๐ถ๐ก
๐๐ก ๐๐ก ๐๐กแ๐ถ๐ก
s() s() s()tanh
tanh
๐ง๐ก
๐ถ๐ก ๐ถ๐ก+1
s() s() s()tanh
tanh
๐ป๐ถ๐ก๐ท๐๐ฃ = ๐ปโ๐ก๐ท๐๐ฃ โ ๐๐ก โ ๐ก๐๐โโฒ . ๐๐ถโ
โ๐ก+1
๐ฅ๐ก+1
แ๐ถ๐ก+1
Backpropagation rules: Backward
๐ฅ๐ก
โ๐กโ1โ๐ก
๐ถ๐กโ1๐ถ๐ก
๐๐ก ๐๐ก ๐๐กแ๐ถ๐ก
s() s() s()tanh
tanh
๐ง๐ก
๐ถ๐ก ๐ถ๐ก+1
s() s() s()tanh
tanh
๐ป๐ถ๐ก๐ท๐๐ฃ = ๐ปโ๐ก๐ท๐๐ฃ โ ๐๐ก โ ๐ก๐๐โโฒ . ๐๐ถโ + ๐ก๐๐โ . โ ๐โฒ . ๐๐ถ๐
โ๐ก+1
๐ฅ๐ก+1
แ๐ถ๐ก+1
Backpropagation rules: Backward
๐ฅ๐ก
โ๐กโ1โ๐ก
๐ถ๐กโ1๐ถ๐ก
๐๐ก ๐๐ก ๐๐กแ๐ถ๐ก
s() s() s()tanh
tanh
๐ง๐ก
๐ถ๐ก ๐ถ๐ก+1
s() s() s()tanh
tanh
๐ป๐ถ๐ก๐ท๐๐ฃ = ๐ปโ๐ก๐ท๐๐ฃ โ ๐๐ก โ ๐ก๐๐โโฒ . ๐๐ถโ + ๐ก๐๐โ . โ ๐โฒ . ๐๐ถ๐ +
๐ปโ๐ก๐ถ๐ก+1 โ ๐๐ก+1 +
โ๐ก+1
๐ฅ๐ก+1
แ๐ถ๐ก+1
๐๐ก+1
Backpropagation rules: Backward
๐ฅ๐ก
โ๐กโ1โ๐ก
๐ถ๐กโ1๐ถ๐ก
๐๐ก ๐๐ก ๐๐กแ๐ถ๐ก
s() s() s()tanh
tanh
๐ง๐ก
๐ถ๐ก
โ๐ก+1
๐ถ๐ก+1
๐๐ก+1
s() s() s()tanh
tanh
๐ป๐ถ๐ก๐ท๐๐ฃ = ๐ปโ๐ก๐ท๐๐ฃ โ ๐๐ก โ ๐ก๐๐โโฒ . ๐๐ถโ + ๐ก๐๐โ . โ ๐โฒ . ๐๐ถ๐ +
๐ปโ๐ก๐ถ๐ก+1 โ ๐๐ก+1 + ๐ถ๐ก โ ๐โฒ . ๐๐ถ๐
๐ฅ๐ก+1
แ๐ถ๐ก+1
Backpropagation rules: Backward
๐ฅ๐ก
โ๐กโ1โ๐ก
๐ถ๐กโ1๐ถ๐ก
๐๐ก ๐๐ก ๐๐กแ๐ถ๐ก
s() s() s()tanh
tanh
๐ง๐ก
๐ถ๐ก
โ๐ก+1
๐ถ๐ก+1
s() s() s()tanh
tanh
๐ป๐ถ๐ก๐ท๐๐ฃ = ๐ปโ๐ก๐ท๐๐ฃ โ ๐๐ก โ ๐ก๐๐โโฒ . ๐๐ถโ + ๐ก๐๐โ . โ ๐โฒ . ๐๐ถ๐ +
๐ปโ๐ก๐ถ๐ก+1 โ ๐๐ก+1 + ๐ถ๐ก โ ๐โฒ . ๐๐ถ๐ + แ๐ถ๐ก+1 โ ๐
โฒ . ๐๐ถ๐
๐ฅ๐ก+1
แ๐ถ๐ก+1
Backpropagation rules: Backward
๐ฅ๐ก
โ๐กโ1โ๐ก
๐ถ๐กโ1๐ถ๐ก
๐๐ก ๐๐ก ๐๐กแ๐ถ๐ก
s() s() s()tanh
tanh
๐ง๐ก
๐ถ๐ก
โ๐ก+1
๐ถ๐ก+1
s() s() s()tanh
tanh
๐ป๐ถ๐ก๐ท๐๐ฃ = ๐ปโ๐ก๐ท๐๐ฃ โ ๐๐ก โ ๐ก๐๐โโฒ . ๐๐ถโ + ๐ก๐๐โ . โ ๐โฒ . ๐๐ถ๐ +
๐ปโ๐ก๐ถ๐ก+1 โ ๐๐ก+1 + ๐ถ๐ก โ ๐โฒ . ๐๐ถ๐ + แ๐ถ๐ก+1 โ ๐
โฒ . ๐๐ถ๐
๐ปโ๐ก๐ท๐๐ฃ = ๐ป๐ง๐ก๐ท๐๐ฃ๐ปโ๐ก๐ง๐ก
๐ฅ๐ก+1
แ๐ถ๐ก+1
Backpropagation rules: Backward
๐ฅ๐ก
โ๐กโ1โ๐ก
๐ถ๐กโ1๐ถ๐ก
๐๐ก ๐๐ก ๐๐กแ๐ถ๐ก
s() s() s()tanh
tanh
๐ง๐ก
๐ถ๐ก
โ๐ก+1
๐ถ๐ก+1
s() s() s()tanh
tanh
๐ป๐ถ๐ก๐ท๐๐ฃ = ๐ปโ๐ก๐ท๐๐ฃ โ ๐๐ก โ ๐ก๐๐โโฒ . ๐๐ถโ + ๐ก๐๐โ . โ ๐โฒ . ๐๐ถ๐ +
๐ปโ๐ก๐ถ๐ก+1 โ ๐๐ก+1 + ๐ถ๐ก โ ๐โฒ . ๐๐ถ๐ + แ๐ถ๐ก+1 โ ๐
โฒ . ๐๐ถ๐
๐ปโ๐ก๐ท๐๐ฃ = ๐ป๐ง๐ก๐ท๐๐ฃ๐ปโ๐ก๐ง๐ก + ๐ปโ๐ก๐ถ๐ก+1 โ ๐ถ๐ก โ ๐โฒ . ๐โ๐
๐ฅ๐ก+1
แ๐ถ๐ก+1
๐๐ก+1 ๐๐ก+1
Backpropagation rules: Backward
๐ฅ๐ก
โ๐กโ1โ๐ก
๐ถ๐กโ1๐ถ๐ก
๐๐ก ๐๐ก ๐๐กแ๐ถ๐ก
s() s() s()tanh
tanh
๐ง๐ก
๐ถ๐ก
โ๐ก+1
๐ถ๐ก+1
s() s() s()tanh
tanh
๐ป๐ถ๐ก๐ท๐๐ฃ = ๐ปโ๐ก๐ท๐๐ฃ โ ๐๐ก โ ๐ก๐๐โโฒ . ๐๐ถโ + ๐ก๐๐โ . โ ๐โฒ . ๐๐ถ๐ +
๐ปโ๐ก๐ถ๐ก+1 โ ๐๐ก+1 + ๐ถ๐ก โ ๐โฒ . ๐๐ถ๐ + แ๐ถ๐ก+1 โ ๐
โฒ . ๐๐ถ๐
๐ปโ๐ก๐ท๐๐ฃ = ๐ป๐ง๐ก๐ท๐๐ฃ๐ปโ๐ก๐ง๐ก + ๐ปโ๐ก๐ถ๐ก+1 โ ๐ถ๐ก โ ๐โฒ . ๐โ๐ + แ๐ถ๐ก+1 โ ๐
โฒ . ๐โ๐
๐ฅ๐ก+1
แ๐ถ๐ก+1
๐๐ก+1 ๐๐ก+1
Backpropagation rules: Backward
๐ฅ๐ก
โ๐กโ1โ๐ก
๐ถ๐กโ1๐ถ๐ก
๐๐ก ๐๐ก ๐๐กแ๐ถ๐ก
s() s() s()tanh
tanh
๐ง๐ก
๐ถ๐ก
โ๐ก+1
๐ถ๐ก+1
s() s() s()tanh
tanh
๐ป๐ถ๐ก๐ท๐๐ฃ = ๐ปโ๐ก๐ท๐๐ฃ โ ๐๐ก โ ๐ก๐๐โโฒ . ๐๐ถโ + ๐ก๐๐โ . โ ๐โฒ . ๐๐ถ๐ +
๐ปโ๐ก๐ถ๐ก+1 โ ๐๐ก+1 + ๐ถ๐ก โ ๐โฒ . ๐๐ถ๐ + แ๐ถ๐ก+1 โ ๐
โฒ . ๐๐ถ๐
๐ปโ๐ก๐ท๐๐ฃ = ๐ป๐ง๐ก๐ท๐๐ฃ๐ปโ๐ก๐ง๐ก + ๐ปโ๐ก๐ถ๐ก+1 โ ๐ถ๐ก โ ๐โฒ . ๐โ๐ + แ๐ถ๐ก+1 โ ๐
โฒ . ๐โ๐ +
๐ป๐ถ๐ก+1๐ท๐๐ฃ โ ๐๐ก+1 โ ๐ก๐๐โโฒ . ๐โ๐
๐ฅ๐ก+1
แ๐ถ๐ก+1
๐๐ก+1 ๐๐ก+1
Backpropagation rules: Backward
๐ฅ๐ก
โ๐กโ1โ๐ก
๐ถ๐กโ1๐ถ๐ก
๐๐ก ๐๐ก ๐๐กแ๐ถ๐ก
s() s() s()tanh
tanh
๐ง๐ก
๐ถ๐ก
โ๐ก+1
๐ถ๐ก+1
s() s() s()tanh
tanh
๐ป๐ถ๐ก๐ท๐๐ฃ = ๐ปโ๐ก๐ท๐๐ฃ โ ๐๐ก โ ๐ก๐๐โโฒ . ๐๐ถโ + ๐ก๐๐โ . โ ๐โฒ . ๐๐ถ๐ +
๐ปโ๐ก๐ถ๐ก+1 โ ๐๐ก+1 + ๐ถ๐ก โ ๐โฒ . ๐๐ถ๐ + แ๐ถ๐ก+1 โ ๐
โฒ . ๐๐ถ๐
๐ปโ๐ก๐ท๐๐ฃ = ๐ป๐ง๐ก๐ท๐๐ฃ๐ปโ๐ก๐ง๐ก + ๐ปโ๐ก๐ถ๐ก+1 โ ๐ถ๐ก โ ๐โฒ . ๐โ๐ + แ๐ถ๐ก+1 โ ๐
โฒ . ๐โ๐ +
๐ป๐ถ๐ก+1๐ท๐๐ฃ โ ๐๐ก+1 โ ๐ก๐๐โโฒ . ๐โ๐ + ๐ปโ๐ก+1๐ท๐๐ฃ โ ๐ก๐๐โ . โ ๐โฒ . ๐โ๐
๐ฅ๐ก+1
แ๐ถ๐ก+1
๐๐ก+1 ๐๐ก+1
Backpropagation rules: Backward
๐ฅ๐ก
โ๐กโ1โ๐ก
๐ถ๐กโ1๐ถ๐ก
๐๐ก ๐๐ก ๐๐กแ๐ถ๐ก
s() s() s()tanh
tanh
๐ง๐ก
๐ถ๐ก
โ๐ก+1
๐ถ๐ก+1
s() s() s()tanh
tanh
๐ป๐ถ๐ก๐ท๐๐ฃ = ๐ปโ๐ก๐ท๐๐ฃ โ ๐๐ก โ ๐ก๐๐โโฒ . ๐๐ถโ + ๐ก๐๐โ . โ ๐โฒ . ๐๐ถ๐ +
๐ปโ๐ก๐ถ๐ก+1 โ ๐๐ก+1 + ๐ถ๐ก โ ๐โฒ . ๐๐ถ๐ + แ๐ถ๐ก+1 โ ๐
โฒ . ๐๐ถ๐
๐ปโ๐ก๐ท๐๐ฃ = ๐ป๐ง๐ก๐ท๐๐ฃ๐ปโ๐ก๐ง๐ก + ๐ปโ๐ก๐ถ๐ก+1 โ ๐ถ๐ก โ ๐โฒ . ๐โ๐ + แ๐ถ๐ก+1 โ ๐
โฒ . ๐โ๐ +
๐ป๐ถ๐ก+1๐ท๐๐ฃ โ ๐๐ก+1 โ ๐ก๐๐โโฒ . ๐โ๐ + ๐ปโ๐ก+1๐ท๐๐ฃ โ ๐ก๐๐โ . โ ๐โฒ . ๐โ๐
๐ฅ๐ก+1
แ๐ถ๐ก+1
๐๐ก+1 ๐๐ก+1
Not explicitly deriving the derivatives w.r.t weights;Left as an exercise
Gated Recurrent Units: Lets simplify the LSTM
โข Simplified LSTM which addresses some of
your concerns of why
Gated Recurrent Units: Lets simplify the LSTM
โข Combine forget and input gates
โ In new input is to be remembered, then this means old memory is to be forgotten
โข Why compute twice?
Gated Recurrent Units: Lets simplify the LSTM
โข Donโt bother to separately maintain compressed and regular memories
โ Pointless computation!
โข But compress it before using it to decide on the usefulness of the current input!
LSTM Equations
157
โข ๐ = ๐ ๐ฅ๐ก๐๐ + ๐ ๐กโ1๐
๐
โข ๐ = ๐ ๐ฅ๐ก๐๐ + ๐ ๐กโ1๐
๐
โข ๐ = ๐ ๐ฅ๐ก๐๐ + ๐ ๐กโ1๐
๐
โข ๐ = tanh ๐ฅ๐ก๐๐ + ๐ ๐กโ1๐
๐
โข ๐๐ก = ๐๐กโ1 โ ๐ + ๐ โ ๐
โข ๐ ๐ก = tanh ๐๐ก โ ๐
โข ๐ฆ = ๐ ๐๐๐ก๐๐๐ฅ ๐๐ ๐ก
โข ๐: input gate, how much of the new
information will be let through the memory
cell.
โข ๐: forget gate, responsible for information
should be thrown away from memory cell.
โข ๐: output gate, how much of the information
will be passed to expose to the next time
step.
โข ๐: self-recurrent which is equal to standard
RNN
โข ๐๐: internal memory of the memory cell
โข ๐๐: hidden state
โข ๐ฒ: final output
LSTM Memory Cell
LSTM architectures example
โข Each green box is now an entire LSTM or GRU
unit
โข Also keep in mind each box is an array of units
TimeX(t)
Y(t)
Bidirectional LSTM
โข Like the BRNN, but now the hidden nodes are LSTM units.
โข Can have multiple layers of LSTM units in either direction
โ Its also possible to have MLP feed-forward layers between the hidden layers..
โข The output nodes (orange boxes) may be complete MLPs
X(0)
Y(0)
t
hf(-1)
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2) Y(T-2) Y(T-1) Y(T)
X(0) X(1) X(2) X(T-2) X(T-1) X(T)
hb(inf)
Some typical problem settings
โข Lets consider a few typical problems
โข Issues:
โ How to define the divergence()
โ How to compute the gradient
โ How to backpropagate
โ Specific problem: The constant error carousel..
Time series prediction using NARX nets
โข NARX networks are commonly used for scalar time series prediction
โ All boxes are scalar
โ Sigmoid activations are commonly used in the hidden layer(s)
โข Linear activation in output layer
โข The network is trained to minimize the L2 divergence between desired and actual output
โ NARX networks are less susceptible to vanishing gradients than conventional RNNs
โ Training often uses methods other than backprop/gradient descent, e.g. simulated annealing or genetic algorithms
Example of Narx Network
โข โSolar and wind forecasting by NARX neural networks,โ Piazza, Piazza and Vitale, 2016
โข Data: hourly global solar irradiation (MJ/m2 ), hourly wind speed (m/s) measured at two meters above ground level and the mean hourly temperature recorded during seven years, from 2002 to 2008
โข Target: Predict solar irradiation and wind speed from temperature readings
Inputs may use either past predicted output values, or past true values or the past error in prediction
Example of NARX Network: Results
โข Used GA to train the net.
โข NARX nets are generally the structure of choice for time series prediction problems
Which open source project?
Language modelling using RNNs
โข Problem: Given a sequence of words (or characters) predict the next one
Four score and seven years ???
A B R A H A M L I N C O L ??
Language modelling: Representing words
โข Represent words as one-hot vectorsโ Pre-specify a vocabulary of N words in fixed (e.g. lexical)
orderโข E.g. [ A AARDVARD AARON ABACK ABACUSโฆ ZZYP]
โ Represent each word by an N-dimensional vector with N-1 zeros and a single 1 (in the position of the word in the ordered list of words)
โข Characters can be similarly representedโ English will require about 100 characters, to include both
cases, special characters such as commas, hyphens, apostrophes, etc., and the space character
Predicting words
โข Given one-hot representations of ๐1โฆ๐๐โ1, predict ๐๐
โข Dimensionality problem: All inputs ๐1โฆ๐๐โ1 are both very high-dimensional and very sparse
๐๐ = ๐ W_1,๐1, โฆ ,๐๐โ1
Four score and seven years ???
Nx1 one-hot vectors
๐()
00โฎ10001โฎ0
10โฎ00
โฎ
01โฎ00
๐1
๐2
๐๐โ1
๐๐
Predicting words
โข Given one-hot representations of ๐1โฆ๐๐โ1, predict ๐๐
โข Dimensionality problem: All inputs ๐1โฆ๐๐โ1 are both very high-dimensional and very sparse
๐๐ = ๐ W_1,๐1, โฆ ,๐๐โ1
Four score and seven years ???
Nx1 one-hot vectors
๐()
00โฎ10001โฎ0
10โฎ00
โฎ
01โฎ00
๐1
๐2
๐๐โ1
๐๐
The one-hot representation
โข The one hot representation uses only N corners of the 2N corners of a unit cube
โ Actual volume of space used = 0โข (1, ํ, ๐ฟ) has no meaning except for ํ = ๐ฟ = 0
โ Density of points: ๐ช๐
2๐
โข This is a tremendously inefficient use of dimensions
(1,0,0)
(0,1,0)
(0,0,1)
Why one-hot representation
โข The one-hot representation makes no assumptions about the relative importance of words
โ All word vectors are the same length
โข It makes no assumptions about the relationships between words
โ The distance between every pair of words is the same
(1,0,0)
(0,1,0)
(0,0,1)
Solution to dimensionality problem
โข Project the points onto a lower-dimensional subspace
โ The volume used is still 0, but density can go up by many orders of magnitude
โข Density of points: ๐ช๐
2๐
โ If properly learned, the distances between projected points will capture semantic relations between the words
(1,0,0)
(0,1,0)
(0,0,1)
Solution to dimensionality problem
โข Project the points onto a lower-dimensional subspaceโ The volume used is still 0, but density can go up by many orders of magnitude
โข Density of points: ๐ช๐
2๐
โ If properly learned, the distances between projected points will capture semantic relations between the words
โข This will also require linear transformation (stretching/shrinking/rotation) of the subspace
(1,0,0)
(0,1,0)
(0,0,1)
The Projected word vectors
โข Project the N-dimensional one-hot word vectors into a lower-dimensional spaceโ Replace every one-hot vector ๐๐ by ๐๐๐
โ ๐ is an ๐ ร ๐ matrix
โ ๐๐๐ is now an ๐-dimensional vector
โ Learn P using an appropriate objective
โข Distances in the projected space will reflect relationships imposed by the objective
๐๐ = ๐ ๐๐1, ๐๐2, โฆ , ๐๐๐โ1
Four score and seven years ???
๐()
00โฎ10001โฎ0
10โฎ00
โฎ
01โฎ00
๐1
๐2
๐๐โ1
๐๐
๐
๐
๐
(1,0,0)
(0,1,0)
(0,0,1)
โProjectionโ
โข P is a simple linear transform
โข A single transform can be implemented as a layer of M neurons with linear activation
โข The transforms that apply to the individual inputs are all M-neuron linear-activation subnets with tied weights
๐๐ = ๐ ๐๐1, ๐๐2, โฆ , ๐๐๐โ1
(1,0,0)
(0,1,0)
(0,0,1)
โฎ โฎ
โฎ โฎ
๐()
01โฎ00
๐๐โฎ โฎ
โฎ
00โฎ10
001โฎ0
10โฎ00
๐1
๐2
๐๐โ1
๐๐
Predicting words: The TDNN model
โข Predict each word based on the past N wordsโ โA neural probabilistic language modelโ, Bengio et al. 2003
โ Hidden layer has Tanh() activation, output is softmax
โข One of the outcomes of learning this model is that we also learn low-dimensional representations ๐๐ of words
๐
๐1
๐
๐2
๐
๐3
๐
๐4
๐
๐5
๐
๐6
๐
๐7
๐
๐8
๐
๐9
๐5 ๐6 ๐7 ๐8 ๐9 ๐10
Alternative models to learn projections
โข Soft bag of words: Predict word based on words in immediate context
โ Without considering specific position
โข Skip-grams: Predict adjacent words based on current word
โข More on these in a future lecture
๐
Mean pooling
๐1
๐
๐2
๐
๐3
๐
๐5
๐
๐6
๐
๐7
๐4
๐
๐7
๐5 ๐6 ๐8 ๐9 ๐10๐4
Color indicatesshared parameters
Generating Language: The model
โข The hidden units are (one or more layers of) LSTM units
โข Trained via backpropagation from a lot of text
๐
๐1
๐
๐2
๐
๐3
๐
๐4
๐
๐5
๐
๐6
๐
๐7
๐
๐8
๐
๐9
๐5 ๐6 ๐7 ๐8 ๐9 ๐10๐2 ๐3 ๐4
Generating Language: Synthesis
โข On trained model : Provide the first few words
โ One-hot vectors
โข After the last input word, the network generates a probability distribution over words
โ Outputs an N-valued probability distribution rather than a one-hot vector
โข Draw a word from the distribution
โ And set it as the next word in the series
๐
๐1
๐
๐2
๐
๐3
Generating Language: Synthesis
โข On trained model : Provide the first few words
โ One-hot vectors
โข After the last input word, the network generates a probability distribution over words
โ Outputs an N-valued probability distribution rather than a one-hot vector
โข Draw a word from the distribution
โ And set it as the next word in the series
๐
๐1
๐
๐2
๐
๐3
๐4
Generating Language: Synthesis
โข Feed the drawn word as the next word in the series
โ And draw the next word from the output probability distribution
โข Continue this process until we terminate generation
โ In some cases, e.g. generating programs, there may be a natural termination
๐
๐1
๐
๐2
๐
๐3
๐
๐5๐4
Generating Language: Synthesis
โข Feed the drawn word as the next word in the series
โ And draw the next word from the output probability distribution
โข Continue this process until we terminate generation
โ In some cases, e.g. generating programs, there may be a natural termination
๐
๐1
๐
๐2
๐
๐3
๐ ๐ ๐ ๐ ๐ ๐
๐5 ๐6 ๐7 ๐8 ๐9 ๐10๐4
Which open source project?
Trained on linux source code
Actually uses a character-levelmodel (predicts character sequences)
Composing music with RNN
http://www.hexahedria.com/2015/08/03/composing-music-with-recurrent-neural-networks/
Speech recognition using Recurrent Nets
โข Recurrent neural networks (with LSTMs) can be used to perform speech recognition
โ Input: Sequences of audio feature vectors
โ Output: Phonetic label of each vector
Time
๐1
X(t)
t=0
๐2 ๐3 ๐4 ๐5 ๐6 ๐7
Speech recognition using Recurrent Nets
โข Alternative: Directly output phoneme, character or word sequence
โข Challenge: How to define the loss function to optimize for training
โ Future lecture
โ Also homework
Time
๐1
X(t)
t=0
๐2
CNN-LSTM-DNN for speech recognition
โข Ensembles of RNN/LSTM, DNN, & ConvNets (CNN) :
โข T. Sainath, O. Vinyals, A. Senior, H. Sak. โConvolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks,โ ICASSP 2015.
Translating Videos to Natural Language Using Deep Recurrent Neural Networks
Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Venugopalan, Huijun Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate SaenkoNorth American Chapter of the Association for Computational Linguistics, Denver, Colorado, June 2015.
Summary
โข Recurrent neural networks are more powerful than MLPsโ Can use causal (one-direction) or non-causal (bidirectional) context to make
predictionsโ Potentially Turing complete
โข LSTM structures are more powerful than vanilla RNNsโ Can โholdโ memory for arbitrary durations
โข Many applicationsโ Language modelling
โข And generation
โ Machine translationโ Speech recognitionโ Time-series predictionโ Stock predictionโ Many others..
Not explained
โข Can be combined with CNNs
โ Lower-layer CNNs to extract features for RNN
โข Can be used in tracking
โ Incremental prediction