introduction to fundamentals of rnns

RNNs

Under the hood On the Surface

Elvis Saravia

http://elvissaravia.com/

Icebreaker

Can we predict the future based on our current decisions?

2

“Which research direction should I take?”

Elvis Saravia


Outline

●Part 1: Review Neural Network Essentials

●Part 2: Sequential Modeling

●Part 3: Introduction to Recurrent Neural Networks

●Part 4: RNNs with Tensorflow

3

Elvis Saravia


Part 1

4

The Neural Network

Elvis Saravia


Perceptron Forward Pass

5

∑ 𝑓

non-linearity

sum

weights

inputs

𝑥0

𝑥1

𝑥𝑛

𝑏

𝑤0

𝑤1

𝑤𝑛

𝑏

𝑦

output

𝑦 = 𝑓(

𝑖=0

𝑁

𝑥𝑖 ∗ 𝑤𝑖 + 𝑏)

Computing output:

𝑦 = 𝑓(𝑋𝑊 + 𝑏)

𝑋 = 𝑥𝑜, 𝑥1, … , 𝑥𝑛

𝑊 = 𝑤𝑜, 𝑤1, … , 𝑤𝑛

tanh, ReLU, sigmoid

Vector form:

Elvis Saravia


Multi-Layer Perceptron (MLP)

6

Output layer

Hidden layer

Input layer

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

𝑥0

𝑥1

𝑥2

𝑥3

Deep Neural Networks (DNN)

7

Output layer

Hidden layers

Input layer

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

𝑥0

𝑥1

𝑥2

𝑥3

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

…

Neural Network (Summary)

• Neural Networks learn features after input is fed into the hidden layers while updating weights through backpropagation (SGD)*.

• Data and activation flows in one direction through the hidden layers, but neurons never interact with each other.

8

Output layer

Hidden layers

Input layer

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

𝑥0

𝑥1

𝑥2

𝑥3

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

…

* Stochastic Gradient Descent (SGD) Elvis Saravia


Drawbacks of NNs

●Lack sequence modeling capabilities: don’t keep track of past information (i.e., no memory), which is very important to model data with a sequential nature.

● Input is of fixed size (more of this later)

9

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

𝑥0

𝑥1

𝑥2

𝑥𝑛

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

…

Elvis Saravia


Part 2

10

Sequential Modeling

Elvis Saravia


Sequences

“I took the bus this morning because it was very cold.”

11

Stock price

Speech waveform

Sentence

●Current values depend on previous values (e.g., melody notes, language rules)

●Order needs to be maintained to preserve meaning

●Sequences usually vary in length

Elvis Saravia


Modeling Sequences

How to represent a sequence?

12

I love the coldness [ 1 1 0 1 0 0 1 ]

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

𝑥0

𝑥1

𝑥2

𝑥𝑛

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

∑ 𝑓

…

𝑇

Bag of Words (BOW)

What’s the problem with the BOW representation?

Problem with BOW

Bag of words does not preserve order, therefore no semantics can be captured

13

“The food was good, not bad at all”

“The food was bad, not good at all”

vs

How to differentiate meaning of both sentences?

[ 1 1 0 1 1 0 1 0 1 0 0 1 1 ]

One-Hot Encoding

● Preserve order by maintaining order within feature vector

14

On Monday it was raining

[ 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 ]

We preserved order but what is the problem here?

Problem with One-Hot Encoding

● One-hot encoding cannot deal with variations of the same sequence.

15


It was raining on Monday

[ 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 ]

[ 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 ]

Elvis Saravia


Solution

Solution: We need to relearn the rules of language at each point in the sentence to preserve meaning.

16


It was raining on Monday

No idea of state and what comes next!!!

What was learned atthe beginning of thesentence will have to be relearned at the end of the sentence.

Markov Models

State and transitions can be modeled,therefore it doesn’t matter wherein the sentence we are, we have an ideaof what comes next based on theprobabilities.

17

Problem: Each state depends only on the last state. We can’t model long-term dependencies!

Elvis Saravia


Long-term dependencies

We need information from the far past and future to accurately model sequences.

18

In Italy, I had a great time and I learnt some of the _____ language

Elvis Saravia


19

It’s time for Recurrent Neural Networks (RNNs)!!!

Elvis Saravia


Part 3

20

Recurrent Neural Networks

(RNNs)

Recurrent Neural Networks (RNNs)

●RNNs model sequential information by assuming long-term dependencies between elements of a sequence.

●RNNs maintain word order and share parameters across the sequence (i.e., no need to relearn rules).

●RNNs are recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations.

●RNNs memorize information that has been computed so far, so they deal well with long-term dependencies.

21

Applications of RNN

●Analyze time series data to predict stock market

●Speech recognition (e.g., Emotion Recognition from Acoustic features)

●Autonomous driving

●Natural Language Processing (e.g., Machine Translation, Question and Answer)?

22

Elvis Saravia


Some examples

Google Magenta Project (Melody composer) – (https://magenta.tensorflow.org)

Sentence Generator - (http://goo.gl/onkPNd)

Image Captioning – (http://goo.gl/Nwx7Kh)

23

Elvis Saravia

https://magenta.tensorflow.org/

http://goo.gl/onkPNd

http://goo.gl/Nwx7Kh


RNNs Main Components

- Recurrent neurons

- Unrolling recurrent neurons

- Layer of recurrent neurons

- Memory cell containing hidden state

24

Elvis Saravia


Recurrent Neurons

𝑥

𝑦A simple recurrent neuron:• receives input• produces output• sends output back to itself

∑

𝑓

𝑦𝑥

∑𝑓

- Input- Output (usually a vector of probabilities)- sum(W. x) + bias- Activation function (e.g., ReLU, tanh, sigmoid)- Weights for inputs- Weights for outputs of previous time step- Function of current inputs and previous time step

𝑊𝑥

𝑊𝑦

𝑊𝑦

𝑊𝑥

25

ℎ

ℎ

Unrolling/Unfolding recurrent neuron

𝑥

𝑦

∑

𝑓

𝑥𝑡−3

𝑦𝑡−3

∑

𝑓

𝑥𝑡−2

𝑦𝑡−2

∑

𝑓

𝑥𝑡−1

𝑦𝑡−1

∑

𝑓

𝑥𝑡

𝑦𝑡

∑

𝑓

Unrollingthrough

time

Time stepor

frame 26

𝑊𝑦 𝑊𝑦 𝑊𝑦 𝑊𝑦

𝑊𝑥 𝑊𝑥 𝑊𝑥 𝑊𝑥

𝑊𝑦

ℎ𝑡−3 ℎ𝑡−2 ℎ𝑡−1 ℎ𝑡

time

ℎ𝑡−4

𝑊𝑥

𝑊𝑦

ℎ

RNNs remember previous state

27

𝑥0: "𝑖𝑡"

𝑦0

∑

𝑓

𝑊𝑥

𝑊𝑦

𝑥0 : vector representing first word

𝑊𝑥,𝑊𝑦: weight matrices

𝑦0 : output at t=0

ℎ0 = tanh(𝑊𝑥𝑥0 +𝑊𝑦ℎ−1)t=0

Can remember things from the past

ℎ−1 ℎ0

RNNs remember previous state

28

𝑥1: "𝑤𝑎𝑠"

𝑦1

∑

𝑓

𝑊𝑥

𝑊𝑦

𝑥1 : vector representing second word

𝑊𝑥,𝑊𝑦: weight matrices stay the same so they are shared across sequence

𝑦1 : output at t=1

ℎ1 = tanh(𝑊𝑥𝑥1 +𝑊𝑦ℎ0)t=1

ℎ0 ℎ1

ℎ0 = tanh(𝑊𝑥𝑥0 +𝑊𝑦ℎ−1)

Can remember things from t=0

Overview

𝑥

𝑦

∑

𝑓

𝑥𝑡−3

𝑦𝑡−3

∑

𝑓

𝑥𝑡−2

𝑦𝑡−2

∑

𝑓

𝑥𝑡−1

𝑦𝑡−1

∑

𝑓

𝑥𝑡

𝑦𝑡

∑

𝑓

Unrollingthrough

time

scalar

[0,1,2] [3,4,5] [6,7,8] [9,0,1]

𝑦𝑡 = 𝑓 𝑊𝑥𝑥𝑡 +𝑊𝑦𝑦𝑡−1 + 𝑏

29[9,8,7] [0,0,0] [6,5,4] [3,2,1] batch

Code Example

30

Where all the magic happens!

𝑦𝑡 = 𝑓 𝑊𝑥𝑥𝑡 +𝑊𝑦𝑦𝑡−1 + 𝑏

Layer of Recurrent Neurons

𝑥

𝑦

∑

𝑓

∑

𝑓

∑

𝑓

∑

𝑓

∑

𝑓

∑

𝑓

𝑥

𝑦

scalar

vector

31

ℎ

Elvis Saravia


Unrolling Layer

∑

𝑓∑𝑓

∑

𝑓

∑𝑓

∑𝑓

𝑥

𝑦

Unrollingthrough

time

𝑥0

𝑦0

𝑥1

𝑦1

𝑥2

𝑦2

[1,2,3] [1,2,3] [1,2,3]

[4,5,6] [7,8,9] [0,0,0]

Yt = 𝑓 𝑋𝑡 𝑌𝑡−1 .𝑊 + 𝑏 , 𝑊 =𝑊𝑥𝑊𝑦

Y𝑡 = 𝑓 𝑋𝑡.𝑊𝑥 + 𝑌𝑡−1.𝑊𝑦 + 𝑏

vectorform

32

𝑊𝑥 𝑊𝑥 𝑊𝑥

𝑊𝑦 𝑊𝑦 𝑊𝑦ℎ0 ℎ1 ℎ2

Variations of RNNs: Input / Output

𝑋0

𝑌0

𝑋1

𝑌1

𝑋2

𝑌2

𝑋3

𝑌3

sequence to sequence- Stock price- Other time series

𝑋0

𝑌0

0

𝑌1

0

𝑌2

0

𝑌3

vector to sequence

- Image captioning

𝑥0

𝑦0

𝑥1

𝑦1

0

𝑌′2

0

𝑌′3

0

𝑌′3

Encoder Decoder

- Translation:- E: Sequence to vector- D: Vector to sequence

33

𝑋0

𝑌0

𝑋1

𝑌1

𝑋2

𝑌2

𝑋3

𝑌3

sequence to vector

- Sentiment analysis ([-1,+1])- Other classification tasks

vector ofprobabilities over classesa.k.a softmax

Training

𝑋0

𝑌0

𝑋1

𝑌1

𝑋2

𝑌2

𝑋3

𝑌3

𝑊,𝑏 𝑊,𝑏 𝑊,𝑏 𝑊, 𝑏

𝑋4

𝑌4

𝑊,𝑏

𝐶(𝑌2, 𝑌3, 𝑌4)

Forward pass

Backpropagation34

- Forward pass- Compute Loss via cost function C- Minimize Loss by backpropagation

through time (BPTT)

𝜕𝐽4𝜕𝑊=

𝑘=0

4𝜕𝐽4𝜕𝑦4

𝜕𝑦4𝜕ℎ4

𝜕ℎ4𝜕ℎ𝑘

𝜕ℎ𝑘𝜕𝑊

𝜕𝐽

𝜕𝑊=

𝑡

𝜕𝐽𝑡𝜕𝑊

For one single time step t:

𝐽2, 𝐽3, 𝐽4

Counting the contributions of W in previous time-stepsto the error at time-step t (using Chain rule)

ℎ0 ℎ1 ℎ2 ℎ3 ℎ4

Drawbacks

Main problem: Vanishing gradients (gradients gets too small)

Intuition: As sequences get longer, gradients tend to get too small during backpropagation process.

35

𝜕𝐽𝑛𝜕𝑊=

𝑘=0

𝑛𝜕𝐽𝑛𝜕𝑦𝑛

𝜕𝑦𝑛𝜕ℎ𝑛

𝜕ℎ𝑛𝜕ℎ𝑘

𝜕ℎ𝑘𝜕𝑊

𝜕ℎ𝑛𝜕ℎ𝑛−1

𝜕ℎ𝑛−1𝜕ℎ𝑛−2

. . .𝜕ℎ3𝜕ℎ2

𝜕ℎ2𝜕ℎ1

𝜕ℎ1𝜕ℎ0

We are just multiplying a lot of small numbers together

Elvis Saravia

https://en.wikipedia.org/wiki/Vanishing_gradient_problem


Solutions

Long Short-Term Memory Networks –

Deal with vanishing gradient problem, therefore more reliable to model long-term dependencies, especially for very long sequences.

36

Elvis Saravia


RNN Extensions

Extended Readings:

○ Bidirectional RNNs – passing states in both directions

○ Deep (Bidirectional) RNNs – stacking RNNs

○ LSTM networks – Adaptation of RNNs

37

Elvis Saravia


In general

●RNNs are great for analyzing sequences of any arbitrary length.

●RNNs are considered “anticipatory” models

●RNNs are also considered creative learning models as they can, for example, predict set of musical notes to play next in melody, and selects an appropriate one.

38

Elvis Saravia


Part 3

39

RNNs in Tensorflow

Elvis Saravia


Demo

●Building RNNS in Tensorflow

●Trainining RNNS in Tensorflow

● Image Classification

●Text Classification

40

Elvis Saravia


References

● Introduction to RNNs - http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

●NTHU Machine Learning Course - https://goo.gl/B4EqMi

●Hands-On Machine Learning with Scikit-Learn and Tensorflow (Book)

41

Elvis Saravia

http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

https://goo.gl/B4EqMi


introduction to fundamentals of rnns

Data & Analytics