special topics in deep learning -...

127
CENG 783 Special topics in Deep Learning Week 13 Sinan Kalkan © AlchemyAPI

Upload: others

Post on 01-Nov-2019

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

CENG 783

Special topics in

Deep Learning

Week 13Sinan Kalkan

© AlchemyAPI

Page 2: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Why do we embed words?• 1-of-n encoding is not suitable to learn from

It is sparse

Similar words have different representations

Compare the pixel-based representation of images: Similar images/objects have similar pixels

• Embedding words in a map allows

Encoding them with fixed-length vectors

“Similar” words having similar representations

Allows complex reasoning between words:

king - man + woman = queen

Table: https://devblogs.nvidia.com/parallelforall/understanding-natural-language-deep-neural-networks-using-torch/

Page 3: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Two different ways to train

1. Using context to predict a target word (~ continuous bag-of-words)

2. Using word to predict a target context (skip-gram)

Produces more accurate results on large datasets

• If the vector for a word cannot predict the context, the mapping to the vector space is adjusted

• Since similar words should predict the same or similar contexts, their vector representations should end up being similar

http://deeplearning4j.org/word2vec

Page 4: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Example: Image Captioning

Fig: https://github.com/karpathy/neuraltalk2

Page 5: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Overview

Pre-trained

word

embedding

is also used

Pre-trained CNN

(e.g., on imagenet)

Image: Karpathy

Page 6: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Example: Neural Machine Translation

Page 7: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Neural Machine Translation• Model

Sutskever et al. 2014 Haitham Elmarakeby

Each box is an LSTM or GRU cell.

Page 8: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Today• Finalize RNNs

Echo State Networks

Time Delay Networks

• Neural Turing Machines

• Autoencoders

• NOTES:

Final Exam date: 16 January, 17:00

Lecture on 31st of December (Monday)

Project deadline (w/o Incomplete period): 26th of January

Project deadline (w Incomplete period): 2nd of February

Page 9: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

On deep learning this week

• https://arxiv.org/pdf/1812.08775.pdf

9

Page 10: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

On deep learning this week

• https://openreview.net/pdf?id=B1l6qiR5F7

10

Page 11: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

On deep learning this week

• https://openreview.net/pdf?id=S1xq3oR5tQ

11

Page 12: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Echo State NetworksReservoir Computing

Page 13: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Motivation

• “Schiller and Steil (2005) also showed that in traditional training methods for RNNs, where all weights (not only the output weights) are adapted, the dominant changes are in the output weights. In cognitive neuroscience, a related mechanism has been investigated by Peter F. Dominey in the context of modelling sequence processing in mammalian brains, especially speech recognition in humans (e.g., Dominey 1995, Dominey, Hoen and Inui 2006). Dominey was the first to explicitly state the principle of reading out target information from a randomly connected RNN. The basic idea also informed a model of temporal input discrimination in biological neural networks (Buonomano and Merzenich 1995).”

http://www.scholarpedia.org/article/Echo_state_network

Page 14: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Echo State Networks (ESN)

• Reservoir of a set of neurons

Randomly initialized and fixed

Run input sequence through the network and keep the activations of the reservoir neurons

Calculate the “readout” weights using linear regression.

• Has the benefits of recurrent connections/networks

• No problem of vanishing gradient Li et al., 2015.

Page 15: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

The reservoir

• Provides non-linear expansion

This provides a “kernel” trick.

• Acts as a memory

• Parameters:

𝑊𝑖𝑛, 𝑊 and 𝛼 (leaking rate).

• Global parameters:

Number of neurons: The more the better.

Sparsity: Connect a neuron to a fixed but small number of neurons.

Distribution of the non-zero elements: Uniform or Gaussian distribution. 𝑊𝑖𝑛 is denser than 𝑊.

Spectral radius of W: Maximum absolute eigenvalue of 𝑊, or the width of the distribution of its non-zero elements.

Should be less than 1. Otherwise, chaotic, periodic or multiple fixed-point behavior may be observed.

For problems with large memory requirements, it should be bigger than 1.

Scale of the input weights.

Page 16: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can
Page 17: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Training ESN

Overfitting (regularization):

Page 18: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Beyond echo state networks

• Good aspects of ESNs Echo state networks can be trained very fast because they just fit a linear model.

• They demonstrate that it’s very important to initialize weights sensibly.

• They can do impressive modeling of one-dimensional time-series.–but they cannot compete

seriously for high-dimensional data.

• Bad aspects of ESNs They need many more hidden units for a given task than an RNN that learns the hiddenhidden weights.

Slide: Hinton

Page 19: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Similar models• Liquid State Machines (Maas et al., 2002)

A spiking version of Echo-state networks

• Extreme Learning Machines

Feed-forward network with a hidden layer.

Input-to-hidden weights are randomly initialized and never updated

Page 20: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Time Delay Neural Networks

Page 21: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Fig: https://www.willamette.edu/~gorr/classes/cs449/Temporal/tappedDelay.htm

Page 22: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Skipped points

Page 23: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Skipping• Stability

• Continuous-time recurrent networks

• Attractor networks

Page 24: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can
Page 25: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can
Page 26: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Neural Turing Machines

Page 27: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Why need other mechanisms?

• We mentioned before that RNNs are Turing Complete, right?

• The issues are: The vanishing/exploding gradients (LSTM and other tricks address

these issues)

However, # of parameters increase in LSTMs with the number of layers

Despite its advantages, LSTMs still fail to generalize to sequences longer than the training sequences

The answer to addressing bigger networks with less parameters is a better abstraction of the computational components, e.g., in a form similar to Turing machines

Weston et al., 2015

Page 28: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Turing Machine

Fig: Ucoluk & Kalkan, 2012

Wikipedia:

Page 29: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Neural Turing Machines

• If we make every component differentiable, we can train such a complex machine

• Accessing only a part of the network is problematic Unlike a computer (TM),

we need a ‘blurry’ access mechanism

2014

Page 30: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Neural Turing Machines: Reading

• Let memory 𝐌 be an 𝑁 ×𝑀 matrix

𝑁: the number of “rows”

𝑀: the size of each row (vector)

• Let 𝐌𝑡 be the memory state at time 𝑡

• 𝑤𝑡: a vector of weightings over 𝑁 locations emitted by the read head at time 𝑡. Since the weights are normalized:

𝑖

𝑤𝑡 𝑖 = 1, 0 ≤ 𝑤𝑡 𝑖 ≤ 1, ∀𝑖

• 𝐫𝑡: the read vector of length 𝑀:

𝐫𝑡 ←

𝑖

𝑤𝑡 𝑖 𝐌𝑡(𝑖) .

• which is differentiable, and therefore, trainable.

Page 31: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Neural Turing Machines: Writing

• Writing = erasing content + adding new content Inspired from LSTM’s forgetting and addition gates.

• Erasing: Multiply with an erase vector 𝐞𝑡 ∈ 0,1 𝑀

𝐌𝑡 𝑖 ← 𝐌𝑡−1 𝑖 [𝟏 − 𝑤𝑡 𝑖 𝐞𝑡]

𝟏: vector of ones. Multiplication here is pointwise.

• Adding: Add an add vector 𝐚𝑡 ∈ 0,1 𝑀:

𝐌𝑡 𝑖 ← 𝐌𝑡 𝑖 + 𝑤𝑡 𝑖 𝐚𝑡

Page 32: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Neural Turing Machines: Addressing

• Content-based addressing

• Location-based addressing

In a sense, use variable “names” to access content

Page 33: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Neural Turing Machines: Content-based Addressing

• Each head (reading or writing head) produces an 𝑀 length key vector 𝐤𝑡 𝐤𝑡 is compared to each vector 𝐌𝑡 𝑖 using a similarity measure 𝐾 . , . , e.g., cosine similarity:

𝐾 𝐮, 𝐯 =𝐮 ⋅ 𝐯

𝐮 ⋅ | 𝐯 |

• From these similarity measures, we obtain a vector of “addressing”:

𝑤𝑡𝑐 𝑖 ←

exp 𝛽𝑡𝐾 𝐤𝑡, 𝐌𝑡 𝑖

σ𝑗 exp 𝛽𝑡𝐾 𝐤𝑡, 𝐌𝑡 𝑗

𝛽𝑡: amplifies or attenuates the precision of the focus

Page 34: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Neural Turing Machines: Location-based Addressing

• Important for e.g. iteration over memory locations, or jumping to an arbitrary memory location

• First: Interpolation between addressing schemes using “interpolation gate” 𝑔𝑡:

𝐰𝑡𝑔← 𝑔𝑡𝐰𝑡

𝑐 + 1 − 𝑔𝑡 𝐰𝑡−1

If 𝑔𝑡 = 1: weight from content-addressable component is used

If 𝑔𝑡 = 0: weight from previous step is used

• Second: rotationally shift weight to achieve location-based addressing using convolution:

ෝ𝑤𝑡 𝑖 ←

𝑗=0

𝑁−1

𝑤𝑡𝑔𝑗 𝑠𝑡(𝑖 − 𝑗)

𝐬𝑡: shift amount. Three elements for how “much” to shift left, right or keep as it is.

It needs to be “sharp”. To keep it sharp, each head emits a scalar 𝛾𝑡 ≥ 1:

𝑤𝑡 𝑖 ←ෝ𝑤𝑡 𝑖

𝛾𝑡

σ𝑗 ෝ𝑤𝑡 𝑗𝛾𝑡

Page 35: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Neural Turing Machines: Controller Network

• Free parameters

The size of the memory

Number of read-write heads

Range of allowed rotation shifts

Type of the neural network for controller

• Alternatives:

A recurrent network such as LSTM with its own memory

These memory units might be considered like “registers” on the CPU

A feed-forward network

Can use the memory to achieve recurrence

More transparent

Page 36: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Neural Turing Machines: Training

• Binary targets Logistic sigmoid output layers

Cross-entropy loss

• Other schemes possible

• Tasks: Copy from input to output

Repeat Copy: Make n copies of the input

Associative recall: Present a part of a sequence to recall the remaining part

N-gram: Learn distribution of 6-grams and make predictions for the next bit based on this distribution

Priority sort: Associate a priority as part of each vector and as the target place the sequence according to the priority

Page 37: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Neural Turing Machines: Training

Page 38: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Reinforced version

Page 39: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Other variants/attempts

Page 40: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can
Page 41: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Neural Programmer

Page 42: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Pointer networks

2015

Page 43: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Memory Networks

Page 44: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Universal Turing Machine

2015

Page 45: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

2017

Page 46: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Newer studies• https://deepmind.com/blog/differentiable-neural-

computers/

• Differentiable Neural Machines

Page 47: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Unsupervised pre-training with Auto-encoders

Page 48: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Now

• Manifold Learning

Principle Component Analysis

Independent Component Analysis

• Autoencoders

• Sparse autoencoders

• K-sparse autoencoders

• Denoising autoencoders

• Contraction autoencoders

48

Page 49: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Manifold Learning

49

Page 50: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Manifold Learning• Discovering the “hidden”

structure in the high-dimensional space

• Manifold: “hidden” structure.

• Non-linear dimensionality reduction

50

http://www.convexoptimization.c

om/dattorro/manifold_learning.h

tml

Page 51: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Manifold Learning

• Many approaches:

Self-Organizing Map (Kohonen map/network)

Auto-encoders

Principles curves & manifolds: Extension of PCA

Kernel PCA, Nonlinear PCA

Curvilinear Component Analysis

Isomap: Floyd-Marshall + Multidimensional scaling

Data-driven high-dimensional scaling

Locally-linear embedding

51

Page 52: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Manifold learning

• Autoencoders learn lower-dimensional manifolds embedded in higher-dimensional manifolds

• Assumption: “Natural data in high dimensional spaces concentrates close to lower dimensional manifolds”

Natural images occupy a very small fraction in a space of possible images

(Pascal Vincent) 52

Page 53: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Manifold Learning

• Many approaches:

Self-Organizing Map (Kohonen map/network)

53https://en.wikipedia.org/wiki/Self-organizing_map

Page 54: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 54

Page 55: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 55

Page 56: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Principle Component Analysis (PCA)

• Principle Components: Orthogonal directions with most variance

Eigen-vectors of the co-variance matrix

• Mathematical background: Orthogonality:

Two vectors 𝑢 and Ԧ𝑣 are orthogonal iff

𝑢 ⋅ Ԧ𝑣 = 0

Variance:

𝜎 𝑋 2 = 𝑉𝑎𝑟 𝑋 = 𝐸 𝑋 − 𝜇 2 =

𝑖

𝑝(𝑥𝑖) 𝑥𝑖 − 𝜇 2

where the (weighted) mean, 𝜇 = 𝐸 𝑋 = σ𝑖 𝑝 𝑥𝑖 𝑥𝑖 .

If 𝑝 𝑥𝑖 = 1/𝑁:

𝑉𝑎𝑟(𝑋) =1

𝑁

𝑖

𝑥𝑖 − 𝜇 2

𝜇 =1

𝑁

𝑖

𝑥𝑖

56

(Ole Winther)

[ Pearson 1901 ] [ Hotelling 1933 ]

Page 57: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Mathematical background for PCA: Covariance

• Co-variance:

Measures how two random variables change wrteach other:

𝐶𝑜𝑣 𝑋, 𝑌 = 𝐸 𝑋 − 𝐸 𝑋 𝑌 − 𝐸 𝑌

=1

𝑁

𝑖

(𝑥𝑖 − 𝐸 𝑋 )(𝑦𝑖 − 𝐸 𝑌 )

If big values of X & big values of Y “co-occur” and small values of X & small values of Y “co-occur” high co-variance.

Otherwise, small co-variance.

57

Page 58: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Mathematical background for PCA: Covariance Matrix

• Co-variance Matrix:

Denoted usually by Σ

For an 𝑛-dimensional space:

Σ𝑖𝑗 = 𝐶𝑜𝑣 𝑋𝑖 , 𝑋𝑗 = 𝐸 𝑋𝑖 − 𝜇𝑖 𝑋𝑗 −𝑚𝑗

58

Page 59: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Mathematical background for PCA: Covariance Matrix

• Co-variance Matrix:Σ𝑖𝑗 = 𝐶𝑜𝑣 𝑋𝑖 , 𝑋𝑗= 𝐸 𝑋𝑖 − 𝜇𝑖 𝑋𝑗 −𝑚𝑗

• Properties

60(Wikipedia)

(Wikipedia)

Page 60: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Mathematical background for PCA: Eigenvectors & Eigenvalues

• Eigenvectors and eigenvalues:

Ԧ𝑣 is an eigenvector of a square matrix 𝐴 if

𝐴 Ԧ𝑣 = 𝜆 Ԧ𝑣

where 𝜆 is the eigenvalue (scalar) associated with Ԧ𝑣.

• Interpretation:

“Transformation” 𝐴 does not change the direction of the vector.

It changes the vector’s scale, i.e., the eigenvalue.

• Solution:

𝐴 − 𝜆𝐼 Ԧ𝑣 = 0 has a solution when the determinant |𝐴 − 𝜆𝐼| is zero.

Find the eigenvalues, then plug in those values to get the eigenvectors.

64

Page 61: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Mathematical background for PCA: Eigenvectors & Eigenvalues Example

• Setting the determinant |𝐴 − 𝜆𝐼| to zero:

• The roots: 𝜆 = 1 and 𝜆 = 3

• If you plug in those eigenvalues, for 𝜆 = 1:

which gives 𝐯1 = {1, −1}. For 𝜆 = 3:

which gives 𝐯2 = {1,1}.

65Example from Wikipedia.

Page 62: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

PCA allows also dimensionality reduction

• Discard components whose eigenvalue is negligible.

70See the following tutorial for more on PCA:

http://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf

Page 63: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Independent Component Analysis (ICA)

• PCA assumes Gaussianity:

Data along a component should be explainable by a mean and a variance.

This may be violated by real signals in the nature.

• ICA:

Blind-source separation of non-Gaussian and mutually-independent signals.

• Mutual independence:

71

https://cnx.org/contents/-

[email protected]:gFEtO206@1/Independent-Component-

Analysis

Page 64: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Autoencoders

73

Page 65: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Autoencoders

• Universal approximators So are Restricted Boltzmann Machines

• Unsupervised learning

• Dimensionality reduction

• 𝐱 ∈ ℝ𝐷 ⇒ 𝐡 ∈ ℝ𝑀 s.t. 𝑀 < 𝐷

74

Page 66: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 75

Page 67: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 76

Page 68: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 77

Page 69: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 78

Page 70: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 80

Page 71: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Stacking autoencoders: learn the first layer

81http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

Page 72: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Stacking autoencoders:learn the second layer

82http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

Page 73: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Stacking autoencoders:Add, e.g., a softmax layer for mapping to output

83http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

Page 74: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Stacking autoencoders: Overall

84http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

Page 75: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 85

Page 76: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 86

Page 77: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 87

Page 78: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Making auto-encoders learn over-completerepresentationsThat are not one-to-one mappings

88

Page 79: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Wait, what do we mean by over-complete?

• Remember distributed representations?

89Figure Credit: Moontae Lee

DistributedNot distributed

Page 80: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

• Four categories could also be represented by two neurons:

90

Distributed vs. undercomplete vs. overcomplete representations

Distributed

(Under complete)Not Distributed

(over complete)Distributed

(Over complete)

Page 81: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Over-complete = sparse(in distributed representations)

• Why sparsity?

1. Because our brain relies on sparse coding. Why does it do so?

a. Because it is adapted to an environment which is composed of and can be sensed through the combination of primitive items/entities.

b. “Sparse coding may be a general strategy of neural systems toaugment memory capacity. To adapt to their environments,animals must learn which stimuli are associated with rewardsor punishments and distinguish these reinforced stimuli fromsimilar but irrelevant ones. Such task requires implementingstimulus-specific associative memories in which only a fewneurons out of a population respond to any given stimulus andeach neuron responds to only a few stimuli out of all possiblestimuli.”

– Wikipedia

c. Theoretically, it has shown that it increases capacity of memory.

91

Page 82: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Over-complete = sparse(in distributed representations)

• Why sparsity?

2. Because of information theoretical aspects:

Sparse codes have lower entropy compared to non-sparse ones.

3. It is easier for the consecutive layers to learn from sparse codes, compared to non-sparse ones.

92

Page 83: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

93

Olshausen & Field,

“Sparse coding with

an overcomplete

basis set: A

strategy employed

by V1?”, 1997

Page 84: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Mechanisms for enforcing over-completeness

• Use stochastic gradient descent

• Add sparsity constraint Into the loss function (sparse autoencoder)

Or, in a hard manner (k-sparse autoencoder)

• Add stochasticisity / randomness Add noise: Denoising Autoencoders, Contraction Autoencoders

Restricted Boltzmann Machines

97

Page 85: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Auto-encoders with SGD

98

Page 86: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Simple neural network

• Input: 𝐱 ∈ 𝑹𝒏

• Hidden layer: 𝐡 ∈ 𝑹𝒎

𝐡 = 𝒇𝟏(𝑾𝟏𝐱)

• Output layer: 𝐲 ∈ 𝑹𝒏

𝐲 = 𝒇𝟐 𝑾𝟐𝒇𝟏 𝑾𝟏𝐱

• Squared-error loss:

𝐿 =1

2

𝑑∈𝐷

𝐱𝑑 − 𝐲𝒅𝟐

• For training, use SGD.

• You may try different activation functions for 𝑓1 and 𝑓2.

99

Page 87: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Sparse Autoencoders

101

Page 88: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Sparse autoencoders

• Input: 𝐱 ∈ 𝑹𝒏

• Hidden layer: 𝐡 ∈ 𝑹𝒎

𝐡 = 𝒇𝟏(𝑾𝟏𝐱)

• Output layer: 𝐲 ∈ 𝑹𝒏

𝐲 = 𝒇𝟐 𝑾𝟐𝒇𝟏 𝑾𝟏𝐱

Over-completeness and sparsity:

• Require 𝑚 > 𝑛, and

Hidden neurons to produce only little activation for any input i.e., sparsity.

• How to enforce sparsity?

102

Page 89: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Enforcing sparsity: alternatives

• How?

• Solution 1: 𝜆 |𝑤| We have seen before that this enforces sparsity.

However, this is not strong enough.

• Solution 2 Limit on the amount of average total activation for a

neuron throughout training!

• Solution 3

Kurtosis: 𝜇4

𝜎4=

𝐸 𝑋−𝜇 4

𝐸 𝑋−𝜇 2 2

Calculated over the activations of the whole network.

High kurtosis sparse activations.

“Kurtosis has only been studied for response distributions of model neurons where negative responses are allowed. It is unclear whether kurtosis is actually a sensible measure for realistic, non-negative response distributions.” -http://www.scholarpedia.org/article/Sparse_coding

• And many many other ways… 103

Page 90: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Enforcing sparsity: a popular choice

• Limit the amount of total activation for a neuron throughout training!

• Use 𝜌𝑖 to denote the activation of neuron 𝑥 on input 𝑖. The average activation of the neuron over the training set:

ො𝜌𝑖 =1

𝑚

𝑖

𝑚

𝜌𝑖

• Now, to enforce sparsity, we limit to ො𝜌𝑖 = 𝜌0.

• 𝜌0: A small value. Yet another hyperparameter which may be tuned.

typical value: 0.05.

• The neuron must be inactive most of the time to keep its activations under the limit.

104

Page 91: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Enforcing sparsity

ො𝜌𝑖 =1

𝑚

𝑖

𝑚

𝜌𝑖

• How to limit ො𝜌𝑖 = 𝜌0? How do we add integrate this as a penalty term into the loss function?

𝜌0 is called the sparsity parameter.

• Use Kullback-Leibler divergence:

𝑖

𝐾𝐿(𝜌0 | ො𝜌𝑖

Or, equivalently as (since this is between two Bernoulli variables with mean 𝜌0 and ො𝜌𝑖):

𝑖

𝜌0 log𝜌0ො𝜌𝑖

+ (1 − 𝜌0) log1 − 𝜌01 − ො𝜌𝑖

105

Page 92: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Backpropagation and training

106

Reminder

-For each hidden unit ℎ,

calculate its error term 𝛿ℎ:

𝛿ℎ = 𝑜ℎ 1 − 𝑜ℎ

𝑘∈𝑜𝑢𝑡𝑝𝑢𝑡𝑠

𝑤𝑘ℎ𝛿𝑘

-Update every weight 𝑤𝑗𝑖𝑤𝑗𝑖 = 𝑤𝑗𝑖 + 𝜂𝛿𝑗𝑥𝑗𝑖

𝑆 = 𝛽

𝑖

𝜌0 log𝜌0ො𝜌𝑖

+ (1 − 𝜌0) log1 − 𝜌01 − ො𝜌𝑖

𝑑𝑆

𝑑𝜌𝑖= 𝛽 −𝜌0

1

ො𝜌𝑖 ln 10+ 1 − 𝜌0

1

1 − ො𝜌𝑖 ln 10

• If you use ln in KL:

𝑑𝑆

𝑑𝜌𝑖= 𝛽 −

𝜌0ො𝜌𝑖+1 − 𝜌01 − ො𝜌𝑖

• So, if we integrate into the original error term:

𝛿ℎ = 𝑜ℎ 1 − 𝑜ℎ .

𝑘

𝑤𝑘ℎ𝛿𝑘 + 𝛽 −𝜌0ො𝜌ℎ+1 − 𝜌01 − ො𝜌ℎ

• Need to change 𝑜ℎ(1 − 𝑜ℎ) if you use a different activation function.

Page 93: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Backpropagation and training

107

𝑆 = 𝛽

𝑖

𝜌0 log𝜌0ො𝜌𝑖

+ (1 − 𝜌0) log1 − 𝜌01 − ො𝜌𝑖

• Do you see a problem here?

• ො𝜌𝑖 should be calculated over the training set.

• In other words, we need to go through the whole dataset (or batch) once to calculate ො𝜌𝑖.

Page 94: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Loss & decoders & encoders

• Be careful about the range of your activations and the range of the output

• Real-valued input:

Encoder: use sigmoid

Decoder: no need for non-linearity.

Loss: Squared-error Loss

• Binary-valued input:

Encoder: use sigmoid.

Decoder: use sigmoid.

Loss: use cross-entropy loss:

108

Page 95: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Loss & decoders & encoders

• Kullback-Leibler divergence assumes that the variables are in the range [0,1]. I.e., you are bound to use sigmoid for the hidden

layer if you use KL to limit the activations of hidden units.

109

Page 96: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

k-Sparse Autoencoder

114

Page 97: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

• Note that it doesn’t have an activation function!

• Non-linearity comes from k-selection.

115

Page 98: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

116http://www.ericlwilkinson.com/blog/2014/11/19/deep-learning-sparse-autoencoders

Page 99: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

DenoisingAuto-encoders (DAE)

117

Page 100: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Denoising Auto-encoders

• Simple idea:

randomly corrupt some of the inputs (as many as half of them) – e.g., set them to zero.

Train the autoencoder to reconstruct the input from a corrupted version of it.

The auto-encoder is to predict the corrupted (i.e. missing) values from the uncorrupted values.

This requires capturing the joint distribution between a set of variables

• A stochastic version of the auto-encoder.

118

Page 101: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 119

Page 102: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 120

Page 103: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Loss in DAE

• You may give extra emphasis on “corrupted” dimensions:

121

Or, in cross-entropy-based loss:

Page 104: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Denoising Auto-encoders

• To undo the effect of a corruption induced by the noise, the network needs to capture the statistical dependencies between the inputs.

• This can be interpreted from many perspectives (see Vincent et al., 2008): the manifold learning perspective,

stochastic operator perspective.

122

Page 105: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 124

Page 106: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 125

Page 107: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Types of corruption

• Gaussian Noise (additive, isotropic)

• Masking Noise

Set a randomly selected subset of input to zero for each sample (the fraction ratio is constant, a parameter)

• Salt-and-pepper Noise:

Set a randomly selected subset of input to maximum or minimum for each sample (the fraction ratio is constant, a parameter)

126

Page 108: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

127(Vincent et al., 2010)Weight decay: L2 regularization.

Page 109: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

128(Vincent et al., 2010)

Page 110: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Training DAE

• Training algorithm does not change

However, you may give different emphasis on the error of reconstruction of the corrupted input.

• SGD is a popular choice

• Sigmoid is a suitable choice unless you know what you are doing.

129

Page 111: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Contractive Auto-encoder

130

Page 112: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 131

Page 113: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 133

Page 114: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 134

Page 115: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 135

Page 116: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

(Pascal Vincent) 138

Page 117: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Convolutional AE

• Encoder:

Standard convolutional layer

You may use pooling (e.g., max-pooling)

Pooling is shown to regularize the features in the encoder (Masci et al., 2011)

• Decoder:

Deconvolution

• Loss is MSE.

2011

https://github.com/vdumoulin/conv_arithmetic

Page 118: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Principles other than ‘sparsity’?

142

Page 119: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Slowness

143http://www.scholarpedia.org/article/Slow_feature_analysis

Page 120: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Slow Feature Analysis (SFA)from Wiskott et al.

144http://www.scholarpedia.org/article/Slow_feature_analysis

Page 121: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Slow Feature Analysis (SFA)

145http://www.scholarpedia.org/article/Slow_feature_analysis

Optimal stimuli for the slowest components

extracted from natural image sequences.

Page 122: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Visualizing the layers

146

Page 123: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Visualizing the layers

• Question: What is the input that activates a hidden unit ℎ𝑖 most?

i.e., we are after 𝐱∗:𝐱∗ = arg max

𝐱 𝑠.𝑡. 𝐱 =𝜌ℎ𝑖(𝑊, 𝐱)

• For the first layer:

𝑥𝑗 =𝑤𝑖𝑗

σ𝑘 𝑤𝑖𝑘2

where we assume that σ𝑖 𝑥𝑖2 ≤ 1, and hence normalize

the weights to match the range of the input values.

• How about the following layers? Gradient ascent (not descent): find the gradient of ℎ𝑖(𝑊, 𝐱)

w.r.t 𝐱 and move 𝐱 in the direction of the gradient since we want to maximize ℎ𝑖 𝑊, 𝐱 .

147

Page 124: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Visualizing the layers• Activation maximization:

Gradient ascent to maximize ℎ𝑖 𝑊,𝐱 .

Start with randomly generated input and move towards the gradient.

Luckily, different random initializations yield very similar filter-like responses.

• Applicable to any network for which we can calculate the gradient 𝜕ℎ𝑖 𝑊, 𝐱 /𝜕𝐱

• Need to tune parameters:

Learning rate

Stopping criteria

• Have the same problems of gradient descent

The space is non-convex

Local maxima etc.

148

Page 125: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

Activation Maximization results

149Erhan et al., “Understanding Representations Learned in

Deep Architectures”, 2010.

Page 126: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

New studies• Unsupervised pretraining with setting noise as target:

• https://arxiv.org/abs/1704.05310

151

Page 127: Special topics in Deep Learning - kovan.ceng.metu.edu.trkovan.ceng.metu.edu.tr/~sinan/DL/week_13.pdf · Neural Turing Machines •If we make every component differentiable, we can

A recent study• https://openreview.net/pdf?id=HkNDsiC9KQ