boyang “albert” li, jie “jay” tanafb/classes/cs7616-spring2014/... · a general algorithm...

55
Deep Learning Boyang “Albert” Li, Jie “Jay” Tan

Upload: others

Post on 10-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Deep Learning

Boyang “Albert” Li, Jie “Jay” Tan

Page 2: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

An Unrelated Video A bicycle controller learned using NEAT (Stanley)

Page 3: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

What do you mean, deep?

Shallow Deep

Hidden Markov models ANNs with one hidden layer Manually selected and

designed features

Stacked Restricted Boltzmann Machines

ANNs with multiple hidden layers

Learning complex features

Page 4: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Algorithms of Deep Learning Recurrent Neural Networks Stacked Autoencoders (i.e. deep neural networks) Stacked Restricted Boltzmann Machines (i.e. deep belief

networks) Convoluted Deep Belief Networks … a growing list

Page 5: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

But…What’s Wrong with Shallow? Needs more nodes / computing units and weights

[Bengio, Y., et al. (2007). Greedy layerwise training of deep networks] Boolean functions (such as the function that computes the multiplication

of two numbers from their d-bit representation) expressible by O log layers of combinatorial logic with O elements in each layer

O 2 elements when expressed with only 2 layers

Reliance on manually selected features Automatically learning the features Disentangling interacting factors, creating invariant features

(will come back to that)

Page 6: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Disentangling factors

Page 7: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Is the brain deep, too?

http://thebrain.mcgill.ca/flash/a/a_02/a_02_cr/a_02_cr_vis/a_02_cr_vis.htmlEric R. Kandel. (2012) The Age of Insight: The Quest to Understand the Unconscious in Art, Mind and Brain from Vienna 1900 to the Present

Page 8: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

A general algorithm for the brain? One part of the brain can learn the function of another

part If the visual input is sent to the auditory cortex of a newborn

ferret, the "auditory" cells learn to do vision. (Sharma, Angelucci, and Sur. Nature 2000)

People blinded at a young age can hear better, possibly because their brain can still adapt. (Gougoux et al. Nature 2004)

Different regions of the brain look similar

Page 9: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Feature Learning vs.Deep Neural Network

pixels

Page 10: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Feature Learning vs.Deep Neural Network

pixels edges

Page 11: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Feature Learning vs.Deep Neural Network

pixels edges object parts

Page 12: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Feature Learning vs.Deep Neural Network

pixels edges object parts object models

Page 13: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Artificial Neural Networks

InputLayer

HiddenLayer

OutputLayer

x y

( )y Wh x

Page 14: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Backpropagation Minimize

Gradient computation:

21 (( ) |2

|| |)i i

i

yJ ww h x

x

( )wh x2

(2) (2)11 11

(3)(3)

(2)114

(2) (2)1 1

1(3)(2)11

(3) (2)1

|| || )( )

( )

( )

( )

1 (2

'

( )

( )

i

i

i

i i

ji

i

i

i

j

y

w w

aa yw

f w aa

J

f

yw

a y a

whwx

Page 15: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Backpropagation

x

( )wh x

2

(1) (1)11 11

(3)(3)

(1)114

(2) (2)1 1

1(3)(1)11

(2)(3) 1

(1)11

|| || )( )

( )

( )

( ) '

1( ( )2

( )

i i

i

jji

i

i

i

i

i

y

w w

aa yw

f w aa y

w

aa yw

J

f

whwx

Page 16: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

More than one hidden layer? I thought of that, too. Didn’t work! Lack of data and computational power Weights initialization Poor local minima Diffusion of gradient

Overfitting A multi-layer model is too powerful / complex

Page 17: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Diffusion of Gradient

1(( ) ( )

1

(1 )) '( )ls

ll l li ij j

jw f

( ) ( 1)( )

( ) l lj il

ij

J aw

w

Page 18: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Diffusion of Gradient

1(( ) ( )

1

(1 )) '( )ls

ll l li ij j

jw f

( ) ( 1)( )

( ) l lj il

ij

J aw

w

Page 19: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Prevention of Overfitting Generative Pre-training a way to initialize the weights Learning p(x) or p(x, h) instead of p(y|x)

Early stopping Weight sharing … and many other methods

Page 20: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Autoencoders

x ( ) Wx h x

( ) ( ) 2argmin || ||i

i

i w

w x x

Page 21: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Sparse Autoencoder

x

( ) Wx h x

Page 22: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Sparse Autoencoder

2x

(2)na

)2(2a

(2)1 0a

Page 23: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Sparse Autoencoder

x

( ) Wx h x

22 ( )2argmin (|| || ( ))i i

iS

wx xw a

Page 24: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Sparsity Regularizer L0 norm:

( ) ( 0)ii

S a a I

Page 25: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Sparsity Regularizer L0 norm:

L1 norm:

( ) ( 0)ii

S a a I

( ) | |i

iS a a

Page 26: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Sparsity Regularizer L0 norm:

L1 norm:

L2 norm:

( ) ( 0)ii

S a a I

( ) | |i

iS a a

2( ) ii

S a a

Page 27: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Sparsity Regularizer L0 norm:

L1 norm:

L2 norm:

( ) ( 0)ii

S a a I

( ) | |i

iS a a

2( ) ii

S a a

Page 28: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

L1 vs. L2 Regularizer

Page 29: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Efficient sparse coding

a a

| |a

Lee et al. (2006) Efficient sparse coding algorithms. NIPS

Page 30: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Dimension Reduction vs. Sparsity

vs.

Page 31: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Visualize a Trained Autoencoder Suppose the autoencoder is trained on 10 * 10 images:

10(2)

0

1

( )i ij jj

a f W x

Page 32: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Visualize a Trained Autoencoder What image will maximally activate ? Less formally, what is the feature that hidden unit i is

looking for?

(2)ia

100

1

max ( )j

i jj

jxf W x

Page 33: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Visualize a Trained Autoencoder What image will maximally activate ? Less formally, what is the feature that hidden unit i is

looking for?

(2)ia

100

1

max ( )j

i jj

jxf W x

2100

1

1

. .

jj

s t

x

Page 34: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Visualize a Trained Autoencoder What image will maximally activate ? Less formally, what is the feature that hidden unit i is

looking for?

(2)ia

1

2100

( )

ijj

ijj

Wx

W

100

1

max ( )j

i jj

jxf W x

2100

1

1

. .

jj

s t

x

Page 35: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Visualize a Trained Autoencoder

Page 36: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Train a Deep Autoencoder

x x

Page 37: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Train a Deep Autoencoder

Page 38: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Train a Deep Autoencoder

Page 39: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Train a Deep AutoencoderFine Tuning

x x

Page 40: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Train a Deep Autoencoder

x Feature Vector

Page 41: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Train an Image Classifier

x Image Label(car or people)

Page 42: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Visualize a Trained Autoencoder

Page 43: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Learning Independent features? Le, Zou, Yeung, and Ng, CVPR 2011 Invariant features, disentangle factors Introducing independence to improve the results

Page 44: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Results

Page 45: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Recurrent Neural Networks Sutskever, Martens, Hinton. 2011. Generating Text with Recurrent Neural

Networks. ICML

x

y

Page 46: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

1500 hidden units

character: 1‐of‐86

1500 hidden units

cpredicted distribution for next character. 

It’s a lot easier to predict 86 characters than 100,000 words.

softmax

RNN to predict characters

Page 47: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

A sub-tree in the tree of all character strings

If the nodes are implemented as hidden states in an RNN, different nodes can share structure because they use distributed representations.

The next hidden representation needs to depend on the conjunction of the current character and the current hidden representation.

...fix

…fixi

…fixin

i e

n

In an RNN, each node is a hidden state vector. The next character must transform this to a new node.

…fixe

There are exponentially many nodes in the tree of all character strings of length N.

Page 48: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Multiplicative connections Instead of using the inputs to the recurrent net to provide

additive extra input to the hidden units, we could use the current input character to choose the whole hidden-to-hidden weight matrix. But this requires 86x1500x1500 parameters This could make the net overfit.

Can we achieve the same kind of multiplicative interaction using fewer parameters? We want a different transition matrix for each of the 86 characters,

but we want these 86 character-specific weight matrices to share parameters (the characters 9 and 8 should have similar matrices).

Page 49: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Using factors to implement multiplicative interactions

We can get groups a and b to interact multiplicatively by using “factors”. Each factor first computes a weighted sum for each of its input groups. Then it sends the product of the weighted sums to its output group.

c f bTw f aTu f v fvector of inputs to group c

scalar input to f from group b

scalar input to f from group a

fu fvf

w f

Group b

Gro

up a

Gro

up c

Page 50: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

He was elected President during the Revolutionary War and forgave Opus Paul at Rome. The regime of his crew of England, is now Arab women's icons in and the demons that use something between the characters‘ sisters in lower coil trains were always operated on the line of the ephemerable street, respectively, the graphic or other facility for deformation of a given proportion of large segments at RTUS). The B every chord was a "strongly cold internal palette pour even the white blade.”

Page 51: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

The meaning of life is … 42? The meaning of life is the tradition of the ancient human

reproduction: it is less favorable to the good boy for when to remove her bigger.

Page 52: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Is RNN deep enough? This deep structure provides

memory, not hierarchical processing

Adding hierarchical processing Pascanu, Gulcehre, Cho, and Bengio (2013)

Page 53: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Why Unsupervised Pre-training Works From Bengio’s talk Optimization Hypothesis Unsupervised training initializes weights near localities of

better minima than random initialization can.

Regularization Hypothesis (Prevent over-fitting) The unsupervised pre-training dataset is larger. Features extracted from unsupervised set are more general

and have better discriminant power.

Page 54: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Why Unsupervised Pre-training Works

Bengio: Learning P(x) or P(x, h), which helps you with P(y|x)

Structures and features that can generate the inputs (no matter if a probabilistic formulation is used) also happen to be useful for your supervised task

This requires P(x) and P(y|x) to be similar, i.e. similarly looking x produces similar y

This is probably more true for vision / audio than for texts

Page 55: Boyang “Albert” Li, Jie “Jay” Tanafb/classes/CS7616-Spring2014/... · A general algorithm for the brain? One part of the brain can learn the function of another part If the

Conclusion Motivation for deep learning Backpropagation Autoencoder and sparsity Generative, layerwise pre-training (Stacked Autoencoder) Recurrent Neural Networks Speculation of why these things work