boyang “albert” li, jie “jay” tanafb/classes/cs7616-spring2014/... · a general algorithm...

Deep Learning

Boyang “Albert” Li, Jie “Jay” Tan

An Unrelated Video A bicycle controller learned using NEAT (Stanley)

What do you mean, deep?

Shallow Deep

Hidden Markov models ANNs with one hidden layer Manually selected and

designed features

Stacked Restricted Boltzmann Machines

ANNs with multiple hidden layers

Learning complex features

Algorithms of Deep Learning Recurrent Neural Networks Stacked Autoencoders (i.e. deep neural networks) Stacked Restricted Boltzmann Machines (i.e. deep belief

networks) Convoluted Deep Belief Networks … a growing list

But…What’s Wrong with Shallow? Needs more nodes / computing units and weights

[Bengio, Y., et al. (2007). Greedy layerwise training of deep networks] Boolean functions (such as the function that computes the multiplication

of two numbers from their d-bit representation) expressible by O log layers of combinatorial logic with O elements in each layer

O 2 elements when expressed with only 2 layers

Reliance on manually selected features Automatically learning the features Disentangling interacting factors, creating invariant features

(will come back to that)

Disentangling factors

Is the brain deep, too?

http://thebrain.mcgill.ca/flash/a/a_02/a_02_cr/a_02_cr_vis/a_02_cr_vis.htmlEric R. Kandel. (2012) The Age of Insight: The Quest to Understand the Unconscious in Art, Mind and Brain from Vienna 1900 to the Present

A general algorithm for the brain? One part of the brain can learn the function of another

part If the visual input is sent to the auditory cortex of a newborn

ferret, the "auditory" cells learn to do vision. (Sharma, Angelucci, and Sur. Nature 2000)

People blinded at a young age can hear better, possibly because their brain can still adapt. (Gougoux et al. Nature 2004)

Different regions of the brain look similar

Feature Learning vs.Deep Neural Network

pixels


pixels edges


pixels edges object parts


pixels edges object parts object models

Artificial Neural Networks

InputLayer

HiddenLayer

OutputLayer

x y

( )y Wh x

Backpropagation Minimize

Gradient computation:

21 (( ) |2

|| |)i i

i

yJ ww h x

x

( )wh x2

(2) (2)11 11

(3)(3)

(2)114

(2) (2)1 1

1(3)(2)11

(3) (2)1

|| || )( )

( )

( )

( )

1 (2

'

( )

( )

i

i

i

i i

ji

i

i

i

j

y

w w

aa yw

f w aa

J

f

yw

a y a

whwx

Backpropagation

x

( )wh x

2

(1) (1)11 11

(3)(3)

(1)114

(2) (2)1 1

1(3)(1)11

(2)(3) 1

(1)11

|| || )( )

( )

( )

( ) '

1( ( )2

( )

i i

i

jji

i

i

i

i

i

y

w w

aa yw

f w aa y

w

aa yw

J

f

whwx

More than one hidden layer? I thought of that, too. Didn’t work! Lack of data and computational power Weights initialization Poor local minima Diffusion of gradient

Overfitting A multi-layer model is too powerful / complex

Diffusion of Gradient

1(( ) ( )

1

(1 )) '( )ls

ll l li ij j

jw f

( ) ( 1)( )

( ) l lj il

ij

J aw

w

Prevention of Overfitting Generative Pre-training a way to initialize the weights Learning p(x) or p(x, h) instead of p(y|x)

Early stopping Weight sharing … and many other methods

Autoencoders

x ( ) Wx h x

( ) ( ) 2argmin || ||i

i

i w

w x x

Sparse Autoencoder

x

( ) Wx h x

Sparse Autoencoder

2x

(2)na

)2(2a

(2)1 0a

Sparse Autoencoder

x

( ) Wx h x

22 ( )2argmin (|| || ( ))i i

iS

wx xw a

Sparsity Regularizer L0 norm:

( ) ( 0)ii

S a a I


L1 norm:

( ) ( 0)ii

S a a I

( ) | |i

iS a a


L1 norm:

L2 norm:

( ) ( 0)ii

S a a I

( ) | |i

iS a a

2( ) ii

S a a

L1 vs. L2 Regularizer

Efficient sparse coding

a a

| |a

Lee et al. (2006) Efficient sparse coding algorithms. NIPS

Dimension Reduction vs. Sparsity

vs.

Visualize a Trained Autoencoder Suppose the autoencoder is trained on 10 * 10 images:

10(2)

0

1

( )i ij jj

a f W x

Visualize a Trained Autoencoder What image will maximally activate ? Less formally, what is the feature that hidden unit i is

looking for?

(2)ia

100

1

max ( )j

i jj

jxf W x


looking for?

(2)ia

100

1

max ( )j

i jj

jxf W x

2100

1

1

. .

jj

s t

x


looking for?

(2)ia

1

2100

( )

ijj

ijj

Wx

W

100

1

max ( )j

i jj

jxf W x

2100

1

1

. .

jj

s t

x

Visualize a Trained Autoencoder

Train a Deep Autoencoder

x x

Train a Deep AutoencoderFine Tuning

x x


x Feature Vector

Train an Image Classifier

x Image Label(car or people)

Visualize a Trained Autoencoder

Learning Independent features? Le, Zou, Yeung, and Ng, CVPR 2011 Invariant features, disentangle factors Introducing independence to improve the results

Results

Recurrent Neural Networks Sutskever, Martens, Hinton. 2011. Generating Text with Recurrent Neural

Networks. ICML

x

y

…

1500 hidden units

character: 1‐of‐86

1500 hidden units

cpredicted distribution for next character.

It’s a lot easier to predict 86 characters than 100,000 words.

softmax

RNN to predict characters

A sub-tree in the tree of all character strings

If the nodes are implemented as hidden states in an RNN, different nodes can share structure because they use distributed representations.

The next hidden representation needs to depend on the conjunction of the current character and the current hidden representation.

...fix

…fixi

…fixin

i e

n

In an RNN, each node is a hidden state vector. The next character must transform this to a new node.

…fixe

There are exponentially many nodes in the tree of all character strings of length N.

Multiplicative connections Instead of using the inputs to the recurrent net to provide

additive extra input to the hidden units, we could use the current input character to choose the whole hidden-to-hidden weight matrix. But this requires 86x1500x1500 parameters This could make the net overfit.

Can we achieve the same kind of multiplicative interaction using fewer parameters? We want a different transition matrix for each of the 86 characters,

but we want these 86 character-specific weight matrices to share parameters (the characters 9 and 8 should have similar matrices).

Using factors to implement multiplicative interactions

We can get groups a and b to interact multiplicatively by using “factors”. Each factor first computes a weighted sum for each of its input groups. Then it sends the product of the weighted sums to its output group.

c f bTw f aTu f v fvector of inputs to group c

scalar input to f from group b

scalar input to f from group a

fu fvf

w f

Group b

Gro

up a

Gro

up c

He was elected President during the Revolutionary War and forgave Opus Paul at Rome. The regime of his crew of England, is now Arab women's icons in and the demons that use something between the characters‘ sisters in lower coil trains were always operated on the line of the ephemerable street, respectively, the graphic or other facility for deformation of a given proportion of large segments at RTUS). The B every chord was a "strongly cold internal palette pour even the white blade.”

The meaning of life is … 42? The meaning of life is the tradition of the ancient human

reproduction: it is less favorable to the good boy for when to remove her bigger.

Is RNN deep enough? This deep structure provides

memory, not hierarchical processing

Adding hierarchical processing Pascanu, Gulcehre, Cho, and Bengio (2013)

…

Why Unsupervised Pre-training Works From Bengio’s talk Optimization Hypothesis Unsupervised training initializes weights near localities of

better minima than random initialization can.

Regularization Hypothesis (Prevent over-fitting) The unsupervised pre-training dataset is larger. Features extracted from unsupervised set are more general

and have better discriminant power.

Why Unsupervised Pre-training Works

Bengio: Learning P(x) or P(x, h), which helps you with P(y|x)

Structures and features that can generate the inputs (no matter if a probabilistic formulation is used) also happen to be useful for your supervised task

This requires P(x) and P(y|x) to be similar, i.e. similarly looking x produces similar y

This is probably more true for vision / audio than for texts

Conclusion Motivation for deep learning Backpropagation Autoencoder and sparsity Generative, layerwise pre-training (Stacked Autoencoder) Recurrent Neural Networks Speculation of why these things work

boyang “albert” li, jie “jay” tanafb/classes/cs7616-spring2014/... · a general algorithm...

Documents