boyang “albert” li, jie “jay” tanafb/classes/cs7616-spring2014/... · a general algorithm...
TRANSCRIPT
Deep Learning
Boyang “Albert” Li, Jie “Jay” Tan
An Unrelated Video A bicycle controller learned using NEAT (Stanley)
What do you mean, deep?
Shallow Deep
Hidden Markov models ANNs with one hidden layer Manually selected and
designed features
Stacked Restricted Boltzmann Machines
ANNs with multiple hidden layers
Learning complex features
Algorithms of Deep Learning Recurrent Neural Networks Stacked Autoencoders (i.e. deep neural networks) Stacked Restricted Boltzmann Machines (i.e. deep belief
networks) Convoluted Deep Belief Networks … a growing list
But…What’s Wrong with Shallow? Needs more nodes / computing units and weights
[Bengio, Y., et al. (2007). Greedy layerwise training of deep networks] Boolean functions (such as the function that computes the multiplication
of two numbers from their d-bit representation) expressible by O log layers of combinatorial logic with O elements in each layer
O 2 elements when expressed with only 2 layers
Reliance on manually selected features Automatically learning the features Disentangling interacting factors, creating invariant features
(will come back to that)
Disentangling factors
Is the brain deep, too?
http://thebrain.mcgill.ca/flash/a/a_02/a_02_cr/a_02_cr_vis/a_02_cr_vis.htmlEric R. Kandel. (2012) The Age of Insight: The Quest to Understand the Unconscious in Art, Mind and Brain from Vienna 1900 to the Present
A general algorithm for the brain? One part of the brain can learn the function of another
part If the visual input is sent to the auditory cortex of a newborn
ferret, the "auditory" cells learn to do vision. (Sharma, Angelucci, and Sur. Nature 2000)
People blinded at a young age can hear better, possibly because their brain can still adapt. (Gougoux et al. Nature 2004)
Different regions of the brain look similar
Feature Learning vs.Deep Neural Network
pixels
Feature Learning vs.Deep Neural Network
pixels edges
Feature Learning vs.Deep Neural Network
pixels edges object parts
Feature Learning vs.Deep Neural Network
pixels edges object parts object models
Artificial Neural Networks
InputLayer
HiddenLayer
OutputLayer
x y
( )y Wh x
Backpropagation Minimize
Gradient computation:
21 (( ) |2
|| |)i i
i
yJ ww h x
x
( )wh x2
(2) (2)11 11
(3)(3)
(2)114
(2) (2)1 1
1(3)(2)11
(3) (2)1
|| || )( )
( )
( )
( )
1 (2
'
( )
( )
i
i
i
i i
ji
i
i
i
j
y
w w
aa yw
f w aa
J
f
yw
a y a
whwx
Backpropagation
x
( )wh x
2
(1) (1)11 11
(3)(3)
(1)114
(2) (2)1 1
1(3)(1)11
(2)(3) 1
(1)11
|| || )( )
( )
( )
( ) '
1( ( )2
( )
i i
i
jji
i
i
i
i
i
y
w w
aa yw
f w aa y
w
aa yw
J
f
whwx
More than one hidden layer? I thought of that, too. Didn’t work! Lack of data and computational power Weights initialization Poor local minima Diffusion of gradient
Overfitting A multi-layer model is too powerful / complex
Diffusion of Gradient
1(( ) ( )
1
(1 )) '( )ls
ll l li ij j
jw f
( ) ( 1)( )
( ) l lj il
ij
J aw
w
Diffusion of Gradient
1(( ) ( )
1
(1 )) '( )ls
ll l li ij j
jw f
( ) ( 1)( )
( ) l lj il
ij
J aw
w
Prevention of Overfitting Generative Pre-training a way to initialize the weights Learning p(x) or p(x, h) instead of p(y|x)
Early stopping Weight sharing … and many other methods
Autoencoders
x ( ) Wx h x
( ) ( ) 2argmin || ||i
i
i w
w x x
Sparse Autoencoder
x
( ) Wx h x
Sparse Autoencoder
2x
(2)na
)2(2a
(2)1 0a
Sparse Autoencoder
x
( ) Wx h x
22 ( )2argmin (|| || ( ))i i
iS
wx xw a
Sparsity Regularizer L0 norm:
( ) ( 0)ii
S a a I
Sparsity Regularizer L0 norm:
L1 norm:
( ) ( 0)ii
S a a I
( ) | |i
iS a a
Sparsity Regularizer L0 norm:
L1 norm:
L2 norm:
( ) ( 0)ii
S a a I
( ) | |i
iS a a
2( ) ii
S a a
Sparsity Regularizer L0 norm:
L1 norm:
L2 norm:
( ) ( 0)ii
S a a I
( ) | |i
iS a a
2( ) ii
S a a
L1 vs. L2 Regularizer
Efficient sparse coding
a a
| |a
Lee et al. (2006) Efficient sparse coding algorithms. NIPS
Dimension Reduction vs. Sparsity
vs.
Visualize a Trained Autoencoder Suppose the autoencoder is trained on 10 * 10 images:
10(2)
0
1
( )i ij jj
a f W x
Visualize a Trained Autoencoder What image will maximally activate ? Less formally, what is the feature that hidden unit i is
looking for?
(2)ia
100
1
max ( )j
i jj
jxf W x
Visualize a Trained Autoencoder What image will maximally activate ? Less formally, what is the feature that hidden unit i is
looking for?
(2)ia
100
1
max ( )j
i jj
jxf W x
2100
1
1
. .
jj
s t
x
Visualize a Trained Autoencoder What image will maximally activate ? Less formally, what is the feature that hidden unit i is
looking for?
(2)ia
1
2100
( )
ijj
ijj
Wx
W
100
1
max ( )j
i jj
jxf W x
2100
1
1
. .
jj
s t
x
Visualize a Trained Autoencoder
Train a Deep Autoencoder
x x
Train a Deep Autoencoder
Train a Deep Autoencoder
Train a Deep AutoencoderFine Tuning
x x
Train a Deep Autoencoder
x Feature Vector
Train an Image Classifier
x Image Label(car or people)
Visualize a Trained Autoencoder
Learning Independent features? Le, Zou, Yeung, and Ng, CVPR 2011 Invariant features, disentangle factors Introducing independence to improve the results
Results
Recurrent Neural Networks Sutskever, Martens, Hinton. 2011. Generating Text with Recurrent Neural
Networks. ICML
x
y
…
1500 hidden units
character: 1‐of‐86
1500 hidden units
cpredicted distribution for next character.
It’s a lot easier to predict 86 characters than 100,000 words.
softmax
RNN to predict characters
A sub-tree in the tree of all character strings
If the nodes are implemented as hidden states in an RNN, different nodes can share structure because they use distributed representations.
The next hidden representation needs to depend on the conjunction of the current character and the current hidden representation.
...fix
…fixi
…fixin
i e
n
In an RNN, each node is a hidden state vector. The next character must transform this to a new node.
…fixe
There are exponentially many nodes in the tree of all character strings of length N.
Multiplicative connections Instead of using the inputs to the recurrent net to provide
additive extra input to the hidden units, we could use the current input character to choose the whole hidden-to-hidden weight matrix. But this requires 86x1500x1500 parameters This could make the net overfit.
Can we achieve the same kind of multiplicative interaction using fewer parameters? We want a different transition matrix for each of the 86 characters,
but we want these 86 character-specific weight matrices to share parameters (the characters 9 and 8 should have similar matrices).
Using factors to implement multiplicative interactions
We can get groups a and b to interact multiplicatively by using “factors”. Each factor first computes a weighted sum for each of its input groups. Then it sends the product of the weighted sums to its output group.
c f bTw f aTu f v fvector of inputs to group c
scalar input to f from group b
scalar input to f from group a
fu fvf
w f
Group b
Gro
up a
Gro
up c
He was elected President during the Revolutionary War and forgave Opus Paul at Rome. The regime of his crew of England, is now Arab women's icons in and the demons that use something between the characters‘ sisters in lower coil trains were always operated on the line of the ephemerable street, respectively, the graphic or other facility for deformation of a given proportion of large segments at RTUS). The B every chord was a "strongly cold internal palette pour even the white blade.”
The meaning of life is … 42? The meaning of life is the tradition of the ancient human
reproduction: it is less favorable to the good boy for when to remove her bigger.
Is RNN deep enough? This deep structure provides
memory, not hierarchical processing
Adding hierarchical processing Pascanu, Gulcehre, Cho, and Bengio (2013)
…
Why Unsupervised Pre-training Works From Bengio’s talk Optimization Hypothesis Unsupervised training initializes weights near localities of
better minima than random initialization can.
Regularization Hypothesis (Prevent over-fitting) The unsupervised pre-training dataset is larger. Features extracted from unsupervised set are more general
and have better discriminant power.
Why Unsupervised Pre-training Works
Bengio: Learning P(x) or P(x, h), which helps you with P(y|x)
Structures and features that can generate the inputs (no matter if a probabilistic formulation is used) also happen to be useful for your supervised task
This requires P(x) and P(y|x) to be similar, i.e. similarly looking x produces similar y
This is probably more true for vision / audio than for texts
Conclusion Motivation for deep learning Backpropagation Autoencoder and sparsity Generative, layerwise pre-training (Stacked Autoencoder) Recurrent Neural Networks Speculation of why these things work