restricted boltzmann machine and deep belief net

81
1 Restricted Boltzmann Machine and Deep Belief Net Wanli Ouyang [email protected] Animation is available for illustration

Upload: makana

Post on 04-Feb-2016

81 views

Category:

Documents


0 download

DESCRIPTION

Restricted Boltzmann Machine and Deep Belief Net. Wanli Ouyang [email protected]. Animation is available for illustration. Outline. RBM and DBN are statistical models. Deep belief net is trained using RBM and CD. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Restricted Boltzmann Machine and Deep Belief Net

1

Restricted Boltzmann Machine and Deep Belief

Net Wanli Ouyang [email protected]

Animation is available for illustration

Page 2: Restricted Boltzmann Machine and Deep Belief Net

2

Outline•Short introduction on deep learning•Short introduction on statistical

models and Graphical model•Restricted Boltzmann Machine

(RBM) and Contrastive divergence•Deep belief net (DBN)

Deep belief net is trained using RBM and CD

RBM and DBN are statistical models

Deep belief net is an unsupervised training algorithm for deep neural network

Page 3: Restricted Boltzmann Machine and Deep Belief Net

3

Good learning resources• Webpages:

– Geoffrey E. Hinton’s readings (with source code available for DBN) http://www.cs.toronto.edu/~hinton/csc2515/deeprefs.html

– Notes on Deep Belief Networks http://www.quantumg.net/dbns.php

– MLSS Tutorial, October 2010, ANU Canberra, Marcus Frean http://videolectures.net/mlss2010au_frean_deepbeliefnets/

– Deep Learning Tutorials http://deeplearning.net/tutorial/ – Hinton’s Tutorial, http://videolectures.net/mlss09uk_hinton_dbn/ – Fergus’s Tutorial,

http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf– CUHK MMlab project :

http://mmlab.ie.cuhk.edu.hk/project_deep_learning.html • People:

– Geoffrey E. Hinton’s http://www.cs.toronto.edu/~hinton– Andrew Ng http://www.cs.stanford.edu/people/ang/index.html – Ruslan Salakhutdinov http://www.utstat.toronto.edu/~rsalakhu/ – Yee-Whye Teh http://www.gatsby.ucl.ac.uk/~ywteh/ – Yoshua Bengio www.iro.umontreal.ca/~bengioy – Yann LeCun http://yann.lecun.com/ – Marcus Frean http://ecs.victoria.ac.nz/Main/MarcusFrean – Rob Fergus http://cs.nyu.edu/~fergus/pmwiki/pmwiki.php

• Acknowledgement– Many materials in this ppt are from these papers, tutorials,

etc (especially Hinton and Frean’s). Sorry for not listing them in full detail.

Dumitru Erhan, Aaron Courville, Yoshua Bengio. Understanding Representations Learned in Deep Architectures. Technical Report.

Page 4: Restricted Boltzmann Machine and Deep Belief Net

Neural networkBack propagation

1986

• Solve general learning problems• Tied with biological system

But it is given up…

• Hard to train• Insufficient computational

resources• Small training sets• Does not work well

2006

• SVM• Boosting• Decision tree• KNN• …

• Loose tie with biological systems• Shallow model• Specific methods for specific tasks

– Hand crafted features (GMM-HMM, SIFT, LBP, HOG)

Kruger et al. TPAMI’13

Deep belief netScience

… …

… …

… …

… …• Unsupervised & Layer-wised pre-

training• Better designs for modeling and

training (normalization, nonlinearity, dropout)

• Feature learning• New development of computer

architectures– GPU– Multi-core computer systems

• Large scale databases

deep learning results

Speech

20112012

How Many Computers to Identify a Cat? 16000 CPU cores

Object recognition over 1,000,000 images and 1,000 categories(2 GPU)

Page 5: Restricted Boltzmann Machine and Deep Belief Net

5

Outline•Short introduction on deep learning•Short introduction on statistical

models and Graphical model•Restricted Boltzmann Machine

(RBM) and Contrastive divergence•Deep belief net (DBN)

Page 6: Restricted Boltzmann Machine and Deep Belief Net

6

Graphical model for Statistics• Conditional

independence between random variables

• Given C, A and B are independent:– P(A, B|C) = P(A|C)P(B|C)

• P(A,B,C) =P(A, B|C) P(C) – =P(A|C)P(B|C)P(C)

• Any two nodes are conditionally independent given the values of their parents.http://www.eecs.qmul.ac.uk/~norman/BBNs/Independence_and_conditional_independence.htm

Smoker?

Has Lung cancer Has bronchitis支气管炎肺癌

B

C

A

Page 7: Restricted Boltzmann Machine and Deep Belief Net

7

Directed and undirected graphical model• Directed graphical model

– P(A,B,C) = P(A|C)P(B|C)P(C)– Any two nodes

are conditionally independent given the values of

their parents.• Undirected graphical model

– P(A,B,C) = P(B,C)P(A,C)– Also called Marcov Random

Field (MRF)

B A

C

P(A,B,C,D) = P(D|A,B)P(B|C)P(A|C)P(C)

B A

C

D

B

C

A

B

C

A

Page 8: Restricted Boltzmann Machine and Deep Belief Net

8

Modeling undirected model

• Probability:

)(

);(

);(

);()(

Z

f

f

fP

x

x

xx;

x

B A

C

),(

)exp()exp(

)exp(

)exp(),,(

21

21

,,21

21

wwZ

ACwBCw

ACwBCw

ACwBCwCBAP

CBA

;w1 w2

Example: P(A,B,C) = P(B,C)P(A,C) Is smoker?

Is healthy Has Lung cancer

partition function

1)( x

x;P

Page 9: Restricted Boltzmann Machine and Deep Belief Net

9

More directed and undirected models

D E

A B

G H

h1 h2

y1

h3

y2 y3

Hidden Marcov model MRF in 2D

F

C

I

Page 10: Restricted Boltzmann Machine and Deep Belief Net

10

More directed and undirected models

A B

C

D

P(A,B,C,D)=P(A)P(B)P(C|B)P(D|A,B,C)

h1 h2

y1

h3

y2 y3

P(y1, y2, y3, h1, h2, h3)=P(h1)P(h2| h1)P(h3| h2) P(y1| h1)P(y2| h2)P(y3| h3)

Page 11: Restricted Boltzmann Machine and Deep Belief Net

11

More directed and undirected models

v

h ...

...x

h1

...

...h2

h3

...

...W W 0

W 1

W 2

(a) (b)

HMM

RBM DBN (c)

...

...

...

...

h1

h2

h3

W 1

W 2

x

Our deep model

Page 12: Restricted Boltzmann Machine and Deep Belief Net

12

Extended reading on graphical model

•  Zoubin Ghahramani ‘s video lecture on graphical models:

• http://videolectures.net/mlss07_ghahramani_grafm/

Page 13: Restricted Boltzmann Machine and Deep Belief Net

13

Outline

• Short introduction on deep learning• Short introduction on statistical models and

Graphical model

• Restricted Boltzmann machine and Contrastive divergence– Product of experts– Contrastive divergence– Restricted Boltzmann Machine

• Deep belief net

A training algorithm for

Page 14: Restricted Boltzmann Machine and Deep Belief Net

14

Outline

• Short introduction on deep learning• Short introduction on statistical models and

Graphical model

• Restricted Boltzmann machine and Contrastive divergence– Product of experts– Contrastive divergence– Restricted Boltzmann Machine

• Deep belief net

A specific, useful case of

Page 15: Restricted Boltzmann Machine and Deep Belief Net

15

Outline

• Short introduction on deep learning• Short introduction on statistical models and

Graphical model

• Restricted Boltzmann machine and Contrastive divergence– Product of experts– Contrastive divergence– Restricted Boltzmann Machine

• Deep belief net

Page 16: Restricted Boltzmann Machine and Deep Belief Net

Product of Experts

16

,)(

);(

);(

);();(

);(

);(

Z

f

e

e

f

fP

E

E

mm

mm

mm

mm x

x

xx

x

x

x

x

m

xx );(log);( mmmfE Energy function

Partition function

MRF in 2D

...);( 34321 CFwBEwADwBCwABwE wx

D E

A B

G H

F

C

I

Page 17: Restricted Boltzmann Machine and Deep Belief Net

Product of Experts

15

1

)()( )1(i

ii ce ii uxux T

Page 18: Restricted Boltzmann Machine and Deep Belief Net

18

Products of experts versus Mixture model

• Products of experts :– "and" operation– Sharper than mixture– Each expert can constrain a different subset of

dimensions.• Mixture model, e.g. Gaussian Mixture model

– “or” operation– a weighted sum of many density functions

x

x

xx

);(

);();(

mm

mm

mm

mm

f

fP

Page 19: Restricted Boltzmann Machine and Deep Belief Net

19

Outline• Basic background on statistical learning

and Graphical model•Contrastive divergence and

Restricted Boltzmann machine– Product of experts– Contrastive divergence– Restricted Boltzmann Machine

• Deep belief net

Page 20: Restricted Boltzmann Machine and Deep Belief Net

20

Contrastive Divergence (CD)• Probability:• Maximum Likelihood and gradient descent

)(/);()x;( ZxfP

Xx

xx

xx

xx

xX

);(log);(log

);(log1);(log),(

);(log1

)(log);(1

),(

1

1

ff

f

Kd

fp

fK

ZL

K

p

K

k

(k)

K

k

(k)

data dist.model dist. expectation

0);();(

1

XX Lor

Ltt

K

k

kK

k

k PLP11

)logmax);(max)max

;(xX;(x )()(

x

)x;()( mfZ

Page 21: Restricted Boltzmann Machine and Deep Belief Net

21

Contrastive Divergence (CD)• Gradient of

Likelihood:

T

K

k

(k)f

Kd

fp

L

1

);x(log1x

);x(log),x(

);X(

Intractable

Tractable Gibbs Sampling Fast contrastive divergence T=1

Easy to compute

Sample p(z1,z2,…,zM)

);X(

1

Ltt

CD

Minimum

P(A,B,C) = P(A|C)P(B|C)P(C)

B A

C

Accurate but slow gradient

Approximate but fast gradient

Page 22: Restricted Boltzmann Machine and Deep Belief Net

22

Gibbs Sampling for graphical model

x1 x2

h2 h3 h4

x3

h5h1

More information on Gibbs sampling: Pattern recognition and machine learning(PRML)

Page 23: Restricted Boltzmann Machine and Deep Belief Net

23

Convergence of Contrastive divergence (CD)

• The fixed points of ML are not fixed points of CD and vice versa. – CD is a biased learning algorithm.– But the bias is typically very small.– CD can be used for getting close to ML solution and

then ML learning can be used for fine-tuning.• It is not clear if CD learning converges (to a

stable fixed point). At 2005, proof is not available.

• Further theoretical results? Please inform us

M. A. Carreira-Perpignan and G. E. Hinton. On Contrastive Divergence Learning. Artificial Intelligence and Statistics, 2005

Page 24: Restricted Boltzmann Machine and Deep Belief Net

24

Outline• Basic background on statistical learning

and Graphical model•Contrastive divergence and

Restricted Boltzmann machine– Product of experts– Contrastive divergence– Restricted Boltzmann Machine

• Deep belief net

Page 25: Restricted Boltzmann Machine and Deep Belief Net

25

Boltzmann Machine

• Undirected graphical model, with hidden nodes.

i

iiji

jiij xxxwE )(x;

,)(

);(

);(

);();(

);(

);(

Z

f

e

e

f

fP

E

E

mm

mm

mm

mm x

x

xx

x

x

x

x

Boltzmann machine: E(x,h)=b' x+c' h+h' Wx+x’Ux+h’Vh

},{: iijw

Page 26: Restricted Boltzmann Machine and Deep Belief Net

Restricted Boltzmann Machine (RBM)

• Undirected, loopy, layer

• E(x,h)=b' x+c' h+h' Wx

jj

ii

xPP

hPP

)|()|(

)|()|(

hhx

xxh

hx

hx

hx

hx

,

),(

),(

),(E

E

e

eP

x1 x2

h2 h3 h4

x3

h5h1

h

x

hx

hxh

hx

x

,

),(

),(

)(E

E

e

eP

P(xj = 1|h) = σ(bj +W’• j · h)P(hi = 1|x) = σ(ci +Wi · · x)

Boltzmann machine: E(x,h)=b' x+c' h+h' Wx+x’Ux+h’Vh

partition function

W

Read the manuscript for details

Page 27: Restricted Boltzmann Machine and Deep Belief Net

27

Restricted Boltzmann Machine (RBM)

• E(x,h)=b' x+c' h+h' Wx• x = [x1 x2 …]T, h = [h1 h2 …]T

• Parameter learning– Maximum Log-Likelihood

K

k

kK

k

k PLP11

)logmin);(min)max

;(xX;(x )()(

)(

);()(

,

(

(

Z

xf

e

eP

hx

Wx)h'h c'x b'h

Wx)h'h c'x b'

x;

Geoffrey E. Hinton, “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14, 1771–1800 (2002)

Page 28: Restricted Boltzmann Machine and Deep Belief Net

28

CD for RBM

• CD for RBM, very fast!

);X(

1

Ltt

)(

);()x;(

h,x

Wx)h'h c' xb'(h

Wx)h'h c' xb'(

Z

xf

e

eP

0

0Xx

xx

xx

X

jiji

jijijipji

K

k

(k)

ij

hxhx

hxhxhxhx

f

Kd

fp

w

L

1

),(

1

);(log1);(log),(

);(

P(xj = 1|h) = σ(bj +W’• j · h) P(hi = 1|x) = σ(ci +Wi · x)

CD

Page 29: Restricted Boltzmann Machine and Deep Belief Net

29

CD for RBM

h1

x1 x2

h2

0

Xjiji

ij

hxhxw

L

1

);(

P(xj = 1|h) = σ(bj +W’• j · h) P(hi = 1|x) = σ(ci +Wi · x)

P(xj = 1|h) = σ(bj +W’• j · h)

P(hi = 1|x) = σ(ci +Wi · x)

P(xj = 1|h) = σ(bj +W’• j · h)

Page 30: Restricted Boltzmann Machine and Deep Belief Net

30

RBM for classification

• y: classification label

Hugo Larochelle and Yoshua Bengio, Classification using Discriminative Restricted Boltzmann Machines, ICML 2008.

Page 31: Restricted Boltzmann Machine and Deep Belief Net

31

RBM itself has many applications

• Multiclass classification• Collaborative filtering• Motion capture modeling• Information retrieval• Modeling natural images• Segmentation

Y Li, D Tarlow, R Zemel, Exploring compositional high order pattern potentials for structured output learning, CVPR 2013V. Mnih, H Larochelle, GE Hinton , Conditional Restricted Boltzmann Machines for Structured Output Prediction, Uncertainty in Artificial Intelligence, 2011.Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted boltzmann machines. ICML, 2008.Salakhutdinov, R., Mnih, A., & Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. ICML 2007.Salakhutdinov, R., & Hinton, G. E. (2009). Replicated softmax: an undirected topic model., NIPS 2009.Osindero, S., & Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random field., NIPS 2008

Page 32: Restricted Boltzmann Machine and Deep Belief Net

32

Outline• Basic background on statistical learning

and Graphical model• Contrastive divergence and Restricted

Boltzmann machine•Deep belief net (DBN)

– Why deep leaning?– Learning and inference– Applications

Page 33: Restricted Boltzmann Machine and Deep Belief Net

33

Belief Nets

• A belief net is a directed acyclic graph composed of random variables.

randomhidden cause

visible effect

Page 34: Restricted Boltzmann Machine and Deep Belief Net

34

Deep Belief Net

• Belief net that is deep• A generative model

– P(x,h1,…,hl) = p(x|h1) p(h1|h2)… p(hl-2|hl-1) p(hl-

1,hl)• Used for unsupervised training of multi-layer

deep model.

h1

x

h2

h3

… …

… …

… …

… …

P(x,h1,h2,h3) = p(x|h1) p(h1|h2) p(h2,h3)

Pixels=>edges=> local shapes=> object parts

Page 35: Restricted Boltzmann Machine and Deep Belief Net

35

Why Deep learning?

• The mammal brain is organized in a deep architecture with a given input percept represented at multiple levels of abstraction, each level corresponding to a different area of cortex(脑或其他器官的皮层 ).

• An architecture with insufficient depth can require many more computational elements, potentially exponentially more (with respect to input size), than architectures whose depth is matched to the task.

• Since the number of computational elements one can afford depends on the number of training examples available to tune or select them, the consequences are not just computational but also statistical: poor generalization may be expected when using an insufficiently deep architecture for representing some functions.T. Serre, etc., “A quantitative theory of immediate visual recognition,” Progress in Brain Research, Computational

Neuroscience: Theoretical Insights into Brain Function, vol. 165, pp. 33–56, 2007. Yoshua Bengio, “Learning Deep Architectures for AI,” Foundations and Trends in Machine Learning, 2009.

Pixels=>edges=> local shapes=> object parts

Page 36: Restricted Boltzmann Machine and Deep Belief Net

36

• Linear regression, logistic regression: depth 1• Kernel SVM: depth 2• Decision tree: depth 2• Boosting: depth 2• The basic conclusion that these results suggest is

that when a function can be compactly represented by a deep architecture, it might need a very large architecture to be represented by an insufficiently deep one. (Example: logic gates, multi-layer NN with linear threshold units and positive weight).

Yoshua Bengio, “Learning Deep Architectures for AI,” Foundations and Trends in Machine Learning, 2009.

Why Deep learning?

Page 37: Restricted Boltzmann Machine and Deep Belief Net

37

Example: sum product network (SPN)

X2 X2

X3 X3

X1 X1 X4

X4

X5 X5

2N-1

N2N-1 parameters

O(N) parameters

Page 38: Restricted Boltzmann Machine and Deep Belief Net

38

Depth of existing approaches

• Boosting (2 layers)– L 1: base learner– L 2: vote or linear combination of layer 1

• Decision tree, LLE, KNN, Kernel SVM (2 layers)– L 1: matching degree to a set of local templates.– L 2: Combine these degrees

• Brain: 5-10 layers

i ii Kb ),( xx

Page 39: Restricted Boltzmann Machine and Deep Belief Net

39

Why decision tree has depth 2?

• Rely on partition of input space.• Local estimator. Rely on partition of input

space and use separate params for each region. Each region is associated with a leaf.

• Need as many as training samples as there are variations of interest in the target function. Not good for highly varying functions.

• Num. training sample is exponential to Num. dim in order to achieve a fixed error rate.

Page 40: Restricted Boltzmann Machine and Deep Belief Net

40

Outline• Basic background on statistical learning

and Graphical model• Contrastive divergence and Restricted

Boltzmann machine•Deep belief net (DBN)

– Why DBN?–Learning and inference– Applications

Page 41: Restricted Boltzmann Machine and Deep Belief Net

41

Deep Belief Net

• Inference problem: Infer the states of the unobserved variables.

• Learning problem: Adjust the interactions between variables to make the network more likely to generate the observed data

h1

x

h2

h3

… …

… …

… …

… …

P(x,h1,h2,h3) = p(x|h1) p(h1|h2) p(h2,h3)

Page 42: Restricted Boltzmann Machine and Deep Belief Net

42

Deep Belief Net– Inference problem (the problem of explaining

away):

B A

C

h11 h12

x1

=

h1

x

… …

… …

P(A,B|C) = P(A|C)P(B|C)

P(h11, h12 | x1) ≠ P(h11| x1) P(h12 | x1)

An example from manuscript Sol: Complementary prior

Page 43: Restricted Boltzmann Machine and Deep Belief Net

43

Deep Belief Net

h1

x

h2

h4

… …

… …

… …

… …

h3 … …

20001000500

30

Sol: Complementary prior

Inference problem (the problem of explaining away) Sol: Complementary prior

Page 44: Restricted Boltzmann Machine and Deep Belief Net

44

Deep Belief Net• Explaining away problem of Inference (see

the manuscript)– Sol: Complementary prior, see the manuscript

• Learning problem– Greedy layer by layer RBM training (optimize

lower bound) and fine tuning– Contrastive divergence for RBM training

h1

x

h2

h3

… …

… …

… …

… …

P(hi = 1|x) = σ(ci +Wi · x)

… …

… …

… …

… …

… …

… …

h1

x

h2

h1

h3

h2

Page 45: Restricted Boltzmann Machine and Deep Belief Net

45

Code reading• It is much easier to read the

DeepLearningToolbox for understanding DBN.

Page 46: Restricted Boltzmann Machine and Deep Belief Net

46

Page 47: Restricted Boltzmann Machine and Deep Belief Net

47

Deep Belief Net

• Why greedy layerwise learning work?

• Optimizing a lower bound:

• When we fix parameters for layer 1 and optimize the parameters for layer 2, we are optimizing the P(h1) in (1)

1h11111

1

x|hx|hx|hhx|h

hx,x

)]}(log)()](log)()[log({

)(log)(log

QQPPQ

PPh

… …

… …

… …

… …

… …

… …

h1

x

h2

h1

h3

h2

(1)

Page 48: Restricted Boltzmann Machine and Deep Belief Net

48

Deep Belief Net and RBM

• RBM can be considered as DBN that has infinitive layers

TW… …

… …

… …

… …

… …

… …

… …

h0

x0

h0

x0

h1

x1

x2

W

W

TW

W

Page 49: Restricted Boltzmann Machine and Deep Belief Net

49

Pretrain, fine-tune and inference – (autoencoder)

(BP)

Page 50: Restricted Boltzmann Machine and Deep Belief Net

50

Pretrain, fine-tune and inference - 2

y: identity or rotation degree

Pretraining Fine-tuning

Page 51: Restricted Boltzmann Machine and Deep Belief Net

51

How many layers should we use?

• There might be no universally right depth– Bengio suggests that several layers is

better than one– Results are robust against changes in the

size of a layer, but top layer should be big– A parameter. Depends on your task.– With enough narrow layers, we can model

any distribution over binary vectors [1]

Copied from http://videolectures.net/mlss09uk_hinton_dbn/

[1] Sutskever, I. and Hinton, G. E., Deep Narrow Sigmoid Belief Networks are Universal Approximators. Neural Computation, 2007

Page 52: Restricted Boltzmann Machine and Deep Belief Net

52

Effect of Unsupervised Pre-training

Erhan et. al. AISTATS’2009

Page 53: Restricted Boltzmann Machine and Deep Belief Net

53

Effect of Depth

w/o pre-trainingwith pre-trainingwithout pre-training

Page 54: Restricted Boltzmann Machine and Deep Belief Net

54

Why unsupervised pre-training makes sense

stuff

image label

stuff

image label

If image-label pairs were generated this way, it would make sense to try to go straight from images to labels. For example, do the pixels have even parity?

If image-label pairs are generated this way, it makes sense to first learn to recover the stuff that caused the image by inverting the high bandwidth pathway.

high bandwidth

low bandwidth

Page 55: Restricted Boltzmann Machine and Deep Belief Net

55

Beyond layer-wise pretraining

• Layer-wise pretraining is efficient but not optimal.

• It is possible to train parameters for all layers using a wake-sleep algorithm.– Bottom-up in a layer-wise manner– Top-down and reffiting the earlier models

Page 56: Restricted Boltzmann Machine and Deep Belief Net

56

Fine-tuning with a contrastive version of the “wake-sleep” algorithm

After learning many layers of features, we can fine-tune the features to improve generation.

1. Do a stochastic bottom-up pass– Adjust the top-down weights to be good at

reconstructing the feature activities in the layer below.

2. Do a few iterations of sampling in the top level RBM-- Adjust the weights in the top-level RBM.

3. Do a stochastic top-down pass– Adjust the bottom-up weights to be good at

reconstructing the feature activities in the layer above.

Page 57: Restricted Boltzmann Machine and Deep Belief Net

57

Include lateral connections

• RBM has no connection among layers• This can be generalized.• Lateral connections for the first layer [1].

– Sampling from P(h|x) is still easy. But sampling from p(x|h) is more difficult.

• Lateral connections at multiple layers [2].– Generate more realistic images.– CD is still applicable, with small

modification.

[1]B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: a strategy employed by V1?,” Vision Research, vol. 37, pp. 3311–3325, December 1997.[2]S. Osindero and G. E. Hinton, “Modeling image patches with a directed hierarchy of Markov random field,” in NIPS, 2007.

Page 58: Restricted Boltzmann Machine and Deep Belief Net

58

Without lateral connection

Page 59: Restricted Boltzmann Machine and Deep Belief Net

59

With lateral connection

Page 60: Restricted Boltzmann Machine and Deep Belief Net

60

My data is real valued …

• Make it [0 1] linearly: x = ax + b• Use another distribution

Page 61: Restricted Boltzmann Machine and Deep Belief Net

61

My data has temporal dependency …

• Static:• Temporal

Page 62: Restricted Boltzmann Machine and Deep Belief Net

62

My data has temporal dependency …

• Static:• Temporal

Page 63: Restricted Boltzmann Machine and Deep Belief Net

63

I consider DBN as…

• A statistical model that is used for unsupervised training of fully connected deep model

• A directed graphical model that is approximated by fast learning and inference algorithms

• A directed graphical model that is fine tuned using mature neural network learning approach -- BP.

Page 64: Restricted Boltzmann Machine and Deep Belief Net

64

Outline• Basic background on statistical learning

and Graphical model• Contrastive divergence and Restricted

Boltzmann machine•Deep belief net (DBN)

– Why DBN?– Learning and inference–Applications

Page 65: Restricted Boltzmann Machine and Deep Belief Net

65

Applications of deep learning

• Hand written digits recognition• Dimensionality reduction• Information retrieval • Segmentation• Denoising• Phone recognition• Object recognition• Object detection• …

Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks, Science 2006.Welling, M. etc., Exponential Family Harmoniums with an Application to Information Retrieval, NIPS 2004A. R. Mohamed, etc., Deep Belief Networks for phone recognition, NIPS 09 workshop on deep learning for speech recognition.Nair, V. and Hinton, G. E. 3-D Object recognition with deep belief nets. NIPS09

………………………….

Page 66: Restricted Boltzmann Machine and Deep Belief Net

66

Object recognition• NORB

– logistic regression 19.6%, kNN (k=1) 18.4%, Gaussian kernel SVM 11.6%, convolutional neural net 6.0%, convolutional net + SVM hybrid 5.9%. DBN 6.5%.

– With the extra unlabeled data (and the same amount of labeled data as before), DBN achieves 5.2%.

Page 67: Restricted Boltzmann Machine and Deep Belief Net

67

Rank Name Error rate

Description

1 U. Toronto 0.15315 Deep learning

2 U. Tokyo 0.26172 Hand-crafted features and learning models.Bottleneck.

3 U. Oxford 0.26979

4 Xerox/INRIA 0.27058

Object recognitionImageNet

Page 68: Restricted Boltzmann Machine and Deep Belief Net

68

Learning to extract the orientation of a face patch (Salakhutdinov & Hinton, NIPS 2007)

Page 69: Restricted Boltzmann Machine and Deep Belief Net

69

The training and test sets

11,000 unlabeled cases100, 500, or 1000 labeled cases

face patches from new people

Page 70: Restricted Boltzmann Machine and Deep Belief Net

70

The root mean squared error in the orientation when combining GP’s with deep belief nets

22.2 17.9 15.2

17.2 12.7 7.2

16.3 11.2 6.4

GP on the pixels

GP on top-level features

GP on top-level features with fine-tuning

100 labels

500 labels

1000 labels

Conclusion: The deep features are much better than the pixels. Fine-tuning helps a lot.

Page 71: Restricted Boltzmann Machine and Deep Belief Net

71

Deep Autoencoders(Hinton & Salakhutdinov, 2006)• They always looked like a

really nice way to do non-linear dimensionality reduction:– But it is very difficult to

optimize deep autoencoders using backpropagation.

• We now have a much better way to optimize them:– First train a stack of 4

RBM’s– Then “unroll” them. – Then fine-tune with

backprop.

1000 neurons

500 neurons

500 neurons

250 neurons

250 neurons

30

1000 neurons

28x28

28x28

1

2

3

4

4

3

2

1

W

W

W

W

W

W

W

W

T

T

T

T

linear units

Page 72: Restricted Boltzmann Machine and Deep Belief Net

72

Deep Autoencoders (Hinton & Salakhutdinov, 2006)

real data

30-D deep auto

30-D PCA

Page 73: Restricted Boltzmann Machine and Deep Belief Net

73

A comparison of methods for compressing digit images to 30 real numbers.

real data

30-D deep auto

30-D logistic PCA

30-D PCA

Page 74: Restricted Boltzmann Machine and Deep Belief Net

74

Representation of DBN

Page 75: Restricted Boltzmann Machine and Deep Belief Net

Our works

75

http://mmlab.ie.cuhk.edu.hk/project_deep_learning.html

Page 76: Restricted Boltzmann Machine and Deep Belief Net

Pedestrian Detection

CVPR’12 CVPR’13

ICCV’13

ICCV’13

Page 77: Restricted Boltzmann Machine and Deep Belief Net

Facial keypoint detection, CVPR’13(2% average error on LFPW)

Face parsing, CVPR’12

Pedestrian parsing, ICCV’13

Page 78: Restricted Boltzmann Machine and Deep Belief Net

Face Recognition and Face Attribute Recognition(LFW: 96.45%)

Face verification, ICCV’13 Recovering Canonical-View Face Images, ICCV’13

Face attribute recognition, ICCV’13

Page 79: Restricted Boltzmann Machine and Deep Belief Net

79

Summary• Deep belief net (DBN)

– is a network with deep layers, which provides strong representation power;

– is a generative model;– can be learned by layerwise RBM using

Contrastive Divergence;– has many applications and more applications is

yet to be found.

Generative models explicitly or implicitly model the distribution of inputs and outputs. Discriminative models model the posterior probabilities directly.

Page 80: Restricted Boltzmann Machine and Deep Belief Net

80

DBN VS SVM• A very controversial topic• Model

– DBN is generative, SVM is discriminative. But fine-tuning of DBN is discriminative

• Application– SVM is widely applied. – Researchers are expanding the application area

of DBN.• Learning

– DBN is non-convex and slow– SVM is convex and fast (in linear case).

• Which one is better? – Time will say.– You can contribute

Hinton: The superior classification performance of discriminative learning methods holds only for domains in which it is not possible to learn a good generative model. This set of domains is being eroded by Moore’s law.

Page 81: Restricted Boltzmann Machine and Deep Belief Net

81