16 rbm lecture - wordpress.com · 4/16/2017 · departement d’informatique´ universite de...

Neural networksRestricted Boltzmann machine - definition

UNSUPERVISED LEARNING2

Topics: unsupervised learning• Unsupervised learning: only use the inputs for learning‣ automatically extract meaningful features for your data

‣ leverage the availability of unlabeled data

‣ add a data-dependent regularizer to training ( )

• We will see 3 neural networks for unsupervised learning‣ restricted Boltzmann machines

‣ autoencoders

‣ sparse coding model

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x(t))

1




October 10, 2012

Abstract


• x

(t) � log p(x(t))

1

RESTRICTED BOLTZMANN MACHINE3

Energy function:

hidden layer(binary units)

visible layer(binary units)

Distribution: p(x,h) = exp(�E(x,h))/Z

x

h

W connectionsbias

bj

ck

Topics: RBM, visible layer, hidden layer, energy function

partition function(intractable)




October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)

p(x,h) = exp(�E(x,h))/Z (3)= exp(h

>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h1 h2 hH�1 hH

• x1 x2 xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

1

MARKOV NETWORK VIEW4

Topics: Markov network (with vector nodes)

• The notation based on an energy function is simply an alternative to the representation as the product of factors




October 10, 2012

Abstract


• x

(t) � log p(x(t))

• h x


>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

1




October 10, 2012

Abstract


• x

(t) � log p(x(t))

• h x


>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

1

•X �1

X=I•(X >)i,j =

Xj,i

•det(X)= Qi

Xi,i

•det(X >)=det(

X)

•det(X �1)=det(

X) �1

•det(X (1)

X (2))=det(X (1))det(

X (2))

•X >=

X �1•v >

Xv>0•�•{

x (t)}•9w,t ⇤

x (t ⇤)= Pt6=t ⇤wt

x (t)

•R(X)={

x2R h|9w

x= Pjwj

X·,j }

•{x2R h|

x/2R(

X)}•{�i,ui |

Xui =

�iui

etu >i

uj =1i=j }

•X= Pi

�iuiu >i

•det(X)= Qi

�i

•�i

>0•@

@x

f(x,y)=lim�!0

f(x+�

,y)�

f(x,y)�

•@

@y

f(x,y)=lim�!0

f(x,y+�)�

f(x,y)�

Machinelearning

•S

upervised

learning

exam

ple:(x,y)

•T

raining

set:D train={(

xt,yt }

•

2

factors




October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)


>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h1 h2 hH�1 hH

• x1 x2 xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

1

MARKOV NETWORK VIEW5

Topics: Markov network (with scalar nodes)

• The scalar visualization is more informative of the structure within the vectors

...

...




October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x


>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1




October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x


>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1




October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x


>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1




October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x


>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1




October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x


>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1




October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x


>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1




October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)


>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h1 h2 hH�1 hH

• x1 x2 xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

1

•X�1 X=I •(X> ) i,j=X j,i •det(X)=Q iX

i,i

•det(X> )=det(X) •det(X�1 )=det(X)� 1

•det(X(1) X

(2) )=det(X(1) )det(X(2) )

•X> =X�1 •v> Xv>

0 •� •{x(t) } •9w,t⇤ x

(t⇤ ) =P t6=t⇤w

tx

(t)

•R(X)={x2Rh |9wx=P jw

jX

·,j}

•{x2Rh |x/2R(X)} •{� i,u

i|Xu i=� iu

ietu

> iu

j=1 i=j}

•X=P i�

iu

iu

> i

•det(X)=Q i�

i

•� i>0 •

@

@x

f(x,y)=lim �!0f

(x+�,y)�f(x,y) �

•@

@y

f(x,y)=lim �!0f

(x,y+�)�f(x,y) �

Machinelearning

•Supervisedlearning

exam

ple:(x,y)

•Trainingset:Dtrain ={(x t,

y

t}

•

2

pair-wise factors

• X

�1X = I

• (X>)i,j = Xj,i

• det (X) =Q

i Xi,i

• det (X>) = det (X)

• det (X�1) = det (X)�1

• det (X(1)X

(2)) = det (X(1)) det (X(2))

• X

> = X

�1

• v

>Xv > 0

• �

• {x(t)}

• 9w, t

⇤x

(t⇤) =P

t 6=t⇤ wtx(t)

• R(X) = {x 2 Rh | 9w x =P

j wjX·,j}

• {x 2 Rh | x /2 R(X)}

• {�i,ui | Xui = �iui et u

>i uj = 1i=j}

• X =P

i �iuiu>i

• det (X) =Q

i �i

• �i > 0

•@

@x

f(x, y) = lim�!0

f(x+�, y)� f(x, y)

�

•@

@y

f(x, y) = lim�!0

f(x, y +�)� f(x, y)

�

Machine learning

• Supervised learning example: (x, y)

• Training set: Dtrain = {(xt, yt}

•

2

unaryfactors


Energy function:




x

h

W connectionsbias

bj

ck






October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)


>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h1 h2 hH�1 hH

• x1 x2 xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

1

8

• Ne(zi) zi xk hj

p(h|x) = p(x,h)/

X

h

0

p(x,h

0) (14)

=

exp(h

>Wx+ c

>x+ b

>h)/Z

Ph

02{0,1}H exp(h

0>Wx+ c

>x+ b

>h

0)/Z

(15)

=

exp(

Pj hjWj·x+ bjhj)P

h012{0,1} · · ·

Ph0H2{0,1} exp(

Pj h

0jWj·x+ bjh

0j)

(16)

=

Qj exp(hjWj·x+ bjhj)P

h012{0,1} · · ·

Ph0H2{0,1}

Qj exp(h

0jWj·x+ bjh

0j)

(17)

=

Qj exp(hjWj·x+ bjhj)⇣P

h012{0,1} exp(h

01W1·x+ b1h

01)

⌘. . .

⇣Ph0H2{0,1} exp(h

0HWH·x+ bHh

0H)

⌘ (18)

=

Qj exp(hjWj·x+ bjhj)

Qj

⇣Ph0j2{0,1} exp(h

0jWj·x+ bjh

0j)

⌘ (19)

=

Qj exp(hjWj·x+ bjhj)Qj (1 + exp(bj +Wj·x))

(20)

=

Y

j

exp(hjWj·x+ bjhj)

1 + exp(bj +Wj·x)(21)

=

Y

j

p(hj |x) (22)

p(hj = 1|x) =

exp(bj +Wj·x)

1 + exp(bj +Wj·x)(23)

=

1

1 + exp(�bj �Wj·x)(24)

(25)

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

2

9

• Ne(zi) zi xk hj

p(h|x) = p(x,h)/

X

h

0

p(x,h

0) (14)

=

exp(h

>Wx+ c

>x+ b

>h)/Z

Ph

02{0,1}H exp(h

0>Wx+ c

>x+ b

>h

0)/Z

(15)

=

exp(

Pj hjWj·x+ bjhj)P

h012{0,1} · · ·

Ph0H2{0,1} exp(

Pj h

0jWj·x+ bjh

0j)

(16)

=


h012{0,1} · · ·

Ph0H2{0,1}

Qj exp(h

0jWj·x+ bjh

0j)

(17)

=


h012{0,1} exp(h

01W1·x+ b1h

01)

⌘. . .

⇣Ph0H2{0,1} exp(h

0HWH·x+ bHh

0H)

⌘ (18)

=


Qj

⇣Ph0j2{0,1} exp(h

0jWj·x+ bjh

0j)

⌘ (19)

=


(20)

=

Y

j

exp(hjWj·x+ bjhj)

1 + exp(bj +Wj·x)(21)

=

Y

j

p(hj |x) (22)

p(hj = 1|x) =

exp(bj +Wj·x)

1 + exp(bj +Wj·x)(23)

=

1

1 + exp(�bj �Wj·x)(24)

= sigm(bj +Wj·x) (25)

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

2

LOCAL MARKOV PROPERTY10

Topics: local Markov property• In general, we have the following property:

‣ is any variable in the Markov network ( or in an RBM)‣ are the neighbors of in the Markov network




October 10, 2012

AbstractMath for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)


>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h

1

h

2

hH�1

hH

• x

1

x

2

xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

• Ne(zi) zi

p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(11)

=

Qf involving ziand any Ne(zi)

f (zi,Ne(zi))

Pz0i


f (z0i,Ne(zi))

(12)

(13)

1




October 10, 2012


• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)


>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h

1

h

2

hH�1

hH

• x

1

x

2

xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

• Ne(zi) zi

p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(11)

=


f (zi,Ne(zi))

Pz0i


f (z0i,Ne(zi))

(12)

(13)

1




October 10, 2012


• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)


>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h

1

h

2

hH�1

hH

• x

1

x

2

xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

• Ne(zi) zi xk hj

p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(11)

=


f (zi,Ne(zi))

Pz0i


f (z0i,Ne(zi))

(12)

(13)

1




October 10, 2012


• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)


>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h

1

h

2

hH�1

hH

• x

1

x

2

xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

• Ne(zi) zi xk hj

p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(11)

=


f (zi,Ne(zi))

Pz0i


f (z0i,Ne(zi))

(12)

(13)

1




October 10, 2012


• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)


>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h

1

h

2

hH�1

hH

• x

1

x

2

xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

• Ne(zi) zi xk hj

p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(11)

=


f (zi,Ne(zi))

Pz0i


f (z0i,Ne(zi))

(12)

(13)

1




October 10, 2012


• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)


>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h

1

h

2

hH�1

hH

• x

1

x

2

xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

• Ne(zi) zi

p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(11)

=


f (zi,Ne(zi))

Pz0i


f (z0i,Ne(zi))

(12)

(13)

1


Energy function:




x

h

W connectionsbias

bj

ck






October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)


>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h1 h2 hH�1 hH

• x1 x2 xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

1

FREE ENERGY12

Topics: free energy• What about ?

x

h

free energy

+ + + + + +




October 10, 2012


• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)


>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h

1

h

2

hH�1

hH

• x

1

x

2

xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

• Ne(zi) zi xk hj

p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(11)

=


f (zi,Ne(zi))

Pz0i


f (z0i,Ne(zi))

(12)

(13)

• p(x)

1

• Ne(zi) zi xk hj

p(h|x) = p(x,h)/

X

h

0

p(x,h

0) (14)

=

exp(h

>Wx+ c

>x+ b

>h)/Z

Ph

02{0,1}H exp(h

0>Wx+ c

>x+ b

>h

0)/Z

(15)

=

exp(

Pj hjWj·x+ bjhj)P

h012{0,1} · · ·

Ph0H2{0,1} exp(

Pj h

0jWj·x+ bjh

0j)

(16)

=


h012{0,1} · · ·

Ph0H2{0,1}

Qj exp(h

0jWj·x+ bjh

0j)

(17)

=


h012{0,1} exp(h

01W1·x+ b1h

01)

⌘. . .

⇣Ph0H2{0,1} exp(h

0HWH·x+ bHh

0H)

⌘ (18)

=


Qj

⇣Ph0j2{0,1} exp(h

0jWj·x+ bjh

0j)

⌘ (19)

=


(20)

=

Y

j

exp(hjWj·x+ bjhj)

1 + exp(bj +Wj·x)(21)

=

Y

j

p(hj |x) (22)

p(hj = 1|x) =

exp(bj +Wj·x)

1 + exp(bj +Wj·x)(23)

=

1

1 + exp(�bj �Wj·x)(24)


p(x) =

X

h2{0,1}H

p(x,h) =

X

h2{0,1}H

exp(�E(x,h))/Z (26)

= exp

0

@c

>x+

HX

j=1

log(1 + exp(bj +Wj·x))

1

A/Z (27)

= exp(�F (x))/Z (28)

p(x) = exp

0

@c

>x+

HX

j=1


1

A/Z (29)

= exp

0

@c

>x+

HX

j=1

softplus(bj +Wj·x)

1

A/Z (30)

(31)

• p(x)

2

13

p(x) =

X

h2{0,1}H

exp(h

>Wx+ c

>x+ b

>h)/Z (32)

= exp(c

>x)

X

h12{0,1}

· · ·X

hH2{0,1}

exp

0

@X

j

hjWj·x+ bjhj

1

A/Z (33)

= exp(c

>x)

0

@X

h12{0,1}

exp(h1W1·x+ b1h1)

1

A. . .

0

@X

hH2{0,1}

exp(hHWH·x+ bHhH)

1

A/Z (34)

= exp(c

>x) (1 + exp(b1 +W1·x)) . . . (1 + exp(bH +WH·x)) /Z (35)

= exp(c

>x) exp(log(1 + exp(b1 +W1·x))) . . . exp(log(1 + exp(bH +WH·x)))/Z (36)

= exp

0

@c

>x+

HX

j=1


1

A/Z (37)

(38)

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

⇣r

W

� log p(x

(t))

⌘(39)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

��x(t)i� E

x,h [rW

E(x,h)]

⌘(40)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

��x(t)i� E

h

[rW

E(

˜

x,h) |˜x ]⌘

(41)

(= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(42)

(43)

3


x

h

!5 !4 !3 !2 !1 0 1 2 3 4 50

1

2

3

4

5

softplus(·)“feature” expected in x

bias the prob of each xi

bias of each feature

+ + + + + +

Topics: free energy

• Ne(zi) zi xk hj

p(h|x) = p(x,h)/

X

h

0

p(x,h

0) (14)

=

exp(h

>Wx+ c

>x+ b

>h)/Z

Ph

02{0,1}H exp(h

0>Wx+ c

>x+ b

>h

0)/Z

(15)

=

exp(

Pj hjWj·x+ bjhj)P

h012{0,1} · · ·

Ph0H2{0,1} exp(

Pj h

0jWj·x+ bjh

0j)

(16)

=


h012{0,1} · · ·

Ph0H2{0,1}

Qj exp(h

0jWj·x+ bjh

0j)

(17)

=


h012{0,1} exp(h

01W1·x+ b1h

01)

⌘. . .

⇣Ph0H2{0,1} exp(h

0HWH·x+ bHh

0H)

⌘ (18)

=


Qj

⇣Ph0j2{0,1} exp(h

0jWj·x+ bjh

0j)

⌘ (19)

=


(20)

=

Y

j

exp(hjWj·x+ bjhj)

1 + exp(bj +Wj·x)(21)

=

Y

j

p(hj |x) (22)

p(hj = 1|x) =

exp(bj +Wj·x)

1 + exp(bj +Wj·x)(23)

=

1

1 + exp(�bj �Wj·x)(24)


p(x) =

X

h2{0,1}H

p(x,h) =

X

h2{0,1}H

exp(�E(x,h))/Z (26)

= exp

0

@c

>x+

HX

j=1


1

A/Z (27)

= exp(�F (x))/Z (28)

p(x) = exp

0

@c

>x+

HX

j=1


1

A/Z (29)

= exp

0

@c

>x+

HX

j=1

softplus(bj +Wj·x)

1

A/Z (30)

(31)

• p(x)

2


Energy function:




x

h

W connectionsbias

bj

ck






October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)


>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h1 h2 hH�1 hH

• x1 x2 xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

1

TRAINING16

• To train an RBM, we’d like to minimize the average negative log-likelihood (NLL)

• We’d like to proceed by stochastic gradient descent{ {positive phase negative phase

Topics: training objective

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

2

TRAINING16



hard tocompute


• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

2

CONTRASTIVE DIVERGENCE (CD) (HINTON, NEURAL COMPUTATION, 2002)

17

• Idea: 1. replace the expectation by a point estimate at2. obtain the point by Gibbs sampling3. start sampling chain at

...

x1 xk = x

� p(h|x) � p(x|h)

Topics: contrastive divergence, negative sample

negative sample

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)

2

x

x


18

E(x,h)

(x, h)


• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

2


19

(x, h)

p(x,h)


• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

2

TRAINING20



hard tocompute


• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

2

DERIVATION OF THE LEARNING RULE21

• Derivation of for⇤E(x,h)⇤�

� = Wjk

⇥E(x,h)⇥Wjk

=⇥

⇥Wjk

�

⇤�⇧

jk

Wjkhjxk �⇧

k

ckxk �⇧

j

bjhj

⇥

⌅

= � �

�Wjk

�

jk

Wjkhjxk

= �hjxk

Topics: contrastive divergence

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

✓�@ log p(x

(t))

@

◆(14)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(15)

=


f (zi,Ne(zi))

Pz0i


f (z0i,Ne(zi))

(16)

(17)

2


• Derivation of forEh

⇥⇤E(x,h)

⇤�

��x⇤

� = Wjk

Eh

⇥⇥E(x,h)

⇥Wjk

��x⇤

= Eh

⇧�hjxk

��x⌃

=⌅

hj�{0,1}

�hjxkp(hj |x)

h(x)def=

�p(h1=1|x)

...p(hH=1|x)

⇥

= sigm(b + Wx)

= �xkp(hj = 1|x)


• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

✓�@ log p(x

(t))

@

◆(14)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(15)

=


f (zi,Ne(zi))

Pz0i


f (z0i,Ne(zi))

(16)

(17)

2


• Given and the learning rule for becomes

x � = W


• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

⇣r

W

� log p(x

(t))

⌘(14)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

��x(t)i� E

x,h [rW

E(x,h)]

⌘(15)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

��x(t)i� E

h

[rW

E(

˜

x,h) |˜x ]⌘

(16)

(= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(17)

(18)

2

CD-K: PSEUDOCODE24


1. For each training examplei. generate a negative sample using

k steps of Gibbs sampling, starting at ii. update parameters

2. Go back to 1 until stopping criteria

x

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

2

W (= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(44)

b (= b+ ↵

⇣h(x

(t))� h(

˜

x)

⌘(45)

c (= c+ ↵

⇣x

(t) � ˜

x

⌘(46)

(47)

• ||x(t) � ˜

x||2

• E(x,h) = �h

>Wx� c

>x� b

>h+

12x

>x

• p(x|h) µ = c+W

>h

•

E(x,h) = �h

>Wx� c

>x� b

>h (48)

�1

2

x

>Vx� 1

2

h

>Uh (49)

4


25

• CD-k: contrastive divergence with k iterations of Gibbs sampling

• In general, the bigger k is, the less biased the estimate of the gradient will be

• In practice, k=1 works well for pre-training


PERSISTENT CD (PCD)(TIELEMAN, ICML 2008)

26

• Idea: instead of initializing the chain to , initialize the chain to the negative sample of the last iteration

...

x1 xk = x

� p(h|x) � p(x|h)

comes from theprevious iteration

x

Topics: persistent contrastive divergence

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

2

negative sample


Energy function:




x

h

W connectionsbias

bj

ck






October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)


>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h1 h2 hH�1 hH

• x1 x2 xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

1

EXAMPLE OF DATA SET: MNIST28

LAROCHELLE, BENGIO, LOURADOUR AND LAMBLIN

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 5: Samples from the MNIST digit recognition data set. Here, a black pixel corresponds to

an input value of 0 and a white pixel corresponds to 1 (the inputs are scaled between 0

and 1).

1. To what extent does initializing greedily the parameters of the different layers help?

2. How important is unsupervised learning for this procedure?

To address these two questions, we will compare the learning algorithms for deep networks of

Sections 4 and 5 with the following algorithms.

6.1.1 DEEP NETWORK WITHOUT PRE-TRAINING

To address the first question above, we compare the greedy layer-wise algorithm with a more stan-

dard way to train neural networks: using standard backpropagation and stochastic gradient descent,

and starting at a randomly initialized configuration of the parameters. In other words, this variant

simply puts away the pre-training phase of the other deep network learning algorithms.

6.1.2 DEEP NETWORK WITH SUPERVISED PRE-TRAINING

To address the second question, we run an experiment with the following algorithm. We greedily

pre-train the layers using a supervised criterion (instead of the unsupervised one), before performing

as before a final supervised fine-tuning phase. Specifically, when greedily pre-training the param-

eters Wi and bi, we also train another set of weights Vi and biases ci which connect hidden layer!hi(x) to a temporary output layer as follows:

oi(x) = f"ci+Vi!hi(x)

#

12

FILTERS (LAROCHELLE ET AL., JMLR2009)

29


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 6: Display of the input weights of a random subset of the hidden units, learned by an RBM

when trained on samples from the MNIST data set. The activation of units of the first

hidden layer is obtained by a dot product of such a weight “image” with the input image.

In these images, a black pixel corresponds to a weight smaller than −3 and a white pixelto a weight larger than 3, with the different shades of gray corresponding to different

weight values uniformly between −3 and 3.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 7: Input weights of a random subset of the hidden units, learned by an autoassociator when

trained on samples from the MNIST data set. The display setting is the same as for

Figure 6.

16

DEBUGGING30

Topics: stochastic reconstruction, filters• Unfortunately, we can’t debug with a comparison with finite

difference• We instead rely on approximate ‘‘tricks’’‣ we plot the average stochastic reconstruction and see if it tends to

decrease:

‣ for inputs that correspond to image, we visualize the connection coming into each hidden unit as if it was an image- gives an idea of the type of visual feature each hidden unit detects

‣ we can also try to approximate the partition function Z and see whether the (approximated) NLL decreases- On the Quantitative Analysis of Deep Belief Networks.

Ruslan Salakhutdinov and Iain Murray, 2008

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

⇣r

W

� log p(x

(t))

⌘(14)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

��x(t)i� E

x,h [rW

E(x,h)]

⌘(15)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

��x(t)i� E

h

[rW

E(

˜

x,h) |˜x ]⌘

(16)

(= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(17)

(18)

W (= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(19)

b (= b+ ↵

⇣h(x

(t))� h(

˜

x)

⌘(20)

c (= c+ ↵

⇣x

(t) � ˜

x

⌘(21)

(22)

• ||x(t) � ˜

x||2

2


Energy function:




x

h

W connectionsbias

bj

ck






October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)


>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h1 h2 hH�1 hH

• x1 x2 xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

1

GAUSSIAN-BERNOULLI RBM32

xTopics: Gaussian-Bernoulli RBM• Inputs are unbounded reals‣ add a quadratic term to the energy function

‣ only thing that changes is that is now a Gaussian distribution with mean and identity covariance matrix

‣ recommended to normalize the training set by- subtracting the mean of each input- dividing each input by the training set standard deviation

‣ should use a smaller learning rate than in the regular RBM

p(x|h)

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

⇣r

W

� log p(x

(t))

⌘(14)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

��x(t)i� E

x,h [rW

E(x,h)]

⌘(15)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

��x(t)i� E

h

[rW

E(

˜

x,h) |˜x ]⌘

(16)

(= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(17)

(18)

W (= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(19)

b (= b+ ↵

⇣h(x

(t))� h(

˜

x)

⌘(20)

c (= c+ ↵

⇣x

(t) � ˜

x

⌘(21)

(22)

• ||x(t) � ˜

x||2

• E(x,h) = �h

>Wx� c

>x� b

>h+

1

2

x

>x

• p(x|h) µ = c+Wx

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

⇣r

W

� log p(x

(t))

⌘(14)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

��x(t)i� E

x,h [rW

E(x,h)]

⌘(15)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

��x(t)i� E

h

[rW

E(

˜

x,h) |˜x ]⌘

(16)

(= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(17)

(18)

W (= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(19)

b (= b+ ↵

⇣h(x

(t))� h(

˜

x)

⌘(20)

c (= c+ ↵

⇣x

(t) � ˜

x

⌘(21)

(22)

• ||x(t) � ˜

x||2

• E(x,h) = �h

>Wx� c

>x� b

>h+

1

2

x

>x

• p(x|h) µ = c+W

>h

2




October 10, 2012


• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)


>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h

1

h

2

hH�1

hH

• x

1

x

2

xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

• Ne(zi) zi xk hj

p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(11)

=


f (zi,Ne(zi))

Pz0i


f (z0i,Ne(zi))

(12)

(13)

1

FILTERS (LAROCHELLE ET AL., JMLR2009)

33


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 12: Input weights of a random subset of the hidden units, learned by an RBM with Gaussian

visible units, when trained on samples from the MNIST data set. The top and bottom

images correspond to the same filters but with different color scale. On top, the display

setup is the same as for Figures 6 and 7 and, on the bottom, a black and white pixel

correspond to weights smaller than −10 and larger than 10 respectively.

24

OTHER TYPES OF OBSERVATIONS34

Topics: extensions to other observations• Extensions support other types:‣ real-valued: Gaussian-Bernoulli RBM‣ Binomial observations:

- Rate-coded Restricted Boltzmann Machines for Face Recognition.Yee Whye Teh and Geoffrey Hinton, 2001

‣ Multinomial observations:- Replicated Softmax: an Undirected Topic Model.

Ruslan Salakhutdinov and Geoffrey Hinton, 2009- Training Restricted Boltzmann Machines on Word Observations.

George Dahl, Ryan Adam and Hugo Larochelle, 2012

‣ and more (see course website)

BOLTZMANN MACHINE35

Topics: Boltzmann machine• The original Boltzmann machine has lateral connections in

each layer

‣ when only one layer has lateral connection, it’s a semi-restricted Boltmann machine

...

...




October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x


>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1




October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x


>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1




October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x


>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1




October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x


>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1




October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x


>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1




October 10, 2012

Abstract


• x

(t) � log p(x

(t))

• h x


>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

��x(t)

�� E

x,h

@E(x,h)

@✓

�

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

��x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

⇣r

W

� log p(x

(t))

⌘(14)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

��x(t)i� E

x,h [rW

E(x,h)]

⌘(15)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

��x(t)i� E

h

[rW

E(

˜

x,h) |˜x ]⌘

(16)

(= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(17)

(18)

W (= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(19)

b (= b+ ↵

⇣h(x

(t))� h(

˜

x)

⌘(20)

c (= c+ ↵

⇣x

(t) � ˜

x

⌘(21)

(22)

• ||x(t) � ˜

x||2

• E(x,h) = �h

>Wx� c

>x� b

>h+

1

2

x

>x

• p(x|h) µ = c+W

>h

•

E(x,h) = �h

>Wx� c

>x� b

>h (23)

�1

2

x

>Vx� 1

2

h

>Uh (24)

2

16 rbm lecture - wordpress.com · 4/16/2017 · departement d’informatique´ universite de...

Documents