16 rbm lecture - wordpress.com · 4/16/2017  · departement d’informatique´ universite de...

39
Neural networks Restricted Boltzmann machine - definition

Upload: others

Post on 20-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

Neural networksRestricted Boltzmann machine - definition

Page 2: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

UNSUPERVISED LEARNING2

Topics: unsupervised learning• Unsupervised learning: only use the inputs for learning‣ automatically extract meaningful features for your data

‣ leverage the availability of unlabeled data

‣ add a data-dependent regularizer to training ( )

• We will see 3 neural networks for unsupervised learning‣ restricted Boltzmann machines

‣ autoencoders

‣ sparse coding model

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x(t))

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x(t))

1

Page 3: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

RESTRICTED BOLTZMANN MACHINE3

Energy function:

hidden layer(binary units)

visible layer(binary units)

Distribution: p(x,h) = exp(�E(x,h))/Z

x

h

W connectionsbias

bj

ck

Topics: RBM, visible layer, hidden layer, energy function

partition function(intractable)

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)

p(x,h) = exp(�E(x,h))/Z (3)= exp(h

>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h1 h2 hH�1 hH

• x1 x2 xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

1

Page 4: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

MARKOV NETWORK VIEW4

Topics: Markov network (with vector nodes)

• The notation based on an energy function is simply an alternative to the representation as the product of factors

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

1

•X �1

X=I•(X >)i,j =

Xj,i

•det(X)= Qi

Xi,i

•det(X >)=det(

X)

•det(X �1)=det(

X) �1

•det(X (1)

X (2))=det(X (1))det(

X (2))

•X >=

X �1•v >

Xv>0•�•{

x (t)}•9w,t ⇤

x (t ⇤)= Pt6=t ⇤wt

x (t)

•R(X)={

x2R h|9w

x= Pjwj

X·,j }

•{x2R h|

x/2R(

X)}•{�i,ui |

Xui =

�iui

etu >i

uj =1i=j }

•X= Pi

�iuiu >i

•det(X)= Qi

�i

•�i

>0•@

@x

f(x,y)=lim�!0

f(x+�

,y)�

f(x,y)�

•@

@y

f(x,y)=lim�!0

f(x,y+�)�

f(x,y)�

Machinelearning

•S

upervised

learning

exam

ple:(x,y)

•T

raining

set:D train={(

xt,yt }

2

factors

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)

p(x,h) = exp(�E(x,h))/Z (3)= exp(h

>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h1 h2 hH�1 hH

• x1 x2 xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

1

Page 5: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

MARKOV NETWORK VIEW5

Topics: Markov network (with scalar nodes)

• The scalar visualization is more informative of the structure within the vectors

...

...

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)

p(x,h) = exp(�E(x,h))/Z (3)= exp(h

>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h1 h2 hH�1 hH

• x1 x2 xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

1

•X�1 X=I •(X> ) i,j=X j,i •det(X)=Q iX

i,i

•det(X> )=det(X) •det(X�1 )=det(X)� 1

•det(X(1) X

(2) )=det(X(1) )det(X(2) )

•X> =X�1 •v> Xv>

0 •� •{x(t) } •9w,t⇤ x

(t⇤ ) =P t6=t⇤w

tx

(t)

•R(X)={x2Rh |9wx=P jw

jX

·,j}

•{x2Rh |x/2R(X)} •{� i,u

i|Xu i=� iu

ietu

> iu

j=1 i=j}

•X=P i�

iu

iu

> i

•det(X)=Q i�

i

•� i>0 •

@

@x

f(x,y)=lim �!0f

(x+�,y)�f(x,y) �

•@

@y

f(x,y)=lim �!0f

(x,y+�)�f(x,y) �

Machinelearning

•Supervisedlearning

exam

ple:(x,y)

•Trainingset:Dtrain ={(x t,

y

t}

2

pair-wise factors

• X

�1X = I

• (X>)i,j = Xj,i

• det (X) =Q

i Xi,i

• det (X>) = det (X)

• det (X�1) = det (X)�1

• det (X(1)X

(2)) = det (X(1)) det (X(2))

• X

> = X

�1

• v

>Xv > 0

• �

• {x(t)}

• 9w, t

⇤x

(t⇤) =P

t 6=t⇤ wtx(t)

• R(X) = {x 2 Rh | 9w x =P

j wjX·,j}

• {x 2 Rh | x /2 R(X)}

• {�i,ui | Xui = �iui et u

>i uj = 1i=j}

• X =P

i �iuiu>i

• det (X) =Q

i �i

• �i > 0

•@

@x

f(x, y) = lim�!0

f(x+�, y)� f(x, y)

•@

@y

f(x, y) = lim�!0

f(x, y +�)� f(x, y)

Machine learning

• Supervised learning example: (x, y)

• Training set: Dtrain = {(xt, yt}

2

unaryfactors

Page 6: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

RESTRICTED BOLTZMANN MACHINE6

Energy function:

hidden layer(binary units)

visible layer(binary units)

Distribution: p(x,h) = exp(�E(x,h))/Z

x

h

W connectionsbias

bj

ck

Topics: RBM, visible layer, hidden layer, energy function

partition function(intractable)

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)

p(x,h) = exp(�E(x,h))/Z (3)= exp(h

>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h1 h2 hH�1 hH

• x1 x2 xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

1

Page 7: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

INFERENCE7

x

h p(h|x) =�

j

p(hj |x)

x

h p(x|h) =�

k

p(xk|h)

p(xk = 1|h) =1

1 + exp(�(ck + h⇥W·k))

= sigm(ck + h⇥W·k)

p(hj = 1|x) =1

1 + exp(�(bj + Wj·x))

= sigm(bj + Wj·x)

Topics: conditional distributions

j th row of W

k th column of W

Page 8: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

8

• Ne(zi) zi xk hj

p(h|x) = p(x,h)/

X

h

0

p(x,h

0) (14)

=

exp(h

>Wx+ c

>x+ b

>h)/Z

Ph

02{0,1}H exp(h

0>Wx+ c

>x+ b

>h

0)/Z

(15)

=

exp(

Pj hjWj·x+ bjhj)P

h012{0,1} · · ·

Ph0H2{0,1} exp(

Pj h

0jWj·x+ bjh

0j)

(16)

=

Qj exp(hjWj·x+ bjhj)P

h012{0,1} · · ·

Ph0H2{0,1}

Qj exp(h

0jWj·x+ bjh

0j)

(17)

=

Qj exp(hjWj·x+ bjhj)⇣P

h012{0,1} exp(h

01W1·x+ b1h

01)

⌘. . .

⇣Ph0H2{0,1} exp(h

0HWH·x+ bHh

0H)

⌘ (18)

=

Qj exp(hjWj·x+ bjhj)

Qj

⇣Ph0j2{0,1} exp(h

0jWj·x+ bjh

0j)

⌘ (19)

=

Qj exp(hjWj·x+ bjhj)Qj (1 + exp(bj +Wj·x))

(20)

=

Y

j

exp(hjWj·x+ bjhj)

1 + exp(bj +Wj·x)(21)

=

Y

j

p(hj |x) (22)

p(hj = 1|x) =

exp(bj +Wj·x)

1 + exp(bj +Wj·x)(23)

=

1

1 + exp(�bj �Wj·x)(24)

(25)

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

2

Page 9: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

9

• Ne(zi) zi xk hj

p(h|x) = p(x,h)/

X

h

0

p(x,h

0) (14)

=

exp(h

>Wx+ c

>x+ b

>h)/Z

Ph

02{0,1}H exp(h

0>Wx+ c

>x+ b

>h

0)/Z

(15)

=

exp(

Pj hjWj·x+ bjhj)P

h012{0,1} · · ·

Ph0H2{0,1} exp(

Pj h

0jWj·x+ bjh

0j)

(16)

=

Qj exp(hjWj·x+ bjhj)P

h012{0,1} · · ·

Ph0H2{0,1}

Qj exp(h

0jWj·x+ bjh

0j)

(17)

=

Qj exp(hjWj·x+ bjhj)⇣P

h012{0,1} exp(h

01W1·x+ b1h

01)

⌘. . .

⇣Ph0H2{0,1} exp(h

0HWH·x+ bHh

0H)

⌘ (18)

=

Qj exp(hjWj·x+ bjhj)

Qj

⇣Ph0j2{0,1} exp(h

0jWj·x+ bjh

0j)

⌘ (19)

=

Qj exp(hjWj·x+ bjhj)Qj (1 + exp(bj +Wj·x))

(20)

=

Y

j

exp(hjWj·x+ bjhj)

1 + exp(bj +Wj·x)(21)

=

Y

j

p(hj |x) (22)

p(hj = 1|x) =

exp(bj +Wj·x)

1 + exp(bj +Wj·x)(23)

=

1

1 + exp(�bj �Wj·x)(24)

= sigm(bj +Wj·x) (25)

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

2

Page 10: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

LOCAL MARKOV PROPERTY10

Topics: local Markov property• In general, we have the following property:

‣ is any variable in the Markov network ( or in an RBM)‣ are the neighbors of in the Markov network

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

AbstractMath for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)

p(x,h) = exp(�E(x,h))/Z (3)= exp(h

>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h

1

h

2

hH�1

hH

• x

1

x

2

xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

• Ne(zi) zi

p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(11)

=

Qf involving ziand any Ne(zi)

f (zi,Ne(zi))

Pz0i

Qf involving ziand any Ne(zi)

f (z0i,Ne(zi))

(12)

(13)

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

AbstractMath for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)

p(x,h) = exp(�E(x,h))/Z (3)= exp(h

>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h

1

h

2

hH�1

hH

• x

1

x

2

xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

• Ne(zi) zi

p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(11)

=

Qf involving ziand any Ne(zi)

f (zi,Ne(zi))

Pz0i

Qf involving ziand any Ne(zi)

f (z0i,Ne(zi))

(12)

(13)

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

AbstractMath for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)

p(x,h) = exp(�E(x,h))/Z (3)= exp(h

>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h

1

h

2

hH�1

hH

• x

1

x

2

xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

• Ne(zi) zi xk hj

p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(11)

=

Qf involving ziand any Ne(zi)

f (zi,Ne(zi))

Pz0i

Qf involving ziand any Ne(zi)

f (z0i,Ne(zi))

(12)

(13)

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

AbstractMath for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)

p(x,h) = exp(�E(x,h))/Z (3)= exp(h

>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h

1

h

2

hH�1

hH

• x

1

x

2

xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

• Ne(zi) zi xk hj

p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(11)

=

Qf involving ziand any Ne(zi)

f (zi,Ne(zi))

Pz0i

Qf involving ziand any Ne(zi)

f (z0i,Ne(zi))

(12)

(13)

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

AbstractMath for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)

p(x,h) = exp(�E(x,h))/Z (3)= exp(h

>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h

1

h

2

hH�1

hH

• x

1

x

2

xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

• Ne(zi) zi xk hj

p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(11)

=

Qf involving ziand any Ne(zi)

f (zi,Ne(zi))

Pz0i

Qf involving ziand any Ne(zi)

f (z0i,Ne(zi))

(12)

(13)

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

AbstractMath for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)

p(x,h) = exp(�E(x,h))/Z (3)= exp(h

>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h

1

h

2

hH�1

hH

• x

1

x

2

xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

• Ne(zi) zi

p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(11)

=

Qf involving ziand any Ne(zi)

f (zi,Ne(zi))

Pz0i

Qf involving ziand any Ne(zi)

f (z0i,Ne(zi))

(12)

(13)

1

Page 11: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

RESTRICTED BOLTZMANN MACHINE11

Energy function:

hidden layer(binary units)

visible layer(binary units)

Distribution: p(x,h) = exp(�E(x,h))/Z

x

h

W connectionsbias

bj

ck

Topics: RBM, visible layer, hidden layer, energy function

partition function(intractable)

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)

p(x,h) = exp(�E(x,h))/Z (3)= exp(h

>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h1 h2 hH�1 hH

• x1 x2 xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

1

Page 12: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

FREE ENERGY12

Topics: free energy• What about ?

x

h

free energy

+ + + + + +

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

AbstractMath for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)

p(x,h) = exp(�E(x,h))/Z (3)= exp(h

>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h

1

h

2

hH�1

hH

• x

1

x

2

xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

• Ne(zi) zi xk hj

p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(11)

=

Qf involving ziand any Ne(zi)

f (zi,Ne(zi))

Pz0i

Qf involving ziand any Ne(zi)

f (z0i,Ne(zi))

(12)

(13)

• p(x)

1

• Ne(zi) zi xk hj

p(h|x) = p(x,h)/

X

h

0

p(x,h

0) (14)

=

exp(h

>Wx+ c

>x+ b

>h)/Z

Ph

02{0,1}H exp(h

0>Wx+ c

>x+ b

>h

0)/Z

(15)

=

exp(

Pj hjWj·x+ bjhj)P

h012{0,1} · · ·

Ph0H2{0,1} exp(

Pj h

0jWj·x+ bjh

0j)

(16)

=

Qj exp(hjWj·x+ bjhj)P

h012{0,1} · · ·

Ph0H2{0,1}

Qj exp(h

0jWj·x+ bjh

0j)

(17)

=

Qj exp(hjWj·x+ bjhj)⇣P

h012{0,1} exp(h

01W1·x+ b1h

01)

⌘. . .

⇣Ph0H2{0,1} exp(h

0HWH·x+ bHh

0H)

⌘ (18)

=

Qj exp(hjWj·x+ bjhj)

Qj

⇣Ph0j2{0,1} exp(h

0jWj·x+ bjh

0j)

⌘ (19)

=

Qj exp(hjWj·x+ bjhj)Qj (1 + exp(bj +Wj·x))

(20)

=

Y

j

exp(hjWj·x+ bjhj)

1 + exp(bj +Wj·x)(21)

=

Y

j

p(hj |x) (22)

p(hj = 1|x) =

exp(bj +Wj·x)

1 + exp(bj +Wj·x)(23)

=

1

1 + exp(�bj �Wj·x)(24)

= sigm(bj +Wj·x) (25)

p(x) =

X

h2{0,1}H

p(x,h) =

X

h2{0,1}H

exp(�E(x,h))/Z (26)

= exp

0

@c

>x+

HX

j=1

log(1 + exp(bj +Wj·x))

1

A/Z (27)

= exp(�F (x))/Z (28)

p(x) = exp

0

@c

>x+

HX

j=1

log(1 + exp(bj +Wj·x))

1

A/Z (29)

= exp

0

@c

>x+

HX

j=1

softplus(bj +Wj·x)

1

A/Z (30)

(31)

• p(x)

2

Page 13: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

13

p(x) =

X

h2{0,1}H

exp(h

>Wx+ c

>x+ b

>h)/Z (32)

= exp(c

>x)

X

h12{0,1}

· · ·X

hH2{0,1}

exp

0

@X

j

hjWj·x+ bjhj

1

A/Z (33)

= exp(c

>x)

0

@X

h12{0,1}

exp(h1W1·x+ b1h1)

1

A. . .

0

@X

hH2{0,1}

exp(hHWH·x+ bHhH)

1

A/Z (34)

= exp(c

>x) (1 + exp(b1 +W1·x)) . . . (1 + exp(bH +WH·x)) /Z (35)

= exp(c

>x) exp(log(1 + exp(b1 +W1·x))) . . . exp(log(1 + exp(bH +WH·x)))/Z (36)

= exp

0

@c

>x+

HX

j=1

log(1 + exp(bj +Wj·x))

1

A/Z (37)

(38)

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

⇣r

W

� log p(x

(t))

⌘(39)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

���x(t)i� E

x,h [rW

E(x,h)]

⌘(40)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

���x(t)i� E

h

[rW

E(

˜

x,h) |˜x ]⌘

(41)

(= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(42)

(43)

3

Page 14: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

RESTRICTED BOLTZMANN MACHINE14

x

h

!5 !4 !3 !2 !1 0 1 2 3 4 50

1

2

3

4

5

softplus(·)“feature” expected in x

bias the prob of each xi

bias of each feature

+ + + + + +

Topics: free energy

• Ne(zi) zi xk hj

p(h|x) = p(x,h)/

X

h

0

p(x,h

0) (14)

=

exp(h

>Wx+ c

>x+ b

>h)/Z

Ph

02{0,1}H exp(h

0>Wx+ c

>x+ b

>h

0)/Z

(15)

=

exp(

Pj hjWj·x+ bjhj)P

h012{0,1} · · ·

Ph0H2{0,1} exp(

Pj h

0jWj·x+ bjh

0j)

(16)

=

Qj exp(hjWj·x+ bjhj)P

h012{0,1} · · ·

Ph0H2{0,1}

Qj exp(h

0jWj·x+ bjh

0j)

(17)

=

Qj exp(hjWj·x+ bjhj)⇣P

h012{0,1} exp(h

01W1·x+ b1h

01)

⌘. . .

⇣Ph0H2{0,1} exp(h

0HWH·x+ bHh

0H)

⌘ (18)

=

Qj exp(hjWj·x+ bjhj)

Qj

⇣Ph0j2{0,1} exp(h

0jWj·x+ bjh

0j)

⌘ (19)

=

Qj exp(hjWj·x+ bjhj)Qj (1 + exp(bj +Wj·x))

(20)

=

Y

j

exp(hjWj·x+ bjhj)

1 + exp(bj +Wj·x)(21)

=

Y

j

p(hj |x) (22)

p(hj = 1|x) =

exp(bj +Wj·x)

1 + exp(bj +Wj·x)(23)

=

1

1 + exp(�bj �Wj·x)(24)

= sigm(bj +Wj·x) (25)

p(x) =

X

h2{0,1}H

p(x,h) =

X

h2{0,1}H

exp(�E(x,h))/Z (26)

= exp

0

@c

>x+

HX

j=1

log(1 + exp(bj +Wj·x))

1

A/Z (27)

= exp(�F (x))/Z (28)

p(x) = exp

0

@c

>x+

HX

j=1

log(1 + exp(bj +Wj·x))

1

A/Z (29)

= exp

0

@c

>x+

HX

j=1

softplus(bj +Wj·x)

1

A/Z (30)

(31)

• p(x)

2

Page 15: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

RESTRICTED BOLTZMANN MACHINE15

Energy function:

hidden layer(binary units)

visible layer(binary units)

Distribution: p(x,h) = exp(�E(x,h))/Z

x

h

W connectionsbias

bj

ck

Topics: RBM, visible layer, hidden layer, energy function

partition function(intractable)

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)

p(x,h) = exp(�E(x,h))/Z (3)= exp(h

>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h1 h2 hH�1 hH

• x1 x2 xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

1

Page 16: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

TRAINING16

• To train an RBM, we’d like to minimize the average negative log-likelihood (NLL)

• We’d like to proceed by stochastic gradient descent{ {positive phase negative phase

Topics: training objective

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

2

Page 17: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

TRAINING16

• To train an RBM, we’d like to minimize the average negative log-likelihood (NLL)

• We’d like to proceed by stochastic gradient descent{ {positive phase negative phase

hard tocompute

Topics: training objective

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

2

Page 18: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

CONTRASTIVE DIVERGENCE (CD) (HINTON, NEURAL COMPUTATION, 2002)

17

• Idea: 1. replace the expectation by a point estimate at2. obtain the point by Gibbs sampling3. start sampling chain at

...

x1 xk = x

� p(h|x) � p(x|h)

Topics: contrastive divergence, negative sample

negative sample

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)

2

x

x

Page 19: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

CONTRASTIVE DIVERGENCE (CD) (HINTON, NEURAL COMPUTATION, 2002)

18

E(x,h)

(x, h)

Topics: contrastive divergence, negative sample

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

2

Page 20: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

CONTRASTIVE DIVERGENCE (CD) (HINTON, NEURAL COMPUTATION, 2002)

19

(x, h)

p(x,h)

Topics: contrastive divergence, negative sample

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

2

Page 21: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

CONTRASTIVE DIVERGENCE (CD) (HINTON, NEURAL COMPUTATION, 2002)

19

(x, h)

p(x,h)

Topics: contrastive divergence, negative sample

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

2

Page 22: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

TRAINING20

• To train an RBM, we’d like to minimize the average negative log-likelihood (NLL)

• We’d like to proceed by stochastic gradient descent{ {positive phase negative phase

hard tocompute

Topics: training objective

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

2

Page 23: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

DERIVATION OF THE LEARNING RULE21

• Derivation of for⇤E(x,h)⇤�

� = Wjk

⇥E(x,h)⇥Wjk

=⇥

⇥Wjk

⇤�⇧

jk

Wjkhjxk �⇧

k

ckxk �⇧

j

bjhj

= � �

�Wjk

jk

Wjkhjxk

= �hjxk

Topics: contrastive divergence

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

✓�@ log p(x

(t))

@

◆(14)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(15)

=

Qf involving ziand any Ne(zi)

f (zi,Ne(zi))

Pz0i

Qf involving ziand any Ne(zi)

f (z0i,Ne(zi))

(16)

(17)

2

Page 24: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

DERIVATION OF THE LEARNING RULE22

• Derivation of forEh

⇥⇤E(x,h)

⇤�

���x⇤

� = Wjk

Eh

⇥⇥E(x,h)

⇥Wjk

���x⇤

= Eh

⇧�hjxk

���x⌃

=⌅

hj�{0,1}

�hjxkp(hj |x)

h(x)def=

�p(h1=1|x)

...p(hH=1|x)

= sigm(b + Wx)

= �xkp(hj = 1|x)

Topics: contrastive divergence

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

✓�@ log p(x

(t))

@

◆(14)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(15)

=

Qf involving ziand any Ne(zi)

f (zi,Ne(zi))

Pz0i

Qf involving ziand any Ne(zi)

f (z0i,Ne(zi))

(16)

(17)

2

Page 25: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

DERIVATION OF THE LEARNING RULE23

• Given and the learning rule for becomes

x � = W

Topics: contrastive divergence

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

⇣r

W

� log p(x

(t))

⌘(14)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

���x(t)i� E

x,h [rW

E(x,h)]

⌘(15)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

���x(t)i� E

h

[rW

E(

˜

x,h) |˜x ]⌘

(16)

(= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(17)

(18)

2

Page 26: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

CD-K: PSEUDOCODE24

Topics: contrastive divergence

1. For each training examplei. generate a negative sample using

k steps of Gibbs sampling, starting at ii. update parameters

2. Go back to 1 until stopping criteria

x

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

2

W (= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(44)

b (= b+ ↵

⇣h(x

(t))� h(

˜

x)

⌘(45)

c (= c+ ↵

⇣x

(t) � ˜

x

⌘(46)

(47)

• ||x(t) � ˜

x||2

• E(x,h) = �h

>Wx� c

>x� b

>h+

12x

>x

• p(x|h) µ = c+W

>h

E(x,h) = �h

>Wx� c

>x� b

>h (48)

�1

2

x

>Vx� 1

2

h

>Uh (49)

4

Page 27: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

CONTRASTIVE DIVERGENCE (CD) (HINTON, NEURAL COMPUTATION, 2002)

25

• CD-k: contrastive divergence with k iterations of Gibbs sampling

• In general, the bigger k is, the less biased the estimate of the gradient will be

• In practice, k=1 works well for pre-training

Topics: contrastive divergence

Page 28: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

PERSISTENT CD (PCD)(TIELEMAN, ICML 2008)

26

• Idea: instead of initializing the chain to , initialize the chain to the negative sample of the last iteration

...

x1 xk = x

� p(h|x) � p(x|h)

comes from theprevious iteration

x

Topics: persistent contrastive divergence

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

2

negative sample

Page 29: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

RESTRICTED BOLTZMANN MACHINE27

Energy function:

hidden layer(binary units)

visible layer(binary units)

Distribution: p(x,h) = exp(�E(x,h))/Z

x

h

W connectionsbias

bj

ck

Topics: RBM, visible layer, hidden layer, energy function

partition function(intractable)

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)

p(x,h) = exp(�E(x,h))/Z (3)= exp(h

>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h1 h2 hH�1 hH

• x1 x2 xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

1

Page 30: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

EXAMPLE OF DATA SET: MNIST28

LAROCHELLE, BENGIO, LOURADOUR AND LAMBLIN

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 5: Samples from the MNIST digit recognition data set. Here, a black pixel corresponds to

an input value of 0 and a white pixel corresponds to 1 (the inputs are scaled between 0

and 1).

1. To what extent does initializing greedily the parameters of the different layers help?

2. How important is unsupervised learning for this procedure?

To address these two questions, we will compare the learning algorithms for deep networks of

Sections 4 and 5 with the following algorithms.

6.1.1 DEEP NETWORK WITHOUT PRE-TRAINING

To address the first question above, we compare the greedy layer-wise algorithm with a more stan-

dard way to train neural networks: using standard backpropagation and stochastic gradient descent,

and starting at a randomly initialized configuration of the parameters. In other words, this variant

simply puts away the pre-training phase of the other deep network learning algorithms.

6.1.2 DEEP NETWORK WITH SUPERVISED PRE-TRAINING

To address the second question, we run an experiment with the following algorithm. We greedily

pre-train the layers using a supervised criterion (instead of the unsupervised one), before performing

as before a final supervised fine-tuning phase. Specifically, when greedily pre-training the param-

eters Wi and bi, we also train another set of weights Vi and biases ci which connect hidden layer!hi(x) to a temporary output layer as follows:

oi(x) = f"ci+Vi!hi(x)

#

12

Page 31: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

FILTERS (LAROCHELLE ET AL., JMLR2009)

29

LAROCHELLE, BENGIO, LOURADOUR AND LAMBLIN

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 6: Display of the input weights of a random subset of the hidden units, learned by an RBM

when trained on samples from the MNIST data set. The activation of units of the first

hidden layer is obtained by a dot product of such a weight “image” with the input image.

In these images, a black pixel corresponds to a weight smaller than −3 and a white pixelto a weight larger than 3, with the different shades of gray corresponding to different

weight values uniformly between −3 and 3.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 7: Input weights of a random subset of the hidden units, learned by an autoassociator when

trained on samples from the MNIST data set. The display setting is the same as for

Figure 6.

16

Page 32: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

DEBUGGING30

Topics: stochastic reconstruction, filters• Unfortunately, we can’t debug with a comparison with finite

difference• We instead rely on approximate ‘‘tricks’’‣ we plot the average stochastic reconstruction and see if it tends to

decrease:

‣ for inputs that correspond to image, we visualize the connection coming into each hidden unit as if it was an image- gives an idea of the type of visual feature each hidden unit detects

‣ we can also try to approximate the partition function Z and see whether the (approximated) NLL decreases- On the Quantitative Analysis of Deep Belief Networks.

Ruslan Salakhutdinov and Iain Murray, 2008

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

⇣r

W

� log p(x

(t))

⌘(14)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

���x(t)i� E

x,h [rW

E(x,h)]

⌘(15)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

���x(t)i� E

h

[rW

E(

˜

x,h) |˜x ]⌘

(16)

(= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(17)

(18)

W (= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(19)

b (= b+ ↵

⇣h(x

(t))� h(

˜

x)

⌘(20)

c (= c+ ↵

⇣x

(t) � ˜

x

⌘(21)

(22)

• ||x(t) � ˜

x||2

2

Page 33: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

RESTRICTED BOLTZMANN MACHINE31

Energy function:

hidden layer(binary units)

visible layer(binary units)

Distribution: p(x,h) = exp(�E(x,h))/Z

x

h

W connectionsbias

bj

ck

Topics: RBM, visible layer, hidden layer, energy function

partition function(intractable)

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)

p(x,h) = exp(�E(x,h))/Z (3)= exp(h

>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h1 h2 hH�1 hH

• x1 x2 xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

1

Page 34: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

GAUSSIAN-BERNOULLI RBM32

xTopics: Gaussian-Bernoulli RBM• Inputs are unbounded reals‣ add a quadratic term to the energy function

‣ only thing that changes is that is now a Gaussian distribution with mean and identity covariance matrix

‣ recommended to normalize the training set by- subtracting the mean of each input- dividing each input by the training set standard deviation

‣ should use a smaller learning rate than in the regular RBM

p(x|h)

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

⇣r

W

� log p(x

(t))

⌘(14)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

���x(t)i� E

x,h [rW

E(x,h)]

⌘(15)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

���x(t)i� E

h

[rW

E(

˜

x,h) |˜x ]⌘

(16)

(= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(17)

(18)

W (= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(19)

b (= b+ ↵

⇣h(x

(t))� h(

˜

x)

⌘(20)

c (= c+ ↵

⇣x

(t) � ˜

x

⌘(21)

(22)

• ||x(t) � ˜

x||2

• E(x,h) = �h

>Wx� c

>x� b

>h+

1

2

x

>x

• p(x|h) µ = c+Wx

2

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

⇣r

W

� log p(x

(t))

⌘(14)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

���x(t)i� E

x,h [rW

E(x,h)]

⌘(15)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

���x(t)i� E

h

[rW

E(

˜

x,h) |˜x ]⌘

(16)

(= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(17)

(18)

W (= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(19)

b (= b+ ↵

⇣h(x

(t))� h(

˜

x)

⌘(20)

c (= c+ ↵

⇣x

(t) � ˜

x

⌘(21)

(22)

• ||x(t) � ˜

x||2

• E(x,h) = �h

>Wx� c

>x� b

>h+

1

2

x

>x

• p(x|h) µ = c+W

>h

2

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

AbstractMath for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

E(x,h) = �h

>Wx� c

>x� b

>h (1)

= �X

j

X

k

Wj,khjxk �X

k

ckxk �X

j

bjhj (2)

p(x,h) = exp(�E(x,h))/Z (3)= exp(h

>Wx+ c

>x+ b

>h)/Z (4)

= exp(h

>Wx) exp(c

>x) exp(b

>h)/Z (5)

• h

1

h

2

hH�1

hH

• x

1

x

2

xD

p(x,h) =

1

Z

Y

j

Y

k

exp(Wj,khjxk) (6)

Y

k

exp(ckxk) (7)

Y

j

exp(bjhj) (8)

(9)

• Ne(zi) zi xk hj

p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)

=

p(zi,Ne(zi))Pz0ip(z

0i,Ne(zi))

(11)

=

Qf involving ziand any Ne(zi)

f (zi,Ne(zi))

Pz0i

Qf involving ziand any Ne(zi)

f (z0i,Ne(zi))

(12)

(13)

1

Page 35: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

FILTERS (LAROCHELLE ET AL., JMLR2009)

33

LAROCHELLE, BENGIO, LOURADOUR AND LAMBLIN

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 12: Input weights of a random subset of the hidden units, learned by an RBM with Gaussian

visible units, when trained on samples from the MNIST data set. The top and bottom

images correspond to the same filters but with different color scale. On top, the display

setup is the same as for Figures 6 and 7 and, on the bottom, a black and white pixel

correspond to weights smaller than −10 and larger than 10 respectively.

24

Page 36: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

OTHER TYPES OF OBSERVATIONS34

Topics: extensions to other observations• Extensions support other types:‣ real-valued: Gaussian-Bernoulli RBM‣ Binomial observations:

- Rate-coded Restricted Boltzmann Machines for Face Recognition.Yee Whye Teh and Geoffrey Hinton, 2001

‣ Multinomial observations:- Replicated Softmax: an Undirected Topic Model.

Ruslan Salakhutdinov and Geoffrey Hinton, 2009- Training Restricted Boltzmann Machines on Word Observations.

George Dahl, Ryan Adam and Hugo Larochelle, 2012

‣ and more (see course website)

Page 37: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

BOLTZMANN MACHINE35

Topics: Boltzmann machine• The original Boltzmann machine has lateral connections in

each layer

‣ when only one layer has lateral connection, it’s a semi-restricted Boltmann machine

...

...

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

⇣r

W

� log p(x

(t))

⌘(14)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

���x(t)i� E

x,h [rW

E(x,h)]

⌘(15)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

���x(t)i� E

h

[rW

E(

˜

x,h) |˜x ]⌘

(16)

(= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(17)

(18)

W (= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(19)

b (= b+ ↵

⇣h(x

(t))� h(

˜

x)

⌘(20)

c (= c+ ↵

⇣x

(t) � ˜

x

⌘(21)

(22)

• ||x(t) � ˜

x||2

• E(x,h) = �h

>Wx� c

>x� b

>h+

1

2

x

>x

• p(x|h) µ = c+W

>h

E(x,h) = �h

>Wx� c

>x� b

>h (23)

�1

2

x

>Vx� 1

2

h

>Uh (24)

2

Page 38: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

BOLTZMANN MACHINE35

Topics: Boltzmann machine• The original Boltzmann machine has lateral connections in

each layer

‣ when only one layer has lateral connection, it’s a semi-restricted Boltmann machine

...

...

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

⇣r

W

� log p(x

(t))

⌘(14)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

���x(t)i� E

x,h [rW

E(x,h)]

⌘(15)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

���x(t)i� E

h

[rW

E(

˜

x,h) |˜x ]⌘

(16)

(= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(17)

(18)

W (= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(19)

b (= b+ ↵

⇣h(x

(t))� h(

˜

x)

⌘(20)

c (= c+ ↵

⇣x

(t) � ˜

x

⌘(21)

(22)

• ||x(t) � ˜

x||2

• E(x,h) = �h

>Wx� c

>x� b

>h+

1

2

x

>x

• p(x|h) µ = c+W

>h

E(x,h) = �h

>Wx� c

>x� b

>h (23)

�1

2

x

>Vx� 1

2

h

>Uh (24)

2

Page 39: 16 RBM lecture - WordPress.com · 4/16/2017  · Departement d’informatique´ Universite de Sherbrooke´ hugo.larochelle@usherbrooke.ca October 10, 2012 Abstract Math for my slides

BOLTZMANN MACHINE35

Topics: Boltzmann machine• The original Boltzmann machine has lateral connections in

each layer

‣ when only one layer has lateral connection, it’s a semi-restricted Boltmann machine

...

...

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

Restricted Boltzmann Machines

Hugo LarochelleDepartement d’informatique

Universite de [email protected]

October 10, 2012

Abstract

Math for my slides “Restricted Boltzmann Machines”.

• x

(t) � log p(x

(t))

• h x

p(x,h) = exp(�E(x,h))/Z (1)= exp(h

>Wx+ b

>h+ c

>x)/Z (2)

= exp(h

>Wx) exp(b

>h) exp(c

>x)/Z (3)

• h1 h2 hH�1 hH

• x1 x2 xD

1

• p(x)

•1

T

X

t

l(f(x

(t))) =

1

T

X

t

� log p(x

(t))

•@ � log p(x

(t))

@✓

= E

h

@E(x

(t),h)

@✓

���x(t)

�� E

x,h

@E(x,h)

@✓

• x

(t)˜

h

(t)= h

0

•E

h

@E(x

(t),h)

@✓

���x(t)

�⇡ @E(x

(t),

˜

h

(t))

@✓

•E

x,h

@E(x,h)

@✓

�⇡ @E(

˜

x,

˜

h)

@✓

• (x

(t),

˜

h

(t))

• rW

E(x,h) = �h x

>

•E

h

[rW

E(x,h) |x ] = �h(x) x

>

W (= W � ↵

⇣r

W

� log p(x

(t))

⌘(14)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

���x(t)i� E

x,h [rW

E(x,h)]

⌘(15)

(= W � ↵

⇣E

h

hr

W

E(x

(t),h)

���x(t)i� E

h

[rW

E(

˜

x,h) |˜x ]⌘

(16)

(= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(17)

(18)

W (= W + ↵

⇣h(x

(t)) x

(t)> � h(

˜

x)

˜

x

>⌘

(19)

b (= b+ ↵

⇣h(x

(t))� h(

˜

x)

⌘(20)

c (= c+ ↵

⇣x

(t) � ˜

x

⌘(21)

(22)

• ||x(t) � ˜

x||2

• E(x,h) = �h

>Wx� c

>x� b

>h+

1

2

x

>x

• p(x|h) µ = c+W

>h

E(x,h) = �h

>Wx� c

>x� b

>h (23)

�1

2

x

>Vx� 1

2

h

>Uh (24)

2