16 rbm lecture - wordpress.com · 4/16/2017 · departement d’informatique´ universite de...
TRANSCRIPT
Neural networksRestricted Boltzmann machine - definition
UNSUPERVISED LEARNING2
Topics: unsupervised learning• Unsupervised learning: only use the inputs for learning‣ automatically extract meaningful features for your data
‣ leverage the availability of unlabeled data
‣ add a data-dependent regularizer to training ( )
• We will see 3 neural networks for unsupervised learning‣ restricted Boltzmann machines
‣ autoencoders
‣ sparse coding model
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x(t))
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x(t))
1
RESTRICTED BOLTZMANN MACHINE3
Energy function:
hidden layer(binary units)
visible layer(binary units)
Distribution: p(x,h) = exp(�E(x,h))/Z
x
h
W connectionsbias
bj
ck
Topics: RBM, visible layer, hidden layer, energy function
partition function(intractable)
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
E(x,h) = �h
>Wx� c
>x� b
>h (1)
= �X
j
X
k
Wj,khjxk �X
k
ckxk �X
j
bjhj (2)
p(x,h) = exp(�E(x,h))/Z (3)= exp(h
>Wx+ c
>x+ b
>h)/Z (4)
= exp(h
>Wx) exp(c
>x) exp(b
>h)/Z (5)
• h1 h2 hH�1 hH
• x1 x2 xD
p(x,h) =
1
Z
Y
j
Y
k
exp(Wj,khjxk) (6)
Y
k
exp(ckxk) (7)
Y
j
exp(bjhj) (8)
(9)
1
MARKOV NETWORK VIEW4
Topics: Markov network (with vector nodes)
• The notation based on an energy function is simply an alternative to the representation as the product of factors
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
1
•X �1
X=I•(X >)i,j =
Xj,i
•det(X)= Qi
Xi,i
•det(X >)=det(
X)
•det(X �1)=det(
X) �1
•det(X (1)
X (2))=det(X (1))det(
X (2))
•X >=
X �1•v >
Xv>0•�•{
x (t)}•9w,t ⇤
x (t ⇤)= Pt6=t ⇤wt
x (t)
•R(X)={
x2R h|9w
x= Pjwj
X·,j }
•{x2R h|
x/2R(
X)}•{�i,ui |
Xui =
�iui
etu >i
uj =1i=j }
•X= Pi
�iuiu >i
•det(X)= Qi
�i
•�i
>0•@
@x
f(x,y)=lim�!0
f(x+�
,y)�
f(x,y)�
•@
@y
f(x,y)=lim�!0
f(x,y+�)�
f(x,y)�
Machinelearning
•S
upervised
learning
exam
ple:(x,y)
•T
raining
set:D train={(
xt,yt }
•
2
factors
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
E(x,h) = �h
>Wx� c
>x� b
>h (1)
= �X
j
X
k
Wj,khjxk �X
k
ckxk �X
j
bjhj (2)
p(x,h) = exp(�E(x,h))/Z (3)= exp(h
>Wx+ c
>x+ b
>h)/Z (4)
= exp(h
>Wx) exp(c
>x) exp(b
>h)/Z (5)
• h1 h2 hH�1 hH
• x1 x2 xD
p(x,h) =
1
Z
Y
j
Y
k
exp(Wj,khjxk) (6)
Y
k
exp(ckxk) (7)
Y
j
exp(bjhj) (8)
(9)
1
MARKOV NETWORK VIEW5
Topics: Markov network (with scalar nodes)
• The scalar visualization is more informative of the structure within the vectors
...
...
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
E(x,h) = �h
>Wx� c
>x� b
>h (1)
= �X
j
X
k
Wj,khjxk �X
k
ckxk �X
j
bjhj (2)
p(x,h) = exp(�E(x,h))/Z (3)= exp(h
>Wx+ c
>x+ b
>h)/Z (4)
= exp(h
>Wx) exp(c
>x) exp(b
>h)/Z (5)
• h1 h2 hH�1 hH
• x1 x2 xD
p(x,h) =
1
Z
Y
j
Y
k
exp(Wj,khjxk) (6)
Y
k
exp(ckxk) (7)
Y
j
exp(bjhj) (8)
(9)
1
•X�1 X=I •(X> ) i,j=X j,i •det(X)=Q iX
i,i
•det(X> )=det(X) •det(X�1 )=det(X)� 1
•det(X(1) X
(2) )=det(X(1) )det(X(2) )
•X> =X�1 •v> Xv>
0 •� •{x(t) } •9w,t⇤ x
(t⇤ ) =P t6=t⇤w
tx
(t)
•R(X)={x2Rh |9wx=P jw
jX
·,j}
•{x2Rh |x/2R(X)} •{� i,u
i|Xu i=� iu
ietu
> iu
j=1 i=j}
•X=P i�
iu
iu
> i
•det(X)=Q i�
i
•� i>0 •
@
@x
f(x,y)=lim �!0f
(x+�,y)�f(x,y) �
•@
@y
f(x,y)=lim �!0f
(x,y+�)�f(x,y) �
Machinelearning
•Supervisedlearning
exam
ple:(x,y)
•Trainingset:Dtrain ={(x t,
y
t}
•
2
pair-wise factors
• X
�1X = I
• (X>)i,j = Xj,i
• det (X) =Q
i Xi,i
• det (X>) = det (X)
• det (X�1) = det (X)�1
• det (X(1)X
(2)) = det (X(1)) det (X(2))
• X
> = X
�1
• v
>Xv > 0
• �
• {x(t)}
• 9w, t
⇤x
(t⇤) =P
t 6=t⇤ wtx(t)
• R(X) = {x 2 Rh | 9w x =P
j wjX·,j}
• {x 2 Rh | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =P
i �iuiu>i
• det (X) =Q
i �i
• �i > 0
•@
@x
f(x, y) = lim�!0
f(x+�, y)� f(x, y)
�
•@
@y
f(x, y) = lim�!0
f(x, y +�)� f(x, y)
�
Machine learning
• Supervised learning example: (x, y)
• Training set: Dtrain = {(xt, yt}
•
2
unaryfactors
RESTRICTED BOLTZMANN MACHINE6
Energy function:
hidden layer(binary units)
visible layer(binary units)
Distribution: p(x,h) = exp(�E(x,h))/Z
x
h
W connectionsbias
bj
ck
Topics: RBM, visible layer, hidden layer, energy function
partition function(intractable)
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
E(x,h) = �h
>Wx� c
>x� b
>h (1)
= �X
j
X
k
Wj,khjxk �X
k
ckxk �X
j
bjhj (2)
p(x,h) = exp(�E(x,h))/Z (3)= exp(h
>Wx+ c
>x+ b
>h)/Z (4)
= exp(h
>Wx) exp(c
>x) exp(b
>h)/Z (5)
• h1 h2 hH�1 hH
• x1 x2 xD
p(x,h) =
1
Z
Y
j
Y
k
exp(Wj,khjxk) (6)
Y
k
exp(ckxk) (7)
Y
j
exp(bjhj) (8)
(9)
1
INFERENCE7
x
h p(h|x) =�
j
p(hj |x)
x
h p(x|h) =�
k
p(xk|h)
p(xk = 1|h) =1
1 + exp(�(ck + h⇥W·k))
= sigm(ck + h⇥W·k)
p(hj = 1|x) =1
1 + exp(�(bj + Wj·x))
= sigm(bj + Wj·x)
Topics: conditional distributions
j th row of W
k th column of W
8
• Ne(zi) zi xk hj
p(h|x) = p(x,h)/
X
h
0
p(x,h
0) (14)
=
exp(h
>Wx+ c
>x+ b
>h)/Z
Ph
02{0,1}H exp(h
0>Wx+ c
>x+ b
>h
0)/Z
(15)
=
exp(
Pj hjWj·x+ bjhj)P
h012{0,1} · · ·
Ph0H2{0,1} exp(
Pj h
0jWj·x+ bjh
0j)
(16)
=
Qj exp(hjWj·x+ bjhj)P
h012{0,1} · · ·
Ph0H2{0,1}
Qj exp(h
0jWj·x+ bjh
0j)
(17)
=
Qj exp(hjWj·x+ bjhj)⇣P
h012{0,1} exp(h
01W1·x+ b1h
01)
⌘. . .
⇣Ph0H2{0,1} exp(h
0HWH·x+ bHh
0H)
⌘ (18)
=
Qj exp(hjWj·x+ bjhj)
Qj
⇣Ph0j2{0,1} exp(h
0jWj·x+ bjh
0j)
⌘ (19)
=
Qj exp(hjWj·x+ bjhj)Qj (1 + exp(bj +Wj·x))
(20)
=
Y
j
exp(hjWj·x+ bjhj)
1 + exp(bj +Wj·x)(21)
=
Y
j
p(hj |x) (22)
p(hj = 1|x) =
exp(bj +Wj·x)
1 + exp(bj +Wj·x)(23)
=
1
1 + exp(�bj �Wj·x)(24)
(25)
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
2
9
• Ne(zi) zi xk hj
p(h|x) = p(x,h)/
X
h
0
p(x,h
0) (14)
=
exp(h
>Wx+ c
>x+ b
>h)/Z
Ph
02{0,1}H exp(h
0>Wx+ c
>x+ b
>h
0)/Z
(15)
=
exp(
Pj hjWj·x+ bjhj)P
h012{0,1} · · ·
Ph0H2{0,1} exp(
Pj h
0jWj·x+ bjh
0j)
(16)
=
Qj exp(hjWj·x+ bjhj)P
h012{0,1} · · ·
Ph0H2{0,1}
Qj exp(h
0jWj·x+ bjh
0j)
(17)
=
Qj exp(hjWj·x+ bjhj)⇣P
h012{0,1} exp(h
01W1·x+ b1h
01)
⌘. . .
⇣Ph0H2{0,1} exp(h
0HWH·x+ bHh
0H)
⌘ (18)
=
Qj exp(hjWj·x+ bjhj)
Qj
⇣Ph0j2{0,1} exp(h
0jWj·x+ bjh
0j)
⌘ (19)
=
Qj exp(hjWj·x+ bjhj)Qj (1 + exp(bj +Wj·x))
(20)
=
Y
j
exp(hjWj·x+ bjhj)
1 + exp(bj +Wj·x)(21)
=
Y
j
p(hj |x) (22)
p(hj = 1|x) =
exp(bj +Wj·x)
1 + exp(bj +Wj·x)(23)
=
1
1 + exp(�bj �Wj·x)(24)
= sigm(bj +Wj·x) (25)
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
2
LOCAL MARKOV PROPERTY10
Topics: local Markov property• In general, we have the following property:
‣ is any variable in the Markov network ( or in an RBM)‣ are the neighbors of in the Markov network
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
AbstractMath for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
E(x,h) = �h
>Wx� c
>x� b
>h (1)
= �X
j
X
k
Wj,khjxk �X
k
ckxk �X
j
bjhj (2)
p(x,h) = exp(�E(x,h))/Z (3)= exp(h
>Wx+ c
>x+ b
>h)/Z (4)
= exp(h
>Wx) exp(c
>x) exp(b
>h)/Z (5)
• h
1
h
2
hH�1
hH
• x
1
x
2
xD
p(x,h) =
1
Z
Y
j
Y
k
exp(Wj,khjxk) (6)
Y
k
exp(ckxk) (7)
Y
j
exp(bjhj) (8)
(9)
• Ne(zi) zi
p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)
=
p(zi,Ne(zi))Pz0ip(z
0i,Ne(zi))
(11)
=
Qf involving ziand any Ne(zi)
f (zi,Ne(zi))
Pz0i
Qf involving ziand any Ne(zi)
f (z0i,Ne(zi))
(12)
(13)
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
AbstractMath for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
E(x,h) = �h
>Wx� c
>x� b
>h (1)
= �X
j
X
k
Wj,khjxk �X
k
ckxk �X
j
bjhj (2)
p(x,h) = exp(�E(x,h))/Z (3)= exp(h
>Wx+ c
>x+ b
>h)/Z (4)
= exp(h
>Wx) exp(c
>x) exp(b
>h)/Z (5)
• h
1
h
2
hH�1
hH
• x
1
x
2
xD
p(x,h) =
1
Z
Y
j
Y
k
exp(Wj,khjxk) (6)
Y
k
exp(ckxk) (7)
Y
j
exp(bjhj) (8)
(9)
• Ne(zi) zi
p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)
=
p(zi,Ne(zi))Pz0ip(z
0i,Ne(zi))
(11)
=
Qf involving ziand any Ne(zi)
f (zi,Ne(zi))
Pz0i
Qf involving ziand any Ne(zi)
f (z0i,Ne(zi))
(12)
(13)
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
AbstractMath for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
E(x,h) = �h
>Wx� c
>x� b
>h (1)
= �X
j
X
k
Wj,khjxk �X
k
ckxk �X
j
bjhj (2)
p(x,h) = exp(�E(x,h))/Z (3)= exp(h
>Wx+ c
>x+ b
>h)/Z (4)
= exp(h
>Wx) exp(c
>x) exp(b
>h)/Z (5)
• h
1
h
2
hH�1
hH
• x
1
x
2
xD
p(x,h) =
1
Z
Y
j
Y
k
exp(Wj,khjxk) (6)
Y
k
exp(ckxk) (7)
Y
j
exp(bjhj) (8)
(9)
• Ne(zi) zi xk hj
p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)
=
p(zi,Ne(zi))Pz0ip(z
0i,Ne(zi))
(11)
=
Qf involving ziand any Ne(zi)
f (zi,Ne(zi))
Pz0i
Qf involving ziand any Ne(zi)
f (z0i,Ne(zi))
(12)
(13)
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
AbstractMath for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
E(x,h) = �h
>Wx� c
>x� b
>h (1)
= �X
j
X
k
Wj,khjxk �X
k
ckxk �X
j
bjhj (2)
p(x,h) = exp(�E(x,h))/Z (3)= exp(h
>Wx+ c
>x+ b
>h)/Z (4)
= exp(h
>Wx) exp(c
>x) exp(b
>h)/Z (5)
• h
1
h
2
hH�1
hH
• x
1
x
2
xD
p(x,h) =
1
Z
Y
j
Y
k
exp(Wj,khjxk) (6)
Y
k
exp(ckxk) (7)
Y
j
exp(bjhj) (8)
(9)
• Ne(zi) zi xk hj
p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)
=
p(zi,Ne(zi))Pz0ip(z
0i,Ne(zi))
(11)
=
Qf involving ziand any Ne(zi)
f (zi,Ne(zi))
Pz0i
Qf involving ziand any Ne(zi)
f (z0i,Ne(zi))
(12)
(13)
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
AbstractMath for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
E(x,h) = �h
>Wx� c
>x� b
>h (1)
= �X
j
X
k
Wj,khjxk �X
k
ckxk �X
j
bjhj (2)
p(x,h) = exp(�E(x,h))/Z (3)= exp(h
>Wx+ c
>x+ b
>h)/Z (4)
= exp(h
>Wx) exp(c
>x) exp(b
>h)/Z (5)
• h
1
h
2
hH�1
hH
• x
1
x
2
xD
p(x,h) =
1
Z
Y
j
Y
k
exp(Wj,khjxk) (6)
Y
k
exp(ckxk) (7)
Y
j
exp(bjhj) (8)
(9)
• Ne(zi) zi xk hj
p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)
=
p(zi,Ne(zi))Pz0ip(z
0i,Ne(zi))
(11)
=
Qf involving ziand any Ne(zi)
f (zi,Ne(zi))
Pz0i
Qf involving ziand any Ne(zi)
f (z0i,Ne(zi))
(12)
(13)
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
AbstractMath for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
E(x,h) = �h
>Wx� c
>x� b
>h (1)
= �X
j
X
k
Wj,khjxk �X
k
ckxk �X
j
bjhj (2)
p(x,h) = exp(�E(x,h))/Z (3)= exp(h
>Wx+ c
>x+ b
>h)/Z (4)
= exp(h
>Wx) exp(c
>x) exp(b
>h)/Z (5)
• h
1
h
2
hH�1
hH
• x
1
x
2
xD
p(x,h) =
1
Z
Y
j
Y
k
exp(Wj,khjxk) (6)
Y
k
exp(ckxk) (7)
Y
j
exp(bjhj) (8)
(9)
• Ne(zi) zi
p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)
=
p(zi,Ne(zi))Pz0ip(z
0i,Ne(zi))
(11)
=
Qf involving ziand any Ne(zi)
f (zi,Ne(zi))
Pz0i
Qf involving ziand any Ne(zi)
f (z0i,Ne(zi))
(12)
(13)
1
RESTRICTED BOLTZMANN MACHINE11
Energy function:
hidden layer(binary units)
visible layer(binary units)
Distribution: p(x,h) = exp(�E(x,h))/Z
x
h
W connectionsbias
bj
ck
Topics: RBM, visible layer, hidden layer, energy function
partition function(intractable)
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
E(x,h) = �h
>Wx� c
>x� b
>h (1)
= �X
j
X
k
Wj,khjxk �X
k
ckxk �X
j
bjhj (2)
p(x,h) = exp(�E(x,h))/Z (3)= exp(h
>Wx+ c
>x+ b
>h)/Z (4)
= exp(h
>Wx) exp(c
>x) exp(b
>h)/Z (5)
• h1 h2 hH�1 hH
• x1 x2 xD
p(x,h) =
1
Z
Y
j
Y
k
exp(Wj,khjxk) (6)
Y
k
exp(ckxk) (7)
Y
j
exp(bjhj) (8)
(9)
1
FREE ENERGY12
Topics: free energy• What about ?
x
h
free energy
+ + + + + +
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
AbstractMath for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
E(x,h) = �h
>Wx� c
>x� b
>h (1)
= �X
j
X
k
Wj,khjxk �X
k
ckxk �X
j
bjhj (2)
p(x,h) = exp(�E(x,h))/Z (3)= exp(h
>Wx+ c
>x+ b
>h)/Z (4)
= exp(h
>Wx) exp(c
>x) exp(b
>h)/Z (5)
• h
1
h
2
hH�1
hH
• x
1
x
2
xD
p(x,h) =
1
Z
Y
j
Y
k
exp(Wj,khjxk) (6)
Y
k
exp(ckxk) (7)
Y
j
exp(bjhj) (8)
(9)
• Ne(zi) zi xk hj
p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)
=
p(zi,Ne(zi))Pz0ip(z
0i,Ne(zi))
(11)
=
Qf involving ziand any Ne(zi)
f (zi,Ne(zi))
Pz0i
Qf involving ziand any Ne(zi)
f (z0i,Ne(zi))
(12)
(13)
• p(x)
1
• Ne(zi) zi xk hj
p(h|x) = p(x,h)/
X
h
0
p(x,h
0) (14)
=
exp(h
>Wx+ c
>x+ b
>h)/Z
Ph
02{0,1}H exp(h
0>Wx+ c
>x+ b
>h
0)/Z
(15)
=
exp(
Pj hjWj·x+ bjhj)P
h012{0,1} · · ·
Ph0H2{0,1} exp(
Pj h
0jWj·x+ bjh
0j)
(16)
=
Qj exp(hjWj·x+ bjhj)P
h012{0,1} · · ·
Ph0H2{0,1}
Qj exp(h
0jWj·x+ bjh
0j)
(17)
=
Qj exp(hjWj·x+ bjhj)⇣P
h012{0,1} exp(h
01W1·x+ b1h
01)
⌘. . .
⇣Ph0H2{0,1} exp(h
0HWH·x+ bHh
0H)
⌘ (18)
=
Qj exp(hjWj·x+ bjhj)
Qj
⇣Ph0j2{0,1} exp(h
0jWj·x+ bjh
0j)
⌘ (19)
=
Qj exp(hjWj·x+ bjhj)Qj (1 + exp(bj +Wj·x))
(20)
=
Y
j
exp(hjWj·x+ bjhj)
1 + exp(bj +Wj·x)(21)
=
Y
j
p(hj |x) (22)
p(hj = 1|x) =
exp(bj +Wj·x)
1 + exp(bj +Wj·x)(23)
=
1
1 + exp(�bj �Wj·x)(24)
= sigm(bj +Wj·x) (25)
p(x) =
X
h2{0,1}H
p(x,h) =
X
h2{0,1}H
exp(�E(x,h))/Z (26)
= exp
0
@c
>x+
HX
j=1
log(1 + exp(bj +Wj·x))
1
A/Z (27)
= exp(�F (x))/Z (28)
p(x) = exp
0
@c
>x+
HX
j=1
log(1 + exp(bj +Wj·x))
1
A/Z (29)
= exp
0
@c
>x+
HX
j=1
softplus(bj +Wj·x)
1
A/Z (30)
(31)
• p(x)
2
13
p(x) =
X
h2{0,1}H
exp(h
>Wx+ c
>x+ b
>h)/Z (32)
= exp(c
>x)
X
h12{0,1}
· · ·X
hH2{0,1}
exp
0
@X
j
hjWj·x+ bjhj
1
A/Z (33)
= exp(c
>x)
0
@X
h12{0,1}
exp(h1W1·x+ b1h1)
1
A. . .
0
@X
hH2{0,1}
exp(hHWH·x+ bHhH)
1
A/Z (34)
= exp(c
>x) (1 + exp(b1 +W1·x)) . . . (1 + exp(bH +WH·x)) /Z (35)
= exp(c
>x) exp(log(1 + exp(b1 +W1·x))) . . . exp(log(1 + exp(bH +WH·x)))/Z (36)
= exp
0
@c
>x+
HX
j=1
log(1 + exp(bj +Wj·x))
1
A/Z (37)
(38)
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
• (x
(t),
˜
h
(t))
• rW
E(x,h) = �h x
>
•E
h
[rW
E(x,h) |x ] = �h(x) x
>
W (= W � ↵
⇣r
W
� log p(x
(t))
⌘(39)
(= W � ↵
⇣E
h
hr
W
E(x
(t),h)
���x(t)i� E
x,h [rW
E(x,h)]
⌘(40)
(= W � ↵
⇣E
h
hr
W
E(x
(t),h)
���x(t)i� E
h
[rW
E(
˜
x,h) |˜x ]⌘
(41)
(= W + ↵
⇣h(x
(t)) x
(t)> � h(
˜
x)
˜
x
>⌘
(42)
(43)
3
RESTRICTED BOLTZMANN MACHINE14
x
h
!5 !4 !3 !2 !1 0 1 2 3 4 50
1
2
3
4
5
softplus(·)“feature” expected in x
bias the prob of each xi
bias of each feature
+ + + + + +
Topics: free energy
• Ne(zi) zi xk hj
p(h|x) = p(x,h)/
X
h
0
p(x,h
0) (14)
=
exp(h
>Wx+ c
>x+ b
>h)/Z
Ph
02{0,1}H exp(h
0>Wx+ c
>x+ b
>h
0)/Z
(15)
=
exp(
Pj hjWj·x+ bjhj)P
h012{0,1} · · ·
Ph0H2{0,1} exp(
Pj h
0jWj·x+ bjh
0j)
(16)
=
Qj exp(hjWj·x+ bjhj)P
h012{0,1} · · ·
Ph0H2{0,1}
Qj exp(h
0jWj·x+ bjh
0j)
(17)
=
Qj exp(hjWj·x+ bjhj)⇣P
h012{0,1} exp(h
01W1·x+ b1h
01)
⌘. . .
⇣Ph0H2{0,1} exp(h
0HWH·x+ bHh
0H)
⌘ (18)
=
Qj exp(hjWj·x+ bjhj)
Qj
⇣Ph0j2{0,1} exp(h
0jWj·x+ bjh
0j)
⌘ (19)
=
Qj exp(hjWj·x+ bjhj)Qj (1 + exp(bj +Wj·x))
(20)
=
Y
j
exp(hjWj·x+ bjhj)
1 + exp(bj +Wj·x)(21)
=
Y
j
p(hj |x) (22)
p(hj = 1|x) =
exp(bj +Wj·x)
1 + exp(bj +Wj·x)(23)
=
1
1 + exp(�bj �Wj·x)(24)
= sigm(bj +Wj·x) (25)
p(x) =
X
h2{0,1}H
p(x,h) =
X
h2{0,1}H
exp(�E(x,h))/Z (26)
= exp
0
@c
>x+
HX
j=1
log(1 + exp(bj +Wj·x))
1
A/Z (27)
= exp(�F (x))/Z (28)
p(x) = exp
0
@c
>x+
HX
j=1
log(1 + exp(bj +Wj·x))
1
A/Z (29)
= exp
0
@c
>x+
HX
j=1
softplus(bj +Wj·x)
1
A/Z (30)
(31)
• p(x)
2
RESTRICTED BOLTZMANN MACHINE15
Energy function:
hidden layer(binary units)
visible layer(binary units)
Distribution: p(x,h) = exp(�E(x,h))/Z
x
h
W connectionsbias
bj
ck
Topics: RBM, visible layer, hidden layer, energy function
partition function(intractable)
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
E(x,h) = �h
>Wx� c
>x� b
>h (1)
= �X
j
X
k
Wj,khjxk �X
k
ckxk �X
j
bjhj (2)
p(x,h) = exp(�E(x,h))/Z (3)= exp(h
>Wx+ c
>x+ b
>h)/Z (4)
= exp(h
>Wx) exp(c
>x) exp(b
>h)/Z (5)
• h1 h2 hH�1 hH
• x1 x2 xD
p(x,h) =
1
Z
Y
j
Y
k
exp(Wj,khjxk) (6)
Y
k
exp(ckxk) (7)
Y
j
exp(bjhj) (8)
(9)
1
TRAINING16
• To train an RBM, we’d like to minimize the average negative log-likelihood (NLL)
• We’d like to proceed by stochastic gradient descent{ {positive phase negative phase
Topics: training objective
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
2
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
2
TRAINING16
• To train an RBM, we’d like to minimize the average negative log-likelihood (NLL)
• We’d like to proceed by stochastic gradient descent{ {positive phase negative phase
hard tocompute
Topics: training objective
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
2
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
2
CONTRASTIVE DIVERGENCE (CD) (HINTON, NEURAL COMPUTATION, 2002)
17
• Idea: 1. replace the expectation by a point estimate at2. obtain the point by Gibbs sampling3. start sampling chain at
...
x1 xk = x
� p(h|x) � p(x|h)
Topics: contrastive divergence, negative sample
negative sample
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)
2
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)
2
x
x
CONTRASTIVE DIVERGENCE (CD) (HINTON, NEURAL COMPUTATION, 2002)
18
E(x,h)
(x, h)
Topics: contrastive divergence, negative sample
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
2
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
2
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
• (x
(t),
˜
h
(t))
2
CONTRASTIVE DIVERGENCE (CD) (HINTON, NEURAL COMPUTATION, 2002)
19
(x, h)
p(x,h)
Topics: contrastive divergence, negative sample
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
2
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
2
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
• (x
(t),
˜
h
(t))
2
CONTRASTIVE DIVERGENCE (CD) (HINTON, NEURAL COMPUTATION, 2002)
19
(x, h)
p(x,h)
Topics: contrastive divergence, negative sample
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
2
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
2
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
• (x
(t),
˜
h
(t))
2
TRAINING20
• To train an RBM, we’d like to minimize the average negative log-likelihood (NLL)
• We’d like to proceed by stochastic gradient descent{ {positive phase negative phase
hard tocompute
Topics: training objective
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
2
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
2
DERIVATION OF THE LEARNING RULE21
• Derivation of for⇤E(x,h)⇤�
� = Wjk
⇥E(x,h)⇥Wjk
=⇥
⇥Wjk
�
⇤�⇧
jk
Wjkhjxk �⇧
k
ckxk �⇧
j
bjhj
⇥
⌅
= � �
�Wjk
�
jk
Wjkhjxk
= �hjxk
Topics: contrastive divergence
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
• (x
(t),
˜
h
(t))
• rW
E(x,h) = �h x
>
•E
h
[rW
E(x,h) |x ] = �h(x) x
>
W (= W � ↵
✓�@ log p(x
(t))
@
◆(14)
=
p(zi,Ne(zi))Pz0ip(z
0i,Ne(zi))
(15)
=
Qf involving ziand any Ne(zi)
f (zi,Ne(zi))
Pz0i
Qf involving ziand any Ne(zi)
f (z0i,Ne(zi))
(16)
(17)
2
DERIVATION OF THE LEARNING RULE22
• Derivation of forEh
⇥⇤E(x,h)
⇤�
���x⇤
� = Wjk
Eh
⇥⇥E(x,h)
⇥Wjk
���x⇤
= Eh
⇧�hjxk
���x⌃
=⌅
hj�{0,1}
�hjxkp(hj |x)
h(x)def=
�p(h1=1|x)
...p(hH=1|x)
⇥
= sigm(b + Wx)
= �xkp(hj = 1|x)
Topics: contrastive divergence
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
• (x
(t),
˜
h
(t))
• rW
E(x,h) = �h x
>
•E
h
[rW
E(x,h) |x ] = �h(x) x
>
W (= W � ↵
✓�@ log p(x
(t))
@
◆(14)
=
p(zi,Ne(zi))Pz0ip(z
0i,Ne(zi))
(15)
=
Qf involving ziand any Ne(zi)
f (zi,Ne(zi))
Pz0i
Qf involving ziand any Ne(zi)
f (z0i,Ne(zi))
(16)
(17)
2
DERIVATION OF THE LEARNING RULE23
• Given and the learning rule for becomes
x � = W
Topics: contrastive divergence
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
• (x
(t),
˜
h
(t))
2
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
• (x
(t),
˜
h
(t))
• rW
E(x,h) = �h x
>
•E
h
[rW
E(x,h) |x ] = �h(x) x
>
W (= W � ↵
⇣r
W
� log p(x
(t))
⌘(14)
(= W � ↵
⇣E
h
hr
W
E(x
(t),h)
���x(t)i� E
x,h [rW
E(x,h)]
⌘(15)
(= W � ↵
⇣E
h
hr
W
E(x
(t),h)
���x(t)i� E
h
[rW
E(
˜
x,h) |˜x ]⌘
(16)
(= W + ↵
⇣h(x
(t)) x
(t)> � h(
˜
x)
˜
x
>⌘
(17)
(18)
2
CD-K: PSEUDOCODE24
Topics: contrastive divergence
1. For each training examplei. generate a negative sample using
k steps of Gibbs sampling, starting at ii. update parameters
2. Go back to 1 until stopping criteria
x
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
• (x
(t),
˜
h
(t))
2
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
• (x
(t),
˜
h
(t))
2
W (= W + ↵
⇣h(x
(t)) x
(t)> � h(
˜
x)
˜
x
>⌘
(44)
b (= b+ ↵
⇣h(x
(t))� h(
˜
x)
⌘(45)
c (= c+ ↵
⇣x
(t) � ˜
x
⌘(46)
(47)
• ||x(t) � ˜
x||2
• E(x,h) = �h
>Wx� c
>x� b
>h+
12x
>x
• p(x|h) µ = c+W
>h
•
E(x,h) = �h
>Wx� c
>x� b
>h (48)
�1
2
x
>Vx� 1
2
h
>Uh (49)
4
CONTRASTIVE DIVERGENCE (CD) (HINTON, NEURAL COMPUTATION, 2002)
25
• CD-k: contrastive divergence with k iterations of Gibbs sampling
• In general, the bigger k is, the less biased the estimate of the gradient will be
• In practice, k=1 works well for pre-training
Topics: contrastive divergence
PERSISTENT CD (PCD)(TIELEMAN, ICML 2008)
26
• Idea: instead of initializing the chain to , initialize the chain to the negative sample of the last iteration
...
x1 xk = x
� p(h|x) � p(x|h)
comes from theprevious iteration
x
Topics: persistent contrastive divergence
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
• (x
(t),
˜
h
(t))
2
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
2
negative sample
RESTRICTED BOLTZMANN MACHINE27
Energy function:
hidden layer(binary units)
visible layer(binary units)
Distribution: p(x,h) = exp(�E(x,h))/Z
x
h
W connectionsbias
bj
ck
Topics: RBM, visible layer, hidden layer, energy function
partition function(intractable)
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
E(x,h) = �h
>Wx� c
>x� b
>h (1)
= �X
j
X
k
Wj,khjxk �X
k
ckxk �X
j
bjhj (2)
p(x,h) = exp(�E(x,h))/Z (3)= exp(h
>Wx+ c
>x+ b
>h)/Z (4)
= exp(h
>Wx) exp(c
>x) exp(b
>h)/Z (5)
• h1 h2 hH�1 hH
• x1 x2 xD
p(x,h) =
1
Z
Y
j
Y
k
exp(Wj,khjxk) (6)
Y
k
exp(ckxk) (7)
Y
j
exp(bjhj) (8)
(9)
1
EXAMPLE OF DATA SET: MNIST28
LAROCHELLE, BENGIO, LOURADOUR AND LAMBLIN
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 5: Samples from the MNIST digit recognition data set. Here, a black pixel corresponds to
an input value of 0 and a white pixel corresponds to 1 (the inputs are scaled between 0
and 1).
1. To what extent does initializing greedily the parameters of the different layers help?
2. How important is unsupervised learning for this procedure?
To address these two questions, we will compare the learning algorithms for deep networks of
Sections 4 and 5 with the following algorithms.
6.1.1 DEEP NETWORK WITHOUT PRE-TRAINING
To address the first question above, we compare the greedy layer-wise algorithm with a more stan-
dard way to train neural networks: using standard backpropagation and stochastic gradient descent,
and starting at a randomly initialized configuration of the parameters. In other words, this variant
simply puts away the pre-training phase of the other deep network learning algorithms.
6.1.2 DEEP NETWORK WITH SUPERVISED PRE-TRAINING
To address the second question, we run an experiment with the following algorithm. We greedily
pre-train the layers using a supervised criterion (instead of the unsupervised one), before performing
as before a final supervised fine-tuning phase. Specifically, when greedily pre-training the param-
eters Wi and bi, we also train another set of weights Vi and biases ci which connect hidden layer!hi(x) to a temporary output layer as follows:
oi(x) = f"ci+Vi!hi(x)
#
12
FILTERS (LAROCHELLE ET AL., JMLR2009)
29
LAROCHELLE, BENGIO, LOURADOUR AND LAMBLIN
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 6: Display of the input weights of a random subset of the hidden units, learned by an RBM
when trained on samples from the MNIST data set. The activation of units of the first
hidden layer is obtained by a dot product of such a weight “image” with the input image.
In these images, a black pixel corresponds to a weight smaller than −3 and a white pixelto a weight larger than 3, with the different shades of gray corresponding to different
weight values uniformly between −3 and 3.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 7: Input weights of a random subset of the hidden units, learned by an autoassociator when
trained on samples from the MNIST data set. The display setting is the same as for
Figure 6.
16
DEBUGGING30
Topics: stochastic reconstruction, filters• Unfortunately, we can’t debug with a comparison with finite
difference• We instead rely on approximate ‘‘tricks’’‣ we plot the average stochastic reconstruction and see if it tends to
decrease:
‣ for inputs that correspond to image, we visualize the connection coming into each hidden unit as if it was an image- gives an idea of the type of visual feature each hidden unit detects
‣ we can also try to approximate the partition function Z and see whether the (approximated) NLL decreases- On the Quantitative Analysis of Deep Belief Networks.
Ruslan Salakhutdinov and Iain Murray, 2008
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
• (x
(t),
˜
h
(t))
• rW
E(x,h) = �h x
>
•E
h
[rW
E(x,h) |x ] = �h(x) x
>
W (= W � ↵
⇣r
W
� log p(x
(t))
⌘(14)
(= W � ↵
⇣E
h
hr
W
E(x
(t),h)
���x(t)i� E
x,h [rW
E(x,h)]
⌘(15)
(= W � ↵
⇣E
h
hr
W
E(x
(t),h)
���x(t)i� E
h
[rW
E(
˜
x,h) |˜x ]⌘
(16)
(= W + ↵
⇣h(x
(t)) x
(t)> � h(
˜
x)
˜
x
>⌘
(17)
(18)
W (= W + ↵
⇣h(x
(t)) x
(t)> � h(
˜
x)
˜
x
>⌘
(19)
b (= b+ ↵
⇣h(x
(t))� h(
˜
x)
⌘(20)
c (= c+ ↵
⇣x
(t) � ˜
x
⌘(21)
(22)
• ||x(t) � ˜
x||2
2
RESTRICTED BOLTZMANN MACHINE31
Energy function:
hidden layer(binary units)
visible layer(binary units)
Distribution: p(x,h) = exp(�E(x,h))/Z
x
h
W connectionsbias
bj
ck
Topics: RBM, visible layer, hidden layer, energy function
partition function(intractable)
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
E(x,h) = �h
>Wx� c
>x� b
>h (1)
= �X
j
X
k
Wj,khjxk �X
k
ckxk �X
j
bjhj (2)
p(x,h) = exp(�E(x,h))/Z (3)= exp(h
>Wx+ c
>x+ b
>h)/Z (4)
= exp(h
>Wx) exp(c
>x) exp(b
>h)/Z (5)
• h1 h2 hH�1 hH
• x1 x2 xD
p(x,h) =
1
Z
Y
j
Y
k
exp(Wj,khjxk) (6)
Y
k
exp(ckxk) (7)
Y
j
exp(bjhj) (8)
(9)
1
GAUSSIAN-BERNOULLI RBM32
xTopics: Gaussian-Bernoulli RBM• Inputs are unbounded reals‣ add a quadratic term to the energy function
‣ only thing that changes is that is now a Gaussian distribution with mean and identity covariance matrix
‣ recommended to normalize the training set by- subtracting the mean of each input- dividing each input by the training set standard deviation
‣ should use a smaller learning rate than in the regular RBM
p(x|h)
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
• (x
(t),
˜
h
(t))
• rW
E(x,h) = �h x
>
•E
h
[rW
E(x,h) |x ] = �h(x) x
>
W (= W � ↵
⇣r
W
� log p(x
(t))
⌘(14)
(= W � ↵
⇣E
h
hr
W
E(x
(t),h)
���x(t)i� E
x,h [rW
E(x,h)]
⌘(15)
(= W � ↵
⇣E
h
hr
W
E(x
(t),h)
���x(t)i� E
h
[rW
E(
˜
x,h) |˜x ]⌘
(16)
(= W + ↵
⇣h(x
(t)) x
(t)> � h(
˜
x)
˜
x
>⌘
(17)
(18)
W (= W + ↵
⇣h(x
(t)) x
(t)> � h(
˜
x)
˜
x
>⌘
(19)
b (= b+ ↵
⇣h(x
(t))� h(
˜
x)
⌘(20)
c (= c+ ↵
⇣x
(t) � ˜
x
⌘(21)
(22)
• ||x(t) � ˜
x||2
• E(x,h) = �h
>Wx� c
>x� b
>h+
1
2
x
>x
• p(x|h) µ = c+Wx
2
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
• (x
(t),
˜
h
(t))
• rW
E(x,h) = �h x
>
•E
h
[rW
E(x,h) |x ] = �h(x) x
>
W (= W � ↵
⇣r
W
� log p(x
(t))
⌘(14)
(= W � ↵
⇣E
h
hr
W
E(x
(t),h)
���x(t)i� E
x,h [rW
E(x,h)]
⌘(15)
(= W � ↵
⇣E
h
hr
W
E(x
(t),h)
���x(t)i� E
h
[rW
E(
˜
x,h) |˜x ]⌘
(16)
(= W + ↵
⇣h(x
(t)) x
(t)> � h(
˜
x)
˜
x
>⌘
(17)
(18)
W (= W + ↵
⇣h(x
(t)) x
(t)> � h(
˜
x)
˜
x
>⌘
(19)
b (= b+ ↵
⇣h(x
(t))� h(
˜
x)
⌘(20)
c (= c+ ↵
⇣x
(t) � ˜
x
⌘(21)
(22)
• ||x(t) � ˜
x||2
• E(x,h) = �h
>Wx� c
>x� b
>h+
1
2
x
>x
• p(x|h) µ = c+W
>h
2
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
AbstractMath for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
E(x,h) = �h
>Wx� c
>x� b
>h (1)
= �X
j
X
k
Wj,khjxk �X
k
ckxk �X
j
bjhj (2)
p(x,h) = exp(�E(x,h))/Z (3)= exp(h
>Wx+ c
>x+ b
>h)/Z (4)
= exp(h
>Wx) exp(c
>x) exp(b
>h)/Z (5)
• h
1
h
2
hH�1
hH
• x
1
x
2
xD
p(x,h) =
1
Z
Y
j
Y
k
exp(Wj,khjxk) (6)
Y
k
exp(ckxk) (7)
Y
j
exp(bjhj) (8)
(9)
• Ne(zi) zi xk hj
p(zi|z1, . . . , zV ) = p(zi|Ne(zi)) (10)
=
p(zi,Ne(zi))Pz0ip(z
0i,Ne(zi))
(11)
=
Qf involving ziand any Ne(zi)
f (zi,Ne(zi))
Pz0i
Qf involving ziand any Ne(zi)
f (z0i,Ne(zi))
(12)
(13)
1
FILTERS (LAROCHELLE ET AL., JMLR2009)
33
LAROCHELLE, BENGIO, LOURADOUR AND LAMBLIN
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 12: Input weights of a random subset of the hidden units, learned by an RBM with Gaussian
visible units, when trained on samples from the MNIST data set. The top and bottom
images correspond to the same filters but with different color scale. On top, the display
setup is the same as for Figures 6 and 7 and, on the bottom, a black and white pixel
correspond to weights smaller than −10 and larger than 10 respectively.
24
OTHER TYPES OF OBSERVATIONS34
Topics: extensions to other observations• Extensions support other types:‣ real-valued: Gaussian-Bernoulli RBM‣ Binomial observations:
- Rate-coded Restricted Boltzmann Machines for Face Recognition.Yee Whye Teh and Geoffrey Hinton, 2001
‣ Multinomial observations:- Replicated Softmax: an Undirected Topic Model.
Ruslan Salakhutdinov and Geoffrey Hinton, 2009- Training Restricted Boltzmann Machines on Word Observations.
George Dahl, Ryan Adam and Hugo Larochelle, 2012
‣ and more (see course website)
BOLTZMANN MACHINE35
Topics: Boltzmann machine• The original Boltzmann machine has lateral connections in
each layer
‣ when only one layer has lateral connection, it’s a semi-restricted Boltmann machine
...
...
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
• (x
(t),
˜
h
(t))
• rW
E(x,h) = �h x
>
•E
h
[rW
E(x,h) |x ] = �h(x) x
>
W (= W � ↵
⇣r
W
� log p(x
(t))
⌘(14)
(= W � ↵
⇣E
h
hr
W
E(x
(t),h)
���x(t)i� E
x,h [rW
E(x,h)]
⌘(15)
(= W � ↵
⇣E
h
hr
W
E(x
(t),h)
���x(t)i� E
h
[rW
E(
˜
x,h) |˜x ]⌘
(16)
(= W + ↵
⇣h(x
(t)) x
(t)> � h(
˜
x)
˜
x
>⌘
(17)
(18)
W (= W + ↵
⇣h(x
(t)) x
(t)> � h(
˜
x)
˜
x
>⌘
(19)
b (= b+ ↵
⇣h(x
(t))� h(
˜
x)
⌘(20)
c (= c+ ↵
⇣x
(t) � ˜
x
⌘(21)
(22)
• ||x(t) � ˜
x||2
• E(x,h) = �h
>Wx� c
>x� b
>h+
1
2
x
>x
• p(x|h) µ = c+W
>h
•
E(x,h) = �h
>Wx� c
>x� b
>h (23)
�1
2
x
>Vx� 1
2
h
>Uh (24)
2
BOLTZMANN MACHINE35
Topics: Boltzmann machine• The original Boltzmann machine has lateral connections in
each layer
‣ when only one layer has lateral connection, it’s a semi-restricted Boltmann machine
...
...
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
• (x
(t),
˜
h
(t))
• rW
E(x,h) = �h x
>
•E
h
[rW
E(x,h) |x ] = �h(x) x
>
W (= W � ↵
⇣r
W
� log p(x
(t))
⌘(14)
(= W � ↵
⇣E
h
hr
W
E(x
(t),h)
���x(t)i� E
x,h [rW
E(x,h)]
⌘(15)
(= W � ↵
⇣E
h
hr
W
E(x
(t),h)
���x(t)i� E
h
[rW
E(
˜
x,h) |˜x ]⌘
(16)
(= W + ↵
⇣h(x
(t)) x
(t)> � h(
˜
x)
˜
x
>⌘
(17)
(18)
W (= W + ↵
⇣h(x
(t)) x
(t)> � h(
˜
x)
˜
x
>⌘
(19)
b (= b+ ↵
⇣h(x
(t))� h(
˜
x)
⌘(20)
c (= c+ ↵
⇣x
(t) � ˜
x
⌘(21)
(22)
• ||x(t) � ˜
x||2
• E(x,h) = �h
>Wx� c
>x� b
>h+
1
2
x
>x
• p(x|h) µ = c+W
>h
•
E(x,h) = �h
>Wx� c
>x� b
>h (23)
�1
2
x
>Vx� 1
2
h
>Uh (24)
2
BOLTZMANN MACHINE35
Topics: Boltzmann machine• The original Boltzmann machine has lateral connections in
each layer
‣ when only one layer has lateral connection, it’s a semi-restricted Boltmann machine
...
...
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
Restricted Boltzmann Machines
Hugo LarochelleDepartement d’informatique
Universite de [email protected]
October 10, 2012
Abstract
Math for my slides “Restricted Boltzmann Machines”.
• x
(t) � log p(x
(t))
• h x
p(x,h) = exp(�E(x,h))/Z (1)= exp(h
>Wx+ b
>h+ c
>x)/Z (2)
= exp(h
>Wx) exp(b
>h) exp(c
>x)/Z (3)
• h1 h2 hH�1 hH
• x1 x2 xD
1
• p(x)
•1
T
X
t
l(f(x
(t))) =
1
T
X
t
� log p(x
(t))
•@ � log p(x
(t))
@✓
= E
h
@E(x
(t),h)
@✓
���x(t)
�� E
x,h
@E(x,h)
@✓
�
• x
(t)˜
h
(t)= h
0
•E
h
@E(x
(t),h)
@✓
���x(t)
�⇡ @E(x
(t),
˜
h
(t))
@✓
•E
x,h
@E(x,h)
@✓
�⇡ @E(
˜
x,
˜
h)
@✓
• (x
(t),
˜
h
(t))
• rW
E(x,h) = �h x
>
•E
h
[rW
E(x,h) |x ] = �h(x) x
>
W (= W � ↵
⇣r
W
� log p(x
(t))
⌘(14)
(= W � ↵
⇣E
h
hr
W
E(x
(t),h)
���x(t)i� E
x,h [rW
E(x,h)]
⌘(15)
(= W � ↵
⇣E
h
hr
W
E(x
(t),h)
���x(t)i� E
h
[rW
E(
˜
x,h) |˜x ]⌘
(16)
(= W + ↵
⇣h(x
(t)) x
(t)> � h(
˜
x)
˜
x
>⌘
(17)
(18)
W (= W + ↵
⇣h(x
(t)) x
(t)> � h(
˜
x)
˜
x
>⌘
(19)
b (= b+ ↵
⇣h(x
(t))� h(
˜
x)
⌘(20)
c (= c+ ↵
⇣x
(t) � ˜
x
⌘(21)
(22)
• ||x(t) � ˜
x||2
• E(x,h) = �h
>Wx� c
>x� b
>h+
1
2
x
>x
• p(x|h) µ = c+W
>h
•
E(x,h) = �h
>Wx� c
>x� b
>h (23)
�1
2
x
>Vx� 1
2
h
>Uh (24)
2