probabilistic graphical modelsepxing/class/10708-14/lectures/lecture11... · 2014. 2. 19. ·...
Post on 22-Jan-2021
1 Views
Preview:
TRANSCRIPT
School of Computer Science
Probabilistic Graphical Models
Factor Analysis and State Space Models
Eric XingLecture 11, February 19o, 2014
Reading: See class website1© Eric Xing @ CMU, 2005-2014
A road map to more complex dynamic models
AX
Y
AX
Y
AX
Ydiscrete
discrete
discrete
continuous
continuous
continuous
Mixture modele.g., mixture of multinomials
Mixture modele.g., mixture of Gaussians
Factor analysis
A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
...
HMM (for discrete sequential data, e.g., text)
HMM(for continuous sequential data, e.g., speech signal)
State space model
... ... ... ...
A AA Ax2 x3x1 xN
yk2 yk3yk1 ykN...
...
y12 y13y11 y1N...
S2 S3S1 SN...
... ... ... ...
A AA Ax2 x3x1 xN
yk2 yk3yk1 ykN...
...
y12 y13y11 y1N...
S2 S3S1 SN...
... ... ... ...
A AA Ax2 x3x1 xN
yk2 yk3yk1 ykN...
...
y12 y13y11 y1N...
S2 S3S1 SN...
... ... ... ...
A AA Ax2 x3x1 xN
yk2 yk3yk1 ykN...
...
y12 y13y11 y1N...
S2 S3S1 SN...
Factorial HMM Switching SSM
2© Eric Xing @ CMU, 2005-2014
Recall multivariate Gaussian Multivariate Gaussian density:
A joint Gaussian:
How to write down p(x1), p(x1|x2) or p(x2|x1) using the block elements in and ? Formulas to remember:
)-()-(-exp)(
),|( //
xxx 1
21
21221
Tn
p
) , (),|(
2221
1211
2
1
2
1
2
1
xx
xx
Np
211
22121121
221
2212121
2121121
|
|
||
)(
),|()(
V
xm
Vmxxx
Np
222
22
2222
m
m
mmp
V
m
Vmxx
),|()( N
3© Eric Xing @ CMU, 2005-2014
Review:The matrix inverse lemma Consider a block-partitioned matrix:
First we diagonalize M
Schur complement:
Then we inverse, using this formula:
Matrix inverse lemma
HGFE
M
HGE-FH
IG-HI
HGFE
I
-FHI -
-
-
000
0
1
1
1
XZW YW XYZ -- 11
GE-FHM/H -1
111
111111
111111
111
1
1
1
1
11
0000
M/EGEM/E-M/EF-EGEM/EFEE
FHM/HGHHM/HG-HFHM/H-M/H
I-FHI
HM/H
IG-HI
HGFE
M
-
-
-
-
-
-
1111111 --- GEFH-GEFEEGE-FH
4© Eric Xing @ CMU, 2005-2014
Review:Some matrix algebra Trace and derivatives
Cyclical permutations
Derivatives
Determinants and derivatives
i
iiadef
tr A
TBBAA
tr
TTT xxAxxA
AxxA
trtr
BCACABABC trtrtr
1log AAA
5© Eric Xing @ CMU, 2005-2014
Factor analysis An unsupervised linear regression model
Geometric interpretation
To generate data, first generate a point within the manifold then add noise. Coordinates of point are components of latent variable.
AY
X
where is called a factor loading matrix, and is diagonal.
),;()(),;()(
xyxyxx
N
N
pp I0
6© Eric Xing @ CMU, 2005-2014
Marginal data distribution A marginal Gaussian (e.g., p(x)) times a conditional Gaussian
(e.g., p(y|x)) is a joint Gaussian Any marginal (e.g., p(y) of a joint Gaussian (e.g., p(x,y)) is
also a Gaussian Since the marginal is Gaussian, we can determine it by just computing its mean
and variance. (Assume noise uncorrelated with data.)
T
TTT
T
T
T
EEE
E
EVar
EEEE
WWXXWXWX
WXWX
YYY
WXWWXY
00
0 ),(~ here w N
AY
X
7© Eric Xing @ CMU, 2005-2014
FA = Constrained-Covariance Gaussian Marginal density for factor analysis (y is p-dim, x is k-dim):
So the effective covariance is the low-rank outer product of two long skinny matrices plus a diagonal matrix:
In other words, factor analysis is just a constrained Gaussian model. (If were not diagonal then we could model any Gaussian and it would be pointless.)
),;()|( T yy Np
8© Eric Xing @ CMU, 2005-2014
FA joint distribution Model
Covariance between x and y
Hence the joint distribution of x and y:
Assume noise is uncorrelated with data or latent variables.
),;()(),;()(
xyxyxx
N
N
pp I0
T
TTT
TT
EEECov
XWXXWXXYXYX, 0
) , ()(
T
TI0
yx
yx
Np
9© Eric Xing @ CMU, 2005-2014
Inference in Factor Analysis Apply the Gaussian conditioning formulas to the joint
distribution we derived above, where
we can now derive the posterior of the latent variable x given observation y, , where
Applying the matrix inversion lemma
Here we only need to invert a matrix of size |x||x|, instead of |y||y|.
)(
)(|
y
ym1
21
2212121
TT
T
TT
I
22
1212
11
),|()( || 2121 Vmxyx Np
1111111 --- GEFH-GEFEEGE-FH
1
211
22121121
TTI
|V
1121
TI|V )(|| yVm 12121
T
111
111111
111111
111
M/EGEM/E-M/EF-EGEM/EFEE
FHM/HGHHM/HG-HFHM/H-M/H
-
-
-
-
10© Eric Xing @ CMU, 2005-2014
Geometric interpretation: inference is linear projection The posterior is:
Posterior covariance does not depend on observed data y! Computing the posterior mean is just a linear operation:
),;()( || 2121 Vmxyx Np
1121
TI|V )(|| yVm 12121
T
11© Eric Xing @ CMU, 2005-2014
Learning FA Now, assume that we are given {yn} (the observation on high-
dimensional data) only
We have derived how to estimate xn from P(X|Y)
How can we learning the model? Loading matrix Manifold center Variance
12© Eric Xing @ CMU, 2005-2014
EM for Factor Analysis Incomplete data log likelihood function (marginal density of y)
Estimating is trivial: Parameters and are coupled nonlinearly in log-likelihood
Complete log likelihood
Tn
nn
nT
nn
μyμy N
yyND
)()( where , trlog
)()(log),(
SS1
1
21
2
21
2TT
TT l
n nN
ML y1̂
Tnn
nnn
n
Tnn
nnT
nnn
nn
Tn
nnnn
nnn
xyxyN
NxxN
xyxyNxxN
xypxpyxpD
)()( where , trtrlog
)()(loglog
)|(log)(log),(log),(
122
12
21
221
21
1
SS
I
cl
13© Eric Xing @ CMU, 2005-2014
E-step for Factor Analysis Compute
Recall that we have derived:
trtrlog),( 1
221
2 SNXXND
n
Tnncl
)|(),( yxpDcl
)( T
n
Tnn
Tn
Tn
TTnn
Tnn XXyXXyyy
N 1S
TnnnnnnTnn yXyXyXXX ||| EEVar
nnn yXX |E
1121
TI|V )(|| yVm 12121
T
)(|| nyxn yX
nn
121
TVm Tyxyx
Tnn nnnn
XX ||| and mmV 21
14© Eric Xing @ CMU, 2005-2014
M-step for Factor Analysis Take the derivates of the expected complete log likelihood
wrt. parameters. Using the trace and determinant derivative rules:
S
S
22
221
21
11
NN
NXXNn
Tnn
trtrlogcl
n
Tnn
n
Tnn
T
n
Tnn
Tn
Tn
TTnn
Tnn
n
Tnn
XXXy
XXyXXyyyN
N
NNXXN
11
1
11
12
2221
2
)(
trtrlog SScl
S 1t
1
1
n
Tnn
n
Tnn
t XXXy
15© Eric Xing @ CMU, 2005-2014
Model Invariance and Identifiability There is degeneracy in the FA model. Since only appears as outer product , the model is
invariant to rotation and axis flips of the latent space. We can replace with Q for any orthonormal matrix Q and
the model remains the same: (Q)(Q)=(QQ)=. This means that there is no “one best” setting of the
parameters. An infinite number of parameters all give the ML score!
Such models are called un-identifiable since two people both fitting ML parameters to the identical data will not be guaranteed to identify the same parameters.
16© Eric Xing @ CMU, 2005-2014
A road map to more complex dynamic models
AX
Y
AX
Y
AX
Ydiscrete
discrete
discrete
continuous
continuous
continuous
Mixture modele.g., mixture of multinomials
Mixture modele.g., mixture of Gaussians
Factor analysis
A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
... A AA Ax2 x3x1 xN
y2 y3y1 yN...
...
HMM (for discrete sequential data, e.g., text)
HMM(for continuous sequential data, e.g., speech signal)
State space model
... ... ... ...
A AA Ax2 x3x1 xN
yk2 yk3yk1 ykN...
...
y12 y13y11 y1N...
S2 S3S1 SN...
... ... ... ...
A AA Ax2 x3x1 xN
yk2 yk3yk1 ykN...
...
y12 y13y11 y1N...
S2 S3S1 SN...
... ... ... ...
A AA Ax2 x3x1 xN
yk2 yk3yk1 ykN...
...
y12 y13y11 y1N...
S2 S3S1 SN...
... ... ... ...
A AA Ax2 x3x1 xN
yk2 yk3yk1 ykN...
...
y12 y13y11 y1N...
S2 S3S1 SN...
Factorial HMM Switching SSM
17© Eric Xing @ CMU, 2005-2014
State space models (SSM) A sequential FA or a continuous state HMM
In general,
A AA AY2 Y3Y1 YN
X2 X3X1 XN...
... ),;(~
);(~ ),;(~
0
RQC
GA
000
0
1
1
NNN
x
xyxx
tt
ttt
ttt
vwv
w
This is a linear dynamic system.
ttt
ttt
vw
)()(
1
1
xyxx
gGf
where f is an (arbitrary) dynamic model, and g is an (arbitrary) observation model
18© Eric Xing @ CMU, 2005-2014
LDS for 2D tracking Dynamics: new position = old position + velocity + noise
(constant velocity model, Gaussian noise)
Observation: project out first two components (we observe Cartesian position of object - linear!)
noise
10000100
010001
21
11
21
11
2
1
2
1
t
t
t
t
t
t
t
t
xxxx
xxxx
noise
2
1
2
1
2
1
00100001
t
t
t
t
t
t
xxxx
yy
19© Eric Xing @ CMU, 2005-2014
The inference problem 1
jtt
jttt
ittt jipipip 111 )X|X()X|y()|X( :y
A Y2 Y3Y1 Yt
X2 X3X1 Xt...
...
)|( :1 ttxP y
0α 1α 2α tα
Filtering given y1, …, yt, estimate xt: The Kalman filter is a way to perform exact online inference (sequential
Bayesian updating) in an LDS. It is the Gaussian analog of the forward algorithm for HMMs:
20© Eric Xing @ CMU, 2005-2014
The inference problem 2 Smoothing given y1, …, yT, estimate xt (t<T)
The Rauch-Tung-Strievel smoother is a way to perform exact off-line inference in an LDS. It is the Gaussian analog of the forwards-backwards (alpha-gamma) algorithm:
j
jt
ji
jt
it
itTt XXPyip 11:1 )|()|X(
A Y2 Y3Y1 Yt
X2 X3X1 Xt...
...
0α 1α 2α tα0γ 1γ 2γ tγ
21© Eric Xing @ CMU, 2005-2014
2D tracking
X1 X1
22© Eric Xing @ CMU, 2005-2014
filtered
Kalman filtering in the brain?
23© Eric Xing @ CMU, 2005-2014
Kalman filtering derivation Since all CPDs are linear Gaussian, the system defines a
large multivariate Gaussian. Hence all marginals are Gaussian. Hence we can represent the belief state p(Xt|y1:t) as a Gaussian with
mean and covariance . It is common to work with the inverse covariance (precision) matrix ;
this is called information form.
Kalman filtering is a recursive procedure to update the belief state: Predict step: compute p(Xt+1|y1:t) from prior belief
p(Xt|y1:t) and dynamical model p(Xt+1|Xt) --- time update
Update step: compute new belief p(Xt+1|y1:t+1) from prediction p(Xt+1|y1:t), observation yt+1 and observation model p(yt+1|Xt+1) --- measurement update
AA YtY1
Xt Xt+1X1 ...
A AA Yt Yt+1Y1
Xt Xt+1X1 ...
24© Eric Xing @ CMU, 2005-2014
Kalman filtering derivation Kalman filtering is a recursive procedure to update the belief
state: Predict step: compute p(Xt+1|y1:t) from prior belief
p(Xt|y1:t) and dynamical model p(Xt+1|Xt) --- time update
Update step: compute new belief p(Xt+1|y1:t+1) from prediction p(Xt+1|y1:t), observation yt+1 and observation model p(yt+1|Xt+1) --- measurement update
AA YtY1
Xt Xt+1X1 ...
A AA Yt Yt+1Y1
Xt Xt+1X1 ...
25© Eric Xing @ CMU, 2005-2014
Predict step Dynamical Model:
One step ahead prediction of state:
Observation model: One step ahead prediction of observation:
);(~ , QGA 01 Ntttt ww xx
);0(~ , RvvC tttt N xy
AA YtY1
Xt Xt+1X1 ...
A AA Yt Yt+1Y1
Xt Xt+1X1 ...
26© Eric Xing @ CMU, 2005-2014
Predict step Dynamical Model:
One step ahead prediction of state:
Observation model: One step ahead prediction of observation:
);(~ , QGA 01 Ntttt ww xx
);0(~ , RvvC tttt N xy
tttttt || ˆ),,|X(ˆ xyyx AE 111
Ttt
tT
tttttttt
tT
tttttttt
GQGAAP
wGAwGAE
EP
|
1|1|1
1|11|11|1
),,|)ˆX)(ˆX(
),,|)ˆ)(XˆX(
yyxx
yyxx
ttttttt v |ˆ),,|X(),,|Y( 111111 xyyyy CCEE
R),,|)ˆ)(YˆY( |11|11|11 T
tttT
tttttt CCPE yyyy
tttT
tttttt CPE |11|11|11 ),,|)ˆ)(XˆY( yyxy
AA YtY1
Xt Xt+1X1 ...
A AA Yt Yt+1Y1
Xt Xt+1X1 ...
27© Eric Xing @ CMU, 2005-2014
Update step Summarizing results from previous slide, we have
p(Xt+1,Yt+1|y1:t) ~ N(mt+1, Vt+1), where
Remember the formulas for conditional Gaussian distributions:
) , (),|(
2221
1211
2
1
2
1
2
1
xx
xx
Np
211
22121121
221
2212121
2121121
|
|
||
)(
),|()(
V
xm
Vmxxx
Np
222
22
2222
m
m
mmp
V
m
Vmxx
),|()( N
,ˆˆ
|
|
tt
ttt xC
xm
1
11 ,
||
||
RCCPCP
CPPV T
tttt
Ttttt
t11
111
28© Eric Xing @ CMU, 2005-2014
Kalman Filter Measurement updates:
where Kt+1 is the Kalman gain matrix
Time updates:
Kt can be pre-computed (since it is independent of the data).
tttt || ˆˆ xx A1
TGQGAAPP tttt ||1
)x̂C(yˆˆ t|1t1t|| 1111 ttttt Kxx
tttttt ||| 1111 KCPPP
-1TT RCCPCPK )( || ttttt 111
29© Eric Xing @ CMU, 2005-2014
Example of KF in 1D Consider noisy observations of a 1D particle doing a random
walk:
KF equations:
),(~ ,| xttt wwxx 011 N ),(~ , ztt vvxz 0N
t|tt|ttt xxx ˆˆˆ| A1, || xttttt
TGQGAAPP 1
zxt
ttztxtttttt
xzxzxx
|
t|1t1t||
ˆ)ˆ-(ˆˆ CK 1111
zxt
zxttttttt
||| 1111 KCP-PP
))(()( || zxtxtttttt -1TT RCCPCPK 111
30© Eric Xing @ CMU, 2005-2014
KF intuition The KF update of the mean is
the term is called the innovation
New belief is convex combination of updates from prior and observation, weighted by Kalman Gain matrix:
If the observation is unreliable, z (i.e., R) is large so Kt+1 is small, so we pay more attention to the prediction.
If the old prior is unreliable (large t) or the process is very unpredictable (large x), we pay more attention to the observation.
zxt
ttztxtttttt
xzxzxx
|
t|1t1t||
ˆ)ˆ-(ˆˆ CK 1111
)ˆ( t|1t1t xz C
-1TT RCCPCPK )( || ttttt 111
31© Eric Xing @ CMU, 2005-2014
Complexity of one KF step Let and ,
Computing takes O(Nx2) time, assuming
dense P and dense A.
Computing takes O(Ny3) time.
So overall time is, in general, max {Nx2,Ny
3}
xNtX R yN
tY R
TGQGAAPP tttt ||1
-1TT RCCPCPK )( || ttttt 111
32© Eric Xing @ CMU, 2005-2014
The inference problem 2 Smoothing given y1, …, yT, estimate xt (t<T)
The Rauch-Tung-Strievel smoother is a way to perform exact off-line inference in an LDS. It is the Gaussian analog of the forwards-backwards (alpha-gamma) algorithm:
j
jt
ji
jt
it
itTt XXPyip 11:1 )|()|X(
A Y2 Y3Y1 Yt
X2 X3X1 Xt...
...
0α 1α 2α tα0γ 1γ 2γ tγ
33© Eric Xing @ CMU, 2005-2014
Rauch-Tung-Strievel smoother
General structure: KF results + the difference of the "smoothed" and predicted results of the next step
Backward computation: Pretend to know things at t+1 –- such conditioning makes things simple and we can remove this condition finally
The difficulty: The trick:
A AYt Yt+1
Xt Xt+1)ˆ-ˆ(ˆˆ |||| ttTttttTt 11 xxxx L
TttTtttTt tt LPPLPP |||| 11 1
1 ttttt || PAPL T
ttt
Tttt
TTttTtTt
yyXXyyyyXXyyyyXXyyXx
,,,|,,|,,,|,,|,,,|,,|ˆ
def
|
11
111
1111
EEEEEE
ZZYXZX |,|| EEE
Tt yyX ,,| 1
ZZYXZZYXZX |,||,|| VarEEVarVar
(Hw!)
Same for Pt|T34© Eric Xing @ CMU, 2005-2014
RTS derivation Following the results from previous slide, we need to derive
p(Xt+1,Xt|y1:t) ~ N(m, V), where
all the quantities here are available after a forward KF pass
Remember the formulas for conditional Gaussian distributions:
The RTS smoother
, ) , (),|(
2221
1211
2
1
2
1
2
1
xx
xx
Np
211
22121121
221
2212121
2121121
|
|
||
)(
),|()(
V
xm
Vmxxx
Np
222
22
2222
m
m
mmp
V
m
Vmxx
),|()( N
,ˆˆ
|
|
tt
tt
xx
m1
,||
||
tttt
Ttttt
PAPAPP
V1
)ˆ-ˆ(ˆ
,,,|ˆ
|||
|
ttTtttt
tttTt yyXXx
11
11
xxx LE
TttTttt
TtttTTtTt yyXXyx
tt LPPLP
VarEVarP
|||
:::|
def
| |,||ˆ
11
1111
35© Eric Xing @ CMU, 2005-2014
Learning SSMs Complete log likelihood
EM E-step: compute
these quantities can be inferred via KF and RTS filters, etc., e,g.,
M-step: MLE using
c.f., M-step in factor analysis
RCGQA ,;:,,,;:,,);(
)|(log)|(log)(log),(log),( ,,,,
tXXXftXXXXXfXf
xypxxpxpyxpD
tTttt
Ttt
Ttt
n ttntn
n ttntn
nnnn
312011
11cl
TtTtt
Ttt yyXXXXX , , , 11
22TtTtt
Ttt
Ttt xPXXXXX ||
ˆ)(E)var(
RCGQA ,;:,,,;:,,;),( tXXXftXXXXXfXfD tTttt
Ttt
Ttt 312011cl
36© Eric Xing @ CMU, 2005-2014
Nonlinear systems In robotics and other problems, the motion model and the
observation model are often nonlinear:
An optimal closed form solution to the filtering problem is no longer possible.
The nonlinear functions f and g are sometimes represented by neural networks (multi-layer perceptrons or radial basis function networks).
The parameters of f and g may be learned offline using EM, where we do gradient descent (back propagation) in the M step, c.f. learning a MRF/CRF with hidden nodes.
Or we may learn the parameters online by adding them to the state space: xt'= (xt, ). This makes the problem even more nonlinear.
)( , )( tttttt vxgywxfx 1
37© Eric Xing @ CMU, 2005-2014
Extended Kalman Filter (EKF) The basic idea of the EKF is to linearize f and g using a
second order Taylor expansion, and then apply the standard KF. i.e., we approximate a stationary nonlinear system with a non-stationary
linear system.
where and and
The noise covariance (Q and R) is not changed, i.e., the additional error due to linearization is not modeled.
)ˆ()ˆ(
)ˆ()ˆ(
|ˆ|
|ˆ|
|
|
ttttxttt
ttttxttt
vxxxgy
wxxxfx
tt
tt
11
11111
1
11
C
A
)ˆ(ˆ|| 111 tttt xfx
xx x
fˆ
def
ˆ
Ax
x xg
ˆ
def
ˆ
C
38© Eric Xing @ CMU, 2005-2014
Online vs offline inference
39© Eric Xing @ CMU, 2005-2014
The KF update of the mean is
Consider the special case where the hidden state is a constant, xt =, but the “observation matrix” C is a time-varying vector, C = xt
T. Hence the observation model at each time slide, , is a
linear regression
We can estimate recursively using the Kalman filter:
This is called the recursive least squares (RLS) algorithm.
We can approximate by a scalar constant. This is called the least mean squares (LMS) algorithm.
We can adapt t online using stochastic approximation theory.
)ˆ-(ˆˆt|1t1t|| xyxx ttttt CKA 111
tTtt vxy
ttTtttt xxy )ˆ(ˆˆ
1t
1
11 RP
11
1
tt RP
KF, RLS and LMS
40© Eric Xing @ CMU, 2005-2014
top related