probabilistic graphical modelsepxing/class/10708-14/lectures/lecture11... · 2014. 2. 19. ·...

School of Computer Science

Probabilistic Graphical Models

Factor Analysis and State Space Models

Eric XingLecture 11, February 19o, 2014

Reading: See class website1© Eric Xing @ CMU, 2005-2014

A road map to more complex dynamic models

AX

Y

AX

Y

AX

Ydiscrete

discrete

discrete

continuous

continuous

continuous

Mixture modele.g., mixture of multinomials

Mixture modele.g., mixture of Gaussians

Factor analysis

A AA Ax2 x3x1 xN

y2 y3y1 yN...

... A AA Ax2 x3x1 xN

y2 y3y1 yN...


y2 y3y1 yN...


y2 y3y1 yN...


y2 y3y1 yN...


y2 y3y1 yN...

...

HMM (for discrete sequential data, e.g., text)

HMM(for continuous sequential data, e.g., speech signal)

State space model

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

Factorial HMM Switching SSM

2© Eric Xing @ CMU, 2005-2014

Recall multivariate Gaussian Multivariate Gaussian density:

A joint Gaussian:

How to write down p(x1), p(x1|x2) or p(x2|x1) using the block elements in and ? Formulas to remember:

)-()-(-exp)(

),|( //

xxx 1

21

21221

Tn

p

) , (),|(

2221

1211

2

1

2

1

2

1

xx

xx

Np

211

22121121

221

2212121

2121121

|

|

||

)(

),|()(

V

xm

Vmxxx

Np

222

22

2222

m

m

mmp

V

m

Vmxx

),|()( N

3© Eric Xing @ CMU, 2005-2014

Review:The matrix inverse lemma Consider a block-partitioned matrix:

First we diagonalize M

Schur complement:

Then we inverse, using this formula:

Matrix inverse lemma

HGFE

M

HGE-FH

IG-HI

HGFE

I

-FHI -

-

-

000

0

1

1

1

XZW YW XYZ -- 11

GE-FHM/H -1

111

111111

111111

111

1

1

1

1

11

0000

M/EGEM/E-M/EF-EGEM/EFEE

FHM/HGHHM/HG-HFHM/H-M/H

I-FHI

HM/H

IG-HI

HGFE

M

-

-

-

-

-

-

1111111 --- GEFH-GEFEEGE-FH

4© Eric Xing @ CMU, 2005-2014

Review:Some matrix algebra Trace and derivatives

Cyclical permutations

Derivatives

Determinants and derivatives

i

iiadef

tr A

TBBAA

tr

TTT xxAxxA

AxxA

trtr

BCACABABC trtrtr

1log AAA

5© Eric Xing @ CMU, 2005-2014

Factor analysis An unsupervised linear regression model

Geometric interpretation

To generate data, first generate a point within the manifold then add noise. Coordinates of point are components of latent variable.

AY

X

where is called a factor loading matrix, and is diagonal.

),;()(),;()(

xyxyxx

N

N

pp I0

6© Eric Xing @ CMU, 2005-2014

Marginal data distribution A marginal Gaussian (e.g., p(x)) times a conditional Gaussian

(e.g., p(y|x)) is a joint Gaussian Any marginal (e.g., p(y) of a joint Gaussian (e.g., p(x,y)) is

also a Gaussian Since the marginal is Gaussian, we can determine it by just computing its mean

and variance. (Assume noise uncorrelated with data.)

T

TTT

T

T

T

EEE

E

EVar

EEEE

WWXXWXWX

WXWX

YYY

WXWWXY

00

0 ),(~ here w N

AY

X

7© Eric Xing @ CMU, 2005-2014

FA = Constrained-Covariance Gaussian Marginal density for factor analysis (y is p-dim, x is k-dim):

So the effective covariance is the low-rank outer product of two long skinny matrices plus a diagonal matrix:

In other words, factor analysis is just a constrained Gaussian model. (If were not diagonal then we could model any Gaussian and it would be pointless.)

),;()|( T yy Np

8© Eric Xing @ CMU, 2005-2014

FA joint distribution Model

Covariance between x and y

Hence the joint distribution of x and y:

Assume noise is uncorrelated with data or latent variables.

),;()(),;()(

xyxyxx

N

N

pp I0

T

TTT

TT

EEECov

XWXXWXXYXYX, 0

) , ()(

T

TI0

yx

yx

Np

9© Eric Xing @ CMU, 2005-2014

Inference in Factor Analysis Apply the Gaussian conditioning formulas to the joint

distribution we derived above, where

we can now derive the posterior of the latent variable x given observation y, , where

Applying the matrix inversion lemma

Here we only need to invert a matrix of size |x||x|, instead of |y||y|.

)(

)(|

y

ym1

21

2212121

TT

T

TT

I

22

1212

11

),|()( || 2121 Vmxyx Np

1111111 --- GEFH-GEFEEGE-FH

1

211

22121121

TTI

|V

1121

TI|V )(|| yVm 12121

T

111

111111

111111

111

M/EGEM/E-M/EF-EGEM/EFEE

FHM/HGHHM/HG-HFHM/H-M/H

-

-

-

-

10© Eric Xing @ CMU, 2005-2014

Geometric interpretation: inference is linear projection The posterior is:

Posterior covariance does not depend on observed data y! Computing the posterior mean is just a linear operation:

),;()( || 2121 Vmxyx Np

1121

TI|V )(|| yVm 12121

T

11© Eric Xing @ CMU, 2005-2014

Learning FA Now, assume that we are given {yn} (the observation on high-

dimensional data) only

We have derived how to estimate xn from P(X|Y)

How can we learning the model? Loading matrix Manifold center Variance

12© Eric Xing @ CMU, 2005-2014

EM for Factor Analysis Incomplete data log likelihood function (marginal density of y)

Estimating is trivial: Parameters and are coupled nonlinearly in log-likelihood

Complete log likelihood

Tn

nn

nT

nn

μyμy N

yyND

)()( where , trlog

)()(log),(

SS1

1

21

2

21

2TT

TT l

n nN

ML y1̂

Tnn

nnn

n

Tnn

nnT

nnn

nn

Tn

nnnn

nnn

xyxyN

NxxN

xyxyNxxN

xypxpyxpD

)()( where , trtrlog

)()(loglog

)|(log)(log),(log),(

122

12

21

221

21

1

SS

I

cl

13© Eric Xing @ CMU, 2005-2014

E-step for Factor Analysis Compute

Recall that we have derived:

trtrlog),( 1

221

2 SNXXND

n

Tnncl

)|(),( yxpDcl

)( T

n

Tnn

Tn

Tn

TTnn

Tnn XXyXXyyy

N 1S

TnnnnnnTnn yXyXyXXX ||| EEVar

nnn yXX |E

1121

TI|V )(|| yVm 12121

T

)(|| nyxn yX

nn

121

TVm Tyxyx

Tnn nnnn

XX ||| and mmV 21

14© Eric Xing @ CMU, 2005-2014

M-step for Factor Analysis Take the derivates of the expected complete log likelihood

wrt. parameters. Using the trace and determinant derivative rules:

S

S

22

221

21

11

NN

NXXNn

Tnn

trtrlogcl

n

Tnn

n

Tnn

T

n

Tnn

Tn

Tn

TTnn

Tnn

n

Tnn

XXXy

XXyXXyyyN

N

NNXXN

11

1

11

12

2221

2

)(

trtrlog SScl

S 1t

1

1

n

Tnn

n

Tnn

t XXXy

15© Eric Xing @ CMU, 2005-2014

Model Invariance and Identifiability There is degeneracy in the FA model. Since only appears as outer product , the model is

invariant to rotation and axis flips of the latent space. We can replace with Q for any orthonormal matrix Q and

the model remains the same: (Q)(Q)=(QQ)=. This means that there is no “one best” setting of the

parameters. An infinite number of parameters all give the ML score!

Such models are called un-identifiable since two people both fitting ML parameters to the identical data will not be guaranteed to identify the same parameters.

16© Eric Xing @ CMU, 2005-2014

A road map to more complex dynamic models

AX

Y

AX

Y

AX

Ydiscrete

discrete

discrete

continuous

continuous

continuous

Mixture modele.g., mixture of multinomials

Mixture modele.g., mixture of Gaussians

Factor analysis

A AA Ax2 x3x1 xN

y2 y3y1 yN...


y2 y3y1 yN...


y2 y3y1 yN...


y2 y3y1 yN...


y2 y3y1 yN...


y2 y3y1 yN...

...

HMM (for discrete sequential data, e.g., text)

HMM(for continuous sequential data, e.g., speech signal)

State space model

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

...

y12 y13y11 y1N...

S2 S3S1 SN...

Factorial HMM Switching SSM

17© Eric Xing @ CMU, 2005-2014

State space models (SSM) A sequential FA or a continuous state HMM

In general,

A AA AY2 Y3Y1 YN

X2 X3X1 XN...

... ),;(~

);(~ ),;(~

0

RQC

GA

000

0

1

1

NNN

x

xyxx

tt

ttt

ttt

vwv

w

This is a linear dynamic system.

ttt

ttt

vw

)()(

1

1

xyxx

gGf

where f is an (arbitrary) dynamic model, and g is an (arbitrary) observation model

18© Eric Xing @ CMU, 2005-2014

LDS for 2D tracking Dynamics: new position = old position + velocity + noise

(constant velocity model, Gaussian noise)

Observation: project out first two components (we observe Cartesian position of object - linear!)

noise

10000100

010001

21

11

21

11

2

1

2

1

t

t

t

t

t

t

t

t

xxxx

xxxx

noise

2

1

2

1

2

1

00100001

t

t

t

t

t

t

xxxx

yy

19© Eric Xing @ CMU, 2005-2014

The inference problem 1

jtt

jttt

ittt jipipip 111 )X|X()X|y()|X( :y

A Y2 Y3Y1 Yt

X2 X3X1 Xt...

...

)|( :1 ttxP y

0α 1α 2α tα

Filtering given y1, …, yt, estimate xt: The Kalman filter is a way to perform exact online inference (sequential

Bayesian updating) in an LDS. It is the Gaussian analog of the forward algorithm for HMMs:

20© Eric Xing @ CMU, 2005-2014

The inference problem 2 Smoothing given y1, …, yT, estimate xt (t<T)

The Rauch-Tung-Strievel smoother is a way to perform exact off-line inference in an LDS. It is the Gaussian analog of the forwards-backwards (alpha-gamma) algorithm:

j

jt

ji

jt

it

itTt XXPyip 11:1 )|()|X(

A Y2 Y3Y1 Yt

X2 X3X1 Xt...

...

0α 1α 2α tα0γ 1γ 2γ tγ

21© Eric Xing @ CMU, 2005-2014

2D tracking

X1 X1

22© Eric Xing @ CMU, 2005-2014

filtered

Kalman filtering in the brain?

23© Eric Xing @ CMU, 2005-2014

Kalman filtering derivation Since all CPDs are linear Gaussian, the system defines a

large multivariate Gaussian. Hence all marginals are Gaussian. Hence we can represent the belief state p(Xt|y1:t) as a Gaussian with

mean and covariance . It is common to work with the inverse covariance (precision) matrix ;

this is called information form.

Kalman filtering is a recursive procedure to update the belief state: Predict step: compute p(Xt+1|y1:t) from prior belief

p(Xt|y1:t) and dynamical model p(Xt+1|Xt) --- time update

Update step: compute new belief p(Xt+1|y1:t+1) from prediction p(Xt+1|y1:t), observation yt+1 and observation model p(yt+1|Xt+1) --- measurement update

AA YtY1

Xt Xt+1X1 ...

A AA Yt Yt+1Y1

Xt Xt+1X1 ...

24© Eric Xing @ CMU, 2005-2014

Kalman filtering derivation Kalman filtering is a recursive procedure to update the belief

state: Predict step: compute p(Xt+1|y1:t) from prior belief

p(Xt|y1:t) and dynamical model p(Xt+1|Xt) --- time update

Update step: compute new belief p(Xt+1|y1:t+1) from prediction p(Xt+1|y1:t), observation yt+1 and observation model p(yt+1|Xt+1) --- measurement update

AA YtY1

Xt Xt+1X1 ...

A AA Yt Yt+1Y1

Xt Xt+1X1 ...

25© Eric Xing @ CMU, 2005-2014

Predict step Dynamical Model:

One step ahead prediction of state:

Observation model: One step ahead prediction of observation:

);(~ , QGA 01 Ntttt ww xx

);0(~ , RvvC tttt N xy

AA YtY1

Xt Xt+1X1 ...

A AA Yt Yt+1Y1

Xt Xt+1X1 ...

26© Eric Xing @ CMU, 2005-2014

Predict step Dynamical Model:

One step ahead prediction of state:

Observation model: One step ahead prediction of observation:

);(~ , QGA 01 Ntttt ww xx

);0(~ , RvvC tttt N xy

tttttt || ˆ),,|X(ˆ xyyx AE 111

Ttt

tT

tttttttt

tT

tttttttt

GQGAAP

wGAwGAE

EP

|

1|1|1

1|11|11|1

),,|)ˆX)(ˆX(

),,|)ˆ)(XˆX(

yyxx

yyxx

ttttttt v |ˆ),,|X(),,|Y( 111111 xyyyy CCEE

R),,|)ˆ)(YˆY( |11|11|11 T

tttT

tttttt CCPE yyyy

tttT

tttttt CPE |11|11|11 ),,|)ˆ)(XˆY( yyxy

AA YtY1

Xt Xt+1X1 ...

A AA Yt Yt+1Y1

Xt Xt+1X1 ...

27© Eric Xing @ CMU, 2005-2014

Update step Summarizing results from previous slide, we have

p(Xt+1,Yt+1|y1:t) ~ N(mt+1, Vt+1), where

Remember the formulas for conditional Gaussian distributions:

) , (),|(

2221

1211

2

1

2

1

2

1

xx

xx

Np

211

22121121

221

2212121

2121121

|

|

||

)(

),|()(

V

xm

Vmxxx

Np

222

22

2222

m

m

mmp

V

m

Vmxx

),|()( N

,ˆˆ

|

|

tt

ttt xC

xm

1

11 ,

||

||

RCCPCP

CPPV T

tttt

Ttttt

t11

111

28© Eric Xing @ CMU, 2005-2014

Kalman Filter Measurement updates:

where Kt+1 is the Kalman gain matrix

Time updates:

Kt can be pre-computed (since it is independent of the data).

tttt || ˆˆ xx A1

TGQGAAPP tttt ||1

)x̂C(yˆˆ t|1t1t|| 1111 ttttt Kxx

tttttt ||| 1111 KCPPP

-1TT RCCPCPK )( || ttttt 111

29© Eric Xing @ CMU, 2005-2014

Example of KF in 1D Consider noisy observations of a 1D particle doing a random

walk:

KF equations:

),(~ ,| xttt wwxx 011 N ),(~ , ztt vvxz 0N

t|tt|ttt xxx ˆˆˆ| A1, || xttttt

TGQGAAPP 1

zxt

ttztxtttttt

xzxzxx

|

t|1t1t||

ˆ)ˆ-(ˆˆ CK 1111

zxt

zxttttttt

||| 1111 KCP-PP

))(()( || zxtxtttttt -1TT RCCPCPK 111

30© Eric Xing @ CMU, 2005-2014

KF intuition The KF update of the mean is

the term is called the innovation

New belief is convex combination of updates from prior and observation, weighted by Kalman Gain matrix:

If the observation is unreliable, z (i.e., R) is large so Kt+1 is small, so we pay more attention to the prediction.

If the old prior is unreliable (large t) or the process is very unpredictable (large x), we pay more attention to the observation.

zxt

ttztxtttttt

xzxzxx

|

t|1t1t||

ˆ)ˆ-(ˆˆ CK 1111

)ˆ( t|1t1t xz C


31© Eric Xing @ CMU, 2005-2014

Complexity of one KF step Let and ,

Computing takes O(Nx2) time, assuming

dense P and dense A.

Computing takes O(Ny3) time.

So overall time is, in general, max {Nx2,Ny

3}

xNtX R yN

tY R

TGQGAAPP tttt ||1


32© Eric Xing @ CMU, 2005-2014

The inference problem 2 Smoothing given y1, …, yT, estimate xt (t<T)

The Rauch-Tung-Strievel smoother is a way to perform exact off-line inference in an LDS. It is the Gaussian analog of the forwards-backwards (alpha-gamma) algorithm:

j

jt

ji

jt

it

itTt XXPyip 11:1 )|()|X(

A Y2 Y3Y1 Yt

X2 X3X1 Xt...

...

0α 1α 2α tα0γ 1γ 2γ tγ

33© Eric Xing @ CMU, 2005-2014

Rauch-Tung-Strievel smoother

General structure: KF results + the difference of the "smoothed" and predicted results of the next step

Backward computation: Pretend to know things at t+1 –- such conditioning makes things simple and we can remove this condition finally

The difficulty: The trick:

A AYt Yt+1

Xt Xt+1)ˆ-ˆ(ˆˆ |||| ttTttttTt 11 xxxx L

TttTtttTt tt LPPLPP |||| 11 1

1 ttttt || PAPL T

ttt

Tttt

TTttTtTt

yyXXyyyyXXyyyyXXyyXx

,,,|,,|,,,|,,|,,,|,,|ˆ

def

|

11

111

1111

EEEEEE

ZZYXZX |,|| EEE

Tt yyX ,,| 1

ZZYXZZYXZX |,||,|| VarEEVarVar

(Hw!)

Same for Pt|T34© Eric Xing @ CMU, 2005-2014

RTS derivation Following the results from previous slide, we need to derive

p(Xt+1,Xt|y1:t) ~ N(m, V), where

all the quantities here are available after a forward KF pass

Remember the formulas for conditional Gaussian distributions:

The RTS smoother

, ) , (),|(

2221

1211

2

1

2

1

2

1

xx

xx

Np

211

22121121

221

2212121

2121121

|

|

||

)(

),|()(

V

xm

Vmxxx

Np

222

22

2222

m

m

mmp

V

m

Vmxx

),|()( N

,ˆˆ

|

|

tt

tt

xx

m1

,||

||

tttt

Ttttt

PAPAPP

V1

)ˆ-ˆ(ˆ

,,,|ˆ

|||

|

ttTtttt

tttTt yyXXx

11

11

xxx LE

TttTttt

TtttTTtTt yyXXyx

tt LPPLP

VarEVarP

|||

:::|

def

| |,||ˆ

11

1111

35© Eric Xing @ CMU, 2005-2014

Learning SSMs Complete log likelihood

EM E-step: compute

these quantities can be inferred via KF and RTS filters, etc., e,g.,

M-step: MLE using

c.f., M-step in factor analysis

RCGQA ,;:,,,;:,,);(

)|(log)|(log)(log),(log),( ,,,,

tXXXftXXXXXfXf

xypxxpxpyxpD

tTttt

Ttt

Ttt

n ttntn

n ttntn

nnnn

312011

11cl

TtTtt

Ttt yyXXXXX , , , 11

22TtTtt

Ttt

Ttt xPXXXXX ||

ˆ)(E)var(

RCGQA ,;:,,,;:,,;),( tXXXftXXXXXfXfD tTttt

Ttt

Ttt 312011cl

36© Eric Xing @ CMU, 2005-2014

Nonlinear systems In robotics and other problems, the motion model and the

observation model are often nonlinear:

An optimal closed form solution to the filtering problem is no longer possible.

The nonlinear functions f and g are sometimes represented by neural networks (multi-layer perceptrons or radial basis function networks).

The parameters of f and g may be learned offline using EM, where we do gradient descent (back propagation) in the M step, c.f. learning a MRF/CRF with hidden nodes.

Or we may learn the parameters online by adding them to the state space: xt'= (xt, ). This makes the problem even more nonlinear.

)( , )( tttttt vxgywxfx 1

37© Eric Xing @ CMU, 2005-2014

Extended Kalman Filter (EKF) The basic idea of the EKF is to linearize f and g using a

second order Taylor expansion, and then apply the standard KF. i.e., we approximate a stationary nonlinear system with a non-stationary

linear system.

where and and

The noise covariance (Q and R) is not changed, i.e., the additional error due to linearization is not modeled.

)ˆ()ˆ(

)ˆ()ˆ(

|ˆ|

|ˆ|

|

|

ttttxttt

ttttxttt

vxxxgy

wxxxfx

tt

tt

11

11111

1

11

C

A

)ˆ(ˆ|| 111 tttt xfx

xx x

fˆ

def

ˆ

Ax

x xg

ˆ

def

ˆ

C

38© Eric Xing @ CMU, 2005-2014

The KF update of the mean is

Consider the special case where the hidden state is a constant, xt =, but the “observation matrix” C is a time-varying vector, C = xt

T. Hence the observation model at each time slide, , is a

linear regression

We can estimate recursively using the Kalman filter:

This is called the recursive least squares (RLS) algorithm.

We can approximate by a scalar constant. This is called the least mean squares (LMS) algorithm.

We can adapt t online using stochastic approximation theory.

)ˆ-(ˆˆt|1t1t|| xyxx ttttt CKA 111

tTtt vxy

ttTtttt xxy )ˆ(ˆˆ

1t

1

11 RP

11

1

tt RP

KF, RLS and LMS

40© Eric Xing @ CMU, 2005-2014

probabilistic graphical modelsepxing/class/10708-14/lectures/lecture11... · 2014. 2. 19. ·...

Documents