probabilistic graphical modelsepxing/class/10708-14/lectures/lecture11... · 2014. 2. 19. ·...

School of Computer Science

Probabilistic Graphical Models

Factor Analysis and State Space Models

Eric XingLecture 11, February 19o, 2014

Reading: See class website1© Eric Xing @ CMU, 2005-2014

A road map to more complex dynamic models

Ydiscrete

discrete

continuous

Mixture modele.g., mixture of multinomials

Mixture modele.g., mixture of Gaussians

Factor analysis

A AA Ax2 x3x1 xN

y2 y3y1 yN...

... A AA Ax2 x3x1 xN

y2 y3y1 yN...

HMM (for discrete sequential data, e.g., text)

HMM(for continuous sequential data, e.g., speech signal)

State space model

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

y12 y13y11 y1N...

S2 S3S1 SN...

Factorial HMM Switching SSM

2© Eric Xing @ CMU, 2005-2014

Recall multivariate Gaussian Multivariate Gaussian density:

A joint Gaussian:

How to write down p(x1), p(x1|x2) or p(x2|x1) using the block elements in and ? Formulas to remember:

)-()-(-exp)(

),|( //

) , (),|(

22121121

2212121

2121121

),|()(

),|()( N

3© Eric Xing @ CMU, 2005-2014

Review:The matrix inverse lemma Consider a block-partitioned matrix:

First we diagonalize M

Schur complement:

Then we inverse, using this formula:

Matrix inverse lemma

HGE-FH

-FHI -

XZW YW XYZ -- 11

GE-FHM/H -1

111111

M/EGEM/E-M/EF-EGEM/EFEE

FHM/HGHHM/HG-HFHM/H-M/H

1111111 --- GEFH-GEFEEGE-FH

4© Eric Xing @ CMU, 2005-2014

Review:Some matrix algebra Trace and derivatives

Cyclical permutations

Derivatives

Determinants and derivatives

iiadef

TTT xxAxxA

BCACABABC trtrtr

1log AAA

5© Eric Xing @ CMU, 2005-2014

Factor analysis An unsupervised linear regression model

Geometric interpretation

To generate data, first generate a point within the manifold then add noise. Coordinates of point are components of latent variable.

where is called a factor loading matrix, and is diagonal.

),;()(),;()(

xyxyxx

6© Eric Xing @ CMU, 2005-2014

Marginal data distribution A marginal Gaussian (e.g., p(x)) times a conditional Gaussian

(e.g., p(y|x)) is a joint Gaussian Any marginal (e.g., p(y) of a joint Gaussian (e.g., p(x,y)) is

also a Gaussian Since the marginal is Gaussian, we can determine it by just computing its mean

and variance. (Assume noise uncorrelated with data.)

WWXXWXWX

WXWWXY

0 ),(~ here w N

7© Eric Xing @ CMU, 2005-2014

FA = Constrained-Covariance Gaussian Marginal density for factor analysis (y is p-dim, x is k-dim):

So the effective covariance is the low-rank outer product of two long skinny matrices plus a diagonal matrix:

In other words, factor analysis is just a constrained Gaussian model. (If were not diagonal then we could model any Gaussian and it would be pointless.)

),;()|( T yy Np

8© Eric Xing @ CMU, 2005-2014

FA joint distribution Model

Covariance between x and y

Hence the joint distribution of x and y:

Assume noise is uncorrelated with data or latent variables.

),;()(),;()(

xyxyxx

EEECov

XWXXWXXYXYX, 0

) , ()(

9© Eric Xing @ CMU, 2005-2014

Inference in Factor Analysis Apply the Gaussian conditioning formulas to the joint

distribution we derived above, where

we can now derive the posterior of the latent variable x given observation y, , where

Applying the matrix inversion lemma

Here we only need to invert a matrix of size |x||x|, instead of |y||y|.

2212121

),|()( || 2121 Vmxyx Np

1111111 --- GEFH-GEFEEGE-FH

22121121

TI|V )(|| yVm 12121

111111

M/EGEM/E-M/EF-EGEM/EFEE

FHM/HGHHM/HG-HFHM/H-M/H

10© Eric Xing @ CMU, 2005-2014

Geometric interpretation: inference is linear projection The posterior is:

Posterior covariance does not depend on observed data y! Computing the posterior mean is just a linear operation:

),;()( || 2121 Vmxyx Np

TI|V )(|| yVm 12121

11© Eric Xing @ CMU, 2005-2014

Learning FA Now, assume that we are given {yn} (the observation on high-

dimensional data) only

We have derived how to estimate xn from P(X|Y)

How can we learning the model? Loading matrix Manifold center Variance

12© Eric Xing @ CMU, 2005-2014

EM for Factor Analysis Incomplete data log likelihood function (marginal density of y)

Estimating is trivial: Parameters and are coupled nonlinearly in log-likelihood

Complete log likelihood

μyμy N

)()( where , trlog

)()(log),(

ML y1̂

xyxyNxxN

xypxpyxpD

)()( where , trtrlog

)()(loglog

)|(log)(log),(log),(

13© Eric Xing @ CMU, 2005-2014

E-step for Factor Analysis Compute

Recall that we have derived:

trtrlog),( 1

2 SNXXND

)|(),( yxpDcl

Tnn XXyXXyyy

TnnnnnnTnn yXyXyXXX ||| EEVar

nnn yXX |E

TI|V )(|| yVm 12121

)(|| nyxn yX

TVm Tyxyx

Tnn nnnn

XX ||| and mmV 21

14© Eric Xing @ CMU, 2005-2014

M-step for Factor Analysis Take the derivates of the expected complete log likelihood

wrt. parameters. Using the trace and determinant derivative rules:

trtrlogcl

XXyXXyyyN

trtrlog SScl

t XXXy

15© Eric Xing @ CMU, 2005-2014

Model Invariance and Identifiability There is degeneracy in the FA model. Since only appears as outer product , the model is

invariant to rotation and axis flips of the latent space. We can replace with Q for any orthonormal matrix Q and

the model remains the same: (Q)(Q)=(QQ)=. This means that there is no “one best” setting of the

parameters. An infinite number of parameters all give the ML score!

Such models are called un-identifiable since two people both fitting ML parameters to the identical data will not be guaranteed to identify the same parameters.

16© Eric Xing @ CMU, 2005-2014

A road map to more complex dynamic models

Ydiscrete

discrete

continuous

Mixture modele.g., mixture of multinomials

Mixture modele.g., mixture of Gaussians

Factor analysis

A AA Ax2 x3x1 xN

y2 y3y1 yN...

HMM (for discrete sequential data, e.g., text)

HMM(for continuous sequential data, e.g., speech signal)

State space model

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

y12 y13y11 y1N...

S2 S3S1 SN...

... ... ... ...

A AA Ax2 x3x1 xN

yk2 yk3yk1 ykN...

y12 y13y11 y1N...

S2 S3S1 SN...

Factorial HMM Switching SSM

17© Eric Xing @ CMU, 2005-2014

State space models (SSM) A sequential FA or a continuous state HMM

In general,

A AA AY2 Y3Y1 YN

X2 X3X1 XN...

... ),;(~

);(~ ),;(~

This is a linear dynamic system.

where f is an (arbitrary) dynamic model, and g is an (arbitrary) observation model

18© Eric Xing @ CMU, 2005-2014

LDS for 2D tracking Dynamics: new position = old position + velocity + noise

(constant velocity model, Gaussian noise)

Observation: project out first two components (we observe Cartesian position of object - linear!)

10000100

010001

00100001

19© Eric Xing @ CMU, 2005-2014

The inference problem 1

ittt jipipip 111 )X|X()X|y()|X( :y

A Y2 Y3Y1 Yt

X2 X3X1 Xt...

)|( :1 ttxP y

0α 1α 2α tα

Filtering given y1, …, yt, estimate xt: The Kalman filter is a way to perform exact online inference (sequential

Bayesian updating) in an LDS. It is the Gaussian analog of the forward algorithm for HMMs:

20© Eric Xing @ CMU, 2005-2014

The inference problem 2 Smoothing given y1, …, yT, estimate xt (t<T)

The Rauch-Tung-Strievel smoother is a way to perform exact off-line inference in an LDS. It is the Gaussian analog of the forwards-backwards (alpha-gamma) algorithm:

itTt XXPyip 11:1 )|()|X(

A Y2 Y3Y1 Yt

X2 X3X1 Xt...

0α 1α 2α tα0γ 1γ 2γ tγ

21© Eric Xing @ CMU, 2005-2014

2D tracking

22© Eric Xing @ CMU, 2005-2014

filtered

Kalman filtering in the brain?

23© Eric Xing @ CMU, 2005-2014

Kalman filtering derivation Since all CPDs are linear Gaussian, the system defines a

large multivariate Gaussian. Hence all marginals are Gaussian. Hence we can represent the belief state p(Xt|y1:t) as a Gaussian with

mean and covariance . It is common to work with the inverse covariance (precision) matrix ;

this is called information form.

Kalman filtering is a recursive procedure to update the belief state: Predict step: compute p(Xt+1|y1:t) from prior belief

p(Xt|y1:t) and dynamical model p(Xt+1|Xt) --- time update

Update step: compute new belief p(Xt+1|y1:t+1) from prediction p(Xt+1|y1:t), observation yt+1 and observation model p(yt+1|Xt+1) --- measurement update

AA YtY1

Xt Xt+1X1 ...

A AA Yt Yt+1Y1

Xt Xt+1X1 ...

24© Eric Xing @ CMU, 2005-2014

Kalman filtering derivation Kalman filtering is a recursive procedure to update the belief

state: Predict step: compute p(Xt+1|y1:t) from prior belief

p(Xt|y1:t) and dynamical model p(Xt+1|Xt) --- time update

Update step: compute new belief p(Xt+1|y1:t+1) from prediction p(Xt+1|y1:t), observation yt+1 and observation model p(yt+1|Xt+1) --- measurement update

AA YtY1

Xt Xt+1X1 ...

A AA Yt Yt+1Y1

Xt Xt+1X1 ...

25© Eric Xing @ CMU, 2005-2014

Predict step Dynamical Model:

One step ahead prediction of state:

Observation model: One step ahead prediction of observation:

);(~ , QGA 01 Ntttt ww xx

);0(~ , RvvC tttt N xy

AA YtY1

Xt Xt+1X1 ...

A AA Yt Yt+1Y1

Xt Xt+1X1 ...

26© Eric Xing @ CMU, 2005-2014

Predict step Dynamical Model:

One step ahead prediction of state:

Observation model: One step ahead prediction of observation:

);(~ , QGA 01 Ntttt ww xx

);0(~ , RvvC tttt N xy

tttttt || ˆ),,|X(ˆ xyyx AE 111

tttttttt

GQGAAP

wGAwGAE

1|11|11|1

),,|)ˆX)(ˆX(

),,|)ˆ)(XˆX(

ttttttt v |ˆ),,|X(),,|Y( 111111 xyyyy CCEE

R),,|)ˆ)(YˆY( |11|11|11 T

tttttt CCPE yyyy

tttttt CPE |11|11|11 ),,|)ˆ)(XˆY( yyxy

AA YtY1

Xt Xt+1X1 ...

A AA Yt Yt+1Y1

Xt Xt+1X1 ...

27© Eric Xing @ CMU, 2005-2014

Update step Summarizing results from previous slide, we have

p(Xt+1,Yt+1|y1:t) ~ N(mt+1, Vt+1), where

Remember the formulas for conditional Gaussian distributions:

) , (),|(

22121121

2212121

2121121

),|()(

),|()( N

ttt xC

RCCPCP

CPPV T

28© Eric Xing @ CMU, 2005-2014

Kalman Filter Measurement updates:

where Kt+1 is the Kalman gain matrix

Time updates:

Kt can be pre-computed (since it is independent of the data).

tttt || ˆˆ xx A1

TGQGAAPP tttt ||1

)x̂C(yˆˆ t|1t1t|| 1111 ttttt Kxx

tttttt ||| 1111 KCPPP

-1TT RCCPCPK )( || ttttt 111

Example of KF in 1D Consider noisy observations of a 1D particle doing a random

KF equations:

),(~ ,| xttt wwxx 011 N ),(~ , ztt vvxz 0N

t|tt|ttt xxx ˆˆˆ| A1, || xttttt

TGQGAAPP 1

ttztxtttttt

xzxzxx

t|1t1t||

ˆ)ˆ-(ˆˆ CK 1111

zxttttttt

||| 1111 KCP-PP

))(()( || zxtxtttttt -1TT RCCPCPK 111

KF intuition The KF update of the mean is

the term is called the innovation

New belief is convex combination of updates from prior and observation, weighted by Kalman Gain matrix:

If the observation is unreliable, z (i.e., R) is large so Kt+1 is small, so we pay more attention to the prediction.

If the old prior is unreliable (large t) or the process is very unpredictable (large x), we pay more attention to the observation.

ttztxtttttt

xzxzxx

t|1t1t||

ˆ)ˆ-(ˆˆ CK 1111

)ˆ( t|1t1t xz C

Complexity of one KF step Let and ,

Computing takes O(Nx2) time, assuming

dense P and dense A.

Computing takes O(Ny3) time.

So overall time is, in general, max {Nx2,Ny

xNtX R yN

TGQGAAPP tttt ||1

The inference problem 2 Smoothing given y1, …, yT, estimate xt (t<T)

The Rauch-Tung-Strievel smoother is a way to perform exact off-line inference in an LDS. It is the Gaussian analog of the forwards-backwards (alpha-gamma) algorithm:

itTt XXPyip 11:1 )|()|X(

A Y2 Y3Y1 Yt

X2 X3X1 Xt...

0α 1α 2α tα0γ 1γ 2γ tγ

Rauch-Tung-Strievel smoother

General structure: KF results + the difference of the "smoothed" and predicted results of the next step

Backward computation: Pretend to know things at t+1 –- such conditioning makes things simple and we can remove this condition finally

The difficulty: The trick:

A AYt Yt+1

Xt Xt+1)ˆ-ˆ(ˆˆ |||| ttTttttTt 11 xxxx L

TttTtttTt tt LPPLPP |||| 11 1

1 ttttt || PAPL T

TTttTtTt

yyXXyyyyXXyyyyXXyyXx

,,,|,,|,,,|,,|,,,|,,|ˆ

EEEEEE

ZZYXZX |,|| EEE

Tt yyX ,,| 1

ZZYXZZYXZX |,||,|| VarEEVarVar

RTS derivation Following the results from previous slide, we need to derive

p(Xt+1,Xt|y1:t) ~ N(m, V), where

all the quantities here are available after a forward KF pass

Remember the formulas for conditional Gaussian distributions:

The RTS smoother

, ) , (),|(

22121121

2212121

2121121

),|()(

),|()( N

PAPAPP

)ˆ-ˆ(ˆ

,,,|ˆ

ttTtttt

tttTt yyXXx

xxx LE

TttTttt

TtttTTtTt yyXXyx

tt LPPLP

VarEVarP

| |,||ˆ

Learning SSMs Complete log likelihood

EM E-step: compute

these quantities can be inferred via KF and RTS filters, etc., e,g.,

M-step: MLE using

c.f., M-step in factor analysis

RCGQA ,;:,,,;:,,);(

)|(log)|(log)(log),(log),( ,,,,

tXXXftXXXXXfXf

xypxxpxpyxpD

n ttntn

312011

Ttt yyXXXXX , , , 11

22TtTtt

Ttt xPXXXXX ||

ˆ)(E)var(

RCGQA ,;:,,,;:,,;),( tXXXftXXXXXfXfD tTttt

Ttt 312011cl

Nonlinear systems In robotics and other problems, the motion model and the

observation model are often nonlinear:

An optimal closed form solution to the filtering problem is no longer possible.

The nonlinear functions f and g are sometimes represented by neural networks (multi-layer perceptrons or radial basis function networks).

The parameters of f and g may be learned offline using EM, where we do gradient descent (back propagation) in the M step, c.f. learning a MRF/CRF with hidden nodes.

Or we may learn the parameters online by adding them to the state space: xt'= (xt, ). This makes the problem even more nonlinear.

)( , )( tttttt vxgywxfx 1

Extended Kalman Filter (EKF) The basic idea of the EKF is to linearize f and g using a

second order Taylor expansion, and then apply the standard KF. i.e., we approximate a stationary nonlinear system with a non-stationary

linear system.

where and and

The noise covariance (Q and R) is not changed, i.e., the additional error due to linearization is not modeled.

)ˆ()ˆ(

ttttxttt

vxxxgy

wxxxfx

)ˆ(ˆ|| 111 tttt xfx

Online vs offline inference

The KF update of the mean is

Consider the special case where the hidden state is a constant, xt =, but the “observation matrix” C is a time-varying vector, C = xt

T. Hence the observation model at each time slide, , is a

linear regression

We can estimate recursively using the Kalman filter:

This is called the recursive least squares (RLS) algorithm.

We can approximate by a scalar constant. This is called the least mean squares (LMS) algorithm.

We can adapt t online using stochastic approximation theory.

)ˆ-(ˆˆt|1t1t|| xyxx ttttt CKA 111

tTtt vxy

ttTtttt xxy )ˆ(ˆˆ

KF, RLS and LMS

probabilistic graphical modelsepxing/class/10708-14/lectures/lecture11... · 2014. 2. 19. ·...

Documents

2: representation of undirected graphical...

probabilistic graphical...

probabilistic graphical...

probabilistic graphical...

probabilistic graphical...

probabilistic graphical...

cyclic_code2004/3/17yuh-ming huang, csie ncnu1 v = (v 0, v...

probabilistic graphical...

probabilistic graphical...

probabilistic graphical...

jam properties inc....admin lounge v. v. v. v. v. v. v. v....

probabilistic graphical...

probabilistic graphical...

probabilistic graphical...

probabilistic graphical...

2: representation of undirected graphical...

• v = (v 0 , v 1 , ..., v n-1 ) : code vector v (1) =...

probabilistic graphical...

probabilistic graphical...

probabilistic graphical...