introduction to graphic models - 國立臺灣大學disp.ee.ntu.edu.tw/~pujols/introduction to...

Introduction to Graphical Models

Wei-Lun (Harry) ChaoJune 10, 2010

aMMAI, spring 2010

1

Outline

• Graphical model fundamentals

[Directed]

• General structure: 3 connections, chain, and tree

• Graphical model examples

• Inference and Learning

[Undirected]

• Markov Random Fields and its Applications

2

Main References

• "An introduction to graphical models," Kevin Murphy, 2001

• "Learning Low-Level Vision," Freeman, IJCV, 2000

• Chapter 16: Graphical Models in “Introduction to Machine Learning, 2nd edition" , Ethem Alpaydin

3

What are graphical models?

4

If we know all P(C,S,R,W), we know everything in this graph


5

If we know all ( , , , ), we can perform marginalization and Bayes' rule to achieve

all possible probability distribution. (ex: ( ), ( | ), ( , | , ))

P C R S W

P R P R S P R W S CP(C,R,S,W) C R S W

0.2 T T T T

0.11 T T T F

…… …… …… …… ……

0.06 F F F F

[General decomposition]

( , , , ) ( | , , ) ( , , ) ( | , , ) ( | , ) ( , ) ( | , , ) ( | , ) ( | ) ( )

(Totally 1+2+4+8=15 terms recorded)

P C R S W P W C R S P C R S P W C R S P S R C P R C P W C R S P S R C P R C P C

[Induce conditional independence]

( , , , ) ( | ) ( | ) ( | ) ( ) ( | ) ( | ) ( | ) ( )

(Totally 1+2+2+4=9 terms rec

, ,

orded)

, ,P C R S W P W P S P R C P C P W P S P R C PRC R S S CCR C

1

1

[Probability decomposition in graphical models]

( ,..., ) ( | ( ))d

d i i

i

P X X P X parents X


6

From MMAI 09

Outline


[Directed]




[Undirected]

• Markov Random Field and its Applications

7

1. Graphical model fundamentals (1/3)

• Graphical models are a marriage between probability theory and graphic theory

• Solving two problems: Uncertainty and Complexity

(Ex: text retrieval, object recognition, ……)

• General structure: Modularity

• Conditional independencies result in local calculations

• Issues: Representation, Inference, Learning, and Decision Theory

8

1. Graphic model fundamentals (2/3)

• Two structural factors:

Node(Variable)

Arc (Dependence)

• Two kinds of models:

Undirected: Markov random field (MRFs)

Directed: Bayesian network (BNs)

[2]

( , )

( , )

i j

i i

x x

y x

( ), ( | ), ( | )P X P Y X P Z Y

9

1. Graphical model fundamental (3/3)

• Conditional Independence

• Need to know: Structure and Parameters

• Want to know: Variables (Observed and Unobserved)

10

Ex: ( | ; ) ( , )P Y y X x N Wx

1

1

( ,..., ) ( | ( ))d

d i i

i

P X X P X parents X

Outline


[Directed]




[Undirected]


11


• Head-to-tail:

• Example:

( ), ( | ), ( | )P X P Y X P Z Y

( , , ) ( ) ( | ) ( | , )

( , | ) ( | ) ( | )( | , ) ( | )

( | ) ( | )

( , , ) ( ) ( | ) ( | )

P X Y Z P X P Y X P Z Y X

P Z X Y P Z Y P X YP Z X Y P Z Y

P X Y P X Y

P X Y Z P X P Y X P Z Y

( ) ( | ) ( ) ( |~ ) (~ ) 0.38

( ) ( | ) ( ) ( |~ ) (~ ) 0.47

( | ) ( | ) ( | ) ( |~ ) (~ | ) 0.76 [Prediction]

( | ) ( | ) ( ) / ( ) 0.65 [Diagnosis]

P R P R C P R P R C P C

P W P W R P R P R R P R

P W C P W R P R C P W R P R C

P C W P W C P C P W

13


• Tail-to-tail:

• Example 1:

( ), ( | ), ( | )P X P Y X P Z X

( , | ) ( | ) ( | )

( , , ) ( ) ( | ) ( | )

P Y Z X P Y X P Z X

P X Y Z P X P Y X P Z X

( ) ( | ) ( ) ( |~ ) (~ ) 0.45

( | ) ( )( | ) ( | ) ( ) / ( ) 0.89

( | ) ( ) ( |~ ) (~ )

( | ) ( , | ) ( | ) ( | ) ( |~ ) (~ | ) ...... 0.22 ( )C

P R P R C P C P R C P C

P R C P CP C R P R C P C P R

P R C P C P R C P C

P R S P R C S P R C P C S P R C P C S P R

14


• Example 2: PLSA

• How to determine the structure?

Based on what probability model we have known!

Ex: Regression V.S Generative model

[ ] ( , ) ( | ) ( | ) ( )

[ ] ( , ) ( | ) ( | ) ( )

z

z

Original P d w P w z P z d P d

Modification P d w P w z P d z P z

Regression: ( | ) ( , )

( | ) ( )Generative models: ( | )

( )

P Y y X x N Wx

P X Y P YP Y X

P X

15


• Head-to-head: [Different structure]

When Z is observed, X and Y are not independent!!

• Example:

( ), ( ), ( | , )P X P Y P Z X Y

( , ) ( ) ( )

( , , ) ( ) ( ) ( | , )

P X Y P X P Y

P X Y Z P X P Y P Z X Y

,

( ) ( , , ) 0.52

( | ) ( , | ) 0.92

( | ) ( )( | ) 0.35

( )

( , | )( | , ) 0.21 [Explaining away]

( | )

R S

R

P W P W R S

P W S P W R S

P W S P SP S W

P W

P S R WP S R W

P W R

16


• Combination:

• Memory saving:

• New representation:

No explicit input / output

Blurry difference between supervised / unsupervised learning

Hidden Nodes

Causality

4(2 1) 9

1

1

( ,..., ) ( | ( ))d

d i i

i

P X X P X parents X

17


• Chain:

• Tree:

• Loop:

18

Outline


[Directed]




[Undirected]


19

3. Graphical model examples (1/10)

• Generative discrimination V.S Gaussian mixture

• PCA, ICA, and all that

• Hidden Markov Chain

• Naive Bayes’ classifier

• Linear regression

• Generative model for generative model

• Applications

• Notation:

Square(discrete), Circle(continuous)

Shaded(observed), Clear(hidden)

20


• Generative discrimination V.S Gaussian mixture

( ), ( ) : Multinomial distribution

( | ), ( | ) : ( ; , )i i

P Q i P Y i

P X x Q i P X x Y i N x

[Supervised learning]: at the training phase, the class label variable is observable[Inference]: There must be some latent (hidden) variables during testing (prediction)

21


• PCA and Factor Analysis

( ) ( ;0, )

( | ) ( ; , ) is diagonal

usually assume , where ,n m

P X x N x I

P Y y X x N y Wx

n m X R Y R

[Further simplification]

(1) Isotropic noise: (eigen problem)

(2) Classical PCA: 0

I

[Usage]

Model P(y) Not probabilistic

Subspace Factor analysis PCA

Clustering Mixture of Gaussian K-means

[1]

22


• Mixture of factor analysis (nonlinear, no W)

• Independent factor analysis (IFA), and ICA

[1]

IFA: chain graph with not diagonal

PCA: ( ) is non-GaussianP X

[1]23


• Hidden Markov Model: [dynamic, discrete state]

[4 stages]

• Parameters: [repeat] homogeneous Markov chain

Parameters could be estimated by inference

or by learning

[1]

[1]

1 1

1 1

|

Properties: ( ), ( | ), ( | )

Unobserved (hidden)

t t t

t t t t

t

Q Q Q

P Q P Q Q P Y Q

Q

24


• Variations of HMM[Input-Output HMM] [Factorial HMM]

*Pedigree: parent-child

[Coupled HMM]

*Speech recognition: spoken words, lip images

[Gaussian mixture HMM] [Switching HMM] [Linear dynamic system: Kalman filter]

[1] [1] [1]

[1]

25


• Naive Bayes’ classifier

• Linear Regression

1 2 1 2( , ,......, | ) ( | ) ( | )...... ( | )

word dictionary

d dP Y Y Y X P Y X P Y X P Y X

d

26

1

1

1

( ) ~ (0, )

Learning (Inference): ( ) ~ (0, )

( | , ) ~ ( , )

Prediction: [ | , ]

t t T t

P w N I

P N

P r x w N w x

r E r x w


• Generative model for generative model

28


• PLSA and LDA

• Object Recognition

29

[4] [4]

[5],[6]

Outline


[Directed]




[Undirected]


30

4. Inference and Learning (1/6)

• The definition of inference and learning:

[inference]: Assume the structures and the parameters have been determined, based on some observations, we want to inference some unobserved variables.

[learning]: To estimate the structure and parameters of the graphical

model!

PS: Each node has its corresponding probability function and parameters, while the parameters of some of them are determined without learning!

For these nodes, even if the variables are unobserved during training, we don’t need to use EM algorithm.

31

Ex: ( | ; ) ( , )P Y y X x N Wx

4. Inference (2/6)

• The main goal of inference:

To estimate the values of hidden nodes (variable), given the observed nodes (after the structure and parameters are fixed)

• Problem:

32

(1)

(2)Computationally intractable: (marginalization)

over unobserved variables int

(3)Solution: Conditional independence

conditional likelihood priorposterior

likelihood

summation

egral

4. Inference (3/6)

• Variable elimination:

Push the sums (integrals) in as far as possible

Distributing sums over products: FFT and Viterbi algorithm33

[1]

[1]

4. Inference (4/6)

(1) Dynamic programming: Avoid redundant computation involved in repeated variable eliminations

(2)Acyclic (tree, chain): Local message passing EX: forwards-backwards algorithm for HMMs

(3) Cyclic or loop: Clustering nodes together to form a tree

(junction trees)

34

( ) ( | )X P X E ( ) ( | )X P E X

( | ) ( ) ( , | ) ( )( | )

( ) ( )

( | ) ( | ) ( )

( )

( | ) ( ) ( | ) ( )

( ) ( )

( | ) ( | ) ( ) ( )

P E X P X P E E X P XP X E

P E P E

P X E P E X P X

P E

P E X P E P E X P X

P X P E

P E X P E X X X

4. Inference (5/6)

• Approximate inference: Used to solve high induced width (largest cluster) and integral operation

(1) Sampling (Monte Carlo) methods: MCMC

(2) Variational methods Mean-field approximation: law of large number!!

Decoupling all the nodes, and introducing variational parameters for each node

Iteratively updating these variational parameters so as to minimize the cross-entropy (KL distance) between the approximate and the true probabilistic distribution

The mean-field approximation produces a lower bound on the likelihood

(3) Laplacian variational

(4) Loopy belief propagation: turbo codes35

4. Inference (6/6)

• Variational methods

36

[Example]

( , )( ) ( , ) ( , )

( , )

( , ) ( , )log log ( , ) ( , ) log

( , ) ( , )

providing ( , ) 1

assume ( , ) ( ) ( )

iteratively optimize ( ) and ( )

S

S

g SF f d g S dSd q S dSd

q S

g S g SF q S dSd q S dSd

q S q S

q S dSd

q S q q S

q q S

by EM to maximize the lower bound

5. Learning (1/7)

• Two things to learn: Parameters and Structures

• Variables in learning: Full or partial observability

• Point estimation V.S Bayesian estimation

37

Full observed Partial observed

Known structure Closed form Expectation Maximization

Unknown structure Local search Structural EM

5. Learning (3/7)

• Known structure, full observability:ML: Find the maximum likelihood estimation of the parameters of each CPD

MAP: Include prior information of the parameters

39

1

1

( ,..., ) ( | ( ))n

n i i

i

P X X P X Pa X

5. Learning (4/7)

• Known structure, partial observability Some nodes are hidden during training, then we can use the EM

(Expectation Maximization) algorithm to find a (locally) optimal ML estimation

[E-step]: we compute the expected values of all the hidden nodes using an inference algorithm

[M-step]: treat these expected values as though they were observed and do ML estimation

EM algorithm is an iterative algorithm, and is similar to the Baum-Welch algorithm used for training HMMs, gradient ascent, and coordinate ascent.

Inference becomes a subroutine which is called by the learning procedure!

40

5. Learning (5/7)

• EM algorithm 1

41

(1) ( , ) ( | ) ( )

( ) ( , ; )max ( ; ) max ( , ; ) max

( )

(2)Jensen's inequality:

log [ ( )] [log( ( ))]

The equality holds when ( ) is a constant

(3) max ( ; ) is equivalent to max log (

tt t

Z Z

t

P X Z P X Z P Z

Q Z P X ZP X P X Z

Q Z

E f X E f X

f X

P X P

( )

( )

; )

( , ; ) ( , ; )log ( ; ) log ( ) log [ ]

( ) ( )

( , ; ) ( , ; ) [log ] ( ) log

( ) ( )

t

t tt

Q Z

Z

t t

Q Z

Z

X

P X Z P X ZP X Q Z E

Q Z Q Z

P X Z P X ZE Q Z

Q Z Q Z

5. Learning (6/7)

• EM algorithm 2

42

( , ; )(4) log ( ; ) ( ) log

( )

( , ; )(5)The equality holds when is a constant

( )

( , ; ) ( | ; , ) ( )

( ) ( )

, where ( ) is fixed for the whole training phase

Then ( ) ( | ; , )

(6)T

tt

Z

t

t t t

t

t

P X ZP X Q Z

Q Z

P X Z

Q Z

P X Z P Z X P X

Q Z Q Z

P X

Q Z P Z X

he EM algorithm

[E-step]

expect ( ) ( | ; , ), modify

[M-step]

( , ; )maximize ( ) log , modify

( )

t

t

Z

Q Z P Z X

P X ZQ Z

Q Z

5. Learning (7/7)

• Bayesian estimation

• Hidden nodes

43

( ; )P X x

( | ) ( | ) ( | )t tP X x X P X x P X d ( | )P X

Take a Break!!!

44

Outline


[Directed]




[Undirected]


45

1. MRFs fundamentals

• Markov random field: [Potential (compatibility) functions]

• Learning the parameters: Maximum likelihood estimates of the clique potentials can be computed

using iterative proportional fitting (IPF)

46

[2]

4

1 2 1 3 2 4 3 4

1

( , ) ( , ) ( , ) ( , ) ( , ) ( , )i i

i

P x y x x x x x x x x x y

2. Low-level vision (1/2)

• Low-level vision: [scene, image] How a visual system might interpret images?

(1) Super-resolution

(2) Shading and reflection estimation

(3) Motion estimation

• Previous works: The probability model reached not by learning

47

( | ) ( )( | )

( )

( | ) is usually defined as a "noise model"

( ) is usually defined as sparseness or smoothness

P y x P xP x y

P y

P y x

P x

2. Low-level vision (2/2)

• The proposed approach: VISTA (Vision by Image/Scene TrAining)

• The proposed structure:[learning]

We have scene / image pairs, which have both been divided into patches! We learn the relationships between local regions of image and scenes, and between local scene regions

Long range interaction: multi-scale pyramid

[Estimation scene]

The best scene estimate is the mean (minimum mean squared error, MMSE) or the mode (maximum a posterior, MAP) of the posterior probability

48

[2]

( , )( | )

( )

P x yP x y

P y

3. MRFs inference (1/3)

• Joint probability:

• MMSE:

• MAP:

49


• No loop MMSE: [Inference]

• No loop MAP: [message passing]

50

[2]

[Two indices: state and location]

is a column vector with the same dimension as

( , ) is a column vector indexed by the different possible states of , the scene at node

( , ) is a matrix

k

j j

i i i

i j

M x

x y x i

x x

with the different possible states of and , the scen at node and i jx x i j


• No loop MAP: [probability chain decomposition]

• With loop: Turbo codes

Still use the message passing algorithm above for MRFs with loops

51

4. MRFs Representation

52

(1) The PCA is used to find a set of lower dimensional basis functions for the patches of images and

scene pixels

(2) Model ( , ), ( , ), and (Could be of Gaussian mixture distributions)k

i i i j jx y x x M

(3) Continuous probabilities are hard to propagate, so we prefer discrete representation!

During the scene estimation phase, each scene patch is represented by 10 to 20 closed candidates,

which are selected based on the evidence at each nodeiy

[2]

5. MRFs Learning

• Method 1: [message passing]

• Method 2: [proper probability factorization-Overlap]

53

( , )Gaussina mixture:

( , )

( , )( | ) : a matrix

( )

( | ): a column vector indexed by " "

i i

i j

l m

k jl m

k j m

j

l

k k

x y

x x

P x xP x x

P x

P y x l

Scene patch overlap between neighbor-scene patches to estimate ( , )

: the vectors of pixels of the th candidate for scene patch which lies in the overlap region with patch

i j

l

jk k

x x

d l x j

[2]

6. Super-resolution (1/6)

• Laplacian pyramid

54[2]

[7]


• Training: [message passing]

55

[2]


• Result 1: [message passing]

56

[2]


• Result 2:

57

[2]


• Result 3:

58[2]


• Result 4:

59

[2]

7. Shading and reflection estimation (1/5)

• Shape: shading intensity

• Reflection: surface intensity

• Scene: two pixel arrays (reflection, shape)

• Rendering: based on the estimated scene

60

[2]


• Patch selection:

61

[2]


• Training: [overlap]

62

[2]


• Result 1:

63[2]


• Result 2:

64

[2]

Reference

[1] "An introduction to graphical models," Kevin Murphy, 2001

[2] "Learning Low-Level Vision," Freeman, IJCV, 2000

*3+ Chapter 16: Graphical Models in “Introduction to Machine Learning, 2nd edition" , Ethem Alpaydin

[4] "Latent Dirichlet allocation," D. Blei, A. Ng, and M. Jordan. . Journal of Machine Learning Research, 3:993–1022, January 2003

[5] R. Fergus, P. Perona, and A. Zisserman, “Object class recognition by unsupervised scale-invariant learning,” In Proc. CVPR, Jun 2003

[6] L. Fei-Fei, R. Fergus, P. Perona, “A Bayesian approach to unsupervised learning of object categories,” in: Proc. Int. Conf. on Computer Vision, 2003, pp. 1134–1141

[7] Laplacian pyramid: http://sepwww.stanford.edu/~morgan/texturematch/paper_html/node3.html

[8] Bayesian network toolbox:

http://code.google.com/p/bnt/

65

http://sepwww.stanford.edu/~morgan/texturematch/paper_html/node3.html

http://code.google.com/p/bnt/

Thanks for listening

66

introduction to graphic models - 國立臺灣大學disp.ee.ntu.edu.tw/~pujols/introduction to...

Documents