introduction to graphic models - 國立臺灣大學disp.ee.ntu.edu.tw/~pujols/introduction to...
Post on 08-Jun-2020
3 Views
Preview:
TRANSCRIPT
Introduction to Graphical Models
Wei-Lun (Harry) ChaoJune 10, 2010
aMMAI, spring 2010
1
Outline
• Graphical model fundamentals
[Directed]
• General structure: 3 connections, chain, and tree
• Graphical model examples
• Inference and Learning
[Undirected]
• Markov Random Fields and its Applications
2
Main References
• "An introduction to graphical models," Kevin Murphy, 2001
• "Learning Low-Level Vision," Freeman, IJCV, 2000
• Chapter 16: Graphical Models in “Introduction to Machine Learning, 2nd edition" , Ethem Alpaydin
3
What are graphical models?
4
If we know all P(C,S,R,W), we know everything in this graph
What are graphical models?
5
If we know all ( , , , ), we can perform marginalization and Bayes' rule to achieve
all possible probability distribution. (ex: ( ), ( | ), ( , | , ))
P C R S W
P R P R S P R W S CP(C,R,S,W) C R S W
0.2 T T T T
0.11 T T T F
…… …… …… …… ……
0.06 F F F F
[General decomposition]
( , , , ) ( | , , ) ( , , ) ( | , , ) ( | , ) ( , ) ( | , , ) ( | , ) ( | ) ( )
(Totally 1+2+4+8=15 terms recorded)
P C R S W P W C R S P C R S P W C R S P S R C P R C P W C R S P S R C P R C P C
[Induce conditional independence]
( , , , ) ( | ) ( | ) ( | ) ( ) ( | ) ( | ) ( | ) ( )
(Totally 1+2+2+4=9 terms rec
, ,
orded)
, ,P C R S W P W P S P R C P C P W P S P R C PRC R S S CCR C
1
1
[Probability decomposition in graphical models]
( ,..., ) ( | ( ))d
d i i
i
P X X P X parents X
What are graphical models?
6
From MMAI 09
Outline
• Graphical model fundamentals
[Directed]
• General structure: 3 connections, chain, and tree
• Graphical model examples
• Inference and Learning
[Undirected]
• Markov Random Field and its Applications
7
1. Graphical model fundamentals (1/3)
• Graphical models are a marriage between probability theory and graphic theory
• Solving two problems: Uncertainty and Complexity
(Ex: text retrieval, object recognition, ……)
• General structure: Modularity
• Conditional independencies result in local calculations
• Issues: Representation, Inference, Learning, and Decision Theory
8
1. Graphic model fundamentals (2/3)
• Two structural factors:
Node(Variable)
Arc (Dependence)
• Two kinds of models:
Undirected: Markov random field (MRFs)
Directed: Bayesian network (BNs)
[2]
( , )
( , )
i j
i i
x x
y x
( ), ( | ), ( | )P X P Y X P Z Y
9
1. Graphical model fundamental (3/3)
• Conditional Independence
• Need to know: Structure and Parameters
• Want to know: Variables (Observed and Unobserved)
10
Ex: ( | ; ) ( , )P Y y X x N Wx
1
1
( ,..., ) ( | ( ))d
d i i
i
P X X P X parents X
Outline
• Graphical model fundamentals
[Directed]
• General structure: 3 connections, chain, and tree
• Graphical model examples
• Inference and Learning
[Undirected]
• Markov Random Field and its Applications
11
2. General structure (1/7)
• 3 Connections
Head-to-tail:
Tail-to-tail:
Head-to-head:
( ), ( | ), ( | )P X P Y X P Z Y
( ), ( | ), ( | )P X P Y X P Z X
( ), ( ), ( | , )P X P Y P Z X Y
12
2. General structure (2/7)
• Head-to-tail:
• Example:
( ), ( | ), ( | )P X P Y X P Z Y
( , , ) ( ) ( | ) ( | , )
( , | ) ( | ) ( | )( | , ) ( | )
( | ) ( | )
( , , ) ( ) ( | ) ( | )
P X Y Z P X P Y X P Z Y X
P Z X Y P Z Y P X YP Z X Y P Z Y
P X Y P X Y
P X Y Z P X P Y X P Z Y
( ) ( | ) ( ) ( |~ ) (~ ) 0.38
( ) ( | ) ( ) ( |~ ) (~ ) 0.47
( | ) ( | ) ( | ) ( |~ ) (~ | ) 0.76 [Prediction]
( | ) ( | ) ( ) / ( ) 0.65 [Diagnosis]
P R P R C P R P R C P C
P W P W R P R P R R P R
P W C P W R P R C P W R P R C
P C W P W C P C P W
13
2. General structure (3/7)
• Tail-to-tail:
• Example 1:
( ), ( | ), ( | )P X P Y X P Z X
( , | ) ( | ) ( | )
( , , ) ( ) ( | ) ( | )
P Y Z X P Y X P Z X
P X Y Z P X P Y X P Z X
( ) ( | ) ( ) ( |~ ) (~ ) 0.45
( | ) ( )( | ) ( | ) ( ) / ( ) 0.89
( | ) ( ) ( |~ ) (~ )
( | ) ( , | ) ( | ) ( | ) ( |~ ) (~ | ) ...... 0.22 ( )C
P R P R C P C P R C P C
P R C P CP C R P R C P C P R
P R C P C P R C P C
P R S P R C S P R C P C S P R C P C S P R
14
2. General structure (4/7)
• Example 2: PLSA
• How to determine the structure?
Based on what probability model we have known!
Ex: Regression V.S Generative model
[ ] ( , ) ( | ) ( | ) ( )
[ ] ( , ) ( | ) ( | ) ( )
z
z
Original P d w P w z P z d P d
Modification P d w P w z P d z P z
Regression: ( | ) ( , )
( | ) ( )Generative models: ( | )
( )
P Y y X x N Wx
P X Y P YP Y X
P X
15
2. General structure (5/7)
• Head-to-head: [Different structure]
When Z is observed, X and Y are not independent!!
• Example:
( ), ( ), ( | , )P X P Y P Z X Y
( , ) ( ) ( )
( , , ) ( ) ( ) ( | , )
P X Y P X P Y
P X Y Z P X P Y P Z X Y
,
( ) ( , , ) 0.52
( | ) ( , | ) 0.92
( | ) ( )( | ) 0.35
( )
( , | )( | , ) 0.21 [Explaining away]
( | )
R S
R
P W P W R S
P W S P W R S
P W S P SP S W
P W
P S R WP S R W
P W R
16
2. General structure (6/7)
• Combination:
• Memory saving:
• New representation:
No explicit input / output
Blurry difference between supervised / unsupervised learning
Hidden Nodes
Causality
4(2 1) 9
1
1
( ,..., ) ( | ( ))d
d i i
i
P X X P X parents X
17
2. General structure (7/7)
• Chain:
• Tree:
• Loop:
18
Outline
• Graphical model fundamentals
[Directed]
• General structure: 3 connections, chain, and tree
• Graphical model examples
• Inference and Learning
[Undirected]
• Markov Random Field and its Applications
19
3. Graphical model examples (1/10)
• Generative discrimination V.S Gaussian mixture
• PCA, ICA, and all that
• Hidden Markov Chain
• Naive Bayes’ classifier
• Linear regression
• Generative model for generative model
• Applications
• Notation:
Square(discrete), Circle(continuous)
Shaded(observed), Clear(hidden)
20
3. Graphical model examples (2/10)
• Generative discrimination V.S Gaussian mixture
( ), ( ) : Multinomial distribution
( | ), ( | ) : ( ; , )i i
P Q i P Y i
P X x Q i P X x Y i N x
[Supervised learning]: at the training phase, the class label variable is observable[Inference]: There must be some latent (hidden) variables during testing (prediction)
21
3. Graphical model examples (3/10)
• PCA and Factor Analysis
( ) ( ;0, )
( | ) ( ; , ) is diagonal
usually assume , where ,n m
P X x N x I
P Y y X x N y Wx
n m X R Y R
[Further simplification]
(1) Isotropic noise: (eigen problem)
(2) Classical PCA: 0
I
[Usage]
Model P(y) Not probabilistic
Subspace Factor analysis PCA
Clustering Mixture of Gaussian K-means
[1]
22
3. Graphical model examples (4/10)
• Mixture of factor analysis (nonlinear, no W)
• Independent factor analysis (IFA), and ICA
[1]
IFA: chain graph with not diagonal
PCA: ( ) is non-GaussianP X
[1]23
3. Graphical model examples (5/10)
• Hidden Markov Model: [dynamic, discrete state]
[4 stages]
• Parameters: [repeat] homogeneous Markov chain
Parameters could be estimated by inference
or by learning
[1]
[1]
1 1
1 1
|
Properties: ( ), ( | ), ( | )
Unobserved (hidden)
t t t
t t t t
t
Q Q Q
P Q P Q Q P Y Q
Q
24
3. Graphical model examples (6/10)
• Variations of HMM[Input-Output HMM] [Factorial HMM]
*Pedigree: parent-child
[Coupled HMM]
*Speech recognition: spoken words, lip images
[Gaussian mixture HMM] [Switching HMM] [Linear dynamic system: Kalman filter]
[1] [1] [1]
[1]
25
3. Graphical model examples (7/10)
• Naive Bayes’ classifier
• Linear Regression
1 2 1 2( , ,......, | ) ( | ) ( | )...... ( | )
word dictionary
d dP Y Y Y X P Y X P Y X P Y X
d
26
1
1
1
( ) ~ (0, )
Learning (Inference): ( ) ~ (0, )
( | , ) ~ ( , )
Prediction: [ | , ]
t t T t
P w N I
P N
P r x w N w x
r E r x w
27[1]
3. Graphical model examples (9/10)
• Generative model for generative model
28
3. Graphical model examples (10/10)
• PLSA and LDA
• Object Recognition
29
[4] [4]
[5],[6]
Outline
• Graphical model fundamentals
[Directed]
• General structure: 3 connections, chain, and tree
• Graphical model examples
• Inference and Learning
[Undirected]
• Markov Random Field and its Applications
30
4. Inference and Learning (1/6)
• The definition of inference and learning:
[inference]: Assume the structures and the parameters have been determined, based on some observations, we want to inference some unobserved variables.
[learning]: To estimate the structure and parameters of the graphical
model!
PS: Each node has its corresponding probability function and parameters, while the parameters of some of them are determined without learning!
For these nodes, even if the variables are unobserved during training, we don’t need to use EM algorithm.
31
Ex: ( | ; ) ( , )P Y y X x N Wx
4. Inference (2/6)
• The main goal of inference:
To estimate the values of hidden nodes (variable), given the observed nodes (after the structure and parameters are fixed)
• Problem:
32
(1)
(2)Computationally intractable: (marginalization)
over unobserved variables int
(3)Solution: Conditional independence
conditional likelihood priorposterior
likelihood
summation
egral
4. Inference (3/6)
• Variable elimination:
Push the sums (integrals) in as far as possible
Distributing sums over products: FFT and Viterbi algorithm33
[1]
[1]
4. Inference (4/6)
(1) Dynamic programming: Avoid redundant computation involved in repeated variable eliminations
(2)Acyclic (tree, chain): Local message passing EX: forwards-backwards algorithm for HMMs
(3) Cyclic or loop: Clustering nodes together to form a tree
(junction trees)
34
( ) ( | )X P X E ( ) ( | )X P E X
( | ) ( ) ( , | ) ( )( | )
( ) ( )
( | ) ( | ) ( )
( )
( | ) ( ) ( | ) ( )
( ) ( )
( | ) ( | ) ( ) ( )
P E X P X P E E X P XP X E
P E P E
P X E P E X P X
P E
P E X P E P E X P X
P X P E
P E X P E X X X
4. Inference (5/6)
• Approximate inference: Used to solve high induced width (largest cluster) and integral operation
(1) Sampling (Monte Carlo) methods: MCMC
(2) Variational methods Mean-field approximation: law of large number!!
Decoupling all the nodes, and introducing variational parameters for each node
Iteratively updating these variational parameters so as to minimize the cross-entropy (KL distance) between the approximate and the true probabilistic distribution
The mean-field approximation produces a lower bound on the likelihood
(3) Laplacian variational
(4) Loopy belief propagation: turbo codes35
4. Inference (6/6)
• Variational methods
36
[Example]
( , )( ) ( , ) ( , )
( , )
( , ) ( , )log log ( , ) ( , ) log
( , ) ( , )
providing ( , ) 1
assume ( , ) ( ) ( )
iteratively optimize ( ) and ( )
S
S
g SF f d g S dSd q S dSd
q S
g S g SF q S dSd q S dSd
q S q S
q S dSd
q S q q S
q q S
by EM to maximize the lower bound
5. Learning (1/7)
• Two things to learn: Parameters and Structures
• Variables in learning: Full or partial observability
• Point estimation V.S Bayesian estimation
37
Full observed Partial observed
Known structure Closed form Expectation Maximization
Unknown structure Local search Structural EM
5. Learning (2/7)
38
ML
MAP
[Prediction]
(1) ( ; ) or ( | )
(2)Bayesian estimation: ( | ) ( | ) ( | )
[Learning]
( | ) ( )(1)Want to maximize ( | )
( )
ML: arg max ( | ) arg max ( ; )
MAP: arg max ( |
t t
tt
t
t t
P X x P X x
P X x X P X x P X d
P X PP X
P X
P X P X
P
Bayes
)
Bayes' estimation: [ | ] ( | )
(2)Want to get ( | )
Prior conjugate could simplify the structure:
( | ) ( )( | ) , where ( | ) and ( ) are of the same distribution
( )
Ex:
t
t t
t
X
E X P X d
P X
P X PP X P X P
P X
P
( | ) is multinomial and ( ) is DirchletP
5. Learning (3/7)
• Known structure, full observability:ML: Find the maximum likelihood estimation of the parameters of each CPD
MAP: Include prior information of the parameters
39
1
1
( ,..., ) ( | ( ))n
n i i
i
P X X P X Pa X
5. Learning (4/7)
• Known structure, partial observability Some nodes are hidden during training, then we can use the EM
(Expectation Maximization) algorithm to find a (locally) optimal ML estimation
[E-step]: we compute the expected values of all the hidden nodes using an inference algorithm
[M-step]: treat these expected values as though they were observed and do ML estimation
EM algorithm is an iterative algorithm, and is similar to the Baum-Welch algorithm used for training HMMs, gradient ascent, and coordinate ascent.
Inference becomes a subroutine which is called by the learning procedure!
40
5. Learning (5/7)
• EM algorithm 1
41
(1) ( , ) ( | ) ( )
( ) ( , ; )max ( ; ) max ( , ; ) max
( )
(2)Jensen's inequality:
log [ ( )] [log( ( ))]
The equality holds when ( ) is a constant
(3) max ( ; ) is equivalent to max log (
tt t
Z Z
t
P X Z P X Z P Z
Q Z P X ZP X P X Z
Q Z
E f X E f X
f X
P X P
( )
( )
; )
( , ; ) ( , ; )log ( ; ) log ( ) log [ ]
( ) ( )
( , ; ) ( , ; ) [log ] ( ) log
( ) ( )
t
t tt
Q Z
Z
t t
Q Z
Z
X
P X Z P X ZP X Q Z E
Q Z Q Z
P X Z P X ZE Q Z
Q Z Q Z
5. Learning (6/7)
• EM algorithm 2
42
( , ; )(4) log ( ; ) ( ) log
( )
( , ; )(5)The equality holds when is a constant
( )
( , ; ) ( | ; , ) ( )
( ) ( )
, where ( ) is fixed for the whole training phase
Then ( ) ( | ; , )
(6)T
tt
Z
t
t t t
t
t
P X ZP X Q Z
Q Z
P X Z
Q Z
P X Z P Z X P X
Q Z Q Z
P X
Q Z P Z X
he EM algorithm
[E-step]
expect ( ) ( | ; , ), modify
[M-step]
( , ; )maximize ( ) log , modify
( )
t
t
Z
Q Z P Z X
P X ZQ Z
Q Z
5. Learning (7/7)
• Bayesian estimation
• Hidden nodes
43
( ; )P X x
( | ) ( | ) ( | )t tP X x X P X x P X d ( | )P X
Take a Break!!!
44
Outline
• Graphical model fundamentals
[Directed]
• General structure: 3 connections, chain, and tree
• Graphical model examples
• Inference and Learning
[Undirected]
• Markov Random Field and its Applications
45
1. MRFs fundamentals
• Markov random field: [Potential (compatibility) functions]
• Learning the parameters: Maximum likelihood estimates of the clique potentials can be computed
using iterative proportional fitting (IPF)
46
[2]
4
1 2 1 3 2 4 3 4
1
( , ) ( , ) ( , ) ( , ) ( , ) ( , )i i
i
P x y x x x x x x x x x y
2. Low-level vision (1/2)
• Low-level vision: [scene, image] How a visual system might interpret images?
(1) Super-resolution
(2) Shading and reflection estimation
(3) Motion estimation
• Previous works: The probability model reached not by learning
47
( | ) ( )( | )
( )
( | ) is usually defined as a "noise model"
( ) is usually defined as sparseness or smoothness
P y x P xP x y
P y
P y x
P x
2. Low-level vision (2/2)
• The proposed approach: VISTA (Vision by Image/Scene TrAining)
• The proposed structure:[learning]
We have scene / image pairs, which have both been divided into patches! We learn the relationships between local regions of image and scenes, and between local scene regions
Long range interaction: multi-scale pyramid
[Estimation scene]
The best scene estimate is the mean (minimum mean squared error, MMSE) or the mode (maximum a posterior, MAP) of the posterior probability
48
[2]
( , )( | )
( )
P x yP x y
P y
3. MRFs inference (1/3)
• Joint probability:
• MMSE:
• MAP:
49
3. MRFs inference (2/3)
• No loop MMSE: [Inference]
• No loop MAP: [message passing]
50
[2]
[Two indices: state and location]
is a column vector with the same dimension as
( , ) is a column vector indexed by the different possible states of , the scene at node
( , ) is a matrix
k
j j
i i i
i j
M x
x y x i
x x
with the different possible states of and , the scen at node and i jx x i j
3. MRFs inference (3/3)
• No loop MAP: [probability chain decomposition]
• With loop: Turbo codes
Still use the message passing algorithm above for MRFs with loops
51
4. MRFs Representation
52
(1) The PCA is used to find a set of lower dimensional basis functions for the patches of images and
scene pixels
(2) Model ( , ), ( , ), and (Could be of Gaussian mixture distributions)k
i i i j jx y x x M
(3) Continuous probabilities are hard to propagate, so we prefer discrete representation!
During the scene estimation phase, each scene patch is represented by 10 to 20 closed candidates,
which are selected based on the evidence at each nodeiy
[2]
5. MRFs Learning
• Method 1: [message passing]
• Method 2: [proper probability factorization-Overlap]
53
( , )Gaussina mixture:
( , )
( , )( | ) : a matrix
( )
( | ): a column vector indexed by " "
i i
i j
l m
k jl m
k j m
j
l
k k
x y
x x
P x xP x x
P x
P y x l
Scene patch overlap between neighbor-scene patches to estimate ( , )
: the vectors of pixels of the th candidate for scene patch which lies in the overlap region with patch
i j
l
jk k
x x
d l x j
[2]
6. Super-resolution (1/6)
• Laplacian pyramid
54[2]
[7]
6. Super-resolution (2/6)
• Training: [message passing]
55
[2]
6. Super-resolution (3/6)
• Result 1: [message passing]
56
[2]
6. Super-resolution (4/6)
• Result 2:
57
[2]
6. Super-resolution (5/6)
• Result 3:
58[2]
6. Super-resolution (6/6)
• Result 4:
59
[2]
7. Shading and reflection estimation (1/5)
• Shape: shading intensity
• Reflection: surface intensity
• Scene: two pixel arrays (reflection, shape)
• Rendering: based on the estimated scene
60
[2]
7. Shading and reflection estimation (2/5)
• Patch selection:
61
[2]
7. Shading and reflection estimation (3/5)
• Training: [overlap]
62
[2]
7. Shading and reflection estimation (4/5)
• Result 1:
63[2]
7. Shading and reflection estimation (5/5)
• Result 2:
64
[2]
Reference
[1] "An introduction to graphical models," Kevin Murphy, 2001
[2] "Learning Low-Level Vision," Freeman, IJCV, 2000
*3+ Chapter 16: Graphical Models in “Introduction to Machine Learning, 2nd edition" , Ethem Alpaydin
[4] "Latent Dirichlet allocation," D. Blei, A. Ng, and M. Jordan. . Journal of Machine Learning Research, 3:993–1022, January 2003
[5] R. Fergus, P. Perona, and A. Zisserman, “Object class recognition by unsupervised scale-invariant learning,” In Proc. CVPR, Jun 2003
[6] L. Fei-Fei, R. Fergus, P. Perona, “A Bayesian approach to unsupervised learning of object categories,” in: Proc. Int. Conf. on Computer Vision, 2003, pp. 1134–1141
[7] Laplacian pyramid: http://sepwww.stanford.edu/~morgan/texturematch/paper_html/node3.html
[8] Bayesian network toolbox:
http://code.google.com/p/bnt/
65
Thanks for listening
66
top related