introduction to graphical models - aisociety.kr · introduction to graphical models yung-kyun noh...
TRANSCRIPT
Seoul National University
Introduction to Graphical Models
Yung-Kyun Noh
Seoul National University
Pattern Recognition and Machine Learning Winter School
Seoul National University Museum, Seoul
2018. 1. 24-25. (Wed.-Thu.)
Seoul National University
Probabilistic Assumption and Bayes Classification• Bayes Error
3
p1(x) p2(x)
Data space
Seoul National University
Probabilistic Assumption and Bayes Classification
4
p1(x) p2(x)
Data space
Seoul National University
Learning
5
• Data
• Learning
Learn prediction function
from data
• Prediction
(Regularity)
( : Hypothesis set/Candidate set)
Seoul National University
Quantify the Evaluation
• Measure of quality: expected loss
• Estimated error
• Examples
– Classification
– Regression
6
: loss function
Seoul National University
• satisfies
• Caution:
– The definition of consistency is not
Consistent Learner
7
<Uniform convergence>
Seoul National University
Quiz 1
• Consider a hypothesis set which satisfies
8
Explain that learning with is not consistent
though it satisfies .
What is the possible problem in this case?
N
Seoul National University
Consistency and Bayes Error
• Minimizing expected error (objective) vs.
minimizing estimated error
9
Bayes error
0
Error
Error for training data (train error)
(For example, a linear classifier with regularization)
Larger (Complex classifier)
Smaller (Simple classifier)
: loss function
: Optimal parameter
with N data
f
Seoul National University
Consistency and Bayes Error
• Consistent learner with many data
10
Bayes error
0
Error
Smaller Larger
train
error
Minimum error of consistent learner
Consistency does not
necessarily imply
achieving the Bayes
error!!!
: Optimal parameter
(For example, a linear classifier with regularization)
f
Seoul National University
Related Terms with Confining H
• Linear model
VC-dim for classification = Dimensionality + 1
• Small number of parameters
• Large margin
• Regularization
• Bias-Variance trade-off
• Generalization ability, overfitting
Many terms are theoretically connected
11
Seoul National University
Learning for Prediction
12
• Method:
– Learning from
examples and can
classify an unseen
data
label
classifierNew
unseen
data(a horse)
[Based on the assumption of regularity]
car car
plane
ship
horse
horsehorse
car
Learn
Seoul National University
=[1, 2, 5, 10, …]T
Representation of Data
13
• Each datum is one point
in a data space
Data space
elements
Seoul National University
Bayes Optimal Classifier
• Our ultimate goal is not a zero error.
15
(Optimal) Bayes error
Figure credit: Masashi Sugiyama
Seoul National University
Gaussian Random Variable
17
Principal axes are the eigenvector directions of
: eigenvalues
Seoul National University
Covariance Matrix and Projection
18
: eigenvalues
covariance = Ʃ
variance =
with
Seoul National University
Parameter Estimation
• Maximum Likelihood Estimation
19
Data:
Covariance matrix:
Mean vector:
Seoul National University
Naïve Bayes As An Extreme Case
• Naïve Bayes
– Simply ignore every correlation and
dependencies between variables
• True decomposition:
20
p(x) =DY
d=1
pd(xd)
p(x) =p(x1)p(x2jx1)p(x3jx1;x2): : :p(xDjx1; : : : ;xD¡1)
=p1(x1)p2(x2) : : :pD(xD)
Seoul National University
Number of Parameters
• D = 1000
– Number of parameters of a Gaussian:
1000 + 1001*(1000)/2 = 501,500
21
Seoul National University
Independence
•
• D1 = 500, D2 = 500
– Number of parameters
500 + 501*(500)/2 + 500 + 501*(500)/2
= 251,500
• Incorporating one independence can reduce
the number of parameters into half.
22
p(x) =p1(x1)p2(x2)
D=D1 +D2
Seoul National University
Independency
• Correlation and Independency
23
Naïve Bayes?
Mixture of Gaussian?
Independency in Gaussian means no correlation
x a ? ? x b
Seoul National University
Graphical Models
• We utilize probabilities that are represented by
the graph structure. (directed & undirected)
25
Use probabilistic independencies
and conditional independencies
that can be captured by graph
structure
Seoul National University
Directed Graphical Models
• Factorization of a (large) joint pdf
• For given data, make a model for each decomposed
probability, then estimate parameters separately.
26
Seoul National University
D-Separations
27
X1
X2
X3
X1
X2 X3
X1X2
X3
Causal path Common cause Common effect
X1??X3jX2 X2??X3jX1 X1??= X2jX3
X1 ??X2
Seoul National University
Continuous Random Variables
• Jointly Gaussian with 6 variables
– Need 6 + 6*7/2 = 27 parameters
• Using graphical model:
– Need to obtain the parameters of
– We need to estimate only the following 19 parameters
40
Seoul National University
Filtering
• Linear Dynamical Systems (LDS) with
Gaussian noise
46
X 1 X 2 X 3 X k ¡ 1 X k…
Y1 Y2 Y3 Yk¡1 Yk
Seoul National University
Kalman Filter
47
…
…
Marginalizing x0,…xT-1
Filtering
“Conditional marginalization”
Marginalization from the left
Seoul National University
Kalman Filter
48
…
Unconstrained distribution
…
Also, for joint density if necessary
Seoul National University
Kalman Filter
50
Marginalizing x0
Constrained distribution
Same
and are from unconstrained distribution. What matters is .
Seoul National University
Kalman Filter
51
…
Marginalizing x1,…xT-1
What matters is how we can find the covariance of joint density function!!
Seoul National University
Kalman Filter
52
…Filtering
“Conditional marginalization”
Marginalization from the left
Why filtering? Once we know , we
don’t have to know (or keep) .
Seoul National University
Kalman Filter
• Time update
• Measurement update
53
Time update
Similar to the unconstrained distribution calculation!
Seoul National University
Kalman Filter
57
http://afrl.dodlive.mil/2012/04/03/the-kalman-filter-a-tool-we-use-everyday/
http://www.wpafb.af.mil/News/Article-Display/Article/401306/computer-mouse-
kalman-filter-trace-origins-to-air-force-basic-research-funding/
Seoul National University
Markov Random Field
• Undirected Graph
58
X1
X2 X3
X5X4
If there is a direct edge
between Xi and Xj:
Xi ??= XjjXni;j
If there is no direct edge
between Xi and Xj:
Xi ??XjjXni;j
X1??= X3jX2;X4;X5
X1??X5jX3
Seoul National University
Joint Distribution
• Product of functions on cliques
59
X1
X2 X3
X5X4
P(X1;X2;X3;X4;X5) =
1
ZÃ1;2(X1;X2) Ã1;3(X1;X3) Ã3;4;5(X3;X4;X5)
0
@Z =X
X1;X2;X3;X4;X5
Ã1;2(X1;X2)Ã1;3(X1;X3)Ã3;4;5(X3;X4;X5)
1
A
X1
X2
X1
X3
X3
X5X4
Cliques:
The set of distributions satisfying MRF conditions (Markov random field)
= The set of distributions decomposed by cliques (Gibbs random field)
(Hammersley-Clifford Theorem)
Seoul National University
More Fancy Models
• Latent Variable Model
• Filtering
60
Z1 Z2 Z3 Zk¡1 Zk…
Y1 Y2 Y3 Yk¡1 Yk
Z
Y
X
Z
Y
XZ
X
Seoul National University
Topic Models
62
http://machinelearning.snu.ac.kr/NIPS2015/NIPS2015_Accepted/nipsnice.html
http://cs.stanford.edu/people/karpathy/nipspreview/
Seoul National University
Naïve Bayes for Classification
63
P (X;Y ) = P (Y )P (X1jY ) : : : P (XDjY )
Y
…
Seoul National University
Undirected Graphical Models
• Potential functions on cliques
66
(X1, X2, …: maximal cliques)
Discrete
ContinuousPartition function
Seoul National University
Undirected Graphical Models
67
2-cliques
1 1 1 2
1 1 0 0
1 0 1 0
1 0 0 1
0 1 1 2
0 1 0 0
0 0 1 0
0 0 0 1
1 1 1
1 0 0
0 1 0
0 0 3
X 3 X 4 © X 3 ; X 4
Ex.
Seoul National University
Estimating Parameters
68
2-cliques
1 1 1
1 1 0
1 0 1
1 0 0
0 1 1
0 1 0
0 0 1
0 0 0
1 1
1 0
0 1
0 0
1 1 1 1
1 1 1 0
1 1 0 1
… …
12 parameters
: 8
: 4
Without graphical model: 15 parameters (24 – 1)
……
Seoul National University
Conditional Independency
70
2-cliques
1 1 1 1
1 1 1 0
1 0 1 1
1 0 1 0
0 1 1 1
0 1 1 0
0 0 1 1
0 0 1 01 1
1 0
1 1 1
1 0 1
0 1 1
0 0 1
All answers are the same:
Seoul National University
Introducing Latent Variables
• Issue: how can a model be simplified as
much as possible, while the flexibility is kept
enough to incorporate the true dependency.
74
Seoul National University
Expectation-Maximization Algorithm
• Parameter estimation with latent variables
– We don’t have data for latent variables
• E-step:
– Data for latent variables are obtained from
expectation with current parameter values.
• M-step:
– With expected latent variables, parameters
are obtained by maximizing the likelihood.
• E-step and M-step are repeated back and
forth until the likelihood converges.
75
Seoul National University
Expectation-Maximization Algorithm
• Gaussian mixture model
76
Parameters: for
Unknown variables: for
We are given for
E-step: Distribution of (Responsibilities)
using current parameters
j
i
Seoul National University
Expectation-Maximization Algorithm
77
M-step: Estimate parameters.
for
Iterate until converges.