introduction to graphical models - aisociety.kr · introduction to graphical models yung-kyun noh...

Seoul National University

Introduction to Graphical Models

Yung-Kyun Noh


Pattern Recognition and Machine Learning Winter School

Seoul National University Museum, Seoul

2018. 1. 24-25. (Wed.-Thu.)


GENERALIZATION FOR

PREDICTION

2


Probabilistic Assumption and Bayes Classification• Bayes Error

3

p1(x) p2(x)

Data space


Probabilistic Assumption and Bayes Classification

4

p1(x) p2(x)

Data space


Learning

5

• Data

• Learning

Learn prediction function

from data

• Prediction

(Regularity)

( : Hypothesis set/Candidate set)


Quantify the Evaluation

• Measure of quality: expected loss

• Estimated error

• Examples

– Classification

– Regression

6

: loss function


• satisfies

• Caution:

– The definition of consistency is not

Consistent Learner

7

<Uniform convergence>


Quiz 1

• Consider a hypothesis set which satisfies

8

Explain that learning with is not consistent

though it satisfies .

What is the possible problem in this case?

N


Consistency and Bayes Error

• Minimizing expected error (objective) vs.

minimizing estimated error

9

Bayes error

0

Error

Error for training data (train error)

(For example, a linear classifier with regularization)

Larger (Complex classifier)

Smaller (Simple classifier)

: loss function

: Optimal parameter

with N data

f


Consistency and Bayes Error

• Consistent learner with many data

10

Bayes error

0

Error

Smaller Larger

train

error

Minimum error of consistent learner

Consistency does not

necessarily imply

achieving the Bayes

error!!!

: Optimal parameter

(For example, a linear classifier with regularization)

f


Related Terms with Confining H

• Linear model

VC-dim for classification = Dimensionality + 1

• Small number of parameters

• Large margin

• Regularization

• Bias-Variance trade-off

• Generalization ability, overfitting

Many terms are theoretically connected

11


Learning for Prediction

12

• Method:

– Learning from

examples and can

classify an unseen

data

label

classifierNew

unseen

data(a horse)

[Based on the assumption of regularity]

car car

plane

ship

horse

horsehorse

car

Learn


=[1, 2, 5, 10, …]T

Representation of Data

13

• Each datum is one point

in a data space

Data space

elements


Classification

14


Bayes Optimal Classifier

• Our ultimate goal is not a zero error.

15

(Optimal) Bayes error

Figure credit: Masashi Sugiyama


GRAPHICAL MODELS

16


Gaussian Random Variable

17

Principal axes are the eigenvector directions of

: eigenvalues


Covariance Matrix and Projection

18

: eigenvalues

covariance = Ʃ

variance =

with


Parameter Estimation

• Maximum Likelihood Estimation

19

Data:

Covariance matrix:

Mean vector:


Naïve Bayes As An Extreme Case

• Naïve Bayes

– Simply ignore every correlation and

dependencies between variables

• True decomposition:

20

p(x) =DY

d=1

pd(xd)

p(x) =p(x1)p(x2jx1)p(x3jx1;x2): : :p(xDjx1; : : : ;xD¡1)

=p1(x1)p2(x2) : : :pD(xD)


Number of Parameters

• D = 1000

– Number of parameters of a Gaussian:

1000 + 1001*(1000)/2 = 501,500

21


Independence

•

• D1 = 500, D2 = 500

– Number of parameters

500 + 501*(500)/2 + 500 + 501*(500)/2

= 251,500

• Incorporating one independence can reduce

the number of parameters into half.

22

p(x) =p1(x1)p2(x2)

D=D1 +D2


Independency

• Correlation and Independency

23

Naïve Bayes?

Mixture of Gaussian?

Independency in Gaussian means no correlation

x a ? ? x b


Conditional Independency

24

vs.

xa ?? xbjxc

x a ? ? x b


Graphical Models

• We utilize probabilities that are represented by

the graph structure. (directed & undirected)

25

Use probabilistic independencies

and conditional independencies

that can be captured by graph

structure


Directed Graphical Models

• Factorization of a (large) joint pdf

• For given data, make a model for each decomposed

probability, then estimate parameters separately.

26


D-Separations

27

X1

X2

X3

X1

X2 X3

X1X2

X3

Causal path Common cause Common effect

X1??X3jX2 X2??X3jX1 X1??= X2jX3

X1 ??X2


D-Separations

28

X1 X2 X3 X1??X3jX2

X1 X2 X3


Discrete Random Variables

29


Discrete Random Variables

30

P (1)


Inference with Discrete Variables

31



32



• Marginalize X5

33



• Marginalize X5

34

!



• Marginalize X3

35



• Marginalize X3

36



• Marginalize X2

37



38

From the joint distribution,


Continuous Random Variables

• Each probability density is a Gaussian

39



• Jointly Gaussian with 6 variables

– Need 6 + 6*7/2 = 27 parameters

• Using graphical model:

– Need to obtain the parameters of

– We need to estimate only the following 19 parameters

40



41


Gaussian Random Variable – Marginal

42


Gaussian Random Variable – Marginal

43


Gaussian Random Variable – Conditional

44


KALMAN FILTER

45


Filtering

• Linear Dynamical Systems (LDS) with

Gaussian noise

46

X 1 X 2 X 3 X k ¡ 1 X k…

Y1 Y2 Y3 Yk¡1 Yk


Kalman Filter

47

…

…

Marginalizing x0,…xT-1

Filtering

“Conditional marginalization”

Marginalization from the left


Kalman Filter

48

…

Unconstrained distribution

…

Also, for joint density if necessary


Kalman Filter

49

…

Unconstrained distribution


Kalman Filter

50

Marginalizing x0

Constrained distribution

Same

and are from unconstrained distribution. What matters is .


Kalman Filter

51

…

Marginalizing x1,…xT-1

What matters is how we can find the covariance of joint density function!!


Kalman Filter

52

…Filtering

“Conditional marginalization”

Marginalization from the left

Why filtering? Once we know , we

don’t have to know (or keep) .


Kalman Filter

• Time update

• Measurement update

53

Time update

Similar to the unconstrained distribution calculation!


Kalman Filter

54

Also,

Joint:


Kalman Filter

• Sum - ups

55

Measurement update (Conditional density)


Kalman Filter

• With different notation,

• Alternative form of Kt+1

56


Kalman Filter

57

http://afrl.dodlive.mil/2012/04/03/the-kalman-filter-a-tool-we-use-everyday/

http://www.wpafb.af.mil/News/Article-Display/Article/401306/computer-mouse-

kalman-filter-trace-origins-to-air-force-basic-research-funding/


Markov Random Field

• Undirected Graph

58

X1

X2 X3

X5X4

If there is a direct edge

between Xi and Xj:

Xi ??= XjjXni;j

If there is no direct edge

between Xi and Xj:

Xi ??XjjXni;j

X1??= X3jX2;X4;X5

X1??X5jX3


Joint Distribution

• Product of functions on cliques

59

X1

X2 X3

X5X4

P(X1;X2;X3;X4;X5) =

1

ZÃ1;2(X1;X2) Ã1;3(X1;X3) Ã3;4;5(X3;X4;X5)

0

@Z =X

X1;X2;X3;X4;X5

Ã1;2(X1;X2)Ã1;3(X1;X3)Ã3;4;5(X3;X4;X5)

1

A

X1

X2

X1

X3

X3

X5X4

Cliques:

The set of distributions satisfying MRF conditions (Markov random field)

= The set of distributions decomposed by cliques (Gibbs random field)

(Hammersley-Clifford Theorem)


More Fancy Models

• Latent Variable Model

• Filtering

60

Z1 Z2 Z3 Zk¡1 Zk…

Y1 Y2 Y3 Yk¡1 Yk

Z

Y

X

Z

Y

XZ

X


Topic Models

61


Topic Models

62

http://machinelearning.snu.ac.kr/NIPS2015/NIPS2015_Accepted/nipsnice.html

http://cs.stanford.edu/people/karpathy/nipspreview/




Naïve Bayes for Classification

63

P (X;Y ) = P (Y )P (X1jY ) : : : P (XDjY )

Y

…


Undirected Graphical Models

64

Independent



• Find all maximal cliques:

65



• Potential functions on cliques

66

(X1, X2, …: maximal cliques)

Discrete

ContinuousPartition function



67

2-cliques

1 1 1 2

1 1 0 0

1 0 1 0

1 0 0 1

0 1 1 2

0 1 0 0

0 0 1 0

0 0 0 1

1 1 1

1 0 0

0 1 0

0 0 3

X 3 X 4 © X 3 ; X 4

Ex.


Estimating Parameters

68

2-cliques

1 1 1

1 1 0

1 0 1

1 0 0

0 1 1

0 1 0

0 0 1

0 0 0

1 1

1 0

0 1

0 0

1 1 1 1

1 1 1 0

1 1 0 1

… …

12 parameters

: 8

: 4

Without graphical model: 15 parameters (24 – 1)

……



69

= P (X3 = 1jX2 = 1)



70

2-cliques

1 1 1 1

1 1 1 0

1 0 1 1

1 0 1 0

0 1 1 1

0 1 1 0

0 0 1 1

0 0 1 01 1

1 0

1 1 1

1 0 1

0 1 1

0 0 1

All answers are the same:


Marginalization

71

Any good property like


Marginalization

72

Any decomposition with

…

…

P (XDjX1; : : : ;Xi¡1)


Marginalization

73

…

…

P (XDjX1; : : : ;Xi¡1)

…

…

Marginalize Xi


Introducing Latent Variables

• Issue: how can a model be simplified as

much as possible, while the flexibility is kept

enough to incorporate the true dependency.

74


Expectation-Maximization Algorithm

• Parameter estimation with latent variables

– We don’t have data for latent variables

• E-step:

– Data for latent variables are obtained from

expectation with current parameter values.

• M-step:

– With expected latent variables, parameters

are obtained by maximizing the likelihood.

• E-step and M-step are repeated back and

forth until the likelihood converges.

75



• Gaussian mixture model

76

Parameters: for

Unknown variables: for

We are given for

E-step: Distribution of (Responsibilities)

using current parameters

j

i



77

M-step: Estimate parameters.

for

Iterate until converges.


Gaussian Mixture Model With EM

78

C. Bishop 2007, Figure 9.8(a)-(f)


ANY QUESTIONS?

79


THANK YOU

Yung-Kyun Noh

[email protected]

80

introduction to graphical models - aisociety.kr · introduction to graphical models yung-kyun noh...

Documents