orasmaa liin chapter 6

25
MTAT.05.113 Bayesian Networks Parameter Estimation in Bayesian Networks Siim Orasmaa Krista Liin

Upload: nyinguro

Post on 06-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 1/25

MTAT.05.113 Bayesian Networks

Parameter Estimation

in Bayesian Networks

Siim OrasmaaKrista Liin

Page 2: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 2/25

2

Introduction

We have

a Bayesian network structure S with parameters θ 

access to a database of cases D

We want to

estimate the parameters of the model (conditionalprobabilities) from given cases

Two approachesEstimate parameters once and for all

Adapt the model as each new case arrives

Page 3: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 3/25

3

Outline

Parameter estimation

Maximum likehood estimation

Bayesian estimation

Incomplete data and EM algorithm

Adaption

Type variables

Fractional updating

Fading

Tuning

Page 4: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 4/25

4

Complete data

Complete case:

a configuration over all the variables in model

We assume the parameters can be learned

independently:

Global independence

 parameters for various variables are independent 

Local independenceuncertainties of the parameters for different parent 

configurations are independent 

Page 5: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 5/25

5

Maximum likehood estimation

Likehood of model M given D:

Choose the parameter set that maximizeslikehood:

 L M  | D=∏d ∈ D

 P d | M 

=arg max

 L M  | D

Page 6: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 6/25

6

Maximum likehood estimation II

MLE for probability matrices:

The intuition of using frequencies as estimates

Example, estimate for P ( A=a | B=b, C=c ):

Drawback:

The outcomes with zero counts → zero probabilities

 N  A=a , B=b , C =c

 N  B=b , C =c

 positive counts

total counts

Page 7: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 7/25

7

Bayesian estimation

Alternative principle to MLE:

start with a prior distribution, and use experience(database) to update the distribution.

collapse the posterior distribution to the mean valueand use this as the final value of the parameter 

Even prior distribution:

e.g Add 1 virtual count to all variable stateoccurrences

Page 8: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 8/25

8

Bayesian estimation II

In case of binary variables:

Let X be a binary variable (yes, no) and we haveperformed a number of independent experiments of 

which n turned up yes and m turned up no.Then, starting with even prior distribution for θ, the

Bayesian estimate for P( X =yes) is n1

nm2

Page 9: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 9/25

9

Incomplete data

Some of the cases in database may containmissing values

How the data can be missing:

Missing at random (MAR)probability that a value is missing depends on observed

(existing) values

Missing completely at random (MCAR)

probability is independent of the observed values

Neither MAR nor MCAR 

Page 10: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 10/25

10

EM Algorithm

Expectation-Maximization algorithm:to find maximum likehood estimates for θ when given

dataset is incomplete (assuming MAR)

Starts with even or random probability distributions

Alternates between two steps:

Expectation step

„complete“ the data set by using the current parameter 

estimates (calculate expectations for missing values)

Maximization step

Use the „completed“ data set to find a new maximum

likehood estimate for the parameters

Page 11: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 11/25

11

EM Algorithm II

Calculating expected counts for a configuration:If a case is inconsistent with the configuration, then it

counts as 0

If a case contains the entire configuration, then itcounts as 1

If the value for a variable is missing in a case, then itcontributes with a fractional count corresponding toprobability of seeing the configuration

If more than one values are missing, we need to calculate joint probability → use junction tree structure

Page 12: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 12/25

12

EM Algorithm III

Expected counts are then used in M-step as if they were „real“ counts – the maximumlikehood estimate for a conditional probabiliy iscalculated as:

The alternating procedure continues until the

probabilities no longer change or until another termination criterion is met.

estimated positive countsestimated total counts

Page 13: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 13/25

13

EM Algorithm IV

The EM-algorithm can also be generalized for estimating the maximum a posteriori  parameters

Virtual counts are added to both the denominator andnumerator in the M-step

Page 14: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 14/25

14

* EM Algorithm more formally

What we have:– observable variables

– latent variables

θ – all possible parameters in the modelMain goal is to find:

Asis difficult to calculate and optimize, we use auxiliary

function

 yi

 z i

 P  yi ... yn |=∫ P  y1 ... yn , z 1 ... z n | dz 

Q |t =∫ P  z 1 ... z n |t  , y1... yn log P  , z | y1 ... yn dz 

 P  | yi ... yn∝ P  y1 ... yn | P ∝ P  y1 ... yn |

Page 15: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 15/25

15

* EM Algorithm more formally II

What EM algorithm does is:

with random starting point

E-stepFind the probabilities for if all parameters are

fixed to

M-step

Now that is fixed, find thatmaximizes the integral

t 1=arg max

Q |t 

 z 1 ... z nt 

 p z 1 ... z n |t  , y1 ... yn

Page 16: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 16/25

 

Adaptation

Adapt to observed data – change probabilities tablesto better suit new observed values

Why do we need adapting:

New situation for an existing network

To have more accurate probabilities

Second-order uncertainty - when we are not sure howprobable something is

General solution: allow each probability to range withinan interval

Second order uncertainty needs to be reduced

16

Page 17: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 17/25

 

Adaptation: Type variables

Main idea: add a modifiable parent node to the nodewhich you are uncertain about.

Example: milk test

When adding type variables to the network, keep newnodes d-separated

17

Page 18: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 18/25

 

Adaptation: Fractional updating

Main idea: iteratively update probability distributionsExample

P(A | bi,c

 j)

Prior distribution:1) new case e=(a

1,b

i,c

 j)

 

2) new case e=(?,bi,c

 j), P(A|b

i,c

 j,e)=(y

1,y

2,y

3)

 3) new case e=(a

1,?,?), P(b

i,c

 j|e)=z

 P  A∣bi , c

 j=

n1

 s,

n2

 s,

n3

 s

 P  A∣bi , c  j=n11

 s1;

n2

 s1;

n3

 s1

 P  A∣bi , c  j=

n1 y1

 s1 ;

n2 y2

 s1 ;

n3 y3

 s1

 P  A∣bi , c  j=n1 z 

 s z ;

n2

 s z ;

n3

 s z 

18

Page 19: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 19/25

 

Adaptation: Fractional updating

In general

Drawback: overestimates sample size

 xk =nk  z ⋅ yk 

 s z =

nk  P ak  , bi , c  j∣e

 s P bi , c j∣e

19

Page 20: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 20/25

 

Adaptation: Fading

Main idea: The older evidence is given less weight bymultiplying it with a fading factor 

Fading factor 

If new evidence (case 1)

Effective sample size

 s := s⋅q1 ; x1

:= x1⋅q1 ; x

2:= x

2⋅q ; x3

:= x3⋅q

 s ' =1

1−q

q∈0,1

20

Page 21: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 21/25

 

Adaptation: Fading

Sample size

Sample size can be used to determine the fading factor 

The bigger the sample size, the more resistant is thenetwork to change

Sample size can be different for each node in network

q ' = s ' −1

 s ' 

21

Page 22: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 22/25

 

Adaptation: alternative models

Main idea: If we're not sure whether our network'sstructure is correct, use several structures

Adjusting probabilities help, but might not be enough tosolve structural problems

Solutions

Gather data and learn new structure

Use probabilities calculated over several weighted

structures, recalculating weights for new evidence

 P  A∣e=∑k wk ⋅ P  A∣M k  , e w k = P M ∣e=

P e∣M i⋅wi

∑ j

 P e∣M  j⋅w  j

22

Page 23: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 23/25

 

TuningWe know what we want the probability of certain

variables to be, and want to tune an existingnetwork to suit it.

We have

Bayesian network BN

Evidence e

Existing probability x=P(A|e)=(x1,...,x

n)

Probabilities we want to have y=(y1,...,y

n)

Parameters we can adjust t=(t1,...,t

m)with initial set of 

values t0

To do: change values of t until x close enough to y

23

Page 24: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 24/25

 

Tuning

To measure difference: Euclidean distance

Method of gradient descent:

Calculate grad dist(x,y) with respect to the parameterst.

Give t0

a displacement  Δ t in the direction opposite to

the direction of the gradient grad dist(x,y)(t0); that is,

choose a step size α>0 and let  Δ t=-αgrad dist(x,y)(t

0).

Iterate this procedure until the gradient is close to 0

dist  x , y=∑i=1

n

 x i− yi2

24

Page 25: Orasmaa Liin Chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 25/25

 

Thank you for listening