orasmaa liin chapter 6

8/3/2019 Orasmaa Liin Chapter 6

http://slidepdf.com/reader/full/orasmaa-liin-chapter-6 1/25

MTAT.05.113 Bayesian Networks

Parameter Estimation

in Bayesian Networks

Siim OrasmaaKrista Liin



2

Introduction

We have

a Bayesian network structure S with parameters θ

access to a database of cases D

We want to

estimate the parameters of the model (conditionalprobabilities) from given cases

Two approachesEstimate parameters once and for all

Adapt the model as each new case arrives



3

Outline

Parameter estimation

Maximum likehood estimation

Bayesian estimation

Incomplete data and EM algorithm

Adaption

Type variables

Fractional updating

Fading

Tuning



4

Complete data

Complete case:

a configuration over all the variables in model

We assume the parameters can be learned

independently:

Global independence

parameters for various variables are independent

Local independenceuncertainties of the parameters for different parent

configurations are independent



5

Maximum likehood estimation

Likehood of model M given D:

Choose the parameter set that maximizeslikehood:

L M | D=∏d ∈ D

P d | M

=arg max

L M | D



6

Maximum likehood estimation II

MLE for probability matrices:

The intuition of using frequencies as estimates

Example, estimate for P ( A=a | B=b, C=c ):

Drawback:

The outcomes with zero counts → zero probabilities

N A=a , B=b , C =c

N B=b , C =c

positive counts

total counts



7

Bayesian estimation

Alternative principle to MLE:

start with a prior distribution, and use experience(database) to update the distribution.

collapse the posterior distribution to the mean valueand use this as the final value of the parameter

Even prior distribution:

e.g Add 1 virtual count to all variable stateoccurrences



8

Bayesian estimation II

In case of binary variables:

Let X be a binary variable (yes, no) and we haveperformed a number of independent experiments of

which n turned up yes and m turned up no.Then, starting with even prior distribution for θ, the

Bayesian estimate for P( X =yes) is n1

nm2



9

Incomplete data

Some of the cases in database may containmissing values

How the data can be missing:

Missing at random (MAR)probability that a value is missing depends on observed

(existing) values

Missing completely at random (MCAR)

probability is independent of the observed values

Neither MAR nor MCAR



10

EM Algorithm

Expectation-Maximization algorithm:to find maximum likehood estimates for θ when given

dataset is incomplete (assuming MAR)

Starts with even or random probability distributions

Alternates between two steps:

Expectation step

„complete“ the data set by using the current parameter

estimates (calculate expectations for missing values)

Maximization step

Use the „completed“ data set to find a new maximum

likehood estimate for the parameters



11

EM Algorithm II

Calculating expected counts for a configuration:If a case is inconsistent with the configuration, then it

counts as 0

If a case contains the entire configuration, then itcounts as 1

If the value for a variable is missing in a case, then itcontributes with a fractional count corresponding toprobability of seeing the configuration

If more than one values are missing, we need to calculate joint probability → use junction tree structure



12

EM Algorithm III

Expected counts are then used in M-step as if they were „real“ counts – the maximumlikehood estimate for a conditional probabiliy iscalculated as:

The alternating procedure continues until the

probabilities no longer change or until another termination criterion is met.

estimated positive countsestimated total counts



13

EM Algorithm IV

The EM-algorithm can also be generalized for estimating the maximum a posteriori parameters

Virtual counts are added to both the denominator andnumerator in the M-step



14

* EM Algorithm more formally

What we have:– observable variables

– latent variables

θ – all possible parameters in the modelMain goal is to find:

Asis difficult to calculate and optimize, we use auxiliary

function

yi

z i

P yi ... yn |=∫ P y1 ... yn , z 1 ... z n | dz

Q |t =∫ P z 1 ... z n |t , y1... yn log P , z | y1 ... yn dz

P | yi ... yn∝ P y1 ... yn | P ∝ P y1 ... yn |



15

* EM Algorithm more formally II

What EM algorithm does is:

with random starting point

E-stepFind the probabilities for if all parameters are

fixed to

M-step

Now that is fixed, find thatmaximizes the integral

t 1=arg max

Q |t

z 1 ... z nt

p z 1 ... z n |t , y1 ... yn



Adaptation

Adapt to observed data – change probabilities tablesto better suit new observed values

Why do we need adapting:

New situation for an existing network

To have more accurate probabilities

Second-order uncertainty - when we are not sure howprobable something is

General solution: allow each probability to range withinan interval

Second order uncertainty needs to be reduced

16



Adaptation: Type variables

Main idea: add a modifiable parent node to the nodewhich you are uncertain about.

Example: milk test

When adding type variables to the network, keep newnodes d-separated

17



Adaptation: Fractional updating

Main idea: iteratively update probability distributionsExample

P(A | bi,c

j)

Prior distribution:1) new case e=(a

1,b

i,c

j)

2) new case e=(?,bi,c

j), P(A|b

i,c

j,e)=(y

1,y

2,y

3)

3) new case e=(a

1,?,?), P(b

i,c

j|e)=z

P A∣bi , c

j=

n1

s,

n2

s,

n3

s

P A∣bi , c j=n11

s1;

n2

s1;

n3

s1

P A∣bi , c j=

n1 y1

s1 ;

n2 y2

s1 ;

n3 y3

s1

P A∣bi , c j=n1 z

s z ;

n2

s z ;

n3

s z

18



Adaptation: Fractional updating

In general

Drawback: overestimates sample size

xk =nk z ⋅ yk

s z =

nk P ak , bi , c j∣e

s P bi , c j∣e

19



Adaptation: Fading

Main idea: The older evidence is given less weight bymultiplying it with a fading factor

Fading factor

If new evidence (case 1)

Effective sample size

s := s⋅q1 ; x1

:= x1⋅q1 ; x

2:= x

2⋅q ; x3

:= x3⋅q

s ' =1

1−q

q∈0,1

20



Adaptation: Fading

Sample size

Sample size can be used to determine the fading factor

The bigger the sample size, the more resistant is thenetwork to change

Sample size can be different for each node in network

q ' = s ' −1

s '

21



Adaptation: alternative models

Main idea: If we're not sure whether our network'sstructure is correct, use several structures

Adjusting probabilities help, but might not be enough tosolve structural problems

Solutions

Gather data and learn new structure

Use probabilities calculated over several weighted

structures, recalculating weights for new evidence

P A∣e=∑k wk ⋅ P A∣M k , e w k = P M ∣e=

P e∣M i⋅wi

∑ j

P e∣M j⋅w j

22



TuningWe know what we want the probability of certain

variables to be, and want to tune an existingnetwork to suit it.

We have

Bayesian network BN

Evidence e

Existing probability x=P(A|e)=(x1,...,x

n)

Probabilities we want to have y=(y1,...,y

n)

Parameters we can adjust t=(t1,...,t

m)with initial set of

values t0

To do: change values of t until x close enough to y

23



Tuning

To measure difference: Euclidean distance

Method of gradient descent:

Calculate grad dist(x,y) with respect to the parameterst.

Give t0

a displacement Δ t in the direction opposite to

the direction of the gradient grad dist(x,y)(t0); that is,

choose a step size α>0 and let Δ t=-αgrad dist(x,y)(t

0).

Iterate this procedure until the gradient is close to 0

dist x , y=∑i=1

n

x i− yi2

24



Thank you for listening

orasmaa liin chapter 6

Documents