parameter estimation: bayesian estimation chapter 3 (duda et al.) – sections 3.3-3.7 cs479/679...

Parameter Estimation:Bayesian Estimation

Chapter 3 (Duda et al.) – Sections 3.3-3.7

CS479/679 Pattern RecognitionDr. George Bebis

Bayesian Estimation• Assumes that the parameters are random

variables that have some known a-priori distribution p(.

• Estimates a distribution rather than making point estimates like ML.

( / ) ( / ) ( / )p D p p D dx x θ θ θ

BE solution might not be of the parametric form assumed.

The Role of Training Examples in computing P(ωi /x)

• If p(x/ωi) and P(ωi) are known, the Bayes’ rule allows us to compute the posterior probabilities P(ωi /x):

• Emphasize the role of the training examples D by introducing them in the computation of the posterior probabilities:

( / , )iP D x

( / ) ( )( / )

( / ) ( )i i

ij j

j

p PP

p P

xx

x

The Role of Training Examples (cont’d)

( , / ) ( ) ( / , ) ( / ) ( )( / , )

( , ) ( / ) ( )

( / , ) ( / ) ( / , ) ( / )

( / ) ( , / )

( / , ) ( / )

( / , ) ( / )

i i i i ii

i i i i

j jj

i i i i

j j j jj

p D P p D p D PP D

p D p D p D

p D P D p D P D

p D p D

p D P D

p D P D

x xx

x x

x x

x x

x

x

chain rule

marginalize

chain rule

Using only thesamples from class i

The Role of Training Examples (cont’d)

• The training examples Di can help us to determine both the class-conditional densities and the prior probabilities:

• For simplicity, replace P(ωi /D) with P(ωi):

( / , ) ( / )( / , )

( / , ) ( / )i i i i

i ij j j j

j

p D P DP D

p D P D

xx

x

( / , ) ( )( / , )

( / , ) ( )i i i

i ij j j

j

p D PP D

p D P

xx

x

Bayesian Estimation (BE)

• Need to estimate p(x/ωi,Di) for every class ωi

• If the samples in Dj give no information about i,

we need to solve c independent problems of the following form:

“Given D, estimate p(x/D)”

i j

( / , ) ( )( / , )

( / , ) ( )i i i

i ij j j

j

p D PP D

p D P

xx

x

BE Approach

• Estimate p(x/D) as follows:

• Since , we have:

Important equation: it links p(x/D) with p(θ/D)

( / ) ( , / ) ( / , ) ( / )p D p D d p D p D d x x θ θ x θ θ θ

( / ) ( / ) ( / )p D p p D dx x θ θ θ

( / , ) ( / )p D px θ x θ

BE Main Steps

( / ) ( / ) ( / )p D p p D dx x θ θ θ

(1) Compute p(θ/D) :

(2) Compute p(x/D) :

1

( / ) ( )( / ) ( / ) ( )

( )

n

kk

p D pp D a p p

p D

θ θθ x θ θ

where ( / ) ( )

( / )( )

p D pp D

p D

θ θθ

Interpretation of BE Solution

• Suppose p(θ/D) peaks sharply at , then p(x/D) can be approximated as follows:

(i.e., the best estimate is obtained by setting )

ˆθ θ

ˆ( / ) ( / )p D px x θˆθ θ

( / ) ( / ) ( / )p D p p D dx x θ θ θ

(assuming that p(x/ θ) is smooth)

Interpretation of BE Solution (cont’d)

•If we are less certain about the exact value of θ, we should consider a weighted average of p(x / θ) over the possible values of θ.

•The samples exert their influence on p(x / D) through P(θ / D).

( / ) ( / ) ( / )p D p p D dx x θ θ θ

Relation to ML solution

• If p(D/ θ) peaks sharply at , then p(θ /D) will, in general, peak sharply at too (i.e., close to ML solution):

ˆθ θ

ˆ( / ) ( / )p D px x θ

( / ) ( / ) ( / )p D p p D dx x θ θ θ

( / ) ( )( / )

( )

p D pp D

p D

θ θθ

ˆθ θ

Case 1: Univariate Gaussian,Unknown μ

D={x1,x2,…,xn} (independently drawn)

(1)

(known)

Case 1: Univariate Gaussian, Unknown μ (cont’d)

x x const.const.

• It can be shown that p(μ/D) has the form:

peaks at μnp(μ/D)


0 0 as as (ML estimate) n

(i.e., lies between them)

n implies more samples!

(ML estimate)


BayesianLearning


not dependent on μ

As the number of samples increases, p(x/D) converges to p(x/μ)

(2)

Case 2: Multivariate Gaussian, Unknown μ

1

( / ) ( )( / ) ( / ) ( )

( )

n

k

p D pp D a p p

p D

k

μ μμ x μ μ

D={x1,x2,…,xn} (independently drawn)

Assume p(x/μ)~N(μ,Σ) and p(μ)~N(μ0,Σ0)

Compute p(μ/D):(1)

(known)

Case 2: Multivariate Gaussian, Unknown μ (cont’d)

• Substituting the expressions for p(xk/μ) and p(μ):

where

11( / ) c exp[ ( ) ( )]

2t

np D n nμ μ μ μ μ

1 10 0 0

10 0

1 1 1( ) ( )

1 1( )n

n n n

n n

n n 0μ x μ

Case 2: Multivariate Gaussian (cont’d)

Compute p(x/D):

( / ) ( / ) ( / ) ~ ( , )np D p p D d N nx x μ μ μ μ

(2)

Recursive Bayes Learning

• Develop an incremental learning algorithm: Dn: (x1, x2, …., xn-1, xn)

• Rewrite as follows:1

( / ) ( / )n

kk

p D p

θ x θ

Dn-1

1( / ) ( / ) ( / )n nnp D p p D θ x θ θ

Recursive Bayes Learning (cont’d)

• Then, can be written as follows:

n=1,2,…

1

1

10

1

( / ) ( / ) ( )( / ) ( )( / )

( / ) ( ) ( / ) ( / ) ( )

( / ) ( / ), ( / ) ( )

( / ) ( / )

nnn n

n nn

nn

nn

p p D pp D pp D

p D p d p p D p d

p p Dwhere p D p

p p D d

x θ θ θθ θθ

θ θ θ x θ θ θ θ

x θ θθ θ

x θ θ θ

( / ) ( )( / )

( )

p D pp D

p D

θ θθ

Recursive Bayes Learning -Example

p(θ)

1

1

0

( / ) ( / )( / )

( / ) ( / )

( / ) ( )

nn n

nn

p p Dp D

p p D d

where p D p

x θ θθ

x θ θ θ

θ θ

Recursive Bayesian Learning (cont’d)

(x(x44=8)=8)

1( / ) , max [ ] 10n n

xnp D for D

In general:

Recursive Bayesian Learning (cont’d)

ML estimate:

IterationsIterations

p(θ/D4) peaks at ˆ 8

ˆ( / ) ~ (0,8)p x U

Bayesian estimate:

p(p(θθ/D/D00))

( / ) ( / ) ( / )p D p p D dx x θ θ θ

Multiple Peaks

• For most p(x/θ) choices, p(θ/Dn) will have peak strongly at given enough samples; in this case:

• There might be cases, however, where cannot be determined uniquely from p(x/θ); in this case, p(θ/Dn) will contain multiple peaks

• The solution p(x/θ) should be obtained by integration in this case:

( / ) ( / ) ( / )p D p p D dx x θ θ θ

ˆ( / ) ( / )p D px x θ

θ̂

θ̂

ML vs Bayesian Estimation

• Number of training data– The two methods are equivalent assuming infinite

number of training data (and prior distributions that do not exclude the true solution).

– For small training data sets, they give different results in most cases.

• Computational complexity– ML uses differential calculus or gradient search for

maximizing the likelihood.– Bayesian estimation requires complex multidimensional

integration techniques.

ML vs Bayesian Estimation (cont’d)

• Solution complexity– Easier to interpret ML solutions (i.e., must be of the

assumed parametric form).– A Bayesian estimation solution might not be of the

parametric form assumed.• Prior distribution

– If the prior distribution p(θ) is uniform, Bayesian estimation solutions are equivalent to ML solutions.

– In general, the two methods will give different solutions.

ML vs Bayesian Estimation (cont’d)

• General comments– There are strong theoretical and methodological

arguments supporting Bayesian estimation.– In practice, ML estimation is simpler and can lead to

comparable performance.

Computational Complexity

11 1ˆ ˆˆ ˆ( ) ( ) ( ) ln 2 ln | | ln ( )2 2 2

t dg P x x x

O(dn)O(dn) O(dO(d22n)n)

O(dO(d22))O(1)O(1) O(nO(n))

dimensionality: d

# training data: n

# classes: c

O(dO(d33))

ML estimation

• Learning complexity

(n>d)

These computations must be repeated c times!

d(d+1)/2

O(dn)O(dn)

Computational Complexity

11 1ˆ ˆˆ ˆ( ) ( ) ( ) ln 2 ln | | ln ( )2 2 2

t dg P x x x

O(1)O(1)O(dO(d22))

Bayesian Estimation: higher learning complexity, same classification complexity

• Classification complexity

These computations must be repeated c times and take max

dimensionality: d

# training data: n

# classes: c

Main Sources of Error in Classifier Design

• Bayes error– The error due to overlapping densities p(x/ωi)

• Model error– The error due to choosing an incorrect model.

• Estimation error– The error due to incorrectly estimated parameters

( / , ) ( )( / , )

( / , ) ( )i i i

i ij j j

j

p D PP D

p D P

xx

x

(e.g., due to small number of training examples)

Overfitting• When the number of training examples is inadequate, the

solution obtained might not be optimal.

• Consider the problem of fitting a curve to some data:– Points were selected from a parabola (plus noise).– A 10th degree polynomial fits the data perfectly but does not

generalize well.

A greater error ontraining data might improve generalization!

• Need more training examples than number or model parameters!

Overfitting (cont’d)• Control model complexity.

– Assume diagonal covariance matrix (i.e., uncorrelated features).– Use the same covariance matrix for all classes and consolidate data.

– Shrinkage technique:

(1 )( ) , (0 1)

(1 )

( ) (1 ) , (0 1)

i ii

i

a n ana a

a n an

b b bI b

Shrink common covariance matrix to identity matrix:

Shrink individual covariance matrices to same covariance:

parameter estimation: bayesian estimation chapter 3 (duda et al.) – sections 3.3-3.7 cs479/679...

Documents