parameter estimation: bayesian estimation chapter 3 (duda et al.) – sections 3.3-3.7 cs479/679...

34
Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Upload: alivia-ghent

Post on 14-Dec-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Parameter Estimation:Bayesian Estimation

Chapter 3 (Duda et al.) – Sections 3.3-3.7

CS479/679 Pattern RecognitionDr. George Bebis

Page 2: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Bayesian Estimation• Assumes that the parameters are random

variables that have some known a-priori distribution p(.

• Estimates a distribution rather than making point estimates like ML.

( / ) ( / ) ( / )p D p p D dx x θ θ θ

BE solution might not be of the parametric form assumed.

Page 3: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

The Role of Training Examples in computing P(ωi /x)

• If p(x/ωi) and P(ωi) are known, the Bayes’ rule allows us to compute the posterior probabilities P(ωi /x):

• Emphasize the role of the training examples D by introducing them in the computation of the posterior probabilities:

( / , )iP D x

( / ) ( )( / )

( / ) ( )i i

ij j

j

p PP

p P

xx

x

Page 4: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

The Role of Training Examples (cont’d)

( , / ) ( ) ( / , ) ( / ) ( )( / , )

( , ) ( / ) ( )

( / , ) ( / ) ( / , ) ( / )

( / ) ( , / )

( / , ) ( / )

( / , ) ( / )

i i i i ii

i i i i

j jj

i i i i

j j j jj

p D P p D p D PP D

p D p D p D

p D P D p D P D

p D p D

p D P D

p D P D

x xx

x x

x x

x x

x

x

chain rule

marginalize

chain rule

Using only thesamples from class i

Page 5: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

The Role of Training Examples (cont’d)

• The training examples Di can help us to determine both the class-conditional densities and the prior probabilities:

• For simplicity, replace P(ωi /D) with P(ωi):

( / , ) ( / )( / , )

( / , ) ( / )i i i i

i ij j j j

j

p D P DP D

p D P D

xx

x

( / , ) ( )( / , )

( / , ) ( )i i i

i ij j j

j

p D PP D

p D P

xx

x

Page 6: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Bayesian Estimation (BE)

• Need to estimate p(x/ωi,Di) for every class ωi

• If the samples in Dj give no information about i,

we need to solve c independent problems of the following form:

“Given D, estimate p(x/D)”

i j

( / , ) ( )( / , )

( / , ) ( )i i i

i ij j j

j

p D PP D

p D P

xx

x

Page 7: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

BE Approach

• Estimate p(x/D) as follows:

• Since , we have:

Important equation: it links p(x/D) with p(θ/D)

( / ) ( , / ) ( / , ) ( / )p D p D d p D p D d x x θ θ x θ θ θ

( / ) ( / ) ( / )p D p p D dx x θ θ θ

( / , ) ( / )p D px θ x θ

Page 8: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

BE Main Steps

( / ) ( / ) ( / )p D p p D dx x θ θ θ

(1) Compute p(θ/D) :

(2) Compute p(x/D) :

1

( / ) ( )( / ) ( / ) ( )

( )

n

kk

p D pp D a p p

p D

θ θθ x θ θ

where ( / ) ( )

( / )( )

p D pp D

p D

θ θθ

Page 9: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Interpretation of BE Solution

• Suppose p(θ/D) peaks sharply at , then p(x/D) can be approximated as follows:

(i.e., the best estimate is obtained by setting )

ˆθ θ

ˆ( / ) ( / )p D px x θˆθ θ

( / ) ( / ) ( / )p D p p D dx x θ θ θ

(assuming that p(x/ θ) is smooth)

Page 10: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Interpretation of BE Solution (cont’d)

•If we are less certain about the exact value of θ, we should consider a weighted average of p(x / θ) over the possible values of θ.

•The samples exert their influence on p(x / D) through P(θ / D).

( / ) ( / ) ( / )p D p p D dx x θ θ θ

Page 11: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Relation to ML solution

• If p(D/ θ) peaks sharply at , then p(θ /D) will, in general, peak sharply at too (i.e., close to ML solution):

ˆθ θ

ˆ( / ) ( / )p D px x θ

( / ) ( / ) ( / )p D p p D dx x θ θ θ

( / ) ( )( / )

( )

p D pp D

p D

θ θθ

ˆθ θ

Page 12: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Case 1: Univariate Gaussian,Unknown μ

D={x1,x2,…,xn} (independently drawn)

(1)

(known)

Page 13: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Case 1: Univariate Gaussian, Unknown μ (cont’d)

x x const.const.

• It can be shown that p(μ/D) has the form:

peaks at μnp(μ/D)

Page 14: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Case 1: Univariate Gaussian, Unknown μ (cont’d)

0 0 as as (ML estimate) n

(i.e., lies between them)

n implies more samples!

(ML estimate)

Page 15: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Case 1: Univariate Gaussian, Unknown μ (cont’d)

Page 16: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Case 1: Univariate Gaussian, Unknown μ (cont’d)

BayesianLearning

Page 17: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Case 1: Univariate Gaussian, Unknown μ (cont’d)

not dependent on μ

As the number of samples increases, p(x/D) converges to p(x/μ)

(2)

Page 18: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Case 2: Multivariate Gaussian, Unknown μ

1

( / ) ( )( / ) ( / ) ( )

( )

n

k

p D pp D a p p

p D

k

μ μμ x μ μ

D={x1,x2,…,xn} (independently drawn)

Assume p(x/μ)~N(μ,Σ) and p(μ)~N(μ0,Σ0)

Compute p(μ/D):(1)

(known)

Page 19: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Case 2: Multivariate Gaussian, Unknown μ (cont’d)

• Substituting the expressions for p(xk/μ) and p(μ):

where

11( / ) c exp[ ( ) ( )]

2t

np D n nμ μ μ μ μ

1 10 0 0

10 0

1 1 1( ) ( )

1 1( )n

n n n

n n

n n 0μ x μ

Page 20: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Case 2: Multivariate Gaussian (cont’d)

Compute p(x/D):

( / ) ( / ) ( / ) ~ ( , )np D p p D d N nx x μ μ μ μ

(2)

Page 21: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Recursive Bayes Learning

• Develop an incremental learning algorithm: Dn: (x1, x2, …., xn-1, xn)

• Rewrite as follows:1

( / ) ( / )n

kk

p D p

θ x θ

Dn-1

1( / ) ( / ) ( / )n nnp D p p D θ x θ θ

Page 22: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Recursive Bayes Learning (cont’d)

• Then, can be written as follows:

n=1,2,…

1

1

10

1

( / ) ( / ) ( )( / ) ( )( / )

( / ) ( ) ( / ) ( / ) ( )

( / ) ( / ), ( / ) ( )

( / ) ( / )

nnn n

n nn

nn

nn

p p D pp D pp D

p D p d p p D p d

p p Dwhere p D p

p p D d

x θ θ θθ θθ

θ θ θ x θ θ θ θ

x θ θθ θ

x θ θ θ

( / ) ( )( / )

( )

p D pp D

p D

θ θθ

Page 23: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Recursive Bayes Learning -Example

p(θ)

1

1

0

( / ) ( / )( / )

( / ) ( / )

( / ) ( )

nn n

nn

p p Dp D

p p D d

where p D p

x θ θθ

x θ θ θ

θ θ

Page 24: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Recursive Bayesian Learning (cont’d)

(x(x44=8)=8)

1( / ) , max [ ] 10n n

xnp D for D

In general:

Page 25: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Recursive Bayesian Learning (cont’d)

ML estimate:

IterationsIterations

p(θ/D4) peaks at ˆ 8

ˆ( / ) ~ (0,8)p x U

Bayesian estimate:

p(p(θθ/D/D00))

( / ) ( / ) ( / )p D p p D dx x θ θ θ

Page 26: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Multiple Peaks

• For most p(x/θ) choices, p(θ/Dn) will have peak strongly at given enough samples; in this case:

• There might be cases, however, where cannot be determined uniquely from p(x/θ); in this case, p(θ/Dn) will contain multiple peaks

• The solution p(x/θ) should be obtained by integration in this case:

( / ) ( / ) ( / )p D p p D dx x θ θ θ

ˆ( / ) ( / )p D px x θ

θ̂

θ̂

Page 27: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

ML vs Bayesian Estimation

• Number of training data– The two methods are equivalent assuming infinite

number of training data (and prior distributions that do not exclude the true solution).

– For small training data sets, they give different results in most cases.

• Computational complexity– ML uses differential calculus or gradient search for

maximizing the likelihood.– Bayesian estimation requires complex multidimensional

integration techniques.

Page 28: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

ML vs Bayesian Estimation (cont’d)

• Solution complexity– Easier to interpret ML solutions (i.e., must be of the

assumed parametric form).– A Bayesian estimation solution might not be of the

parametric form assumed.• Prior distribution

– If the prior distribution p(θ) is uniform, Bayesian estimation solutions are equivalent to ML solutions.

– In general, the two methods will give different solutions.

Page 29: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

ML vs Bayesian Estimation (cont’d)

• General comments– There are strong theoretical and methodological

arguments supporting Bayesian estimation.– In practice, ML estimation is simpler and can lead to

comparable performance.

Page 30: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Computational Complexity

11 1ˆ ˆˆ ˆ( ) ( ) ( ) ln 2 ln | | ln ( )2 2 2

t dg P x x x

O(dn)O(dn) O(dO(d22n)n)

O(dO(d22))O(1)O(1) O(nO(n))

dimensionality: d

# training data: n

# classes: c

O(dO(d33))

ML estimation

• Learning complexity

(n>d)

These computations must be repeated c times!

d(d+1)/2

O(dn)O(dn)

Page 31: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Computational Complexity

11 1ˆ ˆˆ ˆ( ) ( ) ( ) ln 2 ln | | ln ( )2 2 2

t dg P x x x

O(1)O(1)O(dO(d22))

Bayesian Estimation: higher learning complexity, same classification complexity

• Classification complexity

These computations must be repeated c times and take max

dimensionality: d

# training data: n

# classes: c

Page 32: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Main Sources of Error in Classifier Design

• Bayes error– The error due to overlapping densities p(x/ωi)

• Model error– The error due to choosing an incorrect model.

• Estimation error– The error due to incorrectly estimated parameters

( / , ) ( )( / , )

( / , ) ( )i i i

i ij j j

j

p D PP D

p D P

xx

x

(e.g., due to small number of training examples)

Page 33: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Overfitting• When the number of training examples is inadequate, the

solution obtained might not be optimal.

• Consider the problem of fitting a curve to some data:– Points were selected from a parabola (plus noise).– A 10th degree polynomial fits the data perfectly but does not

generalize well.

A greater error ontraining data might improve generalization!

• Need more training examples than number or model parameters!

Page 34: Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Overfitting (cont’d)• Control model complexity.

– Assume diagonal covariance matrix (i.e., uncorrelated features).– Use the same covariance matrix for all classes and consolidate data.

– Shrinkage technique:

(1 )( ) , (0 1)

(1 )

( ) (1 ) , (0 1)

i ii

i

a n ana a

a n an

b b bI b

Shrink common covariance matrix to identity matrix:

Shrink individual covariance matrices to same covariance: