parameter estimation: bayesian estimation chapter 3 (duda et al.) – sections 3.3-3.7 cs479/679...
TRANSCRIPT
Parameter Estimation:Bayesian Estimation
Chapter 3 (Duda et al.) – Sections 3.3-3.7
CS479/679 Pattern RecognitionDr. George Bebis
Bayesian Estimation• Assumes that the parameters are random
variables that have some known a-priori distribution p(.
• Estimates a distribution rather than making point estimates like ML.
( / ) ( / ) ( / )p D p p D dx x θ θ θ
BE solution might not be of the parametric form assumed.
The Role of Training Examples in computing P(ωi /x)
• If p(x/ωi) and P(ωi) are known, the Bayes’ rule allows us to compute the posterior probabilities P(ωi /x):
• Emphasize the role of the training examples D by introducing them in the computation of the posterior probabilities:
( / , )iP D x
( / ) ( )( / )
( / ) ( )i i
ij j
j
p PP
p P
xx
x
The Role of Training Examples (cont’d)
( , / ) ( ) ( / , ) ( / ) ( )( / , )
( , ) ( / ) ( )
( / , ) ( / ) ( / , ) ( / )
( / ) ( , / )
( / , ) ( / )
( / , ) ( / )
i i i i ii
i i i i
j jj
i i i i
j j j jj
p D P p D p D PP D
p D p D p D
p D P D p D P D
p D p D
p D P D
p D P D
x xx
x x
x x
x x
x
x
chain rule
marginalize
chain rule
Using only thesamples from class i
The Role of Training Examples (cont’d)
• The training examples Di can help us to determine both the class-conditional densities and the prior probabilities:
• For simplicity, replace P(ωi /D) with P(ωi):
( / , ) ( / )( / , )
( / , ) ( / )i i i i
i ij j j j
j
p D P DP D
p D P D
xx
x
( / , ) ( )( / , )
( / , ) ( )i i i
i ij j j
j
p D PP D
p D P
xx
x
Bayesian Estimation (BE)
• Need to estimate p(x/ωi,Di) for every class ωi
• If the samples in Dj give no information about i,
we need to solve c independent problems of the following form:
“Given D, estimate p(x/D)”
i j
( / , ) ( )( / , )
( / , ) ( )i i i
i ij j j
j
p D PP D
p D P
xx
x
BE Approach
• Estimate p(x/D) as follows:
• Since , we have:
Important equation: it links p(x/D) with p(θ/D)
( / ) ( , / ) ( / , ) ( / )p D p D d p D p D d x x θ θ x θ θ θ
( / ) ( / ) ( / )p D p p D dx x θ θ θ
( / , ) ( / )p D px θ x θ
BE Main Steps
( / ) ( / ) ( / )p D p p D dx x θ θ θ
(1) Compute p(θ/D) :
(2) Compute p(x/D) :
1
( / ) ( )( / ) ( / ) ( )
( )
n
kk
p D pp D a p p
p D
θ θθ x θ θ
where ( / ) ( )
( / )( )
p D pp D
p D
θ θθ
Interpretation of BE Solution
• Suppose p(θ/D) peaks sharply at , then p(x/D) can be approximated as follows:
(i.e., the best estimate is obtained by setting )
ˆθ θ
ˆ( / ) ( / )p D px x θˆθ θ
( / ) ( / ) ( / )p D p p D dx x θ θ θ
(assuming that p(x/ θ) is smooth)
Interpretation of BE Solution (cont’d)
•If we are less certain about the exact value of θ, we should consider a weighted average of p(x / θ) over the possible values of θ.
•The samples exert their influence on p(x / D) through P(θ / D).
( / ) ( / ) ( / )p D p p D dx x θ θ θ
Relation to ML solution
• If p(D/ θ) peaks sharply at , then p(θ /D) will, in general, peak sharply at too (i.e., close to ML solution):
ˆθ θ
ˆ( / ) ( / )p D px x θ
( / ) ( / ) ( / )p D p p D dx x θ θ θ
( / ) ( )( / )
( )
p D pp D
p D
θ θθ
ˆθ θ
Case 1: Univariate Gaussian,Unknown μ
D={x1,x2,…,xn} (independently drawn)
(1)
(known)
Case 1: Univariate Gaussian, Unknown μ (cont’d)
x x const.const.
• It can be shown that p(μ/D) has the form:
peaks at μnp(μ/D)
Case 1: Univariate Gaussian, Unknown μ (cont’d)
0 0 as as (ML estimate) n
(i.e., lies between them)
n implies more samples!
(ML estimate)
Case 1: Univariate Gaussian, Unknown μ (cont’d)
Case 1: Univariate Gaussian, Unknown μ (cont’d)
BayesianLearning
Case 1: Univariate Gaussian, Unknown μ (cont’d)
not dependent on μ
As the number of samples increases, p(x/D) converges to p(x/μ)
(2)
Case 2: Multivariate Gaussian, Unknown μ
1
( / ) ( )( / ) ( / ) ( )
( )
n
k
p D pp D a p p
p D
k
μ μμ x μ μ
D={x1,x2,…,xn} (independently drawn)
Assume p(x/μ)~N(μ,Σ) and p(μ)~N(μ0,Σ0)
Compute p(μ/D):(1)
(known)
Case 2: Multivariate Gaussian, Unknown μ (cont’d)
• Substituting the expressions for p(xk/μ) and p(μ):
where
11( / ) c exp[ ( ) ( )]
2t
np D n nμ μ μ μ μ
1 10 0 0
10 0
1 1 1( ) ( )
1 1( )n
n n n
n n
n n 0μ x μ
Case 2: Multivariate Gaussian (cont’d)
Compute p(x/D):
( / ) ( / ) ( / ) ~ ( , )np D p p D d N nx x μ μ μ μ
(2)
Recursive Bayes Learning
• Develop an incremental learning algorithm: Dn: (x1, x2, …., xn-1, xn)
• Rewrite as follows:1
( / ) ( / )n
kk
p D p
θ x θ
Dn-1
1( / ) ( / ) ( / )n nnp D p p D θ x θ θ
Recursive Bayes Learning (cont’d)
• Then, can be written as follows:
n=1,2,…
1
1
10
1
( / ) ( / ) ( )( / ) ( )( / )
( / ) ( ) ( / ) ( / ) ( )
( / ) ( / ), ( / ) ( )
( / ) ( / )
nnn n
n nn
nn
nn
p p D pp D pp D
p D p d p p D p d
p p Dwhere p D p
p p D d
x θ θ θθ θθ
θ θ θ x θ θ θ θ
x θ θθ θ
x θ θ θ
( / ) ( )( / )
( )
p D pp D
p D
θ θθ
Recursive Bayes Learning -Example
p(θ)
1
1
0
( / ) ( / )( / )
( / ) ( / )
( / ) ( )
nn n
nn
p p Dp D
p p D d
where p D p
x θ θθ
x θ θ θ
θ θ
Recursive Bayesian Learning (cont’d)
(x(x44=8)=8)
1( / ) , max [ ] 10n n
xnp D for D
In general:
Recursive Bayesian Learning (cont’d)
ML estimate:
IterationsIterations
p(θ/D4) peaks at ˆ 8
ˆ( / ) ~ (0,8)p x U
Bayesian estimate:
p(p(θθ/D/D00))
( / ) ( / ) ( / )p D p p D dx x θ θ θ
Multiple Peaks
• For most p(x/θ) choices, p(θ/Dn) will have peak strongly at given enough samples; in this case:
• There might be cases, however, where cannot be determined uniquely from p(x/θ); in this case, p(θ/Dn) will contain multiple peaks
• The solution p(x/θ) should be obtained by integration in this case:
( / ) ( / ) ( / )p D p p D dx x θ θ θ
ˆ( / ) ( / )p D px x θ
θ̂
θ̂
ML vs Bayesian Estimation
• Number of training data– The two methods are equivalent assuming infinite
number of training data (and prior distributions that do not exclude the true solution).
– For small training data sets, they give different results in most cases.
• Computational complexity– ML uses differential calculus or gradient search for
maximizing the likelihood.– Bayesian estimation requires complex multidimensional
integration techniques.
ML vs Bayesian Estimation (cont’d)
• Solution complexity– Easier to interpret ML solutions (i.e., must be of the
assumed parametric form).– A Bayesian estimation solution might not be of the
parametric form assumed.• Prior distribution
– If the prior distribution p(θ) is uniform, Bayesian estimation solutions are equivalent to ML solutions.
– In general, the two methods will give different solutions.
ML vs Bayesian Estimation (cont’d)
• General comments– There are strong theoretical and methodological
arguments supporting Bayesian estimation.– In practice, ML estimation is simpler and can lead to
comparable performance.
Computational Complexity
11 1ˆ ˆˆ ˆ( ) ( ) ( ) ln 2 ln | | ln ( )2 2 2
t dg P x x x
O(dn)O(dn) O(dO(d22n)n)
O(dO(d22))O(1)O(1) O(nO(n))
dimensionality: d
# training data: n
# classes: c
O(dO(d33))
ML estimation
• Learning complexity
(n>d)
These computations must be repeated c times!
d(d+1)/2
O(dn)O(dn)
Computational Complexity
11 1ˆ ˆˆ ˆ( ) ( ) ( ) ln 2 ln | | ln ( )2 2 2
t dg P x x x
O(1)O(1)O(dO(d22))
Bayesian Estimation: higher learning complexity, same classification complexity
• Classification complexity
These computations must be repeated c times and take max
dimensionality: d
# training data: n
# classes: c
Main Sources of Error in Classifier Design
• Bayes error– The error due to overlapping densities p(x/ωi)
• Model error– The error due to choosing an incorrect model.
• Estimation error– The error due to incorrectly estimated parameters
( / , ) ( )( / , )
( / , ) ( )i i i
i ij j j
j
p D PP D
p D P
xx
x
(e.g., due to small number of training examples)
Overfitting• When the number of training examples is inadequate, the
solution obtained might not be optimal.
• Consider the problem of fitting a curve to some data:– Points were selected from a parabola (plus noise).– A 10th degree polynomial fits the data perfectly but does not
generalize well.
A greater error ontraining data might improve generalization!
• Need more training examples than number or model parameters!
Overfitting (cont’d)• Control model complexity.
– Assume diagonal covariance matrix (i.e., uncorrelated features).– Use the same covariance matrix for all classes and consolidate data.
– Shrinkage technique:
(1 )( ) , (0 1)
(1 )
( ) (1 ) , (0 1)
i ii
i
a n ana a
a n an
b b bI b
Shrink common covariance matrix to identity matrix:
Shrink individual covariance matrices to same covariance: