chapter 3: maximum-likelihood parameter estimation l introduction l maximum-likelihood estimation l...
TRANSCRIPT
Chapter 3: Maximum-Likelihood Parameter Estimation
Introduction Maximum-Likelihood Estimation
Multivariate Case: unknown , known Univariate Case: unknown and unknown 2
Bias Appendix: Maximum-Likelihood Problem Statement
Pattern Classification, Chapter 3
2
Introduction Data availability in a Bayesian framework
We could design an optimal classifier if we knew: P(i) (priors)
P(x | i) (class-conditional densities)
Unfortunately, we rarely have this complete information!
Design a classifier from a training sample No problem with prior estimation Samples are often too small for class-conditional
estimation (large dimension of feature space!)
1
Pattern Classification, Chapter 3
3
A priori information about the problem
Normality of P(x | i)
P(x | i) ~ N( i, i)
Characterized by i and i parameters
Estimation techniques
Maximum-Likelihood and Bayesian estimations Results nearly identical, but approaches are different We will not cover Bayesian estimation details
1
Pattern Classification, Chapter 3
4
Parameters in Maximum-Likelihood estimation are fixed but unknown!
Best parameters are obtained by maximizing the probability of obtaining the samples observed
Bayesian methods view the parameters as random variables having some known distribution
In either approach, we use P(i | x)for our classification rule!
1
Pattern Classification, Chapter 3
5 Maximum-Likelihood Estimation
Has good convergence properties as the sample size increases
Simpler than alternative techniques
General principle
Assume we have c classes and
P(x | j) ~ N( j, j)
P(x | j) P (x | j, j) where:
)...)x,xcov(,,,...,,(),( nj
mj
22j
11j
2j
1jjj
2
Pattern Classification, Chapter 3
6 Use the information
provided by the training samples to estimate
= (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each category
Suppose that D contains n samples, x1, x2,…, xn
ML estimate of is, by definition the value that maximizes P(D | )
“It is the value of that best agrees with the actually observed training sample”
samples) ofset the w.r.t. of likelihood the called is )|D(P
)(F)|x(P)|D(Pnk
1kk
2
Pattern Classification, Chapter 3
7
2
Likelihood
Log-likelihood
(fixed, = unknown
Pattern Classification, Chapter 3
8 Optimal estimation
Let = (1, 2, …, p)t and let be the gradient operator
We define l() as the log-likelihood function
l() = ln P(D | )
New problem statement:
determine that maximizes the log-likelihood
t
p21
,...,,
)(lmaxargˆ
2
Pattern Classification, Chapter 3
9
Set of necessary conditions for an optimum is:
l = 0
))|x(Plnl( k
nk
1k
2
n = number of training samples
Pattern Classification, Chapter 3
10
Multivariate Gaussian: unknown , known
Samples drawn from multivariate Gaussian population
P(xi | ) ~ N(, ) =
= therefore:The ML estimate for must satisfy:
)x()|x(Pln and
)x()x(21
)2(ln21
)|x(Pln
1
kk
1
kt
kd
k
0)ˆx( k
nk
1k
1
2
Pattern Classification, Chapter 3
11
• Multiplying by and rearranging, we obtain:
Just the arithmetic average of the samples of the training samples!
Conclusion: If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-
dimensional feature space; then we can estimate the vector
= (1, 2, …, c)t and perform an optimal classification!
nk
1kkx
n1
ˆ
2
Pattern Classification, Chapter 3
12
Univariate Gaussian: unknown , unknown 2 Samples drawn from univariate Gaussian population
P(xi | , 2) ~ N(, 2) = (1, 2) = (, 2)
02
)x(
21
0)x(1
0))|x(P(ln
))|x(P(ln
l
)x(2
12ln
21
)|x(Plnl
22
21k
2
1k2
k2
k1
21k
22k
2
Pattern Classification, Chapter 3
13
Summation:
Combining (1) and (2), one obtains:
n
)x( ;
n
x
nk
1k
2k
2nk
1k
k
nk
1k
nk
1k22
21k
2
nk
1k1k
2
(2) 0ˆ
)ˆx(ˆ1
(1) 0)x(ˆ1
2
Pattern Classification, Chapter 3
14
Bias
Maximum-Likelihood estimate for 2 is biased
An elementary unbiased estimator for is:
222i .
n1n
)xx(n1
E
matrix covariance Sample
nk
1k
tkk )ˆx)(x(
1-n1
C
2
Pattern Classification, Chapter 3
15
Appendix: Maximum-Likelihood Problem Statement
Let D = {x1, x2, …, xn}
P(x1,…, xn | ) = 1,nP(xk | ); |D| = n
Our goal is to determine (value of that makes this sample the most representative!)
2
Pattern Classification, Chapter 3
16
|D| = n
x1
x2xn
.. ..
.
..
..
..
...
..
..
x11
x20
x10x8
x9x1
N(j, j) = P(xj, 1)
D1
DcDk
P(xj | 1)P(xj | k)
2
Pattern Classification, Chapter 3
17
= (1, 2, …, c)
Problem: find such that:
n
1kk
n1
)|x(PMax
)|x,...,x(MaxP)|D(PMax
2
Pattern Classification, Chapter 3
18
Sources of final-system classification error (sec 3.5.1)
Bayes Error Error due to overlapping densities for different
classes (inherent error, never eliminated) Model Error
Error due to having an incorrect model Estimation Error
Error from estimating parameters from finite sample
1
Pattern Classification, Chapter 1
19
Problems of Dimensionality (sec 3.7)Accuracy, Dimension, Training Sample SizeClassification accuracy depends upon the dimensionality and the amount of training dataCase of two classes multivariate normal with the same covariance
0)error(Plim
)()(r :where
due21
)error(P
r
211t
212
2
2u
2/r
7
Pattern Classification, Chapter 1
20
If features are independent then:
Most useful features are the ones for which the difference between the means is large relative to the standard deviation It appears that adding new features improves accuracy
It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: we have the wrong model !
2di
1i i
2i1i2
2d
22
21
r
),...,,(diag
7
Pattern Classification, Chapter 1
21
77
7
Pattern Classification, Chapter 1
22
Computational Complexity
Maximum-Likelihood Estimation Gaussian priors in d dimensions, with n
training samples for each of c classes For each category, we have to compute the
discriminant function
Total = O(d2..n)Total for c classes = O(cd2.n) O(d2.n)
Cost increase when d and n are large!
)n(O)n.2d(O
)1(O)2d.n(O
1t)n.d(O
)(Plnˆln21
2ln2d
)ˆx()ˆx(21
)x(g
7
Pattern Classification, Chapter 1
23
Overfitting
Number of training samples n can be inadequate for estimating the parameters
What to do?
Simplify the model – reduce the parameters Assume all classes have same covariance matrix Assume statistical independence
Reduce number of features d Principal Component Analysis, etc.
8
Pattern Classification, Chapter 1
24
8