ml.ppt - svcltitle: microsoft powerpoint - ml.ppt [compatibility mode] author: nuno created date:...
TRANSCRIPT
![Page 1: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/1.jpg)
Maximum likelihood estimationMaximum likelihood estimation
Nuno VasconcelosUCSD
![Page 2: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/2.jpg)
Bayesian decision theory• recall that we have
– Y – state of the world– X – observations– g(x) – decision function– L[g(x),y] – loss of predicting y with g(x)
• Bayes decision rule is the rule that minimizes the risk
[ ]),( YXLERisk YX=
• for the “0-1” loss
[ ]),(,YX
⎧ ≠)(1
• optimal decision rule is the maximum a posteriori⎩⎨⎧
=≠
=yxgyxg
yxgL)(,0)(,1
]),([
2
• optimal decision rule is the maximum a-posteriori probability rule
![Page 3: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/3.jpg)
MAP ruleh h th t it b i l t d i f th• we have shown that it can be implemented in any of the
three following ways– 1) )|(maxarg)( |
* xiPxi XY=)
– 2)
)|(maxarg)( | xiPxi XYi
[ ])()|(maxarg)( |* iPixPxi YYX=)
– 3)
[ ]|i
[ ])(log)|(logmaxarg)( |* iPixPxi YYX +=)
• by introducing a “model” for the class-conditional distributions we can express this as a simple equation
[ ])(g)|(gg)( | YYXi
distributions we can express this as a simple equation– e.g. for the multivariate Gaussian
⎬⎫
⎨⎧ Σ )()(11)|( 1TP
3⎭⎬⎫
⎩⎨⎧ −Σ−−
Σ= − )()(
21exp
||)2(1)|( 1
| iiT
i
idYX xxixP µµ
π
![Page 4: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/4.jpg)
The Gaussian classifierdi i i t
• the solution is
[ ]iii xdxi αµ += ),(minarg)(*
discriminant:PY|X(1|x ) = 0.5
with
[ ]iiii
xdxi αµ +),(minarg)(
)()(),( 1 yxyxyxd iT
i −Σ−= −
)(log2)2log( iPYid
i −Σ= πα
• the optimal rule is to assign x to the closest class• closest is measured with the Mahalanobis distance di(x,y)• can be further simplified in special cases
4
• can be further simplified in special cases
![Page 5: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/5.jpg)
Geometric interpretation• for Gaussian classes, equal covariance σ2I
µµ −
( )Yji
ji
iP
w
σµµσµµ
+
=
)(l2
2
( )jiY
Y
ji
ji
jPx µµ
µµ
µµ−
−−=
)()(log
2 20
σ
w
µi
µjσ
x0 )(1 iP
5)()(log1
2
2 jPiP
Y
Y
ji
σµµ −
![Page 6: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/6.jpg)
Geometric interpretation• for Gaussian classes, equal but arbitrary covariance
( )w µµΣ−1( )
( ) ( ) ( )jiY
Tji
ji
jPiPx
w
µµµµµµ
µµ
µµ
−Σ
−+
=
−Σ=
− )()(log1
2 10 ( ) ( ) Yjiji jPµµµµ −Σ− )(2
w
µi
µjx0
6
![Page 7: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/7.jpg)
Bayesian decision theory• advantages:
– BDR is optimal and cannot be beaten– Bayes keeps you honest– models reflect causal interpretation of the problem, this is how
we think– natural decomposition into “what we knew already” (prior) and
“what data tells us” (CCD)– no need for heuristics to combine these two sources of infono need for heuristics to combine these two sources of info– BDR is, almost invariably, intuitive– Bayes rule, chain rule, and marginalization enable modularity,
and scalability to very complicated models and problemsand scalability to very complicated models and problems
• problems:– BDR is optimal only insofar the models are correct.
7
![Page 8: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/8.jpg)
Implementation• we do have an optimal solution
( )jiw µµ −Σ= −1
f( ) ( ) ( )ji
Y
Y
jiT
ji
ji
jPiPx µµ
µµµµµµ
−−Σ−
−+
=− )(
)(log12 10
• but in practice we do not know the values of the parameters µ, Σ, PY(1)– we have to somehow estimate these values– this is OK, we can come up with an estimate from a training set– e.g. use the average value as an estimate for the mean
( )( )
( )Yji
ji
iPx
w
µµµµ
µµ
ˆˆ)(ˆlog1ˆˆ
ˆˆˆ 1
+=
−Σ= −
8( ) ( ) ( )ji
YjiT
ji jPx µµ
µµµµ )(ˆlogˆˆˆˆ2 10 −
−Σ−−=
−
![Page 9: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/9.jpg)
Important • warning: at this point all optimality claims for the BDR
cease to be valid!!• the BDR is guaranteed
to achieve the minimumloss when we use theloss when we use thetrue probabilities
• when we “plug in” the b bilit ti tprobability estimates,
we could be implementing p ga classifier that is quite distant from the optimal– e.g. if the PX|Y(x|i) look like the example above
I could never approximate them well by parametric models (e g
9
– I could never approximate them well by parametric models (e.g. Gaussian)
![Page 10: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/10.jpg)
Maximum likelihood• this seems pretty serious
– how should I get these probabilities then?
• we rely on the maximum likelihood (ML) principle• this has three steps:
1) we choose a parametric model for all probabilities– 1) we choose a parametric model for all probabilities– to make this clear we denote the vector of parameters by Θ and
the class-conditional distributions by
note that this means that Θ is NOT a random variable (otherwise
);|(| ΘixP YX
– note that this means that Θ is NOT a random variable (otherwise it would have to show up as subscript)
– it is simply a parameter, and the probabilities are a function of this parameter
10
parameter
![Page 11: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/11.jpg)
Maximum likelihood• three steps:
– 2) we assemble a collection of datasetsD(i) = {x (i) x (i)} set of examples drawn independently fromD(i) = {x1
(i) , ..., xn(i)} set of examples drawn independently from
class i
3) we select the parameters of class i to be the ones that– 3) we select the parameters of class i to be the ones that maximize the probability of the data from that class
( )ΘΘ ;|maxarg )( iDP i( )( )Θ=
Θ=ΘΘ
;|logmaxarg
;|maxarg
)(|
)(|
iDP
iDP
iYX
YXi
– like before, it does not really make any difference to maximize
( )Θ
;|gg |YX
11
, y yprobabilities or their logs
![Page 12: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/12.jpg)
Maximum likelihood• since
– each sample D(i) is considered independentlyt ti t d l f l (i)– parameter Θi estimated only from sample D(i)
• we simply have to repeat the procedure for all classes• so from now on we omit the class variable• so, from now on we omit the class variable
( )Θ=ΘΘ
;maxarg* DPX
( )Θ=Θ
Θ
;logmaxarg DPX
• the function PX(D;Θ) is called the likelihood of the parameter Θ with respect to the data
i l h lik lih d f i12
• or simply the likelihood function
![Page 13: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/13.jpg)
Maximum likelihood• note that the likelihood
function is a function of the parameters Θof the parameters Θ
• it does not have the same shape as the
fdensity itself• e.g. the likelihood
function of a Gaussianfunction of a Gaussian is not bell-shaped
• the likelihood isdefined only after we have a sample
⎫⎧ 2)(1 µd
13⎭⎬⎫
⎩⎨⎧ −−=Θ 22 2
)(exp)2(
1);(σµ
σπddPX
![Page 14: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/14.jpg)
Maximum likelihood• given a sample, to obtain ML estimate we need to solve
( )Θ=Θ ;maxarg* DP
• when Θ is a scalar this is high-school calculus
( )Θ=ΘΘ
;maxarg DPX
when Θ is a scalar this is high school calculus
• we have a maximum whenfirst derivative is zero
14
– first derivative is zero– second derivative is negative
![Page 15: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/15.jpg)
The gradient• in higher dimensions, the generalization of the derivative
is the gradient
∇f
• the gradient of a function f(w) at z is
Tff ⎞
⎜⎛ ∂∂ ∇f
• the gradient has a nice geometrici t t ti
n
zwfz
wfzf ⎟
⎠
⎞⎜⎜⎝
⎛∂∂
∂∂
=∇−
)(,),()(10
L
interpretation– it points in the direction of maximum
growth of the functionf(x,y)
– which makes it perpendicular to the contours where the function is constant
),( 00 yxf∇
15
),( 00 yf
),( 11 yxf∇
![Page 16: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/16.jpg)
maxThe gradient
max
• note that if ∇f = 0– there is no direction of growthg– also -∇f = 0, and there is no direction of
decrease– we are either at a local minimum or maximum
“ ddl ” i tor “saddle” point• conversely, at local min or max or saddle
pointdi ti f th d min– no direction of growth or decrease
– ∇f = 0• this shows that we have a critical point if
d l if ∇f 0saddle
and only if ∇f = 0• to determine which type we need second
order conditions
16
![Page 17: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/17.jpg)
The Hessian• the extension of the second-order derivative is the
Hessian matrix⎤⎡
⎥⎥⎥⎤
⎢⎢⎢⎡
∂∂∂
∂∂
=∇−
)()(
)(10
2
20
2
2
xxxfx
xf
xfn
M
L
⎥⎥⎥
⎦⎢⎢⎢
⎣ ∂∂
∂∂∂
=∇
−−
)()(
)(
21
2
01
2
xx
fxxx
fxf
nn
L
M
– at each point x, gives us the quadratic function
⎦⎣
xxfx t )(2∇
that best approximates f(x)
xxfx )(∇
17
![Page 18: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/18.jpg)
The Hessian• this means that, when gradient is
zero at x, we have max– a maximum when function can be
approximated by an “upwards-facing” quadratic
– a minimum when function can be approximated by a “downwards-facing” quadratic
– a saddle point otherwise
saddlemin
18
![Page 19: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/19.jpg)
The Hessian max
• for any matrix M, the function
Mxx t
• is– upwards facing quadratic when M
Mxx
upwards facing quadratic when Mis negative definite
– downwards facing quadratic when Mis positive definite
min
saddle
is positive definite– saddle otherwise
• hence, all that matters is the positived fi it f th H idefiniteness of the Hessian
• we have a maximum when Hessianis negative definite
19
is negative definite
![Page 20: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/20.jpg)
Maximum likelihood• in summary, given a sample, we need to solve
( )ΘΘ* DPmax
• the solutions are the parameters
( )Θ=ΘΘ
;maxarg DPX
the solutions are the parameters such that
0);( =Θ∇ DP 0);( =Θ∇Θ DPX
nX
t DP ℜ∈∀≤∇Θ θθθθ ,0);(2
• note that you always have to check the second-order condition!
XΘ ,);(
20
condition!
![Page 21: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/21.jpg)
Maximum likelihood• let’s consider the Gaussian example
• given a sample {T1, …, TN} of independent pointsgiven a sample {T1, …, TN} of independent points• the likelihood is
21
![Page 22: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/22.jpg)
Maximum likelihood• and the log-likelihood is
• the derivative with respect to the mean is zero when
• or
• note that this is just the sample mean22
• note that this is just the sample mean
![Page 23: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/23.jpg)
Maximum likelihood• and the log-likelihood is
• the derivative with respect to the variance is zero when
• or
• note that this is just the sample variance23
• note that this is just the sample variance
![Page 24: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/24.jpg)
Maximum likelihood• example:
– if sample is {10,20,30,40,50}
24
![Page 25: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/25.jpg)
Homework• show that the Hessian is negative definite
nt 2
• show that these formulas can be generalized to the vector
nX
t DP ℜ∈∀≤∇Θ θθθθ ,0);(2
show that these formulas can be generalized to the vector case– D(i) = {x1
(i) , ..., xn(i)} set of examples from class i
th ML ti t– the ML estimates are
∑=Σ Tii xx ))((1 )()( µµ∑ ix )(1µ
• note that the ML solution is usually intuitive
∑ −−=Σj
ijiji xxn
))(( )()( µµ∑=j
ji xn
)(µ
25
• note that the ML solution is usually intuitive
![Page 26: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/26.jpg)
Estimators• when we talk about estimators, it is important to keep in
mind that– an estimate is a number– an estimator is a random variable
)(ˆ XXfθ
• an estimate is the value of the estimator for a given
),,( 1 nXXf K=θ
gsample.
• if D = {x1 , ..., xn}, when we say ∑=j
jxn1µ̂
what we mean is with j
nn xXxXnXXf==
=,,1
11),,(ˆ
KKµ
∑ XXXf 1)(26
the Xi are random variables∑=j
jn Xn
XXf ),,( 1 K
![Page 27: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/27.jpg)
Bias and variance• we know how to produce estimators (by ML)• how do we evaluate an estimator?• Q1: is the expected value equal to the true value?• this is measured by the bias
if– if
then
),,(ˆ1 nXXf K=θ
then
an estimator that has bias will usually not converge to the perfect
( ) [ ]θθ −= ),,(ˆ1,,1 nXX XXfEBias
nKK
– an estimator that has bias will usually not converge to the perfect estimate θ, no matter how large the sample is
– e.g. if θ is negative and the estimator isthe bias is clearly non zero
∑=j
jn Xn
XXf 21
1),,( K
27
the bias is clearly non-zero j
![Page 28: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/28.jpg)
Bias and variance• the estimators is said to be biased
– this means that it is not expressive enough to approximate the tr e al e arbitraril elltrue value arbitrarily well
– this will be clearer when we talk about density estimation
• Q2: assuming that the estimator converges to the true 2 g gvalue, how many sample points do we need?– this can be measured by the variance
( )[ ]( ){ }2)()(
ˆ
XXfEXXfE
Var =θ
– the variance usually decreases as one collects more training
[ ]( ){ }1,,1,, ),,(),,(11 nXXnXX XXfEXXfE
nnKK KK −
28
y gexamples
![Page 29: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/29.jpg)
Example• ML estimator for the mean of a Gaussian N(µ,σ2)
( ) [ ] [ ]ˆˆˆ −=−= µµµµµ XXXX EEBias ( ) [ ] [ ]1
,,,,
1
11
−⎥⎦
⎤⎢⎣
⎡=
==
∑ µ
µµµµµ
iXX
XXXX
XE
EEBias
n
nn KK
[ ]1
,,1
−=
⎥⎦
⎢⎣
∑
∑
µ
µ
iXX
iiXX
XE
nnK
[ ]
[ ]1
,,1
−= ∑
∑
µ
µ
iX
iiXX
XE
n nK
[ ]
0=−=
∑µµ
µ
i
iX XEn i
29• the estimator is unbiased
![Page 30: ML.ppt - SVCLTitle: Microsoft PowerPoint - ML.ppt [Compatibility Mode] Author: nuno Created Date: 10/11/2007 12:01:42 PM](https://reader034.vdocument.in/reader034/viewer/2022052015/602d464b06e79b1997674218/html5/thumbnails/30.jpg)
38