introduction to machine learning 3rd edition ethem alpaydinmodified by prof. carolina ruiz © the...
TRANSCRIPT
INTRODUCTION TO MACHİNE LEARNİNG3RD EDİTİON
ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz© The MIT Press, 2014 for CS539 Machine Learning at WPI
[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml3e
Lecture Slides for
CHAPTER 4: PARAMETRİC METHODS
Parametric Estimation3
X = { xt }t where xt ~ p (x)
Parametric estimation: Assume a form for p (x |q ) and estimate q , its sufficient statistics, using X
e.g., N ( μ, σ2) where q = { μ, σ2}
Remember that a probability density function is defined as:p (x0) ≡ lim ε→0 P(x0 ≤ x < x0 + ε)
“distributed as” probability density function
4
Maximum Likelihood Estimation
Likelihood of q given the sample Xl (θ|X) = p (X |θ) = ∏
t p (xt|θ)
Log likelihood L(θ|X) = log l (θ|X) = ∑
t log p (xt|θ)
Maximum likelihood estimator (MLE)θ* = argmaxθ L(θ|X)
5
Examples: Bernoulli/Multinomial
Bernoulli: Two states, failure/success, x in {0,1} P (x) = po
x (1 – po ) (1 – x)
L (po|X) = log ∏t po
xt (1 – po ) (1 –
xt)
MLE: po = ∑t xt / N
Multinomial: K>2 states, xi in {0,1}P (x1,x2,...,xK) = ∏
i pi
xi
L(p1,p2,...,pK|X) = log ∏t ∏
i pi
xit
MLE: pi = ∑t xi
t / N
6
Gaussian (Normal) Distribution
2
2
2exp
2
1 x-xp
N
mxs
N
xm
t
t
t
t
2
2
p(x) = N ( μ, σ2)
MLE for μ and σ2:μ σ
2
2
22
1
x
xp exp
7
Bias and Variance
Unknown parameter qEstimator di = d (Xi) on sample Xi
Bias: bq(d) = E [d] – qVariance: E [(d–E [d])2]
Mean square error: r (d,q) = E [(d–q)2]
= (E [d] – q)2 + E [(d–E [d])2]= Bias2 + Variance
q
8
Bayes’ Estimator
Treat θ as a random var with prior p (θ)
Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X)
Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθ
Maximum a Posteriori (MAP): θMAP =
argmaxθ p(θ|X)
Maximum Likelihood (ML): θML = argmaxθ
p(X|θ)
Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ
9
Comparing ML, MAP, and Bayes’
Let Θ be the set of all possible solutions θ’s Maximum a Posteriori (MAP): Selects the θ that satisfies
θMAP = argmaxθ p(θ|X)
Maximum Likelihood (ML): Selects the θ that satisfies θML = argmaxθ p(X|θ)
Bayes’: Constructs the “weighted average” over all θ’s in Θ: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ
Note that if the θ’s in Θ are uniformly distributed then θMAP=θML
since θMAP = argmaxθ p(θ|X)
= argmaxθ p(X|θ) p(θ)/p(X) = argmaxθ p(X|θ) = θML
10
Bayes’ Estimator: Example
xt ~ N (θ, σo2) and θ ~ N ( μ, σ2)
θML = m θMAP = θBayes’ =
22
0
2
220
20
1
1
1 //
/
///
|
N
mN
NE X
11
Parametric Classification
iii
iii
CPCxpxg
CPCxpxg
log| log or
|
ii
iii
i
i
ii
CPx
xg
xCxp
log log log
exp|
2
2
2
2
22
2
1
22
1
12
Given the sample
ML estimates are
Discriminant
Nt
tt ,rx 1}{ X
x
, if
if ijx
xr
jt
it
ti C
C
0
1
t
ti
t
tii
t
i
t
ti
t
ti
t
it
ti
i r
rmxs
r
rxm
N
rCP
2
2 ˆ
ii
iii CP
smx
sxg ˆ log log log
2
2
22
2
1
13
Equal variances
Single boundary athalfway between means
14
Variances are different
Two boundaries
15
16
Regression
2
2
|~|
0~
|:estimator
,N
,N
xgxrp
xg
xfr
N
t
tN
t
tt
N
t
tt
xpxrp
rxp
11
1
log| log
, log|XL
second term can be ignored
17
Regression: From LogL to Error
2
1
2
12
2
2
1
2
1
2
12
22
1
N
t
tt
N
t
tt
ttN
t
xgrE
xgrN
xgr
||
|log
|exp log|
X
XL
Maximizing the log likelihood above is the same as minimizing the Error.
θ that minimizes Error is called “least squares estimate” Trick: When maximizing a likelihood l that contains exponents,
instead minimize error E = - log l
18
Linear Regression
0101 wxwwwxg tt ,|
t
t
t
tt
t
t
t
t
t
t
xwxwxr
xwNwr
2
10
10
t
t
tt
t
t
t
t
tt
t
xr
r
ww
xx
xNyw
1
02A
yw 1Ayw A and so
19
Polynomial Regression
01
2
2012 wxwxwxwwwwwxg ttktkk
t ,,,,|
NNNN
k
k
r
rr
xxx
xxxxxx
2
1
22
2222
1211
1
1
1
r D
rw TT DDD1
yw A where rTT D y DDA and
then
20
Maximum Likelihood and Least Squares
Taken from Tom Mitchell’s Machine Learning textbook:
“… under certain assumptions (*) any learning algorithm that minimizes the squared error
between the output hypothesis predictions and the training data will output a maximum likelihood
hypothesis.”
(*): Assumption:“the observed training target values are generated by adding random noise to the true target value, where the random noise is drawn independently for each example from a Normal distribution with zero mean.”
21
Other Error Measures
2
12
1
N
t
tt xgrE ||X Square Error:
Relative Square Error (ERSE):
Absolute Error: E (θ |X) = ∑t
|rt – g(xt| θ)|
ε-sensitive Error: E (θ |X) = ∑ t
1(|rt – g(xt| θ)|>ε) (|rt – g(xt|θ)| – ε)
R2 : Coefficient of Determination: R2 = 1 – ERSE (see next slide)
2
1
2
1
N
t
t
N
t
tt
rr
xgrE
||X
22
Coefficient of Determination: R2
Figure adapted from Wikipedia (Sept. 2015)"Coefficient of Determination" by Orzetto - Own work. Licensed under CC BY-SA 3.0 via Commons – https://commons.wikimedia.org/wiki/File:Coefficient_of_Determination.svg#/media/File:Coefficient_of_Determination.svg
2
1
2
12
|1
N
t
t
N
t
tt
rr
xgrR
|txgr
r r
The closer R2 is to 1 the better
as this means that the g(.) estimates (on the right graph) fit the data well in comparison to the simple estimate given by the average value (on the left graph)
Given X = {xt,rt} drawn from unknown pdf p(x,r) Expected Square Error of the g(.) estimate at x:
it may be that for a sample X, g(.) is a very good fit, and for another sample it is not
Given samples X’s, all of size N drawn from the same joint density p(x,r). Expected value, averaged over X’s:
Bias and Variance23
222 ||| xgExgExgExrExxgxrEE XXXX bias
how much g(.) is wrongdisregarding effect of varying samples
variancehow much g(.) fluctuates
as samples varies
222 xgxrExxrErExxgrE ||||noise
variance of noise addeddoes not depend on g(.) or X
squared errorg(x) deviation from E[r|x]
depend on g(.) and X
Estimating Bias and Variance
24
ti
t i
tti
t
tt
xgM
xg
xgxgNM
g
xfxgN
g
1 where
1Variance
1Bias
2
22
M samples Xi={xti , rt
i}, i=1,...,M
are used to fit gi (x), i =1,...,M
25
Bias/Variance Dilemma
Examples: gi(x)=2 has no variance and high bias
gi(x)= ∑t rt
i/N has lower bias but higher variance
As we increase complexity of g(.), bias decreases (a better fit to data) and variance increases (fit varies more with
data) Bias/Variance dilemma: (Geman et al., 1992)
26
bias
variance
f
gi g
f
added noise ~ N(0,1) A linear regression for each of 5 samples
Polynomial regression of order 3 (left) and order 5 (right) for each of 5 samples
dotted red line in each plot is the average of the five fits,
27
Polynomial Regression
Best fit “min error”
Same settings as plots on the previous slide, but using 100 models instead of 5
28
Best fit, “elbow”
Settings as plots on previous slides, but using training and validation sets (50 instances each)
Fitted polynomials of orders 1, …, 8 on training set
29
Model Selection (1 of 2 slides) Methods to fine-tune model complexity
Cross-validation: Measures generalization accuracy by testing on data unused during training
Regularization: Penalizes complex modelsE’=error on data + λ model complexity (where λ is a penalty weight)
the lower the better Other measures of “goodness of fit” with complexity penalty:
Akaike’s information criterion (AIC) AIC ≡ log p(X|θML,M) – k(M)
Bayesian information criterion (BIC)BIC ≡ log p(X|θML,M) – k(M) log(N)/2
where: M is a modellog p(X|θML,M): is the log likelihood of M where M’s parameters θML have been
estimated using maximum likelihoodk(M) = number of adjustable parameters in θML of the model M N= size of sample X
For both AIC and BIC, the higher the better
30
Model Selection (2 of 2 slides)Methods to fine-tune model complexity
Minimum description length (MDL): Kolmogorov complexity, shortest description of data
Given a dataset X, MDL (M) = Description length of model M + Description length of data in X not correctly described by
M
the lower the better
Structural risk minimization (SRM) Uses a set of models and their complexities (measured
usually using their number of free parameters or their VC-dimension)
Selects the simplest model in terms of order and best in terms of empirical error on the data
31
Bayesian Model Selection
data
model model|datadata|model
ppp
p
Used when we have Prior on models, p(model)
Regularization, when prior favors simpler models
Bayes, MAP of the posterior, p(model|data) Average over a number of models with high
posterior (voting, ensembles: Chapter 17)
32
Regression example
Coefficients increase in magnitude as order increases:1: [-0.0769, 0.0016]2: [0.1682, -0.6657, 0.0080]3: [0.4238, -2.5778, 3.4675, -0.00024: [-0.1093, 1.4356, -5.5007, 6.0454, -0.0019]
i i
N
t
tt wxgrE 22
12
1 ww || XRegularization (L2):