introduction to machine learning 3rd edition ethem alpaydinmodified by prof. carolina ruiz © the...

32
INTRODUCTION TO MACHİNE LEARNİNG 3RD EDİTİON ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI [email protected] http://www.cmpe.boun.edu.tr/~ethem/i2ml3e Lecture Slides for

Upload: lillian-fisher

Post on 14-Jan-2016

232 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

INTRODUCTION TO MACHİNE LEARNİNG3RD EDİTİON

ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz© The MIT Press, 2014 for CS539 Machine Learning at WPI

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml3e

Lecture Slides for

Page 2: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

CHAPTER 4: PARAMETRİC METHODS

Page 3: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

Parametric Estimation3

X = { xt }t where xt ~ p (x)

Parametric estimation: Assume a form for p (x |q ) and estimate q , its sufficient statistics, using X

e.g., N ( μ, σ2) where q = { μ, σ2}

Remember that a probability density function is defined as:p (x0) ≡ lim ε→0 P(x0 ≤ x < x0 + ε)

“distributed as” probability density function

Page 4: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

4

Maximum Likelihood Estimation

Likelihood of q given the sample Xl (θ|X) = p (X |θ) = ∏

t p (xt|θ)

Log likelihood L(θ|X) = log l (θ|X) = ∑

t log p (xt|θ)

Maximum likelihood estimator (MLE)θ* = argmaxθ L(θ|X)

Page 5: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

5

Examples: Bernoulli/Multinomial

Bernoulli: Two states, failure/success, x in {0,1} P (x) = po

x (1 – po ) (1 – x)

L (po|X) = log ∏t po

xt (1 – po ) (1 –

xt)

MLE: po = ∑t xt / N

Multinomial: K>2 states, xi in {0,1}P (x1,x2,...,xK) = ∏

i pi

xi

L(p1,p2,...,pK|X) = log ∏t ∏

i pi

xit

MLE: pi = ∑t xi

t / N

Page 6: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

6

Gaussian (Normal) Distribution

2

2

2exp

2

1 x-xp

N

mxs

N

xm

t

t

t

t

2

2

p(x) = N ( μ, σ2)

MLE for μ and σ2:μ σ

2

2

22

1

x

xp exp

Page 7: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

7

Bias and Variance

Unknown parameter qEstimator di = d (Xi) on sample Xi

Bias: bq(d) = E [d] – qVariance: E [(d–E [d])2]

Mean square error: r (d,q) = E [(d–q)2]

= (E [d] – q)2 + E [(d–E [d])2]= Bias2 + Variance

q

Page 8: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

8

Bayes’ Estimator

Treat θ as a random var with prior p (θ)

Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X)

Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθ

Maximum a Posteriori (MAP): θMAP =

argmaxθ p(θ|X)

Maximum Likelihood (ML): θML = argmaxθ

p(X|θ)

Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ

Page 9: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

9

Comparing ML, MAP, and Bayes’

Let Θ be the set of all possible solutions θ’s Maximum a Posteriori (MAP): Selects the θ that satisfies

θMAP = argmaxθ p(θ|X)

Maximum Likelihood (ML): Selects the θ that satisfies θML = argmaxθ p(X|θ)

Bayes’: Constructs the “weighted average” over all θ’s in Θ: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ

Note that if the θ’s in Θ are uniformly distributed then θMAP=θML

since θMAP = argmaxθ p(θ|X)

= argmaxθ p(X|θ) p(θ)/p(X) = argmaxθ p(X|θ) = θML

Page 10: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

10

Bayes’ Estimator: Example

xt ~ N (θ, σo2) and θ ~ N ( μ, σ2)

θML = m θMAP = θBayes’ =

22

0

2

220

20

1

1

1 //

/

///

|

N

mN

NE X

Page 11: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

11

Parametric Classification

iii

iii

CPCxpxg

CPCxpxg

log| log or

|

ii

iii

i

i

ii

CPx

xg

xCxp

log log log

exp|

2

2

2

2

22

2

1

22

1

Page 12: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

12

Given the sample

ML estimates are

Discriminant

Nt

tt ,rx 1}{ X

x

, if

if ijx

xr

jt

it

ti C

C

0

1

t

ti

t

tii

t

i

t

ti

t

ti

t

it

ti

i r

rmxs

r

rxm

N

rCP

2

2 ˆ

ii

iii CP

smx

sxg ˆ log log log

2

2

22

2

1

Page 13: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

13

Equal variances

Single boundary athalfway between means

Page 14: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

14

Variances are different

Two boundaries

Page 15: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

15

Page 16: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

16

Regression

2

2

|~|

0~

|:estimator

,N

,N

xgxrp

xg

xfr

N

t

tN

t

tt

N

t

tt

xpxrp

rxp

11

1

log| log

, log|XL

second term can be ignored

Page 17: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

17

Regression: From LogL to Error

2

1

2

12

2

2

1

2

1

2

12

22

1

N

t

tt

N

t

tt

ttN

t

xgrE

xgrN

xgr

||

|log

|exp log|

X

XL

Maximizing the log likelihood above is the same as minimizing the Error.

θ that minimizes Error is called “least squares estimate” Trick: When maximizing a likelihood l that contains exponents,

instead minimize error E = - log l

Page 18: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

18

Linear Regression

0101 wxwwwxg tt ,|

t

t

t

tt

t

t

t

t

t

t

xwxwxr

xwNwr

2

10

10

t

t

tt

t

t

t

t

tt

t

xr

r

ww

xx

xNyw

1

02A

yw 1Ayw A and so

Page 19: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

19

Polynomial Regression

01

2

2012 wxwxwxwwwwwxg ttktkk

t ,,,,|

NNNN

k

k

r

rr

xxx

xxxxxx

2

1

22

2222

1211

1

1

1

r D

rw TT DDD1

yw A where rTT D y DDA and

then

Page 20: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

20

Maximum Likelihood and Least Squares

Taken from Tom Mitchell’s Machine Learning textbook:

“… under certain assumptions (*) any learning algorithm that minimizes the squared error

between the output hypothesis predictions and the training data will output a maximum likelihood

hypothesis.”

(*): Assumption:“the observed training target values are generated by adding random noise to the true target value, where the random noise is drawn independently for each example from a Normal distribution with zero mean.”

Page 21: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

21

Other Error Measures

2

12

1

N

t

tt xgrE ||X Square Error:

Relative Square Error (ERSE):

Absolute Error: E (θ |X) = ∑t

|rt – g(xt| θ)|

ε-sensitive Error: E (θ |X) = ∑ t

1(|rt – g(xt| θ)|>ε) (|rt – g(xt|θ)| – ε)

R2 : Coefficient of Determination: R2 = 1 – ERSE (see next slide)

2

1

2

1

N

t

t

N

t

tt

rr

xgrE

||X

Page 22: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

22

Coefficient of Determination: R2

Figure adapted from Wikipedia (Sept. 2015)"Coefficient of Determination" by Orzetto - Own work. Licensed under CC BY-SA 3.0 via Commons – https://commons.wikimedia.org/wiki/File:Coefficient_of_Determination.svg#/media/File:Coefficient_of_Determination.svg

2

1

2

12

|1

N

t

t

N

t

tt

rr

xgrR

|txgr

r r

The closer R2 is to 1 the better

as this means that the g(.) estimates (on the right graph) fit the data well in comparison to the simple estimate given by the average value (on the left graph)

Page 23: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

Given X = {xt,rt} drawn from unknown pdf p(x,r) Expected Square Error of the g(.) estimate at x:

it may be that for a sample X, g(.) is a very good fit, and for another sample it is not

Given samples X’s, all of size N drawn from the same joint density p(x,r). Expected value, averaged over X’s:

Bias and Variance23

222 ||| xgExgExgExrExxgxrEE XXXX bias

how much g(.) is wrongdisregarding effect of varying samples

variancehow much g(.) fluctuates

as samples varies

222 xgxrExxrErExxgrE ||||noise

variance of noise addeddoes not depend on g(.) or X

squared errorg(x) deviation from E[r|x]

depend on g(.) and X

Page 24: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

Estimating Bias and Variance

24

ti

t i

tti

t

tt

xgM

xg

xgxgNM

g

xfxgN

g

1 where

1Variance

1Bias

2

22

M samples Xi={xti , rt

i}, i=1,...,M

are used to fit gi (x), i =1,...,M

Page 25: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

25

Bias/Variance Dilemma

Examples: gi(x)=2 has no variance and high bias

gi(x)= ∑t rt

i/N has lower bias but higher variance

As we increase complexity of g(.), bias decreases (a better fit to data) and variance increases (fit varies more with

data) Bias/Variance dilemma: (Geman et al., 1992)

Page 26: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

26

bias

variance

f

gi g

f

added noise ~ N(0,1) A linear regression for each of 5 samples

Polynomial regression of order 3 (left) and order 5 (right) for each of 5 samples

dotted red line in each plot is the average of the five fits,

Page 27: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

27

Polynomial Regression

Best fit “min error”

Same settings as plots on the previous slide, but using 100 models instead of 5

Page 28: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

28

Best fit, “elbow”

Settings as plots on previous slides, but using training and validation sets (50 instances each)

Fitted polynomials of orders 1, …, 8 on training set

Page 29: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

29

Model Selection (1 of 2 slides) Methods to fine-tune model complexity

Cross-validation: Measures generalization accuracy by testing on data unused during training

Regularization: Penalizes complex modelsE’=error on data + λ model complexity (where λ is a penalty weight)

the lower the better Other measures of “goodness of fit” with complexity penalty:

Akaike’s information criterion (AIC) AIC ≡ log p(X|θML,M) – k(M)

Bayesian information criterion (BIC)BIC ≡ log p(X|θML,M) – k(M) log(N)/2

where: M is a modellog p(X|θML,M): is the log likelihood of M where M’s parameters θML have been

estimated using maximum likelihoodk(M) = number of adjustable parameters in θML of the model M N= size of sample X

For both AIC and BIC, the higher the better

Page 30: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

30

Model Selection (2 of 2 slides)Methods to fine-tune model complexity

Minimum description length (MDL): Kolmogorov complexity, shortest description of data

Given a dataset X, MDL (M) = Description length of model M + Description length of data in X not correctly described by

M

the lower the better

Structural risk minimization (SRM) Uses a set of models and their complexities (measured

usually using their number of free parameters or their VC-dimension)

Selects the simplest model in terms of order and best in terms of empirical error on the data

Page 31: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

31

Bayesian Model Selection

data

model model|datadata|model

ppp

p

Used when we have Prior on models, p(model)

Regularization, when prior favors simpler models

Bayes, MAP of the posterior, p(model|data) Average over a number of models with high

posterior (voting, ensembles: Chapter 17)

Page 32: INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDINModified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI alpaydin@boun.edu.tr

32

Regression example

Coefficients increase in magnitude as order increases:1: [-0.0769, 0.0016]2: [0.1682, -0.6657, 0.0080]3: [0.4238, -2.5778, 3.4675, -0.00024: [-0.1093, 1.4356, -5.5007, 6.0454, -0.0019]

i i

N

t

tt wxgrE 22

12

1 ww || XRegularization (L2):