lecture 3: statistical decision theory (part ii)

27
Lecture 3: Statistical Decision Theory (Part II) Wenbin Lu Department of Statistics North Carolina State University Fall 2019 Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 27

Upload: others

Post on 04-May-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II)

Wenbin Lu

Department of StatisticsNorth Carolina State University

Fall 2019

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 27

Page 2: Lecture 3: Statistical Decision Theory (Part II)

Outline of This Note

Part I: Statistics Decision Theory (Classical Statistical Perspective -“Estimation”)

loss and riskMSE and bias-variance tradeoffBayes risk and minimax risk

Part II: Learning Theory for Supervised Learning (Machine LearningPerspective - “Prediction”)

optimal learnerempirical risk minimizationrestricted estimators

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 2 / 27

Page 3: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner

Supervised Learning

Any supervised learning problem has three components:

Input vector X ∈ Rd .

Output Y , either discrete or continuous valued

Probability framework

(X ,Y ) ∼ P(X ,Y ).

The goal is to estimate the relationship between X and Y , described byf (X ), for future prediction and decision-making.

For regression, f : Rd → R

For K -class classification, f : Rd → {1, · · · ,K}

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 3 / 27

Page 4: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner

Learning Loss Function

Similar to the learning theory, we use a learning loss function L

to measure the discrepancy Y and f (X )

to penalize the errors for predicting Y .

Examples:

squared error loss (used in mean regression)

L(Y , f (X )) = (Y − f (X ))2

absolute error loss (used in median regression)

L(Y , f (X )) = |Y − f (X )|

0-1 loss function (used in binary classification)

L(Y , f (X )) = I (Y 6= f (X ))

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 4 / 27

Page 5: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner

Risk, Expected Prediction Error (EPE)

The risk of f is

R(f ) = EX ,Y L(Y , f (X )) =

∫L(y , f (x))dP(x , y)

In machine learning, R(f ) is called Expected Prediction Error (EPE).

For squared error loss,

R(f ) = EPE (f ) = EX ,Y [Y − f (X )]2

For 0-1 loss

R(f ) = EPE (f ) = EX ,Y I (Y 6= f (X )) = Pr(Y 6= f (X )),

which is the probability of making a mistake.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 5 / 27

Page 6: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner

Optimal Learner

We seek f which minimizes the average (long-term) loss overthe entire population.

Optimal Learner: The minimizer of R(f ) is

f ∗ = arg minf

R(f ).

The solution f ∗ depends on the loss L.

f ∗ is not achievable without knowing P(x , y) or the entire population

Bayes Risk: The optimal risk, or risk of the optimal learner

R(f ∗) = EX ,Y (Y − f ∗(X ))2.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 6 / 27

Page 7: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner

How to Find The Optimal Learner

Note that

R(f ) = EX ,Y L(Y , f (X )) =

∫L(y , f (x))dP(x , y)

= EX{EY |XL(Y , f (X ))

}One can do the following,

for each fixed x , find f ∗(x) by solving

f ∗(x) = arg minc

EY |X=xL(Y , c)

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 7 / 27

Page 8: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner

Example 1: Regression with Squared Error Loss

For regression, consider L(Y , f (X )) = (Y − f (X ))2. The risk is

R(f ) = EX ,Y (Y − f (X ))2 = EXEY |X ((Y − f (X ))2|X ).

For each fixed x , the best learner solves the problem

minc

EY |X [(Y − c)2|X = x ].

The solution is E (Y |X = x), so

f ∗(x) = E (Y |X = x) =

∫ydP(y |x).

So the optimal leaner is f ∗(X ) = E (Y |X ).

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 8 / 27

Page 9: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner

Decomposition of EPE: Under Squared Error Loss

For any learner f , its EPE under the squared error loss is decomposed as

EPE (f ) = EX ,Y [Y − f (X )]2 = EXEY |X ([Y − f (X )]2|X )

= EXEY |X ([Y − E (Y |X ) + E (Y |X )− f (X )]2|X}= EXEY |X ([Y − E (Y |X )]2|X ) + EX [f (X )− E (Y |X )]2

= EX ,Y [Y − f ∗(X )]2 + EX [f (X )− f ∗(X )]2

= EPE (f ∗) + MSE

= Bayes Risk + MSE

The Bayes risk (gold standard) gives a lower bound for any risk.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 9 / 27

Page 10: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner

Example 2: Classification with 0-1 Loss

In binary classification, assume Y ∈ {0, 1}. The learning rulef : Rd → {0, 1}.

Using 0-1 lossL(Y , f (X )) = I (Y 6= f (X )),

the expected prediction error is

EPE (f ) = EI (Y 6= f (X ))

= Pr(Y 6= f (X ))

= EX [P(Y = 1|X )I (f (X ) = 0) + P(Y = 0|X )I (f (X ) = 1)]

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 10 / 27

Page 11: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner

Bayes Classification Rule for Binary Problems

For each fixed x , the minimizer of EPE (f ) is given by

f ∗(x) =

{1 if P(Y = 1|X = x) > P(Y = 0|X = x)

0 if P(Y = 1|X = x) < P(Y = 0|X = x)

It is called the Bayes classifier, denoted by φB .

The Bayes rule can be written as

φB(x) = f ∗(x) = I [P(Y = 1|X = x) > 0.5].

To obtain the Bayes rule, we need to know P(Y = 1|X = x), orwhether P(Y = 1|X = x) is larger than 0.5 or not.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 11 / 27

Page 12: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Learning Process: ERM

Probability Framework

Assume the training set

(X1,Y1), ..., (Xn,Yn) ∼ i.i.d P(X ,Y ).

We aim to find the best function (optimal solution) from the model classF for decision making.

However, the optimal learner f ∗ is not attainable without knowledgeof P(x , y).

The training set (x1, y1), ..., (xn, yn) generated from P(x , y) carryinformation about P(x , y)

We approximate the integral over P(x , y) by the average over thetraining samples.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 12 / 27

Page 13: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Learning Process: ERM

Empirical Risk Minimization

Key Idea: We can not compute R(f ) = EX ,Y L(Y , f (X )), but cancompute its empirical version using the data!

Empirical Risk: also known as the training error

Remp(f ) =1

n

n∑i=1

L(yi , f (xi )).

A reasonable estimator f should be close to f ∗, or converge to f ∗

when more data are collected.

Using law of large numbers, Remp(f )→ EPE (f ) as n→∞.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 13 / 27

Page 14: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Learning Process: ERM

Learning Based on ERM

We construct the estimator of f ∗ by

f̂ = arg minf

Remp(f ).

This principle is called Empirical Risk Minimization. (ERM)

For regression,

the empirical risk is known as Residual Sum of Squares (RSS)

Remp(f ) = RSS(f ) =n∑

i=1

[yi − f (xi )]2.

ERM gives the least mean squares:

f̂ ols = arg minf

RSS(f ) = arg minf

n∑i=1

[yi − f (xi )]2.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 14 / 27

Page 15: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Learning Process: ERM

Issues with ERM

Minimizing Remp(f ) over all functions is not proper.

Any function passing through all (xi , yi ) has RSS = 0. Example:

f1(x) =

{yi if x = xi for some i = 1, · · · , n1 otherwise.

It is easy to check that RSS(f1) = 0.

However, it does not amount to any form of learning. For the futurepoints not equal to xi , it will always predict y = 1. This phenomenonis called “overfitting”.

Overfitting is undesired, since a small Remp(f ) does not imply asmall R(f ) =

∫L(y , f (x))dP(x , y).

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 15 / 27

Page 16: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Restricted Model Class F

Class of Restricted Estimators

We need to restrict the estimation process within a set of functions F , tocontrol complexity of functions and hence avoid over-fitting.

Choose a simple model class F .

f̂ols = arg minf

RSS(f ∈ F) = arg minf ∈F

n∑i=1

[yi − f (xi )]2.

Choose a large model class F , but use a roughness penalty to controlmodel complexity: Find f ∈ F to minimize

f̂ = arg minf ∈F

Penalized RSS ≡ RSS(f ) + λJ(f ).

The solution is called regularized empirical risk minimizer.

How should we choose the model F? Simple one or complex one?

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 16 / 27

Page 17: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Restricted Model Class F

Parametric Learners

Some model classes can be parametrized as

F = {f (X ;α), α ∈ Λ},

where α is the index parameter for the class.

Parametric Model Class: The form of f is known except for finitelymany unknown parameters

Linear Models: linear regression, linear SVM, LDA

f (X ;β) = β0 + X′β1, β1 ∈ Rd ,

f (X ;β) = β0 + sin(X1)β1 + β2X22 , X ∈ R2

Nonlinear models

f (x ;β) = β1xβ2

f (X ;β) = β1 + β2 expβ3Xβ4

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 17 / 27

Page 18: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Restricted Model Class F

Advantages of Linear Models

Why are linear models in wide use?

If linear assumptions are correct, it is more efficient thannonparametric fits.

convenient and easy to fit

easy to interpret

providing the first-order Taylor approximation to f (X)

Example: In the linear regression, F = {all linear functions of x}.

The ERM is the same as solving the ordinary least squares,

(β̂ols0 , β̂ols) = arg min(β0,β)

n∑i=1

[yi − β0 − x′iβ]2.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 18 / 27

Page 19: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Restricted Model Class F

Drawback of Parametric Models

We never know whether linear assumptions are proper or not.

Not flexible in practice, having all orders of derivatives and the samefunction form everywhere

Individual observations have large influences on remote parts of thecurve

The polynomial degree can not be controlled continuously

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 19 / 27

Page 20: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Restricted Model Class F

More Flexible Learners

Nonparametric Model Class F :the function form f is unspecified, and the parameter set could beinfinitely dimensional.

F = {all continuous functions of x}.F = the second-order Sobolev space over [0, 1].

F is the space spanned by a set of Basis functions (dictionaryfunctions)

F = {f : fθ(x) =∞∑

m=1

θmhm(x)}.

Nearest-neighbor methods: f̂ (x) = k−1∑

xi∈Nk (x)yi , where Nk(x) is

the neighborhood of x defined by the k closest points xi in thetraining sample.

Kernel methods and local regression: F contains all piece-wiseconstant or linear functions.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 20 / 27

Page 21: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Restricted Model Class F

Example of Nonparametric Methods

For regression problems,

splines, wavelets, kernel estimators, locally weighted polynomialregression, GAM

projection pursuit, regression trees, k-nearest neighbors

For classification problems,

kernel support vector machines, k-nearest neighbors

classification trees ...

For density estimation,

histogram, kernel density estimation

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 21 / 27

Page 22: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Restricted Model Class F

Semiparametric Models: a compromise between linear models andnonparametric models

partially linear models

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 22 / 27

Page 23: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Restricted Model Class F

Nonparametric Models

Motivation: The underlying regression function is so complicated that noreasonable parametric models would be adequate. Advantages:

NOT assume any specific form.

Assume less, thus discover more

infinite parameters: not no parameters

Local more relying on the neighbors

Outliers and influential points do not affect the results

Let the data speak for themselves and show us the appropriatefunctional form!

Price we pay: more computational effort, lower efficiency and slowerconvergence rate (if linear models happen to be reasonable)

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 23 / 27

Page 24: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Restricted Model Class F

Nonparametric vs Parametric Models

They are not mutually exclusive competitors

A nonparametric regression estimate may suggest a simple parametricmodel

In practice, parametric models are preferred for simplicity

Linear regression line is infinitely smooth fit

method # parameters estimate outliers

parametric finite global sensitivenonparametric infinite local robust

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 24 / 27

Page 25: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Restricted Model Class F

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

linear fit

x

y

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

2−order polynomial fit

x

y

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

3−order polynomial fit

x

y

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

smoothing spline (lambda=0.85)

x

y

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 25 / 27

Page 26: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Restricted Model Class F

Interpretation of Risk

Let F denote the restricted model space.

Denote f ∗ = arg minEPE (f ), the optimal learner. The minimizationis taken over all measurable functions.

Denote f̃ = arg minF EPE (f ), the best learner in the restricted spaceF .

Denote f̂ = arg minF Remp(f ), the learner based on limited data inthe restricted space F .

EPE (f̂ )− EPE (f ∗)

= [EPE (f̂ )− EPE (f̃ )] + [EPE (f̃ )− EPE (f ∗)]

= estimation error + approximation error.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 26 / 27

Page 27: Lecture 3: Statistical Decision Theory (Part II)

Part II: Learning Theory for Supervised Learning Restricted Model Class F

Estimation Error and Approximation Error

The gap EPE (f̂ )− EPE (f̃ ) is called the estimation error, which isdue to the limited training data.

The gap EPE (f̃ )− EPE (f ∗) is called the approximation error. It isincurred from approximating f ∗ with a restricted model space or viaregularization, since it is possible that f ∗ /∈ F .

This decomposition is similar to the bias-variance trade-off, with

estimation error playing the role of variance

approximation error playing the role of bias.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 27 / 27