bias and variance of the estimator prml 3.2 ethem chp. 4

Bias and Variance of the Estimator

PRML 3.2Ethem Chp. 4

In previous lectures we showed how to build classifiers when the underlying densities are known Bayesian Decision Theory introduced the general formulation

In most situations, however, the true distributions are unknown and must be estimated from data. Parameter Estimation (we saw the Maximum Likelihood Method)

Assume a particular form for the density (e.g. Gaussian), so only the parameters (e.g., mean and variance) need to be estimated

Maximum Likelihood Bayesian Estimation

Non-parametric Density Estimation (not covered) Assume NO knowledge about the density Kernel Density Estimation Nearest Neighbor Rule

2

3

The blue is the distributionof the estimates (e.g. Mean)

4

How Good is an Estimator

Assume our dataset X is sampled from a population specified up to the parameter how good is an estimator d(X) as an estimate for

Notice that the estimate depends on sample set X If we take an expectation of d(X) over different datasets X,

E[(d(X)-)2], and expand using E[dE[d(X), we get:

X

E[(d(X)-)2]= E[(d(X)-E[d)2] + (E[d]- )2

variance bias sq.

of the estimator

Using a simpler notation (dropping the dependence on X from the

notation – but knowing it exists):

E[(d-)2] = E[(d-E[d)2] + (E[d]- )2

variance bias sq.

of the estimator

5

Properties of and

ML is an unbiased estimator

ML is biased

Use instead:

6

The Bias-Variance Decomposition (1)

Recall the expected squared loss,

where

We said that the second term corresponds to the noise inherent in the random variable t.

What about the first term?

7


Suppose we were given multiple data sets, each of size N.

Any particular data set, D, will give a particular function y(x; D).

We then have:

8


Taking the expectation over D yields

9


Thus we can write

where

10

Bias measures how much the prediction (averaged over all data sets) differs from the desired regression function.

Variance measures how much the predictions for individual data sets vary around their average.

There is a trade-off between bias and variance As we increase model complexity,

bias decreases (a better fit to data) and variance increases (fit varies more with data)

11

bia

s

variance

f

gi g

f

Reminder: Overfitting

Concepts: Polynomial curve fitting, overfitting, regularization

13

Polynomial Curve Fitting

14

Sum-of-Squares Error Function

15

0th Order Polynomial

16

1st Order Polynomial

17

3rd Order Polynomial

18

9th Order Polynomial

19

Over-fitting

Root-Mean-Square (RMS) Error:

20

Polynomial Coefficients

Large weights results in large changes in y(x) for small changes in x.

Regularization

One solution to control complexity is to penalize complex models -> regularization.

22

Regularization

Penalize large coefficient values:

23

Regularization:

24

Regularization:

25

Regularization: vs.

26

Polynomial Coefficients

Small lambda = Large lambda = no regularization too much regularization

27


Example: 100 data sets, each with 25 data points from the sinusoidal h(x) = sin(2px), varying the degree of regularization, .

28


Regularization constant= exp{-0.31}.

29


Regularization constant= exp{-2.4}.

30

The Bias-Variance Trade-off

From these plots, we note that;

an over-regularized model (large ) will have a high bias

while an under-regularized model (small ) will have a high variance.

Minimum value of bias2+variance is around =-0.31This is close to the value that gives the minimum error on the test data.

31

bias

variance

g

f

32

Model Selection Procedures

Cross validation: Measure the total error, rather than bias/variance, on a validation set. Train/Validation sets K-fold cross validation Leave-One-Out No prior assumption about the models

33

Model Selection Procedures

2. Regularization (Breiman 1998): Penalize the augmented error: error on data + l.model complexity

1. If l is too large, we risk introducing bias2. Use cross validation to optimize for l

3. Structural Risk Minimization (Vapnik 1995): 1. Use a set of models ordered in terms of their complexities

1. Number of free parameters2. VC dimension,…

2. Find the best model w.r.t empirical error and model complexity.

4. Minimum Description Length Principle

5. Bayesian Model Selection: If we have some prior knowledge about the approximating function, it can be incorporated into the Bayesian approach in the form of p(model).

bias and variance of the estimator prml 3.2 ethem chp. 4

Documents

regularization slide

data slide

estimator slide

mean slide

variance bias

th order polynomial

polynomial curve fitting

rd order polynomial