model validation: introduction to statistical considerations miguel nakamura centro de...

21
Model validation: Model validation: Introduction to statistical Introduction to statistical considerations considerations Miguel Nakamura Miguel Nakamura Centro de Investigación en Matemáticas Centro de Investigación en Matemáticas (CIMAT), Guanajuato, Mexico (CIMAT), Guanajuato, Mexico [email protected] [email protected] Warsaw, November 2007 Warsaw, November 2007

Upload: joel-willis

Post on 20-Jan-2018

215 views

Category:

Documents


0 download

DESCRIPTION

Some key concepts  Model complexity  Loss  Error

TRANSCRIPT

Page 1: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

Model validation:Model validation:Introduction to statistical considerationsIntroduction to statistical considerations

Miguel NakamuraMiguel NakamuraCentro de Investigación en Matemáticas (CIMAT), Centro de Investigación en Matemáticas (CIMAT),

Guanajuato, MexicoGuanajuato, [email protected]@cimat.mx

Warsaw, November 2007Warsaw, November 2007

Page 2: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

f(X)X Y*=f(X)environment Prediction

Starting point: Algorithmic modeling cultureStarting point: Algorithmic modeling culture

Validation = examination of predictive accuracy.Validation = examination of predictive accuracy.

Ydata

Page 3: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

Some key conceptsSome key concepts

Model complexityModel complexity LossLoss ErrorError

Page 4: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

Model assessment: estimate Model assessment: estimate prediction errorprediction error on on new data for a chosen model.new data for a chosen model.

• Observed dataObserved data (X (X11,Y,Y11), (X), (X22,Y,Y22),…, (X),…, (XNN,Y,YNN)) used to used to construct model, construct model, f(X)f(X)..

• Our intention is to use model at NEW values, Our intention is to use model at NEW values, X*X*11,X*,X*22,… ,… by using by using f(X*f(X*11),f(X*),f(X*22),),… …

• Model is good if values of Model is good if values of Y*Y*11,Y*,Y*22,… are “close” to ,… are “close” to f(X*f(X*11),f(X*),f(X*22),),… …

• Big problem: we do not know the values Big problem: we do not know the values Y*Y*11,Y*,Y*22,… (If we ,… (If we knew the values, we wouldn’t be resorting to models!!)knew the values, we wouldn’t be resorting to models!!)

Page 5: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

Measuring error: Loss FunctionMeasuring error: Loss Function

( , ( )) is loss function that measures "closeness" between prediction ( ) and observed .L Y f X

f X Y

Note: Either implicitly or explicitly, consciously or Note: Either implicitly or explicitly, consciously or unconsciously, we specify a way of determining if a unconsciously, we specify a way of determining if a prediction is close to reality or not.prediction is close to reality or not.

Page 6: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

Measuring error: examples of loss functionsMeasuring error: examples of loss functions2Squared loss: ( , ( )) ( ( ))

Absolute loss: ( , ( )) ( )

2 ( ) if ( )An asymmetric loss: ( , ( ))

( ) if ( )

L Y f X Y f X

L Y f X Y f X

Y f X Y f XL Y f X

Y f X Y f X

Note: There are several ways of measuring how good a Note: There are several ways of measuring how good a prediction is relative to reality. Which way is relevant to prediction is relative to reality. Which way is relevant to adopt is not a mathematical question, but rather a adopt is not a mathematical question, but rather a question for the question for the useruser..

Page 7: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

Examples of loss functions for binary Examples of loss functions for binary outcomesoutcomes

0 if ( )Asymmetric loss: ( , ( )) if ( )

if ( )

Y f XL Y f X a Y f X

b Y f X

f(X)=0f(X)=0 f(X)=1f(X)=1

Y=0Y=0 00 11

Y=1Y=1 11 00

0 if ( )0-1 loss: ( , ( ))

1 if ( )Y f X

L Y f XY f X

f(X)=0f(X)=0 f(X)=1f(X)=1

Y=0Y=0 00 bb

Y=1Y=1 aa 00

Page 8: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

Expected lossExpected loss

f(X)f(X) is model obtained using data. is model obtained using data. Test data Test data X*X* is assumed to be coming at random. is assumed to be coming at random. L(Y*,f(X*))L(Y*,f(X*)) is random: makes sense to consider is random: makes sense to consider

“expected loss”, or “expected loss”, or E{L(Y*,f(X*))}.E{L(Y*,f(X*))}. This represents This represents “typical” difference between “typical” difference between Y*Y* and and f(X*)f(X*)..

E{L(Y*,f(X*))}E{L(Y*,f(X*))} becomes a key concept for model becomes a key concept for model evaluation.evaluation.

Possible criticism or warning: if expected loss is Possible criticism or warning: if expected loss is computed/estimated under an assumed distribution for computed/estimated under an assumed distribution for X*X* but the model will be used under another distribution but the model will be used under another distribution for for X*X*, then expected loss may be irrelevant or , then expected loss may be irrelevant or misleading.misleading.

Page 9: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

Expected 0-1 lossExpected 0-1 loss

Expected loss=probability of misclassification.Expected loss=probability of misclassification.

{ ( ( *), *)} 0 { ( ( *), *) 0} 1 { ( ( *), *) 1}{ ( ( *), *) 1}

E L f X Y P L f X Y P L f X YP L f X Y

Page 10: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

Expected asymmetric lossExpected asymmetric loss

{ ( ( *), *)}0 { ( ( *), *) 0} { ( ( *), *) } { ( ( *), *) }

{ ( *) 0, * 1} { ( *) 1, * 0}

E L f X YP L f X Y a P L f X Y a b P L f X Y b

aP f X Y bP f X Y

Expected lossExpected loss=a=a×P(×P(false absencefalse absence)+b×P()+b×P(false presencefalse presence)=)=aa×(×(omission rate)omission rate)+b×(c+b×(commission rate)ommission rate)

Page 11: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

The validation challengeThe validation challenge

To compute To compute E{L(Y,f(X))}E{L(Y,f(X))}, given that we DO NOT KNOW the , given that we DO NOT KNOW the value of value of YY..

In data modeling culture, In data modeling culture, E{L(Y,f(X))} E{L(Y,f(X))} can be computed can be computed mathematically. (Example to follow)mathematically. (Example to follow)

In algorithmic modeling culture, In algorithmic modeling culture, E{L(Y,f(X))} E{L(Y,f(X))} must be must be estimatedestimated in some way. in some way.

Page 12: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

Example (in statistical estimation): When Example (in statistical estimation): When expected loss can be expected loss can be calculatedcalculated

Note: This calculation for expected loss holds even if the Note: This calculation for expected loss holds even if the parameter is unknown! But this can be calculated parameter is unknown! But this can be calculated theoretically because assumptions are being made theoretically because assumptions are being made regarding the probability distribution of observed data.regarding the probability distribution of observed data.

1

2

Object to predict is , mean of population.Data is set of observations, , , .

Prediction of is sample mean ( ).

Loss function is square error loss: ( , ) ( ) .Expected Loss is called Mean S

n

n

n n

n X X

X

L X X

2

quare Error.

Mathematics says: Expected Loss=Var( ) / .nX n

Page 13: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

Estimating expected lossEstimating expected loss

1

(average error )

Unfortunately, training error is a very bad estimator of expected loss,

{ (

Training error:

1

, (

over training sample

over all values where pr) edictio

} (average

( , (

error n

))N

i ii

E L Y

L Y f X

f X

N

1 1

1 1

)

This is what we are interested in. Need independent test sample:the notion of data-splitting is born.

( , ), , ( , ) used for model constructio

s a

n.( , ), , ( , ) used for estim

re req d

at

uire

M M

M M N N

X Y X YX Y X Y

1

over tes

Test error:1 (

ing

t sam

expected loss

(average er , ( ) ple) or r )N M

i ii M

L Y f XN M

Page 14: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

X

Y

Truth

Training sample

Simple model

Test points

Training error

Test error

Page 15: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

X

Y

Truth

Training sample

Test points

Training error

Test error

Complex model

Page 16: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

X

Y

X

Y

Comparisons

Simple model

Complex model

Page 17: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

Model Complexity

Pred

ictio

n Er

ror

Low High

Training sample

Test sample

Hastie et al. (2001)

Page 18: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

Model complexityModel complexity

In Garp: number of layers and/or convergence criteriaIn Garp: number of layers and/or convergence criteria In Maxent: it is the regularization parameter, In Maxent: it is the regularization parameter, ββ..

Page 19: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

Things could get worse in preceding plots:Things could get worse in preceding plots:

Sampling bias may induce artificially smaller or larger Sampling bias may induce artificially smaller or larger errors.errors.

Randomness around “truth” may be present in Randomness around “truth” may be present in observations, due to measurement error or other issues.observations, due to measurement error or other issues.

XX is multidimensional (as in niche models). is multidimensional (as in niche models). Predictions may be needed where observed values of Predictions may be needed where observed values of XX

are scarce.are scarce. f(X)f(X) itself may be random for some algorithms (same itself may be random for some algorithms (same XX

yields different yields different f(X)f(X)).).

Page 20: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

Notes regarding Loss FunctionNotes regarding Loss Function

Is any loss function Is any loss function universaluniversal, , i.e.i.e. that it is reasonable for that it is reasonable for allall problems and problems and allall type of applications? type of applications?

Once loss function is fixed, the notion of Once loss function is fixed, the notion of optimality optimality follows!follows!

There is no such thing as an “optimal” method. It is only There is no such thing as an “optimal” method. It is only optimal relative to the optimal relative to the given loss functiongiven loss function. Thus it is . Thus it is optimal for the type of problems for which that particular optimal for the type of problems for which that particular loss function is appropriate.loss function is appropriate.

Page 21: Model validation: Introduction to statistical considerations Miguel Nakamura Centro de Investigación…

Some referencesSome references

Hastie, T., Tibshirani, R., and Friedman, J. (2001), Hastie, T., Tibshirani, R., and Friedman, J. (2001), The Elements of The Elements of Statistical Learning: Data Mining, Inference, and PredictionStatistical Learning: Data Mining, Inference, and Prediction, Springer, New , Springer, New York.York.

Duda, R.O., Hart, P.E., and Stork, D.G. (2001), Duda, R.O., Hart, P.E., and Stork, D.G. (2001), Pattern ClassificationPattern Classification, , Wiley, New York.Wiley, New York.

Wahba, G. (1990), Wahba, G. (1990), Spline Models for Observational DataSpline Models for Observational Data, SIAM, , SIAM, Philadelphia.Philadelphia.

Stone, M. (1974), “Cross-validatory choice and assessment of statistical Stone, M. (1974), “Cross-validatory choice and assessment of statistical predictions”, predictions”, Journal of the Royal Statistical SocietyJournal of the Royal Statistical Society, 36, 111–147., 36, 111–147.

Breiman, L., and Spector, P. (1992), “Submodel selection and evaluation in Breiman, L., and Spector, P. (1992), “Submodel selection and evaluation in regression: the X-random case”, regression: the X-random case”, The International Statistics ReviewThe International Statistics Review, 60, , 60, 291–319.291–319.

Efron, B. (1986), “How biased is the apparent error rate of a prediction Efron, B. (1986), “How biased is the apparent error rate of a prediction rule?”, rule?”, Journal of the American Statistical AssociationJournal of the American Statistical Association, 81, 461–470., 81, 461–470.