machine learning methods: an...

93
Machine Learning Methods: An Overview A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside April 23 2019 A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside () Machine Learning: Overview April 23 2019 1 / 93

Upload: others

Post on 12-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

Machine Learning Methods: An Overview

A. Colin CameronU.C.-Davis

.Presented at University of California - Riverside

April 23 2019

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 1 / 93

Page 2: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

Introduction

Abstract

This talk summarizes key machine learning methods and presentsthem in a language familiar to economists.

The focus is on regression of y on x.The talk is based on the two books

I Introduction to Statistical Learning (ISL)I Elements of Statistical Learning (ESL).

It emphasizes the ML methods most used currently in economics.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 2 / 93

Page 3: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

Introduction

Introduction

Machine learning methods include data-driven algorithms to predicty given x.

I there are many machine learning (ML) methodsI the best ML methods vary with the particular data applicationI and guard against in-sample over�tting.

The main goal of the machine learning literature is prediction of yI and not estimation of β.

For economics prediction of y is sometimes a goalI e.g. do not provide hip transplant to individuals with low predictedone-year survival probability

I then main issue is what is the best standard ML method.

But often economics is interested in estimation of β or estimation ofa partial e¤ect such as average treatment e¤ects

I then new methods using ML are being developed by econometricians.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 3 / 93

Page 4: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

Introduction

Econometrics for Machine Learning

The separate slides on ML methods in Economics consider thefollowing microeconomics examples.

Treatment e¤ects under an unconfoundedness assumptionI Estimate β in the model y = βx1 + g(x2) + uI the assumption that x1 is exogenous is more plausible with better g(x2)I machine learning methods can lead to a good choice of g(x2).

Treatment e¤ects under endogeneity using instrumental variablesI Now x1 is endogenous in the model y = β1x1 + g(x2) + uI given instruments x3 and x2 there is a potential many instrumentsproblem

I machine learning methods can lead to a good choice of instruments.

Average treatment e¤ects for heterogeneous treatmentI ML methods may lead to better regression imputation and betterpropensity score matching.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 4 / 93

Page 5: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

Introduction

Econometrics for Machine Learning (continued)

ML methods involve data miningI using traditional methods data mining leads to the complications ofpre-test bias and multiple testing.

In the preceding econometrics examples, the ML methods are used insuch a way that these complications do not arise

I an asymptotic distribution for the estimates of β1 or ATE is obtainedI furthermore this is Gaussian.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 5 / 93

Page 6: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

Introduction

Overview

1 Terminology2 Model selection - especially cross-validation.3 Shrinkage methods

1 Variance-bias trade-o¤2 Ridge regression, LASSO, elastic net.

4 Dimension reduction - principal components5 High-dimensional data6 Nonparametric and semiparametric regression7 Flexible regression (splines, sieves, neural networks,...)8 Regression trees and random forests9 Classi�cation - including support vector machines10 Unsupervised learning (no y) - k-means clustering

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 6 / 93

Page 7: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

1. Terminology

1. Terminology

The topic is called machine learning or statistical learning or datalearning or data analytics where data may be big or small.

Supervised learning = RegressionI We have both outcome y and regressors (or features) xI 1. Regression: y is continuousI 2. Classi�cation: y is categorical.

Unsupervised learningI We have no outcome y - only several xI 3. Cluster Analysis: e.g. determine �ve types of individuals givenmany psychometric measures.

This talkI focus on 1.I brie�y mention 2.I even more brie�y mention 3.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 7 / 93

Page 8: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

1. Terminology

Terminology (continued)

Consider two types of data setsI 1. training data set (or estimation sample)

F used to �t a model.

I 2. test data set (or hold-out sample or validation set)F additional data used to determine how good is the model �tF a test observation (x0, y0) is a previously unseen observation.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 8 / 93

Page 9: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.1 Forwards selection, backwards selection and best subsets

2.1 Model selection: Forwards, backwards and best subsets

Forwards selection (or speci�c to general)I start with simplest model (intercept-only) and in turn include thevariable that is most statistically signi�cant or most improves �t.

I requires up to p+ (p� 1) + � � �+ 1 = p(p+ 1)/2 regressions where pis number of regressors

Backwards selection (or general to speci�c)I start with most general model and in drop the variable that is leaststatistically signi�cant or least improves �t.

I requires up to p(p + 1)/2 regressions

Best subsetsI for k = 1, ..., p �nd the best �tting model with k regressorsI in theory requires (p0) + (

p1) + � � �+ (

pp) = 2

p regressionsI but leaps and bounds procedure makes this much quickerI p < 40 manageable though recent work suggests p in thousands.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 9 / 93

Page 10: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.2 Statistical Signi�cance

2.2 Selection using Statistical Signi�cance

Use as criteria p < 0.05I and do forward selection or backward selectionI common in biostatistics.

Problem:I pre-testing changes the distribution of bβI this problem is typically ignored (but should not be ignored).

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 10 / 93

Page 11: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.3 Goodness-of-�t measures

2.3 Goodness-of-�t measures

We wish to predict y given x = (x1, ..., xp).A training data set d yields prediction rule bf (x)

I we predict y at point x0 using by0 = bf (x0).I e.g. for OLS by0 = x0(X0X)�1X0y.

For regression consider squared error loss (y � by)2I some methods adapt to other loss functions

F e.g. absolute error loss and log-likelihood loss

I and loss function for classi�cation is 1(y 6= by).We wish to estimate the true prediction error

I Errd = EF [(y0 � by0)2 ]I for test data set point (x0, y0) � F .

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 11 / 93

Page 12: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.3 Goodness-of-�t measures

Models over�t in sample

We want to estimate the true prediction errorI EF [(y0 � by0)2 ] for test data set point (x0, y0) � F .

The obvious criterion is in-sample mean squared errorI MSE = 1

n ∑ni=1(yi � byi )2 where MSE = mean squared error.Problem: in-sample MSE under-estimates the true prediction error

I Intuitively models �over�t�within sample.

Example: suppose y = Xβ+ uI then bu = (y�XbβOLS ) = (I�M)u where M = X(X0X)�1X

F so jbui j < jui j (OLS residual is smaller than the true unknown error)I and use bσ2 = s2 = 1

n�k ∑ni=1(yi � byi )2 and not 1n ∑ni=1(yi � byi )2Two solutions:

I penalize for over�tting e.g. R2, AIC, BIC, CpI use out-of-estimation sample prediction (cross-validation).

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 12 / 93

Page 13: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.4 Penalized Goodness-of-�t Measures

2.4 Penalized Goodness-of-�t Measures

Standard measures for general parametric model areI Akaike�s information criterion (small penalty)

F AIC= �2 ln L+ 2kI BIC: Bayesian information criterion (larger penalty)

F BIC= �2 ln L+ (ln n)� k

Models with smaller AIC and BIC are preferredI for regression with i.i.d. errors replace �2 ln L with n

2 ln bσ2+ constant.

For OLS other measures areI R2: adjusted R-squaredI Mallows CpI AICC=AIC+2(K + 1)(K + 2)/(N �K � 2).

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 13 / 93

Page 14: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.4 Penalized Goodness-of-�t Measures

Stata Example

These slides use StataI most machine learning code is initially done in R.

Generated data: n = 40

Three correlated regressors.

I

24 x1ix2ix3i

35 � N0@24 x1i

x2ix3i

35 ,24 1 0.5 0.50.5 1 0.50.5 0.5 1

351ABut only x1 determines y

I y = 2+ xi + ui where ui � N(0, 32).

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 14 / 93

Page 15: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.4 Penalized Goodness-of-�t Measures

Fitted OLS regression

As expected only x1 is statistically signi�cant at 5%I though due to randomness this is not guaranteed.

       _cons    2.531396   .5377607     4.71   0.000     1.440766    3.622025          x3   ­.0256025   .6009393    ­0.04   0.966    ­1.244364    1.193159          x2    .4707111   .5251826     0.90   0.376    ­.5944086    1.535831          x1    1.555582   .5006152     3.11   0.004     .5402873    2.570877

           y       Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]               Robust

Root MSE          =  3.0907R­squared         =     0.2373Prob > F          =     0.0058F(3, 36)          =      4.91

Linear regression                               Number of obs     =         40

. regress y x1 x2 x3, vce(robust)

. * OLS regression of y on x1­x3

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 15 / 93

Page 16: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.4 Penalized Goodness-of-�t Measures

Example of penalty measures

We will consider all 8 possible models based on x1, x2 and x3.

. scalar s2full = e(rmse)^2  // Needed for Mallows Cp

. quietly regress y $xlist8

. * Full sample estimates with AIC, BIC, Cp, R2adj penalties

.

. global xlist8 x1 x2 x3

. global xlist7 x1 x3

. global xlist6 x2 x3

. global xlist5 x1 x2

. global xlist4 x3

. global xlist3 x2

. global xlist2 x1

. global xlist1

. * Regressor lists for all possible models

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 16 / 93

Page 17: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.4 Penalized Goodness-of-�t Measures

Example of penalty measures (continued)

Manually get various measuresI all (but MSE) favor model with just x1.

Model x1 x2 x3 MSE= 8.597 R2adj= 0.174  AIC= 207.57 BIC= 214.33 Cp= 4.000Model x1 x3    MSE= 8.739 R2adj= 0.183  AIC= 206.23 BIC= 211.29 Cp= 2.592Model x2 x3    MSE= 9.842 R2adj= 0.080  AIC= 210.98 BIC= 216.05 Cp= 7.211Model x1 x2    MSE= 8.598 R2adj= 0.196  AIC= 205.58 BIC= 210.64 Cp= 2.002Model x3       MSE=10.800 R2adj= 0.017  AIC= 212.70 BIC= 216.08 Cp= 9.224Model x2       MSE= 9.992 R2adj= 0.090  AIC= 209.58 BIC= 212.96 Cp= 5.838Model x1       MSE= 8.739 R2adj= 0.204  AIC= 204.23 BIC= 207.60 Cp= 0.593Model          MSE=11.272 R2adj= 0.000  AIC= 212.41 BIC= 214.10 Cp= 9.199  9. }>    " BIC=" %7.2f bic`k' " Cp=" %6.3f cp`k'>    " R2adj=" %6.3f r2adj`k' "  AIC=" %7.2f aic`k'  ///  8.   display "Model " "${xlist`k'}" _col(15) " MSE=" %6.3f mse`k'  ///  7.   scalar cp`k' =  e(rss)/s2full ­ e(N) + 2*e(rank)  6.   scalar bic`k' = ­2*e(ll) + e(rank)*ln(e(N))  5.   scalar aic`k' = ­2*e(ll) + 2*e(rank)  4.   scalar r2adj`k' = e(r2_a)  3.   scalar mse`k' = e(rss)/e(N)  2.   quietly regress y ${xlist`k'}.   forvalues k = 1/8 {

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 17 / 93

Page 18: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.5 Cross-validation

2.5 Cross-validation

Begin with single-split validationI for pedagogical reasons.

Then present K-fold cross-validationI used extensively in machine learningI generalizes to loss functions other than MSE such as 1n ∑ni=1 jyi � byi jI though more computation than e.g. BIC.

And present leave-one-out cross validationI widely used for bandwidth choice in nonparametric regression.

Given a selected model the �nal estimation is on the full datasetI usual inference ignores the data-mining.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 18 / 93

Page 19: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.5 Cross-validation

Single split validation

Randomly divide available data into two partsI 1. model is �t on training setI 2. MSE is computed for predictions in validation set.

Example: estimate all 8 possible models with x1, x2 and x3

I for each model estimate on the training set to get bβ0s, predict on thevalidation set and compute MSE in the validation set.

I choose the model with the lowest validation set MSE.

Problems with this single-split validationI 1. Lose precision due to smaller training set

F so may actually overestimate the test error rate (MSE) of the model.

I 2. Results depend a lot on the particular single split.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 19 / 93

Page 20: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.5 Cross-validation

Single split validation example

Randomly formI training sample (n = 14)I test sample (n = 26)

14. count if dtrain == 1

. gen dtrain = runiform() > 0.5  // 1 for training set and 0 for test data

. set seed 10101

. * Form indicator that determines training and test datasets

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 20 / 93

Page 21: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.5 Cross-validation

Single split validation example (continued)In-sample (training sample) MSE minimized with x1, x2, x3.

Out-of-sample (test sample) MSE minimized with only x1.

Model x1 x2 x3  Training MSE =   6.830 Test MSE =  12.708Model x1 x3     Training MSE =   7.138 Test MSE =  10.468Model x2 x3     Training MSE =   7.798 Test MSE =  15.270Model x1 x2     Training MSE =   7.520 Test MSE =  10.350Model x3        Training MSE =   7.943 Test MSE =  13.052Model x2        Training MSE =  10.326 Test MSE =  10.490Model x1        Training MSE =   7.625 Test MSE =   9.568Model           Training MSE =  10.496 Test MSE =  11.852 10. }>     " Training MSE = " %7.3f mse`k'train " Test MSE = " %7.3f mse`k'test  9.   display "Model " "${xlist`k'}" _col(16)  ///  8.   qui scalar mse`k'test = r(mean)  7.   qui sum y`k'errorsq if dtrain == 0  6.   scalar mse`k'train = r(mean)  5.   qui sum y`k'errorsq if dtrain == 1  4.   qui gen y`k'errorsq = (y`k'hat ­ y)^2  3.   qui predict y`k'hat  2.   quietly reg y ${xlist`k'} if dtrain==1. forvalues k = 1/8 {. * Split sample validation ­ training and test MSE for the 8 possible models

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 21 / 93

Page 22: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.5 Cross-validation

K-fold cross-validationK -fold cross-validation

I splits data into K mutually exclusive folds of roughly equal sizeI for j = 1, ...,K �t using all folds but fold j and predict on fold jI standard choices are K = 5 and K = 10.

The following shows case K = 5

Fit on folds Test on foldj = 1 2,3,4,5 1j = 2 1,3,4,5 2j = 3 1,2,4,5 3j = 4 1,2,3,5 4j = 5 1,2,3,4 5

The K -fold CV estimate is

CVK = 1K ∑K

j=1MSE(j), where MSE(j) is the MSE for fold j .

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 22 / 93

Page 23: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.5 Cross-validation

K-fold cross validation exampleUser-written crossfold command (Daniels 2012) implements this

I do so for the model with all three regressors and K = 5I set seed for replicability.

CV5 (average MSE in 5 folds) = 9.6990836 with st. dev. = 3.3885108

. display _n "CV5 (average MSE in 5 folds) = " r(mean) " with st. dev. = " r(sd)

. quietly sum mse

. quietly generate mse = rmse^2

. svmat RMSEs, names(rmse)

. matrix RMSEs  = r(est)

        est5  3.498511        est4  2.532469        est3  3.059801        est2  2.549458        est1  3.739027

      RMSE

. crossfold regress y x1 x2 x3, k(5)

. set seed 10101

. * Five­fold cross validation example for model with all regressors

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 23 / 93

Page 24: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.5 Cross-validation

K-fold Cross Validation example (continued)Now do so for all eight models with K = 5

I model with only x1 has lowest CV(5)

Model x1 x2 x3   CV5 =   9.699 with st. dev. =   3.389Model x1 x3      CV5 =   9.639 with st. dev. =   2.985Model x2 x3      CV5 =  10.872 with st. dev. =   4.221Model x1 x2      CV5 =   9.173 with st. dev. =   3.367Model x3         CV5 =  11.776 with st. dev. =   3.272Model x2         CV5 =  10.407 with st. dev. =   4.139Model x1         CV5 =   9.138 with st. dev. =   3.069Model            CV5 =  11.960 with st. dev. =   3.561 11. }>      " with st. dev. = " %7.3f sdcv`k' 10.   display "Model " "${xlist`k'}" _col(16) "  CV5 = " %7.3f cv`k' ///  9.   scalar sdcv`k' = r(sd)  8.   scalar cv`k' = r(mean)  7.   quietly sum mse`k'  6.   quietly generate mse`k' = rmse`k'^2  5.   svmat RMSEs`k', names(rmse`k')  4.   matrix RMSEs`k' = r(est)  3.   quietly crossfold regress y ${xlist`k'}, k(5)  2.   set seed 10101. forvalues k = 1/8 {

. drop _est*   // Drop variables created by previous crossfold

. drop rm*     // Drop variables created by previous crossfold

. * Five­fold cross validation measure for all possible models

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 24 / 93

Page 25: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.5 Cross-validation

Leave-one-out Cross Validation (LOOCV)

Use a single observation for validation and (n� 1) for trainingI by(�i ) is byi prediction after OLS on observations 1, .., i � 1, i + 1, ..., n.I Cycle through all n observations doing this.

Then LOOCV measure is

CV(n) =1n ∑n

i=1MSE(�i ) =1n ∑n

i=1(yi � by(�i ))2Requires n regressions in general

I except for OLS can show CV(n) =1n ∑ni=1

�yi�byi1�hii

�2F where byi is �tted value from OLS on the full training sampleF and hii is i th diagonal entry in the hat matrix X(X0X)�1X.

Used for bandwidth choice in local nonparametric regressionI such as k-nearest neighbors, kernel and local linear regressionI but not used for machine learning (see below).

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 25 / 93

Page 26: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

2. Model Selection 2.5 Cross-validation

How many folds?

LOOCV is the special case of K -fold CV with K = NI it has little bias

F as all but one observation is used to �t.

I but large variance

F as the n predicted by(�i ) are based on very similar samplesF so subsequent averaging does not reduce variance much.

The choice K = 5 or K = 10 is found to be a good compromiseI neither high bias nor high variance.

Remember: For replicability set the seed as this determines the folds.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 26 / 93

Page 27: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

3. Shrinkage Estimation

3. Shrinkage Estimation

Consider linear regression model with p potential regressors where p istoo large.

Methods that reduce the model complexity areI choose a subset of regressorsI shrink regression coe¢ cients towards zero

F ridge, LASSO, elastic net

I reduce the dimension of the regressors

F principal components analysis.

Linear regression may predict well if include interactions and powersas potential regressors.

And methods can be adapted to alternative loss functions forestimation.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 27 / 93

Page 28: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

3. Shrinkage Estimation 3.1 Variance-bias trade-o¤

3.1 Variance-bias trade-o¤

Consider regression model

y = f (x) + u with E [u] = 0 and u ? x.

For out-of-estimation-sample point (y0, x0) the true prediction error

E [(y0 � bf (x0))2] = Var [bf (x0)] + fBias(bf (x0))g2 + Var(u)The last term Var(u) is called irreducible error

I we can do nothing about this.

So need to minimize sum of variance and bias-squared!I more �exible models have less bias (good) and more variance (bad).I this trade-o¤ is fundamental to machine learning.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 28 / 93

Page 29: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

3. Shrinkage Estimation 3.1 Variance-bias trade-o¤

Variance-bias trade-o¤ and shrinkage

Shrinkage is one method that is biased but the bias may lead to lowersquared error loss

I �rst show this for estimation of a parameter βI then show this for prediction of y .

The mean squared error of a scalar estimator eβ isMSE(eβ) = E [(eβ� β)2]

= E [f(eβ� E [eβ]) + (E [eβ]� β)g2]= E [(eβ� E [eβ])2] + (E [eβ]� β)2 + 2� 0= Var(eβ) + Bias2(eβ)

I as the cross product term 2� E [(eβ� E [eβ])(E [eβ]� β)] =

constant�E [(eβ� E [eβ])] = 0.A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 29 / 93

Page 30: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

3. Shrinkage Estimation 3.1 Variance-bias trade-o¤

Bias can reduce estimator MSE: a shrinkage exampleSuppose scalar estimator bβ is unbiased for β with

I E [bβ] = β and Var [bβ] = v so MSE(bβ) = v .Consider the shrinkage estimator

I eβ = abβ where 0 � a � 1.Bias: Bias(eβ) = E [eβ]� β = aβ� β = (a� 1)β.Variance: Var [eβ] = Var [abβ] = a2Var(bβ) = a2v .

MSE(eβ) = Var [eβ] + Bias2(eβ) = a2v + (a� 1)2β2

MSE(eβ) < MSE[bβ] if β2 <1+ a1� av .

So MSE(eβ) <MSE[bβ] for a = 0 if β2 < vI and for a = 0.9 if β2 < 19v .

The ridge estimator shrinks towards zero.The LASSO estimator selects and shrinks towards zero.James-Stein estimator in yi � N(µi , 1) has lower MSE than the MLE!

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 30 / 93

Page 31: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

3. Shrinkage Estimation 3.1 Variance-bias trade-o¤

Bias can therefore reduce predictor MSE

Now consider prediction of y0 = βx0 + u where E [u] = 0

I using ey0 = eβx0 where treat scalar x0 as �xed.Then mean squared error of a scalar estimator eβ is

MSE(ey0) = x20Bias2(eβ) + Var(u).So bias in eβ that reduces MSE(eβ) also reduces MSE(ey0).

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 31 / 93

Page 32: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

3. Shrinkage Estimation 3.2 Shrinkage methods

3.2 Shrinkage Methods

Shrinkage estimators minimize RSS (residual sum of squares) with apenalty for model size

I this shrinks parameter estimates towards zero.

The extent of shrinkage is determined by a tuning parameterI this is determined by cross-validation or e.g. AIC.

Ridge, LASSO and elastic net are not invariant to rescaling ofregressors, so �rst standardize

I so xij below is actually (xij � xj )/sjI and demean yi so below yi is actually yi � yI xi does not include an intercept nor does data matrix XI we can recover intercept β0 as bβ0 = y .

So work with y = x0β+ ε = β1x1 + β2x2 + � � �+ βpxp + ε

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 32 / 93

Page 33: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

3. Shrinkage Estimation 3.3 Ridge Regression

3.3 Ridge RegressionThe ridge estimator bβλ of β minimizes

Qλ(β) = ∑ni=1(yi � x

0iβ)

2 + λ ∑pj=1 β2j = RSS + λ(jjβjj2)2

I where λ � 0 is a tuning parameter to be determinedI jjβjj2 =

q∑pj=1 β2j is L2 norm.

Equivalently the ridge estimator minimizes

∑ni=1(yi � x

0iβ)

2 subject to∑pj=1 β2j � s.

The ridge estimator is bβλ = (X0X+ λI)�1X0y.

Derivation: Q(β) = (y�Xβ)0(y�Xβ) + λβ0βI ∂Q(β)/∂β = �2X0(y�Xβ) + 2λβ = 0I ) X0Xβ+ λIβ = X0yI ) bβλ = (X

0X+ λI)�1X0y.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 33 / 93

Page 34: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

3. Shrinkage Estimation 3.3 Ridge Regression

More on Ridge

FeaturesI bβλ ! bβOLS as λ ! 0 and bβλ ! 0 as λ ! ∞.I best when many predictors important with coe¤s of similar sizeI best when LS has high varianceI algorithms exist to quickly compute bβλ for many values of λI then choose λ by cross validation.

Hoerl and Kennard (1970) proposed ridge as a way to reduce MSE ofeβ.Ridge is the posterior mean for y � N(Xβ, σ2I) with priorβ � N(0,γ2I)

I though γ is a speci�ed prior parameter whereas λ is data-determined.

For scalar regressor and no intercept bβλ = abβOLS where a = ∑i x2i

∑i x2i +λ

I like earlier example of eβ = abβ.A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 34 / 93

Page 35: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

3. Shrinkage Estimation 3.4 LASSO

3.4 LASSO (Least Absolute Shrinkage And Selection)

The LASSO estimator bβλ of β minimizes

Qλ(β) = ∑ni=1(yi � x

0iβ)

2 + λ ∑pj=1 jβj j = RSS + λjjβjj1

I where λ � 0 is a tuning parameter to be determinedI jjβjj1 = ∑pj=1 jβj j is L1 norm.

Equivalently the LASSO estimator minimizes

∑ni=1(yi � x

0iβ)

2 subject to∑pj=1 jβj j � s.

FeaturesI best when a few regressors have βj 6= 0 and most βj = 0I leads to a more interpretable model than ridge.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 35 / 93

Page 36: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

3. Shrinkage Estimation 3.4 LASSO

LASSO versus Ridge (key �gure from ISL)LASSO is likely to set some coe¢ cients to zero.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 36 / 93

Page 37: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

3. Shrinkage Estimation 3.5 Elastic net

3.5 Elastic netThere are many extensions of LASSO.

Elastic net combines ridge regression and LASSO with objectivefunction

Qλ,α(β) = ∑ni=1(yi � x

0iβ)

2 + λ ∑pj=1fαjβj j+ (1� α)β2j g.

I ridge penalty λ averages correlated variablesI LASSO penalty α leads to sparsity.

Here I use the elasticregress package (Townsend 2018)I ridgeregress (alpha=0)I lassoregress (alpha=1)I elasticregress.

K-fold classi�cation is used with default K = 10I set seed for replicability.

In part 3 I instead use the lassopack package.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 37 / 93

Page 38: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

3. Shrinkage Estimation 3.6 Examples

3.6 LASSO example

OLS was by = 1.556x1 + 0.471x2 � 0.026x3 + 2.532OLS on x1 and x2 is by = 1.554x1 + 0.468x2 + 2.534.

       _cons    2.621573          x3           0          x2    .2834002          x1      1.3505

           y       Coef.

Number of lambda tested    =        100Number of folds            =          5Cross­validation MSE       =     9.3871lambda                     =     0.2594alpha                      =     1.0000R­squared                  =     0.2293

LASSO regression                       Number of observations     =         40

. lassoregress y x1 x2 x3, numfolds(5)

. set seed 10101

. * LASSO with lambda determined by cross validation

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 38 / 93

Page 39: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

4. Dimension Reduction

4. Dimension Reduction

Reduce from p regressors to M < p linear combinations of regressorsI Form X� = XA where A is p �M and M < pI y = β0 +X

�δ+ u after dimension reductionI y = β0 +Xβ+ u where β = Aδ.

Two methods mentioned in ISLI 1. Principal components

F use only X to form A (unsupervised)

I 2. Partial least squares

F also use relationship between y and X to form A (supervised)F I have not seen this used in practice.

For both should standardize regressors as not scale invariant.

And often use cross-validation to determine M.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 39 / 93

Page 40: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

4. Dimension Reduction 4.1 Principal Components Analysis

4.1 Principal Components Analysis (PCA)

Suppose X is normalized to have zero means so ij th entry is xji � xj .The �rst principal component has the largest sample variance amongall normalized linear combinations of the columns of n� p matrix X

I the �rst component is Xh1 where h1 is p � 1I normalize h1 so that h01h1 = 1I then h1 max Var(Xh1) = h01X

0Xh1 subject to h01h1 = 1I the maximum is the largest eigenvalue of X0X and h1 is thecorresponding eigenvector.

The second principal component has the largest variance subject tobeing orthogonal to the �rst, and so on.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 40 / 93

Page 41: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

4. Dimension Reduction 4.1 Principal Components Analysis

Principal Components Analysis Example

Compare correlation coe¢ cient from OLS on �rst principalcomponent (r = 0.4444) with OLS on all three regressors(r = 0.4871) and each single regressor.

x3    0.2046   0.4200   0.7824   0.4281   0.2786   1.0000x2    0.3370   0.6919   0.7322   0.5077   1.0000x1    0.4740   0.9732   0.8086   1.0000

pc1    0.4219   0.8661   1.0000yhat    0.4871   1.0000

           y    1.0000

        y     yhat      pc1       x1       x2       x3

(obs=40). correlate y yhat pc1 x1 x2 x3

(option xb assumed; fitted values). predict yhat

. quietly regress y x1 x2 x3

. * Compare R from OLS on all three regressors, on pc1, on x1, on x2, on x3

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 41 / 93

Page 42: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

4. Dimension Reduction 4.1 Principal Components Analysis

Principal Components Analysis (continued)

PCA is unsupervised so seems unrelated to y butI Elements of Statistical Learning says does well in practice.I PCA has the smallest variance of any estimator that estimates themodel y = Xβ+ u with i.i.d. errors subject to constraint Cβ = cwhere dim[C] � dim[X].

I PCA discards the p�M smallest eigenvalue components whereas ridgedoes not, though ridge does shrink towards zero the most for thesmallest eigenvalue components (ESL p.79).

Partial least squares is supervisedI partial least squares turns out to be similar to PCAI especially if R2 is low.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 42 / 93

Page 43: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

5. High-Dimensional Models

5. High-Dimensional Models

High dimensional simply means p is large relative to nI in particular p > nI n could be large or small.

Problems with p > n:

I Cp , AIC, BIC and R2cannot be used.

I due to multicollinearity cannot identify best model, just one of manygood models.

I cannot use regular statistical inference on training set

SolutionsI Forward stepwise, ridge, lasso, PCA are useful in trainingI Evaluate models using cross-validation or independent test data

F using e.g. MSE or R2.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 43 / 93

Page 44: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

6. Nonparametric and Semiparametric Regression 6.1 Nonparametric Regression

6.1 Nonparametric regression

Nonparametric regression is the most �exible approachI but it is not practical for high p due to the curse of dimensionality.

Consider explaining y with scalar regressor xI we want bf (x0) for a range of values x0.

With many observations with xi = x0 we would just use the averageof y for those observations

I bf (x0) = 1n0 ∑ni :xi=x0 yi =

∑ni=1 1[xi=x0 ]yi∑ni=1 1[xi=x0 ]

.

Rewrite as

bf (x0) = ∑ni=1 w(xi , x0)yi , where w(xi , x0) =

1[xi=x0 ]∑nj=1 1[xj=x0 ]

.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 44 / 93

Page 45: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

6. Nonparametric and Semiparametric Regression 6.1 Nonparametric Regression

Kernel-weighted local regression

In practice there are not many observations with xi = x0.

Nonparametric regression methods borrow from nearby observationsI k-nearest neighbors

F average yi for the k observations with xi closest to x0.

I kernel-weighted local regressionF use a weighted average of yi with weights declining as jxi � x0 jincreases.

The original kernel regression estimate is

bf (x0) = ∑ni=1 w(xi , x0,λ)yi ,

I where w(xi , x0,λ) = w(xi ,�x0

λ ) are kernel weightsI and λ is a bandwidth parameter to be determined.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 45 / 93

Page 46: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

6. Nonparametric and Semiparametric Regression 6.1 Nonparametric Regression

Local constant and local linear regression

The local constant estimator bf (x0) = bα0 where α0 minimizes

∑ni=1 w(xi , x0,λ)(yi � α0)

2

I this yields bα0 = ∑ni=1 w(xi , x0,λ)yi .

The local linear estimator bf (x0) = bα0 where α0 and β0 minimize

∑ni=1 w(xi , x0,λ)fyi � α0 � β0(xi � x0)g2.

Can generalize to local maximum likelihood that maximizes over θ0

∑ni=1 w(xi , x0,λ) ln f (yi , xi , θ0).

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 46 / 93

Page 47: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

6. Nonparametric and Semiparametric Regression 6.1 Nonparametric Regression

Local linear example

lpoly y z, degree(1)

­50

510

15y

­4 ­2 0 2 4z

kernel = epanechnikov, degree = 1, bandwidth = .49

Local poly nomial smooth

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 47 / 93

Page 48: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

6. Nonparametric and Semiparametric Regression 6.2 Curse of Dimensionality

6.2 Curse of Dimensionality

Nonparametric methods do not extend well to multiple regressors.

Consider p-dimensional x broken into binsI for p = 1 we might average y in each of 10 bins of xI for p = 2 we may need to average over 102 bins of (x1, x2)I and so on.

On average there may be few to no points with high-dimensional xiclose to x0

I called the curse of dimensionality.

Formally for local constant kernel regression with bandwidth λ

I bias is O(λ2) and variance is O(nλp)I optimal bandwidth is O(n�1/(p+4))

F gives asymptotic bias so standard conf. intervals not properly centered

I convergence rate is then n�2/(p+4) << n�0.5

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 48 / 93

Page 49: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

6. Nonparametric and Semiparametric Regression 6.3 Semiparametric Models

6.3 Semiparametric Models

Semiparametric models provide some structure to reduce thenonparametric component from K dimensions to fewer dimensions.

I Econometricians focus on partially linear models and on single-indexmodels.

I Statisticians use generalized additive models and project pursuitregression.

Machine learning methods can outperform nonparametric andsemiparametric methods

I so wherever econometricians use nonparametric and semiparametricregression in higher-dimensional models it may be useful to use MLmethods.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 49 / 93

Page 50: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

6. Nonparametric and Semiparametric Regression 6.3 Semiparametric Models

Partially linear modelA partially linear model speci�es

yi = f (xi , zi ) + ui = x0iβ+ g(zi ) + ui

I simplest case z (or x) is scalar but could be vectorsI the nonparametric component is of dimension of z.

The di¤erencing estimator of Robinson (1988) provides a root-nconsistent asymptotically normal bβ as follows

I E [y jz] = E [xjz]0β+ g(z) as E [ujz] = 0 given E [ujx, z] = 0I y � E [y jz] = (x� E [xjz])0β+ u subtractingI so OLS estimate y � bmy = (x� bmz)0β+ error.

Robinson proposed nonparametric kernel regression of y on z for bmyand x on z for bmx

I recent econometrics articles instead use a machine learner such asLASSO

I in general need bm converges at rate at least n1/4.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 50 / 93

Page 51: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

6. Nonparametric and Semiparametric Regression 6.3 Semiparametric Models

Single-index model

Single-index models specify

f (xi ) = g(x0iβ)

I with g(�) determined nonparametricallyI this reduces nonparametrics to one dimension.

We can obtain bβ root-n consistent and asymptotically normalI provided nonparametric bg(�) converges at rate n�1/4.

The recent economics ML literature has instead focused on thepartially linear model.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 51 / 93

Page 52: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

6. Nonparametric and Semiparametric Regression 6.3 Semiparametric Models

Generalized additive models and project pursuitGeneralized additive models specify f (x) as a linear combination ofscalar functions

f (xi ) = α+∑pj=1 fj (xij )

I where xj is the j th regressor and fj (�) is (usually) determined by thedata

I advantage is interpretability (due to each regressor appearingadditively).

I can make more nonlinear by including interactions such as xi1 � xi2 asa separate regressor.

Project pursuit regression is additive in linear combinations of the x 0s

f (xi ) = ∑Mm=1 gm(x

0iωm)

I additive in derived features x0ωm rather than in the x 0j sI the gm(�) functions are unspeci�ed and nonparametrically estimated.I this is a multi-index model with case M = 1 being a single-index model.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 52 / 93

Page 53: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

6. Nonparametric and Semiparametric Regression 6.3 Semiparametric Models

How can machine learning methods do better?

In theory there is scope for improving nonparametric methods.

k-nearest neighbors usually has a �xed number of neighborsI but it may be better to vary the number of neighbors with data sparsity

Kernel-weighted local regression methods usually use a �xedbandwidth

I but it may be better to vary the bandwidth with data sparsity.

There may be advantage to basing neighbors in part on relationshipwith y .

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 53 / 93

Page 54: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

7. Flexible Regression

7. Flexible Regression

Basis function modelsI global polynomial regressionI splines: step functions, regression splines, smoothing splinesI waveletsI polynomial is global while the others break range of x into pieces.

Other methodsI neural networks.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 54 / 93

Page 55: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

7. Flexible Regression 7.1 Basis Functions

7.1 Basis Functions

Also called series expansions and sieves.

General approach (scalar x for simplicity)

yi = β0 + β1b1(xi ) + � � �+ βK (xi ) + εi

I where b1(�), ..., bK (�) are basis functions that are �xed and known.

Global polynomial regression sets bj (xi ) = xji

I typically K � 3 or K � 4.I �ts globally and can over�t at boundaries.

Step functions: separately �t y in each interval x 2 (cj , cj+1)I could be piecewise constant or piecewise polynomial.

Splines smooth so that not discontinuous at the cut points.

Wavelets are also basis functions, richer than Fourier series.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 55 / 93

Page 56: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

7. Flexible Regression 7.2 Regression splines

7.2 Regression SplinesThe cubic regression spline is the standard.

Piecewise cubic model with K knotsI require f (x), f 0(x) and f 00(x) to be continuous at the K knots

Then can do OLS with

f (x) = β0+ β1x+ β2x2+ β3x

3+ β4(x� c1)3++ � � �+ β(3+K )(x� cK )3+

I where h+(z) = z+ if z > 0 and h+(z) = 0 otherwise.

This is the lowest degree regression spline where the graph of bf (x) onx seems smooth and continuous to the naked eye.

There is no real bene�t to a higher-order spline.

Regression splines over�t at boundaries.I A natural or restricted cubic spline is an adaptation that restricts therelationship to be linear past the lower and upper boundaries of thedata.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 56 / 93

Page 57: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

7. Flexible Regression 7.2 Regression splines

Spline ExampleNatural or restricted cubic spline with �ve knots at the 5, 27.5, 50,72.5 and 95 percentiles

I mkspline zspline = z, cubic nknots(5) displayknotsI regress y zspline*

­50

510

15f(

z)

­4 ­2 0 2 4z

Natural cubic spline: y=a+f(z)+u

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 57 / 93

Page 58: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

7. Flexible Regression 7.3 Wavelets

7.3 Wavelets

Wavelets are used especially for signal processing and extractionI they are richer than a Fourier series basisI they can handle both smooth sections and bumpy sections of a series.I they are not used in cross-section econometrics but may be useful forsome time series.

Start with a mother or father wavelet function ψ(x)

I example is the Haar function ψ(x) =

8<: 1 0 � x < 12

�1 12 < x < 1

0 otherwise

Then both translate by b and scale by a to give basis functionsψab(x) = jaj�1/2ψ( x�ba ).

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 58 / 93

Page 59: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

7. Flexible Regression 7.4 Neural Networks

7.4 Neural Networks

A neural network is a richer model for f (xi ) than project pursuitI but unlike project pursuit all functions are speci�edI only parameters need to be estimated.

A neural network involves a series of nested logit regressions.

A single hidden layer neural network explaining y by x hasI y depends on z0s (a hidden layer)I z0s depend on x0s.

A neural network with two hidden layers explaining y by x hasI y depends on w0s (a hidden layer)I w0s depend on z0s (a hidden layer)I z0s depend on x0s.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 59 / 93

Page 60: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

7. Flexible Regression 7.4 Neural Networks

Two-layer neural network

y depends on M z0s and the z0s depend on p x0s

f (x) = β0 + z0β is usual choice for g(�)

zm = 11+exp[�(α0m+x0αm )] m = 1, ...,M

More generally we may use

f (x) = h(T ) usually h(T ) = TT = β0 + z

0βzm = g(α0m + x0αm) usually g(v) = 1/(1+ e�v )

This yields the nonlinear model

f (xi ) = β0 +∑Mm=1 βm �

11+ exp[�(α0m + x0αm)]

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 60 / 93

Page 61: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

7. Flexible Regression 7.4 Neural Networks

Neural Networks (continued)Neural nets are good for prediction

I especially in speech recognition (Google Translate), image recognition,...

I but very di¢ cult (impossible) to interpret.

They require a lot of �ne tuning - not o¤-the-shelfI we need to determine the �nd the number of hidden layers, the numberof M of hidden units within each layer, and estimate the α0s, β0s,....

Minimize the sum of squared residuals but need a penalty on α0s toavoid over�tting.

I since penalty is introduced standardize x 0s to (0,1).I best to have too many hidden units and then avoid over�t usingpenalty.

I initially back propagation was usedI now use gradient methods with di¤erent starting values and averageresults or use bagging.

Deep learning uses nonlinear transformations such as neural networksI deep nets are an improvement on original neural networks.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 61 / 93

Page 62: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

7. Flexible Regression 7.4 Neural Networks

Neural Networks ExampleThis example uses user-written Stata command brain (Doherr)

. twoway (scatter y x) (lfit y x) (line ybrain x)

. sort x

. brain think ybrain

. quietly brain train, iter(500) eta(2)

   brain[1,61]   layer[1,3]  neuron[1,22]  output[4,1]   input[4,1]Defined matrices:. brain define, input(x) output(y) hidden(20)

. gen y = sin(x)

. gen x = 4*_pi/200 *_n

number of observations (_N) was 0, now 200. set obs 200

. clear

. * Example from help file for user­written brain command

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 62 / 93

Page 63: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

7. Flexible Regression 7.4 Neural Networks

Neural Networks Example (continued)We obtain

­1­.5

0.5

1

0 5 10 15x

y Fitted valuesybrain

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 63 / 93

Page 64: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

7. Flexible Regression 7.4 Neural Networks

This �gure from ESL is for classi�cation with K categories

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 64 / 93

Page 65: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

8. Regression Trees and Random Forests

8. Regression Trees and Random Forests: OverviewRegression Trees sequentially split regressors x into regions that bestpredict y

I e.g., �rst split is income < or > $12,000second split is on gender if income > $12,000third split is income < or > $30,000 (if female and income > $12,000).

Trees do not predict wellI due to high varianceI e.g. split data in two then can get quite di¤erent treesI e.g. �rst split determines future splits 9a greedy method).

Better methods are then givenI bagging (bootstrap averaging) computes regression trees for di¤erentsamples obtained by bootstrap and averages the predictions.

I random forests use only a subset of the predictors in each bootstrapsample

I boosting grows trees based on residuals from previous stageI bagging and boosting are general methods (not just for trees).

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 65 / 93

Page 66: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

8. Regression Trees and Random Forests 8.1 Regression Trees

8.1 Regression Trees

Regression treesI sequentially split x0s into rectangular regions in way that reduces RSSI then byi is the average of y 0s in the region that xi falls inI with J blocks RSS= ∑Jj=1 ∑i2Rj (yi � yRj )

2.

Need to determine both the regressor j to split and the split point s.I For any regressor j and split point s, de�ne the pair of half-planesR1(j , s) = fX jXj < sg and R2(j , s) = fX jXj � sg

I Find the value of j and s that minimize

∑i :xi2R1(j ,s)

(yi � yR1)2 + ∑i :xi2R1(j ,s)

(yi � yR1)2

where yR1 is the mean of y in region R1 (and similar for R2).I Once this �rst split is found, split both R1 and R2 and repeatI Each split is the one that reduces RSS the most.I Stop when e.g. less than �ve observations in each region.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 66 / 93

Page 67: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

8. Regression Trees and Random Forests 8.1 Regression Trees

Tree example from ISL page 308

(1) split X1 in two; (2) split the lowest X1 values on the basis of X2into R1 and R2; (3) split the highest X1 values into two regions (R3and R4/R5); (4) split the highest X1 values on the basis of X2 intoR4 and R5.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 67 / 93

Page 68: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

8. Regression Trees and Random Forests 8.1 Regression Trees

Tree example from ISL (continued)

The left �gure gives the tree.

The right �gure shows the predicted values of y .

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 68 / 93

Page 69: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

8. Regression Trees and Random Forests 8.1 Regression Trees

Regression tree (continued)

The model is of form f (x) = ∑Jj=1 yj � 1[x 2 Rj ].

The approach is a topdown greedy approachI top down as start with top of the treeI greedy as at each step the best split is made at that particular step,rather than looking ahead and picking a split that will lead to a bettertree in some future step.

This leads to over�tting, so pruneI use cost complexity pruning (or weakest link pruning)I this penalizes for having too many terminal nodesI see ISL equation (8.4).

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 69 / 93

Page 70: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

8. Regression Trees and Random Forests 8.1 Regression Trees

Tree as alternative to k-NN or kernel regressionFigure from Athey and Imbens (2019), �Machine Learning MethodsEconomists should Know About�

I axes are x1 and x2I note that tree used explanation of y in determining neighborsI tree may not do so well near boundaries of region

F random forests form many trees so not always at boundary.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 70 / 93

Page 71: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

8. Regression Trees and Random Forests 8.1 Regression Trees

Improvements to regression trees

Regression trees are easy to understand if there are few regressors.

But they do not predict as well as methods given so farI due to high variance (e.g. split data in two then can get quite di¤erenttrees).

Better methods are given nextI bagging

F bootstrap aggregating averages regression trees over many samples

I random forests

F averages regression trees over many sub-samples

I boosting

F trees build on preceding trees.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 71 / 93

Page 72: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

8. Regression Trees and Random Forests 8.2 Bagging

8.2 Bagging (Bootstrap Aggregating)Bagging is a general method for improving prediction that worksespecially well for regression trees.Idea is that averaging reduces variance.So average regression trees over many samples

I the di¤erent samples are obtained by bootstrap resample withreplacement (so not completely independent of each other)

I for each sample obtain a large tree and prediction bfb(x).I average all these predictions: bfbag(x) = 1

B ∑Bb=1 bfb(x).Get test sample error by using out-of-bag (OOB) observations not inthe bootstrap sample

I Pr[i th obs not in resample] = (1� 1n )n ! e�1 = 0.368 ' 1/3.

I this replaces cross validation.

Interpretation of trees is now di¢ cult soI record the total amount that RSS is decreased due to splits over agiven predictor, averaged over all B trees.

I a large value indicates an important predictor.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 72 / 93

Page 73: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

8. Regression Trees and Random Forests 8.3 Random Forests

8.3 Random Forests

The B bagging estimates are correlatedI e.g. if a regressor is important it will appear near the top of the tree ineach bootstrap sample.

I the trees look similar from one resample to the next.

Random forests get bootstrap resamples (like bagging)I but within each bootstrap sample use only a random sample of m < ppredictors in deciding each split.

I usually m ' ppI this reduces correlation across bootstrap resamples.

Simple bagging is random forest with m = p.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 73 / 93

Page 74: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

8. Regression Trees and Random Forests 8.3 Random Forests

Random Forests (continued)

Random forests are related to kernel and k-nearest neighborsI as use a weighted average of nearby observationsI but with a data-driven way of determining which nearby observationsget weight

I see Lin and Jeon (JASA, 2006).

Susan Athey and coauthors are big on random forests.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 74 / 93

Page 75: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

8. Regression Trees and Random Forests 8.4 Boosting

8.4 Boosting

Boosting is also a general method for improving prediction.

Regression trees use a greedy algorithm.

Boosting uses a slower algorithm to generate a sequence of treesI each tree is grown using information from previously grown treesI and is �t on a modi�ed version of the original data setI boosting does not involve bootstrap sampling.

Speci�cally (with λ a penalty parameter)I given current model b �t a decision tree to model b0s residuals (ratherthan the outcome Y )

I then update bf (x) = previous bf (x) + λbf b(x)I then update the residuals ri = previous ri � λbf b(xi )I the boosted model is bf (x) = ∑Bb=1 λbf b(xi ).

Stata add-on boost includes �le boost64.dll that needs to bemanually copied into c:nadonplus

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 75 / 93

Page 76: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

8. Regression Trees and Random Forests 8.4 Boosting

Comparison of in-sample predictions

yh_ols    0.4784   0.8615   0.8580   0.8666   1.0000yh_boost    0.5423   0.8769   0.8381   1.0000

yh_rf_half    0.6178   0.9212   1.0000yh_rf    0.7377   1.0000

     ltotexp    1.0000

  ltotexp    yh_rf yh_rf_~f yh_boost   yh_ols

(obs=2,955). correlate ltotexp yh*

      yh_ols       2,955    8.059866     .654323   6.866516   10.53811    yh_boost       2,955    7.667966    .4974654   5.045572   8.571422  yh_rf_half       2,955    8.053512    .7061742   5.540289   9.797389       yh_rf       2,955    8.060393    .7232039    5.29125   10.39143     ltotexp       2,955    8.059866    1.367592   1.098612   11.74094

    Variable         Obs        Mean    Std. Dev.       Min        Max

. summarize ltotexp yh*

(option xb assumed; fitted values). predict yh_ols

. quietly regress ltotexp $zlist

. * Compare various predictions in sample

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 76 / 93

Page 77: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

9. Classi�cation

9. Classi�cation: Overview

y 0s are now categoricalI example: binary if two categories.

Interest lies in predicting y using by (classi�cation)I whereas economists usually want bPr[y = j jx]

Use (0,1) loss function rather than MSE or ln LI 0 if correct classi�cationI 1 if misclassi�ed.

Many machine learning applications are in settings where can classifywell

I e.g. reading car license platesI unlike many economics applications.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 77 / 93

Page 78: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

9. Classi�cation

Classi�cation: Overview (continued)

Regression methods predict probabilitiesI logistic regression, multinomial regression, k-nearest neighborsI assign to class with the highest predicted probability (Bayes classi�er)

F in binary case by = 1 if bp � 0.5 and by = 0 if bp < 0.5.Discriminant analysis additionally assumes a normal distribution forthe x�s

I use Bayes theorem to get Pr[Y = k jX = x].

Support vector classi�ers and support vector machinesI directly classify (no probabilities)I are more nonlinear so may classify betterI use separating hyperplanes of X and extensions.

Regression trees, bagging, random forests and boosting can be usedfor categorical data.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 78 / 93

Page 79: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

9. Classi�cation 9.1 Loss Function

9.1 A Di¤erent Loss Function: Error Rate

Instead of MSE we use the error rateI the number of misclassi�cations

Error rate =1n ∑n

i=1 1[yi 6= byi ],F where for K categories yi = 0, ...,K � 1 and byi = 0, ...,K � 1.F and indicator 1[A] = 1 if event A happens and = 0 otherwise.

The test error rate is for the n0 observations in the test sample

Ave(1[y0 6= by0]) = 1n0

∑n0i=1 1[y0i 6= by0i ].

Cross validation uses number of misclassi�ed observations. e.g.LOOCV is

CV(n) =1n ∑n

i=1 Erri =1n ∑n

i=1 1[yi 6= by(�i )].A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 79 / 93

Page 80: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

9. Classi�cation 9.2 Logit

9.2 Logit

Directly model p(x) = Pr[y jx].Logistic (logit) regression for binary case obtains MLE for

ln�

p(x)1�p(x)

�= x0β.

Logit model is a linear (in x) classi�erI by = 1 if bp(x) > 0.5I i.e. if x0bβ > 0.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 80 / 93

Page 81: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

9. Classi�cation 9.3 Discriminant Analysis

9.3 Linear and Quadratic Discriminant AnalysisDeveloped for classi�cation problems such as is a skull Neanderthal orHomo Sapiens given various measures of the skull.Discriminant analysis speci�es a joint distribution for (Y ,X).Linear discriminant analysis with K categories

I assume XjY = k is N(µk ,Σ) with density fk (x) = Pr[X = xjY = k ]I and let πk = Pr[Y = k ]

Upon simpli�cation assignment to class k with largestPr[Y = k jX = x] is equivalent to choosing model with largestdiscriminant function

δk (x) = x0Σ�1µk �12

µk0Σ�1µk + lnπk

I use bµk =�xk , bΣ = cVar[xk ] and bπk = 1N ∑Ni=1 1[yi = k ].

Called linear discriminant analysis as δk (x) linear in x.Quadratic discriminant analysis assumes XjY = k is N(µk ,Σk ).

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 81 / 93

Page 82: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

9. Classi�cation 9.3 Discriminant Analysis

ISL Figure 4.9: Linear and Quadratic BoundariesLDA uses a linear boundary to classify and QDA a quadratic

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 82 / 93

Page 83: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

9. Classi�cation 9.4 Support Vector Machines

9.4 Support Vector Classi�er

Build on LDA idea of linear boundary to classify when K = 2.

Maximal margin classi�erI classify using a separating hyperplane (linear combination of X )I if perfect classi�cation is possible then there are an in�nite number ofsuch hyperplanes

I so use the separating hyperplane that is furthest from the trainingobservations

I this distance is called the maximal margin.

Support vector classi�erI generalize maximal margin classi�er to the nonseparable caseI this adds slack variables to allow some y�s to be on the wrong side ofthe margin

I Maxβ,εM (the margin - distance from separator to training X�s)subject to β0β 6= 1, yi (β0 + x0i β) � M(1� εi ), εi � 0 and∑ni=1 εi � C .

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 83 / 93

Page 84: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

9. Classi�cation 9.4 Support Vector Machines

Support Vector Machines

The support vector classi�er has a linear boundaryI f (x0) = β0 +∑ni=1 αix00xi , where x

00xi = ∑pj=1 x0jxij .

The support vector machine has nonlinear boundariesI f (x0) = β0 +∑ni=1 αiK (x0, xi ) where K (�) is a kernelI polynomial kernel K (x0, xi ) = (1+∑pj=1 x0jxij )

d

I radial kernel K (x0, xi ) = exp(�γ ∑pj=1(x0j � xij )2)

Can extend to K > 2 classes (see ISL ch. 9.4).I one-versus-one or all-pairs approachI one-versus-all approach.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 84 / 93

Page 85: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

9. Classi�cation 9.4 Support Vector Machines

ISL Figure 9.9: Support Vector MachineIn this example a linear or quadratic classi�er won�t work whereasSVM does.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 85 / 93

Page 86: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

9. Classi�cation 9.4 Support Vector Machines

Comparison of model predictions

The following compares the various category predictions.

SVM does best but we did in-sample predictions hereI especially for SVM we should have training and test samples.

yh_svm    0.5344   0.3966   0.6011   0.3941   0.3206   1.0000yh_qda    0.2294   0.6926   0.2762   0.5850   1.0000yh_lda    0.2395   0.6955   0.3776   1.0000yh_knn    0.3604   0.3575   1.0000

yh_logit    0.2505   1.0000     suppins    1.0000

  suppins yh_logit   yh_knn   yh_lda   yh_qda   yh_svm

(obs=3,064). correlate suppins yh_logit yh_knn yh_lda yh_qda yh_svm. * Compare various in­sample predictions

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 86 / 93

Page 87: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

10. Unsupervised Learning

10. Unsupervised Learning

Challenging area: no y , only x.Example is determining several types of individual based on responsesto many psychological questions.

Principal components analysisI already presented earlier.

Clustering MethodsI k-means clustering.I hierarchical clustering.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 87 / 93

Page 88: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

10. Unsupervised Learning 10.1 Cluster Analysis

10.1 Cluster Analysis: k-Means Clustering

Goal is to �nd homogeneous subgroups among the X .

K-Means splits into K distinct clusters where within cluster variationis minimized.

Let W (Ck ) be measure of variation

I MinimizeC1,...,Ck ∑Kk=1W (Ck )I Euclidean distance W (Ck ) =

1nk ∑Ki ,i 02Ck ∑pj=1(xij � xi 0j )2

Global maximum requires K n partitions.

Instead use algorithm 10.1 (ISL p.388) which �nds a local optimumI run algorithm multiple times with di¤erent seedsI choose the optimum with smallest ∑Kk=1W (Ck ).

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 88 / 93

Page 89: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

10. Unsupervised Learning 10.1 Cluster Analysis

ISL Figure 10.5

Data is (x1.x2) with K = 2, 3 and 4 clusters identi�ed.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 89 / 93

Page 90: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

10. Unsupervised Learning 10.1 Cluster Analysis

Hierarchical Clustering

Do not specify K .

Instead begin with n clusters (leaves) and combine clusters intobranches up towards trunk

I represented by a dendrogramI eyeball to decide number of clusters.

Need a dissimilarity measure between clustersI four types of linkage: complete, average, single and centroid.

For any clustering methodI it is a di¢ cult problem to do unsupervised learningI results can change a lot with small changes in methodI clustering on subsets of the data can provide a sense of robustness.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 90 / 93

Page 91: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

11. Conclusions

11. ConclusionsGuard against over�tting

I use K -fold cross validation or penalty measures such as AIC.

Biased estimators can be better predictorsI shrinkage towards zero such as Ridge and LASSO.

For �exible models popular choices areI neural netsI random forests.

Though what method is best varies with the applicationI and best are ensemble forecasts that combine di¤erent methods.

Machine learning methods can outperform nonparametric andsemiparametric methods

I so wherever econometricians use nonparametric and semiparametricregression in higher dimensional models it may be useful to use MLmethods

I though the underlying theory still relies on assumptions such as sparsity.

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 91 / 93

Page 92: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

12. References

12. References

Undergraduate / Masters level bookI ISL: Gareth James, Daniela Witten, Trevor Hastie and RobertTibsharani (2013), An Introduction to Statistical Learning: withApplications in R, Springer.

I free legal pdf at http://www-bcf.usc.edu/~gareth/ISL/I $25 hardcopy viahttp://www.springer.com/gp/products/books/mycopy

Masters / PhD level bookI ESL: Trevor Hastie, Robert Tibsharani and Jerome Friedman (2009),The Elements of Statistical Learning: Data Mining, Inference andPrediction, Springer.

I free legal pdf athttp://statweb.stanford.edu/~tibs/ElemStatLearn/index.html

I $25 hardcopy viahttp://www.springer.com/gp/products/books/mycopy

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 92 / 93

Page 93: Machine Learning Methods: An Overviewcameron.econ.ucdavis.edu/e240f/machlearn2019_Riverside_1.pdf · Introduction Overview 1 Terminology 2 Model selection - especially cross-validation

12. References

References (continued)

A recent book isI EH: Bradley Efron and Trevor Hastie (2016), Computer Age StatisticalInference: Algorithms, Evidence and Data Science, CambridgeUniversity Press.

Interesting book: Cathy O�Neil (2016), Weapons of MathDestruction: How Big Data Increases Inequality and ThreatensDemocracy.

My website has some material (including these slides)I http://cameron.econ.ucdavis.edu/e240f/machinelearning.html

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 93 / 93