machine learning methods: an...

Machine Learning Methods: An Overview

A. Colin CameronU.C.-Davis

.Presented at University of California - Riverside

April 23 2019

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 1 / 93

Introduction

Abstract

This talk summarizes key machine learning methods and presentsthem in a language familiar to economists.

The focus is on regression of y on x.The talk is based on the two books

I Introduction to Statistical Learning (ISL)I Elements of Statistical Learning (ESL).

It emphasizes the ML methods most used currently in economics.


Introduction

Introduction

Machine learning methods include data-driven algorithms to predicty given x.

I there are many machine learning (ML) methodsI the best ML methods vary with the particular data applicationI and guard against in-sample over�tting.

The main goal of the machine learning literature is prediction of yI and not estimation of β.

For economics prediction of y is sometimes a goalI e.g. do not provide hip transplant to individuals with low predictedone-year survival probability

I then main issue is what is the best standard ML method.

But often economics is interested in estimation of β or estimation ofa partial e¤ect such as average treatment e¤ects

I then new methods using ML are being developed by econometricians.


Introduction

Econometrics for Machine Learning

The separate slides on ML methods in Economics consider thefollowing microeconomics examples.

Treatment e¤ects under an unconfoundedness assumptionI Estimate β in the model y = βx1 + g(x2) + uI the assumption that x1 is exogenous is more plausible with better g(x2)I machine learning methods can lead to a good choice of g(x2).

Treatment e¤ects under endogeneity using instrumental variablesI Now x1 is endogenous in the model y = β1x1 + g(x2) + uI given instruments x3 and x2 there is a potential many instrumentsproblem

I machine learning methods can lead to a good choice of instruments.

Average treatment e¤ects for heterogeneous treatmentI ML methods may lead to better regression imputation and betterpropensity score matching.


Introduction

Econometrics for Machine Learning (continued)

ML methods involve data miningI using traditional methods data mining leads to the complications ofpre-test bias and multiple testing.

In the preceding econometrics examples, the ML methods are used insuch a way that these complications do not arise

I an asymptotic distribution for the estimates of β1 or ATE is obtainedI furthermore this is Gaussian.


Introduction

Overview

1 Terminology2 Model selection - especially cross-validation.3 Shrinkage methods

1 Variance-bias trade-o¤2 Ridge regression, LASSO, elastic net.

4 Dimension reduction - principal components5 High-dimensional data6 Nonparametric and semiparametric regression7 Flexible regression (splines, sieves, neural networks,...)8 Regression trees and random forests9 Classi�cation - including support vector machines10 Unsupervised learning (no y) - k-means clustering


1. Terminology

1. Terminology

The topic is called machine learning or statistical learning or datalearning or data analytics where data may be big or small.

Supervised learning = RegressionI We have both outcome y and regressors (or features) xI 1. Regression: y is continuousI 2. Classi�cation: y is categorical.

Unsupervised learningI We have no outcome y - only several xI 3. Cluster Analysis: e.g. determine �ve types of individuals givenmany psychometric measures.

This talkI focus on 1.I brie�y mention 2.I even more brie�y mention 3.


1. Terminology

Terminology (continued)

Consider two types of data setsI 1. training data set (or estimation sample)

F used to �t a model.

I 2. test data set (or hold-out sample or validation set)F additional data used to determine how good is the model �tF a test observation (x0, y0) is a previously unseen observation.


2. Model Selection 2.1 Forwards selection, backwards selection and best subsets

2.1 Model selection: Forwards, backwards and best subsets

Forwards selection (or speci�c to general)I start with simplest model (intercept-only) and in turn include thevariable that is most statistically signi�cant or most improves �t.

I requires up to p+ (p� 1) + � � �+ 1 = p(p+ 1)/2 regressions where pis number of regressors

Backwards selection (or general to speci�c)I start with most general model and in drop the variable that is leaststatistically signi�cant or least improves �t.

I requires up to p(p + 1)/2 regressions

Best subsetsI for k = 1, ..., p �nd the best �tting model with k regressorsI in theory requires (p0) + (

p1) + � � �+ (

pp) = 2

p regressionsI but leaps and bounds procedure makes this much quickerI p < 40 manageable though recent work suggests p in thousands.


2. Model Selection 2.2 Statistical Signi�cance

2.2 Selection using Statistical Signi�cance

Use as criteria p < 0.05I and do forward selection or backward selectionI common in biostatistics.

Problem:I pre-testing changes the distribution of bβI this problem is typically ignored (but should not be ignored).


2. Model Selection 2.3 Goodness-of-�t measures

2.3 Goodness-of-�t measures

We wish to predict y given x = (x1, ..., xp).A training data set d yields prediction rule bf (x)

I we predict y at point x0 using by0 = bf (x0).I e.g. for OLS by0 = x0(X0X)�1X0y.

For regression consider squared error loss (y � by)2I some methods adapt to other loss functions

F e.g. absolute error loss and log-likelihood loss

I and loss function for classi�cation is 1(y 6= by).We wish to estimate the true prediction error

I Errd = EF [(y0 � by0)2 ]I for test data set point (x0, y0) � F .


2. Model Selection 2.3 Goodness-of-�t measures

Models over�t in sample

We want to estimate the true prediction errorI EF [(y0 � by0)2 ] for test data set point (x0, y0) � F .

The obvious criterion is in-sample mean squared errorI MSE = 1

n ∑ni=1(yi � byi )2 where MSE = mean squared error.Problem: in-sample MSE under-estimates the true prediction error

I Intuitively models �over�t�within sample.

Example: suppose y = Xβ+ uI then bu = (y�XbβOLS ) = (I�M)u where M = X(X0X)�1X

F so jbui j < jui j (OLS residual is smaller than the true unknown error)I and use bσ2 = s2 = 1

n�k ∑ni=1(yi � byi )2 and not 1n ∑ni=1(yi � byi )2Two solutions:

I penalize for over�tting e.g. R2, AIC, BIC, CpI use out-of-estimation sample prediction (cross-validation).


2. Model Selection 2.4 Penalized Goodness-of-�t Measures

2.4 Penalized Goodness-of-�t Measures

Standard measures for general parametric model areI Akaike�s information criterion (small penalty)

F AIC= �2 ln L+ 2kI BIC: Bayesian information criterion (larger penalty)

F BIC= �2 ln L+ (ln n)� k

Models with smaller AIC and BIC are preferredI for regression with i.i.d. errors replace �2 ln L with n

2 ln bσ2+ constant.

For OLS other measures areI R2: adjusted R-squaredI Mallows CpI AICC=AIC+2(K + 1)(K + 2)/(N �K � 2).



Stata Example

These slides use StataI most machine learning code is initially done in R.

Generated data: n = 40

Three correlated regressors.

I

24 x1ix2ix3i

35 � N0@24 x1i

x2ix3i

35 ,24 1 0.5 0.50.5 1 0.50.5 0.5 1

351ABut only x1 determines y

I y = 2+ xi + ui where ui � N(0, 32).



Fitted OLS regression

As expected only x1 is statistically signi�cant at 5%I though due to randomness this is not guaranteed.

_cons 2.531396 .5377607 4.71 0.000 1.440766 3.622025 x3 .0256025 .6009393 0.04 0.966 1.244364 1.193159 x2 .4707111 .5251826 0.90 0.376 .5944086 1.535831 x1 1.555582 .5006152 3.11 0.004 .5402873 2.570877

y Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust

Root MSE = 3.0907Rsquared = 0.2373Prob > F = 0.0058F(3, 36) = 4.91

Linear regression Number of obs = 40

. regress y x1 x2 x3, vce(robust)

. * OLS regression of y on x1x3



Example of penalty measures

We will consider all 8 possible models based on x1, x2 and x3.

. scalar s2full = e(rmse)^2 // Needed for Mallows Cp

. quietly regress y $xlist8

. * Full sample estimates with AIC, BIC, Cp, R2adj penalties

.

. global xlist8 x1 x2 x3

. global xlist7 x1 x3



. global xlist4 x3

. global xlist3 x2

. global xlist2 x1

. global xlist1

. * Regressor lists for all possible models



Example of penalty measures (continued)

Manually get various measuresI all (but MSE) favor model with just x1.

Model x1 x2 x3 MSE= 8.597 R2adj= 0.174 AIC= 207.57 BIC= 214.33 Cp= 4.000Model x1 x3 MSE= 8.739 R2adj= 0.183 AIC= 206.23 BIC= 211.29 Cp= 2.592Model x2 x3 MSE= 9.842 R2adj= 0.080 AIC= 210.98 BIC= 216.05 Cp= 7.211Model x1 x2 MSE= 8.598 R2adj= 0.196 AIC= 205.58 BIC= 210.64 Cp= 2.002Model x3 MSE=10.800 R2adj= 0.017 AIC= 212.70 BIC= 216.08 Cp= 9.224Model x2 MSE= 9.992 R2adj= 0.090 AIC= 209.58 BIC= 212.96 Cp= 5.838Model x1 MSE= 8.739 R2adj= 0.204 AIC= 204.23 BIC= 207.60 Cp= 0.593Model MSE=11.272 R2adj= 0.000 AIC= 212.41 BIC= 214.10 Cp= 9.199 9. }> " BIC=" %7.2f bic`k' " Cp=" %6.3f cp`k'> " R2adj=" %6.3f r2adj`k' " AIC=" %7.2f aic`k' /// 8. display "Model " "${xlist`k'}" _col(15) " MSE=" %6.3f mse`k' /// 7. scalar cp`k' = e(rss)/s2full e(N) + 2*e(rank) 6. scalar bic`k' = 2*e(ll) + e(rank)*ln(e(N)) 5. scalar aic`k' = 2*e(ll) + 2*e(rank) 4. scalar r2adj`k' = e(r2_a) 3. scalar mse`k' = e(rss)/e(N) 2. quietly regress y ${xlist`k'}. forvalues k = 1/8 {


2. Model Selection 2.5 Cross-validation

2.5 Cross-validation

Begin with single-split validationI for pedagogical reasons.

Then present K-fold cross-validationI used extensively in machine learningI generalizes to loss functions other than MSE such as 1n ∑ni=1 jyi � byi jI though more computation than e.g. BIC.

And present leave-one-out cross validationI widely used for bandwidth choice in nonparametric regression.

Given a selected model the �nal estimation is on the full datasetI usual inference ignores the data-mining.



Single split validation

Randomly divide available data into two partsI 1. model is �t on training setI 2. MSE is computed for predictions in validation set.

Example: estimate all 8 possible models with x1, x2 and x3

I for each model estimate on the training set to get bβ0s, predict on thevalidation set and compute MSE in the validation set.

I choose the model with the lowest validation set MSE.

Problems with this single-split validationI 1. Lose precision due to smaller training set

F so may actually overestimate the test error rate (MSE) of the model.

I 2. Results depend a lot on the particular single split.



Single split validation example

Randomly formI training sample (n = 14)I test sample (n = 26)

14. count if dtrain == 1

. gen dtrain = runiform() > 0.5 // 1 for training set and 0 for test data

. set seed 10101

. * Form indicator that determines training and test datasets



Single split validation example (continued)In-sample (training sample) MSE minimized with x1, x2, x3.

Out-of-sample (test sample) MSE minimized with only x1.

Model x1 x2 x3 Training MSE = 6.830 Test MSE = 12.708Model x1 x3 Training MSE = 7.138 Test MSE = 10.468Model x2 x3 Training MSE = 7.798 Test MSE = 15.270Model x1 x2 Training MSE = 7.520 Test MSE = 10.350Model x3 Training MSE = 7.943 Test MSE = 13.052Model x2 Training MSE = 10.326 Test MSE = 10.490Model x1 Training MSE = 7.625 Test MSE = 9.568Model Training MSE = 10.496 Test MSE = 11.852 10. }> " Training MSE = " %7.3f mse`k'train " Test MSE = " %7.3f mse`k'test 9. display "Model " "${xlist`k'}" _col(16) /// 8. qui scalar mse`k'test = r(mean) 7. qui sum y`k'errorsq if dtrain == 0 6. scalar mse`k'train = r(mean) 5. qui sum y`k'errorsq if dtrain == 1 4. qui gen y`k'errorsq = (y`k'hat y)^2 3. qui predict y`k'hat 2. quietly reg y ${xlist`k'} if dtrain==1. forvalues k = 1/8 {. * Split sample validation training and test MSE for the 8 possible models



K-fold cross-validationK -fold cross-validation

I splits data into K mutually exclusive folds of roughly equal sizeI for j = 1, ...,K �t using all folds but fold j and predict on fold jI standard choices are K = 5 and K = 10.

The following shows case K = 5

Fit on folds Test on foldj = 1 2,3,4,5 1j = 2 1,3,4,5 2j = 3 1,2,4,5 3j = 4 1,2,3,5 4j = 5 1,2,3,4 5

The K -fold CV estimate is

CVK = 1K ∑K

j=1MSE(j), where MSE(j) is the MSE for fold j .



K-fold cross validation exampleUser-written crossfold command (Daniels 2012) implements this

I do so for the model with all three regressors and K = 5I set seed for replicability.

CV5 (average MSE in 5 folds) = 9.6990836 with st. dev. = 3.3885108

. display _n "CV5 (average MSE in 5 folds) = " r(mean) " with st. dev. = " r(sd)

. quietly sum mse

. quietly generate mse = rmse^2

. svmat RMSEs, names(rmse)

. matrix RMSEs = r(est)

est5 3.498511 est4 2.532469 est3 3.059801 est2 2.549458 est1 3.739027

RMSE

. crossfold regress y x1 x2 x3, k(5)

. set seed 10101

. * Fivefold cross validation example for model with all regressors



K-fold Cross Validation example (continued)Now do so for all eight models with K = 5

I model with only x1 has lowest CV(5)

Model x1 x2 x3 CV5 = 9.699 with st. dev. = 3.389Model x1 x3 CV5 = 9.639 with st. dev. = 2.985Model x2 x3 CV5 = 10.872 with st. dev. = 4.221Model x1 x2 CV5 = 9.173 with st. dev. = 3.367Model x3 CV5 = 11.776 with st. dev. = 3.272Model x2 CV5 = 10.407 with st. dev. = 4.139Model x1 CV5 = 9.138 with st. dev. = 3.069Model CV5 = 11.960 with st. dev. = 3.561 11. }> " with st. dev. = " %7.3f sdcv`k' 10. display "Model " "${xlist`k'}" _col(16) " CV5 = " %7.3f cv`k' /// 9. scalar sdcv`k' = r(sd) 8. scalar cv`k' = r(mean) 7. quietly sum mse`k' 6. quietly generate mse`k' = rmse`k'^2 5. svmat RMSEs`k', names(rmse`k') 4. matrix RMSEs`k' = r(est) 3. quietly crossfold regress y ${xlist`k'}, k(5) 2. set seed 10101. forvalues k = 1/8 {

. drop _est* // Drop variables created by previous crossfold

. drop rm* // Drop variables created by previous crossfold

. * Fivefold cross validation measure for all possible models



Leave-one-out Cross Validation (LOOCV)

Use a single observation for validation and (n� 1) for trainingI by(�i ) is byi prediction after OLS on observations 1, .., i � 1, i + 1, ..., n.I Cycle through all n observations doing this.

Then LOOCV measure is

CV(n) =1n ∑n

i=1MSE(�i ) =1n ∑n

i=1(yi � by(�i ))2Requires n regressions in general

I except for OLS can show CV(n) =1n ∑ni=1

�yi�byi1�hii

�2F where byi is �tted value from OLS on the full training sampleF and hii is i th diagonal entry in the hat matrix X(X0X)�1X.

Used for bandwidth choice in local nonparametric regressionI such as k-nearest neighbors, kernel and local linear regressionI but not used for machine learning (see below).



How many folds?

LOOCV is the special case of K -fold CV with K = NI it has little bias

F as all but one observation is used to �t.

I but large variance

F as the n predicted by(�i ) are based on very similar samplesF so subsequent averaging does not reduce variance much.

The choice K = 5 or K = 10 is found to be a good compromiseI neither high bias nor high variance.

Remember: For replicability set the seed as this determines the folds.


3. Shrinkage Estimation

3. Shrinkage Estimation

Consider linear regression model with p potential regressors where p istoo large.

Methods that reduce the model complexity areI choose a subset of regressorsI shrink regression coe¢ cients towards zero

F ridge, LASSO, elastic net

I reduce the dimension of the regressors

F principal components analysis.

Linear regression may predict well if include interactions and powersas potential regressors.

And methods can be adapted to alternative loss functions forestimation.


3. Shrinkage Estimation 3.1 Variance-bias trade-o¤

3.1 Variance-bias trade-o¤

Consider regression model

y = f (x) + u with E [u] = 0 and u ? x.

For out-of-estimation-sample point (y0, x0) the true prediction error

E [(y0 � bf (x0))2] = Var [bf (x0)] + fBias(bf (x0))g2 + Var(u)The last term Var(u) is called irreducible error

I we can do nothing about this.

So need to minimize sum of variance and bias-squared!I more �exible models have less bias (good) and more variance (bad).I this trade-o¤ is fundamental to machine learning.



Variance-bias trade-o¤ and shrinkage

Shrinkage is one method that is biased but the bias may lead to lowersquared error loss

I �rst show this for estimation of a parameter βI then show this for prediction of y .

The mean squared error of a scalar estimator eβ isMSE(eβ) = E [(eβ� β)2]

= E [f(eβ� E [eβ]) + (E [eβ]� β)g2]= E [(eβ� E [eβ])2] + (E [eβ]� β)2 + 2� 0= Var(eβ) + Bias2(eβ)

I as the cross product term 2� E [(eβ� E [eβ])(E [eβ]� β)] =

constant�E [(eβ� E [eβ])] = 0.A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 29 / 93


Bias can reduce estimator MSE: a shrinkage exampleSuppose scalar estimator bβ is unbiased for β with

I E [bβ] = β and Var [bβ] = v so MSE(bβ) = v .Consider the shrinkage estimator

I eβ = abβ where 0 � a � 1.Bias: Bias(eβ) = E [eβ]� β = aβ� β = (a� 1)β.Variance: Var [eβ] = Var [abβ] = a2Var(bβ) = a2v .

MSE(eβ) = Var [eβ] + Bias2(eβ) = a2v + (a� 1)2β2

MSE(eβ) < MSE[bβ] if β2 <1+ a1� av .

So MSE(eβ) <MSE[bβ] for a = 0 if β2 < vI and for a = 0.9 if β2 < 19v .

The ridge estimator shrinks towards zero.The LASSO estimator selects and shrinks towards zero.James-Stein estimator in yi � N(µi , 1) has lower MSE than the MLE!



Bias can therefore reduce predictor MSE

Now consider prediction of y0 = βx0 + u where E [u] = 0

I using ey0 = eβx0 where treat scalar x0 as �xed.Then mean squared error of a scalar estimator eβ is

MSE(ey0) = x20Bias2(eβ) + Var(u).So bias in eβ that reduces MSE(eβ) also reduces MSE(ey0).


3. Shrinkage Estimation 3.2 Shrinkage methods

3.2 Shrinkage Methods

Shrinkage estimators minimize RSS (residual sum of squares) with apenalty for model size

I this shrinks parameter estimates towards zero.

The extent of shrinkage is determined by a tuning parameterI this is determined by cross-validation or e.g. AIC.

Ridge, LASSO and elastic net are not invariant to rescaling ofregressors, so �rst standardize

I so xij below is actually (xij � xj )/sjI and demean yi so below yi is actually yi � yI xi does not include an intercept nor does data matrix XI we can recover intercept β0 as bβ0 = y .

So work with y = x0β+ ε = β1x1 + β2x2 + � � �+ βpxp + ε


3. Shrinkage Estimation 3.3 Ridge Regression

3.3 Ridge RegressionThe ridge estimator bβλ of β minimizes

Qλ(β) = ∑ni=1(yi � x

0iβ)

2 + λ ∑pj=1 β2j = RSS + λ(jjβjj2)2

I where λ � 0 is a tuning parameter to be determinedI jjβjj2 =

q∑pj=1 β2j is L2 norm.

Equivalently the ridge estimator minimizes

∑ni=1(yi � x

0iβ)

2 subject to∑pj=1 β2j � s.

The ridge estimator is bβλ = (X0X+ λI)�1X0y.

Derivation: Q(β) = (y�Xβ)0(y�Xβ) + λβ0βI ∂Q(β)/∂β = �2X0(y�Xβ) + 2λβ = 0I ) X0Xβ+ λIβ = X0yI ) bβλ = (X

0X+ λI)�1X0y.


3. Shrinkage Estimation 3.3 Ridge Regression

More on Ridge

FeaturesI bβλ ! bβOLS as λ ! 0 and bβλ ! 0 as λ ! ∞.I best when many predictors important with coe¤s of similar sizeI best when LS has high varianceI algorithms exist to quickly compute bβλ for many values of λI then choose λ by cross validation.

Hoerl and Kennard (1970) proposed ridge as a way to reduce MSE ofeβ.Ridge is the posterior mean for y � N(Xβ, σ2I) with priorβ � N(0,γ2I)

I though γ is a speci�ed prior parameter whereas λ is data-determined.

For scalar regressor and no intercept bβλ = abβOLS where a = ∑i x2i

∑i x2i +λ

I like earlier example of eβ = abβ.A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 34 / 93

3. Shrinkage Estimation 3.4 LASSO

3.4 LASSO (Least Absolute Shrinkage And Selection)

The LASSO estimator bβλ of β minimizes

Qλ(β) = ∑ni=1(yi � x

0iβ)

2 + λ ∑pj=1 jβj j = RSS + λjjβjj1

I where λ � 0 is a tuning parameter to be determinedI jjβjj1 = ∑pj=1 jβj j is L1 norm.

Equivalently the LASSO estimator minimizes

∑ni=1(yi � x

0iβ)

2 subject to∑pj=1 jβj j � s.

FeaturesI best when a few regressors have βj 6= 0 and most βj = 0I leads to a more interpretable model than ridge.


3. Shrinkage Estimation 3.4 LASSO

LASSO versus Ridge (key �gure from ISL)LASSO is likely to set some coe¢ cients to zero.


3. Shrinkage Estimation 3.5 Elastic net

3.5 Elastic netThere are many extensions of LASSO.

Elastic net combines ridge regression and LASSO with objectivefunction

Qλ,α(β) = ∑ni=1(yi � x

0iβ)

2 + λ ∑pj=1fαjβj j+ (1� α)β2j g.

I ridge penalty λ averages correlated variablesI LASSO penalty α leads to sparsity.

Here I use the elasticregress package (Townsend 2018)I ridgeregress (alpha=0)I lassoregress (alpha=1)I elasticregress.

K-fold classi�cation is used with default K = 10I set seed for replicability.

In part 3 I instead use the lassopack package.


3. Shrinkage Estimation 3.6 Examples

3.6 LASSO example

OLS was by = 1.556x1 + 0.471x2 � 0.026x3 + 2.532OLS on x1 and x2 is by = 1.554x1 + 0.468x2 + 2.534.

_cons 2.621573 x3 0 x2 .2834002 x1 1.3505

y Coef.

Number of lambda tested = 100Number of folds = 5Crossvalidation MSE = 9.3871lambda = 0.2594alpha = 1.0000Rsquared = 0.2293

LASSO regression Number of observations = 40

. lassoregress y x1 x2 x3, numfolds(5)

. set seed 10101

. * LASSO with lambda determined by cross validation


4. Dimension Reduction

4. Dimension Reduction

Reduce from p regressors to M < p linear combinations of regressorsI Form X� = XA where A is p �M and M < pI y = β0 +X

�δ+ u after dimension reductionI y = β0 +Xβ+ u where β = Aδ.

Two methods mentioned in ISLI 1. Principal components

F use only X to form A (unsupervised)

I 2. Partial least squares

F also use relationship between y and X to form A (supervised)F I have not seen this used in practice.

For both should standardize regressors as not scale invariant.

And often use cross-validation to determine M.


4. Dimension Reduction 4.1 Principal Components Analysis

4.1 Principal Components Analysis (PCA)

Suppose X is normalized to have zero means so ij th entry is xji � xj .The �rst principal component has the largest sample variance amongall normalized linear combinations of the columns of n� p matrix X

I the �rst component is Xh1 where h1 is p � 1I normalize h1 so that h01h1 = 1I then h1 max Var(Xh1) = h01X

0Xh1 subject to h01h1 = 1I the maximum is the largest eigenvalue of X0X and h1 is thecorresponding eigenvector.

The second principal component has the largest variance subject tobeing orthogonal to the �rst, and so on.



Principal Components Analysis Example

Compare correlation coe¢ cient from OLS on �rst principalcomponent (r = 0.4444) with OLS on all three regressors(r = 0.4871) and each single regressor.

x3 0.2046 0.4200 0.7824 0.4281 0.2786 1.0000x2 0.3370 0.6919 0.7322 0.5077 1.0000x1 0.4740 0.9732 0.8086 1.0000

pc1 0.4219 0.8661 1.0000yhat 0.4871 1.0000

y 1.0000

y yhat pc1 x1 x2 x3

(obs=40). correlate y yhat pc1 x1 x2 x3

(option xb assumed; fitted values). predict yhat

. quietly regress y x1 x2 x3

. * Compare R from OLS on all three regressors, on pc1, on x1, on x2, on x3



Principal Components Analysis (continued)

PCA is unsupervised so seems unrelated to y butI Elements of Statistical Learning says does well in practice.I PCA has the smallest variance of any estimator that estimates themodel y = Xβ+ u with i.i.d. errors subject to constraint Cβ = cwhere dim[C] � dim[X].

I PCA discards the p�M smallest eigenvalue components whereas ridgedoes not, though ridge does shrink towards zero the most for thesmallest eigenvalue components (ESL p.79).

Partial least squares is supervisedI partial least squares turns out to be similar to PCAI especially if R2 is low.


5. High-Dimensional Models

5. High-Dimensional Models

High dimensional simply means p is large relative to nI in particular p > nI n could be large or small.

Problems with p > n:

I Cp , AIC, BIC and R2cannot be used.

I due to multicollinearity cannot identify best model, just one of manygood models.

I cannot use regular statistical inference on training set

SolutionsI Forward stepwise, ridge, lasso, PCA are useful in trainingI Evaluate models using cross-validation or independent test data

F using e.g. MSE or R2.


6. Nonparametric and Semiparametric Regression 6.1 Nonparametric Regression

6.1 Nonparametric regression

Nonparametric regression is the most �exible approachI but it is not practical for high p due to the curse of dimensionality.

Consider explaining y with scalar regressor xI we want bf (x0) for a range of values x0.

With many observations with xi = x0 we would just use the averageof y for those observations

I bf (x0) = 1n0 ∑ni :xi=x0 yi =

∑ni=1 1[xi=x0 ]yi∑ni=1 1[xi=x0 ]

.

Rewrite as

bf (x0) = ∑ni=1 w(xi , x0)yi , where w(xi , x0) =

1[xi=x0 ]∑nj=1 1[xj=x0 ]

.



Kernel-weighted local regression

In practice there are not many observations with xi = x0.

Nonparametric regression methods borrow from nearby observationsI k-nearest neighbors

F average yi for the k observations with xi closest to x0.

I kernel-weighted local regressionF use a weighted average of yi with weights declining as jxi � x0 jincreases.

The original kernel regression estimate is

bf (x0) = ∑ni=1 w(xi , x0,λ)yi ,

I where w(xi , x0,λ) = w(xi ,�x0

λ ) are kernel weightsI and λ is a bandwidth parameter to be determined.



Local constant and local linear regression

The local constant estimator bf (x0) = bα0 where α0 minimizes

∑ni=1 w(xi , x0,λ)(yi � α0)

2

I this yields bα0 = ∑ni=1 w(xi , x0,λ)yi .

The local linear estimator bf (x0) = bα0 where α0 and β0 minimize

∑ni=1 w(xi , x0,λ)fyi � α0 � β0(xi � x0)g2.

Can generalize to local maximum likelihood that maximizes over θ0

∑ni=1 w(xi , x0,λ) ln f (yi , xi , θ0).



Local linear example

lpoly y z, degree(1)

50

510

15y

4 2 0 2 4z

kernel = epanechnikov, degree = 1, bandwidth = .49

Local poly nomial smooth


6. Nonparametric and Semiparametric Regression 6.2 Curse of Dimensionality

6.2 Curse of Dimensionality

Nonparametric methods do not extend well to multiple regressors.

Consider p-dimensional x broken into binsI for p = 1 we might average y in each of 10 bins of xI for p = 2 we may need to average over 102 bins of (x1, x2)I and so on.

On average there may be few to no points with high-dimensional xiclose to x0

I called the curse of dimensionality.

Formally for local constant kernel regression with bandwidth λ

I bias is O(λ2) and variance is O(nλp)I optimal bandwidth is O(n�1/(p+4))

F gives asymptotic bias so standard conf. intervals not properly centered

I convergence rate is then n�2/(p+4) << n�0.5


6. Nonparametric and Semiparametric Regression 6.3 Semiparametric Models

6.3 Semiparametric Models

Semiparametric models provide some structure to reduce thenonparametric component from K dimensions to fewer dimensions.

I Econometricians focus on partially linear models and on single-indexmodels.

I Statisticians use generalized additive models and project pursuitregression.

Machine learning methods can outperform nonparametric andsemiparametric methods

I so wherever econometricians use nonparametric and semiparametricregression in higher-dimensional models it may be useful to use MLmethods.



Partially linear modelA partially linear model speci�es

yi = f (xi , zi ) + ui = x0iβ+ g(zi ) + ui

I simplest case z (or x) is scalar but could be vectorsI the nonparametric component is of dimension of z.

The di¤erencing estimator of Robinson (1988) provides a root-nconsistent asymptotically normal bβ as follows

I E [y jz] = E [xjz]0β+ g(z) as E [ujz] = 0 given E [ujx, z] = 0I y � E [y jz] = (x� E [xjz])0β+ u subtractingI so OLS estimate y � bmy = (x� bmz)0β+ error.

Robinson proposed nonparametric kernel regression of y on z for bmyand x on z for bmx

I recent econometrics articles instead use a machine learner such asLASSO

I in general need bm converges at rate at least n1/4.



Single-index model

Single-index models specify

f (xi ) = g(x0iβ)

I with g(�) determined nonparametricallyI this reduces nonparametrics to one dimension.

We can obtain bβ root-n consistent and asymptotically normalI provided nonparametric bg(�) converges at rate n�1/4.

The recent economics ML literature has instead focused on thepartially linear model.



Generalized additive models and project pursuitGeneralized additive models specify f (x) as a linear combination ofscalar functions

f (xi ) = α+∑pj=1 fj (xij )

I where xj is the j th regressor and fj (�) is (usually) determined by thedata

I advantage is interpretability (due to each regressor appearingadditively).

I can make more nonlinear by including interactions such as xi1 � xi2 asa separate regressor.

Project pursuit regression is additive in linear combinations of the x 0s

f (xi ) = ∑Mm=1 gm(x

0iωm)

I additive in derived features x0ωm rather than in the x 0j sI the gm(�) functions are unspeci�ed and nonparametrically estimated.I this is a multi-index model with case M = 1 being a single-index model.



How can machine learning methods do better?

In theory there is scope for improving nonparametric methods.

k-nearest neighbors usually has a �xed number of neighborsI but it may be better to vary the number of neighbors with data sparsity

Kernel-weighted local regression methods usually use a �xedbandwidth

I but it may be better to vary the bandwidth with data sparsity.

There may be advantage to basing neighbors in part on relationshipwith y .


7. Flexible Regression

7. Flexible Regression

Basis function modelsI global polynomial regressionI splines: step functions, regression splines, smoothing splinesI waveletsI polynomial is global while the others break range of x into pieces.

Other methodsI neural networks.


7. Flexible Regression 7.1 Basis Functions

7.1 Basis Functions

Also called series expansions and sieves.

General approach (scalar x for simplicity)

yi = β0 + β1b1(xi ) + � � �+ βK (xi ) + εi

I where b1(�), ..., bK (�) are basis functions that are �xed and known.

Global polynomial regression sets bj (xi ) = xji

I typically K � 3 or K � 4.I �ts globally and can over�t at boundaries.

Step functions: separately �t y in each interval x 2 (cj , cj+1)I could be piecewise constant or piecewise polynomial.

Splines smooth so that not discontinuous at the cut points.

Wavelets are also basis functions, richer than Fourier series.


7. Flexible Regression 7.2 Regression splines

7.2 Regression SplinesThe cubic regression spline is the standard.

Piecewise cubic model with K knotsI require f (x), f 0(x) and f 00(x) to be continuous at the K knots

Then can do OLS with

f (x) = β0+ β1x+ β2x2+ β3x

3+ β4(x� c1)3++ � � �+ β(3+K )(x� cK )3+

I where h+(z) = z+ if z > 0 and h+(z) = 0 otherwise.

This is the lowest degree regression spline where the graph of bf (x) onx seems smooth and continuous to the naked eye.

There is no real bene�t to a higher-order spline.

Regression splines over�t at boundaries.I A natural or restricted cubic spline is an adaptation that restricts therelationship to be linear past the lower and upper boundaries of thedata.


7. Flexible Regression 7.2 Regression splines

Spline ExampleNatural or restricted cubic spline with �ve knots at the 5, 27.5, 50,72.5 and 95 percentiles

I mkspline zspline = z, cubic nknots(5) displayknotsI regress y zspline*

50

510

15f(

z)

4 2 0 2 4z

Natural cubic spline: y=a+f(z)+u


7. Flexible Regression 7.3 Wavelets

7.3 Wavelets

Wavelets are used especially for signal processing and extractionI they are richer than a Fourier series basisI they can handle both smooth sections and bumpy sections of a series.I they are not used in cross-section econometrics but may be useful forsome time series.

Start with a mother or father wavelet function ψ(x)

I example is the Haar function ψ(x) =

8<: 1 0 � x < 12

�1 12 < x < 1

0 otherwise

Then both translate by b and scale by a to give basis functionsψab(x) = jaj�1/2ψ( x�ba ).


7. Flexible Regression 7.4 Neural Networks

7.4 Neural Networks

A neural network is a richer model for f (xi ) than project pursuitI but unlike project pursuit all functions are speci�edI only parameters need to be estimated.

A neural network involves a series of nested logit regressions.

A single hidden layer neural network explaining y by x hasI y depends on z0s (a hidden layer)I z0s depend on x0s.

A neural network with two hidden layers explaining y by x hasI y depends on w0s (a hidden layer)I w0s depend on z0s (a hidden layer)I z0s depend on x0s.



Two-layer neural network

y depends on M z0s and the z0s depend on p x0s

f (x) = β0 + z0β is usual choice for g(�)

zm = 11+exp[�(α0m+x0αm )] m = 1, ...,M

More generally we may use

f (x) = h(T ) usually h(T ) = TT = β0 + z

0βzm = g(α0m + x0αm) usually g(v) = 1/(1+ e�v )

This yields the nonlinear model

f (xi ) = β0 +∑Mm=1 βm �

11+ exp[�(α0m + x0αm)]



Neural Networks (continued)Neural nets are good for prediction

I especially in speech recognition (Google Translate), image recognition,...

I but very di¢ cult (impossible) to interpret.

They require a lot of �ne tuning - not o¤-the-shelfI we need to determine the �nd the number of hidden layers, the numberof M of hidden units within each layer, and estimate the α0s, β0s,....

Minimize the sum of squared residuals but need a penalty on α0s toavoid over�tting.

I since penalty is introduced standardize x 0s to (0,1).I best to have too many hidden units and then avoid over�t usingpenalty.

I initially back propagation was usedI now use gradient methods with di¤erent starting values and averageresults or use bagging.

Deep learning uses nonlinear transformations such as neural networksI deep nets are an improvement on original neural networks.



Neural Networks ExampleThis example uses user-written Stata command brain (Doherr)

. twoway (scatter y x) (lfit y x) (line ybrain x)

. sort x

. brain think ybrain

. quietly brain train, iter(500) eta(2)

brain[1,61] layer[1,3] neuron[1,22] output[4,1] input[4,1]Defined matrices:. brain define, input(x) output(y) hidden(20)

. gen y = sin(x)

. gen x = 4*_pi/200 *_n

number of observations (_N) was 0, now 200. set obs 200

. clear

. * Example from help file for userwritten brain command



Neural Networks Example (continued)We obtain

1.5

0.5

1

0 5 10 15x

y Fitted valuesybrain



This �gure from ESL is for classi�cation with K categories


8. Regression Trees and Random Forests

8. Regression Trees and Random Forests: OverviewRegression Trees sequentially split regressors x into regions that bestpredict y

I e.g., �rst split is income < or > $12,000second split is on gender if income > $12,000third split is income < or > $30,000 (if female and income > $12,000).

Trees do not predict wellI due to high varianceI e.g. split data in two then can get quite di¤erent treesI e.g. �rst split determines future splits 9a greedy method).

Better methods are then givenI bagging (bootstrap averaging) computes regression trees for di¤erentsamples obtained by bootstrap and averages the predictions.

I random forests use only a subset of the predictors in each bootstrapsample

I boosting grows trees based on residuals from previous stageI bagging and boosting are general methods (not just for trees).


8. Regression Trees and Random Forests 8.1 Regression Trees

8.1 Regression Trees

Regression treesI sequentially split x0s into rectangular regions in way that reduces RSSI then byi is the average of y 0s in the region that xi falls inI with J blocks RSS= ∑Jj=1 ∑i2Rj (yi � yRj )

2.

Need to determine both the regressor j to split and the split point s.I For any regressor j and split point s, de�ne the pair of half-planesR1(j , s) = fX jXj < sg and R2(j , s) = fX jXj � sg

I Find the value of j and s that minimize

∑i :xi2R1(j ,s)

(yi � yR1)2 + ∑i :xi2R1(j ,s)

(yi � yR1)2

where yR1 is the mean of y in region R1 (and similar for R2).I Once this �rst split is found, split both R1 and R2 and repeatI Each split is the one that reduces RSS the most.I Stop when e.g. less than �ve observations in each region.



Tree example from ISL page 308

(1) split X1 in two; (2) split the lowest X1 values on the basis of X2into R1 and R2; (3) split the highest X1 values into two regions (R3and R4/R5); (4) split the highest X1 values on the basis of X2 intoR4 and R5.



Tree example from ISL (continued)

The left �gure gives the tree.

The right �gure shows the predicted values of y .



Regression tree (continued)

The model is of form f (x) = ∑Jj=1 yj � 1[x 2 Rj ].

The approach is a topdown greedy approachI top down as start with top of the treeI greedy as at each step the best split is made at that particular step,rather than looking ahead and picking a split that will lead to a bettertree in some future step.

This leads to over�tting, so pruneI use cost complexity pruning (or weakest link pruning)I this penalizes for having too many terminal nodesI see ISL equation (8.4).



Tree as alternative to k-NN or kernel regressionFigure from Athey and Imbens (2019), �Machine Learning MethodsEconomists should Know About�

I axes are x1 and x2I note that tree used explanation of y in determining neighborsI tree may not do so well near boundaries of region

F random forests form many trees so not always at boundary.



Improvements to regression trees

Regression trees are easy to understand if there are few regressors.

But they do not predict as well as methods given so farI due to high variance (e.g. split data in two then can get quite di¤erenttrees).

Better methods are given nextI bagging

F bootstrap aggregating averages regression trees over many samples

I random forests

F averages regression trees over many sub-samples

I boosting

F trees build on preceding trees.


8. Regression Trees and Random Forests 8.2 Bagging

8.2 Bagging (Bootstrap Aggregating)Bagging is a general method for improving prediction that worksespecially well for regression trees.Idea is that averaging reduces variance.So average regression trees over many samples

I the di¤erent samples are obtained by bootstrap resample withreplacement (so not completely independent of each other)

I for each sample obtain a large tree and prediction bfb(x).I average all these predictions: bfbag(x) = 1

B ∑Bb=1 bfb(x).Get test sample error by using out-of-bag (OOB) observations not inthe bootstrap sample

I Pr[i th obs not in resample] = (1� 1n )n ! e�1 = 0.368 ' 1/3.

I this replaces cross validation.

Interpretation of trees is now di¢ cult soI record the total amount that RSS is decreased due to splits over agiven predictor, averaged over all B trees.

I a large value indicates an important predictor.


8. Regression Trees and Random Forests 8.3 Random Forests

8.3 Random Forests

The B bagging estimates are correlatedI e.g. if a regressor is important it will appear near the top of the tree ineach bootstrap sample.

I the trees look similar from one resample to the next.

Random forests get bootstrap resamples (like bagging)I but within each bootstrap sample use only a random sample of m < ppredictors in deciding each split.

I usually m ' ppI this reduces correlation across bootstrap resamples.

Simple bagging is random forest with m = p.


8. Regression Trees and Random Forests 8.3 Random Forests

Random Forests (continued)

Random forests are related to kernel and k-nearest neighborsI as use a weighted average of nearby observationsI but with a data-driven way of determining which nearby observationsget weight

I see Lin and Jeon (JASA, 2006).

Susan Athey and coauthors are big on random forests.


8. Regression Trees and Random Forests 8.4 Boosting

8.4 Boosting

Boosting is also a general method for improving prediction.

Regression trees use a greedy algorithm.

Boosting uses a slower algorithm to generate a sequence of treesI each tree is grown using information from previously grown treesI and is �t on a modi�ed version of the original data setI boosting does not involve bootstrap sampling.

Speci�cally (with λ a penalty parameter)I given current model b �t a decision tree to model b0s residuals (ratherthan the outcome Y )

I then update bf (x) = previous bf (x) + λbf b(x)I then update the residuals ri = previous ri � λbf b(xi )I the boosted model is bf (x) = ∑Bb=1 λbf b(xi ).

Stata add-on boost includes �le boost64.dll that needs to bemanually copied into c:nadonplus


8. Regression Trees and Random Forests 8.4 Boosting

Comparison of in-sample predictions

yh_ols 0.4784 0.8615 0.8580 0.8666 1.0000yh_boost 0.5423 0.8769 0.8381 1.0000

yh_rf_half 0.6178 0.9212 1.0000yh_rf 0.7377 1.0000

ltotexp 1.0000

ltotexp yh_rf yh_rf_~f yh_boost yh_ols

(obs=2,955). correlate ltotexp yh*

yh_ols 2,955 8.059866 .654323 6.866516 10.53811 yh_boost 2,955 7.667966 .4974654 5.045572 8.571422 yh_rf_half 2,955 8.053512 .7061742 5.540289 9.797389 yh_rf 2,955 8.060393 .7232039 5.29125 10.39143 ltotexp 2,955 8.059866 1.367592 1.098612 11.74094

Variable Obs Mean Std. Dev. Min Max

. summarize ltotexp yh*

(option xb assumed; fitted values). predict yh_ols

. quietly regress ltotexp $zlist

. * Compare various predictions in sample


9. Classi�cation

9. Classi�cation: Overview

y 0s are now categoricalI example: binary if two categories.

Interest lies in predicting y using by (classi�cation)I whereas economists usually want bPr[y = j jx]

Use (0,1) loss function rather than MSE or ln LI 0 if correct classi�cationI 1 if misclassi�ed.

Many machine learning applications are in settings where can classifywell

I e.g. reading car license platesI unlike many economics applications.


9. Classi�cation

Classi�cation: Overview (continued)

Regression methods predict probabilitiesI logistic regression, multinomial regression, k-nearest neighborsI assign to class with the highest predicted probability (Bayes classi�er)

F in binary case by = 1 if bp � 0.5 and by = 0 if bp < 0.5.Discriminant analysis additionally assumes a normal distribution forthe x�s

I use Bayes theorem to get Pr[Y = k jX = x].

Support vector classi�ers and support vector machinesI directly classify (no probabilities)I are more nonlinear so may classify betterI use separating hyperplanes of X and extensions.

Regression trees, bagging, random forests and boosting can be usedfor categorical data.


9. Classi�cation 9.1 Loss Function

9.1 A Di¤erent Loss Function: Error Rate

Instead of MSE we use the error rateI the number of misclassi�cations

Error rate =1n ∑n

i=1 1[yi 6= byi ],F where for K categories yi = 0, ...,K � 1 and byi = 0, ...,K � 1.F and indicator 1[A] = 1 if event A happens and = 0 otherwise.

The test error rate is for the n0 observations in the test sample

Ave(1[y0 6= by0]) = 1n0

∑n0i=1 1[y0i 6= by0i ].

Cross validation uses number of misclassi�ed observations. e.g.LOOCV is

CV(n) =1n ∑n

i=1 Erri =1n ∑n

i=1 1[yi 6= by(�i )].A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside ()Machine Learning: Overview April 23 2019 79 / 93

9. Classi�cation 9.2 Logit

9.2 Logit

Directly model p(x) = Pr[y jx].Logistic (logit) regression for binary case obtains MLE for

ln�

p(x)1�p(x)

�= x0β.

Logit model is a linear (in x) classi�erI by = 1 if bp(x) > 0.5I i.e. if x0bβ > 0.


9. Classi�cation 9.3 Discriminant Analysis

9.3 Linear and Quadratic Discriminant AnalysisDeveloped for classi�cation problems such as is a skull Neanderthal orHomo Sapiens given various measures of the skull.Discriminant analysis speci�es a joint distribution for (Y ,X).Linear discriminant analysis with K categories

I assume XjY = k is N(µk ,Σ) with density fk (x) = Pr[X = xjY = k ]I and let πk = Pr[Y = k ]

Upon simpli�cation assignment to class k with largestPr[Y = k jX = x] is equivalent to choosing model with largestdiscriminant function

δk (x) = x0Σ�1µk �12

µk0Σ�1µk + lnπk

I use bµk =�xk , bΣ = cVar[xk ] and bπk = 1N ∑Ni=1 1[yi = k ].

Called linear discriminant analysis as δk (x) linear in x.Quadratic discriminant analysis assumes XjY = k is N(µk ,Σk ).


9. Classi�cation 9.3 Discriminant Analysis

ISL Figure 4.9: Linear and Quadratic BoundariesLDA uses a linear boundary to classify and QDA a quadratic


9. Classi�cation 9.4 Support Vector Machines

9.4 Support Vector Classi�er

Build on LDA idea of linear boundary to classify when K = 2.

Maximal margin classi�erI classify using a separating hyperplane (linear combination of X )I if perfect classi�cation is possible then there are an in�nite number ofsuch hyperplanes

I so use the separating hyperplane that is furthest from the trainingobservations

I this distance is called the maximal margin.

Support vector classi�erI generalize maximal margin classi�er to the nonseparable caseI this adds slack variables to allow some y�s to be on the wrong side ofthe margin

I Maxβ,εM (the margin - distance from separator to training X�s)subject to β0β 6= 1, yi (β0 + x0i β) � M(1� εi ), εi � 0 and∑ni=1 εi � C .



Support Vector Machines

The support vector classi�er has a linear boundaryI f (x0) = β0 +∑ni=1 αix00xi , where x

00xi = ∑pj=1 x0jxij .

The support vector machine has nonlinear boundariesI f (x0) = β0 +∑ni=1 αiK (x0, xi ) where K (�) is a kernelI polynomial kernel K (x0, xi ) = (1+∑pj=1 x0jxij )

d

I radial kernel K (x0, xi ) = exp(�γ ∑pj=1(x0j � xij )2)

Can extend to K > 2 classes (see ISL ch. 9.4).I one-versus-one or all-pairs approachI one-versus-all approach.



ISL Figure 9.9: Support Vector MachineIn this example a linear or quadratic classi�er won�t work whereasSVM does.



Comparison of model predictions

The following compares the various category predictions.

SVM does best but we did in-sample predictions hereI especially for SVM we should have training and test samples.

yh_svm 0.5344 0.3966 0.6011 0.3941 0.3206 1.0000yh_qda 0.2294 0.6926 0.2762 0.5850 1.0000yh_lda 0.2395 0.6955 0.3776 1.0000yh_knn 0.3604 0.3575 1.0000

yh_logit 0.2505 1.0000 suppins 1.0000

suppins yh_logit yh_knn yh_lda yh_qda yh_svm

(obs=3,064). correlate suppins yh_logit yh_knn yh_lda yh_qda yh_svm. * Compare various insample predictions


10. Unsupervised Learning

10. Unsupervised Learning

Challenging area: no y , only x.Example is determining several types of individual based on responsesto many psychological questions.

Principal components analysisI already presented earlier.

Clustering MethodsI k-means clustering.I hierarchical clustering.


10. Unsupervised Learning 10.1 Cluster Analysis

10.1 Cluster Analysis: k-Means Clustering

Goal is to �nd homogeneous subgroups among the X .

K-Means splits into K distinct clusters where within cluster variationis minimized.

Let W (Ck ) be measure of variation

I MinimizeC1,...,Ck ∑Kk=1W (Ck )I Euclidean distance W (Ck ) =

1nk ∑Ki ,i 02Ck ∑pj=1(xij � xi 0j )2

Global maximum requires K n partitions.

Instead use algorithm 10.1 (ISL p.388) which �nds a local optimumI run algorithm multiple times with di¤erent seedsI choose the optimum with smallest ∑Kk=1W (Ck ).



ISL Figure 10.5

Data is (x1.x2) with K = 2, 3 and 4 clusters identi�ed.



Hierarchical Clustering

Do not specify K .

Instead begin with n clusters (leaves) and combine clusters intobranches up towards trunk

I represented by a dendrogramI eyeball to decide number of clusters.

Need a dissimilarity measure between clustersI four types of linkage: complete, average, single and centroid.

For any clustering methodI it is a di¢ cult problem to do unsupervised learningI results can change a lot with small changes in methodI clustering on subsets of the data can provide a sense of robustness.


11. Conclusions

11. ConclusionsGuard against over�tting

I use K -fold cross validation or penalty measures such as AIC.

Biased estimators can be better predictorsI shrinkage towards zero such as Ridge and LASSO.

For �exible models popular choices areI neural netsI random forests.

Though what method is best varies with the applicationI and best are ensemble forecasts that combine di¤erent methods.

Machine learning methods can outperform nonparametric andsemiparametric methods

I so wherever econometricians use nonparametric and semiparametricregression in higher dimensional models it may be useful to use MLmethods

I though the underlying theory still relies on assumptions such as sparsity.


12. References

12. References

Undergraduate / Masters level bookI ISL: Gareth James, Daniela Witten, Trevor Hastie and RobertTibsharani (2013), An Introduction to Statistical Learning: withApplications in R, Springer.

I free legal pdf at http://www-bcf.usc.edu/~gareth/ISL/I $25 hardcopy viahttp://www.springer.com/gp/products/books/mycopy

Masters / PhD level bookI ESL: Trevor Hastie, Robert Tibsharani and Jerome Friedman (2009),The Elements of Statistical Learning: Data Mining, Inference andPrediction, Springer.

I free legal pdf athttp://statweb.stanford.edu/~tibs/ElemStatLearn/index.html

I $25 hardcopy viahttp://www.springer.com/gp/products/books/mycopy


12. References

References (continued)

A recent book isI EH: Bradley Efron and Trevor Hastie (2016), Computer Age StatisticalInference: Algorithms, Evidence and Data Science, CambridgeUniversity Press.

Interesting book: Cathy O�Neil (2016), Weapons of MathDestruction: How Big Data Increases Inequality and ThreatensDemocracy.

My website has some material (including these slides)I http://cameron.econ.ucdavis.edu/e240f/machinelearning.html


machine learning methods: an...

Documents