day 6: nonlinear models and tree-based methods · day 6 outline moving beyond linearity polynomial...

Day 6: Nonlinear models and tree-based methods

ME314: Introduction to Data Science and Machine Learning

LSE Methods Summer Programme 2019

6 August 2019

Day 6 Outline

Moving Beyond LinearityPolynomial RegressionStep FunctionsSplinesLocal RegressionGeneralized Additive Models

Tree-based Methods

Bagging

Random Forest

Moving Beyond Linearity

Linearity and social reality

I Social world is almost never linear.

I Often the linearity assumption is good enough.

I When linearity doesn’t hold we can useI polynomials,I step functions,I splines,I local regression, andI generalized additive models

I These models offer a lot of flexibility, without losing the ease andinterpretability of linear models.

Polynomial regression

yi = β0 + β1xi + β2x2i + β3x

3i + · · ·+ βdx

di + εi

20 30 40 50 60 70 80

50

10

01

50

20

02

50

30

0

Age

Wa

ge

Degree−4 Polynomial

20 30 40 50 60 70 80

0.0

00

.05

0.1

00

.15

0.2

0

Age

| | || | ||| | ||| | | ||| | || | | |||

|

|| || ||| | | | || || || | || |

|

|| | | |

|

| || || || | | || | ||| || ||| | | |

|

| || | ||| || | || |||| || || ||| || || ||| |||| || | | | ||

|

|| || ||| ||| || || ||| || ||| | || ||| |||| ||| || | | | ||| || |||| |||| || || | ||||| | || || || | ||| | || ||| || | || ||

|

||| | || | || || | || | ||| || || | || ||| |

|

| | |||| ||| | || | |||| ||| || || ||| | | || || |||||| || | || || | || || | | || || || | || ||| || | || || ||||| ||||| || || || || | |||| || || ||| | || || || |

|

| |||| ||| || || || ||| | | ||

|

|| |

|

| || || || ||| || || | || || || | || ||| || | ||| || | || || || | |||| || | |||| | |||| || | | | ||||

|

| || || || || || |

|

| |||| || || |||| | || || || ||| | || |||| || |

|

| | |||| || || || |

|

|| |||| ||| ||

|

||| || |||| | || || | | |||

|

||| || | || | || | || || ||||| | | ||| |

|

| | || || ||| ||| | || |

|

|| | || || | |||| || | || || | ||| || || || || |||| || | ||||| | | |||| | || ||| || ||| |

|

| ||| | || || | | || |

|

| | | ||| |||| || || | | || || | |||| | | | ||| | || | |||| ||| | |

|

|| ||||| ||| | | || || || || || || || |

|

| || || || | ||| || || | || || |||| |||| |

|

| || || ||| || | | ||| || || | ||| ||| || || |

|

||| || || || || || | | ||| | || ||| || || | |||| || | |

|

|| || ||||| || | || || ||| | ||| | || ||| ||||| || ||||| ||| | ||| ||| | || || || ||| || || | | || |

|

| || |||| ||| | |||

|

| | | | || | ||| | | || | |||| || ||| || | ||| || | ||| ||

|

|| || |||| | ||| | || | | ||| |||| || ||| || || || | | || | || | || || || || | | || || | |

|

|| ||| ||||| ||| ||| || ||||| || || | ||| || | | || | ||| | | ||| || || || || | ||| ||| || || |||

|

| || || ||| | | ||| | |||| | || || ||||

|

| | || | || | || | |||

|

| || || ||| | | ||| ||| | || ||| || || ||| | |||| | ||| | ||| | || | || | || | | || || || || || |||| || | | || | | | |||| || | ||| | || ||| || || ||| ||

|

||| ||| | || || || | | || | || || || || || || | || || | || || |

|

| || ||| || |

| |

| ||| | || || |

|

| |||| ||| | |||| ||

|

| ||| ||| ||| |||| |

|

| || || || || ||| | | | || || | ||| || | || | || | |||| | ||| ||| ||

|

| | ||||| ||| | | || || | | |||| | |||| ||| ||| | || | || || || | || | || || ||| | || ||| | || || ||| | | | |||| | || | | ||| ||| |||| | | ||| | |||| | || | || || | ||

|

| || ||||| || ||| ||| || | | ||||| || |||| || | | ||| | || || || ||| |||| |||| | | || || || | ||| | || || || | | || || || |||| || ||| || ||| || |

|

| || || |||| || | ||| | ||| || | || |||| |||| ||| | | | || ||| | || | | ||

|

|| |||| ||| ||| || | | |||| ||| |||| || |||| || || ||| |||| | ||| | |

|

|| | || || || | ||| | || ||| || ||| | || || ||| | || || || | || ||| | || || |||| || || | || ||| ||

|

|| || | || || || | || | ||| | ||| || | || || ||| || ||| ||| || | || || | | || || || ||| || || || | ||| || | |||

|

|| | |

|

||| | | | || ||| || | ||||| | | || || || | | || || || | | || ||| | |||| |

|

||||| | | | || || | | ||| || | | || | || | ||| || |||| | ||| | || || ||||| | || ||| ||| | || || || || || ||| | ||||| || || ||| ||| || | | || || || ||

|

| || | | || | || || | || || || | |||| | | | ||| | | ||

|

| | || ||

|

|| | | ||| || ||| || || | || || || || | | || ||| || ||| || || || ||| | ||| || ||| || ||| | ||| | | | || || | ||| ||| || | ||

|

|||| |

|

|| | |||| ||| | || || ||| || ||| | |||| || |

|

|| ||| ||| | ||| | || | | | ||| || | || || ||| | | | ||| || || ||| || | ||| | || |||| | |||| | ||| || || || || || | ||| || || | | ||| || || |||| ||| || | || ||| || | ||| |

|

| || | |||

|

| | || || | ||| || |

|

| | ||| || || || | | || | ||| | | ||| || | | || | | || ||||| || || |||| | ||| | | || || | | || | | |

|

|| || |||| | || |||| |

||

| | | ||||| |||

|

|| |||| | |||| || |

|

| | || ||||| ||||| | || || || | || ||| ||| | || ||| || ||| || | || || ||| || | | | || || ||| | || || | || || |

|

| || ||

|

|| || ||| || | | | || |||| || |||| ||| || |||| || || | ||| | |||

|

|| ||| | |

| |

|| || | ||| || ||| | | |||| | ||| | |||| || ||| || || | ||| | ||| | |||| || | || |||| | ||||| ||| | | ||| | ||| || ||| || | ||| || ||| | ||| || | ||| | | || || || || | ||| || || || |||| ||| | ||| || || |||| || |||

|

| |||

|

| ||

|

| |

|

|

|

| | | || || |||

|

|||| ||

|

|| || || || || || | | ||||| | ||| || | ||| ||| || ||| || | | || || | || | || ||| |||| || || ||| |||| ||| ||| ||| | | || |

|

| ||| || || || ||| ||| | ||| | || || ||| || || ||| ||

|

| ||| | || | || || |||| || ||| || | | ||| || | || ||| || || | || ||

|

| | ||| || | | | ||

|

| | || | | ||| | || | || | ||| || || ||| | | || |

|

|| ||| || || | || || |||| || || || | || || | || ||| | || ||| | || ||| || || | | || || ||| || || || ||| |||| |

Pr(Wage>

250|A

ge)

Details

I Create new variables X1 = X , X2 = X 2, etc and then treat asmultiple linear regression.

I Not really interested in the coefficients; more interested in the fittedfunction values at any value x0:

f (x0) = β0 + β1x0 + β2x20 + β3x

30 + β4x

40 .

I Since f (x0) is a linear function of the β`, can get a simple expressionfor pointwise-variances Var[f (x0)] at any value x0.

I In the figure we have computed the fit and pointwise standard errorson a grid of values for x0. We show f (x0)± 2 · se[f (x0)].

I We either fix the degree d at some reasonably low value, else usecross-validation to choose d .

Details continued

I Logistic regression follows naturally. For example, in figure we model

Pr(yi > 250|xi ) =exp(β0 + β1xi + β2x

2i · · ·+ βdx

di )

1 + exp(β0 + β1xi + β2x2i · · ·+ βdxdi.

I To get confidence intervals, compute upper and lower bounds on thelogit scale, and then invert to get on probability scale.

I Can do separately on several variables – just stack the variables intoone matrix, and separate out the pieces afterwards (see GAMs later).

I Caveat: polynomials have notorious tail behavior – very bad forextrapolation.

I Can fit using y ∼ poly(x , degree = 3) in formula.

Step FunctionsAnother way of creating transformations of a variable – cut the variableinto distinct regions.

C1(X ) = I (X < 35),C2(X ) = I (35 ≤ X < 50), . . . ,C3(X ) = I (X ≥ 65)

20 30 40 50 60 70 80

50

10

01

50

20

02

50

30

0

Age

Wa

ge

Piecewise Constant

20 30 40 50 60 70 80

0.0

00

.05

0.1

00

.15

0.2

0

Age

| | || | ||| | ||| | | ||| | || | | |||

|

|| || ||| | | | || || || | || |

|

|| | | |

|

| || || || | | || | ||| || ||| | | |

|

| || | ||| || | || |||| || || ||| || || ||| |||| || | | | ||

|

|| || ||| ||| || || ||| || ||| | || ||| |||| ||| || | | | ||| || |||| |||| || || | ||||| | || || || | ||| | || ||| || | || ||

|

||| | || | || || | || | ||| || || | || ||| |

|

| | |||| ||| | || | |||| ||| || || ||| | | || || |||||| || | || || | || || | | || || || | || ||| || | || || ||||| ||||| || || || || | |||| || || ||| | || || || |

|

| |||| ||| || || || ||| | | ||

|

|| |

|

| || || || ||| || || | || || || | || ||| || | ||| || | || || || | |||| || | |||| | |||| || | | | ||||

|

| || || || || || |

|

| |||| || || |||| | || || || ||| | || |||| || |

|

| | |||| || || || |

|

|| |||| ||| ||

|

||| || |||| | || || | | |||

|

||| || | || | || | || || ||||| | | ||| |

|

| | || || ||| ||| | || |

|

|| | || || | |||| || | || || | ||| || || || || |||| || | ||||| | | |||| | || ||| || ||| |

|

| ||| | || || | | || |

|

| | | ||| |||| || || | | || || | |||| | | | ||| | || | |||| ||| | |

|

|| ||||| ||| | | || || || || || || || |

|

| || || || | ||| || || | || || |||| |||| |

|

| || || ||| || | | ||| || || | ||| ||| || || |

|

||| || || || || || | | ||| | || ||| || || | |||| || | |

|

|| || ||||| || | || || ||| | ||| | || ||| ||||| || ||||| ||| | ||| ||| | || || || ||| || || | | || |

|

| || |||| ||| | |||

|

| | | | || | ||| | | || | |||| || ||| || | ||| || | ||| ||

|

|| || |||| | ||| | || | | ||| |||| || ||| || || || | | || | || | || || || || | | || || | |

|

|| ||| ||||| ||| ||| || ||||| || || | ||| || | | || | ||| | | ||| || || || || | ||| ||| || || |||

|

| || || ||| | | ||| | |||| | || || ||||

|

| | || | || | || | |||

|

| || || ||| | | ||| ||| | || ||| || || ||| | |||| | ||| | ||| | || | || | || | | || || || || || |||| || | | || | | | |||| || | ||| | || ||| || || ||| ||

|

||| ||| | || || || | | || | || || || || || || | || || | || || |

|

| || ||| || |

| |

| ||| | || || |

|

| |||| ||| | |||| ||

|

| ||| ||| ||| |||| |

|

| || || || || ||| | | | || || | ||| || | || | || | |||| | ||| ||| ||

|

| | ||||| ||| | | || || | | |||| | |||| ||| ||| | || | || || || | || | || || ||| | || ||| | || || ||| | | | |||| | || | | ||| ||| |||| | | ||| | |||| | || | || || | ||

|

| || ||||| || ||| ||| || | | ||||| || |||| || | | ||| | || || || ||| |||| |||| | | || || || | ||| | || || || | | || || || |||| || ||| || ||| || |

|

| || || |||| || | ||| | ||| || | || |||| |||| ||| | | | || ||| | || | | ||

|

|| |||| ||| ||| || | | |||| ||| |||| || |||| || || ||| |||| | ||| | |

|

|| | || || || | ||| | || ||| || ||| | || || ||| | || || || | || ||| | || || |||| || || | || ||| ||

|

|| || | || || || | || | ||| | ||| || | || || ||| || ||| ||| || | || || | | || || || ||| || || || | ||| || | |||

|

|| | |

|

||| | | | || ||| || | ||||| | | || || || | | || || || | | || ||| | |||| |

|

||||| | | | || || | | ||| || | | || | || | ||| || |||| | ||| | || || ||||| | || ||| ||| | || || || || || ||| | ||||| || || ||| ||| || | | || || || ||

|

| || | | || | || || | || || || | |||| | | | ||| | | ||

|

| | || ||

|

|| | | ||| || ||| || || | || || || || | | || ||| || ||| || || || ||| | ||| || ||| || ||| | ||| | | | || || | ||| ||| || | ||

|

|||| |

|

|| | |||| ||| | || || ||| || ||| | |||| || |

|

|| ||| ||| | ||| | || | | | ||| || | || || ||| | | | ||| || || ||| || | ||| | || |||| | |||| | ||| || || || || || | ||| || || | | ||| || || |||| ||| || | || ||| || | ||| |

|

| || | |||

|

| | || || | ||| || |

|

| | ||| || || || | | || | ||| | | ||| || | | || | | || ||||| || || |||| | ||| | | || || | | || | | |

|

|| || |||| | || |||| |

||

| | | ||||| |||

|

|| |||| | |||| || |

|

| | || ||||| ||||| | || || || | || ||| ||| | || ||| || ||| || | || || ||| || | | | || || ||| | || || | || || |

|

| || ||

|

|| || ||| || | | | || |||| || |||| ||| || |||| || || | ||| | |||

|

|| ||| | |

| |

|| || | ||| || ||| | | |||| | ||| | |||| || ||| || || | ||| | ||| | |||| || | || |||| | ||||| ||| | | ||| | ||| || ||| || | ||| || ||| | ||| || | ||| | | || || || || | ||| || || || |||| ||| | ||| || || |||| || |||

|

| |||

|

| ||

|

| |

|

|

|

| | | || || |||

|

|||| ||

|

|| || || || || || | | ||||| | ||| || | ||| ||| || ||| || | | || || | || | || ||| |||| || || ||| |||| ||| ||| ||| | | || |

|

| ||| || || || ||| ||| | ||| | || || ||| || || ||| ||

|

| ||| | || | || || |||| || ||| || | | ||| || | || ||| || || | || ||

|

| | ||| || | | | ||

|

| | || | | ||| | || | || | ||| || || ||| | | || |

|

|| ||| || || | || || |||| || || || | || || | || ||| | || ||| | || ||| || || | | || || ||| || || || ||| |||| |

Pr(Wage>

250|A

ge)

Step functions continued

I Easy to work with. Creates a series of dummy variables representingeach group.

I Useful way of creating interactions that are easy to interpret. Forexample, interaction effect of Year and Age:

I (Year < 2005) · Age, I (Year ≥ 2005) · Agewould allow for different linear functions in each age category.

I In R: I (year < 2005) or cut(age, c(18, 25, 40, 65, 90)).

I Choice of cutpoints or knots can be problematic. For creatingnonlinearities, smoother alternatives such as splines are available.

Piecewise polynomials

I Instead of a single polynomial in X over its whole domain, we canuse different polynomials in regions defined by knots. E.g. (seefigure)

yi =

{β01 + β11xi + β21x

2i + β31x

3i + εi if xi < c ;

β02 + β12xi + β22x2i + β32x

3i + εi if xi ≥ c .

I Better to add constraints to the polynomials, e.g. continuity.

I Splines have the “maximum” amount of continuity.

20 30 40 50 60 70

50

100

150

200

250

Age

Wage

Piecewise Cubic

20 30 40 50 60 70

50

100

150

200

250

Age

Wage

Continuous Piecewise Cubic

20 30 40 50 60 70

50

100

150

200

250

Age

Wage

Cubic Spline

20 30 40 50 60 70

50

100

150

200

250

Age

Wage

Linear Spline

Linear Splines

I A linear spline with knots at ξk , k = 1, . . . ,K is a piecewise linearpolynomial continuous at each knot.

I We can represent this model as

yi = β0 + β1b1(xi ) + β2b2(xi ) + · · ·+ βK+3bK+3(xi ) + εi ,

where the bk are basis functions.

b1(xi ) = xi

bk+1(xi ) = (xi − ξk)+, k = 1, . . . ,K

Here the ()+ means positive part; i.e.

(xi − ξk)+ =

{xi − ξk if xi > ξk ;0 otherwise.

Cubic splines

I A cubic spline with knots at ξk , k = 1, . . . ,K is a piecewise cubicpolynomial with continuous derivatives up to order 2 at each knot.

I We can represent this model with truncated power basis functions

yi = β0 + β1b1(xi ) + β2b2(xi ) + · · ·+ βK+3bK+3(xi ) + εi ,

where the bk are basis functions.

b1(xi ) = xi ; b2(xi ) = x2i ; b3(xi ) = x3i ;

bk+3(xi ) = (xi − ξk)3+, k = 1, . . . ,K

where

(xi − ξk)3+ =

{(xi − ξk)3 if xi > ξk ;0 otherwise.

Natural cubic splines

I A natural cubic spline extrapolates linearly beyond the boundaryknots.

20 30 40 50 60 70

50

10

01

50

200

25

0

Age

Wa

ge

Natural Cubic Spline

Cubic Spline

I Fitting splines in R is easy: bs(x , . . . ) for any degree splines, andns(x , . . . ) for natural cubic splines, in package splines.

20 30 40 50 60 70 80

50

100

150

200

250

300

Age

Wage


20 30 40 50 60 70 80

0.0

00.0

50.1

00.1

50.2

0

Age

| | || | ||| | ||| | | ||| | || | | |||

|

|| || ||| | | | || || || | || |

|

|| | | |

|

| || || || | | || | ||| || ||| | | |

|

| || | ||| || | || |||| || || ||| || || ||| |||| || | | | ||

|

|| || ||| ||| || || ||| || ||| | || ||| |||| ||| || | | | ||| || |||| |||| || || | ||||| | || || || | ||| | || ||| || | || ||

|

||| | || | || || | || | ||| || || | || ||| |

|

| | |||| ||| | || | |||| ||| || || ||| | | || || |||||| || | || || | || || | | || || || | || ||| || | || || ||||| ||||| || || || || | |||| || || ||| | || || || |

|

| |||| ||| || || || ||| | | ||

|

|| |

|

| || || || ||| || || | || || || | || ||| || | ||| || | || || || | |||| || | |||| | |||| || | | | ||||

|

| || || || || || |

|

| |||| || || |||| | || || || ||| | || |||| || |

|

| | |||| || || || |

|

|| |||| ||| ||

|

||| || |||| | || || | | |||

|

||| || | || | || | || || ||||| | | ||| |

|

| | || || ||| ||| | || |

|

|| | || || | |||| || | || || | ||| || || || || |||| || | ||||| | | |||| | || ||| || ||| |

|

| ||| | || || | | || |

|

| | | ||| |||| || || | | || || | |||| | | | ||| | || | |||| ||| | |

|

|| ||||| ||| | | || || || || || || || |

|

| || || || | ||| || || | || || |||| |||| |

|

| || || ||| || | | ||| || || | ||| ||| || || |

|

||| || || || || || | | ||| | || ||| || || | |||| || | |

|

|| || ||||| || | || || ||| | ||| | || ||| ||||| || ||||| ||| | ||| ||| | || || || ||| || || | | || |

|

| || |||| ||| | |||

|

| | | | || | ||| | | || | |||| || ||| || | ||| || | ||| ||

|

|| || |||| | ||| | || | | ||| |||| || ||| || || || | | || | || | || || || || | | || || | |

|

|| ||| ||||| ||| ||| || ||||| || || | ||| || | | || | ||| | | ||| || || || || | ||| ||| || || |||

|

| || || ||| | | ||| | |||| | || || ||||

|

| | || | || | || | |||

|

| || || ||| | | ||| ||| | || ||| || || ||| | |||| | ||| | ||| | || | || | || | | || || || || || |||| || | | || | | | |||| || | ||| | || ||| || || ||| ||

|

||| ||| | || || || | | || | || || || || || || | || || | || || |

|

| || ||| || |

| |

| ||| | || || |

|

| |||| ||| | |||| ||

|

| ||| ||| ||| |||| |

|

| || || || || ||| | | | || || | ||| || | || | || | |||| | ||| ||| ||

|

| | ||||| ||| | | || || | | |||| | |||| ||| ||| | || | || || || | || | || || ||| | || ||| | || || ||| | | | |||| | || | | ||| ||| |||| | | ||| | |||| | || | || || | ||

|

| || ||||| || ||| ||| || | | ||||| || |||| || | | ||| | || || || ||| |||| |||| | | || || || | ||| | || || || | | || || || |||| || ||| || ||| || |

|

| || || |||| || | ||| | ||| || | || |||| |||| ||| | | | || ||| | || | | ||

|

|| |||| ||| ||| || | | |||| ||| |||| || |||| || || ||| |||| | ||| | |

|

|| | || || || | ||| | || ||| || ||| | || || ||| | || || || | || ||| | || || |||| || || | || ||| ||

|

|| || | || || || | || | ||| | ||| || | || || ||| || ||| ||| || | || || | | || || || ||| || || || | ||| || | |||

|

|| | |

|

||| | | | || ||| || | ||||| | | || || || | | || || || | | || ||| | |||| |

|

||||| | | | || || | | ||| || | | || | || | ||| || |||| | ||| | || || ||||| | || ||| ||| | || || || || || ||| | ||||| || || ||| ||| || | | || || || ||

|

| || | | || | || || | || || || | |||| | | | ||| | | ||

|

| | || ||

|

|| | | ||| || ||| || || | || || || || | | || ||| || ||| || || || ||| | ||| || ||| || ||| | ||| | | | || || | ||| ||| || | ||

|

|||| |

|

|| | |||| ||| | || || ||| || ||| | |||| || |

|

|| ||| ||| | ||| | || | | | ||| || | || || ||| | | | ||| || || ||| || | ||| | || |||| | |||| | ||| || || || || || | ||| || || | | ||| || || |||| ||| || | || ||| || | ||| |

|

| || | |||

|

| | || || | ||| || |

|

| | ||| || || || | | || | ||| | | ||| || | | || | | || ||||| || || |||| | ||| | | || || | | || | | |

|

|| || |||| | || |||| |

||

| | | ||||| |||

|

|| |||| | |||| || |

|

| | || ||||| ||||| | || || || | || ||| ||| | || ||| || ||| || | || || ||| || | | | || || ||| | || || | || || |

|

| || ||

|

|| || ||| || | | | || |||| || |||| ||| || |||| || || | ||| | |||

|

|| ||| | |

| |

|| || | ||| || ||| | | |||| | ||| | |||| || ||| || || | ||| | ||| | |||| || | || |||| | ||||| ||| | | ||| | ||| || ||| || | ||| || ||| | ||| || | ||| | | || || || || | ||| || || || |||| ||| | ||| || || |||| || |||

|

| |||

|

| ||

|

| |

|

|

|

| | | || || |||

|

|||| ||

|

|| || || || || || | | ||||| | ||| || | ||| ||| || ||| || | | || || | || | || ||| |||| || || ||| |||| ||| ||| ||| | | || |

|

| ||| || || || ||| ||| | ||| | || || ||| || || ||| ||

|

| ||| | || | || || |||| || ||| || | | ||| || | || ||| || || | || ||

|

| | ||| || | | | ||

|

| | || | | ||| | || | || | ||| || || ||| | | || |

|

|| ||| || || | || || |||| || || || | || || | || ||| | || ||| | || ||| || || | | || || ||| || || || ||| |||| |

Pr(Wage>

250|A

ge)

Knot placement

I One strategy is to decide K , the number of knots, and then placethem at appropriate quantiles of the observed X .

I A cubic spline with K knots has K + 4 parameters or degrees offreedom.

I A natural spline with K knots has K degrees of freedom.

I On the figure, comparison of a degree-14 polynomial and a naturalcubic spline, each with 15df: ns(age, df = 14) andpoly(age, deg = 14).

Knot placement

20 30 40 50 60 70 80

50

100

150

200

250

300

Age

Wage


Polynomial

Local regression

0.0 0.2 0.4 0.6 0.8 1.0

−1

.0−

0.5

0.0

0.5

1.0

1.5

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O

O

OO

O

O

OO

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

OO

O

O

O

O

OO

O

O

OO

O

O

O

OO

O

O

O

O

O

O

O

OO

O

O

O

OO

O

O

O

O

OO

O

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

0.0 0.2 0.4 0.6 0.8 1.0−

1.0

−0

.50

.00

.51

.01

.5

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O

O

OO

O

O

OO

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

OO

O

O

O

O

OO

O

O

OO

O

O

O

OO

O

O

O

O

O

O

O

OO

O

O

O

OO

O

O

O

O

OO

O

O

O

O

O

O

O

O

O

O

O

OO

O

O

OO

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

OO

O

O

O

O

OO

O

O

OO

O

O

Local Regression

I With a sliding weight function, we fit separate linear fits over therange of X by weighted least squares.

I See text for more details, and loess() function in R.

Generalized Additive Models

Allows for flexible nonlinearities in several variables, but retains theadditive structure of linear models.

yi = β0 + f1(xi1) + f2(xi2) + · · ·+ fp(xip) + εi .

2003 2005 2007 2009

−30

−20

−10

010

20

30

20 30 40 50 60 70 80

−50

−40

−30

−20

−10

010

20

−30

−20

−10

010

20

30

40

<HS HS <Coll Coll >Coll

f 1(year)

f 2(age)

f 3(edu

cation)

year ageeducation

GAM details

I Can fit a GAM simply using, e.g. natural splines:lm(wage ∼ ns(year , df = 5) + ns(age, df = 5) + education)

I Coefficients not that interesting; fitted functions are. The previousplot was produced using plot.gam.

I Can mix terms – some linear, some nonlinear – and use anova() tocompare models.

I Can use smoothing splines or local regression as well:gam(wage ∼ s(year , df = 5) + lo(age, span = .5) + education)

I GAMs are additive, although low-order interactions can be includedin a natural way using, e.g. bivariate smoothers or interactions ofthe form ns(age, df = 5) : ns(year , df = 5).

GAMs for classification

log

(p(X )

1− p(X )

)= β0 + f1(X1) + f2(X2) + · · ·+ fp(Xp).

2003 2005 2007 2009

−4

−2

02

4

20 30 40 50 60 70 80

−8

−6

−4

−2

02

−4

−2

02

4

HS <Coll Coll >Coll

f 1(year)

f 2(age)

f 3(edu

cation)

year ageeducation

gam(I (wage > 250) ∼ year + s(age, df = 5) + education, family =binomial)

Tree-based Methods

Tree-based Methods

I Here we describe tree-based methods for regression andclassification.

I These involve stratifying or segmenting the predictor space into anumber of simple regions.

I Since the set of splitting rules used to segment the predictor spacecan be summarised in a tree, these types of approaches are known asdecision-tree methods.

Pros and Cons

I Tree-based methods are simple and useful for interpretation.

I However they typically are not competitive with the best supervisedlearning approaches in terms of prediction accuracy.

I Hence we also discuss bagging, and random forests. These methodsgrow multiple trees which are then combined to yield a singleconsensus prediction.

I Combining a large number of trees can often result in dramaticimprovements in prediction accuracy, at the expense of some lossinterpretation.

The Basics of Decision Trees

I Decision trees can be applied to both regression and classificationproblems.

I We first consider regression problems, and then move on toclassification.

Baseball salary data

|Years < 4.5

Hits < 117.5

5.11

6.00 6.74

Details of previous figure

I For the Hitters data, a regression tree for predicting the log salary ofa baseball player, based on the number of years that he has played inthe major leagues and the number of hits that he made in theprevious year.

I At a given internal node, the label (of the form Xj < tk) indicatesthe left-hand branch emanating from that split, and the right-handbranch corresponds to Xj ≥ tk . For instance, the split at the top ofthe tree results in two large branches. The left-hand branchcorresponds to Years < 4.5, and the right-hand branch correspondsto Years ≥ 4.5.

I The tree has two internal nodes and three terminal nodes, or leaves.The number in each leaf is the mean of the response for theobservations that fall there.

ResultsI Overall, the tree stratifies or segments the players into three regions

of predictor space: R1 = {X |Years < 4.5},R2 = {X |Years ≥ 4.5,Hits < 117.5}, andR3 = {X |Years ≥ 4.5,Hits ≥ 117.5}.

Years

Hits

1

117.5

238

1 4.5 24

R1

R3

R2

Terminology for Trees

I In keeping with the tree analogy, the regions R1, R2, and R3 areknown as terminal nodes.

I Decision trees are typically drawn upside down, in the sense that theleaves are at the bottom of the tree.

I The points along the tree where the predictor space is split arereferred to as internal nodes.

I In the hitters tree, the two internal nodes are indicated by the textYears < 4.5 and Hits < 117.5.

Interpretation of Results

I Years is the most important factor in determining Salary , andplayers with less experience earn lower salaries than moreexperienced players.

I Given that a player is less experienced, the number of Hits that hemade in the previous year seems to play little role in his Salary .

I But among players who have been in the major leagues for five ormore years, the number of Hits made in the previous year doesaffect Salary , and players who made more Hits last year tend tohave higher salaries.

I Surely an over-simplification, but compared to a regression model, itis easy to display, interpret and explain.

Details of the tree-building process

1. We divide the predictor space – that is, the set of possible values forX1,X2, . . . ,Xp – into J distinct and non-overlapping regions,R1,R2, . . . ,RJ .

2. For every observation that falls into the region Rj , we make thesame prediction, which is simply the mean of the response values forthe training observations in Rj .

More details of the tree-building process

I In theory, the regions could have any shape. However, we choose todivide the predictor space into high-dimensional rectangles, or boxes,for simplicity and for ease of interpretation of the resulting predictivemodel.

I The goal is to find boxes R1, . . . ,RJ that minimize the RSS, given by

J∑j=1

∑i∈Rj

(yi − yRj )2,

where yRj is the mean response for the training observations withinthe jth box.

More details of the tree-building process

I Unfortunately, it is not computationally feasible to consider everypossible partition of the feature space into J boxes.

I For this reason, we take a top-down, greedy approach that is knownas recursive binary splitting.

I The approach is top-down because it begins at the top of the treeand then successively splits the predictor space; each split isindicated via two new branches further down on the tree.

I It is greedy because at each step of the tree-building process, thebest split is made at that particular step, rather than looking aheadand picking a split that will lead to a better tree in some future step.

Details – Continued

I We first select the predictor Xj and the cutpoint s such thatsplitting the predictor space into the regions {X |Xj < s} and{X |Xj ≥ s} leads to the greatest possible reduction in RSS.

I Next, we repeat the process, looking for the best predictor and bestcutpoint in order to split the data further so as to minimize the RSSwithin each of the resulting regions.

I However, this time, instead of splitting the entire predictor space, wesplit one of the two previously identified regions. We now have threeregions.

I Again, we look to split one of these three regions further, so as tominimize the RSS. The process continues until a stopping criterion isreached; for instance, we may continue until no region contains morethan five observations.

Predictions

I We predict the response for a given test observation using the meanof the training observations in the region to which that testobservation belongs.

Pruning a tree

I The process described above may produce good predictions on thetraining set, but is likely to overfit the data, leading to poor test setperformance.

I A smaller tree with fewer splits (that is, fewer regions R1, . . . ,RJ)might lead to lower variance and better interpretation at the cost ofa little bias.

I One possible alternative to the process described above is to growthe tree only so long as the decrease in the RSS due to each splitexceeds some (high) threshold.

I This strategy will result in smaller trees, but is too short-sighted: aseemingly worthless split early on in the tree might be followed by avery good split – that is, a split that leads to a large reduction inRSS later on.

Pruning a tree – continued

I A better strategy is to grow a very large tree T0, and then prune itback in order to obtain a subtree.

I Cost complexity pruning – also known as weakest link pruning – isused to do this.

I we consider a sequence of trees indexed by a non-negative tuningparameter α. For each value of α there corresponds a subtreeT ⊂ T0 such that

|T |∑m=1

∑i :xi∈Rm

(yi − yRm)2 + α|T |

is as small as possible. Here |T | indicates the number of terminalnodes of the tree T , Rm is the rectangle (i.e. the subset of predictorspace) corresponding to the mth terminal node, and yRm is the meanof the training observations in Rm.

Choosing the best subtree

I The tuning parameter α controls a trade-off between the subtree’scomplexity and its fit to the training data.

I We select an optimal value α using cross-validation.

I We then return to the full data set and obtain the subtreecorresponding to α.

Summary: tree algorithm

1. Use recursive binary splitting to grow a large tree on the trainingdata, stopping only when each terminal node has fewer than someminimum number of observations.

2. Apply cost complexity pruning to the large tree in order to obtain asequence of best subtrees, as a function of α.

3. Use K-fold cross-validation to choose α. For each k = 1, . . . ,K :

3.1 Repeat Steps 1 and 2 on the K−1K

th fraction of the training data,excluding the kth fold.

3.2 Evaluate the mean squared prediction error on the data in theleft-out kth fold, as a function of α. Average the results, and pick αto minimize the average error.

4. Return the subtree from Step 2 that corresponds to the chosen valueof α.

Baseball example continued

I First, we randomly divided the data set in half, yielding 132observations in the training set and 131 observations in the test set.

I We then built a large regression tree on the training data and variedα in order to create subtrees with different numbers of terminalnodes.

I Finally, we performed six-fold cross-validation in order to estimatethe cross-validated MSE of the trees as a function of α.


|Years < 4.5

RBI < 60.5

Putouts < 82

Years < 3.5

Years < 3.5

Hits < 117.5

Walks < 43.5

Runs < 47.5

Walks < 52.5

RBI < 80.5

Years < 6.5

5.487

4.622 5.183

5.394 6.189

6.015 5.5716.407 6.549

6.459 7.0077.289


2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Tree Size

Mean S

quare

d E

rror

Training

Cross−Validation

Test

Classification Trees

I Very similar to a regression tree, except that it is used to predict aqualitative response rather than a quantitative one.

I For a classification tree, we predict that each observation belongs tothe most commonly occurring class of training observations in theregion to which it belongs.

Details of classification trees

I Just as in the regression setting, we use recursive binary splitting togrow a classification tree.

I In the classification setting, RSS cannot be used as a criterion formaking the binary splits.

I A natural alternative to RSS is the classification error rate. This issimply the fraction of the training observations in that region thatdo not belong to the most common class:

E = 1−max︸︷︷︸k

(pmk).

Here pmk represents the proportion of training observations in themth region that are from the kth class.

I However classification error is not sufficiently sensitive fortree-growing, and in practice two other measures are preferable.

Gini index and Deviance

I The Gini index is defined by

G =K∑

k=1

pmk(1− pmk),

a measure of total variance across the K classes. The Gini indextakes on a small value if all of the pmk ’s are close to zero or one.

I For this reason the Gini index is referred to as a measure of nodepurity – a small value indicates that a node contains predominantlyobservations from a single class.

I An alternative to the Gini index is cross-entropy, given by

D = −K∑

k=1

pmk log pmk .

I Both the Gini index and the cross-entropy are very similarnumerically.

|Thal:a

Ca < 0.5

MaxHR < 161.5

RestBP < 157

Chol < 244MaxHR < 156

MaxHR < 145.5

ChestPain:bc

Chol < 244 Sex < 0.5

Ca < 0.5

Slope < 1.5

Age < 52 Thal:b

ChestPain:a

Oldpeak < 1.1

RestECG < 1

No YesNo

NoYes

No

No No No Yes

Yes No No

No Yes

Yes Yes

Yes

5 10 15

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Tree Size

Err

or

TrainingCross−ValidationTest

|Thal:a

Ca < 0.5

MaxHR < 161.5 ChestPain:bc

Ca < 0.5

No No

No Yes

Yes Yes

Trees Versus Linear Models

−2 −1 0 1 2

−2

−1

01

2

X1

X2

−2 −1 0 1 2

−2

−1

01

2

X1

X2

−2 −1 0 1 2

−2

−1

01

2

X1

X2

−2 −1 0 1 2

−2

−1

01

2

X1

X2

I Top Row: True linear boundary; Bottom row: true non-linearboundary.

I Left column: linear model; Right column: tree-based model

Advantages and Disadvantages of Trees

I Trees are very easy to explain to people. In fact, they are even easierto explain than linear regression.

I Some people believe that decision trees more closely mirror humandecision-making than do the regression and classification approachesseen in previous days.

I Trees can be displayed graphically, and are easily interpreted even bya non-expert (especially if they are small).

I Trees can easily handle qualitative predictors without the need tocreate dummy variables.

I Unfortunately, trees generally do not have the same level ofpredictive accuracy as some of the other regression and classificationapproaches we’ve seen previously.

However, by aggregating many decision trees, the predictive performanceof trees can be substantially improved. We introduce these concepts next.

Bagging

Bagging

I Bootstrap aggregation, or bagging, is a general-purpose procedurefor reducing the variance of a statistical learning method; weintroduce it here because it is particularly useful and frequently usedin the context of decision trees.

I Recall that given a set of n independent observations Z1, . . . ,Zn,each with variance σ2, the variance of the mean Z of theobservations is given by σ2/n.

I In other words, averaging a set of observations reduces variance. Ofcourse, this is not practical because we generally do not have accessto multiple training sets.

Bagging continued

I Instead, we can bootstrap, by taking repeated samples from the(single) training data set.

I In this approach we generate B different bootstrapped training datasets. We then train our method on the bth bootstrapped trainingset in order to get f ∗b(x), the prediction at a point x . We thenaverage all the predictions to obtain

fbag (x) =1

B

B∑b=1

f ∗b(x).

I This is called bagging.

Bagging classification trees

I The above prescription applied to regression trees.

I For classification trees: for each test observation, we record the classpredicted by each of the B trees, and take a majority vote: theoverall prediction is the most commonly occurring class among theB predictions

Out-of-Bag Error Estimation

I There is a very straightforward way to estimate the test error of abagged model.

I Recall that the key to bagging is that trees are repeatedly fit tobootstrapped subsets of the observations. On average, each baggedtree makes use of around two-thirds of the observations.

I The remaining one-third of the observations not used to fit a givenbagged tree are referred to as the out-of-bag (OOB) observations.

I We can predict the response for the ith observation using each ofthe trees in which that observation was OOB. This will yield aroundB/3 predictions for the ith observation, which we average.

I This estimate is essentially the LOOCV error for bagging, if B islarge.

Bagging the heart data

0 50 100 150 200 250 300

0.10

0.15

0.20

0.25

0.30

Number of Trees

Erro

r

Test: Bagging

Test: RandomForest

OOB: Bagging

OOB: RandomForest

I The test error (black and orange) is shown as a function of B, the numberof bootstrapped training sets used.

I Random forests were applied with m =√p.

I The dashed line indicates the test error resulting from a singleclassification tree.

I The green and blue traces show the OOB error, which in this case isconsiderably lower.

Random Forest

Random Forests

I Random forests provide an improvement over bagged trees by way ofa small tweak that decorrelates the trees. This reduces the variancewhen we average the trees.

I As in bagging, we build a number of decision trees on bootstrappedtraining samples.

I But when building these decision trees, each time a split in a tree isconsidered, a random selection of m predictors is chosen as splitcandidates from the full set of p predictors. The split is allowed touse only one of those m predictors.

I A fresh selection of m predictors is taken at each split, and typicallywe choose m ≈ √p – that is, the number of predictors considered ateach split is approximately equal to the square root of the totalnumber of predictors (4 out of the 13 for the Heart data).

Example: gene expression data

I Random forests applied to a high-dimensional biological data setconsisting of expression measurements of 4,718 genes measured ontissue samples from 349 patients.

I There are around 20,000 genes in humans, and individual genes havedifferent levels of activity, or expression, in particular cells, tissues,and biological conditions.

I Each of the patient samples has a qualitative label with 15 differentlevels: either normal or one of 14 different types of cancer.

I Random forests are used to predict cancer type based on the 500genes that have the largest variance in the training set.

I Observations are randomly divided into a training and a test set, andapplied random forests to the training set for three different valuesof the number of splitting variables m.

Results: gene expression data

0 100 200 300 400 500

0.2

0.3

0.4

0.5

Number of Trees

Test

Cla

ssifi

catio

n E

rror

m=p

m=p/2

m= p

I Results from random forests for the fifteen-class gene expression data setwith p = 500 predictors.

I The test error is displayed as a function of the number of trees. Eachcolored line corresponds to a different value of m, the number ofpredictors available for splitting at each interior tree node.

I Random forests (m < p) lead to a slight improvement over bagging(m = p). A single classification tree has an error rate of 45.7%.

Variable importance measureI For bagged/RF regression trees, we record the total amount that the

RSS is decreased due to splits over a given predictor, averaged overall B trees. A large value indicates an important predictor.

I Similarly, for bagged/RF classification trees, we add up the totalamount that the Gini index is decreased by splits over a givenpredictor, averaged over all B trees.

Thal

Ca

ChestPain

Oldpeak

MaxHR

RestBP

Age

Chol

Slope

Sex

ExAng

RestECG

Fbs

0 20 40 60 80 100

Variable Importance

Variable importance plot for the Heart data.

Summary

I Decision trees are simple and interpretable models for regression andclassification.

I However they are often not competitive with other methods in termsof prediction accuracy.

I Bagging and random forests are good methods for improving theprediction accuracy of trees. They work by growing many trees onthe training data and then combining the predictions of the resultingensemble of trees.

I Random forest is among the (new) methods that are“democratising” machine learning.

day 6: nonlinear models and tree-based methods · day 6 outline moving beyond linearity polynomial...

Documents