new nonlinear regression (part 2) - christof seiler · 2020. 6. 16. · nonlinear regression i...

Nonlinear Regression (Part 2)

Christof Seiler

Stanford University, Spring 2016, STATS 205

Overview

Last time:I The bias-variance tradeoff

I Bias: Estimator cannot explain true functionI Variance: Estimator changes every time we see a new sample

I The curse of dimensionality

Today:I Linear Smoothers

I Local AveragesI Local RegressionI Penalized Regression

Nonlinear Regression

I We are given n pairs of obserations (x1,Y1), . . . , (xn,Yn)I The response variable is related to the covariate

Yi = r(xi ) + εi E(εi ) = 0, i = 1, . . . , n

with r being the regression functionI For now, assume that variance Var(εi ) = σ2 is independent of xI The covariates xi are fixed

Linear SmoothersI All the nonparametric estimators that we will treat in this class

are linear smoothersI An estimator rn is a linear smoother if, for each x , there exists

a vector l(x) = (l1(x), . . . , ln(x))T such that

rn(x) =n∑

i=1li (x)Yi

I Define the vector of fitted values

r = (rn(x1), . . . , rn(xn))T

I Then we can write in matrix form (Lij = lj(xi ))

r = LY

I The ith row shows weights given to each Yi in forming theestimate rn(xi )

Regressogram Estimator

I Suppose that a ≤ xi ≤ b, i = 1, . . . , nI Divide (a, b) into m equally spaced bins denoted by

B1,B2, . . . ,BmI Define estimator as

rn(x) = 1kj

∑i :xi∈Bj

Yi for x ∈ Bj

where kj is the number of points in BjI In this case,

l(x)T =(0, 0, . . . , 1kj

, . . . ,1kj, 0, . . . , 0

)

I This estimator is step function

Regressogram Estimator

I For example, n = 9,m = 3

L = 13 ×

1 1 1 0 0 0 0 0 01 1 1 0 0 0 0 0 01 1 1 0 0 0 0 0 00 0 0 1 1 1 0 0 00 0 0 1 1 1 0 0 00 0 0 1 1 1 0 0 00 0 0 0 0 0 1 1 10 0 0 0 0 0 1 1 10 0 0 0 0 0 1 1 1

Local Averages Estimator

I Fix bandwidth h > 0 and let Bx = i : |xi − x | ≤ hI Let nx be the number of points in BxI Estimator is

rn(x) = 1nx

∑i∈Bx

Yi

I This is a special case of the kernel estimator that we willdiscuss next

I In this case, li (x) = 1/nx if |xi − x | ≤ h and l(x) = 0 otherwise

Local Averages Estimator

I For example, n = 9, xi = i/9, h = 1/9

L =

1/2 1/2 0 0 0 0 0 0 01/3 1/3 1/3 0 0 0 0 0 00 1/3 1/3 1/3 0 0 0 0 00 0 1/3 1/3 1/3 0 0 0 00 0 0 1/3 1/3 1/3 0 0 00 0 0 0 1/3 1/3 1/3 0 00 0 0 0 0 1/3 1/3 1/3 00 0 0 0 0 0 1/3 1/3 1/30 0 0 0 0 0 0 1/2 1/2

Linear Smoothers

I Rowise weighted average with constrain∑n

i=1 li (x) = 1I Define the effective degrees of freedom by

ν = tr(L)

I The effective degrees of freedom behave very much like thenumber of parameters in a linear regression model

I For regressogram example: Lii = 1, we have ν = nI For local averages: Lii ≈ 1/#neighbors, we haveν ≈ n/#neighbors

Local Regression

I We still use the regression model

Yi = r(xi ) + εi ,E(εi ) = 0, i = 1, . . . , n

I But now, we consider weighted averages of Yi ’s giving higherweights to points close to x

I One option is kernel regression estimator called theNadaraya–Watson kernel estimator

rn(x) =n∑

i=1li (x)Yi

with kernel K and weights li (x) given by

li (x) =K ( x−xi

h )∑ni=1 K ( x−xi

h )

Local Regression

I For example, the Gaussian K (x) = 12e−x2/2

I Think of them as basis functions anchored at observationlocations xi

Source: Wassermann (2006)

Local Regression

I For points x1, . . . , xn drawn from some density fI Let h→ 0, nh→∞I The bias-variance tradeoff for the Nadaraya–Watson kernel

estimatorR(rn, r) ≈ h4

4 Bias2 + 1nh Variance

I Depends on first and second derivatives of the density f

Bias2 =(∫

x2K (x)dx)2 ∫ (

r ′′(x) + 2r ′(x) f ′(x)f (x)

)2dx

I The term 2r ′(x) f ′(x)f (x) is called design bias

I It depends on the distribution of the points x1, . . . , xn

Local Regression

I The bias term2r ′(x) f ′(x)

f (x)has two properties:

I it is large if f ′(x) is non-zero andI it is large if f (x) is small

I The dependence of the bias on the density f (x) is called designbias

I It can also be shown that the bias is large if x is close to theboundary of the support of p

I These biases can be reduced by by using a refinement calledlocal polynomial regression

Local Polynomials

Source: Hastie, Tibshirani, Friedman (2009)

Local Polynomials

I Define function wi (x) = K ((xi − x)/h) and choose a ≡ rn(x)

n∑i=1

wi (x)(Yi − a)2

I Take derivative with respect to a and set to zero

a =∑n

i=1 wi (x)yi∑ni=1 wi (x)

I Thus the kernel estimator is a locally constant estimator,obtained from locally weighted least squares

Local PolynomialsI What if we local polynomial of degree p instead of a local

constant?I Let x be some fixed value at which we want to estimate r(x)I For values u in a neighborhood of x , define the polynomial

Px (u; a) = a0 + a1(u − x) + a22! (u − x)2 + · · ·+ ap

p! (u − x)p

I We approximate the regression function r(u) in neighborhood u

r(u) ≈ Px (u; a)

I Estimate a = (a0, . . . , ap)T by taking the gradient with respectto a and setting to zero

n∑i=1

wi (xi )(Yi − Px (xi ; a))2

Local Polynomials

I First construct matrices

Xx =

1 x1 − x1 x2 − x...

...1 xn − x

,Wx =

w(x1) 0 · · · 00 w(x1) · · · 0...

... . . . ...0 0 · · · w(xn)

I Then rewrite in matrix form

n∑i=1

wi (xi )(Yi − Px (xi ; a))2 = (Y − Xxa)T Wx (Y − Xxa)

I Take the gradient with respect to a and set to zero

a = (XTx WxXx )−1XT

x WxY

Local Polynomials

I p = 1 is most popular case, this is called local linearregression

I p = 0 gives back kernel estimatorI This linear smoother

a = (XTx WxXx )−1XT

x WxY = LY

and we can choose the bandwith with cross-validation

Local Polynomials

I Comparing the bias-variance tradeoff

R(rn, r) ≈ h4

4 Bias2 + 1nh Variance

I for Nadaraya–Watson kernel estimator (depends on first andsecond derivatives of the density f )

Bias2 =(∫

x2K (x)dx)2 ∫ (

r ′′(x) + 2r ′(x) f ′(x)f (x)

)2dx

I and local linear estimator (no dependence on the density f )

Bias2 =(∫

x2K (x)dx)2 ∫

r ′′(x)2dx

Penalized Regression

I An alternative to local averagingI Minimize the penalized sums of squares

M(λ) =n∑

i=1(Yi − rn(xi ))2 + λJ(r)

I Roughness penalty, for example

J(r) =∫

(r ′′(x))2dx

I The parameter λ controls the trade-off between fitI For λ = 0 is interpolating functionI As λ→∞ is least squares lineI Between 0 < λ <∞ rn are splines (piecewise polynomials)


Source: Wasserman (2006)


I Theorem: The function rn(x) that minimizes M(λ) withpenalty from previous slide is a natural cubic spline with knotsat the data points. The estimator rn is called a smoothingspline.

I This theorem doesn’t give an explict form of rnI However we can build an explicit basis using B-splines

rn(x) =N∑

j=1βjBj(x)

I where B1, . . . ,BN are a basis for B-splines with N = n + 4I Thus, we only need to find the coefficients β = (β1, . . . , βN)T


I So we take the derivative and set to zero

(Y − Bβ)T (Y − Bβ) + λβT Ωβ

with Bij = Bj(Xi ) and Ωjk =∫

B′′j (x)B′′k (x)I and find

β = (BT B + λΩ)−1BT YI This is another linear smoother

r = B(BT B + λΩ)−1BT Y = LY

with fittes values r a smooth version of original observations Y

References

I Wassermann (2006). All of Nonparametric StatisticsI Hastie, Tibshirani, Friedman (2009). The Elements of

Statistical Learning

new nonlinear regression (part 2) - christof seiler · 2020. 6. 16. · nonlinear regression i...

Documents