new nonlinear regression (part 2) - christof seiler · 2020. 6. 16. · nonlinear regression i...
TRANSCRIPT
Nonlinear Regression (Part 2)
Christof Seiler
Stanford University, Spring 2016, STATS 205
Overview
Last time:I The bias-variance tradeoff
I Bias: Estimator cannot explain true functionI Variance: Estimator changes every time we see a new sample
I The curse of dimensionality
Today:I Linear Smoothers
I Local AveragesI Local RegressionI Penalized Regression
Nonlinear Regression
I We are given n pairs of obserations (x1,Y1), . . . , (xn,Yn)I The response variable is related to the covariate
Yi = r(xi ) + εi E(εi ) = 0, i = 1, . . . , n
with r being the regression functionI For now, assume that variance Var(εi ) = σ2 is independent of xI The covariates xi are fixed
Linear SmoothersI All the nonparametric estimators that we will treat in this class
are linear smoothersI An estimator rn is a linear smoother if, for each x , there exists
a vector l(x) = (l1(x), . . . , ln(x))T such that
rn(x) =n∑
i=1li (x)Yi
I Define the vector of fitted values
r = (rn(x1), . . . , rn(xn))T
I Then we can write in matrix form (Lij = lj(xi ))
r = LY
I The ith row shows weights given to each Yi in forming theestimate rn(xi )
Regressogram Estimator
I Suppose that a ≤ xi ≤ b, i = 1, . . . , nI Divide (a, b) into m equally spaced bins denoted by
B1,B2, . . . ,BmI Define estimator as
rn(x) = 1kj
∑i :xi∈Bj
Yi for x ∈ Bj
where kj is the number of points in BjI In this case,
l(x)T =(0, 0, . . . , 1kj
, . . . ,1kj, 0, . . . , 0
)
I This estimator is step function
Regressogram Estimator
I For example, n = 9,m = 3
L = 13 ×
1 1 1 0 0 0 0 0 01 1 1 0 0 0 0 0 01 1 1 0 0 0 0 0 00 0 0 1 1 1 0 0 00 0 0 1 1 1 0 0 00 0 0 1 1 1 0 0 00 0 0 0 0 0 1 1 10 0 0 0 0 0 1 1 10 0 0 0 0 0 1 1 1
Local Averages Estimator
I Fix bandwidth h > 0 and let Bx = i : |xi − x | ≤ hI Let nx be the number of points in BxI Estimator is
rn(x) = 1nx
∑i∈Bx
Yi
I This is a special case of the kernel estimator that we willdiscuss next
I In this case, li (x) = 1/nx if |xi − x | ≤ h and l(x) = 0 otherwise
Local Averages Estimator
I For example, n = 9, xi = i/9, h = 1/9
L =
1/2 1/2 0 0 0 0 0 0 01/3 1/3 1/3 0 0 0 0 0 00 1/3 1/3 1/3 0 0 0 0 00 0 1/3 1/3 1/3 0 0 0 00 0 0 1/3 1/3 1/3 0 0 00 0 0 0 1/3 1/3 1/3 0 00 0 0 0 0 1/3 1/3 1/3 00 0 0 0 0 0 1/3 1/3 1/30 0 0 0 0 0 0 1/2 1/2
Linear Smoothers
I Rowise weighted average with constrain∑n
i=1 li (x) = 1I Define the effective degrees of freedom by
ν = tr(L)
I The effective degrees of freedom behave very much like thenumber of parameters in a linear regression model
I For regressogram example: Lii = 1, we have ν = nI For local averages: Lii ≈ 1/#neighbors, we haveν ≈ n/#neighbors
Local Regression
I We still use the regression model
Yi = r(xi ) + εi ,E(εi ) = 0, i = 1, . . . , n
I But now, we consider weighted averages of Yi ’s giving higherweights to points close to x
I One option is kernel regression estimator called theNadaraya–Watson kernel estimator
rn(x) =n∑
i=1li (x)Yi
with kernel K and weights li (x) given by
li (x) =K ( x−xi
h )∑ni=1 K ( x−xi
h )
Local Regression
I For example, the Gaussian K (x) = 12e−x2/2
I Think of them as basis functions anchored at observationlocations xi
Source: Wassermann (2006)
Local Regression
I For points x1, . . . , xn drawn from some density fI Let h→ 0, nh→∞I The bias-variance tradeoff for the Nadaraya–Watson kernel
estimatorR(rn, r) ≈ h4
4 Bias2 + 1nh Variance
I Depends on first and second derivatives of the density f
Bias2 =(∫
x2K (x)dx)2 ∫ (
r ′′(x) + 2r ′(x) f ′(x)f (x)
)2dx
I The term 2r ′(x) f ′(x)f (x) is called design bias
I It depends on the distribution of the points x1, . . . , xn
Local Regression
I The bias term2r ′(x) f ′(x)
f (x)has two properties:
I it is large if f ′(x) is non-zero andI it is large if f (x) is small
I The dependence of the bias on the density f (x) is called designbias
I It can also be shown that the bias is large if x is close to theboundary of the support of p
I These biases can be reduced by by using a refinement calledlocal polynomial regression
Local Polynomials
Source: Hastie, Tibshirani, Friedman (2009)
Local Polynomials
I Define function wi (x) = K ((xi − x)/h) and choose a ≡ rn(x)
n∑i=1
wi (x)(Yi − a)2
I Take derivative with respect to a and set to zero
a =∑n
i=1 wi (x)yi∑ni=1 wi (x)
I Thus the kernel estimator is a locally constant estimator,obtained from locally weighted least squares
Local PolynomialsI What if we local polynomial of degree p instead of a local
constant?I Let x be some fixed value at which we want to estimate r(x)I For values u in a neighborhood of x , define the polynomial
Px (u; a) = a0 + a1(u − x) + a22! (u − x)2 + · · ·+ ap
p! (u − x)p
I We approximate the regression function r(u) in neighborhood u
r(u) ≈ Px (u; a)
I Estimate a = (a0, . . . , ap)T by taking the gradient with respectto a and setting to zero
n∑i=1
wi (xi )(Yi − Px (xi ; a))2
Local Polynomials
I First construct matrices
Xx =
1 x1 − x1 x2 − x...
...1 xn − x
,Wx =
w(x1) 0 · · · 00 w(x1) · · · 0...
... . . . ...0 0 · · · w(xn)
I Then rewrite in matrix form
n∑i=1
wi (xi )(Yi − Px (xi ; a))2 = (Y − Xxa)T Wx (Y − Xxa)
I Take the gradient with respect to a and set to zero
a = (XTx WxXx )−1XT
x WxY
Local Polynomials
I p = 1 is most popular case, this is called local linearregression
I p = 0 gives back kernel estimatorI This linear smoother
a = (XTx WxXx )−1XT
x WxY = LY
and we can choose the bandwith with cross-validation
Local Polynomials
I Comparing the bias-variance tradeoff
R(rn, r) ≈ h4
4 Bias2 + 1nh Variance
I for Nadaraya–Watson kernel estimator (depends on first andsecond derivatives of the density f )
Bias2 =(∫
x2K (x)dx)2 ∫ (
r ′′(x) + 2r ′(x) f ′(x)f (x)
)2dx
I and local linear estimator (no dependence on the density f )
Bias2 =(∫
x2K (x)dx)2 ∫
r ′′(x)2dx
Penalized Regression
I An alternative to local averagingI Minimize the penalized sums of squares
M(λ) =n∑
i=1(Yi − rn(xi ))2 + λJ(r)
I Roughness penalty, for example
J(r) =∫
(r ′′(x))2dx
I The parameter λ controls the trade-off between fitI For λ = 0 is interpolating functionI As λ→∞ is least squares lineI Between 0 < λ <∞ rn are splines (piecewise polynomials)
Penalized Regression
Source: Wasserman (2006)
Penalized Regression
I Theorem: The function rn(x) that minimizes M(λ) withpenalty from previous slide is a natural cubic spline with knotsat the data points. The estimator rn is called a smoothingspline.
I This theorem doesn’t give an explict form of rnI However we can build an explicit basis using B-splines
rn(x) =N∑
j=1βjBj(x)
I where B1, . . . ,BN are a basis for B-splines with N = n + 4I Thus, we only need to find the coefficients β = (β1, . . . , βN)T
Penalized Regression
I So we take the derivative and set to zero
(Y − Bβ)T (Y − Bβ) + λβT Ωβ
with Bij = Bj(Xi ) and Ωjk =∫
B′′j (x)B′′k (x)I and find
β = (BT B + λΩ)−1BT YI This is another linear smoother
r = B(BT B + λΩ)−1BT Y = LY
with fittes values r a smooth version of original observations Y
References
I Wassermann (2006). All of Nonparametric StatisticsI Hastie, Tibshirani, Friedman (2009). The Elements of
Statistical Learning