basis expansions - web.math.ku.dkweb.math.ku.dk › ~richard › courses › statlearn2011 ›...

Basis ExpansionsWith X ∈ Rp and Y ∈ R the function

f (x) = E (Y |X = x)

is typically globally a non-linear function. We discuss situations where p issmall or moderate, but where the function is complicated.

A basis function expansion of f is an expansion

f (x) =M∑

m=1

βmhm(x)

with hm : Rp → R for m = 1, . . . ,M.

The basis functions are chosen and fixed and the parameters βm form = 1, . . . ,M are estimated. This is a linear model in the derived variablesh1(X ), . . . , hM(X ).

Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 1 / 24

Polynomial Bases

Monomials are classical basis functions;

hm(x) = xr11 x

r22 . . . x

rpp

with ri ∈ {0, . . . , d} and r1 + . . .+ rp ≤ d . This basis spans thepolynomials of degree ≤ d .

If the linear models provide first order Taylor approximations of thefunction, expansions in the degree d polynomials provide order dTaylor approximations.

However, if p ≥ 2 the number of basis functions grows exponentiallyin d .


IndicatorsA completely different, non-differentiable idea is to approximate f locallyas a constant. Box-type basis functions are

hm(x) = 1(l1 ≤ x1 ≤ r1) . . . 1(lp ≤ xp ≤ rp)

with li ≤ ri and li , ri ∈ [−∞,∞] for i = 1, . . . , p.

If the boxes are disjoint, the columns in the X-matrix for the derivedvariables are orthogonal:

Xim = hm(xi ) ∈ {0, 1}

We can think of this as dummy variables representing the box.Consequently, with least squares estimation

β̂m =1

Nm

∑i :hm(xi )=1

yi , Nm =N∑i=1

1(hm(xi ) = 1).


Basis StrategiesThe size of the typical set of basis functions increases rapidly with p.What are feasible strategies for basis selection?

Restriction: Choose a priory only special basis functionsAdditivity; hmj : R→ R

hm(x) =

p∑j=1

hmj(xj)

Radial basis functions:

hm(x) = D

(||x − ξm||

λm

)

Selection: As variable selection – implement exhaustive or step-wiseinclusions/exclusions of basis functions.

Penalization: As ridge regression – keep the entire set of basisfunctions but penalize the size of the parameter vector.


Figure 5.1


Figure 5.2


Splines – p = 1

Define h1(x) = 1, h2(x) = x and

hm+2(x) = (x − ξm)+ t+ = max{0, t}

for ξ1, . . . , ξK the knots.

f (x) =M+2∑m=1

βmhm(x)

is a piecewise linear, continuous function. One order-M spline basis withknots ξ1, . . . , ξK is

h1(x) = 1, . . . , hR(x) = xR−1, hR+l(x) = (x − ξl)R−1+ , l = 1, . . . ,K .


Figure 5.3


Natural Cubic SplinesSplines of order M are polynomials of degree M − 1 beyond the boundaryknots ξ1 and ξK . The natural cubic splines are the splines of order 4 thatare linear beyond the two boundary knots. With

f (x) = β0 + β1x + β2x2 + β3x

3 +K∑

k=1

θk(x − ξk)3+

the restriction is that β2 = β3 = 0 and

K∑k=1

θk =K∑

k=1

θkξk = 0.

N1(x) = 1, N2(x) = x

and

N2+l(x) =(x − ξl)3+ − (x − ξK )3+

ξK − ξl−

(x − ξK−1)3+ − (x − ξK )3+ξK − ξK−1

for l = 1, . . . ,K − 2 form a basis.Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 9 / 24

B Splines (Basis Splines, or YASB)

Yet Another Spline Basis ...

Defined by a recursion in M;

Bk,1(x) =

{1 if τk ≤ x ≤ τk+10 otherwise

with

τ1 ≤ . . . τR = ξ0 < τR+1 = ξ1 < . . . < τR+K = ξK < τR+K+1 = ξK+1 ≤ . . . ≤ τ2R+K

and

Bk,r =x − τi

τi+r+1 − τiBk,r−1(x) +

τi+r − xτi+r − τi

Bk+1,r−1(x)

for k = 1, . . . ,K + 2R − r .


Figure 5.20 – B-splines


Knot Placing Strategies

How do you determine the knots?

Fix the number (the complexity parameter), spread them uniformlyover the whole range of data.

Fix the number, spread them according to the empirical distribution.

Adaptive selection of the number and/or the location – ranging fromad hoc adaptation to a full fledged, complete estimation from data.

Smoothing methods automatically determine their location


Smoothing SplinesAllowing E (Y |X = x) = f (x) to be an arbitrary, but twice differentiablefunction, define the penalized residual sum of squares

RSS(f , λ) =N∑i=1

(yi − f (xi ))2 + λ∫ ba

f ′′(t)2dt

If f λ is a minimizer of RSS(f , λ), the natural cubic splines with knots inx1, . . . , xN have the properties that

they can interpolate; there is a natural cubic spline f0 withf0(xi ) = f

λ(xi )

and among all interpolants f , f0 attains the smallest value of∫ ba

f ′′(t)2dt.

The solution f λ =∑N

i=1 θiNi (x) is a natural cubic spline.Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 13 / 24

Smoothing SplinesIn vector notation

f = Nθ

with Nij = Nj(xi ) and

RSS(f , λ) = (y − f)T (y − f) + λ∫ ba

f ′′(t)2dt

= (y −Nθ)T (y −Nθ) + λθTΩNθ

with

ΩN,ij =

∫ ba

N ′′i (t)N′′j (t)dt.

This generalized ridge regression problem has solution

θ̂ = (NTN + λΩN)−1NTy

and the fitted values are

f̂ = N(NTN + λΩN)−1NTy


Degrees Of Freedom

WritingSλ = N(N

TN + λΩN)−1NT

and by analogy with projection matrices the effective degrees of freedom is

dfλ = trace(Sλ).

The value of dfλ is monotonely decreasing from N to 0 as λ increasesfrom 0 to ∞.

The matrix Sλ is known as a spline smoother and it is common to specifythe degrees of freedom instead of λ in practice.


Figure 5.8 – Smoother Matrix


Multidimensional SplinesTwo multivariate versions.

Tensor products. Consider a basis consisting of

Bi1,R(x1)Bi2,R(x2) . . .Bip ,R(xp)

– compare with the multinomial basis for polynomials.

Thin plate splines. If p = 2 consider minimizing

N∑i=1

(yi − f (xi ))2 + λ∫A

(∂21 f )2 + 2(∂1∂2f )

2 + (∂22 f )2.

The solution is a function

f (x) = β0 + xTβ +

N∑i=1

αiη(||x − xi ||)

with η(z) = z2 log(z2) – thus a radial basis function expansion.


Figure 5.10 – Tensor Products of B-splines


Data acquisition – and interpretations

In this course we consider observational data. Roughly we have

Observational data; Both X and Y are sampled from an (imaginary)population.

Non-observational; e.g. a designed experiment where we fix X by thedesign and sample Y .

For observational data how should we interpret Y |X?


Example

In toxicology we are interested in measuring the effect of a (toxic)compound on the plant, say.

Consider a naturally occurring compound A and a plant Z.

Full observational study: On N randomly selected fields we measureY = the amount of plant Z and X = the amount of compound A.

Semi-observational study: On each of N randomly selected fields weplant R plants Z. After T days we measure Y = the amount of plantZ and X = the amount of compound A.

Designed experiment: On each of N identical fields we plant R plantsZ. We add according to a design scheme the amount Xi of compoundA to field i . After T days we measure Y = the amount of plant Z.


Causality

In toxicology – as in most parts of science – the basic question is causalrelations.

Is the compound A toxic? Does it actually kill plant Z?

The pragmatic farmer; Can I grow plant Z on my soil?

The former question can only be answered by the designed experiment.The latter may be answered by prediction of the yield based on ameasurement of compound A.

The latter prediction is not justified by causality – only by correlation.


Probability Models and Causality

Probability theory is completely blind to causation!

From a technical point of view the regression of Y on X is carried outprecisely in the same manner whether the data are observational or from adesigned experiment. The probability conditional model is the same.

For the ideal designed experiment we control X and all systematicvariation in Y can only be ascribed to X .

For the observational study we observed the pair (X ,Y ) Systematicvariations in Y can be due to X but there is no evidence of causality.


Interventions

Many, many studies are observational and many, many conclusions arecausal.

If the children in Gentofte get higher grades compared toCopenhagen, should I put my child in one of their schools?

If the children in large schools get higher grades compared to childrenin small schools, should we build larger schools?

If people on night-shifts get more ill than those with a regular job, isit then dangerous to take night-shifts? Should I not take a night-shiftjob?

If smokers more frequently get lung cancer is that because theysmoke? Should I stop smoking?

All four final questions are phrased as interventions. Data from anobservational study does not alone provide information on the result of anintervention.


What if Y |X then?

For observational data we must think of Y |X as an observationalconditional distribution meaning that (X ,Y ) must be sampled exactly thesame way as (x1, y1), . . . , (xN , y1) were.

Then if X = x but Y has not been disclosed to us, Y |X = x is a sensibleconditional distribution of Y .

If we remember to gather data using the same principles as when we laterwant to use Y |X for predictions, we can expect that Y |X is useful forpredictions – even if there is no alternative evidence of causation.

Violations of a consistent sampling scheme is the Achilles heel ofpredictions based on observational data. And we can not trust predictionsif we make interventions.


basis expansions - web.math.ku.dkweb.math.ku.dk › ~richard › courses › statlearn2011 ›...

Documents