basis expansions - web.math.ku.dkweb.math.ku.dk › ~richard › courses › statlearn2011 ›...

24
Basis Expansions With X R p and Y R the function f (x )= E (Y |X = x ) is typically globally a non-linear function. We discuss situations where p is small or moderate, but where the function is complicated. A basis function expansion of f is an expansion f (x )= M X m=1 β m h m (x ) with h m : R p R for m =1,..., M . The basis functions are chosen and fixed and the parameters β m for m =1,..., M are estimated. This is a linear model in the derived variables h 1 (X ),..., h M (X ). Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 1 / 24

Upload: others

Post on 30-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Basis ExpansionsWith X ∈ Rp and Y ∈ R the function

    f (x) = E (Y |X = x)

    is typically globally a non-linear function. We discuss situations where p issmall or moderate, but where the function is complicated.

    A basis function expansion of f is an expansion

    f (x) =M∑

    m=1

    βmhm(x)

    with hm : Rp → R for m = 1, . . . ,M.

    The basis functions are chosen and fixed and the parameters βm form = 1, . . . ,M are estimated. This is a linear model in the derived variablesh1(X ), . . . , hM(X ).

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 1 / 24

  • Polynomial Bases

    Monomials are classical basis functions;

    hm(x) = xr11 x

    r22 . . . x

    rpp

    with ri ∈ {0, . . . , d} and r1 + . . .+ rp ≤ d . This basis spans thepolynomials of degree ≤ d .

    If the linear models provide first order Taylor approximations of thefunction, expansions in the degree d polynomials provide order dTaylor approximations.

    However, if p ≥ 2 the number of basis functions grows exponentiallyin d .

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 2 / 24

  • IndicatorsA completely different, non-differentiable idea is to approximate f locallyas a constant. Box-type basis functions are

    hm(x) = 1(l1 ≤ x1 ≤ r1) . . . 1(lp ≤ xp ≤ rp)

    with li ≤ ri and li , ri ∈ [−∞,∞] for i = 1, . . . , p.

    If the boxes are disjoint, the columns in the X-matrix for the derivedvariables are orthogonal:

    Xim = hm(xi ) ∈ {0, 1}

    We can think of this as dummy variables representing the box.Consequently, with least squares estimation

    β̂m =1

    Nm

    ∑i :hm(xi )=1

    yi , Nm =N∑i=1

    1(hm(xi ) = 1).

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 3 / 24

  • Basis StrategiesThe size of the typical set of basis functions increases rapidly with p.What are feasible strategies for basis selection?

    Restriction: Choose a priory only special basis functionsAdditivity; hmj : R→ R

    hm(x) =

    p∑j=1

    hmj(xj)

    Radial basis functions:

    hm(x) = D

    (||x − ξm||

    λm

    )

    Selection: As variable selection – implement exhaustive or step-wiseinclusions/exclusions of basis functions.

    Penalization: As ridge regression – keep the entire set of basisfunctions but penalize the size of the parameter vector.

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 4 / 24

  • Figure 5.1

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 5 / 24

  • Figure 5.2

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 6 / 24

  • Splines – p = 1

    Define h1(x) = 1, h2(x) = x and

    hm+2(x) = (x − ξm)+ t+ = max{0, t}

    for ξ1, . . . , ξK the knots.

    f (x) =M+2∑m=1

    βmhm(x)

    is a piecewise linear, continuous function. One order-M spline basis withknots ξ1, . . . , ξK is

    h1(x) = 1, . . . , hR(x) = xR−1, hR+l(x) = (x − ξl)R−1+ , l = 1, . . . ,K .

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 7 / 24

  • Figure 5.3

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 8 / 24

  • Natural Cubic SplinesSplines of order M are polynomials of degree M − 1 beyond the boundaryknots ξ1 and ξK . The natural cubic splines are the splines of order 4 thatare linear beyond the two boundary knots. With

    f (x) = β0 + β1x + β2x2 + β3x

    3 +K∑

    k=1

    θk(x − ξk)3+

    the restriction is that β2 = β3 = 0 and

    K∑k=1

    θk =K∑

    k=1

    θkξk = 0.

    N1(x) = 1, N2(x) = x

    and

    N2+l(x) =(x − ξl)3+ − (x − ξK )3+

    ξK − ξl−

    (x − ξK−1)3+ − (x − ξK )3+ξK − ξK−1

    for l = 1, . . . ,K − 2 form a basis.Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 9 / 24

  • B Splines (Basis Splines, or YASB)

    Yet Another Spline Basis ...

    Defined by a recursion in M;

    Bk,1(x) =

    {1 if τk ≤ x ≤ τk+10 otherwise

    with

    τ1 ≤ . . . τR = ξ0 < τR+1 = ξ1 < . . . < τR+K = ξK < τR+K+1 = ξK+1 ≤ . . . ≤ τ2R+K

    and

    Bk,r =x − τi

    τi+r+1 − τiBk,r−1(x) +

    τi+r − xτi+r − τi

    Bk+1,r−1(x)

    for k = 1, . . . ,K + 2R − r .

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 10 / 24

  • Figure 5.20 – B-splines

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 11 / 24

  • Knot Placing Strategies

    How do you determine the knots?

    Fix the number (the complexity parameter), spread them uniformlyover the whole range of data.

    Fix the number, spread them according to the empirical distribution.

    Adaptive selection of the number and/or the location – ranging fromad hoc adaptation to a full fledged, complete estimation from data.

    Smoothing methods automatically determine their location

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 12 / 24

  • Smoothing SplinesAllowing E (Y |X = x) = f (x) to be an arbitrary, but twice differentiablefunction, define the penalized residual sum of squares

    RSS(f , λ) =N∑i=1

    (yi − f (xi ))2 + λ∫ ba

    f ′′(t)2dt

    If f λ is a minimizer of RSS(f , λ), the natural cubic splines with knots inx1, . . . , xN have the properties that

    they can interpolate; there is a natural cubic spline f0 withf0(xi ) = f

    λ(xi )

    and among all interpolants f , f0 attains the smallest value of∫ ba

    f ′′(t)2dt.

    The solution f λ =∑N

    i=1 θiNi (x) is a natural cubic spline.Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 13 / 24

  • Smoothing SplinesIn vector notation

    f = Nθ

    with Nij = Nj(xi ) and

    RSS(f , λ) = (y − f)T (y − f) + λ∫ ba

    f ′′(t)2dt

    = (y −Nθ)T (y −Nθ) + λθTΩNθ

    with

    ΩN,ij =

    ∫ ba

    N ′′i (t)N′′j (t)dt.

    This generalized ridge regression problem has solution

    θ̂ = (NTN + λΩN)−1NTy

    and the fitted values are

    f̂ = N(NTN + λΩN)−1NTy

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 14 / 24

  • Degrees Of Freedom

    WritingSλ = N(N

    TN + λΩN)−1NT

    and by analogy with projection matrices the effective degrees of freedom is

    dfλ = trace(Sλ).

    The value of dfλ is monotonely decreasing from N to 0 as λ increasesfrom 0 to ∞.

    The matrix Sλ is known as a spline smoother and it is common to specifythe degrees of freedom instead of λ in practice.

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 15 / 24

  • Figure 5.8 – Smoother Matrix

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 16 / 24

  • Multidimensional SplinesTwo multivariate versions.

    Tensor products. Consider a basis consisting of

    Bi1,R(x1)Bi2,R(x2) . . .Bip ,R(xp)

    – compare with the multinomial basis for polynomials.

    Thin plate splines. If p = 2 consider minimizing

    N∑i=1

    (yi − f (xi ))2 + λ∫A

    (∂21 f )2 + 2(∂1∂2f )

    2 + (∂22 f )2.

    The solution is a function

    f (x) = β0 + xTβ +

    N∑i=1

    αiη(||x − xi ||)

    with η(z) = z2 log(z2) – thus a radial basis function expansion.

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 17 / 24

  • Figure 5.10 – Tensor Products of B-splines

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 18 / 24

  • Data acquisition – and interpretations

    In this course we consider observational data. Roughly we have

    Observational data; Both X and Y are sampled from an (imaginary)population.

    Non-observational; e.g. a designed experiment where we fix X by thedesign and sample Y .

    For observational data how should we interpret Y |X?

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 19 / 24

  • Example

    In toxicology we are interested in measuring the effect of a (toxic)compound on the plant, say.

    Consider a naturally occurring compound A and a plant Z.

    Full observational study: On N randomly selected fields we measureY = the amount of plant Z and X = the amount of compound A.

    Semi-observational study: On each of N randomly selected fields weplant R plants Z. After T days we measure Y = the amount of plantZ and X = the amount of compound A.

    Designed experiment: On each of N identical fields we plant R plantsZ. We add according to a design scheme the amount Xi of compoundA to field i . After T days we measure Y = the amount of plant Z.

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 20 / 24

  • Causality

    In toxicology – as in most parts of science – the basic question is causalrelations.

    Is the compound A toxic? Does it actually kill plant Z?

    The pragmatic farmer; Can I grow plant Z on my soil?

    The former question can only be answered by the designed experiment.The latter may be answered by prediction of the yield based on ameasurement of compound A.

    The latter prediction is not justified by causality – only by correlation.

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 21 / 24

  • Probability Models and Causality

    Probability theory is completely blind to causation!

    From a technical point of view the regression of Y on X is carried outprecisely in the same manner whether the data are observational or from adesigned experiment. The probability conditional model is the same.

    For the ideal designed experiment we control X and all systematicvariation in Y can only be ascribed to X .

    For the observational study we observed the pair (X ,Y ) Systematicvariations in Y can be due to X but there is no evidence of causality.

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 22 / 24

  • Interventions

    Many, many studies are observational and many, many conclusions arecausal.

    If the children in Gentofte get higher grades compared toCopenhagen, should I put my child in one of their schools?

    If the children in large schools get higher grades compared to childrenin small schools, should we build larger schools?

    If people on night-shifts get more ill than those with a regular job, isit then dangerous to take night-shifts? Should I not take a night-shiftjob?

    If smokers more frequently get lung cancer is that because theysmoke? Should I stop smoking?

    All four final questions are phrased as interventions. Data from anobservational study does not alone provide information on the result of anintervention.

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 23 / 24

  • What if Y |X then?

    For observational data we must think of Y |X as an observationalconditional distribution meaning that (X ,Y ) must be sampled exactly thesame way as (x1, y1), . . . , (xN , y1) were.

    Then if X = x but Y has not been disclosed to us, Y |X = x is a sensibleconditional distribution of Y .

    If we remember to gather data using the same principles as when we laterwant to use Y |X for predictions, we can expect that Y |X is useful forpredictions – even if there is no alternative evidence of causation.

    Violations of a consistent sampling scheme is the Achilles heel ofpredictions based on observational data. And we can not trust predictionsif we make interventions.

    Niels Richard Hansen (Univ. Copenhagen) Statistics Learning October 6, 2011 24 / 24