smoothing spline anova - statisticsusers.stat.umn.edu/~helwig/notes/ssanova-notes.pdf · smoothing...

Smoothing Spline ANOVA

Nathaniel E. Helwig

Assistant Professor of Psychology and StatisticsUniversity of Minnesota (Twin Cities)

Updated 04-Jan-2017

Nathaniel E. Helwig (U of Minnesota) Smoothing Spline ANOVA Updated 04-Jan-2017 : Slide 1

Copyright

Copyright c© 2017 by Nathaniel E. Helwig


Outline of Notes

1) IntroductionParametric regressionNonparametric regressionSmoothing splines

2) Background TheoryAveraging operatorsHilbert spacesReproducing kernels

3) Estimation & Inference:Penalized least squaresSmoothing parameter selectionBayesian confidence intervals

4) SSANOVA in Practice:One-way SSANOVATwo-way SSANOVA (additive)Two-way SSANOVA (interactive)

For a thorough treatment see:

Gu, C. (2013). Smoothing spline ANOVA models, 2nd edition. New York: Springer-Verlag.


Introduction

Introduction


Introduction Parametric Regression

Parametric Regression Model: Scalar Form

The multiple linear regression model has the form

yi =

p∑j=1

bjxij + ei

for i ∈ {1, . . . ,n} whereyi ∈ R is the real-valued response for the i-th observationbj ∈ R is the j-th predictor’s regression slopexij ∈ R is the j-th predictor for the i-th observation

eiiid∼ N(0, σ2) is Gaussian measurement error

Implies that (yi |xi1, . . . , xip)ind∼ N(

∑pj=1 bjxij , σ

2)



Parametric Regression Model: Matrix Form

The multiple linear regression model has the form

y = Xb + e

wherey = (y1, . . . , yn)′ ∈ Rn is the n × 1 response vectorX = [x1, . . . ,xp] ∈ Rn×p is the n × p design matrix• xj = (x1j , . . . , xnj )

′ ∈ Rn is j-th predictor vector (n × 1)

b = (b1, . . . ,bp)′ ∈ Rp is p × 1 vector of coefficientse = (e1, . . . ,en)′ ∈ Rn is the n × 1 error vector

Implies that (y|x) ∼ N(Xb, σ2In)



Ordinary Least Squares Solution

The ordinary least squares (OLS) problem is

minb∈Rp

1n‖y− Xb‖2 ←→ min

b∈Rp

1n

n∑i=1

(yi − yi)2

where ‖ · ‖ denotes the Euclidean norm and yi =∑p

j=1 bjxij .

The OLS solution has the form

b = (X′X)−1X′y

and the fitted values corresponding to b are given by

y = Xb = Hy

where H = X(X′X)−1X′ is the hat matrix.Nathaniel E. Helwig (U of Minnesota) Smoothing Spline ANOVA Updated 04-Jan-2017 : Slide 7


Summary of Results

Using the model assumption (y|x) ∼ N(Xb, σ2In), we have

b ∼ N(b, σ2(X′X)−1)

y ∼ N(Xb, σ2H)

e ∼ N(0, σ2(In − H))

where e = y− y is the residual vector.

Typically σ2 is unknown, so we use the MSE σ2 = 1n−p

∑ni=1 e2

i .


Introduction Nonparametric Regression

Nonparametric Regression Model

The Gaussian nonparametric regression model has the form

yi = η(xi) + ei

for i ∈ {1, . . . ,n} whereyi ∈ R is the real-valued response for the i-th observationxi ∈ Rp is the predictor vector for the i-th observationη : Rp → R is an unknown smooth function

eiiid∼ N(0, σ2) is Gaussian measurement error

Implies that (yi |xi1, . . . , xip)ind∼ N(η(xi), σ

2)



Additive versus Interactive Models

Suppose that xi = (xi1, xi2) with xi1 ∈ X1 and xi2 ∈ X2.

We could fit one of two possible models:

Additive : η(xi) = η0 + η1(xi1) + η2(xi2)

Interaction : η(xi) = η0 + η1(xi1) + η2(xi2) + η12(xi1, xi2)

whereη0 is a constant functionη1 is main effect of first predictorη2 is main effect of second predictorη12 is interaction effect



Example 1: Continuous and Nominal Covariates

xi = (xi1, xi2) with xi1 ∈ [0,1] and xi2 ∈ {a,b}.

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

23

4

Additive

x1

y

x2 = ax2 = b

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

23

4

Interaction

x1

y

x2 = ax2 = b



Example 1: R Code

addfun = function(x1,x2){funval = sin(2*pi*x1)idx = which(x2=="a")funval[idx] = funval[idx] + 2funval

}intfun = function(x1,x2){

funval = sin(2*pi*x1)idx = which(x2=="a")funval[idx] = funval[idx] + 2 + sin(4*pi*x1[idx])funval

}

dev.new(width=12,height=6,noRStudioGD=TRUE)par(mfrow=c(1,2))x1 = seq(0,1,length=200)plot(x1,addfun(x1,rep("a",200)),type="l",ylim=c(-2,4),main="Additive",

ylab="y",cex.axis=1.25,cex.lab=1.5,cex.main=3)lines(x1,addfun(x1,rep("b",200)),lty=2)legend("bottomleft",legend=c(expression(x[2]*" = "*a),expression(x[2]*" = "*b)),

lty=1:2,bty="n",cex=1.5)plot(x1,intfun(x1,rep("a",200)),type="l",ylim=c(-2,4),main="Interaction",

ylab="y",cex.axis=1.25,cex.lab=1.5,cex.main=3)lines(x1,intfun(x1,rep("b",200)),lty=2)legend("bottomleft",legend=c(expression(x[2]*" = "*a),expression(x[2]*" = "*b)),

lty=1:2,bty="n",cex=1.5)



Example 2: Two Continuous Covariates

xi = (xi1, xi2) with xi1, xi2 ∈ [0,1].

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Additive

x1

x2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Interaction

x1

x2



Example 2: R Code

addfun = function(x1,x2){sin(2*pi*x1) + cos(4*pi*x2*(1-x2))

}intfun = function(x1,x2){sin(2*pi*x1) + cos(4*pi*x2*(1-x2)) + 2*sin(pi*(x1-x2))

}

xs = seq(0,1,length=50)xg = expand.grid(xs,xs)dev.new(width=12,height=6,noRStudioGD=TRUE)par(mfrow=c(1,2))zmat = matrix(addfun(xg[,1],xg[,2]),50,50)image(xs,xs,zmat,xlab="x1",ylab="x2",main="Additive",

cex.axis=1.25,cex.lab=1.5,cex.main=3)zmat = matrix(intfun(xg[,1],xg[,2]),50,50)image(xs,xs,zmat,xlab="x1",ylab="x2",main="Interaction",

cex.axis=1.25,cex.lab=1.5,cex.main=3)


Introduction Smoothing Splines

Smoothing Splines on {1, . . . ,K}Suppose xi ∈ {1, . . . ,K} and note that ηf is a vector of length K .

f = (f1, . . . , fK )′ ∈ RK is vector corresponding to ηf

ηf(1) = f1, ηf(2) = f2, . . . , ηf(K ) = fKLet ηf =

∑Kx=1 ηf(x)/K denote the mean

A nominal smoothing spline is the ηλ ∈ RK that minimizes

1n

n∑i=1

(yi − ηf(xi))2 + λJ(ηf)

where λ ≥ 0 is smoothing parameter and J(ηf) is roughness penalty.J(ηf) =

∑Kx=1(ηf(x)− ηf)

2 to shrink towards constant

J(ηf) =∑K

x=1 ηf(x)2 to shrink towards zeroNathaniel E. Helwig (U of Minnesota) Smoothing Spline ANOVA Updated 04-Jan-2017 : Slide 15


Polynomial Smoothing Splines on [0,1]

Suppose xi ∈ [0,1] and let C(m)[0,1] = {η : η(m) ∈ L2[0,1]}.η(m) = dmη

dxm denotes m-th derivative of η

L2[0,1] = {η :∫ 1

0 η2dx <∞}

A polynomial smoothing spline is the ηλ ∈ C(m)[0,1] that minimizes

1n

n∑i=1

(yi − η(xi))2 + λ

∫ 1

0(η(m))2dx

where λ ≥ 0 is the smoothing parameter and m is spline order.Related to natural spline in numerical analysis literature



Cubic Smoothing Splines

Setting m = 2 results in classic cubic smoothing spline.x1 < x2 < · · · < xq are “knots” (distinct xi values)ηλ is piecewise cubic polynomial, and is linear beyond x1 and xq

ηλ is three-times differentiable, and 3rd derivative jumps at “knots”As λ→ 0, ηλ approaches minimum curvature interpolantAs λ→∞, ηλ approaches simple linear regression

Can also view cubic smoothing spline as solution to

min1n

n∑i=1

(yi − η(xi))2 subject to∫ 1

0η2dx ≤ ρ

for some ρ ≥ 0, which is least-squares with soft constraint.



Example with R’s spline Function

yi = sin(2πxi) + ei where xi = i/20 for i ∈ {0,1,2, . . . ,20} andNo Noise: ei = 0 ∀iSome Noise: ei

iid∼ N(0,0.152)

More Noise: eiiid∼ N(0,0.252)

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

No Noise

x

y

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

Some Noise

x

y

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

More Noise

xy



spline Function (R code)

dev.new(width=12,height=4,noRStudioGD=TRUE)par(mfrow=c(1,3))x = seq(0,1,length=21)y = sin(2*pi*x)mysp = spline(x,y,method="natural")plot(x,y,main="No Noise",ylim=c(-1.5,1.5))lines(x,sin(2*pi*x),lty=2)lines(mysp)

set.seed(1)x = seq(0,1,length=21)y = sin(2*pi*x) + rnorm(21,sd=0.15)mysp = spline(x,y,method="natural")plot(x,y,main="Some Noise",ylim=c(-1.5,1.5))lines(x,sin(2*pi*x),lty=2)lines(mysp)

set.seed(1)x = seq(0,1,length=21)y = sin(2*pi*x) + rnorm(21,sd=0.25)mysp = spline(x,y,method="natural")plot(x,y,main="More Noise",ylim=c(-1.5,1.5))lines(x,sin(2*pi*x),lty=2)lines(mysp)



Same Example with R’s smooth.spline Function

yi = sin(2πxi) + ei where xi = i/20 for i ∈ {0,1,2, . . . ,20} andNo Noise: ei = 0 ∀iSome Noise: ei

iid∼ N(0,0.152)

More Noise: eiiid∼ N(0,0.252)

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

No Noise

x

y

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

Some Noise

x

y

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

More Noise

xy



smooth.spline Function (R code)

dev.new(width=12,height=4,noRStudioGD=TRUE)par(mfrow=c(1,3))set.seed(1)x = seq(0,1,length=21)y = sin(2*pi*x)mysp = smooth.spline(x,y)plot(x,y,main="No Noise",ylim=c(-1.5,1.5))lines(x,sin(2*pi*x),lty=2)lines(x,mysp$y)

set.seed(1)x = seq(0,1,length=21)y = sin(2*pi*x) + rnorm(21,sd=0.15)mysp = smooth.spline(x,y)plot(x,y,main="Some Noise",ylim=c(-1.5,1.5))lines(x,sin(2*pi*x),lty=2)lines(x,mysp$y)

set.seed(1)x = seq(0,1,length=21)y = sin(2*pi*x) + rnorm(21,sd=0.25)mysp = smooth.spline(x,y)plot(x,y,main="More Noise",ylim=c(-1.5,1.5))lines(x,sin(2*pi*x),lty=2)lines(x,mysp$y)


Background Theory

Background Theory


Background Theory Averaging Operators

One-Way ANOVA Decomposition

Consider the standard one-way ANOVA model

yij = µj + eij

for i ∈ {1, . . . ,nj} and j ∈ {1, . . . ,K}.

Typically, we want to decompose the treatment effects such as

µj = µ+ αj

where µ is overall mean and αj is treatment effect such thatα1 = 0 if first group is control∑K

j=1 αj = 0 if using effect coding



One-Way ANOVA and Averaging Operators

Consider the standard one-way ANOVA model using a smoothingspline on xi ∈ {1, . . . ,K}

yi = η(xi) + ei

for i ∈ {1, . . . ,n} where n =∑K

j=1 nj .

The ANOVA decomposition µj = µ+ αj can be written as

η = Aη + (I − A)η = η0 + ηc

where A “averages out” η to return a constant η0.α1 = 0 corresponds to Aη = η(1)∑K

j=1 αj = 0 corresponds to Aη =∑K

x=1 η(x)/K



Averaging Operators on Continuous Domain

For a continuous domain X = [a,b] we can decompose η such as

η = Aη + (I − A)η = η0 + ηc

where A “averages out” η to return a constant η0.Need averaging operator A defined such that A(Aη) = Aη = η0

Need identity operator I defined such that Iη = η

Note that η0 is overall constant, and ηc is treatment (contrast) effect.

For a function defined on X = [0,1], we could defineAη = η(0)

Aη =∫ 1

0 η(z)dz



Two-Way ANOVA Decomposition

Consider the standard two-way ANOVA model

yijk = µjk + eijk

for i ∈ {1, . . . ,njk}, j ∈ {1, . . . ,a}, and k ∈ {1, . . . ,b}.

Typically, we want to decompose the treatment effects such as

µjk = µ+ αj + βk + γjk

where µ is overall mean andαj is main effect of Factor A such that

∑aj=1 αj = 0

βk is main effect of Factor B such that∑b

k=1 βk = 0

γjk is interaction effect such that∑a

j=1 γjk =∑b

k=1 γjk = 0 ∀j , kNathaniel E. Helwig (U of Minnesota) Smoothing Spline ANOVA Updated 04-Jan-2017 : Slide 26


Two-Way ANOVA and Averaging Operators

Consider the standard two-way ANOVA model using a smoothingspline on xi = (xi1, xi2) ∈ X1 ×X2 = {1, . . . ,a} × {1, . . . ,b}

yi = η(xi) + ei

for i ∈ {1, . . . ,n} where n =∑a

j=1∑b

k=1 njk .

The ANOVA decomposition µjk = µ+ αj + βk + γjk can be written as

η = [AX1 + (I − AX1)][AX2 + (I − AX2)]η

= AX1AX2η︸︷︷︸η0

+ (I − AX1)AX2η︸︷︷︸η1

+ AX1(I − AX2)η︸︷︷︸η2

+ (I − AX1)(I − AX2)η︸︷︷︸η12

where AX1 and AX2 are averaging operators such thatAX1(AX1η) = AX1η is constant for all xi1 ∈ X1

AX2(AX2η) = AX2η is constant for all xi2 ∈ X2


Background Theory Hilbert Spaces

Linear Spaces and Functionals

Suppose that η, φ ∈ L where the set L satisfies:η + φ ∈ LaL ∈ L for any scalar a

If these two conditions are met, we say that L is a linear space.

A functional L in L operates on η ∈ L and returns a real number.Linear functional: L(η + φ) = Lη + Lφ and L(aη) = aLηBilinear functional: linear functional of two variables

- J(aη + bφ, ψ) = aJ(η, ψ) + bJ(φ, ψ)- J(η,aφ+ bψ) = aJ(η, φ) + bJ(η, ψ)

Symmetry: J(η, φ) = J(φ, η) for all η, φ ∈ LPositive definite: J(η) = J(η, η) > 0 for all η ∈ LNon-negative definite: J(η) = J(η, η) ≥ 0 for all η ∈ LQuadratic: bilinear, symmetric, and non-negative definite



Inner Products and Norms

In a linear space L, an inner-product is a positive definite bilinear form.We will use the notation 〈·, ·〉 to denote an inner-product.

The inner-product defines a norm in L, which provides a metric tomeasure distance between two objects η, φ ∈ L.

We will use the notation ‖η‖ =√〈η, η〉 to denote the norm of η.

We will use the notation D[η, φ] = ‖η − φ‖ to denote the distancebetween η and φ in L

In any linear space L we have the following two rules:Cauchy-Schwarz: |〈η, φ〉| ≤ ‖η‖‖φ‖Triangle: ‖η + φ‖ ≤ ‖η‖+ ‖φ‖



Null Spaces, Semi-Inner Products, and Semi-Norms

The null space of a non-negative definite bilinear form J in a linearspace L is defined as NJ = {η : J(η, η) = 0, η ∈ L}, and note that

NJ = {0} if J is positive definiteNJ contains 0 and nonzero elements otherwise

A non-negative definite bilinear form J in a linear space L defines asemi-inner-product in L.

Induces a semi-norm√

J(η) =√

J(η, η) in LSimilar to a norm, but J(η) = 0 does not imply η = 0



Hilbert Spaces and Projections

A Hilbert space is a complete inner-product linear space.A sequence where limm,n→∞ ‖ηm − ηn‖ = 0 is a Cauchy sequenceA linear space L is complete if every Cauchy sequence in Lconverges to some element in L.

Any closed linear subspace of H (denoted G ⊂ H) is a Hilbert space.Distance between η ∈ H and G is D[η,G] = infφ∈G ‖η − φ‖There exists ηG ∈ G such that D[η,G] = ‖η − ηG‖ηG is the unique projection of η in the space G



Tensor Sum Decompositions

Given η ∈ H and G ⊂ H, we have that 〈η − ηG , φ〉 = 0 for all φ ∈ G.Gc = {η : 〈η, φ〉 = 0, ∀φ ∈ G} is orthogonal complement of GTenor sum decomposition: H = G ⊕ Gc and η = ηG + ηGc

If Hn and Hc are Hilbert spaces with inner products 〈·, ·〉n and 〈·, ·〉c,and if Hn ∩Hc = {0}, then H = Hn ⊕Hc is Hilbert space withinner-product 〈·, ·〉 = 〈·, ·〉n + 〈·, ·〉c

Consider a null space NJ corresponding to a semi-inner-product J inthe space H, and define J(·, ·) such that

1 J(·, ·) defines a full inner product in the space NJ

2 (∀η ∈ H)(∃φ ∈ NJ) such that J(η − φ) = 0Then (J + J)(η, φ) defines a full inner product in H.Nathaniel E. Helwig (U of Minnesota) Smoothing Spline ANOVA Updated 04-Jan-2017 : Slide 32


Hilbert Space Example: RK

Note that a Hilbert space is a generalization of Euclidean space RK .

For any vectors x,y ∈ RK , inner product is 〈x,y〉 = x′y =∑K

i=1 xiyi

〈x,y〉 = 〈x,y〉n + 〈x,y〉c = x′[ 1

K 1K 1′K + (IK − 1K 1K 1′K )

]y

Hn = {η : η(1) = · · · = η(K )} and Hc = {η :∑K

x=1 η(x) = 0}

This corresponds to classic one-way ANOVA decomposition

µj = µ+ αj

with the constraint∑

j αj = 0


Background Theory Reproducing Kernels

Riesz Representation Theorem

For every φ in a Hilbert space H, the functional Lφη = 〈φ, η〉 defines acontinuous linear functional Lφ.

L is continuous if limn→∞ Lηn = Lη whenever limn→∞ ηn = η

Every continuous linear functional L in H has a representationLη = 〈φL, η〉 for some φL ∈ H, which is called the representer of L.

TheoremFor every continuous linear functional L in a Hilbert space H, thereexists a unique φL ∈ H such that Lη = 〈φL, η〉 for all η ∈ H.



Reproducing Kernel Hilbert Spaces

To estimate an SSANOVA, we need to evaluate η for different x ∈ X .Need continuity of evaluation functional: [x ]η = η(x)

Consider a Hilbert space H of functions on the domain X .If evaluation functional [x ]η = η(x) is continuous in H for all x ∈ X ,then we say that H is a reproducing kernel Hilbert spaceBy the Riesz Representation Theorem, there exists ρx ∈ H, whichis the representer of the evaluation functional [x ]η = η(x)

Symmetric bivariate function ρ(x , y) = ρx (y) = 〈ρx , ρy 〉 has thereproducing property 〈ρ(x , ·), η(·)〉 = η(x)

Consequently, ρ is called the reproducing kernel of the space H



Examples of Reproducing Kernel Hilbert Spaces

Consider the Euclidean space RK , which is a RKHSInner product is defined as 〈x,y〉 =

∑Ki=1 xiyi

RK is define as ρ(x , y) = I{x=y}, which is indicator function

Consider the space L2[0,1] = {η :∫ 1

0 η2dx <∞}

Elements in L2[0,1] are defined via equivalent classes(not defined via individual functions)NOT a RKHS because evaluation functional is not well-defined

Consider the space C(m)[0,1] = {η : η(m) ∈ L2[0,1]}Elements in C(m)[0,1] are defined via individual functionsEvaluation functional is continuous, so we have a RKHS



Tensor Sum Decompositions of RKHS

Given the tensor sum decomposition H = Hn ⊕Hc, we have

ρ = ρn + ρc

where ρ is the RK of H, ρn is the RK of Hn, and ρc is the RK of Hc.

Furthermore, if ρ is the RK of H and if ρ = ρn + ρc whereρn, ρc ∈ H are non-negative for all x ∈ X〈ρn(x , ·), ρc(y , ·)〉 = 0 for all x , y ∈ X

then the spaces Hn and Hc form a tensor sum decomposition of H.



Reproducing Kernel for Nominal Smoothing Splines

Suppose that xi ∈ X = {1, . . . ,K} and η ∈ H = RK .

For any elements η, φ ∈ H, we have that〈η, φ〉 = η′φ =

∑Kx=1 η(x)φ(x)

ρ(x , y) = I{x=y} where I{·} is indicator function

Using the averaging operator Aη =∑K

x=1 η(x)/K〈η, φ〉 = 〈η, φ〉n + 〈η, φ〉c = η′

[ 1K 1K 1′K + (IK − 1

K 1K 1′K )]φ

ρ(x , y) = ρn(x , y) + ρc(x , y) = 1K + (I{x=y} − 1

K )

Hn = {η : η(1) = · · · = η(K )} and Hc = {η :∑K

x=1 η(x) = 0}



Reproducing Kernel for Polynomial Smoothing Splines

Suppose that xi ∈ X = [0,1] and η ∈ H = C(m)[0,1].

Using the averaging operator Aη =∫ 1

0 ηdx〈η, φ〉 = 〈η, φ〉n + 〈η, φ〉c =

∑m−1ν=0 (

∫ 10 η

(ν)dx)(∫ 1

0 φ(ν)dx) +

∫ 10 η

(m)φ(m)dx

ρ(x , y) = ρn(x , y) + ρc(x , y) =∑m−1

ν=0 kν(x)kν(y) + (−1)m−1k2m(|x − y |)where kν(x) is scaled Bernoulli polynomial

Hn = {η : η(m) = 0} and Hc = {η :∫ 1

0 η(ν) = 0, ν = 0, . . . ,m − 1, η(m) ∈ L2[0, 1]}

Using the averaging operator Aη = η(0)

〈η, φ〉 = 〈η, φ〉n + 〈η, φ〉c =∑m−1

ν=0 η(ν)(0)φ(ν)(0) +

∫ 10 η

(m)φ(m)dx

ρ(x , y) = ρn(x , y) + ρc(x , y) =∑m−1

ν=0xν

ν!yν

ν!+

∫ 10

(x−u)m−1+

(m−1)!(y−u)m−1

+(m−1)! du

Hn = {η : η(m) = 0} and Hc = {η : η(ν)(0) = 0, ν = 0, . . . ,m − 1, η(m) ∈ L2[0, 1]}



Tensor Product RKHS

Suppose that xi ∈ X where X = X1 × · · · × Xp is a product domain.Suppose HXj is a RKHS of functions with RK ρXj for all xj ∈ Xj

Note that the marginal RKs have the form ρXj = ρnj + ρcj

We can define ρX =∏p

j=1 ρXj =∏p

j=1(ρnj + ρcj )

ρX is non-negative for all x ∈ XρX is RK of tensor product RKHS H = HX1 ⊗ · · · ⊗ HXp

We can form functional spaces for any number of covariatesCan constrain and/or remove subspaces to fit different models



Need for Additional Smoothing Parameters

Given a tensor product RKHS H = HX1 ⊗ · · · ⊗ HXp we have that. . .H = ⊗p

j=1(Hnj ⊕Hcj ) = ⊕sk=1Hk is tensor sum decomposition

Each subspace Hk has inner product 〈·, ·〉k and RK ρk

Inner-products and RKs have different metrics

Can introduce additional smoothing parameters to inner product:

〈·, ·〉 =s∑

k=1

θ−1k 〈·, ·〉k

which corresponds to the tensor product RK

ρ =s∑

k=1

θkρk


Estimation and Inference

Estimation and Inference


Estimation and Inference Penalized Least Squares

Tensor Product Smoothing Spline

Given xi ∈ X = X1 × · · · × Xp, a tensor product smoothing spline is theηλ ∈ H = HX1 ⊗ · · · ⊗ HXp that minimizes

1n

n∑i=1

(yi − η(xi))2 + λJ(η)

whereλ ≥ 0 is overall (global) smoothing parameterJ is a quadratic functional quantifying roughness of ηAdditional smoothing parameters θ = (θ1, . . . , θs) exist in J



Representation of η

Let H = Hn ⊕Hc denote the tensor sum decomposition of the tensorproduct RKHS H = ⊗p

j=1HXj

Note that H has RK ρ = ρn + ρc where ρc =∑s

k=1 θkρk

Given fixed smoothing parameters θ, the η ∈ H that minimizes thepenalized least-squares functional can be written as

η(x) =m∑

v=1

dvφv (x) +n∑

i=1

ciρc(xi ,x) (1)

where {φv}mv=1 is a set of known functions spanning Hn, ρc is thereproducing kernel (RK) of Hc, and d ≡ {dv}m×1 and c ≡ {ci}n×1 arethe (unknown) basis function coefficient vectors



Penalty of η

Given the tensor sum decomposition H = Hn ⊕Hc for some tensorproduct RKHS H = ⊗p

j=1HXj , we define the penalty functional

J(η) = 〈η, η〉c

which is a semi-inner-product with null space Hn.

Using the representation of η(x) on the previous slide, we have

〈η, η〉c = 〈∑m

v=1 dvφv (x) +∑n

i=1 ciρc(xi ,x),∑m

v=1 dvφv (x) +∑n

i=1 ciρc(xi ,x)〉c= 〈∑n

i=1 ciρc(xi ,x),∑n

i=1 ciρc(xi ,x)〉c

=n∑

i=1

n∑j=1

cicj〈ρc(xi ,x), ρc(xj ,x)〉c =n∑

i=1

n∑j=1

cicjρc(xi ,xj )



Penalized Least Squares Problem

Using {x∗u}qu=1 ⊂ {xi}ni=1 as knots, the penalized least-squares

functional can be approximated as

‖y− Kd− Jθc‖2 + λnc′Qθc

wherey = (y1, . . . , yn)′ is response vectorK = {φv (xi)}n×m is null space basis function matrixJθ = {ρc(xi ,x∗u)}n×q is contrast space basis function matrix

Note: Jθ =∑s

k=1 θk Jk where Jk = {ρk (xi ,x∗u)}n×q

Qθ = {ρc(x∗t ,x∗u)}q×q is penalty matrix

Note: Qθ =∑s

k=1 θk Qk where Qk = {ρk (x∗t ,x∗u)}q×q

d = (d1, . . . ,dm)′ and c = (c1, . . . , cq)′ are unknown coefficients



Coefficients and Smoothing Matrix

The coefficients minimizing the penalized least-squares function are(dc

)=

(K′K K′Jθ

J′θK J′θJθ + λnQθ

)†(K′

J′θ

)y

where (·)† denotes the Moore-Penrose pseudoinverse.

The fitted values are given by y = Kd + Jθc = Sλy where

Sλ =(K Jθ

)(K′K K′Jθ

J′θK J′θJθ + λnQθ

)†(K′

J′θ

)is the smoothing matrix, which depends on λ = (λ/θ1, . . . , λ/θs).


Estimation and Inference Smoothing Parameter Selection

Smoothing Parameter Goldilocks Phenomenon

The selection of λ is the crucial step when fitting an SSANOVA model!

If λk ≡ λ/θk is too large, the penalty corresponding to Hk will be toosevere, making it difficult to estimate ηk .

Oversmooth k -th contrast space

If λk ≡ λ/θk is too small, the penalty corresponding to Hk will be toolenient, making it difficult to estimate ηk (assuming noisy data)

Undersmooth k -th contrast space



Cross-Validation

If σ2 is unknown, a reasonable loss function for selecting λ is thecross-validated loss function

CV(λ|y,X,w) = (1/n)n∑

i=1

wi(yi − η[i]λ (xi))2

where wi > 0 is some weight, and η[i]λ is the function φ ∈ H thatminimizes the delete the i-th observation functional:

(1/n)∑j 6=i

(yj − φ(xj))2 + λJ(φ)



Cross-Validation (continued)

The form of CV loss function might suggest that it is necessary to fit ndifferent models (to obtain η[i]λ for i ∈ {1, . . . ,n}).

However, the CV function can be rewritten as

CV(λ|y,X,w) = (1/n)n∑

i=1

wi(yi − ηλ(xi))2

(1− sii(λ))2

which implies that the CV function can be minimized using the resultsof the full model.



Generalized Cross-Validation

Defining wi ≡ (1− sii(λ))2/[n−1tr(In − Sλ)]2 replaces each sii(λ) with its

average value, producing the generalized cross-validation (GCV)criterion of Craven and Wahba (1979):

GCV(λ|y,X) = (1/n)n∑

i=1

(yi − ηλ(xi))2

[n−1tr(In − Sλ)]2

=(1/n)‖(In − Sλ)y‖2

[1− tr(Sλ)/n]2

(2)

The λ that minimizes the GCV score produces good estimates of η.


Estimation and Inference Bayesian Confidence Intervals

Gaussian Process Definition

A Gaussian process is a stochastic process {η(x) : x ∈ X} such thatη(x) ∼ N(µx , σ

2x ) for all x ∈ X where

µx = E(η(x)) is the mean functionγx ,x ′ = Cov(η(x), η(x′)) is the covariance functionσ2

x = Cov(η(x), η(x)) is the variance function

Note η(x) is a random variable that is normally distributed for all x ∈ XUse the notation η(x) ∼ N(µx , σ

2x ) for all x ∈ X

Mean and variance differ for each x ∈ X



Bayesian Interpretation of Smoothing Spline

Let η = ηn + ηc denote the null and contrast space functions andassume the following prior distributions:

ηn has a diffuse (vague) prior with mean zeroηc is a zero mean Gaussian process with covariance functionproportional to ρc

Using these prior assumptions. . .η can be interpreted as posterior mean of η given data ywe can derive posterior variance Var(η|y)



Bayesian Confidence Intervals

Using the Bayesian interpretation, we can form confidence intervals

η(x)± Zα/2√

Var(η|y)

where Zα/2 is critical value from standard normal distribution.

Bayesian CIs have approximate “across-the-function coverage” whenthe smoothing parameters are selected according to GCV.

On average contain 100(1− α)% of true function realizations


SSANOVA in Practice

SSANOVA in Practice


SSANOVA in Practice One-Way SSANOVA

Unidimensional Smoothing Splines in R

Many options for unidimensional smoothing spline in R:smooth.spline function (in stats package)bigspline function (in bigsplines package)bigssa function (in bigsplines package)ssanova function (in gss package)gam function (in mgcv package)

For unidimensional smoothing, we will focus on the smooth.splineand the bigspline functions, which have simple syntax.



smooth.spline: Overview

> set.seed(1)> x = seq(0,1,length=100)> eta = 2 + x + sin(2*pi*x)> y = eta + rnorm(100)> plot(x,y)> smsp = smooth.spline(x,y)> lines(x,smsp$y)> lines(x,eta,lty=2)

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

x

y



smooth.spline: Changing Smoothing Parameter

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

12

34

spar=0.25

x

y

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

12

34

spar=0.75

x

y●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

12

34

spar=1

x

y

R code for leftmost plot:> smsp = smooth.spline(x,y,spar=0.25)> plot(x,y,main="spar=0.25")> lines(x,smsp$y)> lines(x,eta,lty=2)



smooth.spline: Changing Number of Knots

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

12

34

spar=0.5, nknots=10

x

y

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

12

34

spar=0.5, nknots=20

x

y●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

12

34

spar=0.5, nknots=30

x

y

R code for leftmost plot:> smsp = smooth.spline(x,y,spar=0.5,nknots=10)> plot(x,y,main="spar=0.5, nknots=10")> lines(x,smsp$y)> lines(x,eta,lty=2)



smooth.spline: CV versus GCV

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

12

34

nknots=20, cv=TRUE

x

y

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

12

34

nknots=20, cv=FALSE

x

y

R code for leftmost plot:> smsp = smooth.spline(x,y,nknots=20,cv=TRUE)> plot(x,y,main="nknots=20, cv=TRUE")> lines(x,smsp$y)> lines(x,eta,lty=2)Nathaniel E. Helwig (U of Minnesota) Smoothing Spline ANOVA Updated 04-Jan-2017 : Slide 60


smooth.spline: Number of Knots (revisited)

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

12

34

cv=FALSE, nknots=10

x

y

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

12

34

cv=FALSE, nknots=20

x

y●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

12

34

cv=FALSE, nknots=30

x

y

R code for leftmost plot:> smsp = smooth.spline(x,y,nknots=10)> plot(x,y,main="cv=FALSE, nknots=10")> lines(x,smsp$y)> lines(x,eta,lty=2)



smooth.spline: Predicting for New Data

Given η we can predict for a new sequence of data:

> set.seed(1)> x = seq(0,1,length=100)> eta = 2 + x + sin(2*pi*x)> y = eta + rnorm(100)> plot(x,y,main="Prediction")> smsp = smooth.spline(x,y)> newdata = seq(0,1,length=200)> yhat = predict(smsp,newdata)> lines(yhat)> lines(x,eta,lty=2)

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

Prediction

x

y



bigspline: Overview

For smoothing large samples. . .

> set.seed(1)> x = seq(0,1,length=100)> eta = 2 + x + sin(2*pi*x)> y = eta + rnorm(100)> plot(x,y)> bigsp = bigspline(x,y)> lines(x,bigsp$fitted)> lines(x,eta,lty=2)

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

x

y



bigspline: Changing Smoothing Parameter

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

lambdas=10^−9

x

y

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

lambdas=10^−5

x

y

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

lambdas=1

x

y

R code for leftmost plot:> bigsp = bigspline(x,y,lambdas=10^-9)> plot(x,y,main="lambdas=10^-9")> lines(x,bigsp$fitted)> lines(x,eta,lty=2)



bigspline: Changing Number of Knots

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

lambdas=10^−5, nknots=10

x

y

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4


x

y

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4


x

y

R code for leftmost plot:> bigsp = bigspline(x,y,lambdas=10^-5,nknots=10)> plot(x,y,main="lambdas=10^-5, nknots=10")> lines(x,bigsp$fitted)> lines(x,eta,lty=2)



bigspline: Number of Knots (revisited)

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

GCV, nknots=10

x

y

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

GCV, nknots=20

x

y

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

GCV, nknots=30

x

y

R code for leftmost plot:> bigsp = bigspline(x,y,nknots=10)> plot(x,y,main="GCV, nknots=10")> lines(x,bigsp$fitted)> lines(x,eta,lty=2)



bigspline: Predicting for New Data

Given η we can predict for a new sequence of data:

> set.seed(1)> x = seq(0,1,length=100)> eta = 2 + x + sin(2*pi*x)> y = eta + rnorm(100)> plot(x,y,main="Prediction")> bigsp = bigspline(x,y)> newdata = seq(0,1,length=200)> yhat = predict(bigsp,newdata)> lines(newdata,yhat)> lines(x,eta,lty=2)

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

Prediction

x

y



bigspline: Predicting Linear and Non-Linear Effects

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

Full Prediction

x

y

0.0 0.2 0.4 0.6 0.8 1.0

2.0

2.2

2.4

2.6

2.8

3.0

Linear Effect

x

2 +

x

0.0 0.2 0.4 0.6 0.8 1.0

−1.

0−

0.5

0.0

0.5

1.0

Non−Linear Effect

x

sin(

2 *

pi *

x)

R code for center and rightmost plot:> newdata = seq(0,1,length=200)> plot(x,2+x,main="Linear Effect",type="l",lty=2)> yhat = predict(bigsp,newdata,effect="0") + predict(bigsp,newdata,effect="lin")> lines(newdata,yhat)> plot(x,sin(2*pi*x),main="Non-Linear Effect",type="l",lty=2)> yhat = predict(bigsp,newdata,effect="non")> lines(newdata,yhat)



bigspline: Bayesian Confidence Intervals

> dev.new(width=6,height=6,noRStudioGD=TRUE)> set.seed(1)> x = seq(0,1,length=100)> eta = 2 + x + sin(2*pi*x)> y = eta + rnorm(100)> bigsp = bigspline(x,y,se.fit=TRUE)> cilo = bigsp$fit - qnorm(0.975)*bigsp$se> cihi = bigsp$fit + qnorm(0.975)*bigsp$se> plot(x,y)> lines(x,eta)> lines(bigsp$xunique,cilo,lty=2)> lines(bigsp$xunique,cihi,lty=2)> sum(eta>=cilo & eta<=cihi)/length(x)[1] 1

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

x

y



Comparing smooth.spline and bigspline

Consider the function yi = 2 + xi + sin(2πxi) + ei where eiiid∼ N(0, σ2).

Suppose that xi = i/n for i ∈ {0, . . . ,n} and σ2 = 1 so that eiiid∼ N(0,1).

Median true MSE = 1n∑n

i=1(η(xi)− η(xi))2 using q = 20 knots:n 100 1000 10000 1e+05 1e+06

smooth.spline 0.13836 0.00504 0.00113 1e-04 2e-05bigspline 0.14030 0.00497 0.00110 1e-04 2e-05

Median runtimes using q = 20 knots:n 100 1000 10000 1e+05 1e+06

smooth.spline 0.001 0.002 0.021 0.1965 2.233bigspline 0.009 0.009 0.011 0.0120 0.094



R Code for Simulation (on previous slide)

nsamp = 10^c(2:6)simresults = NULLxnew = seq(0,1,length=200)set.seed(1)for(j in 1:5){for(k in 1:10){

x = seq(0,1,length=nsamp[j])eta = 2 + x + sin(2*pi*x)y = eta + rnorm(nsamp[j])

tic = proc.time()ssmod = smooth.spline(x,y,nknots=20)toc = proc.time() - tictmse = sum( (ssmod$y - eta)^2 ) / nsamp[j]simsp = data.frame(method="smsp",n=nsamp[j],time=toc[3],tmse=tmse,row.names=k)

tic = proc.time()ssmod = bigspline(x,y,nknots=20)toc = proc.time() - tictmse = sum( (predict(ssmod) - eta)^2 ) / nsamp[j]simbig = data.frame(method="big",n=nsamp[j],time=toc[3],tmse=tmse,row.names=k+1)

simresults = rbind(simresults,simsp,simbig)}

}

round(tapply(simresults$tmse,list(simresults$method,simresults$n),median),5)round(tapply(simresults$time,list(simresults$method,simresults$n),median),5)



bigspline: Linear and Non-Linear Effects (revisited)

0.0 0.2 0.4 0.6 0.8 1.0

2.0

2.2

2.4

2.6

2.8

3.0

Linear: n = 100

xnew

2 +

xne

w

0.0 0.2 0.4 0.6 0.8 1.0

−1.

0−

0.5

0.0

0.5

1.0

Non−linear: n = 100

xnew

sin(

2 *

pi *

xne

w)

0.0 0.2 0.4 0.6 0.8 1.0

2.0

2.2

2.4

2.6

2.8

3.0

Linear: n = 1000

xnew

2 +

xne

w

0.0 0.2 0.4 0.6 0.8 1.0

−1.

0−

0.5

0.0

0.5

1.0


xnew

sin(

2 *

pi *

xne

w)

0.0 0.2 0.4 0.6 0.8 1.0

2.0

2.2

2.4

2.6

2.8

3.0

Linear: n = 10000

xnew

2 +

xne

w

0.0 0.2 0.4 0.6 0.8 1.0

−1.

0−

0.5

0.0

0.5

1.0


xnew

sin(

2 *

pi *

xne

w)

0.0 0.2 0.4 0.6 0.8 1.0

2.0

2.2

2.4

2.6

2.8

3.0

Linear: n = 1e+05

xnew

2 +

xne

w

0.0 0.2 0.4 0.6 0.8 1.0

−1.

0−

0.5

0.0

0.5

1.0

Non−linear: n = 1e+05

xnew

sin(

2 *

pi *

xne

w)


SSANOVA in Practice Two-Way SSANOVA (Additive)

Multidimensional Smoothing Splines in R

A few options for multidimensional smoothing splines in R:bigssa function (in bigsplines package)bigssp function (in bigsplines package)ssanova function (in gss package)gam function (in mgcv package)

We will focus on the ssanova and bigssa (or bigssp) functions,which fit tensor product smoothing splines.

Note that the gam function handles interactions in different manner.



Additive Function: Definition

Suppose we have the following function defined forx = (x1, x2) ∈ [0,1]× {a,b}:

addfun = function(x1,x2){funval = sin(2*pi*x1)idx = which(x2=="a")funval[idx] = funval[idx] + 2funval

}

Note that the function isη(x1, x2) = 2 + sin(2πx1) if x2 = aη(x1, x2) = sin(2πx1) if x2 6= a



Additive Function: Visualization

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

23

4

x1

η(x 1

, x2)

x2 = ax2 = b



Additive Function: bigssa fitting

> n = 100> set.seed(55455)> x1v = seq(0,1,length=n)> x2v = factor(sample(letters[1:2],n,replace=TRUE))> eta = addfun(x1v,x2v)> y = eta + rnorm(n)> idx = binsamp(cbind(x1v,x2v),nmbin=c(20,2))> ssint = bigssa(y~x1v*x2v, type=list(x1v="cub",x2v="nom"), nknots=idx)> sum((ssint$fitted-eta)^2) / length(eta)[1] 0.04605668> ssadd = bigssa(y~x1v+x2v, type=list(x1v="cub",x2v="nom"), nknots=idx)> sum((ssadd$fitted-eta)^2) / length(eta)[1] 0.03305623> fitstats = rbind(ssint$info,ssadd$info)> rownames(fitstats) = c("int","add")> fitstats

gcv rsq aic bicint 1.441561 0.6258559 319.3529 344.6341add 1.386159 0.6134636 316.0127 332.6986



Additive Function: bigssa prediction> dev.new(width=12,height=6,noRStudioGD=TRUE)> par(mfrow=c(1,2))> newdata = expand.grid(x1v=seq(0,1,length=100),x2v=c("a","b"))> yint = predict(ssint,newdata)> yadd = predict(ssadd,newdata)> plot(newdata[1:100,1],yint[1:100],main="Interaction",+ type="l",ylim=c(-2,4))> lines(newdata[101:200,1],yint[101:200],lty=2)> plot(newdata[1:100,1],yadd[1:100],main="Additive",+ type="l",ylim=c(-2,4))> lines(newdata[101:200,1],yadd[101:200],lty=2)

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

23

4

Interaction

newdata[1:100, 1]

yint

[1:1

00]

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

23

4

Additive

newdata[1:100, 1]

yadd

[1:1

00]



Additive Function: ssanova fitting

> n = 100> set.seed(55455)> x1v = seq(0,1,length=n)> x2v = factor(sample(letters[1:2],n,replace=TRUE))> eta = addfun(x1v,x2v)> y = eta + rnorm(n)> idx = binsamp(cbind(x1v,x2v),nmbin=c(20,2))> ssint = ssanova(y~x1v*x2v,type=list(x1v="cubic",x2v="nominal"),+ id.basis=idx)> newdata = data.frame(x1v=x1v,x2v=x2v)> sum((predict(ssint,newdata)-eta)^2) / length(eta)[1] 0.01449173> ssadd = ssanova(y~x1v+x2v,type=list(x1v="cubic",x2v="nominal"),+ id.basis=idx)> sum((predict(ssadd,newdata)-eta)^2) / length(eta)[1] 0.01432404



Additive Function: ssanova prediction> dev.new(width=12,height=6,noRStudioGD=TRUE)> par(mfrow=c(1,2))> newdata = expand.grid(x1v=seq(0,1,length=100),x2v=c("a","b"))> yint = predict(ssint,newdata)> yadd = predict(ssadd,newdata)> plot(newdata[1:100,1],yint[1:100],main="Interaction",+ type="l",ylim=c(-2,4))> lines(newdata[101:200,1],yint[101:200],lty=2)> plot(newdata[1:100,1],yadd[1:100],main="Additive",+ type="l",ylim=c(-2,4))> lines(newdata[101:200,1],yadd[101:200],lty=2)

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

23

4

Interaction

newdata[1:100, 1]

yint

[1:1

00]

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

23

4

Additive

newdata[1:100, 1]

yadd

[1:1

00]


SSANOVA in Practice Two-Way SSANOVA (Interaction)

Interaction Function: Definition

Suppose we have the following function defined forx = (x1, x2) ∈ [0,1]× {a,b}:

intfun = function(x1,x2){funval = sin(2*pi*x1)idx = which(x2=="a")funval[idx] = funval[idx] + 2 + sin(4*pi*x1[idx])funval

}

Note that the function isη(x1, x2) = 2 + sin(2πx1) + sin(4πx1) if x2 = aη(x1, x2) = sin(2πx1) if x2 6= a



Interaction Function: Visualization

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

23

4

x1

η(x 1

, x2)

x2 = ax2 = b



Interaction Function: bigssa fitting

> n = 100> set.seed(55455)> x1v = seq(0,1,length=n)> x2v = factor(sample(letters[1:2],n,replace=TRUE))> eta = intfun(x1v,x2v)> y = eta + rnorm(n)> idx = binsamp(cbind(x1v,x2v),nmbin=c(20,2))> ssint = bigssa(y~x1v*x2v,type=list(x1v="cub",x2v="nom"),nknots=idx)> sum((ssint$fitted-eta)^2) / length(eta)[1] 0.1081747> ssadd = bigssa(y~x1v+x2v,type=list(x1v="cub",x2v="nom"),nknots=idx)> sum((ssadd$fitted-eta)^2) / length(eta)[1] 0.1858098> fitstats = rbind(ssint$info,ssadd$info)> rownames(fitstats) = c("int","add")> fitstats




Interaction Function: bigssa prediction> dev.new(width=12,height=6,noRStudioGD=TRUE)> par(mfrow=c(1,2))> newdata = expand.grid(x1v=seq(0,1,length=100),x2v=c("a","b"))> yint = predict(ssint,newdata)> yadd = predict(ssadd,newdata)> plot(newdata[1:100,1],yint[1:100],main="Interaction",+ type="l",ylim=c(-2,4))> lines(newdata[101:200,1],yint[101:200],lty=2)> plot(newdata[1:100,1],yadd[1:100],main="Additive",+ type="l",ylim=c(-2,4))> lines(newdata[101:200,1],yadd[101:200],lty=2)

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

23

4

Interaction

newdata[1:100, 1]

yint

[1:1

00]

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

23

4

Additive

newdata[1:100, 1]

yadd

[1:1

00]



Interaction Function: ssanova fitting

> n = 100> set.seed(55455)> x1v = seq(0,1,length=n)> x2v = factor(sample(letters[1:2],n,replace=TRUE))> eta = intfun(x1v,x2v)> y = eta + rnorm(n)> idx = binsamp(cbind(x1v,x2v),nmbin=c(20,2))> ssint = ssanova(y~x1v*x2v,type=list(x1v="cubic",x2v="nominal"),+ id.basis=idx)> newdata = data.frame(x1v=x1v,x2v=x2v)> sum((predict(ssint,newdata)-eta)^2) / length(eta)[1] 0.1624814> ssadd = ssanova(y~x1v+x2v,type=list(x1v="cubic",x2v="nominal"),+ id.basis=idx)> sum((predict(ssadd,newdata)-eta)^2) / length(eta)[1] 0.1802812



Interaction Function: ssanova prediction> dev.new(width=12,height=6,noRStudioGD=TRUE)> par(mfrow=c(1,2))> newdata = expand.grid(x1v=seq(0,1,length=100),x2v=c("a","b"))> yint = predict(ssint,newdata)> yadd = predict(ssadd,newdata)> plot(newdata[1:100,1],yint[1:100],main="Interaction",+ type="l",ylim=c(-2,4))> lines(newdata[101:200,1],yint[101:200],lty=2)> plot(newdata[1:100,1],yadd[1:100],main="Additive",+ type="l",ylim=c(-2,4))> lines(newdata[101:200,1],yadd[101:200],lty=2)

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

23

4

Interaction

newdata[1:100, 1]

yint

[1:1

00]

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

23

4

Additive

newdata[1:100, 1]

yadd

[1:1

00]



Interaction Function: Fitting with More Data

> n = 1000> set.seed(55455)> x1v = seq(0,1,length=n)> x2v = factor(sample(letters[1:2],n,replace=TRUE))> eta = intfun(x1v,x2v)> y = eta + rnorm(n)> idx = binsamp(cbind(x1v,x2v),nmbin=c(20,2))> ssint = bigssa(y~x1v*x2v,type=list(x1v="cub",x2v="nom"),nknots=idx)> sum((ssint$fitted-eta)^2) / length(eta)[1] 0.03178251> ssadd = bigssa(y~x1v+x2v,type=list(x1v="cub",x2v="nom"),nknots=idx)> sum((ssadd$fitted-eta)^2) / length(eta)[1] 0.1311356> fitstats = rbind(ssint$info,ssadd$info)> rownames(fitstats) = c("int","add")> fitstats




Interaction Function: Predicting with More Data> dev.new(width=12,height=6,noRStudioGD=TRUE)> par(mfrow=c(1,2))> newdata = expand.grid(x1v=seq(0,1,length=100),x2v=c("a","b"))> yint = predict(ssint,newdata)> yadd = predict(ssadd,newdata)> plot(newdata[1:100,1],yint[1:100],main="Interaction",+ type="l",ylim=c(-2,4))> lines(newdata[101:200,1],yint[101:200],lty=2)> plot(newdata[1:100,1],yadd[1:100],main="Additive",+ type="l",ylim=c(-2,4))> lines(newdata[101:200,1],yadd[101:200],lty=2)

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

23

4

Interaction

newdata[1:100, 1]

yint

[1:1

00]

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

23

4

Additive

newdata[1:100, 1]

yadd

[1:1

00]


smoothing spline anova - statisticsusers.stat.umn.edu/~helwig/notes/ssanova-notes.pdf · smoothing...

Documents