package mboost - fau · control:parameters controlling the algorithm (.slide8). family:used to...

42
Model-based boosting in R Package mboost Benjamin Hofner [email protected] Institut f¨ ur Medizininformatik, Biometrie und Epidemiologie (IMBE) Friedrich-Alexander-Universit¨ at Erlangen-N¨ urnberg Statistical Computing 2011

Upload: others

Post on 10-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

Model-based boosting in R

Package mboost

Benjamin Hofner

[email protected] fur Medizininformatik, Biometrie und Epidemiologie (IMBE)

Friedrich-Alexander-Universitat Erlangen-Nurnberg

Statistical Computing 2011

Page 2: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

Getting Started

Page 3: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

Prerequisites

PrerequisitesHow to get started?

Start a recent version of R (≥ 2.9.0).

Install package mboost1,2 [Hothorn et al., 2010a,b]:R> install.packages("mboost",

+ repos = "http://R-Forge.R-project.org")

Load package mboost:R> library("mboost")

1installs current development version of mboost, which is required to haveall the features discussed in this tutorial; features only available in mboost ≥2.1-0 are indicated via

2.1-0

2to be able to install mboost from R-forge you need to run the currentversion of R, i.e., R 2.13StatComp 2011 (Benjamin Hofner) Model-based boosting in R 3 / 42

Page 4: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

glmboost

Page 5: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

glmboost

glmboost

glmboost() can be used to fit linear models via component-wiseboosting.

. Note that each column of the design matrix is fitted and selectedseparately using a simple linear model (i.e., linear base-learner)

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 5 / 42

Page 6: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

glmboost

glmboost(formula, data = list(), weights = NULL, center = TRUE,

control = boost_control(), ...)

Only selected options are displayed throughout this talk!

Usage in essence very similar to lm() and glm().

formula: a formula as in lm(). E.g.:formula = y ~ x1 + x2 + x3

data: a data set containing the variables of the formula.

weights: (optional) vector of weights used for weighted regression.

center: a logical variable indicating of the predictor variables arecentered before fitting (default = TRUE) (. Slide 7).

control: parameters controlling the algorithm (. Slide 8).

Indirectly available (via ”...”; possible parameters see ?mboost_fit):

family3: used to specify the fitting problem. For example:family = Gaussian() # default

family = Binomial() # logit model3details . see next talks

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 6 / 42

Page 7: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

glmboost

center = TRUE

This is very important (at least for quick convergence).Each of the columns of the design matrix specifies one base-learner,which is fitted separately to the negative gradient.The single least-squares fit does not contain an intercept, i.e., modelsof the type lm(u ∼ -1 + x1) are fitted.Model is forced through origin (0, 0) which is only reasonable if thereare no observations . centering!

●●

●●

0 2 4 6

−2

−1

01

2

center = FALSE

x

y ● +

o+

(0,0)center of data

●●

●●

−4 −2 0 2

−2

−1

01

2

center = TRUE

x

y ●+

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 7 / 42

Page 8: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

glmboost

control = boost_control(mstop = 100, nu = 0.1,

trace = FALSE)

mstop: number of initial boosting iterations.

nu: step size (∈ (0, 1])

trace: should status information during the fitting process beprinted?

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 8 / 42

Page 9: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

glmboost

ExampleR> mod.cars <- glmboost(dist ~ speed, data = cars)

R> mod.cars

Generalized Linear Models Fitted via Gradient Boosting

Call:

glmboost.formula(formula = dist ~ speed, data = cars)

Squared Error (Regression)

Loss function: (y - f)^2

Number of boosting iterations: mstop = 100

Step size: 0.1

Offset: 42.98

Coefficients:

(Intercept) speed

-60.557486 3.932304

attr(,"offset")

[1] 42.98

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 9 / 42

Page 10: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

glmboost

R> plot(dist ~ speed, data = cars)

R> cf <- coef(mod.cars,

+ off2int = TRUE)

R> abline(cf)

R> abline(lm(dist ~ speed, data = cars),

+ col = "gray", lty = 2)

●●

●●

●●

●●

5 10 15 20 25

020

4060

8010

012

0

speed

dist

R> plot(mod.cars)

0 20 40 60 80 100

−60

−50

−40

−30

−20

−10

0

Coefficient Paths

Number of boosting iterations

Coe

ffici

ents

(Intercept)

speed

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 10 / 42

Page 11: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

glmboost

glmboost with Matrix-Interface

glmboost(x, y, center = TRUE, control = boost_control(), ...)

This is especially useful for high-dimensional data (such as geneexpression data).

x: (design-) matrix

y: response variable (a vector)

center: a logical variable indicating of the predictor variables arecentered before fitting (default = TRUE) (. Slide 7).

control: parameters controlling the algorithm (. Slide 8).

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 11 / 42

Page 12: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

glmboost

Alternative specification using paste

An alternative specification via the formula interface using paste()

exists. Here a (low-dimensional) example:R> rhs <- paste(colnames(data[, colnames(data) != "y"]),

+ collapse = " + ")

R> fm <- as.formula(paste("y ~", rhs))

R> fm

y ~ x1 + x2 + x3 + x4

R> mod <- glmboost(fm, data = data)

This is very useful for high-dimensional non-linear models or modelscontaining factors. In these cases, the model matrix is built internallywithout requiring further input by the user.

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 12 / 42

Page 13: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

gamboost

Page 14: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

gamboost

gamboost

gamboost() can be used to fit linear models or (nonlinear) additivemodels via component-wise boosting.

. Base-learners need to be specified more explicitly.

In general: interface very similar to glmboost().

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 14 / 42

Page 15: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

gamboost

gamboost(formula, data = list(), ...)

formula: a formula specifying the model (see later).

data: a data set containing the variables of the formula.

Indirectly available (via "..."):

weights: (optional) vector of weights used for weighted regression.

control: parameters controlling the algorithm (. Slide 8).

family: used to specify the fitting problem.

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 15 / 42

Page 16: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

Special Base-learners

Page 17: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

gamboost Base-learners

bols() – ordinary least squares

Allows the definition of (penalized) ordinary least squares base-learners,including

1) linear effects (xβ)

2) categorical effects (z>β)

3) linear effects for groups of variables x = (x1, x2, . . . , xp)> (x>β)

4) ridge-penalized effects for 2) or 3)

5) special penalty for ordinal variables (x <- as.ordered(x))

6) Varying coefficient terms (i.e., interactions)

7) . . .

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 17 / 42

Page 18: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

gamboost Base-learners

bols(..., by = NULL, intercept = TRUE, df = NULL, lambda = 0)

...: variables are specified here (comma-separated)

by: additional variable that defines varying coefficients (optional)

intercept: if intercept = TRUE an intercept is added to the designmatrix. If intercept = FALSE, continuous covariates shouldbe (mean-) centered.

df: initial degrees of freedom for penalized estimates (. Slide 42)

lambda: smoothing penalty, computed from df when df specified

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 18 / 42

Page 19: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

gamboost Base-learners

Examples for bols()

−2 −1 0 1 2 3 4

−1.

0−

0.5

0.0

0.5

1.0

1.5

2.0

Linear effect

x1

f par

tial

●●●●

●●●●●

●●●●

●●

●●●

●●●●●

●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●

●●

●●●●●

●●●●●●●●

●●●

true effectmodel

True effect:

y = 0.5x1

−1

01

23

Categorical effect

x2

f par

tial

x2(1)

x2(2)

x2(3)

x2(4)

true effectmodel

True effect:

y = 0x(1)2 − 1x

(2)2 + 0.5x

(3)2 + 3x

(4)2

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 19 / 42

Page 20: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

gamboost Base-learners

Some calls using bols()

bols(x)

linear effect . x>β (x> = (1, x ))

bols(z)

OLS fit with factor z (i.e., linear effect after dummy coding)

bols(x1, x2, x3)

one base-learner for three variables (group-wise selection)

bols(x, by = z)

interaction (with z continuous or factor variable) . β · xzbols(x, intercept = FALSE)

linear effect without intercept . β · x

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 20 / 42

Page 21: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

gamboost Base-learners

bbs() – B-splines (with penalty)

Allows the definition of smooth effects based on B-splines with differencepenalty (i.e., P-splines). Examples include:

1) smooth effects (f (x ))

2) bivariate smooth effects (e.g., spatial effects f (x , y))

3) varying coefficient terms (f (x ) · z )

4) cyclic effects (= periodic effects)

5) . . .

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 21 / 42

Page 22: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

gamboost Base-learners

bbs(..., by = NULL, knots = 20, boundary.knots = NULL,

degree = 3, differences = 2, df = 4, lambda = NULL,

center = FALSE, cyclic = FALSE)

...: variables are specified here (comma-separated, max. 2variables).

knots: number of equidistant knots (a), or positions of the interiorknots (b) or a named list (with (a) or (b))

boundary.knots2.1-0

: boundary knot locations (default: range of the data)

degree: degree of the B-spline basis

differences: difference order of penalty (∈ {0, 1, 2, 3})center: reparameterization of P-splines (. Slide 24)

cyclic2.1-0

: if cyclic = TRUE the fitted values coincide at the boundaryknots

by: additional variable that defines varying coefficients (optional)

df: initial degrees of freedom (. Slide 42)

lambda: smoothing penalty, computed from df when df specified

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 22 / 42

Page 23: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

gamboost Base-learners

Examples for bbs()

−2 −1 0 1 2 3 4

−3

−2

−1

01

23

True effect: f(x1) = 3sin(x1)

x1

f par

tial

●●●●

●●●●●●●●●●●●●

●●

●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●

true effectmodel

−3 −2 −1 0 1 2 30

24

68

True effect: f(x2) = x22

x2

f par

tial

●●●

●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●

●●●●●●●●●

●●

●●●●

●●

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 23 / 42

Page 24: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

gamboost Base-learners

Some calls using bbs()

bbs(x, by = z)

varying coefficient . f (x ) · z = β(x )z

bbs(x, knots = 10)

smooth effect with 10 inner knots

bbs(x, boundary.knots = (0, 2 * pi), cyclic = TRUE)

cyclic effect, i.e., periodic function with period 2π

. important to specify boundary knots (which are thepoints where the function is joined)

bbs(x, center = TRUE, df = 1)

P-spline decomposition (center = TRUE), which is neededto specify arbitrary small df for P-splines

. in this case (usually) you need to add bols(x,

intercept = FALSE) to the model formula. see Kneib et al. [2009]

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 24 / 42

Page 25: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

gamboost Base-learners

Example: Cyclic Splines

0 1 2 3 4 5 6

−1.

0−

0.5

0.0

0.5

1.0

1.5

True effect: f(x) = sin(x)

x

y

cyclic = FALSEcyclic = TRUE

0 1 2 3 4 5 6−

1.0

−0.

50.

00.

51.

01.

5

True effect: f(x) = sin(x)

x

ycyclic = FALSEcyclic = TRUE

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 25 / 42

Page 26: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

gamboost Base-learners

bspatial(..., df = 6)

Allows to specify spatial, smooth effects (via bivariate P-splines).bspatial() is just a wrapper to bbs() with redefined degrees of freedom.Note that df = 4 was changed to df = 6 in mboost 2.1-0.

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 26 / 42

Page 27: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

gamboost Base-learners

Examples for bspatial()

True function: f (x1, x2) = sin(x1) · sin(x2)

x1

x2

−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3

−0.4

−0.2

0.0

0.2

0.4

0.6

x1

−2

0

2

x2

−2

0

2

−0.5

0.0

0.5

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 27 / 42

Page 28: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

gamboost Base-learners

brandom(..., df = 4)

Allows to specify (quasi) random effects (via ridge penalized linear effects).brandom() is just a wrapper to bols() with redefined degrees of freedom.

Examples of brandom() see below.

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 28 / 42

Page 29: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

gamboost Base-learners

Additional Base-learners in a Nutshell

btree() tree-based base-learner (per default: stumps)

brad()2.1-0

radial basis functions (e.g., for spatial effects)

bmono()2.1-0

monotonic effects of continuous or ordered categoricalvariables

bmrf()2.1-0

Markov random fields (for spatial effects of regions)

buser()2.1-0

user specified base-learner (advanced users only!)

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 29 / 42

Page 30: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

gamboost Base-learners

Building a ModelHow to combine different base-learners

Simple formula:R> gamboost(y ~ bols(x1) + bbs(x2) + bspatial(s1, s2) + brandom(id),

+ data = example)

Options for some base-learners:R> gamboost(y ~ bols(int, intercept = FALSE) +

+ bols(x1, intercept = FALSE) +

+ bbs(x2) +

+ bspatial(s1, s2, knots = list(s1 = 10, s2 = 20)) +

+ brandom(id) + brandom(id, by = x1),

+ data = example)

separate intercept (where example$int = rep(1, length(y)))base-learner for x1 without intercept (i.e., x1β1)10 inner knots in s1 direction, 20 inner knots in s2 direction,i.e., 10 · 20 = 200 inner knots in total;Usually, something like knots = 10 (. 10 knots per coordinate)random slope for x1 (in the groups of id)

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 30 / 42

Page 31: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

cvrisk and subset method

Page 32: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

cvrisk

Optimal stopping iteration

model needs tuning to prevent overfitting

determine optimal stopping iteration

Use cross-validated estimates of the empirical risk to choose anappropriate number of boosting iterations.

. aims at optimizing prognosis on new data

. infrastructure in mboost exists to compute

bootstrap estimatesk-fold cross-validation estimatessub-sampling estimatesand to do this in parallel

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 32 / 42

Page 33: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

cvrisk

cvriskHow to determine mstop

cvrisk(object, folds = cv(model.weights(object)),

grid = 1:mstop(object),

papply = if (require("multicore")) mclapply else lapply)

object: the mboost model

folds: weight matrix that determines the crossvalidation samples(. see cv() on next slide)

grid: compute empirical (out-of-bag) risk on this grid of mstopvalues.

papply: use mclapply if multicore is available, else run sequentially;Alternative: clusterApplyLB (package snow) can be usedhere (with some further setup needed . see ?cvrisk).

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 33 / 42

Page 34: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

cvrisk

cv(weights, type = c("bootstrap", "kfold", "subsampling"),

B = ifelse(type == "kfold", 10, 25))

weights: (original) weights of the model; one can usemodel.weights(mod) to extract the weights from mod.

type: use bootstrap (default), k-fold cross-validation orsub-sampling

B: number of folds, per default 25 for bootstrap andsubsampling and 10 for kfold

. returns a matrix with B columns that determines the weights for eachcross-validation run

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 34 / 42

Page 35: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

cvrisk

ExampleDetermine mstop

R> model <- glmboost(DEXfat ~ ., data = bodyfat, center = TRUE,

control = boost_control(trace = TRUE))

R> ## 10-fold cross-validation

R> cv10f <- cv(model.weights(model), type = "kfold")

R> cvm <- cvrisk(model, folds = cv10f, papply = lapply)

R> cvm

Cross-validated Squared Error (Regression)

glmboost.formula(formula = DEXfat ~ ., data = bodyfat, center = TRUE,

control = boost_control(trace = TRUE, mstop = 10))

1 2 3 4 5 6

105.22746 89.82040 77.04107 66.23743 58.54920 50.82022

(...)

97 98 99 100

12.94416 12.93962 12.93397 12.94333

Optimal number of boosting iterations: 86

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 35 / 42

Page 36: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

cvrisk

R> plot(cvm)

R> mstop(cvm)

[1] 86

10−fold kfold

Number of boosting iterations

Squ

ared

Err

or (

Reg

ress

ion)

1 6 12 19 26 33 40 47 54 61 68 75 82 89 96

050

100

150

200

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 36 / 42

Page 37: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

cvrisk Subset

SubsetHow (not) to get rid of spare boosting steps

To increase or reduce the number of boosting steps for the model mod, onecan use the indexing / subsetting operator:

mod[i]

Attention, non-standard behavior:

mod[i] directly changes the model mod

. no need to save mod under a new name

Good news: nothing gets lost!

Example:

fit an initial model mod with mstop = 100

set mod[10] sets mstop = 10 (in the model object)increasing mod[40] requires no re-computation of the model;internally everything was kept in storage!But keep attention: coef(mod) now returns the coefficients in the40th boosting iteration!

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 37 / 42

Page 38: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

cvrisk Subset

Example (ctd.)Determine mstop

R> model[mstop(cvm)]

Generalized Linear Models Fitted via Gradient Boosting

Call:

glmboost.formula(formula = DEXfat ~ ., data = bodyfat, center = TRUE,

control = boost_control(trace = TRUE))

Squared Error (Regression)

Loss function: (y - f)^2

Number of boosting iterations: mstop = 86

Step size: 0.1

Offset: 30.78282

(...)

. Model reduced to 86 steps (before per default 100)

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 38 / 42

Page 39: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

cvrisk Subset

R> ### To see that nothing got lost we now increase mstop to 200:

R> model[200]

[ 101] ............................ -- risk: 671.8004

[ 131] ............................ -- risk: 671.2759

[ 161] ............................ -- risk: 670.7738

[ 191] ........

Final risk: 670.6114

Generalized Linear Models Fitted via Gradient Boosting

Call:

glmboost.formula(formula = DEXfat ~ ., data = bodyfat, center = TRUE, control = boost_control(trace = TRUE))

Squared Error (Regression)

Loss function: (y - f)^2

Number of boosting iterations: mstop = 200

Step size: 0.1

Offset: 30.78282

(...)

. Continues at step 101! (remember that we initially fitted 100 steps)StatComp 2011 (Benjamin Hofner) Model-based boosting in R 39 / 42

Page 40: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

Summary

SummaryKey Messages

mboost . . .

. . . has a modular nature.

. . . is relatively easy to use.

. . . offers a well tested back-end.

. . . is able to specify a wide range of possible models (more tocome!).

. . . has a rich infrastructure (e.g., plot functions, cvrisk, mod[i],. . . ; more to come!)

Commentson bugs, on improvements in the manual, etc.

are always welcome (. email us!)

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 40 / 42

Page 41: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

References

References

Benjamin Hofner, Torsten Hothorn, Thomas Kneib, and Matthias Schmid.A framework for unbiased model selection based on boosting. Journalof Computational and Graphical Statistics, 2010. to appear.

Torsten Hothorn, Peter Buhlmann, Thomas Kneib, Matthias Schmid, andBenjamin Hofner. Model-based boosting 2.0. Journal of MachineLearning Research, 11:2109–2113, 2010a.

Torsten Hothorn, Peter Buhlmann, Thomas Kneib, Matthias Schmid, andBenjamin Hofner. mboost: Model-Based Boosting, 2010b. URLhttps://r-forge.r-project.org/projects/mboost/. R packageversion 2.1-0.

Thomas Kneib, Torsten Hothorn, and Gerhard Tutz. Variable selection andmodel choice in geoadditive regression models. Biometrics, 65:626–634, 2009. doi: 10.1111/j.1541-0420.2008.01112.x.

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 41 / 42

Page 42: Package mboost - FAU · control:parameters controlling the algorithm (.Slide8). family:used to specify the tting problem. StatComp 2011 (Benjamin Hofner) Model-based boosting in R

Controlling smoothnessDegrees of Freedom ↔ Smoothing Parameter

Degrees of freedom df and smoothing parameter λ have a one-to-onerelationship.

However: Multiple definitions for degrees of freedom existLet S = X(X>X+ λK)−1X> be the smoother matrix.

Standard definitiondf = trace(S) (1)

Alternative definition

df = trace(2S− S>S) (2)

UseR> options(mboost_dftraceS = FALSE)

to switch to definition (2) (default: TRUE).

Definition (2) more suitable for boosting [see Hofner et al., 2010].

NB: Define equal degrees of freedom (with Def. (2)) for allbase-learners to obtain “less biased” selection

StatComp 2011 (Benjamin Hofner) Model-based boosting in R 42 / 42