arxiv:1611.10171v1 [stat.co] 30 nov 2016 · 2016-12-01 · janek thomas andreas mayr bernd bischl...

16
Noname manuscript No. (will be inserted by the editor) Stability selection for component-wise gradient boosting in multiple dimensions Janek Thomas · Andreas Mayr · Bernd Bischl · Matthias Schmid · Adam Smith · Benjamin Hofner the date of receipt and acceptance should be inserted later Abstract We present a new algorithm for boosting generalized additive models for location, scale and shape (GAMLSS) that allows to incorporate stability selec- tion, an increasingly popular way to obtain stable sets of covariates while controlling the per-family error rate (PFER). The model is fitted repeatedly to subsam- pled data and variables with high selection frequencies are extracted. To apply stability selection to boosted GAMLSS, we develop a new “noncyclical” fitting al- gorithm that incorporates an additional selection step of the best-fitting distribution parameter in each itera- tion. This new algorithms has the additional advantage that optimizing the tuning parameters of boosting is reduced from a multi-dimensional to a one-dimensional problem with vastly decreased complexity. The perfor- mance of the novel algorithm is evaluated in an ex- tensive simulation study. We apply this new algorithm to a study to estimate abundance of common eider in Massachusetts, USA, featuring excess zeros, overdisper- sion, non-linearity and spatio-temporal structures. Ei- J. Thomas · B. Bischl Department of Statistics, Ludwig-Maximilians-Universität München, Ludwigstrasse 33, 80539 Munich, Germany Tel.: +4989-2180-3196 E-mail: [email protected] A. Mayr Department of Medical Informatics, Biometry and Epidemiology, FAU Erlangen-Nürnberg, Germany M. Schmid · A.Mayr Department of Medical Biometry, Informatics and Epidemiology, RFWU Bonn, Germany A. Smith U.S. Fish & Wildlife Service, National Wildlife Refuge System, Southeast Inventory & Monitoring Branch, USA B. Hofner Section Biostatistics, Paul-Ehrlich-Institute, Langen, Germany der abundance is estimated via boosted GAMLSS, al- lowing both mean and overdispersion to be regressed on covariates. Stability selection is used to obtain a sparse set of stable predictors. Keywords boosting · additive models · GAMLSS · gamboostLSS · Stability selection 1 Introduction In view of the growing size and complexity of mod- ern databases, statistical modeling is increasingly faced with heteroscedasticity issues and a large number of available modeling options. In ecology, for example, it is often observed that outcome variables do not only show differences in mean conditions but also tend to be highly variable across different geographical features or states of a combination of covariates (e.g., [33]). In addition, ecological databases typically contain large numbers of correlated predictor variables that need to be carefully chosen for possible incorporation in a statistical regres- sion model [1,8,31]. A convenient approach to address both heteroscedastic- ity and variable selection in statistical regression models is the combination of GAMLSS modeling with gradient boosting algorithms. GAMLSS, which refer to “gener- alized additive models for location, scale and shape” [34], are a modeling technique that relates not only the mean but all parameters of the outcome distribution to the available covariates. Consequently, GAMLSS si- multaneously fit different submodels for the location, scale and shape parameters of the conditional distri- bution. Gradient boosting, on the other hand, has be- come a popular tool for data-driven variable selection in generalized additive models [4]. The most important arXiv:1611.10171v1 [stat.CO] 30 Nov 2016

Upload: others

Post on 11-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1611.10171v1 [stat.CO] 30 Nov 2016 · 2016-12-01 · Janek Thomas Andreas Mayr Bernd Bischl Matthias Schmid Adam Smith Benjamin Hofner ... U.S. Fish & Wildlife Service, National

Noname manuscript No.(will be inserted by the editor)

Stability selection for component-wise gradient boosting inmultiple dimensions

Janek Thomas · Andreas Mayr · Bernd Bischl · Matthias Schmid ·Adam Smith · Benjamin Hofner

the date of receipt and acceptance should be inserted later

Abstract We present a new algorithm for boostinggeneralized additive models for location, scale and shape(GAMLSS) that allows to incorporate stability selec-tion, an increasingly popular way to obtain stable setsof covariates while controlling the per-family error rate(PFER). The model is fitted repeatedly to subsam-pled data and variables with high selection frequenciesare extracted. To apply stability selection to boostedGAMLSS, we develop a new “noncyclical” fitting al-gorithm that incorporates an additional selection stepof the best-fitting distribution parameter in each itera-tion. This new algorithms has the additional advantagethat optimizing the tuning parameters of boosting isreduced from a multi-dimensional to a one-dimensionalproblem with vastly decreased complexity. The perfor-mance of the novel algorithm is evaluated in an ex-tensive simulation study. We apply this new algorithmto a study to estimate abundance of common eider inMassachusetts, USA, featuring excess zeros, overdisper-sion, non-linearity and spatio-temporal structures. Ei-

J. Thomas · B. BischlDepartment of Statistics, Ludwig-Maximilians-UniversitätMünchen, Ludwigstrasse 33, 80539 Munich, GermanyTel.: +4989-2180-3196E-mail: [email protected]

A. MayrDepartment of Medical Informatics, Biometry and Epidemiology,FAU Erlangen-Nürnberg, Germany

M. Schmid · A.MayrDepartment of Medical Biometry, Informatics and Epidemiology,RFWU Bonn, Germany

A. SmithU.S. Fish & Wildlife Service, National Wildlife Refuge System,Southeast Inventory & Monitoring Branch, USA

B. HofnerSection Biostatistics, Paul-Ehrlich-Institute, Langen, Germany

der abundance is estimated via boosted GAMLSS, al-lowing both mean and overdispersion to be regressed oncovariates. Stability selection is used to obtain a sparseset of stable predictors.

Keywords boosting · additive models · GAMLSS ·gamboostLSS · Stability selection

1 Introduction

In view of the growing size and complexity of mod-ern databases, statistical modeling is increasingly facedwith heteroscedasticity issues and a large number ofavailable modeling options. In ecology, for example, it isoften observed that outcome variables do not only showdifferences inmean conditions but also tend to be highlyvariable across different geographical features or statesof a combination of covariates (e.g., [33]). In addition,ecological databases typically contain large numbers ofcorrelated predictor variables that need to be carefullychosen for possible incorporation in a statistical regres-sion model [1,8,31].A convenient approach to address both heteroscedastic-ity and variable selection in statistical regression modelsis the combination of GAMLSS modeling with gradientboosting algorithms. GAMLSS, which refer to “gener-alized additive models for location, scale and shape”[34], are a modeling technique that relates not only themean but all parameters of the outcome distributionto the available covariates. Consequently, GAMLSS si-multaneously fit different submodels for the location,scale and shape parameters of the conditional distri-bution. Gradient boosting, on the other hand, has be-come a popular tool for data-driven variable selectionin generalized additive models [4]. The most important

arX

iv:1

611.

1017

1v1

[st

at.C

O]

30

Nov

201

6

Page 2: arXiv:1611.10171v1 [stat.CO] 30 Nov 2016 · 2016-12-01 · Janek Thomas Andreas Mayr Bernd Bischl Matthias Schmid Adam Smith Benjamin Hofner ... U.S. Fish & Wildlife Service, National

2 Janek Thomas et al.

feature of gradient boosting is the ability of the algo-rithm to perform variable selection in each iteration,so that model fitting and variable selection are accom-plished in a single algorithmic procedure. To combineGAMLSS with gradient boosting, we have developedthe gamboostLSS algorithm [25] and have implementedthis procedure in the R add-on package gamboostLSS[17,15].A remaining problem of gradient boosting is the ten-dency of boosting algorithms to select a relatively highnumber of false positive variables and to include toomany non-informative covariates in a statistical regres-sion model. This issue, which has been raised in sev-eral previous articles [5,7,21], is particularly relevantfor model building in the GAMLSS framework, as theinclusion of non-informative false positives in the sub-models for the scale and shape parameters may resultin overfitting with a highly inflated variance. As a con-sequence, it is crucial to include only those covariates inthese submodels that show a relevant effect on the out-come parameter of interest. From an algorithmic pointof view, this problem is aggravated by the conventionalfitting procedure of gamboostLSS: although the fittingprocedure proposed in [25] incorporates different iter-ation numbers for each of the involved submodels, thealgorithm starts with mandatory updates of each sub-model at the beginning of the procedure. Consequently,due to the tendency of gradient boosting to include rela-tively high numbers of non-informative covariates, falsepositive variables may enter a GAMLSS submodel ata very early stage, even before the iteration number ofthe submodel is finally reached.To address these issues and to enforce sparsity inGAMLSS, we propose a novel procedure that incorpo-rates stability selection [28] in gamboostLSS. Stabilityselection is a generic method that investigates the im-portance of covariates in a statistical model by repeat-edly subsampling the data. Sparsity is enforced by in-cluding only the most “stable” covariates, in the finalstatistical model. Importantly, under appropriate regu-larity conditions, stability selection can be tuned suchthat the expected number of false positive covariatesis controlled in a mathematically rigorous way. As willbe demonstrated in Section 3 of this paper, the sameproperty holds in the gamboostLSS framework.To combine gamboostLSS with stability selection, wepresent an improved “noncyclical ” fitting procedure forgamboostLSS that addresses the problem of possiblefalse positive inclusions at early stages of the algorithm.In contrast to the original “cyclical ” gamboostLSS al-gorithm presented in Mayr et al. [25], the new versionof gamboostLSS not only performs variable selection ineach iteration but additionally an iteration-wise selec-

tion of the best submodel (location, scale, or shape)that leads to the largest improvement in model fit. Asa consequence, sparsity is not only enforced by the in-clusion of the most “stable” covariates in the GAMLSSsubmodels but also by a data-driven choice of iteration-wise submodel updates. It is this procedure that theo-retically justifies and thus enables the use of stabilityselection in gamboostLSS.A further advantage of “noncyclical ” fitting is that themaximum number of boosting iterations for each sub-model does not have to be specified individually foreach submodel (as in the originally proposed “cyclical ”variant), instead only the overall number of iterationsmust be chosen optimally. Tuning the complete modelreduces from a multi-dimensional to a one-dimensionaloptimization problem, regardless of the number of sub-models, therefore drastically reducing the amount ofneeded runtime for model selection.A similar approach for noncyclical fitting ofmulti-parameter models was recently suggested by [29]for the specific application of ensemble post-processingfor weather forecasts. Our proposed method generalizesthis approach, allowing gamboostLSS to be combinedwith stability selection in a generic way that applies toa large number of outcome distributions.The rest of this paper is organized as follows: In Sec-tion 2 we describe the gradient boosting, GAMLSS andstability selection techniques, and show how to com-bine the three approaches in a single algorithm. In ad-dition, we provide details on the new gamboostLSS fit-ting procedure. Results of extensive simulation studiesare presented in Section 3. They demonstrate that com-bining gamboostLSS with stability selection results inprediction models that are both easy to interpret andshow a favorable behavior with regard to variable selec-tion. They also show that the new gamboostLSS fittingprocedure results in a large decrease in runtime whileshowing similar empirical convergence rates as the tra-ditional gamboostLSS procedure. We present an appli-cation of the proposed algorithm to a spatio-temporaldata set on sea duck abundance in Nantucket Sound,USA, in Section 4. Section 5 summarizes the main find-ings and provides details on the implementation of theproposed methodology in the R package gamboost-LSS [15].

2 Methods

2.1 Gradient boosting

Gradient boosting is a supervised learning techniquethat combines an ensemble of base-learners to estimatecomplex statistical dependencies. Base-learners should

Page 3: arXiv:1611.10171v1 [stat.CO] 30 Nov 2016 · 2016-12-01 · Janek Thomas Andreas Mayr Bernd Bischl Matthias Schmid Adam Smith Benjamin Hofner ... U.S. Fish & Wildlife Service, National

Stability selection for component-wise gradient boosting in multiple dimensions 3

be weak in the sense that they only possess moder-ate prediction accuracy, usually assumed to be at leastslightly better than a random predictor, but on theother hand easy and fast to fit. Base-learners can befor example simple linear regression models, regressionsplines with low degrees of freedom, or stumps (i.e.,trees with only one split) [4]. One base-learner by itselfwill usually not be enough to fit a well performing sta-tistical model to the data, but a boosted combinationof a large number can compete with other state-of-the-art algorithms on many tasks, e.g., classification [22] orimage recognition [32].Let D = {(x(i), y(i))}i=1,...n be a learning data set sam-pled i.i.d. from a distribution over the joint space X×Y,with a p-dimensional input space X = (X1×X2×...×Xp)and a usually one-dimensional output space Y. The re-sponse variable is estimated through an additive modelwhere E(y|x) = g−1(η(x)), with link function g andadditive predictor η : X → R,

η(x) = β0 +

J∑j=1

fj(x|βj), (1)

with a constant intercept coefficient β0 and additiveeffects fj(x|βj) derived from the pre-defined set of base-learners. These are usually (semi-)parametric effects,e.g., splines, with parameter vector βj . Note that someeffects may later be estimated as 0, i.e., fj(x|βj) = 0.In many cases, each base-learner is defined on exactlyone element Xj of X and thus Equation 1 simplifies to

η(x) = β0 +

p∑j=1

fj(xj |βj). (2)

To estimate the parameters β1, ...,βJ of the additivepredictor, the boosting algorithm minimizes the empir-ical risk R which is the loss ρ : Y × R → R summedover all training data:

R =

n∑i=1

ρ(y(i), η(x(i))). (3)

The loss function measures the discrepancy between thetrue outcome y(i) and the additive predictor η(x(i)).Examples are the absolute loss |y(i) − η(x(i))|, lead-ing to a regression model for the median, the quadraticloss (y(i)−η(x(i)))2, leading to the conventional (mean)regression model or the binomial loss −y(i)η(x(i)) +

log(1 + exp(η(x(i)))) often used in classification of bi-nary outcomes y(i) ∈ {0, 1}. Very often the loss is de-rived from the negative log-likelihood of the distribu-tion of Y, depending on the desired model [10].While there exist different types of gradient boostingalgorithms [23,24], in this article we will focus on

component-wise gradient boosting [6,4]. The main ideais to fit simple regression type base-learners h(·) one-by-one to the negative gradient vector of the loss u =

(u(1), . . . , u(n)) instead of to the true outcomes y =

(y(1), . . . , y(n)). Base-learners are chosen in such a waythat they approximate the effect f(x|βj) =

∑m hj(·).

The negative gradient vector in iteration m, evaluatedat the estimated additive predictor η[m−1](x(i)), is de-fined as

u =

(− ∂

∂ηρ(y, η)

∣∣∣∣η=η[m−1](x(i)), y=y(i)

)i=1,...,n

.

In every boosting iteration, each base-learner is fittedseparately to the negative gradient vector byleast-squares or penalized least-squares regression. Thebest-fitting base-learner is selected based on the resid-ual sum of squares with respect to u

j∗ = argminj∈1...J

n∑i=1

(u(i) − hj(x(i)))2. (4)

Only the best-performing base-learner hj∗(x) will beused to update the current additive predictor,

η[m] = η[m−1] + sl · hj∗(x) (5)

where 0 < sl � 1 denotes the step-length (learningrate; usually sl = 0.1). The choice of sl is not of criticalimportance as long as it is sufficiently small [36].The main tuning parameter for gradient boosting algo-rithms is the number of iterationsm that are performedbefore the algorithm is stopped (denoted asmstop). Theselection of mstop has a crucial influence on the predic-tion performance of the model. If mstop is set too small,the model cannot fully incorporate the influence of theeffects on the response and will consequently have apoor performance. On the other hand, too many it-erations will result in overfitting, which hampers theinterpretation and generalizability of the model.

2.2 GAMLSS

In classical generalized additive models (GAM, [11]) itis assumed that the conditional distribution of Y de-pends only on one parameter, usually the conditionalmean. If the distribution has multiple parameters, allbut one are considered to be constant nuisance param-eters. This assumption will not always hold and shouldbe critically examined, e.g., the assumption of constantvariance is not adequate for heteroscedastic data. Po-tential dependency of the scale (and shape) parame-ter(s) of a distribution on predictors can be modeledin a similar way to the conditional mean (i.e., location

Page 4: arXiv:1611.10171v1 [stat.CO] 30 Nov 2016 · 2016-12-01 · Janek Thomas Andreas Mayr Bernd Bischl Matthias Schmid Adam Smith Benjamin Hofner ... U.S. Fish & Wildlife Service, National

4 Janek Thomas et al.

parameter). This extended model class is called gen-eralized additive models for location, scale and shape(GAMLSS, [34]).The framework hence fits different prediction functionsto multiple distribution parameters θ = (θ1, . . . , θk), k =

1, . . . , 4. Given a conditional density p(y|θ), one esti-mates additive predictors (cf. Equation 1) for each ofthe parameters θk

ηθk = β0θk +

Jk∑j=1

fjθk(x|βjθk), k = 1, . . . , 4. (6)

Typically these models are estimated via penalized like-lihood. For details on the fitting algorithm see [35].Even though these models can be applied to a largenumber of different situations, and the available fit-ting algorithms are extremely powerful, they still in-herit some shortcomings from the penalized likelihoodapproach:

(1) It is not possible to estimate models with more co-variates than observations.

(2) Maximum likelihood estimation does not feature anembedded variable selection procedure. ForGAMLSS models the standard AIC has been ex-panded to the generalized AIC (GAIC) in [34] tobe applied to multidimensional prediction functions.Variable selection via information criteria has sev-eral shortcomings, for example the inclusion of toomany non-informative variables [2].

(3) Whether to model predictors in a linear or nonlin-ear fashion is not trivial. Unnecessary complexityincreases the danger of overfitting as well as compu-tation time. Again, a generalized criterion like GAICmust be used to choose between linear and nonlinearterms.

2.3 Boosted GAMLSS

To deal with these issues, gradient boosting can be usedto fit the model instead of the standard maximum like-lihood algorithm. Based on an approach proposed in[37] to fit zero-inflated count models, in [25] the au-thor developed a general algorithm to fit multidimen-sional prediction functions with component-wise gradi-ent boosting (see Algorithm 1).The basic idea is to cycle through the distribution pa-rameters θ in the fitting process. Partial derivativeswith respect to each of the additive predictors are usedas response vectors. In each iteration of the algorithm,the best-fitting base-learner is determined for each dis-tribution parameter while all other parameters stay fixed.

For a four parametric distribution, the update in boost-ing iteration m+ 1 may be sketched as follows:

∂ηθ1ρ(y, θ

[m]1 , θ

[m]2 , θ

[m]3 , θ

[m]4 )

update−→ η[m+1]θ1

∂ηθ2ρ(y, θ

[m+1]1 , θ

[m]2 , θ

[m]3 , θ

[m]4 )

update−→ η[m+1]θ2

∂ηθ3ρ(y, θ

[m+1]1 , θ

[m+1]2 , θ

[m]3 , θ

[m]4 )

update−→ η[m+1]θ3

∂ηθ4ρ(y, θ

[m+1]1 , θ

[m+1]2 , θ

[m+1]3 , θ

[m]4 )

update−→ η[m+1]θ4

.

Unfortunately, separate stopping values for each distri-bution parameter have to be specified, as the predic-tion functions will most likely require different levelsof complexity and hence a different number of boostingiterations. In case of multi-dimensional boosting the dif-ferent mstop,k values are not independent of each other,and have to be jointly optimized. The usually appliedgrid search scales exponentially with the number of dis-tribution parameters and can quickly become compu-tationally demanding or even infeasible.

2.4 Stability selection

Selecting an optimal subset of explanatory variables isa crucial step in almost every supervised data analy-sis problem. Especially in situations with a large num-ber of covariates it is often almost impossible to getmeaningful results without automatic, or at least semi-automatic, selection of the most relevant predictors. Se-lection of covariate subsets based on modified R2 crite-ria (e.g., the AIC) can be unstable, see for example [9],and tend to select too many covariates (see, e.g., [27]).Component-wise boosting algorithms are one solutionto select predictors in high dimensions and/or p > n

problems. As only the best fitting base-learner is se-lected to update the model in each boosting step, asdiscussed above, variable selection can be obtained bystopping the algorithm early enough. Usually this isdone via cross-validation methods, selecting the stop-ping iteration that optimizes the empirical risk on testdata (predictive risk). Hence, boosting with early stop-ping via cross-validation offers a way to perform vari-able selection while fitting the model. Nonetheless,boosted models stopped early via cross-validation stillhave a tendency to include too many noise variables,particularly in rather low-dimensional settings with fewpossible predictors and many observations (n > p) [3].

2.4.1 Stability selection for boosted GAM models

To circumvent the problems mentioned above, the sta-bility selection approach was developed [28,38]. This

Page 5: arXiv:1611.10171v1 [stat.CO] 30 Nov 2016 · 2016-12-01 · Janek Thomas Andreas Mayr Bernd Bischl Matthias Schmid Adam Smith Benjamin Hofner ... U.S. Fish & Wildlife Service, National

Stability selection for component-wise gradient boosting in multiple dimensions 5

Algorithm 1 “Cyclical” component-wise gradientboosting in multiple dimensions [25]

Initialize

(1) Initialize the additive predictors η[0] = (η[0]θ1 , η[0]θ2 , η[0]θ3 , η[0]θ4 )with offset values.

(2) For each distribution parameter θk, k = 1, . . . , 4, spec-ify a set of base-learners, i.e., for parameter θk definehk1(x(i)), . . . , hkJk (x

(i)) where Jk is the cardinality of theset of base-learners specified for θk.

Boosting in multiple dimensionsFor m = 1 to max(mstop,1, ...,mstop,4):

(3) For k = 1 to 4:(a) If m > mstop,k set η[m]

θk:= η

[m−1]θk

and skip this itera-tion.Else compute negative partial derivative − ∂

∂ηθkρ(y,η)

an plug in the current estimates η[m−1](·):

uk =

(∂

∂ηθkρ(y,η)

∣∣∣η=η[m−1](x(i)),y=y(i)

)i=1,...,n

(b) Fit each of the base-learners uk contained in the set ofbase-learners specified for the distribution parameter θkin step (2) to the negative gradient vector.

(c) Select the component j∗ that best fits the negativepartial-derivative vector according to the residual sumof squares, i.e., select the base-learner hkj∗ defined by

j∗ = argminj∈1,...,Jk

n∑i=1

(u(i)k − hkj(x(i)))2.

(d) Update the additive predictor ηθk

η[m]θk

= η[m−1]θk

+ sl · hkj∗(x),

where sl is the step-length (typically sl = 0.1), and up-date the current estimates for step 4(a):

η[m−1]θk

= η[m]θk

.

generic algorithm can be applied to boosting and allother variable selection methods. The main idea of sta-bility selection is to run the selection algorithm on mul-tiple subsamples of the original data. Highly relevantbase-learners should be selected in (almost) all subsam-ples.

Stability selection in combination with boosting was in-vestigated in [12] and is outlined in Algorithm 2. In thefirst step, B random subsets of size bn/2c are drawnand a boosting model is fitted to each one. The modelfit is interrupted as soon as q different base-learnershave entered the model. For each base-learner the se-lection frequency πj is the fraction of subsets in whichthe base-learner j was selected (7). An effect is includedin the model if the selection frequency exceeds the user-specified threshold πthr (8).

Algorithm 2 Stability selection for model-basedboosting

1. For b = 1, . . . , B:(a) Draw a subset of size bn/2c from the data(b) Fit a boosting model until the number of selected

base-learners is equal to q or the number of iterationsreaches a pre-specified number (mstop).

2. Compute the relative selection frequencies per base-learner:

πj :=1

B

B∑b=1

I{j∈Sb}, (7)

where Sb denotes the set of selected base-learners in iterationb.

3. Select base-learners with a selection frequency of at least πthr,which yields a set of stable covariates

Sstable := {j : πj ≥ πthr}. (8)

This approach leads to upper bounds for the per-familyerror-rate (PFER) E(V ), where V is the number ofnon-informative base-learners wrongly included in themodel (i.e., false positives) [28]:

E(V ) ≤ q2

(2πthr − 1)p. (9)

Under certain assumptions, refined, less conservative er-ror bounds can be derived [38].One of the main difficulties of stability selection in prac-tice is the choice of the parameters q, πthr and PFER.Even though only two of the three parameters need tobe specified (the last one can then be derived underequality in (9)) their choice is not trivial and not al-ways intuitive for the user.Meinshausen and Bühlmann [28] state that the thresh-old should be πthr ∈ (0.6, 0.9) and has little influenceon the result. The number of base-learners q has to besufficiently large, i.e., q should be at least as big as thenumber of informative variables in the data (or bet-ter to say the number of corresponding base-learners).This is obviously a problem in practical applications,in which the number of informative variables (or base-learners) is usually unknown. One nice property is thatif q is fixed, πthr and the PFER can be varied with-out the need to refit the model. A general advice wouldthus be to choose q relatively large or to make sure thatq is large enough for a given combination of πthr andPFER. Simulation studies like [12,26] have shown thatthe PFER is quite conservative and the true numberof false positives will most likely be much smaller thanthe specified value.In practical applications two different approaches to se-lect the parameters are typically used. Both assume

Page 6: arXiv:1611.10171v1 [stat.CO] 30 Nov 2016 · 2016-12-01 · Janek Thomas Andreas Mayr Bernd Bischl Matthias Schmid Adam Smith Benjamin Hofner ... U.S. Fish & Wildlife Service, National

6 Janek Thomas et al.

that the number of covariates to include, q, is chosenintuitively by the user: The first idea is to look at thecalculated inclusion frequencies πj and look for a break-point in the decreasing order of the values. The thresh-old can be then chosen so that all covariates with inclu-sion frequencies larger than the breakpoint are includedand the resulting PFER is only used as an additionalinformation. The second possibility is to fix the PFERas a conservative upper bound for the expected numberof false positives base-learners. Hofner et al. [12] providesome rationales for the selection of the PFER by relat-ing it to common error types, the per-comparison error(i.e., the type I error without multiplicity correction)and the family-wise error rate (i.e., with conservativemultiplicity correction).

2.4.2 Stability selection for boosted GAMLSS models

The question of variable selection in (boosted) GAMLSSmodels is even more critical than in regular (GAM)models, as the question of including a base-learner im-plies not only if the base-learner should be used in themodel at all, but also for which distribution parame-ter(s) it should be used. Essentially, the number of pos-sible base-learners doubles in a distribution with twoparameters, triples in one with three parameters and soon. This is particularly challenging in situations with alarge amount of base-learners and in p > n situations.The method of fitting boosted GAMLSS models in acyclical way leads to a severe problem when used incombination with stability selection. In each iterationof the algorithm all distribution parameters will receivean additional base-learner as long as their mstop limit isnot exceeded. This means that base-learners are addedto the model that might have a rather small importancecompared to base-learners for other distribution param-eters. This becomes especially relevant if the number ofinformative base-learners varies substantially betweendistribution parameters.Regarding the maximum number of base-learners q tobe considered in the model, base-learners are countedseparately for each distribution parameter, so a base-learner that is selected for the location and scale pa-rameter counts as two different base-learners. Arguably,one might circumvent this problem by choosing a highervalue for q, but still less stable base-learners could beselected instead of stable ones for other distribution pa-rameters. One aspect of the problem is that the possiblemodel improvement between different distribution pa-rameters is not considered. A careful selection of mstopper distribution parameter might resolve the problem,but the process would still be unstable because the mar-gin of base-learner selection in later stages of the algo-

rithm is quite small. Furthermore, this is not in linewith the general approach of stability selection wherethe standard tuning parameters do not play an impor-tant role.

2.5 Noncyclical fitting for boosted GAMLSS

The main idea to solve the previously stated problemsof the cyclical fitting approach is to update only onedistribution parameter in each iteration, i.e, the distri-bution parameter with a base-learner that leads to thehighest loss reduction over all distribution parametersand base-learners.Usually, base-learners are selected by comparing theirresidual sum of squares with respect to the negative gra-dient vector (inner loss). This is done in Step (4c) ofAlgorithm 1 where the different base-learners are com-pared. However, the residual sum of squares cannot beused to compare the fit of base-learners over differentdistribution parameters, as the gradients are not com-parable.

Inner loss One solution is to compare the empiricalrisks (i.e., the negative log likelihood of the modeled dis-tribution) after the update with the best-fitting base-learners that have been selected via the residual sumof squares for each distribution parameter: First, foreach distribution parameter the best performing base-learner is selected via the residual sum of squares ofthe base-learner fit with respect to the gradient vec-tor. Then, the potential improvement in the empiricalloss ∆ρ is compared for all selected base-learners (i.e.,over all distribution parameters). Finally, only the best-fitting base-learner (w.r.t. the inner-loss) which leadsto the highest improvement (w.r.t. the outer loss) isupdated. The base-learner selection for each distribu-tion parameter is still done with the inner loss (i.e.,the residual sum of squares) and this algorithm will becalled analogously.

Outer loss Choosing base-learners and parameters withrespect to two different optimization criteria may notalways lead to the best possible update. A better solu-tion could be to use a criterion which can compare allbase-learners for all distribution parameters. As stated,the inner loss cannot be used for such a comparison.However, the empirical loss (i.e., the negativelog-likelihood of the modeled distribution) can be usedto compare both, the base-learners within a distribu-tion parameter and over the different distribution pa-rameters. Now, the negative gradients are used to esti-mate all base-learners h11, . . . , h1p1 , h21, . . . , h4p4 . Theimprovement in the empirical risk is then calculated for

Page 7: arXiv:1611.10171v1 [stat.CO] 30 Nov 2016 · 2016-12-01 · Janek Thomas Andreas Mayr Bernd Bischl Matthias Schmid Adam Smith Benjamin Hofner ... U.S. Fish & Wildlife Service, National

Stability selection for component-wise gradient boosting in multiple dimensions 7

each base-learner of every distribution parameter andonly the the overall best-performing base-learner (w.r.t.the outer loss) is updated. Instead of the using the in-ner loss, the whole selection process is hence based onthe outer loss (empirical risk) and the method is namedaccordingly.The noncyclical fitting algorithm is shown in Algorithm 3.The inner and outer variant solely differ in step (3c).

Algorithm 3 “Noncyclical” component-wise gradientboosting in multiple dimensionsInitialize

(1) Initialize the additive predictors η[0] = (η[0]θ1 , η[0]θ2 , η[0]θ3 , η[0]θ4 )with offset values.

(2) For each distribution parameter θk, k = 1, . . . , 4, spec-ify a set of base-learners, i.e., for parameter θk definehk1(·), . . . , hkJk (·) where Jk is the cardinality of the set ofbase-learners specified for θk.

Boosting in multiple dimensionsFor m = 1 to mstop:

(3) For k = 1 to 4:(a) Compute negative partial derivatives − ∂

∂ηθkρ(y,η) and

plug in the current estimates η[m−1](·):

uk =

(∂

∂ηθkρ(y,η)

∣∣∣η=η[m−1](x(i)),y=y(i)

)i=1,...,n

(b) Fit each of the base-learners uk contained in the set ofbase-learners specified for the distribution parameter θkin step (2) to the negative gradient vector.

(c) Select the best-fitting base-learner hkj∗ either by• the inner loss, i.e., the residual sum of squares of the

base-learner fit w.r.t. uk:

j∗ = argminj∈1,...,Jk

n∑i=1

(u(i)k − hkj(x(i)))2

• the outer loss, i.e., the negative log likelihood of themodelled distribution after the potential update:

j∗ = argminj∈1,...,Jk

n∑i=1

ρ(y(i), η

[m−1]θk

(x(i)) + sl · hkj(x(i)))

(d) Compute the possible improvement of this update re-garding the outer loss

∆ρk =n∑i=1

ρ(y(i), η

[m−1]θk

(x(i)) + sl · hkj∗(x(i)))

(4) Update, depending on the value of the loss reductionk∗ = argmink∈1,...,4(∆ρk) only the overall best-fitting base-learner:

η[m]θk∗

= η[m−1]θk∗

+ sl · hk∗j∗(x)

(5) Set η[m]θk

:= η[m−1]θk

for all k 6= k∗.

A major advantage of both noncyclical variants com-pared to the cyclical fitting algorithm (Algorithm 1) is

that mstop is always scalar. The updates of each distri-bution parameter estimate are adaptively chosen. Theoptimal partitioning (and sequence) of base-learners be-tween different parameters is done automatically whilefitting the model. Such a scalar optimization can bedone very efficiently using standard cross-validation meth-ods without the need for a multi-dimensional grid search.

3 Simulation study

In a first step, we carry out simulations to evaluate theperformance of the new noncyclical fitting algorithmsregarding convergence, convergence speed and runtime.In a second step, we analyze the variable selection prop-erties if the new variant is combined with stability se-lection.

3.1 Performance of the noncyclical algorithms

The response yi is drawn from a normal distributionN(µi, σi), where µi and σi depend on 4 covariates each.The xi, i = 1, . . . , 6, are drawn independently from auniform distribution on [−1, 1], i.e., n = 500 samplesare drawn independently from U(−1, 1). Two covariatesx3 and x4 are shared between both µi and σi, i.e., theyare informative for both parameters, which means thatthere are pinf = 6 informative variables overall. Theresulting predictors look like

µi = x1i + 2x2i + 0.5x3i − x4ilog(σi) = 0.5x3i + 0.25x4i − 0.25x5i − 0.5x6i.

Convergence First, we compare the new noncyclicalboosting algorithms and the cyclical approach with theclassical estimation method based on penalized maxi-mum likelihood (as implemented in the R packagegamlss, [35]). The results from B = 100 simulationruns are shown in Figure 1. All four methods convergeto the correct solution.

Convergence speed Second, we compare the convergencespeed in terms of boosting iterations. Therefore, non-informative variables are added to the model. Four set-tings are considered with pn-inf = 0, 50, 250 and 500 ad-ditional non-informative covariates independently sam-pled from a U(-1,1) distribution. With n = 500 obser-vations, both pn-inf = 250 and pn-inf = 500 are high-dimensional situations (p > n) as we have two dis-tribution parameters. In Figure 2 the mean risk over100 simulated data sets is plotted against the numberof iterations. The mstop value of the cyclical variantshown in Figure 2 is the sum of the number of updates

Page 8: arXiv:1611.10171v1 [stat.CO] 30 Nov 2016 · 2016-12-01 · Janek Thomas Andreas Mayr Bernd Bischl Matthias Schmid Adam Smith Benjamin Hofner ... U.S. Fish & Wildlife Service, National

8 Janek Thomas et al.

−1

0

1

2

−0.4

0.0

0.4

µσ

X1 X2 X3 X4 X5 X6Parameter

Est

imat

ed c

oeffi

cien

t

Methodcyclical

outer

inner

gamlss

Fig. 1: Distribution of coefficient estimates from B =

100 simulation runs. The dashed lines show the true pa-rameters. All algorithms were fitted until convergence.

on every distribution parameter. Outer and inner lossvariants of the noncyclical algorithm have exactly thesame risk profiles in all four settings. Compared to thecyclical algorithm, the convergence is faster in the first500 iterations. After more than 500 iterations the riskreduction is the same for all three methods. The mar-gin between cyclical and both noncyclical algorithmsdecreases with a larger number of noise variables.

Runtime The main computational effort of the algo-rithms is the base-learner selection, which is differentfor all three methods. The runtime is evaluated in con-text of cross-validation, which allows us to see how out-of-bag error and runtime behave in different settings.We consider two scenarios — a two-dimensional (d = 2)and a three-dimensional (d = 3) distribution. The dataare generated according to setting 1A and 3A of Sec-tion 3.2. In each scenario we sample n = 500 observa-tions, but do not add any additional noise variables. Foroptimization of the model, the out-of-bag prediction er-ror is estimated via a 25-fold bootstrap. A grid of length10 is created for the cyclical model, with an maximummstop of 300 for each distribution parameter. The gridis created with the make.grid function in gamboost-LSS (refer to the package documentation for details onthe arrangement of the grid points). To allow the samecomplexity for all variants, the noncyclical methods areallowed up to mstop = d× 300 iterations.

0 50

250 500

850

900

950

1000

1050

850

900

950

1000

1050

0 250 500 750 1000 0 250 500 750 1000mstop

Ris

k

Methodcyclical

outer

inner

Fig. 2: Convergence speed (regarding the number ofboosting iterations m) with 6 informative and pn-inf =

0, 50, 250 and 500 additional noise variables.

2−Dimensional 3−Dimensional

1.2

1.5

1.8

2.1

Out

−of

−ba

g er

ror

2−Dimensional 3−Dimensional

1

10

100

cyclical inner outer cyclical inner outerMethod

Run

time

in m

in

Fig. 3: Out-of-bag error (top) and optimization timein minutes (logarithmic scale; bottom) for a two-dimensional (left) and three-dimensional distribution(right) based on 25-fold bootstrap.

The results of the benchmark can be seen in Figure 3.The out-of-bag error in the two-dimensional setting issimilar for all three methods, but the average number ofoptimal iterations is considerably smaller for the non-cyclical methods (cyclical:360 vs. inner:306,outer:308). In the three-dimensional setting, the outer

Page 9: arXiv:1611.10171v1 [stat.CO] 30 Nov 2016 · 2016-12-01 · Janek Thomas Andreas Mayr Bernd Bischl Matthias Schmid Adam Smith Benjamin Hofner ... U.S. Fish & Wildlife Service, National

Stability selection for component-wise gradient boosting in multiple dimensions 9

variant of the noncyclical fitting results in a higher er-ror, whereas the inner variant results in a slightly bet-ter performance compared to the cyclical variant. Inthis setting the optimal number of iterations is similarfor all three methods but near the edge of the searchedgrid. It is possible that the outer variant will result ina comparable out-of-bag error if the range of the gridis increased.

3.2 Stability selection

After having analyzed the properties of the new non-cyclical boosting algorithms for fitting GAMLSS, theremaining question is how they perform when combinedwith stability selection. In the previous subsection nodifferences in the model fit (Figure 1) and convergencespeed (Figure 2) could be observed, but the optimiza-tion results in a three dimensions setting (Figure 3) wasworse for the outer algorithm. Taking this into consid-eration we will only compare the inner and cyclical al-gorithm here.We consider three different distributions: (1) The nor-mal distribution with two parameters, mean µi andstandard deviation σi. (2) The negative binomial dis-tribution with two parameters, mean µi and dispersionσi. (3) The zero-inflated negative binomial (ZINB) dis-tribution with three parameters, µi and σi identical tothe negative binomial distribution, and probability forzero-inflation νi.Furthermore, two different partitions of six informativecovariates shared between the distribution parametersare evaluated:

(A) Balanced case: For normal and negative binomialdistribution, both µi and σi depended on four infor-mative covariates, where two are shared. In case ofthe ZINB distribution, each parameter depends onthree informative covariates, each sharing one withthe other two parameters.

(B) Unbalanced case: For normal and negative binomialdistribution, µi depends on five informative covari-ates, while σi only on one. No informative variablesare shared between the two parameters. For theZINB distribution, µi depends on five informativevariables, σi on two, and νi on one. One variable isshared across all three parameters.

To summarize these different scenarios for a total of sixinformative variables, x1, . . . , x6:

(1A, 2A)µi = β1µx1i + β2µx2i + β3µx3i + β4µx4ilog(σi) = β3σx3i + β4σx4i + β5σx5i + β6σx6i

(1B, 2B)log(µi) = β1µx1i + β2µx2i + β3µx3i + β4µx4i + β5µx5ilog(σi) = β6σx6i

(3A)log(µi) = β1µx1i + β2µx2i + β3µx3ilog(σi) = β3σx3i + β4σx4i + β5σx5ilogit(νi) = β1νx1i + β5νx5i + β6νx6i

(3B)log(µi) = β1µx1i + β2µx2i + β3µx3i + β4µx4i + β5µx5ilog(σi) = β5σx5i + β6σx6ilogit(νi) = β6νx6i

To evaluate the performance of stability selection, twocriteria have to be considered. First, the true positiverate, or the number of true positives (TP, number ofcorrectly identified informative variable). Secondly, thefalse positive rate, or the number of false positives (FP,number of non-informative variable that were selectedas stable predictors).Considering stability selection, the most obvious controlparameter to influence false and true positive rates isthe threshold πthr. To evaluate the algorithms depend-ing on the settings of stability selection, we considerseveral values for the number of variables to be includedin the model q ∈ {8, 15, 25, 50} and the threshold πthr(varying between 0.55 and 0.99 in steps of 0.01). A thirdfactor is the number of (noise) variables in the model:we consider p = 50, 250 or 500 covariates (including thesix informative ones). It should be noted that the actualnumber of possible base-learners is p times the numberof distribution parameters, as each covariate can be in-cluded in one or more additive predictors. To visualizethe simulation results, the progress of true and falsepositives is plotted against the threshold πthr for differ-ent values of p and q, where true and false positives areaggregated over all distribution parameters. Separatefigures for each distribution parameter can be found inthe web supplement. The setting p = 50, q = 50 is anedge case that would work for some assumptions aboutthe distribution of selection probabilities [38]. Since thepractical application of this scenario is doubtful, we willnot further examine it here.

3.2.1 Results

It can be observed that with increasing threshold πthr,the number of true positives as well as the number offalse positives declines in all six scenarios (see Figures 4to 9) and for every combination of p and q. This is a nat-ural consequence as the threshold is increased, the lessvariables are selected. Furthermore, the PFER, which

Page 10: arXiv:1611.10171v1 [stat.CO] 30 Nov 2016 · 2016-12-01 · Janek Thomas Andreas Mayr Bernd Bischl Matthias Schmid Adam Smith Benjamin Hofner ... U.S. Fish & Wildlife Service, National

10 Janek Thomas et al.

q = 8 q = 15 q = 25 q = 50

0

2

4

6

8

0

2

4

6

8

0

2

4

6

8

p = 50

p = 250

p = 500

0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0πthr

Ave

rage

num

ber

of T

rue/

Fals

e po

sitiv

es

MeasureFP

TP

Methodcyclical

noncyc.

Fig. 4: Balanced case with normal distribution (Sce-nario 1A).

q = 8 q = 15 q = 25 q = 50

0

2

4

6

0

2

4

6

0

2

4

6

p = 50

p = 250

p = 500

0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0πthr

Ave

rage

num

ber

of T

rue/

Fals

e po

sitiv

es

MeasureFP

TP

Methodcyclical

noncyc.

Fig. 5: Unbalanced case with normal distribution (Sce-nario 1B).

is to be controlled by stability selection, decreases withincreasing threshold πthr (see Eq. 9).

Results for the normal distribution In the balanced case(Figure 4) a higher number of true positives for thenoncyclical algorithm can be observed compared to thecyclical algorithm for most simulation settings. Partic-

ularly for smaller q values (q ∈ {8, 15}) the true positiverate was always higher compared to the cyclical vari-ant. For higher q values the margin decreases and forthe highest settings both methods have approximatelythe same progression over πthr, with slightly better re-sults for the cyclical algorithm. Overall, the number oftrue positives increases with a higher value of q. Hofneret al. [12] found similar results for boosting with onedimensional prediction functions, but also showed thatthe true positive rate decreases again after a certainvalue of q. This could not be verified for the multidi-mensional case.The false positive rate is extremely low for both meth-ods, especially in the high-dimensional settings. Thenoncyclical fitting method has a constantly smaller oridentical false positive rate and the difference reducesfor higher πthr, as expected. For all settings the falsepositive rate reaches zero for a threshold higher than0.9. The setting with the highest false positive rate isp = 50 and q = 25, a low dimensional case with a rela-tively high threshold. This is also the only setting whereon average all 8 informative variables are found (for athreshold of 0.55).In the unbalanced case (Figure 5) the results are sim-ilar. The number of false positives for the noncyclicalvariant is lower compared to the cyclical approach inalmost all settings. The main difference between thebalanced and the unbalanced case is that the numberof true positives for the p = 50, q = 25 setting is almostidentical in the former case whereas in the latter casethe noncyclical variant is dominating the cyclical algo-rithm. On the other hand, in the high-dimensional casewith a small q (p = 500, q = 8) both fitting methodshave about the same true positive rate for all possiblethreshold values.In summary, it can be seen that the novel noncyclical al-gorithm is generally better, but at least comparable, tothe cyclical method in identifying informative variables.Furthermore, the false positive rate is less or identicalto the cyclical method. For some scenarios in which thescale parameter σi is higher compared to the locationparameter µi, the cyclical variant achieves slightly bet-ter results than the noncyclical variant regarding truepositives at high p and q values.

Results for the negative binomial distribution In thebalanced case of the negative binomial distribution (Fig-ure 6), the number of true positives is almost identicalfor the cyclical and noncyclical algorithm in all set-tings, while the number of true positives is generallyquite high. It varies between 6 and 8 in almost all set-tings, except for the cases with a very small value of q(= 8) where it is slightly lower. This is consistent with

Page 11: arXiv:1611.10171v1 [stat.CO] 30 Nov 2016 · 2016-12-01 · Janek Thomas Andreas Mayr Bernd Bischl Matthias Schmid Adam Smith Benjamin Hofner ... U.S. Fish & Wildlife Service, National

Stability selection for component-wise gradient boosting in multiple dimensions 11

q = 8 q = 15 q = 25 q = 50

0

2

4

6

8

0

2

4

6

8

0

2

4

6

8

p = 50

p = 250

p = 500

0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0πthr

Ave

rage

num

ber

of T

rue/

Fals

e po

sitiv

es

MeasureFP

TP

Methodcyclical

noncyc.

Fig. 6: Balanced case with negative binomial distribu-tion (Scenario 2A).

q = 8 q = 15 q = 25 q = 50

0

2

4

6

0

2

4

6

0

2

4

6

p = 50

p = 250

p = 500

0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0πthr

Ave

rage

num

ber

of T

rue/

Fals

e po

sitiv

es

MeasureFP

TP

Methodcyclical

noncyc.

Fig. 7: Unbalanced case with negative binomial distri-bution (Scenario 2B).

the results for stability selection with one dimensionalboosting [12,26]. The number of false positives in thenoncyclical variants is smaller or identical to the cycli-cal variant in all tested settings.In the unbalanced case the true positive rate of thenoncyclical variant is higher compared to the cyclicalvariant, whereas the difference reduces for larger values

q = 8 q = 15 q = 25 q = 50

2

4

6

2

4

6

2

4

6

p = 50

p = 250

p = 500

0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0πthr

Ave

rage

num

ber

of T

rue/

Fals

e po

sitiv

es

MeasureFP

TP

Methodcyclical

noncyc.

Fig. 8: Balanced case with zero-inflated negative bino-mial distribution (Scenario 3A).

of q. The results are consistent with the normal dis-tribution setting but with smaller differences betweenboth methods.

Results for ZINB distribution The third considered dis-tribution in our simulation setting is the ZINB distri-bution, which features three parameters to fit.In Figure 8, the results for the balanced case (scenario3A), are visualized. The tendency of a larger numbertrue positives in the noncyclical variant, which couldbe observed for both two-parametric distributions, isnot present here. For all settings, except for high di-mensional settings with a low q (i.e., p = 250, 500 andq = 50), the cyclical variant has a higher number of truepositives. Additionally, the number of false positives isconstantly higher for the noncyclical variant. For theunbalanced setting (Figure 9) the results are similar intrue positives and negatives between both methods.The number of true positives is overall considerablysmaller compared to all other simulation settings. Par-ticularly in the high dimensional cases (p = 250, 500),not even half of the informative covariates are found. Insettings with smaller q the number of true positives islower than two. Both algorithms obtain approximatelythe same number of true positives for all settings. Incases with a very low or a very high number q, (i.e.,q = 8 or 50), the noncyclical algorithm is slightly bet-ter. The number of false positives is very high, espe-cially compared with the number of true positives andparticularly for the unbalanced case. For a lot of set-

Page 12: arXiv:1611.10171v1 [stat.CO] 30 Nov 2016 · 2016-12-01 · Janek Thomas Andreas Mayr Bernd Bischl Matthias Schmid Adam Smith Benjamin Hofner ... U.S. Fish & Wildlife Service, National

12 Janek Thomas et al.

q = 8 q = 15 q = 25 q = 50

0

2

4

6

0

2

4

6

0

2

4

6

p = 50

p = 250

p = 500

0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9 1.0πthr

Ave

rage

num

ber

of T

rue/

Fals

e po

sitiv

es

MeasureFP

TP

Methodcyclical

noncyc.

Fig. 9: Unbalanced case with zero-inflated negative bi-nomial distribution (Scenario 3B).

tings, more than half of the included variables are non-informative. The number of false positives is higherfor the noncyclical case. The difference are especiallypresent in settings with a high q and a low πthr, thosesettings which also have the highest numbers of truepositives.Altogether, the trend from the simulated two-parameterdistributions is not present in the three parametric set-ting. The cyclical algorithm overall is not worse or evenbetter with regard to both true and false positives foralmost all tested scenarios.

4 Modelling sea duck abundance

A recent analysis by Smith et al. [39] investigated theabundance of wintering sea ducks in Nantucket Sound,Massachusetts, USA. Spatio-temporal abundance datafor common eider (among other species) was collectedbetween 2003 and 2005 by counting seaducks on mul-tiple aerial strip transects from a small plane. For thesubsequent analysis, the research area was split in2.25km2 segments (see Figure 10). Researchers were in-terested in variables that explained and predicted thedistribution of the common eider in the examined area.As the data were zero-inflated (75% of the segmentscontained no birds) and highly skewed (a small num-ber of segments contained up to 30000 birds), a hur-dle model [30] was used for estimation. Therefore, themodel was split into an occupancy model (zero part)

Fig. 10: Nantucket Sound – Research area of the seabirdstudy by Smith et al. [39]. Squares are the discretizedsegments in which bird abundance was studied. Graylines indicate all aerial transects flown over the courseof the study. The black polygon indicates the locationof permitted wind energy development on HorseshoeShoal.

and an abundance model (count part). The occupancymodel estimated if a segment was populated at all andwas fitted by boosting a generalized additive model(GAM) with binomial loss, i.e., an additive logistic re-gression model. In the second step, the number of birdsin populated segments was estimated with a boostedGAMLSS model. Because of the skewed and long-taileddata, the (zero-truncated) negative binomial distribu-tion was chosen for the abundance model (compare[30]).We reproduce the common eider model reported bySmith et al. [39] but apply the novel noncyclical algo-rithm; Smith et al. used the cyclic algorithm to fit theGAMLSS model. As discussed in Section 3.2 we ap-ply the noncyclical algorithm with inner loss. In short,both distribution parameters, mean and overdispersionof the abundance model, and the probability of birdsightings in the occupancy model were regressed ona large number of biophysical covariates, spatial andspatio-temporal effects, and some pre-defined interac-tions. A complete list of the considered effects can befound in the web supplement. To allow model selection(i.e., the selection between modelling alternatives), thecovariate effects were split in linear and nonlinear base-learners [20,14]. The step-length was set to sl = 0.3

and the optimal number of boosting iterations mstop

Page 13: arXiv:1611.10171v1 [stat.CO] 30 Nov 2016 · 2016-12-01 · Janek Thomas Andreas Mayr Bernd Bischl Matthias Schmid Adam Smith Benjamin Hofner ... U.S. Fish & Wildlife Service, National

Stability selection for component-wise gradient boosting in multiple dimensions 13

µ (mean)

easting (in km)

nort

hing

(in

km

)

−1

0

1

2

−1 0 1

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

σ (overdispersion)

easting (in km)

nort

hing

(in

km

)

−1

0

1

2

−1 0 1

−0.006

−0.004

−0.002

0.000

0.002

0.004

Fig. 11: Spatial effects for mean (upper figure) andoverdispersion (lower figure) of seabird population.

was found via 25-fold subsampling with sample size n/2[27]. Additionally, we used stability selection to obtainsparser models. The numbers of variables to be includedper boosting run was set to q = 35 and the per-familyerror-rate was set to 6. With unimodality assumptionthis resulted in a threshold of πthr = 0.9. These settingswere chosen identically to the original choices in [39].

4.1 Results

Subsampling yielded an optimal mstop of 2231, split inmstop,µ = 1871 and mstop,σ = 336. The resulting modelselected 46 out of 48 possible covariates in µ and 8 out of48 in σ, which is far too complex of a model (especiallyin µ) to be useful.

0.0 0.2 0.4 0.6 0.8 1.0

π

y2004 [sigma]

ykm [sigma]

f(xkm, ykm) * time [mu]

f(SBT) [mu]

SSTm [sigma]

f(chla) [mu]

meanphi [mu]

depth [sigma]

tidesd [sigma]

xkm * ykm [sigma]

f(SSTrel) [mu]

f(meanphi) [mu]

y2004 [mu]

f(SAR) [mu]

f(xkm, ykm) [mu]

f(tidebmean) [sigma]

f(tidebmean) [mu]

int [mu]

f(cdom) [mu]

ferry [mu]

Fig. 12: Selection frequencies of the 20 most frequentlyselected biophysical covariate base-learners of commoneider abundance, determined by stability selection withq = 35 and PFER = 6. The grey line represents thecorresponding threshold of 0.9

With stability selection (see Figure 12), 10 effects wereselected for the location: the intercept, relative sea sur-face temperature (smooth), chlorophyll-a levels (smooth),chromophoric dissolved organic material levels (smooth),sea floor sediment grain size (linear and smooth), seafloor surface area (smooth), mean epidenthic tidal ve-locity (smooth), a smooth spatial interaction, the pres-ence of nearby ferry routes (yes/no), and two factorsto account for changes in 2004 and 2005 compared tothe the year 2003. For the overdispersion parameter 5effects were selected: sea surface temperature (linear),bathymetry (linear), the mean (smooth) and standarddeviation (linear) of the epibenthic tidal velocity, andthe linear spatial interaction. For the location, all met-ric variables entered the model nonlinearly. Only sed-iment grain size was selected linearly as well as non-linearly in the model. The converse was true for theoverdispersion parameter: only the mean epibenthic ve-locity was selected as a smooth effect and all others wereselected as linear effects. In Figure 11 the spatial effectsfor the mean and overdispersion can be seen.

Page 14: arXiv:1611.10171v1 [stat.CO] 30 Nov 2016 · 2016-12-01 · Janek Thomas Andreas Mayr Bernd Bischl Matthias Schmid Adam Smith Benjamin Hofner ... U.S. Fish & Wildlife Service, National

14 Janek Thomas et al.

4.2 Comparison to results of the cyclic method

Comparing the model with the results of Smith et al. [39],the noncyclical model was larger in µ (10 effects, com-pared to 8 effects), but smaller in σ (5 effects, comparedto 7 effects). Chlorophyll-a levels, mean epibenthic tidalvelocity, smooth spatial variation and year were not se-lected for the mean by stability selection with the cycli-cal fitting algorithm. On the other hand bathymetrywas selected by the cyclical fitting method, but not bythe noncyclical. For the overdispersion parameter thecyclical algorithm selected the year and the northing ofa segment (the north-south position of a segment rel-ative to the median) in addition to all effects selectedby the noncyclical variant. Most effects were selectedby both the cyclical and the noncyclical algorithm andthe differences in the selected effects were rather small.In the simulation study for the negative binomial distri-bution (Section 3), the noncyclical variant had a smallerfalse positive rate and a higher true positive rate. Eventhough the simulation was simplified compared to thisapplication (only linear effects, known true number ofinformative covariates, uncorrelated effects), the resultssuggest to prefer the noncyclical variant. Nonetheless,the interpretation of selected covariate effects and finalmodel assessment rests ultimately with subject matterexperts.

5 Conclusion

The main contribution of this paper is a statisticalmodel building algorithm that combines the three ap-proaches of gradient boosting, GAMLSS and stabilityselection. As shown in our simulation studies and theapplication on sea duck abundance in Section 4, theproposed algorithm incorporates the flexibility of struc-tured additive regression modeling via GAMLSS, whileit simultaneously allows for a data-driven generation ofsparse models.Being based on the gamboostLSS framework by Mayret al. [25], the main feature of the new algorithm is anew “noncyclical” fitting method for boosted GAMLSSmodels. As shown in the simulation studies, this methoddoes not only increase the flexibility of the variable se-lection mechanism used in gamboostLSS, but is alsomore time-efficient than the traditional cyclical fittingalgorithm. In fact, even though the initial runtime tofit a single model may be higher (especially if the base-learner selection is done via the outer loss approach),this time is regained while finding the optimal numberof boosting iterations via cross-validation approaches.Furthermore, the convergence speed of the new algo-

rithm proved to be faster, and consequently fewer boost-ing iterations were needed in total.Regarding stability selection, we observed that the non-cyclical algorithm often had fewer false positives as wellas more true positives compared to the cyclical vari-ant in the two-parameter distributions tested in oursimulation study. For high dimensional cases, however,the differences between both methods reduced and, es-pecially with regard to the number of true positives,approximately equal results were achieved. For three-parameter distributions the cyclical variant achievedbetter values throughout with respect to both true andfalse positive rates. This may be due to the the fact thatfor more complex distributions, similar densities can beachieved with different parameter settings. For exam-ple, in a zero-inflated negative binomial setting, a smalllocation may be hard to distinguish from a large zero-inflation. Obviously, the behavior of the cyclical variantis more robust in these situations than the noncyclicalvariant, which tends to fit very different models on eachsubsample and consequently selects a higher amount ofnon-informative variables.In summary, we have developed a framework for modelbuilding in GAMLSS that simplifies traditional opti-mization approaches to a great extent. For practition-ers and applied statisticians, the main consequence ofthe new methodology is the incorporation of fewer noisevariables in the GAMLSS model, leading to sparser andthus more interpretable models. Furthermore, the tun-ing of the new algorithm is far more efficient and leadsto much shorter run times, particularly for complex dis-tributions.

Implementation

The derived fitting methods for gamboostLSS modelsare implemented in the R add-on package gamboost-LSS [15]. The fitting algorithm can be specified via themethod argument. By default method is set to "cyclical",which is the originally proposed algorithm. Both newnoncyclical algorithms can be selected via method ="inner" and method = "outer". Base-learners and someof the basic methods are implemented in the R packagemboost [18,16,19]. The basic fitting algorithm for eachdistribution parameter is also implemented in mboost.For a tutorial and an explanation of technical detailsof gamboostLSS see [17]. Stability selection is imple-mented in the R package stabs [13,12], with a special-ized function for gamboostLSSmodels which is includedin gamboostLSS itself. The source code of mboost,gamboostLSS and stabs is hosted openly at

http://www.github.com/boost-R/mboost

Page 15: arXiv:1611.10171v1 [stat.CO] 30 Nov 2016 · 2016-12-01 · Janek Thomas Andreas Mayr Bernd Bischl Matthias Schmid Adam Smith Benjamin Hofner ... U.S. Fish & Wildlife Service, National

Stability selection for component-wise gradient boosting in multiple dimensions 15

http://www.github.com/boost-R/gamboostLSShttp://www.github.com/hofnerb/stabs.

Acknowledgements

We thank Mass Audubon for the use of common eiderabundance data.

References

1. Aho, K., Derryberry, D.W., Peterson, T.: Model selectionfor ecologists: the worldviews of AIC and BIC. Ecology 95,631–636 (2014)

2. Anderson, D.R., Burnham, K.P.: Avoiding pitfalls when us-ing information-theoretic methods. The Journal of WildlifeManagement pp. 912–918 (2002)

3. Bühlmann, P., Gertheiss, J., Hieke, S., Kneib, T., Ma, S.,Schumacher, M., Tutz, G., Wang, C., Wang, Z., Ziegler, A.,et al.: Discussion of “the evolution of boosting algorithms”and “extending statistical boosting”. Methods Inf Med 53(6),436–445 (2014)

4. Bühlmann, P., Hothorn, T.: Boosting algorithms: Regular-ization, prediction and model fitting. Statistical Science 22,477–505 (2007)

5. Bühlmann, P., Hothorn, T.: Twin boosting: improved featureselection and prediction. Statistics and Computing 20, 119–138 (2010)

6. Bühlmann, P., Yu, B.: Boosting with the L2 loss: Regres-sion and classification. Journal of the American StatisticalAssociation 98, 324–339 (2003)

7. Bühlmann, P., Yu, B.: Sparse boosting. The Journal of Ma-chine Learning Research 7, 1001–1024 (2006)

8. Dormann, C.F., Elith, J., Bacher, S., Buchmann, C., Carl,G., Carre, G., Marquez, J.R.G., Gruber, B., Lafourcade,B., Leitao, P.J., Münkemüller, T., McClean, C., Osborne,P.E., Reineking, B., Schröder, B., Skidmore, A.K., Zurell,D., Lautenbach, S.: Collinearity: A review of methods to dealwith it and a simulation study evaluating their performance.Ecography 36, 27–46 (2013)

9. Flack, V.F., Chang, P.C.: Frequency of selecting noise vari-ables in subset regression analysis: A simulation study. TheAmerican Statistician 41(1), 84–86 (1987)

10. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic re-gression: a statistical view of boosting (with discussion and arejoinder by the authors). Annals of Statistics 28(2), 337–407(2000)

11. Hastie, T.J., Tibshirani, R.J.: Generalized additive models,vol. 43. CRC Press (1990)

12. Hofner, B., Boccuto, L., Göker, M.: Controlling false discov-eries in high-dimensional situations: boosting with stabilityselection. BMC Bioinformatics 16(1), 144 (2015)

13. Hofner, B., Hothorn, T.: stabs: Stability Selection with ErrorControl (2015). URL http://CRAN.R-project.org/package=stabs. R package version 0.5-1

14. Hofner, B., Hothorn, T., Kneib, T., Schmid, M.: A frameworkfor unbiased model selection based on boosting. Journal ofComputational and Graphical Statistics 20, 956–971 (2011)

15. Hofner, B., Mayr, A., Fenske, N., Thomas, J., Schmid,M.: gamboostLSS: Boosting Methods for GAMLSS Mod-els (2016). URL http://CRAN.R-project.org/package=gamboostLSS. R package version 1.5-0

16. Hofner, B., Mayr, A., Robinzonov, N., Schmid, M.: Model-based boosting in R – A hands-on tutorial using the R pack-age mboost. Computational Statistics 29, 3–35 (2014)

17. Hofner, B., Mayr, A., Schmid, M.: gamboostLSS: An R pack-age for model building and variable selection in the GAMLSSframework. Journal of Statistical Software 74(1), 1–31 (2016)

18. Hothorn, T., Buehlmann, P., Kneib, T., Schmid, M., Hofner,B.: Model-based boosting 2.0. Journal of Machine LearningResearch 11, 2109–2113 (2010)

19. Hothorn, T., Buehlmann, P., Kneib, T., Schmid, T., Hofner,B.: mboost: Model-Based Boosting (2016). URL http://CRAN.R-project.org/package=mboost. R package version2.6-0

20. Hothorn, T., Müller, J., Schröder, B., Kneib, T., Brandl,R.: Decomposing environmental, spatial, and spatiotemporalcomponents of species distributions. Ecological Monographs81, 329–347 (2011)

21. Huang, S.M.Y., Huang, J., Fang, K.: Gene network-basedcancer prognosis analysis with sparse boosting. Genetics Re-search 94, 205–221 (2012)

22. Li, P.: Robust logitboost and adaptive base class (abc) log-itboost. arXiv preprint arXiv:1203.3491 (2012)

23. Mayr, A., Binder, H., Gefeller, O., Schmid, M., et al.: Theevolution of boosting algorithms. Methods Inf Med 53(6),419–427 (2014)

24. Mayr, A., Binder, H., Gefeller, O., Schmid, M., et al.: Ex-tending statistical boosting. Methods Inf Med 53(6), 428–435 (2014)

25. Mayr, A., Fenske, N., Hofner, B., Kneib, T., Schmid, M.:Generalized additive models for location, scale and shape forhigh-dimensional data - a flexible approach based on boost-ing. Journal of the Royal Statistical Society, Series C - Ap-plied Statistics 61(3), 403–427 (2012)

26. Mayr, A., Hofner, B., Schmid, M.: Boosting the discrimina-tory power of sparse survival models via optimization of theconcordance index and stability selection. BMC Bioinfor-matics 17(1), 288 (2016)

27. Mayr, A., Hofner, B., Schmid, M., et al.: The importanceof knowing when to stop. Methods Inf Med 51(2), 178–186(2012)

28. Meinshausen, N., Bühlmann, P.: Stability selection. Journalof the Royal Statistical Society: Series B (Statistical Method-ology) 72(4), 417–473 (2010)

29. Messner, J.W., Mayr, G.J., Zeileis, A.: Non-homogeneousboosting for predictor selection in ensemble post-processing.Working papers, Faculty of Economics and Statistics, Univer-sity of Innsbruck (2016). URL http://EconPapers.repec.org/RePEc:inn:wpaper:2016-04

30. Mullahy, J.: Specification and testing of some modified countdata models. Journal of Econometrics 33(3), 341–365 (1986)

31. Murtaugh, P.A.: Performance of several variable-selectionmethods applied to real ecological data. Ecology Letters 12,1061–1068 (2009)

32. Opelt, A., Fussenegger, M., Pinz, A., Auer, P.: Weak hy-potheses and boosting for generic object detection and recog-nition. In: European conference on computer vision, pp. 71–84. Springer (2004)

33. Osorio, J.D.G., Galiano, S.G.G.: Non-stationary analysis ofdry spells in monsoon season of Senegal River Basin usingdata from regional climate models (RCMs). Journal of Hy-drology 450–451, 82–92 (2012)

34. Rigby, R.A., Stasinopoulos, D.M.: Generalized additive mod-els for location, scale and shape. Journal of the Royal Sta-tistical Society: Series C (Applied Statistics) 54(3), 507–554(2005)

35. Rigby, R.A., Stasinopoulos, D.M., Akantziliotou, C.: In-structions on how to use the gamlss package in R

Page 16: arXiv:1611.10171v1 [stat.CO] 30 Nov 2016 · 2016-12-01 · Janek Thomas Andreas Mayr Bernd Bischl Matthias Schmid Adam Smith Benjamin Hofner ... U.S. Fish & Wildlife Service, National

16 Janek Thomas et al.

(2008). URL http://www.gamlss.org/wp-content/uploads/2013/01/gamlss-manual.pdf

36. Schmid, M., Hothorn, T.: Boosting additive models usingcomponent-wise P-splines. Computational Statistics & DataAnalysis 53(2), 298–311 (2008)

37. Schmid, M., Potapov, S., Pfahlberg, A., Hothorn, T.: Esti-mation and regularization techniques for regression modelswith multidimensional prediction functions. Statistics andComputing 20(2), 139–150 (2010)

38. Shah, R.D., Samworth, R.J.: Variable selection with errorcontrol: Another look at stability selection. Journal of theRoyal Statistical Society: Series B (Statistical Methodology)75(1), 55–80 (2013)

39. Smith, A.D., Hofner, B., Osenkowski, J.E., Allison, T.,Sadoti, G., McWilliams, S.R., Paton, P.W.C.: Spatiotempo-ral modeling of animal abundance using boosted gamlss hur-dle models: a case study using wintering sea ducks. Methodsin Ecology and Evolution (2016). Under review