r news 2007/2 · 2015-03-05 · news the newsletter of the r project volume 7/2, october 2007...

NewsThe Newsletter of the R Project Volume 7/2, October 2007

Editorialby Torsten Hothorn

Welcome to the October 2007 issue of R News,the second issue for this year! Last week, R 2.6.0was released with many new useful features:dev2bitmap allows semi-transparent colours to beused, plotmath offers access to all characters inthe Adobe Symbol encoding, and three higher-orderfunctions Reduce, Filter, and Map are now available.In addition, the startup time has been substantiallyreduced when starting without the methods pack-age. All the details about changes in R 2.6.0 are givenon pages 54 ff.

The Comprehensive R Archive Network contin-ues to reflect a very active R user and developer com-munity. Since April, 187 new packages have beenuploaded, ranging from ADaCGH, providing facili-ties to deal with array CGH data, to yest for modelselection and variance estimation in Gaussian inde-pendence models. Kurt Hornik presents this impres-sive list starting on page 61.

The first North American R user conference tookplace in Ames Iowa last August and many colleaguesand friends from all over the world enjoyed a well-organized and interesting conference. Duncan Mur-doch and Martin Maechler report on useR! 2007 on

page 74. As I’m writing this, Uwe Ligges and histeam at Dortmund university are working hard toorganize next year’s useR! 2008; Uwe’s invitation toDortmund Germany appears on the last page of thisissue.

Three of the nine papers contained in this issuewere presented at useR! 2006 in Vienna: Peter Dal-gaard reports on new functions for multivariate anal-ysis, Heather Turner and David Firth describe theirnew package gnm for fitting generalized nonlinearmodels, and Olivia Lau and colleagues use ecologi-cal inference models to infer individual-level behav-ior from aggregate data.

Further topics are the analysis of functional mag-netic resonance imaging and environmental data,matching for observational studies, and machinelearning methods for survival analysis as well as forcategorical and continuous data. In addition, an ap-plication to create web interfaces for R scripts is de-scribed, and Fritz Leisch completes this issue with areview of Michael J. Crawley’s The R Book.

Torsten HothornLudwig–Maximilians–Universität München, [email protected]

Contents of this issue:

Editorial . . . . . . . . . . . . . . . . . . . . . . 1New Functions for Multivariate Analysis . . . 2gnm: A Package for Generalized Nonlinear

Models . . . . . . . . . . . . . . . . . . . . . . 8fmri: A Package for Analyzing fmri Data . . . 13Optmatch: Flexible, Optimal Matching for

Observational Studies . . . . . . . . . . . . . 18Random Survival Forests for R . . . . . . . . . 25Rwui: A Web Application to Create User

Friendly Web Interfaces for R Scripts . . . . . 32The np Package: Kernel Methods for Categor-

ical and Continuous Data . . . . . . . . . . . 35

eiPack: R × C Ecological Inference andHigher-Dimension Data Management . . . . 43

The ade4 Package — II: Two-table and K-tableMethods . . . . . . . . . . . . . . . . . . . . . 47

Review of “The R Book” . . . . . . . . . . . . . 53Changes in R 2.6.0 . . . . . . . . . . . . . . . . . 54Changes on CRAN . . . . . . . . . . . . . . . . 61R Foundation News . . . . . . . . . . . . . . . . 72News from the Bioconductor Project . . . . . . 73Past Events: useR! 2007 . . . . . . . . . . . . . . 74Forthcoming Events: useR! 2008 . . . . . . . . . 75Forthcoming Events: R Courses in Munich . . 76

mailto:[email protected]

Vol. 7/2, October 2007 2

New Functions for Multivariate AnalysisPeter Dalgaard

R (and S-PLUS) used to have limited support formultivariate tests. We had the manova function,which extended the features of aov to multivariateresponses, but like aov, this effectively assumed abalanced design, and was not capable of dealing withthe within-subject transformations that are com-monly used in repeated measurement modelling.

Although the methods encoded in proceduresavailable in SAS and SPSS can seem somewhat old-fashioned, they do have some added value relativeto analysis by mixed model methodology, and theyhave a strong tradition in several applied areas. Itwas thus worthwhile to extend R’s capabilities tohandle contrast tests, as well as Greenhouse-Geisserand Huynh-Feldt epsilons. The extensions also pro-vide flexible ways of dealing with linear models witha multivariate response.

Theoretical setting

The general setup is given by

Y ∼ N(ΞB, I ⊗ Σ)

Here, Y is N × p matrix and Σ is a p × p covari-ance matrix. The rows yi of Y are independent withthe same covariance Σ.

Ξ is a N× k design matrix (the reader will have toapologise that I am not using the traditional X, butthat symbol is reserved for other purposes later on)and B is a k× p matrix of regression coefficients.

Thus, we have the same linear model for all presponse coordinates, with separate parameters con-tained in the columns of B.

Standard test procedures

From classical univariate and multivariate theory, anumber of standard tests are available. We shall fo-cus on three of them:

1. Testing hypotheses of simplified mean valuestructure: This reduction is required to be thesame for all coordinates, effectively replacingthe design matrix Ξ with one spanning a sub-space. Such tests take the form of generalizedF tests, replacing the variance ratio by

R = MS−1resMSeff

in which the MS terms are matrices which gen-eralize the mean square terms from analysis ofvariance. Under the hypothesis, R should bedistributed around the unit matrix (in the sense

that the two MS matrices both have mean Σ;notice, however, that MSeff will be rank defi-cient whenever the degrees of freedom for theeffect is less than p), but for a test statistic weneed to reduce R to a scalar measure. Four suchmeasures are cited in the literature, namelyWilks’s Λ, the Pillai trace, the Hotelling-Lawleytrace, and Roy’s greatest root. These are allbased on combinations of the eigenvalues of R.Details can be found in, e.g., Hand and Taylor(1987).

Wilks’s Λ is equivalent to the likelihood ra-tio test, but R and S-PLUS have traditionallyfavoured the Pillai trace based on the (rathervague) recommendations cited in Hand andTaylor (1987). Each test can be converted toan approximately F distributed statistic. Ifthe tests are for a single-degree-of-freedom hy-pothesis, the matrix R has only one non-zeroeigenvalue and all four tests are equivalent.

2. Testing whether Σ is proportional to a givenmatrix, say Σ0 (which is usually the unit matrixI): This is known as Mauchly’s test of spheric-ity. It is based on a comparison of the de-terminant and the trace of U = Σ−1

0 S whereS is the SSD (deviation sum-of-squares-and-products) matrix (MSres instead of S is equiv-alent). Specifically

W = det(U)/tr(U/p)p

is close to 1 if U is close to a diagonal matrix ofdimension p with a constant value along the di-agonal. The test statistic− f log W is an asymp-totic χ2 on p(p + 1)/2 − 1 degrees of freedom(where f is the degrees of freedom for the co-variance matrix. An improved approximationis found in Anderson (1958).

3. Testing linear hypotheses as in point 1 above,but assuming that sphericity holds. In thiscase, we are assuming that the covariance isknown up to a constant, and thus are effec-tively in a univariate setting of (weighted) leastsquares theory. The relevant F statistic willhave (p f1, p f2) degrees of freedom if the coor-dinatewise tests have ( f1, f2) degrees of free-dom.

Within-subject transformations

It is often necessary to consider a transformation ofresponses: If, for instance, coordinates are repeatedmeasures, we might wish to test for “no changeover time” or “same changes over time in differentgroups” (profile analysis). This leads to analysis of

R News ISSN 1609-3631


within-subjects contrasts, rather than of the observa-tions themselves.

If we assume a covariance structure with randomeffects both between and within subject, the contrastswill cancel out the between-subject variation. Theywill effectively behave as if the between-subject termwasn’t there and satisfy a sphericity condition.

Hence we consider the transformed response YT′

(which has rows Tyi). In many cases T is chosen toannihilate another matrix X, i.e. it satisfies TX = 0;by rotation invariance, different such choices of Twill lead to the same inference as long as they havemaximal rank. In such cases it may well be more con-venient to specify X rather than T.

For profile analysis, X could be a p-vector of ones.In that case, the hypothesis of compound symmetryimplies sphericity of TΣT′ w.r.t. TT′ (but sphericitydoes not imply compound symmetry), and the hy-pothesis EYT′ = 0 is equivalent to a flat (constantlevel) profile.

More elaborate within-subject designs exist. Asan example, consider a a complete two-way layout,in which case we might want to look at the sets of (a)Contrasts between row means, (b) Contrasts betweencolumn means, and (c) Interaction contrasts,

These contrasts connect to the theory of balancedmixed-effects analysis of variance. A model with “allinteractions with subject are considered random”, i.e.random effects of subjects, as well as random inter-action between subject and rows and between sub-jects and columns, implies sphericity of each of theabove contrasts. The proportionality constants willbe different linear combinations of the random effectvariances.

A common pattern for such sets of contrasts is asfollows: Define two subspaces of Rp, defined by ma-trices X and M, such that span(X) ⊂ span(M). Thetransformation is calculated as T = PM − PX whereP denotes orthogonal projection. Actually, T definedlike this would be singular and therefore T is thinnedby deletion of linearly dependent rows (it can fairlyeasily be seen that it does not matter exactly how thisis done). Putting M = I (the identity matrix) willmake T satisfy TX = 0 which is consistent with thesituation described earlier.

The choice of X and M is most easily done byviewing them as design matrices for linear models inp-dimensional space. Consider again a two-way in-trasubject design. To make T a transformation whichcalculates contrasts between column means, chooseM to describe a model with different means in eachcolumn and X to describe a single mean for all data.Preferably, but equivalently for a balanced design, letM describe an additive model in rows and columnsand let X be a model in which there is only a roweffect.

There is a hidden assumption involved in thechoice of the orthogonal projection onto span(M).This calculates (for each subject) an ordinary least

squares (OLS) fit and for some covariance structures,a weighted fit may be more appropriate. However,OLS is not actually biased, and the efficiency thatwe might gain is offset by the need to estimate arather large number of parameters to find the opti-mal weights.

The epsilons

Assume that the conditions for a balanced analysis ofvariance are roughly valid. We may decide to com-pute, say, column means for each subject and testwhether the contrasts are zero on average. Therecould be a loss of efficiency in the use of unweightedmeans, but the critical issue for an F test is whetherthe sphericity condition holds for the contrasts be-tween column means.

If the sphericity condition is not valid, then theF test is in principle wrong. Instead, we could useone of the multivariate tests, but they will often havelow power due to the estimation of a large numberof parameters in the empirical covariance matrix.

For this reason, a methodology has evolved inwhich “near-sphericity” data are analyzed by F tests,but applying the so-called epsilon corrections to thedegrees of freedom. The theory originates in Box(1954a) in which it is shown that F is approximatelydistributed as F(ε f1,ε f2), where

ε =∑ λ2

i /p(∑ λi/p)2

and the λi are the eigenvalues of the true covari-ance matrix (after contrast transformation, and withrespect to a similarly transformed identity matrix).One may notice that 1/p ≤ ε ≤ 1; the upper limitcorresponds to sphericity (all eigenvalues equal) andthe lower limit corresponds to the case where thereis one dominating eigenvalue, so that data are effec-tively one-dimensional. Box (1954b) details the resultas a function of the elements of the covariance matrixin the two-way layout.

The Box correction requires knowledge of thetrue covariance matrix. Greenhouse and Geisser(1959) suggest the use of εGG, which is simply theempirical verson of ε, inserting the empirical covari-ance matrix for the true one. However, it is easilyseen that this estimator is biased, at least when thetrue epsilon is close to 1, since εGG < 1 almost surelywhen p > 1.

The Huynh-Feldt correction (Huynh and Feldt,1976) is

εHF =( f + 1)pεGG − 2

p( f − pεGG)

where f is the number of degrees of freedom for theempirical covariance matrix. (The original paper hasN instead of f + 1 in the numerator for the split-plotdesign. A correction was published 15 years later



(Lecoutre, 1991), but another 15 years later, this er-ror is still present in SAS and SPSS.)

Notice that εHF can be larger than one, in whichcase you should use the uncorrected F test. Also,εHF is obtained by bias-correcting the numerator anddenominator of εGG, which is not guaranteed to behelpful, and simulation studies in (Huynh and Feldt,1976) indicate that it actually makes the approxima-tions worse when εGG < 0.7 or so.

Implementation in R

Basics

Quite a lot of the calculations could be copied fromthe existing manova and summary.manova code forbalanced designs. Also, objects of class mlm, inherit-ing from lm, were already defined as the output fromlm(Y~....) when Y is a matrix.

It was necessary to add the following new func-tions

• SSD creates an object of (S3) class SSD which isthe sums of squares and products matrix aug-mented by degrees of freedom and informationabout the call. This is a generic function with amethod for mlm objects.

• estVar calculates the estimated variance-covariance matrix. It has methods for SSD andmlm objects. In the former case, it just normal-izes the SSD by the degrees of freedom, andin the latter case it calculates SSD(object) andthen applies the SSD method.

• anova.mlm is used to compare two multivariatelinear models or partition a single model in theusual cumulative way. The various multivari-ate tests can be specified using test="Pillai",etc., just as in summary.manova. In addition, itis possible to specify test="Spherical" whichgives the F test under assumption of sphericity.

• mauchly.test tests for sphericity of a covari-ance matrix with respect to another.

• sphericity calculates the εGG and εHF basedon the SSD matrix and its degrees of freedom.This function is private to the stats namespace(at least currently, as of R version 2.6.0) andused to generate the headings and adjusted p-values in anova.mlm.

Representing transformations

As previously described, sphericity tests and tests oflinear hypotheses may make sense only for a trans-formed response. Hence we need to be able to spec-ify transformations in anova.mlm and mauchly.test.The code has several ways to deal with this:

• The transformation matrix can be given di-rectly using the argument T.

• It is possible to specify the arguments X and Mas the matrices described previously. The de-faults are set so that the default transformationis the identity.

• It is also possible to specify X and/or M us-ing model formulas. In that case, they usuallyneed to refer to a data frame which describesthe intra-subject design. This can be given inthe idata argument. The default for idata isa data frame consisting of the single variableindex=1:p which may be used to specify, e.g.,a polynomial response for equispaced data.

Example

These data from the book by Maxwell and Delaney(1990) are also used by Baron and Li (2006). Theyshow reaction times where ten subjects respond tostimuli in the absence and presence of ambient noise,and using stimuli tilted at three different angles.

> reacttime <- matrix(c(

+ 420, 420, 480, 480, 600, 780,

+ 420, 480, 480, 360, 480, 600,

+ 480, 480, 540, 660, 780, 780,

+ 420, 540, 540, 480, 780, 900,

+ 540, 660, 540, 480, 660, 720,

+ 360, 420, 360, 360, 480, 540,

+ 480, 480, 600, 540, 720, 840,

+ 480, 600, 660, 540, 720, 900,

+ 540, 600, 540, 480, 720, 780,

+ 480, 420, 540, 540, 660, 780),

+ ncol = 6, byrow = TRUE,

+ dimnames=list(subj=1:10,

+ cond=c("deg0NA", "deg4NA", "deg8NA",

+ "deg0NP", "deg4NP", "deg8NP")))

The same data are used by example(estVar) andexample(anova.mlm), so you can load the reacttimematrix just by running the examples. The followingis mainly an expanded explanation of those exam-ples.

First let us calculate the estimated covariance ma-trix:

> mlmfit <- lm(reacttime~1)

> estVar(mlmfit)

cond

cond deg0NA deg4NA deg8NA deg0NP deg4NP deg8NP

deg0NA 3240 3400 2960 2640 3600 2840

deg4NA 3400 7400 3600 800 4000 3400

deg8NA 2960 3600 6240 4560 6400 7760

deg0NP 2640 800 4560 7840 8000 7040

deg4NP 3600 4000 6400 8000 12000 11200

deg8NP 2840 3400 7760 7040 11200 13640



In this case there is no between-subjects structure,except for a common mean, so the result is equivalentto var(reacttime).

Next we consider tests for whether the responsedepends on the design at all. We generate a contrasttransformation which is orthogonal to an intercept-only within-subject model; this is equivalent to anyfull-rank set of within-subject contrasts. A test basedon multivariate normal theory is performed as fol-lows.

> mlmfit0 <- update(mlmfit, ~0)

> anova(mlmfit, mlmfit0, X=~1)

Analysis of Variance Table

Model 1: reacttime ~ 1

Model 2: reacttime ~ 1 - 1

Contrasts orthogonal to

~1

Res.Df Df Gen.var. Pillai approx F

1 9 1249.57

2 10 1 2013.16 0.95 17.38

num Df den Df Pr(>F)

1

2 5 5 0.003534 **

This gives the default Pillai test, but actually, youget the same result with the other multivariate testssince the degrees of freedom (per coordinate) onlychanges by one.

To perform the same test, but assuming spheric-ity of the covariance matrix, just add an argument tothe anova call:

> anova(mlmfit, mlmfit0, X=~1, test="Spherical")





~1

Greenhouse-Geisser epsilon: 0.4855

Huynh-Feldt epsilon: 0.6778

Res.Df Df Gen.var. F num Df

1 9 1249.6

2 10 1 2013.2 38.028 5

den Df Pr(>F) G-G Pr H-F Pr

1

2 45 4.471e-15 2.532e-08 7.393e-11

It can be noticed that the epsilon corrections inthis case are rather strong, but that the correctedp-values nevertheless are considerably more signif-icant with this approach, reflecting the low powerof the multivariate test. This is commonly the casewhen the number of replications is low.

To test the hypothesis of sphericity, we employMauchly’s criterion:

> mauchly.test(mlmfit, X=~1)

Mauchly's test of sphericity


~1

data: SSD matrix from lm(formula = reacttime ~ 1)

W = 0.0311, p-value = 0.04765

Accordingly, the hypothesis of sphericity is re-jected at the 0.05 significance level.

However, the analysis of contrasts completelydisregards the two-way intrasubject design. To gen-erate tests for overall effects of deg and noise, as wellas interaction between the two, we do as follows:First we need to set up the idata data frame to de-scribe the design.

> idata <- expand.grid(

+ deg=c("0", "4", "8"),

+ noise=c("A", "P"))

> idata

deg noise

1 0 A

2 4 A

3 8 A

4 0 P

5 4 P

6 8 P

Then we can specify the tests using model formu-las to specify the X and M matrices. To test for an ef-fect of deg we let M specify an additive model in degand noise and let X be the model with noise alone.

> anova(mlmfit, mlmfit0, M = ~ deg + noise,

+ X = ~ noise,

+ idata = idata, test="Spherical")





~noise

Contrasts spanned by

~deg + noise




1 9 1007.0

2 10 1 2703.2 40.719 2


1

2 18 2.087e-07 3.402e-07 2.087e-07

It might at this point be helpful to explain theroles of X and M a bit further. To do this, we needto access an internal function in the stats names-pace, namely proj.matrix. This function calculates



the projection matrix for a given design matrix, i.e.PX = X(X′X)−1X′. In the above context, we have

M <- model.matrix(~deg+noise, data=idata)

P1 <- stats:::proj.matrix(M)

X <- model.matrix(~noise, data=idata)

P2 <- stats:::proj.matrix(X)

The two design matrices are (eliminating at-tributes)

> M

(Intercept) deg4 deg8 noiseP

1 1 0 0 0

2 1 1 0 0

3 1 0 1 0

4 1 0 0 1

5 1 1 0 1

6 1 0 1 1

> X

(Intercept) noiseP

1 1 0

2 1 0

3 1 0

4 1 1

5 1 1

6 1 1

For printing the projection matrices, it is useful toinclude the fractions function from MASS.

> library("MASS")

> fractions(P1)

1 2 3 4 5 6

1 2/3 1/6 1/6 1/3 -1/6 -1/6

2 1/6 2/3 1/6 -1/6 1/3 -1/6

3 1/6 1/6 2/3 -1/6 -1/6 1/3

4 1/3 -1/6 -1/6 2/3 1/6 1/6

5 -1/6 1/3 -1/6 1/6 2/3 1/6

6 -1/6 -1/6 1/3 1/6 1/6 2/3

> fractions(P2)

1 2 3 4 5 6

1 1/3 1/3 1/3 0 0 0

2 1/3 1/3 1/3 0 0 0

3 1/3 1/3 1/3 0 0 0

4 0 0 0 1/3 1/3 1/3

5 0 0 0 1/3 1/3 1/3

6 0 0 0 1/3 1/3 1/3

Here, P2 is readily recognized as the operator thatreplaces each of the first three values by their aver-age, and similarly the last three. The other one, P1,is a little harder, but if you look long enough, youwill recognize the formula xi j = xi· + x· j − x·· fromtwo-way ANOVA.

The transformation T is the difference between P1and P2

> fractions(P1-P2)

1 2 3 4 5 6

1 1/3 -1/6 -1/6 1/3 -1/6 -1/6

2 -1/6 1/3 -1/6 -1/6 1/3 -1/6

3 -1/6 -1/6 1/3 -1/6 -1/6 1/3

4 1/3 -1/6 -1/6 1/3 -1/6 -1/6

5 -1/6 1/3 -1/6 -1/6 1/3 -1/6

6 -1/6 -1/6 1/3 -1/6 -1/6 1/3

This is the representation of xi· − x·· where i andj index levels of deg and noise, respectively. Thematrix has (row) rank 2; for the actual computations,only the first two rows are used.

The important aspect of this matrix is that it as-signs equal weights to observations at different lev-els of noise and that it constructs a maximal set ofwithin-deg differences. Any other such choice oftransformation leads to equivalent results, since itwill be a full-rank transformation of T. For instance:

> T1 <- rbind(c(1,-1,0,1,-1,0),c(1,0,-1,1,0,-1))

> anova(mlmfit, mlmfit0, T=T1,

+ idata = idata, test = "Spherical")




Contrast matrix

1 -1 0 1 -1 0

1 0 -1 1 0 -1




1 9 12084

2 10 1 32438 40.719 2

1 den Df Pr(>F) G-G Pr H-F Pr

2 18 2.087e-07 3.402e-07 2.087e-07

This differs from the previous analysis only in theGen.var. column, which is not invariant to scalingand transformation of data.

Returning to the interpretation of the results, no-tice that the epsilons are much closer to one than be-fore. This is consistent with a covariance structureinduced by a mixed model containing a random in-teraction between subject and deg. If we perform thesimilar procedure for noise we get epsilons of ex-actly 1, because we are dealing with a single-df con-trast.

> anova(mlmfit, mlmfit0, M = ~ deg + noise,

+ X = ~ deg,

+ idata = idata, test="Spherical")





~deg

Contrasts spanned by

~deg + noise

Greenhouse-Geisser epsilon: 1

Huynh-Feldt epsilon: 1


1 9 1410



2 10 1 6030 33.766 1


1

2 9 0.0002560 0.0002560 0.0002560

However, we really shouldn’t be doing these testson individual effects of deg and noise if there is in-teraction between the two, and as the following out-put shows, there is:

> anova(mlmfit, mlmfit0, X = ~ deg + noise,

+ idata = idata, test = "Spherical")





~deg + noise




1 9 316.58

2 10 1 996.34 45.31 2


1

2 18 9.424e-08 3.454e-07 9.424e-08

Finally, to test between-within interactions,rephrase them as effects of between-factors onwithin-contrasts. To illustrate this, we introducea fake grouping f of the reacttime data into twogroups. The interaction between f and the (implied)column factor is obtained by testing whether the con-trasts orthogonal to ~1 depend on f.

> f <- factor(rep(1:2, 5))

> mlmfit2 <- update(mlmfit, ~f)

> anova(mlmfit2, X = ~1, test = "Spherical")



~1



Df F num Df den Df

(Intercept) 1 34.9615 5 40

f 1 0.2743 5 40

Residuals 8

Pr(>F) G-G Pr H-F Pr

(Intercept) 1.382e-13 2.207e-07 8.254e-10

f 0.92452 0.79608 0.86456

Residuals

More detailed tests of interactions between f anddeg, f and noise, and the three-way interaction canbe constructed using M and X settings as describedpreviously.

Bibliography

T. W. Anderson. An Introduction to Multivariate Sta-tistical Analysis. Wiley, New York, 1958.

J. Baron and Y. Li. Notes on the use ofR for psychology experiments and question-naires, 2006. URL http://www.psych.upenn.edu/~baron/rpsych/rpsych.html.

G. E. P. Box. Some theorems on quadratic forms ap-plied in the study of analysis of variance problems,i. effect of inequality of variance in the one-wayclassification. Annals of Mathematical Statistics, 25(2):290–302, 1954a.

G. E. P. Box. Some theorems on quadratic forms ap-plied in the study of analysis of variance problems,ii.effect of inequality of variance and of correlationbetween errors in the two-way classification. An-nals of Mathematical Statistics, 25(3):484–498, 1954b.

S. W. Greenhouse and S. Geisser. On methods in theanalysis of profile data. Psychometrika, 24:95–112,1959.

D. J. Hand and C. C. Taylor. Multivariate Analysis ofVariance and Repeated Measures. Chapman and Hall,1987.

H. Huynh and L. S. Feldt. Estimation of the box cor-rection for degrees of freedom from sample data inrandomized block and split-plot designs. Journal ofEducational Statistics, 1(1):69–82, 1976.

B. Lecoutre. A correction for the ε approximate testin repeated measures designs with two or more in-dependent groups. Journal of Educational Statistics,16(4):371–372, 1991.

S. E. Maxwell and H. D. Delaney. Designing Experi-ments and Analyzing Data: A model comparison per-spective. Brooks/Cole, Pacific Grove, CA, 1990.

Peter DalgaardDeptartment of BiostatisticsUniversity of [email protected]


http://www.psych.upenn.edu/~baron/rpsych/rpsych.html

http://www.psych.upenn.edu/~baron/rpsych/rpsych.html



gnm: A Package for GeneralizedNonlinear Modelsby Heather Turner and David Firth

In a generalized nonlinear model, the expectation ofa response variable Y is related to a predictor η —a possibly nonlinear function of parameters — via alink function g:

g[E(Y)] = η(β)

and the variance of Y is equal to, or proportionalto, a known function v[E(Y)]. This class of modelsmay be thought of as extending the class of gener-alized linear models by allowing nonlinear terms inthe predictor, or extending the class of nonlinear leastsquares models by allowing the variance of the re-sponse to depend on the mean.

The gnm package provides facilities for the spec-ification, estimation and inspection of generalizednonlinear models. Models are specified via symbolicformulae, with nonlinear terms specified by func-tions of class "nonlin". The fitting procedure uses aniterative weighted least squares algorithm, adaptedto work with over-parameterized models. Identifi-ability constraints are not essential for model speci-fication and estimation, and so the difficulty of au-tomatically defining such constraints for the entirefamily of generalized nonlinear models is avoided.Constraints may be pre-specified, if desired; other-wise, functions are provided in the package for con-ducting inference on identifiable parameter combi-nations after a model in an over-parameterized rep-resentation has been fitted.

Specific models which the package may beused to fit include models with multiplicative in-teractions, such as row-column association mod-els (Goodman, 1979), UNIDIFF (uniform difference)models for social mobility (Xie, 1992; Erikson andGoldthorpe, 1992), GAMMI (generalized additivemain effects and multiplicative interaction) mod-els (e.g. van Eeuwijk, 1995), and Lee-Carter mod-els for trends in age-specific mortality (Lee andCarter, 1992); diagonal-reference models for depen-dence on a square or hyper-square classification (So-bel, 1981, 1985); Rasch-type logit or probit mod-els for legislative voting (e.g. de Leeuw, 2006); andstereotype multinomial regression models for or-dinal response (Anderson, 1984). A comprehen-sive manual is distributed with the package (seevignette("gnmOverview", package = "gnm")) andthis manual may also be downloaded from http://go.warwick.ac.uk/gnm. Here we give an intro-duction to the key functions and provide some illus-trative applications.

Key Functions

The primary function defined in package gnm is themodel-fitting function of the same name. This func-tion is patterned after glm (the function included inthe standard stats package for fitting generalized lin-ear models), taking similar arguments and returningan object of class c("gnm", "glm", "lm").

A model formula is specified to gnm as the first ar-gument. The conventional symbolic form is used tospecify linear terms, whilst nonlinear terms are spec-ified using functions of class "nonlin". The nonlinfunctions currently exported in gnm are summarizedin Figure 1. These functions enable the specificationof basic mathematical functions of predictors (Exp,Inv and Mult) and some more specialized nonlinearterms (MultHomog, Dref). Often arguments of nonlinfunctions may themselves be parametric functionsdescribed in symbolic form. For example, an expo-nential decay model with additive errors

y = α + exp(β + γx) + e (1)

is specified using the Exp function as

gnm(y ~ Exp(1 + x))

These “sub-formulae” (to which intercepts are notadded by default) can include other nonlin func-tions, allowing more complex nonlinear terms to bebuilt up. Users may also define custom nonlin func-tions, as we shall illustrate later.

Dref to specify a diagonal reference termExp to specify the exponential of a predictorInv to specify the reciprocal of a predictorMult to specify a product of predictorsMultHomog to specify a multiplicative interaction

with homogeneous effects

Figure 1: "nonlin" functions in the gnm package.

The remaining arguments to gnm are mainly con-trol parameters and are of secondary importance.However, the eliminate argument — which imple-ments a feature seen previously in the GLIM 4 sta-tistical modelling system — can be extremely usefulwhen a model contains the additive effect of a (typi-cally ‘nuisance’) factor with a large number of levels;in such circumstances the use of eliminate can sub-stantially improve computational efficiency, as wellas allowing more convenient model summaries withthe ‘nuisance’ factor omitted.

Most of the methods and accessor functions im-plemented for "glm" or "lm" objects are also imple-mented for "gnm" objects, such as print, summary,plot, coef, and so on. Since "gnm" models may be


http://go.warwick.ac.uk/gnm

http://go.warwick.ac.uk/gnm


over-parameterized, care is needed when interpret-ing the output of some of these functions. In partic-ular, for parameters that are not identified, an arbi-trary parameterization will be returned and the stan-dard error will be displayed as NA.

For estimable parameters, methods for profileand confint enable inference based on the profilelikelihood. These methods are designed to handlethe asymmetric behaviour of the log-likelihood func-tion that is a well-known feature of many nonlinearmodels.

Estimates and standard errors of simple "sum-to-zero" contrasts can be obtained using getContrasts,if such contrasts are estimable. For more general lin-ear combinations of parameters, the lower-level sefunction may be used instead.

Example

We illustrate here the use of some of the keyfunctions in gnm through the analysis of a con-tingency table from Goodman (1979). The dataare a cross-classification of a sample of Britishmales according to each subject’s occupational sta-tus and his father’s occupational status. The contin-gency table is distributed with gnm as the data setoccupationalStatus.

Goodman (1979) analysed these data using row-column association models, in which the interactionbetween the row and column factors is representedby components of a multiplicative interaction. Thelog-multiplicative form of the RC(1) model — therow-column association model with one componentof the multiplicative interaction — is given by

log µrc = αr + βc + γrδc, (2)

where µrc is the cell mean for row r and column c.Before modelling the occupational status data, Good-man (1979) first deleted cells on the main diagonal.Equivalently we can separate out the diagonal effectsas follows:

log µrc = αr + βc +θrc + γrδc (3)

where

θrc ={

φr if r = c0 otherwise.

In addition to the "nonlin" functions provided bygnm, the package includes a number of functions forspecifying structured linear interactions. The diago-nal effects in model 3 can be specified using the gnmfunction Diag. Thus we can fit model 3 as follows:

> set.seed(1)> RC <- gnm(Freq ~ origin + destination ++ Diag(origin, destination) ++ Mult(origin, destination),+ family = poisson,+ data = occupationalStatus)

An abbreviated version of the output given bysummary(RC) is shown in Figure 2. Default con-straints (as specified by options("contrasts")) areapplied to linear terms: in this case the first level ofthe origin and destination factors have been set tozero. However the “main effects” are not estimable,since they are aliased with the unconstrained multi-plicative interaction. If the gnm call were re-evaluatedfrom a different random seed, the parameterizationfor the main effects and multiplicative interactionwould differ from that shown in Figure 2. On theother hand the diagonal effects, which are essentiallycontrasts with the off-diagonal elements, are identi-fied here. This is immediately apparent from Figure2, since standard errors are available for these param-eters.

One could consider extending the model byadding a further component of the multiplicative in-teraction, giving an RC(2) model:

log µrc = αr + βc +θrc + γrδc +θrφc. (4)

Most of the "nonlin" functions, including Mult, havean inst argument to allow the specification of multi-ple instances of a term, as here:

Freq ~ origin + destination +Diag(origin, destination) +Mult(origin, destination, inst = 1) +Mult(origin, destination, inst = 2)

Multiple instances of a term with an inst argumentmay be specified in shorthand using the instancesfunction provided in the package:

Freq ~ origin + destination +Diag(origin, destination) +instances(Mult(origin, destination), 2)

The formula is then expanded by gnm before themodel is fitted.

In the case of the occupational status data how-ever, the RC(1) model is a good fit (a residual de-viance of 29.149 on 28 degrees of freedom). So ratherthan extending the model, we shall see if it can besimplified by using homogeneous scores in the mul-tiplicative interaction:

log µrc = αr + βc +θrc + γrγc (5)

This model can be obtained by updating RC usingupdate:

> RChomog <- update(RC, . ~ .+ - Mult(origin, destination)+ + MultHomog(origin, destination))

We can compare the two models using anova, whichshows that there is little gained by allowing hetero-geneous row and column scores in the association ofthe fathers’ occupational status and the sons’ occu-pational status:



Call:

gnm(formula = Freq ~ origin + destination + Diag(origin, destination) +

Mult(origin, destination), family = poisson, data = occupationalStatus)

Deviance Residuals: ...

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.11881 NA NA NA

origin2 0.49005 NA NA NA

...

origin8 1.60214 NA NA NA

destination2 0.95334 NA NA NA

...

destination8 1.42813 NA NA NA

Diag(origin, destination)1 1.47923 0.45401 3.258 0.00112

...

Diag(origin, destination)8 0.40731 0.21930 1.857 0.06327

Mult(., destination).origin1 -1.74022 NA NA NA

...

Mult(., destination).origin8 1.65900 NA NA NA

Mult(origin, .).destination1 -1.32971 NA NA NA

...

Mult(origin, .).destination8 0.66730 NA NA NA

...

Residual deviance: 29.149 on 28 degrees of freedom

...

Figure 2: Abbreviated summary of model RC.

> anova(RChomog, RC)Analysis of Deviance Table

Model 1: Freq ~ origin + destination +Diag(origin, destination) +MultHomog(origin, destination)

Model 2: Freq ~ origin + destination +Diag(origin, destination) +Mult(origin, destination)

Resid. Df Resid. Dev Df Deviance1 34 32.5612 28 29.149 6 3.412

All of the parameters in model 5 can be madeidentifiable by constraining one level of the homo-geneous multiplicative factor to zero. This can beachieved by using the constrain argument to gnm,but this would require re-fitting the model. As we areonly really interested in the parameters of the multi-plicative interaction, we investigate simple contrastsof these parameters using getContrasts. The gnmfunction pickCoef enables us to select the parame-ters of interest using a regular expression to matchagainst the parameter names:

> contr <- getContrasts(RChomog,+ pickCoef(RChomog, "MultHomog"))

A summary of contr is shown in Figure 3. By de-fault, getContrasts sets the first level of the homo-

geneous multiplicative factor to zero. The quasi stan-dard errors and quasi variances are independent ofparameterization; see Firth (2003) and Firth and deMenezes (2004) for more detail. In this example thereported relative-error diagnostics indicate that thequasi-variance approximation is rather inaccurate —in contrast to the high accuracy found in many othersituations by Firth and de Menezes (2004).

Users may run through the example covered inthis section by calling demo(gnm).

Custom "nonlin" functions

The small number of "nonlin" functions currentlydistributed with gnm allow a large class of gener-alized nonlinear models to be specified, as exempli-fied through the package documentation. Neverthe-less, for some models a custom "nonlin" functionmay be necessary or desirable. As an illustration, weshall consider a logistic predictor of the form givenby SSlogis (a selfStart model for use with the statsfunction nls),

Asym1 + exp((xmid− x)/scal)

. (6)

This predictor could be specified to gnm as

~ -1 + Mult(1, Inv(Const(1) +Exp(Mult(1 + offset(-x), Inv(1)))))



Model call: gnm(formula = Freq ~ origin + destination + Diag(origin, destination)

+ MultHomog(origin, destination), family = poisson, data = occupationalStatus)

estimate SE quasiSE quasiVar

MultHomog(origin, destination)1 0.000 0.000 0.1573 0.02473








Worst relative errors in SEs of simple contrasts (%): -19 4.4

Worst relative errors over *all* contrasts (%): -17.2 51.8

Figure 3: Simple contrasts of homogeneous row and column scores in model RChomog.

where Mult, Inv and Exp are "nonlin" functions,offset is the usual stats function and Const is a gnmfunction specifying a constant offset. However, thisspecification is rather convoluted and it would bepreferable to define a single "nonlin" function thatcould specify a logistic term.

Our custom "nonlin" function only needs a sin-gle argument, to specify the variable x, so the skele-ton of our definition might be as follows:

Logistic <- function(x){}

class(Logistic) <- "nonlin"

Now we need to fill in the body of the func-tion. The purpose of a "nonlin" function is to con-vert its arguments into a list of arguments for the in-ternal function nonlinTerms. This function consid-ers a nonlinear term as a mathematical function ofpredictors, the parameters of which need to be esti-mated, and possibly also variables, which have a co-efficient of 1. In this terminology, Asym, xmid andscal in Equation 6 are parameters of intercept-onlypredictors, whilst x is a variable. Our Logistic func-tion must specify the corresponding predictors andvariables arguments of nonlinTerms. We can de-fine the predictors argument as a named list ofsymbolic expressions

predictors = list(Asym = 1, xmid = 1,scal = 1)

and pass the user-specified variable x as thevariables argument:

variables = list(substitute(x))

We must then define the term argument ofnonlinTerms, which is a function that creates a de-parsed mathematical expression of the nonlinearterm from labels for the predictors and variables.These labels should be passed through argumentspredLabels and varLabels respectively, so we canspecify the term argument as

term = function(predLabels, varLabels){paste(predLabels[1], "/(1 + exp((",predLabels[2], "-", varLabels[1], ")/",predLabels[3], "))")}

Our Logistic function need not specify any furtherarguments of nonlinTerms. However, we do havean idea of useful starting values, so we can also spec-ify the start argument. This should be a functionwhich takes a named vector of the parameters in theterm and returns a vector of starting values for theseparameters. The following function will set the ini-tial scale parameter to one:

start = function(theta){theta[3] <- 1theta}

Putting these components together, we have:

Logistic <- function(x){list(predictors =

list(Asym = 1, xmid = 1, scal = 1),variables = list(substitute(x)),term =function(predLabels, varLabels){

paste(predLabels[1],"/(1 + exp((", predLabels[2],"-", varLabels[1], ")/",predLabels[3], "))")

},start = function(theta){

theta[3] <- 1theta

})}class(Logistic) <- "nonlin"

Having defined this function, we can reproducethe example from ?SSlogis as shown below

> Chick.1 <-+ ChickWeight[ChickWeight$Chick == 1, ]



> fm1gnm <- gnm(weight ~ Logistic(Time) - 1,+ data = Chick.1, trace = FALSE)> summary(fm1gnm)

Call:gnm(formula = weight ~ Logistic(Time) - 1,

data = Chick.1)

Deviance Residuals:Min 1Q Median 3Q Max

-4.154 -2.359 0.825 2.146 3.745

Coefficients:Estimate Std. Error t value Pr(>|t|)

Asym 937.0205 465.8569 2.011 0.07516xmid 35.2228 8.3119 4.238 0.00218scal 11.4052 0.9052 12.599 5.08e-07

(Dispersion parameter for gaussian familytaken to be 8.51804)

Residual deviance: 76.662 on 9 degrees offreedom

AIC: 64.309

Number of iterations: 14

In general, whenever a nonlinear term can be rep-resented as an expression that is differentiable us-ing deriv, it should be possible to define a "nonlin"function to specify the term. Additional argumentsof nonlinTerms allow for the specification of factorswith homologous effects and control over the way inwhich parameters are automatically labelled.

Summary

The functions distributed in the gnm package enablea wide range of generalized nonlinear models to beestimated. Nonlinear terms that are not (easily) rep-resented by the "nonlin" functions provided may beimplemented as custom "nonlin" functions, underfairly weak conditions.

There are several features of gnm that we havenot covered in this paper. Some facilitate model in-spection, such as the ofInterest argument to gnmwhich allows the user to customize model sum-maries and simplify parameter selection. Other fea-tures are designed for particular models, such asthe residSVD function for generating good startingvalues for a multiplicative interaction. Still oth-ers relate to particular types of data, such as theexpandCategorical function for expressing categor-ical data as counts. These features complement thefacilities we have already described, to produce asubstantial and flexible package for working withgeneralized nonlinear models.

Acknowledgments

This work was supported by the Economic and So-cial Research Council (UK) through Professorial Fel-lowship RES-051-27-0055.

Bibliography

J. A. Anderson. Regression and ordered categoricalvariables. J. R. Statist. Soc. B, 46(1):1–30, 1984.

J. de Leeuw. Principal component analysis of bi-nary data by iterated singular value decomposi-tion. Comp. Stat. Data Anal., 50(1):21–39, 2006.

R. Erikson and J. H. Goldthorpe. The Constant Flux.Oxford: Clarendon Press, 1992.

D. Firth. Overcoming the reference category problemin the presentation of statistical models. Sociologi-cal Methodology, 33:1–18, 2003.

D. Firth and R. X. de Menezes. Quasi-variances.Biometrika, 91:65–80, 2004.

L. A. Goodman. Simple models for the analysis ofassociation in cross-classifications having orderedcategories. J. Amer. Statist. Assoc., 74:537–552, 1979.

R. D. Lee and L. Carter. Modelling and forecastingthe time series of US mortality. Journal of the Amer-ican Statistical Association, 87:659–671, 1992.

M. E. Sobel. Diagonal mobility models: A substan-tively motivated class of designs for the analysis ofmobility effects. Amer. Soc. Rev., 46:893–906, 1981.

M. E. Sobel. Social mobility and fertility revisited:Some new models for the analysis of the mobil-ity effects hypothesis. Amer. Soc. Rev., 50:699–712,1985.

F. A. van Eeuwijk. Multiplicative interaction in gen-eralized linear models. Biometrics, 51:1017–1032,1995.

Y. Xie. The log-multiplicative layer effect model forcomparing mobility tables. American SociologicalReview, 57:380–395, 1992.

Heather TurnerUniversity of Warwick, [email protected]

David FirthUniversity of Warwick, [email protected]





fmri: A Package for Analyzing fmri Databy J. Polzehl and K. Tabelow

This article describes the usage of the R pack-age fmri to analyze single time series BOLD fMRI(blood-oxygen-level dependent functional magneticresonance imaging) data using structure adaptivesmoothing procedures (Propagation- Separation ap-proach) as described in (Tabelow et al., 2006). See(J. Polzehl and K. Tabelow, 2006) for an extendeddocumentation.

Analysing fMRI data with the fmripackage

The approach implemented in the fmri package isbased on a linear model for the hemodynamic re-sponse and structural adaptive spatial smoothing ofthe resulting Statistical Parametric Map (SPM). Thepackage requires R (version ≥ 2.2). 3D visualizationneeds the R package tkrplot.

The statistical modeling implemented with thissoftware is described in (Tabelow et al., 2006).

NOTE! This software comes with abso-lutely no warranty! It is not intended forclinical use, but for evaluation purposesonly. Absence of bugs can not be guaran-teed!

We first start with a basic script for using the fmripackage in a typical fmri analysis. In the follow-ing sections we describe the consecutive steps of theanalysis.

# read the data

data <- read.AFNI("afnifile")

# or read sequence of ANALYZE files

# analyze031file.hdr ... analyze137file.hdr

# data <- read.ANALYZE("analyze",

# numbered = TRUE, "file", 31, 107)

# create expected BOLD signal and design matrix

hrf <- fmri.stimulus(107, c(18, 48, 78), 15, 2)

x <- fmri.design(hrf)

# generate parametric map from linear model

spm <- fmri.lm(data, x)

# adaptive smoothing with maximum bandwith hmax

spmsmooth <- fmri.smooth(spm, hmax = 4)

# calculate p-values for smoothed parametric map

pvalue <- fmri.pvalue(spmsmooth)

# write slicewise results into a file or ...

plot(pvalue, maxpvalue = 0.01, device = "jpeg",

file = "result.jpeg")

# ... use interactive 3D visualization

plot(pvalue, maxpvalue = 0.01, type = "3d")

Reading the data

The fmri package can read ANALYZE (Mayo Foun-dation, 2001), AFNI- HEAD/BRIK (Cox, 1996),NIFTI and DICOM files. Use

data <- read.AFNI(<filename>)

data <- read.ANALYZE(<filename>)

to create the object data from the data in file‘filename’. Drop the extension in ‘filename’. WhileAFNI data is generally given as a four dimensionaldatacube in one file, ANALYZE format data oftencomes in a series of numbered files. A sequence ofimages in ANALYZE format can be imported by

data <- read.ANALYZE(prefix = "", numbered = TRUE,

postfix = "", picstart = 1,

numbpic = 1)

Setting the argument numbered to TRUE allows toread a list of image files. The image files are as-sumed to be ordered by a number, contained inthe filename. picstart is the number of the firstfile, and numbpic the number of files to be read.The increment between consecutive numbers is as-sumed to be 1. prefix specifies a string preceed-ing the number, whereas postfix identifies a stringfollowing the number in the filename. Note, thatread.ANALYZE() requires the file names in the form:‘<prefix>number<postfix>.hdr/img’.

Figure 1: View of a typical fMRI dataset. Data di-mension is 64x64x26 with a total of 107 scans. A highnoise level is typical. The time series is described bya linear model containing the experimental stimulus(red) and quadratic drift. Data: courtesy of H. Voss,Weill Medical College of Cornell University.

Both functions return lists of class ”fmridata”with components containing the datacube (’ttt’). Seefigure 1 for an illustration. The data cube is stored



in (’raw’) format to save memory. The data canbe extracted from the list using extract.data().Additionally the list contains basic information likethe size of the data cube (’dim’) and the voxel size(’delta’), as well as the complete header information(’header’), which is itself a list with components de-pending to the data format read. A head mask is de-fined by simply using a 75% quantile of the data greylevels as cut- off. This is only be used to provide im-proved spatial correlation estimates for the head infmri.lm().

Expected BOLD response

In voxel affected by the experiment the observed sig-nal is assumed to follow the expected Blood Oxy-genation Level Dependent (BOLD) signal. This sig-nal depends on both the experimental stimulus, de-scribed by a task indicator function and a hemody-namic response function h(t). We define h(t) as thedifference of two gamma functions

h(t) =(

td1

)a1

exp(− t− d1

b1

)−c(

td2

)a2

exp(− t− d2

b2

)with a1 = 6, a2 = 12, b1 = 0.9, b2 = 0.9, anddi = aibi(i = 1, 2), c = 0.35 where t is the time inseconds, see (Glover, 1999). The expected BOLD re-sponse is given as a discrete convolution of this func-tion with the task indicator function. Use


to create the expected BOLD response for a stimuluswith 107 scans, onset times at the 18th, 48th, and 78thscan with a duration of the stimulus of 15 scans, anda time TR = 2s between two scans. In case of mul-tiple stimuli the results of fmri.stimulus() for thedifferent stimuli should be arranged as columns of amatrix hrf using cbind().

The hemodynamic response function may havean unknown latency (Worsley and Taylor, 2005). Thiscan be modeled creating an additional explanatoryvariable as the first numerical derivative of any ex-perimental stimulus:

dhrf <- (c(0,diff(hrf)) + c(diff(hrf),0))/2

See the next section for how to include this into thelinear model.

Construction of the SPM

We adopt the common view of a linear model for thetime series Yi = (Yit) in each voxel i after reconstruc-tion of the raw data and motion correction.

Yi = Xβi +εi , (1)

where X denotes the design matrix. The design ma-trix is created by

x <- fmri.design(hrf) .

This will include polynomial drift terms up toquadratic order. To deviate from this default, theorder of the polynomial drift can be specified by asecond argument. Use cbind() to combine severalstimulus vectors into a matrix of stimuli before call-ing fmri.design().

The first q columns of X contain values of the ex-pected BOLD response for the different stimuli eval-uated at scan acquisition times. The other p − qcolumns are chosen to be orthogonal to the expectedBOLD responses and to account for a slowly vary-ing drift and possible other external effects. The er-ror vector εi has zero expectation and is assumed tobe correlated in time. In order to access the variabil-ity of the estimates of βi correctly we have to takethe correlation structure of the error vector εi into ac-count. We assume an AR(1) model to be sufficient forcommonly used MRI scanners. The autocorrelationcoefficients ρi are estimated from the residual vectorri = (ri1, . . . , riT) of the fitted model (1) as

ρi =T

∑t=2

ritri(t−1)/T

∑t=1

r2it.

This estimate of the correlation coefficient is biaseddue to fitting the linear model (1). We therefore ap-ply the bias correction given by (Worsley et al., 2002)leading to an estimate ρi.

We then use prewhitening to transform model (1)into a linear model with approximately uncorrelatederrors. The prewhitened linear model is obtained bymultiplying the terms in (1) with some matrix Ai de-pending on ρi. The prewhitening procedure thus re-sults in a new linear model

Yi = Xiβi + εi (2)

with Yi = AiYi, Xi = AiX, and εi = Aiεi. In the newmodel the errors εi = (εit) are approximately uncor-related in time t, such that var εi = σ2

i IT . Finally leastsquares estimates βi are obtained from model (2) as

βi = (XTi Xi)−1XT

i Yi .

The error variance σ2i is estimated from the residu-

als ri of the linear model (2) as σ2i = ∑

T1 r2

it/(T − p)leading to estimated covariance matrices

var βi = σ2i (XT

i Xi)−1.

Estimates βi and their estimated covariance ma-trices are, in the simplest case, obtained by

spm <- fmri.lm(data, x)



where data is the data object read by read.AFNI() orread.ANALYZE(), and x is the design matrix createdwith fmri.design().

See figure 1 for an example of a typical fMRIdataset together with the result of the fit of the lin-ear model to a time series.

To consider more than one stimulus and to esti-mate an effect

γ = cTβ (3)

defined by a vector of contrasts c set the argumentcontrast of the function fmri.lm() correspondingly

hrf1 <- fmri.stimulus(214, c(18, 78, 138), 15, 2)

hrf2 <- fmri.stimulus(214, c(48, 108, 168), 15, 2)

x <- fmri.design(cbind(hrf1, hrf2))

# stimulus 1 only

spm1 <- fmri.lm(data, x, contrast = c(1,0))

# stimulus 2 only

spm2 <- fmri.lm(data, x, contrast = c(0,1))

# contrast between both

spm3 <- fmri.lm(data, x, contrast = c(1,-1))

If the argument vvector is set, the component”cbeta” of the object returned by fmri.lm() con-tains a vector with the parameters corresponding tothe non- zero elements in vvector in each voxel.This may be used to include unknown latency ofthe hemodynamic response function in the analysis.First define the expected BOLD response for a givenstimulus and its derivative and then combine theminto the design matrix


dhrf <- (c(0,diff(hrf)) + c(diff(hrf),0))/2

x <- fmri.design(cbind(hrf, dhrf))

spm <- fmri.lm(data, x, vvector = c(1,1)) .

The specification of vvector in the last statement re-sults in a vector of length 2 containing the two pa-rameter estimates for the expected BOLD responseand its derivative in each voxel. Furthermore the ra-tio of the variance estimates for these parameters iscalculated as a prerequisite for the smoothing proce-dure. See fmri.smooth() for details about smooth-ing this parametric map.

The function returns an object with class at-tributes ”fmridata” and ”fmrispm”. This is again alist with components containing the estimated pa-rameter contrast (’cbeta’), and its estimated variance(’var’), as well as estimated spatial correlations in alldirections.

Structure Adaptive Smoothing (PS)

Smoothing of SPM’s is applied in this context to im-prove the sensitivity of signal detection. This is ex-pected to work since neural activations extends over

regions of adjacent voxels. Averaging over such re-gions allows us to reduce the variance of the param-eter estimates without compromising their mean. In-troducing a spatial correlation structure also reducesthe number of independent decisions made for sig-nal detection and therefore eases the multiple testproblem. Adaptive smoothing as implemented inthis package also allows to improve the specificity ofsignal detection, see figure 2 for an illustration.

The parameter map is smoothed with

spmsmooth <- fmri.smooth(spm, hmax = hmax)

where spm is the result of the function fmri.lm().hmax is the maximum bandwidth for the smoothingalgorithm. For lkern="Gaussian" the bandwidth isgiven in units of FWHM, for any other localizationkernel the unit is voxel. hmax should be chosen asthe expected maximum size of the activation areas.As adaptive smoothing automatically adapts to dif-ferent sizes and shapes of the activation areas, over-smoothing is not to be expected.

In (Tabelow et al., 2006) the use of a spa-tial adaptive smoothing procedure derived fromthe Propagation- Separation approach (Polzehl andSpokoiny, 2006) has been proposed in this context.The approach focuses, for each voxel i, on simultane-ously identifying a region where the unknown pa-rameter γ is approximately constant and to obtainan optimal estimate γi employing this structural in-formation. This is achieved by an iterative proce-dure. Local smoothing is restricted to local vicini-ties of each voxel, that are characterized by a weight-ing scheme. Smoothing and characterization of localvicinities are alternated. Weights for a pair of voxelsi and j are constructed as a product of kernel weightsKloc(δ(i, j)/h), depending on the distance δ(i, j) be-tween the two voxels and a bandwidth h, and a factorreflecting the difference of the estimates γi and γ j ob-tained within the last iteration. The bandwidth h isincreased with iterations up to a maximal bandwidthhmax.

The name Propagation- Separation is a synonymfor the two main properties of this algorithm. Incase of a completely homogeneous array Γ , that isIE γ ≡ Const., the algorithm delivers essentially thesame result as a nonadaptive kernel smoother em-ploying the bandwidth hmax. In this case the pro-cedure selects the best of a sequence of almost non-adaptive estimates, that is, it propagates to the onewith maximum bandwidth. Separation means thatas soon as within one iteration step significant dif-ferences of γi and γ j are observed the correspond-ing weight is decreased to zero and the informationfrom voxel j is no longer used to estimate γi. Voxels iand j belong to different regions of homogeneity andare therefore separated. As a consequence smooth-ing is restricted to regions with approximately con-stant values of γ, bias at the border of such regions isavoided and the spatial structure of activated regions



Figure 2: A numerical phantom for studying the performance of PS vs. Gaussian smoothing. (a) Signal loca-tions within one slices. Eight different signal- to- noise ratios, increasing clockwise, are coded by gray values.The spatial extent of activations varies in the radial direction. The data cube contains 15 slices with activationalternated with 9 empty slices. (b) Smoothing with a conventional Gaussian filter. (c) Smoothing with PS. Inboth (b) and (c), the bandwidth is 3 voxels which corresponds to FWHM = 10 mm for typical voxel size. In(b) and (c) the proportion, over slices containing activations, of voxels that are detected in a given position isrendered by gray values. Black corresponds to the absence of a detection.

is preserved.For a formal description of this algorithm, a dis-

cussion of its properties and theoretical results we re-fer to (Polzehl and Spokoiny, 2006) and (Tabelow etal., 2006). Numerical complexity, as well as smooth-ness within homogeneous regions is controlled bythe maximum bandwidth hmax.

If the argument object contains a parameter vec-tor for each voxel (for example to include latency, seesection 4) these will be smoothed according to theirestimated variance ratio, see (Tabelow et al., 2006) fordetails on the smoothing procedure.

Figure 2 provides a comparison of signal detec-tion employing a Gaussian filter and structural adap-tive smoothing.

Signal detection

Smoothing leads to variance reduction and thus sig-nal enhancement. It leaves us with three dimen-sional arrays Γ , S containing the estimated effectsγi = cTβi and their estimated standard deviationssi = (cT var βic)1/2, obtained from time series ofsmoothed residuals. The voxelwise quotient θi =γi/si of both arrays forms a statistical parametricmap (SPM) Θ. The SPM as well as a map of p-valuesare generated by

pvalue <- fmri.pvalue(spmsmooth)

Under the hypothesis, that is, in absence of activa-tion this SPM behaves approximately like a GaussianRandom Field, see (Tabelow et al., 2006). We there-fore use the theory of Gaussian Random Fields to as-sign appropriate p-values as a prerequisite for signal

detection. Such p-values can be defined (Worsley etal., 1996) as

pi =3

∑d=0

Rd(V(rx, ry, rz))ρd(θi) (4)

where Rd(V) is the resel count of the search vol-ume V and ρd(θi) is the EC density depending onlyon the parameter θi. rx, ry, rz denotes the effectiveFWHM bandwidths that measure the smoothness (inresel space see (Worsley et al., 1996)) of the randomfield generated by a Gaussian filter that employs thebandwidth from the last iteration of the PS proce-dure (Tabelow et al., 2006). Rd(V) and ρd are givenin (Worsley et al., 1996). A signal will be detected inall voxels where the observed p-value is less or equalto a specified threshold.

Finally we provide a statistical analysis includ-ing unknown latency of the hemodynamic responsefunction. If spmsmooth contains a vector (seefmri.lm() and fmri.smooth()), a χ2 statistic is cal-culated from the first two parameters and used forp-value calculation. If delta is given, a cone statis-tics is used (Worsley and Taylor, 2005).

The parameter mode allows for different kindsof p-value calculation. ”basic” corresponds to aglobal definition based on the amount of smoothnessachieved by an equivalent Gaussian filter. The prop-agation condition ensures, that under the hypothesisΘ = 0 the adaptive smoothing perform like a nonadaptive filter with the same kernel function. ”local”corresponds to a more conservative setting, wherethe p-values are derived from the estimated local re-sel counts that has been achieved by the adaptivesmoothing. ”global” takes a global median of theseresel counts for calculation.



Figure 3 provides the results of signal detectionfor an experimental fMRI data set.

Viewing results

Results can be displayed by a generic plot function

plot(object, anatomic,

device="jpeg", file="result.jpeg")

object is an object of class ”fmridata” (and”fmrispm” or ”fmripvalue”) as returned byfmri.pvalue(), fmri.smooth(), fmri.lm(),read.ANALYZE() or read.AFNI(). anatomic is ananatomic underlay of the same dimension as thefunctional data. The argument type="3d" providesan interactive display to produce 3 dimensional illus-trations (requires R- package tkrplot) on the screen.The default for type is ”slice”, which can create out-put to the screen and image files.

We also provide generic functions summary() andprint() for objects with class attribute ”fmridata”.

Figure 3: Signal detection in 4 consecutive slices us-ing adaptive smoothing procedure. The shape andextent of the activation areas are conserved muchbetter than with traditional Gaussian filtering. Data:courtesy of H. Voss, Weill Medical College of CornellUniversity.

Writing the results to files

Finally we provide functions to write data in stan-dard medical image formats such as HEAD/BRIK,ANALYZE, and NIFTI.

write.AFNI("afnitest", data, c("signal"),

note="random data", origin=c(0,0,0),

delta=c(4,4,5), idcode="unique ID")

write.ANALYZE(data, list(pixdim=c(4,4,4,5)),

file="analyzetest")

Some basic header information can be provided,in case of ANALYZE files as a list with several ele-ments (see documentation for syntax). Any datacubecreated during the analysis can be written.

Bibliography

Biomedical Imaging Resource. Analyze Program.Mayo Foundation, 2001.

R. W. Cox. Afni: Software for analysis and visual-ization of functional magnetic resonance neuroim-ages. Computers and Biomed. Res., 29:162- 173, 1996.

G. H. Glover. Deconvolution of impulse response inevent- related BOLD fMRI. NeuroImage, 9:416- 429,1999.

J. Polzehl and V. Spokoiny. Propagation- separationapproach for local likelihood estimation. Probab.Theory and Relat. Fields, 135:335- 362, 2006.

R Development Core Team. R: A Language and Envi-ronment for Statistical Computing. R Foundation forStatistical Computing, Vienna, Austria, 2007. ISBN3- 900051- 07- 0.

K. Tabelow, J. Polzehl, H. U. Voss, and V. Spokoiny.Analyzing fMRI experiments with structuraladaptive smoothing procedures. NeuroImage,33:55- 62, 2006.

J. Polzehl and K. Tabelow. Analysing fMRI experi-ments with the fmri package in R. Version 1.0 - Ausers guide. WIAS Technical Report No. 10, 2006.

K. J. Worsley. Spatial smoothing of autocorrelationsto control the degrees of freedom in fMRI analysis.NeuroImage, 26:635- 641, 2005.

K. J. Worsley, C. Liao, J. A. D. Aston, V. Petre, G. H.Duncan, F. Morales, and A. C. Evans. A generalstatistical analysis for fMRI data. NeuroImage, 15:1-15, 2002.

K. J. Worsley, S. Marrett, P. Neelin, K. J. Friston, andA. C. Evans. A unified statistical approach for de-terming significant signals in images of cerebralactivation. Human Brain Mapping, 4:58- 73, 1996.

K. J. Worsley and J. E. Taylor. Detecting fMRI activa-tion allowing for unknown latency of the hemody-namic response. Neuroimage, 29:649- 654, 2006.

Jörg Polzehl & Karsten TabelowWeierstrass Institute for Applied Analysisand Stochastics, Berlin, [email protected]@wias-berlin.de





Optmatch: Flexible, Optimal Matching forObservational StudiesBen B. Hansen

Observational studies compare subjects who re-ceived a specified treatment to others who did not,without controlling assignment to treatment andcomparison groups. When the groups differ at base-line in ways that are relevant to the outcome, thestudy has to adjust for the differences. An old andparticularly direct method of making these adjust-ments is to match treated subjects to controls whoare similar in terms of their pretreatment charac-teristics, then conduct an outcome analysis condi-tioning upon the matched sets. Adjustments of thistype enjoy properties of robustness (Rubin, 1979) andtransparency not shared with purely model-basedadjustments, such as covariance adjustment withoutmatching or stratification; and with the introductionof propensity scores to matching (Rosenbaum andRubin, 1985), the approach was shown to be morebroadly applicable than was previously thought. Ar-guably, the reach of techniques based on matchingnow exceeds that of purely model-based adjustment(Hansen, 2004).

To achieve these benefits, matched adjustment re-quires the analyst to articulate a distinction betweendesirable and undesirable potential matches, andthen to match treated and control subjects in such away as to favor the more desirable pairings. Propen-sity scoring fits under the first of these tasks, as dothe construction of Mahalanobis matching metrics(Rosenbaum and Rubin, 1985), prognostic scoring(Hansen, 2006b), and the distance metric optimiza-tion of Diamond and Sekhon (2006). The second task,matching itself, is less statistical in nature, but doingit well can substantially improve the power and ro-bustness of matched inference (Hansen and Klopfer,2006; Hansen, 2004). The main purpose of optmatchis to relieve the analyst of responsibility for this im-portant, if potentially tedious, undertaking, freeingattention for other aspects of the analysis. Givendiscrepancies between each treatment and controlsubject that might potentially be matched, optmatchplaces them into non-overlapping matched sets, inthe process solving the discrete optimization prob-lems needed to make sums of matched discrepanciesas small as possible; after this, the analysis can pro-ceed using permutation inference (Rosenbaum, 2002;Hothorn et al., 2006; Bowers and Hansen, 2006), con-ditional inference (Breslow and Day, 1980; Cox andSnell, 1989; Hansen, 2004; Lumley and Therneau,2006), approximately conditional inference (Pierceand Peters, 1992; Brazzale, 2005; Brazzale et al., 2006),or multilevel models (Smith, 1997; Raudenbush andBryk, 2002; Gelman and Hill, 2006).

Optimal matching of two groups

To illustrate the meaning of optimal matching, con-sider Cox and Snell’s (1981, p.81) study of costs ofnuclear power. Of 26 light water reactor plants con-structed in the U.S. between 1967 and 1972, sevenhad been built on the site of existing plants. Theproblem is to estimate the cost benefit (or penalty)of building on an existing site as opposed to a newone. A matched analysis seeks to adjust for back-ground characteristics determinative of cost, such asthe date of construction and the capacity of the plant,by linking similar refurbished and new plants: plantsof about the same capacity and constructed at aboutthe same time, for example. To highlight the anal-ogy with intervention studies, I refer to existing-siteplants as “treatments” and new-site plants as “con-trols.”

Consider the problem of arranging the plantsin disjoint triples, each containing one treatmentand two controls, placing each treatment and 14of the 19 controls into some matched triple or an-other. A straightforward way to create such amatch is to move down the list of treatments, pair-ing each to the two most similar controls that havenot yet been matched; this is nearest-available match-ing. Figure 1 shows the 26 plants, their capaci-ties and dates of construction, and a 1 : 2 match-ing constructed in this way. First A was matchedto I and J, then B to L and N, and so forth. Thisexample is discussed by Rosenbaum (2002, ch.10).

Existing sitedate capacity

A 2.3 660B 3.0 660C 3.4 420D 3.4 130E 3.9 650F 5.9 430G 5.1 420

“date” is date of construc-tion, in years after 1965;“capacity” is net capac-ity of the power plant, inMWe above 400.

New sitedate capacity

H 3.6 290I 2.3 660J 3.0 660K 2.9 110L 3.2 420M 3.4 60N 3.3 390O 3.6 160P 3.8 390Q 3.4 130R 3.9 650S 3.9 450T 3.4 380U 4.5 440V 4.2 690W 3.8 510X 4.7 390Y 5.4 140Z 6.1 730

Figure 1: 1:2 matching by a nearest-available algo-rithm.

How might this process be improved? To com-plete step i, the nearest-available algorithm requires



a ranking of potential matches for treatment unit i,an ordering of available controls accounting for theirdifferences with plant i in generating capacity and inyear of construction. Typically controls j are orderedin terms of a numeric discrepancy, d[i, j], from i; Fig-ure 1’s match follows Rosenbaum (2002, ch.10) in us-ing the sum of rank differences on the two covariates(after restricting to a subset of the plants, pt!=1):

> data("nuclear", package="boot")> attach(nuclear[nuclear$pt!=1,])> drk <- rank(date)> d <- outer(drk[pr==1], drk[pr!=1], "-")> d <- abs(d)> crk <- rank(cap)> d <- d +

abs(outer(crk[pr==1], crk[pr!=1], "-"))

(where pr==1 indicates the treatment group). Thed that results from these operations is shown (afterrounding) in Figure 3. Having calculated this d, onecan pose the task of matching as a discrete optimiza-tion problem: find the match M = {(i, j)} minimiz-ing ∑M d(i, j) among all sets of pairs (i, j) in whicheach treatment i appears twice and each control j ap-pears at most once.

Optimal matching refers to algorithms guaran-teed to find matches attaining this minimum, orfalling within a specified tolerance of it, given ant × nc discrepancy matrix M. Optimal matching’sperformance advantage over heuristic, non-optimalalgorithms can be striking. For example, in the prob-lem of Figure 1 optimal matching reduces nearest-available’s sum of discrepancies by 23%. This opti-mal solution, found by optmatch’s pairmatch func-tion, is shown in Figure 2.

Existing sitedate capacity

A 2.3 660B 3.0 660C 3.4 420D 3.4 130E 3.9 650F 5.9 430G 5.1 420

By evaluating potentialmatches all together ratherthan sequentially, optimalmatching (blue lines) reducesthe sum of distances by 23%.

New sitedate capacity

H 3.6 290I 2.3 660J 3.0 660K 2.9 110L 3.2 420M 3.4 60N 3.3 390O 3.6 160P 3.8 390Q 3.4 130R 3.9 650S 3.9 450T 3.4 380U 4.5 440V 4.2 690W 3.8 510X 4.7 390Y 5.4 140Z 6.1 730

Figure 2: Optimal vs. greedy 1:2 matching.

An optimal match is optimal relative to givenstructural requirements, here that all treatments anda corresponding number of controls be arranged in1 : 2 matched sets, and a given distance, here d. Itis best-possible for the purposes of the analyst only

insofar as the given distance and structural stipula-tions best represent the analyst’s goals; in this sense“optimal” in “optimal matching” is analogous tothe “maximum” of “maximum likelihood,” which isnever better than the chosen likelihood model it max-imizes.

For example, in the problem just discussed thestructural stipulation that the matched sets all be 1:2triples may be poorly tailored to the goal of reduc-ing baseline differences between the groups. It isappropriate if these differences stem entirely from asmall minority of controls being too unlike any treat-ment subjects to bear comparison to them, since itdoes exclude 5/19 of potential controls from the fi-nal match; but differences on a larger number of con-trols, even small differences, require the techniquesto be described under Generalizations of pair match-ing, below, which may also give similar or better biasreduction without excluding as many control obser-vations. (See also Discussion, below.)

Growing your own discrepancy matrix

Figures 1 and 2 both illustrate multivariate distancematching , aligning units so as to minimize a sumof rank discrepancies. Optmatch is entirely flexi-ble about the form of the distance on which matchesare to be made. To propensity-score match nuclearplants, for example, one would prepare a propensitydistance using

> pscr <- glm(pr ~ . -(pr+cost),family = binomial,data = nuclear)$linear.predictors

> PR <- nuclear$pr==1> pdist <- outer(pscr[PR], pscr[PR], "-")

> pscr.v <- (var(pscr[PR])*(sum(PR)-1)+var(pscr[!PR])*(sum(!PR)-1))/(length(PR)-2)

> pdist <- abs(pdist)/sqrt(pscr.v)

or, more simply and reliably,

> pmodel <- glm(pr ~ . -(pr+cost),family = binomial, data = nuclear)

> pdist <- pscore.dist(pmodel)

Then pdist is passed to pairmatch or fullmatch asits first argument. Other discrepancies on which onemight match include Mahalanobis distances (whichcan be produced using mahal.dist) and combina-tions of Mahalanobis and propensity-based distances(Rosenbaum and Rubin, 1985; Gu and Rosenbaum,1993; Rubin and Thomas, 2000). Many special re-quirements, such as that matches be made onlywithin given subclasses, or that specific matches beavoided, are also introduced through the discrep-ancy matrix.

First consider the case the matches are to be madewithin subclasses only. In the nuclear dataset, plants



with and without partial turnkey (pt) guaranteesshould be compared separately, since the meaning ofthe outcome variable, cost, changes with pt. Figure2 shows only the pt!=1 plants, and its match is gener-ated with the command pairmatch(d,controls=2),where d is the matrix in Figure 3. To match also par-tial turnkey plants to each other, one gathers intoa list, dl, both d and a distance matrix dpt com-paring new- and existing-site partial turnkey plants,then feeds dl to pairmatch as its first argument. Forpropensity or Mahalanobis matching, pscore.distor mahal.dist would do this if given the formulapr~pt as their optional arguments ‘structure.fmla’.

More generally, the helper function makedist(which is called by pscore.dist and mahal.dist)eases the application of a (user-defined) discrepancymatrix-producing function to each of a sequence ofstrata in order to generate a list of distance matri-ces. For separate summed-rank distances by pt sub-class, one would write a function that extracts por-tions of relevant variables from a given data frame,looking to a treatment-group variable to decide whatportions to take, as in

> capdatediffs <- function(trt, dat) {crk <- rank(dat[names(trt),"cap"])names(crk) <- names(trt)dmt <- outer(crk[trt], crk[!trt],"-")dmt <- abs(dmt)

drk <- rank(dat[names(trt),"date"])dmt <- dmt +abs(outer(drk[trt], drk[!trt],"-"))dmt}

Then one would use makedist to apply the functionseparately within levels of pr:

> dl <- makedist(pr ~ pt, nuclear,capdatediffs)

The result of this is a list of two distance matrices,both submatrices of d created above, one comparingpt!=1 treatments and controls and a smaller one forpt==1 plants.

In larger problems, matching can be substantiallyfaster if preceded by a division of the sample intosubclasses; see Under the hood, below. The useof pscore.dist, mahal.dist, and makedist carryanother advantage, that the lists of distances theygenerate carry metadata to prevent fullmatch orpairmatch from getting confused about the order ofobservations in the data frame from which the dis-tances were generated.

Another common aim is to forbid unwantedmatches. With optmatch, this is done by placingNA’s, NaN’s or Inf’s at the relevant places in a distancematrix. Consider matching nuclear plants withincalipers of three years on date of construction. Pair-ings of plants that would violate this requirement are

indicated in red in Figure 3. To enforce the caliper,one could generate a matrix of discrepancies dy onyear of construction, then replace the distance ma-trix of Figure 3, d, with d/(dy<=3); this new matrixhas an Inf at each entry in Figure 3 currently shownin red, and otherwise is the same as in Figure 3.

Operations of these types, division and logicalcomparison, are compatible with subclassificationprior to matching, despite the fact that the operationsseem to require matrices while subclassification de-mands lists of matrices. Assuming one has defined afunction datediffs as

> datediffs <- function(trt,data){sclr <- data[names(trt), 'date']names(sclr) <- names(trt)abs(outer(sclr[trt], sclr[!trt], '-'))}

then the command

> dly <- makedist(pr ~ pt, nuclear,datediffs)

tabulates absolute differences on date of construc-tion, separately for pr==1 and pr!=1 strata. Withoptmatch, the expression dly<=3 returns a listof indicators of whether potential matches werebuilt within three years of one another. Further-more, dl/(dly<=3) is a list imposing the three-year caliper upon distances coded in dl . Topair match on propensity scores, but with a 3-year date-of-construction caliper, one would usepairmatch(dl/(dly<=3)).

Generalizations of pair matching

Matching with a varying number of con-trols

In Figures 1 and 2, both non-optimal and optimalmatches insist on precisely two controls per treat-ment. If one’s aim is to match as closely as possible,this is a limitation. To optimally match 14 of the 19controls, as Figure 2 does, but without requiring thatthey always be matched two-to-one to treatments,one would use the command fullmatch, with op-tions ‘min.controls=1’ and ‘omit.fraction=5/19’.The flexibility this adds improves matching evenmore than the switch from greedy to optimal match-ing did; while optimal pair matching reduced greedypair matching’s net discrepancy from 82 to 63, op-timal matching with a varying number of controlsbrings it to 44, just over half its original value.

If mvnc is the match created in this way, then thestructure of mvnc is returned by

> stratumStructure(mvnc)stratum treatment:control ratios1:1 1:2 1:3 1:54 1 1 1



Exist- New sitesing H I J K L M N O P Q R S T U V W X Y Z

A 28 0 3 22 14 30 17 28 26 28 20 22 23 26 21 18 34 40 28B 24 3 0 22 10 27 14 26 24 24 16 19 20 23 18 16 31 37 25C 10 18 14 18 4 12 6 11 9 10 14 12 6 14 22 10 16 22 28D 7 28 24 8 14 2 10 6 12 0 24 22 4 24 32 20 18 16 38E 17 20 16 32 18 26 20 18 12 24 0 2 20 6 8 4 14 20 14F 20 31 28 35 20 29 22 20 14 26 12 9 22 5 15 12 9 11 12G 14 32 29 30 18 24 17 16 10 22 12 10 17 6 16 14 4 8 17

Figure 3: Rank discrepancies of new- and existing-site nuclear plants without partial turnkey guarantees. New-and existing-site plants which differ by more than 3 years in date of construction are indicated in red.

This means it consists of four matched pairs and amatched triple, quadruple, and sextuple, all contain-ing precisely one treatment. Ming and Rosenbaum(2000) discuss matching with a varying number ofcontrols, implementing it in their example with asomewhat different algorithm.

Full matching

A propensity score is the conditional probability,p(x), of falling in the treatment group given covari-ates x (or an increasing transformation of it, suchas its logit). Because subjects with large propensityscores more frequently fall in the treatment group,and subjects with low propensity scores are more fre-quently controls, propensity score matching is fun-damentally at odds with matching treatments andcontrols in fixed ratios, such as 1 : 1 or 1 : k. Theseratios must be allowed to adapt, so that 1:1 matchescan be made where p(x)/(1 − p(x)) ≈ 1 while 1 : kmatches are made where p(x)/(1 − p(x)) ≈ 1/k;otherwise either some subjects will have to go un-matched or some subjects are bound to be poorlymatched on their propensity scores. Matching withmultiple controls addresses part of this problem, butonly full matching (Rosenbaum, 1991) addresses it inits entirety, by also permitting l:1 matches, for whenp(x)/(1− p(x)) ≈ l ≥ 2.

In general, full matching is useful when thereare some regions of covariate space in which con-trols outnumber treatments but others in which treat-ments outnumber controls. This pattern emergesmost clearly when matching on a propensity score,but they influence the quality of matches even with-out propensity scores. The rank discrepancies ofnew- and existing-site pt==1 plants, shown in Fig-ure 4, show it; earlier dates of construction, andsmaller capacities, are more common among controls(d and e) than treatments (b only), and this is re-flected in Figure 4’s sums of discrepancies on rank.As a consequence, full matching achieves a net rankdiscrepancy (3) that is half of the minimum possible(6) with techniques that don’t permit both 1:2 and 2:1matched sets.

Exist- New sitesing d e f

a 6 6 0b 0 3 6c 6 6 0

Figure 4: Rank discrepancies of new- and existing-site nuclear plants with partial turnkey guarantees.Boxes indicate the optimal full matching of theseplants.

Gu and Rosenbaum (1993) compare full and otherforms of matching in an extensive simulation study,while Hansen (2004) and Hansen and Klopfer (2006)present applications. This literature emphasizes theimportance of using structural restrictions , upperlimits on K in K : 1 matched sets and on L in 1 :L matched sets, when full matching, in order tocontrol the variance of matched estimation. Withfullmatch, an upper limit K :1 on treatment:controlratios is conveyed using ‘min.controls=1/K’, whilea lower limit of 1 : L on the treatment : control ra-tio would be given with ‘max.controls=L’. Hansen(2004) and Hansen and Klopfer (2006) give strate-gies to optimize these tuning parameters. In thecontext of a specific application, Hansen (2004) finds(min.controls, max.controls) = (1/2, 2) · (1 − p)/ pto work best, where p represent the proportiontreated in the stratum being matched. In an unre-lated application, Stuart and Green (2006) find thesevalues to work well; they may be a good startingpoint for general use.

A somewhat related technique is matching “withreplacement,” in which overlap between matchedsets is permitted in the interests of achieving closermatches. Because of the overlap, methods appropri-ate to stratified data are not generally appropriate forsamples matched with replacement. The techniqueforces one to resort to specialized techniques, suchas those of Abadie and Imbens (2006). On the otherhand, with-replacement matching would appear tooffer the possibility of closer matches, since its pair-ing of one treatment unit in no way limits its pairingof the next treatment unit.

However, it is a surprising, and evidently



little-known, fact that with-replacement matchingachieves no closer matches than full matching, awithout-replacement matching technique. As dis-cussed by Rosenbaum (1991) and (more explicitly)by Hansen and Klopfer (2006), given any criterionfor a potential pairing of subjects to be acceptable,full matching matches all subjects with at least onesuitable match in their comparison group, match-ing them only to acceptable counterparts. So onemight insist, in particular, that each treatment unit bematched only to one of its nearest neighbors; by shar-ing controls among treated units where necessary,omitting controls who are not the nearest neighbor ofsome treatment, and matching to multiple controlswhere that can be done, full matching finds a wayto honor this requirement. Since the matched setsproduced by full matching never overlap, it has theadvantage over with-replacement matching of com-bining with any method of estimation appropriate tofinely stratified data.

Under the hood

Hansen and Klopfer (2006) describe the network-flows algorithm on which optmatch relies in somedetail, establishing its optimality for full matchingand matching with a fixed or varying number of con-trols. They also give upper bounds for the time com-plexity of the algorithm: roughly, O(n3 log(nC)),where n is the size of the sample and C is the quo-tient of largest discrepancy in the distance matrixand the matching tolerance. This is comparable tothe time complexity of squaring a n× n matrix. Moreprecisely, the algorithm requires O(nntnc log(nC))floating-point operations, where nt and nc are thesizes of the treatment and control groups.

These bounds have two practical consequencesfor optmatch. First, computational costs growsteeply with the size of the discrepancy matrix. Justas squaring two (n/2) × (n/2) submatrices of ann× n matrix is about four times faster than squaringthe full n× n matrix, matching is made much fasterby subdividing large matching problems into smallerones. For this reason makedist is written so as to fa-cilitate subclassification prior to matching, the effectof which is to split larger matching problems into asequence of smaller ones. Second, the C-factor con-tributes secondarily to computational cost; its contri-bution is reduced by increasing the value of the ‘tol’argument to fullmatch or pairmatch.

Discussion

When and how matching reduces system-atic differences between groups

Matching can address bias in observational studiesin either of two ways. In matched sampling, it is used

to select a subset of control subjects most like treat-ments, with the remainder of subjects excluded fromanalysis; in matched adjustment, it is used to forcetreatment control comparisons to be based on indi-vidualized comparisons made within matched sets,which will have been so engineered that matchedcounterparts are more like one another than are treat-ments and controls on the whole. Matched sam-pling is typically followed by matched adjustment,but matched adjustment can be useful even when notpreceded by matched sampling.

Because it excludes some five control subjects, thematch depicted in Figures 1 and 2 might be used ina context of matched sampling, although it differs inimportant respects from typical matched samples. Inarchetypal cases, matched sampling is used when forcost reasons the number of controls to be followedup for outcome data has to be reduced anyway, notwhen outcome data is already available for the entiresample already; and in archetypal cases, the reservoirof potential controls is many times larger than thesize of the desired control group. See, e.g., Althauserand Rubin (1970) or Rosenbaum and Rubin (1985).The matches in Figures 1 and 2 are typical of matchedsampling in matching a fixed number of controls toeach treatment subject. When bias can be addressedby being very selective in the choice of controls, flex-ibility in the structure of matched sets becomes lessimportant.

When there is no additional data to be collected,there may be little use for matched sampling per se,while matched adjustment may still be attractive. Inthese cases, it is important to recognize that match-ing, even optimal matching, does not in itself reducesystematic differences between treatment and con-trol groups unless it is specifically given the flexi-bility to do so. Suppose, for instance, that adjust-ment for the variable t2 is needed in the compari-son of new- and existing site-plants. This variable,which represents the time between the issue of anoperating permit and a construction permit, differsmarkedly in its distribution among “treatments” and“controls,” as seen in Figure 5: treatments have sys-tematically larger values of it, although the two dis-tributions well overlap. When two groups comparein this way, no fixed-ratio matching of them can re-duce their overall discrepancy. Some of the observa-tions will have to be set aside — or, better yet, onecould match the two groups in varying ratios, us-ing matching with multiple controls or full match-ing. These techniques have surprising power to rec-oncile differences between treatments and controlswhile setting aside few or even no subjects becausethey lack suitable counterparts; the reader is referredto Hansen (2004, § 2) for general discussion and acase study.



Controls Treatments

50

60

70

80

Da

ys t

o P

erm

it I

ssu

e

Figure 5: New and existing sites’ differences on thevariable t2. To reduce these differences, one has ei-ther to drop observations or to use flexible matchingtechniques.

In practice, one should look critically at an opti-mal match before moving ahead with it toward out-come analysis, refining its generating distance andstructural requirements as appropriate, just as a care-ful analyst deploys various diagnostics in the pro-cess of developing and refining a likelihood-basedanalysis. Diagnostics for matching are discussed invarious methodological papers, many of them recent(Rosenbaum and Rubin, 1985; Rubin, 2001; Lee, 2006;Hansen, 2006a; Sekhon, 2007).

optmatch output

Matched pairs are often analyzed by methods partic-ular to that structure, for example the paired t-test.However, matching with multiple controls and fullmatching require methods that treat the matched setsas strata. With these uses in mind, matching func-tions in optmatch give factor objects as their output,creating unique identifiers for each matched set andtagging them as such in the factor. (Strictly speaking,the value of a call to fullmatch or pairmatch is ofthe class c("optmatch", "factor"), but it is safe to

treat it as a factor.) If one in fact has produced a pairmatch, then one can recover the paired differencesusing the split command:

> pm <- pairmatch(dl)> attach(nuclear)> unlist(split(cost[PR],pm[PR])) -

unlist(split(cost[!PR],pm[!PR]))

— the result of which is the vector of differences

0.1 0.2 0.3 ... 1.2 1.3-9.77 -10.09 184.75 ... -17.77 -4.52

For matched comparisons after full matching ormatching with a varying number of controls, oneuses such commands as

> fm <- fullmatch(dl)> tapply(cost[PR], fm[PR], mean) -

tapply(cost[!PR], fm[!PR], mean)

to return differences of treatment and control meansby matched set. The sizes of the matched sets, interms of treatment units, controls, or both, can be tab-ulated by

> tapply(PR, fm, sum)> tapply(!PR, fm, sum)> tapply(fm,fm,length)

respectively. Unmatched units are automaticallydropped, and split and tapply return matched-setspecific results in a common ordering (that of the lev-els of the match object, e.g. pm or fm.)

Summary

Optmatch offers a comprehensive implementation ofmatching of two groups, such as treatments and con-trols or cases and controls, including optimal pairmatching, optimal matching with k controls, optimalmatching with a varying number of controls, and fullmatching, with and without structural restrictions.

Bibliography

A. Abadie and G. W. Imbens. Large sample proper-ties of matching estimators for average treatmenteffects. Econometrica, 74(1):235–267, 2006.

R. Althauser and D. Rubin. The computerized con-struction of a matched sample. American Journal ofSociology, 76(2):325–346, 1970.

J. Bowers and B. B. Hansen. Attributing effects toa cluster randomized get-out-the-vote campaign.Technical Report 448, Statistics Department,University of Michigan, October 2006. URLhttp://www-personal.umich.edu/%7Ejwbowers/PAPERS/bowershansen2006-10TechReport.pdf.


http://www-personal.umich.edu/%7Ejwbowers/PAPERS/bowershansen2006-10TechReport.pdf

http://www-personal.umich.edu/%7Ejwbowers/PAPERS/bowershansen2006-10TechReport.pdf


A. R. Brazzale. hoa: An R package bundle for higherorder likelihood inference. Rnews, 5/1 May 2005:20–27, 2005. URL ftp://cran.r-project.org/doc/Rnews/Rnews_2005-1.pdf. ISSN 609-3631.

A. R. Brazzale, A. C. Davison, and N. Reid. AppliedAsymptotics. Cambridge University Press, 2006.

N. E. Breslow and N. E. Day. Statistical Methodsin Cancer Research (Vol. 1) — The Analysis of Case-control Studies. Number 32 in IARC ScientificPublications. International Agency for Research onCancer, 1980.

D. Cox and E. Snell. Applied Statistics. Chapman andHall, 1981.

D. R. Cox and E. J. Snell. Analysis of Binary Data.Chapman & Hall Ltd, 1989.

A. Diamond and J. S. Sekhon. Genetic matching forestimating causal effects: A general multivariatematching method for achieving balance in obser-vational studies. Technical report, Travers Depart-ment of Political Science, University of California,Berkeley, 2006. version 1.2.

A. Gelman and J. Hill. Data Analysis Using Regres-sion and Multilevel/Hierarchical Models. CambridgeUniversity Press, 2006.

X. Gu and P. R. Rosenbaum. Comparison of mul-tivariate matching methods: Structures, distances,and algorithms. Journal of Computational and Graph-ical Statistics, 2(4):405–420, 1993.

B. B. Hansen. Appraising covariate balance after as-signment to treatment by groups. Technical Re-port 436, University of Michigan, Statistics Depart-ment, 2006a.

B. B. Hansen. Full matching in an observationalstudy of coaching for the SAT. Journal of the Ameri-can Statistical Association, 99(467):609–618, Septem-ber 2004.

B. B. Hansen. Bias reduction in observational studiesvia prognosis scores. Technical Report 441, Uni-versity of Michigan, Statistics Department, April2006b. URL http://www.stat.lsa.umich.edu/%7Ebbh/rspaper2006-06.pdf.

B. B. Hansen and S. O. Klopfer. Optimal full match-ing and related designs via network flows. Journalof Computational and Graphical Statistics, 15(3):609–627, 2006. URL http://www.stat.lsa.umich.edu/%7Ebbh/hansenKlopfer2006.pdf.

T. Hothorn, K. Hornik, M. van de Wiel, andA. Zeileis. A lego system for conditional inference.The American Statistician, 60(3):257–263, 2006.

W.-S. Lee. Propensity score matching and vari-ations on the balancing test, 2006. URLhttp://papers.ssrn.com/sol3/papers.cfm?abstract_id=936782#.

T. Lumley and T. Therneau. survival: Survival anal-ysis, including penalised likelihood, 2006. R packageversion 2.26.

K. Ming and P. R. Rosenbaum. Substantial gains inbias reduction from matching with a variable num-ber of controls. Biometrics, 56:118–124, 2000.

D. Pierce and D. Peters. Practical use of higher orderasymptotics for multiparameter exponential fami-lies. Journal of the Royal Statistical Society. Series B(Methodological), 54(3):701–737, 1992.

S. W. Raudenbush and A. S. Bryk. Hierarchical Lin-ear Models: Applications and Data Analysis Methods.Sage Publications Inc, 2002.

P. R. Rosenbaum. A characterization of optimal de-signs for observational studies. Journal of the RoyalStatistical Society, 53:597– 610, 1991.

P. R. Rosenbaum. Observational Studies. Springer-Verlag, second edition, 2002.

P. R. Rosenbaum and D. B. Rubin. Construct-ing a control group using multivariate matchedsampling methods that incorporate the propensityscore. The American Statistician, 39:33–38, 1985.

D. B. Rubin. Using propensity scores to help designobservational studies: application to the tobaccolitigation. Health Services and Outcomes ResearchMethodology, 2(3):169–188, 2001.

D. B. Rubin. Using multivariate matched samplingand regression adjustment to control bias in obser-vational studies. Journal of the American StatisticalAssociation, 74:318–328, 1979.

D. B. Rubin and N. Thomas. Combining propen-sity score matching with additional adjustmentsfor prognostic covariates. Journal of the AmericanStatistical Association, 95(450):573–585, 2000.

J. S. Sekhon. Alternative balance metrics for bias re-duction in matching methods for causal inference.Survey Research Center, University of California,Berkeley, 2007. URL http://sekhon.berkeley.edu/papers/SekhonBalanceMetrics.pdf.

H. Smith. Matching with multiple controls to esti-mate treatment effects in observational studies. So-ciological Methodology, 27:325–353, 1997.

E. A. Stuart and K. M. Green. Using full matching toestimate causal effects in non-experimental stud-ies: Examining the relationship between adoles-cent marijuana use and adult outcomes. Technicalreport, Johns Hopkins University, 2006.

Ben B. HansenDepartment of StatisticsUniversity of [email protected]


ftp://cran.r-project.org/doc/Rnews/Rnews_2005-1.pdf

ftp://cran.r-project.org/doc/Rnews/Rnews_2005-1.pdf

http://www.stat.lsa.umich.edu/%7Ebbh/rspaper2006-06.pdf

http://www.stat.lsa.umich.edu/%7Ebbh/rspaper2006-06.pdf

http://www.stat.lsa.umich.edu/%7Ebbh/hansenKlopfer2006.pdf

http://www.stat.lsa.umich.edu/%7Ebbh/hansenKlopfer2006.pdf

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=936782#

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=936782#

http://sekhon.berkeley.edu/papers/SekhonBalanceMetrics.pdf

http://sekhon.berkeley.edu/papers/SekhonBalanceMetrics.pdf



Random Survival Forests for RHemant Ishwaran and Udaya B. Kogalur

Introduction

In this article we introduce Random Survival Forests,an ensemble tree method for the analysis of rightcensored survival data. As is well known, con-structing ensembles from base learners, such as trees,can significantly improve learning performance. Re-cently, Breiman showed that ensemble learning canbe further improved by injecting randomization intothe base learning process, a method called RandomForests (Breiman, 2001). Random Survival Forests isclosely modeled after Breiman’s approach. In Ran-dom Forests, randomization is introduced in twoforms. First, a randomly drawn bootstrap sample ofthe data is used for growing the tree. Second, the treelearner is grown by splitting nodes on randomly se-lected predictors. While at first glance Random For-est might seem an unusual procedure, considerableempirical evidence has shown it to be highly effec-tive. Extensive experimentation, for example, hasshown it compares favorably to state of the art en-sembles methods such as bagging (Breiman, 1996)and boosting (Schapire et al., 1998).

Random Survival Forests being closely patternedafter Random Forests naturally inherits many of itsgood properties. Two features especially worth em-phasizing are: (1) It is user-friendly in that only three,fairly robust, parameters need to be set (the numberof randomly selected predictors, the number of treesgrown in the forest, and the splitting rule to be used).(2) It is highly data adaptive and virtually model as-sumption free. This last property is especially help-ful in survival analysis. Standard analyses oftenrely on restrictive assumptions such as proportionalhazards. Also, with such methods there is alwaysthe concern whether associations between predictorsand hazards have been modeled appropriately, andwhether or not non-linear effects or higher order in-teractions for predictors should be included. In con-trast, such problems are handled seamlessly and au-tomatically within a Random Forests approach.

While R currently has a Random Forests pack-age for classification and regression problems (therandomForest() package ported by Andy Liaw andMatthew Wiener), there is currently no version avail-able for analyzing survival data1. The need for aRandom Forests procedure separate from one thathandles classification and regression problems iswell motivated as survival data possesses uniquefeatures not handled within a CART (Classificationand Regression Tree) paradigm. In particular, the no-

tion of what constitutes a good node split for grow-ing a tree, what prediction means, and how to mea-sure prediction performance, pose unique problemsin survival analysis.

Moreover, while a survival tree can in someinstances be reformulated as a classification tree,thereby making it possible to use CART softwarefor a Random Forests analysis, we believe such ap-proaches are merely stop-gap measures that will bedifficult for the average user to implement. For ex-ample, Ishwaran et al. (2004) show under a propor-tional hazards assumption that one can grow sur-vival trees using the splitting rule of LeBlanc andCrowley (1992) using the rpart() algorithm (Ther-neau and Atkinson, 1997), hence making it possi-ble to implement a relative risk forests analysis in R.However, this requires extensive coding on the userspart, is limited to proportional hazard settings, andthe splitting rule used is only approximate.

The algorithm

It is clear that a comprehensive method with accom-panying software is needed. To fill this need we in-troduce randomSurvivalForest, an R software pack-age for implementing Random Survival Forests. Thealgorithm used by randomSurvivalForest is broadlydescribed as follows:

1. Draw ntree bootstrap samples from the origi-nal data.

2. Grow a tree for each bootstrapped data set. Ateach node of the tree randomly select mtry pre-dictors (covariates) for splitting on. Split on apredictor using a survival splitting criterion. Anode is split on that predictor which maximizessurvival differences across daughter nodes.

3. Grow the tree to full size under the constraintthat a terminal node should have no less thannodesize unique deaths.

4. Calculate an ensemble cumulative hazard esti-mate by combining information from the ntreetrees. One estimate for each individual in thedata is calculated.

5. Compute an out-of-bag (OOB) error rate for theensemble derived using the first b trees, whereb = 1, . . . , ntree.

Splitting rules

Node splits are a crucial ingredient to the algo-rithm. The randomSurvivalForest package pro-

1We are careful to distinguish Random Forests procedures following Breiman’s methodology from other approaches. Readers, forexample, should be aware of the R party() package, which implements a random forests style analysis using conditional tree base learn-ers (Hothorn et al., 2006).



vides four different survival splitting rules for theuser. These are: (i) a log-rank splitting rule,the default splitting rule, invoked by the optionsplitrule="logrank"; (ii) a conservation of eventssplitting rule, splitrule="conserve"; (iii) a logrankscore rule, splitrule="logrankscore"; (iv) and afast approximation to the logrank splitting rule,splitrule="logrankapprox".

Notation

Assume we are at node h of a tree during its growthand that we seek to split h into two daughter nodes.We introduce some notation to help discuss how thethe various splitting rules work to determine the bestsplit. Assume that within h are n individuals. Denotetheir survival times and 0-1 censoring information by(T1, δ1), . . . , (Tn, δn). An individual l will be said tobe right censored at time Tl if δl = 0, otherwise theindividual is said to have died at Tl if δl = 1. In thecase of death, Tl will be refered to as an event time,and the death as an event. An individual l who isright censored at Tl simply means the individual isknown to have been alive at Tl , but the exact time ofdeath is unknown.

A proposed split at node h on a given predictor xis always of the form x ≤ c and x > c. Such a splitforms two daughter nodes (a left and right daugh-ter) and two new sets of survival data. A good splitmaximizes survival differences across the two sets ofdata. Let t1 < t2 < · · · < tN be the distinct deathtimes in the parent node h, and let di, j and Yi, j equalthe number of deaths and individuals at risk at timeti in the daughter nodes j = 1, 2. Note that Yi, j is thenumber of individuals in daughter j who are alive attime ti, or who have an event (death) at time ti. Moreprecisely,

Yi,1 = #{Tl ≥ ti , xl ≤ c}, Yi,2 = #{Tl ≥ ti , xl > c},

where xl is the value of x for individual l = 1, . . . , n.Finally, define Yi = Yi,1 + Yi,2 and di = di,1 + di,2. Letn j be the total number of observations in daughter j.Thus, n = n1 + n2. Note that n1 = #{l : xl ≤ c} andn2 = #{l : xl > c}.

Log-rank splitting

The log-rank test for a split at the value c for predic-tor x is

L(x, c) =

N

∑i=1

(di,1 −Yi,1

diYi

)√

N

∑i=1

Yi,1

Yi

(1− Yi,1

Yi

)(Yi − diYi − 1

)di

.

The value |L(x, c)| is the measure of node separation.The larger the value for |L(x, c)|, the greater the dif-ference between the two groups, and the better the

split is. In particular, the best split at node h is deter-mined by finding the predictor x∗ and split value c∗

such that |L(x∗, c∗)| ≥ |L(x, c)| for all x and c.

Conservation of events splitting

The log-rank test for splitting survival trees is awell established concept (Segal, 1988), having beenshown to be robust in both proportional and non-proportional hazard settings (LeBlanc and Crowley,1993). However, one criticism often heard is that ittends to favor continuous predictors and often suf-fers from an end-cut preference (favoring unevensplits). However, in our experience with RandomSurvival Forests we have not found this to be a seri-ous deficiency. Nevertheless, to address this poten-tial problem we introduce another important classof test statistics for splitting that are related to con-servation of events; a concept introduced in Naftelet al. (1985) (our simulations have indicated thesetests may be much less susceptible to the aforemen-tioned problems).

Under fairly general conditions, conservation ofevents asserts that the sum of the estimated cumula-tive hazard function over the observed time points(deaths and censored values) must equal the totalnumber of deaths. This applies to a wide collectionof estimates including the the Nelson-Aalen estima-tor. The Nelson-Aalen cumulative hazard estimatorfor daughter j is

H j(t) = ∑ti, j≤t

di, j

Yi, j

where ti, j are the ordered death times for daughter j(note: we define 0/0 = 0).

Let (Tl, j, δl, j), for l = 1, . . . , n j, denote all survivaltimes and censoring indicator pairs for daughter j.Conservation of events asserts that

n j

∑l=1

H j(Tl, j) =n j

∑l=1

δl, j. (1)

In other words, the total number of deaths is con-served in each daughter.

The conservation of events splitting rule is moti-vated by (1). First, order the time points within eachdaughter node such that

T(1), j ≤ T(2), j ≤ · · · ≤ T(n j), j.

Let δ(l), j be the censoring indicator function for theordered value T(l), j. Define

Mk, j =k

∑l=1

H j(T(l), j)−k

∑l=1

δ(l), j, k = 1, . . . , n j.

One can think of Mk, j as “residuals” that measure ac-curacy of conservation of events. The proposed teststatistic takes the sum of the absolute values of Mk, jfor k = 1, . . . , n j for each daughter j, and weightsthese values by the number of individuals at risk



within each group. Observe that Mn j , j = 0, but noth-ing can be said about Mk, j for k < n j. Thus, byconsidering Mk, j for each k, the proposed test mea-sures how evenly distributed conservation of eventsis over all deaths. The measure of conservation ofevents for the split on x at the value c is

Conserve(x, c) =1

Y1,1 + Y1,2

2

∑j=1

Y1, j

n j−1

∑k=1

|Mk, j|.

This value is small if the two groups are well sepa-rated. Because we want to maximize survival differ-ences due to a split, we use the transformed value1/(1 + Conserve(x, c)) as our measure of node sep-aration.

The preceding expression for Conserve(x, c) canbe quite expensive to compute as it involves sum-ming over all survival times within the daughternodes. However, we can greatly reduce the amountof work by compressing the sums to involve onlyevent times. With some work, one can show thatConserve(x, c) is equivalent to:

1Y1,1 + Y1,2

2

∑j=1

Y1, j

N−1

∑k=1

{Nk, jYk+1, j

k

∑l=1

dl, j

Yl, j

},

where Ni, j = Yi, j − Yi+1, j equals the number of ob-servations in daughter j with observed time fallingwithin the interval [ti , ti+1) for i = 1, . . . , N wheretN+1 = ∞.

Log-rank score splitting

Another useful splitting rule available within therandomSurvivalForest package is the log-rankscore test of Hothorn and Lausen (2003). To describethis rule, assume the predictor x has been ordered sothat x1 ≤ x2 ≤ · · · ≤ xn. Now, compute the “ranks”for each survival time Tl ,

al = δl −Γl

∑k=1

δkn− Γk + 1

where Γk = #{t : Tt ≤ Tk}. The log-rank score test isdefined as

S(x, c) =∑xl≤c al − n1a√n1

(1− n1

n

)s2

a

where a and s2a are the sample mean and sample vari-

ance of {al : l = 1, . . . , n}. Log-rank score splittingdefines the measure of node separation by |S(x, c)|.Maximizing this value over x and c yields the bestsplit.

Approximate logrank splitting

An approximate log-rank test can be used in place ofL(x, c) to greatly reduce computations. To derive the

approximation, first rewrite the numerator of L(x, c)in a form that uses the Nelson-Aalen estimator forthe parent node. The Nelson-Aalen estimator is

H(t) = ∑ti≤t

diYi

.

As shown in LeBlanc and Crowley (1993) one canwrite

N

∑i=1

(di,1 −Yi,1

diYi

)= D1 −

n

∑l=1

I{xl ≤ c}H(Tl),

where D j = ∑Ni=1 di, j for j = 1, 2. Because the

Nelson-Aalen estimator is computed on the parentnode, and not daughter nodes, this yields an efficientway to compute the numerator of L(x, c).

Now to simplify the denominator, we approxi-mate the variance of the numerator of L(x, c) as inSection 7.7 of Cox and Oakes (1988) (this approxi-mation was suggested to us by Michael LeBlanc inpersonal communication). Setting D = ∑

Ni=1 di, we

get the following approximation to the log-rank testL(x, c):

D1/2

(D1 −

n

∑l=1

I{xl ≤ c}H(Tl)

)√√√√{ n

∑l=1

I{xl ≤ c}H(Tl)

}{D−

n

∑l=1

I{xl ≤ c}H(Tl)

} .

Ensemble estimation

The randomSurvivalForest package produces anensemble estimate for the cumulative hazard func-tion. This is our predictor and key deliverable. Errorrate performance is calculated based on this value.The ensemble is derived as follows. First, for eachtree grown from a bootstrap data set we estimate thecumulative hazard function for the tree. This is ac-complished by grouping hazard estimates by termi-nal nodes. Consider a specific node h. Let {tl,h} bethe distinct death times in h and let dl,h and Yl,h equalthe number of deaths and individuals at risk at timetl,h. The cumulative hazard estimate for node h is de-fined as

Hh(t) = ∑tl,h≤t

dl,h

Yl,h.

Each tree provides a sequence of such estimates,Hh(t). If there are M terminal nodes in the tree,then there are M such estimates. To compute H(t|xi)for an individual i with predictor xi, simply drop xidown the tree. The terminal node for i yields the de-sired estimator. More precisely,

H(t|xi) = Hh(t), if xi ∈ h. (2)

Note this value is computed for all individuals i in thedata.



The estimate (2) is based on one tree. To produceour ensemble we average (2) over all ntree trees. LetHb(t|x) denote the cumulative hazard estimate (2)for tree b = 1, . . . , ntree. Define Ii,b = 1 if i is anOOB point for b, otherwise set Ii,b = 0. The OOBensemble cumulative hazard estimator for i is

H∗e (t|xi) =

∑ntreeb=1 Ii,bHb(t|xi)

∑ntreeb=1 Ii,b

.

Observe that the estimator is obtained by averag-ing over only those bootstrap samples in which i isexcluded (i.e., those datasets in which i is an OOBvalue). The OOB estimator is in contrast to theensemble cumulative hazard estimator that uses allsamples:

He(t|xi) =1

ntree

ntree

∑b=1

Hb(t|xi).

Concordance error rate

Given the OOB estimator H∗e (t|x), it is a simple mat-

ter to compute the error rate. We measure error us-ing Harrell’s concordance index (Harrell et al., 1982).Unlike other measures of survival performance, Har-rell’s C-index does not depend on choosing a fixedtime for evaluation of the model and specificallytakes into account censoring of individuals (Mayet al., 2004). The method has quickly become quitepopular in the literature as a means for assessingprediction performance in survival analysis settings.See Kattan et al. (1998) and references therein.

To compute the concordance index we must de-fine what constitutes a worse predicted outcome. Wetake the following approach. Let t∗1 , . . . , t∗N denote allunique event times in the data. Individual i is said tohave a worse outcome than j if

N

∑k=1

H∗e (t∗k |xi) >

N

∑k=1

H∗e (t∗k |x j).

The concordance error rate is computed as follows:

1. Form all possible pairs of observations over allthe data.

2. Omit those pairs where the shorter event timeis censored. Also, omit pairs i and j if Ti = Tjunless δi = 1 and δ j = 0 or δi = 0 and δ j = 1.The last restriction only allows ties if one ofthe observations is a death and the other a cen-sored observation. Let Permissible denote thetotal number of permissible pairs.

3. Count 1 for each permissible pair in which theshorter event time had the worse predicted out-come. Count 0.5 if the predicted outcomes aretied. Let Concordance denote the total sumover all permissible pairs.

4. Define the concordance index C as

C =ConcordancePermissible

.

5. The error rate is Error = 1 − C. Note that0 ≤ Error ≤ 1 and that Error = 0.5 cor-responds to a procedure doing no better thanrandom guessing, whereas Error = 0 indicatesperfect accuracy.

Usage in R

The user interface to randomSurvivalforest is sim-ilar in many aspects to randomForest and as thereader may have already noticed, many of the argu-ment names are also the same. This was done de-liberately in order to promote compatibility betweenthe two packages. The primary R function call to therandomSurvivalforest package is rsf(). The on-line documentation describes rsf() in great detailand there is no reason to repeat this information here.Different R wrapper functions are provided with therandomSurvivalforest package to aid in interpret-ing the object produced by rsf(). The examplesgiven below illustrate how some of these wrapperswork, and also indicate how rsf() might be used inpractice.

Lung-vet data

For our first example, we use the well knownveteran’s administration lung cancer data fromKalbfleisch and Prentice (Kalbfleisch and Prentice,1980). This is an example data set available withinthe package. In total there are 6 predictors in thedata. We first focus on analysis that includes onlyKarnofsky score as a predictor:

> library("randomSurvivalForest")> data(veteran,package="randomSurvivalForest")> ntree <- 1000> v.out <- rsf(Survrsf(time,status) ~ karno,

veteran, ntree=ntree, forest=T)> print(v.out)

Call:rsf.default(formula = Survrsf(time, status)

~ karno, data = veteran, ntree = ntree)

Sample size: 137Number of deaths: 128Number of trees: 1000

Minimum terminal node size: 3Average no. of terminal nodes: 8.437

No. of variables tried at each split: 1Total no. of variables: 1

Splitting rule: logrankEstimate of error rate: 36.28%



The error rate is significantly smaller than 0.5, thebenchmark value associated with a procedure no bet-ter than flipping a coin. This is very strong evidencethat Karnofsky score is predictive.

We can investigate the effect of Karnofsky scoremore closely by considering how the ensemble esti-mated mortality varies as a function of the predictor:

> plot.variable(v.out, partial=T)

Figure 1, produced by the above command, is a par-tial plot of Karnofsky score. The vertical axis repre-sents expected number of deaths.

20 40 60 80 100

2040

6080

120

karno

● ● ● ●

●●

●

●

●

●

● ●

Figure 1: Partial plot of Karnofsky score. Vertical axisis mortality ∑

Nk=1 Nk He(t∗k |x) for a given Karnofsky

value x and represents expected number of deaths.

Now we run an analysis with all six predictorsunder each of the four splitting rules. For each split-ting rule we run 100 replications and record the meanand standard deviation of the concordance error rate(as before ntree equals 1000):

> splitrule <- c("logrank", "conserve","logrankscore", "logrankapprox")

> nrep <- 100> err.rate <- matrix(0, 4, nrep)> names(err.rate) <- splitrule> v.f <-

as.formula("Survrsf(time,status) ~ .")> for (j in 1:4) {> for (k in 1:nrep) {> err.rate[j,k] <- rsf(v.f,

veteran,ntree=ntree,splitrule=splitrule[j])$err.rate[ntree]

> }> }> err.rate <- rbind(

mean=apply(err.rate, 1, mean),std=apply(err.rate, 1, sd))

> colnames(err.rate) <- splitrule> print(round(err.rate,4))

logrank conserve logrankscore logrankapxmean 0.2982 0.3239 0.2951 0.3170std 0.0027 0.0034 0.0027 0.0046

The analysis shows that logrankscore has the bestpredictive performance (logrank is a close second).Standard deviations in all cases are reasonably small.It is interesting to observe that the mean error ratesare not substantially smaller than our previous anal-ysis which used only Karnofsky score, thus indicat-ing the predictor is highly influential. Our next ex-ample illustrates further techniques for studying theinformativeness of a predictor.

Primary biliary cirrhosis (PBC) of the liver

Next we consider the PBC data set found in appendixD.1 of Fleming and Harrington (Fleming and Har-rington, 1991). This is also an example data set avail-able in the package. Similar to the previous analysiswe analyzed the data by running a forest analysis foreach of the four splitting rules, repeating the analy-sis 100 times independently (as before ntree was setto 1000). The R code is similar as before and sup-pressed:

logrank conserve logrankscore logrankapxmean 0.1703 0.1677 0.1719 0.1602std 0.0014 0.0014 0.0015 0.0020

As can be seen, the error rates are between 16-17%with logrankapprox having the lowest value.

0 200 600 1000

0.20

0.25

0.30

Number of Trees

Err

or R

ate

alk

treatment

trig

sgot

platelet

sex

hepatom

stage

spiders

ascites

chol

albumin

edema

copper

prothrombin

bili

age

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.000 0.010

Importance

Figure 2: Error rate for PBC data as a function of trees(left-side) and out-of-bag importance values for pre-dictors (right-side).

We now consider the informativeness of each pre-dictor under the logrankapprox splitting rule:

> data("pbc",package="randomSurvivalForest")> pbc.f <- as.formula("Survrsf(days,status)~.")> pbc.out <- rsf(pbc.f, pbc, ntree=ntree,

splitrule = "logrankapprox", forest=T)> plot(pbc.out)

Figure 2 depicts the importance values for all 17 pre-dictors. From the plot we see that “age” and “bili”are clearly predictive and have substantially largerimportance values than all other predictors. The par-tial plots for the top six predictors are displayed in



Figure 3. The figure was produced using the com-mand:

> plot.variable(pbc.out,3,partial=T,n.pred=6)

We now consider the incremental effect of eachpredictor using a nested analysis. We sort predictorsby their importance values and consider the nestedsequence of models starting with the top variable,followed by the model with the top 2 variables, thenthe model with the top three variables, and so on:

> imp <- pbc.out$importance> pnames <- pbc.out$predictorNames> pnames.order <- pnames[rev(order(imp))]> n.pred <- length(pnames)> pbc.err <- rep(0, n.pred)> for (k in 1:n.pred){> rsf.f <- "Survrsf(days,status)~"> rsf.f <- as.formula(paste(rsf.f,> paste(pnames.order[1:k],collapse="+")))> pbc.err[k] <- rsf(rsf.f, pbc, ntree=ntree,> splitrule="logrankapprox")$err.rate[ntree]> }> pbc.imp.out <- as.data.frame(> cbind(round(rev(sort(imp)),4),> round(pbc.err,4),> round(-diff(c(0.5,pbc.err)),4)),> row.names=pnames.order)>colnames(pbc.imp.out) <-> c("Imp","Err","Drop Err")> print(pbc.imp.out)

Imp Err Drop Errage 0.0130 0.3961 0.1039bili 0.0081 0.1996 0.1965prothrombin 0.0052 0.1918 0.0078copper 0.0042 0.1685 0.0233edema 0.0030 0.1647 0.0038albumin 0.0026 0.1569 0.0078chol 0.0022 0.1606 -0.0037ascites 0.0014 0.1570 0.0036spiders 0.0013 0.1601 -0.0030stage 0.0013 0.1557 0.0043hepatom 0.0008 0.1570 -0.0013sex 0.0006 0.1549 0.0021platelet 0.0000 0.1565 -0.0016sgot -0.0012 0.1538 0.0027trig -0.0017 0.1545 -0.0007treatment -0.0019 0.1596 -0.0052alk -0.0040 0.1565 0.0032

The first column is the importance value of a predic-tor in the full model. The kth value in the second col-umn is the error rate for the kth nested model, whilethe kth value in the third column is the difference be-tween the error rate for the kth and (k− 1)th nestedmodel, where the error rate for the null model, k = 0,is 0.5. One can see not much is gained by using morethan 6-7 predictors and that the top 3-4 predictors ac-count for much of the predictive power.

Large scale problems

In terms of computationally challenging problems,we have applied randomSurvivalForest success-fully to several large survival datasets. For exam-ple, we have considered data collected at the Cleve-land Clinic involving over 20,000 records and wellover 60 predictors. We have also analyzed a data setcontaining 1,000 records and with almost 250 predic-tors. Our success with these applications is consis-tent with that seen for Random Forests: namely, thatthe methodology has been shown to scale up verynicely, even in very large predictor spaces and withlarge sample sizes. In terms of computational speed,we have found that logrankapprox is almost alwaysfastest. After that, conserve is second fastest. Forvery large datasets, discretizing continuous predic-tors and/or the observed survival times can greatlyspeed up computational times. Discretization doesnot have to be overly granular for substantial gainsto be seen.

Acknowledgements

The authors are extremely grateful to Eugene H.Blackstone and Michael S. Lauer for their tremen-dous support, feedback, and generous time given tous throughout this project. Without their help thisproject would surely never have been completed. Wealso thank the referee of the paper and Paul Murrellfor their constructive and helpful comments. This re-search was supported by the National Institutes ofHealth RO1 grant HL-072771.

Bibliography

L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.

L. Breiman. Bagging predictors. Machine Learning,26:123–140, 1996.

D.R. Cox and D. Oakes. Analysis of Survival Data.Chapman and Hall, London, 1998.

T. Fleming and D. Harrington. Counting Processes andSurvival Analysis. Wiley, New York, 1991.

F. Harrell, R. Califf, D. Pryor, K. Lee, and R. Rosati.Evaluating the yield of medical tests. J. Amer. Med.Assoc., 247:2543–2546, 1982.

T. Hothorn, K. Hornik, and A. Zeileis. Unbiasedrecursive partitioning: A conditional inferenceframework. J. Comp. Graph. Statist., 15:651–674,2006.

T. Hothorn, and B. Lausen. On the exact distribu-tion of maximally selected rank statistics. Comput.Statist. Data Analysis, 43:121–137, 2003.



10000 20000

9010

011

0

age

● ● ● ● ● ●

●● ●●

● ●●● ●●

● ●

●

●

●

●

●

●

●

0 5 10 15 20 2590

100

120

bili

●●●

●

●

●

●

●●●●●

●●●●● ●● ● ● ●

●

●●

10 12 14 16

9095

100

110

prothrombin

●● ●●●●●●●●●

●●●

●●●

●●● ●

●

●

● ●

0 200 400 600

9010

011

012

0

copper

●●●●●●

●

●

●

●

●

●●●●●●●

●●

●

●

●●

●

9095

100

105

edema

0.0 0.5 1.0 2.0 2.5 3.0 3.5 4.090

9510

511

5

albumin

● ●

●● ●●

●

●

●

●●●

●

●●●●● ●●● ● ●●

●

Figure 3: Partial plots for top six predictors from PBC data. Values on the vertical axis represent expectednumber of deaths for a given predictor, after adjusting for all other predictors. Dashed red lines for continuouspredictors are ± 2 standard error bars.

H. Ishwaran, E. Blackstone, M. Lauer, and C. Pothier.Relative risk forests for exercise heart rate recoveryas a predictor of mortality. J. Amer. Stat. Assoc., 99:591–600, 2004.

J. Kalbfleisch and R. Prentice. The Statistical Analysisof Failure Time Data. Wiley, New York, 1980.

M. Kattan, K. Hess, and J. Beck. Experiments todetermine whether recursive partitioning (CART)or an artificial neural network overcomes theoreti-cal limitations of Cox proportional hazards regres-sion. Computers and Biomedical Research, 31:363–373, 1998.

M. LeBlanc and J. Crowley. Relative risk trees for cen-sored survival data. Biometrics, 48:411–425, 1992.

M. LeBlanc and J. Crowley. Survival trees by good-ness of split. J. Amer. Stat. Assoc., 88:457–467, 1993.

M. May, P. Royston, M. Egger, A.C. Justice andJ.A.C. Sterne. Development and validation of aprognostic model for survival time data: applica-tion to prognosis of HIV positive patients treatedwith antiretroviral therapy. Statist. Medicine., 23:2375–2398, 2004.

D. Naftel, E. Blackstone, and M. Turner. Conserva-tion of events, 1985. Unpublished notes.

R. Schapire, Y. Freund, P. Bartlett, and W. Lee. Boost-ing the margin: a new explanation for the effective-ness of voting methods. Ann. Statist., 26(5):1651–1686, 1998.

M.R. Segal. Regression trees for censored data. Bio-metrics, 44:35–47, 1988.

T. Therneau and E. Atkinson. An introduction to re-cursive partitioning using the rpart routine. Tech-nical Report 61, 1997. Section of Biostatistics, MayoClinic, Rochester.

Hemant IshwaranDepartment of Quantitative Health SciencesCleveland Clinic, Cleveland, [email protected] B. KogalurKogalur Shear CorporationClemmons, [email protected]





Rwui: A Web Application to Create UserFriendly Web Interfaces for R Scriptsby Richard Newton and Lorenz Wernisch

Summary

The web application Rwui is used to create web inter-faces for running R scripts. All the code is generatedautomatically so that a fully functional web interfacefor an R script can be downloaded and up and run-ning in a matter of minutes.

Rwui is aimed at R script writers who havescripts that they want people unversed in R to use.The script writer uses Rwui to create a web appli-cation that will run their R script. Rwui allows thescript writer to do this without them having to doany web application programming, because Rwuigenerates all the code for them.

The script writer designs the web application torun their R script by entering information on a se-quence of web pages. The script writer then down-loads the application they have created and installsit on their own server. This is a simple matter ofcopying one file from the download onto the server.The script writer now has a web application on theirserver that runs their R script, that other people canuse over the web.

Although of general applicability, Rwui was de-signed primarily with bioinformatics applications inmind; aimed at bioinformaticians who are develop-ing a statistical analysis of experimental data for col-laborators and who want to automate their analysisin a user friendly way. Rwui may also be of use forcreating teaching applications.

Rwui may be found at http://rwui.cryst.bbk.ac.uk

Introduction

R is widely used in the field of bioinformatics.The Bioconductor project (Gentleman et al., 2004)contains R packages specifically designed for thisfield. However many potential users of bioinformat-ics programs written in R, who come from a non-bioinformatics background, are unfamiliar with thelanguage. One solution to this problem is for devel-opers of R scripts to provide user-friendly web inter-faces for their scripts.

Rwui (R Web User Interface) is a web applicationthat the developer of an R script can use to createa web application for running their script. All thecode for the web application is generated automat-ically. This is the key feature of Rwui and meansthat it only takes a few minutes for someone whois entirely unfamiliar with web application program-ming to design, download and install on their server

a fully functional web interface for an R script.A web interface for an R script means that the

script can be used by anyone, even if they have noknowledge of R. Instead of using the R script directly,values for variables and data files for processing arefirst submitted by the user on a web form. The appli-cation then runs the R script on a server, out of sightof the user, and returns the results of the analysis tothe user’s web page. Because the web applicationruns on a server it can be accessed remotely and theuser does not need to have R installed on their ma-chine. And updates to the script need only be madeto the copy on the server.

Although of general applicability, Rwui has beendesigned with bioinformatics applications in mind.To this end the web applications created by Rwuican include features often required in bioinformat-ics data analysis, such as a page for uploading repli-cate data files and their group identifiers. Rwui istypically aimed at bioinformaticians who have de-veloped an R script for analysing experimental dataand who want to make their method immediatelyaccessible to collaborators who are unfamiliar withR. Rwui enables bioinformaticians to do this quicklyand simply, without having to concern themselveswith any aspects of web application programming.

The completed web applications run on Tomcatservers (http://tomcat.apache.org/). Tomcat is freeand widely used server software, very easy to in-stall on both Unix and Windows machines, and withan impeccable security record. There have been noknown instances of the safety of data on Tomcatservers being compromised despite its widespreaduse. But if data cannot be sent over the internetthen either the web applications created by Rwuican be installed on a local Tomcat server for inter-nal use, situated behind a firewall, or Tomcat and theweb applications created by Rwui can be installed onindividual stand-alone machines, in which case theweb applications are accessed in a browser on themachine via the ‘localhost’ URL. Instructions on in-stalling and running Tomcat can be accessed from alink on the ‘Help’ page of Rwui.

A number of approaches to providing web in-terfaces to R scripts exist. A list with linkscan be found on the R website (http://cran.r-project.org/doc/FAQ/R-FAQ.html) in the section ‘RWeb Interfaces’. These fall into two main categories.Firstly there are projects which allow the user to sub-mit R code to a remote server. With these methods,the end user must handle R code. In contrast, usersof applications created by Rwui have no contact withR code.

Secondly there are projects which facilitate pro-


http://rwui.cryst.bbk.ac.uk















http://tomcat.apache.org/


http://rwui.cryst.bbk.ac.uk/tutorial/tomcat_installation.html

http://rwui.cryst.bbk.ac.uk/tutorial/Instructions.html#s:tomcat


http://cran.r-project.org/doc/FAQ/R-FAQ.html

http://cran.r-project.org/doc/FAQ/R-FAQ.html



Figure 1: Screenshot showing the web page of Rwui where input items are added.

gramming web applications that run R code. Withthese methods, whoever is creating the applicationmust write web application code. In contrast, whenusing Rwui no web application code needs to bewritten; a web interface for an R script is createdsimply by selecting options from a sequence of webpages.

Using the Application

The information that Rwui requires in order to createa web application for running an R script is enteredon a sequence of forms. After entering a title and in-troductory text for the application, the input itemsthat will appear on the application’s web page areselected. Input items may be Numeric or Text entryboxes, Checkboxes, Drop-down lists, Radio Buttons,File Upload boxes and a Multiple/Replicate File Up-load page. Each of the input variables of the R script,that is, those variables in the script that require avalue supplied by the user, must have a correspond-ing input item on the application’s web page. Figure

1 shows the web page of Rwui on which the inputitems that will appear in the application are added.

Section headings can also be added if required.Rwui displays a facsimile of the web page that hasbeen created as items are added to the page. Thiscan be seen in the lower part of the screenshot shownin Figure 1. Input items are given a number so thatitems can be deleted and new items inserted betweenexisting items. After uploading the R script, Rwuigenerates the web application, which can be down-loaded as a zip or tgz file.

Rwui creates Java based applicationsthat use the Apache Struts framework(http://struts.apache.org/). Struts is open sourceand a popular framework for constructing well-organised, stable and extensible web applications.

The completed applications will run on a Tom-cat server. All that needs to be done to use thedownloaded web application is to place the appli-cation’s ‘web application archive’ file, which is con-tained in the download, into a subdirectory of theTomcat server. In addition, the file permissions of anincluded shell script must be changed to executable.








http://struts.apache.org/


Figure 2: Screenshot showing an example of a results web page of an application created using Rwui.

An application description file, written in XML, isincluded in the download. This is useful if the appli-cation requires modification at a later date. The de-tails of the application can be re-entered into Rwui byuploading the application description file. The appli-cation can then be edited and rebuilt within Rwui.

Further information on using Rwui can be foundfrom links on the application’s web pages. The‘Quick Tour’ provides a ten minute introductory ex-ample of using Rwui. The ‘Technical Report’ givesa technical overview of the application. The ‘Help’link accesses the manual for Rwui which contains de-tailed information for users.

System Requirements

In order to use the web applications created by Rwuia machine is required with Tomcat version 5.0 orlater, Java version 1.5 and an R version compatiblewith the R script(s). Although a server running aUnix operating system is preferable, the applicationswill work without modification on a Tomcat serverrunning Windows XP.

R script Requirements

An R script needs little or no modification in orderto be run from a web application created by Rwui.There are three areas where the script may requireattention.

The R script receives input from the user via Rvariables, which we term the input variables of thescript. The values of input variables are entered bythe user on the web page of the application. The in-put variables must be named according to the rulesof both R and Java variable naming. These rules aregiven on the ‘Help’ page of Rwui. But those variablesin the R script that are not input variables do notneed to conform to the rules of Java variable naming.

Secondly, in order to make the results of the anal-ysis available to the user, the R script must savethe results to files. Thirdly, and optionally, the Rscript may also write progress information into a filewhich can be continously displayed for the user in aJavascript pop-up window.





http://rwui.cryst.bbk.ac.uk/tutorial/quick_tour.html


http://rwui.cryst.bbk.ac.uk/tutorial/Technical_Report.pdf

http://rwui.cryst.bbk.ac.uk/tutorial/Instructions.html




http://rwui.cryst.bbk.ac.uk/tutorial/Instructions.html#s:input_variable_naming



Using applications created by Rwui

A demonstration application created by Rwui can beaccessed from the ‘Help’ page of Rwui.

Applications created by Rwui can include a lo-gin page. Access can be controlled either by a singlepassword, or by username/password pairs.

If an application created by Rwui includes a Mul-tiple/Replicate File upload page, then the applica-tion consists of two web pages on which the user en-ters information. On the first web page the user up-loads multiple files one at a time. Once completed,a button takes the user to a second web page wheresingleton data files and values for all other variablesare entered. The ‘Analyse’ button on this page sub-mits the values of variables, uploads any data filesand runs the R script. If a Multiple/Replicate Fileupload page is not included, the application consistsof this second web page only.

Before running the R script the application firstchecks the validity of the values that the user hasentered and returns an error message to the page ifany are invalid. During the analysis, progress infor-mation can be displayed for the user. To enable thisthe R script must append information to a text file atstages during its execution. This text file is displayedfor the user in a JavaScript pop-up window whichrefreshes at fixed intervals.

On completion of the analysis, a link to a Resultspage appears at the bottom of the web page. The usercan change data files and/or the values of any of thevariables and re-analyse, and the new results will ap-pear as a second link at the bottom of the page, andso on. Clicking on a link brings up the Results pagefor the corresponding analysis. Figure 2 shows anexample of a results page.

The user can download individual results files byclicking on the name of the appropriate file on a Re-sults page. Alternatively, each Results page also con-tains a link which will download all the results filesfrom the page and the html of the page itself. In thisway the user can view offline saved Results pages

with their associated results files. Any uploaded datafiles are also linked to on the Results page, giving theuser the opportunity to check that the correct datahas been submitted and has been uploaded correctly.

The applications include a SessionListener whichdetects when a user’s session is about to expire. TheSessionListener then removes all the working direc-tories created during the session from the server, toprevent it from becoming clogged with data. By de-fault a session expires 30 minutes after it was last ac-cessed by the user, but the delay can be changed inthe server configuration.

All the code of a complete demonstration appli-cation created by Rwui can be downloaded from alink on the ‘Help’ page of Rwui.

Acknowledgment

RN was funded as part of a Wellcome Trust Func-tional Genomics programme grant.

Bibliography

Gentleman,R.C., Carey,V.J., Bates,D.M., Bolstad,B.,Dettling,M., Dudoit,S., Ellis,B., Gautier,L., Ge,Y.,Gentry,J., Hornik,K., Hothorn,T., Huber,W., Ia-cus,S., Irizarry,R., Leisch,F., Li,C., Maechler,M.,Rossini,A.J., Sawitzki,G., Smith,C., Smyth,G., Tier-ney,L., Yang,J.Y.H., Zhang,J. (2004) Bioconductor:Open software development for computational bi-ology and bioinformatics, Genome Biology, 5, R80.

Richard NewtonSchool of CrystallographyBirkbeck College, University of London, [email protected] WernischMRC Biostatistics Unit, Cambridge, [email protected]

The np Package: Kernel Methods forCategorical and Continuous Databy Tristen Hayfield and Jeffrey S. Racine

Readers of R News are certainly aware of a num-ber of nonparametric kernel1 methods that exist inR base (e.g., density) and in certain R packages (e.g.,locpoly in the KernSmooth package). Such func-tionality allows R users to nonparametrically modela density or to conduct nonparametric local polyno-

mial regression, to name but two applications of ker-nel methods. Nonparametric kernel approaches areappealing to applied researchers because they oftenreveal features in the data that might be missed byclassical parametric methods. However, traditionalnonparametric kernel methods presume that the un-derlying data is continuous, which is frequently notthe case.

1A ‘kernel’ is simply a weighting function.


http://rwui.cryst.bbk.ac.uk/Rwui_demo1












Practitioners often encounter a mix of categori-cal and continuous data types, but may still wishto proceed in a nonparametric direction. The tradi-tional nonparametric approach is called a ‘frequency’approach, whereby data is broken up into subsets(‘cells’) corresponding to the values assumed by thecategorical variables, and only then do you applysay density or locpoly to the continuous data re-maining in each cell. Nonparametric frequency ap-proaches are widely acknowledged to be unsatisfac-tory. Recent theoretical developments offer practi-tioners a variety of kernel-based methods for cat-egorical (i.e., unordered and ordered factors) dataonly or for a mix of continuous and categorical data;see Li and Racine (2007) and the references thereinfor an in-depth treatment of these methods, and alsosee the articles listed in the bibliography below.

The np package implements recently developedkernel methods that seamlessly handle the mix ofcontinuous, unordered, and ordered factor datatypes often found in applied settings. The packagealso allows users to create their own routines us-ing high-level function calls rather than writing theirown C or Fortran code.2 The design philosophy un-derlying np is simply to provide an intuitive, flex-ible, and extensible environment for applied kernelestimation.

Currently, a range of methods can be found in thenp package including unconditional (Li and Racine,2003; Ouyang et al., 2006) and conditional (Hall et al.,2004; Racine et al., 2004) density estimation andbandwidth selection, conditional mean and gradi-ent estimation (local constant (Racine and Li, 2004;Hall et al., forthcoming) and local polynomial (Liand Racine, 2004)), conditional quantile and gradi-ent estimation (Li and Racine, forthcoming), modelspecification tests (regression (Hsiao et al., forthcom-ing), quantile, significance (Racine et al., forthcom-ing)), semiparametric regression (partially linear, in-dex models, average derivative estimation, vary-ing/smooth coefficient models), etc.

Before proceeding, we caution the reader thatdata-driven bandwidth selection methods can be nu-merically intensive, which is the reason underlyingthe development of an MPI-aware3 version of thenp package that uses some of the functionality ofthe Rmpi package, which we have tentatively calledthe npRmpi package. The functionality of np andnpRmpi will be identical; however, using npRmpiyou could take advantage of a cluster computing en-vironment or a multi-core/multi-cpu desktop ma-chine thereby alleviating the computational burdenassociated with the nonparametric analysis of largedatasets. We ought also to point out that data-driven (i.e., automatic) bandwidth selection proce-dures are not guaranteed always to produce good

results. For this reason, we advise the reader to in-terrogate their bandwidth objects with the summarycommand which produces a table of the bandwidthsfor the continuous variables scaled by an appropriateconstant (σxnα where α depends on the kernel orderand number of continuous variables, e.g. α = −1/5for one continuous variable and a second order ker-nel), which some readers may find helpful. Also,the admissible range for the bandwidths for the cat-egorical variables is provided when summary is used,which some readers may also find helpful.

We have tried to make np flexible enough to beof use to a wide range of users. All options canbe tweaked by users (kernel function, kernel order,bandwidth type, estimator type and so forth). Onefunction, npksum, allows you to create your own es-timators, tests, etc. The function npksum is simply acall to highly optimized C code, so you get the bene-fits of compiled code along with the power and flex-ibility of the R language. We hope that incorporatingthe npksum function renders the package suitable forteaching and research alike.

There are two ways in which you can interactwith functions in np: i) using data frames, or ii) usinga formula interface, where appropriate.

To some, it may be natural to use the data frameinterface. The R data.frame function preserves avariable’s type once it has been cast (unlike cbind,which we avoid for this reason). If you find this mostnatural for your project, you first create a data framecasting data according to their type (i.e., one of con-tinuous (default), factor, ordered), as in

data.object <- data.frame(x1=factor(x1),x2, x3=ordered(x3))

where x1 is, say, a binary factor, x2 continuous, andx3 an ordered factor. Then you could pass this dataframe to the appropriate np function, say

npudensbw(dat=data.object)

To others, however, it may be natural to use theformula interface that is used for the regression ex-ample outlined below. For nonparametric regressionfunctions such as npregbw, you would proceed justas you would using lm, e.g.,

npregbw(y ~ factor(x1) + x2)

except that you would of course not need to specify,e.g., polynomials in variables or interaction terms.Every function in np supports both interfaces, whereappropriate.

Before proceeding, a few words on the formulainterface are in order. We use the standard for-mula interface as it provided capabilities for han-dling missing observations and so forth. This in-terface, however, is simply a convenient device fortelling a routine which variable is, say, the outcome

2The high-level functions found in the package in turn call compiled C code, allowing users to focus on the application rather than theimplementation details.

3MPI is a library specification for message-passing, and has emerged as a dominant paradigm for portable parallel programming.



and which are, say, the covariates. That is, just be-cause one writes x1 + x2 in no way means or ismeant to imply that the model will be linear andadditive (why use fully nonparametric methods toestimate such models in the first place?). It sim-ply means that there are, say, two covariates in themodel, the first being x1 and the second x2, we arepassing them to a routine with the formula interface,and nothing more is presumed or implied.

Regardless of whether you use the data frame orformula interface, workflow for nonparametric esti-mation in np typically proceeds as follows:

1. compute appropriate bandwidths;

2. estimate an object and extract fitted or pre-dicted values, standard errors, etc.;

3. optionally, plot the object.

In order to streamline the creation of a set of com-plicated graphics objects, plot (which calls npplot)is dynamic; i.e., you can specify, say, bootstrappederror bounds and the appropriate routines will becalled in real time.

Efficient nonparametric regressionin the presence of qualitative data

We begin with a motivating example that demon-strates the potential benefits arising from the useof kernel smoothing methods that smooth both thequalitative and quantitative variables in a particu-lar manner; see the subsequent section for more de-tailed information regarding the method itself. Forwhat follows, we consider an application taken fromWooldridge (2003, pg. 226) that involves multiple re-gression analysis with qualitative information.

We consider modeling an hourly wage equa-tion for which the dependent variable is log(wage)(lwage) while the explanatory variables include threecontinuous variables, namely educ (years of educa-tion), exper (the number of years of potential ex-perience), and tenure (the number of years withtheir current employer) along with two qualitativevariables, female (“Female”/“Male”) and married(“Married”/“Notmarried”). For this example thereare n = 526 observations.

The classical parametric approach towards es-timating such relationships requires that one firstspecify the functional form of the underlying rela-tionship. We start by first modelling this relationshipusing a simple parametric linear model. By way ofexample, Wooldridge (2003, pg. 222) presents the fol-lowing model:4

> data("wage1")>> model.ols <- lm(lwage ~ factor(female) ++ factor(married) + educ + exper + tenure,+ data=wage1)>> summary(model.ols)....

This model is, however, restrictive in a number ofways. First, the analyst must specify the functionalform (in this case linear) for the continuous vari-ables (educ, exper, and tenure). Second, the analystmust specify how the qualitative variables (femaleand married) enter the model (in this case they affectthe model’s intercepts only). Third, the analyst mustspecify the nature of any interactions among all vari-ables, quantitative and qualitative (in this case, thereare none). Should any of these assumptions be in-correct, then the estimated model will be biased andinconsistent, potentially leading to faulty inference.

One might next test the null hypothesis that thisparametric linear model is correctly specified usingthe consistent model specification test found in Hsiaoet al. (forthcoming) that admits both categorical andcontinuous data (this example will likely take a fewminutes on a desktop computer as it uses bootstrap-ping and cross-validated bandwidth selection):

> data("wage1")> attach(wage1)>> model.ols <- lm(lwage ~ factor(female) ++ factor(married) + educ + exper + tenure,+ x=TRUE, y=TRUE)>> X <- data.frame(factor(female),+ factor(married), educ, exper, tenure)>> output <- npcmstest(model=model.ols,+ xdat=X, ydat=lwage, tol=.1, ftol=.1)> summary(output)

Consistent Model Specification Test....IID Bootstrap (399 replications)Test Statistic 'Jn': 5.594745e+00P Value: 0.000000e+00....

This naïve linear model is rejected by the data(the P-value for the null of correct specification is< 0.001), hence one might proceed instead to modelthis relationship using kernel methods.

As noted, the traditional nonparametric approachtowards modeling relationships in the presence ofqualitative variables requires that you first split yourdata into subsets containing only the continuous

4We would like to thank Jeffrey Wooldridge for allowing us to use his data. Also, we would like to point out that Wooldridge startsout with this naïve linear model, but quickly moves on to a more realistic model involving nonlinearities in the continuous variables andso forth.



variables of interest (lwage, exper, and tenure). Forinstance, we would have four such subsets, a) n =132 observations for married females, b) n = 120 ob-servations for single females, c) n = 86 observationsfor single males, and d) n = 188 observations formarried males. One would then construct smoothnonparametric regression models for each of thesesubsets and proceed with the analysis in this fashion.However, this may lead to a loss in efficiency due toa reduction in the sample size leading to overly vari-able estimates of the underlying relationship.

Instead, however, we could construct smoothnonparametric regression models by i) using asmoothing function that is appropriate for the qual-itative variables such as that proposed by Aitchi-son and Aitken (1976), and ii) modifying the non-parametric regression model as was done by Li andRacine (2004). One can then conduct sound nonpara-metric estimation based on the n = 526 observa-tions rather than resorting to sample splitting. Therationale for this lies in the fundamental concept thatdoing so may introduce potential bias, but it willalways reduce variability, thus leading to potentialfinite-sample efficiency gains. Our experience hasbeen that the potential benefits arising from this ap-proach more than offset the potential costs in finite-sample settings.

Next, we consider using the local-linear nonpara-metric method described in Li and Racine (2004).For the reader’s convenience we supply precom-puted cross-validated bandwidths which are auto-matically loaded when one reads the wage1 dataset.The commented out code that generates the cross-validated bandwidths is also provided should thereader wish to fully replicate the example (recall be-ing cautioned about the computational burden as-sociated with multivariate data-driven bandwidthmethods).

> data("wage1")>> #bw.all <- npregbw(lwage ~ factor(female) +> # factor(married) + educ + exper + tenure,> # regtype="ll", bwmethod="cv.aic",> # data=wage1)>> model.np <- npreg(bws=bw.all)>> summary(model.np)

Regression Data: 526 training points,in 5 variable(s)

....Kernel Regression Estimator: Local-LinearBandwidth Type:FixedResidual standard error: 1.371530e-01R-squared: 5.148139e-01....

Note that the bandwidth object is the only thing

you need to pass to npreg as it encapsulates thekernel types, regression method, and so forth. Youcould, of course, use npreg with a formula and per-haps manually specify the bandwidths using a band-width vector if you so chose. We have tried to makeeach function as flexible as possible to meet the needsof a variety of users.

The goodness of fit of the nonparametric model(R2 = 51.5%) is better than that for the paramet-ric model (R2 = 40.4%). In order to investigatewhether this apparent improvement reflects overfit-ting or simply that the nonparametric model is in factmore faithful to the underlying data generating pro-cess, we shuffled the data and created two indepen-dent samples, one of size n1 = 400 and one of sizen2 = 126. We fit the models on the n1 training ob-servations then evaluate the models on the n2 inde-pendent hold-out observations using the predictedsquare error criterion, namely n−1

2 ∑n2i=1(yi − yi)2,

where the yis are the lwage values for the hold-outobservations and the yis are the predicted values. Fi-nally, we compare the parametric model, the non-parametric model that smooths both the qualitativeand quantitative variables, and the traditional fre-quency nonparametric model that breaks the datainto subsets and smooths the quantitative data only.

> data("wage1")> set.seed(123)>> # Shuffle the data and create two datasets...>> ii <- sample(seq(nrow(wage1)),replace=FALSE)>> wage1.train <- wage1[ii[1:400],]> wage1.eval <- wage1[ii[401:nrow(wage1)],]>> # Compute the parametric model for the> # training data...>> model.ols <- lm(lwage ~ factor(female) ++ factor(married) + educ + exper + tenure,+ data=wage1.train)>> # Obtain the out-of-sample predictions...>> fit.ols <- predict(model.ols,+ data=wage1.train, newdata=wage1.eval)>> # Compute the predicted square error...>> pse.ols <- mean((wage1.eval$lwage -+ fit.ols)^2)>> #bw.subset <- npregbw(> # lwage~factor(female) + factor(married) +> # educ + exper + tenure, regtype="ll",> # bwmethod="cv.aic", data=wage1.train)>



> model.np <- npreg(bws=bw.subset)>> # Obtain the out-of-sample predictions...>> fit.np <- predict(model.np,+ data=wage1.train, newdata=wage1.eval)>> # Compute the predicted square error...>> pse.np <- mean((wage1.eval$lwage -+ fit.np)^2)>> # Do the same for the frequency estimator...> # We do this by setting lambda=0 for the> # qualitative variables...>> bw.freq <- bw.subset> bw.freq$bw[1] <- 0> bw.freq$bw[2] <- 0>> model.np.freq <- npreg(bws=bw.freq)>> # Obtain the out-of-sample predictions...>> fit.np.freq <- predict(model.np.freq,+ data=wage1.train, newdata=wage1.eval)>> # Compute the predicted square error...>> pse.np.freq <- mean((wage1.eval$lwage -+ fit.np.freq)^2)>> # Compare performance for the> # hold-out data...>> pse.ols[1] 0.1469691> pse.np[1] 0.1344028> pse.np.freq[1] 0.1450111

The predicted square error on the hold-out datawas 0.147 for the parametric linear model, 0.145 forthe traditional nonparametric estimator that splitsthe data into subsets, and 0.134 for the nonparamet-ric estimator that uses the full sample but smoothsboth the qualitative and quantitative data. We there-fore conclude that the nonparametric model thatsmooths both the continuous and qualitative vari-ables appears to provide a better description of theunderlying data generating process than either thenonparametric model that uses sample splitting orthe naïve linear model.

Note that for this example we have only fourcells. If one used all qualitative variables includedin the dataset (16 in total), one would have 65, 536cells, many of which would be empty, and most hav-ing far too few observations to provide meaningfulnonparametric estimates. As the number of qualita-

tive variables increases, the difference between theestimator that smooths both continuous and discretevariables in a particular manner and the traditionalestimator that relies upon sample splitting will be-come even more pronounced.

We now briefly outline the essence of kernelsmoothing of categorical data for the interestedreader.

Categorical data kernel methods

To introduce the R user to the basic concept of ker-nel smoothing of categorical random variables, con-sider a kernel estimator of a probability function,i.e. p(xd) = P(Xd = xd), where Xd is an unordered(i.e., discrete) categorical variable assuming valuesin, say, {0, 1, 2, . . . , c − 1} where c is the number ofpossible outcomes of the discrete random variableXd. Aitchison and Aitken (1976) proposed a kernelestimator of p(xd) given by

p(xd) =1n

n

∑i=1

L(Xdi = xd),

where L(·) is a kernel function defined by, say,

L(Xdi = xd) =

{1− λ if Xd

i = xd

λ/(c− 1) otherwise,

and where λ is a ‘smoothing’ parameter that can beselected via a variety of data-driven methods; seealso Ouyang et al. (2006) for theoretical underpin-nings of least-squares cross-validation in this con-text.

The smoothing parameter λ is restricted to lie in[0, (c − 1)/c] for the kernel given above. Note thatwhen λ = 0, then L(Xd

i = xd) becomes an indi-cator function, and hence p(xd) would be the stan-dard frequency probability estimator (i.e., the sam-ple proportion of Xd

i = xd). If, on the other hand,λ = (c − 1)/c, its upper bound, then L(Xd

i = xd)becomes a constant function and p(xd) = 1/c for allxd, which is the discrete uniform distribution. Data-driven methods for selecting λ are based on minimiz-ing expected square error loss, among other criteria.

A number of issues surrounding this estimatorare noteworthy: i) this approach extends trivially to ageneral multivariate setting, and in fact is best suitedto such settings; ii) to deal with ordered variablesyou simply use an ‘ordered kernel’; iii) there is no“curse of dimensionality” associated with smooth-ing discrete probability functions – the estimatorsare

√n consistent, just like their parametric counter-

parts; and iv) you can “mix-and-match” data typesusing “generalized product kernels” as we describein the next section.

Why would anyone want to smooth a probabil-ity function in the first place? For finite-sample effi-ciency reasons of course. That is, smoothing a prob-ability function may introduce some finite-sample



bias, but, it always reduces variability. In settingswhere you are modeling a mix of categorical andcontinuous data or where one has sparse multivari-ate categorical data, the efficiency gains can be sub-stantial indeed.

Mixed data kernel methods

Estimating a joint density function defined overmixed data (i.e., both categorical and continuous)follows naturally using generalized product kernels.For example, for one discrete variable xd and onecontinuous variable xc, our kernel estimator of thejoint PDF would be

f (xd, xc) =1

nhx

n

∑i=1

L(Xdi = xd)W

(Xc

i − xc

hxc

),

where L(Xdi = xd) is a categorical data kernel func-

tion, while W[(Xci − xc)/hxc ] is a continuous data ker-

nel function (e.g., Epanechnikov, Gaussian, or uni-form) and hxc is the bandwidth for the continuousvariable; see Li and Racine (2003) for further details.

Once we can consistently estimate a joint densityfunction defined over mixed data, we can then pro-ceed to estimate a range of statistical objects of inter-est to practitioners. Some mainstays of applied dataanalysis include estimation of regression functionsand their derivatives, conditional density functionsand their quantiles, conditional mode functions (i.e.,count data models, probability models), and so forth,each of which is currently implemented.

Nonparametric estimation of binary out-come and count data models

For what follows, we adopt the conditional proba-bility estimator proposed in Hall et al. (2004) to es-timate a nonparametric model of a binary outcomewhen there exist a number of categorical covariates.

For this example, we use the birthwt data takenfrom the MASS package, and compute a parametricLogit model and a nonparametric conditional modemodel. We then compare their confusion matricesand assess their classification ability. The outcomeis an indicator of low infant birthweight (0/1).

> library("MASS")> attach(birthwt)>> # First, we model this with a simple> # parametric Logit model...>> model.logit <- glm(low ~ factor(smoke) ++ factor(race) + factor(ht) + factor(ui)++ ordered(ftv) + age + lwt, family=binomial)>> # Now we model this with the nonparametric

> # conditional density estimator and compute> # the conditional mode...>> bw <- npcdensbw(factor(low) ~+ factor(smoke) + factor(race) ++ factor(ht) + factor(ui)+ ordered(ftv) ++ age + lwt, tol=.1, ftol=.1)> model.np <- npconmode(bws=bw)>> # Finally, we compare confusion matrices> # from the logit and nonparametric models...>> # Parametric...>> cm <- table(low,+ ifelse(fitted(model.logit) > 0.5, 1, 0))> cm

low 0 10 119 111 34 25

>> # Nonparametric...>> summary(model.np)

Conditional Mode data: 189 training points,in 8 variable(s)

....Bandwidth Type: Fixed

Confusion MatrixPredicted

Actual 0 10 127 31 27 32

....

For this example the nonparametric model is bet-ter able to predict low birthweight infants than itsparametric counterpart, correctly predicting 159/189birthweights compared with 144/189 for the para-metric model.

Visualizing and summarizing nonpara-metric results

Summarizing nonparametric results and comparingthem with parametric ones is perhaps one of themain challenges faced by the practitioner. We haveattempted to provide a range of summary measuresand plotting methods to facilitate these tasks.

The plot function (which calls npplot) can beused on most nonparametric objects. In order to fur-ther facilitate comparison of parametric with non-parametric results, we also compute a number ofsummary measures of goodness of fit, normalizedbandwidths (‘scale factors’) and so forth that are ac-



cessed by the summary command, while objects gen-erated by np also contain various alternative mea-sures of goodness of fit and the like. We again con-sider the approach detailed in Li and Racine (2004)which conducts local linear kernel regression witha mix of discrete and continuous data types on apopular macroeconomic dataset. The original studyassessed the impact of Organisation for EconomicCo-operation and Development (OECD) member-ship on a country’s growth rate when controlling fora range of factors such as population growth rates,stock of capital (physical and human) and so forth.For this example we shall use the formula interfacerather than the data frame interface. For this exam-ple, we have already computed the cross-validatedbandwidths which are loaded when one reads theoecdpanel dataset.

> data("oecdpanel")> attach(oecdpanel)>> #bw <- npregbw(growth ~ factor(oecd) +> # ordered(year) + initgdp + popgro+ inv +> # humancap, bwmethod="cv.aic",> # regtype="ll")>> summary(bw)

Regression Data (616 observations,6 variable(s)):

Regression Type: Local-Linear

....

>> model <- npreg(bw)>> summary(model)

Regression Data: 616 training points,in 6 variable(s)

....>> plot(bw,+ plot.errors.method="bootstrap",+ plot.errors.boot.num=25)>

We display partial regression plots in Figure 1.5

We also plot bootstrapped variability bounds, wherethe bootstrapping is done via the boot packagethereby facilitating a variety of bootstrap methods.

If you prefer gradients rather than levels, you cansimply use the argument gradients=TRUE in plotto indicate that you wish to plot partial derivativesrather than the conditional mean function itself.6

Many of the accessor functions such as predict,fitted, and residuals work with most non-parametric objects, where appropriate. As well,gradients is a generic function which extracts gra-dients from objects. The R function coef can beused for extracting the parametric coefficients fromsemiparametric models such as those created withnpplreg.

Creating mixed data kernel objects vianpksum

Rather than being limited to only those kernel meth-ods that exist in np, you could instead create yourown mixed data kernel objects. npksum is a func-tion that computes kernel sums on evaluation data,given a set of training data, data to be weighted (op-tional), and a bandwidth specification (any band-width object). npksum uses highly-optimized C codethat strives to minimize its memory footprint, whilethere is low overhead involved when using repeatedcalls to this function. The convolution kernel op-tion would allow you to create, say, the least squarescross-validation function for kernel density estima-tion. You can choose powers to which the kernelsconstituting the sum are raised (default=1), whetheror not to divide by bandwidths, whether or notto generate leave-one-out sums, and so on. Threeclasses of kernel methods for the continuous datatypes are available: fixed, adaptive nearest-neighbor,and generalized nearest-neighbor. Adaptive nearest-neighbor bandwidths change with each sample real-ization in the set, xi, when estimating the kernel sumat the point x. Generalized nearest-neighbor band-widths change with the point at which the sum iscomputed, x. Fixed bandwidths are constant overthe support of x. A variety of kernels may be spec-ified by users. Kernels implemented for continu-ous data types include the second, fourth, sixth, andeighth order Gaussian and Epanechnikov kernels,and the uniform kernel. We have tried to anticipatethe needs of a variety of users when designing thisfunction. A number of semiparametric estimatorsand nonparametric tests in the np package in factmake use of npksum.

In the following example, we conduct local-constant (i.e., Nadaraya-Watson) kernel regression.We shall use cross-validated bandwidths drawnfrom npregbw for this example. Note that we extractthe kernel sum from npksum via the ‘$ksum’ returnvalue in both the numerator and denominator of theresulting object. For this example, we use the dataframe interface rather than the formula interface, andoutput from this example is plotted in Figure 2.

5A “partial regression plot” is simply a 2D plot of the outcome y versus one covariate x j when all other covariates are held constant attheir respective medians (this can be changed by users).

6plot(bw, gradients=TRUE, plot.errors.method="bootstrap", plot.errors.boot.num=25, common.scale=FALSE) will workfor this example.



0 1

−0.

040.

000.

040.

08

factor(oecd)

gro

wth

1965 1970 1975 1980 1985 1990 1995

−0.

040.

000.

040.

08

ordered(year)

gro

wth

6 7 8 9

−0.

040.

000.

040.

08

initgdp

gro

wth

−3.4 −3.2 −3.0 −2.8 −2.6 −2.4

−0.

040.

000.

040.

08

popgro g

row

th

−4.5 −4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0

−0.

040.

000.

040.

08

inv

gro

wth

−2 −1 0 1 2

−0.

040.

000.

040.

08

humancap

gro

wth

Figure 1: Partial regression plots with bootstrapped error bounds based on a cross-validated local linear kernelestimator.

> data("cps71")> attach(cps71)....> fit.lc <- npksum(txdat=age,tydat=logwage,+ bws=1.892169)$ksum/+ npksum(txdat=age,bws=1.892169)$ksum>

Note that the arguments ‘txdat’ and ‘tydat’ refer to‘training’ data for the regressors X and for the depen-dent variable y, respectively. One can also specify‘evaluation’ data if you wish to evaluate the functionon a set of points that differ from the training data.In such cases, you would also specify ‘exdat’ (and‘eydat’ if so desired).

We direct the interested reader to see, by way ofillustration, the example available via ?npksum thatconducts leave-one-out cross-validation for a localconstant regression estimator via calls to the R func-tion nlm, and compares this to the npregbw function.

20 30 40 50 60

1213

1415

Age

log(

wag

e)

Figure 2: A local constant kernel estimator createdwith npksum using the Gaussian kernel (default) anda bandwidth of 1.89.

Acknowledgments

We would like to thank but not implicate AchimZeileis and John Fox, whose many and varied sug-gestions led to a much improved version of the pack-age and of this review.



Bibliography

J. Aitchison and C. G. G. Aitken. Multivariate binarydiscrimination by the kernel method. Biometrika,63(3):413–420, 1976.

P. Hall, J. S. Racine, and Q. Li. Cross-validationand the estimation of conditional probability den-sities. Journal of the American Statistical Association,99(468):1015–1026, 2004.

P. Hall, Q. Li, and J. S. Racine. Nonparametric esti-mation of regression functions in the presence ofirrelevant regressors. The Review of Economics andStatistics, forthcoming.

C. Hsiao, Q. Li, and J. S. Racine. A consistent modelspecification test with mixed categorical and con-tinuous data. Journal of Econometrics, 140(2):802–826, 2007.

Q. Li and J. Racine. Nonparametric Econometrics: The-ory and Practice. Princeton University Press, 2007.

Q. Li and J. S. Racine. Nonparametric estimationof distributions with categorical and continuousdata. Journal of Multivariate Analysis, 86:266–292,August 2003.

Q. Li and J. S. Racine. Cross-validated local linearnonparametric regression. Statistica Sinica, 14(2):485–512, April 2004.

Q. Li and J. S. Racine. Nonparametric estimationof conditional CDF and quantile functions withmixed categorical and continuous data. Journal ofBusiness and Economic Statistics, forthcoming.

D. Ouyang, Q. Li, and J. S. Racine. Cross-validationand the estimation of probability distributionswith categorical data. Journal of NonparametricStatistics, 18(1):69–100, 2006.

J. S. Racine and Q. Li. Nonparametric estimationof regression functions with both categorical andcontinuous data. Journal of Econometrics, 119(1):99–130, 2004.

J. S. Racine, Q. Li, and X. Zhu. Kernel estimationof multivariate conditional distributions. Annals ofEconomics and Finance, 5(2):211–235, 2004.

J. S. Racine, J. D. Hart, and Q. Li. Testing the sig-nificance of categorical predictor variables in non-parametric regression models. Econometric Re-views, 25:523–544, 2007.

J. M. Wooldridge. Introductory Econometrics. Thomp-son South-Western, 2003.

Tristen Hayfield and Jeffrey S. RacineMcMaster UniversityHamilton, Ontario, [email protected]@mcmaster.ca

eiPack: R × C Ecological Inference andHigher-Dimension Data Managementby Olivia Lau, Ryan T. Moore, and Michael Kellermann

Introduction

Ecological inference (EI) models allow researchersto infer individual-level behavior from aggregatedata when individual-level data is unavailable. Ta-ble 1 shows a typical unit of ecological analysis: acontingency table with observed row and columnmarginals and unobserved interior cells.

col1 col2 . . . colCrow1 N11i N12i . . . N1Ci N1·irow2 N21i N22i . . . N2Ci N2·i. . . . . . . . . . . . . . .rowR NR1i NR2i . . . NRCi NR·i

N·1i N·2i . . . N·Ci Ni

Table 1: A typical R× C unit in ecological inference;red quantities are typically unobserved.

In ecological inference, challenges arise becauseinformation is lost when aggregating across indi-viduals, a problem that cannot be solved by col-lecting more aggregate-level data. Thus, EI mod-els are unusually sensitive to modeling assumptions.Testing these assumptions is difficult without accessto individual-level data, and recent years have wit-nessed a lively discussion of the relative merits ofvarious models (Wakefield, 2004).

Nevertheless, there are many applied problems inwhich ecological inferences are necessary, either be-cause individual-level data is unavailable or becausethe aggregate-level data is considered more authori-tative. The latter is true in the voting rights contextin the United States, where federal courts often basedecisions on evidence derived from one or more EImodels (Cho and Yoon, 2001). While packages suchas MCMCpack (Martin and Quinn, 2006) and eco(Imai and Lu, 2005), provide tools for 2× 2 inference,this is insufficient in many applications. In eiPack,





we implement three existing methods for the generalcase in which the ecological units are R× C tables.

Methods and Data in eiPack

The methods currently implemented in eiPack arethe method of bounds (Duncan and Davis, 1953),ecological regression (Goodman, 1953), and theMultinomial-Dirichlet model (Rosen et al., 2001).

The functions that implement these models shareseveral attributes. The ecological tables are definedusing a common formula of the form cbind(col1,..., colC) ∼ cbind(row1, ...,rowR). The rowand column marginals can be expressed as eitherproportions or counts. Auxiliary functions renormal-ize the results for some subset of columns taken fromthe original ecological table, and appropriate print,summary, and plot functions conveniently summa-rize the model output.

In the following section, we demonstrate the fea-tures of eiPack using the (included) senc dataset,which contains individual-level party affiliation datafor Black, White, and Native American voters in8 counties in southeastern North Carolina. Thesecounties include 212 precincts, which form the eco-logical units in this dataset. Because the data are ob-served at the individual level, the interior cell countsare known, allowing us to benchmark the estimatesgenerated by each method.

Method of Bounds

The method of bounds (Duncan and Davis, 1953)uses the observed row and column marginals to cal-culate upper and lower bounds for functions of theinterior cells of each ecological unit. The method ofbounds is not a statistical procedure in the traditionalsense; the bounds implied by the row and columnmarginals are deterministic and there is no proba-bilistic model for the data-generating process.

As implemented in eiPack, the method of boundsallows the user to calculate for a specified columnk′ ∈ k = {1, . . . , C} the deterministic bounds on theproportion of individuals in each row who belong inthat column. For each unit being considered, let j bethe row of interest, k index columns, k′ be the columnof interest, k′′ be the set of other columns considered,and k be the set of columns excluded. For example,if we want the bounds on the proportion of NativeAmerican two-party registrants who are Democrats,j is Native American, k′ is Democrat, k′′ is Repub-lican, and k is No Party. The unit-level quantity ofinterest is

N jk′i

N jk′i + ∑k∈k′′ N jki

The lower and upper bounds on this quantity givenby the observed marginals are, respectively,

max(0, N ji − ∑k 6=k′ Nki)max(0, N ji − ∑k 6=k′ Nki) + min(N ji , ∑k∈k′′ Nki)

and

min(N ji , Nk′i)min(N ji , Nk′i) + max(0, N ji − Nk′i − ∑k∈k Nki)

The intervals generated by the method of boundscan be analyzed in a variety of ways. Grofman (2000)suggests calculating the intersection of the unit-levelbounds. If this intersection (calculated by eiPack) isnon-empty, it represents the range of values that areconsistent with the observed marginals in each of theecological units.

Researchers and practitioners may also choose torestrict their attention to units in which one groupdominates, since the bounds will typically be moreinformative in those units. eiPack allows users to setrow thresholds to conduct this extreme case analy-sis (known as homogeneous precinct analysis in thevoting context). For example, suppose the user is in-terested in the proportion of two-party White regis-trants registered as Democrats in precincts that areat least 90% White. eiPack calculates the desiredbounds:

> out <- bounds(cbind(dem, rep, non) ~ cbind(black,

+ white, natam), data = senc, rows = "white",

+ column = "dem", excluded = "non",

+ threshold = 0.9, total = NULL)

These calculated bounds can then be representedgraphically. Segments cover the range of possiblevalues (the true value for each precinct is the red dot,not included in the standard bounds plot). In this ex-ample, the intersection of the precinct-level boundsis empty.

> plot(out, row = "white", column = "dem")

# add true values to plot

> idx <- as.numeric(rownames(out$bounds$white.dem))

> truth <- senc$whdem[idx]/(senc$white[idx]

+ - senc$non[idx])

> plot((1:length(idx)) / (length(idx) + 1), truth)



Precincts at least 90% White

Pro

port

ion

Dem

ocra

tic

0.0

0.2

0.4

0.6

0.8

1.0

18

252829

30

3134

35

37

39

51

52

54

58

61

63

65

67

68

71

75

85

86

88

89

9091

92

94

9596

97

98

99

104

110

111

113

115

117

118

120

122123

127

128

129

130

131

137

139

144

145

147

200

207

212

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

Figure 1: A plot of deterministic bounds.

Ecological Regression

In ecological regression (Goodman, 1953), observedrow and column marginals are expressed as propor-tions and each column is regressed separately on therow proportions, thus performing C regressions. Re-gression coefficients then estimate the population in-ternal cell proportions. For a given unit i, define

• Xri, the proportion of individuals in row r,

• Tci, the proportion of individuals in column c,and

• βrci, the proportion of row r individuals in col-umn c

The following identities hold:

Tci =R

∑r=1

βrciXri andC

∑c=1

βrci = 1

Defining the population cell fractions βrc such that∑

Cc=1 βrc = 1 for every r, ecological regression as-

sumes that βrc = βrci for all i, and estimates theregression equations Tci = βrcXri + εci. Underthe standard linear regression assumptions, includ-ing E[εci] = 0 and Var[εci] = σ2

c for all i, theseregressions recover the population parameters βrc.eiPack implements frequentist and Bayesian regres-sion models (via ei.reg and ei.reg.bayes, respec-tively).

In the Bayesian implementation, we offer two op-tions for the prior on βrc. As a default, truncate= FALSE uses an uninformative flat prior that pro-vides point estimates approaching the frequentist es-timates (even when those estimates are outside the

feasible range) as the number of draws m → ∞. Incases where the cell estimates are near the bound-aries, choosing truncate = TRUE imposes a uniformprior over the unit hypercube such that all cell frac-tions are restricted to the range [0, 1].

Output from ecological regression can be summa-rized numerically just as in lm, or graphically usingdensity plots. We also include functions to calculateestimates and standard errors of shares of a subsetof columns in order to address questions such as,"What is the Democratic share of 2-party registrationfor each group?" For the Bayesian model, densitiesrepresent functions of the posterior draws of the βrc;for the frequentist model, densities reflect functionsof regression point estimates and standard errors cal-culated using the δ-method.

> out.reg <- ei.reg(cbind(dem, rep, non)

+ ~ cbind(black, white, natam), data = senc)

> lreg <- lambda.reg(out.reg,

columns = c("dem", "rep"))

> density.plot(lreg)

−0.2 0.2 0.6 1.0

010

30

Proportion Democratic

Den

sity

−0.2 0.2 0.6 1.0

010

30

Proportion Republican

Den

sity

Figure 2: Density plots of ecological regression out-put.

Multinomial-Dirichlet (MD) model

In the Multinomial-Dirichlet model proposed byRosen et al. (2001), the data is expressed as countsand a hierarchical Bayesian model is fit using aMetropolis-within-Gibbs algorithm implemented inC. Level 1 models the observed column marginalsas multinomial (and independent across units); thechoice of the multinomial corresponds to samplingwith replacement from the population. Level 2 mod-els the unobserved row cell fractions as Dirichlet(and independent across rows and units); Level 3models the Dirichlet parameters as i.i.d. Gamma.More formally, without a covariate, the model is



(N·1i , . . . , N·Ci)⊥⊥∼ Multinomial(Ni ,

R

∑r=1

βr1iXri ,

. . . ,R

∑r=1

βrCiXri)

(βr1i , . . . , βrCi)⊥⊥∼ Dirichlet(αr1, . . . ,αrC)

αrci.i.d.∼ Gamma(λ1, λ2)

With a unit-level covariate Zi in the second level,the model becomes

(N·1i , . . . , N·Ci)⊥⊥∼ Multinomial(Ni ,

R

∑r=1

βr1iXri ,

. . . ,R

∑r=1

βrCiXri)

(βr1i , . . . , βrCi)⊥⊥∼ Dirichlet(dre(γrc+δrcZi), . . . ,

dre(γr(C−1)+δr(C−1)Zi), dr)

dri.i.d.∼ Gamma(λ1, λ2)

In the model with a covariate, users have two op-tions for the priors on γrc and δrc . They may as-sume an improper uniform prior, as was suggestedby Rosen et al. (2001), or they may specify normalpriors for each γrc and δrc as follows:

γrc ∼ N(µγrc ,σ2γrc)

δrc ∼ N(µδrc ,σ2δrc

)

As Wakefield (2004) notes, the weak identificationthat characterizes hierarchical models in the EI con-text is likely to make the results sensitive to thechoice of prior. Users should experiment with differ-ent assumptions about the prior distribution of theupper-level parameters in order to gauge the robust-ness of their inferences.

The parameterization of the prior on each(βr1i , . . . , βrCi) implies that the following log-oddsratio of expected fractions is linear with respect tothe covariate Zi:

log(

E(βrci)E(βrCi)

)= γrc + δrcZi

Conducting an analysis using the MD model re-quires two steps. First, tuneMD calibrates the tuningparameters used for Metropolis-Hastings sampling:

> tune.nocov <- tuneMD(cbind(dem, rep, non)

+ ~ cbind(black, white, natam), data = senc,

+ ntunes = 10, totaldraws = 100000)

Second, ei.MD.bayes fits the model by calling C codeto generate MCMC draws:

> out.nocov <- ei.MD.bayes(cbind(dem, rep, non)

+ ~ cbind(black, white, natam),

+ covariate = NULL, data = senc,

+ tune.list = tune.nocov)

The output of this function can be returned as mcmcobjects or arrays; in the former case, the standarddiagnostic tools in coda (Plummer et al., 2006) canbe applied directly. The MD implementation in-cludes lambda and density.plot functions, usagefor which is analogous to ecological regression:

> lmd <- lambda.MD(out.nocov,

+ columns = c("dem", "rep"))

> density.plot(lmd)

If the precinct-level parameters are returned orsaved, cover.plot plots the central credible inter-vals for each precinct. The segments represent the95% central credible intervals and their medians foreach unit (the true value for each precinct is the reddot, not included in the standard cover.plot).

> cover.plot(out.nocov, row = "white",

+ column = "dem")

# add true values to plot

> points(senc$white/senc$total,

+ senc$whdem/senc$white)

●

●

●

●

●

● ●

●

●

●

●

●

●

● ●

● ●

●●

●●●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

● ●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●● ●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

● ●

● ●●●

● ●●

●

●●

●

●

●

●

●●

●

●

●

● ●●

●

●

●● ●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

● ●

● ●

●

●●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●●

●

●

●●● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

● ●●

●

●●●

●

●

●

●

●

●

●

● ●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Proportion White in precinct

Pro

port

ion

of W

hite

Dem

ocra

ts

●

●

●

●●

●●

●

●

●

●

●

●

●●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●●

●●

●

●●●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

● ●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●● ●

●

●

●

●

●●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

Figure 3: Coverage plot for MD model output.

Data Management

In the MD model, reasonable-sized problems produceunreasonable amounts of data. For example, a modelfor voting in Ohio includes 11000 precincts, 3 racialgroups, and 4 parties. Implementing 1000 iterationsyields about 130 million parameter draws. Thesedraws occupy about 1GB of RAM, and this is almostcertainly not enough iterations. We provide a fewoptions to users in order to make this model tractablefor large EI problems.

The unit-level parameters present the most sig-nificant data management problem. Rather thanstoring unit-level parameters in the workspace,users can save each chain as a .tar.gz file on



disk using the option ei.MD.bayes(..., ret.beta= "s"), or discard the unit-level draws entirely us-ing ei.MD.bayes(..., ret.beta = "d"). To recon-struct the chains, users can select the row marginals,column marginals, and units of interest, without re-constructing the entire matrix of unit-level draws:> read.betas(rows = c("black", "white"),

+ columns = "dem", units = 1:150,

+ dir = getwd())

If users are interested in some function of the unit-level parameters, the implementation of the MDmodel allows them to define a function in R thatwill be called from within the C sampling algorithm,in which case the unit-level parameters need not besaved for post-processing.

Acknowledgments

eiPack was developed with the support of the In-stitute for Quantitative Social Science at HarvardUniversity. Thanks to John Fox, Gary King, KevinQuinn, D. James Greiner, and an anonymous ref-eree for suggestions and Matt Cox and Bob Kinneyfor technical advice. For further information, seehttp://www.olivialau.org/software.

Bibliography

W. T. Cho and A. H. Yoon. Strange bedfellows: Pol-itics, courts and statistics: Statistical expert testi-mony in voting rights cases. Cornell Journal of Lawand Public Policy, 10:237–264, 2001.

O. D. Duncan and B. Davis. An alternative to ecolog-ical correlation. American Sociological Review, 18:665–666, 1953.

L. Goodman. Ecological regressions and the behav-ior of individuals. American Sociological Review, 18:663–664, 1953.

B. Grofman. A primer on racial bloc voting analy-sis. In N. Persily, editor, The Real Y2K Problem: Cen-sus 2000 Data and Redistricting Technology. BrennanCenter for Justice, New York, 2000.

K. Imai and Y. Lu. eco: R Package for FittingBayesian Models of Ecological cvs c Inference in 2x2Tables, 2005. URL http://imai.princeton.edu/research/eco.html.

A. D. Martin and K. M. Quinn. Applied Bayesianinference in R using MCMCpack. R News, 6:2–7,2006.

M. Plummer, N. Best, K. Cowles, and K. Vines.CODA: Convergence diagnostics and output anal-ysis for MCMC. R News, 6:7–11, 2006.

O. Rosen, W. Jiang, G. King, and M. A. Tanner.Bayesian and frequentist inference for ecologicalinference: The R × C case. Statistica Neerlandica,55(2):134–156, 2001.

J. Wakefield. Ecological inference for 2 × 2 tables(with discussion). Journal of the Royal Statistical So-ciety, 167:385–445, 2004.

Olivia Lau, Ryan T. Moore, Michael KellermannInstitute for Quantitative Social ScienceHarvard University, Cambridge, [email protected]@[email protected]

The ade4 Package — II: Two-table andK-table Methodsby Stéphane Dray, Anne B. Dufour and Daniel Chessel

Introduction

The ade4 package proposes a great variety of ex-planatory methods to analyse multivariate datasets.As suggested by the acronym ade4 (Data Analysisfunctions to analyse Ecological and Environmentaldata in the framework of Euclidean Exploratorymethods), the package is devoted to ecologists butit could be useful in many other fields (e.g., Goecke,2005). Methods available in the package are partic-ular cases of the duality diagram (Escoufier, 1987;

Holmes, 2006; Dray and Dufour, 2007) and the im-plementation of the functions follows the descriptionof this unifying mathematical tool (class dudi). Themain functions of the package for one-table analysismethods have been presented in Chessel et al. (2004).This new paper presents a short summary of two-table and K-table methods available in the package.

Ecological illustration

In order to illustrate the methods, we used thedataset jv73 (Verneaux, 1973) which is available inthe package. This dataset concerns 12 rivers. For


http://www.olivialau.org/software

http://imai.princeton.edu/research/eco.html

http://imai.princeton.edu/research/eco.html





each river, a number of sites have been sampled. Thenumber of sites per river is not constant. jv73$poiis a data.frame and contains presence/absence datafor 19 fish species (columns) in 92 sites (rows).jv73$fac.riv is a factor indicating the river corre-sponding to each site. jv73$morpho contains themeasurements of six environmental variables (alti-tude (m), distance between the site and the source(km), slope (per thousand), wetted cross section (m2),average flow (m3/s) and average speed (m/s)) for thesame sites . Several ecological questions are relatedto these data:

1. Are the groups of fish species living together(i.e. species communities)?

2. Is there a relation between the composition offish communities and the environmental varia-tions?

3. Does the composition of fish communities vary(or not) among rivers?

4. Do the species-environment relationships vary(or not) among rivers?

Multivariate analyses help to answer these differentquestions: one-table methods for the first question,two-table methods for the second one and K-tablemethods for the last two.

Matching two tables

The main purpose of ecological data analysis isthe matching of two data tables: a sites-by-environmental variables table and a sites-by-speciestable, to study the relationships between the com-position of species communities and their environ-ment. The ade4 package contains the main variantsof these methods (procrustean rotation, co-inertiaanalysis and principal component analyses with re-spect to instrumental variables).

The first approach is procrustean rotation(Gower, 1971), introduced in ecology by Digby andKempton (1987, p. 116).

data(jv73)

pca1 <- dudi.pca(jv73$morpho, scannf = FALSE)

pca2 <- dudi.pca(jv73$poi, scale = FALSE,

scannf = FALSE)

plot(procuste(pca1$tab, pca2$tab,

nf = 2))

d = 0.2

Loadings 1

d = 0.2

Alt

Das

Pen

Smm

Qmm

Vme

Loadings 1 d = 0.2

Loadings 2

d = 0.2

Chb

Tru

Vai

Loc

Omb

Bla Hot

Tox

Van

Che

Bar Lot Spi

Gou

Bro Per Tan

Gar

Lam

Loadings 2

0.00

0.02

0.04

0.06

0.08

0.10

Eigenvalues

d = 0.05

Common projection

●●

●

●

●

●

●●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

● ●●

●

●

●

●

● ●

●

●

●

●

●

●

● ●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

● ●

●●

●●

●

●

●

●

●

●●

1

2

3

4 5

6

7

8

9

10 11 12

13 14

15 16

17

18 19 20 21 22

23 24 25

26

27

28 29 30

31 32 33

34 35

36 37

38

39 40

41

42

43

44

45 46 47 48 49

50 51

52 53 54 55 56 57

58 59 60

61

62 63

64

65

66

67 68

69

70 71 72

73 74 75

76 77

78 79 80 81 82 83

84 85

86 87

88 89

90

91 92

d = 0.05

Array 1

1 2

3

4 5 6 7 8

9

10 11 12

13 14 15 16

17

18 19 20 21 22

23

24 25

26 27 28 29 30 31

32 33 34 35 36 37 38 39

40 41 42

43 44

45 46 47 48

49 50 51

52 53 54 55 56 57 58

59 60 61

62 63

64 65

66

67 68 69

70 71 72

73 74 75 76

77 78 79

80 81 82 83 84 85

86 87

88 89

90

91 92

d = 0.05

Array 2

1

2 3 4

5 6

7

8 9

10 11 12

13 14 15 16

17 18 19 20

21 22

23 24

25 26

27

28 29 30

31 32 33

34 35

36 37

38

39

40

41

42 43

44 45 46

47 48 49

50 51

52

53 54 55 56 57

58 59 60

61 62 63 64

65 66

67

68 69

70 71 72 73 74 75

76 77 78

79

80 81 82

83

84 85

86 87 88 89

90 91 92

Figure 1: Plot of a Procrustes analysis: loadings for en-vironmental variables and species, eigenvalues screeplot,scores of sites for the two data sets, and projection of thetwo sets of sites after rotation (arrows link environmentsite score to the species site score) (Dray et al., 2003a).

Two randomization procedures are available to testthe association between two tables: PROTEST (Jack-son, 1995) and RV (Heo and Gabriel, 1998).

plot(procuste.randtest(pca1$tab,

pca2$tab), main = "PROTEST")

plot(RV.rtest(pca1$tab, pca2$tab),

main = "RV")

PROTEST

sim

Fre

quen

cy

0.1 0.2 0.3 0.4 0.5

050

100

150

200

250

300

RV

sim

Fre

quen

cy

0.0 0.1 0.2 0.3 0.4 0.5

05

1015

2025

30

Figure 2: Plots of PROTEST and RV tests: histogramsof simulated values and observed value (vertical line).

Co-inertia analysis (Dolédec and Chessel, 1994;Dray et al., 2003b) is a general approach that canbe applied to any pair of duality diagrams hav-ing the same row weights. This method is sym-metric and seeks for a common structure betweentwo datasets. It extends psychometricians inter-battery analysis (Tucker, 1958), canonical analysison qualitative variables (Cazes, 1980), and ecologi-cal profiles analysis (Montaña and Greig-Smith, 1990;Mercier et al., 1992). Co-inertia analysis of the pair



of triplets (X1, Q1, D) and (X2, Q2, D) leads to thetriplet (Xt

2DX1, Q1, Q2). Note that the two tripletsmust have the same row weights. For a comprehen-sive definition of the statistical triplet of matrices X,Q, D, the reader could consult Chessel et al. (2004).

coa1 <- dudi.coa(jv73$poi, scannf = FALSE)

pca3 <- dudi.pca(jv73$morpho,

row.w = coa1$lw, scannf = F)

plot(coinertia(coa1, pca3, scannf = FALSE))

x1

Ax1

Ax2

X axes

x1

Ax1

Ax2

Y axes

0.0

0.2

0.4

0.6

0.8 Eigenvalues

d = 1

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●●●

●

●●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

1

2

3 4 5

6 7

8

9

10 11 12

13 14 15 16

17 18 19

20 21

22

23

24

25 26

27

28

29

30

31

32

33

34

35

36

37

38

39 40

41

42

43

44 45 46

47 48 49

50 51

52 53 54 55 56 57

58

59 60 61

62

63

64

65

66

67 68 69

70

71 72

73 74 75

76

77 78

79

80 81 82 83

84

85

86

87

88 89

90 91 92

d = 0.2

Y Canonical weights

d = 0.2

Alt

Das

Pen Smm Qmm

Vme

Y Canonical weights

d = 1

X Canonical weights

d = 1

Chb

Tru

Vai Loc

Omb

Bla

Hot

Tox

Van Che

Bar

Lot

Spi

Gou

Bro

Per Tan

Gar

Lam

X Canonical weights

Figure 3: Plot of a co-inertia analysis: projection ofthe principal axes of the two tables (species and environ-ment) on co-inertia axes, eigenvalues screeplot, canonicalweights of species and environmental variables, and jointdisplay of the sites.

For each coupling method, a generic plot func-tion presents the various elements required to in-terpret the results. However, the quality of graphscould vary according to the data set. It is con-sequently impossible to manage relevant graphicaloutputs for all cases. That is why these generic plotuse graphical functions of ade4 which can be directlycalled by the user. A brief description of some ofthese functions is given in Table 1.

Another two-table matching strategy is princi-pal component analyses with respect to instrumentalvariables (pcaiv, Rao, 1964). This approach con-sists in explaining a triplet (X2, Q2, D) by a tableof independent variables X1 and leads to triplet(PX1 X2, Q2, D) where PX1 = X1(Xt

1DX1)−Xt1D.

This family of methods are constrained ordinations,among which redundancy analysis (van den Wollen-berg, 1977) and canonical correspondence analysis(Ter Braak, 1986) are the most frequently used inecology. Note that canonical correspondence anal-ysis can also be performed using the cca wrapperfunction which takes two tables as arguments. Theexample given below is then exactly equivalent to

plot(cca(jv73$poi,jv73$morpho,scannf=FALSE)).While the cca function of ade4 is a particular caseof pcaiv, the cca function of the package vegan isa more traditional implementation of the methodwhich could be preferred by ecologists.

Function Objectives.arrow cloud of points with vectorss.chull cloud of points with groups

by convex hullss.class cloud of points with groups

by stars or ellipsess.corcircle correlation circles.distri cloud of points with fre-

quency distribution by starsand ellipses

s.hist cloud of points with twomarginal histograms

s.image grid of gray-scale rectangleswith contour lines

s.kde2d cloud of points with kerneldensity estimation

s.label cloud of points with labelss.logo cloud of points with picturess.match matching two clouds of

points with vectorss.traject cloud of points with trajecto-

riess.value cloud of points with numeri-

cal variableTable 1: Objectives of some graphical functions.

plot(pcaiv(coa1, jv73$morpho, scannf = FALSE))

d = 0.2

Loadings

d = 0.2

(Intercept)

Alt Das

Pen

Smm

Qmm

Vme

Loadings d = 0.5

Correlation

d = 0.5

(Intercept)

Alt

Das

Pen

Smm Qmm

Vme

Correlation

Axis1

Axis2

Inertia axes

d = 0.5

Scores and predictions

●

●

●

●

●

●

●

●●

● ●●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

● ●

●

●

●

●

●

● ●

●

●

● ●●●●●

●● ●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

1

2

3

4 5

6

7

8

9

10 11

12 13

14 15 16

17 18 19

20 21

22

23

24 25 26 27

28 29

30 31

32

33

34 35

36

37

38 39 40

41 42 43

44

45

46

47 48 49 50 51 52

53 54 55 56 57

58

59 60 61

62

63

64 65

66 67 68

69 70

71 72 73

74 75

76

77 78

79 80 81 82

83

84

85

86

87

88 89

90

91

92

d = 1

Variables

d = 1

Chb

Tru

Vai Loc

Omb Bla Hot Tox

Van

Che

Bar

Lot

Spi

Gou

Bro Per Tan

Gar

Lam

Variables 0.00

0.05

0.10

0.15

0.20

0.25

0.30 Eigenvalues

Figure 4: Plot of a CCA seen as a particular case ofPCAIV: environmental variables loadings and correla-tions with CCA axes, projection of principal axes on CCAaxes, species scores, eigenvalues screeplot, and joint dis-play of the rows of the two tables (position of the sites byaveraging (points) and by regression (arrow tips)).



Orthogonal analysis (pcaivortho) removes theeffect of independent variables and corresponds tothe triplet (P⊥X1 X2, Q2, D) where P⊥X1 = I − PX1 .Between-class (between) and within-class (within)analyses (see Chessel et al., 2004, for details) are par-ticular cases of PCAIV and orthogonal PCAIV whenthere is only one categorical variable (i.e. factor) inX1. Within-class analyses take into account a parti-tion of individuals into groups and focus on struc-tures which are common to all groups. It can be seenas a first step to K-table methods.

wit1 <- within(coa1, fac = jv73$fac.riv,

scannf = FALSE)

plot(wit1)

d = 1

Canonical weights

d = 1

Chb

Tru

Vai Loc

Omb Bla

Hot Tox

Van Che

Bar

Lot

Spi Gou Bro

Per

Tan

Gar

Lam

Canonical weights d = 0.5

Variables

d = 0.5

Chb

Tru

Vai Loc

Omb Bla

Hot Tox

Van Che

Bar

Lot

Spi Gou Bro

Per

Tan

Gar

Lam

Variables

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Eigenvalues

d = 0.5

Scores and classes

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●●●

●

●

●●

●

●●

●●

●

●●

●

●●

●●

●

●●●

●

●

●

Allaine Audeux

Clauge

Cuisance

Cusancin Dessoubre

Doubs

Doulonnes

Drugeon

Furieuse

Lison Loue

Axis1

Axis2

Inertia axes

d = 0.5

Common centring

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●●

●

●●●

●

●

●●

●

●●

●●

●

●●

●

●●

●●

●

●●●

●

●

●

Figure 5: Plot of a within-class analysis: species loadings,species scores, eigenvalues screeplot, projection of princi-pal axes on within-class axes, sites scores (common cen-tring), projections of sites and groups (i.e. rivers in thisexample) on within-class axes.

The K-table class

Class ktab corresponds to collections of more thantwo duality diagrams, for which the internal struc-tures are to be compared. Three formats of thesecollections can be considered:

• (X1, Q1, D), (X2, Q2, D),. . . , (XK , QK , D)

• (X1, Q, D1), (X2, Q, D2),. . . , (XK , Q, DK) storedin the form of (Xt

1, D1, Q), (Xt2, D2, Q),. . . ,

(XtK , DK , Q)

• (X1, Q, D), (X2, Q, D),. . . , (XK , Q, D) whichcan also be stored in the form of (Xt

1, D, Q),(Xt

2, D, Q),. . . , (XtK , D, Q)

Each statistical triplet corresponds to a separateanalysis (e.g., principal component analysis, corre-spondence analysis ...). The common dimensionof the K statistical triplets are the rows of tableswhich can represent individuals (samples, statisticalunits) or variables. Utilities for building and ma-nipulating ktab objects are available. K-table canbe constructed from a list of tables (ktab.list.df),a list of dudi objects (ktab.list.dudi), a within-class analysis (ktab.within) or by splitting a ta-ble (ktab.data.frame). Generic functions to trans-pose (t.ktab), combine (c.ktab) or extract elements([.ktab) are also available. The sepan function canbe used to compute automatically the K separateanalyses.

kt1 <- ktab.within(wit1)

sep1 <- sepan(kt1)

kplot.sepan.coa(sep1, permute.row.col = TRUE)

d = 1

Doubs

Chb

Tru Vai Loc Omb

Bla

Hot

Tox

Van

Che

Bar

Lot

Spi

Gou Bro

Per

Tan

Gar

Lam

1

2 3

4

5

6

7

8

9

10 11

12

13

14

15 16

d = 0.5

Drugeon

Chb Tru Vai

Loc Omb Bla Hot

Tox

Van Che

Bar Lot

Spi

Gou

Bro

Per

Tan

Gar

Lam 17

18

19

20

21

22

d = 0.5

Dessoubre

Chb Tru Vai

Loc

Omb

Bla Hot Tox Van Che Bar Lot Spi Gou Bro Per Tan Gar

Lam

25

26

d = 2

Allaine

Chb Tru Vai

Loc

Omb Bla Hot Tox

Van Che

Bar Lot

Spi

Gou Bro Per Tan

Gar

Lam

28

29 30

31

32

33

34

35

d = 0.5

Audeux

Chb

Tru Vai

Loc

Omb Bla Hot Tox Van Che Bar Lot Spi Gou Bro Per Tan Gar Lam 37

38

39

40

d = 0.5

Cusancin

Chb

Tru

Vai Loc Omb Bla Hot Tox Van Che Bar Lot Spi Gou Bro Per Tan Gar

Lam

41

42

43

44

d = 0.5

Loue

Chb Tru

Vai

Loc

Omb

Bla

Hot

Tox

Van

Che

Bar Lot

Spi Gou

Bro

Per

Tan Gar Lam

51

52

53 54

55

56 57

58 59

60

61

d = 0.5

Lison

Chb Tru

Vai Loc

Omb

Bla

Hot Tox

Van Che Bar Lot Spi Gou Bro Per Tan Gar

Lam 62 63

64

65

66

d = 0.5

Furieuse

Chb Tru Vai

Loc

Omb Bla Hot Tox

Van Che

Bar Lot Spi

Gou

Bro Per Tan Gar Lam 67 68 69

70

71

72

73

74

75

d = 0.5

Cuisance

Chb Tru

Vai Loc

Omb

Bla

Hot

Tox

Van Che

Bar Lot Spi

Gou

Bro Per

Tan

Gar Lam 76 77 78

79 80

81

82

83

d = 0.5

Doulonnes

Chb

Tru

Vai Loc Omb Bla Hot Tox Van

Che Bar Lot Spi Gou Bro Per Tan Gar

Lam

85

86

d = 1

Clauge

Chb

Tru Vai Loc

Omb Bla Hot

Tox

Van Che Bar

Lot Spi

Gou

Bro

Per

Tan

Gar

Lam

88

89

90 91

92

Figure 6: Kplot of 12 separate correspondence analyses(same species, different sites).

When the ktab object is built, various statisticalmethods can be used to analyse it. The foucartfunction can be used to analyse K tables of posi-tive number having the same rows and the samecolumns and that can be analysed by a CA (Foucart,1984; Pavoine et al., 2007). Partial triadic analysis(Tucker, 1966) is a first step toward three modes prin-cipal component analysis (Kroonenberg, 1989) andcan be computed with the pta function. It must beused on K triplets having the same row and columnweights. The pta function can be used to performthe STATICO method (Simier et al., 1999; Thioulouseet al., 2004). This makes it possible to analyse a pairof ktab objects which have been combined by thektab.match2ktabs function.

Multiple factor analysis (mfa, Escofier and Pagès,1994), multiple co-inertia analysis (mcoa, Chessel and



Hanafi, 1996) and the STATIS method (statis, Lavitet al., 1994) can be used to compare K triplets havingthe same row weights. The STATIS method can alsobe used to compare K triplets having the same col-umn weights, which is a first step toward CommonPCA (Flury, 1988).

sta1 <- statis(kt1, scannf = F)

plot(sta1)

x1

Doubs Drugeon

Dessoubre

Allaine

Audeux

Cusancin

Loue

Lison

Furieuse Cuisance

Doulonnes

Clauge

Interstructure d = 1 Compromise

Chb

Tru

Vai

Loc

Omb

Bla

Hot Tox Van

Che Bar

Lot

Spi Gou Bro

Per Tan

Gar

Lam

●

●

●

●

●●

●

●

●

●

●

●

0.15 0.20 0.25 0.30 0.35 0.40

0.3

0.4

0.5

0.6

0.7

0.8

Tables weights

Cos

2

Typological value

Doubs Drugeon

Dessoubre

Allaine

Audeux Cusancin

Loue

Lison

Furieuse

Cuisance Doulonnes

Clauge

Doubs.1

Drugeon.1

Dessoubre.1

Allaine.1

Audeux.1

Cusancin.1

Loue.1

Lison.1

Furieuse.1

Cuisance.1

Doulonnes.1

Clauge.1

Component projection

Figure 7: Plot of STATIS analysis: interstructure, typo-logical value of each table, compromise and projection ofprincipal axes of separate analyses onto STATIS axes.

The kplot generic function is associated to thefoucart, mcoa, mca, pta, sepan, sepan.coa andstatis methods, giving adapted collections ofgraphics.

kplot(sta1, traj = TRUE, arrow = FALSE,

unique = TRUE, clab = 0)

Conclusion

The ade4 package provides many methods to anal-yse multivariate ecological data sets. This diversityof tools is a methodological answer to the great va-riety of questions and data structures associated tobiological questions. Specific methods dedicated tothe analysis of biodiversity, spatial, genetic or phy-logenetic data are also available in the package. Theadehabitat brother-package contains tools to analysehabitat selection by animals while the ade4TkGUIpackage provides a graphical interface to ade4. Moreresources can be found on the ade4 website (http://pbil.univ-lyon1.fr/ADE-4/).

d = 1

Doubs ●

●

●

●

●

●

●

●

●

●●●

●

●

●●

d = 1

Drugeon

●

●

●

●●

●

d = 1

Dessoubre

●

●

●●●

d = 1

Allaine

●

●●

●

● ●

●

●

d = 1

Audeux

●

●

●

●

●

d = 1

Cusancin

●

●

●●

d = 1

Loue

●●

●●

●●●

●●●

●●●●●●●

d = 1

Lison

●●●●●

d = 1

Furieuse

●●●

●

●●●●●

d = 1

Cuisance

●

●

●

●●●

●●

d = 1

Doulonnes

●

●

●

d = 1

Clauge

●

●●

●●

●

Figure 8: Kplot of the projection of the sites of each tableon the principal axes of the compromise of STATIS analy-sis.

Bibliography

P. Cazes. L’analyse de certains tableaux rectangu-laires décomposés en blocs : généralisation despropriétés rencontrées dans l’étude des correspon-dances multiples. I. Définitions et applications àl’analyse canonique des variables qualitatives. LesCahiers de l’Analyse des Données, 5:145–161, 1980.

D. Chessel and M. Hanafi. Analyse de la co-inertie deK nuages de points. Revue de Statistique Appliquée,44(2):35–60, 1996.

D. Chessel, A.-B. Dufour, and J. Thioulouse. Theade4 package-I- One-table methods. R News, 4:5–10, 2004.

P. G. N. Digby and R. A. . Kempton. MultivariateAnalysis of Ecological Communities. Chapman andHall, Population and Community Biology Series,London, 1987.

S. Dolédec and D. Chessel. Co-inertia analy-sis: an alternative method for studying species-environment relationships. Freshwater Biology, 31:277–294, 1994.

S. Dray, D. Chessel, and J. Thioulouse. Procrusteanco-inertia analysis for the linking of multivariatedatasets. Ecoscience, 10:110–119, 2003a.

S. Dray, D. Chessel, and J. Thioulouse. Co-inertiaanalysis and the linking of ecological tables. Ecol-ogy, 84(11):3078–3089, 2003b.

S. Dray and A. Dufour. The ade4 package: imple-menting the duality diagram for ecologists. Journalof Statistical Software, 22(4):1–20, 2007.


http://pbil.univ-lyon1.fr/ADE-4/

http://pbil.univ-lyon1.fr/ADE-4/


B. Escofier and J. Pagès. Multiple factor analysis (AF-MULT package). Computational Statistics and DataAnalysis, 18:121–140, 1994.

Y. Escoufier. The duality diagram : a means of betterpractical applications. In P. Legendre and L. Leg-endre, editors, Development in numerical ecology,pages 139–156. NATO advanced Institute , Serie G.Springer Verlag, Berlin, 1987.

B. Flury. Common Principal Components and RelatedMultivariate. models. Wiley and Sons, New-York,1988.

T. Foucart. Analyse factorielle de tableaux multiples.Masson, Paris, 1984.

R. Goecke. 3D lip tracking and co-inertia analy-sis for improved robustness of audio-video au-tomatic speech recognition. In Proceedings of theAuditory-Visual Speech Processing Workshop AVSP2005, pages 109–114, 2005.

J. Gower. Statistical methods of comparing differentmultivariate analyses of the same data. In F. Hod-son, D. Kendall, and P. Tautu, editors, Mathemat-ics in the archaeological and historical sciences, pages138–149. University Press, Edinburgh, 1971.

M. Heo and K. Gabriel. A permutation test of associ-ation between configurations by means of the RVcoefficient. Communications in Statistics - Simulationand Computation, 27:843–856, 1998.

S. Holmes. Multivariate analysis: The French way. InN. D. and S. T., editors, Festschrift for David Freed-man. IMS, Beachwood, OH, 2006.

D. Jackson. PROTEST: a PROcustean randomizationTEST of community environment concordance.Ecosciences, 2:297–303, 1995.

P. Kroonenberg. The analysis of multiple tables infactorial ecology. iii three-mode principal compo-nent analysis:"analyse triadique complète". ActaOEcologica, OEcologia Generalis, 10:245–256, 1989.

C. Lavit, Y. Escoufier, R. Sabatier, and P. Traissac. TheACT (STATIS method). Computational Statistics andData Analysis, 18:97–119, 1994.

P. Mercier, D. Chessel, and S. Dolédec. Complete cor-respondence analysis of an ecological profile datatable: a central ordination method. Acta OEcolog-ica, 13:25–44, 1992.

C. Montaña and P. Greig-Smith. Correspondenceanalysis of species by environmental variable ma-trices. Journal of Vegetation Science, 1:453–460, 1990.

S. Pavoine, J. Blondel, M. Baguette, and D. Chessel.A new technique for ordering asymmetrical three-dimensional data sets in ecology. Ecology, 88:512–523, 2007.

C. Rao. The use and interpretation of principal com-ponent analysis in applied research. Sankhya A, 26:329–359, 1964.

M. Simier, L. Blanc, F. Pellegrin, and D. Nandris.Approche simultanée de K couples de tableaux:application à l’étude des relations pathologievégétale-environment. Revue de Statistique Ap-pliquée, 47:31–46, 1999.

C. Ter Braak. Canonical correspondence analysis : anew eigenvector technique for multivariate directgradient analysis. Ecology, 67:1167–1179, 1986.

J. Thioulouse, M. Simier, and D. Chessel. Simulta-neous analysis of a sequence of pairs of ecologicaltables with the STATICO method. Ecology, 85:272–283, 2004.

L. . Tucker. An inter-battery method of factor analy-sis. Psychometrika, 23:111–136, 1958.

L. Tucker. Some mathemetical notes on three-modefactor analysis. Psychometrika, 31:279–311, 1966.

A. van den Wollenberg. Redundancy analysis, an al-ternative for canonical analysis. Psychometrika, 42(2):207–219, 1977.

J. Verneaux. Cours d’eau de Franche-Comté (Massifdu Jura). Recherches écologiques sur le réseau hydro-graphique du Doubs. Essai de biotypologie. Thèse dedoctorat, Université de Besançon, Besançon, 1973.

Stéphane Dray, Anne-Béatrice Dufour, Daniel ChesselLaboratoire de Biométrie etBiologie Evolutive (UMR 5558) ; CNRSUniversité de Lyon ; université Lyon 143, Boulevard du 11 Novembre 191869622 Villeurbanne Cedex, [email protected]@[email protected]






Review of “The R Book”Michael J. Crawley, Wiley, 2007

by Friedrich Leisch

The back cover of this physically impressive 1000-page volume advertises it as “. . . the first comprehem-sive reference manual for the R language. . . ” which“. . . introduces all the statistical models covered by R. . . ”.Considering (a) that the R Core Team considers itsown language manual (R Development Core Team,2007b) a draft, and only the source code the ulti-mate reference in more cases than we like, and (b) themultitude of models implemented by R packages onCRAN or Bioconductor, I thought I would be in foran interesting read.

The book has 27 chapters. Chapters 1–8 give anintroduction to R, starting where to obtain and howto install the software, describing the language, datainput and data manipulation, graphics, tables, math-ematical calculations, and classical statistical tests.Chapters 9–20 on statistical modelling form the mainpart of the book with a detailed coverage of the lin-ear regression model and its extensions like GLMs,GAMs, mixed effects and non-linear least squares.Chapters 20–27 show “other” topics like trees, timeseries, multivariate and spatial statistics, survivalanalysis, using R for simulation models and low-level graphics commands.

The preface states that the book is “aimed at be-ginners and intermediate users” and can be used “as atext . . . as well as a reference manual”. I find that thebook in its present form is not optimal for either pur-pose. The first section on “getting started” has a fewminor problems, like using many functions withoutquoted character arguments. In some cases this isa matter of style (like library(foo) or help(foo)),but some instances simply do not work (find(foo),apropos(foo)). Packages are often called libraries(which is a different thing), input lines can be longerthan 128 characters (the current limit is 8192), andrecommending MS Word as a source code editor isat least debatable even for Windows users. I per-sonally find the R code throughout the book hardto read: it is typeset in a proportional font, uses nospaces around the assignment operator <-, no lineindentation for nested code blocks, and path namessometimes contain erroneous spaces, especially afterbackslashes.

The chapter on “essentials of the R language”gives an introduction to the language and many non-statistical functions like string processing and reg-ular expressions. What I found very confusing isthe lack of clear structure. The book uses only onelevel of numbering (chapters), and this chapter is 100pages long. E.g., on page 47 there are two headings:“The match function” and “Writing functions in R”.

Both seem to have the same font size and hence areon equal level. However, as is to be expected giventhe two topics, the section on the match function is2 paragraphs long, how to write functions takes thenext 20 pages, with many intermezzos and headingsin two different sizes. The author also jumps arounda lot, many concepts are discussed or introduced as aside note for a different theme, and it is often unclearwhere examples end. E.g., how formal and actualarguments are matched in a funcion call is the firstparagraph in the section on “Saving data to disc”. Allof this will be confusing for beginners and makes ithard to use the book as a reference manual.

In the chapter on mathematics a dozen pages isused on introducing the OLS estimate (typo in sev-eral equations: β = X′X − 1X′y), including a step-wise implementation of (the correct version of) thisformula. Although the next page in the book startswith solving linear equations via solve(), it is noteven mentioned that it is numerically not the bestidea to compute regression coefficients using the for-mula above.

The quality of the book increases considerably inthe chapters on statistical modelling. A minor draw-back is that it sometimes gives the impression thatlinear models are the only ones available, even dis-criminant analysis is not considered a model, be-cause response variables cannot be multi-level cate-gorical according to the cookbook recipe on page 324.However, there is a nice general introduction to sta-tistical modelling and model selection, and linearmodelling is covered in depth with many examples.

Once the author leaves the territory of linearmodels (and their extensions), quality decreasesagain. The chapter on trees uses package tree, al-though even the author of tree recommends usingpackage rpart (Venables and Ripley, 2002, p. 266).The chapter on multivariate statistics basically rec-ommends not doing multivariate statistics at all, be-cause one is too likely to shoot oneself into the foot.

In summary, the book fails to meet the high ex-pectations that the title and cover texts raise. Inthis review I could list only a selection of problemsI found, and of course there are good things too, likethe detailed explanation on how to enter data into aspreadsheet to form a proper data frame. The book isdefinitely not a reference manual for the R system orR language, but a book on applied linear modellingwith (many pages of) lower-quality additional ma-terial to give the impression of universal coverage.There are better introductionary books for beginners,and Venables and Ripley (2002) is still “the R book”when it comes to a reference text for applied statis-tics.

A (symptomatic) end note: The author gives de-tailed credit to the R core team and wider R com-



munity in the acknowledgements (thanks!). Onpage one he recommends the citation() functionto users to give credit to developers (yes!), howeverhe seems not to have used the function too often, be-cause R Development Core Team (2007a,b) and manyothers are missing from the references, which coveronly 4 of 1000 pages.

Bibliography

R Development Core Team. R: A language and en-vironment for statistical computing. R Foundationfor Statistical Computing, Vienna, Austria, 2007a.URL http://www.R-project.org. ISBN 3-900051-07-0.

R Development Core Team. R Language Definition.R Foundation for Statistical Computing, Vienna,Austria, 2007b. URL http://www.R-project.org.ISBN 3-900051-13-5.

W. N. Venables and B. D. Ripley. Modern AppliedStatistics with S. Fourth Edition. Springer, 2002. URLhttp://www.stats.ox.ac.uk/pub/MASS4/. ISBN0-387-95457-0.

Friedrich LeischLudwig-Maximilians-Universität München, [email protected]

Changes in R 2.6.0by the R Core Team

User-visible changes

• integrate(), nlm(), nlminb(), optim(),optimize() and uniroot() now have ...much earlier in their argument list. This re-duces the chances of unintentional partialmatching but means that the later argumentsmust be named in full.

• The default type for nchar() is now "chars".This is almost always what was intended, anddiffers from the previous default only for non-ASCII strings in a MBCS locale. There is anew argument allowNA, and the default be-haviour is now to throw an error on an invalidmultibyte string if type = "chars" or type ="width".

• Connections will be closed if there is no R ob-ject referring to them. A warning is issued ifthis is done, either at garbage collection or if allthe connection slots are in use.

New features

• abs(), sign(), sqrt(), floor(), ceiling(),exp() and the gamma, trig and hyperbolic trigfunctions now only accept one argument evenwhen dispatching to a Math group method(which may accept more than one argument forother group members).

• abbreviate() gains a method argument witha new option "both.sides" which can makeshorter abbreviations.

• aggregate.data.frame() no longer changesthe group variables into factors, and leavesalone the levels of those which are factors. (In-ter alia grants the wish of PR#9666.)

• The default max.names in all.names() andall.vars() is now -1 which means unlimited.This fixes PR#9873.

• as.vector() and the default methods ofas.character(), as.complex(), as.double(),as.expression(), as.integer(), as.logical()and as.raw() no longer duplicate in mostcases where the object is unchanged. (Beware:some code has been written that invalidly as-sumes that they do duplicate, often when using.C/.Fortran(DUP = FALSE).)

• as.complex(), as.double(), as.integer(),as.logical() and as.raw() are now prim-itive and internally generic for efficiency.They no longer dispatch on S3 methods foras.vector() (which was never documented).as.real() and as.numeric() remain as alter-native names for as.double().

expm1(), log(), log1p(), log2(), log10(),gamma(), lgamma(), digamma() andtrigamma() are now primitive. (Note thatlogb() is not.)

The Math2 and Summary groups (round, signif,all, any, max, min, sum, prod, range) are nowprimitive.

See under Section “methods Package” belowfor some consequences for S4 methods.

• apropos() now sorts by name and not by posi-tion on the search path.


http://www.R-project.org

http://www.R-project.org

http://www.stats.ox.ac.uk/pub/MASS4/



• attr() gains an exact = TRUE argument todisable partial matching.

• bxp() now allows xlim to be specified.(PR#9754)

• C(f, SAS) now works in the same way as C(f,treatment), etc.

• chol() is now generic.

• dev2bitmap() has a new option to go via PDFand so allow semi-transparent colours to beused.

• dev.interactive() regards devices with thedisplaylist enabled as interactive, and packagescan register the names of their devices as inter-active via deviceIsInteractive().

• download.packages() andavailable.packages() (and functions whichuse them) now support in repos or contriburleither ‘file:’ plus a general path (includ-ing drives on a UNC path on Windows) or a‘file:///’ URL in the same way as url().

• dQuote() and sQuote() are more flexible,with rendering controlled by the new optionuseFancyQuotes. This includes the abilityto have TEX-style rendering and directionalquotes (the so-called “smart quotes”) on Win-dows. The default is to use directional quotesin UTF-8 locales (as before) and in the Rgui con-sole on Windows (new).

• duplicated() and unique() and their meth-ods in base gain an additional argumentfromLast.

• fifo() no longer has a default description ar-gument.

fifo("") is now implemented, and works inthe same way as file("").

• file.edit() and file.show() now tilde-expand file paths on all interfaces (they usedto on some and not others).

• The find() argument is now named numericand not numeric.: the latter was neededto avoid warnings about name clashes manyyears ago, but partial matching was used.

• stats:::.getXlevels() confines attention tofactors since some users expected R to treatunclass(a_factor) as a numeric vector.

• grep(), strsplit() and friends now warn ifincompatible sets of options are used, insteadof silently using the documented priority.

• gsub()/sub() with perl = TRUE now pre-serves attributes from the argument x on theresult.

• is.finite() and is.infinite() are now S3and S4 generic.

• jpeg(), png(), bmp() (Windows), dev2bitmap()and bitmap() have a new argument units tospecify the units of width and height.

• levels() is now generic (levels<-() has beenfor a long time).

• Loading serialized raw objects with load() isnow considerably faster.

• New primitive nzchar() as a faster alternativeto nchar(x) > 0 (and avoids having to convertto wide chars in a MBCS locale and hence con-sider validity).

• The way old.packages() and henceupdate.packages() handle packages with dif-ferent versions in multiple package repositorieshas been changed. The first package encoun-tered was selected, now the one with highestversion number.

• optim(method = "L-BFGS-B") now acceptszero-length parameters, like the other methods.Also, method = "SANN" no longer attempts tooptimize in this case.

• New options showWarnCalls andshowErrorCalls to give a concise traceback onwarnings and errors. showErrorCalls = TRUEis the default for non-interactive sessions. Op-tion showNCalls controls how abbreviated thecall sequence is.

• New options warnPartialMatchDollar,warnPartialMatchArgs andwarnPartialMatchAttr to help detect the un-intended use of partial matching in $, argu-ment matching and attr() respectively.

• A device named as a character string inoptions(device =) is now looked for in thegrDevices name space if it is not visible fromthe global environment.

• pmatch(x, y, duplicates.ok = TRUE) nowuses hashing and so is much faster for large xand y when most matches are exact.

• qr() is now generic.

• It is now a warning to have an non-integer ob-ject for .Random.seed: this indicates a user hadbeen playing with it, and it has always beendocumented that users should only save andrestore it.

• New higher-order functions Reduce(),Filter() and Map().



• regexpr() and gregexpr() gain anignore.case argument for consistency withgrep(). (This does change the positionalmatching of arguments, but no instances ofpositional matching beyond the second werefound.)

• relist() utility, an S3 generic with severalmethods, providing an inverse for unlist();thanks to a code proposal from AndrewClausen.

• require() now returns invisibly.

• The interface to reshape() has been revised,allowing some simplified forms that did notwork before, and somewhat improved errorhandling. A new argument sep has been intro-duced to replace simple usages of split (theold features are retained).

• rmultinom() uses a high-precision accumula-tor where available, and so is more likely togive the same result on different platforms (al-though it is still possible to get different results,and the result may differ from previous ver-sions of R).

• row() and col() now work on matrix-like ob-jects such as data frames, not just matrices.

• Rprof() allows smaller values of interval onmachines that support it: for example modernLinux systems support interval = 0.001.

• sample() now requires its first argument x tobe numeric (in the sense of is.numeric()) aswell as of length 1 and ≥ 1 before it is regardedas shorthand for 1:x.

• sessionInfo() now provides details aboutpackage name spaces that are loaded but notattached. The output of sessionInfo() hasbeen improved to make it easier to read whenit is inadvertently wrapped after being pastedinto an email message.

• setRepositories() has a new argument indto allow selections to be made programmati-cally.

• showMethods() has a “smart” default forinherited such that showMethods(genfun,incl = TRUE) becomes a useful short cut.

• sprintf() no longer has a output string lengthlimit.

• storage.mode<-() is now primitive, and hencemakes fewer copies of an object (none if themode is unchanged). It is a little less generalthan mode<-(), which remains available. (Seealso the entry under Deprecated & defunct be-low.)

• sweep() gains an argument check.margin =TRUE which warns about mismatched dimen-sions.

• The mathematical annotation facility(plotmath()) now recognises a symbol() func-tion which forces the font to be a symbolfont. This allows access to all characters inthe Adobe Symbol encoding within plotmathexpressions.

• For OSes that cannot unset environment vari-ables, Sys.unsetenv() sets the value to "",with a warning.

• New function Sys.which(), an interface towhich on Unix-alikes and an emulation on Win-dows.

• On Unix-alikes, system(, intern = TRUE) re-ports on very long lines that may be truncated,giving the line number of the content beingread.

• termplot() has a default for ask that usesdev.interactive().

It allows ylim to be set, or computed to coverall the plots to be made (the new default) orcomputed for each plot (the previous default).

• uniroot(f, *) is slightly faster for non-trivial f() because it computes f(lower) andf(upper) only once, and it has new optionalarguments f.lower and f.upper by which thecaller can pass these.

• unlink() is now internal, using commonPOSIX code on all platforms.

• unsplit() now works with lists of dataframes.

• The vcov() methods for classes "gls" and"nlme" have migrated to package nlme.

• vignette() has a new argument all to choosebetween showing vignettes in attached pack-ages or in all installed packages.

• New function within(), which is like with(),except that it returns modified versions back oflists and data frames.

• X11(), postscript() (and hence bitmap()),xfig(), jpeg(), png() and the Windows de-vices win.print(), win.metafile() and bmp()now warn (once at first use) if semi-transparentcolours are used (rather than silently treatingthem as fully transparent).

• New function xspline() to providebase graphics support of X-splines(cf. grid.xspline()).

• New function xyTable() does the 2D gridding“computations” used by sunflowerplot().



• Rd conversion to HTML and CHM now makesuse of classes, which are set in the stylesheets.Editing ‘R.css’ will change the styles used for\env, \option, \pkg etc. (CHM styles are set atcompilation time.)

• The documented arguments of %*% have beenchanged to be x and y, to match S and the im-plicit S4 generic.

• If members of the Ops group (the arithmetic,logical and comparison operators) and %*% arecalled as functions, e.g., ‘>‘(x, y), positionalmatching is always used. (It used to be the casethat positional matching was used for the de-fault methods, but names would be matchedfor S3 and S4 methods and in the case of !the argument name differed between S3 and S4methods.)

• Imports environments of name spaces arenamed (as "imports:foo"), and so are knowne.g. to environmentName().

• Package stats4 uses lazy-loading notSaveImage (which is now deprecated).

• Installing help for a package now parses the‘.Rd’ file only once, rather than once for eachtype.

• PCRE has been updated to version 7.2.

• bzip2 has been updated to version 1.0.4.

• gettext has been updated to version 0.16.1.

• There is now a global CHARSXP cache,R_StringHash. CHARSXPs are no longer du-plicated and must not be modified in place.Developers should strive to only use mkChar(and mkString) for creating new CHARSXPs andavoid use of allocString. A new macro,CallocCharBuf, can be used to obtain a tem-porary char buffer for manipulating characterdata. This patch was written by Seth Falcon.

• The internal equivalents of as.complex(),as.double(), as.integer() and as.logical()used to handle length - 1 arguments now ac-cept character strings (rather than report thatthis is “unimplemented”).

• Lazy-loading a package is now substantiallymore efficient (in memory saved and loadtime).

• Various performance improvements lead to a45% reduction in the startup time withoutmethods (and one-sixth with – methods nowtakes 75% of the startup time of a default ses-sion).

• The [[ subsetting operator now has an argu-ment exact that allows programmers to dis-able partial matching (which will in due coursebecome the default). The default value is exact= NA which causes a warning to be issuedwhen partial matching occurs. When exact =TRUE, no partial matching will be performed.When exact = FALSE, partial matching can oc-cur and no warning will be issued. This patchwas written by Seth Falcon.

• Many of the C-level warning/error messages(e.g., from subscripting) have been re-workedto give more detailed information on either thelocation or the cause of the problem.

• The S3 and S4 Math groups have beenharmonized. Functions log1p(), expm1(),log10() and log2() are members of the S3group, and sign(), log1p(), expm1(), log2(),cummax(), cummin(), digamma(), trigamma()and trunk() are members of the S4 group.gammaCody() is no longer in the S3 group. Theyare now all primitive.

• The initialization of the random-numberstream makes use of the sub-second part ofthe current time where available.

Initialization of the 1997 Knuth TAOCP gener-ator is now done in R code, avoiding some Ccode whose licence status has been questioned.

• The reporting of syntax errors has been mademore user-friendly.

methods Package

• Packages using methods have to have been in-stalled in R 2.4.0 or later (when various internalrepresentations were changed).

• Internally generic primitives no longer dis-patch S4 methods on S3 objects.

• load() and restoring a workspace attempt todetect and warn on the loading of pre-2.4.0 S4objects.

• Making functions primitive changes the se-mantics of S4 dispatch: these no longer dis-patch on classes based on types but do dispatchwhenever the function in the base name spaceis called.

This applies to as.complex(), as.integer(),as.logical(), as.numeric(), as.raw(),expm1(), log(), log1p(), log2(), log10(),gamma(), lgamma(), digamma() andtrigamma(), as well as the Math2 and Summarygroups.



Because all members of the group generics arenow primitive, they are all S4 generic and set-ting an S4 group generic does at last apply toall members and not just those already madeS4 generic.

as.double() and as.real() are identical toas.numeric(), and now remain so even ifS4 methods are set on any of them. Sinceas.numeric is the traditional name used in S4,currently methods must be exported from a‘NAMESPACE’ for as.numeric only.

• The S4 generic for ! has been changed to havesignature (x) (was (e1)) to match the docu-mentation and the S3 generic. setMethod()will fix up methods defined for (e1), with awarning.

• The "structure" S4 class now has methodsthat implement the concept of structures asdescribed in the Blue Book—that element-by-element functions and operators leave struc-ture intact unless they change the length. Theinformal behavior of R for vectors with at-tributes was inconsistent.

• The implicitGeneric() function and relativeshave been added to specify how a function ina package should look when methods are de-fined for it. This will be used to ensure thatgeneric versions of functions in R core are con-sistent. See ?implicitGeneric.

• Error messages generated by some of the func-tions in the methods package provide the nameof the generic to provide more contextual infor-mation.

• It is now possible to usesetGeneric(useAsDefault = FALSE) to de-fine a new generic with the name of a prim-itive function (but having no connection withthe primitive).

Deprecated & defunct

• $ on an atomic vector now gives a warning thatit is “invalid”. It remains deprecated, but maybe removed in R ≥ 2.7.0.

• storage.mode(x) <- "real" andstorage.mode(x) <- "single" are defunct:use instead storage.mode(x) <- "double"and mode(x) <- "single".

• In package installation, ‘SaveImage: yes’ isdeprecated in favour of ‘LazyLoad: yes’.

• seemsS4Object (methods package) is depre-cated in favour of isS4().

• It is planned that [[exact = TRUE]] will be-come the default in R 2.7.0.

Utilities

• checkS3methods() (invoked by R CMD check)now checks the arguments of methods forprimitive members of the S3 group generics.

• R CMD check now does a recursive copy on the‘tests’ directory.

• R CMD check now warns on non-ASCII ‘.Rd’files without an \encoding field, rather thanjust on ones that are definitely not from anISO-8859 encoding. This agrees with the long-standing stipulation in “Writing R Extensions”,and catches some packages with UTF-8 manpages.

• R CMD check now warns on DESCRIPTIONfiles with a non-portable Encoding field, orwith non-ASCII data and no Encoding field.

• R CMD check now loads all the Suggests andEnhances dependencies to reduce warningsabout non-visible objects, and also emulatesstandard functions (such as shell()) on alter-native R platforms.

• R CMD check now (by default) attempts to latexthe vignettes rather than just weave and tanglethem: this will give a NOTE if there are latexerrors.

• R CMD check computations no longer ignoreRd \usage entries for functions for extractingor replacing parts of an object, so S3 methodsshould use the appropriate \method{} markup.

• R CMD check now checks for CR (as wellas CRLF) line endings in C/C++/Fortransource files, and for non-LF line endings in‘Makefile[.in]’ and ‘Makevars[.in]’ in the package‘src’ directory. R CMD build will correct non-LFline endings in source files and in the make filesmentioned.

• Rdconv now warns about unmatched bracesrather than silently omitting sections con-taining them. (Suggestion by Bill Dunlap,PR#9649)

Rdconv now renders (rather than ig-nores) \var{} inside \code{} markup inLATEXconversion.

R CMD Rdconv gains a ‘--encoding’ argumentto set the default encoding for conversions.

• The list of CRAN mirrors now has a new (man-ually maintained) column "OK" which flagsmirrors that seem to be OK, only those are used



by chooseCRANmirror(). The now exportedfunction getCRANmirrors() can be used to getall known mirrors or only the ones that are OK.

• R CMD SHLIB gains arguments ‘--clean’ and‘--preclean’ to clean up intermediate files af-ter and before building.

• R CMD config now knows about FC andFCFLAGS (used for F9x compilation).

• R CMD Rdconv now does a better job of render-ing quotes in titles in HTML, and \sQuote and\dQuote into text on Windows.

C-level facilities

• New utility function alloc3DArray similar toallocMatrix.

• The entry point R_seemsS4Object in‘Rinternals.h’ has not been needed since R 2.4.0and has been removed. Use IS_S4_OBJECT in-stead.

• Applications embedding R can useR_getEmbeddingDllInfo() to obtain DllInfofor registering symbols present in the applica-tion itself.

• The instructions for making and using stan-dalone libRmath have been moved to the R In-stallation and Administration manual.

• CHAR() now returns (const char *) sinceCHARSXPs should no longer be modified inplace. This change allows compilers to warnor error about improper modification. Thanksto Herve Pages for the suggestion.

• acopy_string is a (provisional) new helperfunction that copies character data and returnsa pointer to memory allocated using R_alloc.This can be used to create a copy of a stringstored in a CHARSXP before passing the data onto a function that modifies its arguments.

• asLogical, asInteger, asReal and asComplexnow accept STRSXP and CHARSXP arguments,and asChar accepts CHARSXP.

• New R_GE_str2col() exported via‘R ext/GraphicsEngine.h’ for external device de-velopers.

• doKeybd and doMouseevent are now exportedin ‘GraphicsDevice.h’.

• R_alloc now has first argument of type size_tto support 64-bit platforms (e.g., Win64) with a32-bit long type.

• The type of the last two arguments ofgetMatrixDimnames (non-API but mentionedin ‘R-exts.texi’ and in ‘Rinternals.h’) has beenchanged to const char ** (from char **).

• R_FINITE now always resolves to the functioncall R_finite in packages (rather than some-times substituting isfinite). This avoids someissues where R headers are called from C++code using features tested on the C compiler.

• The advice to include R headers from C++ in-side extern "C" has been changed. It is nowa-days better not to wrap the headers, as theyinclude other headers which on some OSesshould not be wrapped.

• ‘Rinternals.h’ no longer includes a substantialset of C headers. All but ‘ctype.h’ and ‘errno.h’are included by ‘R.h’ which is supposed to beused before ‘Rinternals.h’.

• Including C system headers can be avoidedby defining NO_C_HEADERS before including Rheaders. This is intended to be used from C++code, and you will need to include C++ equiv-alents such as <cmath> before the R headers.

Installation

• The test-Lapack test is now part of makecheck.

• The stat system call is now required, alongwith opendir (which had long been used butnot tested for). (make check would have failedin earlier versions without these calls.)

• evince is now considered as a possible PDFviewer.

• make install-strip now also strips the DLLsin the standard packages.

• Perl 5.8.0 (released in July 2002) or later is nowrequired. (R 2.4.0 and later have in fact re-quired 5.6.1 or later.)

• The C function finite is no longer used:we expect a C99 compiler which will haveisfinite. (If that is missing, we test separatelyfor NaN, Inf and -Inf.)

• A script/executable texi2dvi is now requiredon Unix-alikes: it is part of the texinfo distribu-tion.

• Files ‘texinfo.tex’ and ‘txi-en.tex’ are no longersupplied in doc/manual (as the latest versionshave an incompatible licence). You will needto ensure that your texinfo and/or TeX instal-lations supply them.



• wcstod is now required for MBCS support.

• There are some experimental provisions forbuilding on Cygwin.

Package Installation

• The encoding declared in the ‘DESCRIPTION’file is now used as the default encoding for‘.Rd’ files.

• A standard for specifying package license in-formation in the ‘DESCRIPTION’ License fieldwas introduced, see “Writing R Extensions”.In addition, files ‘LICENSE’ or ‘LICENCE’ in apackage top-level source directory are now in-stalled (so putting copies into the ‘inst’ subdi-rectory is no longer necessary).

• install.packages() on a Unix-alike now up-dates ‘doc/html/packages.html’ only if packagesare installed to ‘.Library’ (by that exact name).

• R CMD INSTALL with option ‘--clean’ nowruns R CMD SHLIB with option ‘--clean’ to dothe clean up (unless there is a ‘src/Makefile’),and this will remove $(OBJECTS) (which mighthave been redefined in ‘Makevars’).

R CMD INSTALL with ‘--preclean’ cleans upthe sources after a previous installation (as ifthat had used ‘--clean’) before attempting toinstall.

R CMD INSTALL will now run R CMD SHLIB inthe ‘src’ directory if ‘src/Makevars’ is present,even if there are no source files with known ex-tensions.

• If there is a file ‘src/Makefile’, ‘src/Makevars’is now ignored (it could be includedby ‘src/Makefile’ if desired), and it ispreceded by ‘etc/Makeconf’ rather than‘R HOME/share/make/shlib.mk’. Thus themakefiles read are ‘R HOME/etc/Makeconf’,‘src/Makefile’ in the package and then any per-sonal ‘Makevars’ files.

• R CMD SHLIB used to support the use of OBJSin ‘Makevars’, but this was changed to OBJECTSin 2001. The undocumented alternative of OBJShas finally been removed.

• R CMD check no longer issues a warning aboutno data sets being present if a lazyload dbis found (as determined by the presence of‘Rdata.rdb’, ‘Rdata.rds’, and ‘Rdata.rdx’ in the‘data’ subdirectory).

Bug fixes

• charmatch() and pmatch() used to accept non-integer values for nomatch even though the re-turn value was documented to be integer. Nownomatch is coerced to integer (rather than theresult being coerced to the type of nomatch).

• match.call() no longer “works” outside afunction unless definition is supplied. (Un-der some circumstances it used to “work”,matching itself.)

• The formula methods of boxplot, cdplot,pairs and spineplot now attach stats so thatmodel.frame() is visible where they evaluateit.

• Date-time objects are no longer regarded as nu-meric by is.numeric().

• methods("Math") did not work if methods wasnot attached.

• readChar() read an extra empty item (or morethan one) beyond the end of the source; in someconditions it would terminate early when read-ing an item of length 0.

• Added a promise evaluation stack so inter-rupted promise evaluations can be restarted.

• R.version[1:10] now nicely prints.

• In the methods package, prototypes are nowinherited for the .Data “slot”; i.e., for classesthat contain one of the basic data types.

• data_frame[[i, j]] now works if i is charac-ter.

• write.dcf() no longer writes NA fields(PR#9796), and works correctly on empty de-scriptions.

• pbeta(x, log.p = TRUE) now has improvedaccuracy in many cases, and so have func-tions depending on it such as pt(), pf() andpbinom().

• mle() had problems with the L-BFGS-B inthe no-parameter case and consequentially alsowhen profiling 1-parameter models (fix thanksto Ben Bolker).

• Two bugs fixed in methods that in involve the... argument in the generic function: previ-ously failed to catch methods that just droppedthe ...; and use of callGeneric() with noarguments failed in some circumstances when... was a formal argument.

• sequence() now behaves more reasonably, al-though not back-compatibly for zero or nega-tive input.



• nls() now allows more peculiar but reason-able ways of being called, e.g., with data= list(uneven_lengths) or a model withoutvariables.

• match.arg() was not behaving as documentedwhen several.ok = TRUE (PR#9859), gavespurious warnings when arg had the wronglength and was incorrectly documented (exactmatches are returned even when there is morethan one partial match).

• The data.frame method for split<-() wasbroken.

• The test for -D__NO_MATH_INLINES was badlybroken and returned true on all non-glibc plat-forms and false on all glibc ones (whether theywere broken or not).

• LF was missing after the last prompt when‘--quiet’ was used without ‘--slave’. Use‘--slave’ when no final LF is desired.

• Fixed bug in initialisation code in grid pack-age for determining the boundaries of shapes.Problem reported by Hadley Wickham; symp-tom was error message: ‘Polygon edge notfound’.

• str() is no longer slow for large POSIXct ob-jects. Its output is also slightly more compactfor such objects; implementation via new op-tional argument give.head.

• strsplit(*, fixed = TRUE), potentiallyiconv() and internal string formatting isnow faster for large strings, thanks to reportPR#9902 by John Brzustowski.

• de.restore() gave a spurious warning for ma-trices (Ben Bolker)

• plot(fn, xlim = c(a, b)) would not setfrom and to properly when plotting a func-tion. The argument lists to curve() andplot.function() have been modified slightlyas part of the fix.

• julian() was documented to work withPOSIXt origins, but did not work with POSIXltones. (PR#9908)

• Dataset HairEyeColor has been corrected toagree with Friendly (2000): the change involvesthe breakdown of the Brown hair / Brown eyecell by Sex, and only totals over Sex are givenin the original source.

• Trailing spaces are now consistently strippedfrom \alias{} entries in ‘.Rd’ files, and this isnow documented. (PR#9915)

• .find.packages(), packageDescription()and sessionInfo() assumed that attached en-vironments named "package:foo" were pack-age environments, although misguided userscould use such a name in attach().

• spline() and splinefun() with method ="periodic" could return incorrect resultswhen length(x) was 2 or 3.

• getS3method() could fail if the method namecontained a regexp metacharacter such as "+".

• help(a_character_vector) now uses the nameand not the value of the vector unless it haslength exactly one, so e.g. help(letters) nowgives help on letters. (Related to PR#9927)

• Ranges in chartr() now work better in CJK lo-cales, thanks to Ei-ji Nakama.

Changes on CRANby Kurt Hornik

New contributed packages

ADaCGH Analysis and plotting of array CGH data.Allows usage of Circular Binary Segmentation,wavelet-based smoothing, ACE method (CGHExplorer), HMM, BioHMM, GLAD, CGHseg,and Price’s modification of Smith & Water-man’s algorithm. Most computations are par-allelized. Figures are imagemaps with linksto IDClight (http://idclight.bioinfo.cnio.es). By Ramon Diaz-Uriarte and Oscar M.Rueda. Wavelet-based aCGH smoothing code

from Li Hsu and Douglas Grove, imagemapcode from Barry Rowlingson.

AIS Tools to look at the data (“Ad Inidicia Spec-tata”). By Micah Altman.

AcceptanceSampling Creation and evaluation ofAcceptance Sampling Plans. Plans can be sin-gle, double or multiple sampling plans. By An-dreas Kiermeier.

Amelia Amelia II: A Program for Missing Data.Amelia II “multiply imputes” missing data ina single cross-section (such as a survey), froma time series (like variables collected for each


http://idclight.bioinfo.cnio.es

http://idclight.bioinfo.cnio.es


year in a country), or from a time-series-cross-sectional data set (such as collected by yearsfor each of several countries). Amelia II im-plements a bootstrapping-based algorithm thatgives essentially the same answers as the stan-dard IP or EMis approaches, is usually con-siderably faster than existing approaches andcan handle many more variables. The programalso generalizes existing approaches by allow-ing for trends in time series across observationswithin a cross-sectional unit, as well as priorsthat allow experts to incorporate beliefs theyhave about the values of missing cells in theirdata. The program works from the R commandline or via a graphical user interface that doesnot require users to know R. By James Honaker,Gary King, and Matthew Blackwell.

BiodiversityR A GUI (via Rcmdr) and some util-ity functions (often based on the vegan) forstatistical analysis of biodiversity and ecolog-ical communities, including species accumu-lation curves, diversity indices, Renyi pro-files, GLMs for analysis of species abundanceand presence-absence, distance matrices, Man-tel tests, and cluster, constrained and un-constrained ordination analysis. By RoelandKindt.

CORREP Multivariate correlation estimator andstatistical inference procedures. By DongxiaoZhu and Youjuan Li.

CPGchron Create radiocarbon-dated depthchronologies, following the work of Parnelland Haslett (2007, submitted to JRSSC). By An-drew Parnell.

Cairo Provides a Cairo graphics device that canbe use to create high-quality vector (PDF,PostScript and SVG) and bitmap output (PNG,JPEG, TIFF), and high-quality rendering in dis-plays (X11 and Win32). Since it uses the sameback-end for all output, copying across formatsis WYSIWYG. Files are created without the de-pendence on X11 or other external programs.This device supports alpha channel (semi-transparent drawing) and resulting images cancontain transparent and semi-transparent re-gions. It is ideal for use in server environments(file output) and as a replacement for other de-vices that don’t have Cairo’s capabilities suchas alpha support or anti-aliasing. Backendsare modular such that any subset of backendsis supported. By Simon Urbanek and JeffreyHorner.

CarbonEL Carbon Event Loop: hooks a Carbonevent loop handler into R. This is useful for en-abling UI from a console R (such as using the

Quartz device from Terminal or ESS). By SimonUrbanek.

DAAGbio Data sets and functions useful for thedisplay of microarray and for demonstrationswith microarray data. By John Maindonald.

Defaults Set, get, and import global function de-faults. By Jeffrey A. Ryan.

Devore7 Data sets and sample analyses from Jay L.Devore (2008), “Probability and Statistics forEngineering and the Sciences (7th ed)”, Thom-son. Original by Jay L. Devore, modificationsby Douglas Bates, modifications for the 7th edi-tion by John Verzani.

G1DBN Perform Dynamic Bayesian Network infer-ence using 1st order conditional dependencies.By Sophie Lebre.

GSA Gene Set Analysis. By Brad Efron and R. Tib-shirani.

GSM Gamma Shape Mixture. Implements aBayesian approach for estimation of a mixtureof gamma distributions in which the mixingoccurs over the shape parameter. This fam-ily provides a flexible and novel approach formodeling heavy-tailed distributions, is compu-tationally efficient, and only requires to specifya prior distribution for a single parameter. BySergio Venturini.

GeneF Implements several generalized F-statistics,including ones based on the flexible iso-tonic/monotonic regression or order restrictedhypothesis testing. By Yinglei Lai.

HFWutils Functions for Excel connections, garbagecollection, string matching, and passing by ref-erence. By Felix Wittmann.

ICEinfer Incremental Cost-Effectiveness (ICE) Sta-tistical Inference. Given two unbiased sam-ples of patient level data on cost and effec-tiveness for a pair of treatments, make head-to-head treatment comparisons by (i) generat-ing the bivariate bootstrap resampling distri-bution of ICE uncertainty for a specified valueof the shadow price of health, λ, (ii) formthe wedge-shaped ICE confidence region withspecified confidence fraction within [0.50, 0.99]that is equivariant with respect to changes inλ, (iii) color the bootstrap outcomes within theabove confidence wedge with economic pref-erences from an ICE map with specified valuesof λ, beta and gamma parameters, (iv) displayVAGR and ALICE acceptability curves, and(v) display indifference (iso-preference) curvesfrom an ICE map with specified values of λ, β

and γ or η parameters. By Bob Obenchain.



ICS ICS/ICA computation based on two scatter ma-trices. Implements Oja et al.’s method of twodifferent scatter matrices to obtain an invari-ant coordinate system or independent compo-nents, depending on the underlying assump-tions. By Klaus Nordhausen, Hannu Oja, andDave Tyler.

ICSNP Tools for multivariate nonparametrics, aslocation tests based on marginal ranks, spa-tial median and spatial signs computation,Hotelling’s T-test, estimates of shape. By KlausNordhausen, Seija Sirkia, Hannu Oja, and DaveTyler.

LLAhclust Hierarchical clustering of variables orobjects based on the likelihood linkage analy-sis method. The likelihood linkage analysis isa general agglomerative hierarchical clusteringmethod developed in France by Lerman in along series of research articles and books. Ini-tially proposed in the framework of variableclustering, it has been progressively extendedto allow the clustering of very general objectdescriptions. The approach mainly consists inreplacing the value of the estimated similar-ity coefficient by the probability of finding alower value under the hypothesis of “absenceof link”. Package LLAhclust contains routinesfor computing various types of probabilisticsimilarity coefficients between variables or ob-ject descriptions. Once the similarity values be-tween variables/objects are computed, a hier-archical clustering can be performed using sev-eral probabilistic and non-probabilistic aggre-gation criteria, and indices measuring the qual-ity of the partitions compatible with the result-ing hierarchy can be computed. By Ivan Ko-jadinovic, Israël-César Lerman, and PhilippePeter.

LLN Learning with Latent Networks. A new frame-work in which graph-structured data are usedto train a classifier in a latent space, and thenclassify new nodes. During the learning phase,a latent representation of the network is firstlearned and a supervised classifier is then builtin the learned latent space. In order to clas-sify new nodes, the positions of these nodesin the learned latent space are estimated usingthe existing links between the new nodes andthe learning set nodes. It is then possible to ap-ply the supervised classifier to assign each newnode to one of the classes. By Charles Bouvey-ron & Hugh Chipman.

LearnBayes Functions helpful in learning the basictenets of Bayesian statistical inference. Con-tains functions for summarizing basic one andtwo parameter posterior distributions and pre-dictive distributions, MCMC algorithms for

summarizing posterior distributions definedby the user, functions for regression models, hi-erarchical models, Bayesian tests, and illustra-tions of Gibbs sampling. By Jim Albert.

LogConcDEAD Computes the maximum likelihoodestimator from an i.i.d. sample of data from alog-concave density in any number of dimen-sions. Plots are available for 1- and 2-d data. ByMadeleine Cule, Robert Gramacy, and RichardSamworth.

MLDS Maximum Likelihood Difference Scaling.Difference scaling is a method for scaling per-ceived supra-threshold differences. The pack-age contains functions that allow the user todesign and run a difference scaling experiment,to fit the resulting data by maximum likelihoodand test the internal validity of the estimatedscale. By Kenneth Knoblauch and Laurence T.Maloney, based in part on C code written byLaurence T. Maloney and J. N. Yang.

MLEcens MLE for bivariate (interval) censoreddata. Contains functions to compute the non-parametric maximum likelihood estimator forthe bivariate distribution of (X, Y), when real-izations of (X, Y) cannot be observed directly.More precisely, the MLE is computed based ona set of rectangles (“observation rectangles”)that are known to contain the unobservable re-alizations of (X, Y). The methods can also beused for univariate censored data, and for cen-sored data with competing risks. The packagecontains the functionality of bicreduc, whichwill no longer be maintained. By MarloesMaathuis.

MiscPsycho Miscellaneous Psychometrics: statisti-cal analyses useful for applied psychometri-cians. By Harold C. Doran.

ORMDR Odds ratio based multifactor-dimensionalityreduction method for detecting gene-gene in-teractions. By Eun-Kyung Lee, Sung Gon Yi,Yoojin Jung, and Taesung Park.

PET Simulation and reconstruction of PET images.Implements different analytic/direct and itera-tive reconstruction methods of Peter Toft, andoffers the possibility to simulate PET data. ByJoern Schulz, Peter Toft, Jesper James Jensen,and Peter Philipsen.

PSAgraphics Propensity Score Analysis (PSA)Graphics. Includes functions which test bal-ance within strata of categorical and quantita-tive covariates, give a representation of the esti-mated effect size by stratum, provide a graphicand loess based effect size estimate, and vari-ous balance functions that provide measures ofthe balance achieved via a PSA in a categorical



covariate. By James E. Helmreich and RobertM. Pruzek.

PerformanceAnalytics Econometric tools for per-formance and risk analysis. Aims to aid prac-titioners and researchers in utilizing the lat-est research in analysis of non-normal returnstreams. In general, the package is most testedon return (rather than price) data on a monthlyscale, but most functions will work with dailyor irregular return data as well. By Peter Carland Brian G. Peterson.

QRMlib Code to examine Quantitative Risk Man-agement concepts, accompanying the book“Quantitative Risk Management: Concepts,Techniques and Tools” by Alexander J. McNeil,Rüdiger Frey and Paul Embrechts. S-Plus orig-inal by Alexander McNeil; R port by Scott Ul-man.

R.cache Fast and light-weight caching of objects.Methods for memoization, that is, caching ar-bitrary R objects in persistent memory. Objectscan be loaded and saved stratified on a set ofhashing objects. By Henrik Bengtsson.

R.huge Methods for accessing huge amounts ofdata. Provides a class representing a matrixwhere the actual data is stored in a binary for-mat on the local file system. This way the sizelimit of the data is set by the file system and notthe memory. Currently in an alpha/early-betaversion. By Henrik Bengtsson.

RBGL Interface to boost C++ graph library. Demoof interface with full copy of all hpp definingboost. By Vince Carey, Li Long, and R. Gentle-man.

RDieHarder An interface to the dieharder test suiteof random number generators and tests thatwas developed by Robert G. Brown, extendingearlier work by George Marsaglia and others.By Dirk Eddelbuettel.

RLRsim Exact (Restricted) Likelihood Ratio tests formixed and additive models. Provides rapidsimulation-based tests for the presence of vari-ance components/nonparametric terms with aconvenient interface for a variety of models. ByFabian Scheipl.

ROptEst Optimally robust estimation using S4classes and methods. By Matthias Kohl.

ROptRegTS Optimally robust estimation forregression-type models using S4 classes andmethods. By Matthias Kohl.

RSVGTipsDevice An R SVG graphics device withdynamic tips and hyperlinks using the w3.orgXML standard for Scalable Vector Graphics.

Supports tooltips with 1 to 3 lines and linestyles. By Tony Plate, based on RSvgDevice byT Jake Luciani.

Rcapture Loglinear Models in Capture-RecaptureExperiments. Estimation of abundance andother demographic parameters for closed pop-ulations, open populations and the robust de-sign in capture-recapture experiments usingloglinear models. By Sophie Baillargeon andLouis-Paul Rivest.

RcmdrPlugin.TeachingDemos Provides an Rcmdr“plug-in” based on the TeachingDemos pack-age, and is primarily for illustrative purposes.By John Fox.

Reliability Functions for estimating parameters insoftware reliability models. Only infinite fail-ure models are implemented so far. By AndreasWittmann.

RiboSort Rapid classification of (TRFLP & ARISA)microbial community profiles, eliminating thelaborious task of manually classifying commu-nity fingerprints in microbial studies. By au-tomatically assigning detected fragments andtheir respective relative abundances to ap-propriate ribotypes, RiboSort saves time andgreatly simplifies the preparation of DNA fin-gerprint data sets for statistical analysis. ByÚna Scallan & Ann-Kathrin Liliensiek.

Rmetrics Rmetrics —- Financial Engineering andComputational Finance. Environment forteaching “Financial Engineering and Compu-tational Finance”. By Diethelm Wuertz andmany others.

RobLox Optimally robust influence curves in caseof normal location with unknown scale. ByMatthias Kohl.

RobRex Optimally robust influence curves in caseof linear regression with unknown scale andstandard normal distributed errors where theregressor is random. By Matthias Kohl.

Rsac Seismic analysis tools in R. Mostly functionsto reproduce some of the Seismic AnalysisCode (SAC, http://www.llnl.gov/sac/) com-mands in R. This includes reading standardbinary ‘.SAC’ files, plotting arrays of seismicrecordings, filtering, integration, differentia-tion, instrument deconvolution, and rotation ofhorizontal components. By Eric M. Thompson.

Runuran Interface to the UNU.RAN library for Uni-versal Non-Uniform RANdom variate gener-ators. By Josef Leydold and Wolfgang Hör-mann.


http://www.llnl.gov/sac/


Ryacas An interface to the yacas computer al-gebra system. By Rob Goedman, GaborGrothendieck, Søren Højsgaard, Ayal Pinkus.

SRPM Shared Reproducibility Package Manage-ment. A package development and manage-ment system for distributed reproducible re-search. By Roger D. Peng.

SimHap A comprehensive modeling frameworkfor epidemiological outcomes and a multiple-imputation approach to haplotypic analysis ofpopulation-based data. Can perform singleSNP and multi-locus (haplotype) associationanalyses for continuous Normal, binary, lon-gitudinal and right-censored outcomes mea-sured in population-based samples. Uses es-timation maximization techniques for inferringhaplotypic phase in individuals, and incorpo-rates a multiple-imputation approach to dealwith the uncertainty of imputed haplotypes inassociation testing. By Pamela A. McCaskie.

SpatialNP Multivariate nonparametric methodsbased on spatial signs and ranks. Containstest and estimates of location, tests of indepen-dence, tests of sphericity, several estimates ofshape and regression all based on spatial signs,symmetrized signs, ranks and signed ranks. BySeija Sirkia, Jaakko Nevalainen, Klaus Nord-hausen, and Hannu Oja.

TIMP Problem solving environment for fitting su-perposition models. Measurements often rep-resent a superposition of the contributions ofdistinct sub-systems resolved with respect tomany experimental variables (time, tempera-ture, wavelength, pH, polarization, etc). TIMPallows parametric models for such superposi-tions to be fit and validated. The package hasbeen extensively applied to modeling data aris-ing in spectroscopy experiments. By KatharineM. Mullen and Ivo H. M. van Stokkum.

TwslmSpikeWeight Normalization of cDNA mi-croarray data with the two-way semilinearmodel(TW-SLM). It incorporates informationfrom control spots and data quality in the TW-SLM to improve normalization of cDNA mi-croarray data. Huber’s and Tukey’s bisquareweight functions are available for robust esti-mation of the TW-SLM. By Deli Wang and JianHuang.

Umacs Universal Markov chain sampler. By JouniKerman.

WilcoxCV Functions to perform fast variable se-lection based on the Wilcoxon rank sum testin the cross-validation or Monte-Carlo cross-validation settings, for use in microarray-

based binary classification. By Anne-LaureBoulesteix.

YaleToolkit Tools for the graphical exploration ofcomplex multivariate data developed at YaleUniversity. By John W. Emerson and WaltonGreen.

adegenet Genetic data handling for multivariateanalysis using ade4. By Thibaut Jombart.

ads Spatial point patterns analysis. Perform first-and second-order multi-scale analyses derivedfrom Ripley’s K-function, for univariate, multi-variate and marked mapped data in rectangu-lar, circular or irregular shaped sampling win-dows, with test of statistical significance basedon Monte Carlo simulations. By R. Pelissierand F. Goreaud.

argosfilter Functions to filter animal satellite track-ing data obtained from Argos. Especially indi-cated for telemetry studies of marine animals,where Argos locations are predominantly oflow quality. By Carla Freitas.

arrayImpute Missing imputation for microarraydata. By Eun-kyung Lee, Dankyu Yoon, andTaesung Park.

ars Adaptive Rejection Sampling. By Paulino PerezRodriguez; original C++ code from Arnost Ko-marek.

arulesSequences Add-on for arules to handle andmine frequent sequences. Provides interfacesto the C++ implementation of cSPADE by Mo-hammed J. Zaki. By Christian Buchta andMichael Hahsler.

asuR Functions and data sets for a lecture in Ad-vanced Statistics using R. Especially the func-tions mancontr() and inspect() may be ofgeneral interest. With the former, it is pos-sible to specify your own contrasts and givethem useful names. Function inspect() showsa wide range of inspection plots to validatemodel assumptions. By Thomas Fabbro.

bayescount Bayesian analysis of count distributionswith JAGS. A set of functions to apply azero-inflated gamma Poisson (equivalent to azero-inflated negative binomial), zero-inflatedPoisson, gamma Poisson (negative binomial)or Poisson model to a set of count data us-ing JAGS (Just Another Gibbs Sampler). Re-turns information on the possible values formean count, overdispersion and zero inflationpresent in count data such as faecal egg countdata. By Matthew Denwood.



benchden Full implementation of the 28 distribu-tions introduced as benchmarks for nonpara-metric density estimation by Berlinet and De-vroye (1994). Includes densities, cdfs, quantilefunctions and generators for samples. By Tho-ralf Mildenberger, Henrike Weinert, and Sebas-tian Tiemeyer.

biOps Basic image operations and image process-ing. Includes arithmetic, logic, look up tableand geometric operations. Some image pro-cessing functions, for edge detection (severalalgorithms including Roberts, Sobel, Kirsch,Marr-Hildreth, Canny) and operations by con-volution masks (with predefined as well asuser defined masks) are provided. Supportedfile formats are jpeg and tiff. By Matias Bordeseand Walter Alini.

binMto Asymptotic simultaneous confidence inter-vals for comparison of many treatments withone control, for the difference of binomial pro-portions, allows for Dunnett-like-adjustment,Bonferroni or unadjusted intervals. Simulationof power of the above interval methods, ap-proximate calculation of any-pair-power, andsample size iteration based on approximateany-pair power. Exact conditional maximumtest for many-to-one comparisons to a control.By Frank Schaarschmidt.

bio.infer Compute biological inferences. Importsbenthic count data, reformats this data, andcomputes environmental inferences from thisdata. By Lester L. Yuan.

blockTools Block, randomly assign, and diagnosepotential problems between units in random-ized experiments. Blocks units into experimen-tal blocks, with one unit per treatment condi-tion, by creating a measure of multivariate dis-tance between all possible pairs of units. Max-imum, minimum, or an allowable range of dif-ferences between units on one variable can beset. Randomly assign units to treatment con-ditions. Diagnose potential interference prob-lems between units assigned to different treat-ment conditions. Write outputs to ‘.tex’ and‘.csv’ files. By Ryan T. Moore.

bootStepAIC Model selection by bootstrapping thestepAIC() procedure. By Dimitris Rizopoulos.

brew A templating framework for mixing text and Rcode for report generation. Template syntax issimilar to PHP, Ruby’s erb module, Java ServerPages, and Python’s psp module. By JeffreyHorner.

ca Computation and visualization of simple, mul-tiple and joint correspondence analysis. ByMichael Greenacre and Oleg Nenadic.

catmap Case-control And Tdt Meta-Analysis Pack-age. Conducts fixed-effects (inverse vari-ance) and random-effects (DerSimonian andLaird, 1986) meta-analyses of case-control orfamily-based (TDT) genetic data; in addition,performs meta-analyses combining these twotypes of study designs. The fixed-effects modelwas first described by Kazeem and Farrell(2005); the random-effects model is describedin Nicodemus (submitted for publication). ByKristin K. Nicodemus.

celsius Retrieve Affymetrix microarray measure-ments and metadata from Celsius web services,see http://genome.ucla.edu/projects/celsius. By Allen Day, Marc Carlson.

cghFLasso Spatial smoothing and hot spot detectionusing the fused lasso regression. By Robert Tib-shirani and Pei Wang.

choplump Choplump tests: permutation tests forcomparing two groups with some positive butmany zero responses. By M. P. Fay.

clValid Statistical and biological validation of clus-tering results. By Guy Brock, Vasyl Pihur, Sus-mita Datta, and Somnath Datta.

classGraph Construct directed graphs of S4 class hi-erarchies and visualize them. Typically, thesegraphs are DAGs (directed acyclic graphs) ingeneral, though often trees. By Martin Maech-ler partly based on code from Robert Gentle-man.

clinfun Clinical Trial Design and Data AnalysisFunctions. Utilities to make your clinical col-laborations easier if not fun. By E. S. Venkatra-man.

clusterfly Explore clustering interactively using Rand GGobi. Contains both general code forvisualizing clustering results and specific vi-sualizations for model-based, hierarchical andSOM clustering. By Hadley Wickham.

clv Cluster Validation Techniques. Contains most ofthe popular internal and external cluster vali-dation methods ready to use for the most of theoutputs produced by functions coming frompackage cluster. By Lukasz Nieweglowski.

codetools Code analysis tools for R. By Luke Tier-ney.

colorRamps Builds single and double gradient colormaps. By Tim Keitt.

contrast Contrast methods, in the style of the De-sign package, for fit objects produced by thelm, glm, gls, and geese functions. By Steve We-ston, Jed Wing and Max Kuhn.


http://genome.ucla.edu/projects/celsius

http://genome.ucla.edu/projects/celsius


coxphf Cox regression with Firth’s penalized likeli-hood. R by Meinhard Ploner, Fortran by GeorgHeinze.

crosshybDetector Identification of probes poten-tially affected by cross-hybridizations in mi-croarray experiments. Includes functions fordiagnostic plots. By Paolo Uva.

ddesolve Solver for Delay Differential Equations byinterfacing numerical routines written by Si-mon N. Wood, with contributions by BenjaminJ. Cairns. These numerical routines first ap-peared in Simon Wood’s solv95 program. ByAlex Couture-Beil, Jon T. Schnute, and RowanHaigh.

desirability Desirability Function Optimization andRanking. S3 classes for multivariate optimiza-tion using the desirability function by Der-ringer and Suich (1980). By Max Kuhn.

dplR Dendrochronology Program Library in R.Contains functions for performing some stan-dard tree-ring analyses. By Andy Bunn.

dtt Discrete Trigonometric Transforms. Functionsfor 1D and 2D Discrete Cosine Transform(DCT), Discrete Sine Transform (DST) and Dis-crete Hartley Transform (DHT). By LukaszKomsta.

earth Multivariate Adaptive Regression SplineModels. Build regression models usingthe techniques in Friedman’s papers “FastMARS” and “Multivariate Adaptive Regres-sion Splines”. (The term “MARS” is copy-righted and thus not used as the name of thepackage.). By Stephen Milborrow, derivedfrom code in package mda by Trevor Hastieand Rob Tibshirani.

eigenmodel Semiparametric factor and regressionmodels for symmetric relational data. Esti-mates the parameters of a model for symmet-ric relational data (e.g., the above-diagonal partof a square matrix) using a model-based eigen-value decomposition and regression. Missingdata is accommodated, and a posterior meanfor missing data is calculated under the as-sumption that the data are missing at random.The marginal distribution of the relational datacan be arbitrary, and is fit with an ordered pro-bit specification. By Peter Hoff.

epibasix Elementary tools for the analysis of com-mon epidemiological problems, ranging fromsample size estimation, through 2 × 2 con-tingency table analysis and basic measuresof agreement (kappa, sensitivity/specificity).Appropriate print and summary statementsare also written to facilitate interpretation

wherever possible. The target audienceincludes biostatisticians and epidemiologistswho would like to apply standard epidemi-ological methods in a simple manner. ByMichael A Rotondi.

experiment Various statistical methods for design-ing and analyzing randomized experiments.One main functionality is the implementa-tion of randomized-block and matched-pairdesigns based on possibly multivariate pre-treatment covariates. Also provides the toolsto analyze various randomized experiments in-cluding cluster randomized experiments, ran-domized experiments with noncompliance,and randomized experiments with missingdata. By Kosuke Imai.

fCopulae Rmetrics — Dependence Structures withCopulas. Environment for teaching “FinancialEngineering and Computational Finance”. ByDiethelm Wuertz and many others.

forensic Statistical Methods in Forensic Genetics.Statistical evaluation of DNA mixtures, DNAprofile match probability. By Miriam Marusi-akova.

fractal Insightful Fractal Time Series Modeling andAnalysis. Software to book in development en-titled “Fractal Time Series Analysis in S-PLUSand R” by William Constantine and Donald B.Percival, Springer. By William Constantine andDonald Percival.

fso Fuzzy Set Ordination: a multivariate analy-sis used in ecology to relate the compositionof samples to possible explanatory variables.While differing in theory and method, in prac-tice, the use is similar to “constrained ordina-tion”. Contains plotting and summary func-tions as well as the analyses. By David W.Roberts.

gWidgetstcltk Toolkit implementation of the gWid-gets API for the tcltk package. By John Verzani.

gamlss.mx A GAMLSS add on package for fittingmixture distributions. By Mikis Stasinopoulosand Bob Rigby.

gcl Computes a fuzzy rules classifier (Vinterbo, Kim& Ohno-Machado, 2005). By Staal A. Vinterbo.

geiger Analysis of evolutionary diversification. Fea-tures running macroevolutionary simulationand estimating parameters related to diversifi-cation from comparative phylogenetic data. ByLuke Harmon, Jason Weir, Chad Brock, RichGlor, and Wendell Challenger.

ggplot An implementation of the grammar ofgraphics in R. Combines the advantages of



both base and lattice graphics: conditioningand shared axes are handled automatically,and one can still build up a plot step by stepfrom multiple data sources. Also implementsa more sophisticated multidimensional condi-tioning system and a consistent interface tomap data to aesthetic attributes. By HadleyWickham.

ghyp Provides all about univariate and multivari-ate generalized hyperbolic distributions andits special cases (Hyperbolic, Normal InverseGaussian, Variance Gamma and skewed Stu-dent t distribution). Especially fitting proce-dures, computation of the density, quantile,probability, random variates, expected shortfalland some portfolio optimization and plottingroutines. Also contains the generalized inverseGaussian distribution. By Wolfgang Breymannand David Luethi.

glmmAK Generalized Linear Mixed Models. ByArnost Komarek.

granova Graphical Analysis of Variance. Providesdistinctive graphics for display of ANOVA re-sults. Functions were written to display datafor any number of groups, regardless of theirsizes (however, very large data sets or numbersof groups are likely to be problematic) usinga specialized approach to construct data-basedcontrast vectors with respect to which ANOVAdata are displayed. By Robert M. Pruzek andJames E. Helmreich.

graph Implements some simple graph handling ca-pabilities. By R. Gentleman, Elizabeth Whalen,W. Huber, and S. Falcon.

grasp Generalized Regression Analysis and SpatialPrediction for R. GRASP is a general methodfor making spatial predictions of response vari-ables (RV) using point surveys of the RV andspatial coverages of predictor variables (PV).Originally, GRASP was developed to analyse,model and predict vegetation distribution overNew Zealand. It has been used in all sorts ofapplications since then. By Anthony Lehmann,Fabien Fivaz, John Leathwick and Jake Over-ton, with contributions from many specialistsfrom around the world.

hdeco Hierarchical DECOmposition of Entropy forCategorical Map Comparisons. A flexible andhierarchical framework for comparing categor-ical map composition and configuration (spa-tial pattern) along spatial, thematic, or externalgrouping variables. Comparisons are based onmeasures of mutual information between the-matic classes (colors) and location (spatial par-titioning). Results are returned in textual, tabu-

lar, and graphical forms. By Tarmo K. Remmel,Sandor Kabos, and Ferenc (Ferko) Csillag.

heatmap.plus Heatmaps with more sensible behav-ior. Allows the heatmap matrix to have non-identical x and y dimensions, and multipletracks of annotation for RowSideColors andColSideColors. By Allen Day.

hydrosanity Graphical user interface for exploringhydrological time series. Designed to workwith catchment surface hydrology data (rain-fall and streamflow); but could also be usedwith other time series data. Hydrological timeseries typically have many missing values, andvarying accuracy of measurement (indicatedby data quality codes). Furthermore, the spa-tial coverage of data varies over time. Muchof the functionality of this package attemptsto understand these complex data sets, detecterrors and characterize sources of uncertainty.Emphasis is on graphical methods. The GUI isbased on rattle. By Felix Andrews.

identity Calculate identity coefficients, based onMark Abney’s C code. By Na Li.

ifultools Insightful Research Tools. By William Con-stantine.

inetwork Network Analysis and Plotting. Imple-ments a network partitioning algorithm toidentify communities (or modules) in a net-work. A network plotting function then uti-lizes the identified community structure to po-sition the vertices for plotting. Also con-tains functions to calculate the assortativityand transitivity of a vertex. By Sun-ChongWang.

inline Functionality to dynamically define R func-tions and S4 methods with in-lined C, C++ orFortran code supporting .C and .Call callingconventions. By Oleg Sklyar, Duncan Mur-doch, and Mike Smith.

irtoys A simple common interface to the estima-tion of item parameters in IRT models for bi-nary responses with three different programs(ICL, BILOG-MG, and ltm), and a variety offunctions useful with IRT models. By IvailoPartchev.

kin.cohort Analysis of kin-cohort studies. Providesestimates of age-specific cumulative risk of adisease for carriers and noncarriers of a muta-tion. The cohorts are retrospectively built fromrelatives of probands for whom the genotypeis known. Currently the method of momentsand marginal maximum likelihood are imple-mented. Confidence intervals are calculatedfrom bootstrap samples. Most of the code is



a translation from previous MATLAB code byChatterjee. By Victor Moreno, Nilanjan Chat-terjee, and Bhramar Mukherjee.

kzs Kolmogorov-Zurbenko Spline. A collection offunctions utilizing splines to smooth a noisydata set in order to estimate its underlying sig-nal. By Derek Cyr and Igor Zurbenko.

lancet.iraqmortality Surveys of Iraq mortality pub-lished in The Lancet. The Lancet has publishedtwo surveys on Iraq mortality before and afterthe US-led invasion. The package facilitates ac-cess to the data and a guided tour of some oftheir more interesting aspects. By David Kane,with contributions from Arjun Navi Narayanand Jeff Enos.

ljr Fits and tests logistic joinpoint models. By MichalCzajkowski, Ryan Gill, and Greg Rempala.

luca Likelihood Under Covariate Assumptions(LUCA). Likelihood inference in case-controlstudies of a rare disease under independence orsimple dependence of genetic and non-geneticcovariates. By Ji-Hyung Shin, Brad McNeney,and Jinko Graham.

meifly Interactive model exploration using GGobi.By Hadley Wickham.

mixPHM Mixtures of proportional hazard models.Fits multiple variable mixtures of various para-metric proportional hazard models using theEM algorithm. Proportionality restrictions canbe imposed on the latent groups and/or on thevariables. Several survival distributions can bespecified. Missing values are allowed. Inde-pendence is assumed over the single variables.By Patrick Mair and Marcus Hudec.

mlmmm Computational strategies for multivariatelinear mixed-effects models with missing val-ues (Schafer and Yucel, 2002). By Recai Yucel.

modeest Provides estimators of the mode of uni-variate unimodal data or univariate unimodaldistributions. Also allows to compute theChernoff distribution. By Paul Poncet.

modehunt Multiscale Analysis for Density Func-tions. Given independent and identically dis-tributed observations X(1), . . . , X(n) from adensity f , this package provides five meth-ods to perform a multiscale analysis about fas well as the necessary critical values. Thefirst method, introduced in Duembgen andWalther (2006), provides simultaneous confi-dence statements for the existence and loca-tion of local increases (or decreases) of f , basedon all intervals I(all) spanned by any two ob-servations X( j), X(k). The second method ap-proximates the latter approach by using only

a subset of I(all) and is therefore computa-tionally much more efficient, but asymptoti-cally equivalent. Omitting the additive correc-tion term Gamma in either method offers an-other two approaches which are more power-ful on small scales and less powerful on largescales, however, not asymptotically minimaxoptimal anymore. Finally, the block proce-dure is a compromise between adding Gammaor not, having intermediate power properties.The latter is again asymptotically equivalent tothe first and was introduced in Rufibach andWalther (2007). By Kaspar Rufibach and Guen-ther Walther.

monomvn Estimation of multivariate normal dataof arbitrary dimension where the pattern ofmissing data is monotone. Through the use ofparsimonious/shrinkage regressions (plsr, pcr,lasso, ridge, etc.), where standard regressionsfail, the package can handle an (almost) arbi-trary amount of missing data. The current ver-sion supports maximum likelihood inference.Future versions will provide a means of sam-pling from a Bayesian posterior. By Robert B.Gramacy.

mota Mean Optimal Transformation Approach fordetecting nonlinear functional relations. Origi-nally designed for the identifiability analysis ofnonlinear dynamical models. However, the un-derlying concept is very general and allows todetect groups of functionally related variableswhenever there are multiple estimates for eachvariable. By Stefan Hengl.

nlstools Tools for assessing the quality of fit of aGaussian nonlinear model. By Florent Baty andMarie-Laure Delignette-Muller, with contribu-tions from Sandrine Charles, Jean-Pierre Flan-drois.

oc OC Roll Call Analysis Software. Estimates Op-timal Classification scores from roll call votessupplied though a rollcall object from pack-age pscl. By Keith Poole, Jeffrey Lewis, JamesLo and Royce Carroll.

oce Analysis of Oceanographic data. Supports CTDmeasurements, sea-level time series, coastlinefiles, etc. Also includes functions for cal-culating seawater properties such as density,and derived properties such as buoyancy fre-quency. By Dan Kelley.

pairwiseCI Calculation of parametric and nonpara-metric confidence intervals for the differenceor ratio of location parameters and for the dif-ference, ratio and odds-ratio of binomial pro-portion for comparison of independent sam-ples. CIs are not adjusted for multiplicity. A by



statement allows calculation of CI separatelyfor the levels of one further factor. By FrankSchaarschmidt.

paran An implementation of Horn’s techniquefor evaluating the components retained ina principle components analysis (PCA).Horn’s method contrasts eigenvalues pro-duced through a PCA on a number of randomdata sets of uncorrelated variables with thesame number of variables and observations asthe experimental or observational data set toproduce eigenvalues for components that areadjusted for the sample error-induced inflation.Components with adjusted eigenvalues greaterthan one are retained. The package may alsobe used to conduct parallel analysis follow-ing Glorfeld’s (1995) suggestions to reduce thelikelihood of over-retention. By Alexis Dinno.

pcse Estimation of panel-corrected standard errors.Data may contain balanced or unbalanced pan-els. By Delia Bailey and Jonathan N. Katz.

penalized L1 (lasso) and L2 (ridge) penalized esti-mation in Generalized Linear Models and inthe Cox Proportional Hazards model. By JelleGoeman.

plRasch Fit log linear by linear association models.By Zhushan Li & Feng Hong.

plink Separate Calibration Linking Methods. Usesunidimensional item response theory meth-ods to compute linking constants and conductchain linking of tests for multiple groups un-der a nonequivalent groups common item de-sign. Allows for mean/mean, mean/sigma,Haebara, and Stocking-Lord calibrations ofsingle-format or mixed-format dichotomous(1PL, 2PL, and 3PL) and polytomous (gradedresponse partial credit/generalized partialcredit, nominal, and multiple-choice model)common items. By Jonathan Weeks.

plotAndPlayGTK A GUI for interactive plots usingGTK+. When wrapped around plot calls, awindow with the plot and a tool bar to interactwith it pop up. By Felix Andrews, with contri-butions from Graham Williams.

pomp Inference methods for partially-observedMarkov processes. By Aaron A. King, Ed Ion-ides, and Carles Breto.

poplab Population Lab: a tool for constructing a vir-tual electronic population of related individu-als evolving over calendar time, by using vi-tal statistics, such as mortality and fertility, anddisease incidence rates. By Monica Leu, KamilaCzene, and Marie Reilly.

proftools Profile output processing tools. By LukeTierney.

proj4 A simple interface to lat/long projection anddatum transformation of the PROJ.4 carto-graphic projections library. Allows transforma-tion of geographic coordinates from one pro-jection and/or datum to another. By Simon Ur-banek.

proxy Distance and similarity measures. An exten-sible framework for the efficient calculation ofauto- and cross-proximities, along with imple-mentations of the most popular ones. By DavidMeyer and Christian Buchta.

psych Routines for personality, psychometrics andexperimental psychology. Functions are pri-marily for scale construction and reliabilityanalysis, although others are basic descriptivestats. By William Revelle.

psyphy Functions that could be useful in analyz-ing data from pyschophysical experiments, in-cluding functions for calculating d′ from sev-eral different experimental designs, links formafc to be used with the binomial family inglm (and possibly other contexts) and self-Startfunctions for estimating gamma values for CRTscreen calibrations. By Kenneth Knoblauch.

qualV Qualitative methods for the validation ofmodels. By K.G. van den Boogaart, StefanieJachner and Thomas Petzoldt.

quantmod Specify, build, trade, and analyze quanti-tative financial trading strategies. By Jeffrey A.Ryan.

rateratio.test Exact rate ratio test. By Michael Fay.

regtest Functions for unary and binary regressiontests. By Jens Oehlschlägel.

relations Data structures for k-ary relations with ar-bitrary domains, predicate functions, and fit-ters for consensus relations. By Kurt Hornikand David Meyer.

rgcvpack Interface to the GCVPACK Fortran pack-age for thin plate spline fitting and prediction.By Xianhong Xie.

rgr Geological Survey of Canada (GSC) functionsfor exploratory data analysis with applied geo-chemical data, with special application to theestimation of background ranges to supportboth environmental studies and mineral explo-ration. By Robert G. Garrett.

rindex Indexing for R. Index structures allowquickly accessing elements from large collec-tions. With btree optimized for disk databases



and ttree for RAM databases, uses hybrid staticindexing which is quite optimal for R. By JensOehlschlägel.

rjson JSON for R. Converts R object into JSON ob-jects and vice-versa. By Alex Couture-Beil.

rsbml R support for SBML. Links R to libsbml forSBML parsing and output, provides an S4SBML DOM, converts SBML to R graph objects,and more. By Michael Lawrence.

sapa Insightful Spectral Analysis for Physical Appli-cations. Software for the book “Spectral Analy-sis for Physical Applications” by Donald B. Per-cival and Andrew T. Walden, Cambridge Uni-versity Press, 1993. By William Constantineand Donald Percival.

scagnostics Calculates Tukey’s scagnostics whichdescribe various measures of interest for pairsof variables, based on their appearance on ascatterplot. They are useful tool for weedingout interesting or unusual scatterplots from ascatterplot matrix, without having to look atever individual plot. By Heike Hofmann„ LeeWilkinson, Hadley Wickham, Duncan TempleLang, and Anushka Anand.

schoolmath Functions and data sets for math usedin school. A main focus is set to prime-calculation. By Joerg Schlarmann and JosefWienand.

sdcMicro Statistical Disclosure Control methods forthe generation of public- and scientific-usefiles. Data from statistical agencies and otherinstitutions are mostly confidential. The pack-age can be used for the generation of safe (mi-cro)data, i.e., for the generation of public- andscientific-use files. By Matthias Templ.

seriation Infrastructure for seriation with an im-plementation of several seriation/sequencingtechniques to reorder matrices, dissimilaritymatrices, and dendrograms. Also containssome visualizations techniques based on seri-ation. By Michael Hahsler, Christian Buchtaand Kurt Hornik.

signalextraction Real-Time Signal Extraction (DirectFilter Approach). The Direct Filter Approach(DFA) provides efficient estimates of signalsat the current boundary of time series in real-time. For that purpose, one-sided ARMA-filters are computed by minimizing customizederror criteria. The DFA can be used for esti-mating either the level or turning-points of aseries, knowing that both criteria are incongru-ent. In the context of real-time turning-pointdetection, various risk-profiles can be opera-tionalized, which account for the speed and/or

the reliability of the one-sided filter. By MarcWildi & Marcel Dettling.

simba Functions for similarity calculation of binarydata (for instance presence/absence speciesdata). Also contains wrapper functions forreshaping species lists into matrices and viceversa and some other functions for further pro-cessing of similarity data (Mantel-like permu-tation procedures) as well as some other usefulstuff. By Gerald Jurasinski.

simco A package to import Structure files and de-duce similarity coefficients from them. ByOwen Jones.

snpXpert Tools to analysis SNP data. By Eun-kyungLee, Dankyu Yoon, and Taesung Park.

spam SPArse Matrix: functions for sparse matrixalgebra (used by fields). Differences withSparseM and Matrix are: (1) support for onlyone sparse matrix format, (2) based on trans-parent and simple structure(s) and (3) S3 andS4 compatible. By Reinhard Furrer.

spatgraphs Graphs, graph visualization and graphbased summaries to be used with spatial pointpattern analysis. By Tuomas Rajala.

splus2R Insightful package providing missing S-PLUS functionality in R. Currently there aremany functions in S-PLUS that are missing inR. To facilitate the conversion of S-PLUS mod-ules and libraries to R packages, this packagehelps to provide missing S-PLUS functionalityin R. By William Constantine.

sqldf Manipulate R data frames using SQL. By G.Grothendieck.

sspline Smoothing splines on the sphere. By Xian-hong Xie.

stochasticGEM Fitting Stochastic General EpidemicModels: Bayesian inference for partially ob-served stochastic epidemics. The general epi-demic model is used for estimating the param-eters governing the infectious and incubationperiod length, and the parameters governingsusceptibility. In real-life epidemics the infec-tion process is unobserved, and the data con-sists of the times individuals are detected, usu-ally via appearance of symptoms. The pack-age fits several variants of the general epi-demic model, namely the stochastic SIR withMarkovian and non-Markovian infectious pe-riods. The estimation is based on Markov chainMonte Carlo algorithm. By Eugene Zwane.

surveyNG An experimental revision of the surveypackage for complex survey samples (featuringa database interface and sparse matrices). ByThomas Lumley.



termstrc Term Structure and Credit Spread Estima-tion. Offers several widely-used term structureestimation procedures, i.e., the parametric Nel-son and Siegel approach, Svensson approachand cubic splines. By Robert Ferstl and JosefHayden.

tframe Time Frame coding kernel. Functions forwriting code that is independent of the waytime is represented. By Paul Gilbert.

timsac TIMe Series Analysis and Control package.By The Institute of Statistical Mathematics.

trackObjs Track Objects. Stores objects in files ondisk so that files are automatically rewrittenwhen objects are changed, and so that objectsare accessible but do not occupy memory untilthey are accessed. Also tracks times when ob-jects are created and modified, and cache somebasic characteristics of objects to allow for fastsummaries of objects. By Tony Plate.

tradeCosts Post-trade analysis of transaction costs.By Aaron Schwartz and Luyi Zhao.

tripEstimation Metropolis sampler and supportingfunctions for estimating animal movementfrom archival tags and satellite fixes. ByMichael Sumner and Simon Wotherspoon.

twslm A two-way semilinear model for normaliza-tion and analysis of cDNA microarray data.Huber’s and Tukey’s bisquare weight func-tions are available for robust estimation of thetwo-way semilinear models. By Deli Wang andJian Huang.

vbmp Variational Bayesian multinomial probit re-gression with Gaussian process priors. ByNicola Lama and Mark Girolami.

vrtest Variance ratio tests for weak-form market ef-ficiency. By Jae H. Kim.

waved The WaveD Transform in R. Makes avail-able code necessary to reproduce figures andtables in recent papers on the WaveD methodfor wavelet deconvolution of noisy signals. ByMarc Raimondo and Michael Stewart.

wikibooks Functions and datasets of the germanWikiBook “GNU R” which introduces R to newusers. By Joerg Schlarmann.

wmtsa Insightful Wavelet Methods for Time SeriesAnalysis. Software to book “Wavelet Methodsfor Time Series Analysis” by Donald B. Percivaland Andrew T. Walden, Cambridge UniversityPress, 2000. By William Constantine and Don-ald Percival.

wordnet An interface to WordNet using the Jaw-bone Java API to WordNet. By Ingo Feinerer.

yest Model selection and variance estimation inGaussian independence models. By Petr Sime-cek.

Other changes

• CRAN’s Devel area is gone.

• Package write.snns was moved up from Devel.

• Package anm was resurrected from theArchive.

• Package Rcmdr.HH was renamed to Rcmdr-Plugin.HH.

• Package msgcop was renamed to sbgcop.

• Package grasper was moved to the Archive.

• Package tapiR was removed from CRAN.

Kurt HornikWirtschaftsuniversität Wien, [email protected]

R Foundation Newsby Kurt Hornik

Donations and new members

Donations

AT&T Research (USA)Jaimison Fargo (USA)Stavros Panidis (Greece)Saxo Bank (Denmark)Julian Stander (United Kingdom)

New benefactors

Shell Statistics and Chemometrics, Chester, UK

New supporting institutions

AT&T Labs, New Jersey, USA

BC Cancer Agency, Vancouver, Canada

Black Mesa Capital, Santa Fe, USA

Department of Statistics, Unviersity of Warwick,Coventry, UK




New supporting members

Michael Bojanowski (Netherlands)Caoimhin ua Buachalla (Ireland)Jake Bowers (UK)Sandrah P. Eckel (USA)Charles Fleming (USA)Hui Huang (USA)

Jens Oehlschlägel (Germany)Marc Pelath (UK)Tony Plate (USA)Julian Stander (UK)

Kurt HornikWirtschaftsuniversität Wien, [email protected]

News from the Bioconductor Projectby Hervé Pagès and Martin Morgan

Bioconductor 2.1 was released on October 8, 2007and is designed to be compatible with R 2.6.0, re-leased five days before Bioconductor. This releasecontains 23 new software packages, 97 new anno-tation (or metadata) packages, and many improve-ments and bug fixes to existing packages.

New packages

The 23 new software packages provide excit-ing analytic and interactive tools. Packages ad-dress Affymetrix array quality assessment (e.g., ar-rayQualityMetrics, AffyExpress), error assessment(e.g., OutlierD, MCRestimate), particular applica-tion domains (e.g., comparative genomic hybridiza-tion, CGHcall, SMAP; SNP arrays, oligoClasses,RLMM, VanillaICE; SAGE, sagenhaft; gene set en-richment, GSEABase; protein interaction, Rintact)and sophisticated modeling tools (e.g., timecourse,vbmp, maigesPack). exploRase uses the GTK toolkitto provide an integrated user interface for systemsbiology applications.

Several packages benefit from important infras-tructure developments in R. Recent changes allowconsolidation of C code common to several pack-ages into a single location (preprocessCore), greatlysimplifying code maintenance and improving reli-ability. Many new packages use the S4 class sys-tem. A number of these extend classes provided inBiobase, facilitating more seamless interoperability.The ability to access web resources, including con-venient XML parsing, allow Bioconductor packagessuch as GSEABase to access important curated re-sources.

SQLite-based annotations

This release provides SQLite-based annotations inaddition to the traditional environment-based ones.Annotations contain maps between information frommicroarray manufactures, standard naming con-ventions (e.g., Entrez gene identifiers) and re-

sources such as the Gene Ontology consortium.Eighty-six SQLite-based annotation packages arecurrently available. The name of these pack-ages end with ".db" (e.g., hgu95av2.db). Foran example of different metadata packages relatedto specific chips, view the annotations availablefor the hgu95av2 chip: http://bioconductor.org/packages/2.1/hgu95av2.html

New genome wide metadata packages provide amore complete set of maps, similar to those providedin the chip-based annotation packages. Genomewide annotations have an "org." prefix in theirname, and are available as SQLite-based pack-ages only. Five organisms are currently supported:human (org.Hs.eg.db), mouse (org.Mm.eg.db),rat (org.Rn.eg.db), fly (org.Dm.eg.db) and yeast(org.Sc.sgd.db). The *LLMappings packages willbe deprecated in Bioconductor 2.2.

Environment-based (e.g., hgu95av2) and SQLite-based (e.g., hgu95av2.db) packages contain the samedata. For the end user, moving from hgu95av2 tohgu95av2.db is transparent because the objects (ormaps) in hgu95av2.db can be manipulated as if theywere environments (i.e., functions ls, mget, get,etc. . . still work on them). Using SQLite allows con-siderable flexibility in querying maps and in per-forming complex joins between tables, in additionto placing the burden of memory management andoptimized query construction in sqlite. As with theimplementation of operations like ls and mget, theintention is to recognize common use cases and tocode these so that R users do not need to know theunderlying SQL query.

Looking ahead

For the next release (BioC 2.2, April 2008) all our an-notations will be SQLite-based and we will deprecatethe environment-based versions.

We anticipate increasing emphasis on sequence-based technologies like Solexa (http://www.illumina.com) and 454 (http://www.454.com). Thevolume of data these technologies generate is verylarge (a three day Solexa run produces almost a ter-abyte of raw data, with 10’s of gigabytes appropriate



http://bioconductor.org/packages/2.1/hgu95av2.html

http://bioconductor.org/packages/2.1/hgu95av2.html

http://www.illumina.com

http://www.illumina.com

http://www.454.com


for numerical analysis). This volume of data posessignificant challenges, but graphical, statistical, andinteractive abilities and web-based integration makeR a strong candidate for making sense of, and mak-ing fundamental new contributions to, understand-ing and critically assessing this data.

The best Bioconductor packages are contributedby our users, based on their practical needs and so-phisticated experience. We look forward to receivingyour contribution over the next several months.

Hervé PagèsComputational Biology ProgramFred Hutchinson Cancer Research [email protected]

Martin MorganComputational Biology ProgramFred Hutchinson Cancer Research [email protected]

Past Events: useR! 2007Duncan Murdoch and Martin Maechler

The first North American useR! meeting took placeover three hot days at Iowa State University in Amesthis past August.

The program started with John Chambers talk-ing about his philosophy of programming: our mis-sion is to enable effective and rapid exploration ofdata, and the prime directive is to provide trust-worthy software and “tell no lies”. This was fol-lowed by a varied and interesting program of pre-sentations and posters, many of which are nowonline at http://useR2007.org. A panel on theuse of R in clinical trials may be noteworthy be-cause of the related publishing by the R Founda-tion of a document (http://www.r-project.org/certification.html) on “Regulatory Complianceand Validation” issues. In particular “21 CFR part11” compliance is very important in the pharmaceu-tical industry.

The meeting marked the tenth anniversary of theformation of the R Core group, and an enormousblue R birthday cake was baked for the occasion (Fig-ure 1). Six of the current R Core members werepresent for the cutting, and Thomas Lumley gave ashort after dinner speech at the banquet.

Figure 1: R Core turns ten.

There were two programming competitions. Thefirst requested submissions in advance, to producea package useful for large data sets. This was wonby the team of Daniel Adler, Oleg Nenadic, WalterZucchini and Christian Gläser from Göttingen. Theywrote the ff package to use paged memory-mappingof binary files to handle very large datasets. Daniel,Oleg and Christian attended the meeting and pre-sented their package.

The second competition was a series of short Rprogramming tasks to be completed within a timelimit at the conference. The tasks included rela-belling, working with ragged longitudinal data, andwriting functions on functions. There was a tie forfirst place between Elaine McVey and Olivia Lau.

Three kinds of T-shirts with variations on the “R”theme were available: the official conference T-shirt(also worn by the local staff) sold out within a day,so the two publishers’ free T-shirts gained even moreattraction.

Local arrangements for the meeting were han-dled by Di Cook, Heike Hofmann, MichaelLawrence, Hadley Wickham, Denise Riker, Beth Ha-genman and Karen Koppenhaver at Iowa State; theProgram Committee also included Doug Bates, DaveHenderson, Olivia Lau, and Luke Tierney. Thanksare due to all of them, and to the team of IowaState students who made numerous trips to the DesMoines airport carrying meeting participants backand forth, and who kept the equipment and facili-ties running smoothly–useR! 2007 was an excellentmeeting.

Duncan Murdoch, University of Western [email protected] Maechler, ETH Zü[email protected]




http://useR2007.org

http://www.r-project.org/certification.html

http://www.r-project.org/certification.html




Forthcoming Events: useR! 2008The international R user conference ‘useR! 2008’ willtake place at the Universität Dortmund, Dortmund,Germany, August 12-14, 2008.

This world-meeting of the R user community willfocus on

• R as the ‘lingua franca’ of data analysis and sta-tistical computing,

• providing a platform for R users to discuss andexchange ideas how R can be used to do statis-tical computations, data analysis, visualizationand exciting applications in various fields,

• giving an overview of the new features of therapidly evolving R project.

The program comprises invited lectures, user-contributed sessions and pre-conference tutorials.

Invited Lectures

R has become the standard computing engine inmore and more disciplines, both in academia and thebusiness world. How R is used in different areas willbe presented in invited lectures addressing hot top-ics. Speakers will include Peter Bühlmann, John Fox,Andrew Gelman, Kurt Hornik, Gary King, Duncan Mur-doch, Jean Thioulouse, Graham J. Williams, and KeithWorsley.

User-contributed Sessions

The sessions will be a platform to bring togetherR users, contributers, package maintainers and de-velopers in the S spirit that ‘users are developers’.People from different fields will show us how theysolve problems with R in fascinating applications.The scientific program is organized by members ofthe program committee, including Micah Altman,Roger Bivand, Peter Dalgaard, Jan de Leeuw, RamónDíaz-Uriarte, Spencer Graves, Leonhard Held, TorstenHothorn, François Husson, Christian Kleiber, FriedrichLeisch, Andy Liaw, Martin Mächler, Kate Mullen, Ei-jiNakama, Thomas Petzold, Martin Theus, and HeatherTurner. The program will cover topics of current in-terest such as

• Applied Statistics & Biostatistics• Bayesian Statistics

• Bioinformatics• Chemometrics and Computational Physics• Data Mining• Econometrics & Finance• Environmetrics & Ecological Modeling• High Performance Computing• Machine Learning• Marketing & Business Analytics• Psychometrics• Robust Statistics• Sensometrics• Spatial Statistics• Statistics in the Social and Political Sciences• Teaching• Visualization & Graphics• and many more.

Call for pre-conference Tutorials

Before the start of the official program, half-day tuto-rials will be offered on Monday, August 11.

We invite R users to submit proposals for threehour tutorials on special topics on R. The proposalsshould give a brief description of the tutorial, includ-ing goals, detailed outline, justification why the tu-torial is important, background knowledge requiredand potential attendees. The proposals should besent before 2007-10-31 to [email protected].

Call for Papers

We invite all R users to submit abstracts on top-ics presenting innovations or exciting applications ofR. A web page offering more information on ‘useR!2008’ is available at:

http://www.R-project.org/useR-2008/

Abstract submission and registration will start inDecember 2007.

We hope to meet you in Dortmund!

The organizing committee:Uwe Ligges, Achim Zeileis, Claus Weihs, Gerd Kopp,Friedrich Leisch, and Torsten [email protected]



http://www.R-project.org/useR-2008/



Forthcoming Events: R Courses in Munichby Friedrich Leisch

The department of statistics at Ludwig–Maximilians–Universität München, Germany, is of-fering a range of R courses to practitioners fromindustry, universities and all others interested inusing R for data analysis, starting with R basics(November 8–9, by Torsten Hothorn and FriedrichLeisch) followed by R programming (December 12–13, by Torsten Hothorn and Friedrich Leisch), ma-chine learning (January 23–24, by Torsten Hothorn,

Friedrich Leisch and Carolin Strobl), econometrics(spring 2008, by Achim Zeileis), and generalized re-gression (spring 2008, by Thomas Kneib).

The regular courses are taught in German,more information is available from http://www.statistik.lmu.de/R/.

Friedrich LeischLudwig-Maximilians-Universität München, [email protected]


http://www.statistik.lmu.de/R/

http://www.statistik.lmu.de/R/



Editor-in-Chief:Torsten HothornInstitut für StatistikLudwig–Maximilians–Universität MünchenLudwigstraße 33, D-80539 MünchenGermany

Editorial Board:John Fox and Vincent Carey.

Editor Programmer’s Niche:Bill Venables

Editor Help Desk:Uwe Ligges

Email of editors and editorial board:firstname.lastname @R-project.org

R News is a publication of the R Foundation for Sta-tistical Computing. Communications regarding thispublication should be addressed to the editors. Allarticles are copyrighted by the respective authors.Please send submissions to regular columns to therespective column editor and all other submissionsto the editor-in-chief or another member of the edi-torial board. More detailed submission instructionscan be found on the R homepage.

R Project Homepage:http://www.R-project.org/

This newsletter is available online athttp://CRAN.R-project.org/doc/Rnews/


http://www.R-project.org/

http://CRAN.R-project.org/doc/Rnews/

r news 2007/2 · 2015-03-05 · news the newsletter of the r project volume 7/2, october 2007...

Documents