anova with binary variables - alternatives for a dangerous...

30
Universität zu Köln Anova with binary variables - Alternatives for a dangerous F-test Version 1.0 (11.6.2018) Haiko Lüpsen Regionales Rechenzentrum (RRZK) Kontakt: [email protected]

Upload: donhu

Post on 27-Aug-2019

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Universität zu Köln

Anova with binary variables -Alternatives for a dangerous F-test

Version 1.0(11.6.2018)

Haiko Lüpsen

Regionales Rechenzentrum (RRZK)

Kontakt: [email protected]

Page 2: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Introduction 1

ANOVA with binary variables -Alternatives for a dangerous F-test

Abstract

Several methods are compared with the parametric F-test to perform an ANOVA with a binary dependent variable in 2-way layouts. Equal and unequal cell counts as well as several different effect models are taken into account. Special attention has been paid on heterogeneous condi-tions which are caused by nonnull effects through the relation of the binomial probability and its variance. For between subject designs Puri & Sen‘s L statistic, Brunner & Munzel‘s ATS, the χ2-test of loglinear models, the logistic and the probit regresssion in conjunction with the Wald test, the LR test and a compromise test as ANOVA-like tests are considered. The L stati-stic is recommended because the F-test cannot keep the type I error under control if there are nonnull effects in unbalanced designs. For mixed designs the Huynh-Feldt adjustment, Ho-telling Lawley‘s multivariate test, Puri & Sen‘s L statistic, Brunner & Munzel‘s ATS and Koch‘s ANOVA are considered. In addition several GLMM and GEE models are included. The ATS turned out to be the only method which can handle all, especially heterogeneous conditions and large designs, where the F-test fails. The GLMM and GEE methods were not able to com-pete with the tests listed above, from which the GLMM together with the type II Wald test proved to be the best one.

Keywords: ANOVA, binary, dichotomous, Puri & Sen, ATS, GLMM, GEE, logistic, probit, regression.

1. IntroductionThe analysis of variance (ANOVA) is one of the most important and frequently used methods of applied statistics, mainly for the analysis of designs with only grouping factors (between subject designs) and of designs with grouping and repeated measurements factors (mixed or split-plot designs). There is the parametric version and there are nonparametric methods as well. The first one has assumptions, of course. These are essentially normality of the residuals, homo-geneity of the variances and in the case of repeated measurements additionally sphericity and homogeneity of the covariance matrices over the groups. But what to do if the dependent va-riable y is dichotomous, e.g. with values yes or no, or 1 and 0?

Due to the familiarity and simplicity of the ANOVA methodology, one could trust in the robust-ness of the parametric tests. „A test is called robust when its significance level (Type I error probability) and power (one minus Type-II probability) are insensitive to departures from the assumptions on which it is derives.“ (See Ito, 1980). And one of the first who investigated the applicability of the parametric F-test to a dichotomous criterion was Lunney (1970). His simu-lations showed that for 1-, 2- and 3-factorial designs the type I error rate is controlled as long as n 20 for 0.2 p 0.8 and n 40 for other values of p (p being the percentage of one of the outcomes y). And the power is satisfying as long as n is not too small. This seems reasonable as on one side the homogeneity of the variances is the most essential assumption - noting that the variance p(1-p) depends on the mean p - and on the other side for 0.25 p 0.75 variances of a binomial distributed outcome can be regarded as equal. Unfortunately Lunney‘s study has a fundamental restriction: he examined only equal sample sizes and type I error rates if there are no other effects, therefore neglecting the cases of unequal variances. Beside that only between subject designs had been studied. D‘Agostino (1971) wrote a detailed critic on Lunney‘s paper.

≥ ≤ ≤ ≥

≤ ≤

Page 3: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Introduction 2

Nevertheless it remains one of the most important works on this subject. Only decades later Jaeger (2008) expressed concern on the use of the parametric ANOVA for the analysis of a bina-ry outcome, but mainly on a variation of this procedure: Cochran (1940) proposed first compu-ting the success proportion and applying an arcsine-square-root transformation before performing the ANOVA. But which are the alternatives?

First, one of the nonparametric ANOVA methods could be applied. See Luepsen (2017) for an overview. A simple rank transform (RT) or an inverse normal transformation (INT) of y would make no difference compared to the parametric F-test because these transform the two values of y just into two other different values. The aligned rank transform (ART) is not reasonable for dichotomous variables as Luepsen (2016) pointed out. But the Puri & Sen-method (e.g. Puri & Sen, 1985), often referred as L statistic, is one alternative. It can be viewed as a generalization both of the Kruskal-Wallis H test and the Friedman ANOVA. Here also applying the INT-trans-formation on the Puri & Sen-method, resulting in the van der Waerden-test, makes no difference compared to Puri & Sen‘s test. Brunner & Munzel (2002) developed two tests to compare sam-ples by means of comparing the relative effects (for details see next chapter): the ATS (ANO-VA-type statistic) and the WTS (Wald-type statistic).

Furthermore there are the logistic regression and the probit regression (see e.g. Agresti, 2002, and Haberman, 1978), ideal models for an ANOVA with a dichotomous dependent variable. One disadvantage of these methods is the large n required. Recommendations say e.g. 10 per parameter and a minimum of 100 (see. e.g. Peng et al., 2002).

Finally there is the χ2 test. But this one also needs a large sample size to prevent that too many of the expected cell frequencies will be 1.0 or less. Additionally it should be remarked that for the analysis of factorial designs loglinear models must be used instead which require more effort. Therefore Hsu & Feldt (1969), who investigated the effect of scale limitations on the type I error rate, do not recommend the χ2-test for the analysis of ANOVA designs. At this point it should be remarked that the χ2- and the F-test are algebraically similar and under the null hy-pothesis asymptotically equivalent as D‘Agostino (1972) has shown.

So far only methods suitable for between subjects designs have been considered. Of course the parametric F-test, but also the Puri & Sen-method and the ATS may be applied to repeated measures designs. At this point Cochran‘s Q-test should be mentioned. But this one is identical to the Friedman test in the case of a binary dependent variable (see e.g. Mandeville, 1972). And on the other side the Puri & Sen-test is a generalization of the Friedman test. So Cochran‘s test is included in Puri & Sen‘s L-statistic. Reflecting that unequal binomial probabilities pi result in unequal variances, methods should be considered that do not assume sphericity, e.g. the Huynh-Feldt adjustment for the parametric ANOVA and the multivariate statistic by Hotelling-Lawley. Mandeville (1972) tried in his study the multivariate statistic by Hotelling-Lawley and Myers et al. (1982) applied an adjustment of the degrees of freedom to the F- and Q-test to com-pensate for the case of nonsphericity.

Another solution to this problem is supplied by G. Koch. On one side he proposed several nonparametric ANOVA procedures (Koch, 1969). On the other side he developed a general framework for the analysis of experiments with repeated measurements of categorical data (Koch et al., 1977), also known as marginal probability model, which is based on the weighted least squares methodology by Grizzle, Starmer & Koch (GSK-method). Guthrie (1981) speciali-zed this method for the case of binary variables. Both, of course, include the application on dichotomous variables. Unfortunately both methods require large sample sizes: ni 25 is recommended (Koch et al., 1977). And, the tests of main and interaction effects are not ortho-

Page 4: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Introduction 3

gonal. By the way, Kenward & Jones (1992) compared these together with a couple of other approaches to the analysis of categorical data in repeated measures designs. Other comparisons come from Omar et al. (1999), Allan (1997) as well as Neuhaus (1992) who use sample datasets without giving general recommendations.

Concerning the logistic regression there exist also corresponding methods for dependent samp-les (see e.g. Hosmer & Lemeshow, 2013): GEE (Generalized Estimating Equations), estab-lished by Liang & Zeger (1986), and GLMM (Generalized LinearMixed Models, sometimes also called MLM, multi level models) by Harville (1977). Both are extensions of the generalized linear models GLM allowing correlated responses. Pendergast et al. (1996) give an overview of extensions of the logistic regression to correlated samples. GEE can be considered as an advan-cement of the above mentioned marginal probability model and is based on LS estimation. It has appealing and robust properties in parameter estimation: Unlike the repeated measures ANOVA, GEE does not require the outcome variable to have a particular distribution. GEE al-lows specification of the correlation structure, among others the compound symmetry (CS) and the autoregressive (AR(1)) correlation structure. And the GEE approach produces virtually unbiased estimates even if the correlation structure is misspecified (see Emrich & Piedmonte, 1992 and Pan & Connett, 2002). On the other side, as the method is based on large sample as-ymptotic theory, it is not surprising that the tests of the parameters are generally liberal for small samples n<50 (see e.g. Qu et al., 1994 and Stiger et al., 1998). That induced a number of statis-ticians to think about an amelioration for small samples. Some of their work is summarized by Wang et al. (2016) and Fan & Zhang (2014). GLMM is also based on large sample asymptotic theory which causes the same deficiencies in the case of small samples (see e.g. Li & Redden, 2015). In contrary to GEE, GLMM uses ML estimation methods which lead to a number of dif-ferent solutions and programs. One advantage of this approach is the flexibilty in handling miss-ing data though such datasets are not considered here. While both methods primarily estimate and test the regression coefficients of the model, an ANOVA-like test is compulsory for each of the effects, e.g. the Wald-test. Finally two major disadvantages of these methods should be mentioned: First, due to different estimating methods there are a number of programs for GEE and GLMM which yield unfortunately different results, and secondly, due to the complex com-putations, the failure to converge in fitting GEE and GLMM is a common occurrence (Fox & Weisberg, 2011).

There are several studies comparing the different methods. First the case of between subject de-signs. Malhotra (1983) considered regression methods and discriminant analysis which is also used for the prediction of a binary outcome, but not suitable for the analysis of ANOVA designs. He sees the logistic approach superior to the probit regression. Powers et al. (1978) come to the same result. Malhotra reported in his publication also quite a number of comparative studies and gave the results in a clearly arranged table. Gessner et al. (1988) studied the same methods but under special situations, covariance heterogeneity, multicollinearity and nonnormal joint distribution of predictors. But they give no recommendation. Whitmore & Schumacker (1999) compared the parametric ANOVA with the logistic regression in the context of differential item functioning (DIF). They favored the logistic approach. Hsu & Feldt (1969) compared in their above mentioned study the ANOVA F-test with the χ2-test. They state: If a sample size of 50 or more per treatment is used data from a two-point scale such that .25 < p < .75 may be validly analyzed via analysis of variance technique, and even for smaller sample sizes the error rates exceed the nominal level of 5 percent by 10% at most. Especially for very small sample sizes or when the design involves more than one factor, the F-test has an advantage over the χ2-test of independence. A summary of these results can be found at Glass et al. (1972).

For the case of within subjects designs the studies restrict to the comparison of Cochran‘s Q-

Page 5: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Introduction 4

test with the F-test. First of all two assumptions for the Q-test: The population covariances for all pairs of conditions are the same (Cochran, 1950) and the variance may not be zero, i.e. all values are equal. Mandeville (1972) compared the two tests for .1 < p < .9 and showed that the F-Test has generally the larger power and for the number of treatments k > 5 the lower type I error rate. Further he demonstrated that its error rate offends the nominal level of 5 percent by a maximum of 20% as long as the total n is larger than 60. Unfortunately both tests offend the nominal 1 percent level. Additionally he included the multivariate statistic by Hotelling-Law-ley. He observed that for 3 treatments the procedure behaves fairly well, but that for 6 or more treatments in some instances the errors are very large. Myers et al. (1982) compared the two tests for the case of heterogeneous covariance matrices and considered additional adjustments of the degrees of freedom for these tests to compensate the degree of heterogeneity. They re-vealed an improvement by their adjustments in these cases. Additionally they cited several other studies on these two tests, the results of which seemed to be less useful because of too small sample sizes. Finally Seeger & Gabrielsson (1968) proved by means of a simulation study that the F-test as a rule gives at least as good results as the Q-test with regard to the type I error.

Meanwhile a large number of studies are concerned with GEE and GLMM, but only very few compare these methods with the parametric F-test. Stiger et al. (1998) analyzed ordinal data in a 2*4 split-plot design with small samples sizes (20, 40, 60, 80) and examined the performance with respect to error rates and power of ANOVA (with and without the Huynh-Feldt-adjust-ment), Manova, GEE and a weighted least squared method by Koch. Though a 4-point scale had been used the results may be adapted to binary data. The ANOVA with adjustment as well as the Manova performed well for all sample sizes, while the unadjusted ANOVA behaved some-times slightly liberal if sphericity was not given. In contrast the GEE exceeded the type I error rates generally for ni<60. Concerning the power, ANOVA was overall superior while Manova had the lowest rates. Ma et al. (2012) also compared GEE with ANOVA in their study, consi-dering continuous and binary variables, and, surprisingly, found that GEE keeps the type I error rate even for small ni and has the largest power. Mancl & DeRouen (2001) summarized a number of studies examining the behaviour of GEE in small samples and concluded that for ni<50 the type I error rate is generally much too high. McNeish & Stapleton (2016) compared, among other methods, GEE and GLMM for very small sample sizes ( ), but unfortu-nately for a continuous outcome. They found that uncorrected GEE is a poor choice, which also applies to most of the small sample corrections except the version by Morel et al. (2003). This one could control the type I error rate at the cost of a diminished power while GLMM provided satisfying results. Just to mention their large list of studies analyzing these two methods. McNeish & Harring (2017) confirmed these results for binary variables. Fan & Zhang (2014) dealt with another problem associated with GEE and GLMM: a suitable ANOVA-type test for small samples. They suggested a test based on the work of Akritas et al. (1997) and showed in their study of GEE for repeated measures models with that this one could control the error rates in most situations while the Wald-test produced rates up to 80% for the trial effects. Concerning the previously cited corrections for GEE there are several studies beside those mentioned above, e.g. by Fan et al. (2013) or by Lu et al. (2007) which lead to different results, partly because of the different effect tests used. Littell (2002) compared the ANOVA with maximum likelihood and generalized least squares estimation, mainly from a theoretical point of view, and favored the modern techniques because of their flexibility matching the mo-del.

Finally a couple of warnings in this context: „When applied to modeling binary responses, diffe-rent software packages and even different procedures within a package may give quite different results“ (Zhang et al., 2011). „This kind of convergence problem is a common occurrence in

4 n≤ i 14≤

5 n≤ i 20≤

Page 6: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Methods to be compared 5

mixed-effects modeling“ (Fox & Weisberg, 2015) who also report that SAS (Proc NLMIXED) and R (lme4 and glmmML) yield different results for the same datasets though they all use the same integral approximation approach. By the way, sincere convergence problems are reported in quite a number of publications, e.g. by Beckman & Stroup (2003), who also tell: „The SAS-available GLMM algorithms considered in this paper performed poorly with fewer than 20 subjects per treatment. ... This raises significant questions about the viability of studies with few subjects and binary data“.

So this study tries on one side to fill the gaps from the Lunney study (unequal sample sizes, impact of other effects and split plot designs) and to compare the results with those produced with the other methods mentioned above. However those methods are focused which are easily available in the statistical packages and give results for the global hypotheses. But on the other side the number of designs, ANOVA arrangements and different models had to be limited. Whereas Lunney (1970) studied 4 different levels for the factors, 5 two-way and 3 three-way designs as well as nested designs, and e.g. Mandeville (1972) compared 4 fixed and several non-patterned correlation matrices, here only a few ANOVA settings could be considered while additionally the impact of effects of other factors is studied. So the influence of a number of parameters is checked here: equal/unequal cell counts, small/large number of cells, effect mo-dels, success rate p, sample size n, and additionally for within subject designs the pattern of the correlations between the repeated measurements. But not in detail. Therefore further investiga-tions may be helpful.

2. Methods to be comparedIt follows a brief description of the methods compared in this paper. More information, especially how to use them in R or SPSS can be found in Luepsen (2015).

The parametric F-test

In the case of a between subject design the 2-factorial ANOVA model for a dependent variable y with N observations shall be denoted by

with fixed effects αi (factor A, i=1,..,I), βj (factor B, j=1,..,J), αβij (interaction AB), error eijk (k=1,..,nij), cell counts nij and . The parameters αi , βj and αβij with the restrictions

, , can be estimated by means of a linear model y‘ = X p‘ + e‘ using the least squares method, where y are the values of the dependent variable, p is the vector of the parameters, X a suitable design matrix and e the random variable of the errors. If the contrasts for the tests of the hypotheses HA (αi=0), HB (βi=0) and HAB (αβij=0) are orthogonal the resul-ting sum of squares SSA, SSB, SSAB of the parameters are also orthogonal and commonly called type III SSq. They are tested by means of the F-distribution. In case of equal sample sizes the sum of squares as well as the mean squares can be easily computed as

and the F-ratios as

yijk αi βj αβij eijk+ + +=

N nij=

αi 0= βj 0= αβij 0=

SSANI---- yi · ·

y–( )2= SSB

NJ---- y j· ·

y–( )2= SSAB

NIJ----- yij ·

yi · ·y j· ·

–– y+( )2=

MSA SSA I 1–( )⁄= MSB SSB J 1–( )⁄= MSAB SSAB I 1–( ) J 1–( )( )⁄=MSerror yijk yij ·

–( )2 N IJ–( )⁄=

FA MSA MSerror⁄= FB MSB MSerror⁄= FAB MSAB MSerror⁄=

Page 7: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Methods to be compared 6

where , are the level means of factor A and B, are the cell means and is the grand mean (see e.g. Winer, 1991).

In the case of a mixed design with one grouping factor A and one repeated measures factor B, often called trial factor, the 2-factorial ANOVA model for a dependent variable y with

observations shall be denoted by

with αi, βj, eijk as above, ni subjects per group and a subject specific variation τik (k=1,..,ni). The sums of squares and mean squares of the effects are the same as above, if N is substituted by NJ, due to the different definition of N, whereas those for the error terms are different:

and the F-ratios as To make up for heterogeneous variances, i.e. here unequal pi, on factor B, an appropriate adjust-ment of the degrees of freedom for the F-test is applied. Here the Huynh-Feldt adjustment is chosen (see e.g. Winer et al., 1991).

Puri & Sen tests (L statistic)

These are generalizations of the well known Kruskal-Wallis H test (for independent samples) and the Friedman test (for dependent samples) by Puri & Sen (1985), often referred as L stati-stic. A good introduction offer Thomas et al. (1999). The idea dates back to the 60s, when Bennett (1968), Scheirer, Ray & Hare (1976) and later Shirley (1981) generalized the H test for multifactorial designs. It is well-known that the Kruskal-Wallis H test as well as the Friedman test can be performed by a suitable ranking of y (see e.g. Winer,1991), conducting a parametric ANOVA and finally computing χ2-ratios using the sum of squares. In fact the same applies to the generalized tests.The χ2-ratios are computed in the case of only grouping factors as

and in the case of a mixed design for the tests of A, B and AB as

Here SSA, SSB, SSAB are the sum of squares as outlined before, but computed for R(y), the ranks of y, where midranks are used in case of tied values. MSbetween and MSwithin are the mean squares defined previously, and MStotal the variance of R(y). The degrees of freedom are those of the numerator of the corresponding F-test.

The major disadvantage of this method is the lack of power for any effect in the case of other nonnull effects in the model. The reason: In the standard ANOVA the denominator of the F-values is the residual mean square which is reduced by the effects of other factors in the model. In contrast the mean squares in the denominator of the χ2-tests of Puri & Sen‘s L statistic inc-rease with effects of the other factors, thus making the ratio of the considered effect and therefo-re also the χ2-ratio smaller. A good review of articles concerning this test can be found in the study by Toothaker & De Newman (1994).

yi · ·y j· ·

yij ·y

N ni=yijk αi βj αβij τik βτijk+ + + + eijk+=

MSbetween J yi · k yi · ·–( )2

k

i= MSwithin yijk yi · k yij ·

–– yi · ·+( )2

j

k

i=

FA MSA MSbetween⁄= FB MSB MSwithin⁄= FAB MSAB MSwithin⁄=

χeffect2 SSeffect

MStotal-----------------=

χA2 SSA

MSbetween------------------------= χB

2 SSBMSwithin---------------------= χAB

2 SSABMSwithin---------------------=

Page 8: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Methods to be compared 7

Brunner, Munzel and Puri (ATS)

Based on the relative effect (see Brunner & Munzel, 2002) the authors developed two tests to compare samples by means of comparing the relative effects: the approximately F-distributed ATS (ANOVA-type statistic) and the asymptotically χ2-distributed WTS (Wald type statistic). The ATS has preferable attributes e.g. more power (see Brunner & Munzel (2002). The relative effect of a random variable X1 to a second one X2 is defined as p+ = , i.e. the probability that X1 has smaller values than X2 . As the definition of relative effects is based only on an ordinal scale of y this method is suitable also for variables of ordinal or dichotomous scale. For between subject designs detailed descriptions can be found in Brunner & Munzel (2002, chapter 3) as well as in Akritas, Arnold and Brunner (1997). These tests have been extended to reapeated measures designs by Brunner et al. (1999). Noguchi et al. (2012) described the pro-cedures which involve a lot of matrix algebra, and developed the corresponding R package nparLD.

Logistic Regression and Probit Regression

Instead of building a model for y these methods model the probability of y=1:

logistic regression

probit regression

Here xi (i=1,..,P) are predictors which correspond in an ANOVA environment to design va-riables, βi are the regression parameters, and Φ the normal distribution. The computational pro-cedures are described e.g. in Agresti (2002) and not repeated here. An ANOVA-type test, into which all βi belonging to the same effect are summarized, is available by means of the Wald test (see below) or LR (likelihood ratio) test. Usually the latter is preferred (see e.g. Agresti, 2002, and Fox, 1997), especially for smaller samples as analyzed in this study. But a warning for R users: the standard function for an ANOVA-type test, anova, returns sequential (type I) tests, running from 1st factor to the interaction, and the result is therefore dependent from the order of the factors for unbalanced designs. Normally the function Anova (from package car) should be preferred which yields type II F-tests.

χ2-test and loglinear model

For between subject designs Pearson‘s χ2-test is performed. The test of the main effects is recei-ved from the classical test of independence for two variables. And for the test of the interaction of factors A and B a loglinear model including all 2-way interactions is fitted which yields the desired result. For details see e.g. Agresti (2002).

Hotelling-Lawley‘s multivariate test

This test is often used for the analysis of repeated measures designs because it does not require a compound symmetry of the variance-covariance matrix of y. Instead a multivariate normal distribution is demanded. Therefore this test does not seem appropriate for the analysis of dicho-tomous dependent variables. Nevertheless various authors tried it with differing success (see e.g. Mandeville, 1972, and Stiger, 1998). First the differences of two consecutive variables are computed d1ik = yi2k - yi1k, , d2ik = yi3k - yi2k, ,.. (for i=1,..,I and k=1,..,ni). Then d1 , d2 ,.. are checked for 0 by means of Hotelling Lawley‘s test (see e.g. Winer et a.., 1991).

P X1 X2≤( )

P y 1=( ) βixi( )exp 1 βixi( )exp+( )⁄=

P y 1=( ) Φ βixi( )exp( )=

Page 9: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Methods to be compared 8

Koch‘s ANOVA

Koch (1969 and 1970) proposed a couple of nonparametric procedures for split-plot designs based on a multivariate version of the Kruskal-Wallis test and a nonparametric analogue of the one-way Manova based on the trace. There are several variants for the cases with and without compound symmetry, as well as with and without independence of the factors A and B. The ver-sion used here assumes an interaction, but no compound symmetry. A detailed description can be found in Koch (1969) and shall not be repeated here.

GEE (Generalized Estimating Equations)

The GEE method (Liang & Zeger, 1986) can be considered as an extension of the logistic re-gression to designs with repeated measurements. The specification of the model requires the type of correlation matrix of y. Possible correlation structures are among others the compound symmetry (CS), also often named exchangeable, and the autoregressive (AR(1)), though most authors remark that an incorrect specification has scarcely an impact on the results (see e.g.Fan et al., 2013). A short sketch of the model: Let

(k=1,..,N , j=1,..,J and i=1,...,P)

or equivalently

be the marginal expectation of the binary y: yjk = pjk + ejk with E(ejk) = 0 and Var(ejk) = Vk , and with P predictors and corresponding regression parameters βi (k=1,...,P). Here, xijk is the de-

sign matrix of subject k, and with a correlation matrix Rk(α) for yk , which can be parametrized by a vector α, and Ak = diag(pk1(1-pk1), pk2(1-pk2),...). Then the GEE estimates of βk are the solution of

where

and yk = (yk1, yk2,...), pk = (pk1, pk2,...), β = (β1, β2,...) (see e.g. Emrich & Piedmonte, 1992). A detailed description of the model and the estimation process can be found in McNeish, D. & Stapleton, L.M. (2016). Also to mention Ziegler et al. (1998) who summarize a number of va-riants and different estimation methods for GEE. For the tests of the ANOVA effects the vari-ance-covariance matrix of the estimation βk is essential. For this purpose normally the sandwich estimator by Liang & Zeger (1986) is used. But this has the disadvantage that for smaller N<50 the estimates are too small which leads to liberal tests of the βk . A number of authors proposed bias-corrected sandwich estimators, among others Fay & Graubard (2001), Kauermann & Car-roll (2001), Mancl & DeRouen (2001) and Morel et al. (2003). A comparison of them can be found in Fan et al. (2013). But as Fan & Zhang (2014) had shown, these are also unsatisfying. Other solutions came from Pan & Wall (2001), Gosho et al. (2014) and Wang & Long (2011) who assumed additionally that all subjects have a common correlation structure. Since their estimate cov(yk) is obtained by pooling observations across different subjects, whereas (yk-μk)(yk-μk)´, the estimate by Liang & Zeger and the first mentioned authors, is only based on one observation, it is expected they are more efficient. Pan (2001) showed that in the extreme case when all subjects have different correlation structures the estimates by Pan and Liang & Zeger are the same. In a recent publication Wang et al. (2016) compared all of the currently available methods.

pjk βixijk( )exp 1 βixijk( )exp+( )⁄=

pjk1 pjk–--------------- ln βixijk

i

P

=

Vk Ak1 2⁄ Rk α( )Ak

1 2⁄=

DkVk yk pk–( )k

0= Dk β∂∂pk=

Page 10: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Methods to be compared 9

GLMM (Generalized LinearMixed Models)

Also the GLMM method can be considered as an extension of the logistic regression to designs with repeated measurements. The model looks rather similar to the GEE model:

for the binary y: yjk = pjk + ejk with E(ejk)=0 and Var(ejk)= Vk . But here, in addition to the fixed effects βi (i=1,...,P) with design matrix xijk , there are also random effects γik (i=1,...,Q) for subject k with a design matrix zijk , e.g. for modelling subject and repeated measures effects, and to reflect the correlation among the observations of the same subject, often called cluster in this context. Thus, a correlation structure, as for GEE, has not to be stated here. γik are mul-tivariate normal distributed with E(γik )=0 und independent from the ejk which do not depend on the repeated measurements. Details, especially concerning the ML estimation, can be found at Tuerlinckx et al. (2006) and Song, X.Y. & Lee, S.Y. (2006). As mentioned above there are several estimation methods leading not necessarily to the same results, e.g. Penalized qua-silikelihood, Laplace approximation, Gauss-Hermite quadrature or Markov chain Monte Carlo. Similar to GEE, here also the method is based on large sample asymptotic theory with the consequence that for small N<50 the tests for βk and γik are sometimes liberal. Li & Red-den (2015) discuss a number of solutions for this problem which lies in the estimation of the denominator degrees of freedom (ddf) for the F-test into which the Wald test is transformed. The most popular solution is probably the rather complicated one by Kenward & Roger. The simpliest one uses ddf=N-rank(C) where C is the contrast matrix. Additional ANOVA-like tests are mentioned below.

Wald tests

The primary results from an analysis using logistic regression, probit regression, the GEE or GLMM method are the estimates of the model parameters βi together with their standard errors and a significance test of βi=0 for each i, normally by means of the Wald test. But in this context an ANOVA-like test is desired into which all βi belonging to the same effect are summarized. On one side there is Wald‘s χ2-test in the variant for several parameters (see e.g. Carr & Chi, 1992 and Pan & Wall, 2001):

which is approximately χ2-distributed with rank(C) degrees of freedom, and where are the estimates of β, Vβ is the variance-covariance-matrix of β, and C a contrast matrix, and in its simpler form (see e.g. Kenward & Jones, 1992)

where and Vβ are restricted to those i belonging to the effect of interest. Fan & Zhang (2014) found that the above test is too liberal for small sample sizes and proposed a different one, based on the work of Akritas et al. (1997) and Brunner et al. (1997):

The expression (c1/c2)Q is approximately χ2-distributed with f degrees of freedom where

pjk1 pjk–--------------- ln βixijk

i

P

γikzijki

Q

+=

Cβ̂( )' CVβC'( ) 1– Cβ̂( )

β̂

β'ˆ Vβ1– β̂

β̂

Q β̂'C C'C( ) 1– C'β̂=

c1 tr TVβ( )=

c2 tr TVβTVβ( )=

Page 11: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

The Study 10

The tests above cannot be simply calculated from the usual table of coefficients which inclu-de normally either a normal z- or a χ2-value. Therefore finally it should be remarked that, as-suming the as independent, i.e. their covariances being zero, an ANOVA-like test can be computed by summing up either the χ2- or the z2-values which has the same degrees as the numerator from the corresponding F-test. While the Wald test above is equivalent to a Type III test, Fox & Weisberg (2011, chapter 4.4.4) favored a Type II Wald test which is offered in the function Anova of the R package car. It is based on the likelihood ratio method, using analysis of deviance tests. This one conforms to the principle of marginality and is most powerful in the case of no interaction. Using it the main effects may be overestimated, in contrary to the inter-action effects.For the application of GEE estimation Pan (2001) derived additionally the exact degrees of freedom for the above described Wald test for those cases where a robust estimator of the cova-riance matrix Vβ , mentioned in a previous section, is used: the sum of eigenvalues of , where VP and VLZ are the covariance estimates by Pan and Liang & Zeger respectively. As this test requires an invertible covariance matrix which is often not guaranteed for smaller ni it is not recommendable in this context.

3. The StudyThis is a pure Monte Carlo study, which means a couple of designs and theoretical distributions had been chosen from which a large number of samples had been drawn by means of a random number generator. These samples had been analyzed using the various ANOVA methods. Con-cerning the number of different situations, for example, distributions, equal/unequal cell counts, effect sizes, relations of means, variances, and cell counts, one had to restrict to a minimum, as the number of resulting combinations produce an unmanageable amount of information. Therefore, not all influencing factors could be varied. The main focus had been laid upon the control of the type I error rates for α = 0.05 and α = 0.01 as well as upon a comparison of the power for the various methods in the following situations:

There are two major designs: a between subject and a mixed (split-plot) design. For both the following subdesigns are analysed:• a 2*4 design (“small design“) with equal cell counts (balanced) and one with unequal cell

counts and a ratio max(nij)/min(nij) of 3 (unbalanced), and• a 4*5 design (“large design“) with equal cell counts (balanced) and one with unequal cell

counts and a ratio max(nij)/min(nij) of 4.

Mainly the aspect will be put on differences between the small balanced and the large unbalanced designs. So, if results show differences between them in a special situation, this may have two reasons. In such cases additionally the results for the corresponding small unbalanced and the large balanced designs are consulted to get things straight concerning the difference.

The binomial probabilities pi had been set to 0.5, 0.8 and 0.9, which is equivalent to 0.5, 0.2 and 0.1, as for 0.25 p 0.75 the variances of a binomial distributed outcome can be regarded as equal. For the split-plot design the following correlation structures had been chosen which are assumed equal for all groups, while heterogeneous correlation matrices over the groups of factor A were not considered:

f c12 c2⁄=

T C C'C( ) 1– C'=β̂

β̂

VP1– VLZ

≤ ≤

Page 12: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

The Study 11

• exchangeable (compound symmetry) with r=0.3, a value that seems realistic and had often been chosen (see e.g. Emrich & Piedmonte, 1992), and

• descending correlations r=(0.7, 0.5, 0.4, 0.2) which is similar to the AR(1) structure, denoted as ar1.

In the case of between subject designs the error rates had been checked for the following effect models:• main effects and interaction effect for the case of no effects (null model, equal probabilities

pi ) and for the case of one significant main effect A(0.6),• main effect B for the case of a significant interaction AB(0.6) and for the case of significant

effects A(0.6) and AB(0.6),• interaction effect for the case of both significant main effects A(0.4) and B(0.4).

In the case of mixed designs the error rates had been checked for the following effect models:• main effects and interaction effect for the case of no effects (null model, equal probabilities

pi ), for the case of one significant main effect A(0.6) or B(0.4), and for the case of a si-gnificant interaction AB(0.4).

• interaction effect for the case of both significant main effects A(0.3) and B(0.3).

where e.g. A(d) denotes an effect of size d for factor A, which means pi-s*d/2 is compared with pi+s*d/2 (s being the standard deviation), or computationally with

. In 2*4 designs the first group will be compared with last group, in 4*5 designs the first two groups with the last two groups. An analog procedure is chosen for B(d) and the interaction effect. For unbalanced designs the interaction effects abij had to be adjusted in order to avoid impacts on the main effects.

Unfortunately first simulations revealed that difficulties arise when data have to be generated in mixed design models with pi=0.9 and effects had to be added. In order not to let the shifted parameter come too close to 0 or 1 respectively, this parameter had generally to be reduced to pi=0.88. The problems intensified in the case of an unequal correlation structure with descending correlations where effect sizes had to be reduced from 0.6 to 0.4 (for factor B) and 0.4 to 0.36 (for the interaction).

Another problem to be investigated is heterogeneity in unbalanced designs, because it is wellknown that at least the parametric F-test tends to be conservative if cells with larger ni have also larger variances (positive pairing) and that it reacts liberal if cells with larger ni have the smaller variances (negative pairing), as Feir & Toothaker (1974) and Luepsen (2017), to name a few, reported. Here unequal variances coincide with unequal means, i.e. in between subject designs as well as in mixed designs a factor A with nonnull effects will produce heterogeneity which may in unbalanced designs lead to biased type I error rates. Being aware that for pi>0.5 the variances of y become smaller with increasing pi , it is to be expected that the F-test reacts liberal if levels of A with larger ni have also larger pi and that it reacts conservative if levels with larger ni have the smaller pi. Of course, the same behavior will apply to the case pi<0.5. Therefore the effects of factor A will be analyzed for both situations. Though for both designs, between subjects and mixed, the effect size will be the same (d=0.6), the ratio max(ni)/min(ni) will be only 1.3 for the between subjects, but 4 for the mixed design. As in the case of pi=0.5 with an effect d for factor A the resulting pi-s*d/2 and pi+s*d/2 will be equal and therefore pro-duce equal variances, pi=0.6 is chosen instead when situations of heterogeneity are analyzed.

pi d 2⁄( ) pi 1 pi–( )–pi d 2⁄( ) pi 1 pi–( )+

Page 13: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Selected Methods and Preliminary Results 12

The type I error rates at 5% and 1% and the power were computed for ni=5,10,15,..,50 as percentages of rejected null hypotheses. The number of repetitions were 2000 for both, the bet-ween subject and the split-plot designs. Because of the enormous computational effort for the GEE and the GLMM method, the number of repetitions for these had been limited to 1000. The relatively small number of samples is not unusual (see e.g. McNeish, 2017 and Guerin & Stroup, 2000). Due to the convergence problems, mentioned in the previous chapter, which occurred mainly with GLMM in smaller samples , the actual number of repetitions was reduced sometimes by about 2 percent. But the situation became much worse with GEE which produced unmanageable covariance matrices for smaller samples . The failure rates were sometimes 90 percent and more. In those cases the repetitions had to be increased to 5000 in order to receive at least 200 valid results, or the sample size of 5 had been dropped from the study, especially for unbalanced designs.

For between subject designs the variables had been generated from a uniform distribution which had been dichotomized according to the desired pi . For the repeated measurements rmvbin (from R package bindata) created correlated multivariate binary random variables by thres-holding a normal distribution, given probabilities pi and a given correlation matrix.

4. Selected Methods and Preliminary ResultsFor between subject designs these are the parametric F-test, Puri & Sen‘s L-statistic, Brunner & Munzel‘s ATS, the logistic and the probit regression, both with the Wald and the LR test, as well as loglinear models (denoted as chi square). As first results showed that, especially for small samples, the LR test behaves rather liberal while the Wald test acts extremely conserva-tive, a compromise test for the logistic and the probit regression is proposed, composed by the χ2-values of the two tests with the same degrees of freedom, denoted by WLR:

Especially for between subject designs there were considerations to include exact tests for the comparison of two groups (see e.g. Storer & Kim, 1990). These could be either corrected for the number of comparisons or accumulated, e.g. by Fisher‘s combined probability test, to get a test for the global hypothesis. But first, some preliminary tests revealed these solutions lead to a very poor power, and secondly, it is very difficult to deduce an corresponding test for the interaction.

Concerning split-plot designs the choice is also the parametric F-test, with and without the Huynh-Feldt correction, Puri & Sen‘s L-statistic, Brunner & Munzels ATS, together with Ho-telling-Lawley‘s multivariate ANOVA test, Koch‘s rank ANOVA, later often denoted as standard methods, as well as GEE and GLMM, special cases of GLM models. For the selection of a suitable function to apply these two methods in R, preliminary studies had been necessary.

For the GLMM analysis three R functions had been tested: glmer (R package lm4) which is based on restricted maximum likelihood estimation (REML) using a bounds constrained quasi-Newton method (nlminb, by means of R function optimx from package optimx), glmmPQL (R package MASS) which uses Penalized quasilikelihood estimation and glmmML (R package glmmML) which applies adaptive Gauss-Hermite quadrature. (A practical comparison give Bolker et al., 2009, while Zhang et al., 2011, matched the methods using data.) The three methods had been compared for the above mentioned balanced and unbalanced designs with ni=5,...,50, given compound symmetry and pi=0.8, for the following models: null model, A si-gnficant, B signficant and AB significant. To compare the results the classical Wald test, the robust test by Fan & Zhang (2014) and the Type II Wald test (except glmmML)had been used

ni 15≤

ni 10≤

χWLR2 χWald

2 χLR2+( ) 2⁄=

Page 14: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Selected Methods and Preliminary Results 13

which were described in the previous chapter. Additionally the F-transformed Wald test and the sum of χ2-values were computed (see appendix B6). The only satisfying procedure was glmer, which had the error rates completely under control. But, unexpectedly, it is the Type II Wald test which manages also the case of a signficant interaction, though this one is not recommended if there is an interaction. In contrast, the other ANOVA-type tests explode with increasing sam-ple sizes. glmmML generally offends the acceptable limits for the type I error rate with values around 10 for sample sizes . And glmmPQL shows in nearly all situations error rates beyond 10.

For the choice of the GEE procedure the focus had been laid upon the different estimation methods for the covariance matrix of the parameter estimates . As a basis for the estimation of the parameters themselves the function geeglm (R package geepack) had been used which is based on the estimation method by Prentice & Zhao (1991). All the 9 methods described by Wang et al. (2016) had been compared for the above mentioned situations using the functions from the R package geesmv. For the estimation according to Liang & Zeger to-gether with the sandwich estimators for the covariance matrix of the two different functions had been considered: the above mentioned geeglm and the corresponding function GEE.var.lz in package geesmv. The results in general were rather disappointing (see also appendix B6), particularly for the case of an interaction effect, when none of the methods was able to keep the type I error rate for the test of any main effect in an acceptable range. In fact, for all methods and tests the error rates rose up to over 90 percent for ni=50 (see 6.10 and 6.11 in B6).

A few words to the two functions used for the Liang & Zeger method. Generally geeglm be-haved much more liberal than GEE.var.lz. E.g. for the test of the interaction in the null model where the first one showed exceeding error rates for nearly all sample sizes, especially for unbalanced designs, and even when the conservative ANOVA test by Fan & Zhang is used (see B6.3). Similar results were achieved for the test of the interaction when there are significant main effects. One reason might be the following fact: GEE.var.lz is just a transcription of the formula into R. For this study the function had been modified so that any sample is dropped if computational problems like a not invertible matrix arise. By the way, the same had been ap-plied to the other functions of package geesmv. In contrary geeglm tries to handle problem-atic situations itself. Among others this difference leads to quite different accepted cases from the sample. But as cited in the first chapter, the same method may lead to different results when different functions or programs are used.

Some remarks to the other covariance estimation methods. As to be expected: the ANOVA-like tests by Fan & Zhang (2014) show generally much smaller error rates than the Wald test, but with the disadvantage of an also much smaller power. McKinnon shows in many situations a strange behavior: e.g. higher error rates for the Fan & Zhang test than for the Wald test and de-creasing power rates for increasing sample sizes (see tables 6.4 and 6.8 in B6). For the methods by Kauermann & Carroll, Mancl & DeRouen and Fay & Grauband there are huge differences between the error rates by the Wald test and the one by Fan & Zhang in a number of situations. In general the solutions from Pan & Wall (2001), Gosho et al. (2014) and Wang & Long (2011) which obtain their estimates by pooling observations across different subjects, as well as the method by Morel et al. (2003), have the most benevolent behavior. For the first three methods additionally the previously mentioned ANOVA-type test by Pan (2001) had been computed which is able to control the type I error rate in a same way as the one by Fan & Zhang (2014), but shows on the other side clearly better power rates: e.g. 35 (factor A for unequal ni), 51 (fac-tor B for equal ni) and 37 (factor B for unequal ni) in contrary to 14, 26 and 22 in the same situ-ations for the test by Fan & Zhang (see tables 6.4 and 6.8 in B6). But, unfortunately, even these

ni 10≤

βiˆ

βiˆ

Page 15: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Results 14

procedures cannot keep the type I error rate of the main effects under control for the case of a significant interaction.

In an additional preliminary study the error rates and power for the previously mentioned four methods, together with the one by Liang & Zeger, have been studied for all situations quoted in the previous chapter, where the ANOVA-type tests by Wald, by Fan & Zhang and, if possible, by Pan were applied. For the case of an underlying ar1-correlation structure of the data two GEE models have been analyzed: one assuming the correct ar1-structure and one assuming an ex-changeable (compound symmetry) correlation structure. The results from it are to be found in appendixes B7 and B8. If there is no interaction effect the procedure of Morel keeps the type I error always under control, also the one by Gosho et al., except in a few situations for ni=5. For the approach by Wang & Long it is the test by Pan which leads sometimes to exceeding error rates for small ni. The one by Pan & Wall is even more severe affected, though for both the ANOVA-like test by Fan & Zhang keeps the error under control while the Wald test leads often to liberal results, e.g. for the tests of factor B with an ar1-structure (see 7.4.2 and 7.5.2 in B7) or if an ar1-structure is handled as an exchangeable structure (see e.g. 7.4.3, 7.5.3 or 7.8.3 in B7). Finally, for the approach by Liang & Zeger it is again the test by Fan & Zhang which pro-duces satisfying results whereas the Wald test leads generally to error rates of 10 and above. As had to be expected: for all methods using the ANOVA-like test by Fan & Zhang the power is the worst, from which the one by Morel has the lowest. Reasonable results are achieved with the procedures of Wang & Long and Gosho et al., if the Wald test is used, which both are reliable concerning the type I error rate. Regarding the case of an interaction effect, the type I error rates rise up to 50 percent and beyond (for ni=50) for all methods and nearly all situations (see 7.3 and 7.6 un B7). There are only differences in the size of the intervals for which the dis-tinct methods keep the error rate in the acceptable range. If the focus is directed to those techni-ques favoured above, one has to accept that only for admissible error rates are produced, while, of course, the conservative methods, like the one by Morel et al., have generally a wider range , especially when the test by Fan & Zhang is applied.

Overall it can be concluded that the procedures by Gosho et al. and by Wang & Long in conjunc-tion with the ANOVA tests by Wald or Pan are the most recommendable from all GEE models with a slight advantage for the first one, while the Pan & Wall procedure is a bit more liberal in some situations. Finally, a couple of observations to mention: first, the error and power rates reduce with rising parameter pi , and second, if the underlying correlation structure is assumed as an ar1 model, the Liang & Zeger method as well as the application of the ANOVA test by Pan lead to rather conservative results (see e.g. the tables in B9, where the results for the diffe-rent correlation structures are summarized). As a result from these experiences, for the compari-son with the other ANOVA methods the estimation approach by Gosho et al. (2014) is selected together with the ANOVA-type tests by Wald, by Fan & Zhang and by Pan. As these are not generally available the estimation results by Liang & Zeger as received from GEE.var.lz to-gether the ANOVA-type test by Fan & Zhang are also included for comparison purposes. One final remark: As cited in the first chapter, the same method may lead to different results when different functions or programs are used.

5. ResultsTables and Graphical Illustrations

The following remarks represent only a small extract from the numerous tables and graphics produced in this study and will concentrate on essential and perhaps unexpected results. All tables and corresponding graphical illustrations are available online (see address below). These

ni 20≤

ni 40≤

Page 16: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Results 15

are structured as follows and report the proportions of rejections of the corresponding null hy-pothesis (α=0.05, α=0.01) for all methods considered in 2*4 balanced and 4*5 unbalanced de-signs:• B 1: type I error rates for fixed nij (5,10,..,50) for different models in between subject designs,• B 2: power in relation to nij (5,10,..,50) for different models in between subject designs,• B 3: type I error rates for fixed ni (5,10,..,50) for different models in mixed designs,• B 4: power in relation to ni (5,10,..,50) for different models in mixed designs,• B 5: type I error rates for fixed ni (5,10,..,50) for different models in 4*5 balanced and 2*4

unbalanced mixed designs,• B 6: type I error rates and power of selected GLMM and GEE methods for fixed ni

(5,10,..,50) for different models in mixed designs,• B 7: type I error rates of selected GEE methods for fixed ni (5,10,..,50) for different models

in mixed designs,• B 8: power of selected GEE methods in relation to ni (5,10,..,50) for different models in

mixed designs,• B 9: type I error rates of selected GEE methods for ni =50 for different correlation structures,• B 10: type I error rates for fixed nij (5,10,..,50) for different models in 4*5 balanced and 2*4

unbalanced between subject designs.

All references to these tables and graphics will be referred as B n.n.n. All tables and graphics can be viewed online:http://www.uni-koeln.de/~luepsen/statistik/texte/comparison-tables/

Criteria

A deviation of 10 percent (α + 0.1α) - that is 5.50 percent for α=0.05 - can be regarded as a stringent definition of robustness whereas 25 percent (α + 0.25α) - that is 6.25 percent for α=0.05 - to be treated as a moderate robustness (see Peterson, 2002). It should be mentioned that there are other studies in which a deviation of 50 percent, i.e. (α 0.5α), Bradleys liberal criterion (see Bradley, 1978), is regarded as robustness. In this study Peterson‘s moderate robustness will be applied, i.e. an acceptance interval [3.75 , 6.25]. As a large amount of the re-sults concerns the error rates for 10 sample sizes nij = 5,...,50 it seems reasonable to allow a couple of exceedances within this range.

The following remarks concern the results for tests at α=0.05. As noted in several other studies (see e.g. Luepsen, 2017) nearly all tests behave more liberal at α=0.01.

5. 1 Results I: between subject designs

The most exciting question is: How behaves the parametric F-test in those cases which were not treated by Lunney (1979): small sample (ni<20) and the influence of nonnull effects. The control of the type I error rate is guaranteed, even for small ni=5, at least in balanced designs, while for unbalanced designs this is only valid as long as there is no nonnull main or interaction effect which results in heterogeneity of variances. As a consequence the type I error rates for the test of a main or interaction effect do not lie any longer in the interval of robustness if ni and pi are dependent and pi 0.8. E.g. in the case of a significant factor A, even for a small ratio max(ni)/min(ni) of 1.3 the rates rise to nearly 12 for the test of main effect B (see table B1.3)

+−

Page 17: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Results 16

and to 15 for the test of interaction AB if ni and pi are positively correlated, and fall to 2 if they are negatively correlated (see B1.8). Similar results are obtained for the tests of the main effects if the interaction AB is signficant (see B1.5 and figure 1). It should be remarked that the viola-tions are independent of the cell frequencies 5,...,50. The conclusion: the tests of the effects are not independent. Moreover, even in the case of independent ni and pi the error rates show slight exceedances for the test of the interaction if there is a nonnull main effect, but only for extreme pi 0.8: values near 7 for pi=0.8 and near 10 for pi=0.9 (see figure 3 and B1.7). A more detailed inspection revealed that this is due to the larger number of cells, compared with the balanced design, and occurs only for larger designs. For, in a balanced design with 4*5 cells the type I error rates are nearly the same in the same situation (see B10.1).

Figure 1: The maximum of the type I error rates over the range ni=5,..,50 for pi=0.5,..,0.9: for the test of the interaction AB when one main effect is nonnull (left) and for the test of a main effect if the interaction AB is nonnull (right), both in between subject designs when ni and pi are not independent.

Slightly better performs Puri & Sen‘s L statistic in these situations of heterogeneity because it exceeds only for pi=0.9 and with values near 7 (see B1.3, B1.8 and B1.5). The only method that keeps the type I error rate without exceptions is Brunner & Munzel‘s ATS. In contrast, the χ2-test based on loglinear models shows rates rising up to 20 for increasing sample sizes ni->50 in unbalanced designs, too. The logistic and probit models exhibit nearly identical performances: if the LR-test is used the error rates exceed generally the acceptance interval for small samples, sometimes only for ni=5, but often even for larger , whereas the Wald-test reacts rather conservative in theses cases (see e.g. B1.1). Generally the error rates of the LR test exceed those of the Wald test, except in one situation: for the test of the main effect when there is a nonnull interaction and pi=0.9 (see B1.4.3 and 1.5.3). Rather extreme is the difference for the test of the interaction, with and without nonnull main effects: the LR test has extreme type I error rates, up to over 40 for pi=0.9, while the Wald test behaves extremely conservative with rates near 0. On the other hand, the error rates of both tests raise to 20 and more with increasing ni for the test of a main effect if there is a nonnull interaction (see B1.4 and 1.5), which makes the usage of the logistic or probit model dangerous. It shows that these methods are no alternatives to the F-test, wich is rather liberal in this situation, too. In contrary, the WLR compromise test behaves much better, except for two situations mentioned above: the test of a main effect if there is an interac-tion (see B1.4 and 1.5) and the test of the interaction if there are both main effects present and pi=0.8, but with rising error rates for pi=0.9 (see B1.9.2 and 1.10.2).

0.5 0.6 0.7 0.8 0.9

0

5

10

15

20

interaction AB (A significant)

pi

0

5

10

15

20

0.5 0.6 0.7 0.8 0.9

main effect A (interaction AB significant)

pilegend

parametricPuri & SenATS

chi-squaredlogistic Regr/WLR

ni 20≤

Page 18: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Results 17

Concerning the power, the parametric F-test is always among the best performers. It is out-reached only by the LR test for the logistic and probit regression, which is too liberal with regard to the type I error. Puri & Sen‘s L statistic has nearly identical rates. Only for the test of the in-teraction the rates lie sometimes below those of the F-test for small ni: about 20% for ni=5 and about 10% for (see B2.5). Brunner & Munzel‘s ATS is able to keep up only in balanced designs, whereas in unbalanced designs the rates lie clearly below, mostly for smaller samples: for ni=5 (pi=0.5), for (pi=0.8) and (pi=0.9) (see e.g. B2.1 and 2.2). The logistic and probit models exhibit also for the power the same performance: desastrous rates if the Wald test is used, while the LR test leads generally to good power values, and even above average for the test of the interaction. But unfortunately the LR test cannot control the type I error in these cases. In contrary, the compromise test WLR has acceptable power rates in all situations. Finally, the the χ2-test based on linear models performs rather poor in balanced designs. Figure 2 shows the graphs of the power for the methods discussed in a representative situation.

Figure 2: The power of factor B when factor A has a nonnull effect, for pi=0.8 and all methods considered. These graphs reflect the typical behavior of the methods: a similar power for the F-test and rank based methods, slightly higher values for the logistic and probit regression in conjunction with the LR-test, and lower rates, if for these the Wald test is used as ANOVA. The better results for unequal cell counts are due to the larger design.

5. 2 Results II: mixed designs

Split plot designs, as mixed designs are often called, require a more detailed analysis because first, the factors A and B are not exchangeable and second, the correlation structure of the re-peated measurements has to be taken into account. Here again first the behavior of the parametric F-test will be regarded.

ni 15≤

ni 20≤ ni 30≤

20

40

60

80

100

abso

lute

pow

er

equal cell counts

10 20 30 40 50

0

50

100

150

200

10 20 30 40 50

rela

tive

cell counts

20

40

60

80

100unequal cell counts

10 20 30 40 50

0

50

100

150

200

cell counts

legend parametricPuri & SenA TSchi-squared

logis tic /Waldlogis tic /LRlogis tic /Wald+LRprobit/Waldprobit/LRprobit/Wald+LR

Page 19: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Results 18

Likewise to the between subject design, the type I error is fairly well controlled as long as either the grouping factor A has no effects or the cell frequencies are equal. Decent exceedances occur only for extreme pi 0.8 in large unbalanced designs: in the case of equal correlations r=0.3 only the interaction is affected if there is a significant main effect (rates around 7, see B3.8.1) while for unequal correlations (ar1 structure) all tests except that of the grouping effect A are concerned, but predominantly the test of the interaction, with rates between 6 and 7 for pi=0.8 and 9 for pi=0.9 (see e.g. B3.8.2, 3.9.2 and 3.11.2). A detailed look into the results of the corre-sponding small unbalanced and large balanced designs exhibited that mainly the size of the de-sign is responsable for the violations, i.e. the exceedances remain, though a bit smaller (see figure 3 and e.g. B5.8.2 and 5.9.2). As only repeated measurements effects are affected, the Huynh-Feldt adjustment can help which reduces the exceedances for pi=0.8 completely, but for pi=0.9 only to a maximum of 7 percent. A stricter remedy is given by Hotelling-Lawley‘s mul-tivariate test which keeps the type I error rate completely under control. The same applies to the procedure by G. Koch. Puri & Sen‘s L statistic behaves very similar to the parametric F-test, only the error rates lie a bit lower, though in the cases mentioned above still between 6 and 8. Whereas for the above noted methods only the tests of the repeated measurements effects react sometimes too liberal, it is vice versa with the ATS by Brunner & Munzel: only for the test of the grouping effect A the error rates pass the limit of robustness, with values up to 12, but mainly for (see B3.1 to 3.3). Here the violations are more severe for unbalanced than for balanced designs.

Figure 3: The maximum type I error rates over the range ni=5,..,50 vs. the number of cells (8,12,16 and 20) for the test of the interaction AB in a balanced design if there is of a nonnull grouping factor A, for pi=0.8 in between subject designs (left) and mixed designs (right).

Things look considerably worse if in unbalanced designs the grouping factor A has a nonnull effect which leads to heterogeneous variances. It is mainly the test of the interaction which is biased. For positively correlated ni and pi the type I error shows rates between 15 and 20, rising with increasing pi for 0.5 -> 0.9 (see figure 4), which are even higher in the case of unequal cor-relations, and rates below 2 for negatively correlated ni and pi (see B3.10.1 and 3.10.2). And unfortunately neither the Huynh-Feldt adjustment nor Hotelling-Lawley‘s multivariate test are able to reduce the rates clearly for pi > 0.6. Here also the violations are independent of the cell frequencies 5,...,50. The only method which remains completely unaffected is the ATS.

ni 30≤

8 10 12 14 16 18 20

0.0

2.5

5.0

7.5

10.0

type

I er

ror r

ate

between subj design

# cells

0.0

2.5

5.0

7.5

10.0

8 10 12 14 16 18 20

mixed design

# cellslegend

parametricPuri & SenATSchi-squared

parametricparam - HF-adjmultivariatePuri & SenATSKoch

Page 20: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Results 19

Also for mixed designs the parametric F-test has to be attested a constantly good power. While there are no notable differences between the plain F-test and the Huynh-Feldt adjustment for the tests of the trial effects, the multivariate test shows superior rates for the interaction effect in unbalanced designs, but lower rates in balanced designs. Here also a more precise analysis re-vealed that this effect is mainly due to the size of the design. In addition the power is below aver-age for the test of the trial factor B in small samples, mainly if B has an unequal correlation structure: it achieves only between 5% and 25% of the power of the F-test for ni=5 but generally over 90% for ni 10 (see B4.2.2). Puri & Sen‘s L statistic shows a similar performance as the F-test, with a slight advantage for the test of the grouping factor A. In this case also Koch‘s pro-cedure scores, while its power for factor B and the interaction is rather poor, especially for smaller samples ( ). The ATS by Brunner & Munzel has in most situations the lowest power rates from the procedures mentioned so far: it achieves only 50% of the power of the F-test for ni=5 and 70% for ni=20 (see B4.3). With one exception: the test of factor A for very small samples (see B4.1), but in a situation where the error rate is not controlled. By the way, there are no remarkable influences of the correlation structure on the power.

Figure 4: The maximum of the type I error rates over the range ni=5,..,50 for pi=0.5,..,0.9: for the test of the interaction AB (left) and for the test of a main effect, both in mixed designs when the grouping factor A is nonnull and when ni and pi are not independent.

Concerning the GLMM and GEE methods chosen for this comparison, it has to be stated in ad-vance: they are not able to compete with the tests listed above. In general it must be admitted that the results are not reliable for small samples. Not only the procedures used for GLMM but also, to an even larger extent, those for GEE had tremendous failure rates, up to over 90 percent, for ni=5 and often also for ni=10. Therefore peculiarities for small samples are not mentioned here. Let us start with the GLMM method which seems to be the most acceptable. From the th-ree ANOVA-type tests used in the comparisons the Type II Wald test turned out to be the most recommendable which is in all situations superior to the other two ANOVAs, in respect to the type I error rate and even more to the power. Let us first have a look onto the type I error beha-vior. When using the Type II Wald ANOVA for the test of grouping factor A, there are a number of distinct exceedances with rates between 7 and 10, mainly for ni 30 (see e.g. B3.2.1), while there are none for trial factor B. Furthermore, for the test of the interaction both Wald tests as well as the test by Fan & Zhang react sometimes too liberal, mainly if factor B is significant, with rates up to 9, primarily for ni 20. And these are more numerous if the classical Wald test or the Fan & Zhang test are used. Especially if there is an interaction effect, these ANOVA-type

ni 20≤

0.5 0.6 0.7 0.8 0.9

0

5

10

15

20

interaction effect AB

pi

0

5

10

15

20

0.5 0.6 0.7 0.8 0.9

main effect B

pilegend

parametricparam HF adjmultivariate

Puri & SenATSKoch

Page 21: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Results 20

tests, as well as the GEE methods, lead to a large increase of the error rates (up to 90 percent for ni ->50, see e.g. B3.3 and 3.7), whereas the Type II Wald test remains unaffected (see figure 5). Probably the most positive attribute of GLMM with the Type II Wald test: when ni and pi are correlated in unbalanced designs, GLMM with the Type II Wald test has perfect control over the Type I error rate, in contrary to the F-test (see B3.10), though it must be admitted that this combination has an unsatisfying small sample behavior: a failure rate of approximately 10 percent and some severe exceedances of the error rate for ni 10.

The GEE procedures, i.e. the one by Gosho et al. in combination with any of the ANOVA-type tests (Wald, Fan & Zhang or Pan) or the one by Liang & Zeger together with the test by Fan & Zhang, behave benevolently in all situations as long as there is no interaction effect. But if there is one, the error rates raise up to 80 percent for ni ->50 (see B3.3 and 3.7), though they keep in the acceptable range for extreme pi 0.8 and small sample sizes, e.g. for pi=0.9 and when using the test by Fan & Zhang (see e.g. section 9 in B3.3 and 3.7). But this exception will not be generally helpful. A suspicion, the size of the correlation between the repeated measu-rements might be the reason, did not prove to be true: r=0 induces the same rates as r=0.3 (see B9.1 and 9.2). But a comparison of the various results reproduced there, shows that generally the error rates decline with rising pi in which the procedure by Liang & Zeger reaches the lowest values.

Figure 5: Type I error rates for the GLMM and GEE methods for balanced designs and pi=0.8 when there is a significant interaction effect.

Concerning the power, the GLMM procedure in conjunction with the Type II Wald test is the only one with rates mostly above the average of all procedures, at least for larger samples ni 30. However, there are situations in which its performance is rather poor, e.g. for the test of main effect A in balanced or small designs, especially for an ar1 correlation structure and for extreme pi (see e.g. B4.1.2). Compared to the other GLMM and GEE procedures the gap is huge, especially for the test of the main effects (a relation of 1:3, see e.g. B4.1.1 and 4.2.1). From the GEE procedures the one by Gosho et al. in combination with the ANOVA-type test by Pan scores for exchangeable correlation structures whereas the one by Gosho et al. in combination with the classical Wald test is better for the ar1 correlation structure.

≥ ni 30≤

10 20 30 40 50

0

5

10

15

20

type

I er

ror r

ate

main effect A

cell counts n

0

5

10

15

20

10 20 30 40 50

main effect B

cell counts n

legendGlmm Wald IIGlmm Wald IIIGlmm Fan&Zhang

Gee Gosho/WaldGee Gosho/F&ZGee Gosho/PanGee Liang&Zeger/F&Z

Page 22: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Results 21

Figure 6: The power of grouping factor A for pi=0.8 and all methods considered. These graphs reflect the typical behavior of the methods: a similar power for all standard methods, except the ATS for unbalanced designs, and on the other side poor results for the GLM methods except the GLMM with the Wald type II test.

20

40

60

80

100 ab

solu

tepo

wer

equal

050

100150200250300

rela

tive

20

40

60

80

100cell counts

050100150200250300

20

40

60

80

100

abso

lute

pow

er

unequal

10 20 30 40 50

050

100150200250300

rela

tive

cell counts

20

40

60

80

100cell counts

10 20 30 40 50

050100150200250300

cell counts

legend standard methods legend GLM methods

parametricparam - HF-adjmultivariatePuri & SenATSKoch

Glmm Wald IIGlmm Wald IIIGlmm Fan&ZhangGee Gosho/WaldGee Gosho/F&ZGee Gosho/PanGee Liang&Zeger/F&Z

Page 23: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Conclusion and practical aspects 22

6. Conclusion and practical aspectsAt first sight, e.g. looking at the results for the null model or balanced designs, the parametric F-test seems to be a good choice for analyzing binary variables. But when the results are ex-amined in detail the F-test seems to be only second choice.

In between subject designs the tests of main and interaction effects are confounded when there are any nonnull effects. This makes the usage of the F-test dangerous, at least in unbalanced de-signs. The better selection is Puri & Sen‘s L statistic which controls the type I error in nearly all situations. Only in the case of a nonnull main effect it shows only slight exceedances of the error rate (values between 7 and 9), and this even only for pi=0.9. Another argument in favor of the L statistic is the overall excellent power, at least for ni>10, and in most cases even for ni 10. Though Brunner & Munzel‘s ATS has a complete control of the type I error rate it is no good choice because of its poor power. And finally from the logistic and probit regression the only acceptable ANOVA-type test is the compromise WLR test proposed in the previous chapter. But unfortunately its error rates rise up to 10 and beyond (ni->50) for the test of a main effect if there is a nonnull interaction, and for the test of the interaction if both main effects are signifi- cant. And this even in balanced designs and those with independent ni and pi. This makes these procedures, which were made especially for binary variables, a dangerous choice. All in all Puri & Sen‘s L statistic seems to be the best recommendation because of its overall control of the type I error rate, even in heterogeneous situations and for large designs, and because of its generally high power.

Also in mixed designs the use of the parametric F-test is dangerous in unbalanced designs if ni and pi are not independent, because unequal pi may lead to heterogeneous variances for pi 0.8 and together with unequal ni to extreme liberal or conservative test results. E.g. for positively correlated ni and pi the error rates may rise up to values beyond 20 for the test of the interaction, when there is a nonnull effect of the grouping factor A. As neither the Huynh-Feldt adjustment, nor Hotelling-Lawley‘s multivariate test, nor Koch‘s ANOVA can help in this situation, at least for pi > 0.6, the only safe method is the ATS by Brunner & Munzel. But unfortunately this one cannot control the type I error rate for the test of the grouping factor A, and it has the lowest power from the methods mentioned. Beside this, the F-test behaves generally liberal in large de-signs and if factor B has an unequal correlation structure. But in these situations there are two recommendable alternatives: the multivariate ANOVA with its superior power in larger designs (with the restriction of ni 10 for the trial factor B) and the Huynh-Feldt adjustment for the F-test which achieves the power of the F-test.

The GEE methods seem generally not to be recommendable for the analyses examined here. First there is the enormous failure rate (up to 90%), particularly for small samples ni 15. Second their power is overall far below that of the other methods noted above. And third, a non-null interaction affects the tests of both main effects. On the contrary, the GLMM method may be an alternative to the F-test, particularly in the heterogeneous cases quoted above. Using the Type II Wald test as the ANOVA-type test the GLMM based on the REML method (see chapter 2) controls the Type I error rate not only when ni and pi are correlated, but also when there is a nonnull interaction term, a situation where all other GLM methods fail. Also with regard to the power its performance is acceptable in many situations and can often compete at least with that of the ATS for not too small sample sizes (ni 30). But as the GLMM reacts liberal for the test of grouping factor A and of the interaction and as it has an unsatisfying small sample behavior, as noted in the previous section.

Page 24: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Programming 23

The final recommendation: If the relative frequencies of the two values of y lie within the inter-val [0.25 , 0.75] the parametric F-test may be used, because there are only very few decent ex-ceedances of the type I error rate: only for mixed designs when the correlation structure of B is unequal. If this constraint is not given additional checks are necessary. In between subject de-signs the F-test may also be applied if there are equal cell counts nij and the number of cells is not too large ( 15). Otherwise Puri & Sen‘s L statistic should be the choice. In mixed designs with equal cell counts ni the F-test may be used also if the number of cells is not too large ( 15), otherwise Hotelling-Lawley‘s multivariate test should be preferred. For mixed unbalanced de-signs two methods are advised: For the test of factor A the parametric F-test can be applied because in this case it induces no problems. For the the test of factor B and the interaction Brunner & Munzel‘s ATS is recommended, though a loss of power of about 50% has to be accepted. However also in these cases the multivariate test may be applied if it is guaranteed that the ni and the relative frequencies of y for factor A are independent.

7. ProgrammingThis study has been programmed in R (version 3.3.2 and later 3.3.3). For the data generation two different functions had been applied: in the case of between subject designs runif to generate uniform distributed data which were split into two groups at the desired cutpoint p, and rmvbin from the package bindata (Hoejsgaard et al., 2006) for split plot designs which allows the creation of correlated binary variables with specified percentages of pi and specified correlations.

For the analysis of the simulated data various functions had been chosen: for the standard ANO-VA F-test the function aov in combination with drop1 to receive type III sum of squares estimates in the case of unequal cell counts, for the factorial Puri & Sen-tests an own function np.anova, for the ATS method in between subject designs also an own function ats.2 - meanwhile an appropriate package GFD is supplied in R, and in mixed designs the function nparLD from the package nparLD. The logistic and probit regression had been performed with glm, the χ2-tests with chisq.test and loglin, and the multivariate Hotelling-Lawley test with the functions lm and anova. To incorporate exact binary tests the function exact.test from the package Exact had been used. For Koch‘s nonparametric analysis of a split plot design again an own function koch.anova had been chosen. For the analysis of GLMM models the following functions had been applied: glmer (R package lm4) which is based on restricted maximum likelihood estimation (REML) using a bounds constrained quasi-Newton method (nlminb, by means of R function optimx from package optimx), glmmPQL (R package MASS) which uses Penalized quasilikelihood estimation and glmmML (R package glmmML) which applies adaptive Gauss-Hermite quadrature. For the analysis of GEE models the function geeglm (R package geepack) and the functions from the package geesmv had been used, however with a modification to handle failures in the model estrima-tion process. For the own functions see Luepsen (2014).

Some of the computations had been performed on a Windows notebook, but for the major part the high performance cluster CHEOPS of the Regional Computing Centre (RRZK) of the uni-versity of Cologne had been used. I would like to thank the staff of the RRZK for their technical support as well as Prof. Unkelbach for his organizational support.

≤≤

Page 25: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Literature 24

8. LiteratureAgresti, A. (2002): Categorical data analysis. Vol. 2. New York, NY, John Wiley & Sons.

Akritas, M.G., Arnold, S.F., Brunner, E. (1997): Nonparametric Hypotheses and Rank Statis-tics for Unbalanced Factorial Designs, Journal of the American Statistical Association, Volume 92, Issue 437, pp 258-265.

Allan, E.F. (1997): Analyzing binary data in repeated measurements setting using SASAnnual Conference on Applied Statistics in Agriculture, Kansas State University Librari-es New Prairie Press.

Beckman, M. & Stroup, W.W. (2003): Small Sample Power Characteristics of Generalized Mixed Model Procedures for Binary Repeated Measures Data Using SAS. Annual Confe-rence on Applied Statistics in Agriculture, Kansas State University Libraries, New Prairie Press.

Bennett, B.M. (1968). Rank-order tests of linear hypotheses, J. of Stat . Society B 30, pp 483- 489.

Bolker B.M., Brooks M.E., Clark C.J., Geange S.W., Poulsen J.R., Stevens M.H., White J.S. (2009): Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology and Evolution, Vol. 24, No.3, pp 127-135.

Bradley, J.V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31, pp 144-152.

Brunner, E., Munzel, U. (2002). Nichtparametrische Datenanalyse - unverbundene Stich-proben, Springer, Berlin.

Brunner, E., Dette, H. & Munk, A. (1997): Box-Type Approximations in Nonparametric Fac-torial Designs, Journal of the American Statistical Association, Vol. 92, No. 440, pp. 1494- 1502.

Brunner, E., Munzel, U. and Puri, M.L. (1999): Rank-Score Tests in Factorial Designs with Re-peated Measures, Journal of Multivariate Analysis 70, pp 286-317.

Carr, J.C. & Chi, E.M. (1992): Analysis of Variance for Repeated Measures Data: A Generali-zed Estimating Equations Approach. Statistics in Medicine, Vol 11, pp 1033-1040.

Cochran, W.G. (1950): The Comparison of Percentages in Matched Samples. Biometrika 37, pp 256-266.

Cochran WG. (1940): The analysis of variances when experimental errors follow the Poisson or binomial laws. The Annals of Mathematical Statistics, 11, pp 335–347.

D'Agostino, R.B. (1972): Relation Between the Chi-Squared and ANOVA Tests for Testing the Equality of k Independent Dichotomous Population. The American Statistician, Vol. 26, No. 3, pp. 30-32.

D'Agostino, R.B. (1971): A Second Look at Analysis of Variance on Dichotomous Data, Journal of Educational Measurement, Vol. 8, No. 4, pp. 327-333.

Davis, S.D. (1991): Semi-parametric and non-parametric Methods for the analysis of repeated measurements with applications to clinical trials, Statistics in Medicine, Vol. 10, pp 1959-1980.

Page 26: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Literature 25

Emrich L.J. , Piedmonte M.R. (1992): On some small sample properties of generalized estima-ting equation estimates for multivariate dichotomous outcomes. Journal of Statistical Computation and Simulation, 41, 19-29 .

Fan, C., Zhang, D. & Zhang, C.H. (2013): A comparison of bias-corrected covariance estima-tors for generalized estimating equations. Journal of Biopharmaceutical Statistics 23, pp 1172–1187.

Fan, C. & Zhang, D. (2014): Robust small sample inference for generalised estimating equa-tions: An application of the Anova-type test. Australian & New Zealand Journal of Stati-stics, 56(3), pp 237–255.

Fay, M. P., Graubard, B. I. (2001): Small-sample adjustments for Wald-type tests using sandwich estimators. Biometrics 57, pp 1198–1206.

Feir, B.J., Toothaker, L.E. (1974): The ANOVA F-Test Versus the Kruskal-WallisTest: A Robustness Study. Paper presented at the 59th Annual Meeting of the American Educatio-nal Research Association in Chicago, IL.

Fox, J. (1997). Applied regression analysis, linear models, and related methods. Thousand Oaks, CA: Sage Publications.

Fox, J. & Weisberg, S. (2011): An R Companion to Applied Regression. SAGE Publications, Los Angeles.

Gessner G., Kamakura W.A. , Malhotra N.K. , Zmijewski M.E. (1988): Estimating Models with Binary Dependent Variables: Some Theoretical and Empirical Observations, Journal of Business Research, 16(1), pp 49-65.

Glass Gene V. , Peckham Percy D. and Sanders James R. (1972): Consequences of Failure to Meet Assumptions Underlying the Fixed Effects Analyses of Variance and Covariance, Review of Educational Research, Vol. 42, No. 3, pp. 237-288.

Gosho M, Sato Y, Takeuchi H. (2014): Robust covariance estimator for small-sample adjust-ment in the generalized estimating equations: A simulation study. Science Journal of Applied Mathematics and Statistics, 2(1), pp 20–25.

Guerin, LeAnna and Stroup, Walter W. (2000): A Simulation Study to Evaluate PROC MIXED Analysis of Repeated Measures Data. Annual Conference on Applied Statistics in Agriculture. URL: http://newprairiepress.org/agstatconference/2000/proceedings/15

Guthrie, Donald (1981): Analysis of Dichotomous Variables in Repeated Measures Experiments, Psychological Bulletin, Vol.90, No. 1, pp 189-195.

Haberman, Shelby J. (1978): Analysis of Qualitative Data (Vol. 1). New York: Academic Press.

Harville, D.A. (1977): Maximum Likelihood Approaches to Variance Component Estimation and to Related Problems. Journal of the American Statistical Association, Vol. 72, No. 358, pp. 320-338 .

Hoejsgaard, S., Halekoh, U. & Yan J. (2006): The R Package geepack for Generalized Estima-ting Equations, Journal of Statistical Software, 15, 2, pp 1-11.

Hosmer D.W. & Lemeshow S. (2013): Applied logistic regression, RX Sturdivant.

Hsu, Tse-Chi and Feldt, Leonard S. (1969): The Effect of Limitations on the Number of Cri-terion Score Values on the Significance Level of the F-Test, American Educational Re-

Page 27: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Literature 26

search Journal, Vol. 6, No. 4 (Nov., 1969), pp. 515-527.

Ito, P.K. (1980): Robustness of Anova and Manova Test Procedures. Handbook of Statistics, Vol. 1, (P.R.Krishnaiah, ed.), pp 199-236

Jaeger, T.F. (2008) Categorical Data Analysis: Away from ANOVAs (transformation or not) and towards Logit Mixed Models, Journal of Memory and Language, 59(4): pp 434–446.

Kauermann, G., Carroll, R. J. (2001): A note on the efficiency of sandwich covariance matrix estimation. Journal of the American Statistical Association 96:pp 1387–1396.

Kenward, M.G and Jones, B (1992: Alternative approaches to the analysis of binary and categorical repeated measurements. Journal of Biopharmaceutical Statistics, 2(2), pp 137-170.

Koch, G.G. (1969): Some Aspects of the Statistical Analysis of Split Plot Experiments in completely Randomized Designs, Journal of the American Statistical Association, Vol 64, No 326, pp 485-504.

Koch, G.G. (1970): The Use of Non-Parametric Methods in the Statistical Analysis of a Com-plex Split Plot Experiment. Biometrics, Vol. 26, No. 1, pp. 105-128.

Koch, G.G, Landis, J.R et al (1977): A general methodology for the analysis of experiments with repeated measurement of categorical data. Biometrics, 33, pp 133-158.

Li, Peng & Redden, David T. (2015): Comparing denominator degrees of freedom approxima-tions for the generalized linear mixed model in analyzing binary outcome in small sample cluster-randomized trials. BMC Medical Research Methodology, https://doi.org/10.1186/s12874-015-0026-x

Liang, K.Y. & Zeger S.L. (1986): A Comparison of Two Bias-Corrected Covariance Estimators for Generalized Estimating Equations. Biometrika 73,pp 13–22.

Littel, Ramon C. (2002): Analysis of Unbalanced Mixed Model Data: A Case Study Compari-son of Anova versus REML/GLS. Journal of Agricultural, Biological, and Environmental Statistics, Vol. 7, No 4, pp. 472-490.

Lu, B., Preisser, J.S., Qaqish, B.J., Suchindran, C., Bangdiwala, S.I. and Wolfson, M. (2007): A Comparison of Two Bias-Corrected Covariance Estimators for Generalized Estimating Equations. Biometrics, Vol. 63, No. 3, pp. 935-941.

Luepsen, Haiko (2014): R Functions for the Analysis of Variance.URL: http://www.uni-koeln.de/~luepsen/R/ .

Luepsen, Haiko (2015). Varianzanalysen - Prüfung der Voraussetzungen und Übersicht der nichtparametrischen Methoden sowie praktische Anwendungen mit R und SPSS.URL: http://www.uni-koeln.de/~luepsen/statistik/texte/nonpar-anova.pdfURL: http://kups.ub.uni-koeln.de/6851/1/nonpar-anova.pdf .

Luepsen, Haiko (2016): The aligned rank transform and discrete variables: A warning, Communications in Statistics - Simulation and Computation, DOI: 10.1080/03610918.2016.1217014

Luepsen, Haiko (2017):Comparison of nonparametric analysis of variance methods: A Vote for van der Waerden. Communications in Statistics - Simulation and Computation, Volume 30, pp 1-30, DOI: 10.1080/03610918.2017.1353613

Page 28: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Literature 27

Lunney, G.H. (1979): Using Analysis of Variance with a Dichotomous Dependent Variable: An Empirical Study. Journal of Educational Measurement, Vol. 7, No. 4, pp. 263-269.

Ma, Y., Mazumdar, M., Memtsoudis, S.G. (2012): Beyond Repeated measures ANOVA: advanced statistical methods for the analysis of longitudinal data in anesthesia research, Reg Anesth Pain Med, 37(1): pp 99–105.

Malhotra, Naresh K. (1983): A Comparison of the Predictive Validity of Procedures for Analyzing Binary Data, Journal of Business & Economic Statistics, Vol. 1, No. 4, pp. 326-336.

Mancl, L. A., DeRouen, T. A. (2001): A covariance estimator for GEE with improved smallsample properties. Biometrics 57, pp 126–134.

Mandeville, Garrett K. (1972): Comparison of Three Methods of Analyzing Dichotomous Data in a Randomized Flock Design, Distributed by ERIC Clearinghouse.

Marascuilo, L.A. & Serlin, R. (1977): Interactions for Dichotomous Variables in Repeated Measures Designs, Psychological Bulletin, Vol. 84, No. 5, pp 1002-1007

Marubini, Correa & Leite, Milani (1988): Analysis of Dichotomous Response Variables in Teratology, Biometrical Journal, Volume 30, Issue 8, pp 965–974.

McNeish, D. & Stapleton, L.M. (2016): Modeling Clustered Data with Very Few Clusters, Multivariate Behavioral Research, 51 (4), pp 495-518.

McNeish, D. & Harring, J.R. (2017): Clustered data with small sample sizes: Comparing the performance of model-based and designbased approaches, Communications in Statistics - Simulation and Computation, 46 (2), pp 855-869.

Morel, J. G., Bokossa, M. C., Neerchal, N. K. (2003): Small sample correction for the variance of GEE estimators. Biometrical Journal 45, pp 395–409.

Myers, Jerome L., DiCecco, Joseph V., White, John B., Borden, Victor M. (1982): Repeated measurements of dichotomous variables: Q and F tests, Psychological Bulletin, Vol 92(2), pp 517-525.

Neuhaus, John M. (1992): Statistical methods for longitudinal and clustered designs with binary responses, Statistical Methods in Medical Research, Vol I, pp 249-273.

Noguchi, K., Gel, Y.R., Brunner, E., and Konietschke, F. (2012). nparLD: An R Software Package for the Nonparametric Analysis of Longitudinal Data in Factorial Experiments. Journal of Statistical Software , 50 (12), pp 1-23.

Omar, R.Z., Wright, E.M., Turner, R.M., Thompson, S.G. (1999): Analysing repeated measu-rements Data: A practical Comparison of Methods, Statistics in Medicine, Vol. 18, pp 1587-1603.

Pan, W. (2001): On the Robust Variance Estimator in Generalized Estimation Equations. Biometrika 88, No 3, pp 901-906.

Pan, W. & Wall, M.M. (2001): Small Sample Adjustments in Using the Sandwich Variance Estimator in Generalized Estimating Equations. Statistics in Medicine, Volume 21, Issue 10, pp 1429–1441.

Pan, W. & Connett, J.E. (2002): Selecting the working correlation structure in generalized estimating equations with application to the lung health study. Statistica Sinica, Vol 12,

Page 29: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Literature 28

No 2, pp 475-490.

Pendergast, J.F., Gange, S.J., Newton, M.A., Lindstrom, M.J., Palta, M., Fisher, M.R. (1996):A Survey of Methods for Analyzing Clustered Binary Response Data. International Statistical Review, Vol 64, No 1, pp. 89-118.

Peng CYJ, Lee KL , Ingersoll GM (2002): An introduction to logistic regression analysis and reporting, The Journal of Educational Research, Vol 96, No 1, pp 3-14.

Peterson, K. (2002). Six Modifications Of The Aligned Rank TransformTest For Interaction. Journal Of Modem Applied Statistical Methods. Vol. 1, No. 1, pp 100-109.

Powers J.A.,, Marsh, L.C., Huckfelt, R.R. and Johnson, C.L. (1978): A Comparison of Logit, Probit, and Discriminant Analysis in Prediction of Family Size. American Statistical Association, Proceedings of the Social Statistics Section, pp 693-697.

Prentice, R.L. & Zhao, L.P. (1991): Estimating Equations for Parameters in Means and Cova-riances of Multivariate Discrete and Continuous Responses. Biometrics, Vol. 47, No. 3, pp 825-839.

Puri, M.L. & Sen, P.K. (1985): Nonparametric Methods in General Linear Models. Wiley, New York.

Qu, Y., Piedmonte, M.R. & Williams, G.V. (1994): Small Sample Validity of Latent Variable Models for Correlated Binary Data. Communications in Statistics - Simulation and Com-putation, Vol 23, No 1. pp 243-269.

Scheirer, J., Ray, W.J., Hare, N. (1976). The Analysis of Ranked Data Derived from Comple-tely Randomized Factorial Designs. Biometrics. 32(2). International Biometric Society, pp 429−434.

Seeger, P., Gabrielsson (1968): Applicability of the Cochran Q Test and the F Test for Statistical Analysis of Dichotomous Data for Dependent Samples , Psychological Bulletin, Vol. 69, No. 4, pp 269-277.

Shirley, E.A. (1981). A distribution-free method for analysis of covariance based on ranked data. Journal of Applied Statistics 30: 158-162.

Singh, B. (2004) CATANOVA for Analysis of Nominal Data from Repeated Measures Design Journal Indian Society Agril Statistics, 58(3), pp 257-268.

Song, X.Y. & Lee, S.Y. (2006): Model comparison of generalized linear mixed models. Statistics in Medicine, 25, pp 1685–1698.

Stiger, R.T., Kosinski, A.S. ,Barnhart, H.X. & Kleinbaum, D.G. (1998) Anova for repeated ordinal data with small sample size? A comparison of anova, manova, wls and gee metho-ds by simulation, Communications in Statistics - Simulation and Computation, 27:2, pp 357-375.

Storer, B.E. & Kim, C. (1990): Exact Properties of Some Exact Test Statistics for Comparing Two Binomial Proportions. Journal of the American Statistical Association, Vol. 85, No. 409 (Mar., 1990), pp. 146-155.

Thomas, J.R., Nelson, J.K. and Thomas, T.T. (1999): A Generalized Rank-Order Method for Nonparametric Analysis of Data from Exercise Science: A Tutorial. Research Quarterly for Exercise and Sport, Physical Education, Recreation and Dance, Vol. 70, No. 1,

Page 30: Anova with binary variables - Alternatives for a dangerous ...a0032/statistik/texte/binary.book.pdf · in unequal variances, methods should be considered that do not assume sphericity,

Literature 29

pp 11-23.

Toothaker, L.E. and De Newman (1994). Nonparametric Competitors to the Two-Way ANOVA. Journal of Educational and Behavioral Statistics, Vol. 19, No. 3, pp. 237-273.

Tuerlinckx, F., Rijmen, F., Verbeke, G. & De Boeck, P. (2006): Statistical inference in generalized linear mixed models: A review. British Journal of Mathematical and Stati-stical Psychology, 59, pp 225–255.

Wang M & Long Q. (2011): Modified robust variance estimator for generalized estimating equations with improved small-sample performance. Statistics in medicine, 30(11), pp 1278–1291.

Wang, M., Kong, L., Zheng, L. & Zhang, L. (2016): Covariance estimators for Generalized Estimating Equations (GEE) in longitudinal analysis with small samples. Statistics in Medicine, 35(10), pp 1706–1721.

Whitmore Marjorie L., Schumacker Randall E. (1999) A Comparison of Logistic Regression and Analysis of Variance Differential Item Functioning Detection Methods, Educational and Psychological Measurement, Vol. 59, no. 6, pp 910-927.

Winer, B.J., Brown, D.R. & Michels, K.M. (1991): Statistical Principles in Expertimental Design, McGraw-Hill, New York.

Zhang, H., N. Lu, C. Feng, S. Thurston, Y. Xia, L. Zhu, and X. Tu (2011): On Fitting Generali-zed Linear Mixed-effects Models for Binary Responses using Different Statistical Packages. Statistics in Medicine, 30(20), pp 2562–2572.

Ziegler, A., Kastner, Ch., Blettner, M. (1998): The Generalised Estimating Equations: An Annotated Bibliography. Biometrical Journal 40 (2), pp 115-139.