robust nonparametric statistical methods

Robust Nonparametric Statistical Methods

Thomas P. HettmanspergerPenn State University

andJoseph W. McKean

Western Michigan University

Copyright c©1997, 2008, 2010 by Thomas P. Hettmansperger and Joseph W. McKeanAll rights reserved.

ii

Dedication: To Ann and to Marge

Contents

Preface ix

1 One Sample Problems 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Location Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Geometry and Inference in the Location Model . . . . . . . . . . . . . . . . . 4

1.3.1 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.5 Properties of Normed-Based Inference . . . . . . . . . . . . . . . . . . . . . . 17

1.5.1 Basic Properties of the Power Function γS(θ) . . . . . . . . . . . . . 181.5.2 Asymptotic Linearity and Pitman Regularity . . . . . . . . . . . . . . 211.5.3 Asymptotic Theory and Efficiency Results for θ . . . . . . . . . . . . 241.5.4 Asymptotic Power and Efficiency Results for the Test Based on S(θ) 251.5.5 Efficiency Results for Confidence Intervals Based on S(θ) . . . . . . . 27

1.6 Robustness Properties of Norm-Based Inference . . . . . . . . . . . . . . . . 301.6.1 Robustness Properties of θ . . . . . . . . . . . . . . . . . . . . . . . . 301.6.2 Breakdown Properties of Tests . . . . . . . . . . . . . . . . . . . . . . 33

1.7 Inference and the Wilcoxon Signed-Rank Norm . . . . . . . . . . . . . . . . 351.7.1 Null Distribution Theory of T (0) . . . . . . . . . . . . . . . . . . . . 361.7.2 Statistical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 371.7.3 Robustness Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 42

1.8 Inference Based on General Signed-Rank Norms . . . . . . . . . . . . . . . . 441.8.1 Null Properties of the Test . . . . . . . . . . . . . . . . . . . . . . . . 461.8.2 Efficiency and Robustness Properties . . . . . . . . . . . . . . . . . . 47

1.9 Ranked Set Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531.10 Interpolated Confidence Intervals for the L1 Inference . . . . . . . . . . . . . 561.11 Two Sample Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

2 Two Sample Problems 732.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732.2 Geometric Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

iii

iv CONTENTS

2.2.1 Least Squares (LS) Analysis . . . . . . . . . . . . . . . . . . . . . . . 772.2.2 Mann-Whitney-Wilcoxon (MWW) Analysis . . . . . . . . . . . . . . 782.2.3 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 802.4 Inference Based on the Mann-Whitney-Wilcoxon . . . . . . . . . . . . . . . . 83

2.4.1 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 832.4.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 922.4.3 Statistical Properties of the Inference Based on the MWW . . . . . . 922.4.4 Estimation of ∆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 962.4.5 Efficiency Results Based on Confidence Intervals . . . . . . . . . . . . 97

2.5 General Rank Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 992.5.1 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1022.5.2 Efficiency Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1032.5.3 Connection between One and Two Sample Scores . . . . . . . . . . . 107

2.6 L1 Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1082.6.1 Analysis Based on the L1 Pseudo Norm . . . . . . . . . . . . . . . . . 1082.6.2 Analysis Based on the L1 Norm . . . . . . . . . . . . . . . . . . . . . 112

2.7 Robustness Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1152.7.1 Breakdown Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 1152.7.2 Influence Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

2.8 Lehmann Alternatives and Proportional Hazards . . . . . . . . . . . . . . . . 1182.8.1 The Log Exponential and the Savage Statistic . . . . . . . . . . . . . 1192.8.2 Efficiency Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

2.9 Two Sample Rank Set Sampling (RSS) . . . . . . . . . . . . . . . . . . . . . 1232.10 Two Sample Scale Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

2.10.1 Optimal Rank-Based Tests . . . . . . . . . . . . . . . . . . . . . . . . 1252.10.2 Efficacy of the Traditional F -Test . . . . . . . . . . . . . . . . . . . . 133

2.11 Behrens-Fisher Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1352.11.1 Behavior of the Usual MWW Test . . . . . . . . . . . . . . . . . . . . 1352.11.2 General Rank Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1372.11.3 Modified Mathisen’s Test . . . . . . . . . . . . . . . . . . . . . . . . . 1382.11.4 Modified MWW Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 1402.11.5 Efficiencies and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 141

2.12 Paired Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1432.12.1 Behavior under Alternatives . . . . . . . . . . . . . . . . . . . . . . . 145

2.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

3 Linear Models 1533.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1533.2 Geometry of Estimation and Tests . . . . . . . . . . . . . . . . . . . . . . . . 153

3.2.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1543.2.2 The Geometry of Testing . . . . . . . . . . . . . . . . . . . . . . . . . 156

CONTENTS v

3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

3.4 Assumptions for Asymptotic Theory . . . . . . . . . . . . . . . . . . . . . . 164

3.5 Theory of Rank-Based Estimates . . . . . . . . . . . . . . . . . . . . . . . . 166

3.5.1 R-Estimators of the Regression Coefficients . . . . . . . . . . . . . . . 166

3.5.2 R-Estimates of the Intercept . . . . . . . . . . . . . . . . . . . . . . . 170

3.6 Theory of Rank-Based Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

3.6.1 Null Theory of Rank Based Tests . . . . . . . . . . . . . . . . . . . . 177

3.6.2 Theory of Rank-Based Tests under Alternatives . . . . . . . . . . . . 181

3.6.3 Further Remarks on the Dispersion Function . . . . . . . . . . . . . . 185

3.7 Implementation of the R-Analysis . . . . . . . . . . . . . . . . . . . . . . . . 187

3.7.1 Estimates of the Scale Parameter τϕ . . . . . . . . . . . . . . . . . . 188

3.7.2 Algorithms for Computing the R-Analysis . . . . . . . . . . . . . . . 191

3.7.3 An Algorithm for a Linear Search . . . . . . . . . . . . . . . . . . . . 193

3.8 L1-Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

3.9 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

3.9.1 Properties of R-Residuals and Model Misspecification . . . . . . . . . 196

3.9.2 Standardization of R-Residuals . . . . . . . . . . . . . . . . . . . . . 202

3.9.3 Measures of Influential Cases . . . . . . . . . . . . . . . . . . . . . . 208

3.10 Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

3.11 Correlation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

3.11.1 Huber’s Condition for the Correlation Model . . . . . . . . . . . . . . 221

3.11.2 Traditional Measure of Association and its Estimate . . . . . . . . . . 223

3.11.3 Robust Measure of Association and its Estimate . . . . . . . . . . . . 223

3.11.4 Properties of R-Coefficients of Multiple Determination . . . . . . . . 225

3.11.5 Coefficients of Determination for Regression . . . . . . . . . . . . . . 230

3.12 High Breakdown (HBR) Estimates . . . . . . . . . . . . . . . . . . . . . . . 232

3.12.1 Geometry of the HBR-Estimates . . . . . . . . . . . . . . . . . . . . 232

3.12.2 Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

3.12.3 Asymptotic Normality of βHBR . . . . . . . . . . . . . . . . . . . . . 235

3.12.4 Robustness Prperties of the HBR Estimates . . . . . . . . . . . . . . 239

3.12.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

3.12.6 Implementation and Examples . . . . . . . . . . . . . . . . . . . . . . 243

3.12.7 Studentized Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . 244

3.12.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

3.13 Diagnostics for Differentiating between Fits . . . . . . . . . . . . . . . . . . 247

3.14 Rank-Based procedures for Nonlinear Models . . . . . . . . . . . . . . . . . . 252

3.14.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

3.15 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

3.16 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

vi CONTENTS

4 Experimental Designs: Fixed Effects 2754.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2754.2 Oneway Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276

4.2.1 R-Fit of the Oneway Design . . . . . . . . . . . . . . . . . . . . . . . 2774.2.2 Rank-Based Tests of H0 : µ1 = · · · = µk . . . . . . . . . . . . . . . . 2814.2.3 Tests of General Contrasts . . . . . . . . . . . . . . . . . . . . . . . . 2834.2.4 More on Estimation of Contrasts and Location . . . . . . . . . . . . . 2844.2.5 Pseudo-observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

4.3 Multiple Comparison Procedures . . . . . . . . . . . . . . . . . . . . . . . . 2884.3.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294

4.4 Twoway Crossed Factorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2964.5 Analysis of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3004.6 Further Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3044.7 Rank Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310

4.7.1 Monte Carlo Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3124.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

5 Models with Dependent Error Structure 3235.1 General Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

5.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3265.2 Simple Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327

5.2.1 Variance Component Estimators . . . . . . . . . . . . . . . . . . . . . 3285.2.2 Studentized Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . 3295.2.3 Example and Simulation Studies . . . . . . . . . . . . . . . . . . . . 3305.2.4 Simulation Studies of Validity . . . . . . . . . . . . . . . . . . . . . . 3315.2.5 Simulation Study of Other Score Functions . . . . . . . . . . . . . . . 333

5.3 Rank-Based Procedures Based on Arnold Transformations . . . . . . . . . . 3335.3.1 R Fit Based on Arnold Transformed Data . . . . . . . . . . . . . . . 334

5.4 General Estimating Equations (GEE) . . . . . . . . . . . . . . . . . . . . . . 3395.4.1 Asymptotic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3425.4.2 Implementation and a Monte Carlo Study . . . . . . . . . . . . . . . 3435.4.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345

5.5 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3485.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349

6 Multivariate 3516.1 Multivariate Location Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 3516.2 Componentwise Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356

6.2.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3596.2.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3616.2.3 Componentwise Rank Methods . . . . . . . . . . . . . . . . . . . . . 364

6.3 Spatial Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366

CONTENTS vii

6.3.1 Spatial sign Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 3666.3.2 Spatial Rank Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 373

6.4 Affine Equivariant and Invariant Methods . . . . . . . . . . . . . . . . . . . 3776.4.1 Blumen’s Bivariate Sign Test . . . . . . . . . . . . . . . . . . . . . . 3776.4.2 Affine Invariant Sign Tests in the Multivariate Case . . . . . . . . . . 3796.4.3 The Oja Criterion Function . . . . . . . . . . . . . . . . . . . . . . . 3876.4.4 Additional Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

6.5 Robustness of Multivariate Estimates of Location . . . . . . . . . . . . . . . 3926.5.1 Location and Scale Invariance: Componentwise Methods . . . . . . . 3926.5.2 Rotation Invariance: Spatial Methods . . . . . . . . . . . . . . . . . . 3926.5.3 The Spatial Hodges-Lehmann Estimate . . . . . . . . . . . . . . . . . 3946.5.4 Affine Equivariant Spatial Median . . . . . . . . . . . . . . . . . . . . 3946.5.5 Affine Equivariant Oja Median . . . . . . . . . . . . . . . . . . . . . 394

6.6 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3956.6.1 Test for Regression Effect . . . . . . . . . . . . . . . . . . . . . . . . 3976.6.2 The Estimate of the Regression Effect . . . . . . . . . . . . . . . . . 4046.6.3 Tests of General Hypotheses . . . . . . . . . . . . . . . . . . . . . . . 405

6.7 Experimental Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4126.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416

A Asymptotic Results 421A.1 Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421A.2 Simple Linear Rank Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 422

A.2.1 Null Asymptotic Distribution Theory . . . . . . . . . . . . . . . . . . 423A.2.2 Local Asymptotic Distribution Theory . . . . . . . . . . . . . . . . . 424A.2.3 Signed-Rank Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 431

A.3 Results for Rank-Based Analysis of Linear Models . . . . . . . . . . . . . . . 433A.3.1 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436A.3.2 Asymptotic Linearity and Quadraticity . . . . . . . . . . . . . . . . . 437A.3.3 Asymptotic Distance Between β and β . . . . . . . . . . . . . . . . . 439A.3.4 Consistency of the Test Statistic Fϕ . . . . . . . . . . . . . . . . . . . 440A.3.5 Proof of Lemma 3.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 442

A.4 Asymptotic Linearity for the L1 Analysis . . . . . . . . . . . . . . . . . . . . 443A.5 Influence Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446

A.5.1 Influence Function for Estimates Based on Signed-Rank Statistics . . 447A.5.2 Influence Functions for Chapter 3 . . . . . . . . . . . . . . . . . . . . 448A.5.3 Influence Function of βHBR of Chapter 5 . . . . . . . . . . . . . . . . 454

A.6 Asymptotic Theory for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . 455

B Larger Data Sets 465

viii CONTENTS

Preface

I don’t believe I can really do without teaching. The reason is, I have to have somethingso that when I don’t have any ideas and I’m not getting anywhere I can say to myself,“At least I’m living; at least I’m doing something; I’m making some contribution”-it’s justpsychological.

Richard Feynman

We are currently revising these notes. Any corrections and/or comments are welcome.This book is based on the premise that nonparametric or rank based statistical methods

are a superior choice in many data analytic situations. We cover location models, regres-sion models including designed experiments, and multivariate models. Geometry providesa unifying theme throughout much of the development. We emphasize the similarity ininterpretation with least squares methods. Basically, we replace the Euclidean norm witha weighted L-1 norm. This results in rank based methods or L-1 methods depending onthe choice of weights. The rank-based methods proceed much like the traditional analy-sis. Using the norm, models are easily fitted. Diagnostics procedures can then be used tocheck the quality of fit (model criticism) and to locate outlying points and points of highinfluence. Upon satisfaction with the fit, rank-based inferential procedures can be used toconduct the statistical analysis. The benefits include significant gains in power and efficiencywhen the error distribution has tails heavier than those of a normal distribution and superiorrobustness properties in general.

The main text concentrates on Wilcoxon and L-1 methods. The theoretical develop-ment for general scores (weights) is contained in the Appendix. By restricting attention toWilcoxon rank methods, we can recommend a unified approach to data analysis beginningwith the simple location models and extending through complex regression models and de-signed experiments. All major methodology is illustrated on real data. The examples areintended as guides for the application of the rank and L-1 methods. Furthermore, all the datasets in this book can be obtained from the web site: http://www.stat.wmich.edu/home.html.

Selected topics from the first four chapters provide a basic graduate course in rank basedmethods. The prerequisites are an introductory course in mathematical statistics and somebackground in applied statistics. The first seven sections of Chapter 1 and the first foursections of Chapter 2 are fundamental for the development of Wilcoxon signed rank andMann-Whitney-Wilcoxon rank sum methods in the one- and two-sample location models. In

ix

x PREFACE

Chapter 3, on the linear model, sections one through seven and section nine present the basicmaterial for estimation, testing and diagnostic procedures for model criticism. Sections twothrough four of Chapter 4 give extensive development of methods for the one- and two-waylayouts. Then, depending on individual tastes, there are several more exotic topics in eachchapter to choose from.

Chapters 5 and 6 contain more advanced material. In Chapter 5 we extend rank basedmethods for a linear model to bounded influence, high breakdown estimates and tests. InChapter 6 we take up the concept of multidimensional rank. We then discuss various ap-proaches to the development of rank-like procedures that satisfy various invariant/equivariantrestrictions.

Computation of the procedures discussed in this book is very important. Minitab containsan undocumented RREG (rank regression) command. It contains various subcommands thatallow for testing and estimation in the linear model. The reader can contact Minitab at (putemail address or web page address here) and request a technical report that describes theRREG command. In many of the examples of this book the package rglm is used to obtainthe rank-based analyses. The basic algorithms behind this package are described in Chapter3. Information (including online rglm analyses of examples) can be obtained from the website: http://www.stat.wmich.edu/home.html. Students can also be encouraged to write theirown S-plus functions for specific methods.

We are indebted to many of our students and colleagues for valuable discussions, stim-ulation, and motivation. In particular, the first author would like to express his sincerethanks for many stimulating hours of discussion with Steve Arnold, Bruce Brown, and HannuOja while the second author wants to express his sincere thanks for discussions with JohnKapenga, Joshua Naranjo, Jerry Sievers, and Tom Vidmar. We both would like to expressour debt to Simon Sheather, our friend, colleague, and co-author on many papers. Finally,we would like to thank Jun Recta for assistance in creating several of the plots.

Tom HettmanspergerJoe McKean

July 2008State College, PAKalamazoo, MI

Chapter 1

One Sample Problems

1.1 Introduction

Traditional statistical procedures are widely used because they offer the user a unifiedmethodology with which to attack a multitude of problems, from simple location prob-lems to highly complex experimental designs. These procedures are based on least squaresfitting. Once the problem has been cast into a model then least squares offers the user:

1. a way of fitting the model by minimizing the Euclidean normed distance between theresponses and the conjectured model;

2. diagnostic techniques that check the adequacy of the fit of the model, explore thequality of fit, and detect outlying and/or influential cases;

3. inferential procedures, including confidence procedures, tests of hypotheses and multi-ple comparison procedures;

4. computational feasibility.

Procedures based on least squares, though, are easily impaired by outlying observations.Indeed one outlying observation is enough to spoil the least squares fit, its associated di-agnostics and inference procedures. Even though traditional inference procedures are exactwhen the errors in the model follow a normal distribution, they can be quite inefficient whenthe distribution of the errors has longer tails than the normal distribution.

For simple location problems, nonparametric methods were proposed by Wilcoxon (1945).These methods consist of test statistics based on the ranks of the data and associated esti-mates and confidence intervals for location parameters. The test statistics are distributionfree in the sense that their null distributions do not depend on the distribution of the errors.It was soon realized that these procedures are almost as efficient as the traditional methodswhen the errors follow a normal distribution and, furthermore, are often much more efficientrelative to the traditional methods when the error distributions deviate from normality; seeHodges and Lehmann (1956). These procedures possess both robustness of validity and

1

2 CHAPTER 1. ONE SAMPLE PROBLEMS

power. In recent years these nonparametric methods have been extended to linear and non-linear models. In addition, from the perspective of modern robustness theory, contrary toleast squares estimates, these rank-based procedures have bounded influence functions andpositive breakdown points.

Often these nonparametric procedures are thought of as disjoint methods that differ fromone problem to another. In this text, we intend to show that this is not the case. Instead,these procedures present a unified methodology analogous to the traditional methods. Thefour items cited above for the traditional analysis hold for these procedures too. Indeed theonly operational difference is that the Euclidean norm is replaced by another norm.

There are computational procedures available for the rank-based procedures discussedin this book. We offer the reader a collection of computational functions written in thesoftware language R at the site http://www.stat.wmich.edu/mckean/Rfuncs/ . We referto these computational algorithms as rank-based R algorithms or RBR. We discuss thesefunctions throughout the text and use them in many of the examples, simulation studies,and exercises. The programming language R (see Ihaka, R. and Gentleman, R., 1996) isfreeware and can run on all (PC, Mac, Linux) platforms. To download the R software andaccompanying information, visit the site http://www.r-project.org/. The language R hasintrinsic functions for computation of some of the procedures discussed in this and the nextchapter.

1.2 Location Model

In this chapter we will consider the one sample location problem. This will allow us to exploresome useful concepts such as distribution freeness and robustness in a simple setting. Wewill extend many of these concepts to more complicated situations in later chapters. Weneed to first define a location parameter. For a random variable X we often subscript itsdistribution function by X to avoid confusion.

Definition 1.2.1. Let T (H) be a function defined on the set of distribution functions. Wesay T (H) is a location functional if

1. If G is stochastically larger than F (ie.(G(x) ≤ F (x)) for all x, then T (G) ≥ T (F );

2. T (HaX+b) = aT (HX) + b, a > 0;

3. T (H−X) = −T (HX).

Then, we will call θ = T (H) a location parameter of H.

Note that if X has location parameter θ it follows from the second item in the abovedefinition that the random variable e = X−θ has location parameter 0. Suppose X1, . . . , Xn

is a random sample having the common distribution function H(x) and θ = T (H) is alocation parameter of interest. We express this by saying that Xi follows the statisticallocation model,

Xi = θ + ei , i = 1, . . . , n , (1.2.1)

1.2. LOCATION MODEL 3

where e1, . . . , en are independent and identically distributed random variable with distri-bution function F (x) and density function f(x) and location T (F ) = 0. It follows thatH(x) = F (x − θ) and that T (H) = θ. We next discuss three examples of location param-eters that we will use throughout this chapter. Other location parameters are discussed inSection 1.8. See Bickel and Lehmann (1975) for additional discussion of location functionals.

Example 1.2.1. The Median Location Functional

First define the inverse of the cdf H(x) by H−1(u) = infx : H(x) ≥ u. Generally wewill suppose that H(x) is strictly increasing on its support and this will eliminate ambiguitieson the selection of the parameter. Now define θ1 = T1(H) = H−1(1/2). This is the medianfunctional. Note that if G(x) ≤ F (x) for all x, then G−1(u) ≥ F−1(u) for all u; and, inparticular, G−1(1/2) ≥ F−1(1/2). Hence, T1(H) satisfies the first condition for a locationfunctional. Next let H∗(x) = P (aX + b ≤ x) = H [a−1(x− b)]. Then it follows at once thatH∗−1(u) = aH−1(u) + b and the second condition is satisfied. The third condition followswith an argument similar to the the one for the second condition.

Example 1.2.2. The Mean Location Functional

For the mean functional let θ2 = T2(H) =∫xdH(x), when the mean exists. Note that∫

xdH(x) =∫H−1(u)du. Now if G(x) ≤ F (x) for all x, then x ≤ G−1(F (x)). Let x =

F−1(u) and we have F−1(u) ≤ G−1(F (F−1(u)) ≤ G−1(u). Hence, T2(G) =∫G−1(u)du ≥∫

F−1(u)du = T2(F ) and the first condition is satisfied. The other two conditions followeasily from the definition of the integral.

Example 1.2.3. The Pseudo-Median Location Functional

Assume that X1 and X2 are independent and identically distributed, (iid), with distri-bution function H(x). Let Y = (X1 + X2)/2. Then Y has distribution function H∗(y) =P (Y ≤ y) =

∫H(2y−x)h(x)dx. Let θ3 = T3(H) = H∗−1(1/2). To show that T3 is a location

functional, suppose G(x) ≤ F (x) for all x. Then

G∗(y) =

∫G(2y − x)g(x) dx =

∫ [∫ 2y−x

−∞g(t) dt

]g(x) dx ≤

∫ [∫ 2y−x

−∞f(t) dt

]g(x) dx

=

∫ [∫ 2y−t

−∞g(x) dt

]f(t) dx ≤

∫ [∫ 2y−t

−∞f(x) dt

]f(t) dx = F ∗(y) ;

hence, as in Example 1.2.1, it follows that G∗−1(u) ≥ F ∗−1(u) and, hence, that T3(G) ≥T3(F ). For the second property, let W = aX + b where X has distribution function H anda > 0. Then W has distribution function FW (t) = H((t − b)/a). Then by the change ofvariable z = (x− b)/a, we have

F ∗W (y) =

∫H

(2y − x− b

a

)1

ah

(x− b

a

)dx =

∫H

(2y − b

a− z

)h(z) dz .


Thus the defining equation for T3(FW ) is

1

2=

∫H

(2T3(FW ) − b

a− z

)h(z) dz ,

which is satisfied for T3(FW ) = aT3(H) + b. For the third property, let V = −X where Xhas distribution function H . Then V has distribution function FV (t) = 1 −H(−t). Hence,by the change in variable z = −x,

F ∗V (y) =

∫(1 −H(−2y + x))h(−x) dx = 1 −

∫H(−2y − z))h(z) dz .

Because the defining equation of T3(FV ) can be written as

1

2=

∫H(2(−T3(FV )) − z)h(z) dz ,

it follows that T3(FV ) = −T3(H). Therefore, T3 is a location functional. It has been calledthe pseudo-median by Hoyland (1965) and is more appropriate for symmetric distributions.

The next theorem characterizes all the location functionals for a symmetric distribution.

Theorem 1.2.1. Suppose that the pdf h(x) is symmetric about some point a. If T (H) is alocation functional, then T (H) = a.

Proof. Let the random variable X have pdf h(x) symmetric about a. Let Y = X − a, thenY has pdf g(y) = h(y+a) symmetric about 0. Hence Y and −Y have the same distribution.By the third property of location functionals, this means that T (GY ) = T (G−Y ) = −T (GY );i.e, T (GY ) = 0. But by the second property, 0 = T (GY ) = T (H) − a; that is , a = T (H).

This theorem means that when we sample from a symmetric distribution we can unam-biguously define location as the center of symmetry. Then all location functionals that wemay wish to study will specify the same location parameter.

1.3 Geometry and Inference in the Location Model

Letting X = (X1, . . . , Xn)′ and e = (e1, . . . , en)

′, we then write the statistical location model,( 1.2.1), as,

X = 1θ + e , (1.3.1)

where 1 denotes the vector all of whose components are 1 and T (Fe) = 0. If ΩF denotes theone-dimensional subspace spanned by 1, then we can express the model more compactly asX = η + e, where η ∈ ΩF. The subscript F on Ω stands for full model in the context ofhypothesis testing as discussed below.

Let x be a realization of X. Note that except for random error, x would lie in ΩF. Hence

an intuitive fitting criteria is to estimate θ by a value θ such that the vector 1θ ∈ ΩF lies

1.3. GEOMETRY AND INFERENCE IN THE LOCATION MODEL 5

“closest” to x, where “closest” is defined in terms of a norm. Furthermore, a norm, as thefollowing general discussion shows, provides a complete inference for the parameter θ.

Recall that a norm is a nonnegative function, ‖ · ‖, defined on Rn such that ‖y‖ ≥ 0 forall y; ‖y‖ = 0 if and only if y = 0; ‖ay‖ = |a|‖y‖ for all real a; and ‖y + z‖ ≤ ‖y‖ + ‖z‖.The distance between two vectors is d(z,y) = ‖z− y‖.

Given a location model, ( 1.3.1), and a specified a norm, ‖·‖, the estimate of θ inducedby the norm is

θ = argmin‖x − 1θ‖ , (1.3.2)

i.e., the value which minimizes the distance between x and the space ΩF. As discussed inExercise 1.12.1, a minimizing value always exists. The dispersion function induced by thenorm is given by,

D(θ) = ‖x − 1θ‖ . (1.3.3)

The minimum distance between the vector of observations x and the space ΩF is D(θ).As Exercise 1.12.3 shows, D(θ) is a convex, continuous function of θ which is differentiablealmost everywhere. Actually the norms discussed in this book are differentiable at all butat most a finite number of points. We define the gradient process by the function

S(θ) = − d

dθD(θ) . (1.3.4)

As Exercise 1.12.3, shows, S(θ) is a nonincreasing function. Its discontinuities are the pointswhere D(θ) is nondifferentiable. Furthermore the minimizing value is a value where S(θ) is

0 or, due to a discontinuity, steps through 0. We express this by saying that θ solves theequation

S(θ).= 0 . (1.3.5)

Suppose we can represent the above estimate by θ = θ(x) = θ(Hn), where Hn denotes

the empirical distribution function of the sample. The notation θ(Hn) is suggestive of thefunctional notation used in the last section. This is as it should be, since it is easy toshow that θ satisfies the sample analogues of properties (2) and (3) of Definition 1.2.1. Forproperty (2), consider the estimating equation of the translated sample y = ax + 1b, fora > 0, given by

θ(y) = argmin‖y − 1θ‖ = a argmin

∥∥∥∥x − 1θ − b

a

∥∥∥∥ .

From this we immediaitely have that θ(y) = aθ(x) + b. For property (3), the definingequation for the sample y = −x is

θ(y) = argmin‖y − 1θ‖ = argmin‖x − 1(−θ)‖ .

From which we have θ(y) = −θ(x). Furthermore, for the norms considered in this book it

is easy to check that θ(Hn) ≥ θ(Gn) when Hn and Gn are empirical cdfs for which Hn isstochastically larger than Gn. Hence, the norms generate location functionals on the set of


empirical cdfs. The L1 norm provides an easy example. We can think of θ(Hn) = H−1n (1

2)

as the restriction of θ(H) = H−1(12) to the class of discrete distributions which assign mass

1/n to n points. Generally we can think of θ(Hn) as the restriction of θ(H) or, conversely,

we can think of θ(H) as the extension of θ(Hn). We let the norm determine the location.This is especially simple in the symmetric location model where all location functionals areequal to the point of symmetry.

Next consider the hypotheses,

H0 : θ = θ0 versus HA : θ 6= θ0 , (1.3.6)

for a specified θ0. Because of the second property of location functionals in Definition 1.2.1,we can assume without loss of generality that θ0 = 0; otherwise we need only subtract θ0from each Xi. Based on the data, the most acceptable value of θ is the value at which thegradient S(θ) is zero. Hence large values of |S(0)| favor HA. Formally the level α gradienttest or score test for the hypotheses ( 1.3.6) is given by

Reject H0 in favor of HA if |S(0)| ≥ c , (1.3.7)

where c is such that P0[|S(0)| ≥ c] = α. Typically, the null distribution of S(0) is symmetricso there is no loss in generality in considering symmetrical critical regions.

A second formulation of a test statistic is based on the difference in minimizing dispersionsor the reduction in dispersion. Call Model 1.2.1 the full model. As noted above, the distancebetween x and the subspace ΩF is D(θ). The reduced model is the full model subjectto H0. In this case the reduced model space is 0. Hence the distance between x andthe reduced model space is D(0). Under H0, x should be close to this space; therefore, thereduction in dispersion test is given by

Reject H0 in favor of HA if RD = D(0) −D(θ) ≥ m , (1.3.8)

where m is determined by the null distribution of RD. This test will be used in Chapter 3and subsequent chapters.

A third formulation is based on the standardized estimate:

Reject H0 in favor of HA if |bθ|√Varbθ

≥ γ , (1.3.9)

where γ is determined by the null distribution of θ. Tests based directly on the estimate areoften referred to as Wald type tests.

The following useful theorem allows us to shift between computing probabilities whenθ = 0 and for general θ. Its proof is a straightforward application of a change of variables.See Theorem A.2.4 of the Appendix for a more general result.

Theorem 1.3.1. Suppose that we can write S(θ) = S(x1 − θ, . . . , xn − θ). Then Pθ(S(0) ≤t) = P0(S(−θ) ≤ t).


We now turn to the problem of the construction of a (1− α)100% confidence intervalfor θ based on S(θ). Such an interval is easily obtained by inverting the acceptance regionof the level α test given by ( 1.3.7). The acceptance region is | S(0) |< c. Define

θL = inft : S(t) < c and θU = supt : S(t) > −c. (1.3.10)

Then because S(θ) is nonincreasing,

θ :| S(θ) |< c = θ : θL ≤ θ ≤ θU . (1.3.11)

Thus from Theorem 1.3.1,

Pθ(θL ≤ θ ≤ θU) = Pθ(| S(θ) |< c) = P0(| S(0) |< c) = 1 − α . (1.3.12)

Hence, inverting a size α test results in the (1 − α)100% confidence interval (θL, θU).Thus a norm not only provides a fitting criterion but also a complete inference. As

with all statistical analyses, checks on the appropriateness of the model and the quality offit are needed. Useful plots here include: stem-leaf plots and q−q plots to check shapeand distributional assumptions, boxplots and dotplots to check for outlying observations,and a plot of Xi versus i (or other appropriate variables) to check for dependence betweenobservations. Some of these diagnostic checks are performed in the the next section ofnumerical examples.

In the next three examples, we discuss the inference for the norms associated with thelocation functionals presented in the last section. We state the results of their associatedinference, which we will derive in later sections.

Example 1.3.1. L1-Norm

Recall that the L1 norm is defined as ‖x‖1 =∑

| xi |, hence the associated dispersionand negative gradient functions are given respectively by D1(θ) =

∑ | Xi − θ | and S1(θ) =∑sgn(Xi − θ). Letting Hn denote the empirical cdf, we can write the estimating equation

as

0 = n−1∑

sgn(xi − θ) =

∫sgn(x− θ)dHn(x) .

The solution, of course, is θ the median of the observations. If we replace the empirical cdfHn by the true underlying cdf H then the estimating equation becomes the defining equationfor the parameter θ = T (H). In this case, we have

0 =

∫sgn(x− T (H))dH(x) = −

∫ T (H)

−∞dH(x) +

∫ ∞

T (H)

dH(x) ;

hence, H(T (H)) = 1/2 and solving for T (H) we find T (H) = H−1(1/2) as expected.


As we show in Section 1.5,

θ has an asymptotic N(θ, τ 2S/n) distribution , (1.3.13)

where τs = 1/(2h(θ)). Estimation of the standard deviation of θ is discussed in Section 1.5.Turning next to testing the hypotheses ( 1.3.6), the gradient test statistic is S1(0) =∑sgn(Xi). But we can write, S1(0) = S+

1 − S−1 + S0

1 where S+1 =

∑I(Xi > 0), S−

1 =∑I(Xi < 0), and S0

1 =∑I(Xi = 0) = 0, with probability one since we are sampling

from a continuous distribution, and I(·) is the indicator function. In practice, we must dealwith ties and this is usually done by setting aside those observations that are equal to thehypothesized value and carrying out the test with a reduced sample size. Now note thatn = S+

1 + S−1 so that we can write S1 = 2S+

1 − n and the test can be based on S+1 . The null

distribution of S+1 is binomial with parameters n and 1/2. Hence the level α sign test of

the hypotheses ( 1.3.6) is

Reject H0 in favor of HA if S+1 ≤ c1 or S+

1 ≥ n− c1 , (1.3.14)

and c1 satisfiesP [bin(n, 1/2) ≤ c1] = α/2 , (1.3.15)

where bin(n, 1/2) denotes a binomial random variable based on n trials and with proba-bility of success 1/2. Note that the critical value of the test can be determined withoutspecifying the shape of F . In this sense, the test based on S1 is distribution free ornonparametric. Using the asymptotic null distribution of S+

1 , c1 can be approximated asc1

.= n/2− n1/2zα/2/2− .5 where Φ(−zα/2) = α/2; Φ(.) is the standard normal cdf, and .5 is

the continuity correction.For the associated (1 − α)100% confidence interval, we follow the general development

above, ( 1.3.12). Hence, we must find θL = inft : S+1 (t) < n − c1, where c1 is given by

( 1.3.15). Note that S+1 (t) < n − c1 if and only if the number of Xi greater than t is less

than n − c1. But #i : Xi > X(c1+1) = n − c1 − 1 and #i : Xi > X(c1+1) − ǫ ≥ n − c1for any ǫ > 0. Hence, θL = X(c1+1). A similar argument shows that θU = X(n−c1). We cansummarize this by saying that the (1 − α)100% L1 confidence interval is the half open, halfclosed interval

[X(c1+1), X(n−c1)) where α/2 = P (S+1 (0) ≤ c1) determines c1 . (1.3.16)

The critical value c1 can be determined from the binomial(n, 1/2) distribution or from thenormal approximation cited above. The interval developed here is a distribution-free confi-dence interval since the confidence coefficient is determined from the binomial distributionwithout making any shape assumption on the underlying model distribution.

Example 1.3.2. L2-Norm

Recall that the square of the L2-norm is given by ‖x‖22 =

∑ni=1 x

2i . As shown in Exercise

1.12.4, the estimate determined by this norm is the sample mean X and the functional


parameter is µ =∫xh(x) dx, provided it exists. Hence the L2 norm is consistent for the

mean location problem. The associated test statistic is equivalent to Student’s t-test. Theapproximate distribution of X is N(0, σ2/n), provided the variance σ2 = VarX1 exists.Hence, the test statistic is not distribution free. In practice, σ is replaced by its estimate s =(∑

(Xi−X)2/(n− 1))1/2 and the test is based on the t-ratio, t =√nX/s, which, under the

null hypothesis, is asymptotically N(0, 1). The usual confidence interval is X±tα/2,n−1s/√n,

where tα/2,n−1 is the (1 − α/2)-quantile of a t-distribution with n − 1 degrees of freedom.This interval has the approximate confidence coefficient (1 − α)100%, unless the errors, ei,follow a normal distribution in which case it has exact confidence.

Example 1.3.3. Weighted L1 Norm

Consider the function

‖x‖3 =

n∑

i=1

R(|xi|)|xi| , (1.3.17)

where R(|xi|) denotes the rank of |xi| among |x1|, . . . , |xn|. As the next theorem shows thisfunction is a norm on Rn. See Section 1.8 for a general weighted L1 norm.

Theorem 1.3.2. The function ‖x‖3 =∑j|x|(j) =

∑R(|xj |)|xj| is a norm, where R(|xj |) is

the rank of |xj | among |x1|, . . . , |xn| and |x|(1) ≤ · · · ≤ |x|(n) are the ordered absolute values.

Proof. The equality relating ‖x‖3 to the ranks is clear. To show that we have a norm, wefirst note that ‖x‖3 ≥ 0 and that ‖x‖3 = 0 if and only if x = 0. Also clearly ‖ax‖3 = |a|‖x‖3

for any real a. Hence, to finish the proof, we must verify the triangle inequality. Now

‖x+y‖3 =∑

j|x+y|(j) =∑

R(|xi+yj|)|xi+yj| ≤∑

R(|xi+yj|)|xi|+∑

R(|xi+yj|)|yj| .(1.3.18)

Consider the first term on the right side. By summing through another index we can writeit as, ∑

R(|xi + yj|)|xi| =∑

bj |x|(j) ,

where b1, . . . , bn is a permutation on the integers 1, . . . , n. Suppose bj is not in order, thenthere exists a t and a s such that |x|(t) ≤ |x|(s) but bt > bs. Whence,

[bs|x|(t) + bt|x|(s)] − [bt|x|(t) + bs|x|(s)] = (bt − bs)(|x|(s) − |x|(t)) ≥ 0 .

Hence such an interchange never decreases the sum. This leads to the result,

∑R(|xi + yj|)|xi| ≤

∑j|x|(j) ,

A similar result holds for the second term on the right side of ( 1.3.18). Therefore, ‖x+y‖3 ≤∑j|x|(j) +

∑j|y|(j) = ‖x‖3 + ‖y‖3, and, this completes the proof. The above argument is

taken from Hardy, Littlewood, and Polya (1952).


We shall call this norm the weighted L1 Norm. In the next theorem, we offer aninteresting identity satisfied by this norm. First, though, we need another representationof it. For a random sample X1, . . . , Xn, define the anti-ranks to be the random variablesD1, . . . , Dn such that

Z1 = |XD1| ≤ . . . ≤ Zn = |XDn| . (1.3.19)

For example, if D1 = 2 then |X2| is the smallest absolute value and Z1 has rank 1. Notethat the anti-rank function is just the inverse of the rank function. We can then write

‖x‖3 =n∑

i=j

j|x|(j) =n∑

j=1

j|xDj| . (1.3.20)

Theorem 1.3.3. For any vector x,

‖x‖3 =∑∑

i≤j

∣∣∣∣xi + xj

2

∣∣∣∣+∑∑

i<j

∣∣∣∣xi − xj

2

∣∣∣∣ . (1.3.21)

Proof: Letting the index run through the anti-ranks, we have

∑∑

i≤j

∣∣∣∣xi + xj

2

∣∣∣∣ +∑∑

i<j

∣∣∣∣xi − xj

2

∣∣∣∣ =n∑

i=1

|xi| +∑∑

i<j

∣∣∣∣xDi

+ xDj

2

∣∣∣∣+∣∣∣∣xDj

− xDi

2

∣∣∣∣.

(1.3.22)For i < j, hence |xDi

| ≤ |xDj|, consider the expression,

∣∣∣∣xDi

+ xDj

2

∣∣∣∣ +∣∣∣∣xDj

− xDi

2

∣∣∣∣ .

There are four cases to consider: where xDiand xDj

are both positive; where they are bothnegative; and the two cases where they have mixed signs. In all these cases, though, it iseasy to show that ∣∣∣∣

xDi+ xDj

2

∣∣∣∣ +∣∣∣∣xDj

− xDi

2

∣∣∣∣ = |xDj| .

Using this, we have that the right side of expression ( 1.3.22) is equal to:

n∑

i=1

|xi| +∑∑

i<j

|xDj| =

n∑

j=1

|xDj| +

n∑

j=1

(j − 1)|xDj| =

n∑

j=1

j|xDj| = ‖x‖3 , (1.3.23)

and we are finished.The associated gradient function is

T (θ) =

n∑

i=1

R(|Xi − θ|)sgn(Xi − θ) =∑

i≤jsgn

(Xi +Xj

2− θ

). (1.3.24)


The middle term is due to the fact that the ranks only change values at the finite numberof points determined by |Xi − θ| = |Xj − θ|; otherwise R(|Xi − θ|) is constant. The thirdterm is obtained immediately from the identity ( 1.3.21). The n(n+ 1)/2 pairwise averages(Xi +Xj)/2 : 1 ≤ i ≤ j ≤ n are called the Walsh averages. Hence, the estimate of θ isthe median of the Walsh averages, which we shall denote as,

θ3 = medi≤j

Xi +Xj

2

, (1.3.25)

first discussed by Hodges and Lehmann (1963). Often θ3 is called the Hodges-Lehmannestimate of location. In order to obtain the corresponding location functional, note that

R(|Xi − θ|) = #|Xj − θ| ≤ |Xi − θ| = #θ − |Xi − θ| ≤ Xj ≤ θ + |Xi − θ|= nHn(θ + |Xi − θ|) − nH−

n (θ − |Xi − θ|) ,

where H−n is the left limit of Hn. Hence (1.3.24) becomes

∫Hn(θ + |x− θ|) −H−

n (θ − |x− θ|)sgn(x− θ) dHn(x) = 0 ,

and in the limit we have,

∫H(θ + |x− θ|) −H(θ − |x− θ|)sgn(x− θ) dH(x) = 0 ,

that is,

−∫ θ

−∞H(2θ − x) −H(x) dH(x) +

∫ ∞

θ

H(x) −H(2θ − x) dH(x) = 0 .

This simplifies to

∫ ∞

−∞H(2θ − x) dH(x) =

∫ ∞

−∞H(2θ − x) dH(x) =

1

2, (1.3.26)

Hence, the functional is the pseudo-median defined in Example 1.2.3. If the density h(x) issymmetric then from ( 1.7.11)

θ3 has an approximate N(θ3, τ2/n) distribution , (1.3.27)

where τ = 1/(√

12∫h2(x) dx). Estimation of τ is discussed in Section 3.7.

The most convenient form of the gradient process is

T+(θ) =∑∑

i≤jI

(Xi +Xj

2> θ

)=

n∑

i=1

R(|Xi − θ|)I(Xi > θ) . (1.3.28)


The corresponding gradient test statistic for the hypotheses ( 1.3.6) is T+(0). In Section1.7, provided that h(x) is symmetric, it is shown that T+(0) is distribution free under H0

with null mean and variance n(n + 1)/4 and n(n + 1)(2n + 1)/24, respectively. This testis often referred to as the Wilcoxon signed-rank test. Thus the test for the hypotheses( 1.3.6) is

Reject H0 in favor of HA, if T+(0) ≤ k or T+(0) ≥ n(n+1)2

− k , (1.3.29)

where P (T+(0) ≤ k) = α/2. An approximation for k is given in the next paragraph.Because of the similarity between the sign and signed-rank processes, the confidence

interval based on T+(θ) follows immediately from the argument given in Example 1.3.1 forthe sign process. Instead of the order statistics which were used in the confidence intervalbased on the sign process, in this case we use the ordered Walsh averages, which we denoteas W(1), . . . ,W(n(n+1)/2). Hence a (1 − α)100% confidence interval for θ is given by

[W(k+1),W((n(n+1)/2)−k)) where k is such that α/2 = P (T+(0) ≤ k) . (1.3.30)

As with the sign process, k can be approximated using the asymptotic normal distributionof T+(0) by

k.=n(n+ 1)

4− zα/2

√n(n+ 1)(2n+ 1)

24− .5 ,

where zα/2 is the (1−α/2)-quantile of the standard normal distribution. Provided that h(x)is symmetric, this confidence interval is distribution free.

1.3.1 Computation

The three procedures discussed in this section are easily computed in R. The R intrin-sic functions t.test and wilcoxon.test compute the t- and Wilcoxon-signed-rank tests,respectively. Our collection of R functions, RBR, contain the functions onesampwil andonesampsgn which compute the asymptotic versions of the Wilcoxon-signed-rank and signtests, respectively. These functions also compute the associated estimates, confidence inter-vals and standard errors. Their use is discussed in the examples. Minitab (see ??) also canbe used to compute these tests. At command line, the Minitab commands stest, wtest,and ttest compute the sign, Wilcoxon-signed-rank, and t-tests, repsectively.

1.4 Examples

In applications by convention, when testing the null hypothesis H0 : θ = θ0 using the signtest, any data point equal to θ0 is set aside and the sample size is reduced. On the otherhand, these values are not set aside for point estimation or confidence intervals. The outputof the RBR functions onesampwil and onesampsgn includes the test statistics T and S,respectively, and a continuity corrected standardized value z. The p-values are approximated

1.4. EXAMPLES 13

Table 1.4.1: Excess hours of sleep under the influence of two drugs and the difference inexcesses.

Row Dextro Laevo Diff(L-D)1 -0.1 -0.1 0.02 0.8 1.6 0.83 3.4 4.4 1.04 0.7 1.9 1.25 -0.2 1.1 1.36 -1.2 0.1 1.37 2.0 3.4 1.48 3.7 5.5 1.89 -1.6 0.8 2.4

10 0.0 4.6 4.6

by computing normal probabilities on z. Especially for small sample sizes, for the test basedon the signs, S, the approximate and exact p-values can be somewhat different. In calculatingthe signed-ranks for the test statistic T , we use average ranks. For t-tests, we report the thep-values and confidence intervals using the t-distribution with n− 1 degrees of freedom.

Example 1.4.1. Cushney-Peebles Data.

The data given in Table 1.4.1 gives the average excess number of hours of sleep that eachof 10 patients achieved from the use of two drugs. The third column gives the difference(Laevo-Dextro) in excesses across the two drugs. This is a famous data set. Gosset, writingunder the pseudonym Student, published his landmark paper on the t-test in 1908 and usedthis data set for illustration. The differences, however, suggests that the L2 methods maynot be the methods of choice in this case. The normal quantile plot, Panel A of Figure 1.4.1,shows that the tails may be heavy and that there may be an outlier. A normal quantileplot has the data (differences) on the vertical axis and the expected values of the standardnormal order statistics on the horizontal axis. When the data is consistent with a normalassumption, the plot should be roughly linear. The boxplot, with 95% L1 confidence interval,Panel B of Figure 1.4.1, further illustrates the presence of an outlier. The box is defined bythe quartiles and the shaded notch represents the confidence interval.

For the sake of discussion and comparison of methods, we provide the p-values for the signtest, the Wilcoxon signed rank test, and the t-test. We used the RBR functions onesampwil,onesampsgn, and onesampt to compute the results for the Wilcoxon signed rank test, the signtest, and the t-test, respectively. For each function, the following display shows the necessaryR code (these are preceded with the prompt >) to compute these functions, which is thenfollowed by the results. The standard errors (SE) for the sign and signed-rank estimates aregiven by (1.5.28) and (1.7.12), respectively. in general in Section 1.5.5. These functions alsoproduce a boxplot of the data. The boxplot produced by the function onesampsgn is shownin Figure 1.4.1.


Figure 1.4.1: Panel A: Normal q−q Plot of Cushney-Peebles Data; Panel B: Boxplot with 95%notched confidence interval; Panel C: Sensitivity Curve for t-test; and Panel D: SensitivityCurve for sign test

*

**

* * * **

*

*

−1.0 −0.5 0.0 0.5 1.0

01

23

4

Normal quantiles

Diff

eren

ce: L

aevo

− D

extr

o

Panel A

01

23

4

Diff

eren

ce: L

aevo

− D

extr

o

Panel B

−10 −5 0 5 10

01

23

45

6

Value of 10th difference

t−te

st

Panel C

−10 −5 0 5 10

2.2

2.3

2.4

2.5

2.6

2.7

2.8

Value of 10th difference

Sta

ndar

dize

d si

gn te

st

Panel D

Assumes that the differences are in the vector diffs

> onesampwil(diffs)

Results for the Wilcoxon-Signed-Rank procedure

Test of theta = 0 versus theta not equal to 0

Test-Stat. is T 54 Standardized (z) Test-Stat. is 2.70113 p-vlaue 0.00691043

Estimate 1.3 SE is 0.484031

95 % Confidence Interval is ( 0.9 , 2.7 )

Estimate of the scale parameter tau 1.530640

1.4. EXAMPLES 15

> onesampsgn(diffs)

Results for the Sign procedure


Test stat. S is 9 Standardized (z) Test-Stat. 2.666667 p-vlaue 0.007660761




> temp=onesampt(diffs)

Results for the t-test procedure


Test stat. Ave(x) - 0 is 1.58 Standardized (t) Test-Stat. 4.062128 p-vlaue 0.00283289



Estimate of the scale parameter sigma 1.229995

The confidence interval corresponding to the sign test is (0.8, 2.4) which is shifted above0. Hence, there is strong support for the alternative hypothesis that the location of thedifference distribution is not equal to zero. That is, we reject H0 : θ = 0 in favor ofHA : θ 6= 0 at α = .05. All three tests support this conclusion. The estimates of locationcorresponding to the three tests are the median (1.3), the median of the Walsh averages(1.3), and the mean of the sample differences (1.58). Note that the outlier had an effect onthe sample mean.

In order to see how sensitive the test statistics are to outliers, we change the value ofthe outlier (difference in the 10th row of Table 1.4.1 and plot the value of the test statisticagainst the value of the difference in the 10th row of Table 1.4.1; see Panel C of Figure 1.4.1.Note that as the value of the 10th difference changes the t-test changes quite rapidly. Infact, the t-test can be pulled out of the rejection region by making the difference sufficientlysmall or large. However, the sign test , Panel D of Figure 1.4.1, stays constant until thedifference crosses zero and then only changes by 2. This illustrates the high sensitivity of thet-test to outliers and the relative resistance of the sign test. A similar plot can be preparedfor the Wilcoxon signed rank test; see Exercise 1.12.8. In addition, the corresponding p-values can be plotted to see how sensitive the decision to reject the null hypothesis is tooutliers. Sensitivity plots are similar to influence functions. We discuss influence functionsfor estimates in Section 1.6.

Example 1.4.2. Shoshoni Rectangles.


Table 1.4.2: Width to Length Ratios of Rectangles0.553 0.570 0.576 0.601 0.606 0.606 0.609 0.611 0.615 0.6280.654 0.662 0.668 0.670 0.672 0.690 0.693 0.749 0.844 0.933

The golden rectangle is a rectangle in which the ratio of the width to length is approximately0.618. It can be characterized in various ways. For example, w/l = l/(w + l) characterizesthe golden rectangle. It is considered to be an aesthetic standard in Western civilization andappears in art and architecture going back to the ancient Greeks. It now appears in suchitems as credit and business cards. In a cultural anthropology study, DuBois (1960) reportson a study of the Shoshoni beaded baskets. These baskets contain beaded rectangles and thequestion was whether the Shoshonis use the same aesthetic standard as the West. A sampleof twenty width to length ratios from Shoshoni baskets is given in Table 1.4.2.

Panel A of Figure 1.4.2 shows the notched boxplot containing the 95% L1 confidenceinterval for θ the median of the population of w/l ratios. It shows two outliers which are alsoapparent in the normal quantile plot, Panel B of Figure 1.4.2. We used the sign procedureto analyze the data, perfoming the computations with the RBR function onesampsgn. For

Figure 1.4.2: Panel A: Boxplot of Width to Length Ratios of Shoshoni Rectangles; Panel B:Normal q−q plot.

0.6

0.7

0.8

0.9

Wid

th to

leng

th r

atio

s

Panel A

** *

***********

***

*

*

*

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0.6

0.7

0.8

0.9

Normal quantiles

Wid

th to

leng

th r

atio

s

Panel B

this problem, it is of interest to test H0 : θ = 0.618 (the golden rectangle). The display

1.5. PROPERTIES OF NORMED-BASED INFERENCE 17

below shows this evaluation for the sign test along with a 90% confidence interval for θ.

> onesampsgn(x,theta0=.618,alpha=.10)


Test of theta = 0.618 versus theta not equal to 0.618


Estimate 0.641 SE is 0.01854268



With a p-value of 0.823, there is no evidence to refute the null hypothesis. Further. we seethat the golden rectangle 0.618 is contained in the confidence interval. This suggests thatthere is no evidence in this data that the Shoshonis are using a different standard.

For comparison, the analysis based on the t-procedure is

> onesampt(x,theta0=.618,alpha=.10)

Results for the t-test procedure

Test of theta = 0.618 versus theta not equal to 0.618

Test stat. Ave(x) - 0.618 is 0.0425 Standardized (t) Test-Stat. 2.054523

p-vlaue 0.05394133

Estimate 0.6605 SE is 0.02068606



Based on the t-test with the p-value of 0.053, one might conclude that there is evidence thatthe Shoshonis are using a different standard. Further, the 90% t-interval does not containthe golden rectangle ratio. Based on the t-analysis, a researcher might conclude that there isevidence that the Shoshonis are using a different standard. Hence, the robust and traditionalapproaches lead to different practical conclusions for this problem. The outliers, of courseimpaired the t-analysis. For this data, we have more faith in the simple sign test.

1.5 Properties of Normed-Based Inference

In this section, we establish statistical properties of the inference described in Section 1.3for the norm-fit of a location model. These properties describe the null and alternativedistributions of the test, ( 1.3.7), and the asymptotic distribution of the estimate, (1.3.2).Furthermore, these properties allow us to derive relative efficiencies between competing pro-cedures. While our discussion is general, we will illustrate the inference based on the L1 and


L2 norms as we proceed. The inference based on the signed-rank norm will be considered inSection 1.7 and that based on norms of general signed-rank scores in Section 1.8.

We assume then that Model ( 1.2.1) holds for a random sample X1, . . . , Xn with commondistribution and density functions H(x) = F (x− θ) and h(x) = f(x− θ), respectively. Nexta norm is specified to fit the model. We will assume that the induced functional is 0 at F ,i.e., T (F ) = 0. Let S(θ) be the gradient function induced by the norm. We establish theproperties of the inference by considering the null and alternative behavior of the gradienttest. For convenience, we consider the one sided hypothesis,

H0 : θ = 0 versus HA : θ > 0 . (1.5.1)

Since S(θ) is nonincreasing, a level α test of these hypotheses based on S(0) is

Reject H0 in favor of HA if S(0) ≥ c , (1.5.2)

where c is such that P0[S(0) ≥ c] = α.The power function of this test is given by,

γS(θ) = Pθ[S(0) ≥ c] = P0[S(−θ) ≥ c] , (1.5.3)

where the last equality follows from Theorem 1.3.1.The power function forms a convenient summary of the test based on S(0). The prob-

ability of a Type I Error (level of the test) is given by γS(0). The probability of a Type IIerror at the alternative θ is βS(θ) = 1 − γS(θ). For a given test of hypotheses ( 1.5.1) wewant the power function to be increasing in θ with an upper limit of one. In the first sub-section below, we establish these properties for the test ( 1.5.2). We can also compare levelα-tests of ( 1.5.1) by comparing their powers at alternative hypotheses. These are efficiencyconsiderations and they are covered in later subsections.

1.5.1 Basic Properties of the Power Function γS(θ)

As a first step we show that γS(θ) is nondecreasing:

Theorem 1.5.1. Suppose the test of H0 : θ = 0 versus HA : θ > 0 rejects when S(0) ≥ c.Then the power function is nondecreasing in θ.

Proof. Recall that S(θ) is nonincreasing in θ since D(θ) is convex. By Theorem 1.3.1,γS(θ) = P0[S(−θ) ≥ c]. Now, if θ1 ≤ θ2 then S(−θ1) ≤ S(−θ2) and , hence, S(−θ1) ≥ cimplies that S(−θ2) ≥ c. It then follows that P0(S(−θ1) ≥ c) ≤ P0(S(−θ2) ≥ c) and thepower function is monotone in θ as required.

This theorem shows that the test of H0 : θ = 0 versus HA : θ > 0 based on S(0) isunbiased, that is, Pθ(S(0) ≥ c) ≥ α for positive θ, where α is the size of the test. At timesit is convenient to consider the more general null hypothesis:

H∗0 : θ ≤ 0 versus HA : θ > 0 . (1.5.4)


A test of H∗0 versus HA with power function γS is said to have level α, if

supθ≤0

γS(θ) = α .

The proof of Theorem 1.5.1 shows that γS(θ) is nondecreasing in all θ ∈ R. Since thegradient test has level α for H0, it follows immediately that it has level α for H∗

0 also.We next show that the power function of the gradient test converges to 1 as θ → ∞. We

formally define this as:

Definition 1.5.1. Consider a level α test for the hypotheses ( 1.5.1) which has power func-tion γS(θ). We say the test is resolving, if γS(θ) → 1 as θ → ∞.

Theorem 1.5.2. Suppose the test of H0 : θ = 0 versus HA : θ > 0 rejects when S(0) ≥ c.Further, let η = supθS(θ) and suppose that η is attained for some finite value of θ. Then thetest is resolving, that is, Pθ(S(0) ≥ c) → 1 as θ → ∞.

Proof. Since S(θ) is nonincreasing, for any unbounded increasing sequence θm, S(θm) ≥S(θm+1). For fixed n and F , there is a real number a such that P0(| Xi |≤ a, i = 1, . . . , n) >1 − ǫ for any specified ǫ > 0. Let Aǫ denote the event | Xi |≤ a, i = 1, . . . , n. Now,

Pθm(S(0) ≥ c) = P0(S(−θm) ≥ c)

= 1 − P0(S(−θm) < c)

= 1 − P0(S(−θm) < c ∩ Aǫ) − P0(S(−θm) < c ∩ Acǫ) .

The hypothesis of the theorem implies that, for sufficiently large m, S(−θm) < c ∩ Aǫis empty. Further, P0(S(−θm) < c ∩ Acǫ) ≤ P0(A

cǫ) < c. Hence, for m sufficiently large,

Pθm(S(0) ≥ c) ≥ 1 − ǫ and the proof is complete.

The condition of boundedness imposed on S(θ) in the above theorem holds for almostall the nonparametric tests discussed in this book; hence, these nonparametric tests will beresolving. Thus they will be able to discern large alternative hypotheses with high power.What can be said at a fixed alternative? Recall the definition of a consistent test:

Definition 1.5.2. We say that a test is consistent if the power tends to one for each fixedalternative as the sample size n increases. The alternatives consist in specific values of θand a cdf F .

Consistency implies that the test is behaving as expected when the sample size increasesand the alternative hypothesis is true. To obtain consistency of the gradient test, we needto impose the following two assumptions on S(θ): first

S(θ) = S(θ)/nγP→ µ(θ) where µ(0) = 0 and µ(0) < µ(θ) for all θ > 0, (1.5.5)


for some γ > 0 and secondly,

E0S(0) = 0 and√nS(0)

D→ N(0, σ2(0)) under H0 for all F , (1.5.6)

for some positive constant σ(0). The first assumption means that S(0) separates the nullfrom the alternative hypothesis. Note, it is not crucial that µ(0) = 0, since this can always beachieved by recentering. It will be useful to have the following result concerning the asymp-totic null distribution of S(0). Its proof follows readily from the definition of convergence indistribution.

Theorem 1.5.3. Assume ( 1.5.6). The test defined by√nS(0) ≥ zασ(0) where zα is the

upper α percentile from the standard normal cdf ie. 1 − Φ(zα) = α is asymptotically size α.Hence, P0(

√nS(0)) ≥ zασ(0)) → α.

It follows that a gradient test is consistent; i.e.,

Theorem 1.5.4. Assume conditions ( 1.5.5) and ( 1.5.6). Then the gradient test√nS(0) ≥

zασ(0) is consistent, ie. the power at fixed alternatives tends to one as n increases.

Proof. Fix θ∗ > 0 and F . For ǫ > 0 and for large n, we have n−1/2zασ(0) < µ(θ∗) − ǫ. Thisleads to the following string of inequalities:

Pθ∗,F (S(0) ≥ n−1/2zασ(0)) ≥ Pθ∗,F (S(0) ≥ µ(θ∗) − ǫ)

≥ Pθ∗,F (| S(0) − µ(θ∗) |≤ ǫ) → 1 ,

which is the desired result.

Example 1.5.1. The L1 Case

Assume that the model cdf F has the unique median 0. Consider the L1 norm. Theassociated level α gradient test of ( 1.5.1) is equivalent to the sign test given by:

Reject H0 in favor of HA if S+1 =

∑I(Xi > 0) ≥ c ,

where c is such that P [bin(n, 1/2) ≥ c] = α. The test is nonparametric, i.e., it does notdepend on F . From the above discussion its power function is nondecreasing in θ. Since S+

1 (θ)is bounded and attains its bound on a finite interval, the test is resolving. For consistency,take γ = 1 in expression ( 1.5.5). Then E[n−1S+

1 (0)] = P (X > 0) = 1 − F (−θ) = µ(θ).An application of the Weak Law of Large numbers shows that the limit in condition ( 1.5.5)holds. Further, µ(0) = 1/2 < µ(θ) for all θ > 0 and all F . Finally, apply the CentralLimit Theorem to show that ( 1.5.6) holds. Hence, the sign test is consistent for locationalternatives. Further, it is consistent for each pair θ, F such that P (X > 0) > 1/2.

A discussion of these properties for the gradient test based on the L2-norm can be foundin Exercise 1.12.5.


1.5.2 Asymptotic Linearity and Pitman Regularity

In the last section we discussed some of the basic properties of the power function for agradient test. Next we establish some general results that will allow us to compare powerfunctions for different level α-tests. These results will also lead to the asymptotic distribu-tions of the location estimators θ based on norm fits. We will also make use of them in latersections and chapters.

Assume the setup found at the beginning of this section; i.e., we are considering thelocation model ( 1.3.1) and we have specified a norm with gradient function S(θ). We firstdefine a Pitman regular process:

Definition 1.5.3. We will say an estimating function S(θ) is Pitman Regular if thefollowing four conditions hold: first,

S(θ) is nonincreasing in θ ; (1.5.7)

second, letting S(θ) = S(θ)/nγ, for some γ > 0.

there exists a function µ(θ), such that µ(0) = 0, µ′(θ) is continuous at 0,

µ′(0) > 0 and either S(0)Pθ→ µ(θ) or Eθ(S(0) = µ(θ) ; (1.5.8)

third,

sup|b|≤B

∣∣∣∣√nS

(b√n

)−

√nS(0) + µ′(0)b

∣∣∣∣P→ 0 , (1.5.9)

for any B > 0; and fourth there is a constant σ(0) such that

√n

S(0)

σ(0)

D0→ N(0, 1) . (1.5.10)

Further the quantity

c = µ′(0)/σ(0) (1.5.11)

is called the efficacy of S(θ).

Condition ( 1.5.9) is called the asymptotic linearity of the process S(θ). Often we cancompute c when we have the mean under general θ and the variance under θ = 0. Thus

µ′(0) =d

dθEθ[S(0) |θ=0 and σ2(0) = limnVar0(S(0)) . (1.5.12)

Hence, another way expressing the asymptotic linearity of S(θ) is

√n

S(b/

√n)

σ(0)

=

√n

S(0)

σ(0)

− cb+ op(1) . (1.5.13)


If we replace b by√nθn where, of course, |√nθn| ≤ B for B > 0, then we can write

√n

S(θn)

σ(0)

=

√n

S(0)

σ(0)

− c

√nθn + op(1) . (1.5.14)

We record one more result on limiting distributions whose proof follows from Theorems 1.3.1and 1.5.6.

Theorem 1.5.5. Suppose S(θ) is Pitman Regular. Then

√n

S(b/

√n)

σ(0)

D0→ Z − cb (1.5.15)

and√n

S(0)

σ(0)

D−b/

√n→ Z − cb , (1.5.16)

where Z ∼ N(0, 1) and, so, Z − cb ∼ N(−cb, 1).

The second part of this theorem says that the limiting distribution of S(0) , when stan-dardized by σ(0), and computed along a sequence of alternatives −b/n1/2 is still normal withthe same variance of one but with a new mean, namely −cb. This result will be useful inapproximating the power near the null hypothesis.

We will find asymptotic linearity to be useful in establishing statistical properties. Ournext result provides sufficient conditions for linearity.

Theorem 1.5.6. Let S(θ) = (1/nγ)S(θ) for some γ > 0 such that the conditions ( 1.5.7),( 1.5.8) and ( 1.5.10) of Definition 1.5.3 hold. Suppose for any b ∈ R,

nVar0(S(n−1/2b) − S(0)) → 0 , as n→ ∞ . (1.5.17)

Then

sup|b|≤B

∣∣∣∣√nS

(b√n

)−

√nS(0) + µ′(0)b

∣∣∣∣P→ 0 , (1.5.18)

for any B > 0.

Proof. First consider Un(b) = [S(n−1/2b) − S(0)]/(b/√n). By ( 1.5.8) we have

E0(U0(b)) =

√n

bµ

(−b√n

)=

√n

b

[− b√

nµ′(ξn)

]→ −µ′(0) . (1.5.19)

Furthermore,

Var0Un(b) =n

b2Var0

[S

(b√n

)− S(0)

]→ 0 . (1.5.20)

As Exercise 1.12.9 shows, ( 1.5.19) and ( 1.5.20) imply that Un(b) converges to −µ′(0) inprobability, pointwise in b, i.e., Un(b) = −µ′(0) + op(1).


For the second part of the proof, let Wn(b) =√n[S(b/

√n)−S(0) + µ′(0)b/

√n]. Further

let ǫ > 0 and γ > 0 and partition [−B,B] into −B = b0 < b1 < . . . < bm = B so thatbi − bi−1 ≤ ǫ/(2|µ′(0)|) for all i. There exists N such that n ≥ N implies P [maxi |Wn(bi)| >ǫ/2] < γ.

Now suppose that Wn(b) ≥ 0 ( a similar argument can be given for Wn(b) < 0). Then

|Wn(b)| =√n

[S

(b√n

)− S(0)

]+ bµ′(0) ≤

√n

[S

(b√n

)− S(0)

]

+bi−1µ′(0) + (b− bi−1)µ

′(0)

≤ |Wn(bi−1)| + (b− bi−1)|µ′(0)| ≤ maxi

|Wn(bi)| + ǫ/2 .

Hence,

P0

(sup|b|≤B

|Wn(b)| > ǫ

)≤ P0(max

i|Wn(bi)| + ǫ/2) > ǫ) < γ ,

andsup|b|≤B

|Wn(b)| P→ 0 .

In the next three subsections we use these tools to handle the issues of power and efficiencyfor a general norm-based inference, but first we show that the L1 gradient function is Pitmanregular.

Example 1.5.2. Pitman Regularity of the L1 Process

Assume that the model pdf satisfies f(0) > 0. Recall that the L1 gradient function is

S1(θ) =

n∑

i=1

sgn(Xi − θ) .

Take γ = 1 in Theorem 1.5.6; hence, the average of interest is S1(θ) = n−1S1(θ). This isnonincreasing so condition ( 1.5.7) is satisfied. Next it is easy to check that µ(θ) = EθS1(0) =EθsgnXi = E0sgn(Xi + θ) = 1 − 2F (−θ). Hence, µ′(0) = 2f(0). Then condition ( 1.5.8) issatisfied. We now consider condition ( 1.5.17). Consider the case b > 0, (similarly for b < 0),

S1(b/√n) − S1(0) = n−1

n∑

1

[sgn(Xi − b/√n) − sgn(Xi)] = −(2/n)

n∑

1

I(0 < Xi < b/n1/2)

Because this is a sum of independent Bernoulli variables, we have

nVar0[S1(b/n1/2) − S1(0)] ≤ 4P (0 < X1 < b/

√n) = 4[F (b/

√n) − F (0)] → 0 .

The convergence to 0 occurs since F is continuous. Thus condition ( 1.5.17) is satisfied.Finally, note that σ(0) = 1 so

√nS1 converges in distribution to Z ∼ N(0, 1) by the Central


Limit Theorem. Therefore the L1 gradient process S(θ) is Pitman regular. It follows thatthe efficacy of the L1 is

cL1 = 2f(0) . (1.5.21)

For future reference, we state the asymptotic linearity result for the L1 process: if |√nθn| ≤ Bthen √

nS1(θn) =√nS1(0) − 2f(0)

√nθn + op(1) . (1.5.22)

Example 1.5.3. Pitman Regularity of the L2 Process

In Exercise 1.12.6 it is shown that, provided Xi has finite variance, the L2 gradient functionis Pitman Regular and that the efficacy is simply cL2 = 1/σf .

We are now in a position to investigate the efficiency and power properties of the statisti-cal methods based on the L1 norm relative to the statistical methods based on the L2 norm.As we will see in the next three subsections, these properties depend only on the efficacies.

1.5.3 Asymptotic Theory and Efficiency Results for θ

As at the beginning of this section, suppose we have the location model, ( 1.2.1), and that wehave chosen a norm to fit the model with gradient function S(θ). In this part we will developthe asymptotic distribution of the estimate. The asymptotic variance will provide the basisfor efficiency comparisons. We will use the asymptotic linearity that accompanies PitmanRegularity. To do this, however, we first need to show that

√nθ is bounded in probability.

Lemma 1.5.1. If the gradient function S(θ) is Pitman Regular, then√n(θ − θ) = Op(1).

Proof. Assume without loss of generality that θ = 0 and take t > 0. By the monotonicity ofS(θ), if S(t/

√n) < 0 then θ ≤ t/

√n. Hence, P0(S(t/

√n) < 0) ≤ P0(θ ≤ t/

√n). Theorem

1.5.5 implies that the first probability can be made as close to Φ(tc) as desired. This, in turn,can be made as close to 1 as desired. In a similar vein we note that If S(−t/√n) > 0, then

θ ≥ −t/√n and −√nθ ≤ t. Again, the probability of this event can be made arbitrarily close

to 1. Hence, P0(|√nθ| ≤ t) is arbitrarily close to 1 and we have boundedness in probability.

We are now in a position to exploit this boundedness in probability to determine theasymptotic distribution of the estimate.

Theorem 1.5.7. Suppose S(θ) is Pitman regular with efficacy c. Then√n(θ− θ) converges

in distribution to Z ∼ n(0, c−2).

Proof. As usual we assume, with out loss of generality, that θ = 0. First recall that θ isdefined by n−1/2S(θ)

.= 0. From Lemma 1.5.1, we know that

√nθ is bounded in probability

so that we can apply ( 1.5.13) to deduce

√nS(θ)

σ(0)=

√nS(0)

σ(0)− c

√nθ + op(1) .


Solving we have √nθ = c−1

√nS(0)/σ(0) + op(1) ;

hence, the result follows because√nS(0)/σ(0) is asymptotically N(0, 1).

Definition 1.5.4. If we have two Pitman Regular estimates with efficacies c1 and c2 , re-spectively, then the efficiency of θ1 with respect to θ2 is defined to be the reciprocal ratio oftheir asymptotic variances, namely, e(θ1, θ2) = c21/c

22.

The next example compares the L1 estimate to the L2 estimate.

Example 1.5.4. Relative efficiency between the L1 and L2 estimates

In this example we compare the L1 and L2 estimates, namely, the sample median andmean. We have seen that their respective efficacies are 2f(0) and σ−1

f , and their asymptoticvariances are 1/4f 2(0)n and σ2

f/n, respectively. Hence, the relative efficiency of the medianwith respect to the mean is

e(X, X) = asyvar(√nX)/asyvar(

√nX) = c2

X/c2X = 4f 2(0)σ2

f (1.5.23)

where X is the sample median and X is the sample mean. The efficiency computationdepends only on the Pitman efficacies. We illustrate the computation of the efficiency usingthe contaminated normal distribution. The pdf of the contaminated normal distributionconsists of mixing the standard normal pdf with a normal pdf having mean zero and varianceδ2 > 1. For ǫ between 0 and 1, the pdf can be written:

fǫ(x) = (1 − ǫ)φ(x) + ǫδ−1φ(δ−1x) (1.5.24)

with σ2f = 1 + ǫ(δ2 − 1). This distribution has tails heavier than the standard normal distri-

bution and can be used to model data contamination; see Tukey (1960) for more discussion.We can think of ǫ as the fraction of the data that is contaminated. In Table 1.5.1 weprovide values of the efficiencies for various values of contamination and with δ = 3. Notethat when we have 10 percent contamination that the efficiency is 1. This indicates that, forthis distribution, the median and mean are equally effective. Finally, this example exhibitsa distribution for which the median is superior to the mean as an estimate of the center. SeeExercise 1.12.12 for other examples.

1.5.4 Asymptotic Power and Efficiency Results for the Test Basedon S(θ)

Consider the location model, ( 1.2.1), and assume that we have chosen a norm to fit the modelwith gradient function S(θ). Consider the gradient test ( 1.5.2) of the hypotheses ( 1.5.1).In Section 1.5.1, we showed that the power function of this test is nondecreasing with upperlimit one and that it is typically resolving. Further, we showed that for a fixed alternative,the test is consistent. Thus the power will tend to one as the sample size increases. To


Table 1.5.1: Efficiencies of the median relative to the mean for contaminated normal models.ǫ e(X, X)

.00 .637

.03 .758

.05 .833

.10 1.000

.15 1.134

offset this effect, we will let the alternative converge to the null value at a rate that willstabilize the power away from one. This will enable us to compare two tests along the samealternative sequence. Consider the null hypothesis H0 : θ = 0 versus HAn : θ = θn whereθn = θ∗/

√n and θ∗ > 0. Recall that the asymptotic size α test based on S(0) rejects H0 if√

nS/σ(0) ≥ zα where 1 − Φ(zα) = α.The following theorem is called the asymptotic power lemma. Its proof follows im-

mediately from expression ( 1.5.13).

Theorem 1.5.8. Assume that S(0) is Pitman Regular with efficacy c, then the asymptoticlocal power along the sequence θn = θ∗/

√n is

γS(θn) = Pθn

[√nS(0)/σ(0) ≥ zα

]= P0

[√nS(−θn)/σ(0) ≥ zα

]→ 1 − Φ(zα − θ∗c) ,

as n→ ∞.

Note that larger values of the efficacy imply larger values of the asymptotic local power.

Definition 1.5.5. The Pitman asymptotic relative efficiency of one test relative toanother is defined to be e(S1, S2) = c21/c

22 .

Note that this is the same formula as the efficiency of one estimate relative to anothergiven in Definition 1.5.4. Therefore, the efficiency results discussed in Example 1.5.4between the L1 and L2 estimates apply for the sign and t tests also. Hence, we have anexample in which the simple sign test is asymptotically more powerful than the t-test.

We can also develop a sample size interpretation for the asymptotic power. Suppose wespecify a power γ < 1. Further, let zγ be defined by 1−Φ(zγ) = γ. Then 1−Φ(zα−cn1/2θn) =1 − Φ(zγ) and zα − cn1/2θn = zγ . Solving for n yields

n.= (zα − zγ)

2/c2θ2n . (1.5.25)

Typically we take θn = knσ with kn small. Now if S1(0) and S2(0) are two Pitman Regu-lar asymptotically size α tests then the ratio of sample sizes required to achieve the sameasymptotic power along the same sequence of alternatives is given by the approximation:n2/n1

.= c21/c

22. This provides additional motivation for the above definition of Pitman ef-

ficiency of two tests. The initial development of asymptotic efficiency was done by Pitman(1948) in an unpublished manuscript and later published by Noether (1955).


1.5.5 Efficiency Results for Confidence Intervals Based on S(θ)

In this part we consider the length of the confidence interval as a measure of its efficiency.Suppose that we specify γ = 1 − α for the confidence coefficient. Then let zα/2 be definedby 1 − Φ(zα/2) = α/2. Again we suppose throughout the discussion that the estimatingfunctions are Pitman Regular. Then the endpoints of the 100γ percent confidence intervalare given asymptotically by θL and θU such that

√nS(θL)

σ(0)= zα/2 and

√nS(θU)

σ(0)= −zα/2 ; (1.5.26)

see ( 1.3.10) for the exact versions of the endpoints.The next theorem provides the asymptotic behavior of the length of this interval and,

further, it shows that the standardized length of the confidence interval is a consistentestimate of the asymptotic standard deviation of

√nθ.

Theorem 1.5.9. Suppose S(θ) is a Pitman Regular estimating function with efficacy c. LetL be the length of the corresponding confidence interval. Then

√nL

2zα/2

P→ 1

c

Proof: Using the same argument as in Lemma 1.5.1, we can show that θL and θU arebounded in probability when multiplied by

√n. Hence, the above estimating equations can

be linearized to obtain, for example:

zα/2 =√nS(θL)/σ(0) =

√nS(0)/σ(0) − c

√nθL/σ(0) + oP (1) .

This can then be solved to find:

√nθL =

√nS(0)/cσ(0) − zα/2/c+ oP (1)

When this is also done for θU and the difference is taken, we have:

n1/2(θU − θL) = 2zα/2/c+ oP (1) ,

which concludes the argument.From Theorem 1.5.7, θ has an approximate normal distribution with variance c−2/n. So

by Theorem 1.5.9, a consistent estimate of the standard error of θ is

SE(θ) =

√nL

2zα/2

1√n

=L

2zα/2. (1.5.27)

If the ratio of squared asymptotic lengths is used as a measure of efficiency then theefficiency of one confidence interval relative to another is again the ratio of thesquares of the efficacies.


The discussion of the properties of estimation, testing, and confidence interval construc-tion shows that, asymptotically at least, the relative merit of a procedure is measured by itsefficacy. This measure is the slope of the linear approximation of the standardized estimat-ing function that determines these procedures. In the comparison of L1 and L2 methods,we have seen that the efficiency e(L1, L2) = 4σ2

ff2(0). There are other types of asymptotic

efficiency that have been studied in the literature along with finite sample versions of theseasymptotic efficiencies. The conclusions drawn from these other efficiencies are consistentwith the picture presented here. Finally, conclusions of simulation studies have also beenconsistent with the material presented here. Hence, we will not discuss these other measures;see Section 2.6 of Hettmansperger (1984a) for further references.

Example 1.5.5. Estimation of the Standard Error of the Sample Median

Recall that the sample median, when properly standardized, has a limiting normal dis-tribution. Suppose we have a sample of size n from H(x) = F (x−θ) where θ is the unknown

median. From Theorem 1.5.7, we know that the approximating distribution for θ, the sam-ple median, is normal with mean θ and variance 1/[4nh2(θ)]. We refer to this variance asthe asymptotic variance. This normal distribution can be used to approximate probabilitiesconcerning the sample median. When the underlying form of the distribution H is unknown,we must estimate this asymptotic variance. Theorem 1.5.9 provides one key to the estima-tion of the asymptotic variance. The square root of the asymptotic variance is sometimescalled the asymptotic standard error of the sample median. We will discuss the estimationof this standard error rather than the asymptotic variance.

As a simple example, in expression (1.5.27) take α = .05, zα/2 = 2, and k = n/2 −n1/2, then we have the following consistent estimate of the asymptotic standard error of themedian:

SE(median) ≈ [X(n/2+n1/2) −X(n/2−n1/2)]/4. (1.5.28)

This simple estimate of the asymptotic standard error is based on the length of the 95%confidence interval for the median. Sheather (1987) shows that the estimate can be improvedby using the interpolated confidence intervals discussed in Section 1.10. Of course, otherconfidence intervals with different confidence coefficients can be used also. We recommendusing 90% or 95%; again, see McKean and Schrader (1984) and Sheather (1987). This SEis computed by our R function onesampsgn for general α. The default value of α is set at0.05.

There are other approaches to the estimation of this standard error. For example, wecould estimate the density h(x) directly and then use hn(θ) where hn is the density estimate.Another possibility is to estimate the finite sample standard error of the sample mediandirectly. Sheather (1987) surveys these approaches. We will discuss one further possibilityhere, namely the bootstrap. The bootstrap has gained wide attention recently because ofits versatility in estimation and testing in nonstandard situations. See Efron and Tibshirani(1993) for a very readable account of the bootstrap.

If we know the underlying distribution H(x), then we could estimate the standard errorof the median by repeatedly drawing samples with a computer from the distribution H . If we


Table 1.5.2: Generated N(0, 1) variates, (placed in order)-1.79756 -1.66132 -1.46531 -1.45333 -1.21163 -0.92866 -0.86812-0.84697 -0.81584 -0.78912 -0.68127 -0.37479 -0.33046 -0.22897-0.02502 -0.00186 0.09666 0.13316 0.17747 0.31737 0.331250.80905 0.88860 0.90606 0.99640 1.26032 1.46174 1.525491.60306 1.90116

have B samples from H and have computed and stored the B values of the sample median,then our estimate of the standard error of the median is simply the sample standard deviationof these B values. When H is unknown we replace it by Hn, the empirical distributionfunction, and proceed with the simulation. Later in the chapter we will encounter an examplewhere we want to compute a bootstrap p-value for a test; see Section ??. The bootstrapapproach based on Hn is called the nonparametric bootstrap since nothing is assumedabout the form of the underlying distribution H . In another version, called the parametricbootstrap, we suppose that we know the form of the underlying distribution H but there aresome unknown parameters such as the mean and variance. We use the sample to estimatethese unknown parameters, insert the values into H , and use this distribution to draw theB samples. In this book we will be concerned mainly with the nonparametric bootstrap andwe will use the generic term bootstrap to refer to this approach. In either case, ready accessto high speed computing makes this method appealing. The following example illustratesthe computations.

Example 1.5.6. Generated Data

Using Minitab, the 30 data points in Table 1.5.2 were generated from a normal distribu-tion with mean 0 and variance 1. Thus, we know that the asymptotic standard error shouldbe about 1/[301/22f(0)] = 0.23. We will use this to check what happens if we try to estimatethe standard error from the data.

Using expression (1.3.16), the 95% confidence interval for the median is (−0.789, 0.331).Hence, the length of confidence interval estimate, given in expression ( 1.5.28), is (0.331 +0.789)/4 = 0.28. A simple R function was written to bootstrap the sample; see Exercise1.12.7. Using this function, we obtained 1000 bootstrap samples and the resulting standarddeviation of the 1000 bootstrap medians was 0.27. For this instance, the bootstrap procedureessentially agrees with the length of confidence interval estimate.

Note that, from the data, the sample mean is −0.03575 and the sample standard deviationis 1.04769. If we assume the underlying distribution H is normal with unknown mean andvariance, we would use the parametric bootstrap. Hence, instead of sampling from theempirical distribution function, we want to sample from a normal distribution with mean−0.03575 and standard deviation 1.04769. Using R (see Exercise 1.12.7), we obtained 1000parametric bootstrapped samples. The sample standard deviation of the resulting medianswas 0.23, just the value we would expect. You should not expect to get the precise valueevery time you bootstrap, either parametrically or nonparametrically. It is, however, a very


versatile method to use to estimate such quantities as standard errors of estimates andp-values of tests.

An unusual aspect of this example is that the bootstrap distribution of the sample me-dian can be found in closed form and does not have to be simulated as described above. Thevariance of the sample median computed from the bootstrap distribution can then be found.The result is another estimate of the variance of the sample median. This was discoveredindependently by Maritz and Jarrett (1978) and Efron (1979). We do not pursue this devel-opment here because in most cases we must simulate the bootstrap distribution and that iswhere the real strength of the bootstrap approach lies. For an interesting comparison of thevarious estimates of the variance of the sample median see McKean and Schrader (1984).

1.6 Robustness Properties of Norm-Based Inference

We have just considered the statistical properties of the inference procedures. We have lookedat ideas such as efficiency and power. We now turn to stability or robustness properties. Bythis we mean how the inference procedures are effected by outliers or corruption of portionsof the data. Ideally, we would like procedures (tests and estimates) which do not respondtoo quickly to a single outlying value when it is introduced into the sample. Further, wewould not like procedures that can be changed by arbitrary amounts by corrupting a smallamount of the data. Response to outliers is measured by the influence curve and responseto data corruption is measured by the breakdown value. We will introduce finite sampleversions of these concepts. They are easy to work with and, in the limit, they generallyequal the more abstract versions based on the study of statistical functionals. We considerfirst the robustness properties of the estimates and secondly tests. As in the last section,the discussion will be general but the L1 and L2 procedures will be discussed as we proceed.The robustness properties of the procedures based on the weighted L1 norm will be coveredin Sections 1.7 and 1.8. See Section A.5 of the Appendix for a development based onfunctionals.

1.6.1 Robustness Properties of θ

We begin with the definition of breakdown for the estimator θ.

Definition 1.6.1. Estimation Breakdown Let x = (x1, . . . , xn) represent a realization of asample and let

x(m) = (x∗1, . . . , x∗m, xm+1, . . . , xn)

′

represent the corruption of any m of the n observations. We define the bias of an estimatorθ to be bias(m; θ,x) = sup |θ(x(m)) − θ(x)| where the sup is taken over all possible corruptedsamples x(m). Note that we change only x∗1, . . . , x

∗m while xm+1, . . . , xn are fixed at their

original values. If the bias is infinite, we say the estimate has broken down and the finitesample breakdown value is given by

ǫ∗n = min m/n : bias(m; θ,x) = ∞ . (1.6.1)

1.6. ROBUSTNESS PROPERTIES OF NORM-BASED INFERENCE 31

This approach to breakdown is called replacement breakdown because observations arereplaced by corrupted values; see Donoho and Huber (1983) for more discussion of this

approach. Often there exists an integer m such that x(m) ≤ θ ≤ x(n−m+1) and either θ tends

to −∞ as x(m) tends to −∞ or θ tends to +∞ as x(n−m+1) tends to +∞. If m∗ is the smallestsuch integer then ǫ∗n = m∗/n. Hodges (1967) was the first to introduce these ideas.

To remove the effects of sample size, the limit, when it exists, can be computed. In thiscase we call the lim ǫ∗n = ǫ∗, the asymptotic breakdown value.

Example 1.6.1. Breakdown Values for the L1 and L2 Estimates

The L1 estimate is the sample median. If the sample size is n = 2k then it is easy to seethat when x(k) tends to −∞, the median also tends to −∞. Hence, the breakdown valueof the sample median is k/n which tends to .5. By a similar argument, when the samplesize is n = 2k + 1, the breakdown value is (k + 1)/n and it also tends to .5 as the samplesize increases. Hence, we say that the sample median is a 50% breakdown estimate. The L2

estimate is the sample mean. A similar analysis shows that the breakdown value is 1/n whichtends to zero. Hence, we say the sample mean is a zero breakdown estimate. This sharplycontrasts the two estimates since we see that the median is the most resistant estimate andthe sample mean is the least resistant estimate. In Exercise 1.12.13, the reader is asked toshow that the pseudo-median induced by the signed-rank norm, ( 1.3.25), has breakdown.29.

We have just considered the effect of corrupting some of the observations. The estimatebreaks down if we can force the estimate to change by an arbitrary amount by changingthe observations over which we have control. Another important concept of stability entailsmeasuring the effect of the introduction of a single outlier. An estimate is stable or resistantif it does not change by a large amount when the outlier is introduced. In particular, wewant the change to be bounded no matter what the value of the outlier.

Suppose we have a sample of observations x1, . . . , xn from a distribution centered at 0and an estimate θn based on these observations. By Pitman Regularity, Definition 1.5.3,and Theorem 1.5.7, we have

n1/2θn = c−1n−1/2S(0)/σ(0) + oP (1) , (1.6.2)

provided the true parameter is 0. Further, we often have a representation of S(0) as a sumof independent random variables. We may have to make a projection of S(0) to achieve this;see the next chapter for examples of projections. In any case, we then have the followingrepresentation

c−1n−1/2S(0)/σ(0) = n−1/2

n∑

i=1

Ω(xi) + oP (1) , (1.6.3)

where Ω(·) is the function needed in the representation. When we combine the above two


statements we have

n1/2θn = n−1/2

n∑

i=1

Ω(xi) + oP (1) . (1.6.4)

Recall that the distribution that we are sampling is assumed to be centered at 0. Thedifference (θn−0) is approximated by the average of n independent and identically distributed

random variables. Since Ω(xi) represents the effect of the ith observation on θn it is calledthe influence function.

The influence function approximates the rate of change of the estimate when an outlieris introduced. Let xn+1 = x∗ represent a new, outlying , observation. Since θn should beroughly 0, we have

(n + 1)θn+1 − (n+ 1)θn.= Ω(x∗)

andθn+1 − θn1/(n+ 1)

≈ Ω(x∗) , (1.6.5)

and this reveals the differential character of the influence function. Hampel (1974) developedthe influence function from the theory of von Mises differentiable functions. In Sections A.5and A.5.2 of the Appendix, we use his formulation to derive several influence functions forlater situations. Here, though, we will identify influence functions for the estimates throughthe approximations described above. We now illustrate this approach.

Example 1.6.2. Influence Function for the L1 and L2 Estimates

We will briefly describe the influence functions for the sample median and the samplemean, the L1 and L2 estimates. From Example 1.5.2 we have immediately that, for thesample median,

n1/2θ ≈ 1√n

n∑

i=1

sgn(Xi)

2f(0)

and

Ω(x) =sgn(x)

2f(0)

Note that the influence function is bounded but not continuous. Hence, outlying ob-servations cannot have an arbitrarily large effect on the estimate. It is this feature alongwith the 50% breakdown property that makes the sample median the prototype of resistantestimates. The sample mean, on the other hand, has an unbounded influence function. It iseasy to see that Ω(x) = x, linear and unbounded. Hence, a single large outlier is sufficient tocarry the sample mean beyond any bound. The unbounded influence is connected to the 0breakdown property. Hence, the L2 estimate is the prototype of an estimate highly efficientat a specified model, the normal model in this case, but not resistant. This means that quite

1.6. ROBUSTNESS PROPERTIES OF NORM-BASED INFERENCE 33

close to the model for which the estimate is optimal, the estimate may perform very poorly;recall Table 1.5.1.

1.6.2 Breakdown Properties of Tests

We now turn to the issue of breakdown in testing hypotheses. The problems are a bitdifferent in this case since we typically want to move, by data corruption, a test statisticinto or out of a critical region. It is not a matter of sending the statistic beyond any finitebound as it is in estimation breakdown.

Definition 1.6.2. Suppose that V is a statistic for testing H0 : θ = 0 versus H0 : θ > 0and we reject the null hypothesis when V ≥ k, where P0(V ≥ k) = α determines k. Therejection breakdown of the test is defined by

ǫ∗n(reject) = min m/n : infx

supx(m)

V ≥ k , (1.6.6)

where the sup is taken over all possible corruptions of m data points. Likewise the accep-tance breakdown is defined to be

ǫ∗n(accept) = min m/n : supx

infx(m)

V < k . (1.6.7)

Rejection breakdown is the smallest portion of the data that can be corrupted to guaran-tee that the test will reject. Acceptance breakdown is interpreted as the smallest portion ofthe data that must be corrupted to guarantee that the test statistic will not be in the criticalregion; i.e., the test is guaranteed to fail to reject the null hypothesis. We turn immediatelyto a comparison of the L1 and L2 tests.

Example 1.6.3. Rejection Breakdown of the L1

We first consider the one sided sign test for testing H0 : θ = 0 versus HA : θ > 0. Theasymptotically size α test rejects the null hypothesis when n−1/2S1(0) ≥ zα , the upper αquantile from a standard normal distribution. It is easier to see exactly what happens if weconvert the test to S+

1 (0) =∑I(Xi > 0) ≥ n/2 + (n1/2zα)/2. Now each time we make an

observation positive it makes S+1 (0) increase by one. Hence, if we wish to guarantee that the

test will reject, we make m observations positive where m∗ = [n/2 + (n1/2zα)/2] + 1, [.] thegreatest integer function. Then the rejection breakdown is

ǫ∗n(reject) = m∗/n.=

1

2+

zα2n1/2

Likewise,

ǫ∗n(accept).=

1

2− zα

2n1/2.

Note that the rejection breakdown converges down to the estimation breakdown and theacceptance breakdown converges up to it.


We next turn to the one-sided Student’s t-test. Acceptance breakdown for the t-test issimple. By making a single observation approach −∞, the t statistic can be made negativehence we can always guarantee acceptance with control of one observation. The rejectionbreakdown is more interesting. If we increase an observation both the sample mean and thesample standard deviation increase. Hence, it is not at all clear what will happen to thet-statistic. In fact it is not sufficient to increase a single observation in order to force thet-statistic to move into the critical region. We now show that the rejection breakdown forthe t-statistic is:

ǫ∗n(reject) =t2α

n− 1 + t2α→ 0 , as n→ ∞ ,

where tα is the upper α quantile from a t-distribution with n − 1 degrees of freedom. Theinfimum part of the definition suggests that we set all observations at −B < 0 and thenchange m observations to M > 0. The result is

x =mM − (n−m)B

nand s2 =

m(n−m)(M +B)2

(n− 1)n.

Putting these two quantities together we have

n1/2x

s= [m− (n−m)B/M ]

(n− 1

m(n−m)(1 +B/M)2

)1/2

→ m(n− 1)

n−m

1/2

,

as M → ∞. We now equate the limit to tα and solve for m to get m = nt2α/(n − 1 + t2α),(actually we would take the greatest integer and add one). Then the rejection breakdownis m divided by n as stated. Table 1.6.1 compares rejection breakdown values for the signand t-tests. We assume α = .05 and the sample sizes are chosen so that the size of the signtest is quite close to .05. For further discussion, see Ylvisaker (1977).

These definitions of breakdown assume a worst case scenario. They assume that thetest statistic is as far away from the critical region (for rejection breakdown) as possible.In practice, however, it may be the case that a test statistic is quite near the edge of thecritical region and only one observation is needed to change the decision from fail to rejectto reject. An alternative form of breakdown considers the average number of observationsthat must be corrupted, conditional on the test statistic being in the acceptance region, toforce a rejection.

Let MR be the number of observations that must be corrupted to force a rejection; then,MR is a random variable. The expected rejection breakdown is defined to be

Exp∗n(reject) = EH0[MR|MR > 0]/n . (1.6.8)

Note that we condition on MR > 0 since MR = 0 is equivalent to a rejection. It is left asExercise 1.12.14 to show that the expected breakdown can be computed with unconditionalexpectation as

Exp∗n(reject) = EH0 [MR]/(1 − α) . (1.6.9)

In the following example we illustrate this computation on the sign test and show how itcompares to the worst case breakdown introduced earlier.

1.7. INFERENCE AND THE WILCOXON SIGNED-RANK NORM 35

Table 1.6.1: Rejection breakdown values for size α = .05 tests.n Sign t

10 .71 .2713 .70 .2118 .67 .1530 .63 .09

100 .58 .03∞ .50 0

Table 1.6.2: Comparison of expected breakdown and worst case breakdown for the sizeα = .05 sign test.

n Exp∗n(reject) ǫ∗n(reject)

10 .27 .7113 .24 .7018 .20 .6730 .16 .63

100 .08 .58∞ 0 .50

Example 1.6.4. Expected Rejection Breakdown of the Sign Test

Refer to Example 1.6.3. The one sided sign test rejects when∑I(Xi > 0) ≥ n/2 +

n1/2zα/2. Hence, given that we fail to reject the null hypothesis, we will need to change(corrupt) n/2 + n1/2zα/2 −

∑I(Xi > 0) negative observations into positive ones. This is

precisely MR and E[MR] = n1/2zα/2. It follows that Exp∗n(reject) = zα/2n

1/2(1 − α) → 0 asn → ∞ rather than .5 which happens in the worst case breakdown. Table 1.6.2 comparesthe two types of rejection breakdown. This simple calculation clearly shows that even highlyresistant tests such as the sign test may breakdown quite easily. This is contrary to what theworst case breakdown analysis would suggest. For additional reading on test breakdown seeCoakley and Hettmansperger (1992). He, Simpson and Portnoy (1990) discuss asymptotictest breakdown.

1.7 Inference and the Wilcoxon Signed-Rank Norm

In this section we develop the statistical properties for the procedures based on the Wilcoxonsigned-rank norm, ( 1.3.17), that was defined in Example 1.3.3 of Section 1.3. Recall thatthe norm and its associated gradient function are given in expressions ( 1.3.17) and ( 1.3.24),respectively. Recall for a sample X1, . . . , Xn that the estimate of θ is the median of the


Walsh averages given by ( 1.3.25). As in Section 1.3, our hypotheses of interest are

H0 : θ = 0 versus H0 : θ 6= 0 . (1.7.1)

The level α test associated with the signed-rank norm is

Reject H0 in favor of HA, if |T (0)| ≥ c , (1.7.2)

where c is such that P0[|T (0)| ≥ c]. To complete the test we need to determine the nulldistribution of T (0), which is given by Theorems 1.7.1 and 1.7.2.

In order to develop the statistical properties, in addition to ( 1.2.1), we assume that

h(x) is symmetrically distributed about θ . (1.7.3)

We refer to this as the symmetric location model. Under symmetry, by Theorem 1.2.1,T (H) = θ, for all location functionals T .

1.7.1 Null Distribution Theory of T (0)

In addition to expression ( 1.3.24), a third representation of T (0) will be helpful in estab-lishing its null distribution. Recall the definition of the anti-ranks, D1, . . . , Dn, given inexpression ( 1.3.19). Using these anti-ranks, we can write

T (0) =∑

R(|Xi|)sgn(Xi) =∑

jsgn(XDj) =

∑jWj ,

where Wj = sgn(XDj).

Lemma 1.7.1. Under H0, |X1|, . . . , |Xn| are independent of sgn(X1), . . . , sgn(Xn).

Proof: Since X1, . . . , Xn is a random sample from H(x), it suffices to show that P [|Xi| ≤x, sgn(Xi) = 1] = P [|Xi| ≤ x]P [sgn(Xi) = 1]. But due to H0 and the symmetry of h(x) thisfollows from the following string of equalities:

P [|Xi| ≤ x, sgn(Xi) = 1] = P [0 < Xi ≤ x] = H(x) − 1

2

= [2H(x) − 1]1

2= P [|Xi| ≤ x]P [sgn(Xi) = 1] .

Based on this lemma, the vector of ranks and, hence, the vector of antiranks (D1, . . . , Dn),are independent of the vector (sgn(X1), . . . , sgn(Xn)). Based on these facts, we can obtainthe distribution of (W1, . . . ,Wn), which we summarize in the following lemma; see Exercise1.12.15 for its proof.

Lemma 1.7.2. Under H0 and the symmetry of h(x), W1, . . . ,Wn are iid random variableswith P [Wi = 1] = P [Wi = −1] = 1/2.


We can now easily derive the null distribution theory of T (0) which we summarize in thefollowing theorems. Details are given in Exercise 1.12.16.

Theorem 1.7.1. Under H0 and the symmetry of h(x),

T (0) is distribution free and its distribution is symmetric (1.7.4)

E0[T (0)] = 0 (1.7.5)

Var0(T (0)) =n(n+ 1)(2n+ 1)

6(1.7.6)

T (0)√Var0(T (0))

has an asymptotically N(0, 1) distribution . (1.7.7)

The exact distribution of T (0) cannot be found in closed form. We do, however, havethe following recursion formula; see Exercise 1.12.17.

Theorem 1.7.2. Consider the version of the signed-rank test statistics given by T+, ( 1.3.28).

Let pn(k) = P [T+ = k] for k = 0, . . . , n(n+1)2

. Then

pn(k) =1

2[pn−1(k) + pn−1(k − n)] , (1.7.8)

where

p0(0) = 1 ; p0(k) = 0 for k 6= 0; and p0(k) = 0 for k < 0 .

Using this formula algorithms can be developed which obtain the null distribution ofthe signed-rank test statistic. The moment generating function can also be inverted to findthe null distribution; see Hettmansperger(1984a, Section 2.2). As discussed in Section 1.3.1,software is now available which computes critical values and p-values of the null distribution.

Theorem 1.7.1 justifies the confidence interval for θ given in display ( 1.3.30); i.e, the(1−α)100% confidence interval given by [W(k+1),W(((n(n+1))/2)−k)) where W(i) denotes the ithordered Walsh average and P (T+(0) ≤ k) = α/2. Based on ( 1.7.7), k can be approximatedas k ≈ n(n+1)/4−.5−zα/2[n(n+1)(2n+1)/24]1/2. As noted in Section 1.3.1, the computationof the estimate and confidence interval can be obtain by our R function onesampwil or theR intrinsic function wilcox.test.

1.7.2 Statistical Properties

From our earlier analysis of the statistical properties of the L1 and L2 methods we see thatPitman Regularity is crucial. In particular, we need to compute the Pitman efficacy whichdetermines the asymptotic variance of the estimate, the asymptotic local power of the test,and the asymptotic length of the confidence interval. In the following theorem we show thatthe weighted L1 gradient function is Pitman Regular and determine the efficacy. Then wemake some preliminary efficiency comparisons with the L1 and L2 methods.


Theorem 1.7.3. Suppose that h is symmetric and that∫h2(x)dx <∞. Let

T (θ) =2

n(n+ 1)

∑

i≤jsgn

(xi + xj

2− θ

).

Then the conditions of Definition 1.5.3 are satisfied and, thus, T (θ) is Pitman Regular.Moreover, the Pitman efficacy is given by

c =√

12

∫ ∞

−∞h2(x)dx . (1.7.9)

Proof. Since we have the L1 norm applied to the Walsh averages, the estimating functionis a nonincreasing step function with steps at the Walsh averages. Hence, ( 1.5.7) holds.Next note that h(x) = h(−x) and, hence,

µ(θ) = EθT (0) =2

n + 1Eθsgn(X1) +

n− 1

n + 1Eθ

(sgn

X1 +X2

2

).

Now

EθsgnX1 =

∫sgn(x+ θ)h(x)dx = 1 − 2H(θ) ,

and

Eθsgn(X1 +X2)/2 =

∫ ∫sgn[(x+ y)/2 + θ]h(x)h(y)dxdy =

∫[1 − 2H(−2θ − y)]h(y)dy .

Differentiate with respect to θ and set θ = 0 to get

µ′(0) =2h(0)

n+ 1+

4(n− 1)

n+ 1

∫ ∞

−∞h2(y)dy → 4

∫h2(y) dy .

The finiteness of the integral is sufficient to ensure that the derivative can be passed throughthe integral; see Hodges and Lehmann (1961) or Olshen (1967). Hence, ( 1.5.8) also holds.We next establish Condition ( 1.5.9). Since

T (θ) =2

n(n + 1)

n∑

i=1

sgn(Xi − θ) +2

n(n+ 1)

∑

ı<j

sgn

(Xi +Xj

2− θ

),

the first term is of smaller order and we need only consider the second term. Now, for b > 0,let

V∗

=2

n(n + 1)

∑

i<j

[sgn

(Xi +Xj

2− n−1/2b

)− sgn

(Xi +Xj

2

)]

=−4

n(n + 1)

∑∑

i<j

I

(0 <

Xi +Xj

2< n−1/2b

).


Hence,

nVar(V∗) =

16n

n2(n+ 1)2E∑

i<j

∑

s<t

(IijIst −EIijEIst) ,

where Iij = I(0 < (xi + xj)/2 < n−1/2b). This becomes

nVar(V∗) =

16n2(n− 1)

2n2(n+ 1)2Var(I12) +

16n2(n− 1)(n− 2)

2n2(n+ 1)2[EI12I13 −EI12EI13]

The first term tends to zero since it behaves like 1/n. In the second term, consider |EI12I13−EI12EI13| ≤ EI12 + E2I12 = EI12(1 + EI12). Now, as n→ ∞,

EI12 = P

(0 <

Xi +Xj

2< n−1/2b

)=

∫[H(2n−1/2b− x) −H(−x)]h(x)dx → 0

Hence, by Theorem 1.5.6, Condition ( 1.5.9) is true. Finally, asymptotic normality of thenull distribution is established in Theorem 1.7.1 which also yields nVar0T (0) → 4/3 = σ2(0).It follows that the Pitman efficacy is

c =4∫h2(y)dy√4/3

=√

12

∫h2(y)dy .

For future reference we display the asymptotic linearity result:

T (θ)√n(n + 1)(2n+ 1)/6

=T (0)√

n(n+ 1)(2n+ 1)/6−

√nθ

√12

∫ ∞

−∞h2(x) dx+ op(1) , (1.7.10)

for√n|θ| ≤ B, where B > 0.

An immediate consequence of this theorem and Theorem 1.5.7 is that

√n(θ − θ)

D→ Z ∼ N

(0, 1/12

[∫h2(t)dt

]2), (1.7.11)

and we thus have the limiting distribution of the median of the Walsh averages. Exercise1.12.20 shows that

∫h2(t) dt <∞, when h has finite Fisher information.

From our general discussion, a simple estimate of the standard error of the median ofthe Walsh averages is proportional to the length of a distribution free confidence interval.Consider the (1 − α)100% confidence interval given by [W(k+1),W(((n(n+1))/2)−k)) where W(i)

denotes the ith ordered Walsh average and P (T+(0) ≤ k) = α/2. Then by expression(1.5.27), a consistent estimate of the SE of the median of the Walsh averages (medWA) is

SE(medWA) =W(((n(n+1))/2)−k) −W(k+1)

2zα/2. (1.7.12)

Our R function onesampwil computes this standard error for general α, (default α is set at0.05). We will have more to say about this particular c in the next chapter where we will


encounter it in the two-sample location model and later in the linear model, where a betterestimator of this SE is presented.

From Example 1.5.3 and Definition 1.5.4, we have the asymptotic relative efficiencybetween the signed-rank Wilcoxon process and the L2 process is given by

e(Wilcoxon, L2) = 12σ2h

(∫h2(x) dx

)2

, (1.7.13)

where h is the underlying density with variance σ2h.

In the following example, we consider the contaminated normal distribution and thenfind the efficiency of the rank methods relative to the L1 and L2 methods.

Example 1.7.1. Asymptotic Relative Efficiency for Contaminated Normal Distributions

Let fǫ(x) denote the pdf of the contaminated normal distribution used in Example 1.5.4;the proportion of contamination is ǫ and the variance of the contaminated part is 9. Astraight forward computation shows that

∫f 2ǫ (y)dy =

(1 − ǫ)2

2√π

+ǫ2

6√π

+ǫ(1 − ǫ)√

5√π,

and we use this in the formula for c given above. The efficacies for the L1 and L2 are givenin Example 1.5.4. We first consider the special case of ǫ = 0 corresponding to an underlyingnormal distribution. In this case we have for the rank methods c2R = 12/(4π) = 3/π = .955,for the L1 methods c21 = 2/π = .637, and for the L2 methods c22 = 1. We have already seenthat the efficiency enormal(L1, L2) = c21/c

22 = .637 from the first line of Table 1.5.1. We

now have

enormal(Wilcoxon, L2) = 3/π.= .955 and enormal(Wilcoxon, L1) = 1.5 . (1.7.14)

The efficiency of the rank methods relative to the L2 methods is extraordinary. It says thateven at the distribution for which the t test is uniformly most powerful, the Wilcoxon signedrank test is almost as efficient. This means that replacing the values of the observations bytheir ranks (retaining only the order information) does not affect the statistical properties ofthe test. This was considered highly nonintuitive in the 1950’s since nonparametric methodswere thought of as quick and dirty. Now they must be considered highly efficient competitorsof the optimal methods and, in addition, they are more robust than the optimal methods.This provides powerful motivation for the continued study of rank methods in other statisticalmodels such as the two-sample location model and the linear model. The early work in thearea of efficiency of rank methods is due largely to Lehmann and his students. See Lehmannand Hodges (1956, 1961) for two important early papers and Lehmann (1975, Appendix) formore discussion.

We complete this example with a table of efficiencies of the rank methods relative to theL1 and L2 methods for the contaminated normal model with σ = 3. Table 1.7.1 shows these


Table 1.7.1: Efficiencies of the Rank, L1, and L2 methods for the Contaminated NormalDistribution.

ǫ e(L1, L2) e(R,L1) e(R,L2).00 .637 1.500 .955.01 .678 1.488 1.009.03 .758 1.462 1.108.05 .833 1.436 1.196.10 1.000 1.373 1.373.15 1.134 1.320 1.497

efficiencies and extends Table 1.5.1. As ǫ increases the weight in the tails of the distributionalso increases. Note that the efficiencies of both the L1 and rank methods relative to L2

methods increase with ǫ. On the other hand, the efficiency of the rank methods relative tothe L1 methods decreases slightly. The rank methods are still more efficient; however, thisillustrates the fact that the L1 methods are good for heavy tailed distributions. The overallimplication of this example is that the L2 methods, such as the sample mean, the t test andconfidence interval, are not particularly efficient once the underlying distribution departsfrom the normal distribution. Further, the rank methods such as the Wilcoxon signed ranktest, confidence interval, and the median of the Walsh averages are surprisingly efficient,even at the normal distribution. Note that the rank methods are more efficient than L2

methods even for 1% contamination.Finally, the following theorem shows that the Wilcoxon signed rank statistic never loses

much efficiency relative to the t-statistic. Let Fs denote the family of distributions whichhave symmetric densities and finite Fisher information; see Exercise 1.12.20.

Theorem 1.7.4. Let X1, . . . , Xn be a random sample from H ∈ FS. Then

infFs

e(Wilcoxon, L2) = 0.864 . (1.7.15)

Proof: By ( 1.7.13), e(Wilcoxon, L2) = 12σ2h

(∫h2(x) dx

)2. If σ2

h = ∞ then e(Wilcoxon, L2) >.864; hence, we can restrict attention to H ∈ Fs such that σ2

h < ∞. As Exercise 1.12.21indicates e(Wilcoxon, L2) is location and scale invariant, so, we can further assume thath is symmetric about 0 and σ2

h = 1. The problem, then, is to minimize∫h2 subject to∫

h =∫x2h = 1 and

∫xh = 0. This is equivalent to minimizing

∫h2 + 2b

∫x2h− 2ba2

∫h , (1.7.16)

where a and b are positive constants to be determined later. We now write ( 1.7.16) as∫ [

h2 + 2b(x2 − a2)h]

=

∫

|x|≤a

[h2 + 2b(x2 − a2)h

]

+

∫

|x|>a

[h2 + 2b(x2 − a2)h

]. (1.7.17)


First complete the square on the first term on the right side of ( 1.7.17) to get

∫

|x|≤a

[h + b(x2 − a2)

]2 −∫

|x|≤ab2(x2 − a2)2 (1.7.18)

Now ( 1.7.17) is equal to the two terms of ( 1.7.18) plus the second term on the right side of( 1.7.17). We can now write the density that minimizes ( 1.7.16).

If |x| > a take h(x) = 0, since x2 > a2, and if |x| ≤ a take h(x) = b(a2 − x2), since theintegral in the first term of ( 1.7.18) is nonnegative. We can now determine the values of aand b from the side conditions. From

∫h = 1. we have

∫ a

−ab(a2 − x2) dx = 1 ,

which implies that a3b = 34. Further, from

∫x2h = 1 we have

∫ a

−ax2b(a2 − x2) dx = 1 ,

from which a5b = 154. Hence solving for a and b yields a =

√5 and b = 3

√5/100. Now

∫h2 =

∫ √5

−√

5

[3√

5

100(5 − x2)

]2

dx =3√

5

25,

which leads to the result,

infFs

e(Wilcoxon, L2) = 12

(3√

5

25

)2

=108

125

.= 0.864 .

1.7.3 Robustness Properties

We complete this section with a discussion of the breakdown point of the estimate and testand a heuristic derivation of the influence function of the estimate. In Example 1.6.1 wediscussed the breakdown of the sample median and mean. In those cases we saw that themedian is the most resistant while the mean is the least resistant. In Exercise 1.12.13you are asked to show that the breakdown point of the median of the Walsh averages, theR-estimate, is roughly .29. Our next result gives the influence function θ.

Theorem 1.7.5. The influence function of θ = medi≤j(xi + xj)/2 is given by:

Ω(x) =H(x) − 1/2∫∞−∞ h2(t)dt


We sketch a derivation of this result, here. A rigorous development is offered in SectionA.5 of the Appendix. From Theorems 1.7.3 and 1.5.6 we have

n1/2T (θ)/σ(0) ≈ n1/2T (0)/σ(0) − cn1/2θ ,

and

θn ≈ T (0)/cσ(0) ,

where σ(0) = (4/3)1/2 and c = (12)1/2∫h2(t)dt. Make these substitutions to get

θn.=

1

n(n + 1)2∫h2(t)dt

∑

i≤jsgn

(Xi +Xj

2

)

Now introduce an outlier xn+1 = x∗ and take the difference between θn+1 and θn. The resultis

2

∫h2(t)dt[(n+ 2)θn+1 − nθn]

.=

1

(n + 1)

n+1∑

i=1

sgn

(xi + x∗

2

).

We can replace n + 2 and n + 1 by n where convenient without effecting the asymptotics.Using the symmetry of the density of H , we have

1

n

n∑

i=1

sgn

(xi + x∗

2

).= 1 − 2Hn(−x∗) → 1 − 2H(−x∗) = 2H(x∗) − 1 .

It now follows that (n + 1)(θn+1 − θn).= Ω(x∗), given in the statement of the theorem; see

the discussion of the influence function in Section 1.6.

Note that we have a bounded influence function since the cdf H is a bounded function.Further, it is continuous, unlike the influence function of the median. Finally, as an additionalcheck, note that EΩ2(X) = 1/12[

∫h2(t)dt]2 = 1/c2, the asymptotic variance of n1/2θ.

Let θc = medi,j (Xi − cXj)/(1 − c) for −1 ≤ c < 1 . This extension of the Hodges-Lehmann estimate, ( 1.3.25), has some very interesting robustness properties for c > 0. The

influence function of θc is not only bounded but also redescending, similar to the most robustM-estimates. In addition, θc has 50% breakdown. For a complete discussion of this estimatesee Maritz, Wu and Staudte (1977) and Brown and Hettmansperger (1994).

In the next theorem we develop the test breakdown for the Wilcoxon signed rank test.

Theorem 1.7.6. The rejection breakdown, Definition 1.6.2, for the Wilcoxon signed ranktest is

ǫ∗n.= 1 −

(1

2− zα

(3n)1/2

)1/2

→ 1 − 1

21/2

.= .29 .


Table 1.7.2: Rejection breakdown values for size α = .05 tests.Signed-rank

n Sign t Wilcoxon10 .71 .27 .5713 .70 .21 .5318 .67 .15 .4830 .63 .09 .43

100 .58 .03 .37∞ .50 0 .29

Proof. Consider the form T+(0) =∑ ∑

I[(xi+xj)/2 > 0], where the double sum is overall i ≤ j. The asymptotically size α test rejects H0 : θ = 0 in favor of HA : θ > 0 whenT+(0) ≥ c

.= n(n+ 1)/4 + zα[n(n+ 1)(2n+ 1)/24]1/2. Now we must guarantee that T+(0) is

in the critical region. This requires at least c positive Walsh averages. Let x(1) ≤ . . . ≤ x(n)

be the ordered observations. Then contamination of x(n) results in n contaminated Walshaverages, namely those Walsh averages that include x(n). Contamination of x(n−1) yields n−1additional contaminated Walsh averages. When we proceed in this way, contamination of theb ordered values x(n), . . . , x(n−b+1) yields n+(n−1)+...+(n−b+1) = [n(n+1)/2]−[(n−b)(n−b+1)/2] contaminated Walsh averages. We now set [n(n+1)/2]−[(n−b)(n−b+1)/2]

.= c and

solve the resulting quadratic for b. We must solve b2 − (2n+ 1)b+ 2c.= 0. The appropriate

root in this case is

b.=

2n+ 1 − [(2n+ 1)2 − 8c]1/2

2.

Substituting the approximate critical value for c, dividing by n, and ignoring higher orderterms, leads to the stated result.

Table 1.7.2 displays the finite rejection breakdowns of the Wilcoxon signed rank testover the same sample sizes as the rejection breakdowns of the sign test given in Table 1.6.1.For convenience we have reproduced the results for the sign and t tests, also. The rejectionbreakdown for the Wilcoxon test converges from above to the estimation breakdown of .29.The Wilcoxon test is more resistant than the t-test but not as resistant as the simple signtest. It is interesting to note that from the discussion of efficiency, it is clear that we can nowachieve high efficiency and not pay the price in lack of robustness. The rank based methodsseem to be a very attractive alternative to the highly resistant but relatively inefficient (atthe normal model) L1 methods and the highly efficient (at the normal model) but nonrobustL2 methods.

1.8 Inference Based on General Signed-Rank Norms

In this section, we develop properties for a generalized sign-rank process. It includes the L1

and the weighted L1 as special cases. The development is similar to that of the weighted L1

1.8. INFERENCE BASED ON GENERAL SIGNED-RANK NORMS 45

so a brief sketch suffices. For x ∈ Rn, consider the function,

‖x‖ϕ+ =

n∑

i=1

a+(R|xi|)|xi| , (1.8.1)

where the scores a+(i) are generated as a+(i) = ϕ+(i/(n + 1)) for a positive valued, non-decreasing, square-integrable function ϕ+(u) defined on the interval (0, 1). The proof that‖ · ‖ϕ+ is a norm on Rn follows in the same way as in the weighted L1 case; see the proof ofTheorem 1.3.2 and Exercise 1.12.22. The gradient function associated with this norm is

Tϕ+(θ) =n∑

i=1

a+(R|Xi − θ|)sgn(Xi − θ) . (1.8.2)

Note that it reduces to the L1 norm if ϕ+(u) ≡ 1 and the weighted L1, Wilcoxon signed-rank,norm if ϕ+(u) = u. A family of simple score functions between the weighted L1 and the L1

are of the form

ϕ+c (u) =

u 0 < u < cc 0 ≤ u < 1

, (1.8.3)

where the parameter c is between 0 and 1. These scores were proposed by Policello andHettmansperger (1976); see, also, Hogg (1974). The frequently used normal scores aregenerated by the score function,

ϕ+Φ(u) = Φ−1

(u+ 1

2

), (1.8.4)

where Φ is the standard normal distribution function. Note that ϕ+Φ(u) is the inverse cdf (or

quantile function) of the absolute value of a standard normal random variable. The normalscores were originally proposed by Fraser (1957).

For the location model ( 1.2.1), the estimate of θ based on the norm ( 1.8.1) is the valueof θ which minimizes the distance ‖X− 1θ‖ϕ+ or equivalently solves the equation

Tϕ+(θ).= 0 . (1.8.5)

A simple tracing algorithm suffices to compute θ. As Exercise 1.12.18 shows Tϕ+(θ) is adecreasing step function of θ which steps down only at the Walsh averages. So first sortthe Walsh averages. Next select a starting value θ(0), such as median of the Walsh averageswhich corresponds to the signed-rank Wilcoxon scores. Then proceed through the sortedWalsh averages left or right, depending on whether or not Tϕ+(θ(0)) is negative or positive.The algorithm continues until the sign of Tϕ+(θ) changes. This is the algorithm behindour RBR function onesampr which solves equation (1.8.5) for general scores functions; seeExercise 1.12.33. Also, the linear searches discussed in Chapter 3, Section 3.7.3, can be usedto compute θ.

To determine the corresponding functional, note that we can write R|Xi − θ| = #jθ −|Xi − θ| ≤ Xj ≤ |Xi − θ| + θ. Let Hn denote the empirical distribution function of the


sample X1, . . . , Xn and let H−n denote the left limit of Hn. We can then write the defining

equation of θ as,∫ϕ+(Hn(|x− θ| + θ) −H−

n (θ − |x− θ|))sgn(x− θ) dHn(x) = 0 ,

which converges to

δ(θ) =

∫ ∞

−∞ϕ+(H(|x− θ| + θ) −H(θ − |x− θ|))sgn(x− θ) dH(x) = 0 . (1.8.6)

For convenience, a second representation of δ(θ) can be obtained if we extend ϕ+(u) to theinterval (−1, 0) as follows:

ϕ+(t) = −ϕ+(−t) , for −1 < t < 0 . (1.8.7)

Using this extension, the functional θ = T (H) is the solution of

δ(θ) =

∫ ∞

−∞ϕ+(H(x) −H(2θ − x)) dH(x) = 0. (1.8.8)

Compare expressions ( 1.8.8) and ( 1.3.26).The level α test of the hypotheses ( 1.3.6) based on Tϕ+(0) is

Reject H0 in favor of HA, if |Tϕ+(0)| ≥ c , (1.8.9)

where c solves P0[|Tϕ+(0)| ≥ c] = α. We briefly develop the statistical and robustness

properties of this test and the estimator θϕ+ in the next two subsections.

1.8.1 Null Properties of the Test

For this subsection on null properties and the following subsection on efficiency properties ofthe test ( 1.8.9), we will assume that the sample X1, . . . , Xn follows the symmetric locationmodel, ( 1.7.3), with common symmetric density function h(x) = f(x − θ), where f(x) issymmetric about 0. Let H(x) denote the distribution function associated with h(x).

As in Section 1.7.1, we can express Tϕ+(0) in terms of the anti-ranks as,

Tϕ+(0) =∑

a+(R(|Xi|))sgn(Xi) =∑

a+(j)sgn(XDj) =

∑a+(j)Wj ; (1.8.10)

see the corresponding expression ( 1.3.20) for the weighted L1 norm. Recall that under H0

and the symmetry of h(x), the variables W1, . . . ,Wn are iid with P [Wi = 1] = P [Wi = −1] =1/2, (Lemma 1.7.2). Thus we immediately have that Tϕ+(0) is distribution free under H0

with mean and variance

E0[Tϕ+(0)] = 0 (1.8.11)

Var0[Tϕ+(0)] =n∑

i=1

a+2(i) . (1.8.12)


Tables can be constructed for the null distribution of Tϕ+(0) from which critical values, c,can be obtained to complete the test described in ( 1.8.9).

For the asymptotic null distribution of Tϕ+(0), the following additional assumption onthe scores will be sufficient:

maxja+2(j)∑

i=1 a+2(i)

→ 0 . (1.8.13)

Because ϕ+ is square integrable, we have

1

n

∑

i=1

a+2(i) → σ2ϕ+ =

∫ 1

0

(ϕ+(u))2 du , 0 < σ2ϕ+ <∞ , (1.8.14)

i.e., the left side is a Riemann sum of the integral. Under these assumptions and the sym-metric location model, Corollary A.1.1 of the Appendix can be used to show that the nulldistribution of Tϕ+(0) is asymptotically normal; see, also, Exercise 1.12.16. Hence, anasymptotic level α test is

Reject H0 in favor of HA, if∣∣∣ Tϕ+ (0)√nσϕ+

∣∣∣ ≥ zα/2 . (1.8.15)

An approximate (1−α)100% confidence interval for θ based on the process Tϕ+(θ) is the

interval (θϕ+,L, θϕ+,U) such that

Tϕ+(θϕ+,L) = zα/2√nσϕ+ and Tϕ+(θϕ+,U) = −zα/2

√nσϕ+ ; (1.8.16)

see ( 1.5.26). These equations can be solved by the simple tracing algorithm discussedimmediately following expression (1.8.5).

1.8.2 Efficiency and Robustness Properties

We derive the efficiency properties of the analysis described above by establishing the fourconditions of Definition 1.5.3 to show that the process Tϕ+(θ) is Pitman regular. Assumethat ϕ+(u) is differentiable. First define the quantity γh as

γh =

∫ 1

0

ϕ+(u)ϕ+h (u) du , (1.8.17)

where

ϕ+h (u) = −h

′ (H−1(u+1

2

))

h(H−1

(u+1

2

)) . (1.8.18)

As discussed below, ϕ+h (u) is called the optimal score function. We assume that our

scores are such that γh > 0.Since it is the negative of a gradient of a norm, Tϕ+(θ) is nondecreasing in θ; hence, the

first condition, ( 1.5.7), holds. Let Tϕ+(0) = Tϕ+(0)/n and consider

µϕ+(θ) = Eθ[T ϕ+(0)] = E0[Tϕ+(−θ)] .


Note that T ϕ+(−θ) converges in probability to δ(−θ) in ( 1.8.8). Hence, µϕ+(θ) = δ(−θ)where in ( 1.8.8) H is a distribution function with point of symmetry at 0, without loss ofgenerality. If we differentiate δ(−θ) and set θ = 0, we get

µ′ϕ+(0) = 2

∫ ∞

−∞ϕ+′(2H(x) − 1)h(x) dH(x)

= 4

∫ ∞

0

ϕ+′(2H(x) − 1)h2(x) dx =

∫ 1

0

ϕ+(u)ϕ+h (u) du > 0 , (1.8.19)

where the third equality in ( 1.8.19) follows from an integration by parts. Hence the secondPitman regularity condition holds.

For the third condition, ( 1.5.9), the asymptotic linearity for the process Tϕ+(0) is givenin Theorem A.2.11 of the Appendix. We restate the result here for reference:

P0

[sup√n|θ|≤B

∣∣∣∣1√nTϕ+(θ) − 1√

nTϕ+(0) + θγh

∣∣∣∣ ≥ ǫ

]→ 0 , (1.8.20)

for all ǫ > 0 and all B > 0. Finally the fourth condition, ( 1.5.10), concerns the asymptoticnull distribution which was discussed above. The null variance of Tϕ+(0)/

√n is given by

expression ( 1.8.12). Therefore the process Tϕ+(θ) is Pitman regular with efficacy given by

cϕ+ =

∫ 1

0ϕ+(u)ϕ+

h (u) du√∫ 1

0(ϕ+(u))2 du

=2∫∞−∞ ϕ+′(2H(x) − 1)h2(x) dx

√∫ 1

0(ϕ+(u))2 du

. (1.8.21)

As our first result, we obtain the asymptotic power lemma for the process Tϕ+(θ). This,of course, follows immediately from Theorem 1.5.8 so we state it as a corollary.

Corollary 1.8.1. Under the symmetric location model,

Pθn

[Tϕ+(0)√nσϕ+

≥ zα

]→ 1 − Φ(zα − θ∗cϕ+) , (1.8.22)

for the sequence of hypotheses

H0 : θ = 0 versus HAn : θ = θn = θ∗√n

for θ∗ > 0 .

Based on Pitman regularity, the asymptotic distribution of the the estimate θϕ+ is

√n(θϕ+ − θ)

D→ N(0, τ 2ϕ+) , (1.8.23)

where the scale parameter τϕ+ is defined by the reciprocal of ( 1.8.21),

τϕ+ = c−1ϕ+ =

σϕ+

∫ 1

0ϕ+(u)ϕ+

h (u) du. (1.8.24)


Using the general result of Theorem 1.5.9, the length of the confidence interval for θ,(1.8.16), can be used to obtain a consistent estimate of τϕ+ . This in turn can be used to

obtain a consistent estimate of the standard error of θϕ+ ; see Exercise ??.The asymptotic relative efficiency between two estimates or two tests based on score

functions ϕ+1 (u) and ϕ+

2 (u) is the ratio

e(ϕ+1 , ϕ

+2 ) =

c2ϕ+

1

c2ϕ+

2

=τ 2ϕ+

2

τ 2ϕ+

1

. (1.8.25)

This can be used to compare different tests. For a specific distribution we can determinethe optimum scores. Such a score should make the scale parameter τϕ+ as small as possible.This scale parameter can be written as,

cϕ+ = τ−1ϕ+ =

∫ 1

0ϕ+(u)ϕ+

h (u) du

σϕ+

√∫ 1

0ϕ2h(u) du

√∫ 1

0

ϕ2h(u) du . (1.8.26)

The quantity in brackets is a correlation coefficient; hence, to minimize the scale parameterτϕ+ , we need to maximize the correlation coefficient which can be accomplished by selectingthe optimal score function given by

ϕ+(u) = ϕ+h (u) ,

where ϕ+h (u) is given by expression ( 1.8.18). The quantity

√∫ 1

0(ϕ+

h (u))2 du is the square

root of Fisher information; see Exercise 1.12.23. Therefore for this choice of scores theestimate θϕ+

his asymptotically efficient. This is the reason for calling the score function

ϕ+h the optimal score function.

It is shown in Exercise 1.12.24 that the optimal scores are the normal scores if h(x) isa normal density, the Wilcoxon weighted L1 scores if h(x) is a logistic density, and the L1

scores if h(x) is a double exponential density. It is further shown that the scores generatedby ( 1.8.3) are optimal for symmetric densities with a logistic center and exponential tails.

From Exercise 1.12.24, the efficiency of the normal scores methods relative to the leastsquares methods is

e(NS,LS) =

∫ ∞

−∞

f 2(x)

φ (Φ−1(F (x)))dx

2

, (1.8.27)

where F ∈ FS, the family of symmetric distributions with positive, finite Fisher informationand φ = Φ′ is the N(0, 1) pdf.

We now prove a result similar to Theorem 1.7.4. We prove that the normal scoresmethods always have efficiency at least equal to one relative to the LS methods. Further, itis only equal to 1 at the normal distribution. The result was first proved by Chernoff andSavage (1958); however, the proof presented below is due to Gastwirth and Wolff (1968).


Theorem 1.8.1. Let X1, . . . , Xn be a random sample from F ∈ Fs. Then

infFs

e(NS,LS) = 1 , (1.8.28)

and is only equal to 1 at the normal distribution.

Proof: If σ2f = ∞ then e(NS,LS) > 1; hence, we suppose that σ2

f = 1. Let e = e(NS,LS).Then from ( 1.8.27) we can write

√e = E

[f(X)

φ (Φ−1(F (X)))

]

= E

[1

φ (Φ−1(F (X))) /f(X)

].

Applying Jensen’s inequality to the convex function h(x) = 1/x, we have

√e ≥ 1

E [φ (Φ−1(F (X))) /f(X)].

Hence,

1√e

≤ E

[φ (Φ−1(F (X)))

f(X)

]

=

∫φ(Φ−1(F (x))

)dx .

We now integrate by parts, using u = φ (Φ−1(F (x))), du = φ′ (Φ−1(F (x))) f(x) dx/φ (Φ−1(F (x))) =−Φ−1(F (x))f(x) dx since φ′(x)/φ(x) = −x. Hence, with dv = dx, we have

∫ ∞

−∞φ(Φ−1(f(x))

)dx = xφ

(Φ−1(F (x))

) ∣∣∞−∞ +

∫ ∞

−∞xΦ−1(F (x))f(x) dx . (1.8.29)

Now transform xφ (Φ−1(F (x))) into F−1(Φ(w))φ(w) by first letting t = F (x) and thenw = Φ−1(t). The integral

∫F−1(Φ(w))φ(w) dw =

∫xf(x) dx < ∞, hence the limit of the

integrand must be 0 as x → ±∞. This implies that the first term on the right side of( 1.8.29) is 0. Hence applying the Cauchy-Schwarz inequality,

1√e

≤∫ ∞

−∞xΦ−1(F (x))f(x) dx

=

∫ ∞

−∞x√f(x)Φ−1(F (x))

√f(x) dx

≤[∫ ∞

−∞x2f(x) dx

∫ ∞

−∞

Φ−1(F (x))

2f(x) dx

]1/2

= 1 ,


since∫x2f(x) dx = 1 and

∫x2φ(x) dx = 1. Hence e1/2 ≥ 1 and e ≥ 1, which completes

the proof. It should be noted that the inequality is strict except at the normal distribution.Hence the normal scores are strictly more efficient than the LS procedures except at thenormal model where the asymptotic relative efficiency is 1.

The influence function for θϕ+ is derived in Section A.5 of the Appendix. It is givenby

Ω(t, θϕ+) =ϕ+(2H(t) − 1)

4∫∞0ϕ+′(2H(x) − 1)h2(x) dx

. (1.8.30)

Note, also, that E[Ω2(X, θϕ+)] = τ 2ϕ+ as a check on the asymptotic distribution of θϕ+ . Note

that the influence function is bounded provided the score function is bounded. Thus theestimates based on the scores discussed in the last paragraph are all robust except for thenormal scores. In the case of the normal scores, when H(t) = Φ(t), the influence function isΩ(t) = Φ−1(t); see Exercise 1.12.25.

The asymptotic breakdown of the estimate θϕ+ is ǫ∗ given by

∫ 1−ǫ∗

0

ϕ+(u) du =1

2

∫ 1

0

ϕ+(u) du . (1.8.31)

We provide a heuristic argument for ( 1.8.31); for a rigorous development see Huber (1981).Recall Definition 1.6.1. The idea is to corrupt enough data so that the estimating equation,( 1.8.5), no longer has a solution. Suppose that [ǫn] observations are corrupted, where [·]denotes the greatest integer function. Push the corrupted observations out towards +∞ sothat

n∑

i=[(1−ǫ)n]+1

a+(R(|Xi − θ|))sgn(Xi − θ) =n∑

i=[(1−ǫ)n]+1

a+(i) .

This restrains the estimating function from crossing the horizontal axis provided

−[(1−ǫ)n]∑

i=1

a+(i) +n∑

i=[(1−ǫ)n]+1

a+(i) > 0 .

Replacing the sums by integrals in the limit yields

∫ 1−ǫ

0

ϕ+(u) du >

∫ 1

1−ǫϕ+(u) du .

Now use the fact that∫ 1−ǫ

0

ϕ+(u) du+

∫ 1

1−ǫϕ+(u) du =

∫ 1

0

ϕ+(u) du

and that we want the smallest possible ǫ to get ( 1.8.31).

Example 1.8.1. Breakdowns of Estimates Based on Wilcoxon and Normal Scores


Table 1.8.1: Empirical AREs Based on n = 30 and 10,000 simulations.Estimators Normal Contaminated NormalNS, LS 0.983 1.035Wil, LS 0.948 1.007NS, WIL 1.037 1.028

For θ = med(Xi +Xj)/2, ϕ+(u) = u and it follows at once that ǫ∗ = 1 − (1/√

2).= .293.

For the estimate based on the normal scores where ϕ+(u) is given by ( 1.8.4), expression( 1.8.31) becomes

exp

−1

2

[Φ−1

(1 − ǫ

2

)]2=

1

2

and ǫ∗ = 2(1 − Φ(√

log 4)).= .239. Hence we have the unusual situation that the estimate

based on the normal scores has positive breakdown but an unbounded influence curve.

Example 1.8.2. Small Sample Empirical AREs of Estimator Based on Normal Scores

As discussed above, the ARE between the normal scores estimator and the sample meanis 1 at the normal distribution. This is an asymptotic result. To answer the question aboutthis efficiency at small samples, we conducted a small simulation study. We set the samplesize at n = 30 and ran 10,000 simulations from a normal distribution. We also selected thecontaminated normal distribution with ǫ = 0.01 and σc = 3, which is a very mild contami-nated distribution. We consider the three estimators: rank-based estimator based on normalscores (NS), rank-based estimator based on Wilcoxon scores (WIL), and the sample mean(LS). We used the RBR command onesampr(x,score=phinscp,grad=spnsc,maktable=F)

to compute the normal scores estimator; see Exercise 1.12.29. As our empirical ARE we usedthe ratios of empirical mean square errors of the three estimators. Table 1.8.1 summarizesthe results. The empirical AREs for the NS and WIL estimators, at the normal, are close totheir asymptotic counterparts. Note that the NS estimator results in only a loss of less than2% efficiency over LS. For this small amount of contamination the NS estimator dominatesthe LS estimator. It also dominates the Wilcoxon estimator. In Exercise 1.12.29, the readeris asked to extend this study to other situations.

Example 1.8.3. Shoshoni Rectangles, Continued.

The next display shows the normal scores analysis of the Shoshoni Rectangles Data; seeExample 1.4.2. We conducted the same analysis as we did for the sign test and tratditionalt-test discussed in Example 1.4.2. Note that the call to the RBR function onnesampr withthe values score=phinscp,grad=spnsc computes the normal scores analysis.

> onesampr(x,theta0=.618,alpha=.10,score=phinscp,grad=spnsc)

Test of Theta = 0.618 Alternative selected is 0

Test Stat. Tphi+ is 7.809417 Standardized (z) Test-Stat. 1.870514 and p-vlaue 0.06141252

1.9. RANKED SET SAMPLING 53

Estimate 0.6485 SE is 0.02502799



While not as sensitive to the outliers as the traditional analysis, the outliers still had someinfluence on the normal scores analysis. The normal scores test rejects the null hypothesisat level 0.06 while the 90% confidence interval just misses the value 0.618.

1.9 Ranked Set Sampling

In this section we discuss an alternative to simple random sampling (SRS) called ranked setsampling (RSS). This method of data collection is useful when measurements are destructiveor expensive while ranking of the data is relatively easy. Johnson, Nussbaum, Patil andRoss (1996) give an interesting application to environmental sampling. As a simple exampleconsider the problem of estimating the mean volume of trees in a forest. To measure thevolume, we must destroy the tree. On the other hand, an expert may well be able to rank thetrees by volume in a small sample. The idea is to take a sample of size k of trees and ask theexpert to pick the one with smallest volume. This tree is cut down and the volume measuredand the other k − 1 trees are returned to the population for possible future selection. Thena new sample of size k is taken and the expert identifies the second smallest which is thencut down and measured. This is repeated until we have k measurements, having looked atk2 trees. This ends cycle 1. The measurements are represented as x(1)1 ≤ . . . ≤ x(k)1 wherethe number in parentheses indicates an order statistic and the second number indicates thecycle. We repeat the process for n cycles to get nk measurements:

x(1)1, . . . , x(1)n iid h(1)(t)

x(2)1, . . . , x(2)n iid h(2)(t)

......

...

x(k)1, . . . , x(k)n iid h(k)(t)

It is important to note that all nk measurements are independent but are identicallydistributed only within each row. The density function h(j)(t) represents the pdf of the jthorder statistic from a sample of size k and is given by:

h(j)(t) =k!

(j − 1)!(k − j)!Hj−1(t)[1 −H(t)]k−jh(t)

We suppose the measurements are distributed as H(x) = F (x− θ) and we wish to makea statistical inference concerning θ, such as an estimate, test, or confidence interval. We willillustrate the ideas on the L1 methods since they are simple to work with. We also wish to


compute the efficiency of the RSSL1 methods relative to the SRSL1 methods. We will seethat there is a substantial increase in efficiency when using the RSS design. In particular,we will compare the RRS methods to SRS methods based on a sample of size nk. TheRSS method was first applied by McIntyre (1952) in measuring mean pasture yields. SeeHettmansperger (1995) for a development of the RSSL1 methods. The most convenientform of the RSS sign statistic is the number of positive measurements given by

S+RSS =

k∑

j=1

n∑

i=1

I(X(j)i > 0) . (1.9.1)

Now note that S+RSS can be written as S+

RSS =∑S+

(j) where S+(j) =

∑i I(X(j)i > 0) has

a binomial distribution with parameters n and 1 − H(j)(0). Further, S+(j), j = 1, . . . , k are

stochastically independent. It follows at once that

ES+RSS = n

k∑

j=1

(1 −H(j)(0)) (1.9.2)

VarS+RSS = n

k∑

j=1

(1 −H(j)(0))H(j)(0) .

With k fixed and n→ ∞, it follows from the independence of S+(j), j = 1, . . . , k that

(nk)−1/2S+RSS − n

k∑

j=1

(1 −H(j)(0) D→ Z ∼ n(0, ξ2) , (1.9.3)

and the asymptotic variance is given by

ξ2 = k−1k∑

j=1

[1 −H(j)(0)]H(j)(0) =1

4− k−1

k∑

j=1

(H(j)(0) − 1

2)2 . (1.9.4)

It is convenient to introduce a parameter δ2 = 1− (4/k)∑

(H(j)(0)− 1/2)2, then ξ2 = δ2/4.The reader is asked to prove the second equality above in Exercise 1.12.26. Using theformulas for the pdfs of the order statistics it is straightforward to verify that

h(t) = k−1k∑

j=1

h(j)(t) and H(t) = k−1k∑

j=1

H(j)(t) .

We now consider testing H0 : θ = 0 versus HA : θ 6= 0. The following theorem provides themean and variance of the RSS sign statistic under the null hypothesis.

Theorem 1.9.1. Under the assumption that H0 : θ = 0 is true, F (0) = 1/2,

F(j)(0) =k!

(j − 1)!(k − j)!

∫ 1/2

0

uj−1(1 − u)k−jdu

1.9. RANKED SET SAMPLING 55

Table 1.9.1: Values of F(j)(0), j = 1, . . . , k and δ2 = 1 − (4/k)∑

(F(j)(0) − 1/2)2.

k: 2 3 4 5 6 7 8 9 101 .750 .875 .938 .969 .984 .992 .996 .998 .9992 .250 .500 .688 .813 .891 .938 .965 .981 .9893 .125 .313 .500 .656 .773 .856 .910 .9454 .063 .188 .344 .500 .637 .746 .8285 .031 .109 .227 .363 .500 .6236 .016 .063 .145 .254 .3777 .008 .035 .090 .1728 .004 .020 .0559 .002 .011

10 .001δ2 .750 .625 .547 .490 .451 .416 .393 .371 .352

andES+

RSS = nk/2, and VarS+RSS = 1/4 − k−1

∑(F(j)(0) − 1/2)2 .

Proof. Use the fact that k−1∑F(j)(0) = F (0) = 1/2, and the expectation formula

follows at once. Note that

F(j)(0) =k!

(j − 1)!(k − j)!

∫ 0

−∞F (t)j−1(1 − F (t))k−jf(t)dt ,

and then make the change of variable u = F (t).The variance of S+

RSS does not depend on H , as expected; however, its computationrequires the evaluation of the incomplete beta integral. Table 1.9.1 provides the valuesof F(j)(0), under H0 : θ = 0. The bottom line of the table provides the values of δ2 =1− (4/k)

∑(F(j)(0)− 1/2)2, an important parameter in assessing the gain of RSS over SRS.

We will compare the SRS sign statistic S+SRS based on a sample of nk to the RSS sign

statistic S+RSS . Note that the variance of S+

SRS is nk/4. Then the ratio of variances isVarS+

RSS/VarS+SRS = δ2 = 1− (4/k)

∑(F(j)(0)− 1/2)2. The reduction in variance is given in

the last row of Table 1.9.1 and can be quite large.We next show that the parameter δ is an integral part of the efficacy of the RSS L1

methods. It is straight forward using the methods of Section 1.5 and Example 1.5.2 toshow that the RSS L1 estimating function is Pitman regular. To compute the efficacy wefirst note that

SRSS = (nk)−1k∑

j=1

n∑

i=1

sgn(X(j)i) = (nk)−1[2S+RSS − nk] .

We then have at once that

(nk)−1/2SRSSD0→ Z ∼ n(0, δ2) , (1.9.5)


and µ′(0) = 2f(0); see Exercise 1.12.27. See Babu and Koti (1996) for a development of theexact distribution. Hence, the efficacy of the RSS L1 methods is given by

cRSS =2f(0)

δ=

2f(0)

1 − (4/k)∑k

j=1(F(j)(0) − 1/2)21/2.

We now summarize the inference methods and their efficiency in the following:

1. The test. Reject H0 : θ = 0 in favor of HA : θ > 0 at significance level α ifS+SRS > (nk/2) − zαδ(nk/4)1/2 where, as usual, 1 − Φ(zα) = α.

2. The estimate. (nk)1/2medX(j)i − θ D→ Z ∼ n(0, δ2/4f 2(0)).

3. The confidence interval. Let X∗(1), . . . , X

∗(nk) be the ordered values of X(j)i, j =

1, . . . , k and i = 1, . . . , n. Then [X∗(m+1), X

∗(nk−m)] is a (1 − α)100% confidence in-

terval for θ where P (S+SRS ≤ m) = α/2. Using the normal approximation we have

m.= (nk/2) − zα/2δ(nk/4)1/2.

4. Efficiency. The efficiency of the RSS methods with respect to the SRS methods isgiven by e(RSS, SRS) = c2RSS/c

2SRS = δ−2. Hence, the reciprocal of the last line of

Table 1.9.1 provides the efficiency values and they can be quite substantial. Recallfrom the discussion following Definition 1.5.5 that efficiency can be interpreted asthe ratio of sample sizes needed to achieve the same approximate variances, the sameapproximate local power, and the same confidence interval length. Hence, we write(nk)RSS

.= δ2(nk)SRS. This is really the point of the RSS design. Returning to the

example of estimating the volume of wood in a forest, if we let k = 5, then from Table1.9.1, we would need to destroy and measure only about one half as many trees usingthe RSS method rather than the SRS method.

As a final note, we mention the problem of assessing the effect of imperfect ranking. Supposethat the expert makes a mistake when asked to identify the jth ordered value in a set of kobservations. As expected, there is less gain from using the RSS method. The interestingpoint is that if the expert simply identifies the supposed jth ordered value by random guessthen δ2 = 1 and the two sign tests have the same information; see Hettmansperger (1995)for more detail.

1.10 Interpolated Confidence Intervals for the L1 In-

ference

When we construct L1 confidence intervals, we are limited in our choice of confidence coeffi-cients because of the discreteness of the binomial distribution. The effect does not wear off

1.10. INTERPOLATED CONFIDENCE INTERVALS FOR THE L1 INFERENCE 57

very quickly as the sample size increases. For example with a sample of size 50, we can haveeither a 93.5% or a 96.7% confidence interval, and that is as close as we can come to 95%.In the following discussion we provide a method to interpolate between confidence intervals.The method is nonlinear and seems to be essentially distribution-free. We will begin bypresenting and illustrating the method and then derive its properties.

Suppose γ is the desired confidence coefficient. Further, suppose the following intervalsare available from the binomial table: interval (x(k), x(n−k+1)) with confidence coefficient γkand interval (x(k+1), x(n−k)) with confidence coefficient γk+1 where γk+1 ≤ γ ≤ γk. Then the

interpolated interval is [θL, θU ],

θL = (1 − λ)x(k) + λx(k+1) and θU = (1 − λ)x(n−k+1) + λx(n−k) (1.10.1)

where

λ =(n− k)I

k + (n− 2k)Iand I =

γk − γ

γk − γk+1. (1.10.2)

We call I the interpolation factor and note that if we were using linear interpolation thenλ = I. Hence, we see that the interpolation is distinctly nonlinear.

As a simple example we take n = 10 and ask for a 95% confidence interval. For k = 2we find γk = .9786 and γk+1 = .8907. Then I = .325 and λ = .685. Hence, θL = .342x(2) +

.658x(3) and θU = .342x(9) + .658x(8) . Note that linear interpolation is almost the reverse ofthe recommended mixtures, namely λ = I = .325 and this can make a substantial differencein small samples.

The method is based on the following theorem. This theorem highlights the nonlinearrelationship between the interpolation factor and λ. After proving the theorem we will needto develop an approximate solution and then show that it works in practice.

Theorem 1.10.1. The interpolation factor I is given by

I =γk − γ

γk − γk+1= 1 − (n− k)2n

∫ ∞

0

F k

( −λ1 − λ

y

)(1 − F (y))n−k−1f(y)dy

Proof. Without loss of generality we will assume that θ is 0. Then we can write:

γk = P0(xk ≤ 0 ≤ xn−k+1) = P0(k − 1 < S+1 (0) < n− k − 1)

and

γk+1 = P0(xk+1 ≤ 0 ≤ xn−k) = P0(k < S+1 (0) < n− k) .

Taking the difference, we have, using(nk

)to denote the binomial coefficient,

γk − γk+1 = P0(S+1 (0) = k) + P0(S

+1 (0) = n− k) =

(n

k

)(1/2)n−1 . (1.10.3)


We now consider the lower tail probability associated with the confidence interval. Firstconsider

P0(Xk+1 > 0) =1 − γk+1

2=

∫ ∞

0

n!

k!(n− k − 1)!F k(t)(1 − F (t))n−k−1dF (t)(1.10.4)

= P0(S+1 (0) ≥ n− k) = P0(S

+1 (0) ≤ k) .

We next consider the lower end of the interpolated interval

1 − γ

2= P0((1 − γ)Xk + λXk+1 > 0)

=

∫ ∞

0

∫ y

−λ1−λ

y

n!

(k − 1)!(n− k − 1)!F k−1(x)(1 − F (y))n−k−1f(x)f(y)dxdy

=

∫ ∞

0

n!

(k − 1)!(n− k − 1)!

1

k

[F k(y) − F k

( −λy1 − λ

)](1 − F (y))n−k−1f(y)dy

=1 − γk+1

2−∫ ∞

0

n!

k!(n− k − 1)!F k

( −λy1 − λ

)(1 − F (y))n−k−1f(y)dy (1.10.5)

Use ( 1.10.4) in the last line above. Now with ( 1.10.3), substitute into the formula for theinterpolation factor and the result follows.

Clearly, not only is the relationship between I and λ nonlinear but it also depends on theunderlying distribution F . Hence, the interpolated interval is not distribution free. Thereis one interesting case in which we have a distribution free interval given in the followingcorollary.

Corollary 1.10.1. Suppose F is the cdf of a symmetric distribution. Then I(1/2) = k/n,where we write I(λ) to denote the dependence of the interpolation factor on λ.

This shows that when we sample from a symmetric distribution, the interval that lieshalf between the available intervals does not depend on the underlying distribution. Otherinterpolated intervals are not distribution free. Our next theorem shows how to approximatethe solution and the solution is essentially distribution free. We show by example that theapproximate solution works in many cases.

Theorem 1.10.2.I(λ)

.= λk/(λ(2k − n) + n− k)

Proof. We consider the integral∫ ∞

0

F k

( −λ1 − λ

y

)(1 − F (y))n−k−1f(y)dy

The integrand decreases rapidly for moderate powers; hence, we expand the integrand aroundy = 0. First take logarithms then

k logF

( −λ1 − λ

y

)= k logF (0) − λ

1 − λkf(0)

F (0)y + o(y)

1.10. INTERPOLATED CONFIDENCE INTERVALS FOR THE L1 INFERENCE 59

Table 1.10.1: Confidence Coefficients for Interpolated Confidence Intervals in Example1.10.1. DE(Approx)=Double Exponential and the Approximation in Theorem 1.10.2,U=Uniform, N=Normal, C=Cauchy, Linear=Linear Interpolation

λ DE(Approx) U N C Linear0.1 0.976 0.977 0.976 0.976 0.9700.2 0.973 0.974 0.974 0.974 0.9610.3 0.970 0.971 0.971 0.970 0.9520.4 0.966 0.967 0.966 0.966 0.9430.5 0.961 0.961 0.961 0.961 0.9350.6 0.955 0.954 0.954 0.954 0.9260.7 0.946 0.944 0.944 0.946 0.9170.8 0.935 0.930 0.931 0.934 0.9080.9 0.918 0.912 0.914 0.918 0.899

and

(n− k − 1) log(1 − F (y)) = (n− k − 1) log(1 − F (0)) − (n− k − 1)f(0)

1 − F (0)y + o(y) .

Substitute r = λk/(1 − λ) and F (0) = 1 − F (0) = 1/2 into the above equations, and addthe two equations together. Add and subtract r log(1/2), and group terms so the right sideof the second equation appears on the right side along with k log(1/2) − r log(1/2). Hence,we have

k logF

( −λ1 − λ

y

)+ (n− k − 1) log(1 − F (y)) = k log(1/2) − r log(1/2)

+(n− r − k − 1) log(1 − F (y)) + o(y) ,

and, hence,

∫ ∞

0

F k

( −λ1 − λ

y

)(1 − F (y))n−k−1f(y)dy

.=

∫ ∞

0

2−(k−r)(1 − F (y))n+r−k−1f(y)dy

=1

2n(n + r − k). (1.10.6)

Substitute this approximation into the formula for I(λ), use r = λk/(1 − λ) and the resultfollows.

Note that the approximation agrees with Corollary 1.10.1. In addition Exercise 1.12.28shows that the approximation formula is exact for the double exponential (Laplace) dis-tribution. In Table 1.10.1 we show how well the approximation works for several otherdistributions. The exact results were obtained by numerical integration of the integral inTheorem 1.10.1. Similar close results were found for asymmetric examples. For furtherreading see Hettmansperger and Sheather (1986) and Nyblom (1992).


Example 1.10.1. Cushney-Peebles Example 1.4.1, continued.

We now return to this example using it to illustrate the sign test and the L1 interpolatedconfidence interval. We use the RBR function interpci for the computations. We take asour location model: X1, . . . , X10 iid from H(x) = F (x − θ), F and θ both unknown, alongwith the L1 norm. We have already seen that the estimate of θ is the sample median equalto 1.3. Besides obtaining an interpolated 95% confidence interval, we test H0 : θ = 0 versusHA : θ 6= 0. Assuming that the sample is in the vector x, the output for a test and a 95%interpolated confidence interval is:

> tm=interpci(.05,x)

Estimation of Median

Sample Median is 1.3

Confidence Interval ( 1 , 1.8 ) 89.0625 %

Confidence Interval ( 0.9315 , 2.0054 ) 95 % Interpolted

Confidence Interval ( 0.8 , 2.4 ) 97.8516 %

Results for the Sign Test


Test stat. S is 9 p-vlaue 0.00390625

Note the p-value of the test is .0039 and we would easily reject the null hypothesis at anyreasonable level of significance. The interpolated 95% confidence interval for θ shows thereasonable set of values of θ to be between .9315 and 2.0054, given the level of confidence.

1.11 Two Sample Analysis

We now propose a simple way to extend our one sample methods to the comparison of twosamples. Suppose X1, . . . , Xm are iid F (x − θx) and Y1, . . . , Yn are iid F (y − θy) and thetwo samples are independent. Let ∆ = θy − θx and we wish to test the null hypothesisH0 : ∆ = 0 versus the alternative hypothesis Ha : ∆ 6= 0. Without loss of generality we canconsider θx = 0 so that the X sample is from a distribution with cdf F (x) and the Y sampleis from a distribution with cdf F (y − ∆).

The hypothesis testing rule that we propose is:

1. Construct L1 confidence intervals [XL, XU ] and [YL, YU ].

2. Reject H0 if the intervals are disjoint.

1.11. TWO SAMPLE ANALYSIS 61

If we consider the confidence interval as a set of reasonable values for the parameter, giventhe confidence coefficient, then we reject the null hypothesis when the respective reasonablevalues are disjoint. We must determine the significance level for the test. In particular, forgiven γx and γy, what is the value of αc, the significance level for the comparison? Perhapsmore pertinent: Given αc, what values should we choose for γx and γy? Below we show thatfor a broad range of sample sizes,

Comparing two 84% CI’s yields a 5% test of H0 : ∆ = 0 versus HA : ∆ 6= 0, (1.11.1)

where CI denotes confidence interval. In the following theorem we provide the relationshipbetween αc and the pair γx, γy. Define zx by γx = 2Φ(zx) − 1 and likewise zy by γy =2Φ(zy) − 1.

Theorem 1.11.1. Suppose m,n → ∞ so that m/N → λ, 0 < λ < 1, N = m + n. Thenunder the null hypothesis H0 : ∆ = 0,

αc = P (XL > YU) + P (YL > XU) → 2Φ[−(1 − λ)1/2zx − λ1/2zy]

Proof. We will consider αc/2 = P (XL > YU). From ( 1.5.22) we have

XL.=

Sx(0)

m2f(0)− zxm1/22f(0)

and YU.=

Sy(0)

m2f(0)+

zyn1/22f(0)

.

Since m/N → λ

N1/2XLD→ λ−1/2Z1, Z1 ∼ n(−zx/2f(0), 1/4f 2(0)) ,

and

N1/2YUD→ (1 − λ)−1/2Z2, Z2 ∼ n(−zy/2f(0), 1/4f 2(0)) .

Now αc/2 = P (XL > YU) = P (N1/2(YU −XL) < 0) and XL, YU are independent, hence

N1/2(YU −XL)D→ λ−1/2Z1 − (1 − λ)−1/2Z ,

and

λ−1/2Z1 − (1 − λ)−1/2Z2 ∼ n

(1

2f(0)

zx

(1 − λ)1/2+

zyλ1/2

,

1

4f 2(0)

1

λ+

1

1 − λ

).

It then follows that

P (N1/2(YU −XL) < 0) → Φ

(−

zx(1 − λ)1/2

+zyλ1/2

/

1

λ(1 − λ)

1/2).

Which, when simplified, yields the result in the statement of the theorem.


Table 1.11.1: Confidence Coefficients for 5% Comparison.

λ = m/N .500 .550 .600 .650 .750m/n 1.00 1.22 1.50 1.86 3.00

zx = zy 1.39 1.39 1.39 1.40 1.43γx = γy .84 .84 .84 .85 .86

To illustrate, we take equal sample sizes so that λ = 1/2 and we take zx = zy = 2. Thenwe have two 95% confidence intervals and we will reject the null hypothesis H0 : ∆ = 0 if thetwo intervals are disjoint. The above theorem says that the significance level is approximatelyequal to αc = 2Φ(−2.83) = .0046. This is a very small level and it will be difficult to rejectthe null hypothesis. We might prefer a significance level of say αc = .05. We then must findzx and zy so that .05 = 2Φ(−(.5)1/2(zx + zy)). Note that now we have an infinite number ofsolutions. If we impose the reasonable condition that the two confidence coefficients are thesame then we require that zx = zy = z. Then we have the equation .025 = Φ(−(2)1/2z) andhence −2 = −(2)1/2z. So z = 21/2 = 1.39 and the confidence coefficient for the two intervalsis γ = γx = γy = 2Φ(1.41) − 1 = .84. Hence, if we have equal sample sizes and we use two84% confidence intervals then we have a 5% two sided comparison of the two samples.

If we set αc = .10, this would correspond to a 5% one sided test. This means that wecompare the two confidence intervals in the direction specified by the alternative hypothesis.For example, if we specify ∆ = θy − θx > 0, then we would reject the null hypothesis ifthe X-interval is completely below the Y -interval. To determine which confidence intervalswe again assume that the two intervals will have the same confidence coefficient. Then wemust find z such that .05 = Φ(−(2)1/2z) and this leads to −1.645 = −(2)1/2z and z = 1.16.Hence, the confidence coefficient for the two intervals is γ = γx = γy = 2Φ(1.16) − 1 = .75.Hence, for a one-sided 5% test or a 10% two-sided test, when you have equal sample sizes,use two 75% confidence intervals.

We must now consider what to do if the sample sizes are not equal. Let zc be determinedby αc/2 = Φ(−zc), then, again if we use the same confidence coefficient for the two intervals,z = zx = zy = zc/(λ

1/2 + (1 − λ)1/2). When m = n so that λ = 1 − λ = .5 we hadz = zc/2

1/2 = .707zc and so z = 1.39 when αc = .05. We now show by example that whenαc = .05, z is not sensitive to the value of λ. Table 1.11.1 gives the relevant information.Hence, if we use 84% confidence intervals, then the significance level will be roughly 5% forthe comparison for a broad range of ratios of sample sizes. Likewise, we would use 75%intervals for a 10% comparison. See Hettmansperger (1984b) for additional discussion.

Next suppose that we want a confidence interval for ∆ = θy− θx. In the following simpletheorem we show that the proposed test based on comparing two confidence intervals isequivalent to checking to see if zero is contained in a different confidence interval. This newinterval will be a confidence interval for ∆.

Theorem 1.11.2. [XL, XU ] and [YL, YU ] are disjoint if and only if 0 is not contained in

1.11. TWO SAMPLE ANALYSIS 63

[YL −XU , YU −XL].

If we specify our significance level to be αc then we have immediately that

1 − αc = P∆(YL −XU ≤ ∆ ≤ YU −XL)

and [YL −XU , YU −XL] is a γc = 1 − αc confidence interval for ∆.This theorem simply points out that the hypothesis test can be equivalently based on

a single confidence interval. Hence, two 84% intervals produce a roughly 95% confidenceinterval for ∆. The confidence interval is easy to construct since we need only find the leastand greatest differences of the end points between the respective Y and X intervals.

Recall that one way to measure the efficiency of a confidence interval is to find its asymp-totic length. This is directly related to the Pitman efficacy of the procedure; see Section1.5.5. This would seem to be the most natural way to study the efficiency of the test basedon confidence intervals. In the following theorem we determine the asymptotic length of theinterval for ∆.

Theorem 1.11.3. Suppose m,n→ ∞ in such a way that m/N → λ, 0 < λ < 1, N = m+n.Further suppose that γc = 2Φ(zc) − 1. Let Λ be the length of [YL −XU , YU −XL]. Then

N1/2Λ

2zc→ 1

[λ(1 − λ)]1/2]2f(0)

Proof. First note that Λ = Λx+Λy, the sum of the two lengths of the X and Y intervals,respectively. Further,

N1/2Λ =N1/2

n1/2n1/2Λy+ =

N1/2

m1/2m1/2Λx .

But by Theorem 1.5.9 this converges in probability to zx/λ1/2 + zy/(1 − λ)1/2. Now note

that (1 − λ)1/2zx + λ1/2zy = zc and the result follows.The interesting point about this theorem is that the efficiency of the interval does not

depend on how zx and zy are chosen so long as they satisfy (1 − λ)1/2zx + λ1/2zy = zc. Inaddition, this interval has inherited the efficacy of the L1 interval in the one sample locationmodel. We will discuss the two-sample location model in detail in the next chapter. InHettmansperger (1984b) other choices for zx and zy are discussed; for example, we couldchoose zx and zy so that the asymptotic standardized lengths are equal. The correspondingconfidence coefficients for this choice are more sensitive to unequal sample sizes than themethod proposed here.

Example 1.11.1. Hendy and Charles Coin Data

Hendy and Charles (1970) study the change in silver content in Byzantine coins. Duringthe reign of Manuel I (1143-1180) there were several mintings. We consider the researchhypothesis that the silver content changed from the first to the fourth coinage. The dataconsists in 9 coins identified from the first coinage and 7 coins from the fourth. We suppose


Table 1.11.2: Silver Percentage in Two Mintings

First 5.9 6.8 6.4 7.0 6.6 7.7 7.2 6.9 6.2Fourth 5.3 5.6 5.5 5.1 6.2 5.8 5.8

that they are realizations of random samples of coins from the two populations. The percent-age of silver in each coin is given in Table 1.11. Let ∆ = θ1 − θ4 where the 1 and 4 indicatethe coinage. To test the null hypothesis H0 : ∆ = 0 versus HA : ∆ 6= 0 at α = .05, weconstruct two 84% L1 confidence intervals and reject the null hypothesis if they are disjoint.The confidence intervals can be computed by using the RBR function onesampsgn with thevalue alph=.16. Results pertinent to the confidence intervals are:

> onesampsgn(First,alpha=.16)


84 % Confidence Interval is ( 6.4 , 7 )


> onesampsgn(Fourth,alpha=.16)




Clearly, the 84% confidence intervals are disjoint, hence, we reject the null hypothesisat a 5% significance level and claim that the emperor apparently held back a little on thefourth coinage. A 95% confidence interval for ∆ = θ1 − θ4 is found by taking the differencesin the ends of the confidence intervals: (6.4−5.8, 7.0−5.3) = (0.6, 1.7). Hence, this analysissuggests that the difference in median percentages is someplace between .6% and 1.7%, witha point estimate of 6.8 − 5.6 = 1.2%.

Figure 1.11.1 provides a comparison boxplot of the data for the first and fourth coinages.Marking the 84% confidence intervals on the plot, we can see the relatively large gap betweenthe confidence intervals, i.e., the sharp reduction in silver content from the first to fourthcoinage. In addition, the box for the fourth coinage is a bit more narrow than the box forthe first coinage indicating that there may be less variation (as measured by the interquartilerange) in the fourth coinage. There are no apparent outliers as indicated by the whiskers onthe boxplot. Larson and Stroup (1976) analyze this example with a two sample t-test.

1.12. EXERCISES 65

Figure 1.11.1: Comparison Boxplots of the Hendy and Charles Coin Data

First Fourth

5.0

5.5

6.0

6.5

7.0

7.5

Per

cent

age

of s

ilver

1.12 Exercises

1.12.1. Show that if ‖ · ‖ is a norm, then there always exists a value of θ which minimizes‖x − θ1‖ for any x1, . . . , xn.

1.12.2. Figure 1.12.1 displays the graph of Z(θ) versus θ for n = 20 data points (count thesteps) where

Z(θ) =1√n

20∑

i=1

sign(Xi − θ),

i.e., the standardized sign (median) process.

(a) From the plot, what are the minimum and maximum values of the sample?

(b) From the plot, what is the associated point estimate of θ?


(c) From the plot, determine a 95% confidence interval for θ, (approximate, but show onthe graph).

(d) From the plot, determine the value of the test statistic and the associated p-value fortesting H0 : θ = 0 versus HA : θ > 0.

−1 0 1 2 3

−5

05

theta

Z(t

heta

)

Plot of Z(theta) versus theta

Figure 1.12.1: The Graph of Z(θ) versus θ

1.12.3. Show D(θ), ( 1.3.3), is convex and continuous as a function of θ. Further, argue thatD(θ) is differentiable almost everywhere. Let S(θ) be a function such that S(θ) = −D′(θ)where the derivative exists. Then show that S(θ) is a nonincreasing function.

1.12.4. Consider the L2 norm. Show that θ = x and that S2(0) =√nt/

√n− 1 + t2 where

t =√nx/s, and s is the sample standard deviation. Further, show S2(0) is an increasing

function of t so the test based on t is equivalent to S2(0).

1.12.5. Discuss the consistency of the t-test. Is the t-test resolving?

1.12.6. Discuss the Pitman regularity in the L2 case.

1.12.7. The following R function computes a bootstrap distribution of the sample median.

bootmed = function(x,nb)

# Sample is in x and nb is the number of bootstraps

n = length(x)

bootmed = rep(0,nb)

for(i in 1:nb)

1.12. EXERCISES 67

y = sample(x,size=n,replace=T)

bootmed[i] = median(y)

bootmed

(a). Use this code to obtain 1000 bootstraped medians for the Shoshoni data of Example1.4.2. Determine the standard error of this bootstrap sample of medians and compareit with the estimate based on the length of the confidence interval for the Shoshonidata.

(b). Now find the mean and variance of the Shoshoni data. Use these estimates to performa parametric bootstrap of the sample median, as discussed in Example ??. Determinethe standard error of this parametric bootstrap sample of medians and compare it withestimates in Part (a).

1.12.8. Using languages such as Minitab or R, obtain a plot of the test sensitivity curvesbased on the signed-rank Wilcoxon statistic for the Cushney-Peebles Data, Example 1.4.1,similar to the sensitivity curves based on the t test and the sign test as shown in Figure1.4.1.

1.12.9. In the proof of Theorem 1.5.6, show that ( 1.5.19) and ( 1.5.20) imply that Un(b)converges to −µ′(0) in probability, pointwise in b, i.e., Un(b) = −µ′(0) + op(1).

1.12.10. Suppose we are sampling fron the distribution with pdf

f(x) =3

4

1

Γ(2/3)exp−|x|3/2, −∞ < x <∞

and we are considering whether to use the Wilcoxon or sign test. Using the efficacies of thesetests, determine which test to use.

1.12.11. For which of the following distributions is the signed-rank Wilcoxon more powerful?Why?

f1(x) =

32x2 −1 < x < 1

0 elsewhere.

f2(x) =

32(1 − x2) −1 < x < 1

0 elsewhere.

1.12.12. Show that ( 1.5.23) is scale invariant. Hence, the efficiency does not change if Xis multiplied by a positive constant. Let

f(x, δ) = δ exp(−|x|δ)/2Γ(δ−1), −∞ < x <∞, 1 ≤ δ ≤ 2.

When δ = 2, f is a normal distribution and when δ = 1, f is a Laplace distribution. Computeand plot as a function of δ the efficiency ( 1.5.23).


1.12.13. Show that the finite sample breakdown of the Hodges-Lehmann estimate ( 1.3.25) isǫ∗n = m/n, where m is the solution to the quadratic inequality 2m2−(4n+2)m∗+n2+n ≤ 0.Table ǫ∗n as a function of n and show that ǫ∗n converges to 1 − 1√

2

.= .29.

1.12.14. Derive ( 1.6.9).

1.12.15. Prove Lemma 1.7.2.

1.12.16. Prove Theorem 1.7.1. In particular, check the conditions of the Lindeberg CentralLimit Theorem to verify ( 1.7.7).

1.12.17. Prove Theorem 1.7.2.

1.12.18. For the general signed-rank norm given by (1.8.1), show that the function Tϕ+(θ),(1.8.2) is a decreasing step function which steps down only at the Walsh averages. Hint:First show that the ranks of |Xi − θ| and |Xj − θ| switch for θ1 < θ2 if and only if

θ1 <Xi +Xj

2< θ2,

(replace ranks by signs if i = j).

1.12.19. Use the results of the last exercise to write in some detail the tracing algorithm,described after expression (1.8.5), for obtaining the location estimator θϕ+ and its associatedstandard error.

1.12.20. Suppose h(x) has finite Fisher information:

I(h) =

∫(h′(x))2

h(x)dx <∞ .

Prove that h(x) is bounded and that∫h2(x)dx <∞.

Hint: Write

h(x) =

∫ x

−∞h′(t)dt ≤

∫ x

−∞|h′(t)|dt .

1.12.21. Repeat Exercise 1.12.12 for ( 1.7.13).

1.12.22. Show that ( 1.8.1) is a norm.

1.12.23. Show that∫φ+2

h (u)du, φ+2

h (u) given by ( 1.8.18), is equal to Fisher information,

∫(h′(x))2

h(x)dx .

1.12.24. Find ( 1.8.18) when h is normal, logistic, Laplace (double exponential) density,respectively.

1.12. EXERCISES 69

1.12.25. Verify that the influence function of the normal score estimate is unbounded whenthe underlying distribution is normal.

1.12.26. Verify ( 1.9.4).

1.12.27. Derive the limit distribution in expression ( 1.9.5).

1.12.28. Show that approximation ( 1.10.6) is exact for the double exponential (Laplace)distribution.

1.12.29. Extend the simulation study of Example 1.8.2 to the other contaminated normalsituations found in Table 1.7.1. Comment on the results. Compare the empirical results forthe Wilcoxon withe asymptotic results found in the table.

The following R code performs the contaminated normal simulation discussed in Example1.8.2. (Semicolons are end of line indicators. As indicated in the call to onesampr, the normalscores estimator is computed by using the gradient R function spnsc and score functionphinscp.)

nsims = 10000; n = 30; itype = 1; eps = .01; sigc = 3;

collls = rep(0,nsims): collwil = rep(0,nsims); collnsc = rep(0,nsims);

for(i in 1:nsims)

if(itype == 0)x = rnorm(n)

if(itype == 1)x = rcn(n,eps,sigc)

collls[i] = mean(x)

collnsc[i] = onesampr(x,score=phinscp,grad=spnsc,maktable=F)$est

collwil[i] = onesampwil(x,maktable=F)$est

msels = mean(collls^2); msensc = mean(collnsc^2): msewil = mean(collwil^2)

arensc = msels/msensc; arewil = msels/msewil; arenscwil = msewil/msensc

1.12.30. Consider the one sample location problem. Let T (θ) be a nonincreasing process.Consider the hypotheses:

H0 : θ = 0 versus HA : θ > 0.

Assume that T (θ) is standardized so that the decision rule of the (asymptotic) level α testis given by

Reject H0 : θ = 0 in favor of HA : θ > 0, if T (0) > zα.

Further assume that for all |θ| < B, B > 0,

T (θ/√n) = T (0) − 1.2θ + op(1).

(a) For θ0 > 0, determine the asymptotic power γ(θ0), i.e., determine

γ(θ0) = Pθ0 [T (0) > zα].

(b) Evaluate γ(θ0) for n = 36 and θ0 = 0.5.


1.12.31. Suppose X1, . . . , X2n are independent observations such that Xi has cdf F (x−θi)0.For testing H0 : θ1 = . . . = θ2n versus HA : θ1 ≤ . . . ≤ θ2n with at least one strict inequality,consider the test statistic,

S =

n∑

i=1

I(Xn+i > Xi) .

(a.) Discuss the small sample and asymptotic distribution of S under H0.

(b.) Determine the alternative distribution of S under the alternative θn+i−θi = ∆, ∆ > 0,for all i = 1, . . . , n. Show that the test is consistent for this alternative. This test iscalled Mann’s (1945) test for trend.

1.12.32. The data in Table 1.12.1 constitutes a sample of size 59 of information on profes-sional baseball players. The data were recorded from the back of a deck of baseball cards,(complements of Carrie McKean).

(a). Obtain dotplots of the weights and heights of the baseball players.

(b). Assume the weight of a typical adult male is 175 pounds. Use the Wilcoxon test statisticto test the hypotheses

H0 : θW = 175 versus HA : θW 6= 175 ,

where θW is the median weight of a professional baseball player. Compute the p-value.Next obtain a 95% confidence interval for θW using the confidence interval procedurebased on the Wilcoxon. Use the dotplot in Part (a) to comment on the assumption ofsymmetry.

(c). Let θH be the median height of a baseball player. Repeat the analysis of Part (b) forthe hypotheses

H0 : θH = 70 versus HA : θH 6= 70 .

1.12.33. The signed-rank Wilcoxon scores are optimal for the logistic distribution while thesign scores are optimal for the Laplace distribution. A family of score functions which areoptimal for distributions with logistic “middles” and Laplace “tails” are the bent scores.These are continuous score functions ϕ+(u) with a linear (positive slope and intercept 0)piece for 0 < u < b and a constant piece for b < u < 1, for a specified value of b; see Policelloand Hettmansperger (1976). These are called signed-rank Winsorized Wilcoxon scores.

(a) Obtain the standardized scores such that∫

[ϕ+(u)]2 du = 1.

(b) For these scores with b = 0.75, obtain the corresponding estimate of location and anestimate of its standard error for the following data set:

7.94 8.13 8.11 7.96 7.83 7.04 7.91 7.827.42 8.06 8.51 7.88 8.96 7.58 8.14 8.06

1.12. EXERCISES 71

Table 1.12.1: Data for professional baseball players, Exercise 1.12.32. The variables are:(H) Height in inches; (W) Weight in pounds; (B) Side of plate from which the player bats,(1-Right handed, 2-Left handed, 3-Switch-hitter); (A) Throwing arm (0-Right, 1-Left); (P)Pitch-hit indicator, (0-Pitcher, 1-Hitter); and (Ave) Average (ERA if pitcher, Batting aver-age if hitter).

H W B A P Ave H W B A P Ave74 218 1 1 0 3.330 79 232 2 1 0 3.10075 185 1 0 1 0.286 72 190 1 0 1 0.23877 219 2 1 0 3.040 75 200 2 0 0 3.18073 185 1 0 1 0.271 70 175 2 0 1 0.27969 160 3 0 1 0.242 75 200 1 0 1 0.27473 222 1 0 0 3.920 78 220 1 0 0 3.88078 225 1 0 0 3.460 73 195 1 0 0 4.57076 205 1 0 0 3.420 75 205 2 1 1 0.28477 230 2 0 1 0.303 74 185 1 0 1 0.28678 225 1 0 0 3.460 71 185 3 0 1 0.21876 190 1 0 0 3.750 73 210 1 0 1 0.28272 180 3 0 1 0.236 76 210 2 1 0 3.28073 185 1 0 1 0.245 73 195 1 0 1 0.24373 200 2 1 0 4.800 75 205 1 0 0 3.70074 195 1 0 1 0.276 73 175 1 1 0 4.65075 195 1 0 0 3.660 73 190 2 1 1 0.23872 185 2 1 1 0.300 74 185 3 1 0 4.07075 190 1 0 1 0.239 72 190 3 0 1 0.25476 200 1 0 0 3.380 73 210 1 0 0 3.29076 180 2 1 0 3.290 71 195 1 0 1 0.24472 175 2 1 1 0.290 71 166 1 0 1 0.27476 195 2 1 0 4.990 71 185 1 1 0 3.73068 175 2 0 1 0.283 73 160 1 0 0 4.76073 185 1 0 1 0.271 74 170 2 1 1 0.27169 160 1 0 1 0.225 76 185 1 0 0 2.84076 211 3 0 1 0.282 71 155 3 0 1 0.25177 190 3 0 1 0.212 76 190 1 0 0 3.28074 195 1 0 1 0.262 71 160 3 0 1 0.27075 200 1 0 0 3.940 70 155 3 0 1 0.26173 207 3 0 1 0.251


The software RBR computes this estimate with the call

onesampr(x,score=phipb,grad=sphipb,param=c(.75)).

Chapter 2

Two Sample Problems

2.1 Introduction

Let X1, . . . , Xn1 be a random sample with common distribution function F (x) and densityfunction f(x). Let Y1, . . . , Yn2 be another random sample, independent of the first, withcommon distribution function G(x) and density g(x). We will call this the general modelthroughout this chapter. A natural null hypothesis is H0 : F (x) = G(x). In this chapterwe will consider rank and sign tests of this hypothesis. A general alternative to H0 isHA : F (x) 6= G(x) for some x. Except for the Section 2.10 on the scale model we willbe generally concerned with the alternative models where one distribution is stochasticallylarger than the other; for example, the alternative that G is stochastically larger than Fwhich can be expressed as HA : G(x) ≤ F (x) with a strict inequality for some x. This familyof alternatives includes the location model, described next, and the Lehmann alternativemodels discussed in Section 2.7, which are used in survival analysis.

As in Chapter 1, the location models will be of primary interest. For these modelsG(x) = F (x − ∆) for some parameter ∆. Thus the parameter ∆ represents a shift inlocation between the two distributions. It can be expressed as ∆ = θY − θX where θY andθX are the medians of the distributions of G and F or equivalently as ∆ = µY − µX where,provided they exist, µY and µX are the means of G and F. In the location problem the nullhypothesis becomes H0 : ∆ = 0. In addition to tests of this hypothesis we will developestimates and confidence intervals for ∆. We will call this the location model throughoutthis chapter and we will show that this is a generalization of the location problem defined inChapter 1.

As in Chapter 1 with the one-sample problems, for the two-sample problems, we offerthe reader computational R functions which do the computation for the rank-based analysesdicussed in this chapter.

73

74 CHAPTER 2. TWO SAMPLE PROBLEMS

2.2 Geometric Motivation

In this section, we work with the location model described above. As in Chapter 1, we willderive sign and rank-based tests and estimates from a geometric point of view. As we shallshow, their development is analogous to that of least squares procedures in that other normsare used in place of the least squares Euclidean norm. In order to do this we place theproblem into the context of a linear model. This will facilitate our geometric developmentand will also serve as an introduction to Chapter 3, linear models.

Let Z′ = (X1, . . . , Xn1, Y1, . . . , Yn2) denote the vector of all observations; let n = n1 + n2

denote the total sample size; and let

ci =

0 if 1 ≤ i ≤ n1

1 if n1 + 1 ≤ i ≤ n. (2.2.1)

Then we can write the location model as

Zi = ∆ci + ei , 1 ≤ i ≤ n , (2.2.2)

where e1, . . . , en are iid with distribution function F (x). Let C = [ci] denote the n×1 designmatrix and let ΩFULL denote the column space of C. We can express the location model as

Z = C∆ + e , (2.2.3)

where e′ = (e1, . . . , en) is the n× 1 vector of errors. Note that except for random error, theobservations Z would lie in ΩFULL. Thus given a norm, we estimate ∆ so that C∆ minimizesthe distance between Z and the subspace ΩFULL; i.e., C∆ is the vector in ΩFULL closest toZ.

Before turning our attention to ∆, however, we write the problem in terms of the geometrydiscussed in Chapter 1. Consider any location functional T of the distribution of e. Letθ = T (F ). Define the random variable e∗ = e − θ. Then the distribution function of e∗ isF ∗(x) = F (x+θ) and its functional is T (F ∗) = 0. Thus the model, (2.2.3), can be expressedas

Z = 1θ + C∆ + e∗ . (2.2.4)

Note that this is a generalization of the location problem discussed in Chapter 1. Fromthe last paragraph, the distribution function of Xi can be expressed as F (x) = F ∗(x − θ);hence, T (F ) = θ is a location functional of Xi. Further, the distribution function of Yjcan be written as G(x) = F ∗(x − (∆ + θ)). Thus T (G) = ∆ + θ is a location functionalof Yj. Therefore, ∆ is precisely the difference in location functionals between Xi and Yj.Furthermore ∆ does not depend on which location functional is used and will be called theshift parameter.

Let b = (θ,∆)′. Given a norm, we want to choose as our estimate of b a value b such

that [1 C]b minimizes the distance between the vector of observations Z and the columnspace V of the matrix [1 C]. Thus we can use the norms defined in Chapter 1 to estimate b.

2.2. GEOMETRIC MOTIVATION 75

If, as an example, we select the L1 norm, then our estimate of b minimizes

D(b) =n∑

i=1

|Zi − θ − ci∆| . (2.2.5)

Differentiating D with respect to θ and ∆, respectively, and setting the resulting equationsto 0 we obtain the equations,

n1∑

i=1

sgn (Xi − θ) +

n2∑

j=1

sgn (Yj − θ − ∆).= 0 (2.2.6)

n2∑

j=1

sgn (Yj − θ − ∆).= 0 . (2.2.7)

Subtracting the second equation from the first we get∑n1

i=1 sgn (Xi − θ).= 0; hence,

θ = med Xi. Substituting this into the second equation, we get ∆ = med Yj − θ =

med Yj − med Xi; hence, b = (med Xi,med Yj − θ − med Xi). We will obtaininference based on the L1 norm in Sections 2.6.1 and 2.6.2.

If we select the L2 norm then, as shown in Exercise 2.13.1, the LS-estimate b = (X, Y −X)′. Another norm discussed in Chapter 1 was the weighted L1 norm. In this case b isestimated by minimizing

D(b) =n∑

i=1

R(|Zi − θ − ci∆|)|Zi − θ − ci∆| . (2.2.8)

This estimate cannot be obtained in closed form; however, fast minimization algorithms forsuch problems are discussed later in Chapter 3.

In the initial statement of the problem, though, θ is a nuisance parameter and we arereally interested in ∆, the shift in location between the populations. Hence, we want to definedistance in terms of norms which are invariant to θ. The type of norm that is invariant to θis a pseudo-norm which we define next.

Definition 2.2.1. An operator ‖ · ‖∗ is called a pseudo-norm if it satisfies the followingfour conditions:

‖u + v‖∗ ≤ ‖u‖∗ + ‖v‖∗ for all u,v ∈ Rn

‖αu‖∗ = |α|‖u‖∗ for all α ∈ R,u ∈ Rn

‖u‖∗ ≥ 0 for all u ∈ Rn

‖u‖∗ = 0 if and only if u1 = · · · = un

Note that a regular norm satisfies the first three properties but in lieu of the fourthproperty, the norm of a vector is 0 if and only if the vector is 0. The following inequalities


establish the invariance of pseudo-norms to the parameter θ:

‖Z − θ1 −C∆‖∗ ≤ ‖Z −C∆‖∗ + ‖θ1‖∗= ‖Z −C∆‖∗ = ‖Z− θ1 − C∆ + θ1‖∗≤ ‖Z − θ1 −C∆‖∗ .

Hence, ‖Z− θ1 − C∆‖∗ = ‖Z− C∆‖∗.Given a pseudo-norm, denote the associated dispersion function by D∗(∆) = ‖Z−C∆‖∗.

It follows from the above properties of a pseudo-norm that D∗(∆) is a non-negative, contin-uous, and convex function of ∆.

We next develop an inference which includes estimation of ∆ and tests of hypothesesconcerning ∆ for a general pseudo-norm. As an estimate of the shift parameter ∆, wechoose a value ∆ which solves

∆ = ArgminD∗(∆) = Argmin‖Z− C∆‖∗ ; (2.2.9)

i.e., C∆ minimizes the distance between Z and ΩFULL. Another way of defining ∆ is as thestationary point of the gradient of the pseudo-norm. Define the function S∗ by

S∗(∆) = − ‖Z −C∆‖∗ (2.2.10)

where denotes the gradient of ‖Z − C∆‖∗ with respect to ∆. Because D∗(∆) is convex,it follows immediately that

S∗(∆) is nonincreasing in ∆ . (2.2.11)

Hence ∆ is such thatS∗(∆)

.= 0 . (2.2.12)

Given a location functional θ = T (F ), i.e. Model (2.2.4), once ∆ has been estimated we

can base an estimate of θ on the residuals Zi − ∆ci. For example, if we chose the medianas our location functional then we could use the median of the residuals to estimate it. Wewill discuss this in more detail for general linear models in Chapter 3.

Next consider the hypotheses

H0 : ∆ = 0 versus HA : ∆ 6= 0 . (2.2.13)

The closer S∗(0) is to 0 the more plausible is the hypothesis H0. More formally, we definethe gradient test of H0 versus HA by the rejection rule,

Reject H0 in favor of HA if S∗(0) ≤ k or S∗(0) ≥ l ,

where the critical values k and l depend on the null distribution of S∗(0). Typically, the nulldistribution of S∗(0) is symmetric about 0 and k = −l. The reduction in dispersion testis given by

Reject H0 in favor of HA if D∗(0) −D∗(∆) ≥ m ,


where the critical value m is determined by the null distribution of the test statistic. Inthis chapter, as in Chapter 1, we will be concerned with the gradient test while in Chapter3 we will use the reduction in dispersion test. A confidence interval for ∆ of confidence(1 − α)100% is the interval ∆ : k < S∗(∆) < l and

1 − α = P∆[k < S∗(∆) < l] . (2.2.14)

Since D∗(∆) is convex, S∗(∆) is nonincreasing and we have

∆L = inf∆ : S∗(∆) < l and ∆U = sup∆ : S∗(∆) > k ; (2.2.15)

compare (1.3.10). Often we will be able to invert k < S∗(∆) < l to find an explicit formulafor the upper and lower end points.

We will discuss a large class of general pseudo norms in Section 2.5, but now we presentthe pseudo norms that yield the pooled t-test and the Mann-Whitney-Wilcoxon test.

2.2.1 Least Squares (LS) Analysis

The traditional analysis is based on the squared pseudo-norm given by

‖u‖2LS =

n∑

i=1

n∑

j=1

(ui − uj)2 , u ∈ Rn . (2.2.16)

It follows, (see Exercise 2.13.1) that

‖Z−C∆‖2LS = −4n1n2(Y −X − ∆) ;

hence the classical estimate is ∆LS = Y − X. Eliminating the constant factor 4n1n2 theclassical test is based on the statistic

SLS(0) = Y −X .

As shown in Exercise 2.13.1, standardizing SLS results in the two-sample pooled t-statistic.An approximate confidence interval for ∆ is given by

Y −X ± t(α/2,n1+n2−2)σ

√1

n1+

1

n2,

where σ is the usual pooled estimate of the common standard deviation. This confidenceinterval is exact if ei has a normal distribution. Asymptotically, we replace t(α/2,n1+n2−2) byzα/2. The test is asymptotically distribution free.


2.2.2 Mann-Whitney-Wilcoxon (MWW) Analysis

The rank based analysis is based on the pseudo-norm defined by

‖u‖R =

n∑

i=1

n∑

j=1

|ui − uj| , u ∈ Rn . (2.2.17)

Note that this pseudo-norm is the L1-norm based on the differences between the componentsand that it is the second term of expression (1.3.20), which defines the norm of the signedrank analysis of Chapter 1. Note further, that this pseudo-norm differs from the least squarespseudo-norm in that the square root is taken inside the double summation. In Exercise 2.13.2the reader is asked to show that this indeed is a pseudo-norm and that further it can bewritten in terms of ranks as

‖u‖R = 4n∑

i=1

(R(ui) −

n + 1

2

)ui .

From (2.2.17), it follows that the MWW gradient is

‖Z− C∆‖R = −2n1∑

i=1

n2∑

j=1

sgn (Yj −Xi − ∆) .

Our estimate of ∆ is a value which makes the gradient zero; that is, makes half of thedifferences positive and the other half negative. Thus the rank based estimate of ∆ is

∆R = med Yj −Xi . (2.2.18)

This pseudo-norm estimate is often called the Hodges-Lehmann estimate of shift for thetwo sample problem, (Hodges and Lehmann, 1963). As we show in Section 2.4.4, ∆R has anapproximate normal distribution with mean ∆ and standard deviation τ

√(1/n1) + (1/n2),

where the scale parameter τ is given in display (2.4.22).From the gradient we define

SR(∆) =

n1∑

i=1

n2∑

j=1

sgn (Yj −Xi − ∆) . (2.2.19)

Next defineS+R (∆) = #(Yj −Xi > ∆) . (2.2.20)

Note that we have (with probability one) that SR(∆) = 2S+R(∆) − n1n2. The statistic

S+R = S+

R (0), originally proposed by Mann and Whitney (1947), will be more convenient touse. The gradient test for the hypotheses (2.2.13) is

Reject H0 in favor of HA if S+R ≤ k or S+

R ≥ n1n2 − k ,


where k is chosen by P0(S+R ≤ k) = α/2. We show in Section 2.4 that the test statistic

is distribution free under H0 and, that further, it has an asymptotic normal distributionwith mean n1n2/2 and standard deviation

√n1n2(n1 + n2 + 1)/12 under H0. Hence, an

asymptotic level α test rejects H0 in favor of HA, if

|z| > zα/2 where z =S+

R−(n1n2/2)√n1n2(n1+n2+1)/12

. (2.2.21)

As shown in Section 2.4.2, the (1−α)100% MWW confidence interval for ∆ is givenby

[D(k+1), D(n1n2−k)) , (2.2.22)

where k is such that P0[S+R ≤ k] = α/2 and D(1) ≤ · · · ≤ D(n1n2) denote the ordered n1n2

differences Yj − Xi. It follows from the asymptotic null distribution of S+R that k can be

approximated as n1n2

2− 1

2− zα/2

√n1n2(n+1)

12.

A rank formulation of the MWW test statistic S+R (∆) will also prove useful. Letting

R(ui) denote the rank of ui among u1, . . . , un we can write

n2∑

j=1

R(Yj − ∆) =

n2∑

j=1

#i(Xi < Yj − ∆) + #i(Yi − ∆ ≤ Yj − ∆)

= #(Yj −Xi > ∆) +n2(n2 + 1)

2.

Defining,

W (∆) =

n2∑

i=1

R(Yi − ∆) , (2.2.23)

we thus have the relationship that

S+R (∆) = W (∆) − n2(n2 + 1)

2. (2.2.24)

The test statistic W (0) was proposed by Wilcoxon (1945). Since it is a linear function ofthe Mann-Whitney test statistic it has identical statistical properties. We will refer to thestatistic, S+

R , as the Mann-Whitney-Wilcoxon statistic and will label it as MWW.As a final note on the geometry of the rank based analysis, reconsider the model with

the location functional θ in it, i.e. (2.2.4). Suppose we obtain the R-estimate of ∆, (2.2.18).

Let eR = Z − C∆R denote the residuals. Next suppose we want to estimate the locationparameter θ by using the weighted L1 norm which was discussed for estimation of locationin Section 1.7 of Chapter 1. Let ‖u‖SR =

∑nj=1 j|u|(j) denote this norm. For the residual

vector eR, expression (1.3.10) of Chapter 1 is given by

‖e − θ1‖SR =∑∑

i≤j

∣∣∣∣ei + ej

2− θ

∣∣∣∣ + (1/4)‖eR‖R . (2.2.25)


Hence the estimate of θ determined by this geometry is the Hodges-Lehmann estimate basedon the residuals; i.e.,

θR = medi≤j

ei + ej

2

. (2.2.26)

Asymptotic theory for the joint distribution of the random vector (θR, ∆R)′ will be discussedin Chapter 3.

2.2.3 Computation

The Mann-Whitney-Wilcoxon analysis which we described above is easily computed using theRBR function twosampwil. This function returns the value of the Mann-Whitney-Wilcoxontest statistic S+

R = S+R (0), (2.2.20), the estimate ∆, (2.2.18), the associated confidence in-

terval (2.2.22), and comparison boxplots of the samples. Also, the R intrinsic functionwilcoxon.test and minitab command MANN compute this Mann-Whitney-Wilcoxon analy-sis.

2.3 Examples

In this section we present two examples which illustrate the methods discussed in the lastsection. The calculations were performed by the RBR functions twosampwil and twosampt

which compute the Mann-Whitney-Wilcoxon and LS analyses, respectively. By convention,for each difference Yj −Xi = 0, we add the value 1/2 to the test statistic S+

R . Further, thereturned p-value is calculated with the usual continuity correction. The estimate of τ andits standard error (SE) displayed in the results are given by expression (2.4.27), where a fulldiscussion is given. The LS analysis, computed by twosampt, is based on the traditionalpooled two-sample t-test.

Example 2.3.1. Quail Data

The data for this problem are drawn from a high volume drug screen designed to findcompounds which reduce low density lipoproteins, LDL, cholesterol in quail; see McKean,Vidmar and Sievers (1989) for a discussion of this screen. For the purposes of the presentexample, we have taken the plasma LDL levels of one group of quail who were fed over aspecified period of time a special diet mixed with a drug compound and the LDL levels ofa second group of quail who were fed the same special diet but without the drug compoundover the same length of time. A completely randomized design was employed. We will referto the first group as the treatment group and the second group as the control group. Thedata are displayed in Table 2.3.1. Let θC and θT denote the true median levels of LDL for thecontrol and treatment populations, respectively. The parameter of interest is ∆ = θC − θT .We are interested in the alternative hypothesis that the treatment has been effective; hencethe hypotheses are:

H0 : ∆ = 0 versus HA : ∆ > 0 .

2.3. EXAMPLES 81

Table 2.3.1: Data for Quail ExampleControl 64 49 54 64 97 66 76 44 71 89

70 72 71 55 60 62 46 77 86 71

Treated 40 31 50 48 152 44 74 38 81 64

The comparison boxplots returned by the RBR function twosampwil are found in Figure2.3.1. Note that there is one outlier, the fifth observation of the treated group, which hasthe value 152. Outliers such as this were typical with most of the data in this study; seeMcKean et al. (1989). For the data at hand, the treated group appears to have lower LDLlevels.

Figure 2.3.1: Comparison Boxplots of Treatment and Control Quail LDL Levels

Control Treated

4060

8010

012

014

0

LDL

chol

este

rol

Comparison Boxplots of Treated and Control

The analyses returned by the functions twosampwil and twosampt are given below. TheMann-Whitney-Wilcoxon test statistic has the value 134.5 with p-value 0.067, while the t-teststatistic has value 0.557 with p-value 0.291. The MWW indicates with marginal significancethat the treatment performed better than the placebo. The two sample t analysis wasimpaired by the outlier.

The Hodges-Lehmann estimate of ∆, (2.2.18), is 14 and the 90% confidence interval is(−2.0, 24.0). In contrast, the least squares estimate of shift is 5 and the corresponding 90%confidence interval is (−10.25, 20.25).

> twosampwil(y,x,alt=1,alpha=.10,namex="Treated",namey="Control",


nameresp="LDL cholesterol")

Test of Delta = 0 Alternative selected is 1

Test Stat. S+ is 134.5 Standardized (z) Test-Stat. 1.495801 and

p-vlaue 0.06735282

MWW estimate of the shift in location is 14 SE is 8.180836

90 % Confidence Interval is ( -2 , 24 )


> twosampt(y,x,alt=1,alpha=.10,namex="Treated",namey="Control",

nameresp="LDL cholesterol")


Test Stat. ybar-xbar- 0 is 5 Standardized (t) Test-Stat. 0.5577585

and p-vlaue 0.2907209

Mean of y minus the mean of x is 5 SE is 8.964454

90 % Confidence Interval is ( -10.24971 , 20.24971 )


As noted above, this data was drawn from data from a high-speed drug screen to discoverdrug compounds which have the potential to reduce LDL cholesterol. In this screen, if acompound was at least marginally significant the investigation of it would continue, else itwould be eliminated from further scrutiny. Hence, for this drug compound, the robust andLS analyses would result in different practical outcomes.

Example 2.3.2. Hendy-Charles Coin Data, continuation of Example 1.11.1

Recall that the 84% L1 confidence intervals for the data are disjoint. Thus we reject thenull hypothesis that the silver content is the same for the two mintings at the 5% level. Wenow apply the MWW test and confidence interval to this data and find the Hodges-Lehmannestimate of shift. If the tailweights of the underlying distributions are moderate, the MWWmethods are more efficient.

The output from the RBR function twosampwil is:

> twosampwil(Fourth,First)


Test Stat. S+ is 61.5 Standardized (z) Test-Stat. 3.122611

and p-vlaue 0.001792544

MWW estimate of the shift in location is 1.1 SE is 0.2999926

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON 83



Note that there is strong statistical evidence that the mintings are different. The Hodges-Lehmann estimate (2.2.18) is 1.1 which suggests that there is roughly a 1.1% decrease in thesilver content from the first to the fourth mintings. The 95% confidence interval, (2.2.22), is(0.6, 1.7). Half the length of the confidence is .45 and this could be reported as the marginof error in estimating ∆, the change in median silver contents from the first to the fourthmintings. Hence we could report 1.1% ± .45%.

2.4 Inference Based on the Mann-Whitney-Wilcoxon

We next develop the theory for inference based on the Mann-Whitney-Wilcoxon statistic,including the test, the estimate and the confidence interval. Although much of the devel-opment is for the location model the general model will also be considered. We begin withtesting.

2.4.1 Testing

Although the geometric motivation of the test statistic S+R was derived under the location

model, the test can be used for more general models. Recall that the general model iscomprised of a random sample X1, . . . , Xn1 with cdf F (x) and a random sample Y1, . . . , Yn2

with cdf G(x). For the discussion we select the hypotheses,

H0 : F (x) = G(x), for all x versus HA : F (x) ≥ G(x), with strict inequality for some x.(2.4.1)

Under this stochastically ordered alternative Y tends to dominate X,; i.e., P (Y > X) >1/2. Our rank-based decision rule is to reject H0 in favor of HA if S+

R is too large, whereS+R = #(Yj −Xi > 0). Our immediate goal is to make this precise. What we discuss will of

course hold for the other one-sided alternative F (x) ≤ G(x) and the two-sided alternativeF (x) ≤ G(x) or F (x) ≥ G(x) as well. Furthermore, since the location model is a submodelof the general model, what holds for the general model will hold for it also. It will alwaysbe clear which set of hypotheses is being considered.

Under H0, we first show that S+R is distribution free and then show it is symmetrically

distributed about (n1n2)/2.

Theorem 2.4.1. Under the general null hypothesis in (2.4.1), S+R is distribution free.

Proof: Under the null hypothesis, the combined samples X1, . . . , Xn1, Y1, . . . , Yn2 consti-tute a random sample of size n from the distribution function F (x). Hence any assignmentof n2 ranks from the set of integers 1, . . . , n to Y1, . . . , Yn2 is equilikely; i. e., has probability(nn2

)−1independent of F .

Theorem 2.4.2. Under H0 in (2.4.1), the distribution of S+R is symmetric about (n1n2)/2.


Proof: Under H0 in (2.4.1) L(Yj −Xi) = L(Xi−Yj) for all i, j; see Exercise 2.13.3. Thusif S−

R = #(Xi − Yj > 0) then, under H0, L(S+R) = L(S−

R). Since S−R = n1n2 − S+

R we havethe following string of equalities which proves the result:

P [S+R ≥ n1n2

2+ x] = P [n1n2 − S−

R ≥ n1n2

2+ x]

= P [S−R ≤ n1n2

2− x] = P [S+

R ≤ n1n2

2− x]

Hence for the hypotheses (2.4.1), a level α test based on S+R would reject H0 if S+

R ≥cα,n1,n2 where PH0 [S

+R ≥ cα,n1,n2] = α. From the symmetry, note that the lower α critical

point is given by n1n2 − cα,n1,n2 .Although S+

R is distribution free under the null hypothesis its distribution cannot beobtained in closed form. The next theorem gives a recursive formula for its distribution.The proof can be found in Exercise 2.13.4; see, also, Hettmansperger (l984, p. 136-137).

Theorem 2.4.3. Under the general null hypothesis in (2.4.1), let Pn1,n2(k) = PH0 [S+R = k].

ThenPn1,n2(k) =

n2

n1 + n2

Pn1,n2−1(k − n1) +n1

n1 + n2

Pn1−1,n2(k) ,

where Pn1,n2(k) satisfies the boundary conditions Pi,j(k) = 0 if k < 0, Pi,0(k) and P0,j(k) are1 or 0 as k = 0 or k 6= 0.

Based on these recursion formulas, tables of the null distribution can be obtained readily,which then can be used to obtain the critical values for the rank based test. Alternatively,the asymptotic null distribution of S+

R can be used to determine approximate critical values.This asymptotic test will be discussed later; see Theorem 2.4.9.

We next derive the mean and variance of S+R under the three models:

(a) the general model where X has distribution function F (x) and Y has distribution func-tion G(x);

(b) the location model where G(x) = F (x− ∆);

(c) and the null model in which F (x) = G(x).

Of course, from Theorem 2.4.2, the null mean of S+R is (n1n2)/2. In our derivation we

repeatedly make use of the fact that if H is the distribution function of a random variableZ then the random variable H(Z) has a uniform distribution over the interval (0, 1); seeExercise 2.13.5.

Theorem 2.4.4. Assuming that X1, . . . , Xn1 are iid F (x) and Y1, . . . , Yn2 are iid G(x) andthat these two samples are independent of one another, the means of S+

R under the threemodels (a)-(c) are:

(a) E[S+R

]= n1n2 [1 − E [G(X)]] = n1n2E [F (Y )]

(b) E[S+R

]= n1n2 [1 − E [F (X − ∆)]] = n1n2E [F (X + ∆)]

(c) E[S+R

]=

n1n2

2.


Proof: We shall prove only (a), since results (b) and (c) follow directly from it. We canwrite S+

R in terms of indicator functions as

S+R =

n1∑

i=1

n2∑

j=1

I(Yj −Xi > 0) , (2.4.2)

where I(t > 0) is 1 or 0 for t > 0 or t ≤ 0, respectively. Let Y have distribution function G,let X have distribution function F , and let X and Y be independent. Then

E [I (Y −X > 0)] = E [P [Y > X|X]]

= E [1 −G(X)] = E [F (Y )] ,

where the second equality follows from the independence of X and Y . The results thenfollow.

Theorem 2.4.5. The variances of S+R under the models (a) - (c) are:

(a) V ar[S+R

]= n1n2

(E [G(X)] −E2 [G(X)]

)

+ n1n2(n1 − 1)V ar [F (Y )] + n1n2(n2 − 1)V ar [G(X)]

(b) V ar[S+R

]= n1n2

(E [F (X − ∆)] − E2 [F (X − ∆)]

)

+ n1n2(n1 − 1)V ar [F (Y )] + n1n2(n2 − 1)V ar [F (X − ∆)]

(c) V ar[S+R

]=

n1n2(n+ 1)

12.

Proof: Again only the result (a) will be obtained. Using the indicator formulation of S+R ,

(2.4.2), we have

V ar[S+R

]=

n1∑

i=1

n2∑

j=1

V ar [I(Yj −Xi > 0)]+

n1∑

i=1

n2∑

j=1

n1∑

l=1

n2∑

k=1

Cov [I(Yj −Xi > 0), I(Yk −Xl > 0)] ,

where the sums for the covariance terms are over all possible combinations except (i, j) =(l, k). For the first term, note that the variance of I(Y −X > 0) is

V ar [I(Y > X)] = E [I(Y > X)] − E2 [I(Y > X)]

= E [1 −G(X)] − E2 [1 −G(X)]

= E [G(X)] − E2 [G(X)] .

This yields the first term in (a). For the covariance terms, note that a covariance is 0 unlesseither j = k or i = l. This leads to the following two cases:


Case(i) For the covariance terms with j = k and i 6= l, we need E [I(Y > X1)I(Y > X2)]which is,

E [I(Y > X1)I(Y > X2)] = P [Y > X1, Y > X2]

= E [P [Y > X1, Y > X2 |Y ]]

= E [P [Y > X1 |Y ]P [Y > X2 |Y ]]

= E[F (Y )2

]

There are n2ways to get a j and n1(n1−1) ways to get i 6= l; hence there are n1n2(n1−1)covariances of this form. This leads to the second term of (a).

Case(ii) ) The terms for the covariances where i = l and j 6= k follow similarly to Case (i).This leads to the third and final term of (a).

The last two theorems suggest that the random variable Z =S+

R−n1n22

q

n1n2(n+1)12

has an approx-

imate N(0, 1) distribution under H0. This follows from the next results which yield theasymptotic distribution of S+

R under general alternatives as well as under the null hypoth-esis. We will obtain these results by projecting our statistic S+

R down onto a set of linearcombinations of independent random variables. Then we can use central limit theory on theprojection. See Hajek and Sidak (1967) for a discussion of this technique.

Let T = T (Z1, . . . , Zn) be a random variable based on a sample Z1, . . . , Zn such thatE [T ] = 0. Let

p∗k(x) = E [T | Zk = x] , k = 1, . . . , n .

Next define the random variable Tp to be

Tp =

n∑

k=1

p∗k(Zk) . (2.4.3)

In the next theorem we show that Tp is the projection of T onto the space of linear functions ofZ1, . . . , Zn. Note that unlike T , Tp is a linear combination of independent random variables;hence, its asymptotic distribution is often easier to obtain than that of T . As the followingprojection theorem shows it is in a sense the “closest” linear function of the form

∑pi(Zi)

to T .

Theorem 2.4.6. If W =∑n

i=1 pi(Zi) then E [(T −W )2] is minimized by taking pi(x) =p∗i (x). Furthermore E [(T − Tp)

2] = V ar [T ] − V ar [Tp].

Proof: First note that E [p∗k(Zk)] = 0. We have,

E[(T −W )2

]= E

[[(T − Tp) − (W − Tp)]

2] (2.4.4)

= E[(T − Tp)

2]+ E

[(W − Tp)

2]− 2E [(T − Tp)(W − Tp)] .


We can write one-half the cross product term as

n∑

i=1

E [(T − Tp)(pi(Zi) − p∗i (Zi))] =n∑

i=1

E [E [(T − Tp)(pi(Zi) − p∗i (Zi)) | Zi]]

=n∑

i=1

E

[(pi(Zi) − p∗i (Zi))E

[T −

n∑

j=1

p∗j(Zj) | Zi]]

.

The conditional expectation can be written as,

(E [T | Zi] − p∗i (Zi)) −∑

j 6=iE[p∗j (Zj)

]= 0 − 0 = 0 .

Hence the cross-product term is zero, and, therefore the left-hand-side of expression (2.4.4)is minimized with respect to W by taking W = Tp. Also since this holds, in particular, forW = 0 we get

E[T 2]

= E[(T − Tp)

2]+ E

[T 2p

].

Since both T and Tp have zero means the second result of the theorem also follows.From these results a strategy for obtaining the asymptotic distribution of T is apparent.

Namely, find the asymptotic distribution of its projection, Tp and then show V ar [T ] −V ar [Tp] → 0 as n→ ∞. This implies that T and Tp have the same asymptotic distribution;see Exercise 2.13.7. We shall apply this strategy to get the asymptotic distribution of therank based methods. As a first step we obtain the projection of S+

R − E[S+R

]under the

general model.

Theorem 2.4.7. Under the general model the projection of the random variable S+R−E

[S+R

]

is

Tp = n1

n2∑

j=1

(F (Yj) − E [F (Yj)]) − n2

n1∑

i=1

(G(Xi) −E [G(Xi)]) . (2.4.5)

Proof: Define the n random variables Z1 , . . . , Zn by

Zi =

Xi if 1 ≤ i ≤ n1

Yi−n1 if n1 + 1 ≤ i ≤ n.

We have,

p∗k(x) = E[S+R | Zk = x

]−E

[S+R

]

=

n1∑

i=1

n2∑

j=1

E [I(Yj > Xi) | Zk = x] − E[S+R

]. (2.4.6)

There are two cases depending on whether 1 ≤ k ≤ n1 or n1 + 1 ≤ k ≤ n1 + n2 = n.


Case (1) Suppose 1 ≤ k ≤ n1 then the conditional expectation in the above expression(2.4.6), depending on the value of i, becomes

(a): i 6= k, E [I(Yj > Xi) | Xk = x] = E [I(Yj > Xi)]

= P [Y > X]

(b): i = k, E [I(Yj > Xi) | Xi = x]

= P [Y > X | X = x]

= 1 −G(x)

Hence, in this case,

p∗k(x) = n2(n1 − 1)P [Y > X] + n2(1 −G(x)) − E[S+R

].

Case (2) Next suppose that n1 + 1 ≤ k ≤ n then the conditional expectation in the aboveexpression (2.4.6), depending on the value of j, becomes

(a): j 6= k, E [I(Yj > Xi) | Yk = x] = P [Y > X]

(b): j = k, E [I(Yj > Xi) | Yj = x] = F (x)

Hence, in this case,

p∗k(x) = n1(n2 − 1)P [Y > X] + n1F (x) −E[S+R

].

Combining these results we get

Tp =

n1∑

i=1

p∗i (Xi) +

n2∑

j=1

p∗j (Yj)

= n1n2(n1 − 1)P [Y > X] + n2

n1∑

i=1

(1 −G(Xi))

+ n1n2(n2 − 1)P [Y > X] + n1

n2∑

j=1

F (Yj) − nE[S+R

].

This can be simplified by noting that

P (Y > X) = E [P (Y > X | X)] = E [1 −G(X)]

or similarlyP (Y > X) = E [F (Y )] .

From (a) of Theorem 2.4.4,

E[S+R

]= n1n2 (1 − E [G(X)]) = n1n2P (Y > X) .

Substituting these three results into (2.4.6) we get the desired result.An immediate outcome is


Corollary 2.4.1. Under the general model, if Tp is given by (2.4.5) then

Var (Tp) = n21n2Var (F (Y )) + n1n

22Var (G(X)) .

From this it follows that Tp should be standardized as

T ∗p =

1√nn1n2

Tp .

In order to obtain the asymptotic distribution of Tp and subsequently S+R we need the

following assumption on the design (sample sizes),

(D.1) :nin

→ λi , 0 < λi < 1 . (2.4.7)

This says that the sample sizes go to ∞ at the same rate. Note that λ1 + λ2 = 1. Theasymptotic variance of T ∗

p is thus

Var (T ∗p ) → λ1Var (F (Y )) + λ2Var (G(X)) .

We first want to obtain the asymptotic distribution under general alternatives. In orderto do this we need an assumption concerning the ranges of X and Y . The support of acontinuous random variable with distribution function H and density h is defined to be theset x : h(x) > 0 which is denoted by S(H).

Our second assumption states that the intersection of the supports of F and G has anonempty interior; that is

(E.3) : There is an open interval I such that I ⊂ S(F ) ∩ S(G) . (2.4.8)

Note that the asymptotic variance of T ∗p is not zero under (E.3).

We are now in the position to find the asymptotic distribution of T ∗p .

Theorem 2.4.8. Under the general model and the assumptions (D.1) and (E.3), T ∗p has an

asymptotic N (0, λ1Var (F (Y )) + λ2Var (G(X))) distribution.

Proof: By (2.4.5) we can write

T ∗p =

√n1

nn2

n2∑

j=1

(F (Yj) −E [F (Yj)]) −√

n2

nn1

n1∑

i=1

(G(Xi) − E [G(Xi)]) . (2.4.9)

Note that both sums on the right side of expression (2.4.9) are composed of independent andidentically distributed random variables and that the sums are independent of one another.The result then follows immediately by applying the simple central limit theorem to eachsum.

This is the key result we need in order to obtain the asymptotic distribution of our teststatistic S+

R . We first obtain the result under the general model and then under the nullhypothesis. As we will see, both results are immediate.


Theorem 2.4.9. Under the general model and the conditions (E.3) and (D.1) the random

variableS+

R−E[S+R]√

Var (S+R)

has a limiting N(0, 1) distribution.

Proof: By the last theorem and Theorem 2.4.6, we need only show that the difference inthe variances of S+

R/√nn1n2 and T ∗

p goes to 0 as n→ ∞. Note that,

Var

(1√nn1n2

S+R

)=

n1n2

nn1n2

(E [G(X)] −E [G(X)]2

)

+n1n2(n1 − 1)

nn1n2Var (F (Y )) +

n1n2(n2 − 1)

nn1n2Var (G(X)) ;

hence, Var (T ∗p ) − Var (S+

R/√nn1n2) → 0 and the result follows from Exercise (2.13.7).

The asymptotic distribution of the test statistic under the null hypothesis follows imme-diately from this theorem. We record it in the next corollary.

Corollary 2.4.2. Under H0 : F (x) = G(x) and (D.1) only, the test statistic S+R is approx-

imately N(n1n2

2, n1n2(n+1)

12

).

Therefore an asymptotic size α test for H0 : F (x) = G(x) versus HA : F (x) 6= G(x) isto reject H0 if |z| ≥ zα/2 where

z =S+R − n1n2

2√n1n2(n+1)

12

and

1 − Φ(zα/2) = α/2 .

Since we approximate a discrete random variable with continuous one, we think it is advisablein cases of small samples to use a continuity correction. Fix and Hodges (l955) give anEdgeworth approximation to the distribution of S+

R and Bickel (l974) discusses the error ofthis approximation.

Since the standard normal distribution function, Φ, is continuous on the entire real line,we can strengthen the convergence in Theorem 2.4.9 to uniform convergence; that is, thedistribution function of the standardized MWW converges uniformly to Φ. Using this, itis not hard to show that the standardized critical values of the MWW converge to theircounterparts at the standard normal. Thus if cα,n is the MWW critical value defined byα = PH0[S

+R ≥ cα,n] then

cα,n − n1n2

2√n1n2(n+1)

12

→ zα , (2.4.10)

where 1−α = Φ(zα); see Exercise 2.13.8 for details. This result will prove useful in the nextsection.


We now consider when the test based on S+R is consistent. Consider the general set up;

i. e., X1, . . . , Xn1 is a random sample with distribution function F (x) and Y1, . . . , Yn2 is arandom sample with distribution function G(x). Consider the hypotheses

H0 : F = G versus HA1 : F (x) ≥ G(x) with F (x0) > G(x0) for some x0 ∈ Int(S(F ) ∩ S(G)).(2.4.11)

Such an alternative is called a stochastically ordered alternative. The next theorem showsthat the MWW test statistic is consistent for this alternative. Likewise it is consistent forthe other one sided stochastically ordered alternative with F and G interchanged, HA2, and,also, for the two sided alternative which consists of the union of HA1 and HA2. These resultsimply that the MWW test is consistent for location alternatives, provided F and G haveoverlapping support. As Exercise 2.13.9 shows, it will also be consistent when one supportis shifted to the right of the other support.

Theorem 2.4.10. Suppose that the assumptions (D.1), (2.4.7), and (E.3), (2.4.8), hold.Under the stochastic ordering alternatives given above, S+

R is a consistent test.

Proof: Assume the stochastic ordering alternative HA1, (2.4.11). For an arbitrary level α,select the critical level cα such that the test that rejects H0 if S+

R ≥ cα has asymptotic levelα. We want to show that the power of the test goes to 1 as n→ ∞. Since F (x0) > G(x0) forsome point x0 in the interior of S(F )∩S(G), there exists an interval N such that F (x) > G(x)on N . Hence

EHA[G(X)] =

∫

N

G(y)f(y)dy +

∫

Nc

G(y)f(y)dy

<

∫

N

F (y)f(y)dy +

∫

Nc

F (y)f(y)dy =1

2(2.4.12)

The power of the test is given by

PHA

[S+R ≥ cα

]= PHA

[S+R −EHA

(S+R)√

VHA(S+

R)≥ cα − (n1n2/2)√

VHA(S+

R )+

(n1n2/2) − EHA(S+

R )√VHA

(S+R )

].

Note by (2.4.10) that

cα − (n1n2/2)√VHA

(S+R )

=cα − (n1n2/2)√

VH0(S+R )

√VH0(S

+R )√

VHA(S+

R)→ zακ ,

where κ is a real number (since the variances are of the same order). But by (2.4.12)

(n1n2/2) −EHA(S+

R )√VHA

(S+R )

=(n1n2/2) − n1n2[1 − EHA

(G(X))√VHA

(S+R )

]

=n1n2

[−1

2+ EHA

(G(X))]

√VHA

(S+R )

→ −∞ .

By Theorem (2.4.9), under HA the random variableS+

R−EHA(S+

R )√VHA

(S+R )

converges in distribution to

a standard normal variate. Since the convergence is uniform, it follows from the above limitsthat the power converges to 1. Hence the MWW test is consistent.


2.4.2 Confidence Intervals

Consider the location model (2.2.4). We next obtain a distribution free confidence intervalfor ∆ by inverting the MWW test. As a first step we have the following result on the functionS+R (∆), (2.2.20):

Lemma 2.4.1. S+R (∆) is a decreasing step function of ∆ which steps down by 1 at each

difference Yj −Xi. Its maximum is n1n2 and its minimum is 0.

Proof: Let D(1) ≤ · · · ≤ D(n1n2) denote the ordered n1n2 differences Yj −Xi. The resultsfollow immediately by writing S+

R (∆) = #(D(i) > ∆).Let α be given and choose cα/2 to be the lower α/2 critical point of the MWW distribution;

i.e., P∆

[S+R (∆) ≤ cα/2,

]= α/2. By the above lemma we have

1 − α = P∆

[cα/2 < S+

R (∆) < n1n2 − cα/2]

= P∆

[D(cα/2+1) ≤ ∆ < D(n1n2−cα/2)

].

Thus [D(cα/2+1), D(n1n2−cα/2)) is a (1 − α)100% confidence interval for ∆; compare (1.3.30).

From the asymptotic null distribution theory for S+R , Corollary (2.4.2), we can approximate

cα/2 as

cα/2.=n1n2

2− zα/2

√n1n2(n+ 1)

12− .5 . (2.4.13)

2.4.3 Statistical Properties of the Inference Based on the MWW

In this section we derive the efficiency properties of the MWW test statistic and propertiesof its power function under the location model (2.2.4).

We begin with an investigation of the power function of the MWW test. For definitenesswe will consider the one sided alternative,

H0 : ∆ = 0 versus HA : ∆ > 0 . (2.4.14)

Results similar to those given below can be obtained for the power function of the other onesided and the two sided alternatives. Given a level α, let cα,n1,n2 denote the upper criticalvalue for the MWW test of this hypothesis; hence, the test rejects H0 if S+

R ≥ cα,n1,n2. Thepower function of this test is given by

γ(∆) = P∆[S+R ≥ cα,n1,n2] , (2.4.15)

where the subscript ∆ on P denotes that the probability is determined when the true pa-rameter is ∆. Recall that S+

R (∆) = #Yj −Xi > ∆.The following theorem will prove useful, its proof is similar to that of Theorem 1.3.1 of

Chapter 1 and the more general result Theorem A.2.4 of the Appendix.

Theorem 2.4.11. For all t, P∆[S+R (0) ≥ t] = P0[S

+R (−∆) ≥ t].


From Lemma 2.4.1 and Theorem 2.4.11 we have our first important result on the powerfunction of the MWW test; namely, that it is monotone.

Theorem 2.4.12. For the above hypotheses (2.4.14), the function γ(∆) in monotonicallyincreasing in ∆.

Proof: Let ∆1 < ∆2. Then −∆2 < −∆1 and, hence, from Lemma 2.4.1, we haveS+R (−∆2) ≥ S+

R (−∆1). By applying Theorem 2.4.11, the desired result, γ(∆2) ≥ γ(∆1),follows from the following:

1 − γ(∆2) = P∆2 [S+R (0) < cα,n1,n2]

= P0[S+R (−∆2) < cα,n1,n2]

≤ P0[S+R (−∆1) < cα,n1,n2]

= P∆1 [S+R (0) < cα,n1,n2]

= 1 − γ(∆1) .

From this we immediately have that the MWW test is unbiased; that is, its powerfunction evaluated at an alternative is always at least as large as its level of significance. Westate it as a corollary.

Corollary 2.4.3. For the above hypotheses (2.4.14), γ(∆) ≥ α for all ∆ > 0.

A more general null hypothesis is given by

H∗0 : ∆ ≤ 0 versus HA : ∆ > 0 .

If T is any test for these hypotheses with critical region C then we say T is a size α testprovided

sup∆≤0

P∆[T ∈ C] = α .

For selected α, it follows from the monotonicity of the MWW power function that the MWWtest has size α for this more general null hypothesis.

From the above theorems, we have that the MWW power function is monotonicallyincreasing in ∆. Since S+

R(∆) achieves its maximum for ∆ finite, we have by Theorem 1.5.2of Chapter 1 that the MWW test is resolving; hence, its power function approaches one as∆ → ∞. Even for the location model, though, we cannot get the power function of the MWWtest in closed form. For local alternatives, however, we can obtain an asymptotic expressionfor the power function. Applications of this result include sample size determination forthe MWW test and efficiency comparisons of the MWW with other tests, both of which weconsider.

We will need the assumption that the density f(x) had finite Fisher Information, i.e.,

(E.1) f is absolutely continuous, 0 < I(f) =∫ 1

0ϕ2f(u) du <∞ , (2.4.16)


where

ϕf(u) = −f′(F−1(u))

f(F−1(u)). (2.4.17)

As discussed in Section 3.4, assumption (E.1) implies that f is uniformly bounded.Once again we will consider the one sided alternative, (2.4.14), (similar results hold for

the other one sided and two sided alternatives). Consider a sequence of local alternatives ofthe form

HAn : ∆n =δ√n, (2.4.18)

where δ > 0 is arbitrary but fixed.As a first step, we need to show that S+

R(∆) is Pitman regular as discussed in Chapter

1. Let S+

R(∆) = S+R(∆)/(n1n2). We need to verify the four conditions of Definition 1.5.3 of

Chapter 1. The first condition is true by Lemma 2.4.1 and the fourth condition follows fromCorollary 2.4.2. By (b) of Theorem 2.4.4, we have

µ(∆) = E∆[S+

R(0)] = 1 − E[F (X − ∆)] . (2.4.19)

By assumption (E.1), (2.4.16),∫f 2(x) dx ≤ sup f

∫f(x) dx < ∞. Hence differentiating

(2.4.19) we obtain µ′(0) =∫f 2(x)dx > 0 and, thus, the second condition is true. Hence we

need only show that the third condition, asymptotic linearity of S+

R(∆) is true. This willfollow provided we can show the variance condition (1.5.17) of Theorem 1.5.6 is true. Notethat

S+

R(δ/√n) − S

+

R(0) = (n1n2)−1#(0 < Yj −Xi ≤ δ/

√n) .

This is similar to the MWW statistic itself. Using essentially the same argument as that forthe variance of the MWW statistic, Theorem 2.4.5 we get

nVar0[S+

R(δ/√n) − S

+

R(0)] =n

n1n2

(an − a2n) +

n(n1 − 1)

n1n2

(bn − cn)

+n(n2 − 1)

n1n2(dn − a2

n) ,

where an = E0[F (X + δ/√n)− F (X)], bn = E0[(F (Y )− F (Y − δ/

√n))2], cn = E0[(F (Y )−

F (Y − δ/√n))], and dn = E0[(F (X + δ/

√n) − F (X))2]. Using the Lebesgue Dominated

Convergence Theorem, it is easy to see that an, bn, cn, and dn all converge to 0. ThereforeCondition (1.5.17) of Theorem 1.5.6 holds and we have thus established the asymptoticlinearity result given by:

sup|δ|≤B

∣∣∣∣n1/2S+

R(δ/√n) − n1/2S

+

R(0) + δ

∫f 2(x) dx

∣∣∣∣

P→ 0 , (2.4.20)

for any B > 0. Therefore, it follows that S+R (∆) is Pitman regular.


In order to get the efficacy of the MWW test, we need the quantity σ2(0) defined by

σ2(0) = limn→0

nVar0(SR(0))

= limn→0

nn1n2(n + 1)

n21n

2212

= (12λ1λ2)−1 ;

see expression (1.5.12). Therefore by (1.5.11) the efficacy of the MWW test is

cMWW = µ′(0)/σ(0) =√λ1λ2

√12

∫f 2(x) dx =

√λ1λ2τ

−1 , (2.4.21)

where τ is the scale parameter given by

τ = (√

12

∫f 2(x)dx)−1 . (2.4.22)

In Exercise 2.13.10 it is shown that the efficacy of the two sample pooled t-test is√λ1λ2σ

−1 where σ2 is the common variance of X and Y . Hence the efficiency of the theMWW test to the two sample t test is the ratio σ2/τ 2. This of course is the same efficiencyas that of the signed rank Wilcoxon test to the one sample t test; see (1.7.13). In particularif the distribution of X is normal then the efficiency of the MWW test to the two samplet test is .955. For heavier tailed distributions, this efficiency is usually larger than 1; seeExample 1.7.1.

As in Chapter 1 it is convenient to summarize the asymptotic linearity result as follows:

√n

S

+

R(δ/√n) − µ(0)

σ(0)

=

√n

S

+

R(0) − µ(0)

σ(0)

− cMWW δ + op(1) , (2.4.23)

uniformly for |δ| ≤ B and any B > 0.The next theorem is the asymptotic power lemma for the MWW test. As in Chapter 1,

(see Theorem 1.5.8), its proof follows from the Pitman regularity of the MWW test.

Theorem 2.4.13. Under the sequence of local alternatives, (2.4.18),

limn→∞

γ(∆n) = P0 [Z ≥ zα − cMWW δ] = 1 − Φ

(zα −

√12λ1λ2

∫f 2δ

); ,

where Z is N(0, 1).

In Exercise 2.13.10, it is shown that if γLS(∆) denotes the power function of the usualtwo sample t-test then

limn→∞

γLS(∆n) = 1 − Φ

[zα −

√λ1λ2

δ

σ

], (2.4.24)

where σ2 is the common variance of X and Y . By comparing these two power functions, it isseen that the Wilcoxon is asymptotically more powerful if τ < σ, i.e., if e = c2MWW/c

2t > 1.


As an application of the asymptotic power lemma, we consider sample size determi-nation. Consider the MWW test for the one sided hypothesis (2.4.14). Suppose the level,α, and the power, β, for a particular alternative ∆A are specified. For convenience, assumeequal sample sizes, i. e. n1 = n2 = n∗ where n∗ denotes the common sample size; hence,λ1 = λ2 = 2−1. Express ∆A as

√2n∗∆A/

√2n∗. Then by Theorem 2.4.13 we have

β.= 1 − Φ

[zα −

√1

4

√2n∗∆A

τ

],

But this implies

zβ = zα − τ−1√n∗∆A/

√2

and (2.4.25)

n∗ =

(zα − zβ

∆A

)2

2τ 2 .

The above value of n∗ is the approximate sample size. Note that it does depend on τ which,in applications, would have to be guessed or estimated in a pilot study; see the discussionin Section 2.4.5, (estimates of τ are discussed in Sections 2.4.5 and 3.7.1). For a specifieddistribution it can be evaluated; for instance, if the underlying density is assumed to benormal with standard deviation σ then τ =

√π/3σ.

Using (2.4.24) a similar derivation can be obtained for the usual two sample t-test, re-sulting in an approximate sample size of

n∗LS =

(zα − zβ

∆A

)2

2σ2 .

The ratio of the sample size needed by the MWW test to that of the two sample t test isτ 2/σ2. This provides additional motivation for the definition of efficiency.

2.4.4 Estimation of ∆

Recall from the geometry earlier in this chapter that the estimate of ∆ based on the rank-pseudo norm is ∆R = med i,jYj − Xi, (2.2.18). We now obtain several properties ofthis estimate including its asymptotic distribution. This will lead again to the efficiencyproperties of the rank based methods discussed in the last section.

For convenience, we note some equivariances of ∆R = ∆(Y,X), which are established in

Exercise 2.13.11. First, ∆R is translation equivariant; i.e.,

∆R(Y + ∆ + θ,X + θ) = ∆R(Y,X) + ∆ ,

for any ∆ and θ. Second, ∆R is scale equivariant; i.e.,

∆R(aY, aX) = a∆R(Y,X) ,

for any a. Based on these we next show that ∆R is an unbiased estimate of ∆ under certainconditions.


Theorem 2.4.14. If the errors, e∗i , in the location model (2.2.4) are symmetrically dis-

tributed about 0, then ∆R is symmetrically distributed about ∆.

Proof: Due to translation equivariance there is no loss of generality in assuming that ∆and θ are 0. Then Y and X are symmetrically distributed about 0; hence, L(Y ) = L(−Y )and L(X) = L(−X). Thus from the above equivariance properties we have,

L(−∆(Y,X)) = L(∆(−Y,−X) = L(∆(Y,X)) .

Therefore ∆R is symmetrically distributed about 0, and, in general it is symmetrically dis-tributed about ∆.

Theorem 2.4.15. Under Model (2.2.4), if n1 = n2 then ∆R is symmetrically distributedabout ∆.

The reader is asked to prove this in Exercise 2.13.12. In general, ∆R may be biased if theerror distribution is not symmetrically distributed but as the following result shows ∆R isalways asymptotically unbiased. Since the MWW process S+

R(∆) was shown to be Pitman

regular the asymptotic distribution of√n(∆ − ∆) is N(0, c−2

MWW ). In practice, we say

∆R has an approximate N(∆, τ 2(n−11 + n−1

2 )) distribution,

where τ was defined in (2.4.22).Recall from Definition 1.5.4 of Chapter 1, that the asymptotic relative efficiency of two

Pitman regular estimators is the reciprocal of the ratio of their aymptotic variances. AsExercise 2.13.10 shows, the least squares estimate ∆LS = Y − X of ∆ is approximately

N(∆, σ2

(1n1

+ 1n2

)); hence,

e(∆R, ∆LS) =σ2

τ 2= 12σ2

f

(∫f 2(x) dx

)2

.

This agrees with the asymptotic relative efficiency results for the MWW test relative to thet test and (1.7.13).

2.4.5 Efficiency Results Based on Confidence Intervals

Let L1−α be the length of the (1−α)100% distribution free confidence interval based on theMWW statistic discussed in Section 2.4.2. Since this interval is based on the Pitman regularprocess S+

R (∆), it follows from Theorem 1.5.9 of Chapter 1 that

√n1n2

n

L1−α2zα/2

P→ τ ; (2.4.26)

that is, the standardized length of a distribution-free confidence interval is a consistentestimate of the scale parameter τ . It further follows from (2.4.26) that, as in Chapter 1, if


efficiency is based on the relative squared asymptotic lengths of confidence intervals then weobtain the same efficiency results as quoted above for tests and estimates.

In the RBR computational function twosampwil a simple degree of freedom adjustmentis made in the estimation of τ . In the traditional LS analysis based on the pooled t, thisadjustment is equivalent to dividing the pooled estimate of variance by n1 + n2 − 2 insteadof n1 + n2. Hence, as our estimate of τ , the function twosampwil uses

τ =

√n1 + n2

n1 + n2 − 2

√n1n2

n

L1−α2zα/2

. (2.4.27)

Thus the standard error (SE) of the estimator ∆R is given by τ√

(1/n1) + (1/n2).

The distribution free confidence interval is not symmetric about ∆R. Often in practicesymmetric intervals are desired. Based on the asymptotic distribution of ∆R we can formulatethe approximate interval

∆R ± zα/2τ

√1

n1

+1

n2

, (2.4.28)

where τ is a consistent estimate of τ . If we use (2.4.26) as our estimate of τ with the levelα, then the confidence interval simplifies to

∆R ± L1−α2

. (2.4.29)

Besides the estimate given in (2.4.26), a consistent estimate of τ was proposed by byKoul, Sievers and McKean (1987) and will be discussed in Section 3.7. Using this estimatesmall sample studies indicate that zα/2 should be replaced by the t critical value t(α/2,n−1);see McKean and Sheather (1991) for a review of small sample studies on R-estimates. In

this case, the symmetric confidence interval based on ∆R is directly analogous to the usualt interval based on least squares in that the only difference is that σ is replaced by τ .

Example 2.4.1. Hendy and Charles Coin Data, continued from Examples 1.11.1 and 2.3.2

Recall from Chapter 1 that this example concerned the silver content in two coinages (thefirst and the fourth) minted during the reign of Manuel I. The data are given in Chapter1. The Hodges-Lehmann estimate of the difference between the first and the fourth coinageis 1.10 percent of silver and a 95% confidence interval for the difference is (.60, 1.70). Thelength of this confidence interval is 1.10; hence, the estimate of τ given in expression (2.4.27)is 0.595. The symmetrized confidence interval (2.4.28) based on the t upper .025 criticalvalue is (0.46, 1.74). Both of these intervals are in agreement with the confidence intervalobtained in Example 1.11.1 based on the two L1 confidence intervals.

Another estimate of τ can be obtained from a similar consideration of the distribution freeconfidence intervals based on the signed-rank statistic discussed in Chapter 1; see Exercise2.13.13. Note in this case for consistency, though, we would have to assume that f issymmetric.

2.5. GENERAL RANK SCORES 99

2.5 General Rank Scores

In this section we will be concerned with the location model; i.e., X1, . . . , Xn1 are iid F (x),Y1, . . . , Yn2 are iid G(x) = F (x−∆), and the samples are independent of one another. We willpresent an analysis for this problem based on general rank scores. In this terminology, theMann-Whitney-Wilcoxon procedures are based on a linear score function. We will presentthe results for the hypotheses

H0 : ∆ = 0 versus H0 : ∆ > 0 . (2.5.1)

The results for the other one sided and two sided alternatives are similar. We will also beconcerned with estimation and confidence intervals for ∆. As in the preceeding sections wewill first present the geometry.

Recall that the pseudo norm which generated the MWW analysis could be written as alinear combination of ranks times residuals. This is easily generalized. Consider the function

‖u‖∗ =

n∑

i=1

a(R(ui))ui , (2.5.2)

where a(i) are scores such that a(1) ≤ · · · ≤ a(n) and∑a(i) = 0. For the next theorem,

we will also assume that a(i) = −a(n+ 1− i); although, this is only used to show the scalarmultiplicative property.

Theorem 2.5.1. Suppose that a(1) ≤ · · · ≤ a(n),∑a(i) = 0, and a(i) = −a(n + 1 − i).

Then the function ‖ · ‖∗ is a pseudo-norm.

Proof: By the connection between ranks and order statistics we can write

‖u‖∗ =n∑

i=1

a(i)u(i) .

Next suppose that u(j) is the last order statistic with a negative score. Since the scores sumto 0, we can write

‖u‖∗ =

n∑

i=1

a(i)(u(i) − u(j))

=∑

i≤ja(i)(u(i) − u(j)) +

∑

i≥ja(i)(u(i) − u(j)) . (2.5.3)

Both terms on the right side are nonnegative; hence, ‖u‖∗ ≥ 0. Since all the terms in (2.5.3)are nonnegative, ‖u‖∗ = 0 implies that all the terms are zero. But since the scores arenot all 0, yet sum to zero, we must have a(1) < 0 and a(n) > 0. Hence we must haveu(1) = u(j) = u(n); i.e., u(1) = · · · = u(n). Conversely if u(1) = · · · = u(n) then ‖u‖∗ = 0. Bythe condition a(i) = −a(n + 1 − i) it follows that ‖αu‖∗ = |α|‖u‖∗; see Exercise 2.13.16.


In order to complete the proof we need to show the triangle inequality holds. This isestablished by the following string of inequalities:

‖u + v‖∗ =n∑

i=1

a(R(ui + vi))(ui + vi)

=

n∑

i=1

a(R(ui + vi))ui +

n∑

i=1

a(R(ui + vi))vi

≤n∑

i=1

a(i)u(i) +n∑

i=1

a(i)v(i)

= ‖u‖∗ + ‖v‖∗ .

The proof of the above inequality is similar to that of Theorem 1.3.2 of Chapter 1.Based on a set of scores satisfying the above assumptions, we can establish a rank infer-

ence for the two sample problem similar to the MWW analysis. We shall do so for generalrank scores of the form

aϕ(i) = ϕ(i/(n+ 1)) , (2.5.4)

where ϕ(u) satisfies the following assumptions

ϕ(u) is a nondecreasing function defined on the interval (0, 1)∫ 1

0ϕ(u) du = 0 and

∫ 1

0ϕ2(u) du = 1

; (2.5.5)

see (S.1), (3.4.10) in Chapter 3, also. The last assumptions concerning standardization ofthe scores are for convenience. The Wilcoxon scores are generated in this way by the linearfunction ϕR(u) =

√12(u− (1/2)) and the sign scores are generated by ϕS(u) = sgn(2u− 1).

We will denote the corresponding pseudo norm for scores generated by ϕ(u) as

‖u‖ϕ =n∑

i=1

aϕ(R(ui))ui . (2.5.6)

These two sample sign and Wilcoxon scores are generalizations of the sign and Wilcoxonscores discussed in Chapter 1 for the one sample problem. In Section 1.8 of Chapter 1 wepresented one sample analyses based on general score functions. Similar to the sign andWilcoxon cases, we can generate a two sample score function from any one sample scorefunction. For reference we establish this in the following theorem:

Theorem 2.5.2. As discussed at the beginning of Section 1.8, let ϕ+(u) be a score functionfor the one sample problem. For u ∈ (−1, 0), let ϕ+(u) = −ϕ+(−u). Define,

ϕ(u) = ϕ+(2u− 1) , for u ∈ (0, 1) . (2.5.7)

and

‖x‖ϕ =n∑

i=1

ϕ(R(xi)/(n+ 1))xi . (2.5.8)


Then ‖ · ‖ϕ is a pseudo-norm on Rn. Furthermore

ϕ(u) = −ϕ(1 − u) , (2.5.9)

and ∫ 1

0

ϕ2(u) du =

∫ 1

0

(ϕ+(u))2 du . (2.5.10)

Proof: As discussed in the beginning of Section 1.8, (see expression (1.8.1)), ϕ+(u) is apositive valued and nondecreasing function defined on the interval (0, 1). Based on these

properties, it follows that ϕ(u) is nondecreasing and that∫ 1

oϕ(u) du = 0. Hence ‖ · ‖ϕ is a

pseudo-norm on Rn. Properties (2.5.9) and (2.5.10) follow readily; see Exercise 2.13.17 fordetails.

The two sample sign and Wilcoxon scores, cited above, are easily seen to be generatedthis way from their one sample counterparts ϕ+(u) = 1 and ϕ+(u) =

√3u, respectively. As

discussed further in Section 2.5.3, properties such as efficiencies of the analysis based on theone sample scores are the same for a two sample analysis based on their corresponding twosample scores.

In the notation of (2.2.3), the estimate of ∆ is

∆ϕ = Argmin ‖Z −C∆‖ϕ .

Denote the negative of the gradient of ‖Z− C∆‖ϕ by Sϕ(∆). Then based on (2.5.6),

Sϕ(∆) =

n2∑

j=1

aϕ(R(Yj − ∆)) . (2.5.11)

Hence ∆ϕ equivalently solves the equation,

Sϕ(∆ϕ).= 0 . (2.5.12)

As with pseudo norms in general, the function ‖Z − C∆‖ϕ is a convex function of ∆. Thenegative of its derivative, Sϕ(∆), is a decreasing step function of ∆ which steps down at thedifferences Yj −Xi; see Exercise 2.13.18. Unlike the MWW function SR(∆), the step sizesof Sϕ(∆) are not necessarily the same size. Based on MWW starting values, a simple trace

algorithm through the differences can be used to obtain the estimator ∆ϕ. The R functiontwosampr2 computes the rank-based analysis for general scores.

The gradient rank test statistic for the hypotheses (2.5.1) is

Sϕ =

n2∑

j=1

aϕ(R(Yj)) . (2.5.13)


Since the test statistic only depends on the ranks of the combined sample it is distributionfree under the null hypothesis. As shown in Exercise 2.13.18,

E0[Sϕ] = 0 (2.5.14)

σ2ϕ = V0[Sϕ] =

n1n2

n(n− 1)

n∑

i=1

a2(i) . (2.5.15)

Note that we can write the variance as

σ2ϕ =

n1n2

n− 1

n∑

i=1

a2(i)1

n

.=

n1n2

n− 1, (2.5.16)

where the approximation is due to the fact that the term in braces is a Riemann sum of∫ϕ2(u)du = 1 and, hence, converges to 1.It will be convenient from time to time to use rank statistics based on unstandardized

scores; i.e., a rank statistic of the form

Sa =

n2∑

j=1

a(R(Yj)) , (2.5.17)

where a(i) = ϕ(i/(n+ 1)), i = 1, . . . , n is a set of scores. As Exercise 2.13.18 shows the nullmean µS and null variance σ2

S of Sa are given by

µS = n2a and σ2S =

n1n2

n(n− 1)

∑(a(i) − a)2 . (2.5.18)

2.5.1 Statistical Methods

The asymptotic null distribution of the statistic Sϕ, (2.5.13), easily follows from TheoremA.2.1 of the Appendix. To see this note that we can use the notation (2.2.1) and (2.2.2) towrite Sϕ as a linear rank statistic; i.e.,

Sϕ =n∑

i=1

cia(R(Zi)) =n∑

i=1

(ci − c)a

(n

n+ 1Fn(Zi)

), (2.5.19)

where Fn is the empirical distribution function of Z1, . . . , Zn. Our score function ϕ is mono-tone and square integrable; hence, the conditions on scores in Section A.2 are satisfied. AlsoF is continuous so the distributional assumption is satisfied. Finally, we need only show thatthe constants ci satisfy conditions, D.2, (3.4.7), and D.3, (3.4.8). It is a simple exercise toshow that

n∑

i=1

(ci − c)2 =n1n2

n

max1≤i≤n

(ci − c)2 = max

n2

2

n2,n2

1

n2

.


Under condition (D.1), (2.4.7), 0 < λi < 1 where lim(ni/n) = λi for i = 1, 2. Using thisalong with the last two expressions, it is immediate that Noether’s condition, (3.4.9), holdsfor the ci’s. Thus the assumptions of Section A.2 hold for the statistic Sϕ.

As in expression (A.2.7) of Section A.2, define the random variable Tϕ as

Tϕ =

n∑

i=1

(ci − c)ϕ(F (Zi)) . (2.5.20)

By comparing expressions (2.5.19) and (2.5.20), it seems that the variable Tϕ is an approx-imation of Sϕ. This follows from Section A.2. Briefly, under H0 the distribution of Tϕis approximately normal and Var((Tϕ − Sϕ)/σϕ) → 0; hence, Sϕ is asymptotically normalwith mean and variance given by expressions (2.5.14) and (2.5.15), respectively. Hence, anasymptotic level α test of the hypotheses (2.5.1) is

Reject H0 in favor of HA, if Sϕ ≥ zασϕ ,

where σϕ is defined by (2.5.15).

As discussed above, the estimate ∆ϕ of ∆ solves the equation (2.5.12). The interval

(∆L, ∆U) is a (1− α)100% confidence interval for ∆ (based on the asymptotic distribution)

provided ∆L and ∆U solve the equations

Sϕ(∆U).= −zα/2

√n1n2

nand Sϕ(∆L)

.= zα/2

√n1n2

n, (2.5.21)

where 1 − Φ(zα/2) = α/2. As with the estimate of ∆, these equations can be easily solvedwith an iterative algorithm; see Exercise 2.13.18.

2.5.2 Efficiency Results

In order to obtain the efficiency results for these statistics, we first show that the processSϕ(∆) is Pitman regular. For general scores we need to further assume that the densityhas finite Fisher information; i.e., satisfies condition (E.1), (2.4.16). Recall that Fisher

information is given by I(f) =∫ 1

0ϕ2F (u) du, where

ϕf(u) = −f′(F−1(u))

f(F−1(u)). (2.5.22)

Below we will show that the score function ϕf is optimal. Define the parameter τϕ as,

τ−1ϕ =

∫ϕ(u)ϕf(u)du . (2.5.23)

Estimation of τϕ is dicussed in Section 3.7.


To show that the process Sϕ(∆) is Pitman regular, we show that the four conditions ofDefinition 1.5.3 are true. As noted after expression (2.5.12), Sϕ(∆) is nonincreasing; hence,the first condition holds. For the second condition, note that we can write

Sϕ(∆) =

n2∑

i=1

a(R(Yi − ∆)) =

n2∑

i=1

ϕ

(n1

n + 1Fn1(Yi − ∆) +

n2

n+ 1Fn2(Yi)

), (2.5.24)

where Fn1 and Fn2 are the empirical cdfs of the samples X1, . . . , Xn1 and Y1, . . . , Yn2, respec-tively. Hence, passing to the limit we have,

E0

[1

nSϕ(∆)

]→ λ2

∫ ∞

−∞ϕ[λ1F (x) + λ2F (x− ∆)]f(x− ∆) dx

= λ2

∫ ∞

−∞ϕ[λ1F (x+ ∆) + λ2F (x)]f(x) dx = µϕ(∆) ; (2.5.25)

see Chernoff and Savage (1958) for a rigorous proof of the limit. Differentiating µϕ(∆) andevaluating the derivative at 0 we obtain

µ′ϕ(0) = λ1λ2

∫ ∞

−∞ϕ′[F (t)]f 2(t) dt

= λ1λ2

∫ ∞

−∞ϕ[F (t)]

(−f

′(t)

f(t)

)f(t) dt

= λ1λ2

∫ 1

0

ϕ(u)ϕf(u) du = λ1λ2τ−1ϕ > 0 . (2.5.26)

Hence, the second condition is satisfied.The null asymptotic distribution of Sϕ(0) was established in the Section 2.5.1; hence the

fourth condition is true. Hence, we need only establish asymptotic linearity. This resultfollows from the results for general rank regression statistics which are developed in SectionA.2.2 of the Appendix. By Theorem A.2.8 of the Appendix, the asymptotic linearity resultfor Sϕ(∆) is given by

1√nSϕ(δ/

√n) =

1√nSϕ(0) − τ−1

ϕ λ1λ2δ + op(1) , (2.5.27)

uniformly for |δ| ≤ B, where B > 0 and τϕ is defined in (2.5.23).Therefore, following Definition 1.5.3 of Chapter 1, the estimating function is Pitman

regular.By the discussion following (2.5.20), we have that n−1/2Sϕ(0)/

√λ1λ2 is asymptotically

N(0, 1). The efficacy of the test based on Sϕ is thus given by

cϕ =τ−1ϕ λ1λ2√λ1λ2

= τ−1ϕ

√λ1λ2 . (2.5.28)


As with the MWW analysis, several important items follow immediately from Pitmanregularity. Consider first the behavior of Sϕ under local alternatives. Specifically considera level α test based on Sϕ for the hypothesis (2.5.1) and the sequence of local alternativesHn : ∆n = δ/

√n. As in Chapter 1, it is easy to show that the asymptotic power of the

test based on Sϕ is given by

limn→∞

Pδ/√n[Sϕ ≥ zασϕ] = 1 − Φ(zα − δcϕ) . (2.5.29)

Based on this result, sample size determination for the test based on Sϕ can be conductedsimilar to that based on the MWW test statistic; see (2.4.25).

Next consider the asymptotic distribution of the estimator ∆ϕ. Recall that the

estimate ∆ϕ solves the equation Sϕ(∆ϕ).= 0. Based on Pitman regularity and Theorem

1.5.7 of Chapter 1 the asymptotic distribution ∆ϕ is given by

√n(∆ϕ − ∆)

D→ N(0, τ 2ϕ(λ1λ2)

−1) ; (2.5.30)

By using (2.5.27) and Tϕ(0) to approximate Sϕ(0), we have the following useful result:

√n∆ =

τϕλ1λ2

1√nTϕ(0) + op(1) . (2.5.31)

We want to select scores such that the efficacy cϕ, (2.5.28), is as large as possible, or

equivalently such that the asymptotic variance of ∆ϕ is as small as possible. How large canthe efficacy be? Similar to (1.8.26), note that we can write

τ−1ϕ =

∫ϕ(u)ϕf(u)du

=

√∫ϕ2f(u)du

∫ϕ(u)ϕf(u)du√∫

ϕ2f (u)du

√∫ϕ2(u)du

= ρ

√∫ϕ2f(u)du . (2.5.32)

The second equation is true since the scores were standardized as above. In the third equationρ is a correlation coefficient and

∫ϕ2f(u)du is Fisher location information, (2.4.16), which

we denoted by I(f). By the Rao-Cramer lower bound, the smallest asymptotic varianceobtainable by an asymptotically unbiased estimate is (λ1λ2I(f))−1. Such an estimate is calledasymptotically efficient. Choosing a score function to maximize (2.5.32) is equivalent tochoosing a score function to make ρ = 1. This can be achieved by taking the score functionto be ϕ(u) = ϕf(u), (2.5.22). The resulting estimate, ∆ϕ, is asymptotically efficient. Ofcourse this can be accomplished only provided that the form of f is known; see Exercise2.13.19. Evidently, the closer the chosen score is to ϕf , the more powerful the rank analysiswill be.


In Exercise 2.13.19, the reader is ask to show that the MWW analysis is asymptoticallyefficient if the errors have a logistic distribution. For normal errors, it follows in a fewsteps from expression (2.4.17) that the optimal scores are generated by the normal scoresfunction,

ϕN(u) = Φ−1(u) , (2.5.33)

where Φ(u) is the distribution function of a standard normal random variable. Exercise2.13.19 shows that this score function is standardized. These scores yield an asymptoticallyefficient analysis if the the errors truly have a normal distribution and, further, e(ϕN , L2) ≥ 1;see Theorem 1.8.1. Also, unlike the Mann-Whitney-Wilcoxon analysis, the estimate of theshift ∆ based on the normal scores cannot be obtained in closed form. But as mentionedabove for general scores, provided the score function is nondecreasing, simple iterative al-gorithms can be used to obtain the estimate and the corresponding confidence interval for∆. In the next sections we will discuss analyses that are asymptotically efficient for otherdistributions.

Example 2.5.1. Quail Data, continued from Example 2.3.1

In the larger study, McKean et al. (1989), from which these data were drawn, the re-sponses were positively skewed with long right tails; although, outliers frequently occurredin the left tail also. McKean et al. conducted an investigation of estimates of the score func-tions for over 20 of these experiments. Classes of simple scores which seemed appropriatefor such data were piecewise linear with one piece which is linear on the first part on theinterval (0, b) and with a second piece which is constant on the second part (b, 1); i.e., scoresof the form

ϕb(u) =

2b(2−b)u− 1 if 0 < u < bb

2−b if b ≤ u < 1. (2.5.34)

These scores are optimal for densities with left logistic and right exponential tails; see Ex-ercise 2.13.19. A value of b which seemed appropriate for this type of data was 3/4. LetS3/4 =

∑a3/4(R(Yj)) denote the test statistic based on these scores. The RBR function

phibentr with the argument param = 0.75 computes these scores. Using the RBR func-tion twosampr2 with the argument score = phibentr, computes the rank-based analysisfor the score function (2.5.34). Assuming that the treated and control observations are in x

and y, respectively, the call and the resulting analysis for a one sided test as computed byR is:

> tempb = twosampr2(x,y,test=T,alt=1,delta0=0,score=phibentr,grad=sphir,

param=.75,alpha=.05,maktable=T)


Standardized (z) Test-Statistic 1.787738 and p-vlaue 0.03690915



95 % Confidence Interval is ( -2 , 28 )


Comparing p-values, the analysis based on the score function (2.5.34) is a little more precisethan the MWW analysis given in Example 2.3.1. Recall that the data are right skewed, sothis result is not surprising.

For another class of scores similar to (2.5.34), see the discussion around expression (3.10.6)in Chapter 3.

2.5.3 Connection between One and Two Sample Scores

In Theorem 2.5.2 we discussed how to obtain a corresponding two sample score functiongiven a one sample score function. Here we reverse the problem, showing how to obtaina one sample score function from a two sample score function. This will provide a naturalestimate of θ in (2.2.4). We also show the efficiencies and asymptotic properties are the samefor such corresponding scores functions.

Consider the location model but further assume that X has a symmetric distribution.Then Y also has a symmetric distribution. For associated one sample problems, we couldthen use the signed rank methods developed in Chapter 1. What one sample scores shouldwe select?

First consider what two sample scores would be suitable under symmetry. Assume with-out loss of generality that X is symmetrically distributed about 0. Recall that the optimalscores are given by the expression (2.5.22). Using the fact that F (x) = 1−F (−x), it is easyto see (Exercise 2.13.20) that the optimal scores satisfy,

ϕf(−u) = −ϕf (1 − u) , for 0 < u < 1 ,

that is, the optimal score function is odd about 12. Hence for symmetric distributions, it

makes sense to consider two sample scores which are odd about 12.

For this sub-section then assume that the two sample score generating function satisfiesthe property

(S.3) ϕ(1 − u) = −ϕ(u) . (2.5.35)

Note that such scores satisfy: ϕ(1/2) = 0 and ϕ(u) ≥ 0 for u ≥ 1/2. Define a one samplescore generating function as

ϕ+(u) = ϕ

(u+ 1

2

)(2.5.36)

and the one sample scores as

a+(i) = ϕ+

(i

n+ 1

). (2.5.37)

It follows that these one sample scores are nonnegative and nonincreasing.For example, if we use Wilcoxon two sample scores, that is, scores generated by the

function, ϕ(u) =√

12(u− 1

2

)then the associated one sample score generating function is


ϕ+(u) =√

3u and, hence, the one sample scores are the Wilcoxon signed-rank scores. Ifinstead we use the two sample sign scores, ϕ(u) = sgn(2u − 1) then the one sample scorefunction is ϕ+(u) = 1. This results in the one sample sign scores.

Suppose we use two sample scores which satisfy (2.5.35) and use the associated onesample scores. Then the corresponding one and two sample efficacies satisfy

cϕ =√λ1λ2cϕ+ , (2.5.38)

where the efficacies are given by expressions (2.5.28) and (1.8.21). Hence the efficiency andasymptotic properties of the one and two sample analyses are the same. As a final remark,if we write the model as in expression (2.2.4), then we can use the rank statistic based on

the two sample to estimate ∆. We next form the residuals Zi − ∆ci. Then using the onesample scores statistic of Chapter 1, we can estimate θ based on these residuals, as discussedin Chapter 1. In terms of a regression problem we are estimating the intercept parameterθ based on the residuals after fitting the regression coefficient ∆. This is discussed in somedetail in Section 3.5.

2.6 L1 Analyses

In this section, we present analyses based on the L1 norm and pseudo norm. We discuss thepseudo norm first, showing that the corresponding test is the familiar Mood’s (1950) test.The test which corresponds to the norm is Mathisen’s (1943) test.

2.6.1 Analysis Based on the L1 Pseudo Norm

Consider the sign scores. These are the scores generated by the function ϕ(u) = sgn(u−1/2).The corresponding pseudo norm is given by,

‖u‖ϕ =n∑

i=1

sgn

(R(ui) −

n+ 1

2

)ui . (2.6.1)

This pseudo norm is optimal for double exponential errors; see Exercise 2.13.19.We have the following relationship between the L1 pseudo norm and the L1 norm. Note

that we can write

‖u‖ϕ =

n∑

i=1

sgn

(i− n+ 1

2

)u(i) .

Next consider,

n∑

i=1

|u(i) − u(n−i+1)| =

n∑

i=1

sgn(u(i) − u(n−i+1))(u(i) − u(n−i+1))

= 2n∑

i=1

sgn(u(i) − u(n−i+1))u(i) .

2.6. L1 ANALYSES 109

Finally note that

sgn(u(i) − u(n−i+1)) = sgn(i− (n− i+ 1)) = sgn

(i− n+ 1

2

).

Putting these results together we have the relationship,

n∑

i=1

|u(i) − u(n−i+1)| = 2

n∑

i=1

sgn

(i− n+ 1

2

)u(i) = 2‖u‖ϕ . (2.6.2)

Recall that the pseudo norm based Wilcoxon scores can be expressed as the sum of allabsolute differences between the components; see (2.2.17). In contrast the pseudo normbased on the sign scores only involves the n symmetric absolute differences |u(i) − u(n−i+1)|.

In the two sample location model the corresponding R-estimate based on the pseudonorm (2.6.1) is a value of ∆ which solves the equation

Sϕ(∆) =

n2∑

j=1

sgn

(R(Yj − ∆) − n+ 1

2

).= 0 . (2.6.3)

Note that we are ranking the set X1, . . . , Xn1, Y1 − ∆, . . . , Yn2 − ∆ which is equivalent toranking the set X1 −med Xi, . . . , Xn1 −med Xi, Y1 −∆−med Xi, . . . , Yn2 −∆−med Xi.We must choose ∆ so that half of the ranks of the Y part of this set are above (n+1)/2 andhalf are below. Note that in the X part of the second set, half of the X part is below 0 andhalf is above 0. Thus we need to choose ∆ so that half of the Y part of this set is below 0and half is above 0. This is achieved by taking

∆ = med Yj − med Xi . (2.6.4)

This is the same estimate as produced by the L1 norm, see the discussion following (2.2.5).We shall refer to the above pseudo norm (2.6.1) as the L1 pseudo norm. Actually, aspointed out in Section 2.2, this equivalence between estimates based on the L1 norm andthe L1 pseudo norm is true for general regression problems in which the model includes anintercept, as it does here.

The corresponding test statistic for H0 : ∆ = 0 is∑n2

j=1 sgn(R(Yj) − n+12

). Note thatthe sgn function here is only counting the number of Yj’s which are above the combined

sample median M = med X1, . . . , Xn1, Y1, . . . , Yn2 minus the number below M . Hence amore convenient but equivalent test statistic is

M+0 = #(Yj > M) , (2.6.5)

which is called Mood’s median test statistic; see Mood (1950).


Testing

Since this L1-analysis is based on a rank-based pseudo-norm we could use the general theorydiscussed in Section 2.5 to handle the theory for estimation and testing. As we will pointout, though, there are some interesting results pertaining to this analysis.

For the null distribution of M+0 , first assume that n is even. Without loss of generality,

assume that n = 2r and n1 ≥ n2. Consider the combined sample as a population of n items,where n2 of the items are Y ’s and n1 items are X’s. Think of the n/2 items which exceed

M . Under H0 these items are as likely to be an X as a Y . Hence, M+0 , the number of Y ’s

in the top half of the sample follows the hypergeometric distribution, i.e.,

P (M+0 = k) =

(n2

k

)(n1

r−k)

(nr

) k = 0, . . . , n2 ,

where r = n/2. If n is odd the same result holds except in this case r = (n− 1)/2. Thus asa level α decision rule, we would reject H0 : ∆ = 0 in favor of HA : ∆ > 0, if M+

0 ≥ cα,where cα could be determined from the hypergeometric distribution or approximated by thebinomial distribution. From the properties of the hypergeometic distribution, E0[M

+0 ] =

r(n2/n) and V0[M+0 ] = (rn1n2(n − r))/(n2(n − 1)). Under the assumption D.1, (2.4.7), it

follows that the limiting distribution of M+0 is normal.

Confidence Intervals

Exercise 2.13.21 shows that, for n = 2r,

M+0 (∆) = #(Yj − ∆ > M) =

n2∑

i=1

I(Y(i) −X(r−i+1) − ∆ > 0) , (2.6.6)

and furthermore that the n = 2r differences,

Y(1) −X(r) < Y(2) −X(r−1) < · · · < Y(n2) −X(r−n2+1) ,

can be ordered only knowing the order statistics from the individual samples. It is furthershown that if k is such that P (M+

0 ≤ k) = α/2 then a (1 − α)100% confidence interval for∆ is given by

(Y(k+1) −X(r−k), Y(n2−k) −X(r−n2+k+1)) .

The above confidence interval simplifies when n1 = n2 = m, say. In this case the intervalbecomes

(Y(k+1) −X(m−k), Y(m−k) −X(k+1)) ,

which is the difference in endpoints of the two simple L1 confidence intervals (X(k+1), X(m−k))and (Y(k+1), Y(m−k)) which were discussed in Section 1.11. Using the normal approximation


to the hypergeometric we have k = m/2 − Zα/2√m2/(4(2m− 1)) − .5. Hence, the above

two intervals have confidence coefficient

γ.= 1 − 2Φ

(k −m/2√

m/4

)= 1 − 2Φ

(zα/2

√m/(2m− 1)

)

.= 1 − 2Φ

(zα/22

−1/2).

For example, for the equal sample size case, a 5% two sided Mood’s test is equivalent torejecting the null hypothesis if the 84% one sample L1 confidence intervals are disjoint.While this also could be done for the unequal sample sizes case, we recommend the directapproach of Section 1.11.

Efficiency Results

We will obtain the efficiency results from the asymptotic distribution of the estimate, ∆ =med Yj − med Xi of ∆. Equivalently, we could obtain the results by asymptotic linearitythat was derived for arbitrary scores in (2.5.27); see Exercise 2.13.22.

Theorem 2.6.1. Under the conditions cited in Example 1.5.2, (L1 Pitman regularity con-ditions), and (2.4.7), we have

√n(∆ − ∆)

D→ N(0, (λ1λ24f2(0))−1) . (2.6.7)

Proof: Without loss of generality assume that ∆ and θ are 0. We can write,

√n∆ =

√n

n2

√n2med Yj −

√n

n1

√n1med Xi .

From Example 1.5.2, we have

√n2med Yj =

1

2f(0)

1√n2

n2∑

j=1

sgnYj + op(1)

hence,√n2med Yj

D→ Z2 where Z2 is N(0, (4f 2(0))−1). Likewise√n1med Xi

D→ Z1 where Z1

is N(0, (4f 2(0))−1). Since Z1 and Z2 are independent, we have that√n∆

D→ (λ2)−1/2Z2 −

(λ1)−1/2Z1 which yields the result.The efficacy of Mood’s test is thus

√λ1λ22f(0). The asymptotic relative efficiency of

Mood’s test to the two-sample t test is 4σ2f 2(0), while its asymptotic relative efficiency withthe MWW test is f 2(0)/(3(

∫f 2)2). These are the same as the efficiency results of the sign

test to the t test and to the Wilcoxon signed-rank test, respectively, that were obtained inChapter 1; see Section 1.7.

Example 2.6.1. Quail Data, continued, Example 2.3.1


For the quail data the median of the combined samples is M = 64. For the subsequenttest based on Mood’s test we eliminated the three data points which had this value. Thusn = 27, n1 = 9 and n2 = 18. The value of Mood’s test statistic is M+

0 = #(Pj > 64) = 11.Since EH0(M

+0 ) = 8.67 and VH0(M

+0 ) = 1.55, the standardized value (using the continuity

correction) is 1.47 with a p-value of .071. Using all the data, the point estimate correspondingto Mood’s test is 19 while a 90% confidence interval, using the normal approximation, is(−10, 31).

2.6.2 Analysis Based on the L1 Norm

Another sign type procedure is based on the L1 norm. Reconsider expression (2.2.7) which isthe partial derivative of the L1 dispersion function with respect to ∆. We take the parameterθ as a nuisance parameter and we estimate it by med Xi. An aligned sign test procedurefor ∆ is then obtained by aligning the Yj’s with respect to this estimate of θ. The processof interest, then, is

S(∆) =

n2∑

j=1

sgn(Yj − med Xi − ∆) .

A test of H0 : ∆ = 0 is based on the statistic

M+a = #(Yj > med Xi) . (2.6.8)

This statistic was proposed by Mathisen (1943) and is also referred to as the control mediantest; see Gastwirth (1968). The estimate of ∆ obtained by solving S(∆)

.= 0 is, of course,

the L1 estimate ∆ = med Yj − med Xi.

Testing

Mathisen’s test statistic, similar to Mood’s, has a hypergeometric distribution under H0.

Theorem 2.6.2. Suppose n1 is odd and is written as n1 = 2n∗1 +1. Then under H0 : ∆ = 0,

P (M+a = t) =

(n∗

1+tn∗

1

)(n2−t+n∗

1n∗

1

)(nn1

) , t = 0, 1, . . . , n2 .

Proof: The proof will be based on a conditional argument. Given X(n∗1+1) = x, M+

a isbinomial with n2 trials and 1−F (x) as the probability of success. The density of X(n∗

1+1) is

f ∗(x) =n1!

(n∗1!)

2(1 − F (x))n

∗1F (x)n

∗1f(x) .


Using this and the fact that the samples are independent we get,

P (M+a = t) =

∫ (n2

t

)(1 − F (x))tF (x)n2−tf(x)dx

=

(n2

t

)n1!

(n∗1!)

2

∫(1 − F (x))t+n

∗1F (x)n

∗1+n2−tf(x)dx

=

(n2

t

)n1!

(n∗1!)

2

∫ 1

0

(1 − u)t+n∗1un

∗1+n2−tdu .

By properties of the β function this reduces to the result.Once again using the conditional argument, we obtain the moments of M+

a as

E0[M+a ] =

n2

2(2.6.9)

V0[M+a ] =

n2(n+ 1)

4(n1 + 2); (2.6.10)

see Exercise 2.13.23.The result when n1 is even is found in Exercise 2.13.23. For the asymptotic null distribu-

tion of M+a we shall make use of the linearity result for the sign process derived in Chapter

1; see Example 1.5.2.

Theorem 2.6.3. Under H0 and D.1, (2.4.7), M+a has an approximate N(n2

2, n2(n+1)

4(n1+2)) distri-

bution.

Proof: Assume without loss of generality that the true median of X and Y is 0. Letθ = med Xi. Note that

M+a = (

n2∑

j=1

sgn(Yj − θ) + n2)/2 . (2.6.11)

Clearly under (D.1),√n2θ is bounded in probability. Hence by the asymptotic linearity

result for the L1 analysis, obtained in Example 1.5.2, we have

n−1/22

n2∑

j=1

sgn(Yj − θ) = n−1/22

n2∑

j=1

sgn(Yj) − 2f(0)√n2θ + op(1) .

But we also have√n1θ = (2f(0)

√n1)

−1

n1∑

i=1

sgn(Xi) + op(1) .

Therefore

n−1/22

n2∑

j=1

sgn(Yj − θ) = n−1/22

n2∑

j=1

sgn(Yj) −√n2/n1n

−1/21

n1∑

i=1

sgn(Xi) + op(1) .


Note that

n−1/22

n2∑

j=1

sgn(Yj)D→ N(0, λ−1

1 ) .

and√n2/n1n

−1/21

n1∑

i=1

sgn(Xi)D→ N(0, λ2/λ1) .

The result follows from these asymptotic distributions, the independence of the samples,expression (2.6.11), and the fact that asymptotically the variance of M+

a satisfies

n2(n+ 1)

4(n1 + 2).= n2(4λ1)

−1 .

Confidence Intervals

Note that M+a (∆) = #(Yj − ∆ > θ) = #(Yj − θ > ∆); hence, if k is such that P0(M

+a ≤

k) = α/2 then (Y(k+1)− θ, Y(n2−k)− θ) is a (1−α)100% confidence interval for ∆. For testingthe two sided hypothesis H0 : ∆ = 0 versus HA : ∆ 6= 0 we would reject H0 if 0 is not inthe confidence interval. This is equivalent, however, to rejecting if θ is not in the interval(Y(k+1), Y(n2−k)).

Suppose we determine k by the normal approximation. Then

k.=n2

2− zα/2

√n2(n+ 1)

4(n1 + 2)− .5

.=n2

2− zα/2

√n2

4λ1

− .5 .

The confidence interval (Y(k+1), Y(n2−k)), is a γ100%, (γ = 1− 2Φ(−zα/2(λ1)−1/2), confidence

interval based on the sign procedure for the sample Y1, . . . , Yn2. Suppose we take α = .05and have the equal sample sizes case so that λ1 = .5. Then γ = 1 − 2Φ(−2

√2). Hence, the

two sided 5% test rejects H0 : ∆ = 0 if θ is not in the confidence interval.

Remarks on Efficiency

Since the estimator of ∆ based on the Mathisen procedure is the same as that of Mood’sprocedure, the asymptotic relative efficiency results for Mathisen’s procedure are the same asthat of Mood’s. Using another type of efficiency due to Bahadur (1967), Killeen, Hettman-sperger and Sievers (1972) show it is generally better to compute the median of the smallersample.

Curtailed sampling on the Y ’s is one situation where Mathisen’s test would be usedinstead of Mood’s test since with Mathisen’s test an early decision could be made; seeGastwirth (1968).

Example 2.6.2. Quail Data, continued, Examples 2.3.1 and 2.6.1

2.7. ROBUSTNESS PROPERTIES 115

For this data, med Ti = 49. Since one of the placebo values was also 49, we eliminatedit in the subsequent computation of Mathisen’s test. The test statistic has the value M+

a =#(Cj > 49) = 17. Using n2 = 19 and n1 = 10 the null mean and variance are 9.5 and11.875, respectively. This leads to a standardized test statistic of 2.03 (using the continuitycorrection) with a p-value of .021. Utilizing all the data, the corresponding point estimateand confidence interval are 19 and (6, 27). This differs from MWW and Mood analyses; seeExamples 2.3.1 and 2.6.1, respectively.

2.7 Robustness Properties

In this section we obtain the breakdown points and the influence functions of the L1 andMWW estimates. We first consider the breakdown properties.

2.7.1 Breakdown Properties

We begin with the definition of an equivariant estimator of ∆. For convenience let thevectors X and Y denote the samples X1, . . . , Xn1 and Y1, . . . , Yn2, respectively. Also letX + a1 = (X1 + a, . . . , Xn1 + a)′.

Definition 2.7.1. An estimator ∆(X,Y) of ∆ is said to be an equivariant estimator of

∆ if ∆(X + a1,Y) = ∆(X,Y)− a and ∆(X,Y + a1) = ∆(X,Y) + a.

Note that the L1 estimator and the Hodges-Lehmann estimator are both equivariantestimators of ∆. Indeed as Exercise 2.13.24 shows any estimator based on the rank pseudonorms discussed in Section 2.5 are equivariant estimators of ∆. As the following theoremshows the breakdown point of an equivariant estimator is bounded above by .25.

Theorem 2.7.1. Suppose n1 ≤ n2. Then the breakdown point of an equivariant estimatorsatisfies ǫ∗ ≤ [(n1 + 1)/2] + 1/n, where [·] denotes the greatest integer function.

Proof: Let m = [(n1 + 1)/2] + 1. Suppose ∆ is an equivariant estimator such thatǫ∗ > m/n. Then the estimator remains bounded if m points are corrupted. Let X∗ =(X1 + a, . . . , Xm + a,Xm+1, . . . , Xn1)

′. Since we have corrupted m points there exists aB > 0 such that

|∆(X∗,Y) − ∆(X,Y)| ≤ B . (2.7.1)

Next let X∗∗ = (X1, . . . , Xm, Xm+1−a, . . . , Xn1−a)′. Then X∗∗ contains n1−m = [n1/2] ≤ maltered points. Therefore,

|∆(X∗∗,Y) − ∆(X,Y)| ≤ B . (2.7.2)

Equivariance implies that ∆(X∗∗,Y) = ∆(X∗,Y) + a. By (2.7.1) we have

∆(X,Y) − B ≤ ∆(X∗,Y) ≤ ∆(X,Y) +B (2.7.3)


while from (2.7.2) we have

∆(X,Y) −B + a ≤ ∆(X∗∗,Y) ≤ ∆(X,Y) +B + a . (2.7.4)

Taking a = 3B leads to a contradiction between (2.7.2) and (2.7.4).By this theorem the maximum breakdown point of any equivariant estimator is roughly

half of the smaller sample proportion. If the sample sizes are equal then the best possiblebreakdown is 1/4.

Example 2.7.1. Breakdown of L1 and MWW estimates

The L1 estimator of ∆, ∆ = med Yj − med Xi, achieves the maximal breakdown sincemed Yj achieves the maximal breakdown in the one sample problem.

The Hodges-Lehmann estimate ∆R = med Yj −Xi also achieves maximal breakdown.To see this, suppose we corrupt an Xi. Then n2 differences Yj − Xi are corrupted. Hencebetween samples we maximize the corruption by corrupting the items in the smaller sample,so without loss of generality we can assume that n1 ≤ n2. Suppose we corrupt m Xi’s. Inorder to corrupt med Yj − Xi we must corrupt (n1n2)/2 differences. Therefore mn2 ≥(n1n2)/2; i.e., m ≥ n1/2. Hence med Yj −Xi has maximal breakdown. Based on Exercise1.12.13 of Chapter 1, the one sample estimate based on the Wilcoxon signed rank statisticdoes not achieve the maximal breakdown value of 1/2 in the one sample problem.

2.7.2 Influence Functions

Recall from Section 1.6.1 that the influence function of a Pitman regular estimator basedon a single sample X1, . . . , Xn is the function Ω(z) when the estimator has the represen-tation n−1/2

∑Ω(Xi) + op(1). The estimators we are concerned with in this section are

Pitman regular; hence, to determine their influence functions we need only obtain similarrepresentations for them.

For the L1 estimate we have from the proof of Theorem 2.6.1 that

√n∆ = med Yj − med Xi =

1

2f(0)

1√n

n2∑

j=1

sgn (Yj)

λ2−

n1∑

i=1

sgn (Xi)

λ1

+ op(1) .

Hence the influence function of the L1 estimate is

Ω(z) =

−(λ12f(0))−1sgn z if z is an x(λ22f(0))−1sgn z if z is an y

,

which is a bounded discontinuous function.For the Hodges-Lehmann estimate, (2.2.18), note that we can write the linearity result

(2.4.23) as√n(S

+(δ/

√n) − 1/2) =

√n(S

+(0) − 1/2) − δ

∫f 2 + op(1) ,

2.7. ROBUSTNESS PROPERTIES 117

which upon substituting√n∆R for δ leads to

√n∆R =

(∫f 2

)−1 √n(S

+(0) − 1/2) + op(1) .

Recall the projection of the statistic SR(0)−1/2 given in Theorem 2.4.7. Since the differencebetween it and this statistic goes to zero in probability we can, after some algebra, obtainthe following representation for the Hodges-Lehmann estimator,

√n∆R =

(∫f 2

)−11√n

n2∑

j=1

F (Yj) − 1/2

λ2−

n2∑

i=1

F (Xi) − 1/2

λ1

+ op(1) .

Therefore the influence function for the Hodges-Lehmann estimate is

Ω(z) =

−(λ1

∫f 2)−1

(F (z) − 1/2) if z is an x(λ2

∫f 2)−1

(F (z) − 1/2) if z is an y,

which is easily seen to be bounded and continuous.For least squares, since the estimate is Y −X the influence function is

Ω(Z) =

−(λ1)

−1z if z is an x(λ2)

−1z if z is an y,

which is unbounded and continuous. The Hodges-Lehmann and L1 estimates attain themaximal breakdown point and have bounded influence functions; hence they are robust.On the other hand, the least squares estimate has 0 percent breakdown and an unboundedinfluence function. One bad point can destroy a least squares analysis.

For a general score function ϕ(u), by (2.5.31) we have the asymptotic representation

∆ =1√n

[n1∑

i=1

(−τϕλ1

)ϕ(F (Xi)) +

n2∑

i=1

(τϕλ2

)ϕ(F (Yi))

].

Hence, the influence function of the R-estimate based on the score function ϕ is given by

Ω(z) =

− τϕλ1ϕ(F (z)) if z is an x

τϕλ2ϕ(F (z)) if z is an y

,

where τϕ is defined by expression (2.5.23). In particular, the influence function is boundedprovided the score generating function is bounded. Note that the influence function for theR-estimate based on normal scores is unbounded; hence, this estimate is not robust. RecallExample 1.8.1 in which the one sample normal scores estimate has an unbounded influencefunction (non robust) but has positive breakdown point (resistant). A rigorous derivation ofthese influence functions can be based on the influence function derived in Section A.5.2 ofthe Appendix.


2.8 Lehmann Alternatives and Proportional Hazards

Consider a two sample problem where the responses are lifetimes of subjects. We shallcontinue to denote the independent samples by X1, . . . , Xn1 and Y1, . . . , Yn2. Let Xi and Yjhave distribution functions F (x) and G(x) respectively. Since we are dealing with lifetimesboth Xi and Yj are positive valued random variables. The hazard function for Xi is definedby

hX(t) =f(t)

1 − F (t)

and represents the likelihood that a subject will die at time t given that he has surviveduntil that time; see Exercise 2.13.25.

In this section, we will consider the class of lifetime models that are called Lehmannalternative models for which the distribution function G satisfies

1 −G(x) = (1 − F (x))α , (2.8.1)

where the parameter α > 0. See Section 4.4 of Maritz (1981) for an overview of nonparamet-ric methods for these models. The Lehmann model generalizes the exponential scale modelF (x) = 1 − exp(−x) and G(x) = 1 − (1 − F (x))α = 1 − exp(−αx). As shown in Exercise2.13.25, the hazard function of Yj is given by hY (t) = αhX(t); i.e., the hazard function of Yjis proportional to the hazard function of Xi; hence, these models are also referred to as pro-portional hazards models; see, also, Section 3.10. The null hypothesis can be expressedas HL0 : α = 1. The alternative we will consider is HLA : α < 1; that is, Y is less hazardousthan X; i.e., Y has more chance of long survival than X and is stochastically larger than X.Note that,

Pα(Y > X) = Eα[P (Y > X | X)]

= Eα[1 −G(X)]

= Eα[(1 − F (X))α] = (α + 1)−1 (2.8.2)

The last equality holds, since 1− F (X) has a uniform (0, 1) distribution. Under HLA, then,Pα(Y > X) > 1/2; i.e., Y tends to dominate X.

The MWW test statistic S+R = #(Yj > Xi) is a consistent test statistic for HL0 versus

HLA, by Theorem 2.4.10. We reject HL0 in favor of HLA for large values of S+R . Furthermore

by Theorem 2.4.4 and (2.8.2), we have that

Eα[S+R ] = n1n2Eα[1 −G(X)] =

n1n2

1 + α.

This suggests as an estimate of α, the statistic,

α = ((n1n2)/S+R ) − 1 . (2.8.3)

By Theorem 2.4.5 it can be shown that

Vα(S+R) =

αn1n2

(α + 1)2+

n1n2(n1 − 1)α

(α + 2)(α + 1)2+

n1n2(n2 − 1)α2

(2α+ 1)(α + 1)2; (2.8.4)

2.8. LEHMANN ALTERNATIVES AND PROPORTIONAL HAZARDS 119

see Exercise 2.13.27. Using this result and the asymptotic distribution of S+R under general

alternatives, Theorem 2.4.9, we can obtain, by the delta method, the asymptotic variance ofα given by

Var α.=

(1 + α)2α

n1n2

1 +

n1 − 1

α + 2+

(n2 − 1)α

2α+ 1

. (2.8.5)

This can be used to obtain an asymptotic confidence interval for α; see Exercise 2.13.27 fordetails. As in the example below the bootstrap could also be used to estimate the Var(α).

2.8.1 The Log Exponential and the Savage Statistic

Another rank test which is frequently used in this situation is the log rank test proposedby Savage (1956). In order to obtain this test, first consider the special case where X hasthe exponential distribution function, F (x) = 1 − e−x/θ, for θ > 0. In this case the hazardfunction of X is a constant function. Consider the random variable ǫ = logX − log θ. In afew steps we can obtain its distribution function as,

P [ǫ ≤ t] = P [logX − log θ ≤ t]

= 1 − exp (−et) ;

i.e., ǫ has an extreme value distribution. The density of ǫ is fǫ(t) = exp (t− et). Hence, wecan model logX as the location model:

logX = log θ + ǫ . (2.8.6)

Next consider the distribution of the logY . Using expression (2.8.1) and a few steps ofalgebra we get

P [log Y ≤ t] = 1 − exp (−αθet) .

But from this it is easy to see that we can model Y as

log Y = log θ + log1

α+ ǫ , (2.8.7)

where the error random variable has the above extreme value distribution. From (2.8.6) and(2.8.7) we see that the log-transformation problem is simply a two sample location problemwith shift parameter ∆ = − logα. Here, HL0 is equivalent to H0 : ∆ = 0 and HLA isequivalent to HA : ∆ > 0. We shall refer to this model as the log exponential model forthe remainder of this section. Thus any of the rank-based analyses that we have discussedin this chapter can be used to analyze this model.

Lets consider the analysis based on the optimal score function for the model. Based onSection 2.5 and Exercise 2.13.19, the optimal scores for the extreme value distribution aregenerated by the function

ϕfǫ(u) = −(1 + log(1 − u)) . (2.8.8)


Hence the optimal rank test in the log exponential model is given by

SL =

n2∑

j=1

ϕfǫ

(R(Yj)

n + 1

)= −

n2∑

j=1

(1 + log

(1 − R(log Yj)

n+ 1

))

= −n2∑

j=1

(1 + log

(1 − R(Yj)

n + 1

)). (2.8.9)

We reject HL0 in favor of HLA for large values of SL. By (2.5.14) the null mean of SL is 0while from (2.5.18) its null variance is given by

σ2ϕfǫ

=n1n2

n(n− 1)

n∑

i=1

1 + log

(1 − i

n+ 1

)2

. (2.8.10)

Then an asymptotic level α test rejects HL0 in favor of HLA if SL ≥ zασϕfǫ.

Certainly the statistic SL can be used in the general Lehmann alternative model describedabove, although, it is not optimal if X does not have an exponential distribution. We shalldiscuss the efficiency of this test below.

For estimation, let ∆ be the estimate of ∆ based on the optimal score function ϕfǫ; that

is, ∆ solves the equation

n2∑

j=1

(1 + log

(1 − R[log(Yj) − ∆]

n + 1

)).= 0 . (2.8.11)

Besides estimation, the confidence intervals discussed in Section 2.5 for general scores, canbe obtained for the score function ϕfǫ ; see Example 2.8.1 for an illustration.

Thus another estimate of α would be α = exp−∆

. As discussed in Exercise 2.13.27,

an asymptotic confidence interval for α can be formulated from this relationship. Keep inmind, though, that we are assuming that X is exponentially distributed.

As a further note, since ϕfǫ(u) is an unbounded function it follows from Section 2.7.2

that the influence function of ∆ is unbounded. Thus the estimate is not robust.A frequently used, equivalent test statistic to SL was proposed by Savage. To derive it,

denote R(Yj) by Rj . Then we can write

log

(1 − Rj

n+ 1

)=

∫ 1−Rj/(n+1)

1

1

tdt =

∫ 0

Rj/(n+1)

1

1 − tdt .

We can approximate this last integral by the following Riemann sum:

1

1 − Rj/(n+ 1)

1

n+ 1+

1

1 − (Rj − 1)/(n+ 1)

1

n + 1+· · ·+ 1

1 − (Rj − (Rj − 1))/(n+ 1)

1

n + 1.

This simplifies to

1

n + 1 − 1+

1

n+ 1 − 2+ · · ·+ 1

n + 1 −Rj=

n∑

i=n+1−Rj

1

i.

2.8. LEHMANN ALTERNATIVES AND PROPORTIONAL HAZARDS 121

This suggests the rank statistic

SL = −n2 +

n2∑

j=1

n∑

i=n−Rj+1

1

i. (2.8.12)

This statistic was proposed by Savage (1956). Note that it is a rank statistic with scoresdefined by

aj = −1 +n∑

i=n−j+1

1

i. (2.8.13)

Exercise 2.13.28 shows that its null mean and variance are given by

EH0[SL] = 0

σ2 =n1n2

n− 1

1 − 1

n

n∑

j=1

1

j

. (2.8.14)

Hence an asymptotic level α test is to reject HL0 in favor of HLA if SL ≥ σzα.Based on the above Riemann sum it would seem that SL and SL are close statistics.

Indeed they are asymptotically equivalent and, hence, both are optimal when X is exponen-tially distributed; see Hajek and Sidak (1967) or Kalbfleish and Prentice (1980) for details.

2.8.2 Efficiency Properties

We next derive the asymptotic relative efficiences for the log exponential model with fǫ(t) =exp (t− et). The MWW statistic, S+

R , is a consistent test for the log exponential model. By(2.4.21), the efficacy of the Wilcoxon test is

cMWW =√

12

∫f 2ǫ

√λ1λ2 =

√3

4

√λ1λ2; ,

Since the Savage test is asymptotically optimal its efficacy is the square root of Fisher infor-mation, i.e., I1/2(fǫ) discussed in Section 2.5. This efficacy is

√λ1λ2. Hence the asymptotic

relative efficiency of the Mann-Whitney-Wilcoxon test to the Savage test at the log expo-nential model, is 3/4; see Exercise 2.13.29.

Recall that the efficacy of the L1 procedures, both Mood’s and Mathisen’s, is 2fǫ(θǫ)√λ1λ2,

where θǫ denotes the median of the extreme value distribution. This turns out to beθǫ = log(log 2)). Hence fǫ(θǫ) = (log 2)/2, which leads to the efficacy

√λ1λ2 log 2 for the L1

methods. Thus the asymptotic relative efficiency of the L1 procedures with respect to theprocedure based on Savage scores is (log 2)2 = .480. The asymptotic relative efficiency ofthe L1 methods to the MWW at this model is .6406. Therefore there is a substantial loss ofefficiency if L1 methods are used for the log exponential model. This makes sense since theextreme value distribution has very light tails.


The variance of a random variable with density fǫ is π2/6; hence the asymptotic relativeefficiency of the t test to the Savage test at the log exponential model is 6/π2 = .608. Hence,for the procedures analyzed in this chapter on the log exponential model the Savage test isoptimal followed, in order, by the MWW, t, and L1 tests.

Example 2.8.1. Lifetimes of an Insulation Fluid.

The data below are drawn from an example on page 3 of Lawless (1982); see, also, Nelson(1982, p. 227). They consist of the breakdown times (in minutes) of an electrical insulatingfluid when subject to two different levels of voltage stress, 30 and 32 kV. Suppose we areinterested in testing to see if the lower level is less hazardous than the higher level.

Voltage Level Times to Breakdown (Minutes)30 kV 17.05 22.66 21.02 175.88 139.07 144.12 20.46 43.40Y 194.90 47.30 7.74

32 kV 0.40 82.85 9.88 89.29 215.10 2.75 0.79 15.93X 3.91 0.27 0.69 100.58 27.80 13.95 53.24

Let Y and X denote the log of the breakdown times of the insulating fluid at the voltagestesses of 30 kV and 32 kV’s, respectively. Let ∆ = θY − θX denote the shift in locations.We are interested in testing H0 : ∆ = 0 versus HA : ∆ > 0. The comparison boxplots forthe log-transformed data are displayed in the left panel of Figure 2.8.1. It appears that thelower level (30 kV) is less hazardous.

The RBR function twosampwiltwosampr2+ with the score argument set at philogr

obtains the analysis based on the log-rank scores. . Briefly, the results are:


Standardized (z) Test-Statistic 1.302 and p-vlaue 0.096


95 % Confidence Interval is (-0.261, 2.662)


The corresponding Mann-Whitney-Wilcoxon analysis is


Test Stat. S+ is 118 Standardized (z) Test-Stat. 1.816 and p-vlaue 0.034

MWW estimate of the shift in location is 1.297 SE is 0.944

95 % Confidence Interval is (-0.201, 3.355)


2.9. TWO SAMPLE RANK SET SAMPLING (RSS) 123

Figure 2.8.1: Comparison Boxplots of Treatment and Control Quail LDL Levels

0.0 0.5 1.0 1.5 2.0 2.5

050

100

150

200

Exponential Quantiles

Vol

tage

leve

l

Exponential q−q Plot

log 30 kv log 32 kv

−1

01

23

45

Bre

akdo

wn−

time

Comparison Boxplots of log 32 kv and log 30 kv

While the log-rank is insignificant, the MWW analysis is significant at level 0.034. Thisdifference is not surprising upon considering the q−q plot of the original data at the 32 kVlevel found in the right panel of Figure 2.8.1. The population quantiles are drawn from anexponential distribution. The plot indicates heavier tails than that of an exponential distri-bution. In turn, the error distribution for the location model would have heavier tails thanthe light-tailed extreme ,valued distribution. Thus the MWW analysis is more appropriate.The two sample t-test has value 1.34 with the p-value also of .096. It was impaired by theheavy tails too.

Although, the exponential model on the original data seems unlikely, for illustration weconsider it. The sum of the ranks of the 30 kV (Y ) sample is 184. The estimate of α basedon the MWW statistic is .40. A 90% confidence interval for α based on the approximate (viathe delta-method) variance, (2.8.5), is (.06, .74); while, a 90% bootstrap confidence intervalbased 1000 bootstrap samples is (.15, .88). Hence the MWW test, the corresponding estimateof α and the two confidence intervals indicate that the lower voltage level is less hazardousthan the higher level.

2.9 Two Sample Rank Set Sampling (RSS)

The basic background for rank set sampling was discussed in Section 1.9. In this section weextend these ideas to the two sample location problem. Suppose we have the two samplesin which X1, . . . , Xn1 are iid F (x) and Y1, . . . , Yn2 are iid F (x − ∆) and the two samples


are independent of one another. In the corresponding RSS design, we take n1 cycles of ksamples for X and n2 cycles of q samples for Y . Proceeding as in Section 1.9, we display themeasured data as:

X(1)1, . . . , X(1)n1 iid f(1)(t) Y(1)1, . . . , Y(1)n2 iid f(1)(t− ∆)· · · · · ·· · · · · ·· · · · · ·

X(k)1, . . . , X(k)n1iid f(k)(t) Y(q)1, . . . , Y(q)n2

iid f(q)(t− ∆)

.

To test H0 : ∆ = 0 versus HA : ∆ > 0 we compute the Mann-Whitney-Wilcoxonstatistic with these rank set samples. Letting Usi =

∑n2

t=1

∑n1

j=1 I(Y(s)t > X(i)j), the teststatistic is

URSS =

q∑

s=1

k∑

i=1

Usi .

Note that Usi is the Mann-Whitney-Wilcoxon statistic computed on the sample of the sth Yorder statistics and the ith X order statistics. Even under the null hypothesis H0 : ∆ = 0,Usi is not based on identically distributed samples unless s = i. This complicates the nulldistribution of URSS.

Bohn and Wolfe (1992) present a thorough treatment of the distribution theory for URSS.We note that under H0 : ∆ = 0, URSS is distribution free and further, using the same ideasas in Theorem 1.9.1, EH0(URSS) = qkn1n2/2. For fixed k and q, provided assumption D.1,(2.4.7), holds, Theorem 2.4.2 can be applied to show that (URSS − qkn1n2/2)/

√VH0(URSS)

has a limiting N(0, 1) distribution. The difficulty is in the calculation of the VH0(URSS); recallTheorem 1.9.1 for a similar calculation for the sign statistic. Bohn and Wolfe (1992) presenta complex formula for the variance. Bohn and Wolfe provide a table of the approximate nulldistribution of URSS for q = k = 2, n1 = 1, . . . , 5, n2 = 1, . . . , 5 and likewise for q = k = 3.

Another way to approximate the null distribution of URSS is to bootstrap it. Consider,for simplicity, the case k = q = 3 and n1 = n2 = m. Hence the expert must rank threeobservations and each of the m cycles consists of three samples of size three for each of theX and Y measurements. In order to bootstrap the null distribution of URSS, first align theY -RSS’s with ∆, the Hodges-Lehmann estimate of shift computed across the two RSS’s.Our bootstrap sampling is on the data with the indicated sampling distributions:

X(1)1, . . . , X(1)m sample F(1)(x) Y(1)1, . . . , Y(1)m sample F(1)(y − ∆)



.

In the bootstrap process, for each row i = 1, 2, 3, we take random samples X∗(i)1, . . . , X

∗(i)m

from F(i)(x) and Y ∗(i)1, . . . , Y

∗(i)m from F(2)(y− ∆). We then compute U∗

RSS on these samples.Repeating this B times, we obtain the sample of test statistics U∗

RSS,1, . . . , U∗RSS,B. Then

the bootstrap p-value for our test is #(U∗RSS,j ≥ URSS)/B, where URSS is the value of the

statistic based on the original data. Generally we take B = 1000 for a p-value. It is clearhow to modify the above argument to allow for k 6= q and n1 6= n2.

2.10. TWO SAMPLE SCALE PROBLEM 125

2.10 Two Sample Scale Problem

Frequently it is of interest to investigate whether or not one random variable is more dispersedthan another. The general case is when the random variables differ in both location and scale.Suppose the distribution functions of X and Y are given by F (x) and G(y) = F ((y−∆)/η),respectively; hence L(Y ) = L(ηX + ∆). For discussion, we consider one-sided hypotheses ofthe form

H0 : η = 1 versus HA : η > 1. (2.10.1)

The other one-sided or two-sided hypotheses can be handled similarly. Let X1, . . . , Xn1 andY1, . . . , Yn2 be samples drawn on the random variables X and Y , respectively.

The traditional test of H0 is the F -test which is based on the ratio of sample variances.As we discuss in Section 2.10.2, though, this test is generally not asymptotically correct, (oneof the exceptions is when F (t) is a normal cdf). Indeed, as many simulation studies haveshown, this test is extremely liberal in many non-normal situations; see Conover, Johnsonand Johnson (1981).

Tests of H0 should be invariant to the locations. One way of ensuring this is to firstcenter the observations. For the F -test, the centering is by sample means; instead, we preferto use the sample medians. Let θX and θY denote the sample medians of the X and Ysamples, respectively. Then the samples of interest are the folded aligned samples given by|X∗

1 |, . . . , |X∗n1| and |Y ∗

1 |, . . . , |Y ∗n2|, where X∗

i = Xi − θX and Y ∗i = Yi − θY .

2.10.1 Optimal Rank-Based Tests

To obtain appropriate score functions for the scale problem, first consider the case whenthe location parameters of X and Y are known. Without loss of generality, we can thenassume that they are 0 and, hence, that L(Y ) = L(ηX). Further because η > 0, we haveL(|Y |) = L(η|X|). Let Z′ = (log |X1|, . . . , log |Xn1 |, log |Y1|, . . . , log |Yn2|) and ci, (2.2.1), bethe dummy indicator variable, i.e., ci = 0 or 1, depending on whether Zi is an X or Y ,repectively. Then an equivalent formulation of this problem is

Zi = ζci + ei , 1 ≤ i ≤ n , (2.10.2)

where ζ = log η, e1, . . . , en are iid with distribution function F ∗(x) which is the cdf of log |X|.The hypotheses, (2.10.1), are equivalent to

H0 : ζ = 0 versus HA : ζ > 1. (2.10.3)

Of course, this is the two sample location problem based on the logs of the absolute valuesof the observations. Hence, the optimal score function for Model 2.10.2 is given by

ϕf∗(u) = −f∗′ (F ∗−1(u)))

f ∗ (F ∗−1(u))). (2.10.4)


After some simplification, see Exercise 2.13.30, we have

−f∗′(x)

f ∗(x)=ex [f ′ (ex) − f ′ (−ex)]f (ex) + f (−ex) + 1 . (2.10.5)

If we further assume that f is symmetric, then expression (2.10.5) for the optimal scoresfunction simplifies to

ϕf∗(u) = −F−1

(u+ 1

2

)f ′ (F−1

(u+1

2

))

f(F−1

(u+1

2

)) − 1. (2.10.6)

This expression is convenient to work with because it depends on F (t) and f(t), thecdf and pdf of X, in the original formulation of this scale problem. The following twoexamples obtain the optimal score function for the normal and double exponential situations,respectively.

Example 2.10.1. L(X) Is Normal

Without loss of generality, assume that f(x) is the standard normal density. In this caseexpression (2.10.6) simplifies to

ϕFK(u) =

(Φ−1

(u+ 1

2

))2

− 1 , (2.10.7)

where Φ is the standard normal distribution function; see Exercise 2.13.33. Hence, if we aresampling from a normal distribution this suggests the rank test statistic

SFK =

n2∑

j=1

(Φ−1

(R|Yj|

2(n+ 1)+

1

2

))2

, (2.10.8)

where the FK subscript is due to Fligner and Killeen (1976), who discussed this scorefunction in their work on the two-sample scale problem.

Example 2.10.2. L(X) Is Double Exponential

Suppose that the density of X is the double exponential, f(x) = 2−1 exp −|x|, −∞ <x <∞. Then as Exercise 2.13.33 shows the optimal rank score function is given by

ϕ(u) = −(log (1 − u) + 1) . (2.10.9)

These scores are not surprising, because the distribution of |X| is exponential. Hence, thisis precisely the log linear problem with exponentially distributed lifetime that was discussedin Section 2.8; see the discussion around expression (2.8.8).

Example 2.10.3. L(|X|) Is a Member of the Generalized F -family: MWW Statistic


In Section 3.10 a discussion is devoted to a large family of commonly used distributionscalled the generalized F family for survival type data. In particular, as shown there, if|X| follows an F (2, 2)-distribution, then it follows, (Exercise 2.13.31), that the log |X| hasa logistic distribution. Thus the MWW statistic is the optimal rank score statistic in thiscase.

Notice the relationship between tail-weight of the distribution and the optimal scorefunction for the scale problem over these last three examples. If the underlying distributionis normal then the optimal score function (2.10.8) is for very light-tailed distributions. Evenat the double-exponential, the score function (2.10.9) is still for light-tailed errors. Finally,for the heavy-tailed (variance is ∞) F (2, 2) distribution the score function is the boundedMWW score function. The reason for the difference in location and scale scores is thatthe optimal score function for the scale case is based on the distribution of the log’s of theoriginal variables.

Once a scale score function is selected, following Section 2.5 the general scores processfor this problem is given by

Sϕ(ζ) =

n2∑

j=1

aϕ(R(log |Yj| − ζ)) , (2.10.10)

where the scores a(i) are generated by a(i) = ϕ(i/(n+ 1)).A rank test statistic for the hypotheses, (2.10.3), is given by

Sϕ = Sϕ(0) =

n2∑

j=1

aϕ(R(log |Yj|) =

n2∑

j=1

aϕ(R(|Yj|) , (2.10.11)

where the last equality holds because the log function is strictly increasing. This is notnecessarily a standardized score function, but it follows from the discussion on general scoresfound in Section 2.5 and (2.5.18) that the null mean µϕ and null variance σ2

ϕ of the statisticare given by

µϕ = n2a and σ2ϕ =

n1n2

n(n− 1)

∑(a(i) − a)2. (2.10.12)

The asymptotic version of this test statistic rejects H0 at approximate level α if z ≥ zα where

z =Sϕ − µϕσϕ

. (2.10.13)

The efficacy of the test based on Sϕ is given by expression (2.5.28); i.e.,

cϕ = τ−1ϕ

√λ1λ2 , (2.10.14)

where τϕ is given by

τ−1ϕ =

∫ 1

0

ϕ(u)ϕf∗(u) du (2.10.15)


and the optimal scores function ϕf∗(u) is given in expression (2.10.4). Note that this formulafor the efficiacy is under the assumption that the score function ϕ(u) is standardized.

Recall the original (realistic) problem, where the distribution functions of X and Y aregiven by F (x) and G(y) = F ((y − ∆)/η), respectively and the difference in locations, ∆, isunknown. In this case, L(Y ) = L(ηX + ∆). As noted above, the samples of interest are

the folded aligned samples given by |X∗1 |, . . . , |X∗

n1| and |Y ∗

1 |, . . . , |Y ∗n2|, where X∗

i = Xi − θX

and Y ∗i = Yi − θY , where θX and θY denote the sample medians of the X and Y samples,

respectively.Given a score function ϕ(u), we consider the linear rank statistic, (2.10.11), where the

ranking is performed on the folded-aligned observations; i.e.,

S∗ϕ =

n2∑

j=1

a(R(|Y ∗j |)). (2.10.16)

The statistic S∗ is no longer distribution free for finite samples. However, if we further assumethat the distributions of X and Y are symmetric, then the test statistic S∗

ϕ is asymptoticallydistribution free and has the same efficiency properties as Sϕ; see Puri (1968) and Flignerand Hettmansperger (1979). The requirement that f is symmetric is discussed in detail byFligner and Hettmansperger (1979). In particular, we standardize the statistic using themean and variance given in expression (2.10.12).

Estimation and confidence intervals for the parameter η are based on the process

S∗ϕ(ζ) =

n2∑

j=1

aϕ(R(log |Y ∗j | − ζ)) , (2.10.17)

An estimate of ζ is a value ζ which solves the equation (2.5.12); i.e.,

S∗ϕ(ζ)

.= 0 . (2.10.18)

An estimate of η, the ratio of scale parameters, is then

η = ebζ . (2.10.19)

The interval (ζL, ζU) where ζL and ζU solve the equations (2.5.21), forms (asymptotically)a (1 − α)100% confidence interval for ζ . The corresponding confidence interval for η is

(exp ζL, exp ζU).As a simple rank-based analysis, consider the test and estimation given above based on

the optimal scores (2.10.7) for the normal situation. The folded aligned samples version ofthe test statistic (2.10.8) is the statistic

S∗FK =

n2∑

j=1

(Φ−1

(R|Y ∗

j |2(n+ 1)

+1

2

))2

. (2.10.20)


The standardized test statistic is z∗FK = (S∗FK − µFK)/σFK , where µFK abd σFK are the

vaules of (2.10.12) for the scores (2.10.7). This statistic for non-aligned samples is given onpage 74 of Hajek and Sidak (1967). A version of it was also discussed by Fligner and Killeen(1976). We refer to this test and the associated estimator and confidence interval as theFligner-Killeen analysis. The RBR function twoscale with the score function phiscalefk

computes the Fligner-Killeen analysis. We next obtain the efficacy of this analysis.

Example 2.10.4. Efficacy for the Score Function ϕFK(u).

To use expression (2.5.28) for the efficacy, we must first standardize the score functionϕFK(u) = Φ−1[(u+ 1)/2]2 − 1, (2.10.7). Using the substitution (u+ 1)/2 = Φ(t), we have

∫ 1

0

ϕFK(u) du =

∫ ∞

−∞t2φ(t) dt− 1 = 1 − 1 = 0.

Hence, the mean is 0. In the same way,

∫ 1

0

[ϕFK(u)]2 du =

∫ ∞

−∞t4φ(t) dt− 2

∫ ∞

−∞t2φ(t) dt+ 1 = 2.

Thus the standardized score function is

ϕ∗FK(u) = Φ−1[(u+ 1)/2]2 − 1]/

√2. (2.10.21)

Hence, the efficacy of the Fligner-Killeen analysis is

cϕF K=√λ1λ2

∫ 1

0

1√2Φ−1[(u+ 1)/2]2 − 1]ϕf∗(u) du, (2.10.22)

where the optimal score function ϕf∗(u) is given in expression (2.10.4). In particular, theefficacy at the normal distribution is given by

cϕF K(normal) =

√λ1λ2

∫ 1

0

1√2Φ−1[(u+ 1)/2]2 − 1]2 du,=

√2√λ1λ2. (2.10.23)

We illustrate the Fligner-Killeen analysis with the following example.

Example 2.10.5. Doksum and Sievers Data.

Doksum and Sievers (1976) describe an experiment involving the effect of ozone on weightgain of rats. The experimental group consisted of n2 = 22 rats which were placed in an ozoneenvironment for seven days, while the control group contained n1 = 21 rats which were placedin an ozone free environment for the same amount of time. The response was the weightgain in a rat over the time period. Figure 2.10.1 displays the comparison boxplots for thedata. There appears to be a difference in scale. Using the RBR software discussed above,


Figure 2.10.1: Comparison Boxplots of Treated and Control Weight Gains in rats.

Control Ozone

−10

010

2030

4050

Wei

ght G

ain

Comparison Boxplots of Control and Ozone

the Fligner-Killeen test statistic S∗FK = 28.711 and its standardized value is z∗FK = 2.095.

The corresponding p-value for a two sided test is 0.036, confirming the impression from theplot. The associated estimate of the ratio (ozone to control) of scales is η = 2.36 with a 95%confidence interval of (1.09, 5.10).

Conover, Johnson and Johnson (1981) performed a large Monte Carlo study of tests ofdispersion, including these folded-aligned rank tests, over a wide variety of situations forthe c-sample scale problem. The traditional F -test (Bartlett’s test) did poorly, (as wouldbe expected from our comments below about the lack of robustness of the classical F -test).In certain null situations its empirical α levels exceeded .80 when the nominal α level was.05. One rank test that performed very well was the aligned rank version of a test statisticsimilar to S∗

FK , (2.10.20), but with the exponent of one instead of two in the definition of thescore function. This performed well overall in terms of validity and power except for highlyasymmetric distributions, where it has a tendency to be liberal. However, in the followingsimulation study the Fligner-Killeen test (??) (exponent of two) is empirically valid over theasymmetric situations covered.

Example 2.10.6. Simulation Study for Validity of Tests S∗ϕ

Table 2.10.1 displays the results of a small simulation study of the validity of the rank-based tests of scale for various score functions over mostly skewed error distributions. Thescores in the study are: (fk2), the optimal score function for the normal distribution; (fk),similar to last except the exponent is one; (Wilcoxon), the linear Wilcoxon score function;


(Quad), the score function ϕ(u) = u2; and (Logistic) the optimal score function if the dis-tribution of X is logistic (see Exercise 2.13.32). The error distributions include the normaland the χ2(1) distributions and several members of the skewed contaminated normal dis-tribution. In the later case, the random variable X is written as X = X1(1 − Iǫ) + IǫX2,where X1 and X2 have N(0, 1) and N(µc, σ

2c ) distributions, respectively, Iǫ has a Bernoulli

distribution with probability of success ǫ, and X1, X2 and Iǫ are mutually independent. Forthe study ǫ was set at 0.3 and µc and σc varied. The pdfs of the three SCN distributionsin Table 2.10.1 are shown in Figure 2.10.2. The pdf in the bottom right cornor panel ofthe figure is that of χ2(1)-distribution. For all but the last situation in Table 2.10.1, thesample sizes are n1 = 20 and n2 = 25. The last situation is for n1 = n2 = 10, The numberof simulations for each situation was set at 1000. For each run, the two sided alternative,HA : η 6= 1, was tested and the estimator of η and an associated confidence interval for ηwas obtained. Computations were performed by RBR functions.

The table shows the empirical α levels at the nominal 0.10, 0.05, and 0.01 levels; theemirical confidence coefficient for a nominal 95% confidence interval, the mean of the esti-mates of η, and the MSE for η. Of the five analyses, overall the Fligner-Killeen analysis (fk2)performed the best. This analysis was valid (nominal levels and empirical coverage) in allthe situations, except for the χ2(1) distribution at the 10% level and the larger sample sizes.Even here, its empirical level is 0.128. The other tests were liberal in the skewed situations,some as the Wilcoxon quite liberal. Also, the fk analysis (exponent 1 in its score function)was liberal for the χ2(1) situations. Notice that the Fligner-Killeen analysis achieved thelowest MSE in all the situations.

Hall and Padmanabhan (1997) developed a percentile bootstrap for these rank-basedtests which in their accompanying study performed quite well for skewed error distributionsas well as the symmetric error distributions.

As a final remark, another class of linear rank statistics for the two sample scale problemconsists of simple linear rank statistics of the form

S =

n2∑

j=1

a(R(Yj)) , (2.10.24)

where the scores are generated as a(i) = ϕ(i/(n + 1)). The folded rank statistics discussedabove suggest that ϕ be a convex (or concave) function. One popular score function is thequadratic function ϕ(u) = (u− 1/2)2. The resulting statistic,

SM =

n2∑

j=1

(R(Yj)

n + 1− 1

2

)2

, (2.10.25)

was proposed by Mood (1954) as a test statistic for (??). For the realistic problem withunknown location, though, the observations have to be first aligned. Asymptotic theoryholds, provided the underlying distribution is symmetric. This class of aligned rank tests,though, did not perform nearly as well as the folded rank statistics, (2.10.16), in the largeMonte Carlo study of Conover et al. (1981). Hence, we recommend the folded rank-basedanalyses discussed above.


Table 2.10.1: Empirical Levels, Confidences, and MSE’s for the Monte carlo Study of Ex-ample ??.

Normal Errors, n1 = 20, n2 = 25

α.10 α.05 α.01 Cnf.95 η MSE(η)Logistic 0.083 0.041 0.006 0.961 1.037 0.060Quad. 0.080 0.030 0.008 0.970 1.043 0.076Wilcoxon 0.073 0.033 0.004 0.967 1.042 0.097

fk2 0.087 0.039 0.004 0.960 1.036 0.057fk 0.077 0.033 0.005 0.969 1.037 0.067

SKCN(µc = 2, σc =√

2, ǫc = 0.3), n1 = 20, n2 = 25Logistic 0.106 0.036 0.006 0.965 1.035 0.076Quad. 0.106 0.046 0.008 0.953 1.040 0.095Wilcoxon 0.103 0.049 0.007 0.952 1.043 0.117

fk2 0.100 0.034 0.006 0.966 1.033 0.073fk 0.099 0.047 0.006 0.953 1.034 0.085



fk2 0.072 0.026 0.005 0.974 1.057 0.126fk 0.111 0.057 0.015 0.942 1.075 0.229



fk2 0.074 0.042 0.007 0.958 1.070 0.201fk 0.115 0.069 0.015 0.932 1.109 0.400

χ2(1), n1 = 20, n2 = 25Logistic 0.154 0.086 0.023 0.913 1.128056 0.353Quad. 0.249 0.149 0.047 0.851 1.170 0.482Wilcoxon 0.304 0.197 0.067 0.804 1.196 0.611

fk2 0.128 0.066 0.018 0.936 1.120 0.336fk 0.220 0.131 0.039 0.870 1.154 0.432

χ2(1), n1 = 10, n2 = 10Logistic 0.132 0.062 0.018 0.934 1.360 1.495Quad. 0.192 0.099 0.035 0.900 1.457 2.108Wilcoxon 0.276 0.166 0.042 0.833 1.560 3.311

fk2 0.111 0.057 0.013 0.941 1.335 1.349fk 0.199 0.103 0.033 0.893 1.450 2.086


Figure 2.10.2: Pdfs of Skewed Distributions in the Simulation Study of Example 2.10.6.

−2 0 2 4 6 8

0.00

0.10

0.20

0.30

x

f(x)

SCN: µc = 2, σc = 1.41, ε = .3

−2 0 2 4 6 8 10

0.00

0.05

0.10

0.15

0.20

0.25

x

f(x)

SCN: µc = 6, σc = 1.41, ε = .3

0 5 10 15

0.00

0.05

0.10

0.15

0.20

0.25

x

f(x)

SCN: µc = 12, σc = 1.41, ε = .3

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

f(x)

χ2, One Defree of Freedom

2.10.2 Efficacy of the Traditional F -Test

We next obtain the efficacy of the traditional F -test for the ratio of scale parameters.Actually for our development we need not assume that X and Y have the same locations.Let σ2

2 and σ21 denote the variances of Y and X respectively. Then in the notation in the

first paragraph of this section, η2 = σ22/σ

21 . The classical F -test of the hypotheses (??) is to

reject H0 if F ∗ ≥ F (α, n2 − 1, n1 − 1) where

F ∗ = σ22/σ

21 ,

and σ22 and σ2

1 are the sample variances of the samples Y1, . . . , Yn2 and X1, . . . , Xn1, respec-tively. The F -test is exact size α if f is a normal pdf. Also the test is invariant to differencesin location.

We first need the asymptotic distribution of F ∗ under the null hypothesis. Instead ofworking with F ∗ it is more convenient mathematically to work with the equivalent teststatistic

√n logF ∗. We will assume that X has a finite fourth central moment; i.e., µX,4 =

E[(X −E(X))4] <∞. Let ξ = (µX,4/σ41)− 3 denote the kurtosis of X. It easily follows that

Y has the same kurtosis under the null and alternative hypotheses. A key result, establishedin Exercise 2.13.38, is that under these conditions

√ni(σ

2i − σ2

i )D→ N(0, σ4

i (ξ + 2)) , for i = 1, 2 . (2.10.26)

It follows immediately by the delta method that√ni(log σ2

i − log σ2i )

D→ N(0, ξ + 2) , for i = 1, 2 . (2.10.27)


Under H0, σi = σ, say, and the last result,

√n logF ∗ =

√n

n2

√n2(log σ2

2 − log σ2) −√

n

n1

√n1(log σ2

1 − log σ2)

D→ N(0, (ξ + 2))/(λ1λ2)) . (2.10.28)

The approximate test rejects H0 if√n logF ∗

√(ξ + 2))/(λ1λ2)

≥ zα . (2.10.29)

Note that ξ = 0 if X is normal. In practice the test which is used assumes ξ = 0; thatis, F ∗ is not corrected by an estimate of ξ. This is one reason that the usual F -test forratio in variances does not possess robustness of validity; that is, the significance level isnot asymptotically distribution free. Unlike the t-test, the F -test for variances is not evenasymptotically distribution free under H0.

In order to obtain the efficacy of the F -test, consider the sequence of contiguous alter-natives (??). Assume without loss of generality that the locations of X and Y are the same.Under this sequence of alternatives we have Yj = e∆nUj where Uj is a random variable withcdf F (x) while Yj has cdf F (e∆nx). We also get σ2

2 = exp 2∆nσ2U where σ2

U denotes thesample variance of U1, . . . , Un2. Let γF (∆) denote the power function of the F -test. Theasymptotic power lemma for the F test is

Theorem 2.10.1. Assuming that X has a finite fourth moment, with ξ = (µX,4/σ41) − 3,

limn→∞

γF (∆n) = P (Z ≥ zα − cF δ) ,

where Z has a standard normal distribution and efficacy

cF = 2√λ1λ2/

√ξ + 2 . (2.10.30)

Proof: The conclusion follows directly upon observing,√n logF ∗ =

√n(log σ2

2 − log σ21)

=√n(log σ2

U + 2(δ/√n) − log σ2

1)

= 2δ +

√n

n2

√n2(log σ2

U − log σ2) −√

n

n1

√n1(log σ2

1 − log σ2)

and that the last quantity converges in distribution to a N(2δ, (ξ + 2))/(λ1λ2)) variate.Let ϕ(u) denote a general score function for an foled-aligned rank-based analysis as

discussed above. It then follows that the asymptotic relative efficiency of this analysis to theF -test is the ratio of the squares of their efficacies, i.e., e(S, F ) = c2ϕ/c

2F , where cϕ is given

in expression (2.5.28).Suppose we use the Fligner-Killeen analysis. Then its efficacy is cϕF K

which is given inexpression (2.10.22). The ARE between the Fligner-Killeen analysis and the traditional F -test analysis is the ratio c2ϕF K

/c2F . In particular, if we assume that the underlying distributionis normal, then by (2.10.23), this ratio is one.

2.11. BEHRENS-FISHER PROBLEM 135

2.11 Behrens-Fisher Problem

Consider the general model in Section 2.1 of this chapter, where X1, . . . , Xn1 is a randomsample on the random variable X which has distribution function F (x) and density functionf(x) and Y1, . . . , Yn2 is a second random sample, independent of the first, on the randomvariable Y which has common distribution function G(x) and density g(x). Let θX and θYdenote the medians of X and Y , respectively, and let ∆ = θY − θX . In Section 2.4 weshowed that the MWW test was consistent for the stochastically ordered alternative. In thelocation model where the distributions of X and Y differ by at most a shift in location, thehypothesis F = G is equivalent to the the null hypothesis that ∆ = 0. In this section wedrop the location model assumption, that is, we will assume that X and Y have distributionfunctions F and G respectively, but we still consider the null hypothesis that ∆ = 0. Inorder to avoid confusion with Section 2.4, we explicitly state the hypotheses of this sectionas

H0 : ∆ = 0 versus HA : ∆ > 0 ,where ∆ = θY − θX , and L(X) = F, and L(Y ) = G .(2.11.1)

As in the previous sections we have selected a specific alternative for the discussion.The above hypothesis is our most general hypothesis of this section and the modified

Mathisen’s test defined below is consistent for it. We will also consider the case where theforms of F and G are the same; that is, G(x) = F (x/η), for some parameter η. Note inthis case that L(Y ) = L(ηX); hence, η = T (Y )/T (X) where T (X) is any scale functional,(T (X) > 0 and T (aX) = aT (X) for a ≥ 0). If T (X) = σX , the standard deviation ofX, then this is a Behrens-Fisher problem with F unknown. If we further assume that thedistributions of X and Y are symmetric then the modified MWW, defined below, can beused to test that ∆ = 0. The most restrictive case, is when both F and G are assumed to benormal distribution functions. This is, of course, the classical Behrens-Fisher problem andthe classical solution to it is the Welch type t-test, discussed below. For motivation we firstshow the behavior of usual the MWW statistic. We then consider general rank proceduresand finally specialize to analogues of the L1 and MWW analyses.

2.11.1 Behavior of the Usual MWW Test

In order to motivate the problem, consider the null behavior of the usual MWW test under(2.11.1) with the further restriction that the distributions of X and Y are symmetric. UnderH0, since we are examining null behavior there is no loss of generality if we assume thatθX = θY = 0. The asymptotic form of the MWW test rejects H0 in favor of HA if

S+R =

n1∑

i=1

n2∑

j=1

I(Yj −Xi > 0) ≥ n1n2

2+ zα

√n1n2(n+ 1)

12.

This test would have asymptotic level α if F = G. As Exercise 2.13.41 shows, we still haveEH0(S

+R ) = n1n2/2 when the densities of X and Y are symmetric. From Theorem 2.4.5, Part


(a), the variance of the MWW statistic under H0 satisfies the limit,

VarH0(S+R )

n1n2(n+ 1)→ λ1Var(F (Y )) + λ2Var(G(X)) .

Recall that we obtained the asymptotic distribution of S+R , Theorem 2.4.9, under general

conditions which cover the current assumptions; hence, the true significance level of theMWW test has the following limiting behavior:

αS+R

= PH0

[S+R ≥ n1n2

2+ zα

√n1n2(n+ 1)

12

]

= PH0

[S+R − n1n2

2√VarH0(S

+R )

≥ zα

√n1n2(n+ 1)

12VarH0(S+R)

]

→ 1 − Φ[zα(12)−1/2(λ1Var(F (Y )) + λ2Var(G(X)))−1/2

]. (2.11.2)

Under the assumptions that the sample sizes are the same and that L(X) and the L(Y )have the same form we can simplify expression (2.11.2) further. We express the result in thefollowing theorem.

Theorem 2.11.1. Suppose that the null hypothesis in (2.11.1) is true. Assume that thedistributions of Y and X are symmetric, n1 = n2, and G(x) = F (x/η) where η is anunknown parameter. Then the maximum observed significance level is 1 − Φ(.816zα) whichis approached as η → 0 or η → ∞.

Proof: Under the assumptions of the theorem, note that Var(F (Y )) =∫F 2(ηt)dF (t) − 1

4

and Var(G(X)) =∫F 2(x/η)dF (x) − 1

4. Differentiating (2.11.2) with respect to η we get

φ[zα(12)−1/2((1/2)Var(F (Y )) + (1/2)Var(G(X)))−1/2

]zα(12)−1/2

∫F (ηt)tf(ηt)f(t)dt+

∫F (t/η)f(t/η)(−t/η2)f(t)dt

−3/2

. (2.11.3)

Making the substitution u = ηt in the first integral, the quantity in braces reduces toη−2

∫(F (u) − F (u/η))uf(u)f(u/η)du. Note that the other factors in (2.11.3) are strictly

positive. Thus to determine the graphical behavior of (2.11.2) with respect to η, we needonly consider the factor in braces. First note that it has a critical point at η = 1. Nextconsider the case η > 1. In this case F (u) − F (u/η) < 0 on the interval (−∞, 0) and ispositive on the interval (0,∞); hence the factor in braces is positive for η > 1. Using asimilar argument this factor is negative for 0 < η < 1. Therefore the limit of the functionαS+

R(η) is decreasing on the interval (0, 1), has a minimum at η = 1 and is increasing on the

interval (1,∞).Thus the minimum level of significance occurs at η = 1, (the location model), where it is

α. By the graphical behavior of the function, maximum levels would occur at the extremes


of 0 and ∞. But it follows that

Var(F (Y )) =

∫F 2(ηt)dF (t) − 1

4→

0 if η → 014

if η → ∞

and

Var(G(X)) =

∫F 2(x/η)dF (x) − 1

4→

14

if η → 00 if η → ∞ .

From these two results and (2.11.2), the true significance level of the MWW test satisfies

αS+R→

1 − Φ(zα(3/2)−1/2) if η → 01 − Φ(zα(3/2)−1/2) if η → ∞ .

Hence,

αS+R→ 1 − Φ(zα(3/2)−1/2) = 1 − Φ(.816zα) ,

whether η → 0 or ∞. Thus the maximum observed significance level is 1−Φ(.816zα) whichis approached as η → 0 or η → ∞.

For example if α = .05 then .816zα = 1.34 and αS+R→ 1 − Φ(1.34) = .09. Thus in the

equal sample size case when F and G differ only in scale parameter and are symmetric, thenominal 5% level of the MWW test will not be worse than .09. In order to guarantee thatα ≤ .05 choose zα so that 1 − Φ(.816zα) = .05. This leads to zα = 2.02 which is the criticalvalue for an α = .02. Hence another way of saying this is: by performing a 2% MWW testwe are guaranteed that the true (asymptotic) level is at most 5%.

2.11.2 General Rank Tests

Assuming the most general hypothesis, (2.11.1), we will follow the development of Flignerand Policello (1981) to construct general tests. Suppose T represents a rank test statistic,used in the case F = G, and that the test rejects H0 : ∆ = 0 in favor of HA : ∆ > 0 forlarge values of T . Suppose further that n1/2(T − µF,G)/σF,G converges in distribution to astandard normal. Let µ0 denote the null mean of T and assume that it is independent of F .Next suppose that σ is a consistent estimate of σF,G which is a function only of the ranksof the combined sample. This will ensure distribution freeness under H0; otherwise, the teststatistic will only be asymptotically distribution free. The modified test statistic is

T =n1/2(T − µ0)

σ. (2.11.4)

Such a test can be used for the general hypothesis (2.11.1). Fligner and Policello (1981)applied this approach to Mood’s statistic; see Hettmansperger and Malin (1975), also. Inthe next section, we consider Mathisen’s test.


2.11.3 Modified Mathisen’s Test

We next present a modified version of Mathisen’s test for the most general hypothesis(2.11.1). Let θX = mediXi and define the sign-process

S2(θ) =

n2∑

j=1

sgn(Yj − θ) . (2.11.5)

Recall from expression (2.6.8), Section 2.6.2 that Mathisen’s test statistic (centered version)

is given by S2(θX). This will be our test statistic. The modification lies in its asymptoticdistribution which is given in the next theorem.

Theorem 2.11.2. Assume the null hypothesis in expression (2.11.1) is true. Then under the

assumption (D.1), (2.4.7), 1√n2S2(θX) is asymptotically normal with mean 0 and asymptotic

variance 1 +K212 where K2

12 is defined by

K212 =

λ2

λ1

g2(θY )

f 2(θX). (2.11.6)

Proof: Assume without loss of generality that θX = θY = 0. From the asymptotic linearityresults discussed in Example 1.5.2 of Chapter 1, we have that

1√n2S2(θn)

.=

1√n2S2(0) − 2g(0)

√n2θn ,

for√n|θn| ≤ c, c > 0. Since

√n2θX is bounded in probability, upon substitution in the last

expression we get1√n2S2(θX)

.=

1√n2S2(0) − 2g(0)

√n2θX . (2.11.7)

In Example 1.5.2, we also have the approximation

θX.=

1

n12f(0)S1(0) , (2.11.8)

where S1(0) =∑n1

i=1 sgn(Xi). Combining (2.11.7) and (2.11.8), we get

1√n2S2(θX)

.=

1√n2S2(0) − g(0)

f(0)

√n2

n1

1√n1S1(0) . (2.11.9)

The results follows because of independent samples and because Si(0)/√ni

D→ N(0, 1), fori = 1, 2.

In order to use this test we need an estimate of K12. As in Chapter 1, selected orderstatistics from the sample X1, . . . , Xn1 will provide a confidence interval for the medianof X. Hence given a level α, the interval (L,U), where L1 = X(k+1), U1 = X(n−k), and


k = n/2 − zα/2(√n/2) is an approximate (1 − α)100% confidence interval for the median of

X. Let DX denote the length of this confidence interval. By Theorem 1.5.9 of Chapter 1,√n1DX

2zα/2

P→ 2f(0) . (2.11.10)

In the same way let DY denote the length of the corresponding (1 − α)100% confidenceinterval for the median of Y . Define

K12 =DY

DX. (2.11.11)

From (2.11.10) and the corresponding result forDY , the estimate K12 is a consistent estimateof K12, under both H0 and HA.

Thus the modified Mathisen’s test for the general hypotheses (2.11.1), is to reject H0 atapproximately level α if

ZM =S2(θX)√n2(1 + K2

12)≥ zα . (2.11.12)

To derive the efficacy of this statistic we will use the development of Section 1.5.2. Theaverage to consider is n−1S2(θX). Let ∆ denote the shift in medians and without loss ofgenerality let θX = 0. Then the mean function we need is

limn→∞

E∆(n−1S2(θX)) = µ(∆) .

Note that we can reexpress the expansion (2.11.9) as

1

nS2(θX) =

n2

n

1

n2

S2(θX)

.=

n2

n1

1

n2S2(0) − g(0)

f(0)

√n2

n1

√n1

n2

1

n1S1(0)

P∆→ λ2

E∆[sgn(Y )] − g(0)

f(0)E∆[sgn(X)]

= λ2E∆[sgn(Y )] = µ(∆) , (2.11.13)

where the next to last equality holds since θX = 0. Using E∆(sgn(Y )) = 1 − 2G(−∆), weobtain the derivative

µ′(0) = 2λ2g(0) . (2.11.14)

By Theorem 2.11.2 we have the asymptotic null variance of the test statistic S2(θX)/√n.

From the above discussion then the statistic S2(θX) is Pitman regular with efficacy

cMM =2λ2g(0)√λ2(1 +K2

12)=

√λ1λ22g(0)√

λ1 + λ2(g2(0)/f 2(0)). (2.11.15)

Using Theorem 1.5.4 of Chapter 1, consistency of the modified Mathisen’s test for thehypotheses (2.11.1) is obtained provided µ(∆) > µ(0). But this follows immediately fromthe inequality G(−∆) > G(0).


2.11.4 Modified MWW Test

Recall by Theorem 2.4.9 that the mean of the MWW test statistic S+R is n1n2P (Y > X) =

1−∫G(x)f(x)dx. For general F and G, though, this mean may not be 1/2 under H0. Since

this section is concerned with methods for testing the specific hypothesis that ∆ = 0, we addthe further restriction that the distributions of X and Y are symmetric. Recall fromSection 2.11.1 that under this assumption and ∆ = 0 that E(S+

R ) = n1n2/2; see Exercise2.13.41.

Using the general development of rank tests, Section 2.11.2, our modified rank test isgiven by: reject H0 : ∆ = 0 in favor of HA : ∆ > 0 if Z > zα where

Z =S+R − (n1n2)/2√

Var(S+R)

, (2.11.16)

where Var(S+R ) is a consistent estimate of Var(S+

R ), under H0. From the asymptotic distri-bution theory obtained for S+

R under general conditions, Theorem 2.4.9, it follows that thistest has approximate level α. By Theorem 2.4.5, we can express the variance as

Var(S+R) = n1n2

(∫GdF −

(∫GdF

)2)

+ n1n2(n1 − 1)

(∫F 2dG−

(∫FdG

)2)

+ n1n2(n2 − 1)

(∫(1 −G)2dF −

(∫(1 −G)dF

)2). (2.11.17)

Following the suggestion of Fligner and Policello (1981), we estimate Var(S+R ) by replacing

F and G by the empirical cdfs Fn1 and Gn2 respectively. As Exercise 2.13.42 demonstrates,this estimate is consistent and, further, it is a function of the ranks of the combined sample.Thus the test is distribution free when F (x) = G(x) and is asymptotically distribution freewhen F and G have symmetric densities.

The efficacy for the modified MWW follows using an argument similar to that for theMWW in Section 2.4. As there, the function S+

R(∆) is a decreasing function of ∆. Its meanfunction is given by

E∆(S+R ) = E0(S

+R (−∆)) = n1n2

∫(1 −G(x− ∆))f(x)dx .

The average to consider here is SR = (n1n2)−1S+

R . Letting µ(∆) denote the mean of SR under∆, we have µ′(0) =

∫g(x)f(x)dx > 0. The variance we need is σ2(0) = limn→∞ nVar0(SR),

which using the above result on variance simplifies to

σ2(0) = λ−12

(∫F 2dG−

(∫FdG

)2)

+ λ−11

(∫(1 −G)2dF −

(∫(1 −G)dF

)2).


The process S+R (∆) is Pitman regular and, in particular, its efficacy is given by,

cMMWW =

√λ1λ2

∫g(x)f(x)√

λ1

(∫F 2dG−

(∫FdG

)2)+ λ2

(∫(1 −G)2dF −

(∫(1 −G)dF

)2).

(2.11.18)As with the modified Mathisen’s test, we show consistency of the modified MWW test

by using Theorem 1.5.4. Again we need only show that µ(0) < µ(∆). But this followsimmediately provided the supports of F and G overlap in a neighborhood of 0. Note thatthis shows that the modified MWW is consistent for the hypotheses (2.11.1) under the furtherrestriction that the densities of X and Y are symmetric.

2.11.5 Efficiencies and Discussion

Before obtaining the asymptotic relative efficiencies of the above procedures, we shall brieflydiscuss traditional methods. Suppose we restrict F and G to have symmetric densities of thesame form with finite variance; that is, F (x) = F0((x−θX )/σX) and G(x) = F0((x−θY )/σY )where F0 is some distribution function with symmetric density f0 and σX and σY are thestandard deviations of X and Y respectively.

Under these assumptions, it follows that√n(Y − X − ∆) converges in distribution to

N(0, (σ2X/λ1) + (σ2

Y /λ2)); see Exercise 2.13.43. The test is to reject H0 : ∆ = 0 in favor ofHA : ∆ > 0 if tW > zα where

tW =Y −X√s2Xn1

+s2Yn2

,

where s2X and s2

Y are the sample variances of Xi and Yj, respectively. Under these as-sumptions, it follows that these sample variances are consistent estimates of σ2

X and σ2Y ,

respectively; hence, the test has approximate level α. If F0 is also normal then, under H0,tW has an approximate t distribution with a degrees of freedom correction proposed by Welch(1949). This test is frequently used in practice and we shall subsequently call it the Welcht-test.

In contrast, the pooled t-test can behave poorly in this situation, since we have,

tp =Y −X√

(n1−1)s2X+(n2−1)s2Yn1+n2−2

(1n1

+ 1n2

)

.=

Y −X√s2Xn2

+s2Yn1

;

that is, the sample variances are divided by the wrong sample sizes. Hence unless the samplesizes are fairly close the pooled t is not asymptotically distribution free. Exercise 2.13.44obtains the true asymptotic level of tp.


In order to get the efficacy of the Welch t, consider the statistic Y −X. The mean fuctionat ∆ is µ(∆) = ∆; hence, µ′(0) = 1. It follows from the asymptotic distribution discussedabove that

√n

[ √λ1λ2(Y −X)√

(σ2X/λ1) + (σ2

Y )/λ2)

]D→ N(0, 1) ;

hence, σ(0) =√

(σ2X/λ1) + (σ2

Y )/λ2)/√λ1λ2. Thus the efficacy of tW is given by

ctW =µ′(0)

σ(0)=

√λ1λ2√

(σ2X/λ1) + (σ2

Y )/λ2). (2.11.19)

We obtain the ARE’s of the above procedures for the case where G(x) = F (x/η) andF (x) has density f(x) symmetric about 0 with variance 1. Thus η is the ratio of standarddeviations σY /σX . For this case the efficacies (2.11.15), (2.11.18), and (2.11.19) reduce to

cMM =2√λ1λ2f(0)√λ2 + λ1η2

cMMWW =

√λ1λ2

∫gf√

λ1[∫F 2dG− (

∫FdG)2] + λ2[

∫(1 −G)2dF − (

∫(1 −G)dF )2]

ctW =

√λ1λ2√

λ2 + λ1η2.

Thus the ARE between the modified Mathisen’s procedure and the Welch procedure is theratio c2MM/c

2tW

= 4σ2Xf

2(0) = 4f 20 (0). This is the same ARE as in the location problem. In

particular the ARE does not depend on η = σY /σX . Thus the modified Mathisen’s test incomparison to tW would have poor efficiency at the normal distribution, .63, but in general itwould be much more efficient than tW for heavy tailed distributions. Similar to the modifiedMathisen’s test, the Mood test can also be modified for these problems; see Exercise 2.13.45.Its efficacy is the same as that of the Mathisen’s test.

Asymptotic relative efficiencies involving the modified Wilcoxon do depend on the ratioof scale parameters η. Fligner and Rust (1982) show that if the variances of X and Y arequite different then the modified Mathisen’s test may be as efficient as the modified MWWirrespective of the shape of the underlying distribution.

Fligner and Policello (1981) conducted a simulation study of the pooled t, Welch’s t,MWW and the modified MWW over situations where F and G differ in scale only. Theunmodified tests did not maintain their level. Welch’s t performed well when F and G werenormal whereas the modified MWW performed well over all situations, including unequalsample sizes and normal and contaminated normal distributions. In the simulation studyperformed by Fligner and Rust (1982), they found that the modified Mood test maintainsits level over the situations that were considered by Fligner and Policello (1981).

As a final note, Welch’s t requires distributions with the same shape and the modifiedMWW requires symmetric densities. The modified Mathisen’s test and the modified Moodtest, though, are consistent tests for the general problem stated in expression (2.11.1).

2.12. PAIRED DESIGNS 143

2.12 Paired Designs

Consider the situation where we have two treatments of interest, say, A and B, which can beapplied to subjects from a population of interest. Suppose we are interested in a particularresponse after these treatments have been applied. Let X denote the response of a subjectafter treatment A has been applied and let Y be the corresponding measurement for asubject after treatment B has been applied. The natural null hypothesis, H0, is that thereis no difference in treatment effects. A one sided alternative, would be that the responseof a subject under treatment B is in general larger than of a subject under treatment A.Reversing the roles of A and B would yield the other one sided alternative while the unionof the these two alternatives would result in the two sided alternative. Again for definitenesswe choose as our alternative, HA, the first one sided alternative.

The completely randomized design and the paired design are two experimental designswhich are often employed in this situation. In the completely randomized design, n subjectsare selected at random from the population of interest and n1 of them are randomly assignedto treatment A while the remaining n2 = n − n1 are assigned to treatment B. At the endof the treatment period, we then have two samples, one on X while the other is on Y . Thetwo sample procedures discussed in the previous sections can be used to analyze the data.Proper randomization along with carefully controlled experimental conditions give credenceto the assumptions that the samples are random and are independent of one another. Thedesign that produced the data of Example 2.3.1 was a a completely randomized design.

While the completely randomized design is often used in practice, the underlying vari-ability may impair the power of any procedure, robust or classical, to detect alternativehypotheses. The design discussed next usually results in a more powerful analysis but itdoes require a pairing device; i.e., a block of length two.

Suppose we have a pairing device. Some examples include identical twins for a study onhuman subjects, litter mates for a study on animal subjects, or the same exterior wall of ahouse for a study on the durability of exterior house paints. In the paired design, n pairsof subjects are randomly selected from the population of interest. Within each pair, onemember is randomly assigned to treatment A while the other receives treatment B. Againlet X and Y denote the responses of subjects after treatments A and B respectively havebeen applied. This experimental design results in a sample of pairs (X1, Y1), . . . , (Xn, Yn).The sample differences D1 = X1 − Y1, . . .Dn = Xn − Yn, however, become the single sampleof interest. Note that the random pairing in this design induces under the null hypothesis asymmetrical distribution for the differences.

Theorem 2.12.1. In a randomized paired design, under the null hypothesis of no treatmenteffect, the differences Di are symmetrically distributed about 0.

Proof: Let F (x, y) denote the joint distribution of (X, Y ). Under the null hypothesisof no treatment effect and randomized pairing, it follows that X and Y are exchangablerandom variables; that is, P (X ≤ x, Y ≤ y) = P (X ≤ y, Y ≤ x). Hence for a difference


D = Y −X we have,

P [D ≤ t] = P [Y −X ≤ t] = P [X − Y ≤ t] = P [−D ≤ t] .

Thus D and −D have the same distribution; hence D is symmetrically distributed about 0.Let θ be a location functional for the distribution of Di. We shall further assume that Di

is symmetrically distributed under alternative models also. Then we can express the abovehypotheses by H0 : θ = 0 versus HA : θ > 0.

Note that the one sample analyses based on signs and signed-ranks discussed in Chapter1 are appropriate for the randomly paired design. The appropriate sign test statistic isS =

∑sgn(Di) while the signed-rank statistic is T =

∑sgn(Di)R(|Di|).

From Chapter 1 we shall summarize the analysis based on the signed-rank statistic. Alevel α test would reject H0 in favor of HA, if T ≥ cα where cα is determined from the nulldistribution of the Wilcoxon signed-rank test or from the asymptotic approximation to thedistribution. The test is consistent for θ > 0 and it has the efficiency results discussed inChapter 1. In particular, for normal errors the efficiency of T with respect to the usual pairedt-test is .955. The associated point estimate of θ is the Hodges-Lehmann estimate given byθ = med i≤j(Di +Dj)/2. A distribution free confidence interval for θ is constructed basedon the Walsh averages (Di + Dj)/2, i ≤ j as discussed in Chapter 1. Instead of usingWilcoxon scores, general signed-rank scores as discussed in Chapter 1, can also be used.

A similar summary holds for the analysis based on the sign statistic. In fact for the signscores we need not assume that D1, . . . , Dn are identically distributed; that is, there can bea block effect. This is discussed further in Chapter 4.

We should mention that if the pairing is not done randomly then Di may or may notbe symmetrically distributed. If the symmetry assumption is realistic, then both sign andsigned-rank analyses can be used. If, however, it is not realistic then the sign analysis wouldstill be valid but caution would be necessary in interpretating the results of the signed-rankanalysis.

Example 2.12.1. Darwin Data:

The data, Table 2.12.1, are some measurements recorded by Charles Darwin in 1878.They consist of 15 pairs of heights in inches of cross-fertilized plants and self-fertilized plants,(Zea mays), each pair grown in the same pot.

RBR Results for Darwin Data

Results for the Wilcoxon-Signed-Rank procedure


Test-Stat. is T 72 Standardized (z) Test-Stat. is 2.016 p-vlaue 0.043




Table 2.12.1: Plant GrowthPot 1 2 3 4 5 6 7 8

Cross- 23.500 12.000 21.000 22.000 19.125 21.500 22.125 20.375Self- 17.375 20.375 20.000 20.000 18.375 18.625 18.625 15.250Pot 9 10 11 12 13 14 15

Cross- 18.250 21.625 23.250 21.000 22.125 23.000 12.000Self- 16.500 18.000 16.250 18.000 12.750 15.500 18.000





Estimate 3 SE is 1.307422

95 % Confidence Interval is ( 1 , 6.125 )


Let Di denote the difference between the heights of the cross-fertilized and self-fertilizedplants of the ith pot and let θ denote the median of the distribution of Di. Suppose we areinterested in testing for an effect; that is, the hypotheses are H0 : θ = 0 versus HA : θ 6= 0.The boxplot of the differences is displayed in Panel A of Figure 2.12.1, while Panel B givesthe normal q−q plot of the differences. As the plots indicate, the differences for Pot 2 and,perhaps, Pot 15 are possible outliers. The results from the RBR functions onesampwil andonesampsgn are shown below. The value of the signed-rank Wilcoxon statistic for this datais T = 72 with the approximate p-value of .044. The corresponding estimate of θ is 3.14inches and the 95% confidence interval is (.50, 5.21).

There are 13 positive differences, so the standardized value of the sign test statistic is2.58, with the p-value of 0.01. The corresponding estimate of θ is 3 inches and the 95%interpolated confidence is (1.00, 6.13). The paired t-test statistic has the value of 2.15 withp-value 0.050. The difference in sample means is 2.62 inches and the corresponding 95%confidence interval is (0, 5.23). Note that the outliers impaired the t-test and to a lesserdegree the Wilcoxon signed-rank test; see Exercise 2.13.46 for further analyses.

2.12.1 Behavior under Alternatives

In this section we will compare sample size determination for the paired design with samplesize determination for the completely randomized design. For the paired design, let γ+(θ)denote the power function of Wilcoxon signed-rank test statistic for the alternative θ. Thenthe asymptotic power lemma, Theorem 1.5.8 with c = τ−1 =

√12∫f 2(t) dt, for the signed-

rank Wilcoxon from Chapter 1 states that at significance level α and under the sequence of


Figure 2.12.1: Boxplot of Darwin Data.

−5

05

10

Darwin Data

Pai

red

diffe

rnce

s

contiguous alternatives, θn = θ/√n,

limn→∞

γ+(θn) = Pθn

[Z ≥ zα −

θ

τ

].

We will only consider the case where the random vector (Y,X) is jointly normal withvariance-covariance matrix

V = σ2

[1 ρρ 1

].

Then τ =√π/3σ

√2(1 − ρ).

Now suppose we select the sample size n∗ so that the Wilcoxon signed-rank test haspower γ+(θ0) to detect the one-sided alternative θ0 > 0 for a level α test. Then writing

θ0 =√n∗θ0√n∗ we have by the asymptotic power lemma and (1.5.25) that

γ+(θ0).= 1 − Φ(zα −

√n∗θ0/τ) ,

and

n∗ .=

(zα − zγ+(θo))2

θ20

τ 2 .

Substituting the value of τ into this final equation, we have that the necessary sample sizefor the paired design to have the desired local power is

n∗ .=

(zα − zγ+(θo))2

θ20

(π/3)σ22(1 − ρ) . (2.12.1)


Next consider a two-sample design with equal sample sizes ni = n∗. Assume that X andY are iid normal with variance σ2. Then τ 2 = (π/3)σ2. Hence by (2.4.25), the necessarysample size for the completely randomized design to achieve power γ+(θ0) at the onesidedalternative θ0 > 0 for a level α test is given by,

n =

(zα − zγ+(θ0)

θ0

)2

2(π/3)σ2 . (2.12.2)

Based on expressions (2.12.1) and (2.12.2), the sample size needed for the paired design is(1 − ρ) times the sample size needed for the completely randomized design. If the pairingdevice is such that X and Y are strongly, positively correlated then it pays to use the paireddesign. The paired design is a disaster, of course, if the variables are negatively correlated.


2.13 Exercises

2.13.1. (a). Derive the L2 estimates of intercept and shift based on the L2 norm on Model(2.2.4).

(b). Next apply the pseudo norm, (2.2.16), to (2.2.4) and derive the estimating function.Show that the natural test statistic is the pooled t-statistic.

2.13.2. Show that (2.2.17) is a pseudo norm. Show, also, that it can be written in terms ofranks; see the formula following (2.2.17).

2.13.3. In the proof of Theorem 2.4.2, verify that L(Yj −Xi) = L(Xi − Yj).


2.13.5. Prove that if a continuous random variable Z has cdf H(z), then the random variableH(Z) has a uniform distribution on (0, 1).

2.13.6. In Theorem 2.4.4, show that E(F (Y )) =∫F (y)dG(y) =

∫(1 − G(x))dF (x) =

E(1 −G(X)).

2.13.7. Prove that if Zn converges in distribution to Z and if Var(Zn−Wn) and EZn−EWn

converge to 0, then Wn also converges in distribution to Z.

2.13.8. Verify (2.4.10).

2.13.9. Explain what happens to the MWW statistic when one support is shifted completelyto the right of the other support. What does this imply about the consistency of the MWWin this case?

2.13.10. Show that the L2 estimating function is Pitman regular and derive the efficacy ofthe pooled t-test. Also, establish the asymptotic power lemma, Theorem 2.4.13, for the L2

case. Finally, establish the asymptotic distribution of√n(Y − X).

2.13.11. Prove that the Hodges-Lehmann estimate of shift, (2.2.18), is translation and scaleequivariant. (See the discussion in Section 2.4.4).


2.13.13. In Example 2.4.1, form the residuals Zi−∆ci, i = 1, . . . , n. Then, similar to Section1.5.5, use these residuals to estimate τ based on (1.3.30).

2.13.14. Simulate independent random samples from N(20, 52) and N(22, 52) distributionsof sizes 10 and 15 respectively. Let ∆ denote the shift in the locations of the distributions.

(a.) Obtain comparison boxplots for your samples.

(b.) Use the Wilcoxon procedure to test H0 : ∆ = 0 versus HA : ∆ 6= 0 at level .05.

2.13. EXERCISES 149

(c.) Use the Wilcoxon procedure to estimate ∆ and obtain a 95% confidence interval for it.

(d.) Obtain the true value of τ . Use your confidence interval in the last item to obtainan estimate of τ . Obtain a symmetric 95% confidence interval for ∆ based on yourestimate.

(e.) Form a pooled estimate of τ based on the Wilcoxon signed rank process for each sample.Obtain a symmetric 95% confidence interval for ∆ based on your estimate. Compareit with the estimate from the last item and the true value.

2.13.15. Write minitab macros to bootstrap the distribution of ∆. Obtain the bootstrapdistribution for 500 bootstraps of data of Problem 2.13.14. What is your bootstrap estimateof τ? Compare with the true value and the other estimates.

2.13.16. Verify the scalar multiple condition for the pseudo norm in the proof of Theorem2.5.1.

2.13.17. Verify (2.5.9) and (2.5.10).

2.13.18. Consider the process Sϕ(∆), (2.5.11):

(a). Show that Sϕ(∆) is a decreasing step function, with steps occurring at Yj −Xi.

(b). Using Part (a) and the MWW estimator as a starting values, write with some details

an algorithm which obtains the estimator ∆ϕ.

(c). Verify expressions (2.5.14), (2.5.15), and (2.5.16).

2.13.19. Consider the the optimal score function (2.5.22):

(a). Show it is location invariant and scale equivariant. Hence, show if g(x) = 1σf(x−µ

σ),

then ϕg = σ−1ϕf .

(b). Use (2.5.22) to show that the MWW is asymptotically efficient when the underlyingdistribution is logistic. (F (x) = (1 + exp(−x))−1,−∞ < x <∞.)

(c). Show that (2.6.1) is optimal for a Laplace or double exponential distribution. ( f(x) =12exp(−|x|),−∞ < x <∞.)

(d). Show that the optimal score function for the extreme value distribution, (f(x) =expx− ex , −∞ < x <∞ ), is given by (2.8.8).

(e). Show that the optimal score function for the normal distribution is given by (2.5.33).Show that it is standardized.

(f). Show that (2.5.34) is the optimal score function for an underlying distribution that hasa left logistic tail and a right exponential tail.


2.13.20. Show that when the underlying density f is symmetric then ϕf(1 − u) = −ϕf (u).

2.13.21. Show that expression (2.6.6) is true and that the n = 2r differences,

Y(1) −X(r) < Y(2) −X(r−1) < · · · < Y(n2) −X(r−n2+1) ,

can be ordered only knowing the order statistics from the individual samples.

2.13.22. Develop the asymptotic linearity formula for Mood’s estimating function given in(2.6.3). Then give an alternative proof of Theorem 2.6.1 based on this result.

2.13.23. Verify the moment formulas (2.6.9) and (2.6.10).

2.13.24. Show that any estimator based on the pseudo norm (2.5.2) is equivariant. Hence, ifwe multiply the combined sample observations by a constant, then the estimator is multipliedby that same constant.

2.13.25. Suppose X is a continuous random variable representing the time until failureof some process. The hazard function for a continuous random variable X with cdf F isdefined to be the instantaneous rate of failure at X = t, conditional on survival to time t.It is formally given by:

hX(t) = lim∆t→0+

P (t ≤ X < t+ ∆t|X ≥ t)

∆t.

(a). Show that

hX(t) =f(t)

1 − F (t).

(b). Suppose that Y has cdf given by (2.8.1). Show the hazard function is given by hY (t) =αhX(t).

2.13.26. Verify (2.8.4).

2.13.27. Apply the delta method of finding the asymptotic distribution of a function to(2.8.3) to find the asymptotic distribution of α. Then verify (2.8.5). Explain how this canbe used to find an approximate (1 − α)100% confidence interval for α.

2.13.28. Verify (2.8.14).

2.13.29. Show that the asymptotic relative efficiency of the Mann-Whitney-Wilcoxon testto the Savage test at the log exponential model, is 3/4.

2.13.30. Verify (2.10.5).

2.13.31. Show that if |X| has an F (2, 2) distribution then log |X| has a logistic distribution.

2.13.32. Suppose f(t) is the logistic pdf. Show that the optimal scores function, (2.10.6) isgiven by ϕ(u) = ulog[(u+ 1)/(1 − u)].

2.13. EXERCISES 151

2.13.33. (a). Verify (2.10.6).

(b). Apply (2.10.6) to the normal distribution.

(c). Apply (2.10.6) to the Laplace or double exponential distribution.

2.13.34. We consider the Siegel-Tukey (1960) test for the equality of variances when theunderlying centers are equal but possibly unknown. The test statistic is the sum of ranks ofthe Y sample in the combined sample (MWW statistic). However, the ranks are assigned ina different way: In the ordered combined sample assign rank 1 to the smallest value, rank2 to the largest value, rank 3 to the second largest value, rank 4 to the second smallestvalue, and so on, alternatively assigning ranks to end values. To test H0 : varX = varYvs HA : varX > varY , reject H0 when the sum of ranks of the Y sample is large. Findthe mean, variance and the limiting distribution of the test statistic. Show how to find anapproximate size α test.

2.13.35. Develop a sample size formula for the scale problem similar to the sample sizeformula in the location problem, (2.4.25).

2.13.36. Verify (??).

2.13.37. Compute the efficacy of Mood’s scale test, Ansari-Bradley scale test, and Klotz’sscale test discussed in Section ??.

2.13.38. Verify the asymptotic properties given in (2.10.26), (2.10.27) and ( 2.10.28).

2.13.39. Compute the efficiency of Mood’s scale test and the Ansari-Bradley scale testrelative to the classical F test for equality of variances.

2.13.40. Show that the Ansari-Bradley scale test is optimal for f(x) = 12(1 + |x|)−2,−∞ <

x <∞.

2.13.41. Show that when F and G have densities symmetric at 0 (or any common point),the expected value of SR+ = n1n2/2.

2.13.42. Show that the estimate of (2.11.17) based on the empirical cdfs is consistent andthat it is a function only of the combined sample ranks.

2.13.43. Under the general model in Section 2.11.5, derive the limiting distribution of√n(Y − ∆ −X).

2.13.44. Find the true asymptotic level of the pooled t-test under the null hypothesis in(2.11.1).

2.13.45. Develop a modified Mood’s test similar to the modified Mathisen’s test discussedin Section 2.11.5.


2.13.46. Construct and discuss a normal quantile plot of the differences from Table 2.12.1.Carry out the Boos test for asymmetry (??). Why do these results suggest that the L1analysis may be the best analysis in this example?

2.13.47. Consider the data set of information on professional baseball players given in Ex-ercise 1.12.32. Let ∆ denote the shift parameter of the difference between the height of apitcher and the height of a hitter.

(a.) Obtain comparison dotplots between the heights of the pitchers and hitters. Does ashift model seem appropriate?

(b.) Use the MWW test statistic to test the hypotheses H0 : ∆ = 0 versus HA : ∆ > 0.Compute the p-value.

(c.) Determine a point estimate for ∆ and a 95% confidence interval for ∆ based on MWWprocedure.

(d.) Obtain an estimate of the standard deviation of ∆. Use it to obtain an approximate95% confidence interval for ∆.

2.13.48. Repeat Exercise 2.13.47 when ∆ is the shift parameter for the difference in pitchers’and hitters’ weights.

2.13.49. Repeat Exercise 2.13.47 when ∆ is the shift parameter for the difference in lefthanded (A-1) and right handed (A-0) pitchers’ ERA’s and the hypotheses are H0 : ∆ = 0versus HA : ∆ 6= 0.

Chapter 3

Linear Models

3.1 Introduction

In this chapter we discuss the theory for a rank-based analysis of a general linear model.Applications of this analysis to experimental design models will be discussed in Chapter 4.The rank-based analysis is complete, consisting of estimation, testing, and diagnostic toolsfor checking the adequacy of fit of the model, outlier detection, and detection of influentialcases. As in the earlier chapters, we present the analysis in terms of its geometry.

The analysis could be based on either rank scores or signed-rank scores. We have chosento use the general rank scores of Chapter 2. This allows the error distribution to be eitherasymmetric or symmetric. An analysis based on signed-rank scores would parallel the onebased on rank scores except that the theory would require a symmetric error distribution; seeHettmansperger and McKean (1983) for discussion. Although the results are established forgeneral score functions, we illustrate the methods with Wilcoxon and sign scores throughout.We will commonly use the subscripts R and S for results based on Wilcoxon and sign scores,respectively.

3.2 Geometry of Estimation and Tests

For i = 1, . . . , n. let Yi denote the ith observation and let xi denote a p × 1 vector ofexplanatory variables. Consider the linear model

Yi = x′iβ + e∗i , (3.2.1)

where β is a p × 1 vector of unknown parameters. In this chapter, the components ofβ are the parameters of interest. We are interested in estimating β and testing linearhypotheses concerning it. However, it will be convenient to also have a location parameter.So accordingly let α = T (e∗i ) be a location functional. One that we will frequently use is themedian. Let ei = e∗i − α then T (ei) = 0 and the model can be written as,

Yi = α + x′iβ + ei . (3.2.2)

153

154 CHAPTER 3. LINEAR MODELS

The parameter α is called an intercept parameter. An argument similar to the one concerningthe shift parameter ∆ of Chapter 2 shows that β does not depend on the location functionalused.

Let Y = (Y1, . . . , Yn)′ denote the n× 1 vector of observations and let X denote the n× p

matrix whose ith row is x′i. We can then express the model as

Y = 1α + Xβ + e , (3.2.3)

where 1 is an n×1 vector of ones, and e′ = (e1, . . . , en). Since the model includes an interceptparameter, α, there is no loss in generality in assuming that X is centered; i.e., the columnsof X sum to 0. Further, in this chapter, we will assume that X has full column rank p. LetΩF denote the column space spanned by the columns of X. Note that we can then write themodel as

Y = 1α + η + e ,where η ∈ ΩF . (3.2.4)

This model is often called the coordinate free model.Besides estimation of the regression coefficients, we are interested in tests of general linear

hypotheses of the form

H0 : Mβ = 0 versus HA : Mβ 6= 0 , (3.2.5)

where M is a q × p matrix of full row rank. In this section, we discuss the geometry ofestimation and testing with rank-based procedures for the linear model.

3.2.1 Estimation

With respect to model ( 3.2.4), we will estimate η by minimizing the distance between Yand the subspace ΩF. In this chapter we will define distance in terms of the norms orpseudo-norms presented in Chapter 2. Consider, first, the general R pseudo-norm discussedin Chapter 2 which is given by expression ( 2.5.2) and which we write for convenience,

‖v‖ϕ =

n∑

i=1

a(R(vi))vi , (3.2.6)

where a(1) ≤ a(2) ≤ · · · ≤ a(n) is a set of scores generated as a(i) = ϕ(i/(n + 1)) for somenondecreasing score function ϕ(u) defined on the interval (0, 1) and standardized such that∫ϕ(u)du = 0 and

∫ϕ2(u)du = 1. This was shown to be a pseudo-norm in Chapter 2. Recall

that the Wilcoxon pseudo-norm is generated by the linear score function ϕ(u) =√

12(u −1/2). We will also discuss the sign pseudo-norm which is generated by ϕ(u) = sgn(u− 1/2)and show that it is equivalent to using the L1 norm. In Section 3.10 we will also discuss aclass of score functions appropriate for survival type analyses.

For the general R pseudo-norm given above by ( 3.2.6), an R-estimate of η is a vector

Yϕ such that

Dϕ(Y,ΩF) = ‖Y − Yϕ‖ϕ = minη∈ΩF

‖Y − η‖ϕ . (3.2.7)

3.2. GEOMETRY OF ESTIMATION AND TESTS 155

Figure 3.2.1: The R-estimate of η is a vector Yϕ which minimizes the normed differences,( 3.2.6), between Y and ΩF. The distance between Y and the space ΩF is Dϕ(Y,ΩF).

about here

These quantities are represented geometrically in Figure 3.2.1.Once η has been estimated, β can be estimated by solving the equation Xβ = Yϕ; that is,

the R-estimate of β is βϕ = (X′X)−1X′Yϕ. As discussed later in Section 3.7, the intercept

α can be estimated by a location estimate based on the residuals e = Y− Yϕ. One that we

will frequently use is the median of the residuals which we denote as αS = med Yi− x′iβϕ.

Theorem 3.5.7 shows, under regularity conditions, that

(αSβϕ

)has an approximate Np+1

((αβ

),

[n−1τ 2

S 0′

0 τ 2ϕ(X

′X)−1

])distribution ,

(3.2.8)where τϕ and τS are the scale parameters defined in displays ( 3.4.4) and ( 3.4.6), respectively.From this result, an asymptotic confidence interval for the linear function h′β is given by

h′βϕ ± t(α/2,n−p−1)τϕ√

h(X′X)−1h , (3.2.9)

where the estimate τϕ is discussed in Section 3.7.1. The use of t-critical values instead ofz-critical values is documented in the small sample studies cited in Section 3.7. Note theclose analogy between this confidence interval and those based on LS estimates. The onlydifference is that σ has been replaced by τϕ.

We will make use of the coordinate free model, especially in Chapter 4; however, in thischapter we are primarily concerned with the properties of the estimator βϕ and it will bemore convenient to use the coordinate model ( 3.2.3). Define the dispersion function by

Dϕ(β) = ‖Y − Xβ‖ϕ . (3.2.10)

Then Dϕ(βϕ) = Dϕ(Y,ΩF) = ‖Y − Yϕ‖ϕ is the R-distance between Y and the subspaceΩF. It is also the residual dispersion.

Because Dϕ is expressed in terms of a norm it is a continuous and convex function of β;see Exercise 1.12.3. Exercise 3.16.2 shows that the ranks of the residuals can ony change atthe boundaries of the regions defined by the

(n2

)equations yi−x′

iβ = yj−x′jβ. Note that in

the simple linear regression case, these equations define the sample slopesYj−Yi

xj−xi. Hence, in

the interior of these regions the ranks are constant. Therefore, Dϕ(β) is a piecewise linear,continuous, convex function of β with gradient (defined almost everywhere) given by,

Dϕ(β) = −Sϕ(Y −Xβ) , (3.2.11)

whereSϕ(Y −Xβ) = X′a(R(Y − Xβ))) (3.2.12)


and a(R(Y−Xβ)))′ = (a(R(Y1−x′1β)), . . . , a(R(Yn−x′

nβ))). Thus βϕ solves the equations

Sϕ(Y −Xβ) = X′a(R(Y − Xβ))).= 0 , (3.2.13)

which are called the R normal equations. A quadratic form in Sϕ(Y−Xβ0) serves as thegradient R-test statistic for testing H0 : β = β0 versus HA : β 6= β0 .

In terms of the simple regression problem Sϕ(β) is a decreasing step function of β, whichsteps down at each sample slope. There may be an interval of solutions of Sϕ(β) = 0

or Sϕ(β) may step across the horizontal axis. Let βϕ denote any point in the interval inthe former case and the crossing point in the latter case. The gradient test statistic isSϕ(β0) =

∑xia(R(yi − xiβ0)). If the x’s are distinct and equally spaced then for Wilcoxon

scores this test statistic is equivalent to the test for correlation based on Spearman’s rS; seeExercise 3.16.4.

For the asymptotic distribution theory of estimation and testing, we note that the es-timate is location and scale equivariant. Let βϕ(Y) denote the R-estimate β for the lin-

ear model ( 3.2.3). Then, as shown in Exercise 3.16.6, βϕ(Y + Xδ) = βϕ(Y) + δ and

βϕ(kY) = kβϕ(Y). In particular these results imply, without loss of generality, that thetheory developed in the following sections can be accomplished under the assumption thatthe true β is 0.

As a final note, we outline the least squares estimates. The LS estimates of η in model( 3.2.4) is given by

YLS = Argmin ‖Y − η‖2LS ,

‖ · ‖LS denotes the least squares pseudo-norm given by ( 2.2.16) of Chapter 2. The value ofη which minimizes this pseudo-norm is

ηLS = HY , (3.2.14)

where H is the projection matrix onto the space ΩF i.e.; H = X(X′X)−1X′. Denote thesum of squared residuals by SSE = minη∈ΩF

‖Y−η‖2LS = ‖(I−H)Y‖2

LS. In order to have

similar notation we shall denote this minimum by D2LS(Y,ΩF). Also, it is easy to show that

the least squares estimate of β is βLS = (X′X)−1X′Y.

3.2.2 The Geometry of Testing

We next discuss the geometry behind rank-based tests of the general linear hypotheses givenby ( 3.2.5). As above, consider the model ( 3.2.4),

Y = 1α + η + e ,where η ∈ ΩF , (3.2.15)

and ΩF is the column space of the full model design matrix X. Let Yϕ,ΩFdenote the R-

fitted value in the full model. Note that Dϕ(Y,ΩF) is the amount of residual dispersion notaccounted for in fitting the Model ( 3.2.4). These are shown geometrically in Figure 3.2.2.

3.2. GEOMETRY OF ESTIMATION AND TESTS 157

Next let ω denote the subspace of ΩF subject to H0. In symbols ω = η ∈ ΩF : η =Xβ, for some β such that Mβ = 0. In Exercise 3.16.7 the reader is asked to show that

ω is a subspace of ΩF of dimension p − q. Let Yϕ,ω denote the R-estimate of η when the

reduced model is fit and let Dϕ(Y, ω) = ‖Y − Yϕ,ω‖R denote the distance between Y andthe subspace ω. These are illustrated by Figure 3.2.2. The nonnegative quantity

RDϕ = Dϕ(Y, ω) −Dϕ(Y,ΩF) , (3.2.16)

denotes the reduction in residual dispersion when we pass from the reduced model tothe full model. Large values of RDϕ indicate HA while small values support H0.

Figure 3.2.2: The reduction in dispersion RDϕ is the difference in normed distances betweenY and the subspaces ΩF and ω.

about here

This drop in residual dispersion, RDϕ, is analogous to the drop in residual sums of squaresfor the LS analysis. In fact to obtain this reduction in sums of squares, we need only replacethe R-norm with the square of the Euclidean norm in the above development. Thus the dropin sums of squared errors is

SS = D2LS(Y, ω)−D2

LS(Y,ΩF) ,

where D2LS(Y,ΩF) is defined above. Hence the reduction in sums of squared residuals can

be written asSS = ‖(I− Hω)Y‖2

LS − ‖(I− HΩF)Y‖2

LS .

The traditional least squares F -test is given by

FLS =SS/q

σ2, (3.2.17)

where σ2 = D2LS(Y,ΩF)/(n − p). Other than replacing one norm with another, Figures

3.2.1 and 3.2.2 remain the same for the two analyses, LS and R.In order to be useful as a test statistic, similar to least squares, the reduction in dispersion

RDϕ must be standardized. The asymptotic distribution theory that follows suggests thestandardization

Fϕ =RDϕ/q

τϕ/2, (3.2.18)

where τϕ is the estimate of τϕ discussed in Section 3.7. Small sample studies cited in Section3.7 indicate that Fϕ should be compared with F -critical values with q and n−(p+1) degreesof freedom analogous to the LS classical F -test statistic. Similar to the LS F–test, the testbased on Fϕ can be summarized in the ANOVA table, Table 3.2.1. Note that the reductionin dispersion replaces the reduction in sums of squares in the classical table. These robustANOVA tables were first discussed by Schrader and McKean (1976).


Table 3.2.1: Robust ANOVA Table for H0 : Mβ = 0Source Reduction Mean Reduction

in Dispersion in Dispersion df in Dispersion FϕRegression RDϕ =

(Dϕ(Y, ω) −Dϕ(Y,ΩF)

)q RD/q Fϕ

Error n− (p+ 1) τϕ/2

Table 3.2.2: Robust ANOVA Table for H0 : β = 0Source Reduction Mean Reduction

in Dispersion in Dispersion df in Dispersion FϕRegression RD =

(Dϕ(0) −Dϕ(Y,ΩF)

)p RD/p Fϕ

Error n− p− 1 τϕ/2

Tests that all Regression Coefficients are 0

As discussed more fully in Section 3.6, there are three R-test statistics for the hypotheses( 3.2.5). These are the R-analogues of the classical tests: the likelihood ratio test, the scorestest, and the Wald test. We shall introduce them here for the special null hypothesis thatall the regression parameters are 0; i.e.,

H0 : β = 0 versus H0 : β = 0 . (3.2.19)

Their asymptotic theory and small sample properties are discussed in more detail in latersections.

In this case, the reduced model dispersion is just the dispersion of the response vectorY, i.e., Dϕ(0). Hence, the R-test based on the reduction in dispersion is

Fϕ =

(Dϕ(0) −Dϕ(Y,ΩF)

)/p

τϕ/2. (3.2.20)

As discussed above, Fϕ should be compared with F (α, p, n−p−1)-critical values. Similar tothe general hypothesis, the test based on Fϕ can be expressed in the robust ANOVA tablegiven in Table 3.2.2. This is the robust analogue of the traditional ANOVA table that isprinted out for a regression analysis by most least squares regression packages.

The R-scores test is the test based on the gradient. Theorem 3.5.2, below, gives theasymptotic distribution of the gradient Sϕ(0) under the null hypothesis. This leads to theasymptotic level α test, reject H0 if

S′ϕ(0)(X′X)−1Sϕ(0) ≥ χ2

α(p) . (3.2.21)

Note that this test avoids the estimation of τϕ.The R-Wald test is a quadratic form in the full model estimates. Based on the asymp-

totic distribution of the full model estimate βϕ given in Corollary 3.5.1, an asymptotic level

3.3. EXAMPLES 159

Table 3.3.1: Data for Example 3.3.1. The number of calls is in tens of millions and theyears are from 1950-1973.Year 50 51 52 53 54 55 56 57 58 59 60 61No. Calls 0.44 0.47 0.47 0.59 0.66 0.73 0.81 0.88 1.06 1.20 1.35 1.49Year 62 63 64 65 66 67 68 69 70 71 72 73No. Calls 1.61 2.12 11.90 12.40 14.20 15.90 18.20 21.20 4.30 2.40 2.70 2.90

α test, rejects H0 if

β′ϕ(X

′X)βϕ/p

τ 2ϕ

≥ F (α, p, n− p− 1) . (3.2.22)

3.3 Examples

We offer several examples to illustrate the rank-based estimates and test procedures discussedin the last section. For all the examples, we use Wilcoxon scores, ϕ(u) =

√12(u − (1/2)),

for the rank-based estimates of the regression coefficients. We estimate the intercept by themedian of the residuals and we estimate the scale parameter τϕ as discussed in Section 3.7.We begin with a simple regression data set and proceed to multiple regression problems.

Example 3.3.1. Telephone Data

The response for this data set is the number of telephone calls (tens of millions) madein Belgium for the years 1950 through 1973. Time, the years, serves as our only predictorvariable. The data is discussed in Rousseeuw and Leroy (1987) and, for convenience, isdisplayed in Table 3.3.1.

The Wilcoxon estimates of the intercept and slope are −7.13 and .145, respectively, whilethe LS estimates are −26 and .504. The reason for this disparity in fits is easily seen in PanelA of Figure 3.3.1 which is a scatterplot of the data overlaid with the LS and Wilcoxon fits.Note that the years 1964 through 1969 had a profound effect on the LS fit while the Wilcoxonfit was much less sensitive to these years. As discussed in Rousseeuw and Leroy the recordingsystem for the years 1964 through 1969 differed from the other years. Panels B and C ofFigure 3.3.1 are the studentized residual plots of the fits; see ( 3.9.31) of Section 3.9. Aswith internal LS-studentized residuals, values of the internal R-studentized residuals whichexceed 2 in absolute value are potential outliers. Note that the internal Wilcoxon studentizedresiduals clearly show that the years 1964-1969 are outliers while the internal LS studentizedresiduals only detect 1969. The Wilcoxon studentized residuals also mildly detect the year1970. Based on the scatterplot, this point does not follow the trend of the early (before1964) years either. The scatterplot and Wilcoxon residual plot indicate that there may be aquadratic trend over the years before the outliers occur. The last few years, though, do notseem to follow this trend. Hence, a linear model for this data is questionable. On the basisof these plots, we will not discuss any formal inference for this data set.


Figure 3.3.1: Panel A: Scatterplot of the Telephone Data, overlaid with the LS and Wilcoxonfits; Panel B: Internal LS studentized residual plot; Panel C: Internal Wilcoxon studentizedresidual plot; and Panel D: Wilcoxon dispersion function.

• • • • • • • • • • • • • •

• •••

•

•

•• • •

Year

Num

ber

of c

alls

50 55 60 65 70

05

1015

20

Panel A

LS-FitWilcoxon-Fit

• • • • • • • • • • • • • •

• ••••

•

•

• • •

LS-Fit

LS-S

tude

ntiz

ed r

esid

uals

0 2 4 6 8 10

-10

12

Panel B

• • • • • • • • • • • • • •

• •••

•

•

•

• • •

Wilcoxon-Fit

Wilc

oxon

-Stu

dent

ized

res

idua

ls

0 1 2 3

010

2030

4050

Panel C

Beta

Wilc

oxon

dis

pers

ion

-0.2 0.0 0.2 0.4 0.6

110

120

130

140

150

Panel D

Panel D of Figure 3.3.1 depicts the Wilcoxon dispersion function over the interval(−.2, .6). Note that Wilcoxon estimate βR = .145 is the minimizing value. Next considerthe hypotheses H0 : β = 0 versus HA : β 6= 0. The basis for the test statistic Fϕ can beread from this plot. The reduction in dispersion is given by RD = D(0) − D(.145). Also,the gradient test of these hypotheses would be the negative of the slope of the dispersionfunction at 0; i.e., −D′(0).

Example 3.3.2. Baseball Salaries

As a large data set, we consider data on the salaries of professional baseball pitchers forthe 1987 baseball season. This data set was taken from the data set on baseball salarieswhich was used in the 1988 ASA Graphics Section Poster Session. It can be obtained atthe web site: http://lib.stat.cmu.edu/datasets. Our analysis concerns a subdata set of176 pitchers, which can be obtained from the authors upon request. Our response variableis the 1987 beginning salary (in log dollars) of these pitchers. As predictors, we took thecareer summary statistics through the end of the 1986 season. The names of these variablesare listed in Table 3.3.2. Panels A - G of Figure 3.9.2 show the scatter plots of the log ofsalary versus each of the predictors. Certainly the strongest predictor on the basis of theseplots is log years; although, linearity in this plot is questionable.

3.3. EXAMPLES 161

Figure 3.3.2: Panels A - G: Plots of log-salary versus each of the predictors for the baseballdata of Example 3.3.2; Panel H: Internal Wilcoxon studentized residual plot.

•

•

•

• •

•

•

••

••

•

•

•

•

• •

•

•

• •

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

• •

•

••

•

•

•

•

•

• ••

•

•

••••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

• •

•

•

••

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

• •

•

•

•

•

•

••

•

•

••

•

•

•

•• ••

•

•

•

•

•

••

•

•

•

••

•

••

•

•

•

••

•

•

••

•

•

•

••

••

•

•

•

••

•

•

•

••

•

•• •

•

•

•

•

•

•

Log Years

Log

sala

ry

0.0 0.5 1.0 1.5 2.0 2.5 3.0

45

67

Panel A

•

•

•

• •

•

•

••

••

•

•

•

•

••

•

•

• •

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

••

•

••

•

•

•

•

•

•• •

•

•

••

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

••

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

• •

•

•

•

•

•

••

•

•

••

•

•

•

•• ••

•

•

•

•

•

••

•

•

•

• •

•

••

•

•

•

••

•

•

• •

•

•

•

••

••

•

•

•

••

•

•

•

••

•

• ••

•

•

•

•

•

•

Ave. wins

Log

sala

ry

0 5 10 15 20

45

67

Panel B

•

•

•

• •

•

•

••

••

•

•

•

•

••

•

•

• •

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

• •

•

••

•

•

•

•

•

•• •

•

•

••

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

• •

•

•

••

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

• •

•

•

•

•

•

••

•

•

••

•

•

•

•• ••

•

•

•

•

•

••

•

•

•

••

•

••

•

•

•

••

•

•

••

•

•

•

••

••

•

•

•

••

•

•

•

••

•

• ••

•

•

•

•

•

•

Ave. loses

Log

sala

ry

2 4 6 8 10 12

45

67

Panel C

•

•

•

••

•

•

••

••

•

•

•

•

••

•

•

••

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

• •

•

••

•

•

•

•

•

•• •

•

•

••

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

• •

•

•

••

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

• •

•

•

•

•

•

••

•

•

••

•

•

•

•• ••

•

•

•

•

•

••

•

•

•

••

•

• •

•

•

•

••

•

•

• •

•

•

•

••

••

•

•

•

••

•

•

•

••

•

• • •

•

•

•

•

•

•

ERA

Log

sala

ry2.5 3.0 3.5 4.0 4.5 5.0 5.5

45

67

Panel D

•

•

•

••

•

•

••

• •

•

•

•

•

••

•

•

• •

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

••

•

••

•

•

•

•

•

• ••

•

•

••

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

••

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

••

•

•

•

• •••

•

•

•

•

•

••

•

•

•

••

•

••

•

•

•

••

•

•

• •

•

•

•

••

••

•

•

•

••

•

•

•

••

•

•••

•

•

•

•

•

•

Ave. games

Log

sala

ry

0 20 40 60 80

45

67

Panel E

•

•

•

• •

•

•

• •

••

•

•

•

•

••

•

•

• •

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

••

•

••

•

•

•

•

•

•• •

•

•

•••

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

••

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

• •

•

•

•

•

•

••

•

•

••

•

•

•

•• ••

•

•

•

•

•

••

•

•

•

• •

•

••

•

•

•

••

•

•

••

•

•

•

••

••

•

•

•

••

•

•

•

••

•

• ••

•

•

•

•

•

•

Ave. innings

Log

sala

ry

0 50 100 150 200 250

45

67

Panel F

•

•

•

••

•

•

••

• •

•

•

•

•

• •

•

•

••

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

••

•

••

•

•

•

•

•

• ••

•

•

••

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

••

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

• •

•

•

•

• •••

•

•

•

•

•

••

•

•

•

••

•

••

•

•

•

••

•

•

• •

•

•

•

••

••

•

•

•

•••

•

•

••

•

•••

•

•

•

•

•

•

Ave. saves

Log

sala

ry

0 5 10 15 20 25

45

67

Panel G

••

•

•

•

•

••

• ••

••

•

•

••

•

••

•

•• •••• •

••

•

•

• •••• •

•• •

•••

•

•

•• ••

•

••• • •

•

••••

••

••

•

•

•

• ••

•

••

••

•

•

•

• ••

•

••

••

•

• ••• • •

•• •••

••

•

•• •

•

•

•

• • •

•

••••

••

••

•

••

••

•

•••

••

•

•

•• •• ••

••

•

•

••

••

••••

•

•

••

•

••

•

• ••

•• • •••• •

••

•

•• •

Wilcoxon fit

Stu

dent

ized

res

id.

4 5 6 7 8

-8-6

-4-2

02

46

Panel H

OO

O

The internal Wilcoxon studentized residuals, ( 3.9.31), versus fitted values are displayedin the Panel H of Figure 3.9.2. Based on Panels A and H, the pattern in the residualplot follows from the fact that log years is not a linear predictor. Better fitting models arepursued in Exercise 3.16.1. Note that there are several large outliers. The three identifiedoutliers, circled points in Panel H, are interesting. These correspond to the pitchers SteveCarlton, Phil Niekro and Rick Sutcliff. These were very good pitchers, but in 1987 theywere at the end of their careers, (21, 23, and 21 years of pitching, respectively); hence,they missed the rapid rise in baseball salaries. A diagnostic analysis, (see Section 3.9 andExercise 3.16.1), indicates a few mildly influential points, also. For illustration, though, wewill consider the model that we fit. Table 3.3.2 also displays the estimated coefficients andtheir standard errors. The outliers impaired the LS fit, somewhat. The LS estimate of σ is.515 in comparison to the estimate of τ which is .388.

Table 3.3.3 displays the robust ANOVA table for testing that all the coefficients, except


Table 3.3.2: Predictors for Baseball Salaries of Pitchers and Their Estimated (Wilcoxon Fit)Coefficients

Predictor Estimate Stand. Error t-ratiolog Years in professional baseball .839 .044 19.15Average wins per year .045 .028 1.63Average losses per year -.024 .026 -.921Earned run average -.146 .070 -2.11Average games per year -.006 .004 1.60Average innings per year .004 .003 1.62Average saves per year .012 .011 1.07Intercept 4.22 .324Scale (τ) .388

Table 3.3.3: Wilcoxon ANOVA Table for H0 : β = 0Source Reduction Mean Reduction

in Dispersion in Dispersion df in Dispersion FϕRegression 78.287 7 11.18 57.65Error 168 .194

the intercept, are 0. Based on the large value of Fϕ, ( 3.2.20), the predictors are helpfulin explaining the response. In particular, based on Table 3.3.2, the predictors years inprofessional baseball, earned run average, average innings per year, and average number ofsaves per year seem more important than the variables wins, losses, and games. These lastthree variables form a similar group of variables; hence, as an illustration of the rank-basedstatistic Fϕ, the hypothesis that the coefficients for these three predictors are 0 was tested.The reduction in dispersion for this hypothesis is RD = 1.24 which leads to Fϕ = 2.12which is significant at the 10% level. This confirms the above observations on the regressioncoefficients.

Example 3.3.3. Potency Data

This example is part of an n = 34 multivariate data set discussed in Chapter 6; see Table6.6.2 for the data.. The experiment concerned the potency of drug compounds which weremanufactured under different levels of 4 factors. Here we shall consider only one of theresponse variables POT2, which is the potency of a drug compound at the end of two weeks.The factors are: SAI, the amount of intragranular steric acid, which was set at the threelevels −1, 0 and 1; SAE, the amount of extragranular steric acid, which was set at the threelevels −1, 0 and 1; ADS, the amount of cross carmellose sodium, which was set at the threelevels −1, 0 and 1; and TYPE of steric acid which was set at two levels −1 and 1. The initialpotency of the compound, POT0, served as a covariate.

In Example 3.9.2 of Section 3.9 a residual analysis of this data set is performed. Thisanalysis indicates that the model which includes the covariate, the linear terms of the factors,

3.3. EXAMPLES 163

Table 3.3.4: Wilcoxon and LS Estimates for the Potency DataWilcoxon Estimates LS Estimates

Terms Parameter Est. SE Est. SEIntercept α 7.184 2.96 5.998 4.50

β1 0.072 0.05 0.000 0.08Linear β2 0.023 0.05 -0.018 0.07

β3 0.166 0.05 0.135 0.07β4 0.020 0.04 -0.011 0.05β5 0.042 0.05 0.086 0.08β6 -0.040 0.05 0.035 0.08

Two-way β7 0.040 0.05 0.102 0.07Inter. β8 -0.085 0.06 -0.030 0.09

β9 0.024 0.05 0.070 0.07β10 -0.049 0.05 -0.011 0.07β11 -0.002 0.10 0.117 0.15

Quad. β12 -0.222 0.09 -0.240 0.13β13 0.022 0.09 -0.007 0.14

Covariate β14 0.092 0.31 0.217 0.47Scale τ or σ .204 .310

the simple two-way interaction terms of the factors, and the quadratic terms of the threefactors SAE, SAI and ADS is adequate. Let xj for j = 1, . . . , 4 denote the level of the factorsSAI, SAE, ADS, and TYPE, respectively, and let ci denote the value of the covariate. Thenthe model is expressed as,

yi = α+ β1x1,i + β2x2,i + β3x3,i + β4x4,i + β5x1,ix2,i + β6x1,ix3,i

+β7x1,ix4,i + β8x2,ix3,i + β9x2,ix4,i + β10x3,ix4,i

+β11x21,i + β12x

22,i + β13x

23,i + β14ci + ei . (3.3.1)

The Wilcoxon and LS estimates of the regression coefficients and their standard errors aregiven in Table 3.3.4. The Wilcoxon estimates are more precise. As the diagnostic analysisof Example 3.9.2 shows, this is due the outliers in this data set.

Note that the Wilcoxon estimate of the parameter β13, the quadratic term of the factorADS is significant. Again referring to the residual analysis given in Example 3.9.2, there issome graphical evidence to retain the three quadratic coefficients in the model. In order tostatistically confirm this evidence, we will test the hypotheses

H0 : β12 = β13 = β14 = 0 versus HA : βi 6= 0 for some i = 12, 13, 14 .

The Wilcoxon test is summarized in Table 3.3.5 and it is based on the test statistic ( 3.2.18).The value of the test statistic is significant at the .05 level. The LS F -test statistic, though,has the value 1.19. As with its estimates of the regression coefficients, the LS F -test statistichas been impaired by the outliers.


Table 3.3.5: Wilcoxon ANOVA Table for H0 : β12 = β13 = β14 = 0Source Reduction Mean Reduction

of Dispersion in Dispersion df in Dispersion FϕQuadratic Terms .977 3 .326 3.20Error 19 .102

3.4 Assumptions for Asymptotic Theory

For the asymptotic theory developed in this chapter certain assumptions on the distributionof the errors, the design matrix, and the scores are needed. The required assumptions foreach section may differ, but for easy reference, we have placed them in this section.

The major assumption on the error density function f for much of the rank-based anal-yses, is:

(E.1) f is absolutely continuous, 0 < I(f) <∞ . (3.4.1)

where I(f) denotes Fisher information, ( 2.4.16). Since f is absolutely continuous, we canwrite

f(s) − f(t) =

∫ s

t

f ′(x)dx

for some function f ′. An application of the Cauchy-Schwartz inequality yields

|f(s) − f(t)| ≤ I(f)1/2√

|F (s) − F (t)| ; (3.4.2)

see Exercise 1.12.20. It follows from ( 3.4.2), that assumption (E.1) implies that f isuniformly bounded and is uniformly continuous.

An assumption that will be used for analyses based on the L1 norm is:

(E.2) f(θe) > 0 , (3.4.3)

where θe denotes the median of the error distribution, i.e., θe = F−1(1/2).For easy reference, we list again the scale parameter τϕ, ( 2.5.23),

τ−1ϕ =

∫ϕ(u)ϕf(u)du , (3.4.4)

where

ϕf(u) = −f′(F−1(u))

f(F−1(u)). (3.4.5)

Under (E.1) the scale parameter τϕ is well defined. Another scale parameter that will beneeded is τS defined as:

τS = (2f(θe))−1 ; (3.4.6)

see ( 1.5.21). Note that it is well defined under Assumption (E.2).

3.4. ASSUMPTIONS FOR ASYMPTOTIC THEORY 165

As above let H = X(X′X)−1X′ denote the projection matrix onto Ω, the column spaceof X. Our asymptotic theory assumes that the design matrix X is imbedded in a sequenceof design matices which satisfy the next two properties. We should subscript quantities suchas X and the projection matrix with n to show this, but as a matter of convenience we havenot done so. We will subscript the leverage values hiin which are the diagonal entries of theprojection matrix H. We will often impose the next two conditions on the design matrix:

(D.2) limn→∞

max1≤i≤n

hiin = 0 (3.4.7)

(D.3) limn→∞

n−1X′X = Σ , (3.4.8)

where Σ is a p×p positive definite matrix. The first condition has become known as Huber’scondition. Huber (1981) showed that (D.2) is a necessary and sufficient design condition forthe least squares estimates to have an asymptotic normal distribution provided the errors,ei, are iid with finite variance. Conditions (D.3) reduces to assumption (D.1), ( 2.4.7), ofChapter 2 for the two sample problem.

Another design condition is Noether’s condition which is given by

(N.1) max1≤i≤n

x2ik∑n

j=1 x2jk

→ 0 for all k = 1, . . . p . (3.4.9)

Although this condition will be convenient, as the next lemma shows it is implied by Huber’scondition.

Lemma 3.4.1. (D.2) implies (N.1).

Proof: By the generalized Cauchy-Schwarz inequality, (see Graybill, (1976), page 224),for all i = 1, . . . , n we have the following equalities:

sup‖δ‖=1

δ′xix′iδ

δ′X′Xδ= x′

i(X′X)−1xi = hnii .

Next for k = 1, . . . , p take δ to be δk, the p × 1 vector of zeroes except for 1 in the kthcomponent. Then the above equalities imply that

x2ik∑n

j=1 x2jk

≤ hnii , i = 1, . . . , n, k = 1, . . . , p .

Hence

max1≤k≤p

max1≤i≤n

x2ik∑n

j=1 x2jk

≤ max1≤i≤n

hnii .

Therefore Huber’s condition implies Noether’s condition.As in Chapter 2, we will often assume that the score generating function ϕ(u) satisfies

assumption ( 2.5.5). We will in addition assume that it is bounded. For reference, we willassume that ϕ(u) is a function defined on (0, 1) such that

(S.1)

ϕ(u) is a nondecreasing, square-integrable, and bounded function∫ 1

0ϕ(u) du = 0 and

∫ 1

0ϕ2(u) du = 1

. (3.4.10)


Occasionally we will need further assumptions on the score function. In Section 3.7, wewill need to assume that

(S.2) ϕ is differentiable . (3.4.11)

When estimating the intercept parameter based on signed-rank scores, we need to assumethat the score function is odd about 1

2, i.e.,

(S.3) ϕ(1 − u) = −ϕ(u) ; (3.4.12)

see, also, ( 2.5.5).

3.5 Theory of Rank-Based Estimates

Consider the linear model given by ( 3.2.3). To avoid confusion, we will denote the truevector of parameters by (α0,β0)

′; that is, the true model is Y = 1α0 + Xβ0 + e. In thissection we will derive the asymptotic theory for the R-analysis, estimation and testing, underthe assumptions (E.1), (D.2), (D.3), and (S.1). We will occassionally supress the subscripts

ϕ and R from the notation. For example, we will denote the R-estimate by simply β.

3.5.1 R-Estimators of the Regression Coefficients

A key result for both estimation and testing concerns the gradient S(Y − Xβ), ( 3.2.12).We first derive its mean and covariance matrix and then obtain its asymptotic distribution.

Theorem 3.5.1. Under Model ( 3.2.3),

E [S(Y − Xβ0)] = 0

V [S(Y − Xβ0)] = σ2aX

′X ,

σ2a = (n− 1)−1

∑ni=1 a

2(i).= 1.

Proof: Note that S(Y − Xβ0) = X′a(R(e)). Under Model ( 3.2.3), e1, . . . , en are iid;hence, the ith component a(R(e)) has mean

E [a(R(ei))] =n∑

j=1

a(j)n−1 = 0 ,

from which the result for the expectation follows.For the result on the variance-covariance matrix, note that, V [S(Y −Xβ0)] = X′V [a(R(e)]X.

The digaonal entries for the covariance matrix on the RHS are:

V [a(R(ei))] = E[a2(R(ei))

]=

n∑

j=1

a(j)2n−1 =n− 1

nσ2a .

3.5. THEORY OF RANK-BASED ESTIMATES 167

The off-diagonal entries are the covariances given by

cov(a(R(ei)), a(R(el))) = E [a(R(ei)a(R(el)]

=∑n

j=1

∑nk=1j 6=ka(j)a(k)(n(n− 1))−1

= −(n(n− 1))−1

n∑

j=1

a2(j)

= −σ2a/n , (3.5.1)

where the third step in the derivation follows from 0 =(∑n

j=1 a(j))2

. The result, ( 3.5.1),

is obtained directly from these variances and covariances.Under (D.3), we have that

V[n−1/2S(Y − Xβ0)

]→ Σ . (3.5.2)

This anticpates our next result,

Theorem 3.5.2. Under the Model ( 3.2.3), (E.1), (D.2), (D.3), and (S.1) in Section 3.4,

n−1/2S(Y − Xβ0)D→ Np(0,Σ) . (3.5.3)

Proof: Let S(0) = S(Y − Xβ0) and let T(0) = X′ϕ(F (Y − Xβ0)). Under the aboveassumptions, the discussion around Theorem A.3.1 of the appendix shows that (T(0) −S(0))/

√n converges to 0 in probability. Hence we need only show that T(0)/

√n converges

to the intended distribution. Letting W ∗ = n−1/2t′T(e) where t 6= 0 is an arbitrary p × 1vector, it suffices to show that W ∗ converges in distribution to a N(0, t′Σt) distribution.Note that we can write W ∗ as,

W ∗ = n−1/2n∑

k=1

t′xkϕ(F (ek)) . (3.5.4)

Since F is the distribution function of ek, it follows from∫ϕdu = 0 that E [W ∗] = 0, from∫

ϕ2du = 1, and (D.3) that

V [W ∗] = n−1

n∑

k=1

(t′xk)2 = t′n−1X′Xt → t′Σt > 0 . (3.5.5)

Since W ∗ is a sum of independent random variables which are not identically distributedwe establish the limit distribution by the Lindeberg-Feller Central Limit Theorem; see The-orem A.1.1 of the Appendix. In the notation of this theorem let B2

n = V [W ∗]. By ( 3.5.5),B2n converges to a positive real number. We need to show,

limB−2n

n∑

k=1

E

[1

n(x′

kt)2ϕ2(F (ek))I

(∣∣∣∣1√n

(x′kt)ϕ(F (ek))

∣∣∣∣ > ǫBn

)]= 0 . (3.5.6)


The key is the factor n−1/2(x′kt) in the indicator function. By the Cauchy-Schwarz inequality

and (D.2) we have the string of inequalities:

n−1/2|(x′kt)| ≤ n−1/2‖xk‖‖t‖

=

[n−1

p∑

j=1

x2kj

]1/2

‖t‖

≤[pmax

jn−1x2

kj

]1/2

‖t‖ . (3.5.7)

By assumptions (D.2) and (D.3), it follows that the quantity in brackets in equation ( 3.5.7),and, hence, n−1/2|(x′

kt)| converges to zero as n → ∞. Call the term on the right side ofequation ( 3.5.7) Mn. Note that it does not depend on k and Mn → 0. From this string ofinequalities, the limit on the left side of (3.5.6) is less than or equal to

limB−2n limE

[ϕ2(F (e1))I

(|ϕ(F (e1))| >

ǫBn

Mn

)]limn−1

n∑

k=1

(x′kt)

2 .

The first and third limits are positive reals. For the second limit, note that the randomvariable inside the expectation is bounded; hence, by Lebesgue Dominated ConvergenceTheorem we can interchange the limit and expectation. Since ǫBn

Mn→ ∞ the expectation

goes to 0 and our desired result is obtained.Similar to Chapter 2, Exercise 3.16.9 obtains the proof of the above theorem for the

special case of the Wilcoxon scores by first getting the projection of the statistic W .Note from this theorem we have the gradient test that all the regression coefficients are

0; that is, H0 : β = 0 versus HA : β 6= 0. Consider the test statistic

T = σ−2a S(Y)′(X′X)−1S(Y) . (3.5.8)

From the last theorem an approximate level α test for H0 versus HA is:

Reject H0 in favor of HA if T ≥ χ2(α, p) , (3.5.9)

where χ2(α, p) denotes the upper level α critical value of χ2-distribution with p degrees offreedom.

Theorem A.3.8 of the Appendix gives the following linearity result for the process S(βn):

1√nS(βn) =

1√nS(β0) − τ−1

ϕ Σ√n(βn − β0) + op(1) , (3.5.10)

for√n(βn − β0) = O(1), where the scale parameter τϕ is given by ( 3.4.4). Recall that we

have made use of this result in Section 2.5 when we showed that the two sample locationprocess under general scores functions is Pitman regular. If we integrate the RHS of this


result we obtain a locally smooth approximation of the dispersion function D(βn) which isgiven by the following quadratic function:

Q(Y−Xβ) = (2τϕ)−1(β−β0)

′X′X(β−β0)−(β−β0)′S(Y−Xβ0)+D(Y−Xβ0) . (3.5.11)

Note that Q depends on τϕ and β0 so it cannot be used to estimate β. As we will show, thefunction Q is quite useful for establishing asymptotic properties of the R-estimates and teststatistics. As discussed in Section 3.7.3, it also leads to a Gauss-Newton type algorithm forobtaining R-estimates.

The following theorem shows that Q provides a local approximation to D. This is anasymptotic quadraticity result which was proved by Jaeckel (1972). It in turn is based onan asymptotic linearity result derived by Jureckova (1971) and displayed above, ( 3.5.10).It is proved in the Appendix; see Theorem A.3.8.

Theorem 3.5.3. Under the Model ( 3.2.3) and the assumptions (E.1), (D.1), (D.2), and(S.1) of Section 3.4, for any ǫ > 0 and c > 0,

P

[max

‖β−β0‖<c/√n

|D(Y −Xβ) −Q(Y −Xβ)| ≥ ǫ

]→ 0 , (3.5.12)

as n→ ∞.

We will use this result to obtain the asymptotic distribution of the R-estimate. With-out loss of generality assume that the true β0 = 0. Then we can write Q(Y − Xβ) =(2τϕ)

−1β′X′Xβ − β ′S(Y) +D(Y). Because Q is a quadratic function it follows from differ-entiation that it is minimized by

β = τϕ (X′X)−1

S(Y) . (3.5.13)

Hence, β is a linear function of S(Y). Thus we immediately have from Theroem 3.5.2:

Theorem 3.5.4. Under the Model ( 3.2.3), (E.1), (D.1), (D.2) and (S.1) in Section 3.4,

√n(β − β0)

D→ Np(0, τ2ϕΣ

−1) . (3.5.14)

Since Q is a local approximation to D, it would seem that their minimizing values areclose also. As the next result shows this indeed is the case. The proof first appeared inJaeckel (1972) and is sketched in the Appendix; see Theorem A.3.9.

Theorem 3.5.5. Under the Model ( 3.2.3), (E.1), (D.1), (D.2) and (S.1) in Section 3.4,

√n(β − β)

P→ 0 .

Combining this last result with Theorem 3.5.4, we get the next corollary which gives theasymptotic distribution of the R-estimate.


Corollary 3.5.1. Under the Model ( 3.2.3), (E.1), (D.1), (D.2) and (S.1),

√n(βϕ − β0)

D→ Np(0, τ2ϕΣ

−1) . (3.5.15)

Under the further restriction that the errors have finite variance σ2, Exercise 3.16.10

shows that the least squares estimate βLS of β satisfies√n(βLS − β)

D→ Np(0, σ2Σ−1).

Hence as in the location problems of Chapters 1 and 2, the asymptotic relative efficiencybetween the R-estimates and least squares is the ratio σ2/τ 2

ϕ, where τϕ is the scale parameter( 3.4.4). Thus the R-estimates of regression coefficients have the same high efficiency relativeto LS estimates as do the rank-based estimates in the location problem. In particular, theefficiency of the Wilcoxon estimates relative to the LS estimates at the normal distributionis .955. For longer tailed errors distributions this realtive efficency is much higher; see theefficiency discussion for contaminated normal distributions in Example 1.7.1.

From the above corollary, R-estimates are asymptotically unbiased. It follows from theinvariance properties, if we additionally asume that the errors have a symmetric distribution,that R-estimates are unbiased for all sample sizes; see Exercise 3.16.11 for details.

The random vector β, ( 3.5.13), is an asymptotic representation of the R-estimate β.The following representation will be useful later:

Corollary 3.5.2. Under the Model ( 3.2.3), (E.1), (D.1), (D.2) and (S.1) in Section 3.4,

n1/2(βϕ − β0) = τϕ(n−1X′X)−1n−1/2X′ϕ(F (Y −Xβ0)) + op(1) , (3.5.16)

where the notation ϕ(F (Y)) means the n× 1 vector whose ith component is ϕ(F (Yi)).

Proof: This follows immediately from ( A.3.9), ( A.3.10), the proof of Theorem 3.5.2,and equation ( 3.5.13).

Based on this last corollary, we have that the influence function of the R-estimate isgiven by

Ω(x0, y0; βϕ) = τϕΣ−1ϕ(F (y0))x0 . (3.5.17)

A more rigorous derivation of this result, based on Frechet derivatives, is given in the Ap-pendix; see Section A.5.2. Note that the influence function is bounded in the Y -space butit is unbounded in the x-space. Hence an outlier in the x-space can seriously impair anR-estimate. Although as noted above the R-estimates are highly efficient relative to theLS estimates, it follows from its influence function that the breakdown of the R-estimate is0. In Section 3.12, we present the HBR estimates whose influence function is bounded inboth spaces and which can attain 50% breakdown; although, it is less efficient than the Restimate.

3.5.2 R-Estimates of the Intercept

As discussed in Section 3.2, the intercept parameter requires the specification of a locationfunctional, T (ei). In this section we shall take T (ei) = med(ei). Since we assume, without


loss of generality, that T (ei) = 0, α = T (Yi − x′iβ). This leads immediately to estimating α

by the median of the R-residuals. Note that this is analogous to LS, since the LS estimateof the intercept is the arithmetic average of the LS residuals. Further, this estimate isassociated with the sign test statistic and the L1 norm. More generally we could also considerestimates associated with signed-rank test statistics. For example, if we consider the signed-rank Wilcoxon scores of Chapter 1 then the corresponding estimate is the median of theWalsh averages of the residuals. The theory of such estimates based on signed-rank tests,though, requires symmetrically distributed errors. Thus, while we briefly discuss these later,we now concentrate on the median of the residuals which does not require this symmetryassumption. We will make use of Assumption (E.2), ( 3.4.3), i.e, f(0) > 0.

The process we consider is the sign process based on residuals given by

S1(Y − α1 − Xβϕ) =

n∑

i=1

sgn(Yi − α− xiβϕ) . (3.5.18)

As with the sign process in Chapter 1, this process is a nondecreasing step function of αwhich steps down at the residuals. The solution to the equation

S1(Y − α1 − Xβϕ).= 0 (3.5.19)

is the median of the residuals which we shall denote by αS = medYi − xiβϕ. Our goal is

to obtain the asymptotic joint distribution of the estimate bϕ = (αS, β′ϕ)

′.

Similar to the R-estimate of β the estimate of the intercept is location and scale equivari-ant; hence, without loss of generality we will assume that the true intercept and regressionparameters are 0. We begin with a lemma.

Lemma 3.5.1. Assume conditions (E.1), (E.2), (S.1), (D.1) and (D.2) of Section 3.4. Forany ǫ > 0 and for any a ∈ R,

limn→∞

P [|S1(Y − an−1/21 −Xβϕ) − S1(Y − an−1/21)| ≥ ǫ√n] = 0 .

The proof of this lemma was first given by Jureckova (1971) for general signed-rank scoresand it is briefly sketched in the Appendix for the sign scores; see Lemma A.3.2. This lemmaleads to the asymptotic linearity result for the process ( 3.5.18).

We need the following linearity result:

Theorem 3.5.6. Assume conditions (E.1), (E.2), (S.1), (D.1) and (D.2) of Section 3.4.For any ǫ > 0 and c > 0,

limn→∞

P [sup|a|≤c|n−1/2S1(Y − an−1/21 −Xβϕ) − n−1/2S1(Y −Xβϕ) + aτ−1S | ≥ ǫ] = 0 ,

where τs is the scale parameter defined in expression ( 3.4.6).


Proof: For any fixed a write

|n−1/2S1(Y − an−1/21 − Xβϕ) − n−1/2S1(Y − Xβϕ) + aτ−1S | ≤

|n−1/2S1(Y − an−1/21 − Xβϕ) − n−1/2S1(Y − an−1/21)|+ |n−1/2S1(Y − an−1/21) − n−1/2S1(Y) + aτ−1

S | + |n−1/2S1(Y) − n−1/2S1(Y − Xβϕ)| .

We can apply Lemma 3.5.1 to the first and third terms on the right side of the aboveinequality. For the middle term we can use the asymptotic linearity result in Chapter 1 forthe sign process, ( 1.5.22). This yields the result for any a and the sup will follow from themonotonicity of the process, similar to the proof of Theorem 1.5.6 of Chapter 1.

Letting a = 0 in Lemma 3.5.1, we have that the difference n−1/2S1(Y − Xβϕ) −n−1/2S1(Y) goes to zero in probability. Thus the asymptotic distribution of n−1/2S1(Y −Xβϕ) is the same as that of n−1/2S1(Y), namely, N(0, 1). We have two applications of theseresults. The first is found in the next lemma.

Lemma 3.5.2. Assume conditions (E.1), (E.2), (D.1), (D.2), and (S.1) of Section 3.4.The random variable, n1/2αS is bounded in probability.

Proof: Let ǫ > 0 be given. Since n−1/2S1(Y−Xβϕ) is asymptotically N(0, 1) there existsa c < 0 such that

P [n−1/2S1(Y −Xβϕ) < c] <ǫ

2. (3.5.20)

Take c∗ = τ−1S (c − ǫ). By the process’s monotonicity and the definition of α, we have the

implication n1/2αS < c∗ ⇒ n−1/2S1(Y − c∗n−1/21 − Xβϕ) ≤ 0. Adding in and subtractingout the above linearity result, leads to

P [n1/2αS < c∗ ≤ P [n−1/2S1(Y − n−1/2c∗1 − Xβϕ) ≤ 0]

≤ P [|n−1/2S1(Y − c∗n−1/21 −Xβϕ) − (n−1/2S1(Y − Xβϕ) − c∗τ−1S | ≥ ǫ]

+ P [n−1/2S1(Y − Xβϕ) − c∗τ−1S < ǫ]] (3.5.21)

The first term on the right side can be made less that ǫ/2 for sufficiently large n whereas thesecond term is ( 3.5.20). From this it follows that n1/2αS is bounded below in probability.To finish the proof a similar argument shows that n1/2αS is bounded above in probability.

As a second application we can write the linearity result of the last theorem as

n−1/2S1(Y − an−1/21 − Xβϕ) = n−1/2S1(Y) − aτ−1S + op(1) (3.5.22)

uniformly for all |a| ≤ c and for c > 0.Because αS is a solution to equation ( 3.5.19) and n1/2αS is bounded in probability, the

second linearity result, ( 3.5.22), yields, after some simplification, the following asymptoticrepresentation of our result for the estimate of the intercept for the true intercept α0,

n1/2(αS − α0) = τSn−1/2

n∑

i=1

sgn(Yi − α0) + op(1) , (3.5.23)


where τS is given in ( 3.4.6). From this we have that n1/2(αS−α0)D→ N(0, τ 2

S). Our interest,

though, is in the joint distribution of αS and βϕ.

By Corollary 3.5.2 the corresponding asymptotic representation of βϕ for the true vectorof regression coefficients β0 is

n1/2(βϕ − β0) = τϕ(n−1X′X)−1n−1/2X′ϕ(F (Y)) + op(1) , (3.5.24)

where τϕ is given by ( 3.4.4). The joint asymptotic distribution is given in the followingtheorem.

Theorem 3.5.7. Under (D.1), (D.2), (S.1), (E.1) and (E.2) in Section 3.4,

bϕ =

(αSβϕ

)has an approximate Np+1

((α0

β0

),

[n−1τ 2

S 0′

0 τ 2ϕ(X

′X)−1

])distribution .

Proof: As above assume without loss of generality that the true parameters are 0. Itis easier to work with the random vector Tn = (τ−1

s

√nαS,

√n(τ−1

ϕ (n−1X′X)βϕ)′)′. Let

t = (t1, t′2)

′ be an arbitrary, nonzero, vector in Rp+1. We need only show that Zn =t′Tn has an asymptotically univariate normal distribution. Based on the above asymptoticrepresentations of αS, ( 3.5.23), and βϕ, ( 3.5.24), we have

Zn = n−1/2

n∑

k=1

(t1sgn(Yk) + (t′2xk)ϕ(F (Yk)) + op(1) , (3.5.25)

Denote the sum on the right side of ( 3.5.25) as Z∗n. We need only show that Z∗

n convergesin distribution to a univariate normal distribution. Denote the kth summand as Z∗

nk. Weshall use the Lindeberg-Feller Central Limit Theorem. Our application of this theorem issimilar to its use in the proof of Theorem 3.5.2. First note that since the score functionϕ is standardized (

∫ϕ = 0) that E(Z∗

n) = 0. Let B2n = Var(Z∗

n). Because the individualsummands are independent, Yk are identically distributed, ϕ is standardized (

∫ϕ2 = 1), and

the design is centered, B2N simplifies to

B2n = n−1(

n∑

k=1

t21 +n∑

k=1

(t′2xk)2 + 2t1cov(sgn(Y1), ϕ(F (Y1))t

′2

n∑

k=1

xk

= t21 + t′2(n−1X′X)t2 + 0 .

Hence by (D.2),limn→∞

B2n = t21 + t′2Σt2 , (3.5.26)

which is a positive number. To satisfy the Lindeberg-Feller condition, we need to show thatfor any ǫ > 0

limn→∞

B−2n

n∑

k=1

E[Z∗2nkI(|Z∗

nk| > ǫBn)] = 0 . (3.5.27)


Since B2n converges to a positive constant we need only show that the sum converges to 0.

By the triangle inequality we can show that the indicator function satisfies

I(n−1/2|t1| + n−1/2|t′2xk||ϕ(F (Yk))| > ǫBn) ≥ I(|Z∗nk| > ǫBn) . (3.5.28)

Following the discussion after expression ( 3.5.7), we have that n−1/2|(x′kt)| ≤Mn where Mn

is independent of k and, furthermore, Mn → 0. Hence, we have

I(|ϕ(F (Yk))| >ǫBn − n−1/2t1

Mn) ≥ I(n−1/2|t1| + n−1/2|t′2xk||ϕ(F (Yk))| > ǫBn) . (3.5.29)

Thus the sum in expression ( 3.5.27) is less than or equal to

n∑

k=1

E

[Z∗2nkI

(|ϕ(F (Yk))| >

ǫBn − n−1/2t1Mn

)]= t1E

[I

(|ϕ(F (Y1))| >


)]

+ (2/n)E

[sgn(Y1)ϕ(F (Y1))I

(|ϕ(F (Y1))| >


)]t′2

n∑

k=1

xk

+ E

[ϕ2(F (Y1))I

(|ϕ(F (Y1))| >


)](1/n)

n∑

k=1

(t′2xk)2 .

Because the design is centered the middle term on the right side is 0. As remarked above, theterm (1/n)

∑nk=1(t

′2xk)

2 = (1/n)t′2X′Xt2 converges to a positive constant. In the expression

ǫBn−n−1/2t1Mn

, the numerator converges to a positive constant as the denominator converges to0; hence, the expression goes to ∞. Therefore since ϕ is bounded, the indicator functionconverges to 0. Again using the boundedness of ϕ, we can interchange limit and expectationby the Lebesgue Dominated Convergence Theorem. Thus condition ( 3.5.27) is true and,hence, Z∗

n converges in distribution to a univariate normal distribution. Therefore, Tn con-verges to a multivariate normal distribution. Note by ( 3.5.26) it follows that the asymptotic

covariance of bϕ is the result displayed in the theorem.

In the above development, we considered the centered design. In practice, though, we areoften concerned with an uncentered design. Let α∗ denote the intercept for the uncenteredmodel. Then α∗ = α−x′β where x denoted the vector of column averages of the uncentereddesign matrix. An estimate of α∗ based on R-estimates is given by α∗

S = αS − x′βϕ. Basedon the last theorem, it follows, (Exercise 3.16.14), that

(α∗S

βϕ

)is approximately Np+1

((α0

β0

),

[κn −τ 2

ϕx′(X′X)−1

−τ 2ϕ(X

′X)−1x τ 2ϕ(X

′X)−1

]), (3.5.30)

where κn = n−1τ 2S + τ 2

ϕx′(X′X)−1x and τS and and τϕ are given respectively by ( 3.4.6) and

( 3.4.4).


Intercept Estimate Based on Signed-Rank Scores

Suppose we additionally assume that the errors have a symmetric distribution; i.e., f(−x) =f(x). In this case, all location functionals are the same. Let ϕf(u) = −f ′(F−1(u))/f(F−1(u))denote the optimal scores for the density f(x). Then as Exercise 3.16.12 shows, ϕf (1−u) =−ϕf (u); that is, the scores are odd about 1/2. Hence, in this subsection we will additionallyassume that the scores satisfy property (S.3), ( 3.4.12).

For scores satisfying (S.3), the corresponding signed-rank scores are generated as a+(i) =ϕ+(i/(n+ 1)) where ϕ+(u) = ϕ((u+ 1)/2); see the discussion in Section 2.5.3. For exampleif Wilcoxon scores are used, ϕ(u) =

√12(u − 1/2), then the signed-rank score function is

ϕ+(u) =√

3u. Recall from Chapter 1, that these signed-rank scores can be used to define anorm and a subsequent R-analysis. Here we only want to apply the associated one samplesigned-rank procedure to the residuals in order to obtain an estimate of the intercept. Soconsider the process

T+(eR − α1) =n∑

i=1

sgn(eRi − α1)a+(R|eRi − α|) , (3.5.31)

where eRi = yi − x′iβϕ; see ( 1.8.2). Note that this is the process discussed in Section 1.8,

except now the iid observations are replaced by residuals. The process is still a nonincreasingfunction of α which steps down at the Walsh averages of the residuals; see Exercise 1.12.28.The estimate of the intercept is a value α+

ϕ which solves the equation

T+(eR − α).= 0. (3.5.32)

If Wilcoxon scores are used then the estimate is the median of the Walsh averages, ( 1.3.25)while if sign scores are used the estimate is the median of the residuals.

Let b+ϕ = (α+

ϕ , β′ϕ)

′. We next briefly sketch the development of the asymptotic distri-

bution of b+ϕ . Assume without loss of generality that the true parameter vector (α0,β

′0)

′ is0. Suppose instead of the residuals we had the true errors in ( 3.5.31). Theorem A.2.11of the Appendix then yields an asymptotic linearity result for the process. McKean andHettmansperger (1976) show that this result holds for the residuals also; that is,

1√nS+(eR − α1) = S+(e) − ατ−1

ϕ + op(1) , (3.5.33)

for all |α| ≤ c, where c > 0. Using arguments similar to those in McKean and Hettmansperger(1976), we can show that

√nα+

ϕ is bounded in probability; hence, by ( 3.5.33) we have that

√nα+

ϕ = τϕ1√nS+(e) + op(1) . (3.5.34)


But by ( A.2.43) and ( A.2.45) of the Appendix, we have the second representation given by,

√nα+

ϕ = τϕ1√n

n∑

i=1

ϕ+(F+|ei|)sgn(ei) + op(1)

= τϕ1√n

n∑

i=1

ϕ+(2F (ei) − 1) + op(1) , (3.5.35)

where F+ is the distribution function of the absolute errors |ei|. Due to symmetry, F+(t) =2F (t)−1. Then using the relationship between the rank and the signed-rank scores, ϕ+(u) =ϕ((u+ 1)/2), we obtain finally

√nα+

ϕ = τϕ1√n

n∑

i=1

ϕ(F (Yi)) . (3.5.36)

Therefore using expression ( 3.5.2), we have the asypmtotic representation of the estimates:

√n

[α+ϕ

βϕ

]=

τϕ√n

[1′ϕ(F (Y))

(X′X)−1X′ϕ(F (Y))

]. (3.5.37)

This and an application of the Lindeberg Central Limit Theorem, similar to the proof ofTheorem 3.5.7, leads to the theorem,

Theorem 3.5.8. Under assumptions (D.1), (D.2), (E.1), (E.2), (S.1) and (S.3) of Section3.4

[α+ϕ

βϕ

]has an approximate Np+1

((α0

β0

), τ 2ϕ(X

′1X1)

−1

)distribution , (3.5.38)

where X1 = [1X].

3.6 Theory of Rank-Based Tests

Consider the general linear hypotheses discussed in Section 3.2,

H0 : Mβ = 0 versus HA : Mβ 6= 0 , (3.6.1)

where M is a q × p matrix of full row rank. The geometry of R testing, Section 3.2.2,indicated the statistic based on the reduction of dispersion between the reduced and fullmodels, Fϕ = (RD/q)/(τϕ/2), see ( 3.2.18), as a test statistic. In this section we develop theasymptotic theory for this test statistic under null and alternative hypotheses. This theorywill be sufficient for two other rank-based tests which we will discuss later. See Table 3.2.2and the discussion relating to that table for the special case when M = I.

3.6. THEORY OF RANK-BASED TESTS 177

3.6.1 Null Theory of Rank Based Tests

We proceed with two lemmas about the dispersion function D(β) and its quadratic approx-imation Q(β) given by expression ( 3.5.11).

Lemma 3.6.1. Let β denote the R-estimate of β in the full model ( 3.2.3), then under(E.1), (S.1), (D.1) and (D.2) of Section 3.4,

D(β) −Q(β)P→ 0 . (3.6.2)

Proof: Assume without loss of generality that the true β is 0. Let ǫ > 0 be given. Choose

c0 such that P[√

n‖β‖ > c0

]< ǫ/2, for n sufficiently large. Using asymptotic quadraticity,

Theorem A.3.8, we have for n sufficiently large

P[|D(β) −Q(β)| < ǫ

]≥ P

[max

‖β‖<c0/√n

|D(β) −Q(β)| < ǫ

∩√

n‖β‖ < c0

]

> 1 − ǫ . (3.6.3)

From this we obtain the result.The last result shows that D and Q are close at the R-estimate of β. Our next result

shows that Q(β) is close to the minimum of Q.

Lemma 3.6.2. Let β denote the minimizing value of the quadratic function Q then under(E.1), (S.1), (D.1) and (D.2) of Section 3.4,

Q(β) −Q(β)P→ 0 . (3.6.4)

Proof: By simple algebra we have,

Q(β) −Q(β) = (2τϕ)−1(β − β)′X′X(β + β) − (β − β)′S(Y)

= (2τϕ)−1√n(β − β)′

[n−1X′X

√n((β + β) − n−1/2S(Y)

].

It is shown in Exercise 3.16.15 that the term in brackets in the last equation is boundedin probability. Since the left factor converges to zero in probability by Theorem 3.5.5 thedesired result follows.

It is easier to work with the the equivalent formulation of the linear hypotheses given by

Lemma 3.6.3. An equivalent formulation of the model and the hypotheses is:

Y = 1α + X∗1β

∗1 + X∗

2β∗2 + e , (3.6.5)

with the hypotheses H0 : β∗2 = 0 versus HA : β∗

2 6= 0, where X∗i and β∗

i , i = 1, 2, are definedin display ( 3.6.7).


Proof: Consider the QR-decomposition of M given by

M′ = [Q2 Q1] =

[RO

]= Q2R , (3.6.6)

where the columns of Q1 form an orthonormal basis for the kernel of the matrix M, thecolumns of Q2 form an orthonormal basis for the column space of M′, O is a (p − q) × qmatrix of 0’s, and R is a q × q upper triangular, nonsingular matrix. Define

X∗i = XQi and β∗

i = Q′iβ for i = 1, 2 . (3.6.7)

It follows that,

Y = 1α+ Xβ + e

= 1α+ X∗1β

∗1 + X∗

2β∗2 + e .

Further, Mβ = 0 if and only if β∗2 = 0, which yields the desired result.

Without loss of generality, by the last lemma, for the remainder of the section, we willconsider a model of the form

Y = 1α + X1β1 + X2β2 + e , (3.6.8)

with the hypothesesH0 : β2 = 0 versus HA : β2 6= 0 . (3.6.9)

With these lemmas we are now ready to obtain the asymptotic distribution of Fϕ. Let

βr = (β′1, 0

′)′ denote the reduced model vector of parameters, let βr,1 denote the reduced

model R-estimate of β1, and let βr = (β′r,1, 0

′)′. We shall use similar notation with the min-imizing value of the approximating quadratic Q. With this notation the drop in dispersionbecomes RDϕ = D(βr) −D(β). McKean and Hettmansperger (1976) proved the following:

Theorem 3.6.1. Suppose the assumptions (E.1), (D.1), (D.2), and (S.1) of Section 3.4hold. Then under H0,

RDϕ

τϕ/2

D→ χ2(q) ,

where RDϕ is formally defined in expression ( 3.2.16).

Proof: Assume that the true vector of parameters is 0 and suppress the subscript ϕ onRD. Write RD as the sum of five differences:

RD = D(βr) −D(β)

=(D(βr) −Q(βr)

)+(Q(βr) −Q(βr)

)+(Q(βr)

− Q(β))

+(Q(β) −Q(β)

)+(Q(β) −D(β)

).


By Lemma 3.6.1 the first and fifth differences go to zero in probability and by Lemma 3.6.2the second and fourth differences go to zero in probability. Hence we need only show thatthe middle difference converges in distribution to the intended distribution. As in Lemma3.6.2, algebra leads to

Q(β) = −2−1τϕS(Y)′ (X′X)−1

S(Y) +D(Y) ,

while

Q(βr) = −2−1τϕS(Y)′[

(X′1X1)

−1 00 0

]S(Y) +D(Y) .

Combining these last two results the middle difference becomes

Q(βr) −Q(β) = 2−1τϕS(Y)′(

(X′X)−1 −

[(X′

1X1)−1 0

0 0

])S(Y) .

Using a well known matrix identity, (see page 27 of Searle, 1971),

(X′X)−1

=

[(X′

1X1)−1 0

0 0

]+

[−A−1

1 BI

]W[−B′A−1

1 I],

where

X′X =

[A1 BB′ A2

]

W =(A2 − B′A−1

1 B)−1

. (3.6.10)

Hence after some simplification we have

RD

τϕ/2= S(Y)′

[[−A−1

1 BI

]W[−B′A−1

1 I]]

S(Y) + op(1)

=([−B′A−1

1 I]S(Y)

)′W([−B′A−1

1 I]S(Y)

)+ op(1)

=([−B′A−1

1 I]n−1/2S(Y)

)′nW

([−B′A−1

1 I]n−1/2S(Y)

)+ op(1) . (3.6.11)

Using n−1X′X → Σ and the asymptotic distribution of n−1/2S(Y), Theorem 3.5.2, it followsthat the right side of ( 3.6.11) converges in distribution to a χ2 random variable with q degreesof freedom, which completes the proof of the theorem.

A consistent estimate of τϕ is discussed in Section 3.7. We shall denote this estimate byτϕ. The test statistic we shall subsequently use is given by

Fϕ =RDϕ/q

τϕ/2. (3.6.12)

Although the test statistic qFϕ has an asymptotic χ2 distribution, small sample studies, (seebelow), have indicated that it is best to compare the test statistic with F -critical valueshaving q and n− p− 1 degrees of freedom; that is, the test at nominal level α is:

Reject H0 : Mβ = 0 in favor of HA : Mβ 6= 0 if Fϕ ≥ F (α, q, n− p− 1) . (3.6.13)


McKean and Sheather (1991) review numerous small sample studies concerning the valid-ity of the rank-based analysis based on the test statistic Fϕ. These small sample studiesdemonstrate that the empirical α level of Fϕ over a variety of designs, sample sizes and errordistributions are close to the nominal values.

In classical inference there are three tests of general hypotheses: the likelihood ratiotest (reduction in sums of squares test), Wald’s test and Rao’s scores (gradient) test. Agood discussion of these tests can be found in Rao (1973). When the hypotheses are thegeneral linear hypotheses ( 3.6.1), the errors have a normal distribution, and the least squaresprocedure is used then the three tests statistics are algebraically equivalent. Actually theequivalence holds without normality, although in this case the reduction in sums of squaresstatistic is not the likelihood ratio test; see the discussion in Hettmansperger and McKean(1983).

There are three rank-based tests for the general linear hypotheses, also. The reductionin dispersion test statistic Fϕ is the analogue of the likelihood ratio test, i.e., the reductionin sums of squares test. Since Wald’s test statistic is a quadratic form in full modelestimates, its rank analogue is given by

Fϕ,Q =

(Mβ

)′ [M (X′X)−1 M′]−1

(Mβ

)/q

τ 2ϕ

. (3.6.14)

Provided τϕ is a consistent estimate of τϕ it follows from the asymptotic distribution of

βR, Corollary 3.5.1, that under H0, qFϕ,Q has an asymptotic χ2 distribution. Hence thetest statistics Fϕ and Fϕ,Q have the same null asymptotic distributions. Actually as Ex-ercise 3.16.16 shows, the difference of the test statistics converges to zero in probabilityunder H0. Unlike the classical methods, though, they are not algebraically equivalent, seeHettmansperger and McKean (1983).

The rank gradient scores test is easiest to define in terms of the reparameterizedmodel, ( 3.6.20); that is, the null hypothesis is H0 : β2 = 0. Rewrite the random vectordefined in ( 3.6.11) of Theorem 3.6.1 using as the true parameter under H0, β0 = (β01, 0

′)′,i.e.,

([−B′A−1

1 I]n−1/2S(Y − Xβ0)

)′nW

([−B′A−1

1 I]n−1/2S(Y −Xβ0)

). (3.6.15)

From the proof of Theorem 3.6.1 this quadratic form has an asymptotic χ2 distributionwith q degrees of freedom. Since it does depend on β0, it can not be used as a test statistic.Suppose we substitute the reduced model R-estimate of β1; i.e., the first p− q components

of βr, defined immediately after expression ( 3.6.9). We shall call it β01. Now since this isthe reduced model R-estimate, we have

S(Y −Xβr).=

(0

S2(Y − X1βr,1)

), (3.6.16)

where the subscript 2 on S denotes the last p− q components of S. This yields

Aϕ = S2(Y − X1βr,1)′X′

2X2 −X′2X1 (X′

1X1)−1

X′1X2

−1

S2(Y − X1βr,1) (3.6.17)


as a test statistic. This is often called the aligned rank test, since the observations arealigned by the reduced model estimate. Exercise 3.16.17 shows that under H0, Aϕ has anasymptotic χ2 distribution. As the proof shows, the difference between qFϕ and Aϕ convergesto zero in probability under H0. Aligned rank tests were introduced by Hodges and Lehmann(1962) and are developoed in the linear model by Puri and Sen (1985).

Suppose in ( 3.6.16) we use a reduced model estimate β∗r,1 which is not the R-estimate;

for example, it may be the LS-estimate. Then we have

S(Y − Xβ∗r)

.=

(S1(Y − X1β

∗r,1)

S2(Y − X1β∗r,1)

). (3.6.18)

The reduced model estimate must satisfy√n(β

∗r−β0) = Op(1), under H0. Then the statistic

in ( 3.6.17) is

A∗ϕ = S∗′

2

X′

2X2 −X′2X1 (X′

1X1)−1

X′1X2

−1

S∗2 , (3.6.19)

where, from ( 3.6.11),

S∗2 = S2(Y −X1β

∗r,1) − X′

2X1 (X′1X1)

−1S1(Y − X1β

∗r,1) . (3.6.20)

Note that when the R-estimate is used, the second term in S∗2 vanishes and we have ( 3.6.17);

see Adichi(1978) and Chiang and Puri (1984).Hettmansperger and McKean (1983) give a general discussion of these three tests. Note

that both Fϕ,Q and Fϕ require estimation of full model estimates and the scale parameterτϕ while Aϕ does not. However when using a linear model, one is usually interested in morethan hypothesis testing. Of primary interest is checking the quality of the fit; i.e., does themodel fit the data. This requires estimation of the full model parameters and an estimateof τϕ. Diagnostics for fits based on R-estimates are the topics of Section 3.9. One is alsousually interested in estimating contrasts and their standard errors. For R-estimates thisrequires an estimate of τϕ. Moreover, as discussed in Hettmansperger and McKean (1983),the small sample properties of the aligned rank test can be poor on certain designs.

The influence function of the test statistic Fϕ is derived in Appendix A.5.2. Asdiscussed there, it is easier to work with the

√qFϕ. The result is given by,

Ω(x0, y0;√qFϕ) = |ϕ[F (y0−x′

0βr)]|x′

0

((X′X)

−1 −[

(X′1X1)

−1 00 0

])x0

1/2

. (3.6.21)

As shown in the Appendix, the null distribution of Fϕ can be read from this result. Notethat similar to the R-estimates, the influence function of Fϕ is bounded in the Y -space butnot in the x-space; see ( 3.5.17).

3.6.2 Theory of Rank-Based Tests under Alternatives

In the last section, we developed the null asymptotic theory of the rank-based tests basedon a general score function. In this section we obtain some properties of these tests under


alternative models. We show first that the test based on the reduction of dispersion, RDϕ,( 3.2.16), is consistent under any alternative to the general linear hypothesis. We then showthe efficiency of these tests is the same as the efficiency results obtained in Chapter 2.

Consistency

We want to show that the test statistic Fϕ is consistent for the general linear hypothesis,( 3.2.5). Without loss of generality, we will again reparameterize the model as in ( 3.6.20)and consider as our hypothesis H0 : β2 = 0 versus HA : β2 6= 0. Let β0 = (β′

01,β′02)

′ bethe true parameter. We will assume that the alternative is true; hence, β02 6= 0. Let α be a

given level of significance. Let T (τϕ) = RDϕ/(τϕ/2) where RDϕ = D(βr) −D(β). Becausewe estimate τϕ under the full model by a consistent estimate, to show consistency of Fϕ itsuffices to show

Pβ0[T (τϕ) ≥ χ2

α,q] → 1 , (3.6.22)

as n→ ∞.As in the proof under the null hypothesis, it is convenient to work with the approximating

quadratic function Q(Y−Xβ), ( 3.5.11). As above, let β and β denote the minimizing valuesof Q and D respectively under the full model. The present argument simplifies if, for thefull model, we replace β by β in T (τϕ). We can do this because we can write

D(Y − Xβ) −D(Y −Xβ) =(D(Y − Xβ) −Q(Y − Xβ)

)+(Q(Y −Xβ) −Q(Y − Xβ)

)

+(Q(Y − Xβ) −D(Y − Xβ)

).

Applying asymptotic quadraticity, Theorem A.3.8, the first and third differences go to 0 inprobability while the second difference goes to 0 in probability by Lemma 3.6.2; hence theleft side goes to 0 in probability under the alternative model. Thus we need only show that

Pβ0[(2/τϕ)(D(βr) −D(β)) ≥ χ2

α,q] → 1 , (3.6.23)

where, as above, βr denotes the reduced model R-estimate. We state the result next. Theproof can be found in the appendix; see Theorem A.3.10.

Theorem 3.6.2. Suppose conditions (E.1), (D.1), (D.2), and (S.1) of Section 3.4 hold.The test statistic Fϕ is consistent for the hypotheses ( 3.2.5).

Efficiency Results

The above result establishes that the rank-based test statistic Fϕ is consistent for the generallinear hypothesis, ( 3.2.5). We next derive the efficiency results of the test. Our first step isto obtain the asymptotic power of Fϕ along a sequence of alternatives. This generalizes theasymptotic power lemmas discussed in Chapters 1 and 2. From this the efficiency resultswill follow. As with the consistency discussion it is more convenient to work with the model( 3.6.20).


The sequence of alternative models to the hypothesis H0 : β2 = 0 is:

Y = 1α + X1β1 + X2(θ/√n) + e , (3.6.24)

where θ is a nonzero vector. Because R-estimates are invariant to location shifts, we canassume without loss of generality that β1 = 0. Let βn = (0′, θ′/

√n)′ and let Hn denote

the hypothesis that ( 3.6.24) is the true model. The concept of contiguity will prove helpfulwith the asymptotic theory of the statistic Fϕ under this sequence of models. A discussionof contiguity is given in the appendix; see Section A.2.2.

Theorem 3.6.3. Under the sequence of models ( 3.6.24) and the assumptions (E.1), (D.1),(D.2), and (S.1) of Section 3.4,

Pβn(T (τϕ) ≤ t) → P (χ2

q(ηϕ) ≤ t) , (3.6.25)

where χ2q(ηϕ) has a noncentral χ2-distribution with q degrees of freedom and non-centrality

parameterηϕ = τ−2

ϕ θ′W−10 θ , (3.6.26)

where W0 = limn→∞ nW and W is defined in display ( 3.6.10).

Proof: As in the proof of Theorem 3.6.1 we can write the drop in dispersion as the sum ofthe same five differences. Since the first two and last two differences go to zero in probabilityunder the null model, it follows from the discussion on contiguity, (Section A.2.2), thatthese differences go to zero in probability under the model ( 3.6.24). Hence we need only beconcerned about the middle difference. Since β1 = 0, the middle difference reduces to thesame quantity as in Theorem 3.6.1; i.e., we obtain,

RDϕ

τϕ/2=([−B′A−1

1 I]S(Y)

)′W([−B′A−1

1 I]S(Y)

)+ op(1) .

The asymptotic linearity result derived in the Appendix, (Theorem A.3.8), is

sup√n‖β‖≤c

‖n−1/2S(Y − Xβ) −(n−1/2S(Y) − τ−1

ϕ Σ√nβ)‖ = op(1) ,

for all c > 0. Since√n‖βn‖ = ‖θ‖, we can take c = ‖θ‖ and get

‖n−1/2S(Y − Xβn) −(n−1/2S(Y) − τ−1

ϕ Σ(0′, θ′)′)‖ = op(1) . (3.6.27)

The above probability statements hold under the null model and, hence, by contiguity underthe sequence of models ( 3.6.24) also. Under the sequence of models ( 3.6.24), however,

n−1/2S(Y −Xβn)D→ Np(0,Σ) .

Hence, under the sequence of models ( 3.6.24)

n−1/2S(Y)D→ Np(τ

−1ϕ Σ(0′, θ′)′,Σ) . (3.6.28)


Then under the sequence of models ( 3.6.24),

[−B′A−1

1 I]n−1/2S(Y)

D→ Nq(τ−1ϕ W0,W0) .

From this last result, the conclusion readily follows.Several interesting remarks follow from this theorem. First, since W0 is positive definite,

under alternatives the noncentrality parameter η > 0. Thus the asymptotic distribution ofT (τϕ) under the sequence of models ( 3.6.24) has mean q + η. Furthermore, the asymptoticpower of a level α test based on T (τϕ) is P [χ2

q(η) ≥ χ2α,q].

Second, note that that we can write the non-centrality parameter as

η = (τ 2ϕn)−1[θ′A2θ − (Bθ)′A−1

1 Bθ] .

Both matrices A2 and A−11 are positive definite; hence, the non-centrality parameter is

maximized when θ is in the kernel of B. One way of assuring this for a design is to takeB = 0. Because B = X′

1X2 this condition holds for orthogonal designs. Therefore orthogonaldesigns are generally more efficient than non-orthogonal designs.

We next obtain the asymptotic relative efficiency of the test statistic Fϕ with respect tothe least squares classical F -test, FLS, defined by ( 3.2.17) in Section 3.2.2. The theory forFLS under local alternatives is outlined in Exercise 3.16.18 where it is shown that, under theadditional assumption that the random errors ei have finite variance σ2, the null asymptoticdistribution of qFLS is a central χ2

q distribution. Thus both Fϕ and FLS have the sameasymptotic null distribution. As outlined in Exercise 3.16.18, under the sequence of models( 3.6.24) qFLS has an asymptotic noncentral χ2

q,ηLSwith noncentrality parameter

ηLS = (σ2)−1θ′W−10 θ (3.6.29)

Based on Theorem 3.6.3, the asymptotic relative efficiency of Fϕ and FLS is the ratio oftheir non-centrality parameters; i.e.,

e(Fϕ, FLS) =ηϕηLS

=σ2

τ 2ϕ

.

Thus the efficiency results for the rank-based estimates and tests discussed in this sec-tion are the same as the efficiency results presented in Chapters 1 and 2. An asymp-totically efficient analysis can be obtained if the selected rank score function is ϕf(u) =−f ′

0(F−10 (u))/f0(F

−10 (u)) where f0 is the form of the density of the error distribution. If the

errors have a logistic distribution then the Wilcoxon scores will result in an asymptoticallyefficient analysis.

Usually we have no knowledge of the distribution of the errors. In which case, we wouldrecommend using Wilcoxon scores. With them, the loss in relative efficiency to the classicalanalysis at the normal distribution is only 5%, while the gain in efficiency over the classicalanalysis for long tailed error distributions can be substantial as discussed in Chapters 1 and2.


Many of the studies reviewed in the article by McKean and Sheather (1991) includedpower comparisons of the rank-based analyses with the least squares F -test, FLS. Theempirical power of FLS at normal error distributions was slightly better than the empiricalpower of Fϕ, under Wilcoxon scores. Under error distributions with heavier tails than thenormal distribution, the empirical power of Fϕ was generally larger, often much larger,than the empirical power of FLS. These studies provide empirical evidence that the goodasymptotic efficiency properties of the rank-based analysis hold in the small sample setting.

As discussed above, the noncentrality parameters of the test statistics Fϕ and FLS differ inonly the scale parameters. Hence, in practice, planning designs based on the noncentralityparameter of Fϕ can proceed similar to the planning of a design using the noncentralityparameter of FLS; see, for example, the discussion in Chapter 4 of Graybill (1976).

3.6.3 Further Remarks on the Dispersion Function

Let e denote the rank-based residuals when the linear model, ( 3.2.4), is fit using the scoresbased on the function ϕ. Suppose the same assumptions hold as above; i.e., (E.1), (D.1), and(D.2) in Section 3.4. In this section, we explore further properties of the residual dispersionD(e); see also Sections 3.9.2 and 3.11.

The functional corresponding to the dispersion function evaluated at the errors ei isdetermined as follows: letting Fn denote the empirical distribution function of the iid errorse1, . . . , en we have

1

nD(e) =

n∑

i=1

a(R(ei))ei1

n

=

n∑

i=1

ϕ

(n

n + 1Fn(ei)

)ei

1

n

=

∫ϕ

(n

n+ 1Fn(x)

)x dFn(x)

P→∫ϕ(F (x))x dF (x) = De . (3.6.30)

As Exercise 3.16.19 shows, De is a scale parameter; see also the examples below.Let D(e) denote the residual dispersion D(β) = D(Y,Ω). We next show that n−1D(e)

also converges in probability to De, a result which will prove useful in Sections 3.9.2 and3.11. Assume without loss of generality that the true β is 0. We can write

D(e) = (D(e) −Q(β)) + (Q(β) −Q(β)) +Q(β) .

By Lemmas 3.6.1 and 3.6.2 the two differences on the right hand side converge to 0 inprobability. After some algebra, we obtain

Q(β) = −τϕ2

1√nS(e)′

(1

nX′X

)−11√nS(e)

+D(e) .


By Theorem 3.5.2 the term in braces on the right side converges in distribution to a χ2

random variable with p degrees of freedom. This implies that (D(e) − D(e))/(τϕ/2) alsoconverges in distribution to a χ2 random variable with p degrees of freedom. Although thisis a stronger result than we need, it does imply that n−1(D(e) − D(e)) converges to 0 inprobability. Hence, n−1D(e) converges in probability to De.

The natural analog to the least squares F -test statistic is

F ∗ϕ =

RD/q

σD/2, (3.6.31)

where σD = D(e)/(n− p− 1), rather than Fϕ. But we have,

qF ∗ϕ =

τϕ/2

n−1D(e)/2qFϕ

D→ κFχ2(q) , (3.6.32)

where κF is defined byτϕ

n−1D(e)

P→ κF . (3.6.33)

Hence, to have a limiting χ2-distribution for qF ∗ϕ we need to have κF = 1. Below we give

several examples where this occurs. In the first example, the form of the error distributionis known while in the second example the errors are normally distributed; however, thesecases rarely occur in practice.

There is an even more acute problem with using F ∗ϕ, though. In Section A.5.2 of the

appendix, we show that the influence function of F ∗ϕ is not bounded in the Y -space, while,

as noted above, the influence function of the statistic Fϕ is bounded in the Y -space providedthe score function ϕ(u) is bounded. Note, however, that the influence functions of D(e)and F ∗

ϕ are linear rather than quadratic as is the influence function of FLS. Hence, theyare somewhat less sensitive to outliers in the Y -space than FLS; see Hettmansperger andMcKean (1978).

Example 3.6.1. Form of Error Density Known.

Assume that the errors have density f(x) = σ−1f0(x/σ) where f0 is known. Our choiceof scores would then be the optimal scores given by

ϕ0(u) = − 1√I(f0)

f ′0(F

−10 (u))

f0(F−10 (u))

, (3.6.34)

where I(f0) denotes the Fisher information corresponding to f0. These scores yield anasymptotically efficient rank-based analysis. Exercise 3.16.20 shows that with these scores

τϕ = De . (3.6.35)

Thus κF = 1 for this example and qF ∗ϕ0

has a limiting χ2(q)-distribution under H0.

3.7. IMPLEMENTATION OF THE R-ANALYSIS 187

Example 3.6.2. Errors are Normally Distributed.

In this case the form of the error density is f0(x) = (√

2π)−1 exp −x2/2; i.e., thestandard normal density. This is of course a subcase of the last example. The optimal scoresin this case are the normal scores ϕ0(u) = Φ−1(u) where Φ denotes the standard normaldistribution function. Using these scores, the statistic qF ∗

ϕ0has a limiting χ2(q)-distribution

under H0. Note here that the score function ϕ0(u) = Φ−1(u) is unbounded; hence theabove theory must be modified to obtain this result. Under further regularity conditions onthe design matrix, Jureckova (1969) obtained asymptotic linearity for the unbonded scorefunction case; see, also, Koul (1992, p. 51). Using these results, the limiting distribution ofqF ∗

ϕ0can be obtained. The R-estimates based on these scores, however, have an unbounded

influence function; see Section 1.8.1. We next consider this analysis for Wilcoxon and signscores.

If Wilcoxon scores are employed then Exercise 3.16.21 shows that

τϕ = σ

√π

3(3.6.36)

De = σ

√3

π. (3.6.37)

Thus, in this case, a consistent estimate of τϕ/2 is n−1D(e)(π/6).For sign scores a similar computation yields

τS = σ

√π

2(3.6.38)

De = σ

√2

π(3.6.39)

Hence n−1D(e)(π/4) is a consistent estimate of τS/2.

Note that both examples are overly restrictive and again in all cases the resulting rank-based test of the general linear hypothesis H0 has an unbounded influence function, evenin the case when the errors have a normal density and the analysis is based on Wilcoxonor sign scores. In general then, we recommend using a bounded score function ϕ and thecorresponding test statistic Fϕ, ( 3.2.18) which is highly efficent and whose influence function,3.6.21, is bounded in the Y -space.

3.7 Implementation of the R-Analysis

Up to this point, we have presented the geometry and asymptotic theory of the R-analysis.In order to implement the analysis we need to discuss the estimation of the scale parametersτϕ and τS. Estimation of τS is discussed around expression (1.5.28). Here, though, theestimate is based on the residuals. We next discuss estimation of the scale parameter τϕ.We also discuss algorithms for obtaining the rank-based analysis.


3.7.1 Estimates of the Scale Parameter τϕ

The estimators of τϕ that we dicuss are based on the R-residuals formed after estimating β.In particular, the estimators do not depend on the the estimate of intercept parameter α.Suppose then we have fit Model ( 3.2.3) based on a score function ϕ which satisfies (S.1),

( 3.4.10), i.e., ϕ is bounded, and is standardized so that∫ϕ = 0 and

∫ϕ2 = 1. Let βϕ

denote the R-estimate of β and let eR = Y − Xβϕ denote the residuals based on the R-fit.

There have been several estimates of τϕ proposed. McKean and Hettmansperger (1976)proposed a Lehmann type estimator based on the standardized length of a confidence intervalfor the intercept parameter α. This estimator is a function of residuals and is consistentprovided the density of the errors is symmetric. It is similar to the estimators of τϕ discussedin Chapter 1. For Wilcoxon scores, Aubuchon and Hettmansperger (1984, 1989) obtained adensity type estimator for τϕ and showed it was consistent for symmetric and asymmetricerror distributions. Both of these estimators are available as options in the command RREGRin Minitab. In this section we briefly sketch the development of an estimator of τϕ forbounded score functions proposed by Koul, Sievers and McKean (1987). It is a densitytype estimate based on residuals which is consistent for symmetric and asymmetric errordistributions which satisfy (E.1), ( 3.4.1). It further satisfies a uniform consistency propertyas stated in Theorem 3.7.1. Witt et al. (1995) derived the influence function of this estimator,showing that it is robust.

A bootstrap percentile-t procedure based on this estimator did quite well in terms ofempirical validity and efficiency in the Monte Carlo study performed by George, McKean,Schucany and Sheather (1995).

Let the score function ϕ satisfy (S.1), (S.2), and (S.3) of Section 3.4. Since it is bounded,consider the standardization of it given by

ϕ∗(u) =ϕ(u) − ϕ(0)

ϕ(1) − ϕ(0). (3.7.1)

Since ϕ∗ is a linear function of ϕ the inference properties under either score function are thesame. The score function ϕ∗ will be useful since it is also a distribution function on (0, 1).Recall that τϕ = 1/γ where

γ =

∫ 1

0

ϕ(u)ϕf(u)du and ϕf(u) = −f′(F−1(u))

f(F−1(u)).

Note that γ∗ =∫ϕ∗(u)ϕf(u)du = (ϕ(1) − ϕ(0))−1γ. For the present it will be more conve-

nient to work with γ∗.


If we make the change of variable u = F (x) in γ∗, we can rewrite it as,

γ∗ = −∫ ∞

−∞ϕ∗(F (x))f ′(x)dx

=

∫ ∞

−∞ϕ∗′(F (x))f 2(x)dx

=

∫ ∞

−∞f(x)dϕ∗(F (x)) ,

where the second equality is obtained upon integration by parts using dv = f ′(x) dx andu = ϕ∗(F (x)).

From the above assumptions on ϕ∗, ϕ∗(F (x)) is a distribution function. Suppose Z1

and Z2 are independent random variables with distributions functions F (x) and ϕ∗(F (x)),respectively. Let H(y) denote the distribution function of |Z1 − Z2|. It then follows that,

H(y) =

P [|Z1 − Z2| ≤ y] =

∫∞−∞[F (z2 + y) − F (z2 − y)]dϕ∗(F (z2)) y > 0

0 y ≤ 0. (3.7.2)

Let h(y) denote the density of H(y). Upon differentiating under the integral sign in expres-sion ( 3.7.2) it easily follows that

h(0) = 2γ∗ . (3.7.3)

So to estimate γ we need to estimate h(0).Using the transformation t = F (z2), rewrite ( 3.7.2) as

H(y) =

∫ 1

0

[F (F−1(t) + y) − F (F−1(t) − y)

]dϕ∗(t) . (3.7.4)

Next let Fn denote the empirical distribution function of the R-residuals and let F−1n (t) =

infx : Fn(x) ≥ t denote the usual inverse of Fn. Let Hn denote the estimate of H

which is obtained by replacing F by Fn. Some simplification follows by noting that fort ∈ ((j − 1)/n, j/n], F−1

n (t) = e(j). This leads to the following form of Hn,

Hn(y) =

∫ 1

0

[Fn(F

−1n (t) + y) − Fn(F

−1n (t) − y)

]dϕ∗(t)

=n∑

j=1

∫

( j−1n, jn

]

[Fn(F

−1n (t) + y) − Fn(F

−1n (t) − y)

]dϕ∗(t)

=

n∑

j=1

[Fn(e(j) + y) − Fn(e(j) − y)

](ϕ∗(j

n

)− ϕ∗

(j − 1

n

))

=1

n

n∑

i=1

n∑

j=1

(ϕ∗(j

n

)− ϕ∗

(j − 1

n

))I(|e(i) − e(j)| ≤ y) . (3.7.5)


An estimate of h(0) and hence γ∗, ( 3.7.3), is an estimate of the form Hn(tn)/(2tn) where

tn is chosen close to 0. Since Hn is a distribution function, let tn,δ denote the δth quantile

of Hn; i.e., tn,δ = H−1n (δ). Then take tn = tn,δ/

√n. Our estimate of γ is given by

γn,δ =(ϕ(1) − ϕ(0))Hn(tn,δ/

√n)

2tn,δ/√n

. (3.7.6)

Its consistency is given by the following theorem:

Theorem 3.7.1. Under (E.1),(D.1), (S.1), and (S.2) of Section 3.4, and for any 0 < δ < 1,

supϕ∈C

|γn,δ − γ| P→ 0 ,

where C denotes the class of all bounded, right continuous, nondecreasing score functionsdefined on the interval (0, 1).

The proof can be found in Koul et al. (1987). It follows immediately that τϕ = 1/γn,δ isa consistent estimate of τϕ. Note that the uniformity condition on the scores in the theoremis more than we need here. This result, though, proves useful in adaptive procedures whichestimate the score function; see McKean and Sievers (1989).

Since the scores are differentiable, an approximation of Hn is obtained by an applicationof the mean value theorem to ( 3.7.5) which results in

H∗n(y) =

1

cnn

n∑

i=1

n∑

j=1

ϕ∗′(

j

n + 1

)I(|e(i) − e(j)| ≤ y) , (3.7.7)

where cn =∑n

j=1 ϕ∗′(j/(n+ 1)) is such that H∗

n is a distribution function.

The expression ( 3.7.5) for Hn contains a density estimate of f based on a rectangularkernel. Hence, in choosing δ we are really choosing a bandwidth for a density estimator.As most kernel type density estimates are sensitive to the bandwidth, so is γ∗ sensitive toδ. Several small sample studies have been done on this estimate of τϕ; see McKean andSheather (1991) for a summary. In these studies the quality of an estimator of τϕ is basedon how well it standardizes test statistics such as Fϕ in terms of how close the empiricalα-levels of the test statistic are to nominal α-levels. In the same way, scale estimators usedin confidence intervals were judged by how close empirical confidence levels were to nominalconfidence levels. The major concern is thus the validity of the inference procedure. Formoderate sample sizes where the ratio of n/p exceeds 5, the value of δ = .80 yielded validestimates. For ratios less than 5, larger values of δ, around .90, gave valid estimates. Inall cases it was found that the following simple degrees of freedom correction benefited theanalysis

τϕ =

√n

n− p− 1γ−1 . (3.7.8)

Note that this is similar to the least squares correction on the maximum likelihood estimate(under normality) of the variance.


3.7.2 Algorithms for Computing the R-Analysis

As we saw in Section 3.2, the dispersion function D(β) is a continuous convex function ofβ. Gradient type algorithms, such as steepest descent, can be use to minimize D(β) butthey are often agonizingly slow. The algorithm which we describe next is a Newton type ofalgorithm based on the asymptotic quadraticity of D(β). It is generally much faster thangradient type algorithms and is currently used in the RREGR command in Minitab and inthe program RGLM (Kapenga, McKean and Vidmar, 1988). A finite algorithm to minimizeD(β) is discussed by Osborne (1985).

The Newton type of algorithm needs an initial estimate which we denote as β(0)

. Let

e(0) = Y − Xβ(0)

denote the initial residuals and let τ(0)ϕ denote the initial estimate of τϕ

based on these residuals. By ( 3.5.11) the approximating quadratic to D based on β(0)

isgiven by,

Q(β) =(2τϕ

(0))−1 (

β − β(0))′

X′X(β − β

(0))−(β − β

(0))′

S(Y − Xβ

(0))+D

(Y −Xβ

(0)).

By ( 3.5.13), the value of β which minimizes Q(β) is given by

β(1)

= β(0)

+ τ (0)ϕ (X′X)

−1S(Y −Xβ

(0)) . (3.7.9)

This is the first Newton step. In the same way that the first step was defined in terms ofthe initial estimate, so can a second step be defined in terms of the first step. We shall callthese iterated estimates or k-step estimates. In practice, though, we would want to know

if D(β

(1))

is less than D(β

(0))

before proceeding. A more formal algorithm is presented

below.

These k-step estimates satisfy some interesting properties themselves which we brieflydiscuss; details can be found in McKean and Hettmansperger (1978). Provided the initial

estimate is such that√n(β

(0) − β) is bounded in probability then for any k ≥ 1 we have

√n(β

(k) − βϕ

)P→ 0 ,

where βϕ denotes a minimizing value of D. Hence the k-step estimates have the same

asymptotic distribution as βϕ. Furthermore τ(k)ϕ is a consistent estimate of τϕ, if it is any of

the scale estimates discussed in Section 3.7.1 based on k-step residuals. Let F(k)ϕ denote the

R-test of a general linear hypothesis based on reduced and full model k-step estimates. Thenit can be shown that F

(k)ϕ satisfies the same asymptotic properties as the test statistic Fϕ

under the null hypothesis and contiguous alternatives. Also it is consistent for any alternativeHA.


Formal Algorithm

In order to outline the algorithm used by RGLM, first consider the QR-decomposition of Xwhich is given by

Q′X = R , (3.7.10)

where Q is an n × n orthogonal matrix and R is an n × p upper triangular matrix ofrank p. As discussed in Stewart (1973), Q can be expressed as a product of p Householdertransformations. Writing Q = [Q1 Q2] where Q1 is n×p, it is easy to show that the columnsof Q1 form an orthonormal basis for the column space of X. In particular the projectionmatrix onto the column space of X is given by H = Q1Q

′1. The software package LINPACK

(1979) is a collection of subroutines which efficiently computes QR-decompositions and itfurther has routines which obtain projections of vectors.

Note that we can write the kth Newton step in terms of residuals as

e(k) = e(k−1) − τϕHa(R(e(k−1)) (3.7.11)

where a(R(e(k−1)) denotes the vector whose ith component is a(R(e(k−1)i ). Let D(k) denote

the dispersion function evaluated at e(k). The Newton step is a step from e(k−1) along thedirection τϕHa(R(e(k−1))). If D(k) < D(k−1) the step has been successful; otherwise, a linearsearch can be made along the direction to find a value which minimizes D. This would thenbecome the kth step residual. Such a search can be performed using methods such as falseposition as discussed below in Section 3.7.3. Stopping rules can be based on the relativedrop in dispersion, i.e., stop when

D(k) −D(k−1)

D(k−1)< ǫD , (3.7.12)

where ǫD is a specified tolerance. A similar stopping rule can be based on the relative sizeof the step. Upon stopping at step k, obtain the fitted value Y = Y − e(k) and then theestimate of β by solving Xβ = Y.

A formal algorithm is: Let ǫD and ǫs be the given stopping tolerances.

1. Set k = 1. Obtain initial residuals e(k−1) and based upon these get an initial estimateτ

(0)ϕ of τϕ.

2. Obtain e(k) as in expression ( 3.7.11). If the step is successful proceed to the next step,otherwise search along the Newton direction for a value which minimizes D then go tothe next step. An algorithm for this search is discussed in Section 3.7.3.

3. If the relative drop in dispersion or length of step is within its respective tolerance ǫDor ǫs stop; otherwise set e(k−1) = e(k) and go to step (2).

4. Obtain the estimate of β and the final estimate of τϕ.


The QR decomposition can readily be used to form a reduced model design matrix fortesting the general linear hypotheses ( 3.2.5), Mβ = 0, where M is a specified q× p matrix.Recall that we called the column space of X, ΩF, and the space ΩF constrained by Mβ = 0the reduced model space, ω. The key result lies in the following theorem:

Theorem 3.7.2. Denote the row space of M by R(M′). Let QM be a p × (p − q) matrixwhose columns consist of an orthonormal basis for the space (R(M′))⊥. If U = XQM, thenR(U) = ω.

Proof: If u ∈ ω then u = Xb for some b where Mb = 0. Hence b ∈ (R(M′))⊥; i.e.,b = QMc for some c. Conversely, if u ∈ R(U) then for some c ∈ Rp−q, u = X(QMc).Hence u ∈ R(X) and M(QMc) = (MQM)c = 0.

Thus using the LINPACK subroutines mentioned above, it is easy to write an algorithmwhich obtains the reduced model design matrix U defined above in the theorem. The packageRGLM uses such an algorithm to test linear hypotheses; see Kapenga, McKean and Vidmar(1988).

3.7.3 An Algorithm for a Linear Search

The computation for many of the quantities needed in a rank-based analysis involve simplelinear searches. Examples include the estimate of the location parameter for a signed-rankprocedure, the estimate of the shift in location in the two sample location problem, theestimate of τϕ discussed in Section 3.7 and the search along the Newton direction for aminimizing value in Step (2) of the algorithm for the R-fit in a regression problem discussedin the last section. The following is a generic setup for these problems: solve the equation

S(b) = K , (3.7.13)

where S(b) is a decreasing step function and K is a specified constant. Without loss ofgenerality we will take K = 0 for the remainder of the discussion. By the monotonicity,a solution always exists, although, it may be an interval of solutions. In almost all cases,S(b) is asymptotically linear; so, the search problem becomes relatively more efficient as thesample size increases.

There are certainly many search algorithms that can be used for solving ( 3.7.13). Onethat we have successfully employed is the Illinois version of regula falsi; see Dowell andJarratt (1971). McKean and Ryan (1977) employed this routine to obtain the estimate andconfidence interval for the two sample Wilcoxon location problem. We will write the genericasymptotic linearity result as,

S(b).= S(b(0)) − ζ(b− b(0)) . (3.7.14)

The parameter ζ is often of the form δ−1C where C is some constant. Since δ is a scaleparameter, initial estimates of it include such estimates as the MAD, ( 3.9.27), or thesample standard deviation. We have found MAD to usually be preferable. An outline of analgorithm for the search is:


1. Bracket Step. Beginning with an initial estimate b(0) step along the b-axis to b(1) wherethe interval (b(0), b(1)), or vice-versa, brackets the solution. Asymptotic linearity canbe used here to make these steps; for instance, if ζ (0) is an estimate of ζ based on b(0)

then the first step isb(1) = b(0) + S(b(0))/ζ (0) .

2. Regula-Falsi. Assume the interval (b(0), b(1)) brackets the solution and that b(1) is themore recent value of b(0), b(1). If |b(1) − b(0)| < ǫ then stop. Else, the next step is wherethe secant line determined by b(0), b(1) intersects the b-axis; i.e.,

b(2) = b(0) − b(1) − b(0)

S(b(1)) − S(b(0))S(b(0)) . (3.7.15)

(a) If (b(0), b(2)) brackets the solution then replace b(1) by b(2) and go to (2) but useS(b(0))/2 in place of S(b(0)) in determination of the secant line, (this is the Illinoismodification).

(b) If (b(2), b(1)) brackets the solution then replace b(0) by b(2) and go to (2).

The above algorithm is easy to implement. Such searches are used in the package RGLM;see Kapenga, McKean and Vidmar (1988).

3.8 L1-Analysis

This section is devoted to L1-procedures. These are widely used procedures; see, for example,Bloomfield and Steiger (1983). We first show that they are equivalent to R-estimates based onthe sign score function under Model ( 3.2.4). Hence the asymptotic theory for L1-estimationand subsequent analysis is contained in Section 3.5. The asymptotic theory for L1-estimationcan also be found in Bassett and Koenker (1978) and Rao (1988) from an L1 point of view.

Consider the sign scores; i.e., the scores generated by ϕ(u) = sgn(u−1/2). In this sectionwe shall denote the associated pseudo-norm by

‖v‖S =

n∑

i=1

sgn(R(vi) − (n + 1)/2)vi v ∈ Rn ;

see, also, Section 2.6.1. This score function is optimal if the errors follow a double exponential(Laplace) distribution; see Exercise 2.13.19 of Chapter 2. We shall summarize the analysisbased on the sign scores, but first we show that indeed the R-estimates based on sign scoresare also L1-estimates, provided that the intercept is estimated by the median of residuals.

Consider the intercept model, ( 3.2.4), as given in Section 3.2 and let Ω denote thecolumn space of X and Ω1 denote the column space of the augmented matrix X1 = [1 X].

First consider the R-estimate of η ∈ Ω based on the L1 pseudo-norm. This is a vectorYS ∈ Ω such that

YS = Argminη∈Ω‖Y − η‖S .

3.8. L1-ANALYSIS 195

Next consider the L1-estimate for the space Ω1; i.e., the L1-estimate of α1 + η. This isa vector YL1 ∈ Ω1 such that

YL1 = Argminθ∈Ω1‖Y − θ‖L1 ,

where ‖v‖L1 =∑

|vi| is the L1-norm.

Theorem 3.8.1. R-estimates based on sign scores are equivalent to L1-estimates; that is,

YL1 = YS + medY − YS1 .‘ (3.8.1)

Proof: Any vector v ∈ Ω1 can be written uniquely as v = a1 + vc where a is a scalar andvc ∈ Ω. Since the sample median minimizes the L1-distance between a vector and the spacespanned by 1, we have

‖Y − v‖L1 = ‖Y − a1 − vc‖L1 ≥ ‖Y − medY − vc1 − vc‖L1 .

But it is easy to show that sgn(Yi − medY − vc − vci) = sgn(R(Yi − vci) − (n+ 1)/2) fori = 1, . . . , n. Putting these two results together along with the fact that the sign scores sumto 0 we have,

‖Y − v‖L1 = ‖Y − a1 − vc‖L1 ≥ ‖Y − medY − vc1 − vc‖L1 = ‖Y − vc‖S; , (3.8.2)

for any vector v ∈ Ω1. Once more using the sign argument above, we can show that

‖Y − medY − YS1 − YS‖L1 = ‖Y − YS‖S . (3.8.3)

Putting ( 3.8.2) and ( 3.8.3) together establishes the result.

Let b′S = (αS, β

′S) denote the R-estimate of the vector of regression coefficients b =

(β0,β′)′. It follows that these R-estimates are the maximum likelihood estimates if the

errors ei are double exponentially distributed; see Exercise 3.16.13.From the discussions in Sections 3.5 and 3.5.2, bS has an approximateN(b, τ 2

S(X′1X1)

−1)distribution, where τS = (2f(0))−1. From this the efficiency properties of the L1-proceduresdiscussed in the first two chapters carry over to the L1 linear model procedures. In particularits efficiency relative to LS at the normal distribution is .63, and it can be much more efficientthan LS for heavier tailed error distributions.

As Exercise 3.16.22 shows, the drop in dispersion test based on sign scores, FS, is, exceptfor the scale parameter, the likelihood ratio test of the general linear hypothesis ( 3.2.5),provided the errors have a double exponential distribution. For other error distributions, thesame comments about efficiency of the L1 estimates can be made about the test FS.

In terms of implementation, Schrader and McKean (1987) found it more difficult tostandardize the L1 statistics than other R-procedures, such as the Wilcoxon. Their mostsuccessful standardization of FS was based on the following bootstrap procedure:

1. Compute the full model L1 estimates βS and αS, the full model residuals e1, . . . , en,and the test statistic FS.


2. Select e1, . . . , en, the n = n− (p+ 1) nonzero residuals.

3. Draw a bootstrap random sample e∗1, . . . , e∗n with replacement from e1, . . . , en. Calcu-

late β∗S and F ∗

S , the L1 estimate and test statistic, from the model y∗i = αS +x′iβS +e∗i .

4. Independently repeat step 3 a large number B times. The bootstrap p value, p∗ =#F ∗

S ≥ FS/B.

5. Reject H0 at level α if p∗ ≤ α.

Notice that by using full model residuals, the algorithm estimates the null distributionof FS. The algorithm depends on the number B of bootstrap samples taken. We suggest atleast 2000.

3.9 Diagnostics

One of the most important parts in the analysis of a linear model is the examination of theresulting fit. Tools for doing this include residual plots and diagnostic techniques. Over thelast fifteen years or so, these tools have been developed for fits based on least squares; see,for example, Cook and Weisberg (1982) and Belsley, Kuh and Welsch (1980). Least squaresresidual plots can be used to detect such things as curvature not accounted for by the fittedmodel; see, Cook and Weisberg (1989) for a recent discussion. Further diagnostic techniquescan be used to detect outliers which are points that differ greatly from pattern set by thebulk of the data and to measure the influence of individual cases on the least squares fit.

In this section we explore the properties of the residuals from the rank-based fits, showinghow they can be used to determine model misspecification. We present diagnostic techniquesfor rank-based residuals that detect outlying and influential cases. Together these tools offerthe user a residual analysis for the rank-based method for the fit of a linear model similarto the residual analysis based on least squares estimates.

In this section we consider the same linear model, ( 3.2.3), as in Section 3.2. For a given

score function ϕ, let βϕ and eR denote the R-estimate of β and residuals from the R-fitof the model based on these scores. Much of the discussion is taken from the articles byMcKean, Sheather and Hettmansperger (1990, 1991, 1993). Also, see Dixon and McKean(1996) for a robust rank-based approach to modeling heteroscedasticity.

3.9.1 Properties of R-Residuals and Model Misspecification

As we discussed above, a primary use of least squares residuals is in detection of modelmisspecification. In order to show that the R-residuals can also be used to detect modelmisspecification, consider the sequence of models

Y = 1α + Xβ + Zγ + e , (3.9.1)

3.9. DIAGNOSTICS 197

where Z is an n × q centered matrix of constants and γ = θ/√n, for θ 6= 0. Note that

this sequence of models is contiguous to Model ( 3.2.3). Suppose we fit model ( 3.2.3), i.e.Y = 1α + Xβ + e, when model ( 3.9.1) is the true model. Hence the model has beenmisspecified. As a first step in examining the residuals in this situation, we consider thelimiting distribution of the corresponding R-estimate.

Theorem 3.9.1. Assume model ( 3.9.1) is the true model. Let βϕ be the R-estimate for themodel ( 3.2.3). Suppose that conditions (E.1) and (S.1) of Section 3.4 are true and thatconditions (D.1) and (D.2) are true for the augmented matrix [X Z]. Then

βϕ has an approximate Np

(β + (X′X)−1 X′Zθ/

√n, τ 2

ϕ (X′X)−1) distribution. (3.9.2)

Proof: Without loss of generality assume that β = 0. Note that the situation here isthe same as the situation in Theorem 3.6.3; except now the null hypothesis corresponds toγ = 0 and βϕ is the reduced model estimate. Thus we seek the asymptotic distribution ofthe reduced model estimate. As in Section 3.5.1 it is easier to consider the correspondingpseudo estimate β which is the reduced model estimate which minimzes the quadratic Q(Y−Xβ), ( 3.5.11). Under the null hypothesis, γ = 0,

√n(βϕ − β)

P→ 0; hence by contiguity√n(βϕ − β)

P→ 0 under the sequence of models ( 3.9.1). Thus βϕ and β have the same

distributions under ( 3.9.1); hence, it suffices to find the distribution of β. But by ( 3.5.13),

β = τϕ (X′X)−1

S(Y) , (3.9.3)

where S(Y) is the first p components of the vector T (Y) = [X Z]′a(R(Y)). By ( 3.6.28) ofTheorem 3.6.3

n−1/2T (Y)D→ Np+q(τ

−1ϕ Σ∗(0′, θ′)′,Σ∗) , (3.9.4)

where Σ∗ is the following limit,

limn→∞

1

n

[X′X X′ZZ′X Z′Z

]= Σ∗ .

Because β is defined by ( 3.9.3), the result is just an algebraic computation applied to ( 3.9.4).

With a few more steps we can write a first order expression for βϕ, which is given in thefollowing corollary:

Corollary 3.9.1. Under the assumptions of the last theorem,

βϕ = β + τϕ (X′X)−1

X′ϕ(F (e)) + (X′X)−1

X′Zθ/√n+ op(n

−1/2) . (3.9.5)

Proof: Without loss of generality assume that the regression coefficients are 0. By( A.3.10) and expression ( 3.6.27) of Theorem 3.6.3 we can write

1√nT (Y) =

1√n

[X′ϕ(F (e))Z′ϕ(F (e))

]+ τ−1

ϕ

1

n

[X′Zθ

Z′Zθ

]+ op(1) ;


hence, the first p components of 1√nT (Y) satisfy

1√nS(Y) =

1√nX′ϕ(F (e)) + τ−1

ϕ

1

nX′Zθ + op(1) .

By expression ( 3.9.3) and the fact that√n(β − β)

P→ 0 the result follows.From this corollary we obtain the following first order expressions of the R-residuals and

R-fitted values:

YR.= α1 + Xβ + τϕHϕ(F (e)) + HZγ (3.9.6)

eR.= e − τϕHϕ(F (e)) + (I − H)Zγ , (3.9.7)

where H = X (X′X)−1 X′. In Exercise 3.16.23 the reader is asked to show that the leastsquares fitted values and residuals satisfy

YLS = α1 + Xβ + He + HZγ (3.9.8)

eLS = e − He + (I −H)Zγ . (3.9.9)

In terms of model mispecification the coefficients of interest are the regression coefficients.Hence, at this time we need not consider the effect of the estimation of the intercept. Thisavoids the problem of which estimate of the intercept to use. In practice, though, for bothR- and LS-fits, the intercept will also be fitted and its effect will be removed from theresiduals. We will also include the effect of estimation of the intercept in our discussion ofthe standardization of residuals and fitted values in Sections 3.9.2 and 3.9.3, respectively.

Suppose that the linear model ( 3.2.3) is correct. Based on its first order expression whenγ = 0, eR is a function of the random errors similar to eLS; hence, it follows that a plotof eR versus YR should generally be a random scatter, similar to the least squares residualplot.

In the case of model misspecification, note that the R-residuals and least squares residualshave the same asymptotic bias, namely (I−H)Zγ. Hence R-residual plots, similar to thoseof least squares, are useful in identifying model misspecification.

For least squares residual plots, since least squares residuals and the fitted values areuncorrelated, any pattern in this plot is due to model misspecification and not the fittingprocedure used. The converse, however, is not true. As the example on the potency of drugcompounds below illustrates, the least squares residual plot can exhibit a random scatterfor a poorly fitted model. This orthogonality in the LS residual plot does, however, make iteasier to pick out patterns in the plot. Of course the R-residuals are not orthogonal to theR-fitted values, but they are usually close to orthogonality; see Naranjo et al. (1994). Weintroduce the following parameter ν to measure the extent of departure from orthogonality.

Denote general fitted values and residuals by Y and e respectively. The expecteddeparture from orthogonality is the parameter ν defined by

ν = E[e′Y

]. (3.9.10)


For least squares, νLS is of course 0. For R-fits, we have the following first order expressionfor it:

Theorem 3.9.2. Under the assumptions of Theorem 3.9.1 and either Model ( 3.2.3) orModel ( 3.9.1),

νR.= pτϕ(E[ϕ(F (e1))e1] − τϕ) . (3.9.11)

Proof: Suppose Model ( 3.9.1) holds. Using the above first order expressions we have

νR.= E [(e + α1 − τϕHϕ(F (e)) + (I − H)Zγ)′(Xβ + τϕHϕ(F (e)) + HZγ)]

Using E[ϕ(F (e))] = 0, E[e] = E(e1)1, and the fact that X is centered this expressionsimplifies to

νR.= τϕE [trHϕ(F (e))e′] − τ 2

ϕE [trHϕ(F (e))ϕ(F (e))′] .

Since the components of e are independent, the result follows. The result is invariant toeither of the models.

Although in general, νR 6= 0 for R-estimates, if, as the next corollary shows, optimalscores (see Examples 3.6.1 and 3.6.2) are used the expected departure from orthogonalityis 0.

Corollary 3.9.2. Under the hypothesis of the last theorem, if optimal R-scores are used thenνR = 0.

Proof: Let ϕ(u) = −cf ′(F−1(u))f(F−1(u))

where c is chosen so that∫ϕ2(u)du = 1. Then

τϕ =

[∫ϕ(u)

(−f

′(F−1(u))

f(F−1(u))

)du

]−1

= c.

Some simplification and an integration by parts shows∫ϕ(F (e))e dF (e) = −c

∫f ′(e) de = c.

Naranjo et al. (1994) conducted a simulation study to investigate the above properties ofrank-based and LS residuals over several small sample situations of null (the true model wasfitted) models and misspecified models. Error distributions included the normal distributionand a contaminated normal distribution. Wilcoxon scores were used. The first part ofthe study concerned the amount of association between residuals and fitted values wherethe association was measured by several correlation coefficients, including Pearson’s r andKendall’s τ . Because of orthogonality between the LS residuals and fitted values, Pearson’sr is always 0 for LS. On the other measures of association, however, the results for theWilcoxon analysis and LS were about the same. In general, there was little association. Thesecond part investigated measures of randomness in a residual plot, including a runs tests anda quadrant count test, (the quadrants were determined by the medians of the residuals andfitted values). The results were similar for the LS and Wilcoxon fits. Both showed validity


Table 3.9.1: Cloud Data%I-8 0 1 2 3 4 5 6 7 8 0Cloud Point 22.1 24.5 26.0 26.8 28.2 28.9 30.0 30.4 31.4 21.9%I-8 2 4 6 8 10 0 3 6 9Cloud Point 26.1 28.5 30.3 31.5 33.1 22.8 27.3 29.8 31.8

over the null models and exhibited similar power over the misspecified models. In a powerstudy over a quadratic misspecified model, the Wilcoxon analysis exhibited more power forlong tailed error distributions. In summary, the simulation study provided empirical evidencethat residual analyses based on Wilcoxon fits are similar to LS based residual analyses.

There are other useful residual plots. Two that we will briefly discuss are q−q plotsand added variable plots. As with standard residual plots, the internal R-studentizedresiduals (see Section 3.9.2) can be used in place of the residuals. Since the R-estimatesof β are consistent, the distribution of the residuals should resemble the distribution of theerrors. This leads to consideration of another useful residual plot, a q−q plot. In this plot,the quantiles of the target distribution form the horizontal coordinates while the samplequantiles (ordered residuals) form the vertical coordinates. Linearity of this plot indicatesthe appropriateness of the target distribution as the true model distribution; see Exercise3.16.24. McKean and Sievers (1989) discuss how to use these plots adaptively to selectappropriate rank scores. In the next example, we use them to examine how well the R-fitfits the bulk of the data and to highlight outliers.

For the added variable plot, let eR denote the residuals from the R-fit of the modelY = α1 + Xβ + e. In this case, Z is a known vector and we wish to decide whether or notto add it to the regression model. For the added variable plot, we regress Z on X. We willdenote the residuals from this fit as e(Z | X) = (I − H)Z. The added variable plot consistsof the scatter plot of the residuals eR versus e(Z | X). Under model misspecification γ 6= 0from expression ( 3.9.7), the residuals eR are also a function of (I − H)Z. Hence, the plotcan be quite powerful in determining the potential of Z as a predictor.

Example 3.9.1. Cloud Data

The data for this example can be found in Table 3.9.1. It is taken from an exercise onp.162 of Draper and Smith (1966). The dependent variable is the cloud point of a liquid, ameasure of degree of crystallization in a stock. The independent variable is the percentageof I-8 in the base stock. The subsequent R-fits for this data set were all based on Wilcoxonscores with the intercept estimate αS, the median of the residuals.

Panel A of Figure 3.9.1 displays the residual plot (R-residuals versus R-fitted values)of the R-fit of the simple linear model. The curvature in the plot indicates that this modelis a poor choice and that a higher degree polynomial model would be more appropriate.Panel B of Figure 3.9.1 displays the residual plot from the R-fit of a quadratic model. Somecurvature is still present in the plot. A cubic polynomial was fitted next. Its R-residual plot,found in Panel C of Figure 3.9.1, is much more of a random scatter than the first two plots.


On the basis of residual plots the cubic polynomial is an adequate model. Least squaresresidual plots would also lead to a third degree polynomial.

Figure 3.9.1: Panel A through C are the residual plots of the Wilcoxon fits of the linear,quadratic and cubic models, respectively, for the Cloud Data. Panel D is the q−q plot basedon the Wilcoxon fit of the cubic model

•

•

••

•

••

• •

•

•

••

•

•

•

•

•

•

Wilcoxon linear fit

Wilc

oxon

res

idua

ls

24 26 28 30 32 34

-2.0

-1.0

0.0

0.5

Panel A

•

• •

•

•

•

•

•

•

•

••

•

•

•

••

••

Wilcoxon quadratic fit

Wilc

oxon

res

idua

ls

24 26 28 30 32

-0.6

-0.2

0.2

0.6

Panel B

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

Wilcoxon cubic fit

Wilc

oxon

res

idua

ls

22 24 26 28 30 32

-0.4

-0.2

0.0

0.2

0.4

Panel C

• •

• •• •

• •

• • • • •• • • •

•

•

Normal quantiles

Wilc

oxon

res

idua

ls

-1 0 1

-0.4

-0.2

0.0

0.2

0.4

Panel D

In the R-residual plot of the cubic model, several points appear to be outlying from thebulk of the data. These points are also apparent in Panel D of Figure 3.9.1 which displaysthe q−q plot of the R-residuals. Based on these plots, the R-regression appears to have fit thebulk of the data well. The q−q plot suggests that the underlying error distribution has slightlyheavier tails than the normal distribution. A scale would be helpful in interpretating theseresidual plots as discussed the next section. Table 3.9.2 displays the estimated coefficientsalong with their standard errors. The Wilcoxon and least squares fits are practically thesame.

Example 3.9.2. Potency Data, Example 3.3.3 continued


Table 3.9.2: Wilcoxon and LS-estimates of the regression coefficients for Cloud Data. Stan-dard errors are in parentheses.

Method Intercept Linear Quadratic Cubic ScaleWilcoxon 22.35 (.18) 2.24 (.17) -.23 (.04) .01 (.003) τϕ = .307Least Squares 22.31 (.15) 2.22 (.15) -.22 (.04) .01 (.003) σ = .281

This example was discussed in Section 3.3. Recall that the data were the result of anexperiment concerning the potency of drug compounds manufactured under different levelsof 4 factors and one covariate. Here we want to discuss a residual analysis of the rank-basedfits of the two models that were fit in Example 3.3.3.

First consider Model ( 3.3.1) without the quadratic terms, i.e., without the parametersβ11, β12 and β13. The residuals used are the internal R-studentized residuals defined in thenext section; see ( 3.9.31). They provide a convenient scale for detecting outliers. Thecurvature in the Wilcoxon-residual plot of this model, Panel A of Figure 3.9.2, is quiteapparent, indicating the need for quadratic terms in the model; whereas, the LS residualplot, Panel C of Figure 3.9.2, does not exhibit this quadratic effect. As the R-residual plotindicates there are outliers in the data and these had an effect on the LS fit. Panels B andD display the residual plots, when the squared terms of the factors are added to model, i.e.,Model ( 3.3.1) was fit. This R-residual plot no longer exhibits the quadratic effect indicatinga better fitting model. Also by examining the R-plots for both models, it is seen that theoutlyingness of some of the outliers indicated in the plot for the first model was accountedfor by the larger model.

3.9.2 Standardization of R-Residuals

In this section we want to obtain an expression for the variance of the R-residuals underthe model ( 3.2.3). We will assume in this section that σ2, the variance of the errors, isfinite. As we show below, similar to the least squares residual, the variance of an R-residualdepends both on its location in the x-space and the underlying variation of the errors. Theinternal Studentized least squares residuals (residuals divided by their estimated standarderrors) have proved useful in diagnostic procedures since they correct for both the modeland the underlying variance. The internal R-Studentized residuals defined below, ( 3.9.31),are similarly Studentized R-residuals.

A diagnostic use of a Studentized residual is in detecting outlying observations. The R-method provides a robust fit to the bulk of the data. Thus any case with a large Studentizedresidual can be considered an outlier from this model. Even though a robust fit is resistantto outliers, it is still useful to detect such points. Indeed in practice these are often the pointsof most interest. The value of an internally Studentized residual is in its simplicity. It tellshow many estimated standard errors a residual is away from the center of the data.

The standardization depends on which estimate of the intercept is selected. We shallobtain the result for αS the median of eRi and only state the results for the intercept based


Figure 3.9.2: Panels A and B are the Wilcoxon internal studentized residuals plots for modelswithout and with, respectively, the three quadratic terms β11, β12 and β13. Panels C and Dare the analogous plots for the LS fit.

•

••

•

• ••

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

••

••

•

Wilcoxon w/o quad. terms

Wilc

oxon

res

idua

ls

7.8 8.0 8.2 8.4

-0.4

0.0

0.4

Panel A

•

••

•• •

•

•

•

•

••

•

••

•

••

•••

•

•

•

•

•

•

•

•

• •

••

•

Wilcoxon with quad. terms

Wilc

oxon

res

idua

ls7.6 7.8 8.0 8.2

0.0

0.5

1.0

Panel B

•

•

•

•

• ••

•

•

•

•

•

•

••

•

•

• ••

•

•

•

•

•

•

•

•

•

•

•

•••

LS w/o quad. terms

LS r

esid

uals

7.8 8.0 8.2 8.4

-0.4

0.0

0.2

0.4

0.6

Panel C

•

•

•

•

• ••

•

•

•

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•••

LS with quad. terms

LS r

esid

uals

7.8 8.0 8.2 8.4

-0.4

0.0

0.2

0.4

0.6

Panel D

on symmetric errors. Thus the residuals we seek to standardize are given by

eR = Y − αS1 − Xβϕ . (3.9.12)

We will obtain a first order approximation of cov(eR). Since the residuals are invariant tothe regression coefficients, we can assume without loss of generality that the true parametersare zero. Recall that hci is the ith diagonal element of H = X(X′X)−1X′ and hi = n−1 +hci.

Theorem 3.9.3. Under the conditions (E.1), (E.2), (D.1), (D.2) and (S.1) of Section 3.4,if the intercept estimate is αs then a first order representation of the variance of eR,i is

Var(eR,i).= σ2(1 −K1n

−1 −K2hci) , (3.9.13)

where K1 and K2 are defined in expressions ( 3.9.18) and ( 3.9.19), respectively. In the caseof a symmetric error distribution when the estimate of the intercept is given by α+

ϕ , discussed


in Section 3.5.2, and (S.3) also holds,

Var(eR,i).= σ2(1 −K2hi) . (3.9.14)

Proof: Using the first order expression for βϕ given in ( 3.5.24) and the asymptotic repre-sentation of αS given by ( 3.5.23), we have

eR.= e − τSsgn(e)1 − Hτϕϕ(F (e)) , (3.9.15)

where sgn(e) =∑

sgn(ei)/n and τS and τϕ are defined in expressions ( 3.4.6) and ( 3.4.4),respectively. Because the median of ei is 0 and

∫ϕ(u) du = 0, we have

E[eR].= E(e1)1 .

Hence,

cov(eR).= E[(e − τSsgn(e)1 −Hτϕϕ(F (e)) −E(e1)1)

(e − τSsgn(e)1 − Hτϕϕ(F (e)) − E(e1)1)′] . (3.9.16)

Let J = 11′/n denote the projection onto the space spanned by 1. Since our design matrixis [1 X], the leverage of the ith case is hi = n−1 + hci where hci is the ith diaggonal entry ofthe projection matrix H. By expanding the above expression and using the independence ofthe components of e we get after some simplification (see Exercise 3.16.25):

Cov(eR).= σ2I −K1J −K2H , (3.9.17)

where

K1 =(τSσ

)2(

2δSτS

− 1

), (3.9.18)

K2 =(τϕσ

)2(

2δ

τϕ− 1

), (3.9.19)

δS = E[eisgn(ei)] , (3.9.20)

δ = E[eiϕ(F (ei))] , (3.9.21)

σ2 = Var(ei) = E((ei − E(ei))2) . (3.9.22)

This yields the first result, ( 3.9.13). Next consider the case of a symmetric error distribution.If the estimate of the intercept is given by α+

ϕ , discussed in Section 3.5.2, the result simplifiesto ( 3.9.14).

From Cook and Weisberg (1982, p. 11) in the least squares case, Var(eLS,i) = σ2(1− hi)so that K1 and K2 are correction factors due to using the rank score function.

Based on the results in the theorem, an estimate of the variance-covariance matrix of eRis

S = σ2I− K1J − K2Hc , (3.9.23)


where

K1 =τ 2S

σ2

(2δSτS

− 1

), (3.9.24)

K2 =τ 2ϕ

σ2

(2δ

τϕ− 1

), (3.9.25)

δS =1

n− p

∑|eR,i| , (3.9.26)

and

δ =1

n− pD(βϕ) .

The estimators τS and τϕ are discussed in Section 3.7.1.To complete the estimate of the Cov(eR) we need to estimate σ. A robust estimate of it

is given by the MAD,σ = 1.483medi|eRi − medj eRj | , (3.9.27)

which is a consistent estimate of σ if the errors have a normal distribution. For the examplesdiscussed here, we used this estimate in ( 3.9.23) - ( 3.9.25).

It follows from ( 3.9.23) that an estimate of Var(eR,i) is

s2R,i = σ2(1 − K1

1

n− K2hc,i) . (3.9.28)

where hci = xi(X′X)−1xi.

Let σ2LS denote the usual least squares estimate of the variance. Least squares residuals

are standardized by sLS,i where

s2LS,i = σ2

LS(1 − hi) ; (3.9.29)

see page 11 of Cook and Weisberg (1982) and recall that hi = n−1 + x′i(X

′X)−1xi.If the error distribution is symmetric ( 3.9.28) reduces to

s2R,i = σ2(1 − K2hi) . (3.9.30)

Internal R-studentized Residual

We define the internal R-studentized residuals as

rR,i =eR,isR,i

for i = 1, . . . , n , (3.9.31)

where sR,i is the square root of either ( 3.9.28) or ( 3.9.30) depending on whether one assumesan asymmetric or symmetric error distribution, respectively.


It is interesting to compare expression ( 3.9.30) with the estimate of the variance of the

least squares residual σ2LS(1 − hi). The correction factor K2 depends on the score function

ϕ(·) and the underlying symmetric error distribution. If, for example, the error distribution

is normal and if we use normal scores, then K2 converges in probability to 1; see Exercise3.16.26. In general, however , we will not wish to specify the error distribution and then K2

provides a natural adjustment.A simple benchmark is useful in declaring whether or not a case is an outlier. We are

certainly not advocating eliminating such cases but flagging them as potential outliers andtargeting them for further study. As we discussed in the last section, the distribution of theR-residuals should resemble the true distribution of the errors. Hence a simple rule for allcases is not apparent. In general, unless the residuals appear to be from a highly skeweddistribution, a simple rule is to declare a case to be a potential outlier if its residualexceeds two standard errors in absolute value; i.e., |rR,i| > 2.

The matrix S, ( 3.9.23), is an estimate of a first order approximation of cov(eR). It is notnecessarily positive semi-definite and we have not constrained it to be so. In practice thishas not proved troublesome since only occasionally have we encountered negative estimatesof the variance of the residuals. For instance, the R-fit for the cloud data resulted in onecase with a negative variance. Presently, we replace ( 3.9.28) by σ

√1 − hi, where σ is the

MAD estimate ( 3.9.27), in these situations.We have already illustrated the internal R-studentized residuals for the potency of Ex-

ample 3.9.2 discussed in the last section. We use them next on the Cloud data.

Example 3.9.3. Cloud Data, Example 3.9.1, continued

Returning to cloud data example, Panel A of Figure 3.9.3 displays a residual plot of theinternal Wilcoxon studentized residuals versus the fitted values. It is similar to Panel C ofFigure 3.9.1 but it has a meaningful scale on the vertical axis. The residuals for three ofthe cases (4, 10, and 16) are over two standard errors from the center of the data. Theseshould be flagged as potential outliers. Panel B of Figure 3.9.3 displays the normal q−qplot of the internal Wilcoxon studentized residuals. The underlying error structure appearsto have heavier tails than the normal distribution.

As with their least squares counterparts, we think the chief benefits of the internal R-studentized residuals is their usefulness in diagnostic plots and flagging potential outliers.

External R-studentized Residual

Another statistic that is useful for flagging outliers is a robust version of the external tstatistic. The LS version of this diagnostic is discussed in detail in Cook and Weisberg (1982).A robust version of this diagnostic is discussed in McKean, Sheather and Hettmansperger(1991). We briefly describe this latter approach.

Suppose we want to examine the ith case to see if its an outlier. Consider the meanshift model given by,

Y = X1b + θidi + e , (3.9.32)


Figure 3.9.3: Internal Wilcoxon studentized residual plot, Panel A, and corresponding normalq−q plot, Panel B, for the Cloud Data.

•

•

•

•

•

•

•

•

•

•

• •

•

••

•

•

•

•

Wilcoxon cubic fit

Wilc

oxon

Stu

dent

ized

res

idua

ls

22 24 26 28 30 32

-2-1

01

2

Panel A

••

•• • •

• •

• • • • •• • • •

•

•

Normal quantilesW

ilcox

on S

tude

ntiz

ed r

esid

uals

-1 0 1

-2-1

01

2

Panel B

where X1 is the augmented matrix [1X] and di is an n×1 vector of zeroes except for its ithcomponent which is a 1. A formal hypothesis that the the ith case is an outlier is given by

H0 : θi = 0 versus HA : θi 6= 0 . (3.9.33)

One way of testing these hypotheses is to use the test procedures described in Section 3.6.This requires fitting Model ( 3.9.32) for each value of i. A second approach is described next.

Note that we can rewrite Model ( 3.9.32) equivalently as

Y = X1b∗ + θid

∗i + e , (3.9.34)

where d∗i = (I − H1)di, H1 is the projection matrix onto the column space of X1 and

b∗ = b + H1diθi; see Exercise 3.16.27. Because of the orthogonality between X and d∗i ,

the least squares estimate of θi can be obtained by a simple linear regression of Y on d∗i or

equivalently of eLS on d∗i . For the rank-based estimate, the asymptotic distribution theory


of the regression estimates suggests a similar approach. Accordingly, let θR,i denote theR-estimate when eR is regressed on d∗

i . This is a simple regression and the estimate can beobtained by a linear search algorithm; see Section 3.7.2. As Exercise 3.16.29 shows, thisestimate is the inversion of an aligned rank statistic to test the hypotheses ( 3.9.33). Nextlet τϕ,i denote the estimate of τϕ produced from this regression. We define the externalR-studentized residual to be the statistic

tR(i) =θR,i

τϕ,i/√

1 − h1,i

, (3.9.35)

where h1,i is the ith diagonal entry of H1. Note that we have standardized θR,i by itsasymptotic standard error.

A final remark on these external t-statistics is in order. In the mean shift model, ( 3.9.32),the leverage value of the ith case is 1. Hence, the design assumption (D.2), ( 3.4.7), is nottrue. This invalidates both the LS and rank-based asymptotic theory for the external tstatistics. In light of this, we do not propose the statistic tR(i) as a test statistic for thehypotheses ( 3.9.33) but as a diagnostic for flagging potential outliers. As a benchmark, wesuggest the value 2.

3.9.3 Measures of Influential Cases

Since R-estimates have bounded influence in the y-space but not in the x- space, the R-fitmay be affected by outlying points in the x-space. We next introduce a statistic whichmeasures the influence of the ith case on the robust fit. We work with the usual model( 3.2.3). First, we need the first order representation of YR. Similar to the proof of Theorem3.9.3 which obtained the first order representation of the residuals, ( 3.9.15), we have

YR.= α1 + Xβ + τSsgn(e)1 + Hτϕϕ(F (e)) ; (3.9.36)

see exercise 3.16.28.Let YR(i) denote the R-predicted value of Yi when the ith case is deleted from the model.

We shall call this model, the delete i model. Then the change in the robust fit due to theith case is

RDFFITi = YR,i − YR(i) . (3.9.37)

RDFFITi is our measure of the influence of case i. Computation of this statistic is discussedlater. Clearly, in order to be useful, RDFFITi must be assessed relative to some scale.

RDFFIT is a change in the fitted value; hence, a natural scale for assessing RDFFIT isa fitted value scale. Using as our estimate of the intercept αS, it follows from the expression( 3.9.36) with γ = 0 that

Var(YR,i).= n−1τ 2

S + hc,iτ2ϕ . (3.9.38)

Hence, based on a fitted scale assessment, we standardize RDFFIT by an estimate of thesquare root of this quantity.


For least squares diagnostics there is some discussion on whether to use the original modelor the model with the ith point deleted for the estimation of scale. Cook and Weisberg(1982) advocate the original model. In this case the scale estimate is the same for all ncases. This allows casewise comparisons involving the diagnostic. Belsley, Kuh, and Welsch(1980), however, advocate scale estimation based on the delete i model. Note that bothstandardizations correct for the model and the underlying variation of the errors.

Let τS(i) and τϕ(i) denote the estimates of τS and τϕ for the delete i model as discussedabove. Then our diagnostic in which RDFFITi is assessed relative to a fitted value scalewith estimates of scale based on the delete i model is given by

RDFFITSi =RDFFITi

(n−1τ 2S(i) + hc,iτ 2

ϕ(i))12

. (3.9.39)

This is an R-analogue of the least squares diagnostic DFFITSi proposed by Belsley et al.(1980). For standardization based on the original model, replace τS(i) and τϕ(i) by τS andτϕ respectively. We shall define,

RDCOOKi =RDFFITi

(n−1τ 2S + hc,iτ 2

ϕ)12

. (3.9.40)

If α+R is used as the estimate of the intercept then, provided the errors have a symmetric

distribution, the R-diagnostics are obtained by replacing Var(YR,i) with Var(YR,i) = hiτ2ϕ;

see Exercise 3.16.30 for details. This results in the diagnostics,

RDFFITSsymm,i =RDFFITi√hiτϕ(i)

, (3.9.41)

and

RDCOOKsymm,i =RDFFITi√

hiτϕ. (3.9.42)

This eliminates the need to estimate τS.There is also a disagreement on what benchmarks to use for flagging points of potential

influence. As Belsley et al. (1980) discuss in some detail, DFFITS is inversely influencedby sample size. They advocate a size-adjusted benchmark of 2

√p/n for DFFITS. Cook

and Weisberg (1982) suggest a more conservative value which results in√p. We shall use

both benchmarks in the examples. We realize these diagnostics only flag potentially influ-ential points that require investigation. Similar to the two references cited above, we wouldnever recommend indiscriminately deleting observations solely because their diagnostic val-ues exceed the benchmark. Rather these are potential points of influence which should beinvestigated.

The diagnostics described above are formed with the leverage values based on the projec-tion matrix. These leverage values are nonrobust, (see Rousseeuw and van Zomeren, 1990).For data sets with clusters of outliers in factor space robust leverage values can be formulated


in terms of high breakdown estimates of the center and scatter matrix in factor space. Onesuch choice would be the MVE, minimum volume ellipsoid, proposed by Rousseeuw and vanZomeren (1990). Other estimates could be based on the robust singular value decompositiondiscussed by Ammann (1993). See, also, Simpson, Ruppert and Carroll (1992). We recom-

mend computing YR(i) with a one or two step R-estimate based on the residuals from theoriginal model; see Section 3.7.2. Each step involves a single ordering of the residuals whichare nearly in order, (in fact on the first step they are in order) and a single projection ontothe range of X, (easily obtained by using the routines in LINPACK as discussed in Section3.7.2).

The diagnostic RDFITTSi measures the change in the fitted values when the ith case isdeleted. Similarly we can also measure changes in the estimates of the regression coefficients.For the LS analysis, this is the diagnostic DBETAS proposed by Belsley, Kuh and Welsch(1980). The corresponding diagnostics for the rank-based analysis are:

RDBETASij =βϕ,j − βϕ,j(i)

τϕ(i)√

(X′X)jj, (3.9.43)

where βϕ(i) denotes the R-estimate of β in the delete i-model. A similar statistic canbe constructed for the intercept parameter. Furthermore, a DCOOK verison can also beconstructed as above. These diagnostics are often used when |RDFFITSi| is large. In suchcases, it may be of interest to know which components of the regression coefficients are moreinfluential than other components. The benchmark suggested by Belsley, Kuh and Welsch(1980) is 2/

√n.

Example 3.9.4. Free Fatty Acid (FFA) Data.

The data for this example can be found in Morrison (1983, p.64) and for convenience inTable 3.9.3. The response is the level of free fatty acid of prepubescent boys while theindependent variables are age, weight, and skin fold thickness. The sample size is 41. PanelA of Figure 3.9.4 depicts the residual plot based on the least squares internal t residuals.From this plot there appears to be several outliers. Certainly the cases 12, 22, 26 and 9 areoutlying and perhaps the cases 8, 10 and 38. In fact, the first four of these cases probablycontrol the least squares fit, obscuring cases 8, 10 and 38.

As our first R-fit of this data, we used the Wilcoxon scores with the intercept estimatedby the median of the residuals, αs. Note that all seven cases stand out in the Wilcoxonresidual plot based on the internal R-studentized residuals, ( 3.9.31); see Panel B of Figure3.9.4. This is further confirmed by the fits displayed in Table 3.9.4, where the LS fit withthese seven cases deleted is very similar to the Wilcoxon fit using all the cases. The q−qplot of the internal R-studentized residuals, Panel C of Figure 3.9.4, also highlights theseoutlying cases. Similar to the residual plot, the q− q plot suggests that the underlyingerror distribution is positively skewed with a light left tail. The estimates of the regressioncoefficients and their standard errors are displayed in Table 3.9.4. Due to the skewness in


Table 3.9.3: Free Fatty Acid (FFA) DataAge Weight Skin Fold Free Fatty

Case (months) (LBS) Thickness Acid1 105 67 0.96 0.7592 107 70 0.52 0.2743 100 54 0.62 0.6854 103 60 0.76 0.5265 97 61 1.00 0.8596 101 62 0.74 0.6527 99 71 0.76 0.3498 101 48 0.62 1.1209 107 59 0.56 1.059

10 100 51 0.44 1.03511 100 80 0.74 0.53112 101 57 0.58 1.33313 104 58 1.10 0.67414 99 58 0.72 0.68615 101 54 0.72 0.78916 110 66 0.54 0.64117 109 59 0.68 0.64118 109 64 0.44 0.35519 110 76 0.52 0.25620 111 50 0.60 0.62721 112 64 0.70 0.44422 117 73 0.96 1.01623 109 68 0.82 0.58224 112 67 0.52 0.32525 111 81 1.14 0.36826 115 74 0.82 0.81827 115 63 0.56 0.38428 125 74 0.72 0.50929 131 70 0.58 0.63430 121 63 0.90 0.52631 123 67 0.66 0.33732 125 82 0.94 0.30733 122 62 0.62 0.74834 124 67 0.74 0.40135 122 60 0.60 0.45136 129 98 1.86 0.34437 128 76 0.82 0.54538 127 63 0.26 0.78139 140 79 0.74 0.50140 141 60 0.62 0.52441 139 81 0.78 0.318


the data, it is not surprising, that the LS and R estimates of the intercept are different sincethe former estimates the mean of the residuals while the later estimates the median of theresiduals.

Figure 3.9.4: Panel A, Internal LS studentized residual plot on the original Free Fatty AcidData; Panel B, Internal Wilcoxon studentized residual plot on the original Free Fatty AcidData; Panel C, Internal Wilcoxon studentized normal q−q plot on the original Free FattyAcid Data; and Panel D, Internal R-studentized residual plot on the original Free Fatty AcidData based on the score function ϕ.5(u).

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

••

••

•

•

•

•

•

•

•

•

••

•

•

••

••

•

•

•

•

LS fit original data

LS S

tude

ntiz

ed r

esid

uals

0.3 0.5 0.7 0.9

-10

12

Panel A

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

••

••

•

•

•

•

•

•

•

•

••

•

•

• •

•

•

•

•

••

Wilcoxon fit original data

Wilc

oxon

Stu

dent

ized

res

idua

ls

0.3 0.5 0.7

01

23

Panel B

• • • •••••••••••

••••••••••••••••

••••

•••••

•

•

Normal quantiles

Wilc

oxon

Stu

dent

ized

res

idua

ls (

orig

)

-2 -1 0 1 2

01

23

Panel C

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

••

••

•

•

••

•

•

•

•

••

•

•

• ••

•

•

•

••

Fit based on Bent score original data

Ben

t sco

re S

tude

ntiz

ed r

esid

uals

0.4 0.6 0.8

-10

12

3

Panel D

Table 3.9.5 displays the values of the R and LS diagnostics for the cases of interest. Forthe seven cases cited above, the internal Wilcoxon studentized residuals, ( 3.9.31), definitelyflag three of the cases and for two of the others it exceeds 1.70; see Panel B of Figure 3.9.4.As RDFFITS, ( 3.9.39), indicates none of these seven cases seem to have an effect on theWilcoxon-fit, (the liberal benchmark is .62), whereas the 12th case appears to have an effecton the least squares fit. RDFFITS exceeded the benchmark only for case 2 for which ithad the value -.64. Case 36 with h36 = .53 has high leverage but it did not have an adverse


Table 3.9.4: Estimates of β (first cell entry) and σβ (second cell entry) for Free Fatty AcidData.

Original Data log yPar. LS Wilcoxon LS (w/o 7 pts.) R-Bent Score LS Wilcoxonβ0 1.70 .33 1.49 .27 1.24 .21 1.37 .21 1.12 .52 .99 .54β1 -.002 .003 -.001 .003 -.001 .002 -.001 .002 -.001 .005 .000 .005β2 -.015 .005 -.015 .004 -.013 .003 -.015 .003 -.029 .008 -.031 .008β3 .205 .167 .274 .137 .285 .103 .355 .104 .444 .263 .555 .271

Scale .215 .178 .126 .134 .341 .350

Table 3.9.5: Regression Diagnostics for cases of interest for the Fatty Acid Data.LS Wilcoxon Bent Score

Case hi Int. t DFFIT Int. t DFFIT Int. t DFFIT8 0.12 1.16 0.43 1.57 0.44 1.73 0.319 0.04 1.74 0.38 2.14 0.13 2.37 0.2610 0.09 1.12 0.36 1.59 0.53 1.84 0.3012 0.06 2.84 0.79 3.30 0.33 3.59 0.3022 0.05 2.26 0.53 2.51 -0.06 2.55 0.1126 0.04 1.51 0.32 1.79 0.20 1.86 0.1038 0.15 1.27 0.54 1.70 0.53 1.93 0.192 0.10 -1.19 -0.40 -0.17 -0.64 -0.75 -0.487 0.11 -1.07 -0.37 -0.75 -0.44 -0.74 -0.6411 0.22 0.56 0.30 0.97 0.31 1.03 0.0740 0.25 -0.51 -0.29 -0.31 -0.21 -0.35 0.0636 0.53 0.18 0.19 -0.04 -0.27 -0.66 -0.34

effect on either the Wilcoxon fit or the LS fit. This is true too of cases 11 and 40 which werethe only other cases whose leverage values exceeded the benchmark of 2p/n.

As we noted above, both the residual and the q−q plots indicate that the distribution of theresiduals is positively skewed. This suggests a transformation as discussed below, or perhapsa prudent choice of a score function which would be more appropriate for skewed errordistributions than the Wilcoxon scores. The score function ϕ.5(u), ( 2.5.34), is more suitedto positively skewed errors. Panel D of Figure 3.9.4 displays the internal R-studentizedresiduals based on the R-fit using this bent score function. From this plot and thetabled diagnostics, the outliers stand out more from this fit than the previous two fits. TheRDFFITS values for this fit are even smaller than those of the Wilcoxon fit, which isexpected since this score function protects on the right. While Case 7 has a little influenceon the bent score fit, no other cases have RDFFITS exceeding the benchmark.

Table 3.9.4 displays the estimates of the betas for the three fits along with their standarderrors. At the .05 level, coefficients 2 and 3 are significant for the robust fits while onlycoefficient 2 is significant for the LS fit. The robust fits appear to be an improvement over


LS. Of the two robust fits, the bent score fit appears to be more precise than the Wilcoxonfit.

A practical transformation on the response variable suggested by the Box-Cox transfor-mation is the log. Panel A of Figure 3.9.5 shows the internal R-studentized residuals plotbased on the Wilcoxon fit of the log transformed response. Note that 5 of the cases stillstand out in the plot. The residuals from the transformed response still appear to be skewedas is evident in the q−q plot, Panel B of Figure 3.9.5. From Table 3.9.4, the Wilcoxon fitseems slightly more precise in terms of standard errors.

Figure 3.9.5: Panel A, Internal R-studentized residuals plot of the log transfomed Free FattyAcid Data; Panel B, Corresponding normal q−q plot.

•

•

•

•

••

•

•

•

••

•

•

• •

•

•

••••

•

•

•

•

•

•

•

•

•

•

•

•

• •

•

•

•

•

••

Wilcoxon fit logs data

Wilc

oxon

Stu

dent

ized

res

idua

ls

-1.0 -0.6 -0.2

-1.0

0.0

1.0

2.0

Panel A

•• • ••••

•••••••

••••••••••••••••••

••••••

•• •

Normal quantiles

Wilc

oxon

Stu

dent

ized

res

idua

ls (

logs

)

-2 -1 0 1 2

-1.0

0.0

1.0

2.0

Panel B

3.10 Survival Analysis

In this section we discuss scores which are appropriate for lifetime distributions when thelog of lifetime follows a linear model. These are called accelerated failure time models;

3.10. SURVIVAL ANALYSIS 215

Kalbfleisch and Prentice (1980). Let T denote the lifetime of a subject and let x be a p× 1vector of covariates associated with T . Let h(t;x) denote the hazard function of T at time t;see Section 2.8. Suppose T follows a log linear model; that is, Y = log T follows the linearmodel

Y = α + x′β + e , (3.10.1)

where e is a random error with density f . Exponentiating both sides we get T = expα +x′βT0 where T0 = expe. Let h0(t) denote the hazard function of T0. This is called thebaseline hazard function. Then the hazard function of T is given by

h(t;x) = h0(t exp−(α + x′β)) exp−(α + x′β) . (3.10.2)

Thus the covariate x accelerates or decelerates the failure time of T ; hence, the name accel-erated failure time for these models.

An important subclass of the accelerated failure time models are those where T0 followsa Weibull distribution, i.e.,

fT0(t) = λγ(λt)γ−1 exp−(λt)γ , t > 0 , (3.10.3)

where λ and γ are unknown parameters. In this case it follows that the hazard function ofT is proportional to the baseline hazard function with the covariate acting as the factor ofproportionality; i.e.,

h(t;x) = h0(t) exp−(α + x′β) . (3.10.4)

Hence these models are called proportional hazards models. Kalbfleisch and Prentice(1980) show that the only proportional hazards models which are also accelerated failuretime models are those for which T0 has the Weibull density. We can write the random errore = log T0 as e = ξ + γ−1W0 where ξ = − log γ and W0 has the extreme value distributiondiscussed in Section 2.8 of Chapter 2. Thus the optimal rank scores for these log-linearmodels are generated by the function

ϕfǫ(u) = −1 − log(1 − u) ; (3.10.5)

see ( 2.8.8) of Chapter 2.Next we consider suitable score functions for the general failure time models, ( 3.10.1).

As noted in Kalbfleisch and Prentice (1980) many of the error distributions currently usedfor these models are contained in the log-F class. In this class, e = log T is distributeddown to an unknown scale parameter, as the log of an F random variable with 2m1 and 2m2

degrees of freedom. In this case we shall say that e has a GF (2m1, 2m2) distribution. Thedistribution of T is Weibull if (m1, m2) → (1,∞), log-normal if (m1, m2) → (∞,∞), andgeneralized gamma if (m1, m2) → (∞, 1); see Kalbfleish and Prentice. If (m1, m2) = (1, 1)then the e has a logistic distribution. In general this class contains a variety of shapes. Thedistributions are symmetric for m1 = m2, positively skewed for m1 > m2, and negativelyskewed for m1 < m2. While Kalbfleisch and Prentice discuss this class for m1, m2 ≥ 1, wewill extend the class to m1, m2 > 0 in order to include heavier tailed error distributions.


For random errors with distribution GF (2m1, 2m2), the optimal rank score function isgiven by

ϕm1,m2(u) = (m1m2(exp F−1(u) − 1))/(m2 +m1 exp F−1(u)) , (3.10.6)

where F is the cdf of the GF (2m1, 2m2) distribution; see Exercise 3.16.31. We shall labelthese scores as GF (2m1, 2m2) scores. It follows that the scores are strictly increasing andbounded below by −m1 and above by m2. Hence an R-analysis based on these scores willhave bounded influence in the Y -space.

Figure 3.10.1: Schematic of the four classes, C1 - C4, of the GF (2m1, 2m2) scores

m1

m2

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

C3 C2

C4 C1

Neg. Skewed Light Tailed

Pos. SkewedHeavy Tailed

This class of scores can be conveniently divided into the four subclasses C1 through C4

which are represented by the four quadrants with center (1, 1) as depicted in Figure 3.10.1.The point (1, 1) in this figure corresponds to the linear-rank, Wilcoxon scores. These scoresare optimal for the logistic distribution, GF (2, 2), and form a “natural” center point forthe scores. One score function from each class with the density for which it is optimal isplotted in Figure 3.10.2. These plots are generally representative. The score functions in C2


change from concave to convex as u increases and, hence, are suitable for light tailed errorstructure, while, those in C4 pass from convex to concave and are suitable for heavy tailederror structure. The score functions in C3 are always convex and are suitable for negativelyskewed error structure with heavy left tails and moderate right tails, while those in C1 aresuitable for positively skewed errors with heavy right tails and moderate left tails.

Figure 3.10.2: Column A contains plots of the densities: the Class C1 distribution GF (3, .8);the Class C2 distribution GF (4, 8); the Class C3 distribution GF (.5, 6); and the Class C4distribution GF (1, .6). Column B contains the corresponding optimal score functions.

x

GF

(3,.8

)-de

nsity

(x)

-4 -2 0 2 4 6 8 10

0.0

0.05

0.10

0.15

Column A

u

GF

(3,.8

)-sc

ore(

u)

0.0 0.2 0.4 0.6 0.8 1.0-1

.0-0

.50.

0

Column B

x

GF

(4,8

)-de

nsity

(x)

-2 -1 0 1 2

0.1

0.2

0.3

0.4

u

GF

(4,8

)-sc

ore(

u)

0.0 0.2 0.4 0.6 0.8 1.0

-10

12

x

GF

(.5,

6)-d

ensi

ty(x

)

-15 -10 -5 0

0.0

0.04

0.08

0.12

u

GF

(.5,

6)-s

core

(u)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

x

GF

(1,.6

)-de

nsity

(x)

-5 0 5 10

0.0

0.04

0.08

0.12

u

GF

(1,.6

)-sc

ore(

u)

0.0 0.2 0.4 0.6 0.8 1.0

-0.4

-0.2

0.0

0.2

Figure 3.10.2 shows how a score function corresponds to its density. If the density hasa heavy right tail then the score function will tend to be flat on the right side; hence, theresulting estimate will be less sensitive to outliers on the right. While if the density has alight right tail then the scores will tend to rise on the right in order to accentuate points onthe right. The plots in Figure 3.10.2 suggest approximating these scores by scores consistingof two or three line segments such as the bent score function, ( 2.5.34).

Generally the GF (2m1, 2m2) scores cannot be obtained in closed form due to F−1, but


programs such as Minitab and Splus can easily produce them. There are two interestingsubclasses for which closed forms are possible. These are the subclasses GF (2, 2m2) andGF (2m1, 2). As Exercise 3.16.32 shows, the random variables for these classes are the logsof variates having Pareto distributions. For the subclass GF (2, 2m2) the score generatingfunction is

ϕm2(u) =

(m2 + 2

m2

)1/2 (m2 − (m2 + 1)(1 − u)1/m2

). (3.10.7)

These are the powers of rank scores discussed by Mielke (1972) in the context of two sampleproblems.

It is interesting to note that the asymptotic relative efficiency of the Wilcoxon to theoptimal rank score function at the GF (2m1, 2m2) distribution is given by

ARE =12 Γ4(m1 +m2)Γ

2(2m1)Γ2(2m2)(m1 +m2 + 1)

Γ4(m1)Γ4(m2)Γ2(2m1 + 2m2)m1m2; (3.10.8)

see Exercise 3.16.31. This efficiency can be arbitrarily small. For instance, in the subclassGF (2, 2m2) the efficiency reduces to

ARE =3m2(m2 + 2)

(2m2 + 1)2, (3.10.9)

which approaches 0 as m2 → 0 and 34

as m2 → ∞. Thus in the presence of severely skewederrors, the Wilcoxon scores can have arbitrarily low efficiency compared to a fully efficientR-estimate based on the optimal scores.

For a given problem, the choice of scores presents a problem. McKean and Sievers (1989)discuss several methods for score selection, one of which is illustrated in the next example.This method is adaptive in nature with the adaption depending on residuals from an initialfit. In practice, this can lead to overfitting. Its use, however, can lead to insight and mayprove beneficial for fitting future data sets of the same type; see McKean et al. (1989) forsuch an application. Using XLISP-STAT (Tierney, 1990), Wang (1996) presents a graphicalinterface for methods of score selection.

Example 3.10.1. Insulating Fluid Data.

We consider a problem discussed in Nelson (1982, p. 227) and also discussed by Lawless(1982, p. 185). The data consist of breakdown times T of an electrical insulating fluidsubject to seven different levels of voltage stress v. Panel A of Figure 3.10.3 displays ascatter plot of Y = log T versus log v.

As a full model we consider a oneway layout, as discussed in Chapter 4, with the re-sponse variable Y = log T and with the seven voltage levels as treatments. The comparisonboxplots, Panel B of Figure 3.10.3, are an appropriate display for this model. The onemethod for score selection that we briefly touch on here is based on q−q plots; see McKeanand Sievers (1989). Using Wilcoxon scores we obtained an initial fit of the oneway layoutmodel as discussed in Chapter 4. Panel C of Figure 3.10.3 displays the q−q plot of the


ordered residuals versus the logistic quantiles based on this fit. Although the left tail of thelogistic distribution appears adequate, the right side of the plot indicates that distributionswith lighter right tails might be more appropriate. This is confirmed by the near linearityof the GF (2, 10) quantiles versus the Wilcoxon residuals. After trying several R-fits usingGF (2m1, 2m2) scores with m1, m2 ≥ 1, we decided that the q−q plot of the GF (2, 10) fit,Panel D of Figure 3.10.3, appeared to be most linear and we used it to conduct the followingR-analysis.

For the fit of the full model using the scores GF (2, 10), the minimum value of the disper-sion function, D, is 103.298 and the estimate of τϕ is 1.38. Note that this minimum valueof D is the analogue of the “pure” sum of squared errors in a least squares analysis; hence,we will use the notation DPE = 103.298 for pure error dispersion. We first test thegoodness of fit of a simple linear model. The reduced model in this case is a simple linearmodel. The alternative hypothesis is that the model is not linear but, other than this, it isnot specified; hence, the full model is the oneway layout. Thus the hypotheses are

H0 : Y = α + β log v + e versus HA : the model is not linear. (3.10.10)

To test H0, we fit the reduced model Y = α + β log v + e. The dispersion at the reducedmodel it is 104.399. Since, as noted above, the dispersion at the full model is 103.298, thelack of fit is the reduction in dispersion RDLOF = 104.399 − 103.298 = 1.101. Thereforethe value of the robust test statistic is Fϕ = .319. There is no evidence on the basis of thistest to contest a linear model.

The GF (2, 10)-fit of the simple linear model is Y = 64 − 17.67 log v, which is graphedin Panel A of Figure 3.10.3. Under this linear model, the estimate of the scale parameterτϕ is 1.57. From this we compute a 95% confidence interval for the slope parameter βto be −17.67 ± 3.67; hence, it appears that the slope parameter differs significantly from0. In Lawless there was interest in computing a confidence interval for E(Y |x = log 20).The robust estimate of this conditional mean is Y = 11.07 and a confidence interval is11.07 ± 1.9. Similar to the other robust confidence intervals, this interval is the same as inthe least squares analysis, except that τϕ replaces σ. A fuller discussion of the R-analysis ofthis data set can be found in McKean and Sievers (1989).

Example 3.10.2. Sensitivity Analysis for Insulating Fluid Data.

As noted by Lawless, engineers may suggest a Weibull distribution for breakdown times inthis problem. As discussed earlier this means the errors have an extreme value distribution.This distribution is essentially the limit of a GF (2, 2m) distribution as m → ∞. For com-pleteness we obtained, using the IMSL (1987) subroutine UMIAH, estimates based on anextreme value likelihood function. These estimates are labeled EXT . R-estimates based onthe the optimum R-score function ( 2.8.8) for the extreme value distribution are labeled asREXT . The influence functions for EXT and REXT estimates are unbounded in Y -spaceand, hence, neither estimate is robust; see ( 3.5.17).

In order to illustrate this lack of robustness, we conducted a small sensitivity analysis.We replaced the fifth point, which had the value 6.05 (log units), in the data with an outlying


Table 3.10.1: Sensitivity Analysis for Insulating Data.Value of Y5

Original (.05) 7.75 10.05 16.05 30.05

Estimate α β α β α β α β α βLS 59.4 -16.4 60.8 -16.8 62.7 -17.3 67.6 -18.7 79.1 -21.9Wilcoxon 62.7 -17.2 63.1 -17.4 63.0 -17.4 63.1 -17.4 63.1 -17.4GF (2, 10) 64.0 -17.7 65.5 -18.1 67.0 -18.5 67.1 -18.5 67.1 -18.5REXT 64.1 -17.7 65.5 -18.1 68.3 -18.9 68.3 -18.9 68.3 -18.9EXT 64.8 -17.7 68.4 -18.7 79.3 -21.8 114.6 -31.8 191.7 -53.5

observation. Table 3.10.1 summarizes the results for several different choices of the outlier.Note that even for the first case when the changed point is 7.75, which is the maximum of theoriginal data, there is a substantial change in the EXT -estimates. The EXT fit is a disasterwhen the point is changed to 10.05, whereas the R-estimates exhibit robustness. This is evenmore so for succeeding cases. Although the REXT -estimates have an unbounded influencefunction, they behaved well in this sensitivity analysis.

3.11 Correlation Model

In this section, we are concerned with the correlation model defined by

Y = α + x′β + e (3.11.1)

where x is a p-dimensional random vector with distribution function M and density functionm, e is a random variable with distribution function F and density f , and x and e areindependent. Let H and h denote the joint distribution function and joint density functionof Y and x. It follows that

h(x, y) = f(y − α− x′β)m(x) . (3.11.2)

Denote the marginal distribution and density functions of Y by G and g.The hypotheses of interest are:

H0 : Y and x are independent versus HA : Y and x are dependent . (3.11.3)

By ( 3.11.2) this is equivalent to the hypotheses H0 : β = 0 versus HA : β 6= 0. For thissection, we will use the additional assumptions:

(E.2) Var(e) = σ2e <∞ (3.11.4)

(M.1) E[xx′] = Σ , Σ > 0 . (3.11.5)

Without loss of generality assume that E[x] = 0 and E(e) = 0.

3.11. CORRELATION MODEL 221

Let (x1, Y1), . . . , (xn, Yn) be a random sample from the above model. Define the n × pmatrix X1 to be the matrix whose ith row is the vector xi and let X be the correspondingcentered matrix, i.e, X = (I− n−111′)X1. Thus the notation here agrees with that found inthe previous sections.

We intend to briefly describe the rank-based analysis for this model. As we will showusing conditional arguments the asymptotic inference we developed for the fixed x case willhold for the stochastic case also. We then want to explore measures of association between x

and Y . These will be analogues of the classical coefficient of multiple determination, R2.

As with R2, these robust CMDs will be 0 when x and Y are independent and positive when

they are dependent. Besides defining these measures, we will obtain consistent estimates ofthem. First we show that, conditionally, the assumptions of Section 3.4 hold. Much of thediscussion in this section is taken from the paper by Witt, Naranjo and McKean (1995).

3.11.1 Huber’s Condition for the Correlation Model

The key assumption on the design matrix for the nonstochastic x linear model was Huber’scondition, (D.2), ( 3.4.7). As we next show, it holds almost surely (a.s.) for the correlationmodel. This will allow us to easily obtain inference methods for the correlation model asdiscussed below.

First define the modulus of a matrix A to be

m(A) = maxi,j

|aij| . (3.11.6)

As Exercise 3.16.33 shows the following three facts follow from this definition: m(AB) ≤pm(A)m(B) where p is the common dimension of A and B; m(AA′) ≥ m(A)2; and m(A) =max aii if A is positive semidefinite. We next need a preliminary lemma found in Arnold(1980).

Lemma 3.11.1. Let an be a sequence of nonnegative real numbers. If n−1∑n

i=1 ai → a0

then n−1 sup1≤i≤n ai → 0.

Proof: We have,

ann

=1

n

n∑

i=1

ai −n− 1

n

1

n− 1

n−1∑

i=1

ai → 0 . (3.11.7)

Now suppose that n−1 sup1≤i≤n an 6→ 0. Then for some ǫ > 0 and for all integers N thereexists an nN such that nN ≥ N and n−1

N sup1≤i≤nNai ≥ ǫ. Thus we can find a subsequence

of integers nj such that nj → ∞ and n−1j sup1≤i≤nj

ai ≥ ǫ. Let ainj= sup1≤i≤nj

ai. Then

ǫ ≤ainj

nj≤ainj

inj

. (3.11.8)

Also, since nj → ∞ and ǫ > 0, inj→ ∞; hence, expression ( 3.11.8) leads to a contradiction

of expression ( 3.11.7).The following theorem is due to Arnold (1980).


Theorem 3.11.1. Under ( 3.11.5),

limn→∞

max diagX (X′X)

−1X′

= 0 , a.s. ; (3.11.9)

Proof: Using the facts cited above on the modulus of a matrix, we have

m(X (X′X)

−1X′)≤ p2n−1m (XX′)m

((1

nX′X

)−1). (3.11.10)

Using the assumptions on the correlation model, the law of large numbers yields(

1nX′X

)→

Σ a.s. . Hence we need only show that n−1m (XX′) → 0 a.s. . Let Ui denote the ith diagonalelement of XX′. We then have,

1

n

n∑

i=1

Ui =1

ntr X′X

a.s.→ tr Σ .

By Lemma 3.11.1 we have n−1 supi≤n Uia.s→ 0. Since XX′ is positive semidefinite, the desired

conclusion is obtained from the facts which followed expression ( 3.11.6).

Thus given X, we have the same assumptions on the design matrix as we did in theprevious sections. By conditioning on X, the theory derived in Section 3.5 holds for thecorrelation model also. Such a conditional argument is demonstrated in Theorem 3.11.2below. For later discussion we summarize the rank-based inference for the correlation model.Given a specified score function ϕ, let βϕ denote the R-estimate of β defined in Section3.2. Under the correlation model ( 3.11.1) and the assumptions ( 3.11.4), (S.1), ( 3.4.10),

and ( 3.11.5)√n(βϕ − β)

D→ Np(0, τ2ϕΣ

−1). Also the estimates of τϕ discussed in Section3.7.1 will be consistent estimates of τϕ under the correlation model. Let τϕ denote suchan estimate. In terms of testing, consider the R-test statistic, Fϕ = (RD/p)/(τϕ/2), of theabove hypothesis H0 of independence. Employing the usual conditional argument, it follows

that pFϕD→ χ2(p, δR), a.e. M under Hn : β = θ/

√n where the noncentrality parameter δR

is given by δ = θ′Σθ/τ 2ϕ .

Likewise for the LS estimate βLS of β. Using the conditional argument, (see Arnold

(1980) for details),√n(βLS − β)

D→ Np(0, σ2Σ−1) and under Hn, pFLS

D→ χ2(p, δLS) withnoncentrality parameter δLS = θ′Σθ/σ2. Thus the ARE of the R-test Fϕ to the least squarestest FLS is the ratio of noncentrality parameters, σ2/τ 2

ϕ. This is the usual ARE of rank teststo tests based on least squares in simple location models. Hence the test statistic Fϕ hasefficiency robustness. The theory of rank-based tests in Section 3.6 applies to the correlationmodel.

We return to measures of association and their estimates. For motivation, we considerthe least squares measure first.


3.11.2 Traditional Measure of Association and its Estimate

The traditional population coefficient of multiple determination (CMD) is defined by

R2

=β′Σβ

σ2e + β′Σβ

; (3.11.11)

see Arnold (1981). Note that R2is a measure of association between Y and x. It lies between

0 and 1 and it is 0 if and only if Y and x are independent, (because Y and x are independentif and only if β = 0).

In order to obtain a consistent estimate of R2, treat xi as nonstochastic and fit by least

squares the model Yi = α+x′iβ+ei, which will be called the full model. The residual amount

of variation is SSE =∑n

i=1(Yi − αLS − x′iβLS)

2, where βLS and αLS are the least squaresestimates. Next fit the reduced model defined as the full model subject to H0 : β = 0. Thetotal amount of variation is SST =

∑ni=1(Yi − Y )2. The reduction in variation in fitting

the full model over the reduced model is SSR = SST − SSE. An estimate of R2

is theproportion of explained variation given by

R2 =SSR

SST. (3.11.12)

The least squares test statistic for H0 versus HA is FLS = (SSR/p)/σ2LS where σ2

LS =SSE/(n− p− 1). Recall that R2 can be expressed as

R2 =SSR

SSR + (n− p− 1)σ2LS

=

pn−p−1

FLS

1 + pn−p−1

FLS. (3.11.13)

Now consider the general correlation model. As shown in Arnold (1980), under ( 3.11.4)

and ( 3.11.5), R2 is a consistent estimate of R2. Under the multivariate normal model R2 is

the maximum likelihood estimate of R2.

3.11.3 Robust Measure of Association and its Estimate

The rank-based analogue to the reduction in residual variation is the reduction in residualdispersion which is given by RD = D(0) − D(βR). Hence, the proportion of dispersionexplained by fitting β is

R1 = RD/D(0) . (3.11.14)

This is a natural CMD for any robust estimate and, as we shall show below, the populationCMD for which R1 is a consistent estimate does satisfy interesting properties. As expression( A.5.11) of the Appendix shows, however, the influence function of the denominator is notbounded in the Y -space. Hence the statistic R1 is not robust.

In order to obtain a CMD which is robust, consider the test statistic of H0, Fϕ =(RD/p)/(τϕ/2), ( 3.6.12). As we indicated above, the test statistic Fϕ has efficiency ro-bustness. Furthermore, as shown in the Appendix, the influence function of Fϕ is boundedin the Y -space. Hence the test statistic is robust.


Consider the relationship between the classical F-test andR2 given by expression ( 3.11.13).In the same way but using the robust test Fϕ, we can define a second R-coefficient of mul-tiple determination

R2 =

pn−p−1

FR

1 + pn−p−1

FR

=RD

RD + (n− p− 1)(τϕ/2). (3.11.15)

It follows from the above discussion on the R-test statistic that the influence function of R2

has bounded influence in the Y -space.The parameters that respectively correspond to the statistics D(0) and D(βR) are Dy =∫ϕ(G(y))ydG(y) and De =

∫ϕ(F (e))edF (e); see the discussion in Section 3.6.3. The

population CMD’s associated with R1 and R2 are:

R1 = RD/Dy (3.11.16)

R2 = RD/(RD + (τϕ/2)) , (3.11.17)

where RD = Dy−De. The properties of these parameters are discussed in the next section.The consistency of R1 and R2 is given in the following theorem:

Theorem 3.11.2. Under the correlation model ( 3.11.1) and the assumptions (E.1), ( 2.4.16),(S.1), ( 3.4.10), (S.2), ( 3.4.11), and ( 3.11.5),

RiP→ Ri a.e. M , i = 1, 2 .

Proof: Note that we can write

D(0) =n∑

i=1

ϕ

(n

n+ 1Fn(Yi)

)Yi

1

n

=

∫ϕ

(n

n+ 1Fn(t)

)tdFn(t) ,

where Fn denotes the empirical distribution function of the random sample Y1, . . . , Yn. Asn→ ∞ the integral converges to Dy.

Next consider the reduction in dispersion. By Theorem 3.11.1, with probability 1, wecan restrict the sample space to a space on which Huber’s design condition (D.1) holds andon which n−1X′X → Σ. Then conditionally given X, we have the assumptions found inSection 3.4 for the non-stochastic model. Hence from the discussion found in Section 3.6.3

(1/n)D(βR)P→ De. Hence it is true unconditionally, a.e. M . The consistency of τϕ was

discussed above. The result then follows.


Example 3.11.1. Measures of Association for Wilcoxon Scores.

For the Wilcoxon score function, ϕW (u) =√

12(u − 1/2), as Exercise 3.16.34 shows,Dy =

∫ϕ(G(y))y dy =

√3/4E|Y1 − Y2| where Y1, Y2 are iid with distribution function G.

Likewise, De =√

3/4E|e1 − e2| where e1, e2 are iid with distribution function F . Finally

τϕ = (√

12∫f 2)−1. Hence for Wilcoxon scores these coefficients of multiple determination

simplify to

RW1 =E|Y1 − Y2| −E|e1 − e2|

E|Y1 − Y2|(3.11.18)

RW2 =E|Y1 − Y2| −E|e1 − e2|

E|Y1 − Y2| −E|e1 − e2| + (1/(6∫f 2))

. (3.11.19)

As discussed above, in general, RW1 is not robust but RW2 is.

Example 3.11.2. Measures of Association for Sign Scores.

For the sign score function, Exercise 3.16.34 shows thatDy =∫ϕ(G(y))y dy = E|Y −medY |

where medY denotes the median of Y . Likewise De = E|e − mede|. Hence for sign scores,the coefficients of multiple determination are

RS1 =E|Y − medY | −E|e− mede|

E|Y − medY | (3.11.20)

RS2 =E|Y − medY | − E|e− mede|

E|Y − medY | −E|e− mede| + (4f(mede))−1. (3.11.21)

These were obtained by McKean and Sievers (1987) from a l1 point of view.

3.11.4 Properties of R-Coefficients of Multiple Determination

In this section we explore further properties of the population coefficients of multiple deter-mination proposed in the last section. To show that R1 and R2, ( 3.11.16) and ( 3.11.17),are indeed measures of association we have the following two theorems. The proof of thefirst theorem is quite similar to corresponding proofs of properties of the dispersion functionfor the nonstochastic model.

Theorem 3.11.3. Suppose f and g satisfy the condition (E.1), ( 3.4.1), and their firstmoments are finite then Dy > 0 and De > 0, where Dy =

∫ϕ(G(y))y dy.

Proof: It suffices to show it for Dy since the proof for De is the same. The function ϕis increasing and

∫ϕ = 0; hence, ϕ must take on both negative and positive values. Thus

the set A = y : ϕ(G(y)) < 0 is not empty and is bounded above. Let y0 = supA. Then

Dy =

∫ y0

−∞ϕ(G(y))(y − y0)dG(y) +

∫ ∞

y0

ϕ(G(y))(y − y0)dG(y) . (3.11.22)


Since both integrands are nonnegative, it follows that Dy ≥ 0. If Dy = 0 then it followsfrom (E.1) that ϕ(G(y)) = 0 for all y 6= y0 which contradicts the facts that ϕ takes on bothpositive and negative values and that G is absolutely continuous.

The next theorem is taken from Witt (1989).

Theorem 3.11.4. Suppose f and g satisfy the conditions (E.1) and (E.2) in Section 3.4and that ϕ satisfies assumption (S.2), ( 3.4.11). Then RD is a strictly convex function of β

and has a minimum value of 0 at β = 0.

Proof: We will show that the gradient of RD is zero at β = 0 and that its second matrixderivative is positive definite. Note first that the distribution function, G, and density, g, ofY can be expressed as G(y) =

∫F (y−β′x)dM(x) and g(y) =

∫f(y−β′x)dM(x). We have

∂RD

∂β= −

∫ ∫ ∫ϕ′[G(y)]yf(y − β′x)f(y − β′u)udM(x)dM(u)dy

−∫ ∫

ϕ[G(y)]yf ′(y − β′x)xdM(x)dy . (3.11.23)

Since E[x] = 0, both terms on the right side of the above expression are 0 at β = 0. Beforeobtaining the second derivative, we rewrite the first term of ( 3.11.23) as

−∫ [∫ ∫

ϕ′[G(y)]yf(y − β′x)f(y − β′u)dydM(x)

]udM(u) =

−∫ [∫

ϕ′[G(y)]g(y)yf(y− β′u)dy

]udM(u) .

Next integrate by parts the expression in brackets with respect to y using dv = ϕ′[G(y)]g(y)dyand t = yf(y − β′u). Since ϕ is bounded and f has a finite second moment this leads to

∂RD

∂β=

∫ ∫ϕ[G(y)]f(y − β′u)dydM(u) +

∫ ∫ϕ[G(y)]yf ′(y − β′u)udydM(u)

−∫ ∫

ϕ[G(y)]yf ′(y − β′x)xdydM(x)

=

∫ ∫ϕ[G(y)]f(y − β′u)udydM(u) .

Hence the second derivative of RD is

∂2RD

∂β∂β′ = −∫ ∫

ϕ[G(y)]f ′(y − β′x)xx′dydM(x)

−∫ ∫ ∫

ϕ′[G(y)]f(y − β′x)f(y − β′u)xu′dydM(x)dM(u) . (3.11.24)

Now integrate the first term on the right side of ( 3.11.24) by parts with respect to y byusing dt = f ′(y − β′x)dy and v = ϕ[(G(y)]. This leads to

∂2RD

∂β∂β′ = −∫ ∫ ∫

ϕ′[G(y)]f(y − β′x)f(y − β′u)x(u − x)′dydM(x)dM(u) . (3.11.25)


We have, however, the following identity

∫ ∫ϕ′[G(y)]f(y − β′x)f(y − β′u)(u − x)(u− x)′dydM(x)dM(u) =∫ ∫

ϕ′[G(y)]f(y − β′x)f(y − β′u)u(u − x)′dydM(x)dM(u)

−∫ ∫

ϕ′[G(y)]f(y − β′x)f(y − β′u)x(u − x)′dydM(x)dM(u) .

Since the two integrals on the right side of the last expression are negatives of each other,this combined with expression ( 3.11.24) leads to

2∂2RD

∂β∂β′ =

∫ ∫ϕ′[G(y)]f(y − β′x)f(y − β′u)(u− x)(u − x)′dydM(x)dM(u) .

Since the functions f and M are continuous and the score function is increasing, it followsthat the right side of this last expression is a positive definite matrix.

It follows from these theorems that the Ris satisfy properties of association similar to

R2. We have 0 ≤ Ri ≤ 1. By Theorem 3.11.4, Ri = 0 if and only if β = 0 if and only if Y

and x are independent.

Example 3.11.3. Multivariate Normal Model

Further understanding of Ri can be gleaned from their direct relationship with R2 for themultivariate normal model.

Theorem 3.11.5. Suppose Model ( 3.11.1) holds. Assume further that the (x, Y ) follows amultivariate normal distribution with the variance-covariance matrix

Σ(x,Y ) =

[Σ β′ΣΣβ σ2

e + β′Σβ

]. (3.11.26)

Then, from ( 3.11.16) and ( 3.11.17),

R1 = 1 −√

1 − R2

(3.11.27)

R2 =1 −

√1 − R

2

1 −√

1 − R2[1 − (1/(2T 2))]

, (3.11.28)

where T =∫ϕ[Φ(t)]tdΦ(t), Φ is the standard normal distribution function, and R

2is the

traditional coefficient of multiple determination given by ( 3.11.11).


Proof: Note that σ2y = σ2

e+β′Σβ and E(Y ) = α+β′E[x]. Further the distribution functionof Y is G(y) = Φ((y−α−β′E(x))/σy) where Φ is the standard normal distribution function.Then

Dy =

∫ ∞

−∞ϕ [Φ(y/σy)] ydΦ(y/σy) (3.11.29)

= σyT . (3.11.30)

Similarly, De = σeT . Hence,RD = (σy − σe)T (3.11.31)

By the definition of R2, we have R

2= 1 − σ2

e

σ2y. This leads to the relationship,

1 −√

1 − R2

=σy − σeσy

. (3.11.32)

The result ( 3.11.27) follows from the expressions ( 3.11.31) and ( 3.11.32).For the result ( 3.11.28), by the assumptions on the distribution of (x, Y ), the distribution

of e is N(0, σ2e); i.e., f(x) = (2πσ2

e)−1/2 exp −x2/(2σ2

e) and F (x) = Φ(x/σe). It follows thatf ′(x)/f(x) = −σ−2

e x, which leads to

−f′(F−1(u))

f ′(F (u))=

1

σeΦ−1(u) .

Hence,

τ−1ϕ =

∫ 1

0

ϕ(u)

1

σeΦ−1(u)

du

=1

σe

∫ 1

0

ϕ(u)Φ−1(u) du .

Upon making the substitution u = Φ(t), we obtain the relationship T = σe/τϕ. Using this,the result ( 3.11.31), and the definition of R2, ( 3.11.11), we get

R2 =

σy−σe

σy

σy−σe

σy+ σe

σy

12T 2

.

The result for R2 follows from this and ( 3.11.32).Note that T is free of all parameters. It can be shown directly that the Ris are one-to-one

increasing functions of R2; see Exercise 3.16.35. Hence, for the multivariate normal model

the parameters R2, R1, and R2 are equivalent.

Although the CMDs are equivalent for the normal model, they measure dependencebetween x and Y on different scales. We can use the above relationships derived in thelast theorem to have these coefficients measure the same quantity at the normal model by


simply solving for R2

in terms of R1 and R2 in ( 3.11.27) and ( 3.11.28) respectively. Theseparameters will be useful later so we will call them R

∗1 and R

∗2 respectively. Hence solving

as indicated we get

R∗21 = 1 − (1 − R1)

2 (3.11.33)

R∗22 = 1 −

[1 − R2

1 −R2(1 − (1/(2T 2)))

]2

. (3.11.34)

Again, at the multivariate normal model we have R2

= R∗21 = R

∗22 .

For Wilcoxon scores and sign scores the reader is ask to show in Exercise 3.16.36 that(1/(2T 2)) = π/6 and (1/(2T 2)) = π/4, respectively.

Example 3.11.4. A Contaminated Normal Model.

As an illustration of these population coefficients of multiple determination, we evaluate themfor the situation where the random error e has a contaminated normal distribution withproportion of contamination ǫ and the ratio of contaminated variance to uncontaminatedσ2c , the random variable x has a univariate normal N(0, 1) distribution, and the parameterβ = 1. So β′Σβ = 1. Without loss of generality, we took α = 0 in ( 3.11.1). Hence Y andx are dependent. We consider the CMDs based on the Wilcoxon score function only.

The density of Y = x+ e is given by,

g(y) =1 − ǫ√

2φ

(y√2

)+

ǫ√1 + σ2

c

φ

(y√

1 + σ2c

).

This leads to the expressions,

Dy =

√12√2π

2−1/2(1 − ǫ)2

√2 + 2−1/2ǫ2

√1 + σ2

c + ǫ(1 − ǫ)[3 + σ2c ]

1/2

De =

√12√2π

2−1/2(1 − ǫ)2 + 2−1/2ǫ2σc + ǫ(1 − ǫ)

√1 + σ2

c

τϕ =

[√12√2π

(1 − ǫ)2

√2

+ǫ2

σc√

2+

2ǫ(1 − ǫ)√σ2c + 1

]−1

;

see Exercise 3.16.37. Based on these quantities the coefficients of multiple determination

R2, R1 and R2 can be readily formulated.Table 3.11.1 displays these parameters for several values of ǫ and for σ2

c = 9 and 100. Forease of interpretation we rescaled the robust CMDs as discussed above. Thus at the normal

(ǫ = 0) we have R∗21 = R

∗22 = R

2with the common value of .5 in these situations. Certainly

as either ǫ or σc change, the amount of dependence between Y and x changes; hence all


Table 3.11.1: Coefficients of Multiple Determination under Contaminated Errors (e).e ∼ CN(ǫ, σ2

c = 9) e ∼ CN(ǫ, σ2c = 100)

ǫ ǫCMD .00 .01 .02 .05 .10 .15 .00 .01 .02 .05 .10 .15

R2

.50 .48 .46 .42 .36 .31 .50 .33 .25 .14 .08 .06

R∗1 .50 .50 .48 .45 .41 .38 .50 .47 .42 .34 .26 .19

R∗2 .50 .50 .49 .47 .44 .42 .50 .49 .47 .45 .40 .36

the coefficients change somewhat. However, R2 decays as the percentage of contaminationincreases, and the decay is rapid in the case σ2

c = 100. This is true also, to a lesser degree,for R

∗1 which is predictable since its denominator has unbounded influence in the Y -space.

The coefficient R∗2 shows stability with the increase in contamination. For instance when

σ2c = 100, R2 decays .44 units while R

∗2 decays only .14 units. See Witt et al. (1995) for more

discussion on this example.

Ghosh and Sen (1971) proposed the mixed rank test statistic to test the hypothesis ofindependence ( 3.11.3). It is essentially the gradient test of the hypothesis H0 : β = 0.As we showed in Section 3.6, this test statistic is asymptotically equivalent to Fϕ. Ghoshand Sen (1971), also, proposed a pure rank statistic in which both variables are ranked andscored.

3.11.5 Coefficients of Determination for Regression

We have mainly been concerned with coefficients of multiple determination as measuresof dependence between the random variables Y and x. In the regression setting, though,the statistic R2 is one of the most widely used statistics, not in the sense of estimatingdependence but in the sense of comparing models. As the proportion of variance accountedfor, R2 is intuitively appealing. Likewise R1, the proportion of dispersion accounted for inthe fit, is an intuitive statistic. But neither of these statistics are robust. The statistic R2

though is robust and is directly linked (a one-to-one function) to the robust test statisticFϕ. Furthermore it lies between 0 and 1, having the values 1 for a perfect fit and 0 for acomplete lack of fit. These properties make R2 an attractive coefficient of determination forregression as the following example illustrates.

Example 3.11.5. Hald Data

This data consists of 13 observations and 4 predictors. It can be found in Hald (1952)but it is also discussed in Draper and Smith (1966) where it serves to illustrate a method ofpredictor subset selection based on R2. The data are given in Table 3.11.2. The responseis the heat evolved in calories per gram of cement. The predictors are the percent in weight


Table 3.11.2: Hald Data used in Example 3.11.5x1 x2 x3 x4 Response7 26 6 60 78.51 29 15 52 74.3

11 56 8 20 104.311 31 8 47 87.67 52 6 33 95.9

11 55 9 22 109.23 71 17 6 102.71 31 22 44 72.52 54 18 22 93.1

21 47 4 26 115.91 40 23 34 83.8

11 66 9 12 113.310 68 8 12 109.4

Table 3.11.3: Coefficients of Multiple Determination on Hald DataSubset of Original Data Changed DataPredictors R2 R1 R2 R2 R1 R2

x1, x2 .98 .86 .92 .57 .55 .92x1, x3 .55 .33 .52 .47 .24 .41x1, x4 .97 .84 .90 .52 .51 .88x2, x3 .85 .63 .76 .66 .46 .72x2, x4 .68 .46 .62 .34 .27 .57x3, x4 .94 .76 .89 .67 .52 .83

of ingredients used in the cement and are given by:

x1 = amount of tricalcium aluminate

x2 = amount of tricalcium silicate

x3 = amount of tetracalcium alumino ferrite

x4 = amount of dicalcium silicate .

To illustrate the use of the coefficients of determination R1 and R2, suppose we areinterested in the best two variable predictor model based on coefficients of determination.Table 3.11.3 gives the results for two data sets. The first is the original Hald data while inthe second we changed the 11th response observation from 83.8 to 8.8.

Note that on the original data all three coefficients choose the subset x1, x2. For thechanged data, though, the outlier severely affects the LS coefficient R2 and the nonrobustcoefficient R1, but the robust coefficient R2 was much less sensitive to the outlier. It choosesthe same subset x1, x2 as it did with the original data; however, the LS coefficient selects


the subset x3, x4, two different predictors than its selection for the original data. Thenonrobust coefficient R1 still chooses x1, x2, although, at a relativity much smaller value.

This example illustrates that the coefficient R2 can be used in the selection of predic-tors in a regression problem. This selection could be formalized like the MAXR procedurein SAS. In a similar vein, the stepwise model building criteria based on LS estimation(Draper and Smith, 1966) could easily be robustified by using R-estimates in place of LS-estimates and the robust test statistic Fϕ in place of FLS.

3.12 High Breakdown (HBR) Estimates

By (3.5.17), the influence function of the R-estimate is unbounded in the x-space. Whilein a designed experiment this is of little consequence, for non-designed experiments wherethere are widely dispersed xs, (i.e. outliers in factor space), this is of some concern. Inthis chapter we present R-estimators which have influence functions bounded in both spacesand which can have 50% breakdown. We shall call these estimators high breakdown R(HBR) estimators. Further, we derive diagnostics which differentiate between fits based onthese estimators, R-estimators and LS-estimators. Tableman (1990) provides an alternativedevelopment of bounded influence R-estimates.

3.12.1 Geometry of the HBR-Estimates

Consider the linear model ( 3.2.3). In Chapter 3, estimation and testing are based on thepseudo-norm, (3.2.6). Here we shall consider the function

‖u‖HBR =∑

i<j

bij |ui − uj| , (3.12.1)

where the weights bij are positive and symmetric, i.e., bij = bji. It is then easy to show, seeExercise 3.15.1, that the function ( 3.12.1) is a pseudo-norm. As noted in Section 2.2.2, ifthe weights bij ≡ 1, then this pseudo-norm is proportional to the pseudo-norm based on theWilcoxon scores. Hence we will refer to this as a generalized R (HBR) pseudo-norm.

Since this is a pseudo-norm we can develop estimation and testing procedures using thesame geometry as in the last chapter. Briefly the HBR estimate of β in model ( 3.2.3) is a

vector βHBR such that

βHBR = Argminβ‖Y −Xβ‖HBR. (3.12.2)

Equivalently we can define the dispersion function

DHBR(β) = ‖Y − Xβ‖HBR . (3.12.3)

Since it is based on a pseudo-norm, DHBR is a continuous, nonnegative, convex function ofβ. The negative of its gradient is given by

SHBR(β) =∑

i<j

bij(xi − xj)sgn[(Yi − Yj) − (xi − xj)′β] . (3.12.4)

3.12. HIGH BREAKDOWN (HBR) ESTIMATES 233

Thus the HBR-estmate solves the equation

SHBR(β).= 0 .

In the next subsection, we discuss the selection of the weights bij .

The HBR-estimates were proposed by Chang, McKean, Naranjo and Sheather (1999).Using the package RBR, these estimates are easily computed, as discussed in the examplesbelow.

3.12.2 Weights

The weight for a point (xi, Yi), i = 1, . . . , n, for the HBR estimates is a function of twocomponents. One component depends on the “distance” of the point xi from the centerof the X-space (factor space) and the other component depends on the size of the residualbased on an initial high breakdown fit. As shown below, these components are used incombination, so the weight due to one component may be offset by the weight of the othercomponent.

First, we consider distance in factor space. It seems reasonable to downweight pointsfar from the center of the data. The leverage values hi = n−1 + x′

ci(X′cXc)

−1xci, for i =1, . . . , n,measure distance (Mahalanobis) from the center relative to the scatter matrix X′

cXc.Leverage values, though, are based on means and the usual (LS) variance-covariance scattermatrix which are not robust estimators. There are several robust estimators of locationand scatter from which to choose, including the high breakdown minimum covariancedeterminant (MCD) which is an ellipsoid that covers about half of the data and yet hasminimum determinant. Although computationally intensive, Rousseeuw and Van Driessen(1999) present a fast computational algorithm for it. Let vc denote the center of the ellipsoid.Letting V denote the MCD, the robust distances are given by

Qi = (xi − vc)′V−1(xi − vc). (3.12.1)

We define the associated weights by wi = min

1, cQi

, where c is usually set at the 95th

percentile of the χ2(p) distribution. Note that “good” points generally have weights 1.The class of GR-estimates proposed by Sievers (1983) use weights of the form bij = wiwj

which depend only on distance in factor space. These estimates have positive breakdownand bounded influence in factor space, but as Exercise 3.16.41 shows they are always lessefficient than the Wilcoxon estimates, unless all weights are 1. Further, at times, the loss inefficiency can be severe; see Chang et al. (1999) for discussion. One reason is that “good”points of high leverage (points that follow the model) are down weighted by the same amountas points at the same distance from the center of factor space but which do not follow themodel (“bad” points of high leverage).

The weights for the HBR estimates are a function of the GR weights and residual infor-mation from the Y -space. The residuals are based on a high breakdown initial estimate of


the regression coefficients. We have chosen to use the least trim squares (LTS) estimatewhich is given by

Argmin

h∑

i=1

[Y − α− x′β]2(i), (3.12.2)

where h = [n/2] + 1 and where the notation (i) denotes the ith ordered absolute residual;see Rousseeuw and Van Driessen (1999). Let e0 denote the residuals from this initial fit.

Define the function ψ(t) by ψ(t) = 1, t, or − 1 according as t ≥ 1, −1 < t < 1, or t ≤−1. Let σ be estimated by the initial scaling estimate MAD = 1.483 medi|e(0)i −medje(0)j | .Recall the robust distances Qi, defined in expression (??). Let

mi = ψ

(b

Qi

)= min

1,

b

Qi

,

and consider the weights

bij = min

1,cσ

|ei|σ

|ej|min

1,

b

Qi

min

1,

b

Qj

, (3.12.3)

where the tuning constants b and c are both set at 4. From this point-of-view, it is clear thatthese weights downweight both outlying points in factor space and outlying responses. Notethat the initial residual information is a multiplicative factor in the weight function. Hence,a good leverage point will generally have a small (in absolute value) initial residual whichwill offset its distance in factor space. The following example will illustrate the differencesamong the Wilcoxon, GR and HBR estimates.

Example 3.12.1 (Stars Data). This data set is drawn from an astronomy study on the starcluster CYG OB1 which contains 47 stars; see Rousseeuw and Leroy (1987) for a discussionon the history of the data set. The response is the logarithm of the light intensity of the starwhile the independent variable is the logarithm of the temperature of the star. The data aretabled in Table 3.12.1 and are shown in Panel A of Figure 3.12.1. Note that four of thestars, called giants, form a cluster of outliers in factor space while the rest of the stars fall ina point cloud. Panel A shows also the overlay plot of the LS and Wilcoxon fits. Note thatthe cluster of four outliers in the x space have exerted such a strong influence on the fits thatthey have drawn the LS and Wilcoxon fits towards the cluster. This behavior is predictablebased on the influence functions of these estimates. These four giant cases have very largerobust distances from the center of the data. Hence the weights used by the GR estimatesseverely downweight these points, resulting in its fit through the point cloud. For this dataset, the initial LTS fit ignores the four giant stars and fits the point cloud. Hence, the fourgiant stars are “bad” leverage points and, hence, are downweighted for the HBR fit, also.

The RBR command to compute the GR or HBR estimates is the same as for the Wilcoxonestimates, wwest, except that the argument for the weight indicator bij is set at bij="GR"or bij="HBR", respectively. For example, suppose the design matrix without the intercept


column is in the variable xmat and the response vector is in the variable y. Then the followingR commands place the LS, Wilcoxon, GR and HBR estimates as respective columns in thematrix ests.

ls.fit = lm(y~x)

wil.fit = wwest(x,y,bij="WIL",print.tbl=F)

gr.fit = wwest(x,y,bij="GR",print.tbl=F)

hbr.fit = wwest(x,y,bij="HBR",print.tbl=F)

est = cbind(ls.fit$coef,wil.fit$tmp1$coef,gr.fit$tmp1$coef,hbr.fit$tmp1$coef)

Example 3.12.2 (Stars Data, Continued). Suppose in the last example that we had nosubject matter available. Then based on the scatterplot, we may decide to fit a quadraticmodel. The plots of the LS, Wilcoxon GR, and HBR fits for the quadratic model are foundin Panel B of Figure ??. The quadratic fits based on the LS, Wilcoxon and HBR estimatesfollow the curvature in the data, while the GR fit misses the curvature resulting in a verypoor fit. For quadratic model, the cluster of four giant stars are “good” data points and theHBR weights take this into account. The weights used for the GR fit, however, ignore thisresidual information and severely downweight the four giant star cases, resulting in the poorfit as shown in the figure.

The last two plots in the figure, Panels C and D, are the residual plots for the GR andHBR fits. Based on their fits, the LS and Wilcoxon residual plots are the same as the HBR.The pattern in the GR residual plot (Panel C), while not random, does not indicate how toproceed with model selection. This is often true for residual plots based on high breakdownfits; see McKean et al. (1993).

3.12.3 Asymptotic Normality of βHBR

The asymptotic normality of the HBR estimates was developed by Chang (1995) and Changet al. (1999). Much of our development is in Appendix A.6 which is taken from the articleby Chang et al. (1999). Our discussion is for general weights under assumptions that we will

specify as we proceed. In order to establish asymptotic normality of βHBR, we need somefurther notation and assumptions. Define the parameters

γij = B′ij(0)/Eβ(bij) , for 1 ≤ i, j ≤ n , (3.12.4)

whereBij(t) = Eβ[bijI(0 < yi − yj < t)] . (3.12.5)

Consider the symmetric n× n matrix An = [aij ] defined by

aij =

−γijbij if i 6= j∑

k 6=i γikbik if i = j. (3.12.6)


Table 3.12.1: Stars DataLog Log Log Log

Star Temp Intensity Star Temp Intensity1 4.37 5.23 25 4.38 5.022 4.56 5.74 26 4.42 4.663 4.26 4.93 27 4.29 4.664 4.56 5.74 28 4.38 4.905 4.30 5.19 29 4.22 4.396 4.46 5.46 30 3.48 6.057 3.84 4.65 31 4.38 4.428 4.57 5.27 32 4.56 5.109 4.26 5.57 33 4.45 5.22

10 4.37 5.12 34 3.49 6.2911 3.49 5.73 35 4.23 4.3412 4.43 5.45 36 4.62 5.6213 4.48 5.42 37 4.53 5.1014 4.01 4.05 38 4.45 5.2215 4.29 4.26 39 4.53 5.1816 4.42 4.58 40 4.43 5.5717 4.23 3.94 41 4.38 4.6218 4.42 4.18 42 4.45 5.0619 4.23 4.18 43 4.50 5.3420 3.49 5.89 44 4.45 5.3421 4.29 4.38 45 4.55 5.5422 4.29 4.22 46 4.45 4.9823 4.42 4.42 47 4.42 4.5024 4.49 4.85


Define the p× p matrix Cn asCn = X′AnX . (3.12.7)

Since the rows and columns of An sum to zero, it can be shown that

Cn =∑

i<j

γijbij(xj − xi)(xj − xi)′ ; (3.12.8)

see Exercise 3.15.21. Let

Ui = (1/n)

n∑

j=1

(xj − xi) E(bijsgn(yj − yi)|yi) . (3.12.9)

Besides assumptions (E.1), ( 3.4.1), (D.2), ( 3.4.7), and (D.3), ( 3.4.8) of Chapter 3, weneed to assume additionally that,

(H.1) There exists a matrix CH such that n−2Cn = n−2X′AnXP→ CH . (3.12.10)

(H.2) There exists a p× p matrix ΣH , (1/n)∑n

i=1 Var(Ui) → ΣH . (3.12.11)

(H.3)√n(β

(0) − β)D−→ N(0,Ξ) where β

(0)is the initial estimator and Ξ

is a positive definite matrix. (3.12.12)

(H.4) The weight function bij = g(xi,xj , yi, yj, β

(0))≡ gij

(β

(0))

is

continuous and the gradient gij is bounded uniformly in i and j.(3.12.13)

For the correlation model, an explicit expression can be given for the matrix CH assumedin (H.1); see ( 3.12.22) and, also, Lemma 3.12.1.

As our theory will show, the HBR-estimate attains 50% breakdown (Section 3.12.3) andasymptotic normality, at rate

√n, provided the initial estimate of regression estimates have

these qualities. One such estimate is the least trimmed squares, LTS, which is given byexpression (3.12.2). Another class of such estimates are the rank-based estimates proposedby Hossjer (1994); see also Croux, Rousseeuw and Hossjer (1994).

The development of the theory for βHBR proceeds similar to that of the R estimates.The theory is sketched in the appendix, Section A.6, and here we present only the two mainresults: the asymptotic distribution of the gradient and the asymptotic distribution of theestimate.

Theorem 3.12.1. Under assumptions (E.1), ( 3.4.1), and (H.1) - (H.4), ( 3.12.10) -( 3.12.13),

n−3/2SHBR(0)D→ N(0, ΣH).

The proof of this theorem proceeds along the same lines of the theory that was used toobtain the null distribution of the gradients of the R estimates. The projection of SHBR(0)is first determined and its asymptotic distribution is established as N(0, ΣH). The resultfollows then upon showing that difference between SHBR(0) and its projection goes to zeroin second mean; see Theorem A.6.4 for details. The following theorem gives the asymptoticdistribution of βHBR.


Theorem 3.12.2. Under assumptions (E.1), ( 3.4.1), and (H.1) - (H.4), ( 3.12.10) -( 3.12.13), √

n(βHBR − β)D−→ N( 0, (1/4)C−1

H ΣHC−1H ).

The proof of this theorem is similar to that of the R estimates. First asymptotic linearityand quadraticity are established. These results are then combined with Theorem 3.12.1 toyield the result; see Theorem A.6.1 of the Appendix for details.

The following lemma derives another representation of the limiting matrix CH , which willprove useful in the derivation of the influence function of βHBR found in the next section andSection 3.12.6 which concerns the implementation of these high breakdown estimates. Forwhat follows, assume without loss of generality that the true parameter value β = 0. Let

gij

(β

(0))≡ b

(xi,xj , yi, yj, β

(0))

denote the weights as a function of the initial estimator.

Let gij(0) ≡ b(xi,xj, yi, yj) denote the weight function evaluated at the true value β = 0.The following result is proved in Lemma A.6.1 of the Appendix:

B′ij(t) =

∫ ∞

−∞· · ·∫ ∞

−∞b(xi, xj , yj + t, yj, β

(0))f(yj + t) f(yj)

∏

k 6=i,jf(yk) dy1 · · · dyn .

(3.12.14)It is further shown that B′

ij(t) is continuous in t. The representation we want is:

Lemma 3.12.1. Under assumptions (E.1), ( 3.4.1), and (H.1) - (H.4), ( 3.12.10) - ( 3.12.13),

E

[1

2

1

n2

n∑

i=1

n∑

j=1

∫ ∞

−∞b(xi,xj, yi, yj)f

2(yj) dyj(xj − xi)(xj − xi)′

]−→ CH . (3.12.15)

Proof: By ( 3.12.4), ( 3.12.5), ( 3.12.8), and ( 3.12.14),

E

[1

n2Cn

]=

1

2

1

n2

n∑

i=1

n∑

j=1

B′ij(0)(xj − xi)(xj − xi)

′ . (3.12.16)

Because B′ij(0) is uniformly bounded over all i and j, and the matrix (1/n2)

∑i

∑j(xj −

xi)(xj−xi)′ converges to a positive definite matrix, the right side of ( 3.12.16) also converges.

By Lemmas A.6.1 and A.6.3 of the Appendix, we have

B′ij(0) =

∫b(xi,xj, yj, yj)f

2(yj) dyj + o(1) (3.12.17)

where the remainder term is uniformly small over all i and j. Under Assumption (H.1),( 3.12.10), the result follows.

Remark 3.12.1 (Empirical Efficiency). As noted above, there is always a loss of efficiency ofthe GR estimator relative to the Wilcoxon estimator. It was hoped that the HBR estimatorwould regain some this efficiency. This was confirmed in a Monte Carlo study which is


discussed in Section 8 of the article by Chang et al. (1999). In this study, over a series ofdesigns, which included contamination in both responses and factor space, in all but two ofthe situations, the empirical efficiency of the HBR estimate relative to the Wilcoxon estimatewas always larger than that of the GR estimate relative to the Wilcoxon estimate

Remark 3.12.2 (Stability Study). To obtain its full 50% breakdown, the HBR estimatesrequire initial estimates with 50% breakdown. It is known that slight changes to centrallylocated data can cause some high breakdown estimates to change by a large amount. Thiswas discussed for the high breakdown least median squares (LMS) estimates by Hettman-sperger and Sheather (1992, 1993) and later confimed in a Monte Carlo study by Sheather,McKean and Hettmansperger (1997). Part of the article by Chang et al. (1999) consistedof a stability study for the HBR estimator using LMS and LTS starting values. Over thesituations investigated, the HBR estimates were much more stable than either the LTS orLMS estimates but were less stable than the Wilcoxon estimates.

3.12.4 Robustness Prperties of the HBR Estimates

In this section we show that the HBR estimate can attain 50% breakdown and derive itsinfluence function. We show that its influence function is bounded in both the x and theY -spaces. The argument for breakdown is taken from Chang (1995) while the influencefunction derivation is taken from Chang et al. (1999).

Breakdown of the HBR Estimate

Let Z = zi = (xi, yi), i = 1, . . . , n denote the sample of data points and ‖ · ‖ theEuclidean norm. Define the breakdown point of the estimator at sample Z as

ǫ∗n(β, Z) = max

m

n; sup

Z′‖β(Z′) − β(Z)‖ <∞

,

where the supremum is taken over all samples Z′ that can result from replacing m observa-tions in Z by arbitrary values. See, also, Defintion 1.6.1.

We now state conditions under which the HBR estimate remains bounded.

Lemma 3.12.2. Suppose there exist finite constants M1 > 0 and M2 > 0 such that thefollowing conditions hold:

(B1) inf‖β‖=1supijbij(xj − xi)

′β = M1.

(B2) supijbij |yj − yi| = M2.

Then

‖βHBR‖ <1

M1

[1 + 2

(n

2

)]M2.


Proof: Note that,

DHBR(β) ≥ supij

bij|yj − yi − (xj − xi)′β| ≥ ‖β‖M1 −M2

≥ 2

(n

2

)M2

whenever ‖β‖ ≥ 1M1

[1 + 2

(n2

)]M2. Since DHBR(0) =

∑∑i<j bij |yj − yi| ≤

(n2

)M2 and

DHBR is a convex function of β, it follows that βHBR = ArgminDHBR(β) satisfies

‖βHBR‖ <1

M1

[1 + 2

(n

2

)]M2.

The lemma follows.

For our result, we need to further assume that the data points Z are in general position;that is, any subset of p + 1 of these points determines a unique solution β. In particular,this implies that neither all of the xis are the same nor are all of the yis are the same; hence,provided the weights have not broken down, this implies that both constants M1 and M2 ofLemma 3.12.2 are positive.

Theorem 3.12.3. Assume that the data points Z are in general position. Let v, V and

β(0)

denote the initial estimates of location, scatter and β. Let ǫ∗n(v, Z), ǫ∗n(V, Z) and

ǫ∗n(β(0), Z) denote their corresponding breakdown points. Then breakdown point of the HBR

estimator is

ǫ∗n(βHBR, Z) = minǫ∗n(v, Z), ǫ∗n(V, Z), ǫ∗n(β(0), Z), 1/2 . (3.12.18)

Proof: Corrupt m points in the data set Z and let Z′ be the sample consisting of these corruptpoints and the remaining n−m points. Assume that Z′ is in general position. Assume that

v(Z′), V(Z′) and β(0)

(Z′) have not broken down. Then the constants M1 and M2 of Lemma

3.12.2 are positive and finite. Hence, by Lemma 3.12.2, ‖βHBR(Z′)‖ <∞ and the theoremfollows.

Based on this last result, the HBR-estimate has 50% breakdown provided the initial

estimates v, V and β(0)

all have 50% breakdown. Assuming that the data points are ingeneral position, the MCD estimates of location and scatter as discussed near expression(3.12.1) have 50% breakdown. For initial estimates of the regression coefficients, againassuming that the data points are in general position, the LTS-estimates, ( 3.12.2), have50% breakdown; see, also, Hossjer (1994). The HBR-estimates used in the examples ofSection 3.12.6 employ the MCD estimates of location and scatter and the LTS-estimate ofthe regression coefficients, resulting in the weights defined in (3.12.3).


Influence Function of the HBR Estimate

In order to derive the influence function, we start with the gradient equation S(β).= 0,

written as

0.=

1

n2

n∑

i=1

n∑

j=1

bijsgn(zj − zi)(xj − xi).

Note by Lemma A.6.3 of the Appendix, that bij = gij(0) + Op(1/√n) so that the defining

equation may be written as

0.=

1

n2

n∑

i=1

n∑

j=1

gij(0)sgn(zj − zi)(xj − xi) , (3.12.19)

ignoring a remainder term of magnitude Op(1/√n).

Influence functions are derived at the model where both x and y are stochastic; hence,consider the correlation model of Section ??,

y = x′β + e , (3.12.20)

where e has density f , x is a p× 1 random vector with density function m, and e and x areindependent. Let F and M denote the corresponding distribution functions of e and x. LetH and h denote the joint distribution function and density of y and x. It then follows that

h(x, y) = f(y − x′β)m(x) . (3.12.21)

If we rewrite equation ( 3.12.19) using the Stieltjes integral notation of the empiricaldistribution of (xi, yi), for i = 1, . . . , n, we see that the functional β(H) solves the equation

0 =

∫ ∫b(x1,x2, y1, y2)sgny2 − y1 − (x2 − x1)

′β(H)(x2 − x1)dH(x1, y1)dH(x2, y2) .

Let I(a < b) = 1 or 0, depending on whether a < b or a > b. Then using the fact thatthe sign function is odd and the symmetry of the weight function in its x and y argumentswe can write the defining equation of the functional β(H) as

0 =

∫ ∫x1b(x1,x2, y1, y2)

[I(y2 − y1 < (x2 − x1)

′β(H)) − 1

2

]dH(x1, y1)dH(x2, y2) .

Define the matrix CH by

CH =

1

2

∫ ∫ ∫(x2 − x1)b(x1,x2, y1, y1)(x2 − x1)

′f 2(y1) dy1dM(x1)dM(x2)

. (3.12.22)

Note that under the correlation model CH is the assumed limiting matrix of Assumption(H.1), ( 3.12.10); see Lemma 3.12.1.

The next theorem gives the result for the influence function of βHBR. Its proof ig givenin Theorem A.5.1 of the Appendix.


Theorem 3.12.4. The influence function for the estimate βHBR is given by

Ω(x0, y0, βHBR) = C−1H

1

2

∫ ∫(x0 − x1)b(x1,x0, y1, y0)sgny0 − y1 dF (y1)dM(x1) ,

(3.12.23)where CH is given by expression ( 3.12.22).

In order to show that the influence function correctly identifies the asymptotic distribu-tion of the estimator, define Wi as

Wi =

∫ ∫(xi − x1)b(x1,xi, y1, yi)sgn(yi − y1) dF (y1)dM(x1) . (3.12.24)

Next write Wi in terms of a Stieltjes integral over the empirical distribution of (xj , yj) as

W ∗i =

1

n

n∑

j=1

(xi − xj)b(xj ,xi, yj, yi)sgn(yi − yj) . (3.12.25)

If we can show that (1/√n)∑n

j=1W∗i

d→ N(0, ΣH), then we are done. From the proof ofTheorem A.6.4 in the Appendix, it will suffice to show that

1√n

n∑

i=1

(Ui −W ∗i )

P→ 0 , (3.12.26)

where Ui = (1/n)∑n

j=1(xi−xj)E[bijsgn(yi− yj)|yi] . Writing the left hand side of ( 3.12.26)as

(1/n3/2)n∑

i=1

n∑

j=1

(xi − xj) E [bijsgn(yi − yj)|yi] − gij(0)sgn(yi − yj) ,

where gij(0) ≡ b(xj ,xi, yj, yi), the proof is analogous to the proof of Theorem A.6.4.

3.12.5 Discussion

The influence function, Ω(x0, y0, βHBR), for the HBR estimate is a continuous function of x0

and y0. With a proper choice of a weight function it is bounded in both the x and Y spaces.This is true for the weights given by (3.12.3); furthermore, for these weights Ω(x0, y0, βHBR)goes to zero as x0 and y0 get large in any direction.

The influence function Ω(x0, y0, β) is a generalization of the influence functions for theWilcoxon estimates; see Exercise 3.15.22. Panel A of Figure 3.12.2 shows the influencefunction of the HBR estimate for the special case where (x, Y ) has a bivariate normal dis-tribution with mean 0 and the identity matrix as the variance-covariance matrix. For thisplot we used the weights given by ( ??) where mi = ψ(b/x2

i ) with the constants b = c = 4.For comparison purposes, the influence functions of the Wilcoxon and GR estimates are alsoshown in Panels B and C of Figure 3.12.2. The Wilcoxon influence function is bounded


in the Y space but is unbounded in the x space while the GR estimate is bounded in bothspaces. Note because the weights of the GR estimate do not depend on Y , it does not taperto 0 as y0 → ∞, as the influence function of the HBR estimate does. For all three plots,we used the method of Monte Carlo, (10000 simulations for each of 1600 grid points), toperform the numerical integration. The plot of the Wilcoxon influence function is an easilyverifiable check on the Monte Carlo because of its closed form, ( 3.5.17).

High breakdown estimates can have unbounded influence functions. Such estimates canhave instability problems as discussed in Sheather, McKean and Hettmansperger (1999) forthe LMS estimate which has unbounded influence in the x space at the quartiles of Y . Thegeneralized S estimators discussed in Croux et al. (1994) also have unbounded influencefunctions in the x space at particular values of Y . In contrast, the influence function of theHBR estimate is bounded everywhere. This helps to explain its more stable behavior thanthe LMS in the stability study discussed in Chang et al. (1997).

3.12.6 Implementation and Examples

In this section, we discuss how to estimate the standard errors of the HBR estimates andhow to properly standardize the residuals. We then consider two examples.

Standard Errors and Studentized Residuals

Using the asymptotic distribution of the HBR estimate as a guideline and upon substitutingthe estimated weights for the true weights we can estimate the asymptotic standard errorsfor these estimates. The asymptotic variance-covariance matrix of βHBR is a function of thetwo matrices ΣH and CH , given in ( 3.12.11) and ( 3.12.10), respectively. The matrix ΣH

is the variance-covariance matrix of the random vector Ui, ( 3.12.9). We can approximateUi by the expression,

Ui =1

n

n∑

j=1

(xj − xi)bij(1 − 2Fn(ei)) , (3.12.1)

where bij are the estimated weights, ei are the HBR residuals and Fn is the empirical distri-bution function of the residuals. Our estimate of ΣH is then the sample variance-covariancematrix of U1, . . . , Un, i.e.,

ΣH =1

n− 1

n∑

i=1

(Ui − U

)(Ui − U

)′. (3.12.2)

For the matrix CH , consider the results in Lemma 3.12.1. Upon substituting the esti-mated weights for the weights, expression ( 3.12.17) simplifies to

B′ij(0)

.= bij

∫f 2(t) dt = bij

1√12τW

, (3.12.3)


where τW is the scale parameter (3.4.4) for the Wilcoxon score function; i.e., τW = [√

12∫f 2(t) dt]−1.

To estimate τW , we will use the estimator τW given in expression ( 3.7.8). Now approximatingbij in Cn using ( 3.12.3) leads to the estimate

n−2Cn =1

4√

3(τWn)−2

n∑

i=1

n∑

j=1

bij(xj − xi)(xj − xi)′ . (3.12.4)

Similar to the R- and GR-estimates, we estimate the intercept by

αHBR = med1≤i≤nyi − x′iβHBR . (3.12.5)

Because√nβHBR is bounded in probability and X is centered, it follows, using an argument

very similar to the corresponding result for the R estimates (see McKean et al., 1990), that

the joint asymptotic distribution of α and βHBR is given by

√n

[α

βHBR

] [αβ

]D→ N

([00

],

[τ 2S 0′

0 (1/4)C−1ΣC−1

]), (3.12.6)

where τS is defined by (3.4.6); see the discussion around expression expression (1.5.28) forestimation of this scale parameter.

3.12.7 Studentized Residuals

An important use of robust residual is in detection of outliers. This is easiest done whenthe residuals are correctly studentized by an estimate of their standard deviation. Let βHBR

and αHBR be the estimates of the last section for β and α, respectively. Denote the residualfor the ith case by

e∗i = yi − α− x′iβHBR , (3.12.7)

and the vector of residuals by e∗. Using ( 3.12.6), a first-order approximation of the standarddeviation of the residuals, e∗i , can be obtained in the same way as the derivation for Studen-tized residuals for regular R estimates; see the development proposed for robust estimates,in general, by McKean, Sheather, and Hettmansperger (1990, 1993).

As briefly outlined in the Appendix, this development for the HBR residuals results inthe first order approximation given by

Var(e∗).= σ2I + τ 2

SH1 +1

4X(X′A∗X)−1Σ(X′A∗X)−1X′ − 2τSκ1H1

−√

12τκ2

A∗X(X′A∗X)−1X + X(X′A∗X)−1X′A∗ , (3.12.8)

where σ2 is the variance of ei, κ1 = E[|ei|], κ2 = E[ei(2F (ei) − 1)], H1 = n−111′, and A∗ isdefined above expression ( A.6.17) of the Appendix.

We recommend estimating σ2 by MAD = 1.483med|e|∗i ; κ1 by

κ1 =1

n

n∑

i=1

|e∗i | ; (3.12.9)


and κ2 by

κ2 =1

n

n∑

i=1

(R(e∗i )

n + 1− 1

2

)e∗i , (3.12.10)

which is a consistent estimate of κ2; see McKean et al. (1990). Replacing a∗ij by aij , ( ??),

yields an estimate of the matrix A∗. Estimation of Σ was discussed in Section ??. Let Vdenote the estimate of Var(e∗).

Let σ2bei

denote the the ith diagonal entry of V. Define the Studentized residuals by

e∗i =eiσ

bei

. (3.12.11)

As in LS, these standard errors correct for both the underlying variance of the errorsand location. For flagging outliers, appropriate benchmarks for these residuals are ±2; seeMcKean et al. (1990, 1993) for discussion.

3.12.8 Examples

Similar to the discussion in Section ?? on the difference between GR estimates and Restimates, high breakdown estimates and highly efficient estimates often give conflictingresults. High breakdown estimates are less sensitive to outliers and clusters of outliers inthe x-space; hence, for data sets where this is a problem high breakdown estimates oftengive better fits than highly efficient fits. On the other hand, similar to the GR estimates,the HBR estimates are hampered in fitting and detecting curvature while this is not true ofthe highly efficient estimates. We choose two examples which illustrate these disagreementsbetween the high breakdown HBR fit and the highly efficient Wilcoxon fit.

To obtain the HBR estimates we used the weights given by ( ??). As initial estimatesof location and scatter we chose the MVE estimates discussed in Section ??. They werecomputed by the algorithm proposed by Hadi and Simonoff (1993). For initial estimates ofthe regression coefficients we chose the LMS estimates given by,

(α(0)

β(0)

)= Argmin med1≤i≤nyi − α− x′

iβ2 ; (3.12.12)

see Rousseeuw and Leroy (1987). These estimates have 50% breakdown. We computed themwith the algorithm written by Stromberg (1993). The Mallows tuning constant b was set atthe 95th percentile of χ2(p). The tuning constant c was set at [Medai+3 MADai]2 where

ai = ei(β0)/(σ mi) and σ = MAD, ( ??). Once the weights are computed, a Gauss-Newtontype algorithm similar to that used by RGLM (see Kapenga et al., 1988) is used to actuallyobtain the HBR estimates.

Example 3.12.3. Stars Data, Example 3.12.1 continued


Table 3.12.1: Estimates of Coefficients for the Quadratic DataQuadratic Data

Fit Intercept Linear QuadraticWilcoxon -.665 5.95 -.652HBR .422 4.64 -.375LMS 1.12 3.65 -.141

In Example 3.12.1 we compared the GR and Wilcoxon fits for a data set consisting ofmeasurements of stars. Recall that there is a cluster of outliers in the x-space which greatlyinfluences the Wilcoxon fit but has little effect on the GR fit. The HBR estimates of theintercept and slope are −6.91 and 2.71, respectively, which are quite close to the GR estimatesas displayed in Table ??. Hence similar to the GR estimates, the outlying cluster of giantstars has little effect on the HBR fit.

Example 3.12.4. Quadratic Data

In order to demonstrate the problems that the high breakdown estimates have in fittingcurvature, we simulated data from the following quadratic model:

Yi = 5.5|xi| − .6x2i + ei , (3.12.13)

where the ei’s were simulated iid N(0, 1) variates and the xi’s were simulated contaminatednormal variates with the contamination proportion set at .25 and the ratio of the varianceof the contaminated part to the noncontaminated part set at 16. Panel A of Figure 3.12.1displays a scatter plot of the data overlaid with the Wilcoxon, HBR, and LMS fits. Theestimated coefficients for these fits are in Table 3.12.1. As shown, the Wilcoxon fit is quitegood in fitting the curvature of the data. Its estimates are close to the true values. Onthe other hand, the high breakdown fits are quite poor. The LMS fit missed the curvaturein the data. This is true too for the HBR fit, although, the fit did correct itself somewhatfrom the poor LMS starting values. Panels B and C of Figure 3.12.1 contain the internalstudentized residual plots based on the Wilcoxon and the HBR fits, respectively. Based onthe Wilcoxon residual plot, no further models would be considered. The HBR residual plotshows as outliers the two points which were fitted poorly. It also has a mild linear trend init, which is not helpful since a linear term was fit. This trend is true for the LMS residualplot, (Panel D); although, it gives an overall impression of the lack of a quadratic term inthe model. In such cases in practice, a higher degree polynomial may be fitted, which in thiscase would be incorrect. Difficulties in reading residual plots from high breakdown fits, asencountered here, were discussed in Section ??; see, also, McKean et al. (1993).

3.13. DIAGNOSTICS FOR DIFFERENTIATING BETWEEN FITS 247

3.13 Diagnostics for Differentiating between Fits

Is the least squares (LS) fit appropriate for the data at hand? How different would a morerobust estimate be from LS? Is a high breakdown estimator necessary, or is a highly efficientrobust estimator sufficient? In this section, we present simple intuitive diagnostics which helpanswer these questions. These measure the difference in fits among LS, highly efficient R,and high breakdown R fits. These diagnostics were developed by McKean et al. (1996a,1999);see, also, McKean and Sheather (2009) for a recent discussion. The package RBR computesthem; see Terpstra and McKean (2005) and McKean, Terpstra and Kloke (2009). We sketchthe development for the diagnostics that differentiate between the LS and R fits first. Also,we focus on Wilcoxon scores, leaving the general scores analog to the exercises. We beginby looking at the difference in LS and Wilcoxon (W) fits, which leads to our diagnostics.

Consider the linear model (3.2.3). The design matix is centered, so the LS estimator

is βLS = (X′X)−1X′Y. Let βW denote the R-estimate, immediately following expression(3.2.7), based on Wilcoxon scores, (ϕ(u) =

√12[u − (1/2)]). We use τ to denote the corre-

sponding scale parameter τϕ. We then have the following theorem:

Theorem 3.13.1. Assume that the random errors ei of Model (3.2.3) have finite varianceσ2. Also, assume that Assumptions (E1), (E2), (D1) and (D2) of Section 3.4 are true. Then

βLS − βW is asymptotically normal with mean 0 and variance-covariance matrix

Var(βLS − βW ) = δ2(X ′X)−1 , (3.13.1)

where δ2 = σ2 + τ 2W − 2κ, τW is the scale parameter (??) for Wilcoxon scores, and κ =√

12τWE[e(F (e) − 1/2)].

The proof of this theorem can be found in the appendix. The parameter δ is positiveunless σ − τW = 0. To see this note that E[e(F (e) − 1/2)] = Cov(e, F (e)) ≥ 0; hence,

κ ≥ 0. Then by the Cauchy-Schwarz inequality, κ ≤ τW

√E(e2) · E(

√12(F (e) − 1/2))2 =

τW√σ2 · 1 = τW σ so that δ2 ≥ (τW − σ)2 ≥ 0.

Outliers have an effect on the LS-estimate of the intercept, αLS, also. Hence this mea-surement of overall difference in the R- and LS-estimates needs to based on the estimatesof α too. The problems to which we apply our diagnostics are often messy. In particular,we do not want to preclude errors with skewed distributions. Hence, as in Section ??, our Restimate of the intercept is the median of the Wilcoxon residuals, i.e., αS = medYi−x′

iβW.This, however, raises a problem since αLS is a consistent estimate of the mean of the errors,µe, while αR is a consistent estimate of the median of the errors, µe. We shall define thedifference in these target values to be

µd = µe − µe . (3.13.2)

One way to avoid a problem here is to assume that the errors have a symmetric distribution,i.e. µd = 0, but this is undesirable in developing diagnostics for exploratory analysis. Instead,


we consider measures composed of two parts: one part measures the difference in slopeparameters and the other part measures differences in the estimates of intercept. First, thefollowing notation is convenient, here. Let b = (α,β′)′ denote the vector of parameters. Let

bLS and bW denote respectively the LS and Wilcoxon estimates of b.The version of Theorem 3.13.1 that includes intercept is the following corollary. Let

τs = 1/(2f(θ)), where f is the error density and θ is the median of f .

Corollary 3.13.1. Under the assumptions of Theorem 3.13.1, bW − bLS is asymptoticallynormal with mean vector (µd, 0

′)′ and variance-covariance matrix

Var

(αLS − αSβLS − βW

).=

[(δ2s/n) 00 δ2(X ′X)−1

](3.13.3)

where δ2s = σ2 + τ 2

s − 2τsE(e · sgn(e)).

By the Cauchy-Schwarz inequality, E(e · sgn(e)) ≤ σ so that δ2s ≥ (τs − σ)2 ≥ 0. Hence,

the parameter δ2s ≥ 0.

A simple diagnostic to measure the difference between the LS and Wilcoxon fit suggestedby this corollary is given by (bLS − bW )′ A−1

D (bLS − bW ) where AD is the covariance matrix( 3.13.3). This would have an asymptotic χ2 distribution with p + 1 degrees of freedom.Monte Carlo studies, however, showed that it was too liberal. The major problem is thatif the LS and Wilcoxon fits are close, then τW is close to σ, which can lead to a practicalsingularity for the matrix AD; see McKean et al. (1996a, 1999) for discussion. One practicalsolution is to standardize with the asymptotic covariance matrix of the Wilcoxon estimate.This leads to the diagnostic

TDBETAS(LS,W ) = (bLS − bW )′A−1W (bLS − bW ) (3.13.4)

where

AW =

[τ 2s /n 00 τ 2

W (X ′X)−1

], (3.13.5)

where τW is the robust estimator of τW given in expression (3.7.8), for Wilcoxon scores, andτs is the robust estimator of τs discussed around expression (??).

The diagnostic TDBETAS(LS,W ) decomposes into separate intercept and slope terms

TDBETAS(LS,W ) = (n/widehatτ2s )(αLS − αS)

2 + (1/τ 2W )(βLS − βW )′X ′X(βLS − βW )

= (n/τ 2s )(αLS − αS)

2 + (1/wodehatτ2W ) ‖ YLS − YW ‖2 .

= TDINT (LS,W ) + TDBETAS(LS,W )(3.13.6)

Even if the Wilcoxon and LS fits of the slope parameters are essentially the same, TDBETAS(LS,W )can be large because of asymmetry of the errors, i.e. TDINT (LS,W ) is large. Hence, bothparts of the decomposition are useful diagnostics. Below we give a benchmark for largevalues of TDBETAS(LS,W ).


If TDBETAS(LS,W ) is large then we often want to determine the cases which are themain contributors to this large difference. hence, consider the correspondingly standardizedstatistic for difference in ith fitted value

CFITSi(LS,W ) =YW,i − YLS,i

SE(YW,i), (3.13.7)

where SE(YW,i) = τW [hi − (1/n)], (the expression in brackets is ith leverage of the designmatrix). As with TDBETAS(LS,W ), the standardization of CFITSi(LS,W ) is robust.Note that this standardization is similar to that proposed for the diagnostic RDFFITS byMcKean, Sheather and Hettmansperger (1990).

Note that ( 3.13.7) standardizes by only one fitted value in the numerator (instead of thedifference). Belsley, Kuh, and Welsch (1980) used a similar standardization in assessing thedifference between yLS,i and yLS,(i), the ith deleted fitted value. They suggested a benchmark

of 2√

(p+ 1)/n, and we propose the same benchmark for CFITSi(LS,W ). Having said this,we have found it useful in many cases to ignore the benchmarks and simply look for gapsthat separate large CFITS from small CFITS (see the examples presented below).

Simulations, such as those discussed in McKean et al. (1999), show that standardizationat the Wilcoxon fit is successful and has much better performance than standardization usingthe asymptotic covariance of the difference in fits (Theorem 3.13.1). These simulations wereperformed over a wide range of error and x-variable distributions.

Using the benchmark for CFITS, we can derive an analogous benchmark for TDBETASby replacing τS with τW . We realize that this is may be a crude approximation but weare deriving a benchmark. Let X1 = [1 : X] and denote the projection matrix by H =

X1(X′1X1)

−1X′1. Replacing τS with τW , we have Cov(bR)

.= τ 2

W (X′1X1)

−1 and SE(YW,i).=

τW√hii. Under this approximation, it follows from (??) that an observation is flagged by

the diagnostic CFITSW,i(LS,W ) whenever

|YW,i − YLS,i|τW

√hii

> 2√

(p+ 1)/n (3.13.8)

We use this expression to obtain a benchmark for the diagnostic TDBETAS(LS,W ) asfollows:

TDBETAS(LS,W ) = (bW − bLS)′[τ 2W (X′

1X1)−1]−1(bW − bLS)

= (1/τ 2W )[X1(bR − bGR)]′[X1(bW − bLS)]

= (1/τ 2W )∑

i

(YW,i − YLS,i)2

= (p+ 1)1

n

∑

i

(YW,i − YLS,i

τW√

(p+ 1)/n

)2

.

Since hii has the average value (p+ 1)/n, (3.13.8) suggests flagging TDBETAS(LS,W ) as


large whenever TDBETAS(LS,W ) > (p+ 1)(2√

(p+ 1)/n)2, or

TDBETAS(LS,W ) >4(p+ 1)2

n. (3.13.9)

We proceed the same way for diagnostics to indicate differences in fits between Wilcoxonand HBR fits and between LS and HBR fits. The asymptotic representation for the HBRestimate of β, (??), can be used to obtain the asymptotic distribution of the differencesbetween these fits. For data sets where the HBR weights are close to one, though, thecovariance matrix of this difference is practically singular, resulting in the diagnostic beingquite liberal; see McKean et al. (1996a, 1999) for discussion. So, as we did for the diagnosticbetween the Wilcoxon and LS fits, we standardize the differences in fits using the asymptoticcovariance matrix of the Wilcoxon estimate; i.e., AW . Hence, the total differences in fits aregiven by

TDBETAS(W,HBR) = (bW − bHBR)′A−1W (bW − bHBR) (3.13.10)

TDBETAS(LS,HBR) = (bLS − bHBR)′A−1W (bLS − bHBR). (3.13.11)

We recommend using the benchmark given by (3.13.9) for these diagnostics, also. Likewisethr diagnostics for casewise differences are given by

CFITSi(W,HBR) =YW,i − YHBR,i

SE(YW,i)(3.13.12)

CFITSi(LS,HBR) =YLS,i − YHBR,i

SE(YW,i). (3.13.13)

They suggested a benchmark of 2√

(p+ 1)/n, and We recommend the same benchmark

(2√

(p+ 1)/n) as discussed above for these diagnostics.

Example 3.13.1 (Bonds Data). Siegel (1997) presented a data set, the Bonds data, whichwe use to illustrate some of these concepts. It was further discussed in Sheather (2009) andMcKean and Sheather (2009). The responses are the bid prices for U.S. treasury bondswhile the dependent variable is the coupon rate (size of the bond’s periodic payment rate (inpercent). The data are shown in Panel A of Figure 3.13.1 overlaid with the LS (solid line) andWilcoxon (broken line) fits. The fits differ dramatically and the diagnostic TDBETA(LS,W)has the value 213.7 which far exceeds the benchmark of 0.457. The three cases yieldingthe largest values for the casewise diagnostic CFITS are cases 4, 13, and 35. Panels B andC display the LS and Wilcoxon Studentized residual plots. As can be seen, the WilcoxonStudentized residual plot highlights Cases 4, 13, and 35, also. Their Studentized residualsexceed 20 and clearly should be label outliers. These are the outlying points on the far leftin the scatterplot of the data. On the other hand, the LS Studentized residual plot showsonly two of them exceeding the benchmark. Further, the bow-tie pattern of the Wilcoxonresidual plot indicates heteroscedasticity of the errors. As discussed in Sheather (2009), thisheteroscedasticity is to be expected because the bonds have different maturity dates.


Table 3.13.1: Estimates of regression coefficients for the Hawkins data.α (se) β1 (se) β2 (se) β3 (se)

LS -0.387 (0.42) 0.239 (0.26) -0.334 (0.15) 0.383 (0.13)Wilcoxon -0.776 (0.20) 0.169 (0.11) 0.018 (0.07) 0.269 (0.05)HBR -0.155 (0.22) 0.096 (0.12) 0.038 (0.07) -0.046 (0.06)

As further discussed in Sheather (2009), the three outlying cases are of a different type ofbond then the others. The plot in Panel D is the Studentized residuals versus fitted valuesfor the Wilcoxon fit after removing these three cases. Note that are still a few outlying datapoints. The diagnostic TDBETA(LS,Wil) has the value 1.55 which exceeds the benchmarkof 0.50 but the difference is far less than the difference based on the original data.

Next consider the differences between the LS and HBR fits. The leverage values corre-sponding to the three outlying cases exceed the benchmark for leverage points, (the smallestleverage value of these three cases has value 0.152 which exceeds the benchmark of 0.114).The diagnostic TDBETA(LS,HBR) has the value 318.8 which far exceeds the benchmark.As discussed above the Wilcoxon fit is sensitive to outliers in factor space and in this caseTDBETA(Wil,HBR) is 10.5. When the outliers are omitted, the value of this statistic is0.034 which is less than the benchmark.

In this simple regression model, it is obvious that the three outlying cases are on theedge of factor space. As the next example shows, in a multiple regression problem this isgenerally not as apparent. The diagnostics discussed in this section, though, alert the userto potential troublesome points in factor space or response space.

Example 3.13.2 (Hawkins Data). This is an artificial data set proposed by Hawkins, Braduand Kass (1984) involving three independent variables. There are a total of 75 data pointsin the set and the first 14 of them are outlying in factor space. The other 61 points followa linear model. Of the 14 outlying points, the first 10 points do not follow the model whilethe points 11 through 14 do; hence, the first ten cases are referred to as bad points of highleverage while the next 4 cases are referred to as good points of high leverage.

Panels A, B and C of Figure 3.13.2 are the unstandardized residual plots from the LS,Wilcoxon and HBR fits, respectively. Note that the LS and Wilcoxon fits are fooled by thebad outliers. Their fits are drawn by the bad points of high leverage while they both flagthe four good points of high leverage. On the other hand, as Panel C indicates, the HBR fitcorrectly identified the 10 bad points of high leverage and fit well the 4 good points of highlevearge. Table 3.13.1 displays the estimates and their standard errors.

The differences in fits diagnostics were successful for this data set. As displayed on theplot in Panel D, TDBETAS(W,HBR) = 1324 which far exceeds the benchmark value of0.853 and which indicates the the Wilcoxon and HBR fits differ substantially. The plotin Panel D consists the diagnostics CFITSW,i(W,HBR) versus Case i. The 14 largestvalues of CFITSW,i(W,HBR) are for the 14 outlying cases. Recall that the Wilcoxon fitincorrectly fit the 4 good leverage points. So it is reassuring to see that all 14 outlying were


correctly identified. Also, in a further investigation of this data set, the gap between these14 CFITSW,i(W,HBR) values and the other cases, would lead to one considering a fit basedon the other 61 cases. Assuming that the matrix x1 is the design matrix (not including theintercept, the following is the Rfit code which obtained the fits and the diagnostics:

fit.ls = lm(y~x1)

fit.wl = wwest(x1,y)

fit.hbr = wwest(x1,y,bij="HBR")

fdwilhbr = fitdiag(x1,y,est=c("WIL","HBR"))

fdwilhbr$tdbeta

fdwilhbr$bmtd

cfit =fdwilhbr$cfit

fdwills = fitdiag(x1,y,est=c("WIL","LS"))

fdlshbr = fitdiag(x1,y,est=c("LS","HBR"))

3.14 Rank-Based procedures for Nonlinear Models

In this section, we consider the general nonlinear model:

Yi = fi(θ0) + εi. i = 1, . . . , n , (3.14.1)

where fi are known real valued functions defined on a compact space Θ and ε1, . . . , εnare independent and identically distributed random errors with probability density functionh(t). The asymptotic properties and conditions needed for the numerical stability of the LSestimation procedure were investigated in Jennrich (1969); see, also, Malinvaud (1970) andWu (1981). LS estimation in nonlinear models is a direct extension of its estimation in linearmodels. The same norm (Euclidean) is minimized to obtain the LS estimate of θ0. For therank-based procedures of this section, we simply replace the Euclidian norm by an R norm,as we did in the linear model case. Hence, the geometry is the same as in the linear modelcase.

For the nonlinear model ( 3.14.1), Oberhofer (1982) obtained the weak consistency for Restimates based on the sign scores, i.e., the L1 estimate. Abebe and McKean (2007) obtainedthe consistency and asymptotic normality for the nonlinear R estimates of Model ( 3.14.1)based on the Wilcoxon score function. In this section, we briefly discuss the Wilcoxondevelopment.

For the long form of the model, let Y = (y1, . . . , yn)T and f(θ) = (f1(θ), . . . , fn(θ))T .

Given a norm ‖ · ‖ on n-space, a natural estimator of θ is a value θ which minimizes the

distance between the response vector y and f(θ); i.e., θ = argminθ∈Θ‖y−f(θ)‖. If the norm

is the Euclidean norm then θ is the LS estimate. Our interest, though, is in the Wilcoxonnorm given in expression (??), where the score function is ϕ(u) =

√12[u − (1/2)]. Here,

3.14. RANK-BASED PROCEDURES FOR NONLINEAR MODELS 253

we write the norm as ‖ · ‖W , where W denotes the Wilcoxon score function. We define the

Wilcoxon estimator of θ0, denoted hereafter by θW,n, as

θW,n = argminθ∈Θ‖y − f(θ)‖W . (3.14.2)

We assume that fi(θ) is defined and continuous for all θ ∈ Θ, for all i. It then follows thatthe dispersion function is a continuous function of θ and, hence, since Θ is compact, thatthe Wilcoxon estimate θ exists.

To state the asymptotic properties of the LS and R nonlinear estimates, certain assump-tions are need. These are discussed in detail in Abebe and McKean (2007). We do note theanalog of Assumption (??) for the linear model; that is, the sequence of matrices

n−1n∑

i=1

∇fi(θ0)∇fi(θ0)′

converges to a positive definite matrix Σ(θ0) where ∇fi(θ) is the p× 1 derivative vector offi(θ) with respect to θ. Under these assumptions, Abebe and McKean (2007) showed that

θW,n converges in probability to θ0. They, then, derived the asymptotic distribution of θW,n.Similar to the derivation in the linear model case of Section ??, this involves a pseudo-linearmodel.

Cconsider the local linear model given by

y∗i = x∗iTθ0 + εi. (3.14.3)

where for i = 1, . . . , n,

y∗i = y∗i (θ) ≡ yi − fi(θ0) + ∇fi(θ0)Tθ0.

Note that the probability density function of the errors of Model 3.14.3 is h(t), i.e., thedensity function of εi. Define the corresponding Wilcoxon dispersion function as,

D∗n(θ) ≡ [2n(n + 1)]−1

∑

i<j

|e∗i (θ) − e∗j (θ)| . (3.14.4)

Furthermore, let,θn = argminθ∈ΘD∗

n(θ) . (3.14.5)

It then follows (see Exercise 3.16.42) that

√n(θn − θ0)

D→ Np(0, τ2Σ(θ0)) , (3.14.6)

where τ is as given in (??). Abebe and McKean (2007) show that√n(θW,n − θn) → 0, in

probability; hence, we have the asymptoic distribution for the Wilcoxon estimator, which westae in a theorem.


Theorem 3.14.1. Under assumptions in Abebe and McKean (2007),

√n(θW,n − θ0)

D→ Np(0, τ2Σ(θ0)). (3.14.7)

Let θLS,n denote the LS estimator of θ0. Under suitable regularity conditions, the asymp-totic distribution of the LS estimator is given by

√n(θLS,n − θ0)

D→ Np(0, σ2Σ(θ0)) , (3.14.8)

where σ2 is the variance of the random error εi. It follows immediately, from expressions(3.14.7) and (3.14.8), that, for any component of θ0, the asymptotic relative efficiency (ARE)between the Wilcoxon estimator and the LS estimator of the component is given by the ratioσ2/τ 2. This, of course, is the ARE between the Wilcoxon and LS estimators in linear models.If the error distribution is normal, then this ratio is the well known number 0.955. Hence,there is only a loss of 5% efficiency, if one uses the Wilcoxon estimator instead of the LSestimator when the errors are normally distributed. In contrast, the L1 estimator has theasymptotic relative efficiency of 63% relative to the LS estimator. The ARE between theWilcoxon and L1 estimators at normal errors is 150%. Hence, as in the linear model case,the Wilcoxon estimator is a highly efficient estimator for nonlinear models. For heaviertailed error distributions the Wilcoxon estimator is generally much more efficient than theLS estimator. A discussion of such results, along with a Monte Carlo verification, for a familyof contaminated normal error distributions can be found in Abebe and McKean (2007).

Using the pseudo model and Section ??3.5, a useful asymptotic representation of theWilcoxon estimate is given by:

√n(θW,n − θ0) = τ(n−1X∗TX∗)−1n−1/2X∗Tϕ[H(y∗ − X∗θ0)] + op(1) , (3.14.9)

where X∗ is the n×p matrix with the ith row given by ∇fi(θ0)T and y∗ is an n×1 vectorwith the ith component yi − fi(θ0) + ∇fi(θ0)Tθ0.

Based on (??), we can obtain the influence function of the Wilcoxon estimate. Assumefi depends on a set of predictors zi ∈ Z ⊂ ℜm as fi(θ) = f(zi, θ). Assume also that f is acontinuous function of θ for each z ∈ Z and is a measurable function of z for each θ ∈ Θwith respect to a σ-finite measure. Under these assumptions, the representation above givesus the local influence function of the Wilcoxon estimate at the point (z0, y0) ,

IF(z0, y0; θW,n) = τΣ(θ0)−1ϕ[H(y0)]∇f(z0, θ0) .

Note that the influence function is unbounded if the tangent plane of S at θ0 in unbounded.This phenomenon corresponds to the existence of high leverage points in linear regression.The HBR estimators, however, can be extended to the nonlinear model, also; see??. Theseare robust in such cases.

3.14. RANK-BASED PROCEDURES FOR NONLINEAR MODELS 255

3.14.1 Implementation

To implement the asypmtotic inference based on the Wilcoxon estimate we need a consistentestimator of the variance-covariance matrix. Define the statistic Σ(θW,n) to be Σ(θ0) of

expression (N2) with θ0 replaced by θW,n. By Assumption (N2) and the consistency of

θW,n to θ0, Σ(θW,n) converges in probability to Σ(θ0). Next, it follows from the asymptoticrepresentation (??) that the estimator of τ proposed by Koul et al. [?] for linear models isalso a consistent estimator of τ for our nonlinear model. We denote this estimator by τ .Thus τ 2Σ(θW,n) is a consistent estimator of the asymptotic variance-covariance matrix of

θW,n.

Estimation Algorithm

Similar to the LS estimates for nonlinear models, a Gauss-Newton type of algorithm canbe used to obtain the Wilcoxon fit. Recall that this is an iterated algorithm which usesthe Taylor Series expansion of the function f(θ) evaluated at the current estimate to obtainthe estimate at the next iteration. Thus each iteration consists of fitting a linear model.Abebe and McKean [?] show that this algorithm for obtaining the Wilcoxon fit convergesin a probability sense. Using this algorithm, all that is required to compute the Wilcoxonnonlinear estimate is a computational procedure for Wilcoxon linear model estimates; seeExercise 3.16.43 for further discussion.

We next consider an example that demonstrates the robustness and efficiency proper-ties of the rank estimators in comparison to the least squares(LS) estimator in practicalsituations.

Example 3.14.1 (Chwirut’s data). These data are taken from the ultrasonic block referencestudy by Chwirut [?]. The response variable is ultrasonic response and the predictor variableis metal distance. The study involved 214 observations. The model under consideration is,

fi(θ) ≡ f(xi; θ1, θ2, θ3) ≡exp[−θ1xi]θ2 + θ3x

, i = 1, . . . , 214 .

Using the Wilcoxon and LS fitting procedures, we fit the (original) data and then a data setwith one observation replaced by an outlier. Figure 3.14.1 displays the results of the fits.

For the original data, as shown in the figure and by the estimates given in Table 3.14.1,the LS and Wilcoxon fits are quite similar. As shown in the residual plots of Figure 3.14.1,there are several moderate outliers in the original data set. These outliers have an impacton the LS estimate of scale, the square-root of MSE, which has the value σLS = 3.36. Incontrast, the Wilcoxon estimate of τ is τ = 2.45 which explains the Wilcoxon’s smallerstandard errors than those of LS in the table of estimates.

For robustness considerations, we introduced a gross outlier in the response space, (ob-servation 17 was changed from 8.025 to 5000). The Wilcoxon and LS fits were obtained.As shown in Figure 3.14.1, the LS estimate essentially did not converge. From the plot of


Table 3.14.1: Wilcoxon and LS estimates based on the original data with standard errors(SE) and the Wilcoxon estimates based on the data with substituted gross outlier.

Original Data Set Outlier Data SetWil. Est. SE LS est. SE Wil. Est. SE

θ1 0.1902 0.0161 0.1903 0.0219 0.1897 0.0161θ2 0.0061 0.0002 0.0061 0.0003 0.0061 0.0002θ3 0.0197 0.0006 0.0105 0.0008 0.0107 0.0006

the fitted models and residual plots, it is clear that the Wilcoxon fit performs dramaticallybetter than its LS counterpart. In Table 3.14.1 the Wilcoxon estimates are displayed withtheir standard error. There is basically little difference between the Wilcoxon fits for theoriginal set and the data set with the gross outlier.

3.15. EXERCISES 257

3.15 Exercises

3.15.1. Show that the function ( 3.12.1) is a pseudo-norm.

3.15.2. Consider the hypotheses H0 : β = 0 versus HA : β 6= 0.

(a). Using Theorem ??, derive the Wald type test based on the GR estimate of β.

(b). Using Theorem ??, derive the gradient test.

3.15.3. In the derivation of Lemma ?? show the results for Cases 2-6.

3.15.4. Show that the second standardized term of the variance of n−3/2S(0) in expression( ??) goes to 0 as n→ ∞.

3.15.5. Consider Theorem ??.

(a). If the weights are identically equal to 1, show that the random variable√

12nτRDGR

reduces to the random variable RDϕ

τϕ/2of Theorem 3.6.1 provided ϕ(u) =

√12(u− 1/2),

(i.e., Wilcoxon scores).

(b). Complete the proof of Theorem ??.

3.15.6. Keeping in mind that the weights for the GR estimates depend only on the xs,discuss how the bootstrap described in Section 3.8 for the L1 estimate can be used tobootstrap the p-value of the test based on the statistic FGR.

3.15.7. (a). Write an algorithm which obtains the simulated distribution of∑q

i=1 λiχ2i (1),

where λ1, . . . , λq are specified and χ2i (1) are iid χ2 one degree of freedom random

variables.

(b). Write a second algorithm which uses the algorithm in (a) to obtain the p-value of thetest statistic qFGR.

3.15.8. For general linear hypotheses, H0 : Mβ = 0 versus HA : Mβ 6= 0, discuss theWald type test based on the GR estimates.

3.15.9. Show that the second term in expression ( ??) is less than or equal to zero.

3.15.10. For the data in Example ?? consider polynomial models of degree p,

yi = α +

p∑

j=1

βj(xi − x)j .

Once a good fit has been established, a hypothesis of interest is H0 : βp = 0. Suppose for theGR estimates we use the weights given by expression ( ??) parameterized by the exponentα1. For α1 = .5, 1, and 2, obtain the asymptotic relative efficiency between the GR estimateand the Wilcoxon estimate of βp for polynnomial models of degree p = 1, . . . , 4. If software isavailable, obtain these efficiencies for the weights given by expression ( ??) using the tuningconstants recommended.


3.15.11. Obtain the details of the derivation of the Var(eGR) given in expression ( ??).

3.15.12. Obtain the projection that is used in the proof of Theorem ??.

3.15.13. In the proof of Theorem ??, show that√n(βR − βGR

)is asymptotically normal.

3.15.14. In the proof of Theorem ??, show that E[Fc(e)S′(β0)] = (n/6)WX.

3.15.15. By completing the following steps, obtain the asymptotic variance-covariance ma-

trix of b∗R − b∗

GR, where b∗R = (α∗

S,R, β′R)′ and b∗

GR = (α∗S,GR, β

′GR)′.

(a). Show that b∗R = TbR where

T =

[1 −x′

0 I

].

(b). Show the following holds, (where AV represents asymptotic variance),

AV

[(α∗S,R

βR

)−(α∗S,GR

βGR

)]= TAV

[(αS,R − αS,GR

0

)+

(0

βGR − βGR

)]T′ .

(c). Since the asymptotic representation ( 3.5.23) holds for both α∗S,R and α∗

S,GR and sincethe centered intercept is asymptotically independent of the regression estimates, use( ??) to conclude that

AV

[(α∗S,R

βR

)−(α∗S,GR

βGR

)]= τ 2

[x′Dx −x′D−Dx D

],

where D = (X′WX)−1X′W2X(X′WX)−1 − (X′X)−1 .

3.15.16. Show that the asymptotic variance-covariance matrix derived in Exercise 3.15.15is singular by showing that the vector (1,x)′ is in its kernel.

3.15.17. Show that RDGR(β) = DGRy−DGRe, ( ??) and ( ??), is a strictly convex functionof β with a minimum value of 0 at β = 0.

3.15.18. Using the influence function, ( ??), of the GR estimate, obtain the asymptotic

distribution of βGR.

3.15.19. The influence function of the test statistic TGR is given in Theorem ??. Use it toobtain the asymptotic distribution of TGR.

3.15.20. Obtain the approximate distribution of βHBR − βR, where βR is the Wilcoxonestimate. Use it to obtain HBR-analogues of the diagnostics statistics TDBETASR, ( ??),and CFITSR,i, ( ??).

3.15.21. Show that the p × p matrix Cn defined in expression ( 3.12.7), can be writtenalternately as Cn =

∑i<j γijbij(xj − xi)(xj − xi)

′.

3.15. EXERCISES 259

3.15.22. Consider the influence function of the HBR estimate given in expression ( A.5.24).

(a). If the weights for residuals are set at 1, show that the influence function of the HBRestimate simplifies to the influence function of the GR estimate given in ( ??).

(b). If the weights for residuals and the xs are both set at 1, show that the influence functionof the HBR estimate simplifies to the influence function of the Wilcoxon estimate givenin ( 3.5.17).


3.16 Exercises

3.16.1. For the baseball data in Example 3.3.2, explore other transformations of the pre-dictor Years in order to obtain a better fitting model than the one discussed in the example.

3.16.2. Consider the linear model ( 3.2.2).

(a). Show that the ranks of the residuals can only change values at the(n2

)equations yi −

x′iβ = yj − x′

jβ.

(b). Determine the change in dispersion as β moves across one of these defining planes.

(c). For the telephone data, Example 3.3.1, obtain the plot shown in Panel D of Figure3.3.1; i.e., a plot of the dispersion function D(β) for a set of values β in the interval(−.2, .6). Locate the estimate of slope on the plot.

(d). Plot the gradient function S(β) for the same set of values β in the interval (−.2, .6).Locate the estimate of slope on the plot.

3.16.3. In Section 2.2 of Chapter 2, the two sample location problem was modeled as aregression problem; see ( 2.2.2). Consider fitting this model using Wilcoxon scores.

(a.) Show that the gradient test statistic ( 3.5.8) simplifies to the square of the standardizedMWW test statistic ( 2.2.21).

(b.) Show that the regression estimate of the slope parameter is the Hodges-Lehmann esti-mator given by expression ( 2.2.18).

(c.) Verify Parts (a) and (b) by fitting the data in the two sample problem of Exercise2.13.48 as a regression model.

3.16.4. For the simple linear regression problem, if the values of the independent variablex are distinct and equally spaced show that the Wilcoxon test statistic is equivalent to thetest for correlation based on Spearman’s rs, where

rs =

∑(R(xi) − n+1

2)(R(yi) − n+1

2))√∑

(R(xi) − n+12

)2

√∑(R(yi) − n+1

2)2.

Note that the denominator of rs is a constant. Obtain its value.

3.16.5. For the simple linear regression model consider the process

T (β) =

n∑

i=1

n∑

j=1

sgn(xi − xj)sgn((Yi − xiβ) − (Yj − xjβ)) .

(a.) Show under the null hypothesis H0 : β = 0, that E(T (0)) = 0 and that Var (T (0)) =2(n− 1)n(2n+ 5)/9.

3.16. EXERCISES 261

(b.) Determine the estimate of β based on inverting the test statistic T (0); i.e, the value ofβ which solves

T (β).= 0 .

(c.) Show that when the two sample problem is written as a regression model, ( 2.2.2), thisestimate of β is the Hodges-Lehmann estimate ( 2.2.18).

Note: Kendall’s τ is a measure of association between xi and Yi given by τ = T (0)/(n(n−1)); see Chapter 4 of Hettmansperger (1984) for further discussion.

3.16.6. Show that the R estimate, βϕ is an equivariant estimator; that is, βϕ(Y + Xδ) =

βϕ(Y) + δ) and βϕ(kY) = kβϕ(Y).

3.16.7. Consider Model 3.2.1 and the hypotheses ( 3.2.5). Let ΩF denote the column spaceof the full model design matrix X and let ω denote the subspace of ΩF subject to H0. Showthat ω is a subspace of ΩF and determine its dimension. Hint: One way of establishing thedimension is to show that C = X(X′X)−1M′ is a basis matrix for ΩF ∩ ωc.

3.16.8. Show that Assumptions ( 3.4.9) and ( 3.4.8) imply Assumption ( 3.4.7).

3.16.9. For the special case of Wilcoxon scores, obtain the proof of Theorem 3.5.2 by firstgetting the projection of the statistic S(0).

3.16.10. Assume that the errors ei in Model ( 3.2.2) have finite variance σ2. Let βLS denote

the least squares estimate of β. Show that√n(βLS −β)

D→ Np(0, σ2Σ−1). Hint: First show

that the LS estimate is location and scale equivariant. Then without loss of generality wecan assume that the true β is 0.

3.16.11. Under the additional assumption that the errors have a symmetric distribution,show that R-estimates are unbiased for all sample sizes.

3.16.12. Let ϕf (u) = −f ′(F−1(u))/f(F−1(u)) denote the optimal scores for the densityf(x) and suppose that f is symmetric. Show that ϕf (1− u) = −ϕf (u); that is, the optimalscores are odd about 1/2.

3.16.13. Suppose the errors ei are double exponentially distributed. Show that the L1-estimate, i.e. the R estimate based on sign scores, is the maximum likelihood estimate.

3.16.14. Using Theorem 3.5.7, show that

(α∗S

βϕ

)is approximately Np+1

((α0 − x′β0

β0

),

[κn −τ 2

ϕx′(X′X)−1

−τ 2ϕ(X

′X)−1x τ 2ϕ(X

′X)−1

]),

(3.16.1)where κn = n−1τ 2

S + τ 2ϕx

′(X′X)−1x and τS and and τϕ are given respectively by ( 3.4.6) and( 3.4.4).


3.16.15. Show that the random vector within the brackets in the proof of Lemma 3.6.2 isbounded in probability.

3.16.16. Show that difference between the numerators of the two F -statistics, ( 3.6.12) and( 3.6.14), converges to 0 in probability under the null hypothesis.

3.16.17. Show that the difference between Fϕ, ( 3.6.12), and Aϕ, ( 3.6.17), converges to 0in probability under the null hypothesis.

3.16.18. By showing the following results, establish the asymptotic distribution of the leastsquares test statistic, FLS, under the sequence of models ( 3.6.24) with the additional as-sumption that the random errors have finite variance σ2.

(a). First show that1√nX′Y

D−→ N

([Bθ

A2θ

], σ2I

), (3.16.2)

where the matrices A2 and B are defined in the proof of Theorem 3.6.1. This canbe established by using the Lindeberg-Feller CLT, Theorem A.1.1 of the appendix, toshow that an arbitrary linear combination of the components of the random vector onthe left side converges in distribution to a random variable with a normal distribution.

(b). Based on Part (a), show that

[−B′A−1

1

...I

]1√nX′Y

D−→ N(W−1θ, σ2W−1

), (3.16.3)

where the matrices A1 and W are defined in the proof of Theorem 3.6.1.

(c). Let FLS(σ2) denote the LS F -test statistic with the true value of σ2 replacing the

estimate σ2. Show that

FLS(σ2) =

[−B′A−1

1

...I

]1√nX′Y

′σ−2W

[−B′A−1

1

...I

]1√nX′Y

. (3.16.4)

(d). Based on ( 3.16.3) amd ( 3.16.4), show that FLS(σ2) has a limiting noncentral χ2

distribution with noncentrality parameter given by expression ( 3.6.29).

(e.) Obtain the final result by showing that σ2 is a consistent estimate of σ2 under thesequence of models ( 3.6.24).

3.16.19. Show that De, ( 3.6.30) is a scale parameter; i.e, De(Fae+b) = |a|De(Fe).

3.16.20. Establish expression ( 3.6.35).

3.16.21. Suppose Wilcoxon scores are used.

(a). Establish the expressions ( 3.6.36) and ( 3.6.37).

3.16. EXERCISES 263

(b). Similarly for sign scores establish ( 3.6.38) and ( 3.6.39).

3.16.22. Consider the model ( 3.2.1) and hypotheses ( 3.6.9). Suppose the errors have adouble exponential distribution with density f(t) = (2b)−1 exp −|t|/b. Assume b is known.Show that the likelihood ratio test is equivalent to the drop in dispersion test based on signscores.

3.16.23. Establish expressions ( 3.9.8) and ( 3.9.9).

3.16.24. Let X be a random variable with distribution function FX(x) and let Y = aX + b.Define the quantile function of X as qX(p) = F−1

X (p). Show that qX(p) is a linear functionof qY (p).

3.16.25. Verify expression ( 3.9.17).

3.16.26. Assume that the errors have a normal distribution. Show that K2, ( 3.9.25),converges in probability to 1.

3.16.27. Verify expression ( 3.9.34).

3.16.28. Proceeding as in Theorem 3.9.3 show that the first order representation of thefitted value YR is given by expression ( 3.9.36). Next show that the approximate varianceof the ith fitted case is given by expression ( 3.9.38).

3.16.29. Consider the mean shift model, ( 3.9.32). Show that the estimator of θi given bythe numerator of expression ( 3.9.35) is based on the inversion of an aligned rank statisticto test the hypotheses ( 3.9.33).

3.16.30. Assume that the errors have a symmetric distribution. Verify expressions ( 3.9.41)and ( 3.9.42).

3.16.31. Assume that the errors have the distribution GF (2m1, 2m2).

(a). Show that the optimal rank score function is given by expression ( 3.10.6).

(b). Show that the asymptotic relative efficiency between the Wilcoxon analysis and therank based analysis based on the optimal scores for the distribution GF (2m1, 2m2) isgiven by expression ( 3.10.8).

3.16.32. Suppose the errors have density function

fm2(x) = e−x(1 +m−12 e−x)−(m2+1) , m2 > 0 , −∞ < x <∞ . (3.16.5)

(a). Show that the optimal scores are given by expression ( 3.10.7).

(b). Show that the asymptotic relative efficiency of the Wilcoxon analysis to the rank anal-ysis based on the optimal rank score function for the density ( 3.16.5) is given byexpression ( 3.10.9).


3.16.33. The definition of the modulus of a matrix A is given in expression ( 3.11.6). Verifythe three properties concerning the modulus of a matrix listed in the text following thisdefinition.

3.16.34. Consider Example 3.11.1. If Wilcoxon scores are used, show thatDy =√

3/4E|Y1−Y2| where Y1, Y2 are iid with distribution function G and that De =

√3/4E|e1 − e2| where

e1, e2 are iid with distribution function F . Next assume that sign scores are used. Show thatDy = E|Y − med Y | where medY denotes the median of Y . Likewise De = E|e− med e|.

3.16.35. In Example 3.11.3, show that coefficients of multiple determination R1 and R2

given by expressions ( 3.11.27) and ( 3.11.28), respectively, are one-to-one functions of R2

given by expression ( 3.11.11).

3.16.36. At the end of Example 3.11.3, verify, for Wilcoxon scores and sign scores, that(1/(2T 2)) = π/6 and (1/(2T 2)) = π/4, respectively.

3.16.37. In Example 3.11.4, show that the density of Y is given by,

g(y) =1 − ǫ√

2φ

(y√2

)+

ǫ√1 + σ2

c

φ

(y√

1 + σ2c

).

Using this, verify the expressions for Dy, De, and τϕ found in the example.

3.16.38. For the baseball data given in Exercise 1.12.32, consider the variables height andweight.

(a.) Obtain the scatterplot of height versus weight.

(b.) Obtain the CMDs: R2, R1, R2, R∗21 and R∗2

2 .

3.16.39. Consider a linear model of the form,

Y = X∗β∗ + e , (3.16.6)

where X∗ is n × p whose column space Ω∗F does not include 1. This model is often called

regression through the origin. Note for the pseudo-norm ‖ · ‖ϕ that

‖Y − X∗β∗‖ϕ =

n∑

i=1

a(R(yi − x∗′i β))(yi − x∗′

i β)

=

n∑

i=1

a(R(yi − (x∗i − x∗)′β∗))(yi − (x∗

i − x∗)′β∗)

=n∑

i=1

a(R(yi − α− (x∗i − x∗)′β∗))(yi − α− (x∗

i − x∗)′β∗) , (3.16.7)

3.16. EXERCISES 265

where x∗i is the ith row of X∗ and x∗ is the vector of column averages of X∗. Based on

this result, the estimate of the regression coefficients based on the R-fit of Model ( 3.16.6) isestimating the regression coefficients of the centered model, i.e., the model with the designmatrix X = X∗−H1X

∗. Hence, in general, the parameter β is not estimated. This problemalso occurs in a weighted regression model. Dixon and McKean (1996) proposed the followingsolution. Assume that ( 3.16.6) is the true model, but obtain the R-fit of the model:

Y = 1α1 + X∗β∗1 + e = [1 X∗]

[α1

β∗1

]+ e , (3.16.8)

where the true α1 is 0. Let X1 = [1 X∗] and let Ω1 denote the column space of X1. Let

Y1 = 1α1 + X∗β∗1 denote the R-fitted value based on the fit of Model ( 3.16.8). Note that

Ω∗ ⊂ Ω1. Let Y∗ = HΩ∗Y1 be the projection of this fitted value onto the desired space Ω∗.Finally estimate β∗ by solving the equation

X∗β∗

= Y∗ (3.16.9)

(a.) Show that β∗

= (X∗′X∗)−1X∗′Y1 is the solution of ( 3.16.9).

(b.) Assume that the density function of the errors is symmetric, that the R-score function isodd about 1/2 and that the intercept α1 is estimated by solving the equation T+(eR−α)

.= 0 as discussed in Section 3.5.2. Under these assumptions show that

β∗

has an approximate N(β∗, τ 2(X∗′X∗)−1) distribution . (3.16.10)

(c.) Next, suppose that the intercept is estimated by median of the residuals from the R-fitof ( 3.16.8). Using the asymptotic representations (3.5.24) and (3.5.23), show that the

asymptotic representation of β∗

is given by

β∗

= τS(X∗′X∗)−1X∗′H1sgn(e) + τϕ(X

∗′X∗)−1X∗′HXϕ[F(e) + op(1/√n). (3.16.11)

Use this result to show that the asymptotic variance of of β∗

is given by

AsyVar(β∗) = τ 2

S(X∗′X∗)−1X∗′H1X

∗((X∗′X∗)−1

+τ 2ϕ(X

∗′X∗)−1X∗′HXX∗(X∗′X∗)−1, (3.16.12)

(d.) Show that the invariance to x∗ as shown in ( 3.16.7) is true for any pseudo-norm.

3.16.40. The data in Table 3.16.1 are presented in Graybill and Iyer (1994). The dependentvariable is the weight (in grams) of a crystalline form of a certain chemical compound whilethe independent variable is the length of time (in hours) that the crystal was allowed togrow. A model of interest is the regression through the origin model ( 3.16.6). Obtain theR-estimate of β∗ for this data using the procedure described in ( 3.16.9). Compare the fitwith the intercept the R-fit of the intercept model.


Table 3.16.1: Crystal Data for Exercise 3.16.40Time (hours) 2 4 6 8 10 12 14Weight (grams) 0.08 1.12 4.43 4.98 4.92 7.18 5.57Time (hours) 16 18 20 22 24 26 28Weight (grams) 8.40 8.881 10.81 11.16 10.12 13.12 15.04

3.16.41. ?????????????EDIT??? The following lemma will be helpful in a comparison ofthe efficiency between the R- and the GR-estimates.

Lemma 3.16.1. The matrix

B = (X′WX)−1

X′W2X (X′WX)−1 − (X′X)

−1

is positive semi-definite.

Proof: Let v be any vector in Rp. Since X′WX is non-singular, there exist a vector usuch that v = X′WXu. Hence by the Pythagorean Theorem,

v′Bv = ‖WXu‖2 − ‖HWXu‖2 ≥ 0 ,

where H is the projection matrix onto the column space of X.Based on this lemma, it is easy to see that there is always a loss of efficiency when using

the GR-estimates. As the examples below show this loss can be severe. If the design matrix,though, has clusters of outlying points then this downweighting may be necessary.

3.16.42. Consider the pseudo-linear model (3.14.3) of Section 3.14. For the Wilcoxon pseudoestimator, obtain the asymptotic result (3.14.6).

3.16.43. By filling in the brief sketch below, write out the Gauss-Newton algorithm for theWilcoxon estimate of the nonlinear model (3.14.1).

Let θ0 be an initial estimate of θ. Let f0 = f(θ0). Write the norm to be minimized as‖Y − f‖W = ‖Y − f0 + [f0 − f ]‖W . Then use a Taylor series of order 1 to approximate theterm in brackets. The increment for the next step estimate is the Wilcoxon estimator of thisapproximate linear model with Y − f0 as the dependent variable. For actual implementa-tion, discuss why the regression through the origin algorithm of Exercise 3.16.39 is usuallynecessary here.

3.16. EXERCISES 267

Figure 3.10.3: Panel A: Scatterplot of Insulating Fluid Data, Example 3.10.1, overlaid withGF (2, 10) and LS fits; Panel B: Comparison Boxplots of log breakdown times over levelsof voltage stress; Panel C: q−q plot of Wilcoxon fit versus logistic population quantiles offull model (oneway layout); Panel D: q−q plot of GF (2, 10) fit versus GF (2, 10) populationquantiles of full model (oneway layout)

•

••

•

•

••

•

•••

•••

••

•

•

•

•

•

•

••

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•••

••

••••

•

•

••

•

•

•••

•

•

•

••••

••

•

•••••

•

•

•

Log voltage

Log

surv

ival

tim

e

3.3 3.4 3.5 3.6

-20

24

68

GF(2,10)-FitLS-Fit

Panel A

-20

24

68

1 2 3 4 5 6 7

Voltage levels

Log

surv

ival

tim

e

Panel B

••

•••

•

••••••

•••••••••••••••••••••••••••••

•••••••••••••••••••

•••••••••••••

• • •

Logistic quantiles

Wilc

oxon

res

idua

ls

-4 -2 0 2 4

-4-2

02

Panel C

••

•• ••

••••••

•••••••••••••

••••••••••••••••

•••••••••••••••••••••••••••••••

••••

GF(2,10) quantiles

GF

(2,1

0) r

esid

uals

-8 -6 -4 -2 0 2

-4-2

02

Panel D


3.6 3.8 4.0 4.2 4.4 4.6

4.0

4.5

5.0

5.5

6.0

Log temperature

Log

light

inte

nsity

Panel A: Linear Model

Wil

LS

HBR

GR

3.6 3.8 4.0 4.2 4.4 4.6

4.0

4.5

5.0

5.5

6.0

Log temperature

Log

light

inte

nsity

Panel B: Quadratic Model

LS,Wil,HBR

GR

−0.

50.

00.

51.

01.

5

GR

res

idua

ls (

Qua

d M

odel

)

Panel C: Quadratic Model

−0.

50.

00.

51.

0

HB

R r

esid

uals

(Q

uad

Mod

el)

Panel D: Quadratic Model

3.16. EXERCISES 269

Figure 3.12.2: Panel A: Influence function for the HBR estimate; Panel B: Influence functionfor the Wilcoxon estimate; Panel C: Influence function for the GR estimate.

-10-5

05

10

X-10

-5

0

5

10

Y

-4-2

02

4In

flu. f

unc.

Panel A: HBR Estimate

-10-5

05

10

X-10

-5

0

5

10

Y

-15

-10

-5 0

510

15In

flu. f

unc.

Panel B: Wilcoxon Estimate

-10-5

05

10

X-10

-5

0

5

10

Y

-2-1

01

2In

flu. f

unc.

Panel C: GR Estimate


Figure 3.12.1: Panel A: For the Quadratic Data of Example 3.12.4, scatter plot of dataoverlaid by Wilcoxon, HBR and LMS fits; Panel B: studentized residual plot based on theWilcoxon fit; Panel C: studentized residual plot based on the HBR fit; Panel D: residual plotbased on the LMS fit

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•••

• •

•

•

••

•

•

•

•

••

•

•

• •

•

••

X

Y

0 2 4 6

05

1015

Panel A

Wilc.HBRLMS

•

•

•

• •

•

•

• •

•

•

•

• •

••

•

•

•

••

•

•

•

•••

••

•

•

•

•

•

•

•

• •

•

•

Wilcoxon fit

Wilc

oxon

res

idua

l

0 2 4 6 8 10

-10

12

Panel B

•

••

•

•

•

••

•

•

•

•

•

••• •

•

•

••

•

• ••••

••

•

••

•

•

•

•

•

••

•

HBR fit

HB

R r

esid

ual

2 4 6 8 10 14

-8-6

-4-2

02

Panel C

•••

•

•

•

••

•

•

•

•

•

••• •

••••

•

• • ••••

••

•••

•

••

•

••

•

LMS fit

LMS

res

idua

l

5 10 15 20

-15

-10

-50

Panel D

3.16. EXERCISES 271

4 6 8 10 12

9095

100

105

110

115

120

Coupon rate

Bid

pric

e

Panel A

85 90 95 100 105 110 115

−2

−1

01

23

LS fit

LS S

tude

ntiz

ed r

esid

uals

Panel B

80 90 100 110 120

010

2030

Wilcoxon fit

Wilc

oxon

Stu

dent

ized

res

idua

ls

Panel C

95 100 105 110 115 120

−6

−4

−2

02

Wilcoxon fit

Wilc

oxon

Stu

dent

ized

res

idua

ls

Panel D

Figure 3.13.1: Plots for Bonds Data.


Figure 3.13.2: Panel A: LS residual plot of the Hawkins data; Panel B: Wilcoxon residualplot; Panel C: HBR residual plot; Panel D: CFITS(W,HBR) by case.

0 2 4 6 8

−8

−6

−4

−2

02

4

LS fit

LS r

esid

uals

Panel A

0 2 4 6 8 10

−12

−10

−8

−6

−4

−2

02

Wilcoxon fit

Wilc

oxon

res

idua

ls

Panel B

24

68

10

HB

R r

esid

uals

Panel C

1020

30

CF

ITS

(W

ilcox

on,H

BR

)

Panel D: TDBETAS(W,HBR) = 1324

3.16. EXERCISES 273

1 2 3 4 5 6

−10

−50

510

(a) Wilcoxon Residual vs. Predictor

x

resi

ds

1 2 3 4 5 6

−10

−50

510

(b) LS Residual vs. Predictor

x

resi

ds

1 2 3 4 5 6

2040

6080

(c) Wilcoxon and LS Fits : Original Data

x

y

WilcoxonLS

1 2 3 4 5 6

−15

−10

−50

510

(d) Wilcoxon Residual vs. Predictor : Outlier

x

resi

ds

1 2 3 4 5 6

−200

−150

−100

−50

0

(e) LS Residual vs. Predictor : Outlier

x

resi

ds

1 2 3 4 5 6

2040

6080

(f) Wilcoxon and LS Fits : Outlier

x

y

WilcoxonLS

Figure 3.14.1: Analysis of Chwirut’s data

Chapter 4

Experimental Designs: Fixed Effects

4.1 Introduction

In this chapter we will discuss rank-based inference for experimental designs based on thetheory developed in Chapter 3. We will concentrate on factorial type designs and analysis ofcovariance designs but, based on our discussion, it will be clear how to extend the rank-basedanalysis for any fixed effects design. For example, based on this rank-based inference Vidmarand McKean (1996) developed a response surface methodology which is quite analogous tothe traditional response surface methods. We will discuss estimation of effects, tests of linearhypotheses concerning effects, and multiple comparison procedures. We illustrate this rank-based inference with numerous examples. One purpose of our discussion is to show howthis rank-based analysis is analogous to the traditional analysis based on least squares. InSection 4.2.5 we will introduce pseudo-observations which are based on an R-fit of the fullmodel. We show that the rank-based analysis (Wald type) can be obtained by substitutingthese pseudo-observations in place of the responses in a package that obtains the traditionalanalysis. We begin with the oneway design.

In our development we apply rank scores to residuals. In this sense our methods are notpure rank statistics; but they do provide consistent and highly efficient tests for traditionallinear hypotheses. The rank transform method is a pure rank test and it is discussed inSection 4.7 where we describe various drawbacks to the approach for testing traditionallinear hypotheses in linear models. Brunner and his colleagues have successfully developeda general approach to testing in designed experiments based on pure ranks; although, thehypotheses of their approach are generally not linear hypotheses. Brunner and Puri (1996)provide an excellent survey of these pure rank tests. We will not pursue them further in thisbook.

While we will only consider linear models in this chapter, there have been extensionsof robust inference to other models. For example, Vidmar, McKean and Hettmansperger(1992) extended this robust inference to generalized linear models for quantal responses inthe context of drug combination problems and Li (1991) discussed rank procedures for alogistic model. Stefanski, Carroll and Ruppert (1986) discussed generalized M-estimates for

275

276 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS

generalized linear models.

4.2 Oneway Design

Suppose we want to determine the effect that a single factor A has on a response of interestover a specified population. Assume that A has k levels, each level being referred to asa treatment group. In this situation, the completely randomized design is often used toinvestigate the effect of A. For this design n subjects are selected at random from thepopulation of interest and ni of these subjects are randomly assigned to level i of A, fori = 1, . . . , k. Let Yij denote the response of the jth subject in the ith level of A. We willassume that the responses are independent of one another and that the distributions amonglevels differ by at most shifts in location. Although the randomization gives some credenceto the assumption of independence, after fitting the model a residual analysis should beconducted to check this assumption and the assumption that the level distributions differ byat most a shift in locations.

Under these assumptions, the full model can be written as

Yij = µi + eij j = 1, . . . , ni , i = 1, . . . , k , (4.2.1)

where the eijs are iid random variables with density f(x) and distribution function F (x) andthe parameter µi is a convenient location parameter, (for example, the mean or median). LetT (F ) denote the location functional. Assume, without loss of generality, that T (F ) = 0. Let∆ii′ denote the shift between the distributions of Yij and Yi′l. Recall from Chapter 2 that theparameters ∆ii′ are invariant to the choice of locational functional and that ∆ii′ = µi − µi′.If µi is the mean of the Yij then Hocking (1985) calls this the means model. If µi is themedian of the Yij then we will call it the medians model; see Section 4.2.4 below.

Observational studies can also be modeled this way. Suppose k independent samplesare drawn from k different populations. If we assume further that the distributions for thedifferent populations differ by at most a shift in locations then Model ( 4.2.1) is appropriate.But as in all observational studies, care must be taken in the interpretation of the results ofthe analyses.

While the parameters µi fix the locations, the parameters of interest in this chapter arecontrasts of the form h =

∑ki=1 ciµi where

∑ki=1 ci = 0. Similar to the shift parameters,

contrasts are invariant to the choice of location functional. In fact contrasts are linearfunctions of these shifts; i.e.,

h =

k∑

i=1

ciµi =

k∑

i=1

ci(µi − µ1) =

k∑

i=2

ci∆i1 = c′1∆1 , (4.2.2)

where c′1 = (c2, . . . , ck) and

∆′1 = (∆21, . . . ,∆k1) (4.2.3)

4.2. ONEWAY DESIGN 277

is the vector of location shifts from the first cell. In order to easily reference the theoryof Chapter 3, we will often use ∆1 which references cell 1. But picking cell 1 is only forconvenience and similar results hold for the selection of any other cell.

As in Chapter 2, we can write this model in terms of a linear model as follows. LetZ′ = (Y11, . . . , Y1n1, . . . , Yk1, . . . , Yknk

) denote the vector of all observations, µ′ = (µ1, . . . , µk)denote the vector of locations, and n =

∑ni denote the total sample size. The model can

then be expressed as a linear model of the form

Z = Wµ + e , (4.2.4)

where e denotes the n×1 vector of random errors eij and the n×k design matrix W denotesthe appropriate incidence matrix of 0s and 1s; i.e.,

W =

1n1 0 · · · 00 1n2 · · · 0...

... · · · ...0 0 · · · 1nk

. (4.2.5)

Note that the vector 1n is in the column space of W; hence, the theory derived in Chapter3 is valid for this model.

At times it will be convenient to reparameterize the model in terms of a vector of shiftparameters. For the vector ∆1, let W1 denote the last k−1 columns of W and let X be thecentered W1; i.e., X = (I − H1)W1, where H1 = 1(1′1)−11′ = n−111′ and 1′ = (1, . . . , 1) .Then we can write Model ( 4.2.4) as

Z = α1 + X∆1 + e , (4.2.6)

where ∆1 is given in expression ( 4.2.3). It is easy to show that for any matrix [1|X∗],having the same column space as W, that its corresponding non-intercept parameters arelinear functions of the shifts and, hence, are invariant to the selected location functional.The relationship between Models ( 4.2.4) and ( 4.2.6) will be explored further in Section4.2.4.

4.2.1 R-Fit of the Oneway Design

Note that the sum of all the column vectors of W equals the vector of ones 1n. Thus 1n is inthe column space of W and we can fit Model ( 4.2.4) by using the R-estimates discussed inChapter 3. In this chapter we assume that a specified score function, a(i) = ϕ(i/(n+1)), hasbeen chosen which, without loss of generality, has been standardized; recall (S.1), ( 3.4.10).A convenient way of obtaining the fit is by the QR-decomposition algorithm on the incidencematrix W; see Section 3.7.3.

For the R fits used in the examples of this chapter, we will use the cell median model;that is, Model 4.2.4 with T (F ) = 0 where T denotes the median functional and F denotes


the distribution of the random errors ei. We will use the score function ϕ(u) to obtain the R

fit of this model. Let X∆1 denote the fitted value. As discussed in Chapter 3, X∆1 lies inthe column space of the centered matrix X = (I − H1)W1. We then estimate the interceptas

αS = med1≤i≤nZi − x′i∆1 , (4.2.7)

where xi is the ith row of X. The final fitted value and the residuals are, respectively,

Z = αS1 + X∆1 (4.2.8)

e = Z − Z . (4.2.9)

Note that Z lies in the column space of W and that, further, T (Fn) = 0, where Fn denotesthe empirical distribution function of the residuals and T is the median location functional.Denote the fitted value of the response Yij as Yij. Given Z, we find from ( 4.2.4) that

µ = (W′W)−1W′Z. Because W is an incidence matrix, the estimate of µi is the commonfitted value of the ith cell which, for future reference, is given by

µi = Yij , (4.2.10)

for any j = 1, . . . , ni. In the examples below, we shall denote the R fit described in thisparagraph by stating that the model was fit using Wilcoxon scores and the residuals wereadjusted to have median zero.

It follows from Section 3.5.2 that µ is asymptotically normal with mean µ. To doinference based on these estimates of µi we need their standard errors but these can beobtained immediately from the variance of the fitted values given by expression ( 3.9.38).First note that the leverage value for an observation in the ith cell is 1/ni and, hence, theleverage value for the centered design is hc,i = hi − n−1 = (n − ni)/(nni). Therefore by( 3.9.38) the approximate variance of µi is given by

Var(µi).=

1

niτ 2ϕ +

1

n(τ 2S − τ 2

ϕ) . i = 1, . . . , k ; (4.2.11)

see Exercise 4.8.18.Let τϕ and τS denote, respectively, the estimates of τϕ and τS presented in Section 3.7.1.

The estimated approximate variance of µi is expression ( 4.2.11) with these estimates inplace of τϕ and τS. Define the minimum value of the dispersion function as DE, i.e,

DE = D(e) =k∑

i=1

ni∑

j=1

a(R(eij))eij . (4.2.12)

The symbol DE stands for the dispersion of the errors and is analogous to LS sums ofsquared errors, SSE. Upon fitting such a model, a residual analysis as discussed in Section3.9 should be conducted to assess the goodness of fit of the model.

Example 4.2.1. LDL Cholesterol of Quail


Table 4.2.1: Data for Example 4.2.1.Drug LDL CholesterolI 52 67 54 69 116 79 68 47 120 73II 36 34 47 125 30 31 30 59 33 98III 52 55 66 50 58 176 91 66 61 63IV 62 71 41 118 48 82 65 72 49

Table 4.2.2: Estimates of Location Levels for the Quail DataDrug Wilcoxon Fit LS Fit

Compound Est. SE Est. SEI 67.0 6.3 74.5 9.6II 42.0 6.3 52.3 9.6III 63.0 6.3 73.8 9.6IV 62.0 6.6 67.6 10.1

Thirty-nine quail were randomly assigned to four diets, each diet containing a differentdrug compound, which, hopefully, would reduce low density lipid (LDL) cholesterol. Thedrug compounds are labeled: I, II, III, and IV. At the end of the prescribed experimentaltime the LDL cholesterol of each quail was measured. The data are displayed in Table 4.2.1.

From the boxplot, Panel A of Figure 4.2.1, it appears that Drug Compound II was moreeffective than the other three in lowering LDL. The data appear to be positively skewed with along right tail. We fitted the data using Wilcoxon scores, ϕ(u) =

√12(u−1/2), and adjusted

the residuals to have median 0. The Panel B of Figure 4.2.1 displays the Wilcoxon residualsversus fitted values. The long right tail of the error distribution is apparent from this plot.The lower panels of Figure 4.2.1 involve the internal R-studentized residuals, ( 3.9.31), withthe benchmarks ±2. The internal R-studentized residuals detected six outlying data pointswhile the normal q−q plot of these residuals clearly shows the skewness.

The estimates of τϕ and τS are 19.19 and 21.96, respectively. For comparison the LSestimate of σ was 30.49. Table 4.2.2 displays the Wilcoxon and LS estimates of the celllocations along with their standard errors. The Wilcoxon and LS estimates of the locationlevels are quite different, as they should be since they estimate different functionals underasymmetric errors. The long right tail has drawn out the LS estimates. The standard errorsof the Wilcoxon estimates are much smaller than their LS counterparts.

This data set was taken from a much larger study discussed in McKean, Vidmar andSievers (1989). Most of the data in that study exhibited long right tails. The left tailswere also long; hence, transformations such as logarithms were not effective. Scores moreappropriate for positively skewed data were used with considerable success in this study.These scores are briefly discussed in Example 2.5.1.


Figure 4.2.1: Panel A: Comparison boxplots for data of Example 4.2.1; Panel B: Wilcoxonresidual plot; Panel C: Wilcoxon internal R-studentized residual plot; Panel D: Wilcoxoninternal R-studentized residual normal q−q plot.

4060

8012

016

0

I II III IV

Drug Compound

LDL

Cho

lest

erol

o

Panel A

•

•

•

•

•

••

•

•

•••

•

•

•••

•

•

•

•••

••

•

•

•••••

•

•

•

•

••

•

Wilcoxon fit

Wilc

oxon

res

idua

ls

45 50 55 60 65

-20

020

4060

80

Panel B

•

•

•

•

•

••

•

•

•••

•

•

•••

•

•

•

•••

••

•

•

•••••

•

•

•

•

••

•

Wilcoxon fit

Wilc

oxon

Stu

dent

ized

res

idua

ls

45 50 55 60 65

02

46

8

Panel C

• • • •••••••••••••

•••••••••••

••••••

••• •

•

•

Normal quantiles

Wilc

oxon

Stu

dent

ized

res

idua

ls

-2 -1 0 1 2

02

46

8

Panel D


4.2.2 Rank-Based Tests of H0 : µ1 = · · · = µk

Consider Model ( 4.2.4). A hypothesis of interest in the oneway design is that there are nodifferences in the levels of A; i.e.,

H0 : µ1 = · · · = µk versus H1 : µi 6= µi′ for some i 6= i′ . (4.2.13)

Define the k × (k − 1) matrix M as

M =

1 −1 0 0 . . . 01 0 −1 0 . . . 0...

......

......

...1 0 0 0 . . . −1

. (4.2.14)

Then Mµ = ∆1, ( 4.2.3), and, hence, H0 is equivalent to Mµ = 0. Note that the rows ofM form k − 1 linearly independent contrasts in the vector µ. If the design matrix given in( 4.2.6) is used then the null hypothesis is simply H0 : Ik−1∆1 = 0; that is, all the regressioncoefficients are zero. We shall discuss two rank-based tests for this hypothesis.

One appropriate test statistic is the gradient test statistic, ( 3.5.8), which is given by,

T = σ−2a S(Z)′(X′X)−1S(Z) , (4.2.15)

where S(Z)′ = (S2(Z), . . . , Sk(Z)) for

Si(Z) =

ni∑

j=1

a(R(Zij)) , (4.2.16)

and, as defined in Theorem 3.5.1,

σ2a = (n− 1)−1

n∑

i=1

a2(i) . (4.2.17)

Based on Theorem 3.5.2 a level α test for H0 versus H1 is:

Reject H0 in favor of H1 if T ≥ χ2(α, k − 1) , (4.2.18)

where χ2(α, k − 1) denotes the upper level α critical value of χ2-distribution with k − 1degrees of freedom. Because the design matrix X of Model ( 4.2.6) is an incidence matrix,the gradient test simplifies. First note that

(X′X)−1 =1

n1J + diag

1

n2, . . . ,

1

nk

, (4.2.19)

where J is a (k − 1)× (k − 1) matrix of ones; see Exercise 4.8.1. Since the scores sum to 0,we have that S(Z)′1k−1 = −S1(Z). Upon combining these results, the gradient test statisticsimplifies to

Tϕ = σ−2a S(Z)′(X′X)−1S(Z) = σ−2

a

k∑

i=1

1

niS2i (Z) . (4.2.20)


Table 4.2.3: Analysis of Dispersion Table for the Hypotheses ( 4.2.13)Source D=Dispersion df MD FA RD k − 1 RD/(k − 1) Fϕ

Error n− k τϕ/2

For Wilcoxon scores further simplification is possible. In this case

Si(Y) =

ni∑

j=1

√12

(R(Yij)

n+ 1− 1

2

)

=

√12

n+ 1ni

(Ri −

n+ 1

2

), (4.2.21)

where Ri denotes the average of the ranks from sample i. Also for Wilcoxon scores σ2a =

n/(n+ 1). Thus the test statistic for Wilcoxon scores is given by

HW =12

n(n+ 1)

k∑

i=1

ni

(Ri −

n + 1

2

)2

. (4.2.22)

This is the Kruskal-Wallis (1952) test statistic. It is distribution free under H0. In the caseof two levels, the Kruskal-Wallis test is equivalent to the MWW test discussed in Chapter2; see Exercise 4.8.2. From the discussion on efficiency, Section 3.5, the efficiency resultsfor the Kruskal-Wallis test are the same as for the MWW.

As a second rank-based test, we briefly discuss the drop in dispersion test for H0 versusH1 given by expression ( 3.6.12). Under the null hypothesis, the underlying distributions ofthe k levels of A are the same; hence, the reduced model is

Yij = µ+ eij , (4.2.23)

where µ is a common location functional. Thus there are no parameters to fit in this caseand the reduced model dispersion is

DT = D(Y) =k∑

i=1

ni∑

j=1

a(R(Yij))Yij . (4.2.24)

The symbol DT denotes total dispersion in the problem which is analogous to the classicalLS’s total variation, SST . Hence the reduction in dispersion is RD = DT − DE, whereDE is defined in expression ( 4.2.12), and the drop in dispersion test is given by Fϕ =(RD/(k − 1))/(τϕ/2). As discussed in Section 3.6 this should be compared with F -criticalvalues having k − 1 and n − k degrees of freedom. The analysis can be summarized in ananalysis of dispersion table of the form given in Table 4.2.3.

Because the Kruskal-Wallis test is a gradient test, the drop in dispersion test and theKruskal-Wallis test have the same asymptotic efficiency; see Section 3.6. The third testdiscussed in that section, the Wald type test, will be discussed below, for this hypothesis, inSection 4.2.5.


Example 4.2.2. LDL Cholesterol of Quail, Example 4.2.1 continued

For the hypothesis of no difference among the locations of the cholesterol levels of thedrug compounds, hypotheses ( 4.2.13), the results of the LS F -test, the Kruskal-Wallis test,and the drop in dispersion test can be found in Table 4.2.4. The long right tail of theerrors spoiled the LS test statistic. Using it, one would conclude that there is no significantdifference among the drug compounds which is inconsistent with the boxplots in Figure4.2.1. On the other hand, both robust procedures detect the differences among the drugcompounds especially the drop in dispersion test statistic.

Table 4.2.4: Tests of Hypotheses ( 4.2.13) for the Quail DataTest Scale

Procedure Statistic σ or τϕ df p-valueLS, FLS 1.14 30.5 (3, 35) .35Drop Disp., Fϕ 3.77 19.2 (3, 35) .02Kruskal Wallis 7.18 3 .067

4.2.3 Tests of General Contrasts

As discussed above the parameters and hypotheses of interest for Model ( 4.2.4) can usuallybe defined in terms of contrasts. In this section we discuss R-estimates and tests of contrasts.We will apply these results to more complicated designs in the remainder of the chapter.

For Model ( 4.2.4), consider general linear hypotheses of the form

H0 : Mµ = 0 versus HA : Mµ 6= 0 , (4.2.25)

where M is a q×k matrix of contrasts (rows sum to 0) of full row rank. Since M is a matrixof contrasts, the hypothesis H0 is invariant to the intercept and, hence, can be tested bythe R-test statistic discussed in Section 3.6. To obtain the test based on the reduction ofdispersion, Fϕ, discussed in Section 3.6, we need to fit the reduced model ω which is Model( 4.2.4) subject to H0. Let D(ω) denote the minimum value of the dispersion function forthe reduced model fit and let RD = D(ω) − DE denote the reduction in dispersion. Notethat RD is analogous to the reduction in sums of squares of the traditional LS analysis. Thetest statistic is given by Fϕ = (RD/q)/τϕ/2. As discussed in Chapter 3 this statistic shouldbe compared with F -critical values having q and n− k degrees of freedom. The test can besummarized in the analysis of dispersion table found in Table 4.2.5, which is analogous tothe traditional analysis of variance table for summarizing a LS analysis.

Example 4.2.3. Poland China Pigs

This data set, presented on page 87 of Scheffe (1959), concerns the birth weights of PolandChina pigs in eight litters. For convenience we have tabled that data in Table 4.2.6. Thereare 56 pigs in the eight litters. The sample sizes of the litters vary from 4 to 10.


Table 4.2.5: Analysis of Dispersion Table for H0 : Mµ = 0

.Source D=Dispersion df MD FMµ = 0 RD q MRD = RD/q FϕError n− k τϕ/2

Table 4.2.6: Birth weights of Poland China Pigs by litter.Litter Birth Weight

1 2.0 2.8 3.3 3.2 4.4 3.6 1.9 3.3 2.8 1.12 3.5 2.8 3.2 3.5 2.3 2.4 2.0 1.63 3.3 3.6 2.6 3.1 3.2 3.3 2.9 3.4 3.2 3.24 3.2 3.3 3.2 2.9 3.3 2.5 2.6 2.85 2.6 2.6 2.9 2.0 2.0 2.16 3.1 2.9 3.1 2.57 2.6 2.2 2.2 2.5 1.2 1.28 2.5 2.4 3.0 1.5

In Exercise 4.8.3 a residual analysis is conducted of this data set and the hypothesis( 4.2.13) is tested. Here we are only concerned with the following contrast suggested byScheffe. Assume that the litters 1, 3, and 4 were sired by one boar while the other litterswere sired by another boar. The contrast of interest is that the average litter birthweightsof the pigs sired by the boars are the same; i.e., H0 : h = 0 where

h =1

3(µ1 + µ3 + µ4) −

1

5(µ2 + µ5 + µ6 + µ7 + µ8) . (4.2.26)

For this hypothesis, the matrix M of expression ( 4.2.25) is given by [5 −3 5 5 −3 −3 −3 −3].The value of the LS F -test statistic is 11.19, while Fϕ = 15.65. There are 1 and 48 degreesof freedom for this hypothesis so both tests are highly significant. Hence both tests indicatea difference in average litter birthweights of the boars. The reason Fϕ is more significantthan FLS is clear from the residual analysis found in Exercise 4.8.3

4.2.4 More on Estimation of Contrasts and Location

In this section we further explore the relationship between Models ( 4.2.4) and ( 4.2.6). Thiswill enable us to formulate the contrast procedure based on pseudo-observations discussedin Section 4.2.5. Recall that the design matrix X of expression ( 4.2.6) is a centered designmatrix based on the last k − 1 columns of the design matrix W of expression ( 4.2.4).To determine the relationship between the parameters of these models, we simply matchthem by location parameter for each level. For Model ( 4.2.4) the location parameter forlevel i is of course µi. In terms of Model ( 4.2.6), the location parameter for the first level isα−∑k

j=2nj

n∆j and that of the ith level is α−

∑kj=2

nj

n∆j+∆i. Hence, letting δ =

∑kj=2

nj

n∆j ,


Table 4.2.7: All Pairwise 95% Confidence Intervals for the Quail Data Based on the WilcoxonFit

Difference Estimate Confidence Intervalµ2 − µ1 -25.0 (−42.7,−7.8)µ2 − µ3 -21.0 (−38.6,−3.8)µ2 − µ4 -20.0 (−37.8,−2.0)µ1 − µ3 4.0 (−13.41, 21.41)µ1 − µ4 5.0 (−12.89, 22.89)µ3 − µ4 1.0 (−16.89, 18.89)

we can write the vector of level locations as

µ = (α− δ)1 + (0,∆1)′ , (4.2.27)

where ∆1 is defined in expression (4.2.3).Let h = Mµ be a q × 1 vector of contrasts of interest, (i.e., rows of M sum to 0). Write

M as [m M1]. Then by ( 4.2.27) we have

h = Mµ = M1∆1 . (4.2.28)

By Corollary 3.5.1, ∆1 has an asymptotic N(∆, τ 2ϕ(X

′X)−1) distribution. Hence, based onexpression ( 4.2.28), the asymptotic variance-covariance matrix of the estimate Mµ is

Σh = τ 2ϕM1(X

′X)−1M′1 . (4.2.29)

Note that the only difference for the LS-fit, is that σ2 would be substituted for τ 2ϕ. Expressions

( 4.2.28) and ( 4.2.29) are the basic relationships used by pseudo-observations discussed inSection 4.2.5.

To illustrate these relationships, suppose we want a confidence interval for µi−µi′ . Basedon expression ( 4.2.29), an asymptotic (1 − α)100% confidence interval is given by,

µi − µi′ ± t(α/2,n−k)τϕ

√1

ni+

1

ni′; (4.2.30)

i.e., same as LS except τϕ replaces σ.


To illustrate the above confidence intervals, Table 4.2.7 displays the six pairwise con-fidence intervals among the four drug compounds. On the basis of these intervals DrugCompound 2 seems best. This conclusion, though, is based on six simultaneous confidenceintervals and the problem of overall confidence in these intervals needs to be addressed. Thisis discussed in some detail in Section 4.3, at which time we will return to this example.


Medians Model

Suppose we are interested in estimates of the level locations themselves. We first need toselect a location functional. For the discussion we will use the median; although, for anyother functional, only a change of the scale parameter τS is necessary. Assume then that theR-residuals have been adjusted so that their median is zero. As discussed above, ( 4.2.10),

the estimate of µi is Yij , for any j = 1, . . . , ni, where Yij is the fitted value of Yij. Letµ = (µ1, . . . , µk)

′. Further, µ is asymptotically normal with mean µ and the asymptoticvariance of µi is given in expression ( 4.2.11). As Exercise 4.8.4 shows, the asymptoticcovariance between estimates of location levels is:

cov(µi, µi′) = (τ 2S − τ 2

ϕ)/n , (4.2.31)

for i 6= i′. As Exercises 4.8.4 and 4.8.18 show, expressions ( 3.9.38) and ( 4.2.31) lead to averification of the confidence interval ( 4.2.30).

Note that if the scale parameters are the same, say, τS = τϕ = κ then the approximatevariance reduces to κ2/ni and the covariances are 0. Hence, in this case, the estimates µiare asymptotically independent. This occurs in the following two ways:

1. For the fit of Model ( 4.2.4) use a score function ϕ(u) which satisfies (S2) and usethe location functional based on the corresponding signed-rank score function ϕ+(u) =ϕ((u + 1)/2). The asymptotic theory, though, requires the assumption of symmetricerrors. If the Wilcoxon score function is used then the location functional would resultin the residuals being adjusted so that the median of the Walsh averages of the adjustedresiduals is 0.

2. Use the l1 score function ϕS(u) = sgn(u − (1/2)) to fit Model ( 4.2.4) and use themedian as the location functional. This of course is equivalent to using an l1 fit onModel ( 4.2.4). The estimate of µi is then the cell median.

4.2.5 Pseudo-observations

We next discuss a convenient way to estimate and test contrasts once an R-fit of Model( 4.2.4) is obtained. Let Z denote the R-fit of this model, let e denote the vector of residuals,let a(R(e)) denote the vector of scored residuals, and let τϕ be the estimate of τϕ. Let HW

denote the projection matrix onto the column space of the incidence matrix W. Because of( 3.2.13), the fact that 1n is in the column space of W, and that the scores sum to 0, we get

HWa(R(e)) = 0 . (4.2.32)

Define the constant ζϕ by

ζ2ϕ =

n− k∑ni=1 a

2(i). (4.2.33)


Because n−1∑a2(i)

.= 1, ζ

.= 1−(k/n). Then the vector of pseudo-observations is defined

byZ = Z + τϕζϕa(R(e)) ; (4.2.34)

see Bickel (1976) for a discussion of the pseudo-observations. For the Wilcoxon scores, weget

ζ2W =

(n− k)(n + 1)

n(n− 1). (4.2.35)

LetZ and e denote the LS fit and residuals, respectively, of the pseudo-observations,

( 4.2.34). By ( 4.2.32) we have,Z = Z , (4.2.36)

and, hence,e = τϕζϕa(R(e)) . (4.2.37)

From this last expression and the definition of ζϕ, ( 4.2.33), we get

1

n− ke′e = τ 2

ϕ . (4.2.38)

Therefore the LS-fit of the pseudo-observations results in the R-fit of Model (4.2.4) and,further, the LS estimator MSE is τ 2

ϕ.The pseudo-observations can be used to compute the R-inference on a given contrast, say,

h = Mµ. If the pseudo-observations are used in place of the observations in a LS algorithm,then based on the variance-covariance of h, ( 4.2.29), expressions ( 4.2.36) and ( 4.2.38)imply that the resulting LS estimate of h and the LS estimate of the corresponding variance-covariance matrix of h will be the R-estimate of h and the R-estimate of the correspondingvariance-covariance matrix of h. Similarly for testing the hypotheses ( 4.2.25), the LS testusing the pseudo-observations will result in the Wald’s type R-test, Fϕ,Q of these hypothesesgiven by expression ( 3.6.14). Pseudo-observations will be used in many of the subsequentexamples of this chapter.

The pseudo-observations are easy to obtain. For example, the package rglm returns thepseudo-observations directly in the outputed data set of fits and residuals. These pseudo-observations can then be read into Minitab or another package for further analyses. InMinitab itself, for Wilcoxon scores the robust regression command, RREGR, has the sub-command PSEUDO which returns the pseudo-observations. Then the pseudo-observationscan be used in place of the observations in Minitab commands to obtain the R-inference oncontrasts.


To demonstrate how easy it is to use the pseudoobservations with Minitab, reconsiderExample 4.2.1 concerning LDL cholesterol levels of quail under the treatment of 4 differentdrug compounds. Suppose we want the Wald type R-test of the hypotheses that there is no


effect due to the different drug compounds. The pseudoobservations were obtained basedon the full model R-fit and placed in column 10 and the corresponding levels were placed incolumn 11. The Wald Fϕ,Q statistic is obtained by using the following Minitab command:

oneway c10 c11

The execution of this command returned the value of the Fϕ,Q = 3.45 with a p-value .027,which is close to the result based on the Fϕ-statistic.

4.3 Multiple Comparison Procedures

Our basic model for this section is Model 4.2.4; although, much of what we do here pertainsto the rest of this Chapter also. We will discuss methods based on the R-fit of this model asdescribed in Section 4.2.1. In particular, we shall use the same notation to describe the fit;i.e., the R-residuals and fitted values are, respectively, e and Z, the estimate of µ and τϕ are

µ and τϕ, and the vector of pseudo-observation is Z. We also denote the pseudo-observation

corresponding to the observation Yij as Zij .Besides tests of contrasts of level locations, often we want to make comparisons among

the location levels, for instance, all pairwise comparisons among the levels. With so manycomparisons to make, overall confidence becomes a problem. Multiple comparison proce-dures, MCP, have been developed to offset this problem. In this section we will exploreseveral of these methods in terms of robust estimation. These procedures can often be di-rectly robustified. It is our intent to show this for several popular methods, including theTukey T -method. We will also discuss simultaneous, rank-based tests among levels. We willshow how simple Minitab code, based on the pseudo-observations, suffices to compute theseprocedures. It is not our purpose to give a full discussion of MCPs. Such discussions can befound, for example, in Miller (1981) and Hsu (1996).

We will focus on the problem of simultaneous inference for all(k2

)comparisons µi − µi′

based on an R-fit of Model ( 4.2.4). Recall, ( 4.2.28), that a (1 − α)100% asymptoticconfidence interval for µi − µi′ based on the R-fit of Model 4.2.4 is given by

µi − µi′ ± t(α/2,n−k)τϕ

√1

ni+

1

ni′. (4.3.1)

In this section we say that this confidence interval has experiment error rate α. AsExercise 4.8.8 illustrates, simultaneous confidence for several such intervals can easily slipwell below 1 − α. The error rate for a simultaneous confidence procedure will be called itsfamily error rate.

We next describe six robust multiple comparison procedures for the problem of all pair-wise comparisons. The error rates for them are based on asymptotics. But note that thesame is true for MCPs based on least squares when the normality assumption is not valid.Sufficient Minitab code is given, to demonstrate how easily these procedures can be per-formed.

4.3. MULTIPLE COMPARISON PROCEDURES 289

1. Bonferroni Procedure. This is the simplest of all the MCPs. Suppose we areinterested in making l comparisons of the form µi − µi′. If each individual confidenceinterval, ( 4.3.1), has confidence 1−α

l, then the family error rate for these l simultaneous

confidence intervals is at most α; see Exercise 4.8.8. To do all comparisons just selectl =

(k2

). Hence the R-Bonferroni procedure declares

levels i and i′ differ if |µi − µi′| ≥ t(α/(2(k2)),n−k)τϕ

√1

ni+

1

ni′. (4.3.2)

The asymptotic family error rate for this procedure is at most α.

To obtain these Bonferroni intervals by Minitab assume that pseudo-observations, Yij,are in column 10, the corresponding levels, i, are in column 11, and the constant α/

(k2

)

is in k1. Then the following two lines of Minitab code will obtain the intervals:

oneway c10 c11;

bonferroni k1.

2. Protected LSD Procedure of Fisher. First use the test statistic Fϕ to test thehypotheses that all the level locations are the same, ( 4.2.13), at level α. If H0 isrejected then the usual level 1 − α confidence intervals, ( 4.2.28), are used to makethe comparisons. If we fail to reject H0 then either no comparisons are made or thecomparisons are made using the Bonferroni procedure. In summary, this proceduredeclares

levels i and i′ differ if Fϕ ≥ Fα,k−1,n−k and |µi − µi′| ≥ t(α/2,n−k)τϕ

√1

ni+

1

ni′.

(4.3.3)This MCP has no family error rate but the initial test does offer protection. In a largesimulation study conducted by Carmer and Swanson (1973) this procedure based onLS estimates performed quite well in terms of power and level. In fact, it was one ofthe two procedures recommended. In a moderate sized simulation study conducted byMcKean, Vidmar and Sievers (1989) the robust version of the protected LSD discussedhere performed similarly to the analogous LS procedure on normal errors and had aconsiderable gain in power over LS for error distributions with heavy tails.

Upon rejection of the hypotheses ( 4.2.13) at level α, the following Minitab code will

obtain the comparison confidence intervals. Assume that pseudo-observations, Yij, arein column 10, the corresponding levels, i, are in column 11, and the constant α is ink1.

oneway c10 c11;

fisher k1.


The F -test that appears in the AOV table upon execution of these commands is Wald’stest statistic Fϕ,Q for the hypotheses ( 4.2.13). Recall from Chapter 3 that it is asymp-totically equivalent to Fϕ under the null and local hypotheses.

3. Tukey’s T Procedure. This is a multiple comparison procedure for the set of allcontrasts, h =

∑ki=1 ciµi where

∑ki=1 ci = 0. Assume that the sample sizes for the

levels are the same, say, n1 = · · · = nk = m. The basic geometric fact for thisprocedure is the following equivalence due to Tukey, (see Miller, 1981): for t > 0,

max1≤i,i′≤k

|(µi−µi)−(µi′−µi′)| ≤ t⇐⇒k∑

i=1

ciµi−1

2t

k∑

i=1

|ci| ≤k∑

i=1

ciµi ≤k∑

i=1

ciµi+1

2t

k∑

i=1

|ci| ,

(4.3.4)for all contrasts

∑ki=1 ciµi where

∑ki=1 ci = 0. Hence to obtain simultaneous confidence

intervals for the set of all contrasts we need the distribution of the left side of thisinequality. But first note that

(µi − µi) − (µi′ − µi′) = (µi − µi) − (µ1 − µ1) − (µi′ − µi′) − (µ1 − µ1)= (∆i1 − ∆i1) − (∆i′1 − ∆i′1) .

Hence, we need only consider the asymptotic distribution of ∆1 which by ( 4.2.19) is

Nk−1(∆1,τ2ϕ

m[I + J]).

Recall, if v1, . . . , vk are iid N(0, σ2), then the max1≤i,i′≤k |vi−vi′ |/σ has the Studentizedrange distribution, with k−1 and ∞ degrees of freedom. But we can write this randomvariable as

max1≤i,i′≤k

|vi − vi′ | = max1≤i,i′≤k

|(vi − v1) − (vi′ − v1)| .

Hence we need only consider the random vector of shifts v′1 = (v2 − v1, . . . , vk − v1) to

determine the distribution. But v1 has distribution Nk−1(0, σ2[I + J]). Based on this,

it follows from the asymptotic distribution of ∆1, that if we substitute qα;k,∞τϕ/√m

for t in expression ( 4.3.4), where qα;k,∞ is the upper α critical value of a Studentizedrange distribution with k and ∞ degrees of freedom, then the asymptotic probabilityof the resulting expression will be 1 − α.

The parameter τϕ, though, is unknown and must be replaced by an estimate. In theTukey T procedure for LS, the parameter is σ. The usual estimate s of σ is suchthat if the errors are normally distributed then the random variable (n− k)s2/σ2 hasa χ2 distribution and is independent of the LS location estimates. In this case theStudentized range distribution with k − 1 and n − k degrees of freedom is used. Ifthe errors are not normally distributed then this distribution leads to an approximatesimultaneous confidence procedure. We proceed similarly for the procedure basedon the robust estimates. Replacing t in expression ( 4.3.4) by qα;k,n−kτϕ/

√m, where

qα;k,n−k is the upper α critical value of a Studentized range distribution with k andn−k degrees of freedom, yields an approximate simultaneous confidence procedure for


the set of all contrasts. As discussed before, though, small sample studies have shownthat the Student t-distribution works well for inference based on the robust estimates.Hopefully these small sample properties carry over to the approximation based on theStudentized range distribution. Further research is needed in this area.

Tukey’s procedure requires that the level sample sizes are the same which is frequentlynot the case in practice. A simple adjustment due to Kramer (1956) results in thesimultaneous confidence intervals,

µi − µi′ ±1√2qα;k,n−kτϕ

√1

ni+

1

ni′. (4.3.5)

These intervals have approximate family error rate α. This approximation is oftencalled the Tukey-Kramer procedure.

In summary the R-Tukey-Kramer procedure declares

levels i and i′ differ if |µi − µi′ | ≥1√2qα;k,n−kτϕ

√1

ni+

1

ni′. (4.3.6)

The asymptotic family error rate for this procedure is approximately α.

To obtain these R-Tukey intervals by Minitab assume that pseudo-observations, Yij,are in column 10, the corresponding levels, i, are in column 11, and the constant α isin k1. Then the following two lines of Minitab code will obtain the intervals:

oneway c10 c11;

tukey k1.

4. Pairwise Tests Based on Joint Rankings. The above methods were concerned withestimation and simultaneous confidence intervals for effects. Traditionally, simultane-ous nonparametric inference has dealt with comparison tests. The first such procedurewe will discuss is based on the combined rankings of all levels; i.e, the rankings thatare used by the Kruskal-Wallis test. We will discuss this procedure using the Wilcoxonscore function; see Exercise 4.8.10 for the analogous procedure based on a selectedscore function. Assume a common level sample size m. Denote the average of the ranksfor the ith level by Ri· and let R

′1 = (R2· − R1·, . . . , Rk· − R1·). Using the results of

Chapter 3, under H0 : µ1 = · · · = µk, R1 is asymptotically Nk−1(0,k(n+1)

12(Ik−1+Jk−1));

see Exercise 4.8.9. Hence, as in the development of the Tukey procedure above wehave the asymptotic result

PH0

[max

1≤i,i′≤k|Ri· − Ri′·| ≤

√k(n + 1)

12qα;k,∞

].= 1 − α . (4.3.7)

Hence the joint ranking procedure declares

levels i and i′ differ if |Ri· − Ri′·| ≥√k(n+ 1)

12qα;k,∞ . (4.3.8)


This procedure has an approximate family error rate of α. This procedure is not easyto invert for simultaneous confidence intervals for the effects. We would recommend theTukey procedure, (2), with Wilcoxon scores for corresponding simultaneous inferenceon the effects.

An approximate level α test of the hypotheses ( 4.2.13) is given by

Reject H0 if max1≤i,i′≤k

|Ri· − Ri′·| ≥√k(n + 1)

12qα;k,∞ . (4.3.9)

Although the Kruskal-Wallis test is the usual choice in practice.

The joint ranking procedure, ( 4.3.9), is approximate for the unequal sample size case.Miller (1981, p. 166) describes a procedure similar to the Scheffe procedure in LS whichis valid for the unequal sample size case, but which is also much more conservative; seeExercise 4.8.6. A Tukey-Kramer type rule, ( 4.3.6), for the procedure ( 4.3.9) is

levels i and i′ differ if |Ri· − Ri′·| ≥√n(n+ 1)

24

√1

ni+

1

ni′qα;k,∞ . (4.3.10)

The small sample properties of this approximation need to be studied.

5. Pairwise Tests Based on Separate Rankings. For this procedure we compare

levels i and i′ by ranking the combined ith and i′th samples. Let R(i′)i· denote the sum

of the ranks of the ith level when it is compared with the i′th level. Assume that thesample sizes are the same, n1 = · · · = nk = m. For 0 < α < 1, define the critical valuecα;m,k by

PH0

[max

1≤i,i′≤kR

(i′)i· ≥ cα;m,k

]= α . (4.3.11)

Tables for this critical value at the 5% and 1% levels are provided in Miller (1981).The separate ranking procedure declares

levels i and i′ differ if R(i′)i· ≥ cα;m,k or R

(i)i′· ≥ cα;m,k . (4.3.12)

This procedure has an approximate family error rate of α and was developed indepen-dently by Steel (1960) and Dwass (1960).

An approximate level α test of the hypotheses ( 4.2.13) is given by

Reject H0 if max1≤i,i′≤k

R(i′)i· ≥ cα;m,k , (4.3.13)

although as noted for the last procedure the Kruskal-Wallis test is the usual choice inpractice.

Corresponding simultaneous confidence intervals can be constructed similar to theconfidence intervals developed in Chapter 2 for a shift in locations based the MWW


statistic. For the confidence interval for the ith and i′th samples corresponding to thetest ( 4.3.12), first form the differences between the two samples, say,

Dii′

kl = Yik − Yi′l 1 ≤ k, l ≤ m .

Let D(1), . . . , D(m2) denote the ordered differences. Note here that the critical valuecα;m,k is for the sum of the ranks and not statistics of the form S+

R , ( 2.4.2). But recallthat these versions of the Wilcoxon statistic differ by the constant m(m+1)/2. Hencethe confidence interval is

(D(cα;m,k−m(m+1)

2+1), D

(m2−cα;m,k+m(m+1)

2)) . (4.3.14)

It follows that this set of confidence intervals, over all pairs of levels i and i′, form aset of simultaneous 1− α confidence intervals. Using the iterative algorithm discussedin Section 3.7.2, the differences need not be formed.

6. Procedures Based on Pairwise Distribution Free Confidence Intervals. Sim-ple pairwise (separate ranking) multiple comparison procedures can be easily formu-lated based on the MWW confidence intervals discussed in Section 2.4.2. Such proce-dures do not depend on equal sample sizes. As an illustration, we describe a Bonfer-roni type procedure for the situation of all l =

(k2

)comparisons. For the levels (i, i′),

let [Dii′

(cα/(2l)+1), Dii′

(nini′−cα/(2l))) denote the (1− (α/l))100% confidence interval discussed

in Section 2.4.2 based on the nini′ differences between the ith and i′th samples. Thisprocedure declares

levels i and i′ differ if 0 is not in [Dii′

(cα/(2l)+1), Dii′

(nini′−cα/(2l))) . (4.3.15)

This Bonferroni type procedure has family error rate at most α. Note that the asymp-totic value for cα/(2l) is given by

cα/(2l).=nini′

2− zα/(2l)

√nini′(ni + ni′ + 1)

12− .5 ; (4.3.16)

see ( 2.4.13). A Protected LSD type procedure can be constucted in the sameway, using as the overall test either the Kruskal-Wallis test or the test based on Fϕ;see Exercise 4.8.12.


Reconsider the data on the LDL levels of quail subject to four different drug compounds.The full model fit returned the estimate µ = (67, 42, 63, 62). We set α = .05 and ran thefirst five MCPs on this data set. We used the Minitab code based on pseudo-observationsto compute the first four procedures and we obtained the later two by Minitab commands.A table that helps for the separate rank procedure can be found on page 242 of Lehmann


Table 4.3.1: Drug Compounds Declared Significantly Different by MCP’sCompunds Declared Respective Confidence

Procedure Different IntervalBonferroni (1, 2) (1.25, 49.23)Fisher (1, 2),(2, 3),(2, 4) (7.83, 42.65) (−38.57,−3.75) (−37.79,−2.01)Tukey-Kramer (1, 2) (2.13, 48.35)Joint Ranking NoneSeparate Ranking None

(1975) which links the tables in Miller (1981) with a table of family error α values forthis procedure. Based on these values, the Minitab MANN command can then be usedto obtain the confidence intervals ( 4.3.14). For each procedure, Table 4.3.1 displays thedrug compounds that were declared significantly different by the procedure. The first threeprocedures, based on effects, declared drug compounds 1 and 2 different. Fisher’s PLSDalso declared drug compound 2 different from drug compounds 3 and 4. The usual summaryschematic based on Fisher’s is

2 4 3 1 ,

which shows the separation of the second drug compound from the other three compounds.On the other hand, the schematic for either the Bonferroni or Tukey-Kramer procedures is

2 4 3 1

which shows that though Treatment 2 is significantly different from Treatment 1 it does notdiffer significantly from either Treatments 4 or 3. The joint ranking procedure came closeto declaring drug compounds 1 and 2 different because the difference in average rankingsbetween these levels was 12.85 slightly less than the critical value of 13.10. The separate-ranking procedure declared none different. Its interval, ( 4.3.14), for compounds 1 and 2 is(−29, 68.99). In comparison, the corresponding confidence interval for the Tukey procedurebased on LS is (−14.5, 58.9). Hence, the separate ranking procedure was impaired more bythe outliers than least squares.

4.3.1 Discussion

We have presented robust analogues to three of the most popular multiple comparison pro-cedures: the Bonferroni, Fisher’s protected least significant difference, and the Tukey Tmethod. These procedures provide the user with estimates of the most interesting pa-rameters in these experiments, namely the simple contrasts between treatment effects, andestimates of standard errors with which to assess these contrasts. The robust analoguesare straightforward. Replace the LS estimates of the effects by the robust estimates andreplace the estimate of σ by the estimate of τϕ. Furthermore, these robust procedures can


easily be obtained by using the pseudo-observations as discussed in Section 4.2.5. Hence,the asymptotic relative efficiency between the LS based MCP and its robust analogue is thesame as the ARE between the LS estimator and robust estimator, as discussed in Chapters1-3. In particular if Wilcoxon scores are used, then the ARE of the Wilcoxon MCP to thatof the LS MCP is .955 provided the errors are normally distributed. For error distributionswith longer tails than the normal, the Wilcoxon MCP is generally much more efficient thanits LS MCP counterpart.

The theory behind the robust MCPs is asymptotic, hence, the error rates are approxi-mate. But this is true also for the LS MCPs when the errors are not normally distributed.Verification of the validity and power of both LS and robust MCPs is based on small samplestudies. The small sample study by McKean et al. (1989) demonstrated that the WilcoxonFisher PLSD had the same validity as its LS counterpart over a variety of error distributionsfor a oneway design. For normal errors, the LS MCP had slightly more empirical powerthan the Wilcoxon. Under error distributions with heavier tails than the normal, though,the empirical power of the Wilcoxon MCP was larger than the empirical power of the LSMCP.

The decision as to which MCP to use has long been debated in the literature. It is notour purpose here to discuss these issues. We refer the reader to books devoted to MCPsfor discussions on this topic; see, for example, Miller (1981) and Hsu (1996). We do notethat, besides τϕ replacing σ, the error part of the robust MCP is the same as that of LS;hence, arguments that one procedure dominates another in a certain situation will hold forthe robust MCP as well as for LS.

There has been some controversy on the two simultaneous rank-based testing proceduresthat we presented: pairwise tests based on joint rankings and pairwise tests based on sepa-rate rankings. Miller (1981) and Hsu (1996) both favor the tests based on separate rankingsbecause in the separate rankings procedure the comparison between two levels is not in-fluenced by any information from the other levels which is not the case for the procedurebased on joint rankings. They point out that this is true of the LS procedure, also, sincethe comparison between two levels is based only on the difference in sample means for thosetwo levels, except for the estimate of scale. However, Lehmann (1975) points out that thejoint ranking makes use of all the information in the experiment while the separate rankingprocedure does not. The spacings between all the points is information that is utilized by thejoint ranking procedure and that is lost in the separate ranking procedure. The quail data,Example 4.3.1, is illustrative. The separate ranking procedure did quite poorly on this dataset. The sample sizes are moderate and in the comparisons when half of the informationis lost, the outliers impaired the procedure. In contrast, the joint ranking procedure cameclose to declaring drug compounds 1 and 2 different. Consider also the LS procedure on thisdata set. It is true that the ouliers impaired the sample means, but the estimated variance,being a weighted average of the level sample variances, was drawn down some over all theinformation; for example, instead of using s3 = 37.7 in the comparisons with the third level,the LS procedure uses a pooled standard deviation s = 30.5. There is no way to make asimilar correction to the separate ranking procedure. Also, the separate rankings procedure


can lead to inconsistencies in that it could declare Treatment A superior to B, TreatmentB superior to Treatment C, while not declaring Treatment A superior to Treatment C; seepage 245 of Lehmann (1975) for a simple illustration.

4.4 Twoway Crossed Factorial

For this design we have two factors, say, A at a levels and B at b levels that may have an effecton the response. Each combination of the ab factor settings is a treatment. For a completelyrandomized design, n subjects are selected at random from the reference population and thennij of these subjects are randomly assigned to the (i, j)th treatment combination; hence,n =

∑∑nij . Let Yijk denote the response for the kth subject at the (i, j)th treatment

combination, let Fij denote the distribution function of Yijk, and let µij = T (Fij). Then theunstructured or full model is

Yijk = µij + eijk , (4.4.1)

where eijk are iid with distribution and density functions F and f , respectively. Let T denotethe location functional of interest and assume without loss of generality that T (F ) = 0. Thesubmodels described below utilize the two-way structure of the design.

Model 4.4.1 is the same as the oneway design model ( 4.2.1) of Section 4.2. Usingthe scores a(i) = ϕ(i/(n + 1)), the R-fit of this model can be obtained as described in thatsection. We will use the same notation as in Section 4.2: i.e, e denotes the residuals fromthe fit adjusted so that T (Fn) = 0 where Fn is the empirical distribution function of theresiduals; µ will denote the R-estimate of µ the ab× 1 vector of the µijs; and τϕ denotes theestimate of τϕ. For the examples discussed in this section, Wilcoxon scores are used and theresiduals are adjusted so that their median is 0.

An interesting submodel is the additive model which is given by

µij = µ+ (µi· − µ) + (µ·j − µ) . (4.4.2)

For the additive model, the profile plots, (µij versus i or j), are parallel. A diagnosticcheck for the additive model is to plot the sample profile plots, (µij versus i or j), andsee how close the profiles are to parallel. The null hypotheses of interest for this model arethe main effect hypotheses given by

H0A : µi· = µi′· for all i, i′ = 1, . . . a and (4.4.3)

H0B : µ·j = µ·j′ for all j, j′ = 1, . . . b . (4.4.4)

Note that there are a − 1 and b − 1 free constraints for H0A and H0B, respectively. UnderH0A, the levels of A have no effect on the response.

The interaction parameters are defined as the differences between the full modelparameters and the additive model parameters; i.e.,

γij = µij − [µ+ (µi· − µ) + (µ·j − µ)] = µij − µi· − µ·j + µ . (4.4.5)

4.4. TWOWAY CROSSED FACTORIAL 297

The hypothesis of no interaction is given by

H0AB = γij = 0 , i = 1, . . . , a , j = 1, . . . , b . (4.4.6)

Note that are (a−1)(b−1) free constraints for H0AB. Under H0AB the additive model holds.Historically nonparametric tests for interaction were developed in an ad hoc fashion.

They generally do not appear in nonparametric texts and this has been a shortcoming ofthe area. Sawilowsky (1990) provides an excellent review of nonparametric approaches totesting for interaction. The methods we present are simply part of the general R-theory intesting general linear hypotheses in linear models and they are analogous to the traditionalLS tests for interactions.

All these hypotheses are contrasts in the parameters µij of the oneway model, ( 4.4.1);hence they can easily be tested with the rank based analysis as described in Section 4.2.3.Usually the interaction hypothesis is tested first. If H0AB is rejected then there is difficultyin interpretation of the main effect hypotheses, H0A and H0B. In the presence of interac-tion H0A concerns the cell mean averaged over Factor B, which may have little practicalsignificance. In this case multiple comparisons, (see below), between cells may be of morepractical significance. If H0AB is not rejected then there are two schools of thought. The“pooling” school would take the additive model, ( 4.4.2), as the new full model to test maineffects. The “non-poolers” would stick with unstructured model, ( 4.4.1), as the full model.In either case with little evidence of interaction present, the main effect hypotheses are moreinterpretable.

Since Model ( 4.4.1) is a oneway design, the multiple comparison procedures discussed inSection 4.3 can be used. The crossed structure of the design makes for several interestingfamilies of contrasts. When interaction is present in the model, it is often of interest toconsider simple contrasts between cell locations. Here, we will only mention all

(ab2

)pairwise

comparisons. Among others, the Bonferroni, Fisher, and Tukey T procedures described inSection 4.3 can be used. The rule for the Tukey-Kramer procedure is:

cells (i, j) and (i′, j′) differ if |µij − µi′j′| ≥1√2qα;ab,n−abτϕ

√1

nij+

1

ni′j′. (4.4.7)

The asymptotic family error rate for this procedure is approximately α.The pseudo-observations discussed in Section 4.2.5 can be used to easily obtain the Wald

test statistic, Fϕ,Q, ( 3.6.14), for tests of hypotheses and similarly they can be used to obtainmultiple comparison procedures for families of contrasts. Simply obtain the R-fit of Model( 4.4.1), form the pseudo-observations, ( 4.2.34), and input these pseudo-observations in a LSpackage. The analysis of variance table outputed will contain the Wald type R-tests of themain effect hypotheses, (H0A and H0b), and the interaction hypothesis, (H0AB). As with aLS analysis, one has to know what main hypotheses are being tested by the LS package. Forinstance, the main effect hypothesis H0A, ( 4.4.3), is a Type III sums of squares hypothesisin SAS; see Speed , Hocking, and Hackney (1978).


Table 4.4.1: Data for Example 4.4.1,Lifetimes of Motors, (hours)Insulation

Temp. 1 2 31176 2856 35281512 3192 3528

200 F 1512 2520 35281512 31923528 3528624 816 720624 912 1296

225 F 624 1296 1488816 1392

1296 1488204 300 252228 324 300

250 F 252 372 324300 372324 444

Example 4.4.1. Lifetime of Motors

This problem is an unbalanced twoway design which is discussed on page 471 of Nelson(1982); see, also, McKean and Sievers (1989) for a discussion on R-analyses of this dataset. The responses are lifetimes of three motor insulations, (1, 2, and 3), which were testedat three different temperatures (200 F, 225 F, and 250 F). The design is an unbalanced3× 3 factorial with 5 replicates in 6 of the cells and 3 replicates in the others. The data aredisplayed in Table 4.4.1. Following Nelson, as the response variable we considered the logsof the lifetimes. Let Yijk denote the log of the lifetime of the kth replicate at temperaturelevel i and which used motor insulation j. As a full model we will use Model ( 4.4.1). Theresults found in Tables 4.4.2 and 4.4.3 are for the R-analysis based on Wilcoxon scoreswith the intercept estimated by the median of the residuals. Hence the R-estimates of µijestimate the true cell medians.

The cell median profile plot based on the Wilcoxon estimates, Panel A of Figure 4.4.1,indicates that some interaction is present. Panel B of Figure 4.4.1 is a plot of the internalWilcoxon studentized residuals, ( 3.9.31), versus fitted values. It indicates randomness butalso shows several outlying data points which are, also, quite evident in the q−q plot, PanelC of Figure 4.4.1, of the Wilcoxon studentized residuals versus logistic population quantiles.This plot indicates that score functions for distributions with heavier right tails than thelogistic would be more appropriate for this data; see McKean and Sievers (1989) for morediscussion on score selection for this example. Panel D of Figure 4.4.1, Casewise plot of theWilcoxon studentized residuals, readily identifies the outliers as the fifth observation in cell(1, 1), the fifth observation in cell (2, 1), and the first observation in cell (2, 3).

4.4. TWOWAY CROSSED FACTORIAL 299

Figure 4.4.1: Panel A: Cell median profile plot for data of Example 4.4.1, cell medians basedon the Wilcoxon fit; Panel B: Internal Wilcoxon studentized residual plot; Panel C: Logisticq−q plot based on internal Wilcoxon studentized residuals; Panel D: Casewise plot of theWilcoxon studentized residuals.

Motor insulation

Cel

l med

ian

estim

ates

(W

ilcox

on)

1.0 1.5 2.0 2.5 3.0

56

78

200 deg225 deg250 deg

Panel A

•

•

•

••

•

•••

•

•

•

••

•••

•

•

•

•••

••

•••

•

•

••

Wilcoxon fit

Wilc

oxon

Stu

dent

ized

res

idua

ls

5.5 6.5 7.5

-6-4

-20

24

6

Panel B

•

•• •••••

•••••••••••••••••••

••••••

••• •

•

•

Logistic quantiles

Wilc

oxon

Stu

dent

ized

res

idua

ls

-2 0 2

-6-4

-20

24

6

Panel C

•

•••

•

••

•

•••••

•••

•

•

••

•••

•

•

•

•••

••

••••

•

•

••

Case

Wilc

oxon

Stu

dent

ized

res

idua

ls

0 10 20 30 40

-6-4

-20

24

6

Panel D


Table 4.4.2: Analysis of Dispersion Table for Lifetime of Motors DataSource RD df MRD FRTemperature (T) 26.40 2 13.20 121.7Motor Insulation (I) 3.72 2 1.86 17.2T×I 1.24 4 .310 2.86Error 30 .108

Table 4.4.3: Contrasts for Differences in Insulations at Temperature 200.Contrast Estimate Confidenceµ11 − µ12 -.76 (−1.22,−.30)µ11 − µ13 -.84 (−1.37,−.32)µ12 − µ13 -.09 (−.62, .44)

Table 4.4.2 is an ANOVA table for the R-analysis. Since F (.05, 4, 30) = 2.69, the testof interaction is significant at the .05 level. This confirms the profile plot, Panel A. It isinteresting to note that the least squares F -test statistic for interaction was 1.30 and, hence,was not significant. The LS analysis was impaired because of the outliers. The row effecthypothesis is that the average row effects are the same. The column effect hypothesis issimilarly defined. Both main effects are significant. In the presence of interaction, though,we have interpretation difficulties with main effects.

In Nelson’s discussion of this problem it was of interest to estimate the simple contrastsof mean lifetimes of insulations at the temperature setting of 200. Since this is the firsttemperature setting, these contrasts are µ1j − µ1j′. Table 4.4.3 displays the estimatesof these contrasts along with corresponding confidence intervals formed under the Tukey-Kramer procedure as discussed above, ( ??). It seems that insulations 2 and 3 are betterthan insulation 1 at the temperature of 200, but between insulations 2 and 3 there is nodiscernible difference.

In this example, the number of observations per parameter was less than five. To offsetuneasiness over the use of the rank analysis for such small samples, McKean and Sievers(1989) conducted a a Monte Carlo study on this design. The empirical levels and powers ofthe R-analysis were good over situations similar to those suggested by this data.

4.5 Analysis of Covariance

Often there are extraneous variables available besides the response variable. Hopefully thesevariables explain some of the noise in the data. These variables are called covariates orconcomittant variables and the traditional analysis of such data is called analysis ofcovariance.

As an example, consider the oneway model ( 4.2.1), with k levels and suppose we have asingle covariate, say, xij . A first order model is yij = µi + βxij + eij. This model, however,

4.5. ANALYSIS OF COVARIANCE 301

assumes that the covariate behaves the same within each treatment combination. A moregeneral model is

yij = µi + βxij + γixij + eij j = 1, . . . , ni , i = 1, . . . , k . (4.5.1)

Hence the slope at the ith level is βi = β + γi and, thus, each treatment combination has itsown linear model. There are two natural hypotheses for this model: H0C : β1 = · · · = βkand H0L : µ1 = · · · = µk. If H0C is true then the difference between the levels of Factor Aare just the differences in the location parameters µi for a given value of the covariate. Inthis case, contrasts in these parameters are often of interest as well as the hypothesis H0L.If H0C is not true then the covariate and the treatment combinations interact. For example,whether one treatment combination is better than another, may depend on where in factorspace the responses are measured. Thus as in crossed factorial designs, the interpretationof main effect hypotheses may not be clear; for more discussion on this point see Huitema(1980).

The above example is easily generalized. Consider a designed experiment with k treat-ment combinations. This may be a oneway model with a factor at k levels, a twoway crossedfactorial design model with k = ab treatment combinations, or some other design. Supposewe have ni observations at treatment level i. Let n =

∑ni denote the total sample size.

Denote by W the full model incidence matrix and µ the k×1 vector of location parameters.Suppose we have p covariates. Let U be the n × p matrix of covariates and let Z denotethe n × 1 vector of responses. Let β denote the corresponding p × 1 vector of regressioncoefficients. Then the general covariate model is given by

Z = Wµ + Uβ + Vγ + e , (4.5.2)

where V is the n× pk matrix consisting of all column products of W and U and the pk× 1vector γ is the vector of interaction parameters between the design and the covariates.

The first hypothesis of interest is

H0C : γ11 = · · · = γpk,pk versus HAC : γij 6= γi′j′ for some (i, j) 6= (i′, j′). (4.5.3)

Other hypotheses of interest consist of contrasts in the the µij. In general, let M be a q× kmatrix of contrasts and consider the hypotheses

H0 : Mµ = 0 versus HA : Mµ 6= 0 . (4.5.4)

Matrices M of interest are related to the design. For a oneway design M may be a (k−1)×kmatrix that tests all the location levels to be the same, while for a twoway design it maybe used to test that all interactions between the two factors are zero. But as noted above,the hypothesis H0C concerns interaction between the covariate and design spaces. While theinterpretation of these later hypotheses, ( 4.5.4), are clear under H0C they may not be if H0C

is false.


Table 4.5.1: Snake DataPlacebo Treatment 2 Treatment 3 Treatment 4

Initial Final Initial Final Initial Final Initial FinalDist. Dist. Dist. Dist. Dist. Dist. Dist. Dist.

25 25 17 11 32 24 10 813 25 9 9 30 18 29 1710 12 19 16 12 2 7 825 30 25 17 30 24 17 1210 37 6 1 10 2 8 717 25 23 12 8 0 30 269 31 7 4 5 0 5 8

18 26 5 3 11 1 29 2927 28 30 26 5 1 5 2917 29 19 20 25 10 13 9

The rank-based fit of the full Model ( 4.5.2) proceeds as described in Chapter 3, after ascore function is chosen. Once the fitted values and residuals have been obtained, the diag-nostic procedures described in Section 3.9 can be used to assess the fit. With a good fit, themodel estimates of the parameters and their standard errors can be used to form confidenceintervals and regions and multiple comparison procedures can be used for simultaneous in-ference. Reduced models appropriate for the hypotheses of interest can be obtained and thevalues of the test statistic Fϕ can be used to test them. This analysis can be conductedby the package rglm. It can also be conducted by fitting the full model and obtaining thepseudo-observations. These in turn can be substituted for the responses in a package whichperforms the traditional LS analysis of covariance in order to obtain the R-analysis.

Example 4.5.1. Snake Data

As an example of an analysis of covariance problem consider the data set discussed by Afifiand Azen (1972). The data are reproduced below in Table 4.5.1. It involves four methods,three of which are intended to reduce a human’s fear of snakes. Forty subjects were givena behavior approach test to determine how close they could walk to a snake without feelinguncomfortable. This score was taken as the covariate. Next they were randomly assigned toone of the four treatments with ten subjects assigned to a treatment. The first treatmentwas a control (placebo) while the other three treatments were different methods intended toreduce a human’s fear of snakes. The response was a subject’s score on the behavior approachtest after treatment. Hence, the sample size is 40 and the number of independent variablesin Model ( 4.5.2) is 8. Wilcoxon scores were used to conduct the analysis of covariancedescribed above with the residuals adjusted to have median 0.

The plots of the response variable versus the covariate for each treatment are foundin Panels A - D of Figure 4.5.1. It is clear from the plots that the relationship betweenthe response and the covariate varies with the treatment, from virtually no relationship for

4.5. ANALYSIS OF COVARIANCE 303

Figure 4.5.1: Panels A - D: For the Snake Data, scatterplots of Final Distance versus InitialDistance for the Placebo and Treatments 2-4, overlaid with the Wilcoxon fit (solid line)and the LS fit (dashed line); Panel E: Internal Wilcoxon studentized residual plot; Panel F:Wilcoxon studentized logistic q−q plot.

••

•

•

•

•

•

•

••

Initial distance

Fin

al d

ista

nce

10 15 20 25

1520

2530

35

Panel A: Placebo

•

•

••

•

•

••

•

•

Initial distance

Fin

al d

ista

nce

5 10 15 20 25 30

05

1015

2025

Panel B: Treatment 2

•

•

•

•

•

••••

•

Initial distance

Fin

al d

ista

nce

5 10 15 20 25 30

05

1015

20

Panel C: Treatment 3

•

•

•

•

•

•

•

••

•

Initial distance

Fin

al d

ista

nce

5 10 15 20 25 30

1015

2025

Panel D: Treatment 4

• •

•

•

•

•

•

•

•

•

•

••

•••

•••

• •

••

•

••

•

•

•

•

•

•

•••

•

•

•

•

•

Wilcoxon Fit

Wilc

oxon

Stu

dent

ized

res

idua

ls

0 5 10 15 20 25

-50

510

Panel E

•

•

• •• • • • • • • • •••••••

••••••••• • • • • • • • •

•

• •

•

Logistic quantiles

Wilc

oxon

Stu

dent

ized

Res

idua

ls

-2 0 2

-50

510

Panel F

the first treatment (placebo) to a fairly strong linear relationship for the third treatment.Outliers are apparent in these plots also. These plots are overlaid with Wilcoxon and LSfits of the full model, Model ( 4.5.1). Panels E and F of Figure 4.5.1 are, respectively, theinternal Wilcoxon studentized residual plot and the internal Wilcoxon studentized logisticq−q plot. The outliers stand out in these plots. From the residual plot, the data appearsto be heteroscedastic and, as Exercise 4.8.14 shows, the square root transformation of theresponse does lead to a better fit.

Table 4.5.2 displays the Wilcoxon and LS estimates of the linear models for each treat-ment. As this table and Figure 4.5.1 shows, the larger discrepancy between the Wilcoxonand LS estimates occur for those treatments which have large outliers. The estimates ofτϕ and σ are 3.92 and 5.82, respectively; hence, as the table shows the estimated standard


Table 4.5.2: Wilcoxon and LS Estimates of the Linear Models for Each TreatmentWilcoxon Estimates LS Estimates

Treatment Int. (SE) Slope (SE) Int. (SE) Slope (SE)1 27.3 (3.6) -.02 (.20) 25.6 (5.3) .07 ( .29)2 -1.78 (2.8) .83 ( .15) -1.39 (4.0) .83 ( .22)3 -6.7 (2.4) .87 (.12) -6.4 (3.5) .87 (.17)4 2.9 (2.4) .66 ( .13) 7.8 (3.4) .49 (.19)

Table 4.5.3: Analysis of Dispersion (Wilcoxon) for the Snake DataSource D=Dispersion df MD FHOC 24.06 3 8.021 4.09

Treatment 74.89 3 24.96 12.7Error 32 1.96

errors of the Wilcoxon estimates are lower than their LS counterparts.Table 4.5.3 displays the analysis of dispersion table for this data. Note that Fϕ strongly

rejects HOC, (p-value is .015). This confirms the discussion above based on Figure 4.5.1. Thesecond hypothesis tested is no treatment effect, H0 : µ1 = · · · = µ4. Although Fϕ stronglyrejects this hypothesis also, in lieu of the results for HOC, the practical interpretation of sucha decision is not obvious. The value of the LS F-test for HOC is 2.34 (p-value is .078). IfHOC is not rejected then the LS analysis could lead to an invalid interpretation. The outliersspoiled the LS-analysis of this data set. As shown in Exercise 4.8.15 both the R-analysisand the LS-analysis strongly reject HOC for the square root transformation of the response.

4.6 Further Examples

In this section we present two further data examples. Our main purpose in this section isto show how easy it is to use the rank-based analysis on more complicated models. Eachexample is a three-way crossed factorial design. The first has replicates while the secondinvolves a covariate. Besides displaying tests of the effects, we also consider estimates andstandard errors of contrasts of interest.

Example 4.6.1. Marketing Data

This data set is drawn from an exercise on page 953 of Neter et al. (1996). A marketingfirm research consultant studied the effects that three factors have on the quality of workperformed under contract by independent marketing research agencies. The three factorsand their levels are: Fee level, ((1) High, (2) Average, and (3) Low); Scope, ((1) All contractwork performed in house, (2) Some subcontract out); Supervision, ((1) Local supervision,(2) Traveling supervisors). The response was the quality of the work performed as measuredby an index. Four agencies were chosen for each level combination. For convenience the dataare displayed in Table 4.6.1.

4.6. FURTHER EXAMPLES 305

Table 4.6.1: Marketing Data for Example 4.6.1Supervision

Local Supervision Traveling SupervisionFee Level In House Sub-out In House Sub-out

124.3 115.1 112.7 88.2High 120.6 119.9 110.2 96.0

120.7 115.4 113.5 96.4122.6 117.3 108.6 90.1119.3 117.2 113.6 92.7

Average 188.9 114.4 109.1 91.1125.3 113.4 108.9 90.7121.4 120.0 112.3 87.990.9 89.9 78.6 58.6

Low 95.3 83.0 80.6 63.588.8 86.5 83.5 59.892.0 82.7 77.1 62.3

Hence, the design is a 3× 2× 2 crossed factorial with 4 replications, which we shall writeas,

yijkl = µijk + eijkl , i = 1, . . . , 3; j, k = 1, 2; l = 1, . . . 4 , (4.6.1)

where yijkl denotes the response for the lth replicate, at Fee i, Scope j, and Supervision k.Wilcoxon scores were selected for the fit with residuals adjusted to have median 0. Panels Aand B of Figure 4.6.1 show, respectively, the residual and normal q−q plots for the internalR-studentized residuals, ( 3.9.31), based on this fit. The scatter in the residual plot is fairlyrandom and flat. There do not appear to be any outliers. The main trend in the normalq−q plot indicates tails lighter than those of a normal distribution. Hence, the fit is goodand we proceed with the analysis.

Table 4.6.2 displays the tests of the effects based on the LS and Wilcoxon fits. The Wald-type Fϕ,Q statistic based on the pseudo-observations is also given. The LS and Wilcoxonanalyses agree which is not surprising based on the residual plot. The main effects arehighly significant and the only significant interaction is the interaction between Scope andSupervision.

As a subsequent analysis, we shall consider nine contrasts of interest. We will use theBonferroni method based on the pseudo-observations as discussed in Section 4.3. We usedMinitab to obtain the results that follow. Because the factor Fee does not interact withthe other two factors, the contrasts of interest for this factor are: µ1·· − µ2··, µ1·· − µ3··, andµ2·· − µ3··. Table 4.6.3 presents the estimates of these contrasts and the 95% Bonferroniconfidence intervals which are given by the estimates of the contrast ±t(.05/18;36) τϕ

√2/16

.=

±2.64. From these results, quality of work significantly improves for either higher or averagefees over low fees. The results for high or average fees are insignificant.

Since the factors Scope and Supervision interact, but do not interact separately or jointly


Figure 4.6.1: Panel A: Wilcoxon studentized residual plot for data of Example 4.6.1 data;Panel B: Wilcoxon studentized residual normal q−q plot

•

••

•

•

•

•

••

•

•

•

•

••

•••

•

•

•

••

• •

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

Wilcoxon fit

Wilc

oxon

Stu

dent

ized

res

idua

ls

60 70 80 90 110

-10

1

Panel A

•

• •••••••••••

•••••••••••••••

•••••••

•••••••••

•• ••

Normal quantiles

Wilc

oxon

Stu

dent

ized

res

idua

ls-2 -1 0 1 2

-10

1

Panel B

Table 4.6.2: Tests of Effects for the Market DataEffect df FLS Fϕ Fϕ,QFee 2 679. 207. 793.Scope 1 248. 160. 290.Supervision 1 518. 252. 596.Fee×Scope 2 .108 .098 .103Fee×Super. 2 .053 .004 .002Scope×Super. 1 77.7 70.2 89.6Fee×Scope×Super. 2 .266 .532 .362σ or τϕ 36 2.72 2.53 2.53


Table 4.6.3: Contrasts of Interest for the Market DataContrast Estimate Confidence Intervalµ1·· − µ2·· 1.05 (−1.59, 3.69)µ1·· − µ3·· 31.34 (28.70, 33.98)µ2·· − µ3·· 30.28 (27.64, 32.92)

Table 4.6.4: Parameters of Interest for the Market DataParameter µ·11 µ·12 µ·21 µ·22Estimate 111.9 101.0 106.4 81.64

with the factor fee, the parameters of interest are the simple contrasts among µ·11, µ·12, µ·21and µ·22. Table 4.6.4 displays the estimates of these parameters. Using α = .05, theBonferroni bound for a simple contrast here is t.05/18;36τϕ

√2/12

.= 3.04. Hence all 6 simple

pairwise contrasts among these parameters, are significantly different from 0. In particular,averaging over fees, the best quality of work occurs when all contract work is done in houseand under local supervision. The source of the interaction between the factors Scope andSupervision is also clear from these estimates.

Example 4.6.2. Pigs and Diets

This data set is discussed on page 291 of Rao (1973). It concerns the effect of diets on thegrowth rate of pigs. There are three diets, called A, B and C. Besides the diet classification,the pigs were classified according to their pens (5 levels) and sex (2 levels). Their initialweight was also recored as a covariate. The data are displayed in Table 4.6.5.

The design is a 5 × 3 × 2 a crossed factorial with only one replication. For comparisonpurposes, we will use the same model that Rao used which is a fixed effects model with maineffects and the two-way interaction between the factors Diets and Sex. Letting yijk and xijkdenote, respectively, the growth rate in pounds per week and the initial weight of the pig inpen i, on diet j and sex k, this model is given by:

yijk = µ+ αi + βj + γk + (βγ)jk + δxijk + eijk , i = 1, . . . , 5; j = 1, 2, 3 k = 1, 2 . (4.6.2)

For convenience we have written the model as an over parameterized model; although, wecould have expressed it as a cell means model with constraints for the interaction effectswhich are assumed to be 0. The effects of interest are the diet effects, βj .

We fit the model using the Wilcoxon scores. The analysis could also be carried outusing pseudo-observations and Minitab. Panels A and B of Figure 4.6.2 display the residualplot and normal q−q plot of the internal R-studentized residuals based on the Wilcoxonfit. The residual plot shows the three outliers. The outliers are prominent in the q−q, butnote that even the remaining plotted points indicate an error distribution with heavier tailsthan the normal. Not surprisingly the estimate of τϕ is smaller than that of σ, .413 and.506, respectively. The largest outlier corresponds to the 6th pig which had the lowest initialweight, (recall that the internal R-studentized residuals account for position in factor space),


Table 4.6.5: Data for Example 4.6.2Pen Diet Sex Initial Wt. Growth Rate

A G 48 9.94B G 48 10.00

1 C G 48 9.75C H 48 9.11B H 39 8.51A H 38 9.52B G 32 9.24C G 28 8.66

2 A G 32 9.48C H 37 8.50A H 35 8.21B H 38 9.95C G 33 7.63A G 35 9.32

3 B G 41 9.34B H 46 8.43C H 42 8.90A H 41 9.32C G 50 10.37A H 48 10.56

4 B G 46 9.68A G 46 10.98B H 40 8.86C H 42 9.51B G 37 9.67A G 32 8.82

5 C G 30 8.57B H 40 9.20C H 40 8.76A H 43 10.42


Figure 4.6.2: Panel A: Internal Wilcoxon studentized residual plot for data of Example 4.6.2data; Panel B: Internal Wilcoxon studentized residual normal q−q plot

•

••

••

•

••

•

•

•

•

•

••

•

•

• ••

•

•

•

••

•

••

•

•

Wilcoxon fit

Wilc

oxon

Stu

dent

ized

res

idua

ls

8.5 9.5 10.5

-6-4

-20

24

Panel A

•

•• • •

•••••••••••

•••••••

••• • •

••

Normal quantiles

Wilc

oxon

Stu

dent

ized

res

idua

ls

-1 0 1

-6-4

-20

24

Panel B

but its response was above the first quartile. The second largest outlier corresponds to thepig which had the lowest response.

Table 4.6.6 displays the results of the tests for the effects for the LS and Wilcoxon fits.The pseudo-observations were obtained based on the Wilcoxon fit and were inputed as theresponses in SAS to obtain Fϕ,Q using Type III sums of squares. The Wilcoxon analysesbased on Fϕ and Fϕ,Q are quite similar. All three tests indicate no interaction between thefactors Diet and Sex which clarifies the interpretation of the main effects. Also all threeagree on the need for the covariate. Diet has a significant effect on weight gain as does sex.The robust analyses indicate that pens is also a contributing factor.

Table 4.6.7 displays the results of the analyses when the covariate is not taken intoaccount. It is interesting to note, here, that the factor diet is not significant based on the LSfit while it is for the Wilcoxon analyses. The heavy tails of the error distribution, as evidentin the residual plots, has foiled the LS analysis.


Table 4.6.6: Test Statistics for the Effects of Pigs and Diets Data with Initial Weight as aCovariate

Effect df FLS Fϕ Fϕ,QPen 4 2.35 3.65∗ 3.48∗

Diet 2 4.67∗ 7.98∗ 8.70∗

Sex 1 5.05∗ 8.08∗ 8.02∗

Diet×Sex 2 0.17 1.12 .81Initial Wt. 1 13.7∗ 19.2∗ 19.6∗

σ or τϕ 19 .507 .413 .413∗ Denotes significance at the .05 level

Table 4.6.7: Test Statistics for the Effects of Pigs and Diets Data with No CovariateEffect df FLS Fϕ Fϕ,QPen 4 2.95∗ 4.20∗ 5.87∗

Diet 2 2.77 4.80∗ 5.54∗

Sex 1 1.08 3.01 3.83Diet×Sex 2 0.55 1.28 1.46σ or τϕ 20 .648 .499 .501∗ Denotes significance at the .05 level

4.7 Rank Transform

In this section we present a short comparison between the rank-based analysis of this chapterwith the rank transform analysis (RT). Much of this discussion is drawn from McKean andVidmar (1994). The main point of this section is to show that often the RT test doesnot work well for testing hypotheses in factorial designs and more complicated models.Hence, we do not recommend using RT methods. On the other hand, Akritas, Arnoldand Brunner (1997) develop an unique approach in which factorial hypotheses are replacedby corresponding nonparametric hypotheses based on cdfs. They then show that RT typemethods are appropriate for these nonparametric hypotheses.

As we have pointed out, the rank-based analysis is quite analogous to the LS basedtraditional analysis. It is based on R-estimates while the traditional analysis is based on LSestimates. The only difference in the geometry of estimation is that that the R-estimatesare based on the pseudo-norm ( 3.2.6) while the LS estimates are based on the Euclideanpseudo-norm. The rank-based analysis produces confidence intervals and regions and tests ofgeneral linear hypotheses. The diagnostic procedures of Chapter 3 can be used to check theadequacy of the fit of the model and determine outliers and influential points. Furthermore,the efficiency properties discussed for the simple location nonparametric procedures carryover to this analysis. The rank-based analysis offers the user a complete and highly efficientanalysis of a linear model as an alternative to the traditional analysis. Further, there arecomputational algorithms available for these procedures.

4.7. RANK TRANSFORM 311

Proposed by Conover and Iman (1981), the rank transform (RT) has become a verypopular procedure. The RT test of a linear hypothesis consists generally of ranking thedependent variable and then performing the LS test on these ranks. Although in general theRT offers no estimation, and hence no model checking, it is a simple procedure for testing.

Some basic differences between the rank-based analysis and RT are readily apparent.In linear models the Yi’s are independent but not identically distributed. Hence when theRT is applied indiscriminately to a linear model, the ranking is performed on non-identicallydistributed items. The rankings in the RT are not “free” of the x’s. In contrast, the residualsbased on the R-estimates, under Wilcoxon scores, satisfy

n∑

i=1

xijR(Yi − x′iβR)

.= 0 , j = 1, . . . , p . (4.7.1)

Hence the R-residuals have been adjusted by the fit so that the ranks are orthogonal to thex-space, i.e., the ranks are “free” of the x’s. These are the ranks that are used in the R-teststatistic Fϕ, at the full model. Under H0 this would also be true of the expected ranks ofthe residuals in the R-fit of the reduced model. Note, also, that the statistic Fϕ is invariantto the values of the parameters of the reduced model.

Unlike the rank-based analysis there is no general supporting theory for the RT. Horaand Conover (1984) presented asymptotic null theory on the RT for treatment effect in arandomized block design with no interaction. Thompson and Ammann (1989) explored theefficiency of this RT, showing, however, that this efficiency depends on the block parameters.RT theory for repeated measures designs has been developed by Akritas (1991, 1993) andThompson (1991b). These extensions also have the unpleasant trait that their efficienciesdepend on nuisance parameters.

Many of these theoretical studies on the RT have raised serious questions concerningthe validity of the RT for simple two-way and more complicated designs. For a two-waycrossed factorial design, Brunner and Nuemann (1986) showed that the RT statistics arenot reasonable for testing main effects in the presence of interaction for designs larger than2 × 2 designs. This was echoed by Akritas (1990) who stated further that RT statistics arenot reasonable test statistics for interaction nor most other common hypotheses in eithertwo-way crossed or nested classifications. In several of these articles, (see Akritas, 1990 andThompson, 1991a, 1993), the nonlinear nature of the RT is faulted. For a given model thehypotheses of interest are linear contrasts in model parameters. The rank transform, though,is nonlinear; hence often the original hypothesis is no longer tested by the rank transformeddata. The same issue was raised earlier by Fligner (1981) in a discussion of the article byConover and Iman (1981).

In terms of small sample properties, initial simulations of the RT analysis on certainmodels, (see for example Iman, 1974), did appear promising. Now there has been ampleevidence based on simulation studies questioning the wisdom of doing RT’s on designs assimple as two-way factorial designs with interaction; see, for example, Blair, Sawilowsky andHiggins (1987). We discuss one such study next and then present an analysis of covarianceexample where the use of the RT results in a poor analysis.


4.7.1 Monte Carlo Study

Another major Monte Carlo study on the RT was performed by Sawilowsky, Blair andHiggins (1989), which investigated the behavior of the RT over a three way factorial designwith interaction. In many of their situations, the RT gave severely inflated empirical levelsand severely deflated empirical powers. We present the results of a small Monte Carlo studydiscussed in McKean and Vidmar (1994), which is based on the study of Sawilowsky et al.The model for the study is a 2×2×2 three-way factorial design. The shortcomings of the RTas discussed in the two-way models above seem to become worse for such models. LettingA, B, and C denote the factors, the model is

Yijkl = µ+ ai + bj + ck + (ab)ij + (ac)ik + (bc)jk + (abc)ijk + eijkl, i, j, k = 1, 2, l = 1, . . . , r ,

where r is the number of replicates per cell. In the study by Sawilowsky et al., r was set at2, 5, 10 or 20. Several distributions were considered for the errors eijkl, including the normal.They considered the usual seven hypotheses (3 main effects, 3 two-ways, and 1 three-way)and 8 patterns of alternatives. The nonnull effects were set at ±c where c was a multiple ofσ; see, also, McKean and Vidmar (1992) for further discussion. The study of Sawilowsky etal. found that the RT test for interaction “. . . is dramatically nonrobust at times and that ithas poor power properties in many cases.”

In order to compare the behavior of the rank-based analysis and the RT, on this de-sign, we performed part of their simulation study. We considered standard normal errorsand contaminated normal errors, which had 10% contamination from a normal distributionwith mean 0 and standard deviation 8. The normal variates were generated as discussed inMarsaglia and Bray (1964) using uniform variates which were generated by a portable FOR-TRAN generator written by Kahaner, Moler and Nash (1989). There were 5 replications percell and the non null constant of proportionality c was set at .75. The simulation size was1000.

Tables 4.7.1 and 4.7.2 summarize the results of our study for the following two situa-tions: the two-way interaction A× C and the three-way interaction effect A× B × C. Thealternative for the A×C situation had all main effects and all two-way interactions in whilethe alternative for the A×B×C situation had all main effects, two-way interactions besidesthe three-way alternative in. These were poor situations for the RT in the study conductedby Sawilowsky et al. and as Tables 4.7.1 and 4.7.2 indicate the RT behaves poorly forthese situations in our study also. Its empirical α levels are deplorable. For instance, atthe nominal .10 level for the three-way interaction test under normal errors, the RT hasan empirical level of .777, while the level is .511 at the contaminated normal. In contrastthe levels of the rank-based analysis were quite close to the nominal levels under normalerrors and slightly conservative under the contaminated normal errors. In terms of power,note that the empirical power of the rank-based analysis is slightly less than the empiricalpower of LS under normal errors while it is substantially greater than the power of LS undercontaminated normal errors. For the three-way interaction test, the empirical power of theRT falls below its empirical level.


Table 4.7.1: Empirical Levels and Power for Test of A× CError Distribution

Normal Errors Contaminated Normal ErrorsModel Model

Null Alternative Null AlternativeNominal α Nominal α Nominal α Nominal α

Method .10 .05 .01 .10 .05 .01 .10 .05 .01 .10 .05 .01LS .095 .040 .009 .998 .995 .977 .087 .029 .001 .602 .505 .336

Wilcoxon .104 .060 .006 .997 .992 .970 .079 .032 .004 .934 .887 .713RT .369 .243 .076 .847 .770 .521 .221 .128 .039 .677 .576 .319

Table 4.7.2: Empirical Levels and Power for Test of A× B × CError Distribution

Normal Errors Contaminated Normal ErrorsModel Model

Null Alternative Null AlternativeNominal α Nominal α Nominal α Nominal α

Method .10 .05 .01 .10 .05 .01 .10 .05 .01 .10 .05 .01LS .094 .050 .005 1.00 .998 .980 .102 .041 .001 .598 .485 .301

Wilcoxon .101 .060 .004 .997 .992 .970 .085 .039 .006 .948 .887 .713RT .777 .644 .381 .484 .343 .144 .511 .377 .174 .398 .276 .105


Table 4.7.3: Shirley DataGroup 1 Group 2 Group 3

Initial Final Initial Final Initial Finaltime time time time time time

1.8 79.1 1.6 10.2 1.3 14.81.3 47.6 0.9 3.4 2.3 30.71.8 64.4 1.5 9.9 0.9 7.71.1 68.7 1.6 3.7 1.9 63.92.5 180.0 2.6 39.3 1.2 3.51.0 27.3 1.4 34.0 1.3 10.01.1 56.4 2.0 40.7 1.2 6.92.3 163.3 0.9 10.5 2.4 22.52.4 180.0 1.6 0.8 1.4 11.42.8 132.4 1.2 4.9 0.8 3.3

Example 4.7.1. The Rat Data

The following example, taken from Shirley (1981), contrasts the rank-based methods, therank transformed methods, and least squares methods in an analysis of covariance setting.The response is the time it takes a rat to enter a chamber after receiving a treatment designedto delay the time of entry. There were 30 rats in the experiment and they were divided evenlyinto three groups. The rats in Groups 2 and 3 received an antidote to the treatment. Thecovariate is the time taken by the rat to enter the chamber prior to its treatment. The dataare presented in Table 4.7.3 and are displayed in Panel A of Figure 4.7.1.

As a full model, we considered the model,

yij = αj + βjxij + eij , j = 1, . . . , 3, i = 1, . . . , 10 (4.7.2)

where yij denotes the response for the ith rat in Group j and xij denotes the correspondingcovariate. There is a slight quadratic aspect to the Wilcoxon residual plot, Panel B of Figure4.7.1, which is investigated in Exercise 4.8.16.

Panel C of Figure 4.7.1 displays a plot of the internal Wilcoxon studentized residuals bycase. Note that there are several outliers. These also can be seen in the plots of the datafor groups 2 and 3, Panels E and F of Figure 4.7.1. Note that the outliers have an effect onthe LS-fits, drawing the fits toward the outliers in each group. In particular, for Group 3, itonly took one outlier to spoil the LS-fit. On the other hand, the Wilcoxon fit is not affectedby the outliers. The estimates are given in Table 4.7.4. As the plots indicate, the LS andWilcoxon estimates differ numerically. Further evidence of the more precise R-fits relativeto the LS-fits is given by the estimates of the scale parameters σ and τϕ found in the Table4.7.4.

We first test for homogeneity of slopes for the groups; i.e, H0 : β1 = β2 = β3. As clearlyshown in Panel A of Figure 4.7.1 this does not appear to be true for this data. While the


Figure 4.7.1: Panel A: Wilcoxon fits of all groups; Panel B: Internal Wilcoxon studentizedresidual plot; Panel C: Internal Wilcoxon studentized residuals by Case; Panel D: LS (solidline) and Wilcoxon (dashed line) fits for Group 1; Panel E: LS (solid line) and Wilcoxon(dashed line) fits for Group 2; Panel F: LS (solid line) and Wilcoxon (dashed line) fits forGroup 3.

Time (Before Treatment)

Tim

e (A

fter

Tre

atm

ent)

1.0 1.5 2.0 2.5

050

100

150

200

Panel A

Group 1Group 2Group 3

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

Wilcoxon Fitted Values

Wic

oxon

Res

idua

ls0 50 100 150

-10

1

Panel B

•

•

•

•

•

•

• •

•

•

•

•

•

•

•

• •

•

•

•• •

•

•

•• •

•

••

Case

Wilc

oxon

Stu

dent

ized

Res

idua

l

0 5 10 15 20 25 30

-4-2

02

46

Panel C


Tim

e (A

fter

Tre

atm

ent)

1.0 1.5 2.0 2.5

5010

015

0

Panel D

LSWilcoxon


Tim

e (A

fter

Tre

atm

ent)

1.0 1.5 2.0 2.5

010

2030

40

Panel E

LSWilcoxon


Tim

e (A

fter

Tre

atm

ent)

1.0 1.5 2.0

1020

3040

5060

Panel F

LSWilcoxon

Table 4.7.4: LS and Wilcoxon Estimates (standard errors) for the Rat Data.

Group 1 Group 2 Group 3Procedure α β α β α β σ or τϕLS -39.1 (20.) 76.8 (10.) -15.6 (22.) 20.5 (14.) -14.7 (19.) 21.9 (12.) 20.5Wilcoxon -54.3 (16.) 84.2 (8.6) -19.3 (18.) 21.0 (11.) -11.6 (16.) 17.4 (10.) 17.0


slopes for Groups 2 and 3 seem to be about the same, (the Wilcoxon 95% confidence intervalfor β2 − β3 is 3.9 ± 27.2), the slope for Group 1 appears to differ from the other two. Toconfirm this statistically, the value of the Fϕ statistic to test homogeneity of slopes, H0, hasthe value 9.88 with 2 and 24 degrees of freedom, which is highly significant (p < .001). Thissays that Group 1, the group that did not receive the antidote, does differ significantly fromthe other two groups in terms of how the groups interact with the covariate. In particular,the estimated slope of post-treatment time to pre-treatment time for the rats in Group 1 isabout 4 times as large as the slope for the rats in the two groups which received the antidote.Because there is interaction between the groups and the covariate, we did not proceed withthe second test on average group effects; i.e., testing α1 = α2 = α3.

Shirley (1981) performed a rank transform on this data by ranking the response and thenapplying standard least squares analysis. It is clear from Panel A of Figure 4.7.1 that thisnonlinear transform will result in homogeneous slopes for the ranked problem, as confirmedby Shirley’s analysis. But the rank transform is a nonlinear transform and the subsequentanalysis based on the rank transformed data does not test homogeneity of slopes in Model( 4.7.2). The RT analysis is misleading in this case.

Note that using the rank-based analysis we performed an overall analysis of this data set,including a residual analysis for model criticism. Hypotheses of interest were readily testedand estimates of contrasts, along with standard errors, were easily obtained.

4.8. EXERCISES 317

4.8 Exercises

4.8.1. Derive expression ( 4.2.19).

4.8.2. In Section 4.2.2 when we have only two levels, show that the Kruskal-Wallis test isequivalent to the MWW test discussed in Chapter 2.

4.8.3. Consider a oneway design for the data in Example 4.2.3. Fit the model usingWilcoxon estimates and conduct a residual analysis, including residual and q−q plots ofstandardized residuals. Identify any outliers. Next test the hypothesis ( 4.2.13) using theKruskal-Wallis test and the test based on Fϕ.

4.8.4. Using the notation of Section 4.2.4, show that the asymptotic covariance between µiand µi′ is given by expression ( 4.2.31). Next show that expressions ( 3.9.38) and ( 4.2.31)lead to a verification of the confidence interval ( 4.2.30).

4.8.5. Show that the asymptotic covariance between estimates of location levels is given byexpression( 4.2.31).

4.8.6. Suppose D is a symmteric, positive definite matrix. Prove that

suph

h′y√h′D−1h

=√

y′Dy . (4.8.1)

Refer to the Kruskal Wallis statistic HW , given in expression ( 4.2.22). Let y′ = (R1 −n+1

2, . . . , Rk − n+1

2) and D = 12

n(n+1)diag(n1, . . . , nk). Then, using ( 4.8.1), show that

HW ≤ χ2α(k − 1) if and only if

∑ki=1

hi(Ri−n+12 )

r

n(n+1)12

Pkj=1

1njh2

j

≤√χ2(k − 1) ,

for all vectors h such that∑hi = 0.

Hence, if the Kruskal-Wallis test rejects H0 at level α then there must be at least onecontrast in the rank averages that exceeds the critical value

√χ2(k − 1). This provides

Scheffe type multiple contrast tests with family error rate approximately equal to α.

4.8.7. Apply the procedure presented in Exercise 4.8.6 to the quail data of Example 4.2.1.Use α = .10.

4.8.8. Let I1 and I2 be (1−α)100% confidence intervals for parameters θ1 and θ2, respectively.Show that,

P [θ1 ∈ I1 ∩ θ2 ∈ I2] ≥ 1 − 2α . (4.8.2)

(a). Suppose the confidence intervals I1 and I2 are independent. Show that

1 − 2α ≤ P [θ1 ∈ I1 ∩ θ2 ∈ I2] ≤ 1 − α .


(b). Generalize expression ( 4.8.2) to k confidence intervals and derive the Bonferroni pro-cedure described in ( 4.3.2).

4.8.9. In the notation of the Pairwise Tests Based on Joint Rankings procedure ofSection 4.7, show that R1 is asymptotically Nk−1(0,

k(n+1)12

(Ik−1 + Jk−1)) under H0 : µ1 =· · · = µk. (Hint: The asymptotic normailty follows as in Theorem 3.5.2. In order to deter-mine the covariance matrix of R1, first obtain the covariance matrix of the random vectorR

′= (R1·, . . . , Rk·) and then obtain the covariance matrix of R1 by using the transformation

[−1k−1 Ik−1].)

4.8.10. In Section 4.3, the Pairwise Tests Based on Joint Rankings procedure wasdiscussed based on Wilcoxon scores. Generalize this procedure for an arbitrary score functionϕ(u).

4.8.11. For the baseball data in Exercise 1.12.32, consider the following oneway problem.The response of interest is the hitter’s average and the three groups are left handed hitters,right handed hitters, and switch hitters. Using either Minitab or rglm, obtain the followinganalyses based on Wilcoxon scores:

(a.) Using the test statistic Fϕ, test for an overall group effect. Obtain the p-value andconclude at the 5% level.

(b.) Use the protected LSD procedure of Fisher to compare the groups at the 5% level.

4.8.12. Consider the Bonferroni type procedure described in item (6) of Section 4.3. For-mulate a similar Protected LSD type procedure based on the test statistic Fϕ. Use theseprocedures to make the comparisons discussed in Exercise 4.8.11.

4.8.13. Consider the baseball data in Exercise 1.12.32. In Exercise 3.16.38, we investigatedthe linear relationship between a players height and his weight. For this problem, considerthe simple linear model

height = α + weightβ + e .

Using Wilcoxon scores and either Minitab or rglm, investigate whether or not the samesimple linear model can be used for both the pitchers and hitters. Obtain the p-value forthe test of this hypothesis based on the statistic Fϕ.

4.8.14. In Example 4.5.1 obtain the square root of the response and fit it to the fullmodel. Perform a residual analysis on the resulting fit. In particular identify any outliersand compare the heteroscedasticity in the plot of the residuals versus the fitted values withthe analogous plot in Example 4.5.1.

4.8.15. For Example 4.5.1, overlay the Wilcoxon and LS fits for the four treatments basedon the square root transformation of the response. Then obtain an analysis of covariance forboth the Wilcoxon and LS analyses for the transformed data. Comment on the plots andthe results of the analyses.

4.8. EXERCISES 319

4.8.16. Consider Example 4.7.1. Investigate whether a model which also includes quadraticterms in the covariates is more appropriate for the Rat Data than Model ( 4.7.2).

4.8.17. Consider Example 4.7.1. Eliminate the placebo group, Group 1, and perform ananalysis of covariance on Groups 2 and 3. Use the linear model, ( 4.7.2). Is there anydifference between these groups?

4.8.18. Let HW = W(W′W)−1W′ be the projection matrix based on the incidence matrix,( 4.2.5). Show that HW is a block diagonal matrix with the ith block a ni× ni matrix of allones. Recall X = (I − H1)W1 in Section 4.2.1. Let HX = X(X′X)−1X′ be the projectionmatrix. Then argue that HW = H1 +HX and, hence, HX = HW−H1 is easy to find. Using( 4.2.8), show that, for the oneway design, cov(Z)

.= τ 2

SH1 + τ 2ϕHX and, hence, show that

var(µi) is given by ( 4.2.11) and that cov(µi, µi′) is given by ( 4.2.31).

4.8.19. Suppose we have k treatments of interest and we employ a block design consistingof a blocks. Within each block, we randomly assign mk subjects to the treatments so thateach treatment receives m subjects. Suppose we model the responses Yijl as

Yijl = µ+ αi + βj + eijl ; i = 1, . . . , a , j = 1, . . . , k , l = 1, . . . , m ,

where eijl are iid with cdf F (t). We want to test

H0 : β1 = · · · = βk versus HA : βj 6= βj′ for some j 6= j′ .

Suppose we rank the data in the ith block from 1 to mk for i = 1, . . . , a. Let Rj be the sumof the ranks for the jth treatment. Show that

E(Rj) =am(mk + 1)

2

Var(Rj) =am2(mk + 1)(k − 1)

12

Cov(Rj , Rl) = −am2(mk + 1)

12.

Further, argue that

Km =

k∑

j=1

(k − 1

k

)[Rj −E(Rj)√

Var(Rj)

]2

=

[12

akm2(mk + 1)

k∑

j=1

R2j

]− 3a(mk + 1) .

is asymptotically χ2 with k− 1 degrees of freedom. Note if m = 1 then K1 is the Friedmanstatistic. Show that the efficiency of the Friedman test relative to the twoway LS F -test is12σ2[

∫f 2(x) dx]2(k/(k + 1)). Plot the efficiency as a function of k when f is N(0, 1).


Table 4.8.1: Box-Cox Data, Exercise 4.8.20Treatments

Poisons 1 2 3 40.31 0.82 0.43 0.45

1 0.45 1.10 0.45 0.710.46 0.88 0.63 0.660.43 0.72 0.76 0.620.36 0.92 0.44 0.56

2 0.29 0.61 0.35 1.020.40 0.49 0.31 0.710.23 1.24 0.40 0.380.22 0.30 0.23 0.30

3 0.21 0.37 0.25 0.360.18 0.38 0.24 0.310.23 0.29 0.22 0.33

4.8.20. The data in Table 4.8.1 are the results of a 3 × 4 design discussed in Box andCox (1964). Forty-eight animals were exposed to three different poisons and four differenttreatments. The response was the survival time of the animal. The design was balanced.Use ( 4.4.1) as the full model to answer the questions below.

(a.) Using Wilcoxon scores obtain the fit of the full model. Sketch the cell median profile plotbased on this fit and discuss whether or not interaction between poison and treatmentsis present.

(b.) Based on the Wilcoxon fit, plot the residuals versus the fitted values. Comment on theappropriateness of the model. Also obtain the internal Wilcoxon studentized residualsand identify any outliers.

(c.) Using the statistic Fϕ, obtain the robust ANOVA table (main effects and interaction)for this data. Conclude in terms of the p-values.

(d.) Note that the hypothesis matrix for interaction defines six interaction contrasts. Use theBonferroni and Protected LSD multiple comparison procedures, ( 4.3.2) and ( 4.3.3),to investigate these contrasts. Determine which, if any, are significant.

(e.) Repeat the analysis in Parts (c) and (d), (Bonferroni analysis), using LS. Compare theWilcoxon and LS results.

4.8.21. For testing the ordered alternative

H0 : µ1 = · · · = µk versus HA : µ1 ≤ · · · ≤ µk ,

4.8. EXERCISES 321

with at least one strict inequality, let

J =∑

s<t

S+st ,

where S+st = #(Ytj > Ysi) for i = 1, . . . , ns and j = 1, . . . , nt; see ( 2.2.20). This test for

ordered alternatives was proposed independently by Jonckheere (1954) and Terpstra (1952).Under H0, show the following:

(a.) E(J) =n2−

P

n2t

4.

(b.) V (J) =n2(2n+3)−P

n2t (2nt+3)

72.

(c.) z = (J − E(J))/√V (J) is approximately N(0, 1).

Hence, based on (a) - (c) an asymptotic test for H0 versus HA, is to reject H0 if z ≥ zα.

Chapter 5

Models with Dependent ErrorStructure

5.1 General Mixed Models

Consider an experiment done over m blocks (clusters), where block k has nk observations.Within block k, let Yk, Xk, and ek denote respectively the nk × 1 vector of responses, thenk × p design matrix and the nk × 1 vector of errors. Let 1nk

denote a vector of nk ones.Then the general mixed model for Yk is

Yk = α1nk+ Xkβ + ek, k = 1, . . .m, (5.1.1)

where β is the p× 1 vector of regression coefficients and α is the intercept parameter. Thecomponents of the random error vector ek are generally dependent random variables.Later for theory, we make certain assumptions on the distribution of ek. Alternately, themodel can be written in the long form as

Y = 1nα + Xβ + e, (5.1.2)

where n =∑m

k=1 nk denotes the total sample size, Y = (Y′1, . . . ,Y

′m)′, X = (X′

1, . . . ,X′m)′,

and e = (e′1, . . . , e

′m)′. Because an intercept parameter is in the model, we can assume that

X is centered and that the true median of ekj is zero. Since we can always reparameterize,assume that X has full column rank. It is important to note that the design matrices, Xk’s,for the clusters need not have full column rank. For example, incomplete block designs canbe considered. To distinguish this general mixed model from the linear model of Chapter 3,in this chapter we call the model of Chapter 3 the independent error or case model.

This general mixed model often occurs in the applied sciences. Examples include datafrom clinical designs carried out over centers, repeated measures type data on individuals,data from randomized block designs, and clusters of correlated data. As in Chapters 3 and4, for inference the primary focus concerns the regression coefficients (fixed effects), butthe dependent structure must be taken into account in order to obtain valid inference for

323

324 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

the fixed effects. Liang and Zeger (1986) discuss these models in some detail, developing aweighted LS inference for it.

The fixed effects part of the model is, of course, the linear model of Chapters 3 and 4. Soin this section we proceed to discuss the R fit developed in Chapter 3 for Model (5.1.1). Aswe show the asymptotic variance of the R estimator is a function of the dependence structurein the model.

Let ϕ(u), 0 < u < 1, be a specified score function which satisfies Assumption (S.1) ofSection 3.4. Then, as in Chapter 3, the R estimator of β is

βϕ = Argmin‖Y − Xβ‖ϕ. (5.1.3)

For Model (5.1.1), properties of this estimator were developed by Kloke, McKean and Rashid(2009). They refer to it as the JR estimator for joint ranking; however, we use the terminology

of Chapter 3 and call it an R estimator. As in Chapter 3, equivalently, βϕ is a solution toSϕ(Y−Xβ) = 0, where Sϕ(Y−Xβ) is the negative of the gradient of ‖Y−Xβ‖ϕ given in(3.2.12). Once β is estimated, we estimate the intercept α by the median of the residuals,that is

αS = medkj ykj − x′kjβϕ. (5.1.4)

As in Chapter 3, both estimators are regression and scale equivariant.For the theory discussed next, certain conditions are needed. Assume that the random

vectors e1, e2, . . . , em are independent; i.e., the responses drawn from different blocks orclusters are independent. Assume further that the univariate marginal distributions of ekare the same for all k. As discussed at the end of this section (see Subsection 5.1.1), this holdsfor many models of practical interest; however, in Section 5.4, we do discuss more generalrank-based estimators which do not require this assumption. Let F (x) and f(x) denote thecommon univariate marginal distribution function and density function. Assume that f(x)follows Assumption (E.1) of Section 3.4 and that the usual regularity (likelihood) conditionshold; see, for example, Section 6.5 of Lehmann and Casella (1998). For the design matrixX, assume that Huber’s condition (D.2) of Section 3.4 holds. As with the asymptotic theoryfor the traditional estimates, (see, eg, Liang and Zeger, 1986), assume that the number ofclusters goes to ∞, i.e., m→ ∞, and that nk ≤M , for all k, for some constant M .

Because of the invariances, without loss of generality, assume that the true regressionparameters are zero in Model (5.1.1). As in Chapter 3, asymptotic theory for the fixedeffects estimator involves establishing the distribution of the gradient and the asymptoticquadraticity of the dispersion function.

Consider Model (5.1.1) and assume the above conditions. It then follows from Brunnerand Denker (1994) that the projection of the gradient Sϕ(Y − Xβ) is the random vectorX′ϕ[F(Y − Xβ)], where ϕ[F(Y − Xβ)] = (ϕ[F (Y11 − x′

11β)], . . . , ϕ[F (Ymnm − x′mnm

β)])′.We need to assume that the covariance structure of this projection is asymptotically stable;that is, the following limit exists and is positive definite:

Σϕ = limm→∞ n−1∑m

k=1 X′kΣϕ,kX, (5.1.5)

5.1. GENERAL MIXED MODELS 325

where Σϕ,k = Covϕ[F(ek)]. (In likelihood methods, a similar assumption is made on thecovariance structure of the errors).

As discussed by Kloke et al. (2009), under these assumptions, it follows from Theorem3.1 of Brunner and Denker (1994) that

1√nSX(0)

D→ Np(0,Σϕ), (5.1.6)

where Σϕ is defined in expression (5.1.5). The linearity and quadraticity results obtainedin Chapter 3 for the linear model can be extended to our model. The linearity result isSX(β) = SX(0)− τ−1

ϕ n−1X′Xβ + op(√n), uniformly for

√n‖β‖2 ≤ c, for c > 0, where τϕ is

the same scale parameter as in Chapter 3; i.e., defined in expression (3.4.4). From this weobtain the asymptotic representation of the R estimator given by

√nβϕ = τϕ

√n(X′X)−1X′ϕ[F(e)] + op(1). (5.1.7)

Based on (5.1.6) and (5.1.7), it follows that the distribution of βϕ is approximately normalwith mean β and covariance matrix

Vϕ = τ 2ϕ(X

′X)−1

(m∑

k=1

X′kΣϕ,kXk

)(X′X)−1. (5.1.8)

Letting τs = 1/2f(0), αS is approximately normal with mean α and variance

σ21(0) = τ 2

S

1

n

m∑

k=1

[nk∑

j=1

var(sgn(ekj)) +∑

j 6=j′cov(sgn(ekj), sgn(ekj′))

]. (5.1.9)

In this section, we have kept the model general; i.e., we have not specified the covariancestructure. To conduct inference, we need an estimate of the covariance matrix of βϕ. Definethe residuals of the R fit by

eR = Y − αs1n − Xβϕ. (5.1.10)

Using these residuals, we estimate the parameter τϕ as discussed in Section 3.7.1. Next, anonparametric estimate of Σϕ,k, (5.1.5). is obtained by replacing the distribution functionF (t) in its definition by the empirical distribution function of the residuals. Based on theseresults, for a specified vector h ∈ Rp, an approximate (1 − α)100% confidence interval forh′β is given by

h′βϕ ± zα/2

√h′Vϕh. (5.1.11)

Consider general linear hypotheses of the form H0 : Mβ = 0 versus HA : Mβ 6= 0,where M is a q × p matrix of rank q. We offer two test statistics. First, the asymptoticdistribution of βϕ suggests a Wald type test of H0 based on the test statistic

TW,ϕ = (Mβϕ)T [MVϕM

T ]−1(Mβϕ). (5.1.12)


Under H0, TW,ϕ has an asymptotic χ2q distribution with q degrees of freedom. Hence, a

nominal level α test is to reject H0 if TW,ϕ ≥ χ2α(q). As in the independent error case, this

test is consistent for all alternatives of the form Mβ 6= 0. For efficiency results consider

a sequence of local alternatives of the form: HAn : Mβn =β√n, where β 6= 0. Under this

sequence of alternatives TW,ϕ has an asymptotic noncentral χ2q-distribution with noncentrality

parameter

η = (Mβ)T [MVϕMT ]−1Mβ. (5.1.13)

A second test utilizes the reduction in dispersion, RDϕ = D(Red) − D(Full), whereD(Full) and D(Red) are respectively the minimum values of the dispersion function underthe full and reduced (full model constrained by H0) models. The asymptotically correctstandardization depends on the dependence structure of the errors; see Exercises 5.6.5 and5.6.6 for discussion on this test and also of the aligned rank test of Chapter 3.

Our discussion has been for general scores. If we have knowledge of the distribution ofthe errors then we can optimize the analysis by selecting a suitable score function. Fromexpression (5.1.8), although the dependence structure appears in the approximate covariance

of βϕ, as in Chapters 2 and 3, the constant of proportionality is τϕ. Hence, the discussion inChapters 2 and 3 concerning score selection based on minimizing τϕ is still pertinent for therank-based analysis of this section. Example 5.2.1 of the next section illustrates such scoreselection.

If the score function is bounded, then based on their asymptotic representation, (5.1.7),these R estimators have bounded influence in response space but not in factor space. How-ever, for outliers in factor space, the high breakdown HBR estimators, (3.12.2), can beextended in the same way as the R estimates.

5.1.1 Applications

In many applications the form of the covariance structure of the random vector of errors ekof Model 5.1.1 is known. This can result in a simplified asymptotic covariance structure forβϕ. We discuss several such cases in the next few sections. In Section 5.2, we consider asimple mixed model with block as a random effect. Here, besides an estimate of τϕ, onlyan additional covariance parameter is required to estimate Vϕ. In Section 5.3.1, we discussa transformed procedure for a simple mixed model, provided the the block design matrices,Xk’s, have full column rank. Another rich class of such models is the repeated measuredesigns, where block is synonymous with subject. Two common types of covariance structurefor these designs are: (i) the covariance of the errors for a subject have compound symmetricalstructure, i.e., a simple random effect model, or (ii) the errors follow a stationary timeseries model, for instance an autoregressive model. For Case (ii), the univariate marginalswould have the same distribution and, hence, the above assumptions hold for our rank-basedestimates. Using the residuals from the rank-based fit, R estimators of the autoregressiveparameters of the error distribution can be obtained. These estimates could then be usedin the usual way to transform the observations and then a second (generalized) R estimate

5.2. SIMPLE MIXED MODELS 327

could be obtained based on these transformed observations; see Exercise 5.6.7 for details.This is a robust analogue of the two-stage estimation procedure discussed for cluster samplesin Rao, Sutradhar and Yue (1993). Generalized R estimators based on transformations arediscussed in Sections 5.3 and 5.4.

5.2 Simple Mixed Models

In this section, we discuss a simple mixed model with block or cluster as a random effect.Consider Model (5.1.1), but for each block k, model the error vector ek as ek = 1nk

bk + ǫk,where the components of ǫk are independent and identically distributed and bk is a continuousrandom variable which is independent of ǫk. Hence, we write the model as

Yk = α1nk+ Xkβ + 1nk

bk + ǫk, k = 1, . . .m. (5.2.1)

Assume that the random effects b1, . . . , bm are independent and identically distributed ran-dom variables. It follows that the distribution of ek is exchangeable. In particular, allmarginal distributions of ek are the same; so, the theory of Section 5.1 holds. This familyof models contains the randomized block designs, but as in Section 5.1 the blocks can beincomplete.

For this model, the asymptotic variance-covariance matrix of βϕ, (5.1.8) simplifies to

τ 2ϕ(X

′X)−1∑m

k=1 X′kΣϕ,kXk(X

′X)−1, Σϕ,k = (1 − ρϕ)1Ink+ ρϕJnk

, (5.2.2)

where ρϕ = cov ϕ[F (e11)], ϕ[F (e12)] = Eϕ[F (e11)]ϕ[F (e12)]. Also, the asymptotic vari-ance of the intercept (5.1.9) simplifies to n−1τ 2

S(1 + n∗ρ∗S), for ρ∗S = cov [sgn (e11), sgn (e12)]and n∗ = n−1

∑mk=1 nk(nk−1). As with LS, for positive definiteness, we need to assume that

each of ρϕ and ρ∗S exceeds maxk−1/(nk − 1). Let M =∑m

k=1

(nk

2

)− p, (the subtraction

of p, the dimension of the vector β, is a degree of freedom correction). A simple momentestimator of ρϕ is

ρϕ = M−1m∑

k=1

∑

i>j

a[R(eki)]a[R(ekj)]. (5.2.3)

Plugging this into (5.2.2) and using the estimate of τϕ discussed earlier, we have an estimateof the asymptotic covariance matrix of the R estimators.

For the general mixed model (5.1.1) of Section 5.1, the ARE’s for the rank-based proce-dures are difficult to obtain; however, for the simple mixed model, (5.2.1), the ARE can beobtained in closed form provided the design is centered within each block; see Kloke et al.(2009). The reader is asked to show in Exercise 5.6.2 that for Wilcoxon scores, this ARE is

ARE(FW,ϕ, FLS) = [(1 − ρ)/(1 − ρϕ)]12σ2

[∫f 2(t) dt

]2

, (5.2.4)

where ρϕ is defined under expression (5.2.2) and ρ is the correlation coefficient within ablock. If the random vectors in a block follow the multivariate normal distribution, then this


ARE lies in the interval [0.8660, 0.9549] when 0 < ρ < 1. The lower bound is attained whenρ → 1. The upper bound is attained when ρ = 0 (the independent case), which is the usualhigh efficiency of the Wilcoxon to LS at the normal distribution. When −1 < ρ < 0, thisARE lies in [0.9549, 0.9662] and the upper bound is attained when ρ = −0.52 and the lowerbound is attained when ρ → −1. Generally, the high efficiency properties of the Wilcoxonanalysis to LS analysis in the independent errors case extend to the Wilcoxon analysis forthis mixed model design. See Kloke et al. (2009) for details.

5.2.1 Variance Component Estimators

In this section, we assume that the variances of the errors exist. Let Σekdenote the variance-

covariance matrix of ek. Under the model of this section, the variance-covariance matrix ofek is compound symmetric having the form Σek

= σ2Ak(ρ) = σ2[(1 − ρ)Ink+ ρJnk

], whereσ2 = Var(eki), Ink

is the identity matrix of order nk, and Jnkis a nk × nk matrix of ones.

Letting σ2b and σ2

ε denote respectively the variances of the random effect bk and the errorε, the total variance is given by σ2 = σ2

ε + σ2b . The intraclass correlation coefficient is

ρ = σ2b/(σ

2ε +σ2

b ). These parameters, (σ2ε , σ

2b , σ

2), are referred to as the variance components.To estimate these variance components, we use the estimates discussed in Kloke at al.

(2009); see, also Rashid and Nandram (1998) and Gerand and Schucany (2007). In blockk, rewrite model (5.2.1) as ykj − [α + x′

kjβ] = bk + εkj, j = 1, . . . , nk. The left side of thisexpression is estimated by the residual

eR,kj = ykj − [α + x′kjβ], k = 1, . . . , m; j = 1, . . . , nk. (5.2.5)

Hence, a predictor (estimate) of bk is given by bk = med1≤j≤nkeR,kj. Hence a robust

estimator of the variance of bk is MAD, (3.9.27); that is,

σ2b = [MAD1≤k≤m(bk)]

2 =[1.483 med1≤k≤m|bk − med1≤j≤mbj|

]2. (5.2.6)

In this simple mixed model, the residuals ekj, (5.2.5), are often call the marginal resid-uals. In addition, though, we have the conditional residuals for the errors εkj which aredefined by

εkj = eR,kj − bk, j = 1, . . . nk, k = 1, . . . , m. (5.2.7)

A robust estimate of σ2ε is then

σ2ε = [MAD1≤j≤nn,1≤k≤m(εkj)]

2 . (5.2.8)

Hence, robust estimates of the total variance σ2 and the intraclass correlation coefficient are

σ2 = σ2ε + σ2

b and ρ = σ2b/σ

2. (5.2.9)

Thus, our robust estimates of the variance components are given in expressions (5.2.6),(5.2.8), and (5.2.9).


5.2.2 Studentized Residuals

In Chapter 3, we presented Studentized residuals for R and HBR fits. These residuals arefundamental for diagnostic analyses of linear models. They correct for both the model (factorspace) and the underlying covariance structure and allow for a simple benchmark rule fordesignating potential outliers. In this section, we present Studentized residuals based on theR fit of the simple mixed model, (5.2.1). Because the marginal residuals eR,kj , (5.2.5), areused to check the quality of fit, these are the appropriate residuals for standardizing.

Because the block sample sizes nk are not necessarily the same, some additional notationsimplifies the presentation. Let ν1 and ν2 be two parameters and define the block-diagonalmatrix B(ν1, ν2) = diagB1(ν1, ν2), . . . ,Bm(ν1, ν2), where Bk(ν1, ν2) = (ν1 −ν2)Ink

+ν2Jnk,

k = 1, . . . , m. Hence, for Model (5.2.1), we can write Var(e) = σ2B(1, ρ).

Using the asymptotic representation for βϕ given in expression (5.1.7), a tedious calcu-lation, similar to that in Section 3.9.2, shows that the approximate covariance matrix of eRis given by

CR = σ2B(1, ρ) +τ 2s

n2JnB(1, ρ∗S)Jn + τ 2HcB(1, ρϕ)Hc

−τsn

B(δ∗11, δ∗12)Jn − τB(δ11, δ12)Hc −

τsn

JnB(δ∗11, δ∗12)

+ττsn

JnB(γ11, γ12)Hc − τHcB(δ11, δ12) +τsτ

nHcB(γ11, γ12)Jn, (5.2.10)

where Hc is the projection matrix onto the column space of the centered design matrix Xc,Jn is the n× n matrix of all ones, and

δ∗11 = E[e11sgn (e11)],

δ∗12 = E[e11sgn (e12)],

δ11 = E[e11ϕ(F (e11))],

δ12 = E[e11ϕ(F (e12))],

γ11 = E[sgn(e11)ϕ(F (e11))],

γ12 = E[sgn(e11)ϕ(F (e12))],

and ρϕ and ρ∗S are defined in (5.1.5) and (5.1.8), respectively.To compute the Studentized residuals, estimates of the parameters in CR, (5.2.10), are

required. First, consider the matrix σ2B(1, ρ). In Section 5.2.1, we obtained robust estima-tors σ2 and ρ given in expression (5.2.9). Substituting these estimators for σ2 and ρ intoσ2B(1, ρ), we have a robust estimator of σ2B(1, ρ) given by σ2B(1, ρ). Expression (5.2.3)gives a simple moment estimator of ρϕ. The parameters ρ∗S, δ11, δ12, δ

∗11, δ

∗12, γ11, and γ12

can be estimated in the same way. Substituting these estimators into the matrix CR, let CR

denote the resulting estimator.For t = 1, . . . , n, let ctt denote the tth diagonal entry of the matrix CR. Then the tth

Studentized marginal residual based on the R fit is

e∗R,t = eR,t/√ctt. (5.2.11)


As in Chapter 3, the traditional benchmarks used with these Studentized residuals are thelimits ±2.

5.2.3 Example and Simulation Studies

In this section we present an example of a randomized block design. It consists of only twoblocks, so we also summarize simulation studies which confirms the validity of the rank-basedanalysis. For the examples and the simulation studies, we computed the rank-based analysisusing the collection of R functions Rfit described above. By the traditional fit, we mean themaximum likelihood fit based on multivariate normality of the error random vectors. This fitand subsequent analysis was obtained using the R function lme as discussed in Pinheiro andBates (2000). The rank-based analysis of this section was computed by Rfit, a collection ofR functions.

Example 5.2.1 (Crab Grass Data). [tbh] Cobb (1998) presented an example of a completeblock design concerning the weight of crab grass. Much of our discussion is drawn from Klokeat al. (2009). There are four fixed factors in the experiment: the density of the crabgrassat four levels, the nitrogen content of the crabgrass at two levels, the phosphorus contentof the crabgrass at two levels, and the potassium content of the crabgrass at two levels.Two complete blocks of the experiment were carried out, so altogether there are n = 64observations. Here block is a random factor and we assume the simple mixed model, (5.2.1),of this section. Under each set of experimental conditions, crab grass was grown in a cup.The response is the dry weight of a unit (cup) of crab grass, in milligrams. The data arepresented in Table A.0.1 of Appendix B.

We consider the rank-based analysis of this section based on Wilcoxon scores. For themain effects model, Table 5.2.1 displays the estimated effects (contrasts) and standard errorsfor the Wilcoxon and traditional analyses. For the nutrients, these effects are the differencesbetween the high and low levels, while for the factor density the three contrasts reference thehighest density level. There are major differences between the Wilcoxon and the traditionalestimates. For the Wilcoxon estimates, the nutrients nitrogen and phosphorus are significantand the contrast between the low and high levels of density is highly significant. Nitrogenis the only significant effect for the traditional analysis. The Wilcoxon statistic to test thedensity effects has the value TW,ϕ = 20.55 with p = 0.002; while, the traditional test statisticis Flme = 0.82 with p = 0.490. The robust estimates of the variance components are:σ2 = 206.33, σ2

b = 20.28, and ρ = 0.098An outlier accounts for much of this dramatic difference between the robust and tradi-

tional analyses. Originally, one of the responses was mistyped; instead of the correct value97.25, the response was typed as 972.5. As Cobb (1998) notes, this outlier was more difficultto spot in the original units. Upon replacing the outlier with its correct value, the Wilcoxonand traditional analyses are similar; although, the Wilcoxon analysis is still more precise;see the discussion below on the other outliers in this data set. This is true too of the test forthe factor density: TW,ϕ = 23.23 (p = 0.001) and Flme = 6.33 with p = 0.001. The robustestimates of the variance components are: σ2 = 209.20, σ2

b = 20.45, and ρ = 0.098 These


Table 5.2.1: Wilcoxon and Traditional Estimates and SEs of Effects for the Crabgrass.Wilcoxon Traditional

Contrast Est. SE Est. SENit 39.90 4.08 69.76 28.7Pho 10.95 4.08 −11.52 28.7Pot −1.60 4.08 28.04 28.7D34 3.26 5.76 57.74 40.6D24 7.95 5.76 8.36 40.6D14 24.05 5.76 31.90 40.6

are essentially unchanged from their values on the original data. If on the original data theexperimenter had run the robust fit and compared it with the traditional fit, then the outlierwould have been discovered immediately.

Figure 5.2.1 contains the Wilcoxon Studentized residual plot and q−q plot for the originaldata. We have removed the large outlier from the plots, so that we can focus on the remainingdata. The “vacant middle” in the residual plot is an indication that interaction may bepresent. For the hypothesis of interaction between the nutrients, the value of the Wald typetest statistic is TW,ϕ = 30.61, with p = 0.000. Hence, the R analysis strongly confirms thatinteraction is present. On the other hand, the traditional likelihood ratio test statistic for thisinteraction is 2.92, with p = 0.404. In the presence of interaction, many statisticians wouldconsider interaction contrasts instead of a main effects analysis. Hence, for such statisticians,the robust and traditional analyses would have different practical interpretations.

5.2.4 Simulation Studies of Validity

In this data set, the number of blocks is two. Hence, to answer questions concerning thevalidity of the Wilcoxon analysis, Kloke et al. (2009) conducted a small simulation study.Table 5.2.2 summarizes the empirical confidences and AREs of this study for two situations,normal errors and contaminated normal errors (20% contamination and the ratio of the con-taminated variance to the uncontaminated variance at 25). For each situation, the samerandomized block design as in the Crab Grass example was used, with the correlation struc-ture as estimated by the Wilcoxon analysis. The empirical confidences of the asymptotic95% confidence intervals were recorded. These intervals are of the form Estimate ±1.96×SE,where SE denotes the standard errors of the estimates. The number of simulations was 10,000for each situation, therefore, the error in the table based on the usual 95% confidence intervalfor a proportion is 0.004. The empirical confidences for the Wilcoxon are quite good withthe target of 0.95 usually within range of error. They were perhaps a little conservative atthe the contaminated normal situation. Hence, the Wilcoxon analysis appears to be validfor this design. The intervals based on the traditional fit are slightly liberal. The empiricalARE’s between two estimators displayed in Table 5.2.2 are the ratios of empirical meansquared errors of the two estimators. As the table shows, the traditional fit is more efficient


40 60 80 100

−2

02

46

8

Wilcoxon fit

Stu

dent

ized

Wilc

oxon

res

idua

l

Studentized Residual Plot, Outlier Deleted

−2 −1 0 1 2

−2

02

46

8

Normal quantiles

Stu

dent

ized

Wilc

oxon

res

idua

l

Normal q−q Plot, Outlier Deleted

Figure 5.2.1: Studentized Residual and q−q Plots, Minus Large Outlier.

at the normal but the efficiencies are close to the value 0.95 for the independent error case.The Wilcoxon analysis is much more efficient over the contaminated normal situation.

Does this rank-based analysis differ from the independent error analysis of Chapter 3?As a tentative answer to this question, Kloke et al. (2009) ran 10,000 simulations using themodel for the Crab Grass Example. Wilcoxon scores were used for both analyses. To avoidconfusion, call the analysis of Chapter 3, the IR analysis, (I for independent errors) and theanalysis of this section the R analysis. They considered normal error distributions, settingthe variance components at the values of the robust estimates. Because the R and IR fitsare the same, they considered the differences in their inferences of the six effects listed inTable 5.2.1. For 95% nominal confidence, the average empirical confidences over these six

Table 5.2.2: Validity of Inference (Empirical Confidence Sizes and AREs)

Norm. Errors Cont. Norm. ErrorsContrast Wilc. Traditional ARE Wilc. Traditional ARENit 0.948 0.932 0.938 0.964 0.933 7.73Pho 0.953 0.934 0.941 0.964 0.930 7.82Pot 0.948 0.927 0.940 0.966 0.934 7.72D34 0.950 0.929 0.936 0.964 0.931 7.75D24 0.951 0.929 0.943 0.960 0.931 7.57D14 0.952 0.930 0.944 0.960 0.929 7.92

5.3. RANK-BASED PROCEDURES BASED ON ARNOLD TRANSFORMATIONS 333

contrasts are 95.32% and 96.12%, respectively for the R and IR procedures. Hence, bothprocedures appear valid. For a measure of efficiency, they averaged, across the contrasts, theaverages of squared lengths of the confidence intervals. The ratio of the R to the IR averagesis 0.914; hence for the simulation, the R inference is about 9% more efficient than the IRinference. Similar results for the traditional analyses are reported in Rao et al. (1993).

5.2.5 Simulation Study of Other Score Functions

Besides, the large outlier there are six other potential outliers in the Cobb data. Thisquantity of outliers suggests the use of score functions which are more preferable thanthe Wilcoxon score function for very heavy-tailed error structure. To investigate this,we turned to the family of Winsorized Wilcoxon score functions. Recall that this fam-ily was discussed for skewed data in Example 2.5.1. Here, though, asymmetry does notappear to be warranted. We selected the score function which is linear over the inter-val (0.2, 0.8), i.e., 20% Winsorizing on both sides. We denote it by WW2. For the pa-rameters as in Table 5.2.1, the WW2 estimates and standard errors (in parentheses) are:39.16 (3.78), 10.13 (3.78),−2.26 (3.78), 2.55 (5.35), 7.68 (5.35), and 23.28 (5.35). The estimateof the scale parameter τ is 14.97 compared to the Wilcoxon estimate which is 15.56. Thisindicates that an analysis based on the WW2 fit has more precision than one based on theWilcoxon fit.

To investigate this gain in precision, we ran a small simulation study. We used the samemodel and the same correlation structure as estimated by the Wilcoxon fit. We considerednormal and contaminated normal errors, with the percent of contamination at 20% and therelative variance of the contaminated part at 25. For each situation 10,000 simulations wererun. The AREs were very similar for all six parameters, so we only report their averages.For the normal situation the average ARE between the WW2 and Wilcoxon estimates was0.90; hence, the WW2 estimate was 10% less efficient for the normal situation. For thecontaminated normal situation, though, this average was 1.21; hence, the WW2 estimatewas 20% more efficient than the Wilcoxon estimate for the contaminated normal situation.

There are families of scores functions besides the Winsorized Wilcoxon scores. Gastwirth(1966) presents several families of score functions appropriate for classes of distributions withtails heavier than the exponential distribution. For certain cases, he selects a score based ona maxi-min strategy.

5.3 Rank-Based Procedures Based on Arnold Trans-

formations

In this section, we apply a linear transformation to the mixed model, (5.1.1), and then obtainthe R fits. We begin with a brief but necessary discussion of the intercept parameter.

Write the mixed model in the long form (5.1.2), Y = 1nα + Xβ + e. Suppose thetransformation matrix is A. Multiplying both sides of the model by A, the transformed


model is of the formY∗ = X∗b + e∗, (5.3.1)

where v∗ denotes the vector Av and the vector of parameters is b = (α,β′)′. While theoriginal model has an intercept parameter, in general, the transformed model does not. Asdiscussed in Exercise 3.16.39 of Chapter 3, the R fit of Model (5.3.1) is actually the R fit of

the model Y∗ = X∗b + e∗, where X∗ = (I − H1)X∗ and H1 is the projection matrix onto

the space spanned by 1; i.e, X∗ is the centered design matrix based on X∗.As proposed in Exercise 3.16.39, to obtain an R fit of Model (5.3.1), we use the following

algorithm:

(1.) Fit the model

Y∗ = α11 + X∗b + e∗. (5.3.2)

By fit we mean: obtain the R estimate of b and then estimate the α1 by the medianof the residuals. Let Y∗

1 denote the R fit.

(2.) Project Y∗1 to the right space; i.e., obtain

Y∗ = HX∗Y∗1. (5.3.3)

(3.) Solve X∗b = Y∗; i.e., our estimator is

b∗ = (X∗′X∗)−1X∗′Y∗. (5.3.4)

As developed in Exercise 3.16.39, b∗ is asymptotically normal with the asymptotic represen-tation given by (3.16.11) and asymptotic variance given by (3.16.12). We use these resultsin the remainder of this chapter.

5.3.1 R Fit Based on Arnold Transformed Data

As in the previous sections, consider an experiment done over m blocks, (clusters, centers),and let Yk denote the vector of nk observations for the kth block, k = 1, . . . , m. In thissection, we consider the simple mixed model of Section 5.2. Using the notation of expression(5.2.1), Yk follows the model Yk = α1nk

+ Xkβ + 1nkbk + ǫk, where bk is a random effect

and β denotes the fixed effects of interest. As in Section 5.2, assume that the blocks areindependent and bk and ǫk are independent. Let ek = 1nk

bk + ǫk. As in expression (5.1.2),the long form of the model is useful, i.e., Y = 1nα + Xβ + e. Because there is an interceptparameter in the model, we may assume that X is centered. Let n =

∑mk=1 nk denote the

total sample size. For this section, in addition we assume that for all blocks Xk has fullcolumn rank p.

If the variances of the error variables exist, denote them by Var[bk] = σ2b and Var[ǫkj] = σ2

ǫ .In this case, the variance covariance structure for the kth block is compound symmetric whichwe denote as

Var[ek] = σ2Ak(ρ) = σ2[(1 − ρ)Ink+ ρJnk

], (5.3.5)

where σ2 = σ2ǫ + σ2

b , and ρ = σ2b/(σ

2b + σ2

ǫ ).


Arnold Transformation

Arnold (Chapters 14 and 15, 1981) discusses a Helmert transformation for these types ofmodels for traditional (least squares) analyses for balanced designs, i.e., all nk’s are the same.Kloke and McKean (2010) generalized Arnold’s results to unbalanced designs and developedthe properties of the R fit for the transformed data. Consider the nk×nk orthogonal matrix

Γk =

[ 1√nk

1′nk

C′k,

](5.3.6)

where the columns of Ck form an orthonormal basis for 1⊥nk

, (C′k1nk

= 0). We call Γk anArnold transformation of size nk.

Now, apply an Arnold’s Transformation of size nk to the response vector for the kthblock

Y∗k = ΓkYk =

[Y ∗k1

Y∗k2

]

where the mean component is Y ∗k1 = α∗ + b∗k +

√nkx

′kβ + e∗k1, the contrast component is

Y∗k2 = X∗

kβ + e∗k2, and the other quantities are:

x′k =

1

nk1′nk

Xk

e∗k1 =1√nk

1′nk

ek

X∗k = CkXk

e∗k2 = C′

kek = bkC′k1nk

+ C′kǫk = C′

kǫk.

In particular, note that the contrast component contains, as a linear model, the fixed effectsof interest and, moreover, it is free of the random block effect.

Furthermore, notice that all the information on β is in the contrast component if x = 0.This occurs when the experimental design is replicated at least once in each of the blocks andthe covariate does not change. Also, all of the information on β is in the mean component ifthe covariates are constant within a block. More often, however, there is information on β

in both of the components. If this is the case, then for balanced designs, one can put bothpieces back together and obtain an estimator using all of the information. For unbalanceddesigns this is not possible. The approach we take is to ignore the information in the meancomponent and use the contrast component for inference.

Let n∗ = n −m. Then the long form of the Arnold transformation (AT) is Y∗2 = C′Y,

where C′ = diag[C′1, . . . ,C

′m]. So we can model Y∗

2 as

Y∗2 = X∗β + e∗

2, (5.3.7)

where e∗2 = C′e, and, provided variances exist, Var[e∗

2] = σ22In∗ , σ2

2 = σ2(1 − ρ), andX∗ = C′X.


LS Fit on Arnold Transformed Data

For the traditional least squares procedure, suppose the variance of the errors exist. Un-der the additional assumption of normality, the transformed errors are independent. Thetraditional estimator is thus the usual LS estimator

βATLS = Argmin‖y∗2 − X∗β‖LS.

i.e., βATLS = (X∗′X∗)−1X∗′y∗2. This is the extension of Arnold’s (1981) solution that was

proposed by Kloke and McKean (2010) for the unbalanced case of Model (5.3.7). As usual,estimate the intercept based on the mean of the residuals,

αLS =1

n1′(y − y)

=1

n1′(In − X(X∗′X∗)−1X∗C′)y = y.

As Exercise 5.6.3 shows the joint asymptotic distribution is

[αLS

βATLS

]∼Np+1

( [αβ

],

[σ2

1 0′

0 σ22(X

∗′X∗)−1

])(5.3.8)

where σ21 = (σ2/n2)

∑mk=1[(1 − ρ)nk + n2

k] and σ22 = σ2(1 − ρ). Notice that if inference

is to be on β then we avoid explicit estimation of ρ. To estimate σ22 we may use σ2

2 =∑mk=1

∑nk

j=1 e∗2kj/(n

∗ − p) where e∗kj = y∗kj − x∗′kjβ.

R Fit on Arnold Transformed Data

For the R fit of Model (5.3.7), we briefly sketch the development in Kloke and McKean (2010).Assume that we have selected a score function ϕ(u). We define the Arnold’s transformationrank-based (ATR) estimator of β as the regression through the origin rank estimator definedby the steps (5.3.2) - (5.3.4) of the last section; that is, the rank-based estimator is given by

βATR = Argmin‖y∗2 − X∗β‖ϕ. (5.3.9)

The results of Section 5.1 ensure that the ATR estimates are consistent and asymptot-ically normal. The reason for doing an Arnold transformation, though, is that the trans-formed error variables are uncorrelated. While this does not necessarily mean that they areindependent, in the literature they are usually treated as if they are. This is called work-ing independence. The asymptotic distributions discussed next are formulated under theworking independence. The simulation results reported in Kloke and McKean (2010) sup-port the validity of the asymptotic distributions over normal and contaminated normal errordistributions.

Recall from the regression through the origin algorithm that the asymptotic distributionof βATR depends on the choice of the estimate of the intercept α1. For the first case, suppose


the median of the residuals is used as the estimate of the intercept, (αATR = medy∗kj2 −x∗′kjβATR. Then, under working independence, the joint approximate distribution of the

regression parameters is[αATRβATR

]∼Np+1

([αβ

],

[σ2sτ

2s,e/n 0′

0 V

])(5.3.10)

where V is given in expression (3.16.12) of Chapter 3, σ2s = 1 + t∗ρs, t

∗ =∑m

k=1 nk(nk − 1),and ρs = cov[sgn(e11)sgn(e12)].

For the second case, assume that the score function ϕ(u) is odd about 1/2; ϕ(1 − u) =−ϕ(u). Let α+

ATR denote the signed-rank estimator of the intercept; see expression (3.5.32)of Chapter 3. Then, under working independence, the joint approximate distribution of therank-based estimator is

[α+ATR

βATR

]∼Np+1

([αβ

],

[σ2sτ

2s,e/n 0′

0 V

]), (5.3.11)

where V = τ 2(X∗′X∗)−1. In comparing expressions (5.3.8) and (5.3.11), we see that asymp-totic relative efficiency (ARE) between the ATLS and the ATR estimators is the same asthat of LS and R estimates in ordinary linear models. In particular when Wilcoxon scores areused and errors have a normal distribution, the ARE between the ATLS and ATR(Wilcoxon)is the usual 0.95. Hence, for this second case, the ATR estimators are efficiently robust.

To complete the practical inference, the scale parameters, τ and τs are based on thedistribution of e∗2kj and can be estimated as discussed in Chapter 3. From this, an inferenceis readily formed for the parameters of the model. Validity of the resulting confidenceintervals is confirmed in the simulation study of Kloke and McKean (2010). Studentizedresiduals are also discussed in this article. A matrix expression such as (5.2.10) for thesimple mixed model is derived by the authors; however, unlike the situation in Section 5.2.2,some of the necessary correlations are not straightforward to estimate. Kloke and McKeanrecommend a bootstrap to estimate the standard error of a residual. We use these in thefollowing example.

Example and Discussion

The following example is drawn from the article of Kloke and McKean (2010). Althoughsimple, the following data set demonstrates some of the nice features of Arnold’s Transfor-mation, particularly for balanced data.

Example 5.3.1 (Milliken and Johnson Data). The data in Table 5.3.1 are from an examplefound on page 260 of Milliken and Johnson (2002). Each row represents a block of length two.There is one covariate and each of the responses were measurements on different treatments.

The model for these data is

Yk = α12 + ∆

[−0.5

0.5

]+ βxk12 + ǫk.


Table 5.3.1: Data for Example 5.3.1.x y1 y2

23.2 60.4 76.026.9 59.9 76.329.4 64.4 77.822.7 63.5 75.630.6 80.6 94.636.9 75.9 96.117.6 53.7 62.328.5 66.3 81.6

Table 5.3.2: ATR and ATLS estimates and standard errors for Example 5.3.1.

ATR ATLSEst SE Est SE

α 70.8 3.54 72.8 8.98∆ −14.45 1.61 −14.45 1.19β 1.43 0.65 1.46 0.33

The Arnold’s Transformation for this model is

Γk =1√2

[1 11 −1

].

The transformed responses are Y∗k = ΓkYk = [Y ∗

k1, Y∗k2]

′, where

Y ∗k1 = α∗ + β∗xk + ǫ∗k1,

Y ∗k2 = ∆∗ + ǫ∗k2,

α∗ =√

2α, β∗ =√

2β, and ∆∗ = 1√2∆. We treat the transformed errors ǫ∗k1 for k = 1, . . . , m

and ǫ∗k2 for k = 1, . . . , m as iid. Notice that the first component is a simple linear regressionmodel and the second component is a simple location model. For this example, we usesigned-rank to estimate both of the intercept terms. The estimates and standard errors ofthe parameters are given in Table 5.3.2.

Kloke and McKean (2010) plotted bootstrap Studentized residuals for the least squares(top) and Wilcoxon fits. These plots show no serious outliers.

To demonstrate the robustness of ATR estimates in the example, Kloke and McKean(2010) conducted a small sensitivity analysis. They set the second data point to y

(i)12 =

y11 +∆y, where ∆y varied from -30 to 30. Then the parameters ∆(i) are estimated based onthe data set with the outlier. The graph below, displays the relative change of the estimate,

∆ − ∆(i)

∆

5.4. GENERAL ESTIMATING EQUATIONS (GEE) 339

as a function of ∆y.

−30 −20 −10 0 10 20 30

−0.

06−

0.04

−0.

020.

000.

020.

04

∆y

Rel

ativ

e C

hang

e ∆

Over this range of ∆y, the relative changes in the ATR estimate is between −0.042 to 0.062.In contrast, as the reader is asked to show in Exercise 5.6.4, the relative change in ATLSover this range is between 0.125 to 0.394. Hence, the relative change in the ATR estimatesis small, which indicates the robustness of the ATR estimates.

????????????????Discussion????????

5.4 General Estimating Equations (GEE)

For longitudinal data, Liang and Zeger (1986) presented an elegant, general iterated reweightedleast squares (IRLS) fit of a generalized longitudinal model. As we note below, their fit solvesa set of general estimating equations (GEE). Their model is more general than Model (5.1.1).Abebe, McKean and Kloke (2010) developed a rank-based fit of this general model whichwe present in this section. While analogous to Liang and Zeger’s fit, it is robust in responsespace. Further, the procedure can easily be generalized to be robust in factor space, also.

Consider a longitudinal set of observations over m subjects. Let yit denote the tthresponse for ith subject for t = 1, 2, . . . , ni and i = 1, 2, . . . , m. Assume that xit is a p × 1vector of corresponding covariates. Let n =

∑mi=1 ni denote the total sample size. Assume

that the marginal distribution of yit is of the exponential class of distributions and is givenby

f(yit) = exp[yitθit − a(θit) + b(yit)]φ , (5.4.1)

where φ > 0, θit = h(ηit), ηit = xTitβ, and h(·) is a specified function. Thus the mean andvariance of yit are given by

E(yit) = a′(θit) and Var(yit) = a′′(θit)/φ, (5.4.2)


where the ′ denotes derivative. In this notation, the link function is h−1 (a′)−1. Moreassumptions are stated later for the theory.

Let Yi = (yi1, . . . , yini)T and Xi = (xi1, . . . ,xini

)T denote the ni × 1 vector of responsesand the ni × p matrix of covariates, respectively, for the ith individual. We consider thegeneral case where the components of the vector of responses for the ith subject, Yi, aredependent. Let θi = (θi1, θi2, . . . , θini

)T , so that E(Yi) = a′(θi) = (a′(θi1), . . . , a′(θini

))T .For a s × 1 vector of unknown correlation parameters α, let Ci = Ci(α) denote a ni × nicorrelation matrix. Define the matrix

Vi = A1/2i Ci(α)A

1/2i /φ , (5.4.3)

where Ai = diaga′′(θi1), . . . , a′′(θini). The matrix Vi need not be the covariance matrix of

Yi. In any case, we refer to Ci as the working correlation matrix. For estimation, let Vi bean estimate of Vi. This, in general, requires estimation of α and often an initial estimate ofβ. In general, we denote the estimator of α by α(β, φ) to reflect its dependence on β andφ.

Liang and Zeger (1986) defined their estimate in terms of general estimating equations(GEE). Define the ni × p Hessian matrix,

Di =∂a′(θi)

∂β, i = 1, . . . , m . (5.4.4)

Then their GEE estimator βLS is the solution to the equations

m∑

i=1

DTi V

−1i [Yi − a′(θi)] = 0 . (5.4.5)

To motivate our estimator, it is convenient to write this in terms of the Euclidean norm.Define the dispersion function,

DLS(β) =m∑

i=1

[Yi − a′(θi)]T V−1

i [Yi − a′(θi)]

=

m∑

i=1

[V−1/2i Yi − V

−1/2i a′(θi)]

T [V−1/2i Yi − V

−1/2i a′(θi)]

=

m∑

i=1

ni∑

t=1

[y∗it − dit(β)]2 , (5.4.6)

where Y∗i = V

−1/2i Yi = (y∗i1, . . . , y

∗ini

)T , dit(β) = cTt a′(θi), and cTt is the tth row of V

−1/2i .

The gradient of DLS(β) is

DLS(β) = −m∑

i=1

DTi V

−1i [Yi − a′(θ)] . (5.4.7)


Thus the solution to the GEE equations (5.4.5) also can be expressed as

βLS = Argmin DLS(β) . (5.4.8)

From this point of view, βLS is a nonlinear least squares (LS) estimator. We refer to it asGEEWL2 estimator.

Consider, then, the robust rank-based nonlinear estimators discussed in Section 3.14.For nonnegative weights (see expression (5.4.10) below), we assume for now that the scorefunction is odd about 1/2, i.e., satisfies (2.5.9). In situations where this assumption isunwarranted, we can adjust the weights to accommodate scores appropriate for skewed errordistributions; see the discussion in Section 5.4.3.

Next consider the general model defined by expressions (5.4.1) and (5.4.2). As in the LS

development, let Y∗i = V

−1/2i Yi = (y∗i1, . . . , y

∗ini

)T , git(β) = cTt a′(θi), where cTt is the tth row

of V−1/2i , and let G∗

i = [git]. The rank-based dispersion function is given by

DR(β) =m∑

i=1

ni∑

t=1

ϕ[R(y∗it − git(β))/(n+ 1)][y∗it − git(β)] . (5.4.9)

We next write the R estimator as weighted LS estimator. From this representation theasymptotic theory of the R estimator can be derived. Furthermore, it naturally suggestsan IRLS algorithm. Let eit(β) = y∗it − git(β) denote the (i, t)th residual and let m(β) =med(i,t)eit(β) denote the median of all the residuals. Then because the scores sum to 0 wehave the identity,

DR(β) =

m∑

i=1

ni∑

t=1

ϕ[R(eit(β))/(n+ 1)][eit(β) −m(β)]

=

m∑

i=1

ni∑

t=1

ϕ[R(eit(β))/(n+ 1)]

eit(β) −m(β)[eit(β) −m(β)]2

=

m∑

i=1

ni∑

t=1

wit(β)[eit(β) −m(β)]2 , (5.4.10)

where wit(β) = ϕ[R(eit(β))/(n + 1)]/[eit(β) − m(β)] is a weight function. As usual, wetake wit(β) = 0 if eit(β) − m(β) = 0. Note that by using the median of the residuals inconjunction with property (2.5.9), the weights are positive. To accommodate other scorefunctions besides those that satisfy (2.5.9) quantiles other than the median can be used; seeExample 5.4.3 and Sievers and Abebe (2004) for discussion.

For the initial estimator of β, we recommend the rank-based estimator of Chapter 3 based

on the score function ϕ(u). Denote this estimator by β(0)

R . As estimates of the weights, we

use wit

(β

(0)

R

); i.e., the weight function evaluated at β

(0). Expression (5.4.10) leads to the


dispersion function

D∗R

(β|β(0)

R

)=

m∑

i=1

ni∑

t=1

wit

(β

(0)

R

) [eit(β) −m

(β

(0)

R

)]2

=m∑

i=1

ni∑

t=1

[√wit

(β

(0)

R

)eit(β) −

√wit

(β

(0)

R

)m(β

(0)

R

)]2

. (5.4.11)

Let

β(1)

R = ArgminD∗(β|β(0)

R

). (5.4.12)

This establishes a sequence of IRLS estimates,

β(k)

R

, k = 1, 2, . . ..

After some algebraic simplification, we obtain the gradient

D∗R

(β|β(k)

R

)= −2

m∑

i=1

DTi V

−1/2i WiV

−1/2i

[Yi − a′(θ) − m∗

(β

(k)

R

)], (5.4.13)

where m∗(β

(k)

R

)= V

1/2i m

(β

(k)

R

)1, 1 denotes a ni × 1 vector all of whose elements are 1,

and Wi = diagwi1, . . . , wini is the diagonal matrix of weights for the ith subject. Hence,

β(k+1)

R satisfies the general estimating equations (GEE) given by,

m∑

i=1

DTi V

−1/2i WiV

−1/2i

[Yi − a′(θ) − m∗

(β

(k)

R

)]= 0 . (5.4.14)

We refer to this weighted, general estimation equations estimator as the GEEWR estimator.

5.4.1 Asymptotic Theory

Recall that both the GEEWL2 and GEEWR estimators were defined in terms of the uni-variate variables y∗it. These of course are transformations of the original observations bythe estimates of the covariance matrix Vi and the weight matrix Wi. For the theory, weneed to consider similar transformed variables using the matrices Vi and Wi, where thisnotation means that Vi and Wi are evaluated at the true parameters. For i = 1, . . . , m andt = 1, . . . , ni, let

Y†i = V

−1/2i Yi = (y†i1, . . . , y

†ini

)T

G†i(β) = V

−1/2i a′

i(θ) = [g†it]

e†it = y†it − g†it(β). (5.4.15)

To obtain asymptotic distribution theory for a GEE procedure, assumptions concerning theseerrors e†it must be made. Regularity conditions for the GEEWL2 estimates are discussed in


Liang and Zeger (1986). For the GEEWR estimator, assume these conditions and, furtherthat the marginal pdf of e†it is continuous and the variance-covariance matrix given in (5.4.16)is positive definite. Under these conditions, Abebe et al. (2010) derived the asymptoticdistribution of the GEEWR estimator. The proof involves a Taylor series expansion, as inLiang and Zeger’s (1994) proof, and the rank-based theory found in Brunner and Denker(1994) for dependent observations. We state the result in the next theorem.

Theorem 5.4.1. Assume that the initial estimate satisfies√m(β

(0)

R − β) = Op(1). Then

under the above assumptions, for k ≥ 1,√m(β

(k)

R −β) has an asymptotic normal distributionwith mean 0 and covariance matrix,

limm→∞

m

m∑

i=1

DTi V

−1/2i WiV

−1/2i Di

−1 m∑

i=1

DTi V

−1/2i Var(ϕ†

i)V−1/2i Di

×

m∑

i=1

DTi V

−1/2i WiV

−1/2i Di

−1

, (5.4.16)

where ϕ†i denotes the ni × 1 vector (ϕ[R(e†i1)/(n+ 1)], . . . , ϕ[R(e†ini

)/(n+ 1)])T .

5.4.2 Implementation and a Monte Carlo Study

For practical use of the GEEWR estimate, the asymptotic covariance matrix (5.4.16) requiresestimation. This is true even in the case where percentile bootstrap confidence intervals areemployed for inference, because appropriate standardized bootstrap estimates are gener-ally used. We present a nonparametric estimator of the covariance structure and an thenapproximation to it. We compare these in a small simulation study.

Nonparametric (NP) Estimator of Covariance

The covariance structure suggests a simple moment estimator. Let β(k)

and (for the ith

subject) V(k)i denote the final estimates of β and Vi, respectively. Then the residuals which

estimate e†i ≡ (e†i1, . . . , e

†ini

)T are given by

e†i =

[V

(k)i

]−1/2

Yi − G(k)i (β

(k)), i = 1, . . . , m, (5.4.17)

where G(k)i =

[V

(k)i

]−1/2

a′(θ

(k))

and θ(k)

it = h(xTitβ

(k)). Let R(e†it) denote the rank of e†it

among e†i′t′, t = 1, . . . , ni; i = 1, . . . , m. Let ϕ†i = (ϕ[R(e†i1)/(n + 1)], . . . , ϕ[R(e†ini

)/(n +

1)])T . Let Si = ϕ†i − ϕ

†i1ni

. Then a moment estimator of the covariance matrix (5.4.16) isthat expression with Var(ϕ†

i) estimated by

Var(ϕ†i ) = SiS

Ti , (5.4.18)


and, of course, final estimates of Di and Vi. We label this estimator (NP) Although thisis a simple nonparametric estimate of the covariance structure, in a simulation study Abebeet al. (2010) showed that this estimate often leads to a very liberal inference. Werner andBrunner (2007) discovered this in a corresponding rank testing problem.

Approximation (AP) of the Nonparametric Estimator

The form of the weights, though, suggests a simple approximation, which is based on certainideal conditions. Suppose the model is correct. Assume that the true transformed errorsare independent. Then, because the scores have been standardized, asymptotically Var(ϕ†

i)converges to Ini

, so replace it with Ini. This is the first part of the approximation.

Next consider the weights. The functional for the weights is of the form ϕ[F (e)]/e.Assuming that F (0) = 1/2, a simple application of the Mean Value Theorem gives theapproximation ϕ[F (e)]/e = ϕ′[F (e)]f(e). The expected value of this approximation can beexpressed as

τ−1 =

∫ ∞

−∞ϕ′[F (t)]f 2(t) dt =

∫ 1

0

ϕ(u)

−f

′[F−1(u)]

f [F−1(u)]

du, (5.4.19)

where the second integral is derived from the first by integration by parts followed by asubstitution. The parameter τ is of course the usual scale parameter for the R estimates inthe linear model based on the score function ϕ(u). The second part of the approximation isto replace the weight matrix by (1/τ)I. We label this estimator of the covariance matrix of

β(k)

by (AP).

Monte carlo Study

We report the results of a small simulation study in Abebe et al. (2010) which compares theestimators (NP) and (AP). It also provides empirical information on the relative efficiency

between β(k)

the maximum likelihood estimator (mle) under assumed normality.The simulated model is a randomized block design with the fixed factor at five levels and

the random (block) factor at seven levels. The distribution of the random effect was taken tobe normal. Two error distributions were considered: a normal and a contaminated normalwith the contamination rate at 20% and ratio of the contaminated standard deviation to thenoncontaminated at five. For the normal error model, the intraclass correlation coefficientwas set at 0.5. For each distribution, 10,000 simulations were run.

We consider the GEEWR estimator based on a working independence covariance struc-ture. We compared it with the maximum likelihood estimator (mle) for a randomized blockdesign. This yields the traditional analysis used in practice. We used the R function lme

(Pinheiro et al., 2007) to compute it.Table 5.4.1 records the results of the empirical efficiencies and empirical confidences

between the GEEWR estimator and mle estimator for the fixed effect contrasts betweenlevel 1 and the other four levels. The empirical confidence coefficients are for nominal 95%


confidence intervals based on asymptotic distribution of the GEEWR estimator using thenonparametric (NP) estimate of the covariance structure, the approximation (AP) discussedabove, and the mle inference.

Table 5.4.1: Empirical Efficiencies and Confidence CoefficientsDist. Method Contrast

β21 β31 β41 β51

Empirical EfficiencyNorm 0.974 0.974 0.972 0.973CN 2.065 2.102 2.050 2.055

Empirical Conf. Coeff.Norm mle 0.916 0.915 0.914 0.914

NP 0.546 0.551 0.564 0.549AP 0.951 0.955 0.954 0.951

CN mle 0.919 0.923 0.916 0.915NP 0.434 0.445 0.438 0.441AP 0.890 0.803 0.893 0.889

At the normal distribution, the loss in empirical efficiency of the GEEWR estimates overthe mle estimates is only about 3%; while for the contaminated normal distribution thegain in efficiency of the GEEWR estimates over the maximum likelihood estimates is about200%. Hence, for these situations the GEEWR estimator possesses robustness of efficiency.In terms of empirical confidence coefficients, the nonparametric procedure is quite liberal.In contrast, the approximate procedure confidences are quite close to the nominal confidence(95%) for the normal situation and similar to the those of the mle for the contaminatednormal situation.

5.4.3 Example

As an example, we selected part of a study by Plaisance et al. (2007) concerning the effect ofa single session of high intensity aerobic exercise on inflammatory markers of subjects takenover time. One purpose of the study was to see if these markers differed depending on thefitness level of the subject. Subjects were placed into one of two groups (High Fitness andModerate Fitness) depending on the level of their peak oxygen uptake. The response weconsider here is C-reactive protein (CRP). Elevated CRP levels are a marker of low-gradechronic inflammation and may predict a higher risk for cardiovascular disease (Ridker et al.,2002). The effect of interest is the difference in CRP between the two groups, which wedenote by θ. Hence, a one-sided hypothesis of interest is

H0 : θ ≥ 0 versus HA : θ < 0. (5.4.20)

Out of the 21 subjects in the study, three were removed due to noncompliance or in-complete information. Thus, we consider the remaining 18 individuals, 9 in each group.


CRP level was obtained 24 hours and immediately prior to the acute bout of exercise andsubsequently 24, 72, and 120 hours following exercise giving 90 data points in all. The dataare displayed in Table A.0.2 of Appendix B. The top left comparison boxplot of Figure5.4.1 shows the effect based on the raw responses. An estimate of the effect based on theraw data is difference in medians which is −0.54. Note that the responses are skewed withoutliers in each group. We took the time of measurement as a covariate. Let yi and xidenote respectively the 5× 1 vectors of observations and times of measurements for subjecti and let ci denote his/her indicator variable for Group, i.e., its components are either 0 (forModerate Fitness) or 1 (for High Fitness). Then our model is

yi = α15 + θci + βxi + ei, i = 1, . . . 18 , (5.4.21)

where ei denotes the vector of errors for the ith individual. We present the results forthree covariance structures of ei: working independence (WI), compound symmetry (CS),and autoregressive-one (AR(1)). We fit the GEEWR estimate for each of these covariancestructures using Wilcoxon scores.

The error model for compound symmetry is the simple mixed model; i.e., ei = bi1ni+ai,

where bi is the random effect for subject i and the components of ai are iid and independentfrom bi. Let σ2

b and σ2a denote the variances of bi and aij , respectively. Let σ2

t = σ2b + σ2

a

denote the total variance and ρ = σ2b/σ

2t denote the intraclass coefficient. In this case,

the covariance matrix of ei is of the form σ2t [(1 − ρ)I + ρJ]. We estimated these variance

component parameters σ2t and ρ at each step of the fit of Model (5.4.21) using the robust

estimators discussed in Section 5.2.1The error model for the AR(1) is eij = ρ1ei,j−1 + aij , j = 2, . . . ni, where the aij ’s are

iid, for the ith subject. The (s, t) entry in the covariance matrix of ei is κρ|s−t|1 , where

κ = σ2a/(1− ρ2

1). To estimate the covariance structure at step k, for each subject, we modelthis autoregressive model using the current residuals. For each subject, we then estimateρ1, using the Wilcoxon regression estimate of Chapter 3. As our estimate of ρ1, we take themedian over subjects of these Wilcoxon regression estimates. Likewise, as our estimate ofσ2a we took the median over subjects of MAD2 of the residuals based on the AR(1) fits.

Note that there are only 18 experimental units in this problem, nine for each treatment.So it is a small sample problem. Accordingly, we used a bootstrap to standardize theGEEWR estimates. Our bootstrap consisted of resampling the 18 experimenter units, ninefrom each group. This keeps the covariance structure intact. Then for each bootstrap sample,the GEEWR estimate was computed and recorded. We used 3000 bootstrap samples. Withthese small samples, the outliers had an effect on the bootstrap, also. Hence, we used theMAD of the bootstrap estimates of θ as our standard error of θ.

Table 5.4.2 summarizes the three GEEWR estimates of θ and β, along with the estimatesof the variance components for the CS and AR(1) models. As the comparison boxplot ofresiduals shows in Figure 5.4.1, the three fits are similar. The WI and AR(1) estimates ofthe effect θ are quite similar, including their bootstrap standard errors. The CS estimate ofθ, though, is more precise and it is closer to the difference (based on the raw data) in medians−0.54. The traditional fit of the simple mixed model (under CS covariance structure), would


High Fit Mod. Fit

01

23

4

CR

P

Group Comparison Box Plots

AR(1) CS WI

01

23

4

Res

idua

ls

Box Plots: Residuals

−0.5 −0.4 −0.3 −0.2 −0.1 0.0

01

23

4

CS Fit

CS

Res

idua

ls

Residual Plot of CS Fit

−2 −1 0 1 2

01

23

4

Normal Quantiles

CS

Res

idua

ls

QQ Plot of Residuals for CS Fit

Figure 5.4.1: Plots for CRP Data.

be the maximum likelihood fit based on normality. We obtained this fit by using the lme

function in R. Its estimate of θ is −0.319 with standard error 0.297. For the hypothesesof interest (5.4.20), based on asymptotic normality, the CS GEEWR estimate is marginallysignificant with p = 0.064, while the mle estimate is insignificant with p = 0.141.

Note that the residual and q−q plots of the CS GEEWR fit, bottom plots of Figure5.4.1, show that the error distribution is right skewed with a heavy right tail. This suggestsusing scores more appropriate for skewed error distributions than the Wilcoxon scores. Weconsidered a simple score from the class of Winsorized Wilcoxon scores. The Wilcoxon scorefunction is linear. For this data, a suitable Winsorizing score function is the piecewise linearfunction, which is linear on the interval (0, c) and then constant on the interval (c, 1). Asdiscussed in Example 2.5.1 of Chapter 2, these scores are optimal for a skewed distributionwith a logistic left tail and an exponential right tail. We obtained the GEEWR fit of thisdata using this score function with c = 0.80, i.e, the bend is at 0.80. To insure positiveweights, we used the 47th percentile as the location estimator m(β) in the definition of


Table 5.4.2: Summary of Estimates and Bootstrap Standard Errors (BSE).Wilcoxon Scores

COV. θ BSE β BSE Cov. ParametersWI −0.291 0.293 −0.0007 0.0007 NA NACS −0.370 0.244 −.0010 0.0007 σ2

a = 0.013 ρ = 0.968AR(1) −0.303 0.297 −0.0008 0.0015 ρ1 = 0.023 σ2

a = 0.032

Winsorized Wilcoxon Scores with Bend at 0.8CS −0.442 0.282 −0.008 0.0008 σ2

a = 0.017 ρ = 0.966

the weights; see the discussion around expression (5.4.10). The computed estimates andtheir bootstrap standard errors are given in the last row of Table 5.4.2 for the compoundsymmetry case. The estimate of θ is −0.442 which is closer than the Wilcoxon estimate tothe difference in medians based on the raw data. Using the bootstrap standard error, thecorresponding z-test for hypotheses (5.4.20) is −1.57 with the p-value of 0.059, which is moresignificant than the test based on Wilcoxon scores. Computationally, the iterated reweightedGEEWR algorithm remains the same except that the Wilcoxon scores are replaced by theseWinsorized Wilcoxon scores.

As a final note, the residual plot of the GEEWR fit for the compound symmetric de-pendence structure also shows some heteroscedasticity. The variability of the residuals isdirectly proportional to the fitted values. This scalar trend can be modeled robustly usingthe rank-based procedures discussed in Exercise 3.16.39.

5.5 Time Series

5.6. EXERCISES 349

5.6 Exercises

5.6.1. Assume the simple mixed model (5.2.1). Show that expression (5.2.2) is true.

5.6.2. Obtain the ARE between the R and traditional estimates found in expression (5.2.4),for Wilcoxon scores when the random error vector has a multivariate normal distribution.

5.6.3. Show that the asymptotic distribution of the LS estimator for the Arnold transformedmodel is given by expression (5.3.8).

5.6.4. Consider Example 5.3.1.

(a.) Verify the ATR and ATLS estimates in Table 5.3.2.

(b.) Over the range of ∆y used in the example, verify the relative changes in the ATR andATLS estimates as shown in the example.

5.6.5. Consider the discussion of test statistics around expression (5.1.12). Explore theasymptotic distributions of the drop in dispersion and aligned rank test statistics under thenull and contiguous alternatives for the general mixed model.

5.6.6. Continuing with the last exercise, suppose that the simple mixed model (5.2.1) is true.Suppose further that the design is centered within each block; i.e., X′

k1nk= 0p. For example,

this is true for an ANOVA design in which all subjects have all treatment combinations suchas the Plasma Example of Section 4.

(a.) Under this assumption, show that expression (5.2.2) simplifies to Vϕ = τ 2ϕ(1−ρϕ)(X′X)−1;.

(b.) Show that the noncentrality parameter η, (5.1.13), simplifies to

η =1

τ 2ϕ(1 − ρϕ)

Mβ′[M(X′X)−1M′]−1Hβ.

(c.) Consider as a test statistic the standardized version of the reduction in dispersion,

FRD,ϕ =RDϕ/q

(1 − ρϕ)(τϕ/2).

Show that under the null hypothesis H0, qFRD,ϕD→ χ2(q) and that under the sequence

of alternatives HAn, qFRD,ϕD→ χ2(q, η), where the noncentrality parameter η is given

in Part (b).

(d.) Show that FW,ϕ, (5.1.12), and FRD,ϕ are asymptotically equivalent under the null andlocal alternative models.

(e). Explore the asymptotic distribution of the aligned rank test under the conditions ofthis exercise.

5.6.7. AR(1) generalized exercise.

Chapter 6

Multivariate

6.1 Multivariate Location Model

We now consider a statistical model in which we observe vectors of observations. For example,we may record both the SAT verbal and math scores on students. We then wish to investigatethe bivariate distribution of scores. We may wish to test the hypothesis that the vector ofpopulation locations has changed over time or to estimate the vector of locations. Theframework in which we carry out the statistical inference is the multivariate location modelwhich is similar to the location model of Chapter 1.

For simplicity and convenience, we will often discuss the bivariate case. The k-dimensionalresults will often be obvious changes in notation. Suppose that X1, . . . ,Xn are iid randomvectors with XT

i = (Xi1, Xi2). In this chapter, T denotes transpose and we reserve primefor differentiation. We assume that X has an absolutely continuous distribution with cdfF (s − θ1, t − θ2) and pdf f(s− θ1, t − θ2). We also assume that the marginal distributionsare absolutely continuous. The vector θ = (θ1, θ2)

T is the location vector.

Definition 6.1.1. Distribution models for bivariate data. Let F (s, t) be a prototype cdf, thenthe underlying model will be a shifted version: H(s, t) = F (s− θ1, t− θ2).

The following models will be used throughout this chapter.

1. We say the distribution is symmetric when X and −X have the same distributionor f(s, t) = f(−s,−t). This is sometimes called diagonal symmetry. The vector(0, 0)T is the center of symmetry of F and the location functionals all equal the center ofsymmetry. Unless stated otherwise, we will assume symmetry throughout this chapter.

2. The distribution has spherical symmetry when ΓX and X have the same distributionwhere Γ is an orthogonal matrix. The pdf has the form g(‖x‖) where ‖x‖ = (xTx)1/2

is the Euclidean norm of x. The contours of the density are circular.

3. In an elliptical model the pdf has the form |detΣ|−1/2g(xTΣ−1x), where det denotesdeterminant and Σ is a symmetric, positive definite matrix. The contours of the densityare ellipses.

351

352 CHAPTER 6. MULTIVARIATE

4. A distribution is directionally symmetric if X/‖X‖ and −X/‖X‖ have the samedistribution.

Note that elliptical symmtery implies symmetry which is turn implies directional symmetry.In an elliptical model, the contours of the density are elliptical and if Σ is the identitymatrix then we have a spherically symmetric distribution. An elliptical distribution can betransformed into a spherical one by a transformation of the form Y = DX where D is anonsingular matrix. Along with various models, we will encounter various transformationsin this chapter. The following definition summarizes the transformations.

Definition 6.1.2. Data transformations.(a) Y = ΓX is an orthogonal transformation when the matrix Γ is orthogonal. Thesetransformations include rotations and reflections of the data.(b) Y = AX + b is called an affine transformation when A is a nonsingular matrix and bis any vector of real numbers.(c) When the matrix A in (b) is diagonal, we have a special affine transformation called ascale and location transformation.(d) Suppose t(X) represents one of the above transformations of the data. Let θ(t(X))denote the estimator computed from the transformed data. Then we say the estimator isequivariant if θ(t(X)) = t(θ(X)). Let V (t(X)) denote a test statistic computed from thetransformed data. We say the test statistic is invariant when V (t(X)) = V (X).

Recall that Hotelling’s T 2 statistic is given by

T 2 = n(X − µ)TS−1(X − µ),

where S is the sample covariance matrix. In Exercise 6.8.1, the reader is asked to show thatthe vector of sample means is affine equivariant and Hotelling’s T 2 test statistic is affineinvariant.

As in the earlier chapters, we begin with a criterion function or with a set of estimatingequations. To fix the ideas, suppose that we wish to estimate θ or test the hypothesisH0 : θ = 0 and we are given a pair of estimating equations:

S(θ) =

(S1(θ)S2(θ)

)= 0 ; (6.1.1)

see Example 6.1.1 for three criterion functions. We now list the usual set of assumptions thatwe have been using throughout the book. These assumptions guarantee that the estimatingequations are Pitman regular in the sense of Definition 1.5.3 so that we can define theestimate and test and develop the necessary asymptotic distribution theory. It will often beconvenient to suppose that the true value of θ is 0 which we can do without loss of generality.

Definition 6.1.3. Pitman Regularity conditions.

6.1. MULTIVARIATE LOCATION MODEL 353

(a) The components of S(θ) should be nonincreasing functions of θ1 and θ2.

(b) E0(S(0)) = 0

(c) 1√nS(0)

D0→ Z ∼ N2(0,A)

(d) sup‖b‖≤B

∣∣∣ 1√nS(

1√nb)− 1√

nS(0) + Bb

∣∣∣ P→ 0 .

The matrix A in (c) is the asymptotic covariance matrix of 1√nS(0) and the matrix B in (d)

can be computed in various ways, depending on when differentiation and expectation can beinterchanged. We list the various computations of B for completeness. Note that denotesdifferentiation with respect to the components of θ.

B = −E0 1

nS(θ) |θ=0

= Eθ1

nS(0) |θ=0

= E0[(− log f(X))ΨT (X)] (6.1.2)

where log f(X) denotes the vector of partial derivatives of log f(X) and Ψ(· ) is such that

1√nS(θ) =

1√n

n∑

i=1

Ψ(Xi − θ) + op(1).

Brown (1985) proved a multivariate counterpart to Theorem 1.5.6. We state it next andrefer the reader to the paper for the proof.

Theorem 6.1.1. Suppose conditions (a) - (c) of Definition 6.1.3 hold. Suppose further thatB is given by the second expression in ( 6.1.2) and is positive definite. If, for any b,

trace

ncov

[1

nS

(1√nb

)− 1

nS(0)

]→ 0

then (d) of Definition 6.1.3 also holds.

The estimate of θ is, of course, the solution of the estimating equations, denoted θ.Conditions (a) and (b) make this reasonable. To test the hypothesis H0 : θ = 0 versus

HA : θ 6= 0, we reject the null hypothesis when 1nST (0)A−1S(0) ≥ χ2

α(2), where the upper

α percentile of a chisquare distribution with 2 degrees of freedom. Note that A → A, inprobability, and typically A will be a simple moment estimator of A. Condition (c) impliesthat this is an asymptotically size α test.


With condition (d) we can determine the asymptotic distribution of the estimate and theasymptotic local power of the test; hence, asymptotic efficiencies can be computed. We candetermine the quantity that corresponds to the efficacy in the univariate case described inSection 1.5.2 of Chapter 1. We do this next before discussing specific estimating equations.The following proposition follows at once from the assumptions.

Theorem 6.1.2. Suppose conditions (a)-(d) in Definition 6.1.3 are satisfied, θ = 0 is the

true parameter value, and θn = γ/√n for some fixed vector γ. Further θ is the solution of

the estimating equation. Then

1.√nθ = B−1 1√

nS(0) + op(1)

D0→ Z ∼ MVN(0,B−1AB−1)

2. 1nST (0)A−1S(0)

Dθn→ χ2(2,γTBA−1Bγ) ,,

where χ2(a, b) is noncentral chisquare with a degrees of freedom and noncentrality parameterb.

Proof: Part 1 follows immediately from condition (d) and letting θn = θ → 0 in proba-bility; see Theorem 1.5.7. Part 2 follows by observing (see Theorem 1.5.8) that

Pθn

(1

nST (0)A−1S(0) ≥ χ2

α(2)

)= P0

(1

nST(− 1√

nγ

)A−1S

(− 1√

nγ

)≥ χ2

α(2)

)

and from (d),

1√nS

(− 1√

nγ

)=

1√nS(0) + Bγ + op(1)

D0→ Z ∼ MVN(Bγ,A).

Hence, we have a noncentral chisquare limiting distribution for the quadratic form. Notethat the influence function of θ is Ω(x) = B−1Ψ(x) and we say θ has bounded influenceprovided ‖Ω(x)‖ is bounded.

Definition 6.1.4. Estimation Efficiency. The efficiency of a bivariate estimator can bemeasured using the Wilk’s generalized variance defined to be the determinant of thecovariance matrix of the estimator: σ2

1σ22(1− ρ2

12) where ((ρijσiσj)) is the covariance matrix

of the bivariate vector of estimates. The estimation efficiency of θ1 relative to θ2 is thesquare root of the reciprocal ratio of the generalized variances.

This means that the asymptotic covariance matrix given by B−1AB−1 of the more ef-ficient estimator will be ”small” in the sense of generalized variance. See Bickel (1964) forfurther discussion of efficiency in the multivariate case.

Definition 6.1.5. Test efficiency. When comparing two tests based on S1 and S2, sincethe asymptotic local power is an increasing function of the noncentrality parameter, we definethe test efficiency as the ratio of the respective noncentrality parameters.

6.1. MULTIVARIATE LOCATION MODEL 355

In the bivariate case, we have γTB1A−11 B1γ divided by γTB2A

−12 B2γ and, unlike the

estimation case, the test efficiency may depend on the direction γ along which we approachthe origin; see Theorem 6.1.2. Hence, we note that, unlike the univariate case, the testingand estimation efficiencies are not necessarily equal. Bickel (1965) shows that the ratio ofnoncentrality parameters can be interpreted as the limiting ratio of sample sizes needed forthe same asymptotic level and same asymptotic power along the same sequence of alterna-tives, as in the Pitman efficiency used throughout this book. We can see that BA−1B shouldbe ”large” just as B−1AB−1 should be ”small”. In the next section we consider how to setup the estimating equations and consider what sort of estimates and tests result. We willbe in a position to compute the efficiency of the estimates and tests relative to the tradi-tional least squares estimates and tests. First we list three important criterion functions andtheir associated estimating equations. Other criterion functions will be introduced in latersections.

Example 6.1.1. Three criterion functions.

We now introduce three criterion functions that, in turn, produce estimating equationsthrough differentiation. One of the criterion functions will generate the vector of means,the L2 or least squares estimates. The other two criterion functions will generate differentversions of what may be considered L1 estimates or bivariate medians. The two types ofmedians differ in their equivariance properties. See Small (1990) for an excellent review ofmultidimensional medians. The vector of means is equivariant under affine transformationsof the data; see Exercise 6.8.1.The three criterion functions are:

D1(θ) =

√√√√n∑

i=1

[(xi1 − θ1)2 + (xi2 − θ2)2] (6.1.3)

D2(θ) =

n∑

i=1

√(xi1 − θ1)2 + (xi2 − θ2)2 (6.1.4)

D3(θ) =n∑

i=1

|xi1 − θ1| + |xi2 − θ2| (6.1.5)

In each of these criterion functions we have pushed the square root operation deeper intothe expression. As we will see, this produces very different types of estimates. We now takethe gradients of these criterion functions and display the corresponding estimating functions.The computation of these gradients is given in Exercise 6.8.2.

S1(θ) = [D1(θ)]−1

( ∑(xi1 − θ1)∑(xi2 − θ2)

)(6.1.6)

S2(θ) =n∑

i=1

‖xi − θi‖−1

(xi1 − θ1xi2 − θ2

)(6.1.7)

S3(θ) =

( ∑sgn(xi1 − θ1)∑sgn(xi2 − θ2)

)(6.1.8)


In ( 6.1.8) if the vector is zero, then we take the term in the summation to be zeroalso. In Exercise 6.8.3 the reader is asked to verify that S2(θ) = S3(θ) in the univariatecase; hence, we already see something new in the structure of the bivariate location modelover the univariate location model. On the other hand, S1(θ) and S3(θ) are componentwiseequations unlike S2(θ) in which the two components are entangled. The solution to ( 6.1.8)is the vector of medians, and the solution to ( 6.1.7) is the spatial median which is discussedin Section 6.3. We will begin with an analysis of componentwise estimating equations andthen consider other types.

Sections 6.2.3 through 6.4.4 deal with one sample estimates and tests based on vectorsigns and ranks. Both rotational and affine invariant/equivariant methods are developed.Two and several sample models are treated in Section 6.6 as examples of location models.In Section 6.6 we will be primarily concerned with componentwise methods.

6.2 Componentwise Methods

Note that S1(θ) and S3(θ) are of the general form

S(θ) =

( ∑ψ(xi1 − θ1)∑ψ(xi2 − θ2)

)(6.2.1)

where ψ(t) = t or sgn(t) for ( 6.1.6) and ( 6.1.8), respectively. We need to find the matricesA and B in Definition 6.1.3. It is straight forward to verify that, when the true value of θ

is 0,

A =

(Eψ2(X11) Eψ(X11)ψ(X12)

Eψ(X11)ψ(X12) Eψ2(X22)

), (6.2.2)

and, from ( 6.1.2),

B =

(Eψ′(X11) 0

0 Eψ′(X12)

). (6.2.3)

Provided that A is positive definite, the multivariate central limit theorem implies thatcondition (c) in Definition 6.1.3 is satisfied for the componentwise estimating functions. Inthe case that ψ(t) = sgn(t), we use the second representation in ( 6.1.2). The estimatingfunctions in (6.2.1) are examples of M-estimating functions; see Maronna, Martin and Yohai(2006).

Example 6.2.1. Pulmonary Measurements on Workers Exposed to Cotton Dust.

In this example we extend the discussion to k = 3 dimensions. The data consists ofn = 12 trivariate (k = 3) observations on workers exposed to cotton dust. The measurementsin Table 6.2.1 are changes in measurements of pulmonary functions: FVC (forced vitalcapacity), FEV3 (forced expiratory volume), and CC (closing capacity); see Merchant et al.(1975).

6.2. COMPONENTWISE METHODS 357

Table 6.2.1: Changes in Pulmonary Function after Six Hours of Exposure to Cotton Dust

Subject FVC FEV3 CC1 −.11 −.12 −4.32 .02 .08 4.43 −.02 .03 7.54 .07 .19 −.305 −.16 −.36 −5.86 −.42 −.49 14.57 −.32 −.48 −1.98 −.35 −.30 17.39 −.10 −.04 2.510 .01 −.02 −5.611 −.10 −.17 2.212 −.26 −.30 5.5

Let θT = (θ1, θ2, θ3) and consider H0 : θ = 0 versus HA : θ 6= 0. First we compute thecomponentwise sign test. In ( 6.2.1) take ψ(x) = sgn(x), then n−1/2ST3 = n−1/2(−6,−6, 2)and the estimate of A = Cov(n−1/2S3) is

A =1

n

n∑

sgnxi1sgnxi2∑

sgnxi1sgnxi3∑sgnxi1sgnxi2 n

∑sgnxi2sgnxi3∑

sgnxi1sgnxi3∑

sgnxi2sgnxi3 n

=

1

12

12 8 −48 12 0

−4 0 12

.

Here the diagonal elements are∑

i sgn2(Xis) = n and the off-diagonal elements are values of

the statistics∑

i sgn(Xis)sgn(Xit). Hence, the test statistic n−1ST3 A−1S3 = 3.667, and usingχ2(3), the approximate p-value is 0.299; see Section 6.2.2.

We can also consider the finite sample conditional distribution in which sign changes aregenerated with a binomial with n = 12 and p = .5; see the discussion in Section 6.2.2. Againnote that the signs of all components of the observation vector are either changed or not.The matrix A remains unchanged so it is simple to generate many values of n−1ST3 A−1S3.Out of 2500 values we found 704 greater than or equal to 3.667; hence, the randomizationor sign change p-value is approximately 704/2500 = 0.282, quite close to the asymptoticapproximation. At any rate, we fail to reject H0 : θ = 0 at any reasonable level.

Further, Hotelling’s T 2 = nXTΣ−1X = 14.02 with a p-value of 0.051, based on the F -

distribution for [(n− p)/(n− 1)p]T 2 with 3 and 9 degrees of freedom. Hence, Hotelling’s T 2

is significant at approximately 0.05.Figure 6.2.1 provides boxplots for the data and componentwise normal q−q plots. These

boxplots suggest that any differences will be due to the upward shift in the CC distribution.The normal q−q plot of the component CC shows two outlying values on the right side.In the case of the componentwise Wilcoxon test, Section 6.2.3, we consider (n + 1)S4(0)in ( 6.2.14) along with (n + 1)2A, essentially in ( 6.2.15). For the pulmonary function data


Figure 6.2.1: Panel A: Boxplots of the changes in pulmonary function for the Cotton DustData. Note that the responses have been standardized by componentwise standard devia-tions; Panel B: normal q−q plot for the component FVC, original scale; Panel C: normal q−qplot for the component FEV3, original scale; Panel D: normal q−q plot for the componentCC, original scale.

-2-1

01

2

CC FEV_3 FVC

Component

Sta

ndar

dize

d re

spon

ses

Panel A

•

••

•

•

• •

• •• •

•

Normal Quantiles

Cha

nges

in F

VC

-1.5 -0.5 0.5 1.5

-0.4

-0.2

0.0

Panel B

• •

•• •

••

• ••

•

•

Normal Quantiles

Cha

nges

in F

EV

3

-1.5 -0.5 0.5 1.5

-0.4

-0.2

0.0

0.2

Panel C

• ••

••

• ••

••

•

•

Normal Quantiles

Cha

nges

in C

C

-1.5 -0.5 0.5 1.5

-50

510

15

Panel D


(n+ 1)ST4 (0) = (−63,−52, 28) and

(n + 1)2A =1

n

649 620.5 −260.5620.5 649.5 −141.5

−260.5 −141.5 650

.

The diagonal elements are∑

iR2(|Xis|) which should be

∑i i

2 = 650 but differ for thefirst two components due to ties among the absolute values. The off-diagonal elements are∑

iR(|Xis|)R(|Xit|)sgn (Xis)sgn (Xit). The test statistic is then n−1ST4 (0)A−1S4(0) = 7.82.From the χ2(3) distribution, the approximate p-value is 0.0498. Hence, the Wilcoxon testrejects the null hypothesis at essentially the same level as Hotelling’s T 2 test.

In the construction of tests we generally must estimate the matrix A. When testingH0 : θ = 0 the question arises as to whether or not we should center the data using θ. If wedo not center then we are using a reduced model estimate of A; otherwise, it is a full modelestimate. Reduced model estimates are generally used in randomization tests. In this case,generally, A must only be computed once in the process of randomizing and recomputing

the test statistic n−1ST A−1S. Note also that when H0 : θ = 0 is true, θP−→ 0. Hence,

the centered A is valid under H0. When estimating the asymptotic Cov(θ), B−1AB−1, we

should center A because we no longer assume that H0 is true.

6.2.1 Estimation

Let θ = (θ1, θ2)T denote the true vector of location parameters. Then, when ( 6.1.2) holds,

the asymptotic covariance matrix in Theorem 6.1.2 is

B−1AB−1 =

Eψ2(X11−θ1)[Eψ′(X11−θ1)]2

Eψ(X11−θ1)ψ(X12−θ2)Eψ′(X11−θ1)Eψ′(X12−θ2

)

Eψ(X11−θ1)ψ(X12−θ2)Eψ′(X11−θ1)Eψ′(X12−θ2)

Eψ2(X12−θ2)[Eψ′(X12−θ2)]2

(6.2.4)

Now Theorem 6.1.2 can be applied for various M-estimates to establish asymptoticnormality. Our interest is in the comparison of L2 and L1 estimates and we now turn to thatdiscussion. In the case of L2 estimates, corresponding to S1(θ), we take ψ(t) = t. In this case,θ in expression (6.2.4) is the vector of means. Then it is easy to see that B−1AB−1 is equalto the covariance matrix of the underlying model, say Σf . In applications, θ is estimatedby the vector of component sample means. For the standard errors of these estimates, thevector of componentwise sample means replaces θ in expression (6.2.4) and the expectedvalues are replaced by the corresponding sample moments. Then it is easy to see that theestimate of B−1AB−1 is equal to the traditional sample covariance matrix.

In the first L1 case corresponding to S3(θ), using ( 6.1.2), we take ψ(t) = sgn(t) and find,


using the second representation in ( 6.1.2), that

B−1AB−1 =

14f2

1 (0)

Esgn(X11−θ1)sgn(X12−θ2)4f1(0)f2(0)

Esgn(X11−θ1)sgn(X12−θ2)4f1(0)f2(0)

14f2

2 (0)

, (6.2.5)

where f1 and f2 denote the marginal pdfs of the joint pdf f(s, t) and θ1 and θ2 denote thecomponentwise medians. In applications, the estimate of θ is the vector of componentwisesample medians, which we denote by (θ1, θ2)

′. For inference an estimate of the asymptoticcovariance matrix, (6.2.5) is reqiured. An estimate of Esgn(X11 − θ1)sgn(X12 − θ2) is the

simple moment estimator n−1∑

sgn(xi1 − θ1)sgn(xi2 − θ2). The estimators discussed inSection 1.5.5, (1.5.28), can be used to estimate the scale parameters 1/2f1(0) and 1/2f2(0).

We now turn to the efficiency of the vector of sample medians with respect to the vectorof sample means. Assume for each component that the median and mean are the sameand that without loss of generality their common value is 0. Let δ = det(B−1AB−1) =

det(A)/[det(B)]2 be the Wilk’s generalized variance of√nθ in Definition 6.1.4. For the

vector of means we have δ = σ21σ

22(1 − ρ2), the determinant of the underlying variance-

covariance matrix. For the vector of sample medians we have

δ =1 − (EsgnX11sgnX12)

2

16f 21 (0)f 2

2 (0)

and the efficiency of the vector of medians with respect to the vector of means is given by:

e(med,mean) = 4σ1σ2f1(0)f2(0)

√1 − ρ2

1 − [EsgnX11sgnX12]2(6.2.6)

Note that EsgnX11sgnX12 = 4P (X11 < 0, X12 < 0) − 1. When the underlying distributionis bivariate normal with means 0, variances 1, and correlation ρ, Exercise 6.8.4 shows that

P (X11 < 0, X12 < 0) =1

4+

1

2π sin ρ. (6.2.7)

Further, the marginal distributions are standard normal; hence,( 6.2.6) becomes

e(med, mean) =2

π

√1 − ρ2

1 − [(2/π) sin−1 ρ]2(6.2.8)

The first factor 2/π ∼= .637 is the univariate efficiency of the median relative to the meanwhen the underlying distribution is normal and also the efficiency of the vector of mediansrelative to the vector of means when the correlation in the underlying model is zero. Thesecond factor accounts for the bivariate structure of the model and, in general, depends onthe correlation ρ. Some values of the efficiency are given in Table 6.2.2.


Table 6.2.2: Efficiency ( 6.2.8) of the vector of medians relative to the vector of means whenthe underlying distribution is bivariate normal.

ρ 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .99eff .64 .63 .63 .62 .60 .58 .56 .52 .47 .40 .22

Clearly, as the elliptical contours of the underlying normal distribution flatten out, theefficiency of the vector of medians decreases. This is the first indication that the vector ofmedians is not affine (or even rotation) equivariant. The vector of means is affine equivariantand hence the dependency of the efficiency on ρ must be due to the vector of medians.Indeed, Exercise 6.8.5 asks the reader to construct an example showing that when the axesare rotated the vector of means rotates into the new vector of means while the vector ofmedians fails to do so.

6.2.2 Testing

We now consider the properties of bivariate tests. Recall that we assume the underlyingbivariate distribution is symmetric. In addition, we would generally use an odd ψ-function,so that ψ(t) = −ψ(−t). This implies that ψ(t) = ψ(|t|)sgn(t) which will be useful shortly.

Now referring to Theorem 6.1.2 along with the corresponding matrix A, the test ofH0 : θ = 0 vs HA : θ 6= 0 rejects the null hypothesis when 1

nST (0)A−1S(0) ≥ χ2

α(2).Note that the covariance term in A is Eψ(X11)ψ(X12) =

∫∫ψ(s)ψ(t)f(s, t) dsdt and it

depends upon the underlying bivariate distribution f . Hence, even the sign test based onthe componentwise sign statistics S3(0) is not distribution free under the null hypothesisas it is in the univariate case. In this case, Eψ(X11)ψ(X12) = 4P (X11 < 0, X12 < 0) − 1 aswe saw in the discussion of estimation.

To make the test operational we must estimate the components of A. Since they areexpectations, we use moment estimates, under the null hypothesis. Now condition (c) inDefinition 6.1.3 guarantees that the test with the estimated A is asymptotically distributionfree since it has a limiting chisquare distribution, independent of the underlying distribution.What can we say about finite samples?

First note that

S(0) =

(Σψ(|xi1|)sgn(xi1)Σψ(|xi2|)sgn(xi2)

)(6.2.9)

Under the assumption of symmetry, (x1, . . . ,xn) is a realization of (s1x1, . . . , snxn) where(s1, . . . , sn) is a vector of independent random variables each equalling ±1 with probability1/2, 1/2. Hence Esi = 0 and Es2

i = 1. Condition on (x1, . . . ,xn) then, under the nullhypothesis, there are 2n equally likely sign combinations associated with these vectors. Notethat the sign changes attach to the entire vector. From ( 6.2.9), we see that conditionally,the scores are not affected by the sign changes and S(0) depends on the sign changes onlythrough the signs of the components of the observation vectors. It follows at once that the


conditional mean of S(0) under the null hypothesis is 0. Further the conditional covariancematrix is given by

(Σψ2(|xi1|)

∑ψ(|xi1|)ψ(|xi2|)sgn(xi1)sgn(xi2)∑

ψ(|xi1|)ψ(|xi2|)sgn(xi1)sgn(xi2)∑ψ2(|xi2|)

). (6.2.10)

Note that conditionally, n−1 times this matrix is an estimate of the matrix A above. Thuswe have a conditionally distribution free sign change distribution. For small to moderaten the test statistic (quadratic form) can be computed for each combination of signs and aconditional p-value of the test is the number of values (divided by 2n) of the test statisticat least as large as the observed value of the test statistic. In the first chapter on univariatemethods this argument also leads to unconditionally distribution free tests in the case of theunivariate sign and rank tests since in those cases the signs and the ranks do not dependon the values of the conditioning variables. Again, the situation is different in the bivariatecase due to the matrix A which must be estimated since it depends on the unknown under-lying distribution. In the Exercise 6.8.6 the reader is asked to construct the sign changedistributions for some examples.

We now turn to a more detailed analysis of the tests based on S1 = S1(0) and S3 = S3(0).Recall that S1 is the vector of sample means. The matrix A is the covariance matrix of theunderlying distribution and we take the sample covariance matrix as the natural estimate.

The resulting test statistic is nXTA−1X which is Hotelling’s T 2 statistic. Note for T 2, we

typically use a centered estimate of A. If we want the randomization distribution then weuse the uncentered estimate. Since BA−1B = Σ−1

f , the covariance matrix of the underlying

distribution, the asymptotic noncentrality parameter for Hotelling’s test is γTΣ−1f γ. The

vector S3 is the vector of component sign statistics. By inverting ( 6.2.5) we can write downthe noncentrality parameter for the bivariate componentwise sign test.

To illustrate the efficiency of the bivariate sign test relative to Hotelling’s test we simplifythe structure as follows: assume that the marginal distributions are identical. Let ξ =4P (X11 < 0, X12 < 0) − 1 and let ρ denote the underlying correlation, as usual. ThenHotelling’s noncentrality parameter is

1

σ2(1 − ρ2)γT(

1 −ρ−ρ 1

)γ =

γ21 − 2ργ1γ2 + γ2

2

σ2(1 − ρ2)(6.2.11)

Likewise the noncentrality parameter for the bivariate sign test is

4f 2(0)

(1 − ξ2)γT(

1 −ξ−ξ 1

)γ =

4f 2(0)(γ21 − 2ξγ1γ2 + γ2

2)

(1 − ξ2)(6.2.12)

The efficiency of the bivariate sign test relative to Hotelling’s test is the ratio of the theirrespective noncentrality parameters:

4f 2(0)σ2(1 − ρ2)(γ21 − 2ξγ1γ2 + γ2

2)

(1 − ξ2)(γ21 − 2ργ1γ2 + γ2

2)(6.2.13)


Table 6.2.3: Minimum and maximum efficiencies of the bivariate sign test relative toHotelling’s T 2 when the underlying distribution is bivariate normal.

ρ 0 .2 .4 .6 .8 .9 .99min .64 .58 .52 .43 .31 .22 .07max .64 .68 .71 .72 .72 .71 .66

There are three contributing factors in this efficiency: 4f 2(0)σ2 which is the univariateefficiency of the sign test relative to the t test, (1 − ρ2)/(1 − ξ2) due to the dependencestructure in the bivariate distribution, and the final factor which reflects the direction ofapproach of the sequence of alternatives. It is this last factor which separates the testingefficiency from the estimation efficiency. In order to see the effect of direction on the efficiencywe will use the following result from matrix theory; see Graybill (1983).

Lemma 6.2.1. Suppose D is a nonsingular, square matrix and C is any square matrix andsuppose λ1 and λ2 are the minimum and maximum eigen values of CD−1, then

λ1 ≤γTCγ

γTDγ≤ λ2

.

The following proposition is left as an Exercise 6.8.7.

Theorem 6.2.1. The efficiency e(S3,S1) is bounded between the minimum and maximumof 4f 2(0)σ2(1 − ρ)/(1 − ξ) and 4f 2(0)σ2(1 + ρ)/(1 + ξ).

In Table 6.2.3 we give some values of the maximum and minimum efficiencies whenthe underlying distribution is bivariate normal with means 0, variances 1 and correlationρ. This table can be compared to Table 6.2.2 which contains the corresponding estimationefficiencies. We have f 2(0) = (2π)−1 and ξ = (2/π) sin−1 ρ . Hence, the dependence of theefficiency on direction determined by γ is apparent. The examples involving the bivariatenormal distribution also show the superiority of the vector of means over the vector ofmedians and Hotelling’s test over the bivariate sign test as expected. Bickel (1964, 1965)gives a more thorough analysis of the efficiency for general models. He points out thatwhen heavy tailed models are expected then the medians and sign test will be much betterprovided ρ is not too close to ±1.

In the exercises the reader is asked to show that Hotelling’s T 2 statistic is affine invariant.Thus the efficiency properties of this statistic do not depend on ρ. This means that the bi-variate sign test cannot be affine invariant; again, this is developed in the exercises. It is nownatural to inquire about the properties of the estimate and test based on S2. This estimatingfunction cannot be written in the componentwise form that we have been considering. Beforewe turn to this statistic, we consider estimates and tests based on componentwise ranking.


6.2.3 Componentwise Rank Methods

In this part we will sketch the results for the vector of Wilcoxon signed rank statisticsdiscussed in Section 1.7 for each component. See Example 6.2.1 for an illustration of thecalculations. In Section 6.6 we provide a full development of componentwise rank-basedmethods for location and regression models with examples. We let

S4(θ) =

( ∑ R(|xi1−θ1|)n+1

sgn(xi1 − θ1)∑ R(|xi2−θ2|)n+1

sgn(xi2 − θ2)

)(6.2.14)

Using the projection method, Theorem 2.4.6, we have from Exercise 6.8.8, for the caseθ = 0,

S4(0) =

( ∑F+

1 (|xi1|)sgn(xi1)∑F+

2 (|xi2|)sgn(xi2)

)+ op(1) =

(2∑

[F1(xi1) − 1/2]2∑

[F2(xi2) − 1/2]

)+ op(1)

where F+j is the marginal distribution of |X1j| for j = 1, 2 and Fj is the marginal distribution

of X1j for j = 1, 2; see, also, Section A.2.3 of the Appendix. Symmetry of the marginaldistributions is used in the computation of the projections. The conditions (a)-(d) of Defi-nition 6.1.3 can now be verified for the projection and then we note that the vector of rankstatistics has the same asymptotic properties. We must identify the matrices A and B forthe purposes of constructing the quadratic form test statistic, the asymptotic distribution ofthe vector of estimates and the noncentrality parameter.

The first two conditions, (a) and (b), are easy to check since the multivariate central limittheorem can be applied to the projection. Since under the null hypothesis that θ = 0, F (Xi1)has a uniform distribution on (0, 1), and introducing θ and differentiating with respect to θ1and θ2, the matrices A and B are

1

nA =

(13

δδ 1

3

)and B =

(2∫f 2

1 (t)dt 00 2

∫f 2

2 (t)dt

)(6.2.15)

where δ = 4∫∫

F1(s)F2(t)dF (s, t)−1. Hence, similar to the vector of sign statistics, the vec-tor of Wilcoxon signed rank statistics also has a covariance which depends on the underlyingbivariate distribution. We could construct a conditionally distribution free test but not anunconditionally distribution free one. Of course, the test is asymptotically distribution free.

A consistent estimate of the parameter δ in A is given by

δ =1

n

n∑

t=1

RitRjt

(n+ 1)(n+ 1)sgnXitsgnXjt , (6.2.16)

where Rit is the rank of |Xit| in the tth component among |X1t|, . . . , |Xnt|. This estimateis the conditional covariance and can be used in estimating A in the construction of anasymptotically distribution free test; when we estimate the asymptotic covariance matrix ofθ we first center the data and then compute ( 6.2.16).


Table 6.2.4: Efficiencies of componentwise Wilcoxon methods relative to L2 methods whenthe underlying distribution is bivariate normal.

ρ 0 .2 .4 .6 .8 .9 .99min .96 .94 .93 .91 .89 .88 .87max .96 .96 .97 .97 .96 .96 .96est .96 .96 .95 .94 .93 .92 .91

The estimator that solves S4(θ) = 0 is the vector of Hodges-Lehmann estimates forthe two components; that is, the vector of medians of Walsh averages for each component.Like the vector of medians, the vector of HL estimates is not equivariant under orthogonaltransformations and the test is not invariant under these transformations. This will showup in the efficiency with respect to the L2 methods which are an equivariant estimate andan invariant test. Theorem 6.1.2 provides the asymptotic distribution of the estimator andthe asymptotic local power of the test.

Suppose the underlying distribution is bivariate normal with means 0, variances 1, andcorrelation ρ, then the estimation and testing efficiencies are given by

e(HL, mean) =3

π

√1 − ρ2

1 − 9δ2(6.2.17)

e(Wilcoxon, Hotelling) =3

π

(1 − ρ2)

(1 − 9δ2)γ

21 − 6δγ1γ2 + γ2

2

γ21 − 2ργ1γ2 + γ2

2

(6.2.18)

Exercise 6.8.9 asks the reader to apply Lemma 6.2.1 and show the testing efficiency isbounded between

3(1 + ρ)

2π[2 − 3π

cos−1(ρ2)]

and3(1 − ρ)

2π[2 − 3π

cos−1(ρ2)]

(6.2.19)

In Table 6.2.4 we provide some values of the minimum and maximum efficiencies as wellas estimation efficiency. Note how much more stable the rank methods are than the signmethods. Bickel (1964) points out, however, that when there is heavy contamination and ρis close to ±1 the estimation efficiency can be arbitrarily close to 0. Further, this efficiencycan be arbitrarily large. This behavior is due to the fact that the sign and rank methodsare not invariant and equivariant under orthogonal transformations, unlike the L2 methods.Hence, we now turn to an analysis of the methods generated by S2(θ). Additional materialon the componentwise methods can be found in the papers of Bickel (1964, 1965) and themonograph by Puri and Sen (1971). The extension of the results to dimensions higher thantwo is straightforward and the formulas are obvious. One interesting question is how theefficiencies of the sign or rank methods relative to the L2 methods depend on the dimension.See Section 6.6 and Davis and McKean (1993) for component wise linear model rank-basedmethods.


6.3 Spatial Methods

6.3.1 Spatial sign Methods

We are now ready to consider the estimate and test generated by S2(θ); recall ( 6.1.4)and ( 6.1.7). This estimating function cannot be written in componentwise fashion because‖xi−θ‖ appears in both components. Note that S2(θ) =

∑ ‖xi−θ‖−1(xi−θ), a sum of unitvectors, so that the estimating function depends on the data only through the directions andnot on the magnitudes of xi − θ, i = 1, . . . , n. The vector ‖x‖−1x is also called the spatialsign of x. It generalizes the notion of univariate sign: sgn(x) = |x|−1x. Hence, the test issometimes called the angle test or spatial sign test and the estimate is called the spatialmedian; see Brown (1983). Milasevic and Ducharme (1987) show that the spatial medianis always unique, unlike the univariate median. We will see that the test is invariant underorthogonal transformations and the estimate is equivariant under these transformations.Hence, the methods are rotation invariant and equivariant, properties suitable for methodsused on spatial data. However, applications do not have to be confined to spatial data andwe will consider these methods competitors to the other methods already discussed.

Following our pattern above, we first consider the matrices A and B in Definition 6.1.3.Suppose θ = 0, then since S2(0) is a sum of independent random variables, condition (c) isimmediate with A = E‖X‖−2XXT and the obvious estimate of A, under H0, is

A =1

n

n∑

i=1

‖xi‖−2xixTi , (6.3.1)

which can be used to construct the spatial sign test statistic with

1√nS2(0)

D→ N2(0,A) and1

nST2 (0)A−1S2(0)

D→ χ2(2) . (6.3.2)

In order to compute B, we first compute the partial derivatives; then we take the expectation.This yields

B = E

1

‖X‖

[I − 1

‖X‖2(XXT )

], (6.3.3)

where I is the identity matrix. Use a moment estimate for B similar to the estimate of A.The spatial median is determined by

θ = Argmin

n∑

i=1

‖xi − θ‖ (6.3.4)

or as the solution to the estimating equations

S2(θ) =n∑

i=1

xi − θ

‖xi − θ‖ = 0. (6.3.5)

6.3. SPATIAL METHODS 367

The R package SpatialNP provides routines to compute the spatial median. Gower (1974)calls the estimate the mediancentre and provides a Fortran program for its computation.See Bedall and Zimmerman (1979) for a program in dimensions higher than 2. Further, forhigher dimensions see Mottonen and Oja (1995).

We have the asymptotic representation

1√n

θ = B−1 1√nS2(0) + op(1)

D→ N2(0,B−1AB−1). (6.3.6)

Chaudhuri (1992) provides a sharper analysis for the remainder term in his Theorem 3.2.The consistency of the moment estimates of A and B is established rigorously in the linearmodel setting by Bai, Chen, Miao, and Rao (1990). Hence, we would use A and B computedfrom the residuals. Bose and Chaudhuri (1993) develop estimates of A and B that convergemore quickly than the moment estimates. Bose and Chaudhuri provide a very interestinganalysis of why it is easier to estimate the asymptotic covariance matrix of θ than to estimatethe asymptotic variance of the univariate median. Essentially, unlike the univariate case, wedo not need to estimate the multivariate density at a point. It is left as an exercise to showthat the estimate is equivariant and the test is invariant under orthogonal transformationsof the data; see Exercise 6.8.13.

Example 6.3.1. Cork Borings Data

We consider a well known example due to Rao (1948) of testing whether the weight of corkborings on trees is independent of the directions: North, South, East and West. In this casewe have 4 measurements on each tree and we wish to test the equality of marginal locations:H0 : θN = θS = θE = θW . This is a common hypothesis in repeated measure designs. SeeJan and Randles (1996) for an excellent discussion of issues in repeated measures designs.We reduce the data to trivariate vectors via N−E,E−S, S−W . Then we test δ = 0 whereδT = (θN − θS, θS − θE , θE − θW ). Table 6.3.1 displays the original n = 28 four componentdata vectors.

In Table 6.3.2 we display the data differences: N − S, S −E, and E −W along with theunit spatial sign vectors ‖x‖−1x for each data point. Note that, except for rounding error,the sum of squares in each row is 1 for the spatial sign vectors.

We compute the spatial sign statistic to be ST2 = (7.78,−4.99, 6.65) and, from ( 6.3.1),

A =

.2809 −.1321 −.0539−.1321 .3706 −.0648−.0539 −.0648 .3484

.

Then n−1ST2 (0)A−1S2(0) = 14.74 which yields an asymptotic p-value of .002, using a χ2

approximation with 3 degrees of freedom. Hence, we easily reject H0 : δ = 0 and concludethat boring size depends on direction.


Table 6.3.1: Weight of Cork Borings (in Centigrams) in Four Directions for 28 Trees

N E S W N E S W72 66 76 77 91 79 100 7560 53 66 63 56 68 47 5056 57 64 58 79 65 70 6141 29 36 38 81 80 68 5832 32 35 36 78 55 67 6030 35 34 26 46 38 37 3839 39 31 27 39 35 34 3742 43 31 25 32 30 30 3237 40 31 25 60 50 67 5433 29 27 36 35 37 48 3932 30 34 28 39 36 39 3163 45 74 63 50 34 37 4054 46 60 52 43 37 39 5047 51 52 43 48 54 57 43

For estimation we return to the original component data. Since we have rejected the nullhypothesis of equality of locations, we want to estimate the 4 components of the location

vector: θT = (θ1, θ2, θ3, θ4). The spatial median solves S2(θ) = 0, and we find θT

=(45.38, 41.54, 43.91, 41.03). For comparison the mean vector is (50.54, 46.18, 49.68, 45.18)T .These computations can be performed using the R package SpatialNP. The issue of how toapply rank methods in repeated measure designs has an extensive literature. In addition toJan and Randles (1996), Kepner and Robinson (1988) and Akritas and Arnold (1994) discussthe use of rank transforms and pure ranks for testing hypotheses in repeated measure designs.The Friedman test, Exercise 4.8.19, can also be used for repeated measure designs.

Efficiency for Spherical Distributions

Expressions for A and B can be simplified and the computation of efficiencies made easierif we transform to polar coordinates. We write

x = r

(cosφsin φ

)= rs

(cosϕsinϕ

)(6.3.7)

where r = ‖x‖ ≥ 0, 0 ≤ φ < 2π, and s = ±1 depending on whether x is above or below thehorizontal axis with 0 < ϕ < π. The second representation is similar to 6.2.9 and is useful inthe development of the conditional distribution of the test under the null hypothesis. Hence

S2(0) =∑

si

(cosϕisinϕi

)(6.3.8)


Table 6.3.2: Each row is a data vector for N-S, S-E, E-W along with the components of thespatial sign vector.

Row N − E E − S S −W S1 S2 S3

1 6 -10 -1 0.51 -0.85 -0.092 7 -13 3 0.46 -0.86 0.203 -1 -7 6 -0.11 -0.75 0.654 12 -7 -2 0.85 -0.50 -0.145 0 -3 -1 0.00 -0.95 -0.326 -5 1 8 -0.53 0.11 0.847 0 8 4 0.00 0.89 0.458 -1 12 6 -0.07 0.89 0.459 -3 9 6 -0.27 0.80 0.53

10 4 2 -9 0.40 0.19 -0.9011 2 -4 6 0.27 -0.53 0.8012 18 -29 11 0.50 -0.80 0.3113 8 -14 8 0.44 -0.78 0.4414 -4 -1 9 -0.40 -0.10 0.9115 12 -21 25 0.34 -0.60 0.7116 -12 21 -3 -0.49 0.86 -0.1217 14 -5 9 0.81 -0.29 0.5218 1 12 10 0.06 0.77 0.6419 23 -12 7 0.86 -0.44 0.2620 8 1 -1 0.98 0.12 -0.1221 4 1 -3 0.78 0.20 -0.5922 2 0 -2 0.71 0.00 -0.7123 10 -17 13 0.42 -0.72 0.5524 -2 -11 9 -0.14 -0.77 0.6325 3 -3 8 0.33 -0.33 0.8826 16 -3 -3 0.97 -0.18 -0.1827 6 -2 -11 0.47 -0.16 -0.8728 -6 -3 14 -0.39 -0.19 0.90


where ϕi is the angle measured counterclockwise between the positive horizontal axis andthe line through xi extending indefinitely through the origin and si indicates whether theobservation is above or below the axis. Under the null hypothesis θ = 0, si = ±1 withprobabilities 1/2, 1/2 and s1, . . . , sn are independent. Thus, we can condition on ϕ1, . . . , ϕnto get a conditionally distribution free test. The conditional covariance matrix is

n∑

i=1

(cos2 ϕi cosϕi sinϕi

cosϕi sinϕi sin2 ϕi

)(6.3.9)

and this is used in the quadratic form with S2(0) to construct the test statistic; see Mottonenand Oja (1995, Section 2.1).

To consider the asymptotically distribution free version of this test we use the form

S2(0) =∑(

cosφisinφi

)(6.3.10)

where, recall 0 ≤ φ < 2π, and the multivariate central limit theorem implies that 1√nS2(0)

has a limiting bivariate normal distribution with mean 0 and covariance matrix A. We nowtranslate A and its estimate into polar coordinates.

A = E

(cos2 φ cosφ sinφ

cosφ sinφ sin2 φ

)and A =

1

n

n∑

i=1

(cos2 φi cos φi sin φi

cos φi sinφi sin2 φi

)(6.3.11)

Hence, 1nST2 (0)A−1S2(0) ≥ χ2

α(2) is an asymptotically size α test.We next express B in terms of polar coordinates:

B = Er−1

I −

(cos2 φ cosφ sinφ

cos φ sinφ sin2 φ

)= Er−1

(sin2 φ − cosφ sinφ

− cosφ sinφ cos2 φ

)

(6.3.12)Hence,

√n times the spatial median is limiting bivariate normal with asymptotic covariance

matrix equal to B−1AB−1. The corresponding noncentrality parameter of the noncentralchisquare limiting distribution of the test is γTBA−1Bγ. We are now in a position toevaluate the efficiency of the spatial median and the spatial sign test with respect to themean vector and Hotelling’s test under various model assumptions. The following result isbasic and is derived in Exercise 6.8.10.

Theorem 6.3.1. Suppose the underlying distribution is spherically symmetric so that thejoint density is of the form f(x) = h(‖x‖). Let (r, φ) be the polar coordinates. Then rand φ are stochastically independent, the pdf of φ is uniform on (0, 2π] and the pdf of r isg(r) = 2πrf(r), for r > 0.

Theorem 6.3.2. If the underlying distribution is spherically symmetric, then the matricesA = (1/2)I and B = [(Er−1)/2]I. Hence, under the null hypothesis, the test statisticn−1ST2 (0)A−1S2(0) is distribution free over the class of spherically symmetric distributions.


Proof. First note that

E cosφ sinφ =1

2π

∫cosφ sinφdf = 0 .

Then note that

Er−1 cos φ sinφ = Er−1E cosφ sinφ = 0 .

Finally note, E cos2 φ = E sin2 φ = 1/2.

We can then compute B−1AB−1 = [2/(Er−1)2]I and BA−1B = [(Er−1)2/2]I. Thisimplies that the generalized variance of the spatial median and the noncentrality parameterof the angle sign test are given by detB−1AB−1 = 2/(Er−1)2 and [(Er−1)2/2]γTγ. Noticethat the efficiencies relative to the mean and Hotelling’s test are now equal and independentof the direction. Recall, for the mean vector and T 2, that A = 2−1E(r2)I, det B−1AB−1 =2−1E(r2), and γTBA−1Bγ = [2/E(r2)]γTγ. This is because both the spatial L1 methodsand the L2 methods are equivariant and invariant with respect to orthogonal (rotations andreflections) transformations. Hence, we see that the efficiency

e(spatialL1, L2) =1

4Er2Er−12 . (6.3.13)

If, in addition, we assume the underlying distribution is spherical normal (bivariate nor-mal with means 0 and identity covariance matrix) then Er−1 =

√π/2, Er2 = 2 and

e(spatialL1, L2) = π/4 ≈ .785. Hence, the efficiency of the spatial L1 methods based onS2(θ) are more efficient relative to the L2 methods at the spherical normal model than thecomponentwise L1 methods (.637) discussed in Section 6.2.3.

In Exercise 6.8.12 the reader is asked to show that the efficiency of the spatial L1 methodsrelative to the L2 methods with a k-variate spherical model is given by

ek(spatialL1, L2) =

(k − 1

k

)2

E(r2)[E(r−1)]2. (6.3.14)

When the k-variate spherical model is normal, the exercise shows that Er−1 = Γ[(k−1)/2)]√2Γ(k/2)

with

Γ(1/2) =√π. Table 6.3.3 gives some values for this efficiency as a function of dimension.

Hence, we see that the efficiency increases with dimension. This suggests that the spatialmethods are superior to the componentwise L1 methods, at least for spherical models.

Efficiency for Elliptical Distributions

We need to consider what happens to the efficiency when the model is elliptical but notspherical. Since the methods that we are considering are equivariant and invariant to rota-tions, we can eliminate the correlation from the elliptical model with a rotation but then thevariances are typically not equal. Hence, we study, without loss of generality, the efficiency


Table 6.3.3: Efficiency as a function of dimension for a k-variate spherical normal model.

k 2 4 6e(spatial L1, L2) 0.785 0.884 0.920

when the underlying model has unequal variances but covariance 0. Now the L2 methodsare affine equivariant and invariant but the spatial L1 methods are not scale equivariant andinvariant (hence not affine equivariant and invariant); hence, the efficiency will be a functionof the underlying variances.

The computations are now more difficult. To fix the ideas suppose the underlying modelis bivariate normal with means 0, variances 1 and σ2, and covariance 0. If we let X and Zdenote iid N(0, 1) random variables, then the model distribution is that of X and Y = σZ.Note that W 2 = Z2/X2 has a standard Cauchy distribution. Now we are ready to determinethe matrices A and B.

First, by symmetry, we have E cos φ sinφ = E[XY/(X2+Y 2)] = 0 and Er−1 cosφ sinφ =E[XY/(X2 + Y 2)3/2] = 0; hence, the matrices A and B are diagonal. Next, cos2 φ =X2/[X2 + σ2W 2] = 1/[1 + σ2W 2] so we can use the Cauchy density to compute the expec-tation. Using the method of partial fractions:

E cos2 φ =

∫1

(1 + σ2w2)

1

π(1 + w2)dw =

1

1 + σ.

Hence, E sin2 φ = σ/(1 + σ). The next two formulas are given by Brown (1983) and arederivable by several steps of partial integration:

Er−1 =

√π

2

∞∑

j=0

(2j)!

22j(j!)2

2

(1 − σ2)j ,

Er−1 cos2 φ =1

2

√π

2

∞∑

j=0

(2j + 2)!(2j)!

24j+1(j!)2[(j + 1)!]2

2

(1 − σ2)j ,

andEr−1 sin2 φ = Er−1 −Er−1 cos2 φ .

Thus A = diag[(1 + σ)−1, σ(1 + σ)−1] and the distribution of the test statistic, evenunder the normal model depends on σ. The formulas can be used to compute the efficiencyof the spatial L1 methods relative to the L2 methods; numerical values are given in Table6.3.4. The dependency of the efficiency on σ reflects the dependency of the efficiency on theunderlying correlation which is present prior to rotation.

Hence, just as the componentwise L1 methods have decreasing efficiency as a function ofthe underlying correlation, the spatial L1 methods have decreasing efficiency as a function ofthe ratio of underlying variances. It should be emphasized that the spatial methods are mostappropriate for spherical models where they have equivariance and invariance properties. The


Table 6.3.4: Efficiencies of spatial L1 methods relative to the L2 methods for bivariate normalmodel with means 0, variances 1 and σ2, and 0 correlation, the elliptical case.

σ 1 .8 .6 .4 .2 .05 .01e(spatial L1, L2) 0.785 0.783 0.773 0.747 0.678 0.593 0.321

componentwise methods, although equivariant and invariant under scale transformations ofthe components, cannot tolerate changes in correlation. See Mardia (1972) and Fisher (1987,1993) for further discussion of spatial methods. In higher dimensions, Mardia refers to theangle test as Rayleigh’s test; see Section 9.3.1 of Mardia (1972). Mottonen and Oja (1995)extend the spatial median and the spatial sign test to higher dimensions. See Table 6.3.6below for efficiencies relative to Hotelling’s test for higher dimensions and for a multivariatet underlying distribution. Note that for higher dimensions and lower degrees of freedom, thespatial sign test is superior to Hotelling’s T 2.

6.3.2 Spatial Rank Methods

Spatial Signed Rank Test

Mottonen and Oja (1995) develop the concept of a orthogonally invariant rank vector.Hence, rather than use the univariate concept of rank in the construction of a test, theydefine a spatial rank vector that has both magnitude and direction. This problem is delicatesince there is no inherently natural way to order or rank vectors.

We must first review the relationship between sign, rank, and signed rank. Recall thenorm, ( 1.3.17) and ( 1.3.21), that was used to generate the Wilcoxon signed rank statistic.Further, recall that the second term in the norm was the basis, in Section 2.2.2, for theMann-Whitney-Wilcoxon rank sum statistic. We reverse this approach here and show howthe one sample signed rank statistic based on ranks of the absolute values can be developedfrom the ranks of the data. This will provide the motivation for a one sample spatial signedrank statistic.

Let x1, . . . , xn be a univariate sample. Then 2[Rn(xi)−(n+1)/2] =∑

j sgn(xi−xj). Thusthe centered rank is constructed from the signs of the differences. Now to construct a onesample statistic, we introduce the reflections −x1, . . . ,−xn and consider the centered rankof xi among the 2n combined observations and their reflections. The subscript 2n indicatesthat the reflections are included in the ranking.

2[R2n(xi)−(2n+1)/2] =∑

j

sgn(xi−xj)+∑

j

sgn(xi+xj) = [2Rn(|xi|)−1]sgn(xi) ; (6.3.15)

see Exercise 6.8.14. Hence, ranking observations in the combined observations and reflectionsis essentially equivalent to ranking the absolute values |x1|, . . . , |xn|. In this way, one samplemethods can be developed from two sample methods.


Mottonen and Oja (1995) use this approach to develop a one sample spatial signedrank statistic. The key is the expression sgn(xi − xj) + sgn(xi + xj) which requires onlythe concept of sign, not rank. Hence, we must find the appropriate extension of sign to twodimensions. In one dimension, sgn(x) = |x|−1x can be thought of as a unit vector pointingin the positive or negative directions toward x.

Likewise u(x) = ‖x‖−1x is a unit vector in the direction of x. Hence, as in the previoussection, we take u(x) to be the vector spatial sign. The vector centered spatial rankof xi is then R(xi) =

∑j u(xi − xj). Thus, the vector spatial signed rank statistic is

S5(0) =∑

i

∑

j

u(xi − xj) + u(xi + xj) (6.3.16)

This is also the sum of the centered spatial ranks of the observations when ranked in thecombined observations and their reflections. Note that −u(xi − xj) = u(xj − xi) so that∑∑

u(xi − xj) = 0 and the statistic can be computed from

S5(0) =∑

i

∑

j

u(xi + xj) , (6.3.17)

which is the direct analog of ( 1.3.24).We now develop a conditional test by conditioning on the data x1, . . . ,xn. From ( 6.3.16)

we can write

S5(0) =∑

i

r+(xi) , (6.3.18)

where r+(x) =∑

ju(x − xj) + u(x + xj). Now it is easy to see that r+(−x) = −r+(x).Under the null hypothesis of symmetry about 0, we can think of S5(0) as a realization of∑

i bir+(xi) where b1, . . . , bn are iid variables with P (bi = +1) = P (bi = −1) = 1/2. Hence,

Ebi = 0 and var(bi) = 1. This means that, conditional on the data,

ES5(0) = 0 and A = Cov

[1

n3/2S5(0)

]=

1

n3

n∑

i=1

(r+(xi))(r+(xi))

T . (6.3.19)

The approximate size α conditional test of H0 : θ = 0 versus HA : θ 6= 0 rejects H0 when

1

n3ST5 A−1S5 ≥ χ2

α(2) , (6.3.20)

where χ2α(2) is the upper α percentile from a chisquare distribution with 2 degrees of freedom.

Note that the extension to higher dimensions is done in exactly the same way. See Chaudhuri(1992) for rigorous asymptotics.

Example 6.3.2. Cork Borings, Example 6.3.1 continued


Table 6.3.5: Each row is a spatial signed rank vector for the data differences in Table 6.3.2.

Row SR1 SR2 SR3 Row SR1 SR2 SR31 0.28 -0.49 -0.07 15 0.30 -0.54 0.692 0.28 -0.58 0.12 16 -0.40 0.73 -0.073 -0.09 -0.39 0.31 17 0.60 -0.14 0.394 0.58 -0.29 -0.11 18 0.10 0.56 0.495 -0.03 -0.20 -0.07 19 0.77 -0.34 0.226 -0.28 0.07 0.43 20 0.48 0.10 -0.037 0.07 0.43 0.23 21 0.26 0.08 -0.168 0.01 0.60 0.32 22 0.12 0.00 -0.119 -0.13 0.46 0.34 23 0.32 -0.58 0.48

10 0.23 0.13 -0.49 24 -0.14 -0.53 0.4211 0.12 -0.20 0.33 25 0.19 -0.12 0.4512 0.46 -0.76 0.28 26 0.73 -0.07 -0.1413 0.30 -0.56 0.34 27 0.31 -0.12 -0.5814 -0.22 -0.05 0.49 28 -0.30 -0.14 0.67

We use the spatial signed-rank method ( 6.3.20) to test the hypothesis. Table 6.3.5 providesthe vector signed-ranks, r+(xi) defined in expression ( 6.3.18).

Then ST5 (0) = (4.94,−2.90, 5.17),

n3A−1 =

.1231 −.0655 .0050−.0655 .1611 −.0373.0050 −.0373 .1338

,

and n−1ST5 (0)A−1S5(0) = 11.19 with an approximate p-value of 0.011 based on a χ2-distribution with 3 degrees of freedom. The Hodges-Lehmann estimate of θ, which solves

S5(θ).= 0, is computed to be θ

T= (49.30, 45.07, 48.90, 44.59).

Efficiency

The test in ( 6.3.20) can be developed from the point of view of asymptotic theory andthe efficiency can be computed. The computations are quite involved. The multivariate tdistributiuons provide both a range of tailweights and a range of dimensions. A summary ofthese efficiencies is found in Table 6.3.6; see Mottonen, Oja and Tienari (1997) for details.

The Mottonen and Oja (1995) test efficiency increases with the dimension; see especially,the circular normal case. The efficiency begins at .95 and increases! The efficiency alsoincreases with tailweight, as expected. This strongly suggests that the Mottonen and Ojaapproach is an excellent way to extend the idea of signed rank from the univariate case. SeeExample 6.6.2 for a discussion of the two sample spatial rank test.


Table 6.3.6: The row labeled Spatial SR are the asymptotic efficiencies of multivariate spatialsigned-rank test, ( 6.3.20), relative to Hotelling’s test under the multivariate t distribution;the efficiencies for the spatial sign test, ( 6.3.2), are given in the rows labeled Spatial Sign.

Degress of FreedomDimension Test 3 4 6 8 10 15 20 ∞

1 Spatial SR 1.90 1.40 1.16 1.09 1.05 1.01 1.00 0.95Spatial Sign 1.62 1.13 0.88 0.80 0.76 0.71 0.70 0.64






Hodges-Lehmann Estimator

The estimator derived from S5(θ).= 0 is the spatial median of the pairwise averages, a spatial

Hodges-Lehmann (1963) estimator. This estimator is studied in great detail by Chaudhuri(1992). His paper contains a thorough review of multidimensional location estimates. He de-velops a Bahadur representation for the estimate. From his Theorem 3.2, we can immediatelyconclude that

1√n

θ = B−12

√n

n(n− 1)

n∑

i=1

n∑

j=1

u

(1

2(xi + xj)

)+ op(1) (6.3.21)

where B2 = E‖x∗‖−1(I − ‖x∗‖−2x∗(x∗)T ) and x∗ = 12(x1 + x2). Hence, the asymptotic

distribution of 1√nθ is determined by that of n−3/2S5(0). This leads to

1√n

θD→ N2(0,B

−12 A2B

−12 ) , (6.3.22)

where A2 = Eu(x1 + x2)(u(x1 + x2))T. Moment estimates of A2 and B2 can be used. In

fact the estimator A, defined in expression ( 6.3.19), is a consistent estimate of A2. Boseand Chaudhuri (1993) and Chaudhuri (1993) discuss refinements in the estimation of A2

and B2.Choi and Marden (1997) extend these spatial rank methods to the two-sample model and

the one-way layout. They also consider tests for ordered alternatives; see, also, Oja (2010).

6.4. AFFINE EQUIVARIANT AND INVARIANT METHODS 377

6.4 Affine Equivariant and Invariant Methods

6.4.1 Blumen’s Bivariate Sign Test

It is clear from Tables 6.3.4 and 6.3.6 of efficiencies in the previous section that is desirable tohave robust sign and rank methods that are affine invariant and equivariant to compete withLS methods. We begin with yet another representation of the estimating function S2(θ),( 6.1.7). Let the ordered ϕ angles be given by 0 ≤ ϕ(1) < ϕ(2) < . . . < ϕ(n) < π and lets(i) = ±1 when the observation corresponding to ϕ(i) is above or below the horizontal axis.Then we can write, as in expression ( 6.3.8),

S2(θ) =

n∑

i=1

s(i)

(cosϕ(i)

sinϕ(i)

)(6.4.1)

Now under the assumption of spherical symmetry, ϕ(i) is distributed as the ith order statisticfrom the uniform distribution on [0, π) and, hence, Eϕ(i) = πi/(n+ 1), i = 1, . . . , n. Recall,in the univariate case, if we believe that the underlying distribution is normal then we couldreplace the data by the normal scores (expected values of the order statistics from a normaldistribution) in a signed rank statistic. The result is the distribution free normal scores test.We will do the same thing here. We replace ϕ(i) by its expected value to construct a scoresstatistic. Let

S6(θ) =

n∑

i=1

s(i)

(cos πi

n+1

sin πin+1

)=

n∑

i=1

si

(cos πRi

n+1

sin πRi

n+1

)(6.4.2)

where R1, . . . , Rn are the ranks of the unordered angles ϕ1, . . . , ϕn. Note that s1, . . . , sn areiid with P (si = 1) = P (si = −1) = 1/2 even if the underlying model is elliptical rather thanspherical. Since we now have constant vectors in S6(θ), it follows that the sign test basedon S6(θ) is distribution free over the class of elliptical models. We look at the test in moredetail and consider the efficiency of this sign test relative to Hotelling’s test. First, we haveimmediately, under the null hypothesis, from the distribution of s1, . . . , sn that

cov

[1√nS6(0)

]=

(P

cos2[πi/(n+1)]n

P

cos[πi/(n+1)] sin[πi/(n+1)]n

P

cos[πi/(n+1)] sin[πi/(n+1)]n

P

sin2[πi/(n+1)]n

)→ A ,

where

A =

( ∫ 1

0cos2 πtdt

∫ 1

0cosπt sin πtdt∫ 1

0cosπt sin πtdt

∫ 1

0sin2 πtdt

)=

1

2I ,

as n→ ∞. So reject H0 : θ = 0 if 2nS′

6(0)S6(0) ≥ χ2α(2) for the asymptotic size α distribution

free version of the test where,

2

nS′

6(0)S6(0) =2

n

(∑s(i) cos

πi

n + 1

)2

+

(∑s(i) sin

πi

n + 1

)2. (6.4.3)


This test is not affine invariant. Blumen (1958) created an asymptotically equivariant testthat is affine invariant. We can think of Blumen’s statistic as an elliptical scores versionof the angle statistic of Brown (1983). In (6.4.3) i/(n+ 1) is replaced by (i− 1)/n. Blumenrotated the axes so that ϕ(1) is equal to zero and the data point is on the horizontal axis.Then the remaining scores are uniformly spaced. In this case, π(i− 1)/n is the conditionalexpectation of ϕ(i) given ϕ(1) = 0. Estimation methods corresponding to Blumen’s test,however, have not yet been developed.

To compute the efficiency of Blumen’s test relative to Hotelling’s test we must com-pute the noncentrality parameter of the limiting chisquare distribution. Hence, we mustcompute BA−1B and this leads us to B. Theorem 6.3.2 provides the matrices A and B forthe angle sign statistic when the underlying distribution is spherically symmetric. The fol-lowing theorem shows that the affine invariant sign statistic has the same A and B matricesas in Theorem 6.3.2 and they hold for all elliptical distributions. We discuss the implicationsafter the proof of the proposition.

Theorem 6.4.1. If the underlying distribution is elliptical, then corresponding to S6(0)we have A = 1

2I and B = (Er−1/2)I. Hence, the efficiency of Blumen’s test relative to

Hotelling’s test is e(S6,Hotelling) = E(r2)[E(r−1]2/4 which is the same for all ellipticalmodels.

Proof. To prove this we show that under a spherical model the angle statistic S2(0) andscores statistic S6(0) are asymptotically equivalent. Then S6(0) will have the same A andB matrices as in Theorem 6.3.2. But since S6(0) leads to an affine invariant test statistic,it follows that the same A and B continue to apply for elliptical models.

Recall that under the spherical model, s(1), . . . , s(n) are iid with P (si = 1) = P (si =−1) = 1/2 random variables. Then we consider

1√n

n∑

i=1

s(i)

(cos πi

n+1

sin πin+1

)− 1√

n

n∑

i=1

s(i)

(cosϕisinϕi

)=

1√n

n∑

i=1

s(i)

(cos πi

n+1− cosϕ(i)

sin πin+1

− sinϕ(i)

)

We treat the two components separately. First∣∣∣∣

1√n

∑s(i)

(cos(

πi

n + 1

)− cosϕ(i))

∣∣∣∣ ≤ maxi

∣∣∣∣cos

(πi

n+ 1

)− cosϕ(i)

∣∣∣∣∣∣∣∣

1√n

∑s(i)

∣∣∣∣

The cdf of the uniform distribution on [0, π) is equal to t/π for 0 ≤ t < π. Let Gn(t) be theempirical cdf of the angles ϕi, i = 1, . . . , n. Then G−1

n ( in+1

) = ϕ(i) and maxi| πin+1

− ϕ(i)| ≤supt|G−1

n (t) − tπ| = supt|Gn(t) − tπ| → 0 wp1 by the Glivenko-Cantelli Lemma. The resultnow follows by using a linear approximation to cos( πi

n+1) − cosϕ(i) and noting that the cos

and sin are bounded. The same argument applies to the second component. Hence, thedifference of the two statistics are op(1) and are asymptotically equivalent. The results forthe angle statistic now apply to S6(0) for a spherical model. The affine invariance extendsthe result to an elliptical model.


The main implication of this proposition is that the efficiency of the test based on S6(0)relative to Hotelling’s test is π/4 ≈ .785 for all bivariate normal models, not just the sphericalnormal model. Recall that the test based on S2(0), the angle sign test, has efficiency π/4only for the spherical normal and declining efficiency for elliptical normal models. Hence,we not only gain affine invariance but also have a constant, nondecreasing efficiency.

Oja and Nyblom (1989) study a class of sign tests for the bivariate location problem. Theyshow that Blumen’s test is locally most powerful invariant for the entire class of ellipticalmodels. Ducharme and Milasevic (1987) define a normalized spatial median as an estimateof location of a spherical distribution. They construct a confidence region for the modaldirection. These methods are resistant to outliers.

6.4.2 Affine Invariant Sign Tests in the Multivariate Case

Affine Invariant Sign Tests

Affine invariance is determined in the Blumen test statistic by rearranging the data axes tobe uniformly spaced scores. Further, note that the asymptotic covariance matrix A is (1/2)I,where I is the identity. This the covariance matrix for a random vector that is uniformlydistributed on the unit circle. The equally spaced scores cannot be constructed in higherdimensions. The approach taken here is due to Randles (2000) in which we seek a lineartransformation of the data that makes the data axes roughly equally spaced and the resultingdirection vectors will be roughly uniformly distributed on the unit sphere. We choose thetransformation so that the sample covariance matrix of the unit vectors of the transformeddata is that of a random vector uniformly distributed on the unit sphere. We then computethe spatial sign test (6.3.2) on the transformed data. The result is an affine invariant test.

Let x1, ...,xn be a random sample of size n from a k-variate multivariate symmetricdistribution with symmetry center 0. Suppose for the moment that a nonsingular matrixUx determined by the data, exists and satisfies

1

n

n∑

i=1

(Uxxi

‖Uxxi‖

)(Uxxi

‖Uxxi‖

)T=

1

kI. (6.4.4)

Hence, the unit vectors of the transformed data have covariance matrix equal to that ofa random vector uniformly distributed on the unit k − sphere. Below we describe a simpleand fast way to compute Ux for any dimension k. The test statistic in (6.4.4) computed onthe transformed data becomes

1

nST7 A−1S7 =

k

nST7 S7 (6.4.5)

where

S7 =n∑

i=1

Uxxi‖Uxxi‖

(6.4.6)


and A in (6.3.1) becomes k−1I because of the definition of Ux in (6.4.4).Theorem: Suppose n > k(k − 1) and the underlying distribution is symmetric about

0. Then knST7 S7 in (6.4.5) is affine invariant and the limiting distribution, as n → ∞, is

chisquare with k degrees of freedom.The following lemma will be helpful in the proof of the theorem. The lemma’s proof

depends on a uniqueness result from Tyler (1987).Lemma: Suppose n > k(k − 1) and D is a fixed, nonsingular transformation matrix.

Suppose Ux and UDx are defined by (6.4.4). Thena. DTUT

DxUDxD = c0UTxUx for some positive constant c0 that may depend on D and

the data andb. there exists an orthogonal matrix G such that

√c0GUx = UDxD.

Proof : Define D∗ = UxD−1 then

1

n

n∑

i=1

(U∗Dxi

‖U∗Dxi‖

)(U∗Dxi

‖U∗Dxi‖

)T=

1

n

n∑

i=1

(Uxxi‖Uxxi‖

)(Uxxi

‖Uxxi‖

)T=

1

kI.

Tyler (1987) showed that the matrix UDx defined from Dx1, ...,Dxn is unique up to apositive constant. Hence, UDx = aU∗ for some positive constant a. Hence,

UTDxUDx = a2U∗TU∗ = a2(DT )−1UT

xUxD−1

and DTUtDxUDxD = a2UT

xUx which completes the proof of part a with c0 = a2.

Define G = c−1/20 UDxDU−1

x where c0 comes from the lemma. Then, using part a, itfollows that GTG = I and G is orthogonal. Hence,

c0GUx = c0c−1/20 UDxDU−1

x Ux = c1/20 UDxD

and part b follows.Proof of affine invariance: Given D is a fixed, nonsingular matrix, let yi = Dxi for

i = 1, ..., n. Then (6.4.6) becomes

SD7 =

n∑

i=1

UDxDxi‖UDxDxi‖

.

We will show that SDT7 SD7 = ST7 S7 and hence does not depend on D.Now, from the lemma,

UDxDx

‖UDxDx‖ =c1/20 GUxx

‖ c1/20 GUxx ‖= G

Uxx

‖Uxx‖

and

SD7 = Gn∑

i=1

Uxxi‖Uxxi‖

= GS7.


Hence, SDT7 SD7 = ST7 S7 and the affine invariance follows from the orthogonal invarianceof ST7 S7.

Sketch of argument that the asymptotic distribution is chisquared with kdegrees of freedom. Tyler (1987) showed that there exists a unique upper triangularmatrix U∗ with upper left diagonal element equal to 1 and such that

E

[(U∗X

‖U∗X‖

)(U∗X

‖U∗X‖

)T]=

1

kI

and√n(Ux − U∗) = Op(1). Theorem 6.1.2 implies that (k/n)S∗T

7 S∗7 is asymptotically

chisquared with k degrees of freedom where U∗ replaces Ux in S7. But since Ux and U∗

are close, (k/n)S∗T7 S∗

7 − (k/n)ST7 S7 = op(1), the asymptotic distribution follows. See theappendix in Randles (2000) for details.

We have assumed symmetry of the underlying multivariate distribution. The resultscontinue to hold with the weaker assumption of directional symmetry about 0 in whichX/‖X‖ and −X/‖X‖ have the same distribution. In addition to the asymptotic distribution,we can compute or approximate the conditional distribution (given the direction axes of thedata) of k

nST7 S7 under the assumption of directional symmetry by listing or sampling the 2n

equi-likely values of

k

n

(n∑

i=1

δiUxxi

‖Uxxi‖

)T ( n∑

i=1

δiUxxi

‖Uxxi‖

)

where δi = ±1 for i = 1, ..., n. Hence, it is straight forward to approximate the k− valueof the test.

Computation of Ux

It remains to compute Ux from the data x1, ...,xn. The following efficient iterative procedureis due to Tyler (1987) who also shows the sequence of iterates converges when n > k(k− 1).

We begin with

V0 =1

n

n∑

i=1

(xi

‖xi‖

)(xi‖xi‖

)T.

and U0 = Chol (V−10 ), where Chol (M) is the upper triangular Cholesky decomposition of the

positive definite matrix M divided by the upper left diagonal element of the upper triangularmatrix. This places a 1 as the first element of the main diagonal and makes Chol (M) unique.

If ‖V0 − k−1I‖ is sufficiently small (a prespecified tolerance) stop and take Ux = U0. If‖V0 − k−1I‖ is large, compute

V1 =1

n

n∑

i=1

(U0xi

‖U0xi‖

)(U0xi‖U0xi‖

)T.


and compute U1 = Chol (V−11 ).

If ‖V1 − k−1I‖ is sufficiently small stop and take Ux = U1U0. If ‖V1 − k−1I‖ is largecompute

V2 =1

n

n∑

i=1

(U1U0xi

‖U1U0xi‖

)(U1U0xi‖U1U0xi‖

)T.

and U2 = Chol (V−12 ). If ‖V2 − k−1I‖ is sufficiently small, stop and take Ux = U2U1U0.

If ‖V2 − k−1I‖ is large compute V3 and U3 and proceed until ‖Vj0 − k−1I‖ is sufficientlysmall and take Ux = Uj0Ujo−2...U0

Affine Equivariant Median. We now turn to the problem of constructing an affineequivariant estimate of the center of symmetry of the underlying distribution. Our goal is toproduce an estimate that is computationally efficient for large samples in any dimension, aproblem that plagued some earlier attempts; see Small (1990) for an overview of multivariatemedians. The estimate described below was proposed by Hettmansperger and Randles (2002)

and we refer to it as the HR estimate. The estimator θ is chosen to be the solution of

1

n

n∑

i=1

Ux(xi − θ)

‖Ux(xi − θ)‖ = 0 (6.4.7)

in which Ux is the k × k upper triangular positive definite matrix, with a one in theupper left position on the diagonal, chosen to satisfy

1

n

n∑

i=1

(Ux(xi − θ)

‖Ux(xi − θ)‖

)(Ux(xi − θ)

‖Ux(xi − θ)‖

)T=

1

kI. (6.4.8)

This is a transform-retransform estimate; see, for example, Chakraborty, Chaudhuri andOja (1998). The data are transformed using Ux, and the estimate τ = Uxθ is computed.

Then the estimate is retransformed back to the original scale θ = U−1x τ . The simultaneous

solutions of (6.4.7) and (6.4.8) are M-estimates; see Section 6.5.4 for the explicit representa-tion. It follows from this that the estimate is affine equivariant. It is also possible to directlyverify the affine equivariance.

The calculation of (Ux, θ) involves two routines. The first routine finds the value thatsolves (6.4.7) with Ux fixed. This is done by letting yi = Uxxi and finding τ that solvesΣ(yi−τ )/ ‖ yi−τ ‖= 0. Hence, τ is the spatial median of y1, . . . ,yn; see Section 6.3.1. The

solution to (6.4.7) is θ = U−1x τ . The second routine then finds Ux in (6.4.8) as described

above for the computation of Ux for a fixed value of θ with xi replaced by xi − θ.The calculation of (Ux, θ) alternates between these two routines until convergence. To

obtain starting values, let θ0j = xj. Use the second routine to obtain U0j for this value ofθ. The starting (θ0j ,U0j) is the pair that minimizes, for j = 1, ..., n, the inner product

[n∑

i=1

U0j(xi − θ0j)

‖U0j(xi − θ0j)‖

]T [ n∑

i=1

U0j(xi − θ0j)

‖U0j(xi − θ0j)‖

].


This starting procedure is used, since starting values need to be affine invariant and equiv-ariant.

For a fixed Ux there exists a unique solution for θ, and for fixed θ there exists a unique Ux

up to multiplicative constant. In simulations and calculations described in Hettmanspergerand Randles (2002) the alternating algorithm did not fail to converge. However, the equations

defining the simultaneous solution (Ux, θ) do not fully satisfy all conditions stated in theliterature for existence and uniqueness; see Maronna (1976), Tyler (1988), Kent and Tyler(1991).

The asymptotic distribution theory developed in Hettmansperger and Randles (2002)

show that θ is approximately multivariate normally distributed under the assumption of di-rectional symmetry and, hence, symmetry. The asymptotic covariance matrix is complicatedand we recommend a bootstrap estimate of the covariance matrix of θ.

The approach taken above is more general. If we begin with the orthogonally invariantstatistic in (6.3.2) and use a matrix U that satisfies the invariance property in part b of theLemma then the resulting statistic is affine invariant. For example we could take U to be theinverse of the sample covariance matrix. This results in a test statistic studied by Hossjerand Croux (1995). We prefer the more robust matrix Ux proposed by Tyler (1987).

Example 6.4.1. Mathematics and Statistics Exam Scores

We now illustrate the one-sample affine invariant spatial sign test (6.4.5) and the affineequivariant spatial median on a small data set. A major advantage of this method is thespeed of computation which allows for bootstrap estimates of the covariance matrix andstandard errors for the estimator. The data consists of 20 vectors, chosen at random froma larger data set published in Mardia, Kent, and Bibby (1979). Each vector consists of fourcomponents and records test scores in Mechanics, Vectors, Analysis, and Statistics. We wishto test the hypothesis that there are no differences among the examination topics. Thisis a traditional hypothesis in repeated measures designs; see Jan and Randles (1996) fora thorough discussion of this problem. Similar to our findings above on efficiencies, theyfound that mulitivariate sign and signed rank tests were often superior to least squares inrobustness of level and efficiency .

Table 6.4.1 provides the original quadrivariate data along with the trivariate data that re-sult when the Statistics score is subtracted from the other three. We suppose that the trivari-ate data are a sample of size 20 from a symmetric distribution with center θ = (θ1, θ2, θ3)

T

and we wish to test H0 : θ = 0 versus HA : θ 6= 0. In Table 6.4.1 we have the HR estimates(standard errors) and the tests for the affine spatial methods, Hotelling’s T2, and Oja’s affinemethods described later in Section 6.4.3. The standard errors of the HR estimate are ob-tained from a boostrap estimate of the covariance matrix. The following estimates are basedon 500 bootstrap resamples.

Cov (θ) =

33.88 10.53 21.0510.53 17.03 12.4921.05 12.49 32.71

.


Table 6.4.1: Test Score Data: Mechanics (M), Vectors (V ), Analysis (A), Statistics (S) anddifferences when Statistics is subtracted from the other three.

M V A S M − S V − S A− S59 70 62 56 3 14 652 64 63 54 -2 10 944 61 62 46 -2 15 1644 56 61 36 8 20 2530 69 52 45 -15 24 746 49 59 37 9 12 2231 42 54 68 -37 -26 -1442 60 49 33 9 27 1646 52 41 40 6 12 149 49 48 39 10 10 917 53 43 51 -34 2 -837 56 28 45 -8 11 -1740 43 21 61 -21 -18 -4035 36 48 29 6 7 1931 52 27 40 -9 12 -1317 51 35 31 -14 20 449 50 23 9 40 41 148 42 26 40 -32 2 -14

15 38 28 17 -2 21 110 40 9 14 -14 26 -5

The standard errors in Table 6.4.1 are the squareroots of the main diagonal of this matrix.The affine sign methods suggest that the major source of statistical significance is the V −

S difference. In particular, Vector scores are higher than Statistics scores. A more convenientcomparision is achieved by estimating the locations in the four dimensional problem. We findthe affine equivariant spatial median for M, V, A, S to be (along with bootstrap standarderrors) 36.54 (8.41), 53.04 (5.09), 44.28 (8.39), and 39.65 (7.06). This again reflects thesignificant differences between Vector scores and Statistics. In fact, it appears the Vectorexam was easiest while the other subjects are roughly equivalent.

An outlier was created in V by replacing the 70 (first observation) by 0. The resultsare shown in the lower part of Table 6.4.2. Note, in particular, unlike the robust methods,the p-value for Hotelling’s T 2 test has shifted above 0.05 and, hence, would no longer beconsidered significant.

An affine Invariant Signed Rank Test and Affine Equivariant Estimate

The test statistic can be constructed in the same way that the affine invariant sign test wasconstructed. We will sketch this development below. For a detailed and rigorous development


Table 6.4.2: Results for the original and contaminated test score data: mean of signed-rankvectors, usual mean vectors, the Hodges-Lehmann estimate of θ; results for the signed-ranktest ( 6.4.16) and Hotelling’s T 2 test

Test Asymp.M − S V − S A− S Statistic p-value

Original DataHR Estimate −2.12 13.85 6.21SE HR 5.82 4.13 5.72Mean -4.95 12.10 2.40SE Mean 4.07 3.33 3.62Oja HL-est. -3.05 14.06 4.96Affine Sign Test (6.4.5) 14.19 0.0027Hotelling’s T 2 13.47 0.0037Oja Signed rank (6.4.16) 14.07 0.0028

Contaminated DataHR Estimate −2.92 12.83 6.90SE HR 5.58 8.27 6.60Mean Vector -4.95 8.60 2.40Oja HL-estimate -3.90 12.69 4.64Affine Sign Test (6.4.5) 10.76 0.0131Hotelling’s T 2 6.95 .0736Oja Signed rank (6.4.16) 10.09 0.0178


see Oja (2010, Chapter 7) or Oja and Randles (2004). The spatial signed rank statistic isgiven by S5 in (6.3.19) along with the spatial signed rank covariance matrix, given in thiscase by

1

n

n∑

i=1

r+(xi)r+(xi)

T . (6.4.9)

Now suppose we can construct a matrix Vx such that when xi is replaced by Vxxi in(6.4.9) we have

11n

∑r+(Vxxi)T r+(Vxxi)

1

n

∑r+(Vxxi)r

+(Vxxi)T =

1

kI. (6.4.10)

The divisor in 6.4.10 is the average squared length of the signed rank vectors and isneeded to normalize (on average) the signed rank vectors. In the simpler sign vector casen−1

∑[xTi xi/ ‖ xi ‖2] = 1. The normalized signed rank vectors now have roughly the same

covariance structure as vectors uniformly distributed on the unit k − sphere. It is straightforward to develop an iterative routine to compute Vx in the same way we computed Ux forthe sign statistic.

The signed rank test statistic developed from (6.3.22) is then

k

nST8 S8, (6.4.11)

where S8 =∑

r+(Vxxi). Again, it can be verified directly that this test statistic is affineinvariant. In addition, the p − value of the test can be approximated using the chisquaredistribution with k degrees of freedom or by simulation, conditionally using the 2n equallylikely values of

k

n

[n∑

i=1

δTi r+(Vxxi)T

][n∑

i=1

δir+(Vxxi)

]

with δi = ±1.Recall that the Hodges-Lehmann estimate related to the spatial signed rank statistic is

the spatial median of the pairwise averages of the data vectors. This estimate is orthogonallyequivariant but not affine equivariant. We use the transformation-retransformation method.We transform the data using Vx to get yi = Vxxi i = 1, ..., n and then compute the spatialmedian of the pairwise averages (yi + yj)/2 which we denote by τ . Then we retransform it

back: θ = V−1x τ . This estimate is now affine equivariant. Because of the complexity of the

asymptotic covariance matrix we recommend a bootstrap estimate of the covariance matrixof θ.

Efficiency

Recall Table 6.3.6 which provides efficiency values for either the spatial sign test or the spatialsigned rank test relative to Hotelling’s T2 test. The calculations were made for the spherical


t distribution for various degrees of freedom and finally for the spherical normal distribution.Now that we have affine invariant sign and signed rank tests and affine equivariant estimateswe can apply these efficiency results to elliptical t and normal distributions. Hence, we againsee the superiority of the sign and signed rank methods over Hotelling’s test and the samplemean. The affine invariant tests and affine equivariant estimates are efficient and robustalternatives to the traditional least squares methods.

In the case of the affine invariant sign test, Randles (2000) presents a power sensitivitysimulation comparing his test to Hotelling’s T 2 test, Blumen’s test, and Oja’s sign test(6.4.14). In addition to the multivariate normal distribution, he included t distributionsand a skewed distribution. Randles’ affine invariant sign test performed extremely well.Although Oja’s sign test performed comparably, it is much more computationally intensivethan Randles’ test.

6.4.3 The Oja Criterion Function

This method provides a direct approach to affine invariance equivariance and does not requirea transform-retransform technique. It is, however, much more computationally intensive.We will only sketch the results in this section and give references where the more detailedderivations can be found. Recall from the univariate location model that L1 and L2 arespecial cases of methods that are derived from minimizing

∑|xi − θ|m, for m = 1 and

m = 2. Oja (1983) proposed the bivariate objective function: D8(θ) =∑

i<j Am(xi,xj, θ)

where A(xi,xj , θ) is the area of the triangle formed by the three vectors xi,xj , θ. Whenm = 2 Wilks (1960) showed that D8(θ) is proportional to the determinant of the classicalscatter matrix and the sample mean vector minimizes this criterion. Thus, by analogy withthe univariate case, the m = 1 case will be called the L1 case. The same results carry over todimensions greater than 2 in which the triangles are replaced by simplices. For the remainderof the section, m = 1.

We introduce the following notation:

Aij =1

2

1 1 1θ1 xi1 xj1θ2 xi2 xj2

.

Then D8(θ) = 12

∑i<j absdetAij where det stands for determinant and abs stands for

absolute value. Now if we differentiate this criterion function with respect to θ1 and θ2 weget a new set of estimating equations:

S8(θ) =1

2

n−1∑

i=1

n∑

j=i+1

sgndetAij(x∗j − x∗

i ) = 0 , (6.4.12)

where x∗i is the vector xi rotated counterclockwise by π

2radians, hence, x∗

i = (−xi2, xi1)T .Note that θ enters only through the Aij . The expression in (6.4.12) is found as follows:

sgn(x∗j − x∗

i )T (θ − xi)(x∗

j − x∗i ) =

x∗j − x∗

i if x∗i → x∗

j → θ is counterclockwise−(x∗

j − x∗i ) if x∗

i → x∗j → θ is clockwise


The estimator that solves 6.4.12 is called the Oja median and we will be interested inits properties. This estimator minimizes the sum of triangular areas formed by all pairs ofobservations along with θ. Niinimaa, Nyblom, and Oja (1992) provide a fortran program forcomputing the Oja median and discuss further aspects of its computation; see, also, the Rpackage OjaNP. Brown and Hettmansperger (1987a) present a geometric description of thedetermination of the Oja median. The statistic S8(0) forms the basis of a sign type statisticfor testing H0 : θ = 0. We will refer to this test as the Oja sign test. In order to study theOja median and the Oja sign test we need once again to determine the matrices A and B.Before doing this we will rewrite (6.4.12) in a more convenient form, a form that expressesit as a function of s1, . . . , sn. Recall the polar form of x, ( 6.3.7), that we have been usingand at the same time introduce the vector y as follows:

x = r

(cosφsin φ

)= rs

(cosϕsinϕ

)= sy

As usual 0 ≤ ϕ < π, s indicates whether x is above or below the horizontal axis, and r isthe length of x. Hence, if s = 1 then y = x, and if s = −1 then y = −x, so y is alwaysabove the horizontal axis.

Theorem 6.4.2. The following string of equalities is true:

1

nS8(0) =

1

2n

n−1∑

i=1

n∑

j=i+1

sgndet(xi1 xj1xi2 xj2

)(x∗

j − x∗i )

=1

2n

n−1∑

i=1

n∑

j=i+1

sisj(sjy∗j − siy

∗i )

=1

2

n∑

i=1

sizi

where

zi =1

n

n−1∑

j=1

y∗i+j and yn+i = −yi

Proof: The first formula follows at once from (6.4.12). In the second formula we need torecall the ∗ operation. It entails a counterclockwise rotation of 90 degrees. Suppose, withoutloss of generality, that 0 ≤ ϕ1 ≤ . . . ≤ ϕn ≤ π. Then

sgn

det

(xi1 xj1xi2 xj2

)= sgn

det

(siri cosϕi sjrj cosϕjsiri sinϕi sjrj sinϕj

)

= sgnsisjrirj cosϕi sinϕj − sinϕi cosϕj= sisjsgnsin(ϕj − ϕi)= sisj


Now if xi is in the first or second quadrants then y∗i = x∗

i = six∗i and if xi is in the third or

fourth quadrant then y∗i = −x∗

i = six∗i . Hence, in all cases we have x∗

i = siy∗i . The second

formula now follows.

The third formula follows by straightforward algebraic manipulations. We leave those detailsto the reader, Exercise 6.8.15, and instead point out the following helpful facts:

zi =n∑

j=i+1

y∗j −

i−1∑

j=1

y∗j , i = 2, . . . , n− 1, z1 =

n∑

j=2

y∗j , zn = −

n−1∑

j=1

y∗j (6.4.13)

The third formula shows that we have a sign statistic similar to the ones that we havebeen studying. Under the null hypothesis (s1, . . . , sn) and (z1, . . . , zn) are independent.Hence conditionally on z1, . . . , zn (or equivalently conditionally on y1, . . . ,yn) the conditional

covariance matrix of S8(0) is A = 14

∑i ziz

Ti . A conditional distribution free test is

reject H0 : θ = 0 when ST8 (0)A−1S8(0) ≥ χ2α(2) . (6.4.14)

Theorem 6.4.2 shows that conditionally on the data, the χ2-approximation is appropriate.The next theorem shows that the approximation is appropriate unconditionally as well. Foradditional discussion of this test see Brown and Hettmansperger (1989). We want to describethe asymptotically distribution free version of the Oja sign test. Then we will show that,for elliptical models, the Oja sign test and Blumen’s test are equivalent. It is left to theexercises to show that the Oja median is affine equivariant and the Oja sign test is affineinvariant so they compete with Blumen’s invariant test, the affine spatial methods in Section6.3, and with the L2 methods (vector of means and Hotelling’s test); see Exercise 6.8.16.

Since the Oja sign test is affine invariant, we will consider the behavior under sphericalmodels, without loss of generality. The elliptical models can be reduced to spherical modelsby affine transformations. The next proposition shows that zi has a useful limiting value.

Theorem 6.4.3. Suppose that we sample from a spherical distribution centered at the origin.Let

z(t) = −2

πE(r)

(cos tπsin tπ

)

then1

n3/2S8(0) =

1

2√n

n∑

i=1

siz

(i

n

)+ op(1)

Proof. We sketch the argument. A more general result and a rigorous argument can befound in Brown et al. (1992). We begin by referring to formula (6.4.13). Recall that

1

n

n∑

i=1

y∗i =

1

n

n∑

i=1

ri

(− sinϕicosϕi

)


Consider the second component and let ∼= mean that the approximation is valid up to op(1)terms. From the discussion of maxi|πi/(n+ 1) − ϕ(i)| in Theorem 6.4.1, we have

1

n

[nt]∑

i=1

ri cosϕi ∼= Er 1

n

[nt]∑

i=1

cosϕi

∼=Er

π

π

n

[nt]∑

i=1

cosπi

n+ 1

∼=Er

π

∫ πt

0

cos udu =

Er

π

sin πt

Furthermore,

1

n

n∑

i=[nt]

ri cosφi ∼=Er

π

∫ π

πt

cosudu = −Er

n

sin πt

Hence the formula holds for the second component. The first component formula follows ina similar way.

This proposition is important since it shows that the Oja sign test is asymptoticallyequivalent to Blumen’s test under elliptical models since they are both invariant under affinetransformations. Hence, the efficiency results for Blumen’s test carry over for spherical andelliptical models to the Oja sign test. Also recall that Blumen’s test is locally most powerfulinvariant for the class of elliptical models so the Oja sign test should be quite good forelliptical models in general. The two tests are not equivalent for nonelliptical models. InBrown et. al. (1992) the efficiency of the Oja sign test relative to Blumen’s test was computedfor a class of symmetric densities with contours of the form |x1|m + |x2|m. When m = 2we have spherical densities, and when m = 1 we have Laplacian densities with independentmarginals. Table 1 of Brown et. al.(1992) shows that the Oja sign test is more efficient thanBlumen’s test except when m = 2 where, of course, the efficiency is 1. Hettmansperger,Nyblom, and Oja (1994) extend the Oja methods to dimensions higher than 2 in the onesample case and Hettmansperger and Oja (1994) extend the methods to higher dimensionsfor the multisample problem.

In Brown and Hettmansperger (1987a), the idea of an affine invariant rank vectoris introduced. The approach is similar to that of Mottonen and Oja (1995) for the spatialrank vector discussed earlier; see Section 6.3.2. The Oja criterion D8(θ) with m = 1 inSection 6.4.3 is a multivariate extension of the univariate L1 criterion function and we takeits gradient to be the centered rank vector. Recall in the univariate case D(θ) =

∑|xj−θ|

and the derivative D′(θ) =∑

sgn(θ−xj). Hence, D′(xi) is the centered rank of xi. Likewisethe vector centered rank of xk is defined to be:

Rn(xk) = D8(xk) =1

2

∑ ∑

i<j

sgn

det

1 1 1xk1 xi1 xj1xk2 xi2 xj2

(x∗

j − x∗i ) (6.4.15)


Again we use the idea of affine invariant vector rank to define the Oja signed rank statis-tic. Let R2n(xk) be the rank vector when xk is ranked among the observation vectorsx1, . . . ,xn and their reflections −x1, . . . ,−xn. Then the test statistic is S9(0) =

∑R2n(xj).

Now R2n(−xj) = −R2n(xj) so that the conditional covariance matrix (conditioning on theobserved data) is

A =n∑

j=1

R2n(xj)RT2n(xj)

The approximate size α test of H0 : θ = 0 is:

Reject H0 if ST9 (0)A−1S9(0) ≥ χ2α(2) . (6.4.16)

In addition, the Hodges-Lehmann estimate of θ based on S9(θ).= 0 is the Oja median

of a set of linked pairwise averages; see Brown and Hettmansperger (1987a) for details.

Hettmansperger, Mottonen and Oja (1997a, 1997b) extend the affine invariant one andtwo sample rank tests to dimensions greater than 2. Because of affine invariance, Table6.3.6 provides the efficiencies relative to Hotelling’s test for a multivariate t distribution; seeMottonen, Hettmansperger and Tienari (1997). Note that the efficiency is quite high evenfor the multivariate normal distribution. Further, note that this efficiency is the same for allelliptical normal distributions as well since the test is affine invariant.

Example 6.4.2. Mathematics and Statistics Exam Scores, Example 6.4.1 continued

We apply the Oja signed-rank test and the Oja HL-estimate to the data in Table 6.4.1.The numerical results are similar to the results of the affine spatial methods; see Table6.4.2 for the results. Note that due to computational complexity it is not possible to boot-strap the covariance matrix of the Oja HL-estimate. The R-library OjaNP can be used forcomputations.

6.4.4 Additional Remarks

Many authors have worked on the problem of developing multidimensional sign tests undervarious invariance conditions. The sign statistics are important for defining medians, andfurther in defining the concept of centered rank. Oja and Nyblom (1989) propose a family oflocally most powerful sign tests that are affine invariant and show that the Blumen (1958)test is optimal for elliptical alternatives. Using a different approach that involves data basedcoordinate systems, Chaudhuri and Sengupta (1993) introduce a family of affine invariantsign tests. See also Dietz (1982) for a development of affine invariant sign and rank pro-cedures based on rotations of the coordinate systems. Another interesting approach to theconstruction of a multivariate median and rank is based on the idea of data depth due toLiu (1990). In this case, the median is a point contained in a maximum number of trianglesformed by the

(n3

)different choices of 3 data vectors. See, also, Liu and Singh (1993).


Hence, we conclude that if we are fairly certain that we have a spherical model, in aspatial statistics context, for example, then the spatial median and the spatial sign test arequite good. If the model is likely to be elliptical with heavy tails then either Blumen’s testor the affine invariant spatial sign or spatial signed-rank tests along with the correspondingequivariant estimators are both statistically and computationally quite efficient. If we suspectthat the model is nonelliptical then the methods of Oja are preferable. On the other hand, ifinvariance and equivariance considerations are not relevant then the componentwise methodsshould work quite well. Finally, departures from bivariate normality should be considered.The L1 type methods are good when there is a heavy tailed model. However, the efficiencycan be improved by rank type methods when the tail weight is more moderate and perhapsclose to normality. Even at the bivariate normal model the rank methods loose very littleefficiency when invariance is taken into account. Oja and Randles (2004) discuss affineinvariant rank tests for several samples and, further, discuss tests of independence.

6.5 Robustness of Multivariate Estimates of Location

In this section we sketch some results on the influence and breakdown points for the esti-mators derived from the various estimating equations. Recall from Theorem 6.1.2 that thevector influence is proportional to the vector Ψ(x). Typically Ω(x) is a projection and re-duces the problem of finding the asymptotic distribution of the estimating function 1√

nS(θ)

to a central limit problem. To determine whether an estimator has bounded influence ornot, it is only necessary to check that the norm of Ω(x) is bounded. Further, recall that thebreakdown point is the smallest proportion of contamination needed to carry the estimatorbeyond all bounds. We now briefly consider the different invariance models:

6.5.1 Location and Scale Invariance: Componentwise Methods

In the case of component medians, the influence function is given by

ΩT (x) ∝ (sgn(x11), sgn(x21)) .

The norm is clearly bounded. Further, the breakdown point is 50% as it is in the uni-variate case. Likewise, for the Hodges-Lehmann component estimates ΩT (x) ∝ (F1(x11) −1/2, F2(x21) − 1/2), where Fi(· ) is the ith marginal cdf. Hence, the influence is bounded inthis case as well. The breakdown point is 29%, the same as the univariate case. Note, how-ever, that the componentwise methods are neither rotation nor affine invariant/equivariant.

6.5.2 Rotation Invariance: Spatial Methods

We assume in this subsection that the underlying distribution is spherical. For the spatialmedian, we have Ω(x) = u(x), the unit vector in the x direction. Hence, again we havebounded influence. Lopuhaa and Rousseeuw (1991) were the first to point out that the

6.5. ROBUSTNESS OF MULTIVARIATE ESTIMATES OF LOCATION 393

spatial median has 50% breakdown point. The proof is given in the following theorem.First note from Exercise 6.8.17 that the maximum breakdown point for any translationequivariant estimator is [(n+1)/2]

nand the spatial median is translation equivariant.

Theorem 6.5.1. The spatial median θ has breakdown point ǫ∗ = [(n+1)/2]n

for every dimen-sion.

Proof. In view of the preceding remarks, we only need to show ǫ∗ ≥ [(n+1)/2]n

. LetX = (x1, . . . ,xn) be a collection of n observations in k dimensions. Let Ym = (y1, . . . ,yn)

be formed from X by corrupting any m observations. Then θ(Ym) minimizes∑

‖yi − θ‖.Assume, without loss of generality, that θ(X) = 0. (Use translation equivariance.) Wesuggest that the reader follow the argument with a picture in two dimensions.

Let M = maxi‖xi‖ and let B(0, 2M) be the sphere of radius 2M centered at the origin.

Suppose the number of corrupted observations m ≤ [n−12

]. We will show that sup‖θ(Ym)‖over Ym is finite. Hence, ǫ∗ ≥ (n−1)/2+1

n= (n+1)/2

nand we will be finished.

Let dm = inf‖θ(Ym) − γ‖ : γ in B(0, 2M), the distance of θ(Ym) from B(0, 2M).

Then the distance of θ(Ym) from the origin is ‖θ(Ym)‖ ≤ dm + 2M . Now

‖yj − θ(Ym)‖ ≥ ‖yj‖ − ‖θ(Ym)‖ ≥ ‖yj‖ − (dm + 2M) (6.5.1)

Suppose the contamination has pushed θ(Ym) far outside B(0, 2M). In particular, supposedm > 2M [(n+1)/2]. We will show this leads to a contradiction. We know that X ⊂ B(0,M)and if xk is not contaminated,

‖xk − θ(Ym)‖ ≥M + ‖xk‖ + dm (6.5.2)

Next split the following sum up over contaminated and not contaminated observations using( 6.5.1) and ( 6.5.2).

n∑

i=1

‖yi − θ(Ym)‖ ≥∑

contam

(‖yi‖ − (dm + 2M)) +∑

not

(‖yi‖ − dm)

=∑

‖yi‖ −[n− 1

2

](dm + 2M) + (n−

[n− 1

2

])dm

=∑

‖yi‖ − 2M(

[n− 1

2

]) + dm(n− 2(

[n− 1

2

]))

>∑

‖yi‖ − 2M(

[n− 1

2

]) + 2M(

[n− 1

2

])(n− 2(

[n− 1

2

]))

=∑

‖yi‖ + 2M(

[n− 1

2

])(n− 1 − 2(

[n− 1

2

]))

≥∑

‖yi‖

But, recall that θ(Ym) minimizes∑

‖yi − θ‖, hence we have a contradiction. Thus dm ≤2M(

[n−1

2

]).

But then ǫ∗ must be at least [(n−1)/2]+1n

and[n−1

2

]+ 1 =

[n+1

2

]and the proof is complete.


6.5.3 The Spatial Hodges-Lehmann Estimate

This estimate is the spatial median of the pairwise averages: 12(xi + xj). It was first studied

in detail by Chaudhuri (1992) and it is the estimate corresponding to the spatial signed rankstatistic (6.3.16) of Mottonen and Oja (1995).

From ( 6.3.21) it is clear that the influence function is bounded. Further, since it is thespatial median of the pairwise averages, the argument that shows that the breakdown ofthe univariate Hodges-Lehmann estimate is 1− 1√

2≈ .29 works in the multivariate case; see

Exercise 1.12.13 in Chapter 1.

6.5.4 Affine Equivariant Spatial Median

We can represent the affine equivariant spatial median as an M-estimate; see Maronna (1976)

or Maronna et. al (2006). Our multivariate estimators θ and Ux are the solutions of thefollowing M-estimating equations:

1

n

n∑

i=1

u1(di)Ux(xi − θ) = 0,1

n

n∑

i=1

u2(di)Ux(xi − θ)(xi − θ)TUTx = I (6.5.3)

where di = ‖Ux(xi − θ)‖2 with u1(d) = d−1/2 and u2(d) = kd−1. Because they are M-

estimators, the breakdown value for θ is between (k+1)−1 and k−1 where k is the dimension

of the underlying population. The asymptotic theory for θ, developed in the appendix ofHettmansperger and Randles (2002), shows that the influence function for θ is

B−1 Ux(x − θ)

‖Ux(xi − θ)‖ where B = E

[1

‖Ux(I− xi − θ)‖

(I − Ux(x − θ)

‖Ux(xi − θ)‖(x − θ)TUT

x

‖Ux(xi − θ)‖

)];

(6.5.4)recall (6.3.3). Hence, we see that the influence function is bounded with a positive breakdown.Note however that the breakdown decreases as the dimension of the underlying distributionincreases.

6.5.5 Affine Equivariant Oja Median

This estimator is affine equivariant and solves the equation ( 6.4.12). From the projectionrepresentation of the statistic in Theorem 6.4.2 notice that the vector z(t) is bounded. Itthen follows that, for spherical models (with finite first moment), the influence function isbounded. See Niinimaa and Oja (1995) for a rigorous derivation of the influence function.

The breakdown properties of the Oja median are more interesting. As shown by Niinimaa,Oja, and Tableman (1990), even though the influence function is bounded, the estimate canbe broken down with just two contaminated points; that is, they showed that the breakdownof Oja’s median is 2/n. Further, Niinimaa and Oja (1995) show that the breakdown point ofthe Oja median depends on the dispersion of the contaminated data. When the dispersionof the contaminated data is less than the dispersion of the original data then the asymptotic

6.6. LINEAR MODEL 395

breakdown point is positive. If, for example, the contaminated points are all at a singlepoint, then the breakdown point is 1

3.

6.6 Linear Model

We consider the bivariate linear model. As examples of the linear model, we will find bivariateestimates and tests for a general regression effect as well as shifts in the bivariate two-samplelocation model and multisample location models. We will focus primarily on compontentwiserank methods; however, we will discuss some other methods for the multiple sample locationmodel in the examples of Section 6.6.1. Spatial and affine invariant/equivariant methods forthe general linear model are currently under development in the research literature. See Davisand McKean (1993) for a thorough development of the linear model rank-based methods.

In Chapter 3, Section 3.2, we present the notation for the univariate linear model. Here,we will think of the multivariate linear model as a series of concatenations of the univariatemodels. Hence, we introduce

Yn×2 =

Y11 Y12...

...Yn1 Yn2

= (Y(1),Y(2)) =

YT1...

YTn

(6.6.1)

The superscript indicates a column, a subscript a row, and, as usual in this chapter, Tdenotes transpose. Now the multivariate linear model is

Y = 1αT + Xβ + ǫ , (6.6.2)

where 1 is n×1 vector of ones, αT = (α(1), α(2)), X is n×p full rank, centered design matrix,β is p × 2 matrix of unknown regression constants, and ǫ is n × 2 matrix of errors. Therows of ǫ, and hence, Y, are independent and the rows of ǫ are identically distributed witha continuous bivariate cdf F (s, t).

Model 6.6.2 is the concatenation of two univariate linear models: Y(i) = 1α(i) +Xβ(i) +ǫ(i) for i = 1, 2. We have restricted attention to the bivariate case to simplify the presentation.In most cases the general multivariate results are obvious.

We rank within components or columns. Hence, the rank-score of the ith item in the jthcolumn is:

aij = a(Rij) = a(R(Yij − xTi β(j))) (6.6.3)

where Rij is the rank of Yij−xTi β(j) when ranked among Y1j−xT1 β(j), . . . , Ynj−xTnβ(j). Therank scores are generated by a(i) = ϕ( i

n+1), 0 < ϕ(u) < 1,

∫ϕ(u)du = 0, and

∫ϕ2(u)du = 1;

see Section 3.4. Let the score matrix A be defined as follows:

A =

a11 a12...

...an1 an2

= (a(1), a(2)) (6.6.4)


so that each column is the set of rank scores within the column.

The criterion function is

D(β) =n∑

i=1

aTi ri (6.6.5)

where aTi = (ai1ai2) = (a(R(Yi1 − xTi β(1)), a(R(Yi2 − xTi β(2))) and rTi = (Yi1 − xTi β(1), Yi2 −xTi β(2)). Note at once that this is an analog, using inner products, of the univariate criterionin Section 3.2.1. In fact, D(β) is the sum of the corresponding univariate criterion functions.The matrix of the negatives of the partial derivatives is:

L(β) = XTA =

∑xi1ai1

∑xi1ai2

......∑

xipai1∑xipai2

=

(∑ai1xi,

∑ai2xi

); (6.6.6)

see Exercise 6.8.18 and equation ( 3.2.11). Again, note that the two columns in ( 6.6.6) arethe estimating equations for the two concatenated univariate linear models and xi is the ithrow of X written as a column.

Hence, the componentwise multivariate R-estimator of β is β that minimizes ( 6.6.5) orsolves L(β)

.= 0. Further, L(0) is the basic quantity that we will use to test H0 : β = 0. We

must statistically assess the size of L(0) and reject H0 and claim the presence of a regressioneffect when L(0) is “too large” or “too far from the zero matrix”.

We first consider testing H0 : β = 0 since the distribution theory of the test statistic willbe useful later for the asymptotic distribution theory of the estimate.

For the linear model we need some results on direct products; see Magnus and Neudedker(1988) for a complete discussion. We list here the results that we need:

1. Let A and B be m× n and p× q matrices. The mp× nq matrix A ⊗B defined by

A ⊗ B =

a11B · · · a1nB

......

am1B · · · amnB

(6.6.7)

is called the direct product or Kronecker product of A and B.

2.

(A ⊗B)T = AT ⊗BT , (6.6.8)

(A⊗ B)−1 = A−1 ⊗ B−1 , (6.6.9)

(A⊗ B)(C ⊗D) = (AC⊗ BD) . (6.6.10)

3. Let D be a m × n matrix. Then Dcolis the mn × 1 vector formed by stacking thecolumns of D. We then have

tr(ABCD) = (DT

col)T (CT ⊗ A)Bcol = DT

col(A ⊗CT )(BT )col . (6.6.11)


4.(AB)col = (BT ⊗ I)Acol = (I ⊗ A)Bcol . (6.6.12)

These facts are used in the proofs of the theorems in the rest of this chapter.

6.6.1 Test for Regression Effect

As mentioned above, we will base the test of H0 : β = 0 on the size of the random matrixL(0). We deal with this random matrix by rolling out the matrix by columns. Note from6.6.4 and ( 6.6.6) that L(0) = X′A = (X′a(1),X′a(2)). Then we define the vector

Lcol =

(XTa(1)

XTa(2)

)=

(XT 00 XT

)(a(1)

a(2)

). (6.6.13)

Now from the discussion in Section 3.5.1, let the column variances and covariances be

σ2a(i) =

1

n− 1

n∑

j=1

a2ji → σ2

i =

∫ϕ2(u) du = 1

σ2a(1)a(2) =

1

n− 1

n∑

j=1

aj1aj2 → σ12 =

∫ϕ(F1(s))ϕ(F2(t)) dF (s, t) , (6.6.14)

where F1(s) and F2(t) are the marginal cdfs of F (s, t). Since the ranks are centered andusing the same argument as in Theorem 3.5.1, E(Lcol) = 0 and

V = Cov(Lcol) =

[σ2

a(1)XTX σa(1)a(2)XTX

σa(1)a(2)XTX σ2a(2)X

TX

]

=

(1

n− 1ATA

)⊗(XTX

). (6.6.15)

Further,1

nV →

[1 σ12

σ12 1

]⊗Σ , (6.6.16)

where n−1XTX → Σ and Σ is positive definite.The test statistic for H0 : β = 0 is the quadratic form

AR = LT

colV−1Lcol = (n− 1)LT

col

[(ATA)−1 ⊗ (XTX)−1

]Lcol (6.6.17)

where we use a basic formula for finding the inverse of a direct product; see ( 6.6.9). Beforediscussing the distribution theory we record one final result from traditional multivariateanalysis:

AR = (n− 1)traceLT (XTX)−1L(ATA)−1 ; (6.6.18)

see Exercise 6.8.19. This result is useful in translating a quadratic form involving a directproduct into a trace involving ordinary matrix products. Expression ( 6.6.18) correspondsto the Lawley-Hotelling trace statistic based on ranks within the components. Thefollowing theorem summarizes the distribution theory needed to carry out the test.


Theorem 6.6.1. Suppose H0 : β = 0 is true and the conditions in Section 3.4 hold. Then

P0(AR ≥ χ2α(2p)) → α as n→ ∞

where χ2α(2p) is the upper α percentile from a chisquared distribution with 2 degrees of free-

dom.

Proof: This theorem follows along the same lines as Theorem 3.5.2. Use a projectionto establish that 1√

nLcol is asymptotically normally distributed and then AR will be asymp-

totically chisquared. The details are left as an Exercise 6.8.20; however, the projection isprovided below for use with the estimator.

1√nLcol =

1√n

(XTϕ(1)

XTϕ(2)

)+ op(1) (6.6.19)

where ϕ(i)T= (ϕ(Fi(ǫ1i) . . . ϕ(Fi(ǫni)) i = 1, 2 and F1, F2 are the marginal cdfs. Recall also

that a(i) = ϕ( in+1

) where ϕ(· ) is the score generating function. The asymptotic covariancematrix is given in ( 6.6.16).

Example 6.6.1. Multivariate Mann-Whitney-Wilcoxon Test

We now specialize to the Wilcoxon score function a(i) =√

12( in+1

− .5) and consider thetwo sample model. The test is a multivariate version of the Mann-Whitney-Wilcoxon test.Note that

∑a(i) = 0, σ2

a= 1

n−1

∑a2(i) = n

n+1−→ 1, and

σa(1)a(2) =12

n− 1

n∑

i=1

(Ri1

n + 1− 1

2

)(Ri2

n+ 1− 1

2

)

where R11, . . . , Rn1 are the ranks of the combined samples in the first component and sim-ilarly for R12, . . . , Rn2 for the second component. Note that σa(1)a(2) = n

n+1rs, where rs is

Spearman’s Rank Correlation Coefficient. Hence,

1

n− 1ATA =

[nn+1

σa(1)a(2)

σa(1)a(2)nn+1

]=

n

n+ 1

[1 rsrs 1

]→[

1 σ12

σ12 1

]

where

σ12 = 12

∫ ∫ (F1(r) −

1

2

)(F2(s) −

1

2

)dF (r, s)

depends on the underlying bivariate distribution.

Next , we must consider the design matrix X for the two sample model. Recall ( 2.2.1)and ( 2.2.2) which cast the two sample model as a linear model in the univariate case. Thedesign matrix (or vector in this case) is not centered. For convenience we modify C in


( 2.2.1) to have 1 in the first n1 places and 0 elsewhere. Note that the mean of C is n1

nand

subtracting this from the elements of C yields the centered design:

X =1

n

n2...n2

−n1...

−n1

where n2 appears n1 times. Then XTX = n1n2

nand 1

nXTX = n1n2

n2 −→ λ1λ2. We assume asusual that 0 < λi < 1, i = 1, 2.

Now L = L(0) = (l1, l2) and li = XTa(i). It is easy to see that

li =n1∑

j=1

aji =√

12n1∑

j=1

(Rji

n+ 1− 1

2

), i = 1, 2

So li is the centered and scaled sum of ranks of the first sample in the ith component.Now Lcol = (l1, l2)

T has an approximate bivariate normal distribution with covariancematrix:

Cov(Lcol) =1

n− 1(ATA) ⊗ (XTX) =

n1n2

n(n− 1)ATA =

n1n2

n + 1

[1 rsrs 1

].

Note that σ12 is unknown but estimated by Spearman’s rank correlation coefficient rs (seeabove discussion). Hence the test is based on AR in ( 6.6.18). It is easy to invert Cov(Lcol)and we have, (see Exercise 6.8.20),

AR =n + 1

n1n2(1 − r2s)l21 + l22 − 2rsl1l2 =

1

1 − r2s

l∗21 + l∗22 − 2rsl

∗1l

∗2

,

where l∗1 and l∗2 are the standardized MWW statistics. We rejectH0 : β = 0 at approximatelylevel α when AR ≥ χ2

α(2). The test statistic AR is a quadratic form in the component Mann-Whitney-Wilcoxon rank statistics and rs provides the adjustment for the correlation betweenthe components.

Example 6.6.2. Brains of Mice Data

In Table 6.6.1 we provide bivariate data on levels of certain biochemical components in thebrains of mice. The treatment group received a drug which was hypothesized to alter theselevels. The control group received a placebo.

The ranks of the combined treatment and control data for each component are given inthe table, under component ranks. The Spearman rank correlation coefficient is rs = .149,


Table 6.6.1: Levels of Biochemical Components in the Brains of MiceData Component Ranks Centered Affine Ranks

Control Treatment Control Treatment Control Treatment(1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2)1.21 .61 1.40 1.50 16 22 22 18.5 .90 18.53 8.28 8.06.92 .43 1.17 .39 3 12 18 13 -9.02 3.55 2.06 -7.76.80 .35 1.23 .44 1 4.5 18 13 -8.81 -6.26 4.90 1.05.85 .48 1.19 .37 2 17 15 6 -9.40 9.37 4.15 -11.78.98 .42 1.38 .42 4 10.5 21 10.5 -6.80 .74 9.79 -4.63

1.15 .52 1.17 .45 11 20 12.5 14.5 -.53 15.51 .55 5.961.10 .50 1.31 .41 9 18.5 20 9 -3.25 13.63 8.15 -7.051.02 .53 1.30 .47 6 21 19 16 -6.23 15.97 6.72 6.711.18 .45 1.22 .29 14 14.5 17 2 2.28 5.27 5.52 -16.781.09 .4 1.00 .30 7.5 8 5 3 -3.04 -4.56 -4.84 -15.02

1.12 .27 10 1 .95 -18.311.09 .35 7.5 4.5 -2.03 -12.54

the standardized MWW statistics are l∗1 = −2.74 and l∗2 = 2.17. Hence AR = 14.31 withthe p-value of .0008 based on a χ2-distribution with 2 degrees of freedom. Panels A andB of Figure 6.6.1 show, respectively, plots of the bivariate data and the component ranks.The strong separation between treatment and control is clear from the plot. The treatmentgroup contains an outlier which is brought in by the component ranking.

We have discussed the multivariate version of the MWW test based on the centered ranksof the combined data where the ranking mechanism is represented by the matrix A. Given Aand the design matrix X, the test statistic AR can be computed. Recall from Section 6.3.2that Mottoen and Oja (1995) introduced the vector spatial rank R(xi) =

∑j u(xi − xj),

where u(x) = ‖x‖−1x is a unit vector in the direction of x. In Section ??, an affine rankvector R(xi) is given by ( 6.4.15). Both spatial and affine rank vectors are centered. LetR(xi) be the ith row of A. Note that in these two cases that the columns of A are not oflength 1. Nevertheless, from ( 6.6.18), we have

AR =n(n− 1)

n1n2

[l1 l2](ATA)−1[l1 l2]

T

=n(n− 1)

n1n2

1

1 − r2

l21

‖a(1)‖2+

l22‖a(2)‖2

− 2rl1l2

‖a(1)‖a(2)‖

,

where r is the correlation coefficient between the two columns of A; see Brown and Hett-mansperger (1987b). Table 6.6.1 contains the affine rank vectors and the correspondingaffine invariant MWW test is AR = 15.69 with an approximate p-value of .0004 based on aχ2(2)-distribution. See Exercise 6.8.21.


Figure 6.6.1: Panel A: Plot of the data for the Brains of Mice Data; Panel B: Plot of thecorresponding ranks of the Brains of Mice Data.

C

CC

CC

CCCC

C

Component 1 responses

Com

pone

nt 2

res

pons

es

0.8 1.0 1.2 1.4

0.4

0.8

1.2

T

TT

TTT T

T

TT TT

Panel A

C

C

C

C

C

CC

C

C

C

Component 1 ranks

Com

pone

nt 2

ran

ks

5 10 15 20

510

1520

T

T

T

T

T

T

T

T

TT

T

T

Panel B


Example 6.6.3. Multivariate Kruskal-Wallis Test

In this example we develop the multivariate version of the Kruskal-Wallis statistic for usein a multivariate one-way layout; see Section 4.2.2 for the univariate case. We suppose wehave k samples from k independent distributions. The n× (k− 1) design matrix is given by

C =

0n1 0n1 . . . 0n1

1n2 0n2 . . . 0n2

0n3 1n3 . . . 0n3

......

......

0nk0nk

. . . 1nk

and the column means are c′ = (λ2, . . . , λk) where λi = ni

n. The centered design is X =

C − 1c′ and has full column rank k − 1.

In this design the first of the k populations is taken as the reference population withlocation (α1, α2)

T . The ith row of the β matrix is the vector of shift parameters for the(i + 1)st population relative to the first population. We wish to test H0 : β = 0 that allpopulations have the same (unknown) location vector.

The matrix A = (a(1), a(2)) has the centered and scaled Wilcoxon scores of the previousexample. Hence, a(1) is the vector of rank scores for the combined k samples in the firstcomponent. Since the rank scores are centered, we have

XTa(i) = (C − 1cT )Ta(i) = CTa(i)

and the second version is easier to compute. Now L(0) = (L(1),L(2)) and the hth componentof column i is

lhi =√

12∑

j∈Sh

(Rji

n+ 1− 1

2

)

=

√12

n+ 1nh

(Rhi −

n+ 1

2

)

where Sh is the index set corresponding to the hth sample and Rhi is the average rank ofthe hth sample in the ith component.

As in the previous example, we replace 1n−1

ATA by its limit with 1 on the main diagonal

and σ12 off the diagonal. Then let ((σij)) be the inverse matrix. This is easy to computeand will be useful below. The test statistic is then, from ( 6.6.17)

AR ≈ (L(1)T

,L(2)T

)

(σ11(XTX)−1 σ12(XTX)−1

σ12(XTX)−1 σ22(XTX)−1

)(L(1)

L(2)

)

= σ11L(1)T

(XTX)−1L(1) + 2σ12L(1)T

(XTX)−1L(2) + σ22L(2)T

(XTX)−1L(2)


The ≈ indicates that the right side contains asymptotic quantities which must be estimatedin practice. Now

L(1)T

(XTX)−1L(1) =

k∑

j=1

n−1j l2j1 =

12

(n + 1)2

k∑

j=1

nj

(Rj1 −

n + 1

2

)=

n

n+ 1H1

where H1 is the Kruskal-Wallis statistic computed on the first component. Similarly,

L(1)T

(XTX)−1L(2) =12

(n+ 1)2

k∑

j=1

nj

(Rj1 −

n+ 1

2

)(Rj2 −

n+ 1

2

)=

n

n + 1H12

and H12 is a cross component statistic. Using Spearman’s rank correlation coefficient rs toestimate σ12, we get

AR =1

1 − r2s

H1 − 2rsH12 +H2

The test rejects the null hypothesis of equal location vectors at approximately level α whenAR ≥ χ2

α(2(k − 1)).In order to compute the test, first compute componentwise rankings. We can display the

means of the rankings in a 2 × k table as follows:

Treatment1 2 · · · k

Component 1 R11 R21 · · · R1k

Component 2 R12 R22 · · · R2k

Then use Minitab or some other package to find the two Kruskal-Wallis statistics. To com-pute H12 either use the formula above or use

H12 =k∑

j=1

(1 − nj

n

)Zj1Zj2 , (6.6.20)

where Zji = (Rji−(n+1)/2)/√

VarRji and VarRji = (n−nj)(n+1)/nj; see Exercise 6.8.22.

The package Minitab lists Zji in its output.

The last example shows that in the general regression problem with Wilcoxon scores, ifwe wish to test H0 : β = 0, the test statistic ( 6.6.17) can be written as

AR =1

1 − σ122L(1)T

(XTX)−1L(1) − 2σ12L(1)T

(XTX)−1L(2) + L(2)T

(XTX)−1L(2) (6.6.21)

where the estimate of σ12 can be taken to be rs or nn+1

rs and rs is Spearman’s rank correlationcoefficient and

lhi =

√12

n+ 1

n∑

j=1

(Rji −

n+ 1

2

)xjh =

n∑

j=1

a(R(Yji))xjh

Then reject H0 : β = 0 when AR ≥ χ2α(2p).


6.6.2 The Estimate of the Regression Effect

In the introduction to Section 6.6, we pointed out that the R-estimate β solves L(β).= 0,

( 6.6.7). Recall the representation of the R-estimate in the univariate case given in Corollary3.5.2. This immediately extends to the multivariate case as

√n(β − β0) =

(1

nXTX

)−11√n

(τ1XTϕ(1), τ2X

Tϕ(2)) + op(1) (6.6.22)

where ϕ(i)T= (ϕ(Fi(ǫ1i)), . . . , ϕ(Fi(ǫni))), i = 1, 2 Further, τi is given by ( 3.4.4) and we

define the matrix τ by τ = diagτ1, τ2. To investigate the asymptotic distribution of therandom matrix ( 6.6.22), we again roll it out by columns. We need only consider the linearapproximation on the right side.

Theorem 6.6.2. Assume the regularity conditions in Section 3.4. Then, if β is the truematrix,

√n(βcol − βcol)

D→ N2p

(0,

(τ

(1 σ12

σ12 1

)τ

)⊗ Σ−1

)

where

σ12 =

∫ ∫ϕ(F1(s))ϕ(F2(t))dF (s, t) , τ = diagτ1, τ2

and τi is given by ( 3.4.4), and 1nX′X −→ Σ , positive definite.

Proof. We will sketch the argument based on ( 6.6.1), (6.6.13), and Theorem 3.5.2. Consider,with Σ−1 replaced by ( 1

nX′X)−1,

1√n

(τ1Σ

−1XTϕ(1)

τ2Σ−1XTϕ(2)

)=

(τ1Σ

−1 00 τ2Σ

−1

)( 1√nXTϕ(1)

1√nXTϕ(2)

)

The multivariate central limit theorem establishes the asymptotic multivariate normality.From the discussion after ( 6.6.1), we have Eϕ(i)ϕ(i)T

= I, i = 1, 2 and Eϕ(1)ϕ(2)T= σ12I.

Hence, the covariance matrix of the above vector is:(τ1Σ

−1 00 τ2Σ

−1

)(Σ σ12Σ

σ12Σ Σ

)(τ1Σ

−1 00 τ2Σ

−1

)=

(τ 21 Σ

−1 τ1τ2σ12Σ−1

τ1τ2σ12Σ−1 τ 2

2 Σ−1

)

=

(τ 21 τ1τ2σ12

τ1τ2σ12 τ 22

)⊗ Σ−1

=

(τ

(1 σ12

σ12 1

)τ

)⊗ Σ−1

and this is the asymptotic covariance matrix for√n(βcol − βcol).

We remind the reader that when we use the Wilcoxon score ϕ(u) =√

12(u − 12), then

τ−1i =

√12∫f 2i (x)dx, fi the marginal pdf i = 1, 2 and σ12 = n

n+1rs, where rs is Spearman’s

rank correlation coefficient. See Section 3.7.1 for a discussion of the estimation of τi.


6.6.3 Tests of General Hypotheses

Recall the model ( 6.6.1) and let the matrix M be r × p of rank r and the matrix K be2× s of rank s. The matrices M and K are fully specified by the researcher. We consider atest of H0 : MβK = 0. For example, when K = I2, and M = (O Ir) where O denotes ther× (p− r) matrix of zeros, we have H0 : Mβ = 0 and this null hypothesis specifies that thelast r parameters in both components are 0. This is the usual subhypothesis in the linearmodel applied to both components. Alternatively we might let M = Ip, p× p, and

K =

(1−1

).

Then H0 : βK = 0 and we test the null hypothesis that the parameters of the two con-catenated linear models are equal: βi1 = βi2 for i = 1, . . . , p. This could be appropriate fora pre-post test model. Thus, we generalize ( 3.6.1) to the multivariate linear model. Thedevelopment will preceed in steps beginning with H0 : Mβ = 0, i.e. K = I2.

Theorem 6.6.3. Under H0 : Mβ = 0

√n(Mβ)col

D→ N2r

(0,

(τ

(1 σ12

σ12 1

)τ

)⊗ [MΣ−1MT ]

)

here τ , σ12,Σ are given in Theorem 6.6.2. Let V denote the asymptotic covariance matrixthen

n(Mβ)TcolV−1(Mβ)col = β

T

col[τ1

n− 1ATAτ ]−1 ⊗MT (M(XTX)−1MT )−1Mβcol

= trace(Mβ)T (M(XTX)−1MT )−1(Mβ)

·[τ 1

n− 1ATAτ ]−1 (6.6.23)

is asymptotically χ2(2r). Note that we have estimated unknown parameters in the asymptoticcovariance matrix V.

Proof. First note that

(Mβ)col =

(M 00 M

)βcol

Using Theorem 6.6.2, the asymptotic covariance is, with τ = diagτ1, τ2,

V =

(M 00 M

)((τ

(1 σ12

σ12 1

)τ

)⊗ Σ−1

)(MT 00 MT

)

=

(M 00 M

)(τ1Σ

−1 τ1σ12Σ−1

τ2σ12Σ−1 τ2Σ

−1

)(MT 00 MT

)

=

(τ1MΣ−1MT τ1σ12MΣ−1MT

τ2σ12MΣ−1MT τ2MΣ−1MT

)

=

(τ

(1 σ12

σ12 1

)τ

)⊗MΣ−1MT


Hence, by the same argument,

βT

col

(MT 00 MT

)V−1

(M 00 M

)βcol =

βT

col[τ(

1 σ12

σ12 1

)τ ]−1 ⊗ MT (MΣ−1MT )−1Mβcol =

trace(Mβ)T (MΣ−1MT )−1(Mβ)[τ

(1 σ12

σ12 1

)τ ]−1

Denote the test statistic, ( 6.6.23), defined in the last theorem to be

QMVR = trace(Mβ)T (M(XTX)−1MT )−1(Mβ)[τ1

n− 1ATAτ ]−1 . (6.6.24)

Then the corresponding level α asymptotic decision rule is:

Reject H0 : Mβ = 0 in favor of HA : Mβ 6= 0 if QMVR ≥ χα(2r) . (6.6.25)

The next theorem describes the test when only K is involved. After that we put the tworesults together for the general statement.

Theorem 6.6.4. Under H0 : βK = 0, where K is a 2× s matrix,√n(βK)col is asymptot-

ically

Nps

(0,

[KTτ

(1 σ12

σ12 1

)τK

]⊗ Σ−1

)

where τ , σ12, and Σ are given in Theorem 6.6.2. Let V denote the asymptotic covariancematrix. Then

n(βK)TcolV−1(βK)col = trace(βK)T (XTX)(βK)[Kτ

1

n− 1ATAτK]−1

is asymptotically χ2(ps).

Proof: First note that (βTK)col =(KT ⊗ I

)βT

col. Then from Theorem 6.6.2, the asymp-

totic covariance matrix of√nβT

col is

AsyCov(√nβT

col) =

(τ

[1 σ12

σ12 1

]τ

)⊗Σ−1 .

Hence, the asymptotic covariance matrix of√n(βK)col is,

AsyCov(√n(βK)col) =

(KT ⊗ I

)((τ

[1 σ12

σ12 1

]τ

)⊗ Σ−1

)(KT ⊗ I

)T

=

(KTτ

[1 σ12

σ12 1

]τ ⊗Σ−1

)(K⊗ I)

=

(KTτ

[1 σ12

σ12 1

]τK

)⊗ Σ−1 ,


which is the desired result. The asymptotic normality and chisquare distribution follow fromTheorem 6.6.2.

The previous two theorems can be combined to yield the general case.

Theorem 6.6.5. Under H0 : MβK = 0,

√n(MβK)col

D−→ Nrs

(0, [KTτ

(1 σ12

σ12 1

)τK] ⊗ MΣMT−1

).

If V is the asymptotic covariance matrix with estimate V then

n(MβK)TcolV−1(MβK)col =

(MβK)Tcol[KT τ1

n− 1ATAτK]−1 ⊗ [M(XTX)−1MT ]−1(MβK)col =

trace(MβK)T [M(XTX)−1MT ]−1(MβK)[KT τ1

n− 1ATAτK]−1 (6.6.26)

has an asymptotic χ2(rs) distribution.

The last theorem provides great flexibility in composing and testing hypotheses in themultivariate linear model. We must estimate the matrix β along with the other parametersfamiliar in the linear model. However, once we have these estimates it is a simple series ofmatrix multiplications and the trace operation to yield the test statistic.

Denote the test statistic, ( 6.6.26), defined in the last theorem to be

QMVRK =

trace(MβK)T [M(XTX)−1MT ]−1(MβK)[KT τ1

n− 1ATAτK]−1 (6.6.27)

Then the corresponding level α asymptotic decision rule is:

Reject H0 : MβK = 0 in favor of HA : MβK 6= 0 if QMVRK ≥ χα(rs) . (6.6.28)

The test statistics QMVR and QMVRK are extensions to the multivariate linear modelof the quadratic form test statistic Fϕ,Q, ( 3.6.14). The score or aligned test and the dropin dispersion test are also available. Davis and McKean (1993) develop these in detail andprovide the rigorous development of the asymptotic theory. See also Puri and Sen (1985) fora development of rank methods in the multivariate linear model.

In traditional analysis, based on the least squares estimate of the matrix of regressioncoefficients, there are several tests of the hypothesis H0 : MβK = 0. The test statisticQMVRK , ( 6.6.26), is an analogue of the Lawley (1938) and Hotelling (1951) trace criterion.This traditional test statistic is given by

QLH = trace

(MβLSK)T [M(XTX)−1MT ]−1MβLSK(KT ΛK)−1, (6.6.29)


where βLS = (X′X)−1X′Y is the least squares estimate of the matrix of regression coefficients

β and Λ is the usual estimate of Λ, the covariance matrix of the matrix of errors ǫ, given by

Λ = (Y −XβLS)′(Y −XβLS)/(n− p− 1) . (6.6.30)

Under the above assumptions and the assumption that Λ is positive definite and assumingH0 : MβK = 0 is true, QLH has an asymptotic χ2 distribution with rs degrees of freedom.This type of hypothesis arises in profile analysis; see Chinchilli and Sen (1982) for thisapplication.

In order to illustrate these tests, we complete this section with an example.

Example 6.6.4. Tablet Potency Data

The data are the results from a pharmaceutical experiment on the effects of four factorson five measurements of a tablet. There are n = 34 data cases. The five responses are:(POT2), potency of the tablet at the end of 2 weeks; (POT4), potency of the tablet at theend of 4 weeks; the third and fourth responses are measures of the tablets purity (RSDCU)and hardness (HARD); and the fifth response is its water content (H2O); hence, we have a5-dimensional response rather than the bivariate responses discussed so far. This means thatthe degrees of freedom are 5r rather than 2r in Theorem 6.6.3. The factors are: SAI, theamount of intragranular steric acid, which was set at the three levels −1, 0 and 1; SAE, theamount of extragranular steric acid, which was set at the three levels −1, 0 and 1; ADS, theamount of cross carmellose sodium, which was set at the three levels −1, 0 and 1; and TYPEof steric acid which was set at two levels −1 and 1. The initial potency of the compound,POT0, served as a covariate. The data are displayed in Table 6.6.2. It was used as anexample in the article by Davis and McKean (1993) and much of our discussion below istaken from this article.

This data set was treated as a univariate model for the response POT2 in Chapter 3;see Examples 3.3.3 and 3.9.2. As our full model we choose the same model described inexpression ( 3.3.1) of Example 3.3.3. It includes: the linear effects of the four factors; sixsimple two-way interactions between the factors; the three quadractic terms of the factorsSAI, SAE, and ADS; and the covariate for a total of fifteen terms. The need for the quadraticterms was discussed in the diagnostic analysis of this model for the response POT2; seeExample 3.9.2. Hence, Y is 34 × 5, X is 34 × 14, and β is 14 × 5,

Table 6.6.3 displays the results for the test statistic QMVR, ( 6.6.24) for the usual ANOVAhypotheses of interest: main effects, interaction effects broken down as simple two-way andquadratic, and covariate. Also listed are the hypothesis matrices M for each effect wherethe notation Ot×u represents a t × u matrix of 0s and It is the t × t identity matrix. Alsogiven for comparison purposes are the results of the traditional Lawley-Hotelling test, basedon the statistic ( 6.6.29) with K = I5. For example, M = [I4 O4×10] yields a test of thehypothesis:

H0 : β11 = · · · = β41 = 0, β12 = · · · = β42 = 0, . . . , β15 = · · · = β45 = 0 ;


Table 6.6.2: Responses and Levels of the Factors for the Potency DataResponses Factors Covariate

POT2 POT4 RSDCU HARD H2O SAE SAI ADS TYPE POT07.94 3.15 1.20 8.50 0.188 1 1 1 1 9.388.13 3.00 0.90 6.80 0.250 1 1 1 -1 9.678.11 2.70 2.00 9.50 0.107 1 1 -1 1 9.917.96 4.05 2.30 6.00 0.125 1 1 -1 -1 9.777.83 1.90 0.50 9.80 0.142 -1 1 1 1 9.507.91 2.30 0.90 6.60 0.229 -1 1 1 -1 9.357.82 1.40 1.10 8.43 0.112 -1 1 -1 1 9.587.42 2.60 2.60 8.50 0.093 -1 1 -1 -1 9.698.06 2.00 1.90 6.17 0.207 1 -1 1 1 9.628.51 2.80 1.70 7.20 0.184 1 -1 1 -1 9.897.88 3.35 4.70 9.30 0.107 1 -1 -1 1 9.807.58 3.05 4.00 8.10 0.102 1 -1 -1 -1 9.738.14 1.20 0.80 7.17 0.202 -1 -1 1 1 9.518.06 2.95 2.50 7.80 0.027 -1 -1 1 -1 9.827.31 1.85 2.10 8.70 0.116 -1 -1 -1 1 9.208.66 4.10 3.60 6.40 0.114 -1 -1 -1 -1 9.538.16 3.95 2.00 8.00 0.183 0 0 0 1 9.678.02 2.85 1.10 6.61 0.139 0 0 0 -1 9.418.03 3.20 3.60 9.80 0.171 0 1 0 1 9.627.93 3.20 6.10 7.33 0.152 0 1 0 -1 9.497.84 3.95 2.00 7.70 0.165 0 -1 0 1 9.967.59 1.15 2.10 7.03 0.149 0 -1 0 -1 9.798.28 3.95 0.70 8.40 0.195 1 0 0 1 9.467.75 3.35 2.20 6.37 0.168 1 0 0 -1 9.787.95 3.85 7.20 9.30 0.158 -1 0 0 1 9.488.69 2.80 1.30 6.57 0.169 -1 0 0 -1 9.468.38 3.50 1.70 8.00 0.249 0 0 1 1 9.738.15 2.00 2.30 6.80 0.189 0 0 1 -1 9.678.12 3.85 2.50 7.90 0.116 0 0 -1 1 9.847.72 3.50 2.20 5.60 0.110 0 0 -1 -1 9.847.96 3.55 1.80 7.85 0.135 0 0 0 1 9.508.20 2.75 0.60 7.20 0.161 0 0 0 -1 9.788.10 3.30 0.97 8.73 0.152 0 0 0 1 9.718.16 3.90 2.40 7.50 0.155 0 0 0 -1 9.57


Table 6.6.3: Tests of the Effects for the Potency DataEffects M-matrix df QMVR p-value QLH p-valueMain [I4 O4×10] 20 179.6 .00 91.8 .00Higher Order [O9×4 I9 09×1] 45 102.1 .00 70.7 .01

Interaction [O6×4 I6 O6×4] 30 70.2 .00 52.2 .01Quadratic [O3×10 I3 03×1] 15 34.5 .00 18.7 .23

Covariate [O1×13 1] 5 3.88 .57 4.34 .50

that is, the linear terms vanish in all 5 components. Note that M is 4×14 so r = 4 and hencewe have 4×5 = 20 degrees of freedom in Theorem 6.6.3. The other hypothesis matrices aredeveloped similarly. The robust analysis indicates that all effects are significant except thecovariate effect. In particular the quadratic effect is significant for the robust analysis butnot for the Lawley-Hotelling test. This confirms the discussion on LS and robust residualplots for this data set given in Example 3.9.2.

Are the effects of the factors different on potencies of the tablet after 2 weeks, POT2, or 4weeks, POT4? This question can be answered by evaluating the statistic QMVRK , ( 6.6.27),for hypotheses of the form MβK, for the matrices M given in Table 6.6.3 and the 5 × 1matrix K where K′ = [1 −1 0 0 0]. For example, β11, . . . , β41 are the linear effects of SAE,SAI, ADS, and TYPE on PO2 and β12, . . . , β42 are the linear effects on PO4. We may wantto test the hypothesis

H0 : β11 = β12, . . . , β41 = β42 .

The M matrix picks the appropriate βs within a component and the K matrix compares theresults across components. From Table 6.6.3, choose M = [I4 O4×10]. Then

Mβ =

β11 β12 · · · β15...

......

β41 β42 · · · β45

Next choose KT = [1 −1 0 0 0] so that

MβK =

β11 − β12

β21 − β22...

β41 − β42

.

Then the null hypothesis is H0 : MβK = 0. In this example r = 4 and s = 1 so thetest has rs = 4 degrees of freedom. The test is illustrated in column 3 of Table 6.6.4.Other comparisons are also given. Once again, for comparison purposes the results for theLawley-Hotelling test based on the test statistic, ( 6.6.29), are given also. The robust andtraditional analyses seem to agree on the contrasts. Although there is some overall differencethe factors behave somewhat the same on the responses.


Table 6.6.4: Contrast analyses between responses POT2 and POT4All Terms Covariate main effect Higher order Interaction Quadratic

except mean terms terms terms terms termsdf 14 1 4 9 6 3QMVRK 21.93 3.00 5.07 12.20 8.22 4.77p-value .08 .08 .28 .20 .22 .19QLH 22.28 2.67 6.36 11.73 6.99 5.48p-value .07 .10 .17 .23 .32 .14

Suppose we have the linear model 6.6.2 along with a matrix of scores that sum to zero.The criterion function and the matrix of partial derivatives are given by ( 6.6.5) and ( 6.6.6).Then the test statistic for a general regression effect is given by ( 6.6.17) or ( 6.6.18). Specialcases yield the two sample and k-sample tests discussed in Examples 6.6.1 and 6.6.3.The componentwise rank case uses chisquare critical values. The computation of the testsrequire the score matrix A along with the design matrix X. For example, we could use theL1 norm componentwise and produce multivariate sign tests that extend Mood’s test to themultivariate model.

This approach can be extended to the spatial rank and affine rank cases; recall thediscussion in Example 6.3.2. In the spatial case the criterion function is D(α,β) =

∑ ‖yTi −αT − x′

iβ‖, ( 6.1.4). Let u(x) = ‖x‖−1x and rTi = αT − x′iβ, then D(α,β) =

∑uT (ri)ri

and hence,

A =

uT (r1)...

uT (rn)

Further, let Rc(ri) =∑

j u(ri−rj) be the centered spatial rank vector. Then the criterion

function is D(α,β) =∑

RTc (ri)ri and

A∗ =

RT (r1)...

RT (rn)

The tests then can be carried out using the chisquare critical values. See Brown and Hett-mansperger (1987b) and Mottonen and Oja (1995) for details. For the details in the affineinvariant sign or rank vector cases see Brown and Hettmansperger (1987b), Hettmansperger,Nyblom, and Oja (1994), and Hettmansperger, Mottonen and Oja (1997a,b).

Rao (1988) and Bai, Chen, Miao, and Rao (1990) consider a different formulation of alinear model. Suppose, for i = 1, . . . , n,Yi = Xiβ+ǫi where Yi is a 2×1 vector, Xi is a q×2matrix of known values, β is a 2×1 vector of unknown parameters. Further, ǫ1, . . . , ǫn is aniid set of random vectors from a distribution with median vector 0. The criterion function is∑ ‖Yi − Xiβ‖, the spatial criterion function. Estimates, tests, and the asymptotic theoryare developed in the above references.


6.7 Experimental Designs

Recall that in Chapter 4 we developed rank-based procedures for experimental designs basedon the general R-estimation and testing theory of Chapter 3. Analogously in the multivariatecase, rank-based procedures for experimental designs can be based on the R-estimation andtesting theory of the last section. In this short section we show how this development canproceed. In particular, we use the cell median model, (the basic model of Chapter 4), andshow how the test ( 6.6.28) can be used to test general linear hypotheses involving contrastsin these cell medians. This allows the testing of MANOVA type hypotheses as well as, forinstance, profile analyses for multivariate data.

Suppose we have k groups and within the jth group, j = 1, . . . , k, we have a sample of sizenj . For each subject a d-dimensional vector of variables has been recorded. Let yijl denotethe response for the ith subject in Group j for the lth variable and let yij = (yij1, . . . , yij2)

T

denote the vector of responses for this subject. Consider the model,

yij = µj + eij , j = 1, . . . , k , i = 1, . . . , nk ,

where the eij are independent and identically distributed. Let n =∑nj denote the total

sample size. Let Yn×d denote the matrix of responses, (the yijs are stacked sequentially bygroup and let ǫ be the corresponding n×d matrix of eij . Let Γ = (µ1, . . . ,µk)

T be the k×dmatrix of parameters. We can then write the model as

Y = WΓ + ǫ , (6.7.1)

where W is the incidence matrix in expression ( 4.2.5). This is our full model and it is themultivariate analog of the basic model of Chapter 4, ( 4.2.1). If µj is the vector of mediansthen this is the multivariate medians model. On the other hand, if µj is the vector ofmeans then this is the multivariate means model.

We are interested in the following general hypotheses:

H0 : MΓK = O versus HA : MΓK 6= O , (6.7.2)

where M is an r× k contrast matrix (the rows of M sum to zero) of rank r and K is a d× smatrix of rank s.

In order to use the theory of Section 6.6 we need to transform Model ( 6.7.1) into amodel of the form ( 6.6.2). Consider the k × k elementary column matrix E which replacesthe first column of a matrix by the sum of all columns of the matrix; i.e,

[c1 c2 · · · ck]E =

[k∑

i=1

ci c2 · · · ck], (6.7.3)

for any matrix [c1 c2 · · · ck]. Note that E is nonsingular. Hence we can write Model ( 6.7.1)as

Y = WΓ + ǫ = WEE−1Γ + ǫ = [1 W1]

[αT

β

]+ ǫ , (6.7.4)

6.7. EXPERIMENTAL DESIGNS 413

where W1 is the last k− 1 columns of W and E−1Γ = [α βT ]T . This is a model of the form( 6.6.2). Since M is a contrast matrix, its rows sum to zero. Hence the hypothesis simplifiesto:

MΓK = MEE−1ΓK = [0 M1]

[αT

β

]K = M1βK . (6.7.5)

Therefore the hypotheses ( 6.7.2) can be tested by the procedure ( 6.6.28) based on the fitof Model ( 6.7.4).

Most of the interesting hypotheses in MANOVA can be written in the form ( 6.7.2)for some specified contrast matrix M. Therefore based on the theory developed in Section6.6, a robust rank-based methodology can be developed for MANOVA type models. Thismethodology is demonstrated in Example 6.7.1, which follows, and Exercise 6.8.23.

For the multivariate setting, Davis and McKean (1993) developed an analog of Theorem

3.5.7 which gives the joint asymptotic distribution of [α βT]T . They further developed a test

of the hypothesis H0 : MΓK = O, where M is any full row rank matrix, not necessarilya contrast matrix. Hence, this provides a robust rank-based analysis for any multivariatelinear model.

Example 6.7.1. Paspalum Grass

This data set, discussed on page 460 of Seber (1984), concerns the effect on growth ofpaspalum grass due to a fungal infection. The experiment was a 4 × 2 twoway design. Halfof the forty-eight pots of paspalum grass in the experiment were inoculated with a fungalinfection and half were left as controls. The second factor was the temperature (14, 18, 22,26o C) at which the inoculation was applied. The design was balanced so that 6 plants wereused for each combination of treatment and temperature. After a specified amount of time,the following three measurements were made on each plant:

y1 = the fresh weight of the roots of the plant (gm)

y2 = the maximum root length of the plant (mm)

y3 = the fresh weight of the tops of the plant (gm) .

The data are are displayed in Table 6.7.1.

As a full model we fit Model 6.7.1. Based on the residual analysis found in Exercise6.8.24, though, the fit clearly shows heteroscedasticity and suggests the log transformation.The subsequent analysis is based on the transformed data. Table 6.7.2 displays the estimatesof Model 6.7.1 based on the Wilcoxon score function and LS. Note the fits are very similar.The estimates of the vector τ and the matrix ATA are also displayed.

The hypotheses of interest concern the average main effects and interaction. For Model


Table 6.7.1: Responses for the Paspalum Grass Data, Example 6.7.1Treatments

Temperature Control Inoculated2.2 23.5 1.7 2.3 23.5 2.03.0 27.0 2.3 3.0 21.0 2.7

14oC 3.3 24.5 3.2 2.3 22.0 1.82.2 20.5 1.5 2.5 22.5 2.42.0 19.0 2.0 2.4 21.5 1.13.5 23.5 2.9 2.7 25.0 2.6

21.8 41.5 23.0 10.1 43.5 14.211.0 32.5 15.4 7.6 27.0 14.7

18oC 16.4 46.5 22.8 19.7 32.5 21.413.1 31.0 21.5 4.3 28.5 9.715.4 41.5 20.8 5.2 33.5 12.214.5 46.0 20.3 3.9 24.5 8.213.6 29.5 30.8 10.0 21.0 23.66.2 23.5 14.6 12.3 49.0 28.1

22oC 16.7 58.5 36.0 4.9 28.5 13.312.2 40.5 23.9 9.6 27.0 24.68.7 37.0 20.3 6.5 29.0 19.3

12.3 41.5 27.7 13.6 30.5 31.53.0 24.0 10.2 4.2 25.5 13.35.3 26.5 15.6 2.2 23.5 8.5

26oC 3.1 24.5 14.7 2.8 19.5 11.84.8 34.0 20.5 1.3 21.5 7.83.4 22.5 14.3 4.2 28.5 15.17.4 32.0 23.2 3.0 25.0 11.8

6.7. EXPERIMENTAL DESIGNS 415

Table 6.7.2: Estimates based on the Wilcoxon and LS fits for the Paspalum Grass Data,Example 6.7.1. V is the variance-covariance matrix of vector of random errors ǫ.

Wilcoxon Fit LS FitComponents Components

Parameter (1) (2) (3) (1) (2) (3)µ11 1.04 3.14 .82 .97 3.12 .78µ21 2.74 3.70 3.05 2.71 3.67 3.02µ31 2.47 3.63 3.25 2.40 3.61 3.20µ41 1.49 3.29 2.79 1.45 3.29 2.76µ12 .94 3.12 .77 .92 3.12 .70µ22 1.95 3.43 2.58 1.96 3.43 2.55µ32 2.26 3.36 3.19 2.19 3.39 3.11µ42 1.09 3.18 2.45 1.01 3.17 2.41

τ or σ .376 .188 .333 .370 .197 .292

1.04 .62 .92 .14 .04 .09ATA or V .62 1.04 .57 .04 .04 .03

.92 .57 1.04 .09 .03 .09

6.7.1, matrices for treatment effects, temperature effects and interaction are given by

MTreat. = [ 1 1 1 1 −1 −1 −1 −1 ]

MTemp. =

1 −1 0 0 1 −1 0 01 0 −1 0 1 0 −1 01 0 0 −1 1 0 0 −1

MTreat.×Temp. =

1 −1 0 0 −1 1 0 00 1 −1 0 0 −1 1 00 0 1 −1 0 0 −1 1

.

Take the matrix K to be I3. Then the hypotheses of intertest can be expressed as MΓK = Ofor the above M-matrices. Using the summary statistics in Table 6.7.2 and the elemenatrycolumn matrix E, as defined above expression ( 6.7.3), we obtained the test statistics QMVRK ,( 6.6.27) based on the Wilcoxon fit. For comparison we also obtain the LS test statisticsQLH , ( 6.6.29). The values of these statistics for the hypotheses of interest are summarizedin Table 6.7.3. The test for interaction is not significant while both main effects, Treatmentand Temperature, are significant. The results are quite similar for the traditional test also.We also tabulated the marginal test statistics, Fϕ. The results for each component are thesimilar to the multivariate result.


Table 6.7.3: Test statistics QMVRK and QLH based on the Wilcoxon and LS fits, respec-tively, for the Paspalum Grass Data, Example 6.7.1. Marginal F -tests are also given. Thenumerator degrees of freedom are given. Note that the denominator degrees of freedom forthe marginal F -tests is 40.

Wilcoxon LSMVAR Marginal Fϕ MVAR Marginal FLS

Effect df QMVRK df (1) (2) (3) df QLH df (1) (2) (3)Treat. 3 14.9 1 9.19 7.07 11.6 3 12.2 1 11.4 6.72 8.66Temp. 9 819 3 32.5 13.4 61.4 9 980 3 45.2 13.4 162Treat. ×Temp. 9 11.2 3 2.27 1.49 1.35 9 7.98 3 2.01 .79 1.36

6.8 Exercises

6.8.1. Show that the vector of sample means of the components is affine equivariant. SeeDefinition 6.1.1.

6.8.2. Compute the gradients of the three criterion functions ( 6.1.3)-( 6.1.5).

6.8.3. Show that in the univariate case S2(θ) = S3(θ), ( 6.1.7) and ( 6.1.8).

6.8.4. Establish ( 6.2.7).

6.8.5. Construct an example in the bivariate case for which the mean vector rotates intothe new mean vector but the vector of componentwise medians does not rotate into the newvector of medians.

6.8.6. Students were given a math aptitude and reading comprehension test before startingan intensive study skills workshop. At the end of the program they were given the test again.The following data represents the change in the math and reading tests for the 5 studentsin the program.

Math Reading11 720 40

-10 -410 1216 5

We would like to test the hypothesis H0 : θ = 0 vs HA : θ 6= 0. Following the discussionat the beginning of Section 6.2.2, find the sign change distribution of the componentwisesign test and find the conditional p-value.


6.8. EXERCISES 417

6.8.8. Using the projection method discussed in Chapter 2, derive the projection of thestatistic given in ( 6.2.14).

6.8.9. Apply Lemma 6.2.1 and show that ( 6.2.19) provides the bounds on the testingefficiency of the Wilcoxon test relative to Hotelling’s test in the case of a bivariate normaldistribution.


6.8.11. Show that ( 6.3.13) can be generalized to k dimensions.

6.8.12. Consider the spatial L1 methods.

(a). Show that the efficiency of the spatial L1 methods relative to the L2 methods with ak-variate spherical model is given by

ek(spatialL1, L2) =

(k − 1

k

)2

E(r2)[E(r−1)]2

(b). Next assume that the k–variate spherical model is normal. Show that Er−1 = Γ[(k−1)/2)]√2Γ(k/2)

with Γ(1/2) =√π.

6.8.13. . Show that the spatial median is equivariant and that the spatial sign test isinvariant under orthogonal transformations of the data;.

6.8.14. Verify ( 6.3.15).

6.8.15. Complete the proof of Theorem 6.4.2 by establishing the third formula for S8(0).

6.8.16. Show that the Oja median and Oja sign test are affine equivariant and affine invari-ant, respectively. See Section 6.4.3.

6.8.17. Show that the maximum breakdown point for a translation equivariant estimator is(n+1)/(2n). An estimator is translation equivariant if T (X + a1) = T (X) + a1, for everyreal a. Note that 1 is the vector of all ones.

6.8.18. Verify ( 6.6.6).

6.8.19. Show that ( 6.6.18) can be derived from ( 6.6.17).

6.8.20. Fill in the details of the proof of Theorem 6.6.1.

6.8.21. Show that AR = 15.69 in Example 6.6.2, using Table 6.6.1.

6.8.22. Verify formula ( 6.6.20).


6.8.23. Consider Model ( 6.7.1) for a repeated measures design in which the responses arerecorded on the same variable over time; i.e., yijl is response for the ith subject in Groupj at time period l. In this model the vector µj is the profile vector for the jth group andthe plot of µij versus i is called the profile plot for Group j. Let µj denote the estimate ofµj based on the R-fit of Model ( 6.7.1). The plot of µij versus j is called the sample profileplot of Group j. These group plots are overlaid and are called the sample profiles. Ahypothesis of interest is whether or not the population profiles are parallel.

(a.) Let At−1 be the (t− 1) × t matrix given by

At−1 =

1 −1 0 · · · 00 1 −1 · · · 0...

......

...0 0 0 · · · −1

.

Show that parallel profiles are equivalent to the null hypothesis H0 defined by:

H0 : Ak−1ΓAd−1 = O versus HA : Ak−1ΓAd−1 6= O , (6.8.1)

where Γ is defined in Model 6.7.1. Hence show that a test of parallel profiles can bebased on the test ( 6.6.28).

(b.) The data in Table 6.8.1 are the times (in seconds) it took three different species (A,B, and C) of rats to run a maze at four different times (I, II, III, and IV). Each rowcontains the scores of a single rat. Compare the sample profile plots based on Wilcoxonand LS estimates.

(c.) Test the hypotheses ( 6.8.1) using the procedure ( 6.6.28) based on Wilcoxon scores.Repeat using the LS test procedure ( 6.6.29).

(d.) Repeat items (b) and (c) if the 13th rat at time period 2 took 80 seconds to run themaze instead of 34. Note that p-value of the LS procedure changes from .77 to .15while the p-value of the Wilcoxon procedure changes from .95 to .85.

6.8.24. Consider the data of Example 6.7.1.

(a.) Using the Wilcoxon scores, fit Model ( 6.7.4) to the original data displayed in Table6.7.1. Obtain the marginal residual plots which show heteroscedasticity. Reason thatthe log transformation is appropriate. Show that the residual plots based on thetransformed remove much of the heteroscedasticity. For both the transformed andoriginal data obtain the internal Wilcoxon studentized residuals. Identify the outliers.

(b.) In order to see the effect of the transformation, obtain the Wilcoxon and LS analysesof Example 6.7.1 based on the original data. Discuss your findings.

6.8. EXERCISES 419

Table 6.8.1: Data for Exercise 6.8.23Group A Group B Group C

Times Times TimesRat I II III IV Rat I II III IV Rat I II III IV

1 47 53 51 28 6 44 57 46 27 11 45 33 30 182 35 66 38 39 7 47 29 21 30 12 30 50 21 253 43 40 34 40 8 28 76 29 39 13 33 32 32 244 49 60 44 32 9 57 63 60 15 14 44 62 38 225 41 61 38 32 10 34 62 41 27 15 40 42 33 24

Appendix A

Asymptotic Results

A.1 Central Limit Theorems

The following version of the Lindeberg-Feller Central Limit Theorem will be useful. A proofof it can be found in Arnold (1981).

Theorem A.1.1. Consider the sequence of independent random variables W1n, . . . ,Wnn, forn = 1, 2 . . . . Suppose E(Win) = 0, Var(Win) = σ2

in <∞, and

max1≤i≤n

σ2in → 0 , as n→ ∞ , (A.1.1)

n∑

i=1

σ2in → σ2 , 0 < σ2 <∞ , as n→ ∞ , (A.1.2)

and

limn→∞

n∑

i=1

E(W 2inIǫ(|Win|) = 0 , (A.1.3)

for all ǫ > 0, where Ia(|x|) is 0 or 1 when |x| > a or |x| ≤ a, respectively. Then

n∑

i=1

WinD→ N(0, σ2) .

A useful corollary to this theorem is given next; see, also, page 153 of Hajek and Sidak(1967).

Corollary A.1.1. Suppose that the sequence of random variables X1, . . . , Xn are iid withE(Xi) = 0 and Var(Xi) = σ2 <∞. Suppose the sequence of constants a1n, . . . , ann are suchthat

n∑

i=1

a2in → σ2

a , as n→ ∞ , 0 < σ2a <∞ , (A.1.4)

max1≤i≤n

|ain| → 0 , as n→ ∞ . (A.1.5)

421

422 APPENDIX A. ASYMPTOTIC RESULTS

Thenn∑

i=1

ainXiD→ N(0, σ2σ2

a) .

Proof: Take Win of Theorem A.1.1 to be ainXi. Then the mean of Win is 0 and itsvariance is σ2

in = a2inσ

2. By ( A.1.5), maxσ2in → 0 and by ( A.1.4),

∑σ2in → σ2σ2

a. Hence weneed only show that condition ( A.1.3) is true. For i = 1, . . . , n, define

W ∗in = max

1≤j≤n|ajn||Xi| .

Then |W ∗in| ≥ |Win|; hence, Iǫ(|Win|) ≤ Iǫ(|W ∗

in|), for ǫ > 0. Therefore,

n∑

i=1

E[W 2inIǫ(|Win|)

]≤

n∑

i=1

E[W 2inIǫ(|W ∗

in|)]

=

n∑

i=1

a2in

E[X2

1Iǫ(|W ∗1n|)]. (A.1.6)

Note that the sum in braces converges to σ2σ2a. Because X2

1Iǫ(|W ∗1n|) converges to 0 pointwise

and it is bounded above by the integrable function X21 , it then follows that by Lebesgue’s

Dominated Convergence Theorem that the right side of ( A.1.6) converges to 0. Thuscondition ( A.1.3) of Theorem A.1.1 is true and we are finished.

Note that the simple Central Limit Theorem follows from this corollary by taking ain =n−1/2, so that ( A.1.4) and ( A.1.5) hold.

A.2 Simple Linear Rank Statistics

In the next two subsections, we present the asymptotic distribution theory for a simplelinear rank statistic under the null and local alternative hypotheses. This theory is used inChapters 1 and 2 for location models and, also, in Section A.3, it will be useful in establishingasymptotic linearity and quadraticity results for Chapters 3 and 5. The theory for a simplelinear rank statistic is presented in detail in Chapters 5 and 6 of the book by Hajek andSidak (1967); hence, here we will only present a heuristic development with appropriatereferences to Hajek and Sidak. Also, Chapter 8 of Randles and Wolfe (1979) presents adetailed development of the null asymptotic theory of a simple linear rank statistic.

In this section we assume that the sequence of random variables Y1, . . . , Yn are iid withcommon density function f(y) which follows Assumption E.1, ( 3.4.1). Let x1, . . . , xn denotea sequence of centered, (x = 0), regression coefficients and assume that they follow as-sumptions D.2, ( 3.4.7), and D.3, ( 3.4.8). For this one-dimensional case, these assumptionssimplify to:

maxx2i∑n

i=1 x2i

→ 0 (A.2.1)

1

n

n∑

i=1

x2i → σ2

x , σ2x > 0 , (A.2.2)

A.2. SIMPLE LINEAR RANK STATISTICS 423

for some constant σ2x. It follows from these assumptions that maxi |xi|/

√n→ 0, a fact that

we will find useful. Assume that the score function ϕ(u) is defined on the interval (0, 1) andthat it satisfies (S.1), ( 3.4.10); in particular,

∫ 1

0ϕ(u) du = 0 and

∫ 1

0ϕ2(u) du = 1 . (A.2.3)

Consider then the linear rank statistics,

S =

n∑

i=1

xia(R(Yi)) , (A.2.4)

where the scores are generated as a(i) = ϕ(i/(n+ 1)).

A.2.1 Null Asymptotic Distribution Theory

It follows immediately that the mean and variance of S are given by

E(S) = 0 and Var(S) =∑n

i=1 x2i

1

n−1

∑ni=1 a

2(i) .

=∑n

i=1 x2i , (A.2.5)

where the approximation is due to the fact that the quantity in braces is a Riemann sum of∫ 1

0ϕ2(u) du = 1.Note that we can write S as

S =n∑

i=1

xiϕ

(n

n + 1Fn(Yi)

), (A.2.6)

where Fn is the empirical distribution function of Y1, . . . , Yn. This suggests the approxima-tion,

T =n∑

i=1

xiϕ(F (Yi)) . (A.2.7)

We have immediately from ( A.2.3) that the mean and variance of T are

E(T ) = 0 and Var(T ) =∑n

i=1 x2i . (A.2.8)

Furthermore, by assumptions ( A.2.1) and ( A.2.2), we can apply Corollary A.1.1 to showthat

1√nT is asymptotically distributed as N(0, σ2

x) . (A.2.9)

Because the means of S and T are the same, it will follow that S has the same asymptoticdistribution as T provided the second moment of their difference goes to 0. But this followsfrom the string of inequalities:

E

[(1√nS − 1√

nT

)2]

=1

nE

(

n∑

i=1

xi

(ϕ

(n

n + 1Fn(Yi)

)− ϕ(F (Yi))

))2

≤ n

n− 1

1

n

n∑

i=1

x2i

E

[(ϕ

(n

n+ 1Fn(Y1)

)− ϕ(F (Y1))

)2]

→ σ2x · 0 ,


where the inequality and the derivation of the limit is given on page 160 of Hajek and Sidak(1967). This results in the following theorem,

Theorem A.2.1. Under the above assumptions,

1√n

(T − S)P→ 0 , (A.2.10)

and1√nS

D→ N(0, σ2x) . (A.2.11)

Hence we have established the null asymptotic distribution theory of a simple linear rankstatistic.

A.2.2 Local Asymptotic Distribution Theory

We first need the definition of contiguity between two sequences of densities.

Definition A.2.1. A sequence of densities qn is contiguous to another sequence of den-sities pn, if for any sequence of events An,

∫

An

pn → 0 ⇒∫

An

qn → 0 .

This concept is discussed in some detail in Hajek and Sidak (1967).

The following fact follows immediately from this definition. Suppose the sequence ofdensities qn is contiguous to the sequence of densities pn. Let Xn be a sequence of

random variables. If XnP→ 0 under pn then Xn

P→ 0 under qn.

Then according to LeCam’s First Lemma, if log(qn/pn) is asymptoticallyN(−σ2/2, σ2)under pn, then qn is contiguous to pn. Further by LeCam’s Third Lemma, if (Sn, log(qn/pn))is asymptotically bivariate normal (µ1, µ2, σ

21, σ

22, ρσ1σ2) with µ2 = −σ2

2/2 under pn, then Snis asymptotically N(µ1 + ρσ1σ2, σ

21) under qn; see pages 202-209 in Hajek and Sidak (1967).

In this section, we assume that the random variables Y1, . . . , Yn and the regression coeffi-cients x1, . . . , xn follow the same assumptions that we made in the last section; see expressions( A.2.1) and ( A.2.2). We denote the likelihood function of Y1, . . . , Yn by

py = Πni=1f(yi) . (A.2.12)

In the last section we derived the asymptotic distribution of S under py. In this section weare further concerned with the likelihood function

qd = Πni=1f(yi + di) , (A.2.13)


for a sequence of constants d1, . . . , dn which satisfies the conditions

n∑

i=1

di = 0 (A.2.14)

n∑

i=1

d2i → σ2

d > 0 , as n→ ∞ (A.2.15)

max1≤i≤n

d2i → 0 , as n→ ∞ (A.2.16)

1√n

n∑

i=1

xidi → σxd , as n→ ∞ . (A.2.17)

In applications (eg. power in simple linear models) we take di = −xiβ/√n. For xis following

assumptions ( A.2.1) and ( A.2.2), the above assumptions would hold for these dis.In this section, we establish the asymptotic distribution of S under qd. Consider the log

of the ratio of the likehood functions qd and py given by

l(η) =

n∑

i=1

logf(Yi + ηdi)

f(Yi). (A.2.18)

Expanding l(η) about 0 and evaluating the resulting expression at η = 1 results in

l =n∑

i=1

dif ′(Yi)

f(Yi)+

1

2

n∑

i=1

d2i

f(Yi)f′′(Yi) − (f ′(Yi))

2

f 2(Yi)+ op(1) , (A.2.19)

provided that the third derivative of the log-ratio, evaluated at 0, is square integrable.Under py, the middle term converges in probability to −I(f)σ2

d/2, provided that the secondderivative of the log-ratio, evaluated at 0, is square integrable.

Hence, under py and some further regularity conditions we can write,

l =n∑

i=1

dif ′(Yi)

f(Yi)− I(f)σ2

d

2+ op(1) . (A.2.20)

The random variables in the first term, f ′(Yi)/f(Yi) are iid with mean 0 and variance I(f).Because the sequence d1, . . . , dn satisfies ( A.2.14) - ( A.2.16), we can use Corollary A.1.1 toshow that, under py, l converges in distribution to N(−I(f)σ2

d/2, I(f)σ2d). By the definition

of contiguity ( A.2.1) and the immediate following discussion of LeCam’s first lemma, wehave the result

the densities qd = Πni=1f(yi + di) are contiguous to py = Πn

i=1f(yi) ; (A.2.21)

see, also, page 204 of Hajek and Sidak (1967).We next establish the key result:


Theorem A.2.2. For T given by ( A.2.7) and under py and the assumptions ( 3.4.1),( A.2.1), ( A.2.2), ( A.2.14) - ( A.2.17),

( 1√nT

l

)D→ N2

((0

− I(f)σ2d

2

),

[σ2x σxdγf

σxdγf I(f)σ2d

]). (A.2.22)

Proof: Consider the random vector V = (T/√n, l)′, where T is defined in expression ( A.2.7).

To show that V is asymptotically normal under pn it suffices to show that for t ∈ R2, t 6= 0,t′V is asymptotically univariate normal. By the above discussion, for the second componentof V, we need only be concerned with the first term in expression ( A.2.19); hence, fort = (t1, t2)

′, define the random variables Win by

n∑

i=1

[1√nxit1ϕ(F (Yi)) + t2di

f ′(Yi)

f(Yi)

]=

n∑

i=1

Win . (A.2.23)

We want to apply Theorem A.1.1. The random variables Win are independent and havemean 0. After some simplification, we can show that the variance of Win is

σ2in =

1

nx2i t

21 + t22d

2i I(f) − 2t1t2di

xi√nγf , (A.2.24)

where γf is given by

γf =

∫ 1

0

ϕ(u)

(−f

′(F−1(u))

f(F−1(u))

)du . (A.2.25)

Note by assumptions ( A.2.1), ( A.2.2), and ( A.2.15) - ( A.2.17) that

n∑

i=1

σ2in → t21σ

2x + t22σ

2dI(f) − 2t1t2γfσxd > 0 , (A.2.26)

and that

max1≤i≤n

σ2in ≤ max

1≤i≤n

1

nx2i t

21 + t22I(f) max

1≤i≤nd2i + 2|t1t2|γf max

1≤i≤n

1√n|xi| max

1≤i≤n|di| → 0 ; (A.2.27)

hence conditions ( A.1.2) and ( A.1.1) are true. Thus to obtain the result we need to show

limn→∞

n∑

i=1

E[W 2inIǫ(|Win|)

]= 0 , (A.2.28)

for ǫ > 0. But |Win| ≤W ∗in where

W ∗in = |t1| max

1≤j≤n

1√n|xj ||ϕ(F (Yi))| + |t2| max

1≤j≤n|dj|

∣∣∣∣f ′(Yi)

f(Yi)

∣∣∣∣ .


Hence,n∑

i=1

E[W 2inIǫ(|Win|)

]≤

n∑

i=1

E[W 2inIǫ(W

∗in)]

=

n∑

i=1

E

[(1

nt21x

2iϕ

2(F (Y1))

+t22d2i

(f ′(Y1)

f(Y1)

)2

+ 2t1t21√nxidiϕ(F (Y1))

(−f

′(Y1)

f(Y1)

))Iǫ(W

∗1n)

]

= t21

n∑

i=1

1

nx2i

E[ϕ2(F (Y1))Iǫ(W

∗1n)]

+t22

n∑

i=1

d2i

E

[(f ′(Yi)

f(Yi)

)2

Iǫ(W∗1n)

]

+2t1t2

1√n

n∑

i=1

xidi

E

[ϕ(F (Y1))

(−f

′(Y1)

f(Y1)

)2

Iǫ(W∗1n)

].(A.2.29)

Because Iǫ(W∗1n) → 0 pointwise and each of the other random variables in the expectations of

( A.2.29) are absolutely integrable, the Lebesgue Dominated Convergence Theorem impliesthat each of these expectations converge to 0. The desired limit in expression ( A.2.28),then follows from assumptions ( A.2.1), ( A.2.2) and ( A.2.15) - ( A.2.17). Hence V isasymptotically bivariate normal. We can obtain its asymptotic variance-covariance matrixfrom expression ( A.2.26), which completes the proof.

Based on Theorem A.2.2, an application of LeCam’s third lemma leads to the asymptoticdistribution of T/

√n under local alternatives which we state in the following theorem.

Theorem A.2.3. Under the sequence of densities qd = Πni=1f(yi + di), and the assumptions

( 3.4.1), ( A.2.1), ( A.2.2), ( A.2.14) - ( A.2.17),

1√nT

D→ N(σxdγf , σ2x) , (A.2.30)

1√nS

D→ N(σxdγf , σ2x) . (A.2.31)

The result for S/√n follows because (T − S)/

√n→ 0 in probability under the densities

py; hence, due to the contiguity cited in expression ( A.2.21), (T − S)/√n → 0, also, under

the densities qd. A proof of the asymptotic power lemma, Theorem 2.4.13, follows from thisresult.

We now investigate the relationship between S and the shifted process given by

Sd =n∑

i=1

xia(R(Yi + di)) . (A.2.32)

Consider the analogous process,

Td =n∑

i=1

xiϕ(F (Yi + di)) . (A.2.33)


We next establish the connection between T and Td; see Theorem 1.3.1, also.

Theorem A.2.4. Under the likelihoods qd and py, we have the following identity:

Pqd

[1√nT ≤ t

]= Ppy

[1√nTd ≤ t

]. (A.2.34)

Proof: The proof follows from the following string of equalities.

Pqd

[1√nT ≤ t

]= Pqd

[1√n

n∑

i=1

xiϕ(F (Yi)) ≤ t

]

= Pqd

[1√n

n∑

i=1

xiϕ(F ((Yi − di) + di)) ≤ t

]

= Ppy

[1√n

n∑

i=1

xiϕ(F (Zi + di)) ≤ t

]

= Ppy

[1√nTd ≤ t

],

(A.2.35)

where in line three the sequence of random variables Z1, . . . , Zn follows the likelihood py.We next establish an asymptotic relationship between T and Td.

Theorem A.2.5. Under py and the assumptions ( 3.4.1), ( A.2.1), ( A.2.2), ( A.2.14) -( A.2.17), [

T − [Td −Epy(Td)]√n

]P→ 0 .

Proof: Since E(T ) = 0, it suffices to show that the V [(T − Td)/√n ] → 0. We have,

V

[T − Td√

n

]=

1

n

n∑

i=1

x2iV [ϕ(F (Yi)) − ϕ (F (Yi + di))]

≤ 1

n

n∑

i=1

x2iE [ϕ(F (Yi)) − ϕ(F (Yi + di)]

2

=1

n

n∑

i=1

x2i

∫ ∞

−∞[ϕ(F (y)) − ϕ(F (y + di)]

2 f(y) dy

≤(

1

n

n∑

i=1

x2i

)(∫ ∞

−∞max1≤i≤n

[ϕ(F (y))− ϕ(F (y + di)]2 f(y) dy

).

The first factor in the last expression converges to σ2x; hence, it suffices to show that the lim

of the second factor is 0. Fix y. Let ǫ > 0 be given. Then since ϕ(u) is continuous a.e. we can


assume it is continuous at F (y). Hence there exists a δ1 > 0 such that |ϕ(z) − ϕ(F (y))| < ǫfor |z−F (y)| < δ1. By the uniform continuity of F , choose δ2 > 0 such that |F (t)−F (s)| < δ1for |s− t| < δ2. By ( A.2.16) choose N0 so that for n > N0 implies

max1≤i≤n

|di| < δ2 .

Thus for n > N0,|F (y) − F (y + di)| < δ1 , for i = 1, . . . , n ,

and, hence,|ϕ(F (y))− ϕ (F (y + di))| < ǫ , for i = 1, . . . , n .

Thus for n > N0,max1≤i≤n

[ϕ(F (y)) − ϕ(F (y + di)]2 < ǫ2 ,

Therefore,

lim

(∫ ∞

−∞max1≤i≤n

[ϕ(F (y)) − ϕ(F (y + di))]2 f(y) dy

)≤ ǫ2 ,

and we are finished.The next result yields the asymptotic mean of Td.

Theorem A.2.6. Under py and the assumptions ( 3.4.1), ( A.2.1), ( A.2.2), ( A.2.14) -( A.2.17),

Epy

[1√nTd

]→ γfσxd .

Proof: By Theorem A.2.3,

1√nT − γfσxd

σx

D→ N(0, 1) , under qd.

Hence by the transformation Theorem A.2.4,

1√nTd − γfσxd

σx

D→ N(0, 1) , under py. (A.2.36)

By ( A.2.9),1√nT

σx

D→ N(0, 1) , under py ;

hence by Theorem A.2.5, we must have

1√nTd − E

[1√nTd

]

σx

D→ N(0, 1) , under py. (A.2.37)

The conclusion follows from the results ( A.2.36) and ( A.2.37).


By the last two theorems we have under py

1√nTd =

1√nT + γfσxd + op(1) .

We need to express these results for the random variables S, ( A.2.4), and Sd, ( A.2.32).Because the densities qd are contiguous to py and (T − S)/

√n→ 0 in probability under py,

it follows that (T − S)/√n→ 0 in probability under qd. By a change of variable this means

(Td−Sd)/√n→ 0 in probability under py. This discussion leads to the following two results

which we state in a theorem.

Theorem A.2.7. Under py and the assumptions ( 3.4.1), ( A.2.1), ( A.2.2), ( A.2.14) -( A.2.17),

1√nSd =

1√nS + γfσxd + op(1) (A.2.38)

1√nSd =

1√nT + γfσxd + op(1) . (A.2.39)

Next we relate the result Theorem A.2.7 to ( 2.5.27), the asymptotic linearity of thegeneral scores statistic in the two sample problem. Recall in the two sample problem thatci = 0 for 1 ≤ i ≤ n1 and ci = 1 for n1 + 1 ≤ i ≤ n1 + n2 = n, ( 2.2.1). Hence, xi = ci − c =−n2/n for 1 ≤ i ≤ n1 and xi = n1/n for n1 + 1 ≤ i ≤ n. Defining di = −δxi/

√n, it is easy

to check that conditions ( A.2.14) - ( A.2.17) hold with σxd = −λ1λ2δ. Further ( A.2.32)becomes Sϕ(δ/

√n) =

∑xia(R(Yi − δxi/

√n)) and ( A.2.4) becomes Sϕ(0) =

∑xia(R(Yi)),

where a(i) = ϕ(i/(n + 1)),∫ϕ = 0 and

∫ϕ2 = 1. Hence ( A.2.38) becomes

1√nSϕ(δ/

√n) =

1√nSϕ(0) − λ1λ2γfδ + op(1) .

Finally using the usual partition argument, Theorem 1.5.6, and the monotonicity of Sϕ(δ/√n)

we have:

Theorem A.2.8. Assuming Finite Fisher information, nondecreasing and square integrableϕ(u), and ni/n→ λi, 0 < λi < 1, i = 1, 2,

Ppx

(sup√n|δ|≤c

∣∣∣∣1√nSϕ

(δ√n

)− 1√

nSϕ(0) + λ1λ2γfδ

∣∣∣∣ ≥ ǫ

)→ 0 , (A.2.40)

for all ǫ > 0 and for all c > 0.

This theorem establishes ( 2.5.27). As a final note from ( A.2.11), n−1/2Sϕ(0) is asymp-totically N(0, σ2

x), where σ2x = σ2(0) = limn−1

∑x2i = λ1λ2. Hence to determine the efficacy

using this approach, we have

cϕ =λ1λ2γfσ(0)

=√λ1λ2τ

−1ϕ , (A.2.41)

see ( 2.5.28).


A.2.3 Signed-Rank Statistics

In this section we develop the asymptotic local behavior for the general signed rank statisticsdefined in Section 1.8. Assume that X1, . . .Xn are a random sample having distributionfunction H(x) with density h(x) which is symmetric about 0. Recall that general signedrank statistics are given by

Tϕ+ =∑

a+(R(|Xi|))sgn(Xi) , (A.2.42)

where the scores are generated as a+(i) = ϕ+(i/(n + 1)) for a nonnegative and squareintegrable function ϕ+(u) which is standardized such that

∫(ϕ+(u))2 du = 1.

The null asymptotic distribution of Tϕ+ was derived in Section 1.8 so here we will beconcerned with its behavior under local alternatives. Also the derivations here are similarto those for simple linear rank statistics, Section A.2.2; hence, we will be brief.

Note that we can write Tϕ+ as

Tϕ+ =∑

ϕ+

(n

n + 1H+n (|Xi|)

)sgn(Xi) ,

where H+n denotes the empirical distribution function of |X1|, . . . , |Xn|. This suggests the

approximation

T ∗ϕ+ =

∑ϕ+(H+(|Xi|))sgn(Xi) , (A.2.43)

where H+(x) is the distribution function of |Xi|.Denote the likelihood of the sample X1, . . .Xn by

px = Πni=1h(xi) . (A.2.44)

A result that we will need is,

1√n

(Tϕ+ − T ∗

ϕ+

) P→ 0 , under px . (A.2.45)

This result is shown on page 167 of Hajek and Sidak (1967).For the sequence of local alternatives, b/

√n with b ∈ R, (here we are taking di = −b/√n),

we denote the likelihood by

qb = Πni=1h

(xi −

b√n

). (A.2.46)

For b ∈ R, consider the log of the likelihoods given by,

l(η) =n∑

i=1

logh(Xi − η b√

n)

h(Xi). (A.2.47)

If we expand l(η) about 0 and evaluate it at η = 1, similar to the expansion ( A.2.19), weobtain

l = − b√n

n∑

i=1

h′(Xi)

h(Xi)+b2

2n

n∑

i=1

h(Xi)h′′(Xi) − (h′(Xi))

2

h2(Xi)+ op(1) , (A.2.48)


provided that the third derivative of the log-ratio, evaluated at 0, is square integrable.Under px, the middle term converges in probability to −I(h)b2/2, provided that the secondderivative of the log-ratio, evaluated at 0, is square integrable. An application of Theorem

A.1.1 shows that l converges in distribution to a N(− I(h)b2

2, I(h)b2). Hence, by LeCam’s first

lemma,

the densities qb = Πni=1h

(xi − b√

n

)are contiguous to px = Πn

i=1h(xi) ; (A.2.49)

Similar to Section A.2.2, by using Theorem A.1.1 we can derive the asymptotic distri-bution of the random vector (T ∗

ϕ+/√n, l), which we record as:

Theorem A.2.9. Under px and some regularity conditions on h,

( 1√nT ∗ϕ+

l

)D→ N2

((0

− I(h)b2

2

),

[1 bγhbγh I(h)b2

]), (A.2.50)

where γh = 1/τϕ+ and τϕ+ is given in expression ( 1.8.24).

By this last theorem and LeCam’s third lemma, we have

1√nT ∗ϕ+

D→ N(bγh, 1) , under qb . (A.2.51)

By the result on contiguity, ( A.2.49), the test statistic Tϕ+/√n has the same distribution

under qb. A proof of the asymptotic power lemma, Theorem 1.8.1, follows from this result.Next consider a shifted version of T ∗

ϕ+ given by

T ∗bϕ+ =

n∑

i=1

ϕ+

(H+

(∣∣∣∣Xi +b√n

∣∣∣∣))

sgn

(Xi +

b√n

). (A.2.52)

The following identity is readily established:

Pqb[T∗ϕ+ ≤ t] = Ppx [T

∗bϕ+ ≤ t] ; (A.2.53)

see, also, Theorem 1.3.1. We need the following theorem:

Theorem A.2.10. Under px,

[T ∗ϕ+ − [T ∗

bϕ+ −Epx(T∗bϕ+)]

√n

]P→ 0 .

Proof: As in Theorem A.2.5, it suffices to show that V [(T ∗ϕ+ − T ∗

bϕ+)/√n] → 0. But this

variance reduces to

V

[T ∗ϕ+ − T ∗

bϕ+√n

]=

∫ ∞

−∞

ϕ+(H+(|x|)

)sgn(x) − ϕ+

(H+

(∣∣∣∣x+b√n

∣∣∣∣))

sgn

(x+

b√n

)2

h(x) dx .

A.3. RESULTS FOR RANK-BASED ANALYSIS OF LINEAR MODELS 433

Since ϕ+(u) is square integrable, the quantity in braces is dominated by an integrable func-tion. Since it converges pointwise to 0, a.e., an application of the Lebesgue DominatedConvergence Theorem establishes the result.

Using the above results, we can proceed as we did for Theorem A.2.6 to show that underpx,

Epx

[1√nT ∗bϕ+

]→ bγh . (A.2.54)

Hence,1√nT ∗bϕ+ =

1√nT ∗ϕ+ + bγh + op(1) . (A.2.55)

A similar result holds for the signed-rank statistic.For the results needed in Chapter 1, however, it is convenient to change the notation to:

Tϕ+(b) =

n∑

i=1

a+(R|Xi − b|)sgn(Xi − b) . (A.2.56)

The above results imply that

1√nTϕ+(θ) =

1√nTϕ+(0) − θγh + op(1) , (A.2.57)

for√n|θ| ≤ B, for B > 0.

The general signed-rank statistics found in Chapter 1 are based on norms. In this case,since the scores are nondecreasing, we can strengthen our results to include uniformity; thatis,

Theorem A.2.11. Assuming Finite Fisher information, nondecreasing and square inte-grable ϕ+(u),

Ppx[ sup√n|θ|≤B

| 1√nTϕ+(θ) − 1√

nTϕ+(0) + θγh| ≥ ǫ] → 0 , (A.2.58)

for all ǫ > 0 and all B > 0.

A proof can be obtained by the usual partitioning type of argument on the interval[−B,B]; see the proof of Theorem 1.5.6. Hence, since

∫(ϕ+(u))2 du = 1, the efficacy is

given by cϕ+ = γh; see ( 1.8.21).

A.3 Results for Rank-Based Analysis of Linear Models

In this section we consider the linear model defined by ( 3.2.3) in Chapter 3. The distributionof the errors satisfies Assumption E.1, ( 3.4.1). The design matrix satisfies conditions D.2,( 3.4.7), and D.3, ( 3.4.8). We shall assume without loss of generality that the true vectorof parameters is 0.


It will be easier to work with the following transformation of the design matrix andparameters. We consider β such that

√nβ = O(1). Note that we will suppress the notation

indicating that β depends on n. Let,

∆ = (X′X)1/2

β , (A.3.1)

C = X (X′X)−1/2

, (A.3.2)

di = −c′i∆ , (A.3.3)

where ci is the ith row of C and note that ∆ = O(1) because n−1X′X → Σ > 0 and√nβ = O(1). Then C′C = Ip and HC = HX, where HC is the projection matrix onto the

column space of C. Note that since X is centered, C is also. Also ‖ci‖2 = h2nii where h2

nii

is the ith diagonal entry of HX. It is straightforward to show that c′i∆ = x′iβ. Using the

conditions (D.2) and (D.3), the following conditions are readily established:

d = 0 (A.3.4)n∑

i=1

d2i ≤

n∑

i=1

‖ci‖2‖∆‖2 = p‖∆‖2 , for all n (A.3.5)

max1≤i≤n

d2i ≤ ‖∆‖2 max

1≤i≤n‖ci‖2 (A.3.6)

= ‖∆‖2 max1≤i≤n

h2nii → 0 as n→ ∞ ,

since ‖∆‖ is bounded.For j = 1, . . . , p define

Snj(∆) =n∑

i=1

cija(R(Yi − c′i∆)) , (A.3.7)

where the scores are generated by a function ϕ which staisfies (S.1), ( 3.4.10). We now showthat the theory established in the Section A.2 for simple linear rank statistics holds for Snj,for each j.

Fix j, then the regression coefficients xi of Section A.2 are given by xi =√ncij. Note

from ( A.3.2) that∑x2i /n =

∑c2ij = 1; hence, condition ( A.2.2) is true. Further by ( A.3.6),

max1≤i≤n x2i∑n

i=1 x2i

= max1≤i≤n

c2ij → 0 ;

hence, condition ( A.2.1) is true.For the sequence di = −c′i∆, conditions ( A.3.4) - ( A.3.6) imply conditions ( A.2.14) -

( A.2.16), (the upper bound in condition ( A.3.6) was actually all that was needed in theproofs of Section A.2). Finally for ( A.2.17), because C is orthogonal, σxd is given by

σxd =1√n

n∑

i=1

xidi = −n∑

i=1

cijc′i∆ = −

p∑

k=1

n∑

i=1

cijcik

∆k = −∆j . (A.3.8)


Thus by Theorem A.2.7, for j = 1, . . . , p, we have the results,

Snj(∆) = Snj(0) − γf∆j + op(1) (A.3.9)

Snj(∆) = Tnj(0) − γf∆j + op(1) , (A.3.10)

where

Tnj(0) =

n∑

i=1

cijϕ(F (Yi)) . (A.3.11)

Let Sn(∆)′ = (Sn1(∆), . . . , Snp(∆)). Because component-wise convergence in probabilityimplies that the corresponding vector converges, we have shown that the following theoremis true:

Theorem A.3.1. Under the above assumptions, for ǫ > 0 and for all ∆

limn→∞

P (‖Sn(∆) − (Sn(0) − γ∆) ‖ ≥ ǫ) = 0 . (A.3.12)

The conditions we want are asymptotic linearity and quadraticity. Asymptotic linear-ity is the condition

limn→∞

P

(sup

‖∆‖≤c‖Sn(∆) − (Sn(0) − γ∆) ‖ ≥ ǫ

)= 0 , (A.3.13)

for arbitrary c > 0 and ǫ > 0. This result was first shown by Jureckova (1971) under morestringent conditions on the design matrix.

Consider the dispersion function discussed in Chapter 2. In terms of the above notation

Dn(∆) =n∑

i=1

a(R(Yi − ci∆))(Yi − ci∆) . (A.3.14)

An approximation of Dn(∆) is the quadratic function

Qn(∆) = γ∆′∆/2 − ∆′Sn(0) +Dn(0) . (A.3.15)

Using Jureckova’s conditions, Jaeckel (1972) extended the result ( A.3.13) to asymptoticquadraticity which is given by

limn→∞

P

(sup

‖∆‖≤c|Dn(∆) −Qn(∆)| ≥ ǫ

)= 0 , (A.3.16)

for arbitrary c > 0 and ǫ > 0. Our main result of this section shows that ( A.3.12),( A.3.13), and ( A.3.16) are equivalent. The proof proceeds as in Heiler and Willers (1988)who established their results based on convex function theory. Before proceeding with theproof, for the reader’s convenience, we present some notes on convex functions.


A.3.1 Convex Functions

Let f be a real valued function defined on Rp. Recall the definition of a convex function:

Definition A.3.1. The function f is convex if

f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y) , (A.3.17)

for 0 < λ < 1. Further, a convex function f is called proper if it is defined on an open setC ∈ Rp and is everywhere finite on C.

The convex functions of interest in this appendix are proper with C = Rp.The proof of the following theorem can be found in Rockafellar (1970); see pages 82 and

246.

Theorem A.3.2. Suppose f is convex and proper on an open subset C of Rp. Then f iscontinuous on C and is differentiable almost everywhere on C.

We will find it useful to define a subgradient:

Definition A.3.2. The vector D(x0) is called a subgradient of f at x0 if

f(x) − f(x0) ≥ D(x0)′(x − x0) , for all x ∈ C . (A.3.18)

As shown on page 217 of Rockafellar (1970), a proper convex function which is definedon an open set C has a subgradient at each point in C. Furthermore, at the points ofdifferentiability, the subgradient is unique and it agrees with the gradient. This is a theoremproved on page 242 of Rockafellar which we next state.

Theorem A.3.3. Let f be convex. If f is differentiable at x0 then f(x0), the gradient off at x0, is the unique subgradient of f at x0.

Hence combining Theorems A.3.2 and A.3.3, we see that for proper convex functionsthe subgradient is the gradient almost everywhere; hence if f is a proper convex function wehave,

f(x) − f(x0) ≥ f(x0)′(x − x0) , a.e. x ∈ C . (A.3.19)

The next theorem can be found on page 90 of Rockafellar (1970).

Theorem A.3.4. Let the sequence of convex functions fn be proper on C and suppose thesequence converges for all x ∈ C∗ where C∗ is dense in C. Then the functions fn convergeon the whole set C to a proper and convex function f and, furthermore, the convergence isuniform on each compact subset of C.

The following theorem is a modification by Heiler and Willers (1988) of a theorem foundon page 248 of Rockafellar (1970).


Theorem A.3.5. Suppose in addition to the assumptions of the last theorem the limit func-tion f is differentiable, then

limn→∞

fn(x) = f(x) , for all x ∈ C . (A.3.20)

Furthermore the convergence is uniform on each compact subset of C.

The following result is proved in Heiler and Willers (1988).

Theorem A.3.6. Suppose the hypotheses of Theorem A.3.4 hold. Assume, also, that thelimit function f is differentiable. Then

limn→∞

fn(x) = f(x) , for all x ∈ C∗ (A.3.21)

andlimn→∞

fn(x0) = f(x0) , for at least one x0 ∈ C∗ (A.3.22)

where C∗ is dense in C, imply that

limn→∞

fn(x) = f(x) , for all x ∈ C (A.3.23)

and the convergence is uniform on each compact subset of C.

A.3.2 Asymptotic Linearity and Quadraticity

We now proceed with Heiler and Willers (1988) proof of the equivalence of ( A.3.12),( A.3.13), and ( A.3.16).

Theorem A.3.7. Under Model ( 3.2.3) and the assumptions ( 3.4.7), ( 3.4.8), and ( 3.4.1),the expressions ( A.3.12), ( A.3.13), and ( A.3.16) are equivalent.

Proof:

( A.3.12) ⇒ ( A.3.16). Both functions Dn(∆) and Qn(∆) are proper convex functionsfor ∆ ∈ Rp. Their gradients are given by,

Qn(∆) = γ∆ − Sn(0) (A.3.24)

Dn(∆) = −Sn(∆) , a.e. ∆ ∈ Rp . (A.3.25)

By Theorem A.3.2 the gradient of D exists almost everwhere. Where the derivative ofDn(∆) is not defined, we will use the subgradient ofDn(∆), ( A.3.2), which, in the caseof proper convex functions, exists everwhere and which agrees uniquely with the gra-dient at points where D(∆) is differentiable; see Theorem A.3.3 and the surroundingdiscussion. Combining these results we have,

(Dn(∆) −Qn(∆)) = −[Sn(∆) − Sn(0) + γ∆] (A.3.26)


Let N denote the set of positive integers. Let ∆(1),∆(2), . . . be a listing of the vectorsin p-space with rational components. By ( A.3.12) the right side of ( A.3.26) goes to 0in probability for ∆(1). Hence, for every infinite index set N∗ ⊂ N there exists anotherinfinite index set N∗∗

1 ⊂ N∗ such that

[Sn(∆(1)) − Sn(0) + γ∆(1)]

a.s.→ 0 , (A.3.27)

for n ∈ N∗∗1 . Since the right side of ( A.3.26) goes to 0 in probability for ∆(2) and N∗∗

1

is an infinite index set, there exists another infinite index set N∗∗2 ⊂ N∗∗

1 such that

[Sn(∆(i)) − Sn(0) + γ∆(i)]

a.s.→ 0 , (A.3.28)

for n ∈ N∗∗2 and i ≤ 2. We continue and, hence, get a sequence of nested infinite index

sets N∗∗1 ⊃ N∗∗

2 ⊃ · · · ⊃ N∗∗i ⊃ · · · such that

[Sn(∆(j)) − Sn(0) + γ∆(j)]

a.s.→ 0 , (A.3.29)

for n ∈ N∗∗i ⊃ N∗∗

i+1 ⊃ · · · and j ≤ i. Let N be a diagonal infinite index set of thesequence N∗∗

1 ⊃ N∗∗2 ⊃ · · · ⊃ N∗∗

i ⊃ · · · . Then

[Sn(∆) − Sn(0) + γ∆]a.s.→ 0 , (A.3.30)

for n ∈ N and for all rational ∆.

Define the convex function Hn(∆) = Dn(∆) −Dn(0) + ∆′Sn(0). Then

Dn(∆) −Qn(∆) = Hn(∆) − γ∆′∆/2 (A.3.31)

(Dn(∆) −Qn(∆)) = Hn(∆) − γ∆ . (A.3.32)

Hence by ( A.3.30) we have

Hn(∆)a.s.→ γ∆ = γ∆′∆/2 , (A.3.33)

for n ∈ N and for all rational ∆. Also note

Hn(0) = 0 = γ∆′∆/2|∆=0. (A.3.34)

Since Hn is convex and ( A.3.33) and ( A.3.34) hold, we have by Theorem A.3.6 thatHn(∆)n∈ eN converges to γ∆′∆/2 a.s., uniformly on each compact subset of Rp. Thatis by ( A.3.31), Dn(∆) − Qn(∆) → 0 a.s., uniformly on each compact subset of Rp.Since N∗ is arbitrary, we can conclude, (see Theorem 4, page 103 of Tucker, 1967),

that Dn(∆) −Qn(∆)P→ 0 uniformly on each compact subset of Rp.

( A.3.16) ⇒ ( A.3.13). Let c > 0 be given and let C = ∆ : ‖∆‖ ≤ c. By ( A.3.16) we

know that Dn(∆) − Qn(∆)P→ 0 on C. Using the same diagonal argument as above,

for any infinite index set N∗ ⊂ N there exists an infinite index set N ⊂ N∗ such that


Dn(∆) −Qn(∆)a.s.→ 0 for n ∈ N and for all rational ∆. As in the last part, introduce

the function Hn as

Dn(∆) −Qn(∆) = Hn(∆) − γ∆′∆/2 . (A.3.35)

Hence,Hn(∆)

a.s.→ γ∆′∆/2 , (A.3.36)

for n ∈ N and for all rational ∆. By ( A.3.36) and the fact that the function γ∆′∆/2is differentiable we have by Theorem A.3.5,

Hn(∆)a.s.→ γ∆ , (A.3.37)

for n ∈ N and uniformly on C. This leads to the following string of convergences,

(Dn(∆) −Qn(∆))a.s.→ 0

Sn(∆) − (Sn(0) − γ∆)a.s.→ 0 , (A.3.38)

where both convergences are for n ∈ N and uniformly on C. Since N∗ was arbitrarywe can conclude that

Sn(∆) − (Sn(0) − γ∆)P→ 0 , (A.3.39)

uniformly on C. Hence ( A.3.13) holds.

( A.3.13) ⇒ ( A.3.12). This is trivial.

These are the results we wanted. For convenience we summarize asymptotic linearityand asymptotic quadraticity in the following theorem:

Theorem A.3.8. Under Model ( 3.2.3) and the assumptions ( 3.4.7), ( 3.4.8), and ( 3.4.1),

limn→∞

P

(sup

‖∆‖≤c‖Sn(∆) − (Sn(0) − γ∆) ‖ ≥ ǫ

)= 0 , (A.3.40)

limn→∞

P

(sup

‖∆‖≤c|Dn(∆) −Qn(∆)| ≥ ǫ

)= 0 , (A.3.41)

for all ǫ > 0 and all c > 0.

Proof: This follows from the Theorems A.3.1 and A.3.7.

A.3.3 Asymptotic Distance Between β and β

This section contains a proof of Theorem 3.5.5. It shows that the R-estimate in Chapter 3 isclose to the value which minimizes the quadratic approximation to the dispersion function.The proof is due to Jaeckel (1972). For convenience, we have restated the theorem.


Theorem A.3.9. Under the Model ( 3.2.3), (E.1), (D.1), (D.2) and (S.1) in Section 3.4,

√n(β − β)

P→ 0 .

Proof: Choose ǫ > 0 and δ > 0. Since√nβ converges in distribution, there exists a c0

such thatP[‖β‖ ≥ c0/

√n]< δ/2 , (A.3.42)

for n sufficiently large. Let

T = minQ(Y −Xβ) : ‖β − β‖ = ǫ/

√n−Q(Y −Xβ) . (A.3.43)

Since β is the unique minimizer of Q, T > 0; hence, by asymptotic quadraticity we have

P

[max

‖β‖<(c0+ǫ)/√n

|D(Y − Xβ) −Q(Y − Xβ)| ≥ T/2

]≤ δ/2 , (A.3.44)

for sufficiently large n. By ( A.3.42) and ( A.3.44) we can assert with probability greater

than 1−δ that for sufficiently large n, |Q(Y−Xβ)−D(Y−Xβ)| < (T/2) and ‖β‖ < c0/√n.

This implies with probability greater than 1 − δ that for sufficiently large n,

D(Y − Xβ) < Q(Y −Xβ) + T/2 and ‖β‖ < c0/√n . (A.3.45)

Next suppose β is arbitrary and on the ring ‖β − β‖ = ǫ/√n. For ‖β‖ < c0/

√n it then

follows that ‖β‖ ≤ (c0 + ǫ)/√n. Arguing as above, we have with probability greater than

1 − δ that D(Y − Xβ) > Q(Y − Xβ) − T/2, for sufficiently large n. From this, ( A.3.43),and ( A.3.45) we get the following string of inequalities

D(Y − Xβ) > Q(Y −Xβ) − T/2

≥ minQ(Y − Xβ) : ‖β − β‖ = ǫ/

√n− T/2

= T +Q(Y − Xβ) − T/2

= T/2 +Q(Y − Xβ) > D(Y − Xβ) (A.3.46)

Thus, D(Y−Xβ) > D(Y−Xβ), for ‖β−β‖ = ǫ/√n. Since D is convex, we must also have

D(Y − Xβ) > D(Y − Xβ), for ‖β − β‖ ≥ ǫ/√n. But D(Y − Xβ) ≥ minD(Y − Xβ) =

D(Y −Xβ). Hence β must lie inside the disk ‖β − β‖ = ǫ/√n with probability of at least

1 − 2δ; that is, P[‖β − β‖ < ǫ/

√n]> 1 − 2δ. This yields the result.

A.3.4 Consistency of the Test Statistic Fϕ

This section contains a proof of the consistency of the test statistic Fϕ, Theorem 3.6.2. Webegin with a lemma.


Lemma A.3.1. Let a > 0 be given and let tn = min√n‖β− eβ‖=a

(Q(β) − Q(β)). Then

tn = (2τ)−1a2λn,1 where λn,1 is the minimum eigenvalue of n−1X′X.

Proof: After some computation, we have

Q(β) −Q(β) = (2τ)−1√n(β − β)′n−1X′X

√n(β − β) .

Let 0 < λn,1 ≤ · · · ≤ λn,p be the eigenvalues of n−1X′X and let γn,1, . . . ,γn,p be a correspond-ing set of orthonormal eigenvectors. The spectral decomposition of n−1X′X is n−1X′X =∑p

i=1 λn,iγn,iγ′n,i. From this we can show for any vector δ that δ′n−1X′Xδ ≥ λn,1‖δ‖2 and,

that further, the minimum is achieved over all vectors of unit length when δ = γn,1. It thenfollows that

min‖δ‖=a

δ′n−1X′Xδ = λn,1a2 ,

which yields the conclusion.Note that by (D.2) of Section 3.4, λn,1 → λ1, for some λ1 > 0. The following is a

restatement and a proof of Theorem 3.6.2.

Theorem A.3.10. Suppose conditions (E.1), (D.1), (D.2), and (S.1) of Section 3.4 hold.The test statistic Fϕ is consistent for the hypotheses ( 3.2.5).

Proof: By the above discussion we need only show that ( 3.6.23) is true. Let ǫ > 0 begiven. Let c0 = (2τ)−1χ2

α,q. By Lemma A.3.1, choose a > 0 so large that (2τ)−1a2λ1 > 3c0+ǫ.

Next choose n0 so large that (2τ)−1a2λn,1 > 3c0, for n ≥ n0. Since√n‖β − β0‖ is bounded

in probability, there exits a c > 0 and a n1 such that for n ≥ n1

Pβ0(C1,n) ≥ 1 − (ǫ/2) , (A.3.47)

where we define the event C1,n = √n‖β−β0‖ < c. Since t > 0 by asymptotic quadraticity,Theorem A.3.8, there exits an n2 such that for n > n2,

Pβ0(C2,n) ≥ 1 − (ǫ/2) , (A.3.48)

where C2,n = max√n‖β−β0‖≤c+a

|Q(β) − D(β)| < (t/3). For the remainder of the proof

assume that n ≥ maxn0, n1, n2 = n∗. Next suppose β is such that√n‖β −β‖ = a. Then

on C1,n it follows that√n‖β − β‖ ≤ c+ a. Hence on both C1,n and C2,n we have

D(β) > Q(β) − (t/3)

≥ Q(β) + t− (t/3)

= Q(β) + 2(t/3)

> D(β) + (t/3) .

Therefore, for all β such that√n‖β−β‖ = a, D(β)−D(β) > (t/3) > c0. But D is convex;

hence on C1,n ∩ C2,n, for all β such that√n‖β − β‖ ≥ a, D(β) −D(β) > (t/3) > c0.


Finally choose n3 such that for n ≥ n3, δ > (c + a)/√n where δ is the positive distance

between β0 and Rr. Now assume that n ≥ maxn∗, n3 and C1,n ∩ C2,n is true. Recall that

the reduced model R-estimate is βr = (β′r,1, 0

′)′ where βr,1 lies in Rr; hence,

√n‖βr − β‖ ≥

√n‖βr − β0‖ −

√n‖β − β0‖ ≥

√nδ − c > a .

Thus on C1,n ∩ C2,n, D(βr) −D(β) > c0. Thus for n sufficiently large we have,

P [D(βr) −D(β) > (2τ)−1χ2α,q] ≥ 1 − ǫ .

Because ǫ was arbitrary ( 3.6.23) is true and consistency of Fϕ follows.

A.3.5 Proof of Lemma 3.5.1

The following lemma was used to establish the asymptotic linearity for the sign process forlinear models in Chapter 3. The proof of this lemma was first given by Jureckova (1971) forgeneral scores. We restate the lemma and give its proof for sign scores.

Lemma A.3.2. Assume conditions (E.1), (E.2), (S.1), (D.1) and (D.2) of Section 3.4.For any ǫ > 0 and for any a ∈ R,

limn→∞

P [|S1(Y − an−1/2 − XβR) − S1(Y − an−1/2)| ≥ ǫ√n] = 0 .

Proof: Let a be arbitrary but fixed and let c > |a|. After matching notation, TheoremA.4.3 leads to the result,

max‖(X′X)1/2β‖≤c

∣∣∣∣1√nS1(Y − an−1/2 −Xβ) − 1√

nS1(Y) + (2f(0))a

∣∣∣∣ = op(1) . (A.3.49)

Obviously the above result holds for β = 0. Hence for any ǫ > 0,

P

[max

‖(X′X)1/2β‖≤c

∣∣∣∣1√nS1(Y − an−1/2 −Xβ) − 1√

nS1(Y − an−1/2)

∣∣∣∣ ≥ ǫ

]

≤ P

[max

‖(X′X)1/2β‖≤c

∣∣∣∣1√nS1(Y − an−1/2 − Xβ) − 1√

nS1(Y) + (2f(0)a

∣∣∣∣ ≥ǫ

2

]

+P

[∣∣∣∣1√nS1(Y − an−1/2) − 1√

nS1(Y) + (2f(0)a

∣∣∣∣ ≥ǫ

2

].

By ( A.3.49), for n sufficiently large, the two terms on the right side are arbitrarily small.

The desired result follows from this since (X′X)1/2β is bounded in probability.

A.4. ASYMPTOTIC LINEARITY FOR THE L1 ANALYSIS 443

A.4 Asymptotic Linearity for the L1 Analysis

In this section we obtain a linearity result for the L1 analysis of a linear model. Recall fromSection 3.6 that the L1-estimates are equivalent to the R-estimates when the rank scoresare generated by the sign function; hence, the distribution theory for the L1-estimates isderived in Section 3.4. The linearity result derived below offers another way to obtain thisresult. More importantly though, we need the linearity result for the proof of Lemma 3.5.6of Section 3.5. As we next show, this result is a corollary to the linearity results derived inthe last section.

We will assume the same linear model and use the same notation as in Section 3.2. Recallthat the L1 estimate of β minimizes the dispersion function,

D1(α,β) =n∑

i=1

|Yi − α− xiβ| .

The corresponding gradient function is the (p+ 1) × 1 vector whose components are

jD1 =

−∑n

i=1 sgn(Yi − α− xiβ) if j = 0−∑n

i=1 xijsgn(Yi − α− xiβ) if j = 1, . . . , p,

where j = 0 denotes the partial of D1 with respect to α. The parameter α will denote thelocation functional med(Yi − xiβ), i.e., the median of the errors. Without loss of generalitywe will assume that the true parameters are 0.

We first consider the simple linear model. Consider then the notation of Section A.3;see ( A.3.1) - ( A.3.7). We will derive the analogue of Theorem A.3.8 for the processes

U0(α,∆) =

n∑

i=1

sgn(Yi −α√n− ∆ci) (A.4.1)

U1(α,∆) =n∑

i=1

cisgn(Yi −α√n− ∆ci) . (A.4.2)

Let pd = Πni=1f0(yi) denote the likelihood for the iid observations Y1, . . . , Yn and let qd =

Πni=1f0(yi + α/

√n + ∆ci) denote the liklihood of the variables Yi − α√

n− ∆ci. We assume

throughout that f(0) > 0. Similar to Section A.2.2, the sequence of densities qd is contiguousto the sequence pd. Note that the processes U0 and U1 are already sums of independentvariables; hence, projections are unnecessary.

We first work with the process U1.

Lemma A.4.1. Under the above assumptions and as n→ ∞,

E0(U1(α,∆)) → −2∆f0(0) .


Proof: After some simplification we get

E0(U1(α,∆)) = 2

n∑

i=1

ci[F0(0) − F0(α/

√n + ∆ci)

]

= 2

n∑

i=1

ci(−∆ci − α/√n)f0(ξin) ,

where, by the mean value theorem, ξin is between 0 and |α/√n + ∆ci|. Since the ci’s arecentered, we further obtain

E0(U1(α,∆)) = −2∆

n∑

i=1

c2i [f0(ξin) − f0(0)] − 2∆

n∑

i=1

c2i f0(0) .

By assumptions of Section A.2.2, it follows that maxi |α/√n + ∆ci| → 0 as n → ∞. Since∑n

i=1 c2i = 1 and the assumptions that f0 continuous and positive at 0 the desired result

easily follows.

This leads us to our main result for U1(α,∆):

Theorem A.4.1. Under the above assumptions, for all α and ∆

U1(α,∆) − [U1(0, 0) − ∆2f0(0)]P→ 0 ,

as n→ ∞.

Because the ci’s are centered it follows that Epd(U1(0, 0)) = 0. Thus by the last lemma,

we need only show that Var(U1(α,∆) − U1(0, 0)) → 0. By considering the variance of thesign of a random variable, simplification leads to the bound:

Var((U1(α,∆) − U1(0, 0)) ≤ 4

n∑

i=1

c2i |F0(α/√n + ∆ci) − F0(0)| .

By our assumptions, maxi |∆ci +α/√n| → 0 as n→ ∞. From this and the continuity of F0

at 0, it follows that Var(U1(α,∆) − U1(0, 0)) → 0.

We need analogous results for the process U0(α,∆).

Lemma A.4.2. Under the above assumptions,

E0[U0(α,∆)] → −2αf0(0) ,

as n→ ∞.

A.4. ASYMPTOTIC LINEARITY FOR THE L1 ANALYSIS 445

Proof: Upon simplification and an application of the mean value theorem,

E0[U0(α,∆)] =2√n

n∑

i=1

[F0(0) − F0

(α√n

+ ci∆

)]

=−2√n

n∑

i=1

[α√n

+ ci∆

]f0(ξin)

=−2α

n

n∑

i=1

[f0(ξin) − f0(0)] − 2αf0(0) ,

where we have used the fact that the ci’s are centered. Note that |ξin| is between 0 and|α/√n+ ci∆| and that max |α/√n+ ci∆| → 0 as n→ ∞. By the continuity of f0 at 0, thedesired result follows.

Theorem A.4.2. Under the above assumptions, for all α and ∆

U0(α,∆) − [U0(0, 0) − 2αf0(0)]P→ 0 ,

as n→ ∞.

Because the medYi is 0, E0[U0(0, 0)] = 0. Hence by the last lemma it then suffices toshow that Var(U0(α,∆) − U0(0, 0)) → 0. But,

Var(U0(α,∆) − U0(0, 0)) ≤ 4

n

n∑

i=1

∣∣∣∣F0

(α√n

+ ci∆

)− F0(0)

∣∣∣∣

Because max |α/√n + ci∆| → 0 and F0 is continuous at 0, Var(U0(α,∆) − U0(0, 0)) → 0.Next consider the multiple regression model as discussed in Section A.3. The only

difference in notation is that here we have the intercept parameter included. Let ∆ =(α,∆1, . . . ,∆p)

′ denote the vector of all regression parameters. Take X = [1n : Xc], whereXc denotes a centered design matrix and as in ( A.3.2) take C = X(X′X)−1/2. Note thatthe first column of C is (1/

√n)1n. Let U(∆) = (U0(∆), . . . , Up(∆))′ denote the vector of

processes. Similar to the discussion prior to Theorem A.3.1, the last two theorems implythat

U(∆) − [U(0) − 2f0(0)∆]P→ 0 ,

for all real ∆ in Rp+1.As in Section A.3, we define the approximation quadratic to D1 as

Q1n(∆) = (2f0(0))∆′∆/2 − ∆′U(0) +D1(0) .

The asymptotic linearity of U and the asymptotic quadraticity of D1 then follow as in thelast section. We state the result for reference:


Theorem A.4.3. Under conditions ( 3.4.1), ( 3.4.3), ( 3.4.7) and ( 3.4.8),

limn→∞

P

(max

‖∆‖≤c‖U(∆) − (U(0) − (2f0(0))∆) ‖ ≥ ǫ

)= 0 , (A.4.3)

limn→∞

P

(max

‖∆‖≤c|D1(∆) −Q1(∆)| ≥ ǫ

)= 0 , (A.4.4)

for all ǫ > 0 and all c > 0.

A.5 Influence Functions

In this section we derive the influence functions found in Chapters 1-3 and Chapter 5.Discussions of the influence function can be found in Staudte and Sheather (1990), Hampelet al. (1986) and Huber (1981). For the influence functions of Chapter 3, we will find theGateux derivative of a convenient functional; see Fernholz (1983) and Huber (1981) forrigourous discussions of functionals and derivatives.

Definition A.5.1. Let T be a statistical functional defined on a space of distribution func-tions and let H denote a distribution function in the domain of T . We say T is Gateuxdifferentiable at H if for any distribution function W , such that the distribution functions(1 − s)H + sW lie in the domain of T , the following limit exists:

lims→0

T [(1 − s)H + sW ] − T [H ]

s=

∫ψHdW , (A.5.1)

for some function ψH .

Note by taking W to be H in the above definition we have,

∫ψHdH = 0 . (A.5.2)

The usual definition of the influence function is obtained by taking the distributionfunction W to be a point mass distribution. Denote the point mass distribution function att by ∆t(x). Letting W (x) = ∆t(x), the Gateux derivative of T (H) is

lims→0

T [(1 − s)H + s∆s(x)] − T [H ]

s= ψH(x); . (A.5.3)

The function ψH(x) is called the influence function of T (H). Note that this is the deriva-tive of the functional T [(1 − s)H + s∆s(x)] at s = 0. It measures the rate of change of thefunctional T (H) at H in the direction of ∆s. A functional is said to be robust when thisderivative is bounded.

A.5. INFLUENCE FUNCTIONS 447

A.5.1 Influence Function for Estimates Based on Signed-Rank

Statistics

In this section we derive the influence function for the one-sample location estimate θϕ+ ,( 1.8.5), discussed in Chapter 1. We will assume that we are sampling from a symmetricdensity h(x) with distribution function H(x), as in Section 1.8. As in Chapter 2, we willassume that the one sample score function ϕ+(u) is defined by

ϕ+(u) = ϕ

(u+ 1

2

), (A.5.4)

where ϕ(u) is a nondecreasing, differentiable function defined on the interval (0, 1) satisfying

ϕ(1 − u) = −ϕ(u) . (A.5.5)

Recall from Chapter 2 that this assumption is appropriate for scores for samples from sym-metrical distributions. For convenience we extend ϕ+(u) to the interval (−1, 0) by

ϕ+(−u) = −ϕ+(u) . (A.5.6)

Our functional T (H) is defined implicitly by the equation ( 1.8.5). Using the symmetry ofh(x), ( A.5.5), and ( A.5.6) we can write the defining equation for θ = T (H) as

0 =

∫ ∞

−∞ϕ+(H(x) −H(2θ − x))h(x) dx

0 =

∫ ∞

−∞ϕ(1 −H(2θ − x))h(x) dx . (A.5.7)

For the derivation, we will proceed as discussed above; see the discussion around expres-sion ( A.5.3). Consider the contaminated distribution of H(x) given by

Ht,ǫ(x) = (1 − ǫ)H(x) + ǫ∆t(x) , (A.5.8)

where 0 < ǫ < 1 is the proportion of contamination and ∆t(x) is the distribution functionfor a point mass at t. By ( A.5.3) the influence function is the derivative of the functionalat ǫ = 0. To obtain this derivative we implicitly differentiate the defining equation ( A.5.7)at Ht,ǫ(x); i.e., at

0 = (1 − ǫ)

∫ ∞

−∞ϕ(1 − (1 − ǫ)H(2θ − x) − ǫ∆t(2θ − x))h(x) dx

= ǫ

∫ ∞

−∞ϕ(1 − (1 − ǫ)H(2θ − x) − ǫ∆t(2θ − x)) d∆t(x)

Let θ denote the derivative of the functional. Implicitly differentiating this equation andthen setting ǫ = 0 and without loss of generality θ = 0, we get

0 = −∫ ∞

−∞ϕ(H(x))h(x) dx+

∫ ∞

−∞ϕ′(H(x))H(−x)h(x) dx

= −2θ

∫ ∞

−∞ϕ′(H(x))h2(x) dx−

∫ ∞

−∞ϕ′(H(x))∆t(−x)h(x) dx+ ϕ(H(t)) .


Label the four integrals in the above equation as I1, . . . , I4. Since∫ϕ(u) du = 0, I1 = 0. For

I2 we get

I2 =

∫ ∞

−∞ϕ′(H(x))h(x) dx−

∫ ∞

−∞ϕ′(H(x))H(x)h(x) dx

=

∫ 1

0

ϕ′(u) du−∫ 1

0

ϕ′(u)u du = −ϕ(0) .

Next I4 reduces to

−∫ −t

−∞ϕ′(H(x))h(x) dx = −

∫ H(−t)

0

ϕ′(u) du = ϕ(H(t)) + ϕ(0) .

Combining these results and solving for θ leads to the influence function which we can writein either of the following two ways,

Ω(t, θϕ+) =ϕ(H(t))∫∞

−∞ ϕ′(H(x))h2(x) dx

=ϕ+(2H(t) − 1)

4∫∞0ϕ+′(2H(x) − 1)h2(x) dx

. (A.5.9)

A.5.2 Influence Functions for Chapter 3

In this section, we derive the influence functions which were presented in Chapter 3. Muchof this work was developed in Witt (1989) and Witt, McKean and Naranjo (1995). Thecorrelation model of Section 3.11 is the underlying model for the influence functions derivedin this section. Recall that the joint distribution function of x and Y is H , the distributionfunctions of x, Y and e are M , G and F , respectively, and Σ is the variance-covariancemartix of x.

Let βϕ denote the R-estimate of β for a specified score function ϕ(u). In this section weare interested in deriving the influence functions of this R-estimate and of the correspondingR-test statistic for the general linear hypotheses. We will obtain these influence functionsby using the definition of the Gateux derivative of a functional, ( A.5.1). The influencefunctions are then obtained by taking W to be the point mass distribution function ∆(x0,y0);see expression ( A.5.3). If T is Gateux differentiable at H then by setting W = ∆(x0,y0) wesee that the influence function of T is given by

Ω(x0, y0;T ) =

∫ψHd∆(x0,y0) = ψH(x0, y0) . (A.5.10)

As a simple example, we will obtain the influence function of the statistic D(0) =∑a(R(Yi))Yi. Since G is the distribution function of Y , the corresponding functional is


T [G] =∫ϕ(G(y))ydG(y). Hence for a given distribution function W ,

T [(1 − s)G+ sW ] = (1 − s)

∫ϕ[(1 − s)G(y) + sW (y)]ydG(y)

+s

∫ϕ[(1 − s)G(y) + sW (y)]ydW (y) .

Taking the partial derivative of the right side with respect to s, setting s = 0, and substituting∆y0 for W leads to the influence function

Ω(y0;D(0)) = −∫ϕ(G(y))ydG(y)−

∫ϕ′(G(y))G(y)ydG(y)

+

∫ ∞

y0

ϕ′(G(y))ydG(y) + ϕ(G(y0))y0 . (A.5.11)

Note that this is not bounded in the Y -space and, hence, the statistic D(0) is not robust.Thus, as noted in Section 3.11, the coefficient of multiple determination R1, ( 3.11.16), isnot robust. A similar development establishes the influence function for the denominator ofLS coefficient of multiple determination R2, showing too that it is not bounded. Hence R2

is not a robust statistic.

Another example is the the influence function of the least squares estimate of β which isgiven by,

Ω(x0, y0; βLS) = σ−1y0Σ−1x0 . (A.5.12)

The influence function of the least squares estimate is, thus, unbounded in both the Y - andx-spaces.

Influence Function of βϕ

Recall that H is the joint distribution function of x and Y . Let the p×1 vector T (H) denote

the functional corresponding to βϕ. Assume without loss of generality that the true β = 0,α = 0, and that the Ex = 0. Hence the distribution function of Y is F (y) and Y and x areindependent; i.e., H(x, y) = M(x)F (y).

Recall that the R-estimate satisfies the equations

n∑

i=1

xia(R(Yi − x′iβ))

.= 0 .

Let G∗n denote the empirical distribution function of Yi−x′

iβ. Then we can rewrite the aboveequations as

nn∑

i=1

xiϕ

(n

n + 1G∗n(Yi − x′

iβ)

)1

n

.= 0 .


Let G∗ denote the distribution function of Y − x′T (H). Then the functional T (H) satisfies

∫ϕ(G∗(y − x′T (H))xdH(x, y) = 0 . (A.5.13)

We can show that,

G∗(t) =

∫ ∫

u≤v′T (H)+t

dH(v, u) . (A.5.14)

Let Hs = (1 − s)H + sW for an arbitrary distribution function W . Then the functionalT (H) evaluated at Hs satisfies the equation

(1 − s)

∫ϕ(G∗

s(y − x′T (Hs))xdH(x, y) + s

∫ϕ(G∗

s(y − x′T (Hs))xdW (x, y) = 0 ,

where G∗s is the distribution function of Y − x′T (Hs). We will obtain ∂T/∂s by implicit

differentiation. Then upon substituting ∆x0,y0 for W the influence function is given by(∂T/∂s) |s=0, which we will denote by T . Implicit differentiation leads to

0 = −∫ϕ(G∗

s(y − x′T (Hs))xdH(x, y) − (1 − s)

∫ϕ′(G∗

s(y − x′T (Hs))∂G∗

s

∂sxdH(x, y)

+

∫ϕ(G∗

s(y − x′T (Hs))xdW (x, y) + sB1 , (A.5.15)

where B1 is irrelevant since we will be setting s to 0. We first get the partial derivative ofG∗s with respect to s. By ( A.5.14) and the independence between Y and x at H , we have

G∗s(y − x′T (Hs)) =

∫ ∫

u≤y−T (Hs)′(x−v)

dHs(v, u)

= (1 − s)

∫F [y − T (Hs)

′(x − v)]dM(v) + s

∫ ∫


dW (v, u) .

Thus,

∂G∗s(y − x′T (Hs))

∂s= −

∫F [y − T (Hs)

′(x − v)]dM(v)

+(1 − s)

∫F ′[y − T (Hs)

′(x − v)](v − x)′∂T

∂sdM(v)

+

∫ ∫


dW (v, u) + sB2 ,

where B2 is irrelevant since we are setting s to 0. Therefore using the independence betweenY and x at H , T (H) = 0, and Ex = 0, we get

∂G∗s(y − x′T (Hs))

∂s|s=0= −F (y) − f(y)x′T +WY (y) , (A.5.16)


where WY denotes the marginal (second variable) distribution function of W .Upon evaluating expression ( A.5.15) at s = 0 and substituting into it expression ( A.5.16)

we have

0 = −∫

xϕ(F (y))dH(x, y) +

∫xϕ′(F (y))[−F (y)− f(y)x′T +WY (y)]dH(x, y)

+

∫xϕ(F (y))dW (x, y)

= −∫ϕ′(F (y))f(y)xx′T dH(x, y) +

∫xϕ(F (y))dW (x, y)

Substituting ∆x0,y0 in for W , we get

0 = −τΣT + x0ϕ(F (y0)) .

Solving this last expression for T , the influence function of βϕ is given by

Ω(x0, y0; βϕ) = τΣ−1ϕ(F (y0))x0 . (A.5.17)

Hence the influence function of βϕ is bounded in the Y -space but not in the x-space. Theestimate is thus bias robust. In Chapter 5 we presented R-estimates whose influence functionsare bounded in both spaces; see Theorems ?? and 3.12.4. Note that the asymptoticrepresentation of βϕ in Corollary 3.5.24 can be written in terms of this influence functionas

√nβϕ = n−1/2

n∑

i=1

Ω(xi, Yi; βϕ) + op(1) .

Influence Function of Fϕ

Rewrite the correlation model as

Y = α + x′1β1 + x′

2β2 + e

and consider testing the general linear hypotheses

H0 : β2 = 0 versus HA : β2 6= 0 , (A.5.18)

where β1 and β2 are q×1 and (p−q)×1 vectors of parameters, respectively. Let β1,ϕ denotethe reduced model estimate. Recall that the R-test based upon the drop in dispersion isgiven by

Fϕ =RD/q

τ/2,

where RD = D(β1,ϕ) − D(βϕ) is the reduction in dispersion. In this section we want toderive the influence function of the test statistic.


Let RD(H) denote the functional for the statistic RD. Then

RD(H) = D1(H) −D2(H) ,

where D1(H) and D2(H) are the reduced and full model functionals given by

D1(H) =

∫ϕ[G∗(y − x′

1T1(H))](y − x′1T1(H))dH(x, y)

D2(H) =

∫ϕ[G∗(y − x′T (H))](y − x′T (H))dH(x, y) , (A.5.19)

and T1(H) and T2(H) denote the reduced and full model functionals for β1 and β, respec-tively. Let βr = (β′

1, 0′)′ denote the true vector of parameters under H0. Then the random

variables Y − x′βr and x are independent. Next write Σ as

Σ =

[Σ11 Σ12

Σ21 Σ22

].

It will be convenient to define the matrices Σr and Σ+r as

Σr =

[Σ11 00 0

]and Σ+

r =

[Σ−1

11 00 0

].

As above, let Hs = (1 − s)H + sW . We begin with a lemma,

Lemma A.5.1. Under the correlation model,

(a) RD(0) = 0

(b)∂RD(Hs)

∂s|s=0 = 0

(c)∂2RD(Hs)

∂s2|s=0 = τϕ2[F (y − x′βr)]x

′ [Σ−1 − Σ+]x . (A.5.20)

Proof: Part (a) is immediate. For Part (b), it follows from ( A.5.19) that

∂D2(Hs)

∂s= −

∫ϕ[G∗

s(y − x′T (Hs))](y − x′T (Hs))dH

+(1 − s)

∫ϕ′[G∗

s(y − x′T (Hs))](y − x′T (Hs))∂G∗

s

∂sdH

+(1 − s)

∫ϕ[G∗

s(y − x′T (Hs))](−x′∂T

∂s)dH

+

∫ϕ[G∗

s(y − x′T (Hs))](y − x′T (Hs))dW (y) + sB , (A.5.21)


where B is irrelevant because we are setting s to 0. Evaluating this at s = 0 and using theindependence of Y − x′βr and x, and E(x) = 0 we get after some simplification

∂D2(Hs)

∂s|s=0 = −

∫ϕ[F (y − xβr)](y − xβr)dH −

∫ϕ′[F (y − xβr)]F (y − xβr)(y − xβr)dH

+

∫ϕ′[F (y − xβr)]WY (y − xβr)(y − xβr)dH + ϕ[F (y0 − x0βr)](y0 − x0βr) .

Differentiating as above and using x′βr = x′1β1, we get the same expression for ∂D1

∂s|s=0.

Hence Part (b) is true. Taking the second partial derivatives of D1(H) and D2(H) withrespect to s, the result for Part (c) can be obtained. This is a tedious derivation and detailsof it can be found in Witt (1989) and Witt et al. (1995).

Since Fϕ is nonnegative, there is no loss in generality in deriving the influence functionof√qFϕ. Letting Q2 = 2τ−1RD we have

Ω(x0, y0;√qFϕ) = lim

s→0

Q[(1 − s)H + s∆x0,y0 ] −Q[H ]

s.

But Q[H ] = 0 by Part (a) of Lemma A.5.1. Hence we can rewrite the above limit as

Ω(x0, y0;√qFϕ) =

[lims→0

Q2[(1 − s)H + s∆x0,y0]

s2

]1/2

.

Using Parts (a) and (b) of Lemma A.5.1, we can apply L’hospital’s rule twice to evaluatethis limit. Thus

Ω(x0, y0;√qFϕ) =

[lims→0

1

2

∂2Q2

∂s2

]1/2

=

[2τ−1∂

2RD

∂s2

]1/2

= |ϕ[F (y − x′βr)]|√

x′ [Σ−1 − Σ+]x (A.5.22)

Hence, the influence function of the rank-based test statistic Fϕ is bounded in the Y -spaceas long as the score function is bounded. It can be shown that the influence function of theleast squares test statistic is not bounded in Y -space. It is clear from the above argumentthat the coefficient of multiple determination R2 is also robust. Hence, for R-fits R2 is thepreferred coefficient of determination.

However, Fϕ is not bounded in the x-space. In Chapter 5 we present statistics whoseinfluence function are bounded in both spaces; although, they are less efficient.

The asymptotic distribution of qFϕ was derived in Section 3.6; however, we can use theabove result on the influence function to immediately display it. If we expand Q2 into a


vonMises expansion at H , we have

Q2(Hs) = Q2(H) +∂Q2

∂s|s=0 +

1

2

∂2Q2

∂s2|s=0 +R

=

[∫ϕ(F (y − x′βr)x

′d∆x0,y0(x, y)

] [Σ−1 −Σ+

]

·[∫

ϕ(F (y − x′βr)xd∆x0,y0(x, y)

]+R . (A.5.23)

Upon substituting the empirical distribution function for ∆x0,y0 in expression ( A.5.23), wehave at the sample

nQ2(Hs) =

[1√n

n∑

i=1

x′iϕ

(1

nR(Yi − x′

iβr)

)][Σ−1 − Σ+

][

1√n

n∑

i=1

xiϕ

(1

nR(Yi − x′

iβr)

)]+op(1) .

This expression is equivalent to the expression ( 3.6.11) which yields the asymptotic distri-bution of the test statistic in Section 3.6.

A.5.3 Influence Function of βHBR of Chapter 5

The influence function of the high breakdown estimator βHBR is discussed in Section 3.12.4.In this section, we restate Theorem A.5.24 and then derive a proof of it.

Theorem A.5.1. The influence function for the estimate βHBR is given by

Ω(x0, y0, βHBR) = C−1H

1

2

∫ ∫(x0−x1)b(x1,x0, y1, y0)sgny0−y1 dF (y1)dM(x1) , (A.5.24)

where CH is given by expression ( 3.12.22).

Proof: Let ∆0(x, y) denote the distribution function of the point mass at the point (x0, y0)and consider the contaminated distribution Ht = (1 − t)H + t∆0 for 0 < t < 1. Let β(Ht)denote the functional at Ht. Then β(Ht) satisfies

0 =

∫ ∫x1b(x1,x2, y1, y2)

[I(y2 − y1 < (x2 − x1)

′β(Ht)) −1

2

]dHt(x1, y1)dHt(x2, y2) .

(A.5.25)We next implicitly differentiate ( A.5.25) with respect to t to obtain the derivative of thefunctional. The value of this derivative at t = 0 is the influence function. Without loss ofgenerality, we can assume that the true parameter β = 0. Under this assumption x and yare independent. Substituting the value of Ht into ( A.5.25) and expanding we obtain the

A.6. ASYMPTOTIC THEORY FOR CHAPTER 5 455

four terms:

0 = (1 − t)2

∫ ∫ ∫x1

[∫ y1+(x2−x1)′β(Ht)

−∞b(x1,x2, y1, y2)dF (y2) −

1

2

]dM(x2)dM(x1)dF (y1)

+(1 − t)t

∫ ∫ ∫ ∫x1b(x1,x2, y1, y2)

[I(y2 − y1 < (x2 − x1)

′β(H)) − 1

2

]dM(x2)dF (y2)d∆0(x1, y1)

+(1 − t)t

∫ ∫ ∫ ∫x1b(x1,x2, y1, y2)

[I(y2 − y1 < (x2 − x1)

′β(H)) − 1

2

]d∆0(x2, y2)dM(x1)dF (y1)

+t2∫ ∫ ∫ ∫

x1b(x1,x2, y1, y2)

[I(y2 − y1 < (x2 − x1)

′β(H)) − 1

2

]d∆0(x2, y2)dδ0(x1, y1) .

Let β denote the derivative of the functional evaluted at 0. Proceeding to implicitly differ-entiate this equation and evaluating the derivative at 0, we get, after some derivation,

0 =

∫ ∫ ∫x1b(x1,x2, y1, y1)f

2(y1)(x2 − x1)′ dy1dM(x1)dM(x2) β

+

∫ ∫x0b(x0,x2, y0, y2)

[I(y2 < y0) −

1

2

]dF (y2)dM(x2)

+

∫ ∫x1b(x1,x0, y1, y0)

[I(y0 < y1) −

1

2

]dF (y1)dM(x1)

Once again using the symmetry in the x arguments and y arguments of the function b, wecan simplify this expression to

0 = −

1

2

∫ ∫ ∫(x2 − x1)b(x1,x2, y1, y1)(x2 − x1)

′f 2(y1) dy1dM(x1)dM(x2)

β

+

∫ ∫(x0 − x1)b(x1,x0, y1, y0)

[I(y1 < y0) −

1

2

]dF (y1)dM(x1) .

Using the relationship between the indicator function and the sign function and the definitionof CH ,( 3.12.22), we can rewrite this last expression as

0 = −CH β +1

2

∫ ∫(x0 − x1)b(x1,x0, y1, y0)sgny0 − y1 dF (y1)dM(x1) .

Solving for β leads to the desired result.

A.6 Asymptotic Theory for Chapter 5

In this section we derive the results that are needed in Section 3.12.3 of Chapter 5. Theseresults were first derived by Chang (1995). Our development is taken from the article byChang, McKean, Naranjo and Sheather (1996). The main goal is to prove Theorem 3.12.2which we restate here:


Theorem A.6.1. Under assumptions (E.1), ( 3.4.1), and (H.1) - (H.4), ( 3.12.10) -( 3.12.13), √

n(βHBR − β)d−→ N( 0, (1/4)C−1ΣHC−1).

Besides the notation of Chapter 5, we need:

1. Wij(∆) = (1/2)[sgn(zj − zi) − sgn(yj − yi)],

where zj = yj − x′j∆/

√n . (A.6.1)

2. tij(∆) = (xj − xi)′∆/

√n . (A.6.2)

3. Bij(t) = E[bijI(0 < yi − yj < t)] . (A.6.3)

4. γij = B′ij(0)/E(bij) . (A.6.4)

5. Cn =∑

i<j

γijbij(xj − xi)(xj − xi)′ . (A.6.5)

6. R(∆) = n−3/2

[∑

i<j

bij(xj − xi)Wij(∆) + Cn∆/√n

]. (A.6.6)

Without loss of generality we will assume that the true β0 is 0. We begin with thefollowing lemma.

Lemma A.6.1. Under assumptions (E.1), ( 3.4.1), and (H.1), ( 3.12.13),

B′ij(t) =

∫ ∞

−∞· · ·∫ ∞

−∞b(xi, xj , yj + t, yj, β0) f(yj + t) f(yj)

∏

k 6=i,jf(yk) dy1 · · · dyn

is continuous in t.

Proof: This result follows from ( 3.4.1), ( 3.12.13) and an application of Leibnitz’s rule ondifferentiation of definite integrals.

Let ∆ be arbitrary but fixed. Denote Wij(∆) by Wij , suppressing dependence on ∆.

Lemma A.6.2. Under assumptions (E.1), ( 3.4.1), and (H.4), ( 3.12.13), there exist con-stants |ξij| < |tij| such that E(bijWij) = −tij B′

ij(ξij).

Proof: Since Wij = 1, −1, or 0 according as tij < yj−yi < 0, 0 < yj−yi < tij , or otherwise,we have

Eβ0(bijWij) =

∫

tij<yj−yi<0

bijfY(y) dy −∫

0<yj−yi<tij

bijfY(y) dy .

When tij > 0, E(bijWij) = −Bij(tij) = Bij(0) − Bij(tij) = −tij B′ij(ξij) by Lemma A.6.1

and the Mean Value Theorem. The same result holds for tij < 0, which proves the lemma.

Lemma A.6.3. Under assumptions (H.3), ( 3.12.12), and (H.4), ( 3.12.13), we have

bij = gij(β0) = gij(0) + [gij(ξ)]′β0 = gij(0) +Op(1/√n),

uniformly over all i and j, where ‖ξ‖ ≤ ‖β0‖.


Proof: Follows from a multivariate Mean Value Theorem (see e.g. page 355 of Apostol, 1974),and by ( 3.12.12) and ( 3.12.13).

Lemma A.6.4. Under assumptions ( 3.12.10)-( 3.12.13), ( 3.4.1), ( 3.4.7), and ( 3.4.8),

(i) E(gij(0)gik(0)WijWik) −→ 0, as n→ ∞(ii) E(gij(0)Wij) −→ 0, as n→ ∞

uniformly over i and j.

Proof: Without loss of generality, let tij > 0 and tik > 0, where the indices i, j and k are alldifferent. Then

E(gij(0)gik(0)WijWik) = E[gijgikI(0 < yj − yi < tij) I(0 < yk − yi < tik)]

=

∣∣∣∣∫ ∞

−∞

∫ yi+tik

yi

∫ yi+tij

yi

gijgik fifjfk dyjdykdyi

∣∣∣∣ .

Assumptions ( 3.4.7) and ( 3.4.8) imply (1/n)maxi(xik − xk)2 → 0 for all k, or equivalently

(1/√n)maxi|xik − xk| → 0 for all k, which implies that tij → 0. Since the integrand is

bounded, this proves (i).

Similarly, E(gij(0)Wij) =∫∞−∞∫ yi+tijyi

gijfifjdyjdyi → 0, which proves (ii).

Lemma A.6.5. Under assumptions ( 3.12.10)-( 3.12.13), ( 3.4.1), ( 3.4.7), and ( 3.4.8),

(i) Cov(b12W12, b34W34) = o(n−1).

(ii) Cov(b12W12, b34) = o(n−1).

(iii) Cov(b12W12, b13W13) = o(1).

(iv) Cov(b12W12, b13) = o(1).

Proof: To prove (i), recall that b12 = g12(0) + [g12(ξ)]′ · β0. Thus

Cov(b12W12, b34W34) = Cov(g12(0)W12, g34(0)W34)

+2 Cov([g12(ξ)]′ · β0 W12, g34(0)W34)

+Cov([g12(ξ)]′ · β0 W12, [g34(ξ)]′ · β0 W34).

Let I1, I2 and I3 denote the three terms on the right hand side. The term I1 is 0, byindependence. Now,

I2 = 2E

[g12(ξ)]′ · β0 W12 g34(0)W34

− 2E

[g12(ξ)]′ · β0 W12

E g34(0)W34

= I21 − I22.

Write the first term above as

I21 = 2(1/n)E

[g12(ξ)]′ · β0 g34(0)(√nW12)(

√nW34)

.


The term [g12(ξ)]′ · β0 = b12 − g12(0) is bounded and of magnitude op(1). If we can showthat

√nW12 is integrable, then it follows using standard arguments that I21 = o(1/n). Let

F ∗ denote the cdf of y2 − y1 and f ∗ denote its pdf. Using the mean value theorem,

E[√nW12(∆)] =

√n(1/2)E[sgn(y2 − y1 − (x2 − x1)

′)∆/√n) − sgn(y2 − y1)]

=√n(1/2)[2F ∗(−(x2 − x1)

′)∆/√n) − 2F ∗(0)]

= −√nf ∗(ξ∗)(x2 − x1)

′∆/√n ≤ f ∗(ξ∗)|(x2 − x1)

′∆| ,for |ξ ∗ | < |(x2 − x1)

′∆/√n|. The right side of the inequality in expression ( A.6.7) is

bounded. This proves that I21 = o(1/n). Similarly,

I22 = 2(1/n)E

[g12(ξ)]′ · β0 (√nW12)

Eg34(0)(

√nW34)

= o(1/n),

which proves I2 = 0.The term I3 can be shown to be o(n−1) similarly, which proves (i). The proof of (ii) is

analogous to (i). To prove (iii), note that

Cov(b12W12, b13W13) = Cov(g12(0)W12, g13(0)W13)

+ 2 Cov([g12(ξ)]′ · β0 W12, g13(0)W13)

+ Cov([g12(ξ)]′ · β0 W12, [g13(ξ)]′ · β0 W13).

The first term is o(1) by Lemma A.6.4. The second and third terms are clearly o(1). Thisproves (iii). Result (iv) is analogously proved.

We are now ready to state and prove asymptotic linearity. Consider the negative gradientfunction

S(β) = −D(β) =∑∑

i<j

bijsgn(zj − zi)(xj − xi). (A.6.7)

Theorem A.6.2. Under assumptions ( 3.12.10)-( 3.12.13), ( 3.4.1), ( 3.4.7), and ( 3.4.8).Then

sup‖√nβ‖≤C

n−3/2 [S(β) − S(0) + 2 Cnβ]p−→ 0.

Proof: Write R(∆) =[S(n−1/2∆) − S(0) + 2n−1/2 Cn∆

]. We will show that

sup‖∆‖≤C

R(∆) = 2 sup‖∆‖≤C

n−3/2

∑∑

i<j

bij(xj − xi) Wij(∆) + n−1/2Cn∆

p−→ 0.

It will suffice to show that each component converges to 0. Consider the kth component

Rk(∆) = 2

[n−3/2

∑∑

i<j

bij(xjk − xik) Wij(∆) +∑∑

i<j

γijbij(xjk − xik) tij

]

= 2 n−3/2∑∑

i<j

(xjk − xik)(bijWij + γijtijbij).


We will show that E(Rk(∆)) → 0 and V ar(Rk(∆)) → 0. By Lemma A.6.2 and thedefinition of γij,

E(Rk) = 2 n−3/2∑∑

i<j

(xjk − xik)[E(bijWij) + γijtijE(bij)]

= 2 n−3/2∑∑

i<j

(xjk − xik)tij [B′ij(0) − B′

ij(ξij)]

≤ 2 n−3/2

[∑∑

i<j

(xjk − xik)2

]1/2 [∑∑

i<j

t2ij

]1/2

supi,j

|B′ij(0) −B′

ij(ξij)|

= 2

[(1/n2)

∑∑

i<j

(xjk − xik)2

]1/2 [(1/n)

∑∑

i<j

t2ij

]1/2

supi,j

|B′ij(0) −B′

ij(ξij)| −→ 0

since (1/n)∑∑

i<j t2ij = (1/n)∆′X′X∆ = O(1) and supi,j |B′

ij(0)−B′ij(ξij)| → 0 by Lemma

A.6.1.Next, we will show that V ar(Rk) → 0.

V ar(Rk) = V ar

[2 n−3/2

∑∑

i<j

(xjk − xik)(bijWij + γijtijbij)

]

= V ar

[2 n−3/2

n∑

i=1

n∑

j=1

(xjk − xk)(bijWij + γijtijbij)

]

= 4 n−3n∑

i=1

n∑

j=1

(xjk − xk)2V ar(bijWij + γijtijbij)

+4 n−3∑∑∑ ∑

(i,j)6=(l,m)

(xjk − xk)(xmk − xk)Cov(bijWij + γijtijbij , blmWlm + γlmtlmblm).

The double sum term above goes to 0, since there there n2 bounded terms in the doublesum, multiplied by n−3. There are two types of covariance terms in the quadruple sum,covariance terms with all four indices different, e.g. ((i, j), (l,m)) = ((1, 2), (3, 4)), andcovariance terms with one index of the first pair equal to one index of the second pair, e.g.((i, j), (l,m)) = ((1, 2), (1, 3)). Since there are of magnitude n4 terms with all four indicesdifferent, we need to show that each covariance term is o(n−1). This immediately followsfrom Lemma A.6.5. Finally there are of magnitude n3 covariance terms with one sharedindex, and we need to show each term is o(1). Again, this immediately follows from LemmaA.6.5. Hence, we have established the desired result.

Next define the approximating quadratic process,

Q(β) = D(0) −∑∑

i<j

bijsgn(yj − yi)(xj − xi)′β + β′Cnβ . (A.6.8)


Let

D∗(∆) = n−1D(n−1/2∆) (A.6.9)

and

Q∗(∆) = n−1Q(n−1/2∆) . (A.6.10)

Note that minimizing D∗(∆) andQ∗(∆) is equivalent to minimizingD(n−1/2∆) andQ(n−1/2∆),respectively.

The next result is asymptotic quadraticity.

Theorem A.6.3. Under assumptions ( 3.12.10)-( 3.12.13), ( 3.4.1), ( 3.4.7), and ( 3.4.8),for a fixed constant C and for any ǫ > 0,

P

(sup

‖∆‖<C|Q∗(∆) −D∗(∆)| ≥ ǫ

)→ 0 . (A.6.11)

Proof: Since ∂Q∗

∂∆− ∂D∗

∂∆= 2n−3/2

[∑∑i<j bij(xj − xi)Wij + C(n−1/2∆)

]= R(∆), it follows

from Theorem A.6.2 that for ǫ > 0 and C > 0,

P

(sup

‖∆‖<C‖ ∂Q

∗

∂∆− ∂D∗

∂∆‖≥ ǫ/C

)→ 0.

For 0 ≤ t ≤ 1, let ∆t = t ∆. Then

∣∣∣∣d

dt[Q∗(∆t) −D∗(∆t)]

∣∣∣∣ =

∣∣∣∣∣

p∑

k=1

∆k(∂Q∗

∂∆tk

− ∂D∗

∂∆tk

)

∣∣∣∣∣

≤ ‖ ∆ ‖ sup‖∆‖<C

‖ ∂Q∗

∂∆− ∂D∗

∂∆‖<‖ ∆ ‖ (ǫ/C) < ǫ

with probability approaching 1. Now, let h(t) = Q∗(∆t) −D∗(∆t). By the previous result,we have |h′(t)| < ǫ with high probability. Thus

|h(1)| = |h(1) − h(0)| =

∣∣∣∣∫ 1

0

h′(t) dt

∣∣∣∣ ≤∫ 1

0

|h′(t)| dt < ǫ,

with probability approaching one. This proves the theorem.

The next theorem states asymptotic normality of S(0).

Theorem A.6.4. Under assumptions ( 3.12.10)-( 3.12.13), ( 3.4.1), ( 3.4.7), and ( 3.4.8),

n−3/2S(0)D→ N(0, ΣH) . (A.6.12)


Proof: Let SP denote the projection of S∗(0) = n−3/2S(0) onto the space of linear combina-tions of independent random variables. Then

SP =n∑

k=1

E[S∗(0)|yk] =n∑

k=1

E

[n−3/2

∑∑

i<j

(xj − xi)bijsgn(yj − yi) | yk]

=n∑

k=1

n−3/2

[k−1∑

i=1

(xk − xi)E[biksgn(yk − yi)|yk] +n∑

j=k+1

(xj − xk)E[bkjsgn(yj − yk)|yk]]

= n−3/2

n∑

k=1

n∑

j=1

(xj − xk)E[bkjsgn(yj − yk)|yk]

= (1/√n)

n∑

k=1

Uk,

where Uk is defined in expression ( 3.12.9) of Chapter 4. By assumption (D.3), ( 3.4.8),and a multivariate extension of the Lindeberg-Feller theorem (Rao, 1973), it follows thatSP ∼ AN(0, ΣH). If we show that E ‖ SP − S∗(0) ‖2→ 0, then it follows from theProjection Theorem 2.4.6 that S∗(0) has the same asymptotic distribution as SP , and theproof will be done. Equivalently, we may show that E(SP,r−S∗

r (0))2 → 0 for each componentr = 1, . . . , p. Since for each r we have E(SP,r − S∗

r (0)) = 0, then

E(SP,r − S∗r (0))2 = V ar(SP,r − S∗

r (0))

= V ar

[n−3/2

n∑

k=1

n∑

j=1

(xjr − xkr) E[bkjsgn(yj − yk)|yk] − bkjsgn(yj − yk)]

≡ V ar

[n−3/2

n∑

k=1

n∑

j=1

T (yj, yk)

]

= n−3n∑

k=1

n∑

j=1

V ar(T (yj, yk)) + n−3∑

k

∑

j

∑

l

∑

m

Cov[T (yj, yk), T (yl, ym)]

where the quadruple sum is taken over (j, k) 6= (l,m). The double sum term goes to 0 sincethere are n2 bounded terms divided by n3. There are two types of covariance terms in thequadruple sum: terms with four different indices , and terms with three different indices (i.e.,one shared index). Covariance terms with four different indices are zero (this can be shownby writing out the covariance in terms of expectations, and using symmetry to show thateach covariance term is zero). Thus we only need to consider covariance terms with threedifferent indices and show that the sum goes to 0. Letting k be the shared index (withoutloss of generality), and noting that E T (yj, yk) = 0 for all j, k, we have


n−3∑

k

∑

j 6=k

∑

l 6=k,jCov[T (yj, yk), T (yl, yk)]

= n−3∑

k

∑

j 6=k

∑

l 6=k,jE T (yj, yk) · T (yl, yk)

= n−3∑

k

∑

j 6=k

∑

l 6=k,jE [E(bkjsgn(yj − yk)|yk) − bkjsgn(yj − yk)]

· [E(bklsgn(yl − yk)|yk) − bklsgn(yl − yk)]= n−3

∑

k

∑

j 6=k

∑

l 6=k,jE [E(gkj(0) sgn(yj − yk)|yk) − gkj(0) sgn(yj − yk)]

· [E(gkl(0) sgn(yl − yk)|yk) − gkl(0) sgn(yl − yk)] + op(1)

where the last equality follows from the relation bkj = gkj(0) + 0p(1/√n). Expanding the

product, each term in the triple sum may be written as

E[E(gkj(0) sgn(yj − yk)|yk)]2

+ E gkj(0) sgn(yj − yk)gkl(0) sgn(yl − yk)

−2 E gkj(0) sgn(yj − yk) [E(gkl(0) sgn(yl − yk)|yk)]= (1 + 1 − 2)

[E(gkj(0) sgn(yj − yk)|yk)]2

= 0

where the first equality follows by taking conditional expectations with respect to k insideappropriate terms.

A similar method applies to terms where k is not the shared index. The theorem isproved.Proof of Theorem A.6.1: Let β denote the value which minimizes Q(β). Then β is thesolution to

0 = S(0) − 2Cnβ

so that√nβ = (1/2)n2C−1

n [n−3/2S(0)] ∼ AN(0, (1/4) C−1ΣC−1), by Theorem A.6.4 andAssumption (D.2), ( 3.4.7). It remains to show that

√n(β − β) = op(1). This follows from

Theorem A.6.3 and convexity of D(β), using standard arguments as in Jaeckel (1972).Studentized Residuals: NEEDS EDITED????????????????????????

Assume without loss of generality that α = 0, β = 0, and med ei = 0. In this section wemust further assume that the variance of ei is finite, i.e., σ2 < ∞. From the above proof ofTheorem ??, asymptotically β can be expressed as

√nβ =

1

2(n−2X′AnX)−1 1√

n

n∑

i=1

Uk + op(1) , (A.6.13)

where Uk is given in the first paragraph of Section 5. In this section, we will assume theweights bij are given. We use the approximation ( ??) in place of Uk, with a∗ij = bij/(

√12τ),

see ( ??). Hence, Uk is replaced by,

U∗k = −

√12τ

n2

n∑

j=1

(xj − xk)a∗kj(1 − 2F (ek)) . (A.6.14)


The estimate of α given by ( ??) can be expressed asymptotically as

α = τSn−1/2

n∑

i=1

sgn(ei) + op(1) ; (A.6.15)

see McKean et al. (1990). Using ( A.6.13) and ( A.6.15) we have the following first orderexpression for the residuals e∗i , ( 3.12.7),

e∗ .= e − τS

1

n

n∑

i=1

sgn(ei)1 − 1

2√nX(n−2X′A∗X)−1 1√

n

n∑

i=1

U∗k , (A.6.16)

where A∗ = [a∗ij ]. Because E[U∗k] = 0 and med ei = 0, taking expectations of both sides of

( A.6.16) leads toE[e∗]

.= E[e1]1 . (A.6.17)

WriteVar(e∗)

.= E[(e∗ − E[e1]1)(e∗ − E[e1]1)′] . (A.6.18)

The approximate variance-covariance matrix of e∗ can then be obtained by subtituting theright side of expression ( A.6.16) for e∗ in expression ( A.6.18) and then expanding and takingexpectations term-by-term. This is a tedious derivation, but by making use of E[U∗

k] = 0,med ei = 0,

∑ni=1 xi = 0, and

∑nj=1 a

∗ij =

∑nj=1 a

∗ji = 0 we obtain expression ( 5.4.17) of

Section 5.*************** cut this part some???????????????????????

Proof of Theorem 3.13.1

Let X1 = [1, X] denote the matrix of centered explanatory variables with a column of ones.Recall that β∗

LS = (X ′1X1)

−1X ′1Y . Since 1′X = 0, it can be shown that the vector of slope

parameters β satisfies βLS = (X ′X)−1X ′Y . Since Y = α1 +Xβ + e, we get the relation

βLS = β + (X ′X)−1X ′e. (A.6.19)

From McKean et al. (1990), we have the equivalent relation

βR = β +√

12τ(X ′X)−1X ′Fc(e) + op(n−1/2) (A.6.20)

where Fc(e) = [Fc(e1) − 1/2, . . . , Fc(en) − 1/2]′ is an n × 1 vector of independent randomvariables.

Now,

V ar(βLS − βR) = V ar(βLS) + V ar(βR) − 2E(βLS − β)(βR − β)′

= σ2(X ′X)−1 + τ 2(X ′X)−1 − 2√

12τE[(X ′X)−1X ′eF ′c(e)X(X ′X)−1]

= δ(X ′X)−1

where δ = σ2 + τ 2 − 2√

12τE[e1(F (e1) − 1/2)].Finally, note that from expressions ( A.6.19) and ( A.6.20) both βLS and βR are both

functions of the errors ei. Hence, asymptotic normality follows in the usual way by using(A2) to show that the Lindeberg condition holds.


Proof of Theorem 3.2

Recall that αLS = (Y ) = (1/n)∑n

i=1 yi. Defining the R-intercept as the median ofresiduals, it can be shown that

αR = (1/n)τs

n∑

i=1

sgn(yi) + op(n−1/2), (A.6.21)

which gives

V ar(αLS − αR) = (1/n2)∑n

i=1 V ar(yi − τssgn(yi))= (1/n2)

∑ni=1[V ar(yi) + τ 2

s V ar(sgn(yi)) − 2τscov(yi, sgn(yi))]= (1/n2)

∑ni=1[σ

2 + τ 2s − 2τsE(eisgn(ei))

= (1/n)[σ2 + τ 2s − 2τsE(e1sgn(e1))].

Next we need to show that the intercept and slope differences have zero covariance (i.e., thatthe off diagonal term of AD is 0). This follows from the fact that 1′X = 0. Asymptoticnormality follows as in the proof of Theorem 3.1.

Proof of Theorem 4.2

Since the intercept αGR is defined as the median of residuals, the linear approximationis the same as αR, so that the asymptotic variance is the same.

From Naranjo et al. (1994), we have the following approximation for the slope parameters:

βGR = β + (√

3τ/n)(X ′WX)−1S(β) + op(n−1/2)

where S(β) =∑∑

i<j wiwj(xi−xj)sgn(yi−yj−(xi−xj)′β) =∑∑

i<j wiwj(xi−xj)sgn(ei−ej) is a p− 1 × 1 random vector. Now,

V ar(βGR − βLS) = V ar(βGR) + V ar(βLS) − 2E(βGR − β)(βLS − β)′

= τ 2(X ′WX)−1(X ′W 2X)(X ′WX)−1

+σ2(X ′X)−1 − 2E[√

3τ/n)(X ′WX)−1S(β)][e′X(X ′X)−1].

It can be shown using an elementwise argument that E[S(β)e′] = 2nE[e1(F (e1)−1/2)]X ′Wand the result follows. Asymptotic normality follows as in the proof of Theorem 3.1.

Appendix B

Larger Data Sets

This appendix contains some of the larger data sets discussed in the book.

465

466 APPENDIX B. LARGER DATA SETS

Table A.0.1: data for Example 5.2.1. For each center, the columns are weight of the unit ofcrabgrass, Nitrogen level, Phosphorus level, Potassium level, and the density level.

Center 1 Center 2Wt N Ph Po D Wt N Ph Po D96.4 1 1 1 20 73.4 1 1 1 2086.1 1 1 0 20 77.8 1 1 0 2059.6 1 0 1 20 64.7 1 0 1 2068.9 1 0 0 20 47.1 1 0 0 2039.7 0 1 1 20 41.4 0 1 1 2035.4 0 1 0 20 40.6 0 1 0 2036.9 0 0 1 20 45.4 0 0 1 2042.2 0 0 0 20 79.0 0 0 0 2095.7 1 1 1 15 91.0 1 1 1 15107.3 1 1 0 15 60.6 1 1 0 1567.4 1 0 1 15 972.5 1 0 1 1570.1 1 0 0 15 68.6 1 0 0 1535.0 0 1 1 15 36.4 0 1 1 1538.3 0 1 0 15 50.8 0 1 0 1543.5 0 0 1 15 38.4 0 0 1 1553.7 0 0 0 15 29.1 0 0 0 15117.0 1 1 1 10 87.4 1 1 1 10123.4 1 1 0 10 86.7 1 1 0 1092.0 1 0 1 10 66.5 1 0 1 1072.0 1 0 0 10 56.0 1 0 0 1041.8 0 1 1 10 43.1 0 1 1 1039.1 0 1 0 10 53.4 0 1 0 1047.8 0 0 1 10 47.3 0 0 1 1039.8 0 0 0 10 55.1 0 0 0 10224.3 1 1 1 5 130.0 1 1 1 5162.7 1 1 0 5 128.6 1 1 0 588.2 1 0 1 5 82.4 1 0 1 554.0 1 0 0 5 91.0 1 0 0 553.3 0 1 1 5 46.5 0 1 1 575.3 0 1 0 5 50.3 0 1 0 547.4 0 0 1 5 49.5 0 0 1 586.6 0 0 0 5 74.9 0 0 0 5

467

Table A.0.2: Data for CRP Data. Sub denotes subject, Grp denoted group. The covariateis the same vector for each subject and is given in the first row. For each subjects, theresponses (CRP) follow row by row in the order of the covariate.

Covariate (Times)−24 0 24 72 120

Sub Grp Responses (CRP)1 M 1.79 0.78 0.68 0.86 0.832 M 0.09 0.14 0.07 0.14 0.513 M 1.13 0.99 1.06 0.91 0.924 M 2.94 2.07 1.51 1.05 1.035 M 1.82 2.28 2.99 1.95 1.786 M 0.17 0.08 0.29 0.30 0.227 M 0.38 0.19 0.29 0.49 0.268 M 1.32 0.96 1.80 0.82 1.689 M 0.40 0.11 0.25 0.20 0.31

10 H 0.11 0.19 0.21 0.22 0.0811 H 0.15 0.17 0.15 0.11 0.2112 H 0.06 0.05 0.05 0.06 0.1613 H 0.31 0.20 0.28 0.14 0.2414 H 0.21 0.28 0.35 0.10 0.2015 H 0.36 0.37 4.54 1.80 1.1016 H 0.92 1.23 1.47 0.96 0.8117 H 0.52 0.52 0.51 0.31 0.3518 H 2.05 1.61 1.35 0.73 0.67

468 APPENDIX B. LARGER DATA SETS

Bibliography

[1] Adichi, J. N. (1978), Rank tests of sub-hypotheses in the general regression model,Annals of Statistics, 6, 1012-1026.

[2] Afifi, A. A. and Azen, S. P. (1972), Statistical Analysis: A Computer Oriented Approach,New York: Academic Press.

[3] Akritas, M. G. (1990), The rank transform method in some two-factor designs, Journalof the American Statistical Association, 85, 73-78.

[4] Akritas, M. G. (1991), Limitations of the rank transform procedure: A study of repeatedmeasures designs, Part I, Journal of the American Statistical Association, 86, 457-460.

[5] Akritas, M. G. (1993), Limitations of the rank transform procedure: A study of repeatedmeasures designs, Part II, Statistics and Probability Letters, 17, 149-156.

[6] Akritas, M. G. and Arnold, S. F. (1994), Fully nonparametric hypotheses for factorialdesigns I: Multivariate repeated measures designs, Journal of the American StatisticalAssociation, 89, 336-343.

[7] Akritas, M. G., Arnold, S. F. and Brunner, E. (1997), Nonparametric hypotheses andrank statistics for unbalanced factorial designs, Journal of the American Statistical As-sociation, 92, 258-265.

[8] Ammann, L. P. (1993), Robust singular value decompositions: A new approach toprojection pursuit, Journal of the American Statistical Association, 88, 505-514.

[9] Ansari, A. R., and Bradley, R. A. (1960), Rank-sum tests for dispersion, Annals ofMathematical Statistics, 31, 1174-1189.

[10] Apostal, T. M. (1974), Mathematical Analysis, 2nd Edition, Reading, Massachusetts:Addison-Wesley.

[11] Arnold, S. F. (1980), Asymptotic validity of F -tests for the ordinary linear model andthe multiple correlation model, Journal of the American Statistical Association, 75,890-894.

469

470 BIBLIOGRAPHY

[12] Arnold, S. F. (1981), The Theory of Linear Models and Multivariate Analysis, NewYork: John Wiley and Sons.

[13] Aubuchon, J. C. and Hettmansperger, T. P. (1984), A note on the estimation of theintegral of f 2(x), Journal of Statistical Inference and Planning, 9, 321-331.

[14] Aubuchon, J. C. and Hettmansperger, T. P. (1989), Rank-based inference for linearmodels: Asymmetric errors, Statistics and Probability Letters, 8, 97-107.

[15] Babu, G. J., and Koti, K. M. (1996), Sign test for ranked-set sampling, Communicationsin Statistics, Part A-Theory and Methods, 25(7), 1617-1630.

[16] Bahadur, R. R. (1967), Rates of convergence of estimates and test statistics, Annals ofMathematical Statistics, 31, 276-295.

[17] Bai, Z. D., Chen, X. R., Miao, B. Q., and Rao, C. R. (1990), Asymptotic theory of leastdistance estimate in multivariate linear models, Statistics, 21, 503-519.

[18] Bassett, G. and Koenker, R. (1978), Asymptotic theory of least absolute error regression,Journal of the American Statistical Association, 73, 618-622.

[19] Bedall, F. K. and Zimmerman, H. (1979), Algorithm AS143, the mediancenter, AppliedStatistics 28, 325-328.

[20] Belsley, D. A. Kuh, K. and Welsch, R. E. (1980), Regression Diagnostics, New York:John Wiley and Sons.

[21] Bickel, P. J. (1964), On some alternative estimates for shift in the p-variate one sampleproblem, Annals of Mathematical Statistics, 35, 1079-1090.

[22] Bickel, P. J. (1965), On some asymptotically nonparametric competitors of Hotelling’sT 2, Annals of Mathematical Statistics, 36, 160-173.

[23] Bickel, P. J. (1974), Edgeworth expansions in nonparametric statistics, Annals of Statis-tics, 2, 1-20.

[24] Bickel, P. J. (1976), Another look at robustness: A review of reviews and some newdevelopments, (Reply to Discussant)”, Scandinavian Journal of Statistics, 3, 167.

[25] Bickel, P. J. and Lehmann, E. L. (1975), Descriptive statistics for nonparametric model,II Location, Annals of Statistics, 3, 1045-1069.

[26] Blair, R. C., Sawilowsky, S. S. and Higgins, J. J. (1987), Limitations of the rank trans-form statistic in tests for interaction, Communications in Statistics, Part B-Simulationand Computation, 16, 1133-1145.

[27] Bloomfield, B. and Steiger, W. L. (1983), Least Absolute Deviations, Boston:Birkhauser.

BIBLIOGRAPHY 471

[28] Blumen, I. (1958), A new bivariate sign test, Journal of the American Statistical Asso-ciation, 53, 448-456.

[29] Bohn, L. L. and Wolfe, D. A. (1992), Nonparametric two-sample procedures for ranked-set samples data, Journal of the American Statistical Association, 87, 552-561.

[30] Boos, D. D. (1982), A test for asymmetry associated with the Hodges-Lehmann esti-mator, Journal of the American Statistical Association, 77, 647-651.

[31] Bose, A. and Chaudhuri, P. (1993), On the dispersion of multivariate median, Annalsof the Institute of Statistical Mathematics, 45, 541-550.

[32] Box, G. E. P. and Cox, D. R. (1964), An analysis of transformations, Journal of theRoyal Statistical Society, Series B, Methodological, 26, 211-252.

[33] Brown, B. M. (1983), Statistical uses of the spatial median, Journal of the Royal Sta-tistical Society, Series B, Methodological, 45, 25-30.

[34] Brown, B. M. (1985), Multiparameter linearizaton theorems, Journal of the Royal Sta-tistical Society, Series B, Methodological, 47, 323-331.

[35] Brown, B. M. and Hettmansperger, T. P. (1987a), Affine invariant rank methods in thebivariate location model, Journal of the Royal Statistical Society, Series B, Methodolog-ical, 49, 301-310.

[36] Brown, B. M. and Hettmansperger, T. P. (1987b), Invariant tests in bivariate modelsand the L1 criterion function, In: Statistical Data Analysis Based on the L1 Norm andRelated Methods, ed. Y. Dodge, 333-344. North Holland, Amsterdam.

[37] Brown, B. M. and Hettmansperger, T. P. (1989), An affine invariant version of the signtest, Journal of the Royal Statistical Society, Series B, Methodological, 51, 117-125.

[38] Brown, B. M. and Hettmansperger, T. P. (1994), Regular redescending rank estimates,Journal of the American Statistical Association, 89, 538-542.

[39] Brown, B. M., Hettmansperger, T. P., Nyblom, J., and Oja, H., (1992), On certainbivariate sign tests and medians, Journal of the American Statistical Association, 87,127-135.

[40] Brunner, E. and Neumann, N. (1986), Rank tests in 2×2 designs, Statistica Neerlandica,40, 251-272.

[41] Brunner, E. and Puri, M. L. (1996), Nonparametric methods in design and analysis ofexperiments, In: Handbook of Statistics, S. Ghosh and C. R. Rao, eds, 13, 631-703, TheNetherlands: Elsevier Science, B. V.

472 BIBLIOGRAPHY

[42] Carmer, S. G. and Swanson, M. R. (1973), An evaluation of ten pairwise multiplecomparison procedures by Monte Carlo methods, Journal of the American StatisticalAssociation, 68, 66-74.

[43] Chang, W. H. (1995), High break-down rank-based estimates for linear models, Unpub-lished Ph.D. Thesis, Western Michigan University, Kalamazoo, MI.

[44] Chang, W. H., McKean, J. W., Naranjo, J. D. and Sheather, S. .J. (1997), High break-down rank regression, Submitted.

[45] Chaudhuri, P. (1992), Multivariate location estimation using extension of R-estimatesthrough U-statistics type approach, Annals of Statistics, 20, 897-916.

[46] Chaudhuri, P. and Sengupta, D. (1993), Sign tests in multidimensional inference basedon the geometry of the data cloud, Journal of the American Statistical Association, 88,1363-1370.

[47] Chernoff, H. and Savage, I. R. (1958), Asymptotic normality and efficiency of certainnonparametric test statistics, Annals of Mathematical Statistics, 39, 972-994.

[48] Chiang, C.-Y. and Puri, M. L. (1984), Rank procedures for testing subhypotheses inlinear regression, Annals of the Institute of Statistical Mathematics, 36, 35-50.

[49] Chinchilli, V. M. and Sen, P. K. (1982), Multivariate linear rank statistics for profileanalysis, Journal of Multivariate Analysis, 12, 219-229.

[50] Choi, K. and Marden, J. (1997), An approach to multivariate rank tests in multivariateanalysis of variance, Journal of the American Statistical Association, To appear.

[51] Coakley, C. W. and Hettmansperger, T. P. (1992), Breakdown bounds and expectedtest resistance, Journal of Nonparametric Statistics, 1, 267-276.

[52] Conover, W. J. and Iman, R. L. (1981), Rank transform as a bridge between parametricand nonparametric statistics, The American Statistician, 35, 124-133.

[53] Conover, W. J., Johnson, M. E., and Johnson, M. M. (1981), A comparative studyof tests for homogeneity of variances, with applications to the outer continental shelfbidding data, Technometrics, 23, 351-361.

[54] Cook, R. D., Hawkins, D. M. and Weisberg, S. (1992), Comparison of model misspecifi-cation diagnostics using residuals from least mean of squares and least median of squaresfits, Journal of the American Statistical Association, 87, 419-424.

[55] Cook, R. D. and Weisberg, S. (1982), Residuals and Influence in Regression, New York:Chapman and Hall.

BIBLIOGRAPHY 473

[56] Cook, R. D. and Weisberg, S. (1989), Regression diagnostics with dynamic graphics,Technometrics, 31, 277-291.

[57] Cook, R. D. and Weisberg, S. (1994), An Introduction to Regression Graphics, NewYork: John Wiley and Sons.

[58] Croux, C. Rousseeuw, P. J. and Hossjer, O. (1994), Generalized S-estimators, Journalof the American Statistical Association, 89, 1271-1281.

[59] Cushney, A. R. and Peebles, A. R. (1905), The action of optical isomers, II, Hyoscines,Journal of Physiology, 32, 501-510.

[60] Davis, J. B. and McKean, J. W. (1993), Rank based methods for multivariate linearmodels, Journal of the American Statistical Association, 88, 245-251.

[61] Dietz, E. J. (1982), Bivariate nonparametric tests for the one-sample location problem,Journal of the American Statistical Association, 77, 163-169.

[62] Dixon, S. L. and McKean, J. W. (1996), Rank-based analysis of the heteroscedasticlinear model, Journal of the American Statistical Association, 91, 699-712.

[63] Dongarra, J. J. Bunch, J. R. Moler, C. B. and Stewart, G. W. (1979), Linpack Users’Guide, Philadelphia: SIAM.

[64] Donoho, D. L. and Huber, P. J. (1983), The notion of breakdown point, In: A Festschriftfor Erich L. Lehmann, Eds. P. J. Bickel, K. A. Doksum, J. L. Hodges Jr., 157-184,Belmont, CA: Wadsworth.

[65] Dowell, M. and and Jarratt, P. (1971), A modified regula falsi method for computingthe root of an equation, BIT 11, 168-171.

[66] DuBois, C., ed. (1960) Lowie’s Selected Papers in Anthropology. Berkeley: University ofCalifornia Press.

[67] Draper, D. (1988), Rank-based robust analysis of linear models. I. Exposition and re-view, Statistical Science, 3, 239-257.

[68] Draper, N. R. and Smith, H. (1966), Applied Regression Analysis, New York: JohnWiley and Sons.

[69] Ducharme, G. R. and Milasevic, P. (1987), Spatial median and directional data,Biometrika, 74, 212-215.

[70] Dwass, M. (1960), Some k-sample rank order tests, in I. Olkin, et al. (eds.), Contribu-tions to Probability and Statistics, Stanford: Stanford University Press.

[71] Efron, B. (1979), Bootstrap methods: another look at the jackknife, Annals of Statistics,7, 1-26.

474 BIBLIOGRAPHY

[72] Efron B. and Tibshirani, R. J. (1993), An Introduction to the Bootstrap, New York:Chapman and Hall.

[73] Eubank, R. L., LaRiccia, V. N. and Rosenstein (1992), Testing symmetry about anunknown median, via linear rank procedures, Journal of Nonparametric Statistics, 1,301-311.

[74] Fernholz, L. T. (1983), von Mises Calculus for Statistical Functionals, Lecture Notes inStatistics 19, New York: Springer.

[75] Fisher, N. I. (1987), Statistical Analysis for Spherical Data, Cambridge: CambridgeUniversity Press.

[76] Fisher, N. I. (1993), Statistical Analysis for Circular Data, Cambridge: CambridgeUniversity Press.

[77] Fix, E. and Hodges, J. L., Jr. (1955), Significance probabilities of the Wilcoxon test,Annals of Mathematical Statistics, 26, 301-312.

[78] Fligner, M. A. (1981), Comment, American Statistician, 35, 131-132.

[79] Fligner, M. A. and Hettmansperger, T. P. (1979), On the use of conditional asymptoticnormality, Journal of the Royal Statistical Society, Series B, Methodological, 41, 178-183.

[80] Fligner, M. A. and Killeen, T. J. (1976), Distribution-free two-sample test for scale,Journal of the American Statistical Association, 71, 210-213.

[81] Fligner, M. A. and Policello, G. E. (1981), Robust rank procedures for the Behrens-Fisher problem, Journal of the American Statistical Association, 76, 162-168.

[82] Fligner, M. A. and Rust, S. W. (1982), A modification of Mood’s median test for thegeneralized Behrens-Fisher problem, Biometrika, 69, 221-226.

[83] Fraser, D. A. S. (1957), Nonparametric Methods in Statistics, New York: John Wileyand Sons.

[84] Gastwirth, J. L. (1968), The first median test: A two-sided version of the control mediantest, Journal of the American Statistical Association, 63, 692-706.

[85] Gastwirth, J. L. (1971), On the sign test for symmetry, Journal of the American Sta-tistical Association, 66, 821-823.

[86] George, K. J., McKean, J. W., Schucany, W. R. and Sheather, S. J. (1995), A com-parison of confidence intervals from R-estimators in regression, Journal of StatisticalComputation and Simulation, 53, 13-22.

BIBLIOGRAPHY 475

[87] Ghosh, M. and Sen, P. K. (1971), On a class of rank order tests for regression withpartially formed stochastic predictors, Annals of Mathematical Statistics, 42, 650-661.

[88] Gower, J. C. (1974), The mediancenter, Applied Statistics, 32, 466-470.

[89] Graybill, F. A. (1976), Theory and Application of the Linear Model, North Scituate,Massachusetts: Duxbury.

[90] Graybill, F. A. (1983), Matrices with Applications in Statistics, Belmont, CA:Wadsworth.

[91] Graybill, F. A. and Iyer, H. K. (1994), Regression Analysis: Concepts and Applications,Belmont, California: Duxbury Press.

[92] Hadi, A. S. and Simonoff, J. S. (1993), Procedures for the identification of multipleoutliers in linear models, Journal of the American Statistical Association, 88, 1264-1272.

[93] Hajek, J. and Sidak, Z. (1967), Theory of Rank Tests, New York: Academic Press.

[94] Hald, A. (1952), Statistical Theory with Engineering Applications, New York: JohnWiley and Sons.

[95] Hampel, F. R. (1974), The influence curve and its role in robust estimation, Journal ofthe American Statistical Association, 69, 383-393.

[96] Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. J. (1986), RobustStatistics, the Approach Based on Influence Functions, New York: John Wiley and Sons.

[97] Hardy, G. H., Littlewood, J. E., and Polya, G. (1952), Inequalities. 2nd ed., Cambridge:Cambridge University Press.

[98] Hawkins, D. M., Bradu, D. and Kass, G. V. (1984), Location of several outliers inmultiple regression data using elemental sets, Technometrics, 26, 197 -208

[99] He, X., Simpson, D. G. and Portnoy, S. L. (1990), Breakdown robustness of tests,Journal of the American Statistical Association, 85, 446-452.

[100] Heiler, S. and Willers, R. (1988), Asymptotic normality of R-estimation the linearmodel, Statistics, 19, 173-184.

[101] Hendy, M. F. and Charles, J. A. (1970), The production techniques, silver content andcirculation history of the twelfth-century Byzantine, Archaeometry, 12, 13-21.

[102] Hettmansperger, T. P. (1984a), Statistical Inference Based on Ranks, New York: JohnWiley and Sons.

476 BIBLIOGRAPHY

[103] Hettmansperger, T. P. (1984b), Two-sample inference based on one-sample sign statis-tics, Applied Statistics, 33, 45-51.

[104] Hettmansperger, T. P. (1995), The rank-set sample sign test, Journal of NonparametricStatistics, 4, 263-270.

[105] Hettmansperger, T. P. and Malin, J. S. (1975), A modified Mood’s test for locationwith no shape assumptions on the underlying distributions, Biometrika, 62, 527-529.

[106] Hettmansperger, T. P. and McKean, J. W. (1978), Statistical inference based on ranks,Psychometrika, 43, 69-79.

[107] Hettmansperger, T. P. and McKean, J. W. (1983), A geometric interpretation of in-ferences based on ranks in the linear model, Journal of the American Statistical Asso-ciation, 78, 885-893.

[108] Hettmansperger, T. P. McKean, J. W. and Sheather, S. J. (1997), Rank-based analysesof linear models, Handbook of Statistics, 145-173, S. Ghosh and C. R. Rao, eds, 15,Amsterdam: Elsevier Science.

[109] Hettmansperger, T. P., Mottonen, J., and Oja, H. (1997a), Affine invariant multivari-ate one-sample signed-rank tests, Journal of the American Statistical Association, Toappear.

[110] Hettmansperger, T. P., Mottonen, J., and Oja, H. (1997b), Affine invariant multivari-ate two-sample rank tests, Statistica Sinica, To appear.

[111] Hettmansperger, T. P. and Oja, H. (1994), Affine invariant multivariate multisamplesign tests, Journal of the Royal Statistical Society, Series B, Methodological, 56, 235-249.

[112] Hettmansperger, T. P., Nyblom, J. and Oja, H. (1994), Affine invariant multivariateone-sample sign tests, Journal of the Royal Statistical Society, Series B, Methodological,56, 221-234.

[113] Hettmansperger, T. P. and Sheather, S. J. (1986), Confidence intervals based on in-terpolated order statistics, Statistics and Probability Letters, 4, 75-79.

[114] Hocking, R. R. (1985), The Analysis of Linear Models, Monterey, California:Brooks/Cole.

[115] Hodges, J. L. Jr. (1967), Efficiency in normal samples and tolerance of extreme valuesfor some estimates of location, In: Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability, 1, 163-186. Berkeley: University of CaliforniaPress.

[116] Hodges, J. L., Jr., and Lehmann, E. L. (1956), The efficiency of some nonparametriccompetitors of the t-test, Annals of Mathematical Statistics, 27, 324-335.

BIBLIOGRAPHY 477

[117] Hodges, J. L., Jr., and Lehmann, E. L. (1961), Comparison of the normal scores andWilcoxon tests, In: Proceedings of the Fourth Berkeley Symposium on MathematicalStatistics and Probability, 1, 307-317, Berkeley: University of California Press.

[118] Hodges, J. L. Jr. and Lehmann, E. L. (1962), Rank methods for combination of inde-pendent experiments in analysis of variance, Annals of Mathematical Statistics 33,482-497.

[119] Hodges, J. L., Jr., and Lehmann, E. L. (1963), Estimates of location based on ranktests. Annals of Mathematical Statistics, 34, 598-611.

[120] Hogg, R. V. (1974), Adaptive robust procedures: A partial review and some suggestionsfor future applications and theory, Journal of the American Statistical Association, 69,909-923.

[121] Hora, S. C. and Conover, W. J. (1984), The F-statistic in the two-way layout withrank-score transformed data, Journal of the American Statistical Association, 79, 688-673.

[122] Hossjer, O. (1994), Rank-based estimates in the linear model with high breakdownpoint, Journal of the American Statistical Association, 89, 149-158.

[123] Hossjer, O. and Croux, C. (1995), Generalizing univariate signed rank statistics fortesting and estimating a multivariate location parameter, Journal of NonparametricStatistics, 4, 293-308.

[124] otelling, H. (1951), A genralized T -test and measure of multivariate dispersion, InProceedings of the Second Berkeley Symposium on Mathematical Statistics, 23-41.

[125] Hφyland, A. (1965), Robustness of the Hodges-Lehmann estimates for shift, Annals ofMathematical Statistics, 36, 174-197.

[126] Hsu, J. C. (1996), Multiple Comparisons, London: Chapman Hall.

[127] Huber, P. J. (1981), Robust Statistics, New York: John Wiley and Sons.

[128] Huitema, B. E. (1980), The Analysis of Covariates and Alternatives, New York: JohnWiley and Sons.

[129] Iman, R. L. (1974), A power study of the rank transform for the two-way classificationmodel when interaction may be present, Canadian Journal of Statistics, 2, 227-239.

[130] International Mathematical and Statistical Libraries, Inc. (1987), User’s Manual:Stat/Library, Author: Houston, TX.

[131] Jaeckel, L. A. (1972), Estimating regression coefficients by minimizing the dispersionof the residuals, Annals of Mathematical Statistics, 43, 1449-1458.

478 BIBLIOGRAPHY

[132] Jan, S. L., and Randles, R. H. (1995), A multivariate signed sum test for the one-samplelocation problem, Journal of Nonparametric Statistics, 4, 49-63.

[133] Jan, S. L., and Randles, R. H. (1996), Interdirection tests for simple repeated measuresdesigns, Journal of the American Statistical Association, 91, 1611-1618.

[134] Johnson, G. D., Nussbaum, B. D., Patil, G. P., and Ross, N. P. (1996), Designingcost-effective environmental sampling using concomitant information. Chance, 9, 4-16.

[135] Jonckheere, A. R. (1954), A distribution-free k-sample tests against ordered alterna-tives, Biometrika, 41, 133-145.

[136] Jureckova, J. (1969), Asymptotic linearity of rank statistics in regression parameters,Annals of Mathematical Statistics, 40, 1449-1458.

[137] Jureckova, J. (1971), Nonparametric estimate of regression coefficients, Annals ofMathematical Statistics, 42, 1328-1338.

[138] Kahaner, D. and Moler, C. and Nash, S (1989), Numerical Methods and Software,Englewood Cliffs, New Jersey: Prentice Hall.

[139] Kalbfleisch, J. D. and Prentice, R. L. (1980), The Statistical Analysis of Failure TimeData, New York: John Wiley and Sons.

[140] Kapenga, J. A., McKean, J. W. and Vidmar, T. J. (1988), RGLM: Users Manual,Amer. Statist. Assoc. Short Course on Robust Statistical Procedures for the Analysisof Linear and Nonlinear Models, New Orleans.

[141] Kepner, J. C. and Robinson, D. H. (1988), Nonparametric methods for detecting treat-ment effects in repeated measures designs, Journal of the American Statistical Associ-ation, 83, 456-461.

[142] Killeen, T. J., Hettmansperger, T. P., and Sievers, G. L. (1972), An elementary theoremon the probability of large deviations, Annals of Mathematical Statistics, 43, 181-192.

[143] Klotz, J. (1962), Nonparametrics tests for scale, Annals of Mathematical Statistics, 33,498-512.

[144] Koul, H. L. (1992), Weighted Empiricals and Linear Models, Hayward, California:Institute of Mathematical Statistics.

[145] Koul, H. L. Sievers, G. L. and McKean, J. W. (1987), An estimator of the scale param-eter for the rank analysis of linear models under general score functions, ScandinavianJournal of Statistics, 14, 131-141.

[146] Kramer, C. Y. (1956), Extension of multiple range tests to group means with unequalnumbers of replications, Biometrics, 12, 307-310.

BIBLIOGRAPHY 479

[147] Kruskal, W. H. and Wallis, W. A. (1952), Use of ranks in one criterion variance analysis,Journal of the American Statistical Association, 57, 583-621.

[148] Larsen, R. J. and Stroup, D. F. (1976), Statistics in the Real World, New York: Macmil-lan.

[149] Lawless, J. F. (1982), Statistical Models and Methods for Lifetime Data, New York:John Wiley and Sons.

[150] Lawley, D. N. (1938), A generalization of Fisher’s z-test, Biometrika, 30, 180-187.

[151] Lehmann, E. L. (1975), Nonparametrics: Statistical Methods Based on Ranks, SanFrancisco: Holden-Day.

[152] Li, H. (1991), Rank Procedures for the Logistic Model, Unpublished PhD. Thesis, West-ern Michigan University, Kalamazoo, MI.

[153] Liu, R. Y. (1990), On a notion of data depth based on simplices, Annals of Statistics,18, 405-414.

[154] Liu, R. Y. and Singh, K. (1993), A quality index based on data depth and multivariaterank tests, Journal of the American Statistical Association, 88, 405-414.

[155] Lopuha, H. P. and Rousseeuw, P. J. (1991), Breakdown properties of affine equivariantestimators of multivariate location and covariance matrices, Annals of Statistics, 19,229-248.

[156] Magnus, J. R. and Neudecker, H. (1988), Matrix Differential Calculus with Applicationsin Statistics and Econometrics, New York: John Wiley and Sons.

[157] Mann, H. B. (1945), Nonparametric tests against trend, Econometrica, 13, 245-259.

[158] Mann, H. B. and Whitney, D. R. (1947), On a test of whether one of two randomvariables is stochastically larger than the other, Annals of Mathematical Statistics, 18,50-60.

[159] Mardia, K. V. (1972), Statistics of Directional Data, London: Academic Press.

[160] Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979), Multivariate Analysis, Orlando,Fl.: Academic Press.

[161] Maritz, J. S. (1981), Distribution-Free Statistical Methods, London: Chapman andHall.

[162] Maritz, J. S. and Jarrett, R. G. (1978), A note on estimating the variance of the samplemedian, Journal of the American Statistical Association, 73, 194-196.

480 BIBLIOGRAPHY

[163] Maritz, J. S., Wu, M. and Staudte, R. G. Jr. (1977), A location estimator based on aU -statistic, Annals of Statistics, 5, 779-786.

[164] Marsaglia, G. and Bray, T. A. (1964), A convenient method for generating normalvariables, SIAM Review, 6, 260-64.

[165] Mason, R. L., Gunst, R. F., and Hess, J. L. (1989), Statistical Design and Analysis ofExperiments, New York: John Wiley and Sons.

[166] Mathisen, H. C. (1943), A method of testing the hypothesis that two samples are fromthe same population, Annals of Mathematical Statistics, 14, 188-194.

[167] McIntyre, G. A. (1952), A method of unbiased selective sampling, using ranked sets,Australian Journal of Agricultural Research, 3, 385-390.

[168] McKean, J. W. and Hettmansperger, T. P. (1976), Tests of hypotheses of the generallinear model based on ranks, Communications in Statistics, Part A-Theory and Methods,5, 693-709.

[169] McKean, J. W. and Hettmansperger, T. P. (1978), A robust analysis of the generallinear model based on one step R-estimates, Biometrika, 65, 571-579.

[170] McKean, J. W. Naranjo, J. D. and Sheather, S. J. (1996a), Diagnostics to detectdifferences in robust fits of linear models, Computational Statistics, 11, 223-243.

[171] McKean, J. W. Naranjo, J. D. and Sheather, S. J. (1996b), An efficient and highbreakdown procedure for model criticism, Communications in Statistics, Part A-Theoryand Methods, 25, 2575-2595.

[172] McKean, J. W. and and Ryan, T. A. Jr. (1977), An algorithm for obtaining confi-dence intervals and point estimates based on ranks in the two sample location problem,Transactions of Mathematical Software, 3, 183-185.

[173] McKean, J. W. and Schrader, R. (1980), The geometry of robust procedures in linearmodels, Journal of the Royal Statistical Society, Series B, Methodological, 42, 366-371.

[174] McKean, J. W. and Schrader, R. M. (1984), A comparison of methods for studentizingthe sample median, Communications in Statistics, Part B-Simulation and Computation,6, 751-773.

[175] McKean, J. W. and Sheather, S. J. (1991), Small sample properties of robust analysesof linear models based on r-estimates, In: A Survey. in Directions in Robust Statisticsand Diagnostics, Part II, 1-20, W. Stahel and S. Weisberg, eds., New York: Springer-Verlag.

BIBLIOGRAPHY 481

[176] McKean, J. W. Sheather, S. J. and Hettmansperger, T. P. (1990), Regression diag-nostics for rank-based methods, Journal of the American Statistical Association, 85,1018-28.

[177] McKean, J. W. Sheather, S. J. and Hettmansperger, T. P. (1991), Regression diag-nostics for rank-based methods II, In: Directions in Robust Statistics and Diagnostics,Part II, 21-31, Editors: W. Stahel and S. Weisberg, New York: Springer-Verlag: NewYork.

[178] McKean, J. W. Sheather, S. J. and Hettmansperger, T. P. (1993), The use and inter-pretation of residuals based on robust estimation, Journal of the American StatisticalAssociation, 88, 1254-1263.

[179] McKean, J. W. Sheather, S. J. and Hettmansperger, T. P. (1994), Robust and highbreakdown fits of polynomial models, Technometrics, 36, 409-415.

[180] McKean, J. W. and Sievers, G. L. (1987), Coefficients of determination for least abso-lute deviation analysis, Statistics and Probability Letters, 5, 49-54.

[181] McKean, J. W. and Sievers, G. L. (1989), Rank scores suitable for the analysis of linearmodels under asymmetric error distributions, Technometrics, 31, 207-218.

[182] McKean, J. W. and Vidmar, T. J. (1992), Using procedures based on ranks: cau-tions and recommendations, American Statistical Association 1992 Proceedings of theBiopharmaceutical Section, 280-289.

[183] McKean, J. W. and Vidmar, T. J. (1994), A comparison of two rank-based methodsfor the analysis of linear models, The American Statistician, 48, 220-229.

[184] McKean, J. W., Vidmar, T.J., and Sievers, G. L. (1989), A robust two stage multiplecomparison procedure with application to a random drug screen, Biometrics 45, 1281-1297.

[185] Merchant, J. A., Halprin, G. M. Hudson, A. R., Kilburn, K. H. McKenzie, W.N.,Jr., Hurst, D. J., and Bermazohn, P. (1975), Responses to cotton dust, Archives ofEnvironmental Health, 30, 222-229.

[186] Milasevic, P. and Ducharme, G. R. (1987), Uniqueness of the spatial median, Annalsof Statistics, 15, 1332-1333.

[187] Mielke, P. W. (1972), Asymptotic behavior of two-sample tests based on the powers ofranks for detecting scale and location alternatives, Journal of the American StatisticalAssociation, 67, 850-854.

[188] Miller, R. G. (1981), Simultaneous Statistical Inference, New York: Springer-Verlag.

[189] Mood, A. M. (1950), Introduction to the Theory of Statistics, New York: McGraw-Hill.

482 BIBLIOGRAPHY

[190] Mood, A. M. (1954), On the asymptotic efficiency of certain nonparametric two-sampletests, Annals of Mathematical Statistics, 25, 514-533.

[191] Morrison, D. F. (1983), Applied Linear Statistical Models, Englewood Cliffs, New Jer-sey: Prentice Hall.

[192] Mottonen, J. (1997a), SAS/IML Macros for spatial sign and rank tests, MathematicsDepartment, University of Oulu, Finland.

[193] Mottonen, J. (1997b), SAS/IML Macros for affine invariant multivariate sign and ranktests, Mathematics Department, University of Oulu, Finland.

[194] Mottonen, J, Hettmansperger, T. P., Oja, H., and Tienari, J. (1997), On the efficiencyof multivariate affine invariant rank methods, Journal of Multivariate Analysis, In press.

[195] Mottonen, J. and Oja, H. (1995), Multivariate spatial sign and rank methods. Journalof Nonparametric Statistics, 5, 201-213.

[196] Mottonen, J, Oja, H., and Tienari, J. (1997), On efficiency of multivariate spatial signand rank methods, Annals of Statistics, In press.

[197] Naranjo, J. D. and Hettmansperger, T. P. (1994), Bounded-influence rank regression,Journal of the Royal Statistical Society, Series B, Methodological, 56, No. 1, 209-220.

[198] Naranjo, J. D., McKean, J. W. Sheather, S. J. and Hettmansperger, T. P. (1994), Theuse and interpretation of rank-based residuals, Journal of Nonparametric Statistics, 3,323-341.

[199] Nelson, W. (1982), Applied Lifetime Data Analysis, New York: John Wiley and Sons.

[200] Neter, J., Kutner, M. H., Nachtsheim, C. J. and Wasserman, W. (1996), Applied LinearStatistical Models, 4th Ed., Chicago: Irwin.

[201] Niinimaa, A. and Oja, H. (1995), On the Influence Functions of Certain bivariatemedians, Journal of the Royal Statistical Society, Series B, Methodological, 57, 565-574.

[202] Niinimaa, A., Oja, H. and Nyblom, J. (1992), Algorithm AS277: The Oja bivariatemedian, Applied Statistics, 41, 611-617.

[203] Niinimaa, A., Oja, H., and Tableman, M. (1990), The Finite-sample breakdown pointof the Oja bivariate median, Statistics and Probability Letters, 10, 325-328.

[204] Noether, G. E. (1955), On a theorem of Pitman, Annals of Mathematical Statistics,26, 64-68.

[205] Noether, G. E. (1987), Sample size determination for some common nonparametrictests, Journal of the American Statistical Association, 82, 645-647.

BIBLIOGRAPHY 483

[206] Numerical Algorithms Group, Inc. (1983), Library Manual Mark 15, Oxford: Numer-ical Algorithms Group.

[207] Nyblom, J. (1992), Note on interpolated order statistics, Statistics and ProbabilityLetters, 14, 129-131.

[208] Oja, H. (1983), Descriptive statistics for multivariate distributions, Statistics and Prob-ability Letters, 1, 327-333.

[209] Oja, H. and Nyblom, J. (1989), Bivariate sign tests, Journal of the American StatisticalAssociation, 84, 249-259.

[210] Olshen, R. A. (1967), Sign and Wilcoxon test for linearity, Annals of MathematicalStatistics, 38, 1759-1769.

[211] Osborne, M. R. (1985), Finite Algorithms in Optimization and Data Analysis, Chich-ester: John Wiley & Sons.

[212] Peters, D. and Randles, R. H. (1990a), Multivariate rank tests in the two-samplelocation problem, Communications in Statistics, Part A-Theory and Methods, 15(11),4225-4238.

[213] Peters, D. and Randles, R. H. (1990b), A multivariate signed-rank test for the one-sample location problem, Journal of the American Statistical Association, 85, 552-557.

[214] Pitman, E. J. G. (1948), Notes on nonparametric statistical inference, Unpublishednotes.

[215] Policello, G. E. II, and Hettmansperger, T. P. (1976), Adaptive robust proceduresfor the one-sample location model, Journal of the American Statistical Association, 71,624-633.

[216] Puri, M. L. (1968), Multisample scale problem: Unknown location parameters. Annalsof the Institute of Statistical Mathematics, 40, 619-632.

[217] Puri, M. L. and Sen, P. K. (1971), Nonparametric Methods in Multivariate Analysis,New York: John Wiley and Sons.

[218] Puri, M. L. and Sen, P. K. (1985), Nonparametric Methods in General Linear Models,New York: John Wiley and Sons.

[219] Randles, R. H. (1989), A distribution-free multivariate sign test based on interdirec-tions, Journal of the American Statistical Association, 84, 1045-1050.

[220] Randles, R. H., Fligner, M. A. Policello, G. E., and Wolfe, D. A. (1980), An asymptot-ically distribution-free test for symmetry versus asymmetry, Journal of the AmericanStatistical Association, 75, 168-172.

484 BIBLIOGRAPHY

[221] Randles, R. H. and Wolfe, D. A. (1979), Introduction to the Theory of NonparametricStatistics, New York: John Wiley and Sons.

[222] Rao, C. R. (1948), Tests of significance in multivariate analysis, Biometrika, 35, 58-79.

[223] Rao, C. R. (1973), Linear Statistical Inference and Its Applications, 2nd Edition, NewYork: John Wiley and Sons.

[224] Rao, C. R. (1988), Methodology based on L1-norm in statistical inference, Sankhya,Series A, 50, 289-313.

[225] Rockafellar, R. T. (1970), Convex Analysis, Princeton, New Jersey: Princeton Univer-sity Press.

[226] Rousseeuw, P. J. (1984), Least median squares regression, Journal of the AmericanStatistical Association, 79, 871-880.

[227] Rousseeuw, P. J. and Leroy, A. M. (1987), Robust Regression and Outlier Detection,New York: John Wiley and Sons.

[228] Rousseeuw. P. J. and van Zomeren, B. C. (1990), Unmasking multivariate outliers andleverage points, Journal of the American Statistical Association, 85, 633-648.

[229] Rousseeuw. P. J. and van Zomeren, B. C. (1991), Robust distances: Simulations andcutoff values, In: Directions in Robust Statistics and Diagnostics, Part II, 195-203,Editors: W. Stahel and S. Weisberg, New York: Springer-Verlag.

[230] Savage, I. R. (1956), Contributions to the theory of rank order statistics-the two samplecase, Annals of Mathematical Statistics, 27, 590-615.

[231] Sawilowsky, S. S. (1990), Nonparametric tests of interaction in experimental design,Review of Educational Research, 60, 91-126.

[232] Sawilowsky, S. S., Blair, R. C. and Higgins, J. J. (1989), An investigation of the TypeI Error and power properties of the rank transform procedure in factorial ANOVA,Journal of Educational Statistics, 14, 255-267.

[233] Scheffe. H. (1959), The Analysis of Variance, New York: John Wiley and Sons.

[234] Schrader, R. M. and McKean, J. W. (1977), Robust analysis of variance, Communica-tions in Statistics, Part A-Theory and Methods, 6, 879-894.

[235] Schrader, R. M. and McKean, J. W. (1987), Small sample properties of least absolutevalues analysis of variance, In: Y. Dodge, editor, Statistical Analysis Based on theL1-Norm and related methods, Amsterdam: North Holland, 307-321.

[236] Schuster, E. F. (1975), Estimating the distribution function of a symmetric distribu-tion, Biometrika, 62, 631-635.

BIBLIOGRAPHY 485

[237] Schuster, E. F. (1987), Identifying the closest symmetric distribution or density func-tion, Annals of Mathematical Statistics, 15, 865-874.

[238] Schuster, E. F. and Becker, R. C. (1987), Using the bootstrap in testing symmetryversus asymmetry, Communications in Statistics, Part B-Simulation and Computation,16, 19-84.

[239] Searle, S. R. (1971), Linear Models, New York: John Wiley and Sons.

[240] Sheather, S. J. (1987), Assessing the accuracy of the sample median: Estimated stan-dard errors versus interpolated confidence intervals. In: Statistical Data Analysis Basedon the L1 Norm and Related Methods, ed. Y. Dodge, 203-216, Statistical Data AnalysisY. Dodge ed., Amsterdam: North-Holland.

[241] Sheather, S. J. McKean, J. W. and Hettmansperger, T. P. (1997), Finite sample sta-bility properties of the least median of squares estimator, Journal of Statistical Compu-tation and Simulation, In press.

[242] Shirley, E. A. C. (1981), A distribution-free method for analysis of covariance basedon rank data, Applied Statistics, 30, 158-162.

[243] Siegel, S. and Tukey, J. W. (1960), A nonparametric sum of ranks procedure for relativespread in unpaired samples, Journal of the American Statistical Association, 55, 429-444.

[244] Sievers, G. L. (1983), A weighted dispersion function for estimation in linear models,Communications in Statistics, Part A-Theory and Methods, 12(10), 1161-1179.

[245] Simonoff, J. S. and Hawkins, D. M. (1993), “Algorithm AS 282: High BreakdownRegression and Multivariate Estimation,” Apllied Statistics, 42, 423-432.

[246] Simpson, D. G. Ruppert, D. and Carroll, R. J. (1992), On one-step GM-estimates andstability of inferences in linear regression, Journal of the American Statistical Associa-tion, 87, 439-450.

[247] Small, C. G. (1990), A survey of multidimensional medians, International StatisticalReview, 58, 263-277.

[248] Speed, F. M. and Hocking, R. R. and Hackney, O. P. (1978), Methods of analysis withunbalanced data, Journal of the American Statistical Association, 73, 105-112.

[249] Steel, R. G. D. (1960), A rank sum test for comparing all pairs of treatments, Techno-metrics, 2, 197-207.

[250] Stefanski, L. A., Carroll, R. J. and Ruppert, D. (1986), Optimally bounded score func-tions for generalized linear models with applications to logistic regression, Biometrika,73, 413-424.

486 BIBLIOGRAPHY

[251] Stewart, G. .W. (1973), Introduction to Matrix Computations, New York: AcademicPress.

[252] Stromberg, A. J. (1993), Computing the exact least median of squares estimate andstability diagnostics in multiple linear regression, SIAM Journal of Scientific Computing,14, 1289-1299.

[253] Student (1908), The probable error of a mean, Biometrika, 6, 1-25.

[254] Tableman, M. (1990), Bounded-influence rank regression: A one-step estimator basedon Wilcoxon scores, Journal of the American Statistical Association, 85, 508-513.

[255] Terpstra, T. J. (1952), The asymptotic normality and consistency for Kendall’s testagainst trend, when ties are present, Indagationes Mathematicae, 14, 327-333.

[256] Thompson, G. L. (1991a), A note on the rank transform for interaction, Biometrika,78, 697-701.

[257] Thompson, G. L. (1991b), A unified approach to rank tests for multivariate and re-peated measure designs, Journal of the American Statistical Association, 86, 410-419.

[258] Thompson, G. L. (1993), Correction Note to: A note on the rank transform for inter-actions, (V. 78, 697-701), Biometrika, 80, 211.

[259] Thompson, G. L. and Ammann, L. P. (1989), Efficacies of rank-transform statistics intwo-way models with no interaction, Journal of the American Statistical Association,85, 519-528.

[260] Tierney, L. (1990), XLISP-STAT, New York: John Wiley and Sons.

[261] Tucker, H. G. (1967), A Graduate Course in Probability, New York: Academic Press.

[262] Vidmar, T. J., McKean, J. W. and Hettmansperger (1992), Robust procedures fordrug combination problems with quantal responses, Applied Statistics, 41, 299-315.

[263] Vidmar, T. J. and McKean, J. W. (1996), A Monte Carlo study of robust and leastsquares response surface methods, Journal of Statistical Computation and Simulation,54, 1-18.

[264] Welch, B. L. (1937), The significance of the difference between two means when thepopulation variances are unequal, Biometrika, 29, 350-362.

[265] Wilcoxon, F. (1945), Individual comparisons by ranking methods. Biometrics, 1, 80-83.

[266] Witt, L. D. (1989), Coefficients of Multiple Determination Based on Rank Estimates,Unpublished PhD. Thesis, Western Michigan University, Kalamazoo, MI.

BIBLIOGRAPHY 487

[267] Witt, L. D., McKean, J. W. and Naranjo, J. D. (1994), Robust measures of associationin the correlation model, Statistics and Probability Letters, 20, 295-306.

[268] Witt, L. D., Naranjo, J. D. and McKean, J. W. (1995), Influence functions for rank-based procedures in the linear model, Journal of Nonparametric Statistics, 5, 399-358.

[269] Witt, L. D., Naranjo, J. D. and McKean, J. W. (1994), Robust measures of associationin the correlation model, Statistics and Probability Letters, 20, 295-306.

[270] Ylvisaker, D. (1977), Test resistance, Journal of the American Statistical Association,72, 551-556.

[271] Wang, M. H. (1996), Statistical Graphics: Applications to the R and GR Methods inLinear Models, Unpublished Ph.D. Thesis, Western Michigan University, Kalamazoo,MI.

[272] Wilks, S. S. (1960), Multidimensional statistical scatter, In: Contributions to Proba-bility and Statistics in Honor of Harold Hotelling, ed. I. Olkin et al, 486-503. Stanford:Stanford University Press.

Index

log-rank scorescomputation, 122

accelerated failure time models, 214added variable plot, 200affine, 352

transformation, 352affine invariant rank methods

Oja signed rank statistic, 391algorithm

tracing, 45analysis of covariance models, see experimen-

tal designsangle sign test, 366

efficiency relative to Hotelling’s T 2, 371,376

anti-ranks, 10, 36, 46argmin, 5Arnold transformation, 335, 336asymptotic linearity, 21

L1, 24general signed-rank scores, 433linear model, 168, 439L1, 445

two sample scores, 104asymptotic power, 26asymptotic power lemma, 26asymptotic relative efficiency, 26asymptotic representation

R-estimates for mixed model, 325R-estimates in linear model, 176spatial median, 367

Behrens-Fisher problem, 135Mann-Whitney-Wilcoxon, 135

modified, 140

Mathisen’s test, 138Welch t-test, 141

Blumen’s bivariate sign test, 378efficiency relative to Hotelling’s T 2, 378

bootstrap, 28bounded in probability, 24bounded influence, see HBR-estimatesbreakdown, see general signed-rank scores,

see Mann-Whitney-WilcoxonL1 two sample, 116acceptance breakdown, 33asymptotic value, 31estimation, 30rejection breakdown, 33

expected rejection breakdown, 34breakdown

HBR - estimates, 239

Central Limit TheoremLindeberg-Feller, 421

cluster correlated data, see mixed modelcomponentwise estimates, 359

breakdown, 392efficiency, 360, 363Hodges-Lehmann estimate

efficiency relative to the mean vector,365

influence function, 392componentwise estimating equations, 356componentwise tests

sign tests, 361Wilcoxon test

efficiency relative to Hotelling’s T 2, 365conditional residuals, 328confidence interval, 7

488

INDEX 489

comparison of two samples, 60efficiency, 27estimate of standard error, 27interpolated confidence intervals, 57shift parameter, 62

consistent test, 19rank-based tests in linear model

reduction in dispersion, 182contaminated normal distribution, 25, 40contiguity, 424convex, 436correlation model, 220

Huber’s condition, 221R-coefficient of multiple determination, 224

properties of, 225traditional coefficient of multiple deter-

mination, 223

delete i model, 208difference in means, 77direct product, 396dispersion function

linear model, 155functional representation, 185

one sample, 5quadratic approximation, 169two sample, 76

efficacy, 21efficiency, 25

bivarite, 354efficiency:L1 versus L2, 25elliptical model, 351equivariance

scale, 96translation, 96

equivariant estimator, 115bivariate, 352

examples, datacorrelation model

Hald Data, 230diagnostics

Cloud Data, 200, 206

Free Fatty Acid Data, 210experimental design

Box-Cox Data, 320LDL Cholesterol of Quail, 279, 283, 285,

287, 293Lifetime of Motors, 298Marketing Data, 304Pigs and Diets, 307Poland China Pigs, 283, 317Rat Data, 314Rate Data, 319Snake Data, 302, 318

linear modelBaseball Salaries Data, 160Bonds Data, 250Hawkins Data, 251Potency Data, 162, 202Quadratic Data, 246Stars Data, 234, 235, 246Telephone Data, 159

log linear modelInsulating Fluid Data, 218

longitudinal modelCRP data, 345

mixed modelscrab grass, 330Milliken and Johnson Data, 337

multivariate experimental designPaspalum Grass Data, 413

multivariate linear modelTablet Potency Data, 408

multivariate locationBrains of Mice Data, 399Cork Borings Data, 367, 375Cotton Dust Data, 356Mathematics and Statistics Exam Scores,

383, 391one sample

Cushney-Peebles Data, 13, 60Darwin Data, 144Shoshoni Rectangles Data, 15, 52

proportional hazards

490 INDEX

Lifetime of Insulation Fluid, 122two samples

Hendy and Charles Coin Data, 63, 82,98

Quail Data, 80, 106, 112, 115exchangeable, 327experimental designs

analysis of covariance models, 300covariates, 300

contrasts, 276estimation, 285hypotheses testing, 283

incidence matrix, 277means model, 276medians model, 276, 286multiple comparison procedures, 288

Bonferroni, 289experiment error rate, 288family error rate, 288pairwise confidence intervals, 293pairwise tests, joint rankings, 291pairwise tests, separate rankings, 292Protected LSD Procedure, 289Tukey, 290Tukey-Kramer, 291, 297

multivariatemeans model, 412medians model, 412

Oneway design, 277pseudo-observations, 287twoway model, 296

additive model, 296interaction, 296main effect, 296profile plots, 296

extreme value distribution, 119

Fisher information, 49, 93Fligner-Kileen

rank-based R software (RBR), 129folded aligned samples, 125Friedman test statistic, 319full model, 4

Gateux derivative, 446GEE estimates, 340GEEWR estimates, 342

asymptotic distribution, 343general rank scores, 165, see regression model

piecewise linear, 106two sample scores, 100

asymptotic linearity, 104estimate of shift, 101, 105gradient function, 101gradient rank test, 101normal scores, 106null distribution, 102pseudo-norm, 99

general scorescomputation, 101

general signed-rank scores, 45, 431asymptotic breakdown, 51confidence interval, 47derived from two sample general rank scores,

107efficacy, 48functional, 45gradient function, 45influence function, 448influence function of the estimate, 51linearity, 433local asymptotic distribution theory, 431,

432optimal score function, 47Pitman regular, 48test statistic

asymptotic power lemma, 48null distribution, 47

generalized longitudinal modelGEE estimates, 340GEEWR estimates, 342longitudinal data, 339

GF (2m1, 2m2), 215GR-estimates, 233gradient process, 5gradient test, 6

INDEX 491

consistency, 20

hazard function, 118HBR-estimates, 232

asymptotic distribution, 237, 455breakdown, 239dispersion function

quadratic approximation, 460gradient, 232

asymptotic null distribution, 237, 460influence functions, 241intercept, 244linearity, 458pseudo-norm, 232

Helmert transformation, see Arnold transfor-mation

high breakdown estimates, see HBR-estimates,see LTS-estimates

Hotelling’s T 2, 352HR estimate, 382Huber’s condition, 165

independent error model, see linear modelinfluence function, 32, 446, see R-estimates

in linear model, see rank-based testsin linear model

interpolated confidence intervals, 57intraclass correlation coefficient, 328invariant test statistic, 352

JR estimates, 324

Kendall’s τ , 261Kronecker product, 396Kruskal-Wallis, 282, see experimental designs,

see multivariate linear model

lack of fit, 219Lawley-Hotelling trace statistic, 397least squares, see normLeCam’s Lemmas, 424Lehmann alternatives, 118linear model, 153, see R-estimates in linear

model, see rank-based tests in linearmodel, see multivariate linear model

approximation of R-residual, 204external R-studentized residual, 208general linear hypotheses, 154independent error model, 323internal R-studentized residuals, 205least squares, 156

reduction in sums of squares, 157L1-estimates, 194

reduction in dispersion, 195longitudinal data, 323model misspecification, 196

departure from orthogonality, 198R pseudo-norm, 154RDBETAS, 210RDCOOK, 209RDFFIT, 209

linear rank statistics, see regression modellinearity, see asymptotic linearitylocation functional, 2

center of symmetry, 4in two sample model, 74

location modelmultivariate location model, 351one sample, 2symmetric, 36two sample, 74

location parameter, 2log linear models, 215long form, see mixed modellongitudinal data, 323

generalized model, 339LTS-estimates, 234

Mann’s test for trend, 70Mann-Whitney-Wilcoxon, 78, see multivari-

ate linear modelasymptotic linearity, 95computation, 80confidence interval, 79, 92, 97efficacy, 95estimate of shift, 78, 96

approximate distribution, 97breakdown, 116

492 INDEX

influence function, 117relative efficiency to difference in means,

97gradient, 78Hodges-Lehmann estimate, 78Pitman regular, 94τ , 95

estimate of, 97test, 78

consistency, 91efficiency relative to the t-test, 95general asymptotic distribution, 89null asymptotic distribution, 90null distribution theory, 83power function, 92projection, 87sample size determination, 96unbiased, 93

marginal residuals, 328Mathisen’s two sample test, 112

confidence interval, 114mean

breakdown, 31influence function, 32

two sample location, 117location functional, 3

mean shift model, 206median, 7

asymptotic distribution, 8bootstrap, 29breakdown, 31confidence interval, 8estimate of standard error, 28influence function, 32location functional, 3spatial median, 366standard deviation of, 8

minimum distance, 5Minitab software, 12

one-sample computation, 12mixed model, 323

general, 323

long form, 323R estimates, 324

Mood’s median test, 109confidence interval, 110efficacy, 111estimate of shift, 109

influence function, 116multiple comparison procedures, see experi-

mental designsmultivariate linear model, 395

estimating equations, 396Kruskal-Wallis, 402Mann-Whitney-Wilcoxon, 398means model, 412medians model, 412profile analysis, 418R-estimates, 404test for regression effect, 397tests of general hypotheses, 405

MWW, see Mann-Whitney-Wilcoxon

Noether’s condition, 165norm, 5, see pseudo-norm

L1-norm, 7, see Mathisen’s two sampletest

asymptotic linearity, 24confidence interval, 8dispersion, 7efficacy, 24estimating equation, 7gradient, 7Pitman regularity, 23

L2-norm, 8t-test, 9efficacy, 24influence function, 449Pitman regularity, 24

estimate induced by, 5Weighted L1-Norm, 9

gradient function, 10normal scores, 45

breakdown, 51efficiency relative to the L2, 49

INDEX 493

empirical ARE, 52

Oja criterion function, 387Oja median, 388Oja sign test, 388one sample location model, see normoptimal score function

one sample, 47two sample, 103

ordered alternative, 320orthogonal transformation, 352outlier, 206

paired design model, 143compared with completely randomized de-

sign, 143Pitman regular, 21, 352power function, 18

asymptotic power lemma, 26profile analysis, see multivariate linear modelprojection theorem, 86proportional hazards models, 118

linear model, 215log exponential model, 119log rank test, 119Mann-Whitney-Wilcoxon, 118

pseudo-median, 11location functional, 3

pseudo-norm, 75, see linear model, see HBR-estimates

L1-pseudo norm, 108L1-pseudo-norm, 109L2-pseudo-norm, 77confidence interval, 77gradient, 76gradient test, 76Mann-Whitney-Wilcoxon, 78reduction in dispersion, 76

pure error dispersion, 219

QR-decomposition, 192quadratic approximation

L1, 445

dispersion function linear model, 169, 439quadraticity, 169

R rank-based softwareone-sample computation, 12

R software, 2url, 2

R-estimates for Arnold transformed model,336, see also R-estimates for mixedmodel

asymptotic distribution, 337R-estimates for mixed model, 324

asymptotic distribution, 325R-estimates for simple mixed model, 327, see

also R-estimates for mixed modelasymptotic distribution, 327

R-estimates in linear model, 155asymptotic distribution, 169influence function, 170, 451intercept, 171

joint asymptotic distribution, 173, 174,176

internal R-studentized residuals, 159Newton type algorithm, 192R normal equations, 156

randomized block design, see simple mixedmodel

randomized block designs, see mixed modelrank scores, see general rank scoresrank transform, 310rank-based R software (RBR), 2, see Url

interpolated confidence intervals, 60one sample general scores, 45one sample sign, 12one sample t, 13one sample Wilcoxon, 12one sample, normal scores, 52two sample log-rank scores, 122two sample general scores, 101two sample Mann-Whitney-Wilcoxon, 80two sample scale, 129Winsorized signed-rank Wilcoxon, 72

rank-based software (RBR)

494 INDEX

one sample, normal scores, 69rank-based tests in linear model, see multi-

variate linear modelaligned rank test, 181efficiency relative to LS, 184gradient test, 156influence function, 181reduction in dispersion, 157Fϕ, 157, 158asymptotic distribution, local alterna-

tives, 183consistency, 182, 441influence function, 453null asymptotic distribution, 178

scores test, 158, 166null asymptotic distribution, 180

Wald type test, 158null asymptotic distribution, 180

ranked set sampling, 53Mann-Whitney-Wilcoxon, 124

reduced model, 6reduction in dispersion, 6regression model, 422

linear rank statistics, 423local asymptotic distribution theory, 424,

427null distribution theory, 423

repeated measures, see mixed modelresolving test, 19RGLM, 191robust distance, 233RSS, 53

sample size determinationone sample, 26two sample, 96

scale problemunknown locations, 128

score test, 6scores, see general signed-rank scores, see gen-

eral rank scoresselection of predictors, 232shift parameter, 74

estimate of, 76sign test, 8

consistency, 20distribution free, 8nonparametric, 8null distribution, 8Pitman regularity, 23ranked set sampling, 54rejection breakdown, 33

signed-rank scores, see general signed-rankscores

signed-rank statistics, see general signed-rankscores

simple mixed model, 327conditional residuals, 328intraclass correlation coefficient, 328marginal residuals, 328robust estimates of variance components,

328Studentized residuals, 329total variance, 328variance components, 328

spatial median, 366asymptotic representation, 367breakdown point, 393

spatial rank methodsrotational invariant rank vectors, 373spatial signed rank statistic, 374

spatial ranksspatial signed rank tests efficiency, 376

spatial sign test, 366, see angle sign testSpearman’s rs, 260

multivariate linear model, 398stepwise model building, 232stochastic ordering, 83Studentized residuals

for Arnold transformed model, 337for simple mixed model, 329HBR fit of linear model, 245R fit linear model, 205simple mixed model, 329

symmetry

INDEX 495

diagonal symmetry, 351

t-testone sample, 9

Pitman regularity, 24rejection breakdown, 34

two sample, 77τϕ+ , 48τϕ, 103, 164

estimate of, 188consistency, 190

through the origin, 264ties

Wilcoxon signed-rank, 13total variance, 328two sample location model, see pseudo-normtwo sample scale model, 125

folded aligned samples, 125general scores, 127Mood test, 131Siegel-Tukey test, 151

two sample scale problemFligner-Killeen test, 129

unbiased test, 18Url

R package, 2url

rank-based R software (RBR), 2

variance components, 328

Wald type tests, 6Walsh averages, 11Wilcoxon

efficacy, 37efficiency to the L2, 40Hodges-Lehmann

approximate distribution, 11confidence interval, 12estimate of location, 11

Pitman regular, 37pseudo-median, 11

breakdown, 31, 51

influence function, 42signed-rank test, 12

null distribution, 36rejection breakdown, 43

ties, 13Wilk’s generalized variance, 354Winsorized scores

signed-rank, 70working independence, 336

robust nonparametric statistical methods

Documents

based estimates

efficiency properties

optimal rankbased tests

based inference171

statistical properties

general rank tests

null properties

efficiency results