regression graphics · regression graphics: ideas for studying regressions through graphic&....

30
Regression Graphics Ideas for Studying Regressions through Graphics R. DENNIS COOK The University of Minnesota St. Paul, Minnesota A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim . Brisbane . Singapore Toronto

Upload: others

Post on 28-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

Regression Graphics

Ideas for Studying Regressions through Graphics

R. DENNIS COOK

The University of Minnesota St . Paul, Minnesota

A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim . Brisbane . Singapore Toronto

Page 2: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

This Page Intentionally Left Blank

Page 3: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

Regression Graphics

Page 4: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

WILEY SERIES IN PROBABILITY AND STATISTICS PROBABILITY AND STATISTICS SECTION

Established by W.4LTER A. SHEWHART and SAMUEL S. WILKS

Editors: Fic Burfte!t, Rulph .4. 3 m d / e ~ . : V d A . C. Cressie, .Niclroias I. Fisher* Iain hf. Johrtslone. J. 3. Kadune, David G. Kendall, David W. Scolt, Bernard Mi Siiverinun. Adrian F. M. Smirh. Jozef L. Teicgels; J. Stuart Hiinter. Ern witits

A complete list of the titles in this series appears at the end ofthis volume.

Page 5: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

Regression Graphics

Ideas for Studying Regressions through Graphics

R. DENNIS COOK

The University of Minnesota St . Paul, Minnesota

A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim . Brisbane . Singapore Toronto

Page 6: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

Copyright 01998 by John Wiley & Sons, Inc. All rights reserved.

Published simultaneously in Canada.

No put of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical. photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive. Danvers. MA 01923, (978) 750-8400, fax (978) 750-4744. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 605 Third Avenue, New Yort. NY 10158-0012, (212) 850-6011, fax (212) 850-6008. E-Mail: PERMREQ@ WILEY.COM.

Library of Congress CatalogiI~g-in-Public~ion Data:

Cook, R. Dennis. Regression graphics: ideas for studying regressions through graphic&. Dennis Cook.

p. cm. - (Wiley series in probability and statistics Probability and statistics section)

"A Wiley-Interscience pubfication." lncludes bibliographical references and index. ISBN 0-47 1 - 19365-8 (cloth : alk. paper) 1. Regression analysis-Graphic methods. I. Title. 11. Series: Wiley series in probability

and statistics. Probability and statistics.

QA278.2.C6647 1998 5 19 .5 '364~21 98-3628

CIP

1 0 9 8 7 6 5 4 3 2 1

Page 7: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

To Jami

Page 8: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

This Page Intentionally Left Blank

Page 9: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

Contents

Preface

1. IIltroduction

1.1. c c & I , 1 1. I . 1 Construction, 1 1.1.2 Characterization, 2 1.1.3 Inference, 3

I .2. Illustrations, 4 I .2.1 Residuals versus fitted values, 4 1.2.2 Residuals versus the predictors, 6 1.2.3 Residuals versus the response, 7

1.3. OR things to come, 9 1.4. Notational conventions, 10

Problems, 13

2. Introduction to 2D Scatterplots

2.1. Response plots in simple regression, 14 2.2. New Zealand horse mussels, 15 2.3. Transforming y via inverse response plots, 20

2.3.1 Response transformations, 2 I 2.3.2 Response transformations: Mussel data, 24

2.4. Danish twins, 25 2.5. Scatterplot matrices, 29

2.5.1 Consrruction, 29 2.5.2 Example, 31

xv

1

14

vii

Page 10: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

viii CONTENTS

2.6. Regression graphics in the 1920s, 32 2.6. I Ezekiel’s successive approximations, 32 2.6.2 Bean’s graphic method, 34

2.7. Discussion, 37 Problems, 38

3. Constructing 3D Scatterplots

3.1, Getting an impression of 3D, 40 3.2. Depth cuing, 42 3.3. Scaling, 43 3.4. Orthogonalization, 44

Problems, 46

4. Interpreting 3D Scatterplots

4.1. Haystacks, 47 4.2. Structural dimensionality, 49

4.2.1 One predictor, 49 4.2.2 Two predictors, 50 4.2.3 Many predictors, 51

4.3. One-dimensional structure, 5 1 4.4. Two-dimensional structure, 55

4.4.1 Removing linear trends, 55 4.4.2 Identifying semiparametric regression

functions, 56

4.5.1 A visual metaphor for structural dimension, 59 4.5.2 A first method for deciding d = 1 or 2, 59 4.5.3 Natural rubber, 61

4.6.1 Using independence, 64 4.6.2 Using uncorrelated 2D views, 65 4.6.3 Uncorrelated 2D views: Haystack data, 67 4.6.4 Intraslice residuals, 69 4.6.5 Intraslice orthogonalization, 7 1 4.6.6 Mussels again, 72 4.6.7 Discussion, 73

4.5. Assessing structural dimensionality, 58

4.6. Assessment methods, 63

Problems, 74

47

Page 11: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

CONTENTS

5. Binary Response Variables

5.1. One predictor, 78 5.2. Two predictors, 79

5.2.1 Checking OD structure, 82 5.2.2 Checking 1D structure, 82 5.2.3 Comparison with previous checking

methods, 84 5.2.4 Exploiting the binary response, 85

5.3.1 Australian Institute of Sport, 86 5.3.2 Kyphosis data, 89

5.4.1 Checking ID structure, 91 5.4.2 Kyphosis data again, 93

5.5. Visualizing a logistic model. 94 5.5.1 Conditionally normal predictors, 95 5.5.2 Other predictor distributions, 98

5.3. Illustrations, 86

5.4. Three predictors, 91

Problems, 99

ix

78

6. Dimension-Reduction Subspaces

6.1. Overview, 101 6.2. Dimension-reduction subspaces, 103 6.3. Central subspaces, 105 6.4. Guaranteeing S,,, by constraining . . . , 108

6.4.1 . . . the distribution of x, I08 6.4.2

6.5. Importance of central subspaces, 112 6.6. h-Level response plots, 114

Problems, 1 17

. . . the distribution of y x, 1 1 1

7. Graphical Regression

7.1. Introduction to graphical regression, 120 7.2. Capturing Svix,, 124

7.2.1 Example: Linear regression, 125 7.2.2 Example: SYix, = S(ql), but Sylxz # S(q2), 126

7.3.1 Location regressions for the predictors, 128 7.3.2 Elliptically contoured distributions, 129 7.3.3 Elliptically contoured predictors, 13 1

7.3. Forcing Syixxl c 5(v1), 127

iai

120

Page 12: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

X CONTENTS

7.4. Improving resolution, 134 7.5. Forcing Syrx, = S(vl), 137

7.5.1 Example: xi independent of x2, but

7.5.2 Conditions for Sylx, = S(q,), 137 7.5.3 Marginal consistency assumption, 139

Syk, # S(7tIh 137

7.6. Visual fitting with h-level response pIots, 140 Problems, 142

8. Getting Numerical Help

8.1.

8.2. 8.3. 8.4.

Fitting with linear kernels, 143 8.1.1 Isomerization data, 145 8.1.2 Using the Li-Duan Proposition, 146

Quadratic kernels, 147 The predictor distribution, 150 Reweighting for elliptical contours, 153 8.4.1 Voronoi weights, 154 8.4.2 Target distribution, 155 8.4.3 Modifying the predictors, 156

Problems, 158

9. Graphical Regression Studies

9.1. Naphthaiene data, 159 9. I . 1 Naphthoquinone, Y,, 161 9.1.2 Phthalic anhydride, Yp, 170

9.2. Wheat protein, 175 9.3. Reaction yield, 179 9.4. Discussion, 184

Problems, 184

10. Inverse Regression Graphics

10.1. Inverse regression function, 187 10.1. I Mean checking condition, 19 1 10.1.2 Mean checking condition: Wheat protein, 193 10.1.3 Mean checking condition: Mussel data, 194

10.2.1 Variance checking condition, 199 10.2.2 Variance checking condition: Ethan01 data, 200 Problems, 201

10.2. Inverse variance function, 196

143

159

187

Page 13: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

CONTENTS

11. Sliced Inverse Regression

1 I . 1. Inverse regression subspace, 203 11.2. SIR, 204 1 1.3. Asymptotic distribution of id, 206

11.3.1 Overview, 206 11.3.2 The general case, 208

11.3.3 Distribution of id with constraints, 210 1 1.4. SIR: Mussel data, 21 3 11.5. Minneapolis schools, 216 1 1.6. Discussion, 220

Problems, 222

12. Principal Hessian Directions

12.1. 12.2.

12.3.

12.4.

12.5.

12.6. 12.7.

Incorporating residuals, 225 Connecting Seiz and S,,, when . . . , 227

12.2.1 ... E(z I p'z) = P,z, 227

12.2.2 . . . E(z ] p'z) = Ppz and Var(z ] p'z} = Q,,, 230 12.2.3 . . . z Is normally distributed, 23 1 Estimation and testing, 231

12.3.1 Asymptotic distribution of ,&, 232 12.3.2 An algorithm for inference on n, 235

12.3.3 Asymptotic distribution of LK with constraints. 236

12.3.4 Testing e independent of z , 238 pHd: Reaction yield, 238 12.4.1 OLS and SIR, 239 12.4.2 pHd test results, 240 12.4.3 Subtracting 0, 241 12.4.4 Using stronger assumptions, 243 pHd: Mussel data, 243 12.5.1 pHd test results, 244 12.5.2 Simulating the response, 246 12.5.3 Using Voronoi weights, 246 pHd: Haystacks, 248 Discussion, 249 12.7.1 12.7.2 Additional developments, 250 Problems, 25 1

pHd with the response, 249

xi

203

224

Page 14: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

xii

13. Studying Predictor Effects

13. I . Introduction to net-effect plots, 254

13.1.1 13.1.2 Joint normality, 256

13.1.3 Slicing, 257 13.1.4 Reducing brushing dimensions, 259

13.2. Distributional indices, 259

13.2.1 Example, 260 13.2.2 Location dependence, 262 13.2.3 Post-model net-effect plots, 264 13.2.4 Bivariate SIR, 265

13.3. Global net-effect plots, 266

13.3.1 Tar, 268 13.3.2 Minneapolis schools again, 269

Problems, 270

Natural rubber: Net-effect plots, 255

14. Predictor 'Ransformations

14.1.

14.2.

14.3.

14.4.

14.5. 14.6.

CERES plots, 273 14.1.1 Motivation, 273 14.1.2 Estimating oI, 274 14.1.3 Example, 276 CERES plots when E(x, I x 2 ) is ..., 279 14.2.1 . . . Constant, 279 14.2.2 . . . Linear in x2, 280 14.2.3 . . . Quadratic in x2, 281

CERES plots in practice, 283 14.3.1 Highly dependent predictors, 284 14.3.2 Using many CERES plots, 285

14.3.3 Transforming more than one predictor, 289

Big Mac data, 290 Added-variable plots, 294 Environmental contamination, 296 14.6.1 Assessing relative importance, 297 14.6.2 Data analysis, 298

Problems, 302

CONTENTS

254

272

Page 15: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

CONTENTS

15. Graphics for Model Assessment

15.1. Residual plots, 304 15.1. I Rationale, 304 15.1.2 Isomerization data, 305 15. I .3 Using residuals in graphical regression, 306 15.1.4 pHd, 309 15.1.5 Residual plots: Kyphosis data, 309 15.1'6 Interpreting residual plots, 3 12

15.2.1 Marginal regression functions . . . , 3 14 15.2.2 Marginal variance functions ... , 316 15.2.3 Marginal model plots, 317 15.2.4 Isomerization data again, 319 15.2.5 Reaction yield data, 322 15.2.6 Tomato tops, 322 15.2.7 Marginal modei plots: Kyphosis data, 325 Problems, 327

15.2. Assessing model adequacy, 313

Bibliography

X i u

303

329

339

343

Author Index

Subject Index

Page 16: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

This Page Intentionally Left Blank

Page 17: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

Preface

Humans are good, she knew, at discerning subtle patterns that are really there, but equally so at imagining them when they are altogether absent.

Simple graphs have always played a useful role in the analysis and presenta- tion of data. Until about 12 years ago, my personal view of statistical graphics was mostly confined to relatively simple displays that could be produced on a teletype or CRT terminal. Many displays were based on six-line plots. Today statistical graphics aren’t so simple. Advances in computing have stimulated ideas that go far beyond the historically dominant graphics, and that have the potential to substantially expand the role of visualization in statistical analyses. I understand that much of modern computer graphics can be traced back to the pioneering work on PRIM-9 (Fisherkeller, Friedman, and Tukey 1974) and to Peter Huber’s visions for PRZM-ETH and PRIM-H (Cleveland and McGill 1988). David Andrew’s Macintosh program McCloud provided my first ex- posure to three-dimensional plots, although computer graphics didn’t really become a concrete tool for me until after Luke Tierney began work on XLZSP- STAT, a programming language that allows the user to implement graphical ideas with relatively little difficulty (Tierney 1990).

This book is about ideas for the graphical analysis of regression data. The original motivation came from wondering how far computer graphics could be pushed in a regression analysis. In the extreme, is it possible to conduct a regression analysis by using just graphics? The answer depends on the se- mantics of the question, but under certain rather weak restrictions it seems that the possibility does exist. And in some regressions such an analysis may even be desirable.

This book is not about how to integrate graphical and nongraphical method- ology in pursuit of a comprehensive analysis. The discussion is single-minded, focusing on graphics unless nongraphical methods seem essential for progress. This should not be taken to imply that nongraphical methods are somehow

XV

Page 18: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

xvi PREFACE

less appropriate or less desirable. I hope that this framework will facilitate an understanding of the potential roles for graphics in regression.

CONTEXT

In practice, regression graphics, like most of statistics, requires both an application context and a statistical context. Statistics exists as a discipline because statistical contexts apply across diverse areas and, together with an application context, provide a foundation for scientific inquiries subject to random variation. Much of this book is devoted to a relatively new statistical context for regression and regression graphics. This new context is intended to blend with rather than replace more traditional paradigms for regression analysis. (See, for example, Box 1980.) It imposes few scope-limiting restric- tions on the nature of the regression and for this reason it may be particularly useful at the beginning of an analysis for guiding the choice of a first model, or during the model-checking phase when the response is replaced by a resid- ual. Basing an entire regression analysis on graphics is also a possibility that is discussed.

OUTLINE

Chapter 1 is a light introduction to selected graphical issues that arise in a familiar context. Notational conventions are described at the end of this chap- ter. Chapter 2 consists of a number of miscellaneous topics to set the stage for later developments, including two-dimensional scatterplots and scatterplot matrices, smoothing, response transformations in regressions with a single predictor, plotting exchangeable pairs, and a little history. In the same spirit, some background on constructing an illusion of a rotating three-dimensional scatterplot on a two-dimensional computer screen is given in Chapter 3. The main theme of this book begins in Chapter 4.

Much of the book revolves around the idea of reducing the dimension of the predictor vector through the use of central dimension-reduction subspaces and sufficient summary plots. These and other central ideas are introduced in Chapter 4 where I make extensive use of three-dimensional scatterplots for graphical analyses of regression problems with two predictors and a many- valued response. In the same vein, graphics for regressions with a binary response are introduced in Chapter 5. The development of ideas stemming from central dimension-reduction subspaces is continued in Chapters 6, 7, and 8 by allowing for many predictors. Practical relevance of these ideas is explored in Chapter 9 through a number of examples.

Starting in Chapter 4, steps in the development of various ideas for re- gression graphics are expressed as propositions with justifications separated from the main text. This formal style is not intended to imply a high degree

Page 19: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

PREFACE xvii

of mathematical formalism, however. Rather, I found it convenient to keep track of results and to separate justifications to facilitate reading. Generally, knowledge of mathematical statistics and finite dimensional vector spaces is required for the justifications.

The graphical foundations are expanded in Chapter 10 by incorporating inverse regressions. Numerical methods for estimating a central dimension- reduction subspace via inverse regressions are discussed in Chapters 1 1 and 12.

Traditional models start to play a more central role in Chapter 13, which is devoted to ideas for studying the roles of individual predictors, Chapter 14 is on graphical methods for visualizing predictor transformations in linear models. Graphical methods for model assessment are studied in Chapter 15. FinalIy, each chapter ends with a few problems for those who might like to explore the ideas and methodology further.

Residuals are an important part of graphics for regression analyses, and they play key roles in this book. But they are not singled out for special study. Rather, they occur throughout the book in different roles depending on the context.

No color is used in the plots of this book. Nevertheless, color can facilitate the interpretation of graphical displays. Color and three-dimensional versions of selected plots, data sets, links to recent developments, and other supple- mental information will be available via http://www.stat.umn.ed&?egGrapW. A few data sets are included in the book.

ACKNOWLEDGMENTS

Earlier versions of this book were used over the past six years as lecture notes for a one-quarter course at the University of Minnesota. Most students who attended the course had passed the Ph.D. preliminary examination, which in part requires a year of mathematical statistics and two quarters of linear models. The students in these courses contributed to the ideas and flavor of the book. In particular, I would Iike to thank Efstathia Bura, Francesca Chiaromonte, Rodney Croos-Dabrera, and Hakbae Lee, who each worked through an entire manuscript and furnished help that went far beyond the limits of the course. Dave Nelson was responsible for naming ‘‘central subspaces.” Bret Musser helped with computing and the design of the Web page.

Drafts of this book formed the basis for various short courses sponsored by Los Alamos National Laboratory, the Brazilian Statistical Association, South- em California Chapter of the American Statistical Association, University of Birmingham (U.K.), University of Waikato (New Zealand), International Bio- metric Society, Seoul NationaI University (Korea), Universidad Carlos I11 de Madrid, the Winter Hemavan Conference (Sweden), the University of Hong Kong, and the American Statistical Association.

I would like to thank many friends and colleagues who were generous with their help and encouragement during this project, including Richard Atkinson,

Page 20: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

xviii PREFACE

Frank Critchley, Doug Hawkins, Ker-Chau Li, Bret Musser, Chris Nachtsheim, Rob Weiss, Nate Wetzel, and Joe Whittaker. I am grateful to Joe Eaton for helping me reason through justifications that I found a bit tricky, and to Harold Henderson for his careful reading of the penultimate manuscript. Sandy Weis- berg deserves special recognition for his willingness to engage new ideas.

Some of the material in this book was discussed in the recent text An Introduction ro Regression Graphics by Cook and Weisberg (1994a). That text comes with a computer program, the R-code. that can be used to implement all of the ideas in this book, many of which have been incorporated in the second generation of the R-code. All plots in this book were generated with the R-code, but the development is not otherwise dependent on this particular computer program.

I was supported by the National Science Foundation’s Division of Math- ematical Sciences from the initial phases of this work in I991 through its completion.

R. DENNIS COOK

St. Paul, Minnesota 3anuary 1998

Page 21: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

C H A P T E R 1

Introduction

The focus of this book is fairly narrow relative to what could be included under the umbrella of statistical graphics. We will concentrate almost exclusively on regression problems in which the goal is to extract information from the data about the statistical dependence of a response variable y on a p x 1 vector of predictors x = (xi) , j = l , . . ..p. The intent is to study existing graphical methods and to develop new methods that can facilitate understanding how the conditional distribution of y 1 x changes as a function of the value of n, often concentrating on the regression furtion E(y I x) and on the variance function Var(y 1 x) . If the conditional distribution of y 1 x was completely known, the regression problem would be solved definitively, although further work may be necessary to translate this knowledge into actions. Just how to choose an effective graphical construction for extracting information from the data on the distribution of y I x depends on a variety of factors, including the specific goals of the analysis, the nature of the response and the regressors themselves, and available prior information. The graphical techniques of choice may be quite different depending on whether the distribution of y I x is essentially arbitrary or is fully specified up to a few unknown parameters, for example.

Familiarity with standard graphical and diagnostic methods based on a lin- ear regression model is assumed (see, for example, Cook and Weisberg 1982). These methods will be reviewed briefly a$ they are needed to illustrate specific ideas within the general development, but we will not usually be studying them in isolation. Similar remarks hold for smoothing two-dimensional scatterplots.

Regardless of the specific attributes of the regression problem at hand, it is useful to recognize three aspects of using graphical methodology to gain insights about the distribution of y I x: construction, characterization, and in- ference (Cook 1994a).

1.1. c c & I

1.1.1. Construction

Construction refers to everything involved in the production of the graphical display, including questions of what to plot and how to plot. Deciding what

1

Page 22: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

2 INTRODUCTION

to plot is not always easy and again depends on what we want to accomplish. In the initial phaqes of an analysis, two-dimensional displays of the response against each of the p predictors are obvious choices for gaining insights about the data, choices that are often recommended in the introductory regression literature. Displays of residuals from an initial exploratory fit are frequently used as well.

Recent developments in computer graphics have greatly increased the flex- ibility that we have in deciding how to plot. Some relatively new techniques include scatterplot rotation, touring, scatterplot matrices, linking, identifica- tion, brushing, slicing, and animation. Studying how to plot is important, but is not a focal point of this book. As for the graphical techniques themselves, we will rely on scatterplot displays along with various graphical enhance- ments. Two- and three-dimensional scatterplots and scatterplot matrices will be used most frequently, but higher-dimensional scatterplots will be encoun- tered as well. The graphical enhancements include brushing, linking, rotation, slicing, and smoothing. While familiarity with basic construction methods is assumed, some of the more central ideas for this book are introduced briefly in Chapter 2. Methods for constructing a rotating three-dimensional plot on a two-dimensional computer screen are discussed in Chapter 3.

In striving for versatile displays, the issue of what to plot has been rel- atively neglected. In linear regression, for example, would a rotating three- dimensional scatterplot of case indices, leverages, and residuals be a useful display? More generally, how can three-dimensional plots be used effectively in regression problems with many predictors? Huber’s (1987) account of his experiences with three-dimensional scatterplots is interesting reading as a ref- erence point in the development of statistical graphics. His emphasis was more on the construction of displays and how the user could interact with them, and less on how the displays could be used to advance data analysis. For the most part, three-dimensional scatterplots were seen as effective tools for viewing spatial objects (e.g., galaxies} and for finding “ . . . outliers, clusters and other remarkable structures.” Three-dimensional plotting of linear combinations of variabIes was discouraged.

In this book, quantities to plot will generally be dictated by the development starting in Chapter 4.

1.1.2. Characterization

Characterization refers to what we see in the plot itself. What aspects of the plot are meaningful or relevant? Is there a linear or nonlinear trend? clusters? outliers? heteroscedasticity ? a combination thereof?

There are several standard characterizations for two-dimensional scatter- plots of residuals versus fitted values from an ordinary least squares (OLS) fit of a linear regression model, including a curvilinear trend, a fan-shaped pattern, isolated points, or no apparent systematic tendency. For further read- ing on these characterizations, see Anscornbe f1973), Anscombe and Tukey

Page 23: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

C C & I 3

(1963), or Cook and Weisberg (1982, p. 37). Characterizing a two-dimensional scatterplot is relatively easy, particularly with the full range of recently devel- oped graphical enhancements at hand. However, standard patterns to watch for in three-dimensional plots are not as well understood as they are in many two-dimensional plots. We can certainly look for very general characteristics like curvature in three-dimensional plots, but it may not be clear how or if the curvature itself should be characterized. It is also possible to obtain useful insights into higher-dimensional scatterplots, but for the most part their in- terpretation must rely on lower-dimensional constructions. Similar statements apply to scatterplot matrices and various linked plots. Scatterplot matrices are discussed briefly in Section 2.5.

Beginning with Chapter 4, one central theme of this book is that character- izations of various two- and three-dimensional scatterplots can provide useful information about the distribution of y j x , even when the number of predictors p is appreciable.

1.1.3. Inference

It would do little good to construct and characterize a display if we don’t then know what to do with the information.

Imagine inspecting a scatterplot matrix of data on (y ,xT). Such a display consists of a square array of p ( p + 1) scatterplots, one for each ordered pair of distinct univariate variables from ( y , x r ) . The plots involving just pairs of pre- dictors may be useful for diagnosing the presence of collinearity and for spot- ting high-leverage cases. The marginal plots of the response against each of the individual predictors allow visualization of aspects of the marginal regres- sions problems as represented by y I x,, j = 1,. . . , p , particularly the marginal regression functions E(y I x j ) and the corresponding variance functions. There is surely a wealth of information in a scatterplot matrix. But in the absence of a conrext that establishes a connection with the conditional distribution of y 1 x, its all about sidelights having little to do with the fundamental problem of regression. For example, the presence of collinearity is generally of interest because the distributions of y I (n = a) and y I (x = b) are relatively difficult to distinguish when a and h are close. The presence of collinearity may tell us something about the relatively difficulty of the analysis, but it says nothing about the distribution of y I x per .re.

Similar remarks apply to other modem graphical displays like rotating 3D displays. What can a three-dimensional plot of the response and two predic- tors tell us about the distribution of y 1 x when p > 2? when p = 2? Again, without a context that establishes a connection between the graphic and the full regression y I x , the characterization of a display may be of limited use. In a commentary on graphics in the decades to come, Tukey (19%)) stressed that “ . . . we badly need a detailed understanding of purpose,” particularly for deciding when a display reflects relevant phenomena. A similar theme can be found in an overview of graphical methods by Cox (1978). The construction

Page 24: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

4 INTRODUCTION

of imponderable displays seems much too easy with all the modem graphical machinery. Even the distinction between ponderable and imponderable dis- plays may be lost without a context. The work of Tufte (1983) provides an important lesson that such distinctions are important at all levels.

Ideally, the analysis of a plot would be concluded by forming a well- grounded inference about the nature of the distribution of y \x, about the data itself, about how to wry on the analysis, or about the possibility of unexpected phenomena. Inferences need not depend solely on the character- ization, although this is often the case. An inference might also depend on other aspects of the analysis or on prior information, for example.

Detecting a fan-shaped pattern in the usual plot of residuals versus fitted values from the OLS regression of y on x leads to the data-analytic observa- tion that the residual variance increases with the fitted values. This is a char- acterization of a common regression scatterplot. The inference comes when this characterization is used to infer heteroscedastic errors for the true but unknown regression relationship and to justify a weighted regression or trans- formation. As a second example, Cook and Weisberg (1989) suggested that a saddle-shaped pattern in certain detrended added variable plots can be used to infer the need for interaction terms. While the regression literature is replete with this sort of advice, bridging the gap between the data-analytic character- ization of a plot and the subsequent inference often requires a leap of faith regarding properties of y I x.

The characterization of a plot, which is a data-analytic task, and the sub- sequent inference have often been merged in the literature, resulting in some confusion over crucial aspects of regression graphics. Nevertheless, it still seems useful to have a collective term: We will use interpretation to indicate the combined characterization-inference phase of a graphical analysis.

1.2. ILLUSTRATIONS

1.21. Residuals versus fitted values

Figure 1,l gives a plot of the OLS residuals versus fitted values from fitting a plane with 100 observations on two predictors x = ( X ~ , X * ) ~ . The predictors were sampled from a Pearson type I1 distribution on the unit disk in R2 (John- son 1987, p. 11 1), where W' denotes t-dimensional Euclidean space. This plot exhibits the classical fan-shaped pattern that characterizes heteroscedasticity. While the construction and characterization of this plot are straightforward, any inferences must depend strongly on context.

Residual plots like the one in Figure I. I are often interpreted in the context of a linear regression model, say

Page 25: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

ILLUSTRATIONS 5

a

4

1

0 0.1 0.2 0.3 0.4

Fitted Values FIGURE 1.1 Residuals versus fitted values for the example of Section 1.2.1.

where E is independent of X, E(E) = 0, and Var(~) = 1. The OLS estimate of /jr is unbiased under this model, even if the standard deviation func- tion is nonconstant. Thus, the residual plot provides useful information on a(PTx), allowing the viewer to assess whether it is constant or not. In a re- lated context, nonconstant variance in residual plots can indicate that a ho- moscedastic linear model would be appropriate after transformation of the response. Because of such results, a variance-stabilizing transformation of the response or a weighted regression is often recommended as the next step when faced with a residual plot like the one in Figure 1 .l. While these ideas are surely useful, tying graphical methods to a nwow target model like (1.1) severely restricts their usefulness. Graphics are at their best in contexts with few scope-limiting constraints, allowing the user an unfettered view of the data.

How should we interpret Figure 1.1 if the context is expanded to allow essentially arbitrary regression and variance functions,

y I x = EQ I X) + O(X)E

where E is as defined previously? Now the heteroscedastic pattern can be a manifestation of a nonlinear regression function or a nonconstant variance function, or both, and we can conclude only that the homoscedastic linear regression model does not reasonably describe the data.

Page 26: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

6

The response for Figure

Y

INTRODUCTION

.I was generated as

1x1 I 2+(1.5+x2)2 + &

x =

with homoscedastic error E that is independent of x . A rotating three-dimen- sional plot with y on the vertical axis and the predictors on the horizontal axes resembles a round tilted sheet of paper that is folded down the middle, the fold increasing in sharpness from one side to the other. The plot in Figure 1.1 is essentially a projection of the sheet onto the plane determined by the vertical axis and the direction of the fold, so that the paper appears as a triangle. Interpreting Figure 1. I in the context of a linear regression model could not yield the best possible explanation with the available data. A similar example was given by Cook (1 994a).

1.2.2. Residuals versus the predictors

Another common interpretation arises in connection with a scatterplot of the OLS residuals from a linear regression versus a selected predictor x,. In ref- erence to such a plot Atkinson (1985, p. 3), for example, provides the stan- dard interpretation, "The presence of a curvilinear relationship suggests that a higher-order term, perhaps a quadratic in the explanatory variable, should be added to the model." In other words, if a plot of the residuals versus x, is characterized by a curvilinear trend, then infer that a higher-order term in x j is needed in the model.

For example, consider another constructed regression problem with 100 ob- servations on a univariate response y and 3 x 1 predictor vector x = (xt , x ~ . x ~ ) ~ generated as a multivariate normal random variable with mean 0 and nonsin- gular covariance matrix. Figure 1.2 gives a scatterplot matrix of the predic- tors and the OLS residuals i from the linear model y 1 x = 0, + BTx + m. The residuals here estimate OE.

The bottom row of plots in Figure 1.2 shows the residuals plotted against each of the three predictors. The plot of versus x , and the plot of d versus x, both exhibit a nonlinear trend (characterization), an indication that the dis- tribution of the residuals depends on x1 and x,, and thus that the linear model is deficient (inference}. Restricting the context to the class of homoscedastic regression models y 1 x = E(y I x ) + CTE, where the distribution of E does not de- pend on x, what additional inferences can be obtained from Figure 1.2? Since the plot of d versus x, shows no clear systematic tendencies, would restricting attention to the manner in which xI and x3 enter the model necessarily point us in the right direction?

The response for the data of Figure 1.2 was generated as

y I n = ]x2 + x3j + 0 . 5 ~

Page 27: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

ILLUSTRATIONS

0

7

6.94

0

2.75

.2.55 0

0

0

0 0

0 O 0 ,"

4.98

5.95 0 0

0

0

0

0 0 000 n n

FIGURE 1.2 Scatterplot matrix of residuals and predictors for the illustration of Section 1.2.2.

where E is a standard normal random variable. Clearly revisions of the linear model that restrict attention to x , and x3 cannot yield the best explanation. Plots of residuals versus individual predictors cannot generally be regarded a sufficient diagnostics of model failure. They can fail to indicate model deficiency when there is a nonlinearity in the predictor in question (x2), they can indicate model failure in terms of a predictor that is not needed in the regression ( x ] ) , and they can correctly indicate model failure for a directly relevant predictor (x3) .

Even when the context is correct, popular interpretations of standard graph- ical constructions can result in misdirection.

1.2.3. Residuals versus the response

Draper and Smith (1966) played a notable role in the evolution of regres- sion methodology. They stressed the importance of graphics and promoted a

Page 28: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

8

.-I-

rNTRODUcTION

0 * * 0

0 0 0

0

0

0 0

0

6 8 10 12 14

Steam used per month, g

FIGURE 1.3 Plot of the residuals versus the response from the steam plant data used by Draper and Smith (1966). The 7 observations with the largest responses are highlighted.

number of specific graphical methods for assessing model adequacy. The first edition of their text stood for a number of years as a point of introduction to regression. Many regression texts of today offer relatively little graphical methodology beyond that originally available in Draper and Smith, who relied primarily on unadorned two-dimensional scatterplots, particularly scatterplots of residuals versus fitted values or selected predictors.

However, Draper and Smith (1966, p. 122) also used a scatterplot of the OLS residuals versus the response in an example relating the pounds of steam used monthly y at a large industrial concern to the number of operating days per month x, and the average monthly atmospheric temperature x2. Their plot is reconstructed here as Figure 1.3. They characterized the plot by noting that “ ... six out of the seven largest values of Y have positive residuals.” From this they inferred that ‘‘ ... the model should be amended to provide better prediction at higher [response] values.” We now know that this infer- ence was not well founded. While the characterization is valid, the response and residuals from an OLS fit are always positively correlated, so the find- ing that ”six out of the seven largest values of Y have positive residuals’’ is not necessarily noteworthy. In their second edition, Draper and Smith (1981, p. 147) describe the pitfalls of plotting the residuals versus the responses. The point of this illustration is not to criticize Draper and Smith. Rather, it is to reenforce the general idea that there is much more to regression graphics than construction and characterization. Many different types of graphical displays

Page 29: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

ON THINGS TO COME 9

.. . - * '

0

0

FIGURE 1.4 'Dusdimensionai view of the three-dimensional scatterplot of the steam data; stcam used per month operating days per month xI, monthly atmospheric temperature x2 .

have been developed since the first edition of Draper and Smith (1965), most of which are included in widely available software. And there has probably been a corresponding increase in the number of ways in which we can be misled.

A three-dimensional plot of the steam data from Draper and Smith is quite informative and seems easy to interpret relative to some three-dimensional plots. It shows that the two months with the smallest operating days do not conform to the linear regression established by the remaining data. Figure 1.4 gives one two-dimensional view of this three-dimensional plot. This view was obtained by rotating the three-dimensional plot to the two-dimensional projection that provides the best visual fit while mentally ignoring the two outlying months. The idea of visual fitting is developed in Chapter 4 for regressions with p = 2 predictors. In later chapters these ideas are adapted to allow visual fitting when p > 2.

1.3. ON THINGS TO COME

Progress on the types of issues raised in this chapter may be possible if we can establish a context allowing the development of connections between graphical methodology and the object of interest, principally the distribution of y 1 x. Beginning in Chapter 4, a central theme of this book is based on finding a simplified version of y i x by reducing the algebraic dimension of the predictor vector

without losing information on the response, and avoiding, as far as possible, the introduction of assumptions on the nature of the conditional distribution of y I x.

Page 30: Regression Graphics · Regression graphics: ideas for studying regressions through graphic&. Dennis Cook. p. cm. - (Wiley series in probability and statistics Probability and statistics

10 INTRODUCl'ION

Let q~ denote a p x k matrix, k 5 p , so that y 1 x and y I 7tTx are identically distributed. The dimension-reduction subspace S(v) spanned by the columns of 71 can be used as a superparameter to index the distribution of y I x and thus taken as the target for a graphical inquiry. A plot of y versus the k-dimensional vector qTx is called a sufficient summary plot. In the end, an estimate 6 of 77 can be used to form a plot of y versus GTx that serves as a summary of the data. This paradigm may have practical merit because very often it seems that good summaries can be obtained with k = 1 or 2. In developing this paradigm, all plots are judged on their ability to provide graphical information about S(q), and the three aspects of regression graphics-construction, characterization, and inference-come packaged together.

Dimension-reduction subspaces, summary plots, and other central notions are introduced in Chapter 4, which makes extensive use of three-dimensional scatterplots for graphical analyses of regression problems with two predictors. The basic ideas and methodology developed in that chapter wiH also play a central role in regression with many predictors. The fundamentaI material in Chapter 4 does not depend on the nature of the response, but graphical implementation does. Basic constructions are adapted for regressions with a binary response in Chapter 5.

The development of ideas stemming from dimension-reduction subspaces is continued in Chapters 6, 7, and 8 by permitting many predictors and refining the foundations. Practical relevance of these ideas is explored in Chapter 9 through a number of examples.

The graphical foundations are expanded in Chapter 10 by incorporating inverse regressions, where x plays the role of the response and y becomes the predictor. Numerical methods for estimating a dimension-reduction subspace via inverse regressions are discussed in Chapters 1 1 and 12.

Traditional models start to play a more central role in Chapter 13, which is devoted to ideas for studying the roles of individual predictors. Chapter 14 is on graphical methods for visualizing predictor transformations in linear mod- els. Finally, graphical methods for model assessment are studied in Chapter 15.

1.4. NOTATIONAL CONVENTIONS

A bbreviurions

OLS, ordinary least squares d, the structural dimension for the regression under consideration DRS, dimension-reduction subspace S,,, central DRS for the regression of y on x

Data and random variables When discussing generic regression problems, the response will generally be denoted by y and the p x 1 vector of predictors will be denoted by x. Typi-