linear regression in medical research - statclass.com242 c h a p t e r 10: linear regression in...

C H A P T E R 1 003

Linear Regression in Medical Research

PAUL J. R A T H O U Z , P H . D . , A N D A M I T A R A S T O G I , M . D . , M . H . A .

ABSTRACT Regression techniques are important statistical tools for assessingthe relationships among variables in medical research. Linear regression summa-rizes the way in which a continuous outcome variable varies in relation to one ormore explanatory or predictor variables. This chapter covers the application andinterpretation of simple (a single predictor) and multiple (more than one predictor)linear regression. We discuss in detail the estimation and interpretation ofregres-sion slopes and intercepts, hypothesis tests, and confidence intervals in simple lin-ear regression models. Then we contrast regression with correlation and discusshow these methods both summarize statistical association, but do not necessarilyimply causation among the variables of interest. We then summarize some mainpoints about the careful use of linear regression so that the reader has an apprecia-tion of the data analytic issues involved in carrying out regression analyses. Build-ing on our presentation ofsimple linear regression, we provide an overview ofmul-tiple linear regression, with a strong emphasis on interpretation of the regressionslopes. We discuss the differential use of regression for statistical summarizationof relationships between the outcome and predictors, for statistical adjustment ofthe observed relationship of the outcome to one or more key predictors, and forstatistical prediction ofthe outcome from a set ofcandidate predictors. An exampleinvolving the prediction of HbAlc f r o m c h a r a c t e r i s t ic s o f s l e e p i n d i a b et e s p a t ie n t s

is carried throughout the chapter for illustrating the key ideas, and other illustra-tive examples are drawn from the recent medical literature.

The vast majority of statistical analyses in medical research involve rela-

tionships among variables. Common examples include coronary arterydisease and cholesterol, lung cancer risk and smoking, cognitive func-

tion and aging, and so on. Linear regression summarizes the way in which onevariable (Y) varies in relation to one or more other variables ()O. Simple linearregression involves a single X variable, and multiple linear regression covers more thanone X variable. The Y variable is commonly referred to as the outcome or response,and the X variables are called explanatory or predictor variables or, sometimes,

239

240 C h a p t e r 10: Linear Regression in Medical Research

couariates. (The explanatory variables are often called "independent variables,"but it is safer to avoid this terminology because the term "independent" hasother meanings in statistics.) In linear regression analyses Y is a "continuous"variable, such as a laboratory measurement or a score on a 0-100 scale;techniques for categorical variables are discussed in Chapter 12.

Example 1: A study of sleep and endocrine function' evaluated 122 diabetic patients for sleepquantity and quality, as well as for diabetes control, as measured by hemoglobin Alc (HMO.The hypothesis under investigation was that reduced sleep quantity and/or quality leads topoorer diabetes control, and hence increased HbAlc l e v e l s . S l e e p q u a l i t y w a s m e a s u r e d v i a

a modified version of the previously validated Pittsburgh Sleep Quality Index IPSQ11.4 ' T h ePSQI varies from 0 to 21 "points"; higher values represent poorer sleep, and any score great-er than 5 indicates poor sleep quality. Percent HbA,,, obtained from patients medical charts,measures average blood glucose level for the preceding three months; in diabetic popula-tions, values over 7.5-8% are generally considered indicators of poor glycemic control.' "

In an uncomplicated world, the relationship between HbAlc a n d P S Q I w o u l dbe exact, so that a single value of Y would correspond to each value of X, and in

a scatter plot of Y versus X the points would fall on a straight line, as in Figure1. The key quantity of scientific interest would be the slope of the line describ-ing how HbAic v a r i e s w i t h P SQ I ( th e r at e of c ha ng e in HbAlc per unit increase

in PSQI).This idealized relationship o f HbAn: t o s l e e p q u a l i t y i s , o f c o u r s e , p u r e ly

hypothetical. In practice, many factors other than PSOI, both measured andunmeasured, both understood and unknown, affect HbAic l e v e l s . T h e a c t u a ldata (Figure 2) show that the relationship is much less clear, although stillquite evident. Mean HbAic s h o w s a t e n d e n cy t o i n c r ea s e w i th P SQ L bu t v a ri a -

tion among individuals is substantial. We develop this example through theremainder of the chapter.

Statistical regression is a procedure for objectively fitting a curve through aset of points, such as those in Figure 2. To allow for variation or "noise" in thedata (from differences among individuals or from measurement error), regres-sion focuses on average behavior. Specifically, "the regression of Y on X" refers tothe average (mean) value of Y corresponding to each value of X. The diamonds inFigure 2 show the mean value of HbAlc ( Y ) a t e a c h v a l u e o f P S Q I ( X ) . O n a v e r a ge ,

HbAlc t en ds to increase with PSQL although individual HbAtc values vary con-

siderably at any given level of PSQL In general, the regression of Y on X does notprescribe any particular pattern for how the mean of Y varies with X.

Linear regression refers to the common situation in which the average valuesof Y across the range of X fall on or close to a straight line. Standard ways ofestimating a linear regression produce the straight line that is, in a particularmathematical sense, the best tit to the data. Even when those average-valuepoints do not follow a straight line, the computational procedure will deliver aline, which may often be a useful summary of much of the relationship. The key

Linear Regression in Medical Research 2 4 1

15-

< 1 0 -X

5 -

0 5 1 0 1 5Pall

I- hypothetical MAK line • hypothetical HbAtc data

Figural. Hypothetical ideal relationship and data for HI3A1e a n d P S Q l . H I )Am h a s a p e r f e c t

linear relationship with PSC?' score, so that all data points fall exactly on a straight line.

15

5

• • •

i i • • • • • • 0• :< io • ••z • • ,e • • I • • tto• ••

• •W itt • Of • • •

8 $ : 8 • •

e

• • • • 8 ••• • • ••

0 5 1 0 1 5Pall

I• observed HbAtc • a v e ra g e H b A ,c I

Figure 2. Plot of HbAlc v e r s u s P 5 9 1 . g r a y d i a mo n d s r e p re s e nt the a ve ra ge H bAl c value for

each value of PSQl.


quantity is the regression slope, quantifying how many units the average value of Yincreases (or decreases) for each unit increase in X.

Examples of simple linear regression are less common in the medical litera-ture than are applications of multiple linear regression, involving several predictorvariables (X's). The ability to handle multiple X's and predictors that take a vari-ety of forms gives regression methods very broad applicability. Most analyseshave one of three aims: summarization and explanation, adjustment, or predic-tion. These groupings are not hard and fast, but they do provide a sense of therich diversity of applications.

A summarization analysis aims to provide a simple description o f how aresponse Y varies with X or with a set ofX's. The obvious appeal of the procedureis that it reduces a potentially complex multidimensional set of relationships toa small set of regression slopes that are easy to understand and straightforwardto communicate. For example, it may be o f interest to describe how HbAlcvaries with a set of sleep characteristics, including a measure of sleep qualitysuch as the PSQI, total sleep obtained, and perceived sleep debt. Placing thesethree predictors in a single multiple-regression model for HbAi, w i l l y i e l d t h r e eregression slopes; adding the effects they summarize provides a parsimoniousdescription of how HbAlc v a r i e s w i t h t h e s e t h r ee s l e ep m e a su r e s j o i nt l y . A s um -

marization analysis is often motivated by a desire to explain or describe how aresponse behaves as a function of one or more covariates. The analysis may beguided by hypotheses about these relationships, and it is of interest to evaluatehow well these hypotheses are borne out by the data.

Often, explanatory data analysis problems arise in conjunction with a goal ofstatistical adjustment, a compelling need in many medical studies. Suppose, forexample, that it is known that both HbAlc a n d s l e e p q u a l i t y v a r y w i t h a g e a n d

sex. Ignoring this fact may lead to spurious results, including perhaps an infla-tion of the relationship of HbAlc t o s l e e p q u a l i t y . O n e s t u d y d e s i gn c o u ld h a nd l e

such a problem by sampling only individuals of a single sex and within a narrowage range. More informatively, researchers might perform separate analyses foreach o f several groups delineated by sex and age, yielding a separate regres-sion slope for each group. It may, however, not be feasible to assemble a use-ful number of subjects in each such group. Even i f the sample sizes are largeenough, separate results for each group may be difficult to present and digest.Statistical adjustment aims to simplify the description and clarify the picturewhen the regression o f HbAlc o n s l e e p q u a l i t y w i t h i n t h e v a r i ou s a g e -b y -s e x

groups yields similar slopes for PSQL Multiple linear regression that includesexplanatory variables for age and sex, in addition to sleep quality, will yield acommon regression slope for the relation of HbAi, t o s l e e p q u a l i t y a c r o s s t h e

groups defined by age and sex. This slope quantifies the relationship of HbAtcto PSQI, "adjusted for age and sex."

Simple Linear Regression

Another purpose of linear regression is prediction. Suppose one wants to makereliable clinical predictions of change in HbA,, over a period of a year. Such aprediction model may be critical to setting treatment regimes, targeting indi-viduals for intensified follow-up, etc. Multiple linear regression using predictorvariables such as current HbA,,, age, sex, and sleep measures, as well as othercardiovascular risk factors, might figure into such an analysis. The focus is onhow well the entire set of predictors together can predict HbAic o n e y e a r h e n c e ,whereas the interpretation or values of the regression slopes are of somewhatless interest. A later section of this chapter discusses adjustment, summariza-tion, and prediction in more detail.

In addition to the diverse uses for simple and multiple linear regression,related techniques form an important component of modern medical statistics.Special cases of multiple linear regression (sometimes unified under "generallinear models") include analysis of variance for comparing means among severalgroups and analysis o f-c o v a r i a n c e f o r c o m p a ri n g s u ch m e an s w it h a d j us t me n t for

continuous covariates. Multiple linear regression is primarily applied to data inwhich the response variable is a continuous quantity. The basic ideas, however,carry over into logistic regression for binary outcome data (Chapter 12), pro-portional-hazards regression for survival time data (Chapter 11), and random-effects models and generalized-estimating-equations models for longitudinal(repeated-measures) data. Because regression techniques typify the statisticalprocess of separating signal from noise in a set of data, they provide a statisticalmodeling framework that has proven remarkably useful in focusing attention onthe scientific issues under investigation in medical research.

S I M P L E L I N E A R R E G R E S S I O N

Simple linear regression presents a linear relationship between the responsevariable Y and the predictor variable X. This relationship is captured by a regres-sion equation, which represents a statistical model. Such models play twoimportant roles in scientific investigation. First, they provide a framework forthinking about the scientific relationships and hypotheses of interest. As such,they aid in narrowing and operationalizing questions of interest. As more vari-ables come into play for purposes of adjustment or prediction in multiple linearregression models, the role of the regression model as a framework for formal-izing scientific hypotheses becomes even more important. Secondly, statisticalmodels form the mathematical basis for data analysis. This process includesestimation, as well as inferences such as testing hypotheses and constructingconfidence intervals.

In practice, the fitted regression model separates the signal from the noise. Itcontains estimates of the intercept and slope and thereby traces a line through

243

or


the cloud o f points in a scatter plot o f the data. Estimated coefficients areaccompanied by standard errors, which express the statistical uncertainty inthese estimates. Standard errors in turn are used to conduct hypothesis tests ofpossible association between Y and X and to construct confidence intervals forthe regression slope (and intercept).

Statisticians commonly write

The Fitted Line

Figure 1 depicts an idealized linear relationship o f HbAlc t o s l e e p q u a l i t y . I fthis relationship were to hold, then it would be captured by the well-knownalgebraic equation for a line

Y=a+bX.

Y = 130 +

(1)

using p, (instead of b) for the slope of Y against X and A (instead of a) for theintercept. The slope quantifies the relationship of Y to X and is usually the keyquantity of scientific interest. The intercept is the value of Y when X = 0; in mostmedical research, it is less important than the slope.

In practice, such an ideal relationship never holds, but it may often hold onaverage:

mean(YIX) = /30 + A X . (2)

In this expression mean(YI X) refers to the average value of Y among those indi-viduals with a given value of X. Here is the important difference between equa-tions (1) and (2): Equation (1) says that all individuals in the population witha given value of X have the same value of Y, and that this value is exactly equalto A + pix , y i e ld i n g h y po t he t ic a l "data" such as those in Figure 1. By contrast,

equation (2) says merely that the average (central) value of Yin the population ofpersons with a given value of X is /30 + P . I n d i v i d u a l s i n t h a t p o p u l a ti o n d e v i at e

in both directions from that central value. This deviation is often represented bye, yielding the traditional simple linear regression equation,

+ /31X e. (3)

This equation is simply another way of writing Y = mean(YI X) + e or, in words,response = signal + noise

outcome = prediction + deviation.The reader may adopt the interpretation that is most natural.

Simple Linear Regression 2 4 5

We make two important points about equation (3). First, the "noise" or"deviation" component e represents everything that is not included in the "sig-nal," fio + 1 31X . It c ap tu re s the fact that individuals in a population vary around

their population mean, and such variation is entirely natural and expected. Thee term is sometimes called "error." It is generally not error in the usual senseof the word, although it may include errors of measurement from sources suchas imperfect instrumentation, human factors, and diurnal variation. For thisreason we discourage the use of the term "error" to refer to individual deviationaround the population mean. Second, coefficients [30 a n d A i n r e g r e s s i o n e q u a -

tion (3) are unknown in the absence of any data. Data must be used to estimatethese coefficients, yielding a fitted regression equation. Estimated coefficients aredenoted by pa a n d p , .

Example 1 (continued): For the sleep/diabetes study, the HbAlc a n d P S Q I d a t a i n F i g u r e 2

are used to estimate the regression coefficients, yielding fitted regression equation Y = 7.11+ 0.186X + e, corresponding to the line shown in Figure 3. In terms of the clinical variables,HbAtc a nd PSQI ,

HbAlc = 7 .11% + 0.186 (10/point) x PSQI + t; (4)

the units of HbAlc a r e c l e a r l y g i v en a s p e rc e nt (%) and of PSQI as PSQI " po in ts ." To dis-

tinguish between absolute change and relative change, the units of differences betweenpercentages are percentage points. In the absence of a standard abbreviation for "percentagepoints," we use "%" in labels and equations and when referring to regression coefficients;for example, "%/point" in (4) stands for "percentage points of HbAtc p e r P S Q I p o i n t . " E q u a -tion (4) says that a randomly selected subject with PSQI score of, say, 5 points would have,on average, an HbAi, v a l u e o f

7.11% + 0.186 (%/point) x 5 points = 8.04%.

Note that the units of PSQI (points) cancel.

The most important quantity in fitted equation (4) is the estimated regression slope of 0.186(%/point): It says that, for individuals with PSQI scores separated by 1 point, their differencein HbAtc w i l l on a ve ra ge be equal to 0.186 percentage point. In other words, as PSQI var-

ies by 1 point from person to person, we expect to see corresponding differences in HbAlebetween those individuals of 0.186 percentage point.

A necessary part of any regression analysis is the descriptive statistics of the componentvariables: PSQI in these data ranges from 0 to 14 points, with a mean of 6.0 and a standarddeviation (sd) of 3.2 points. HbAl c r a n g e s f r o m 5 . 7 % t o 1 5 . 2 % w i t h a m e an o f 8 .2 % a nd

a sd of 2.1 percentage points. Such information allows the reader to assess the clinicalsignificance of the results. For example, the reader can determine that, for subjects whosePSQI scores differ by 2 standard deviations (2 X 3.2 = 6.4 points), the expected differencein HbAic v a lu e s is

0.186 (%/point) x 6.4 points = 1.19 percentage points.

Depending on the application, one might then ask whether such a difference in HbA„ valuesis large enough to be clinically important or even to suggest clinical intervention.


I• observed HbAtc fitted HbA,c l i n e IFigure 3. Unear regression analysis of HbA

lc o n P S Q l . A v e r a g e H b Al c i s e x p r e s s ed a s a l i n e ar

function of PSQI Score. The fitted value (on the line) and the observed data value (aboveor below the line) are shown (squares) connected by vertical segments for five selectedobservations

Equation (3) is an example of a statistical model. As such, it states a generalform o f the relationship o f HbAjc t o P S Q I , b u t l e a v e s c e r t a i n d e t a i ls o f t h at

relationship unknown; these unknowns are called parameters. In this instancethe unknown parameters are the coefficients Po a n d A n a d d i t i o n a l p a r a m e t e r

is the population standard deviation a of the deviations e. Thus, the completestatistical model can be expressed as

Y = 130 + ( 31X e (5)

mean(e) = 0sd(e) = cr

The unknown parameters in this model are estimated from the data and havean interpretation that captures the scientific questions of interest in the study.In equation (5), 13,, is the average or mean value of Y for the sub-population forwhom X = 0 (even if such a person does not, or could not, exist), and PI i s t h edifference in average values of Y for two sub-populations separated in their valueof X by I unit. Finally, individual values of Y vary around the mean i t PIX f o reach value of X. This deviation or "noise," represented by e, is assumed to be

Simple Linear Regression

random and to have mean 0, indicating that the individual could be on eitherside o f the fitted line. The amount of variation is measured by the standarddeviation a of e or, equivalently, by its variance a2. W h e n t h e m o d e l i s fi t t e d ,

the standard deviation a is estimated, along with /30 a n d pi.

Example l (continued); In the fitted equation (4) we have estimated values SI a n d t io. I n a d -

dition, the estimate of the standard deviation c i s ti = 2.03 percentage points.First, a line is fitted to the points in the scatter plot, as in Figure 3. Each point in the

original scatter plot now has a corresponding "fitted value" on the fitted line. Five such pairsof points are displayed in Figure 3. The arithmetic difference between the actual value andthe fitted value—i.e., the segment connecting these two points—is the "residual" for thatobservation. The residuals are displayed in Figure 4, where their values can be read fromthe Y-axis and their distribution is displayed in a column at the right. Collecting all of theresiduals together, we can present and examine their distribution via a boxplot or a histo-gram, as in Figure 5. From this distribution we can compute their mean (which is exactly 0because of the method of fitting the line), their standard deviation, which is 2.03 percentagepoints, etc.

Plots such as Figure 4 can be very useful. The plot should show random scat-ter about the zero line if the model is a good fit to the data. Any pattern in the

--/-'0

<.0ITo0•0t7ia.,

Pall

I • residual HbAic E M z e r o l i ne

Figure 4. Residual plot of HbAlc o n P S 9 1 . E a c h r e s i d u al i s t h e d i f f e re n c e b e t we e n t he

observed value and the fitted value, and the mean residual (weighted by the number ofobservations) is zero, the residuals for the same live selected observations as in Figure 3are shown (squares), and the distribution of all residuals is presented with hash marks on avertical line at the right.

247


—2.03 0 2 . 0 3 5

residual HI3 P i o/

Figure 5. Histogram of residual HbAl c a f t e r r e g r e s s i o n o n P S Q l . T h e m e a n r e s i du a l i s z e ro ,

and the standard deviation is indicated with vertical lines at *2.03.

residuals, for example a U-shaped or inverted-U-shaped cloud, would suggestotherwise. I f the degree of scatter increases or decreases with X, this suggeststhat sd(e) is not constant, as assumed in model (5). Such regression diagnosticsare discussed later in the chapter.

Another quantity sometimes reported with fitted regression models isR-squared (R2), s o m e ti m e s c a ll e d the c o ef fi ci e nt of d e te rm i na t io n . Br iefly, 112 ranges

from 0 to 1 and represents the proportion o f total variability in Y (about itsmean) that is accounted for or "explained" by X in the linear regression model.The higher the value of R2, t h e b e t t e r i s t h e p r e d i c ti o n f r om t he m o de l , a l th o ug h

explanatory models with low R' are often quite useful as well. In the linearregression o f HbAlc o n P a l i , R ' = 7 . 9% ; i . e. , P SQ I a c co u nt s for 7.9% of the

total variability in HbAtc. Standard Errors, Tests, and Confidence Intervals

The previous section described the estimation and interpretation of regression coef-ficients /3 and of the standard deviation a of Y around its mean po + PiX . I n r e g r e s -

sion analysis, as in other statistical procedures, however, estimation of unknownstatistical parameters such as Po a n d PI i s o n l y o n e p a r t o f t h e a n a l y si s .

Because estimates /3, and po o f p , a n d / 30 a r e t h e m s e lv e s b a se d on d a ta , t he i r

values involve statistical uncertainty, called sampling variation, and it is impor-tant to quantify this uncertainty. Sampling variation is the variability in parameter

estimates that arises because we observe a sample from the population, andnot the entire population; it is the reason that we cannot make perfectly precisestatements about model parameters. Below we discuss the quantification of sta-tistical uncertainty via standard errors, and the role of standard errors in testinghypotheses and constructing confidence intervals.

Standard Errors of Regression Coefficients

Sampling variation of So a n d $ I i s u s u a l l y q u a n t ifi e d b y t h ei r s t an d ar d e r ro r s. For

the study of HbAtc a n d s l e e p q u a l it y , T a bl e 1 s ho ws the e s ti ma t es for Po and

fil, along with their standard errors. The standard error of the estimate of p,

is 0.0577 (%/point); this quantity serves as input to hypothesis tests and confi-dence intervals. -

The standard errors of So a n d ( 3 , r e fl e c t t h r e e k e y f e a t ur e s of t he d at a . F i rs t ,

standard errors are proportional to the standard deviation a of the residuals.This makes intuitive sense because the larger the value of a, the less preciselywe are able to estimate the mean of Y, and the regression model is, after all, asummary of the mean of Y as a function of X.

Second, the standard error of the estimated slope Si i s i n v e r s e l y p r o p o r t i o n a l

to the standard deviation of the X-values. I f the values of X vary little about theiroverall mean, then X cannot tell us much about individual values beyond a sa summary of Y. A wide range of slopes will produce regression lines that arereasonably close to the data; that is, the slope of the line is not well determined.For a given value of a, a larger standard deviation o f X yields a more stableestimate of 0,.

Third, the standard errors are inversely proportional to the square root of thesample size n used in fitting the model to the data. That is, the larger the valueof n, the more precisely we are able to estimate the regression model.

Table 1. Summary Statistics for Fitted Regression Model ofHbAlC (%) on PSQI for n = 122 Subjects.

Term

intercept

PSQI (points)

Simple Linear Regression 2 4 9

Cod. Est imat e S E t p - v a l u e L o w e r U p p e rP„Pi

7.11 0 . 3 9 1 18.2 0 . 0 0 0

95% ConfidenceLimits

6.34 7 . 8 8

0.186 0 . 0 5 7 7 3 . 2 1 0 . 0 0 2 0 . 0 7 1 0 . 3 0 0

Estimated standard deviation a is 2.03 percentage points on 120 degrees of freedom.Model R' --. 7.9%.


Hypothesis Tests about Regression Coefficients

The standard error also yields a test statistic t for the null hypothesis that /31 = O .This null hypothesis corresponds to no linear association between Y and X. Thehypothesis that PI = 0 i s i m p o r ta n t b e c au s e , as the s c ie n ti fi c h yp ot he si s in many

medical studies is one of association between two variables, the correspondingnull statistical hypothesis to be tested is that of no association.

The t-statistic is the ratio o f 13, to its standard error. The correspondingp-value comes from a t-distribution whose degrees o f freedom (df) equalsthe sample size minus the number of regression coefficients estimated in themodel. Simple linear regression involves two regression coefficients, includingthe intercept, so that df = n — 2. The larger the df, the more precise will be theestimation of a, and hence the smaller the standard error of 131.

Example 1 (continued): Using the estimate and standard error reported in Table 1, the teststatistic t for the slope A is computed as

estimate I - i t 0 . 1 8 6t — — — — 3 21standard error s e ( j 3 1 ) 0 . 0 5 7 7

on 120 df, yielding the two-sided p-value 0.002. This is strong evidence that the measuredvalue of HbAic i s a s s o ci a t e d w it h P SQ L

Confidence Intervals on Regression Coefficients

Estimates of regression coefficients are usually accompanied by 95010 c o n fi d e n c eintervals. Under appropriate (and common) conditions, the confidence intervalwill include the true but unknown population value with the specified probabil-ity. Thus, unless one believes that a rare event has occurred, any parameter valueoutside the range of values delineated by the confidence interval is consideredincompatible with the data.

Example 1 (continued): Table 1 gives the 95% confidence interval for p, as (0.071,0.3001(%Ipoint). This range does not contain 0, again suggesting that p, = 0 is inconsistent withthese data and implying that a null relationship of HbA.c t o P S Q I i s a l s o i n c o n s i s t e n t w i t h

these data. Indeed, for these data, the true slope is likely to lie somewhere between 0.071and 0.300 (Vpoint). The bounds of the confidence interval are obtained by computing 0.186± 1.98 x 0.0577 . (0.071, 0.300). (The 1.98 is the 97.5th percentile of the t-distributionon 120 df.)

C O R R E L A T I O N V E R S U S R E G R E S S I O N

Simple linear regression is a method of describing and quantifying a relation-ship of one variable Y to another variable X. A related measure, the correlationcoefficient, summarizes the direction and the strength of linear relationship between

Correlation versus Regression 2 5 1

X and Y on a scale that does not involve the specific units of X and Y (i.e., it isunit-free). Numerous correlation coefficients have been developed as summarymeasures of association for various types of data. The most common, for con-tinuous data, is the "Pearson" or "product-moment" correlation, denoted by r orryx to indicate the two variables involved. As with linear regression, Pearson cor-

relation is applicable when the relationship between X and Y is at least approxi-mately linear. The correlation coefficient varies from –1 to 1-1; ry x = 0 i n d i c a t e sno linear relationship between Y and X, ry x = 1 i n d i c a t e s p e r f e c t p o s i t i v e ( l i n e a r)

association, and ry x = –1 p e r f ec t i n v er s e ( l i ne a r ) a s so c ia t io n ( i. e. , the line in

Figure 1 would slope downward rather than upward). Neither X nor Y is viewedas the response or the predictor, so the correlation coefficient is a symmetricmeasure (i.e., ry x = rx r) . B e f or e u s in g a c o rr e la t io n c oe f fic i en t to s umma ri ze a

relationship, the variables should be displayed in a scatter plot. The pattern inthe plot can help to determine whether the two variables are linearly associatedand, if so, the direction and strength of the association.

By contrast, in regression analysis, the response variable Y is of primary sci-entific importance, whereas the explanatory variable X is important to the extentthat it predicts Y. The regression slope 01 c a p t u r e s t h e q u a n t i t a t i v e r e l a t i o n s hi p

of Y to X and helps predict the value of Y for a given value of X. Unlike the cor-relation coefficient, the regression slope can be any number from t o -1-00.Also, the slope is not a symmetric measure: switching the roles of Y and X willgenerally give a different value for the slope.

The correlation coefficient and regression coefficients capture differentaspects of the same relationship. Correlation helps determine whether thereis a relationship between X and Y, and how strong the relationship is, whereasthe regression coefficient quantifies that relationship. Positive (negative)regression slope corresponds to positive (negative) correlation, and the slopeis 0 i f and only i f the correlation coefficient is also O. In a scatter plot o f Yversus X, the correlation coefficient captures how closely the points (X, Y) fol-low the regression line, with ryx = ± 1 i n d i c a t i n g t h e i d e a l s i t u a ti o n w h er e a ll

the points fall directly on the line. An example of such an ideal relationshipis depicted in Figure 1.

Correlation coefficients are most suitable when Y and X are on an equal sci-entific footing. They might be two measures of the same quantity, neither onenecessarily better than the other—for example, two psychiatric assessmentsusing the same structured interview, but made by different clinical raters, ortwo mammographic breast density readings made by the same rater at differenttimes. Or, they may be two different measures of the same underlying construct,e.g., a urine test and a blood test for a specific hormone, or sleep measured bymotion sensors (actigraphy) and by self-report. These are appropriate applica-tions of a symmetric measure of association.


Because the correlation coefficient is symmetric and free o f the units o fmeasurement of X and Y, its interpretation is thought to be somewhat portableacross settings, permitting comparisons of the strength o f association acrosssamples, across studies, or across pairs of variables. Loosely, correlations lessthan 0.25 indicate fairly weak association, and correlations between 0.25 and0.50 and above 0.50 reflect moderate and strong associations, respectively. Cor-responding interpretations apply to negative values. However, in practice thisportability goes only so far, because what is strong association in one contextmay be weak in another. In the H b Ai c— P S Q I e x a m p l e , a c o r r e l a t i o n o f 0 . 3 0

between Hb Atc a n d P S QI m ay be c o ns i de r ed s tr on g, given the mul tiple fac-

tors that affect each of these variables and the multiple physiological pathwaysthat may link them. On the other hand, 0.30 may be considered weak when itdescribes the association between two measures o f the same construct, suchas urine and blood tests of the same hormone. In another example, in a studyinvolving the intelligence quotient (IQ) of children and their mothers,6 t h e c o r -relation between children's measured IQ at ages 3 and 5 years was 0.67, whichis only moderately strong given that both measures are presumably capturingthe same stable characteristic on the same individuals. On the other hand, thecorrelation between mother's IQ and child's IQ in that same study is 0.52,which seems remarkable in the presence of the many environmental and geneticfactors shared across generations.

Despite the benefits of a symmetric measure of association, this feature canbe a hindrance in many settings. In most of biomedical science, a symmetricmeasure ignores the interest in the primary direction of the association. We mayassociate increasing age with higher blood pressure, but we certainly do not thinkthat higher blood pressure indicates (or even causes) old age. As such, it is naturalto model blood pressure as a function of age, but not vice-versa, and to expressthis relationship using a regression slope instead of a correlation coefficient. Inaddition, the units of the regression slope carry important clinical informationthat the correlation coefficient does not. In the H b Al c— P S Q I e x a m p l e , t h e r e g r e s -

sion slope was 0.186 (cYolpoint)i i.e., a one-point difference in PSQI correspondson average to a 0.186 percentage point difference in HbAlc. T h e u n i t s g i v e a s e n s e

of the clinical importance of the relationship of HbAie t o P S Q L

A S S O C I A T I O N A N D C A U S A T I O N

The estimated regression slope i t and the correlation coefficient r. quantify theassociation of the response variable Y to the predictor variable X. Often, how-ever, the underlying scientific issue is whether and to what degree X causes Y.The distinction is evident in two common interpretations of the coefficient SI.One interpretation says the following:

Association and Causation

When comparing two subjects randomly drawn from the population who have X values that differby one unit, the difference in their Y values, will, on average, be equal to p, (association).

The emphasis here is on comparing average pairs of subjects in the popula-don. An alternative and subtly different interpretation is:

If we increase any given subject's X value by one unit, then we will see a corresponding in-crease in hislher Y value, and that increase will be, on average, equal to p, (causation).

The first interpretation, association, simply says that Y and X are associatedin the population, but the second, stronger interpretation is one way of formal-izing causation (see, e.g., Rubin7- 8) .

To examine the distinction more concretely, we consider a relationshipbetween intellectual ability and blood lead concentration in young children.

Example 2: Though i t is fairly well established that blood lead concentrations above 10 pig/di hinder normal neurobehavioral development in young children, questions remain aboutthe effects of lead at levels below 10 ttgldl. To address this question. Canfield et al. studiedthe association of intellectual impairment to lead exposure in children.' They collected bloodlead concentration data and intelligence quotient (IQ) test results on 172 healthy children,as well as other variables such as maternal IQ, race, maternal education, tobacco use duringpregnancy, household income, and the child's sex, birth weight, and iron status. Blood leadconcentrations were collected at ages 6, 12, 18, 24, 36, 48, and 60 months and were used tocompute a lifetime average blood lead concentration through age 5 years. IQ and other infor-mation was complete for 154 of these children. Lifetime average blood lead concentration (±sd) for this sub-sample was 7.4 ± 4.3 jigldl (approximate range: 0 to 30 Ag/d1), and IQ was89.8 ± 11.4 points (approximate range: 64 to 128 points). Linear regression of IQ on lifetimeaverage blood lead concentration yielded an estimated regression slope of p, = -1.00 points!(p.gidl) with a 95% Cl of F-1.38, -0.63) pointsi(AgIcil) (p < 0.001), indicating a statisticallyand clinically significant inverse association of IQ to blood lead concentration.

In this example, because lead can accumulate in the bloodstream even atvery low levels of ingestion, it is possible to imagine an intervention—albeit anunethical one—that increases a child's blood lead concentration while nothingelse in that child's environment changes. What will it then take for the regres-sion coefficient to be interpreted as causal?

Fitting the linear regression model allows us to quantify and measure theempirical association between IQ and blood lead concentration. Blood leadconcentration is on the right-hand side of the equation because the investiga-tors placed it there and not because the data have anything to say about bloodlead concentration preceding IQ in some causal chain of events. From a purelyempirical perspective, they could have just as easily regressed blood lead levelson IQ. What they have observed and reported in the data is an association in thepopulation between individuals' blood lead levels and their IQ.

This association may arise because elevated lead in the bloodstream leadsphysiologically to a decrement in IQ. Or it could be that a variety of other fac-tors contribute to both elevated blood lead and depressed IQ in some personsin the population. For example, children with lower IQ will tend to have parents

253


with lower IQ owing to the documented heritability of IQ. And adults with lowerIQ often have fewer economic opportunities to rent or purchase safe and well-maintained housing that is free of decaying, chipping, or flaking lead-basedpaint. Most likely, the observed association is a combination of such explana-tions. Because the data themselves hold no evidence about whether the first,second, or some other explanation is most accurate, it is unwise to use thesedata alone to attach a causal interpretation to any observed association. Otherinformation about the study design andlor the biological processes involved maylend support to an inference of causation. In the absence of such informationwe would avoid statements such as that in the second, causal interpretation.

Building on this example, one might ask under what conditions will thecausal interpretation be valid? The most straightforward one is that of a ran-domized controlled clinical trial. Suppose that a sample of patients is enrolledin a trial and that each one is randomly assigned a dose X of a drug. The sub-jects are followed for a response Y, and then the data are analyzed via linearregression of Y on X. In this instance the study protocol has intervened on X. Inaddition, each subject's value of X is under complete control of the randomiza-tion process set out in the study design. Statistically, randomization ensuresthat, on average, the effects of any factors that may be related to Y will be thesame for each randomization group. Thus, it is fair to assign a causal interpre-tation to an observed association.

C A R E F U L U S E O F L I N E A R R E G R E S S I O N

Fitting a linear regression model provides for a simple description of the sta-tistical relationship of Y to X. However, careful analysis examines how well themodel captures the patterns in the data and how far we can take it in summariz-ing the relationship and in making predictions about Y from X. Key questionsinclude: Is the relationship really linear? What is the statistical uncertainty in themodel fit and predictions across the range of values of X? Are any data pointsnot well described by the model? Is the story being told by the fitted modeldominated or attenuated by a few data points?

Answering these questions is an important part of data analysis; regressiondiagnostics offer a wide array of tools for these tasks.°• '° This section illustratessome of the basic ideas in a non-technical manner.

Linearity of Association

Linear regression captures the linear part of the relationship of the mean of Y to X.Such a description is appropriate if the relationship is fairly linear, as in the HbA,—PSQI example (Figure 3). Can we make the same statement for the I Q-b l o o d l e a d

Careful Use of Linear Itegression 2 5 5

example? I f that relationship is well approximated by a line, then the regressionmodel will provide a parsimonious and accurate description of the data.

Example 2 (continued): Figure 6 reproduces a scatterplot of IQ versus lifetime average bloodlead concentration from Canfield et al. 6 A line fitted through these points yielded a regres-sion slope of —1.00 pointi(Agid1). However, the plot reveals that summarizing this relation-ship as linear is an over-simplification. In fact, the inverse relationship is stronger for bloodlead concentrations between 0 and 10 ptgidl, and there is almost no association betweenIQ and blood lead concentration above 10 pg/d1. The authors of the study recognized this,both from looking at the data and because the research question focused on blood lead con-centrations below 10 pgicil. Thus, they re-fitted the model, restricting the sample to those86 children (56% of the 154) with lifetime peak blood lead concentration below 10 g/c11

• I•

130 -

120 -

01

• •

0 110 - % • •

•o I (e 1 / 1o

I—01 100

oLi90 •

t yl l l io_h

• .1e.

• - .. • • •• •(0..,

c S • • • %• a l •wen • •.... •c+--

15 8 0 -• • •t. • , .....

,,..,. . . • -l i r •

•

• % e• • a_

c o f s % • i t • ye• •1:!; • • • e e * • • • • •

'6 7 0 - •• • • •

•

Ic •dicri

60 -

•

•

0

Lifetime Average Blood Lead Concentration (p.g/c11)

Figure 6. Scatter plot and nonlinear regression curve fitted to IQ versus lifetime averageblood lead concentration.The line represents the relation between IQ and lifetime average blood lead concentration esti-mated by a covariate-adjusted penalized-spline mixed model. Individual points are the unadjustedlifetime average blood lead and IQ values.Source: Canfield et al.'


at age 5 years. This sub-sample yielded a regression slope of —2.54 points/(Agidl) (95% Cl:1-4.01, —1.07]) points/(Agld1): that is, two children who differ in blood lead concentrationby 1 Agidl will, on average, differ by —2.54 IQ points, so long as both children have bloodlead concentrations below 10 Agicil.

The non-linearity problem highlighted in this example can be handled inseveral ways. One is the approach that Canfield et al. took. In their study,this was appropriate because the ceiling value of 10 1.cgidi was implicit in thestudy aims. In other settings, using the data themselves to choose a subset onwhich to focus could introduce additional uncertainty into the analysis. It willbe difficult to account for this uncertainty in hypothesis tests and confidenceintervals about the regression slope. Another approach is to transform the Xvariable, perhaps by taking logarithms. This would have the effect of stretch-ing out the points on the left of the plot, and squeezing together the pointson the right o f the plot, thus rendering the relationship more nearly linear.However, this approach complicates interpretation of the results because bloodlead concentrations are now on the logarithmic rather than the natural scale. Athird approach is to transform the response Y. But this is not always satisfactoryeither, because the transformed response might not be as easy to interpret clini-cally as the untransformed response. The approaches associated with generalizedlinear models" offer an alternative (beyond the scope of this chapter) that avoidstransforming the response. Yet another approach is to include linear, quadratic(blood lead concentration squared), and even higher-order terms in the regres-sion model. Another family of functions called splines, in which line segmentsor curves join at points to cover the range of X, often yields better results. Useof either higher-order terms or splines expands the regression equation intoa multiple linear regression model with its own implications of model fitting,inference, and interpretation. A final note is that, even i f the relationship of Yto X is not well approximated by a line, the linear regression procedure willstill produce estimates. Though those estimates may be useful, they will over-simplify the relationship in the data.

Extrapolation

Having fitted a regression model relating Y to X, one might wonder whether itshould be used to predict Y for values o f X outside the range available in thedata. That is, can we extrapolate beyond the observed range of X? We recom-mend against extrapolation. For example, predicting mean Hb Alc v a l u e s f o rPSQI equal to 20 or 25 points would be a problem because the data all lie below15 PSQI points, so they provide no empirical support whatsoever for fitted val-ues in the range of 20 to 25 points. We have no way of verifying whether therelationship of Y to X continues to be linear outside the observed range of X.

Careful Use o f Linear Regression

In fact, the relationship of Y to X is almost never exactly linear. Instead, itwill often be well approximated by a line for the observed values of X. That is, thelinear relationship of Y to X is fairly local. Over a large enough range of X, thatlinearity will almost certainly break down, as we observed in the IQ—blood leadexample. The relationship was strong and linear in the range of X from 0 Agicitto 10 Ag/d1, but weakened above 10 p,g/c11. Extrapolation is dangerous becauseit ventures into uncharted X territory, where statistical uncertainty is high andmodel form is uncheckable.

An important form of extrapolation can arise when interpreting the regressionintercept. In the H b Al c— P S Q I e x a m p l e , t h e e s t i ma t e d i n t er c e pt of 7 .1 1% in fit -

ted model (4) refers to the mean HbAle l e v e l f o r i n d i v i d u a l s w i t h P A N = O . T h i s

may be a large and important group, or it may represent only a few individuals.In other analyses, the X variable may never be equal to 0; then interpretation ofthe intercept represents a clear extrapolation outside of the observed range of X.These problems can be avoided by "centering" X at some meaningful value beforefitting the model. The value may be the mean of X (or something close to it) orsome other scientifically or clinically relevant reference value. For our example, asthe mean PSQI is 6 points, and scores greater than 5 points indicate poor sleepquality, it might be reasonable to center X at 5 points. Thus, instead of

Po + 131X + e ,

we would fit

= 13; + p o - 5) -1- e,which would yield the fitted equation

HbAic = 8.04% + 0.186 (%/point) x (PSQI — 5) + e.

The estimated regression slope in this new model is the same as that in theoriginal model, and has the same interpretation. But the intercept of 8.049'0 nowrefers to the mean HbAlc l e v e l a m o n g t h e s u b p o pu l a d o n of i n d i vi d u a ls w it h

PSQI = 5, a scientifically important group on the cusp of poor sleep quality.

Residuals and Outliers

Thorough analysis of a linear regression model seldom appears in the medicalliterature. Such analysis includes study of the departures of the Y values from thefitted model These are the individual deviations or residuals, estimated as

(residual) = (observed) — (fitted)

and plotted for the H b At c— P S Q I e x a m p l e i n F i g u r es 4 a nd 5 . I n s y m bo l s ,

i = — (fio + 1 3 4) ,

(6)

257


where Y is the observed response, IL + /31X i s t h e fi t t e d v a l u e f o r t h e m e a n o f Y g i v en

X, and the regression slope and intercept estimates are denoted by pi and Po.

Ideally, the residuals show no remarkable behavior; they are simply a sampleof chance fluctuations. Sometimes, however, a few stand out from the rest.Such outliers generally deserve investigation. An effective diagnostic approach,which often reveals outliers more readily, uses a definition of residuals basedon the leave-one-out principle. The idea is to compare each observation to aregression model estimated from the rest o f the data, leaving out the targetobservation. Then the leave-one-out residual is the difference between the Yvalue and the regression line at the corresponding X, as in equation (6), butobtaining the fitted line from all data except the target observation. The result-ing residuals are then resealed so that, at least approximately, each has mean0 and standard deviation 1. Remarkably, in linear (or multiple) regression it ispossible to compute such leave-one-out quantities without actually refitting themodel. Discrepant values o f Y in the data can then be detected by displayingthese resealed residuals.

Example 1 (continued): For the F i b Al c— P S Q 1 e x a m p l e , i n a b o x p l o t o f r e s e a l ed l e a v e -o n e - o u t

residuals (Figure 7) the largest three values stand out from the rest of the data. These threepoints are flagged in Figure 8.

4

2

0-

-2 -

•••

•

Figure 7. Resealed leave-one-out residual HbAtt a f t e r r e g r e s s i o n o n P S Q l .

Three potential outliers (two of which are nearly on top of each other) are Identified and flagged inFigure 8.

Standardized leave-one-out residual HbAic

Careful Use of Linear Regression 2 5 9

15 • outlier?

•

• •0 • •

< 1 0X • 1 8 •.0

• • • • •• •

5

• outlier?

PS121

•• •• •

•

IS observed 1-1bAlc

fitted HbA,c l i n e I

• outlier?

•

Figure 8. Linear regression analysis of HbAtc on PSQl.Average M A . is expressed as a linear function of PSC?! score. The three points with the largeststandardized residuals are flagged with "outlier?".

Outliers raise the question of what should be done about them. First, oneshould confirm that the data values have been accurately recorded and enteredinto the data set. Second, it is often worth examining other information onthese subjects to gain insight into why their Y values are extreme; perhapssomething new can be learned from these subjects. In the H b Ai c— P S Q I e x a m p l e ,perhaps the subjects with large HbAlc v a l u e s h a v e h a d d i a b e t e s f o r l o n g er p e r i-

ods of time or have other comorbidities that set them apart from the rest of thesample. Usually we do not advise removing these subjects from the analysis,unless a data error is found that cannot be corrected or, upon further inspec-tion, it is discovered that the subject should have been excluded from the study.Rather, it is appropriate to include them in the analysis, perhaps making specialnote of them. I f they are removed (as a form of sensitivity analysis), any reportshould include both sets of results. Displays of the data go a long way towardletting the reader decide how to interpret any outlying observations.


Influential Observations

Detection of outliers is important because such data points may be interestingin the way in which they depart from the bulk of the data. Data points can alsobe unusual in their contribution to the estimated regression equation and, inparticular, to the regression slope. Such contributions constitute the influenceof that point. Each point's influence on the slope estimate can be quantifiedby applying the leave-one-out principle—by taking the difference between theslope estimate computed from the full data set, and the estimate computed withthat point removed—and technical criteria exist for flagging points as overlyinfluential. We illustrate the idea with the H b Ai c– P S Q I e x a m p l e .

Example 1 (continued): Figure 9 displays the HbAl, v e r s u s P S Q I r e g r e s s i o n l i n e w i t h t h e

full data set and a new regression line with two of the strong influence points removed.The new estimate of the regression slope is 0.129 (%Ipoint) with test statistic t = 2.30 (p =0.023). This compares with 0.186 (%ipoint), t = 3.21 (p = 0.002; Table 1) using the full dataset. Although these points are pulling the regression slope upward, even without them westill obtain a statistically significant association of HbA,c t o P S Q L T h e t w o o b s e r v a t i o n s a r e

indicated in Figure 9, providing some insight into why they are influential. In this instance,

15

5 -

•

• ••

•rlir— I

• •• I 8• • •

• •

• •

•8 • • • • - - - - - - -•

111 • • •I•

• •• •

•removed

•remov ed

•

6 115

PSCII

full data fitted l ine - with 2 obs removed I

Figure 9. Linear regression analysis of HbAle o n P S Q l .Average HbAlc i s e x p r es s e d a s a l i ne a r f un ct i on of PSQI score. The two points with the largest (in

absolute value) influence on the regression slope are flagged with the word "removed." The tworegression lines are computed with the full data set and with the two points removed.

Multiple Linear Regression

it is because they have both high values of PSQI (12 points; 99th percentile) and high valuesof HbAlc (14.2% and 15.3%); the effect is to pull the regression line up for large values ofPSQI. Conversely, two of the three outliers flagged in Figure 8 are not points of high influ-ence, because their PSQI values arc not extreme.

Points with extreme X values are said to have high leoerage because their Yvalues are given more weight in estimating the slope; i f their Y values are alsolarge, then that potential to be influential is realized, as shown in Figure 9.

Careful use o f regression analysis involves detection and investigation o findividual points or small groups of points that may be influential. i f any onepoint has disproportionate influence, it deserves special note. Conversely, i fseveral points have large but relatively equal influence, then none of them bythemselves can really be considered influential, relative to the others. On theother hand, i f those points form a group for some reason other than theirinfluence—e.g., those points belong to the oldest subjects or to subjects fromthe same clinic—then the group as a whole may warrant further attention.

M U L T I P L E L I N E A R R E G R E S S I O N

Linear regression extends well beyond examining the relationship of a continu-ous Y variable to a single continuous X variable, covering situations with predic-tors that are not continuous and with multiple predictors, each telling part of thestory about the response Y. Multiple linear regression covers this broader domain.In this section we develop and interpret multiple linear regression models bylooking first at a model with one continuous and one categorical predictor,and then at a model with two continuous predictors. We then turn via exampleto a general formulation and interpretation of multiple linear regression. Thisgeneral model is the basis for the use of linear regression for summarization,statistical adjustment, and prediction.

One Continuous and One Categorical Predictor

We begin with an analysis of HbAlc a n d i t s a s s o c i a t i o n w i t h s l e e p q u a n t it y , a

continuous variable, and whether a person experiences diabetic complications, acategorical variable. The multiple linear regression model takes the form

Y = i t + 131X1 + (32X2.'1- e . ( 7)

Here Xi i s a c on t in uo us predi ctor and X2 is a binary indicator or dummy variable,

taking value 1 for membership in a given group or sub-population and 0 oth-erwise. Indicator variables often appear in the clinical and epidemiologic litera-ture because, as we shall see, they capture the difference in the mean responsebetween groups.

261


Example 3: In the analysis of HMI, a n d s l e e p i n p e r s o n s w i t h d i a b e te s ( E x a mp l e 1 ), K n ut -

son et al.' were interested not only in sleep quality (as measured by PSQ1), but also in sleepquantity. To measure sleep quantity, they used the notion of perceived sleep debt, definedas the difference between the preferred and reported sleep duration. Perceived sleep debtranged from 0 to 6 hours, with a mean (sd) of 1.7 (1.5) hours. The analyses accountedfor the history or presence of major clinical diabetic complications (including neuropathy,retinopathy, nephropathy, coronary artery disease, and peripheral vascular disease), whichcould be associated with elevated HbA,, (from long-term poor diabetes control) and couldalso lead to decreased sleep quality and quantity. Fifty-two (43%) subjects had at least onediabetic complication. In light of these goals, we fit a multiple linear regression model as inequation (7) where Y is HbA,,, X, is perceived sleep debt (DEBT), centered at two hours, andXi is an indicator variable (D1AC) taking value 1 for the presence of any diabetic complica-tions and 0 for no diabetic complications. The fitted equation is

HbA,, = 8.08% + 0.39 (%/hour) x (DEBT 2 ) + 0.66% X MAC + e. ( 8 )

Referring to equation (7), we have estimates I3 = 0 . 3 9 ( % /h o u r ) a n d f t , 0 . 6 6 % w i t h c o r r e-

sponding standard errors of 0.12 (%/hour) and 0.37% yielding p-values of 0.002 and 0.079.We conclude that greater sleep debt is significantly associated, and the presence of diabeticcomplications is marginally associated, with higher levels of HbAtc-

The presence of both DEBT and MAC in fitted model (8) complicates theinterpretation of each o f their coefficients. Certainly, the slope /3, is the esti-mated average difference in Y corresponding to a unit difference in X1, i . e . ,the linear effect o f X, on Y. Similarly, 132 is the estimated average differencein Y between the two groups defined by indicator variable X2• (Although 132 istechnically a slope, it turns out to be a difference between groups because X2is dichotomous.) Additionally—and importantly—the coefficient correspondingto each predictor must be interpreted as adjusted for the other predictor in themodel. For example, p, expresses the estimated effect of DEBT on mean HbA,,,adjusting for differences in HbA,, due to MAC. This adjustment aims to removethe part of the H b A1, – D E B T a s s o c i a ti o n t h a t i s d ue to M AC . To see how this

works, we continue with the example.Example 3 (continued): To gain a sense of how D1AC (X2) m i g h t b e i n fl u e n c i n g t h e o b s e r v e d

association of HbA,, (Y) to DEBT (X,), we note that the mean HbA,,, level in the presence of(one or more) diabetic complications is 8.69%, but 7.88% in the absence of complications.Additionally, the mean sleep debt is 1.88 hours and 1.49 hours in the presence and absenceof complications, respectively. Because HbA,, and DEBT both covary with D1AC, part of theobserved association of HbA,, to DEBT could be accounted for by each variable's associa-tion with MAC. Adjustment aims to remove the part of the 1113A1, - - D E B T a s s o c i a t i o n t h a tis due to D1AC.

Examining the data separately in the two diabetic complication groups, Figure 10 plotsH MI, ve rsus sleep debt in each group. Included in that plot are simple linear regression

lines for each group. These lines correspond to the fined equations

HbAl t, = 8.73% + 0.42 (%/hour) x (DEBT 2) + e

for the group with complications, and


15-

5 -

••

6Perceived sleep debt (hrs minus 2)

• n o complications • w i t h complicationsfitted HbAtc l i n e — fi tt ed HbAt c line

•I4

Figure 10. Data and linear regression analysis of HbAic o n p e r c e i v e d s l e e p d e b t , s t r a t i fi e d b y

diabetic complications those without any, and those with at least one diabetic complication.Data points have been shifted slightly to the right (left) for those with (without) diabeticcomplications to improve readability of the plot.

H bAl t = 8 .06% + 0.36 (10/hour) x (DEBT — 2) + e

for the group without complications.We make two important observations about this plot and these regression equations.

First, the two regression lines are nearly parallel. Second, the intercepts in the two regres-sion equations are 8.73% and 8.06%, representing the mean HbAlc l e v e l s i n t h e t w o g r o u p swhen sleep debt is exactly two hours. The difference between the two is 0.67 percentagepoints, indicating higher i l bAl e l e v e l s i n t h e g r o u p w i t h d i a b e ti c c o m p l ic a t i o n s.

Now suppose that the two regression lines were exactly parallel. This factwould have two consequences. First, the regression slope of HbItic w i t h r e s p e c tto DEBT would be the same regardless of which group we considered, and onlyone slope would be needed. Second, the difference in HbAlc b e t w e e n t h e t w ogroups would be the same at all levels of DEBT.

Model (7) forces these two regression lines to be parallel. Under this restric-tion, which appears reasonable for these data, 0 , can be interpreted as awithin-group regression slope that is common to the two groups, and 132 i s abetween-group difference that holds at any given level of Xi. I n fi t t e d m o d e l ( 8 )

263


the estimated slope of DEBT, 0.39 (%/hour), is adjusted for DIAC: it strikes abalance between the two separately estimated slopes, 0.42 and 0.36 (%/hour).Also, the estimated coefficient of DIAC of 0.66% is adjusted for DEBT; it is veryclose to the difference of 0.67% estimated from the separate regressions as thebetween-group difference when DEBT is fixed at two hours.

The adjustment reflects the presence of both variables in the model, relativeto the corresponding slopes in two separate models: the simple linear regres-sion of Y on X, and the simple linear regression of Von X2. I n t h e fi r s t o f t h e s e

models, the slope of HbAic a g a i n s t D E B T – 2 ( i . e . , u n a d ju s t e d f or D IA C) is 0 .4 2

(%/hour). From model (8), 13, = 0.39 (°/0/hour); the presence o f DIAC in themodel has only a small impact on this slope. In this example this result is notsurprising because the difference in sleep debt in the two complications groupsis small (relative to the sd o f DEBT), but in some applications the impact issubstantial. In general, the regression slope 131 r e p r e s e n t s t h e l i n e a r r e l a t i o n o f

Y to Xi a f te r a cc ou nt in g for (or "net of") the linear contribution of X2• In this

sense Pi = 0 .3 9 ( %/h ou r) in fitted model (8) is the estimated slope of Y with

respect to Xi, a d j u st e d f o r X 2.

Similarly, in the second o f the simple linear regression models, the slopeof HbAlc a g ai n s t D IAC ( i. e. , the d if ference between the mean HbAlc level in

the presence versus absence o f diabetic complications) is 8.69% - 7.88% =0.81 percentage points (the unadjusted difference). From model (8), 132 = 0.66percentage points; the presence o f DEBT – 2 in the model has produced adownward adjustment (though not a large one, relative to the sd o f residualHbA,c, w hi ch is about 2 percentage points). In general, 02 in equation (8) is the

estimated slope of Y against X2, adjusted for X,. This sort of adjustment is typicalof observational studies. Because such studies—as distinguished from random-ized trials—are not able to control all factors affecting the response, they oftenfocus on the difference between two groups and adjust that difference for thecontribution of one or several covariates.

Two Continuous Predictors

A further example analyzes HbAic a s a f u n c t i o n o f s l e e p q u a n t i ty , m e a s ur e d b y

perceived sleep debt, and subject's age, two continuous predictor variables. Themodel contains the same symbols as model (7)

= Po + 1 32X2 + e , (9)

but here X2 is continuous rather than binary.Example 4: Building on Examples 1 and 3, suppose that the primary focus is on the relationof HbAlc t o s l ee p d eb t. From p re li mi na ry analyses age has quite a wide range, from 24 to 92

years, with a mean (sd) of 58 (13) years. I f both sleep debt and HbAi, c o v a r y w i t h a g e , a n y


association of HbAtc a n d s l e e p d e b t c o u ld b e i n fl at e d or a t te n ua t ed by i gn or in g the c on tr ib u-

tion of age to HbAlc. W e t h e r e f or e fi t a m u l ti p l e l i ne a r r e gr e ss i on mode l as in (9) with Y

HbA,,, X, = DEBT, and X2 AGE, centering age at 60 years. The result is

HbAic = 8 .26% + 0.31 (%/hour) X (DEBT — 2) (10)

—0.037 (%/year) x (AGE — 60) + e;

i.e., estimates 1, = 0.31 (%/hour) and = —0.037 (%/year), with standard errors of 0.13 (%1hour) and 0.015 (%/year) and p-values of 0.018 and 0.014. We conclude that greater sleepdebt and younger age are both significantly associated with higher levels of HbAtc.

The interpretation of the estimated coefficients /3, and ti2 o f D E B T — 2 a n d

AGE — 60 in the fitted multiple linear regression equation (10) again involvesadjusting for the other predictor. Thus, )6, is the slope of Y with respect to X„adjusted for X2, and )62 i s t h e s l o p e o f Y w i th r e s pe c t to X 2, a d ju s te d for X,. By

comparison, in the simple linear regression of HbA,, on DEBT — 2 the slope is0.42 (%/hour); including AGE — 60 in the model produced an adjusted slopeof 0.31 (%/hour). Similarly, the simple linear regression of HbAi, o n A G E — 6 0yields a slope of-0.050 (%lyear); including DEBT — 2 in the model produced anadjusted slope of —0.037 (%/year). In each case, inclusion of the other variablein the model produced a substantial adjustment of the slope toward zero; inparticular, age is seen to account for about a quarter of the unadjusted slope inthe regression of HbAlc o n s l e e p d e b t .

Although the models in equations (7) and (9) are generic, in actual applica-tions the meanings of /30, pi, a n d P 2 d e p e nd o n w h at o t he r s p e ci fi c v a ri a b le s

are in the model. For example, both equation (8) and equation (10) containDEBT — 2 (along with the constant term); but equation (8) also contains DIAC,whereas equation (10) also contains AGE — 60. Thus, in equation (8) the coef-ficient of DEBT — 2, 0.39 ( 910 / h o u r ) , i s t h e s l o p e o f H b At, a g a i ns t D EB T — 2,

adjusted for DIAC, whereas in equation (10) the corresponding coefficient, 0.31(cfolhour), is the slope of HbAl, a g a i n s t D E B T — 2 , a d j u s t ed f o r A GE — 6 0. T h es e

are two distinct perspectives on the data, each potentially with its own scientificor clinical implications.

The General Model

Models such as (7) and (9) are instances of a multiple linear regression model, so namedbecause it contains multiple predictor variables ()Cs). The general model may con-tain any combination of indicator and continuous predictors. The following exam-ple, combining the models considered in Examples 3 and 4, serves to illustrate.

Example 5: As in Examples 3 and 4, suppose that the primary focus is on the relationshipof HbAtc t o s l ee p d eb t, but that we also wish to account for subject's age and the presence

of diabetic complications in our analysis. We approach this task via a regression model withall three predictors, specifically

265


Term

Modell Model2

CoefficientEstimate

95%ConfidenceLimits Coefficient

Estimate

95% ConfidenceLimits

tower Upper Lower Upper

intercept 7.94 7.45 8.42 8.87 8.05 9.69

DEBT* (hrs) 0.271 0.014 0.528 0321 0.066 0.575

D1AC (any) 0.726 0.011 1.44 0.749 0.025 1.472

AGE* (yrs) -0.0397 -0.0691 -0.0103

AGE (yrs)24-49 ref.

50-64 -0.961 -1.892 -0.030

65-92 -1.260 -2309 -0.211

Standard deviation 1.95 Standard deviation 1.97Model R 16.1% Model R

215.5%

H mi c + tit (DEBT 2) + P2D1AC + 13I(AGE - 60) + e (11)

This model has two continuous predictors and one indicator predictor. The estimated coef-ficients are presented in Table 2, Model 1. As in separate model fits (8) and (10), perceivedsleep debt, the presence of diabetic complications, and younger age are all significantlyassociated with higher levels of HbAic, a l t h o u g h t h e s l o p e w i t h r e s p e c t t o D E BT i s s m a ll e r

than in either of the previous models.

Consider interpretation of the estimated version of multiple linear regressionequation (11) presented in Table 2, Model 1. First, taken as a whole, this equa-tion can be used to predict the HbA,, levels of an individual with given valuesof each of the three predictors. For example, suppose we wished to predict theaverage HbAl, l e v e l f o r a 4 0 -y e a r- o l d p er so n with three hours of perceived sleep

debt and no diabetic complications. This calculation would be

tio + — 2) + 1 32(0 ) + S3( 4 0 — 6 0 )

= 7.94 + 0.271 X 1 — 0.0397 x (-20) = 9.00%

I f the person had diabetic complications, then this prediction would increaseby P2 = 0.726 percentage point and be equal to 9.73%. These predictions arebased on the assumption that the relationship of the response variable HbA,,

Table 2. Summary Statistics for Two Fitted Regression Models of M AK ( % ) o n P e r c e i v e dSleep Debt, Diabetic Complications, and Age for n = 122 Subjects. The Two ModelsIncorporate Age Differently.

*DEBT is centered at two hours; AGE is centered at 60 years. "ref." denotes the reference cat-egory for AGE in Model 2; the estimated intercept predicts the mean of this category, and thecoefficients for the other categories, when added to the intercept, predict the means for thosecategories.


to the predictors DEBT, D1AC, and AGE is linear and additive. That is, each pre-dictor figures linearly into the regression equation, and the contributions fromthese predictors are additive.* The estimated intercept 13

0 i s o n e s u c h p r e d i c t ed v a l u e: i t i s t he e s t im a t ed

mean response Y for persons with a value of 0 for each of the predictors X in themodel, accounting for linear and additive effects of each of those predictors. InHbAic mode l (11), such persons are 60 years old, with two hours of perceived

sleep debt and no diabetic complications, so the indicator va r ia b le*DI A C i s O .The predicted HbAlc l e v e l f o r s u c h p e r so n s b a se d on the m od el is Po •.• 7 .9 4% .

I f we had not "centered" DEBT at two hours and AGE at 60 years, the interceptwould refer to persons who are 0 years old and have no sleep debt, a groupclearly outside the domain of investigation in this analysis.

The estimated regression coefficients o f DEBT, DIAC, and AGE in model(11) reflect adjusted associations. The three coefficients in Table 2, Model 1quantify the relationship of HbAic t o D E B T , D I A C , a n d A G E , n e t o f a n y l i n ea r

and additive contributions of the other two predictors.Before turning to other aspects o f multiple linear regression, we consider

one final example, which involves a categorical predictor variable with morethan two categories. Specifically, we replace continuous age with age groups.Although this approach may not be optimal, and other options are available, itis easy to implement and serves to illustrate the use of multiple linear regressionwith categorical predictors. it has the further virtue of assuming approximatelinearity within each age group, but not for the data as a whole; thus it is ableto deal with nonlinearity of the age effect.

Example 6: Suppose, as in Example 5, that our goal was an analysis of the H b Ai c— s l e e pdebt association that also accounts for diabetic complications and age, but that we wereconcerned that age did not act on mean HbAic i n a l i n e a r f a s h i o n . A n a l t e r n a t i v e a p p r o a ch i s

to create age groups and treat age as a categorical predictor variable. Examining the distribu-tion of age, we created three groups: 24-49, 50-64, and 65-92 years old. We then chooseone group, the youngest, as a reference group and consider a model with an indicator vari-able for each of the other two age groups:

H bAl c • = 1 30 + Pi(DEBT — 2) + 132DIAC + P1AGE2 + p4AGE3 + e. (12)

Here AGE2 and AGE3 are indicator variables for membership in the middle and oldest agegroups, respectively. AGE2 takes value 1 for persons between 50 and 64 years old and 0otherwise; AGE3 takes value 1 for persons 65 years and older and 0 otherwise. The estimatedcoefficients are presented in Table 2, Model 2. As in Model 1, perceived sleep debt and thepresence of diabetic complications are significantly associated with higher levels of HbAic.In addition, the increasingly negative estimated coefficients of the age group indicators re-flect a trend of decreasing HbAlc l e v e l s w i t h i n c r e a s i n g a g e .

The interpretation of these results is as follows. The estimated intercept So8.87% is the predicted mean HbA,

c l e v e l f o r a p e r s o n i n t h e y o u n g es t a g e g r o up

267


(24-49 years) with two hours of sleep debt and no diabetic complications. Theestimated coefficients P3 = — 0 . 9 6 p e r c e n ta g e p o i nt s f or A GE 2 a nd 1 34 = —1 .2 6

percentage points for AGE3 are estimated differences between mean HbAlclevels in the middle and oldest age groups, respectively, and the youngest agegroup, adjusting for sleep debt and diabetic complications. The coefficients ofDEBT and DIAC are adjusted for differences in mean HbAtc l e v e l s a m o n g a g egroups, as well as for the contributions of DIAC and DEBT, respectively.

The main advantage to model specification (12), with age as a categoricalvariable, versus model (11), with age as a continuous variable, is that model (12)does not force the slope of mean HbAic w i t h r e s p e c t t o a g e t o b e t h e s a m e f o r

30-year-olds as it is for 70-year-olds. On the downside, however, modeling agegroups does not account for any differences in mean HbAlc l e v e l s t h a t o c c u rby age within a group. Other modeling approaches attempt to gain the advan-tages of both (11) and (12). These include higher-order functions of age (suchas age-squared) or, better, linear or higher-order splines. In some situations,when the number of observations is large enough to support a sizable numberof categories, the coefficients of the corresponding indicator variables can guidethe choice of the functional form.

When the analysis involves predictor variables for a set o f categories, theregression model contains indicator variables for all groups but one. The reer-ence group has no indicator variable. Coefficients of each indicator compare thecorresponding group against the reference group in terms of the mean value ofthe response Y. It is up to the analyst to choose the reference group, and thischoice affects the interpretation of the coefficients of the indicator variables. Inprinciple, any group can be chosen as the referent, and as long as the modelincludes indicator variables for each o f the other groups, the models are allequivalent. That is, the choice of reference group affects interpretation of themodel coefficients, but not the degree to which the model fits the data.

Other Aspects of Multiple Linear Regression Analysis

The interpretation of standard errors, and the computation and interpretation oftest statistics, p-values, and confidence intervals in multiple linear regression areanalogous to those in simple linear regression. The main difference in multiplelinear regression is that the regression slopes represent adjusted associations.In Example 5, interest is on the association o f HbAlc to sleep debt, adjust-ing for age and diabetic complications. The slope estimate and its standarderror are p, . 0.271 and 0.130 (%/hour). The null hypothesis of no (adjusted)association corresponds to p, . O. The test statistic for this null hypothesis ist = 0.271/0.130 = 2.09. This t-statistic has 118 degrees of freedom (df), whichin multiple linear regression equals the sample size (n = 122) minus the number

Multiple Linear Regression 2 6 9

of coefficients in the model, including the intercept. The resulting two-sidedp-value is 0.039, which indicates a significant HbA.—sleep debt association,even after accounting for linear effects of age and diabetic complications. The95% confidence interval for /31, ( 0 . 0 1 4 , 0 . 5 2 8 ) ( % /h o u r ) , i n d i c at e s t he r a ng e of

values for the adjusted regression slope that are compatible with the data.The coefficient o f determination, o r R-squared (R2) v a l u e , a l s o e x t e n d s

directly from simple to multiple linear regression. It represents the proportionof the total variability in the response Y (about its mean) that is accounted forjointly by all the predictor variables Xi, X 2 , e t c . , i n t h e m o d e l . I n E x a m p le 5 , R2 =

16.1%, whereas in Example 6, R' = 15.5%, reflecting that the model with linearage had a slightly better fit to the data than did the model with age included asthree categories.

Does the Model Fit the Data?

Linear regression involves specifying a model, which includes a response anda set of one or more predictors, and then fitting that model to a set of data.The model fitting, or estimation, step will yield results and parameter estimateswhether or not the data conform to the assumptions o f that model. What ifthese assumptions do not hold? What if some observations or other subsets ofthe data are not well captured by the linear regression model?

Regarding isolated departures of individual data points from the model or fromthe rest of the data, model diagnostics exist for multiple linear regression. Thesetools can help to identify data points that are outliers or points of high leverageandfor influence; these are essential steps in studying model adequacy. Of course,the problem is more complicated than with simple linear regression because thejoint influence of several predictors must be considered simultaneously.

More-systemic problems arise when the data as a whole appear to violatethe assumption that the mean o f the response varies as a linear and additivefunction of the set of predictors. How should we approach this problem? First,statistical tests, graphical procedures and other tools are available to assess thelinearity of the relationship of the response to each predictor, accounting forthe other predictors in the model, and also to assess whether these variablescombine additively or in a more complex way to predict the response.

Second, even if the assumptions of linearity and additivity do not hold exactly,the fitted model is often still a very useful summary of the relationships in thedata. It will capture the part of the relationship of the response to the predictorsthat is linear and additive, and this will often be a main part of the story, even ifit is not the whole story. Additionally, such a summary is of use precisely becauseit glosses over higher-order details in favor of a more parsimonious presentationof the data, yielding an analysis that is easier to interpret and to communicate. Of


course, the advantage of including higher-order terms is that the resulting modelmore faithfully represents the patterns of association in the data.

For instance, in Example 3 (Figure 10), the slopes with respect to sleep debtare slightly different between the two diabetic complications groups. Strictlyspeaking, this violates the additivity assumption because it suggests that thepresence o f diabetic complications not only shifts the regression line up ordown, but also alters the slope with respect to sleep debt. The two lines are,however, so close to parallel that the difference is unimportant. Indeed, a sta-tistical test could be applied to assess whether this difference is significant.Fitted model (8) simplifies the picture by providing a single slope common toboth groups. Good data analysis often requires a compromise between the twocompeting goals of parsimony and quality of fit to the data.

S U M M A R I Z A T I O N , A D J U S T M E N T , A N DP R E D I C T I O N R E V I S I T E D

The introduction mentioned three broadly construed applications for regressionmodels: summarization, adjustment, and prediction. We now revisit the use ofmultiple regression for these three purposes.

Summarization

One reason that multiple regression is useful is that it yields a parsimoniousdescription of the nature and strength of the dependence of a response Y on aset of Xs. Consider the following example.

Example 7: Lauderdale et al ." objectively measured various sleep characteristics in a pop-ulation-based random sample of healthy middle-aged adults. The goal of the study was toprovide a description and quantification of sleep in this population, and also to examinewhether and how sleep varies by demographic, socioeconomic, and other variables. Sleepparameters included time in bed, sleep latency, sleep duration, and sleep efficiency—allcontinuous response variables. The analysis used multiple linear regression to examine howeach response jointly depends on the predictors. For each response, the authors fitted threeregression models. One model included indicator variables for race-by-sex groups and a con-tinuous predictor, age. The next model added income, another continuous variable. The lastmodel added continuous and indicator variables for employment status, body mass index(8M1), alcohol consumption, smoking status, and number of children under 18 years old inthe household, among others.

This is an example o f using multiple linear regression for data summariza-tion. The study is largely descriptive, aiming to provide information on norma-tive sleep patterns in a healthy population. No hypotheses are strongly drivingthe analyses; interest is more on the joint contributions o f the predictors inexplaining the variability in sleep, rather than on the coefficient of one specific

Summarization, Adjustment, and Prediction Revisited

predictor adjusted for the others. For example, whereas the full model, with allpredictors included, provides the regression slope for BMI adjusted for income(and other things), interest on the sleep-BMI association is no more or lessimportant than that on the sleep-income association. The reader also has avail-able the regression slope for income both unadjusted and adjusted for BMI. Ina sense, all predictors are on an "equal footing." Taken as a whole, the modelis a concise description of the joint impact of race, sex, age, and other factorson sleep in healthy adults. In this sense the regression model is a powerful toolfor summarization of this joint effect.

Adjustment

In many applications, not all of the predictor variables will be on an equal foot-ing. Rather, one or more will be of primary importance. The others are includedfor purposes of statistical adjustment. The motivation for adjustment is often thatthe exposure (predictor variable) o f interest is associated with some variablesthat are also expected to influence the outcome of interest. Such confounders, i fnot properly accounted for, will induce a spurious association between exposureand outcome. We illustrate with the following example.

Example 2 (continued): In the study of the association of child's IQ to blood lead levels,a major barrier to interpreting the observed inverse association as causal was concern thatother variables are associated with both elevated blood lead and depressed IQ in someindividuals, and that these variables account for the observed association of IQ to bloodlead levels, i.e., that these variables are potential confounders. Such variables may includematernal IQ, level of education, and use of tobacco during pregnancy; family financial status;and the child's gender, birth weight, and iron status. Socioeconomic status is of particularconcern because of its link to environments with unstable and uncontained lead-based paintor with lead-contaminated soil, as are perinatal variables such as birth weight and maternalsmoking, owing to links with both socioeconomic status and children's development. Tohandle these problems, the investigators developed a multiple linear regression model thatincluded many of these continuous and categorical covariates as adjustor variables, in ad-dition to lifetime average blood lead concentration. The estimated coefficient of blood leadconcentration in this "adjusted" model was —1.52 pointsl(fig/d1) (95% CI: (-2.94, —0.09))in those with lifetime peak blood lead concentrations less than 10 pgftli. This compares toan unadjusted blood lead concentration coefficient of —2.54 pointsi(itgid1). The adjustedslope was lower in magnitude, but still indicative of a clinically and statistically significantassociation between blood lead and child's IQ after accounting for potential confounders.

When regression is used primarily for adjustment, the estimated coefficientsfor all of the adjustor variables are often not even given. Only the coefficientsfor blood lead are presented in the example. The idea (aside from minimizingjournal space!) is to focus the analysis on the blood lead-IQ relationship whileaccounting for the adjustor variables, rather than to distract from this primaryrelationship by presenting the regression slopes of all variables.

271


In this example the adjustor variables such as maternal IQ, education, andhousehold income are potential confounders because they are thought to influ-ence both the exposure (blood lead) and the outcome (child's IQ). Becausethese variables are included in the regression model, the coefficient of bloodlead quantifies the IQ—blood lead association net o f any linear and additiverelationship o f IQ to these variables, thereby eliminating or reducing poten-tial confounding from them. The statistical analyses permit estimation of bothunadjusted and adjusted regression coefficients, and these represent two differ-ent empirical perspectives on the association of IQ to lead exposure. Conclu-sions about whether an adjustor variable is truly a confounder andfor whetherthe adjusted association quantified by the regression model represents a causallink, however, are extra-statistical steps that must be justified by non-statisticalconsiderations including the scientific issues at hand. Confounding and someaspects of causality are discussed in more detail in Chapter 7.

Prediction

Prediction with linear regression has a variety of purposes in medical practiceand research. First, it can involve forecasting or prognostication into the futureabout some responses based on predictors available at present. Second, it maybe used to avoid expensive or invasive gold-standard diagnostic measures,instead predicting those measures from easier-to-obtain clinical information.Third, it may involve projection o f a given patient into a "what if" situation.Prediction stands in some contrast to summarization and adjustment. In gen-eral, models that do a good job of summarization will also be good predictivemodels and vice versa. However, with prediction, the estimation and testing ofregression coefficients are de-emphasized, and predictive accuracy is of primaryimportance. Therefore, a focus on prediction will sometimes lead to a differentregression model than when the focus is on summarization.

Predictive accuracy is the degree to which the fitted value based on a model isclose to the actual response. That is, suppose we take a data set and fit a regres-sion model. Now, suppose we have a new observation (e.g., a new patient), withpredictors XI, X2, etc., and response Y, and that we use the fitted model withthese new Xs to predict the unknown Y for this new patient. How close the pre-diction is to the actual response, on average, is a measure of predictive accuracy.The following example serves to illustrate some of these points.

Example 8: Gulati et al." developed a regression model to predict exercise capacity in healthywomen as a function of age. Exercise capacity is measured in metabolic equivalents (MET),defined as the maximal oxygen uptake for a given workload, measured as multiples of thebasal rate of oxygen consumption when a person is at rest. The purpose of the study wasto establish a nomogram of mean MET-for-age values in a healthy female population. The

Summarization, Adjustment, and Prediction Revisited

women were also classified as "active" or "sedentary" on the basis of self-reported participa-tion in a regular exercise or training program. Sample sizes were 866 in the active group and4643 in the sedentary group. Fitted regression equations were mean(MET) = 17.9 — (0.16/year) x AGE for the active group and mean(MET) 1 4 . 0 — (0.12/year) x AGE for the sed-entary group.

Here the modeling objective is to obtain good predictions for input into anomogram, the driving goal of the project. We make two points about the fit-ted models. First, the authors were careful to assess whether the relationshipof MET to age was linear; it turned out to be. However, if it had not been, theiranalysis had scope to transform age or to include nonlinear age terms in theregression models. Second, the authors fitted separate models for the sedentarygroup and the active group; this is equivalent to fitting one model to the entiresample with an indicator variable for being in the active group and an interac-tion term between active group and age, allowing for different age slopes inthe two groups. A simpler approach would have omitted the interaction term.Though the interaction term was most likely significant, its inclusion in themodel does not add substantially to the story about the MET-age relationship(MET is higher in the active group and drops with age in both groups), so if thegoal had been summarization, this term might have been excluded. As the goalwas prediction, however, both the potential inclusion of nonlinear functions ofage and the actual inclusion o f the age-by-activity-group interaction renderedthe model more faithful to the data and thereby improved its predictive abil-ity. Often good predictive models contain more terms than good explanatorymodels.

One aspect o f predictive modeling is to evaluate and quantify the predic-tive accuracy. R' is a common metric for predictive accuracy in multiple linearregression models, and is more relevant in predictive than in summarizationmodeling. In the above example, R1 3 5 1 0 i n t h e a c t i v e m o d e l a n d 2 4 % i n t he

sedentary model, reflecting reasonable but not strong predictive ability.Additional considerations arise in practice, relating to problems o f model

selection and of predictive accuracy. Often, there are many candidate predictorvariables, and the analyst faces the problem of choosing which ones to include.Prediction is degraded, and the model may be unusable, if any predictor variableis missing for a new patient. Once variables are selected, there is the question ofwhether to include nonlinear terms. The problem is further complicated by thepotential need to include interactions. These choices must be made while rec-ognizing that including too many predictors, including interaction terms (i.e.,over-fitting), will degrade the generalizability and predictive ability of the model.Taken together, methods for making objective choices about model terms fall inthe domain of model selection, an area of ongoing statistical research.

273


R E P O R T I N G R E G R E S S I O N R E S U L T S

Reporting on regression modeling is an important step in the analyses. Themethod used to develop the regression model should be described in sufficientdetail that a reader with knowledge of regression and access to the data couldreproduce the results (as noted in Chapter 14, point 2).

On the data, sufficient information should be given to enable the reader todigest the analyses and reach conclusions about any fitted model. First, any reportshould provide sample size(s), univariate descriptions of both the response andall predictors, information on missing data and, importantly, units of measure-ment for each variable. Second, to the degree that space allows, reports shouldinclude plots of the data, so that the reader can see key relationships and thevariability in the data. Third, continuous predictor variables should be centeredat some relevant value for the analysis at hand, and reasonable reference catego-ries should be chosen for categorical predictors.

The report should present the fitted regression equation either in the textor in a table. The estimated regression coefficients (Ps) should be reportedwith units and be accompanied with standard errors andlor confidence inter-vals. Reporting only the coefficients and their p-values is rarely adequate. Theestimated residual standard deviation should be included as part of the fittedmodel. When the goal of the model is prediction, it is appropriate to report theR' value of the fitted model, but in summarization or adjustment this statisticcan often be small and lead the reader to decide that the associations detectedvia the regression models are not important. We emphasize that a model mayvery well not be a good prediction model, but may still reveal interesting andimportant associations. Therefore, R' is not always an important statistic toreport. Finally, many reports do not include all regression coefficients, espe-cially when those coefficients correspond to adjustor variables. Though this issometimes unavoidable, it is better to provide this information.

A D D I T I O N A L R E A D I N G

For research practitioners who wish to apply regression models and relatedmethods in the analysis of their own data, a now-classic source is Weisberg,"whereas a more modern treatment (which includes regression diagnostics) ispresented by Chatterjee, Hadi, and Price." Predictive (or "prognostic") model-ing is given extensive treatment by Harrell," who also provides guidance onthe use o f linear and higher-order splines for flexibly modeling continuousvariables.

References

A C K N O W L E D G M E N T S

The authors thank Theodore G. Karrison, PhD, for editorial suggestions thatimproved this material considerably, and Eve Van Cauter, PhD, and KristenL. Knutson, PhD, for making available the data from their sleep and diabetesoutcomes study.

R E F E R E N C E S

1. Knutson KL, Ryden AM, Mander BA, Van Canter E. Role of sleep duration and quality in therisk and severity of type 2 diabetes mellitus, Arch Intern Med 2006; 166:1768-74.

2. Buysse DJ, Reynolds CP Il l, Monk TH, et al. The Pittsburgh Sleep Quality Index: A newinstrument for psychiatric practice and research. Psychiatry Res 1989; 28:193-213.

3. Carpenter IS, Andrykowski MA. Psychometric evaluation of the Pittsburgh Sleep QualityIndex. J Psychosom Res 1998; 45:5-13.

4. Amer ican Diabetes Association. Standards of Medical Care in Diabetes-2006. Diabetes Care2006; 29(suppl. 1):S4-S42.

5. N athan DM, Singer DE, Hurxthal K, Goodson JD. The clinical information value of theglycosylated hemoglobin assay. N Engl J Med 1984; 310:341-6.

6. Canfield RL, Henderson CR, Cory-Slechta DA, et al. Intellectual impairment in children withblood lead concentrations below 10 pg per deciliter. N Engl J Med 2003; 348:1517-26.

7. R ubi n DR. Estimating causal effects of treatments in randomized and non-randomized stud-ies. I Educational Psycho' 1974; 66:688-701.

8. R ubi n DR. Formal modes of statistical inference for causal effects. I Statist Planning andInference 1990; 25:279-92.

9. C ook SD, Weisberg S. Applied regression including computing and graphics. New York:John Wiley, 1999.

10. Beisley DA, Kuh E, Welsch RE. Regression diagnostics. New York: John Wiley, 1980.11. McCullagh P. Nelder IA. Generalized linear models. 2nd ed. London: Chapman & Hall,

1989.

12. Lauderdale DS, Knutson KL, Yan LL, et al. Objectively measured sleep characteristics amongearly-middle-aged adults: The CARINA Study. Am J Epidemiol 2006; 164:5-16.

13. Gulati M, Black HR, Shaw Ll, et al. The prognostic value of a nomogram for exercise capac-ity in women. N Engl I Med 2005; 353:468-75.

14. Weisberg S. Applied linear regression. 2nd ed. New York: John Wiley, 1985,15. Chatterjee S. Hadi AS, Price II. Regression analysis by example. 3rd ed. New York: John

Wiley, 2000.16. Har rel l FE. Regression modeling strategies with applications to linear models, logistic

regression, and survival analysis. New York: Springer-Verlag, Inc., 2001.

275

linear regression in medical research - statclass.com242 c h a p t e r 10: linear regression in...

Documents