what does r² measure.pdf

8/18/2019 What Does r² measure.pdf

1/4

WHAT DOES

R2

MEASURE?

Steve Thomson, University of Kentucky Computing Center

The coefficient of determinaHon, squared multiple

correlation coefficient, that is,

R2,

is perhaps the most

extensively used measure of goodness of fit for

regression models. In my consulting, many people have

asked questions like:

My

R2 was .6. That's good, isn't

it? , or My R2 was only

.05.

That's bad, isn't it? . Thus

many people seem to use high values of R2 as an index

of fit, concluding that if R2

is

above some value, they

have good fit, but that small values indicate bad fit. In

either case above, probably the best response is

Compared to what? .

One interpretation of R2 is as the limit of the

regression sum of squares divided by the limit of the total

sum of squares. Or equivalently, it is an estimate of the

amount of variation in observation means compared to

the variation in means pius the variation in residual error.

This can be small even for models with small error,

Or

large for models with large error.

I. Review:

The general linear model can be interpreted

as

investigating how well an observed nx1 vector,

y,

can be

represented

in some lower dimensional

subspace

spanned

by

the columns of the matrix of regressors.

(Recall that the span of a set of vectors is the set of all

linear combinations of the vectors, i.e. a subspace) The

coefficients of the vectors determining the points in the

subspace are the parameters. The least squares solution

for the parameters are the parameter values

corresponding to the point

in

the subspace

i

closest to

y..

The various sums of squares in ANOVA can be

represented as (squared) Euclidian distances between

subspaces spanned by different groups of columns in the

regressor matrix.

,

SSE

1

Figure

1.

Geometry

01

ANOV

A

(Note that distances are square roots of indicated ssq's.)

The total sum of squares, SST, is the (squared) distance

from

y. to

the origin. The error sum of squares, SSE, is

the (squared) distance from y. to

i.

in the span of A and B.

The model sum of squares, SS(A, B), is the (squared)

distance from i to the origin. By the Pythagorean

theorem,

SST = SS(A,B) + SSE.

So heuristically, a good model has a large SS(A,B)

compared to SST. This seems to justify the use of their

ratio as an index of fit. namely:

R2= SS(A,B)

SST -

1246

model sum of squares

total sum of squares .

The model sum of squares, SS(A,B), could also be

partitioned into sequential sums of squares. One

possible partition would be:

SS(A,B) = SS(A) + SS(BIA).

where both terms on the right are squared distances, and

thus positive. The SS(BIA) is the sum

of

squares for B

adjusted for ,

or

orthogonalized to A. So comparing

the R2 for the submodel with only A's as regressors to the

R2 for the complete model with A's and B's:

R2 = SS(A)

,;

SS(A) + SS(BIA) R2

A SST SST

Clearly R2 has to increase as variables are added to the

model (or at least not decrease). Also, note from the

geometry that 0 ,; R2 ,; 1.

If the model Includes an intercept or constant term,

tt

is particularly easy to adjust. or orthogonalize, to this

constant, merely subtract the column mean from each

column. Most analyses are performed on such

corrected models. One last simple observation from the

geometry is that there are an uncountably infinite number

of subspaces with R2 = 1, namely, any subspace that

includes

Other results on

R2;

1. If true replicates are added to the data, so that more

than one y value occurs at each regressors value,

then R2 will often decrease (e.g. Healy, 1984, Draper,

1984).

In fact, when there are true replicates its

achievable upper bound will be less than 1.0. In

practice, the

resuHs

for

c l o s e ~ r e p c a t e s

are often

similar.

2.

For Yi = +

1 ,

x +

J

Xi' + ...

+

P-' Xi,p-' + j, with

the

; i n d e p e n d e n ~

homogeneous normal random

variables with mean 0, when

1 ,

= IJ = ... =

1

.., =0, it

is easy to show that then the expected value of R2.

using corrected sums of squares, is

P

l

(e.g.

n·

Crocker,

1972,

Seber,

1977.

pg

115).

A similar result

holds for R2 using uncorrected sums of squares (

£. .

n

In fact, under these assumptions, R2 follows

a

beta

.distribution.

It would seem that a good fit index for comparing

completely different regression equations should be

optimized by the least squares equation, should be free of

the measurement units, and should be large for models

with small residual error and small for models with a large

residual error. Finally, the regressor variables are usually

considered

as

fixed. So a fit index should

be

indapendent

of their actual values. R2 satisfies the first two criteria, but

fails on the last two.

II. An

Alternative

Definition:

An alternative definition of R2 that has been

recommended (e.g. Draper & Smith,

1981)

is


2/4

=

squared correlation between regressand

( response and predicted value

®.

It is well known that for a linear model "corrected" for the

Intercept or constant term, estimated

by

ordinaty least

squares {so corrected sums

of

squares are used in the

first deflnHlon}, the two definitions are equivalent. For

uncorrected models, Is easily computed in SAS'"

PROC REG or GLM

by

OUTPUTing the predicted values

and computing correlations In

PAOC

CORA, and finally

squaring the result.

To IUustrete a possible problem with the second

definition, suppose below Ihal o's denole observed

values, and

p s

denote predicted values from

\I1e

least

squares equation, from a simple regression wilhout an

intercept. Then using Ihe second definition,

=

1.

o

p

o

p

o

p

p 0

p

0

p 0

Figure 2.

"

Yet Ihe predicted values, fis, are most certainly nol doing

a

good

job of predicting the observed values. Of course,

in this case, any linear function of the observed x's will

give an

=

1. It

is

slightiy more complicated

lor

multiple

regression problems,

but

generally, even if =

,

there

Is no requirement tha i the least squares hyperplane must

be "close" to

y.

Thai is, for any b, and nonzero a,

( Corr(£,

y:)

}2

= (

Corr(a

2

+

b,

y:)

l'}.

Since we could

generate a hyperplane passing through zero and any

specified value, that suggests that unless the possible

choices of coefficients is

restricted

in some way, is

not

explicHly a good measure of "fit".

III.

Limiting Behavior of R2'

The definition of

R2

used In SAS procedures seems

to make more sense,

R2

=

~ ~ ,

by default, using

corrected

sums of squares

jf the model h ~ s

an intercept,

uncorrected if the model has no intercept. But thai

particular ratio is exactly what R2 is measuring. not

necessarily the quality of fit.

Suppose a general tinear model is appropriate, i.e.

the nxl regressand vector, y> is appropriately modeled as

= X ll. + J , where the ei's are stochastically indapendent

with

mean

0 and variance B'.

n

For convenience, write the orthogonal projection

matrices p.

=

X{X Xr X and for i denoting the

oxl

vector

..

,

of l s, P, = L, a nxn matrix with all elements lin Since

n

the

j

is a column

of

X, there is an g so thai

i

=X g. Also

then PxP, = P, =p,p .

Thus,

R =

SSf\, (Pxlt-P

y : ) t ( p . ~ _ p ,y:)

It'Pxlt - yip l

SSTc

iY.-y P,y

y . Y . - Y P , ~

.

By the law of arge numbers, plim 1

=

0, so

n

~ = ft'J


3/4

xtp X

As n gets large, with -- -->Bo, say, then,

n

.

yip ,?i . 1l'X'P Xll lltXtj j'e

k k

)

p l l m - - = p l l m ( +

~

-

n n n n n n

=

ltBoil

fttBll- ft'Boil fttB'1l

Thus,plimR

2

- t t .R 2 RtB'R+"

II

Bft ll

B""

+

a

y-

where. now, B' is the limiting, centered matrix of

regression

coefficients,

corresponding to

a moment

malrix of the· regressors. Thus

R

will be 'large if

i f

is

small (i.e. good fit). f ifllt

B 1l.

is large Le. large variation

among mean values).

Or again. with correct specification, R2

can

also be

n

'

-p: n)

f

interpreted as lim

~ ; ' - - : : - - -

~

n

basically.

the

corrected regression sum of squares.

IV. So what does

R'

measure?

For

a correctly specified model, R2 tends toward the

•

inverse of

1+- - .

where. again, B Is analogous to a

Il.tBIl

sample

covariance matrix

of

regressors

for the intercept

case, and a

moment matrix

for

'the

no Intercept case. In

either

case,

the fttB{1 is a reasonable regreSSion sums

of squares, but is not much of an indicator of fit. If one

has perfect frt.

, ;

=

0,

and

R'

will

be

1.

Of

course,

R'

will

also tend to 1 as the variation in observation means gets

large.

In practice, this means that if two experimenters

choose equal sample sizes from the same process, the

one who chooses his regressor variables to have more

variation will tend to have a higher R', For example, in

an economic context, suppose some process remains

true over the years and satisfies the standard regression

assumptions. Then an economist who uses monthly data

from that process will tend to have a lower fl ' than one

who uses yearly data, since generally the yearly data will

have more variation. Similarly, suppose two nutritionists

are modeling weight gain as a lunction of calories and

prior weight All else being equal the nuttitionist who

chooses his sample to have more variation In calories

and prior weight will tend to have a higher

R'. In

an

educational context, with a true model, a researcher who

samples across schools in a stale, could be expected to

have a larger R' than a researcher who samples schools

in

a certain county.

Actually this is not surprising. Under the usual

assumptions, the variance of the ordinary least squares

estimates of the /l's, is i f X

t

Xl". So large variation in the

X's tends to reduce the variance of the estimates of the

Ws.

Thus more variability in the X's increases efflctency.

But f l ' is apparentiy not usually Interpreted

as

a measure

01 efficiency, but rather fit. However, R' is not quite a

measure of efficiency either. For any set of regressors,

R'

would also tend to increase as the ft's get large. a

change which would have no effect on the precision of

the estimates.

1248

The fact that

R'

is not quite a measure of fit is not

suprising. It is just the squared correlation between the

response variable y and a linear combination of the

regressor

variables. Correlation measures how

much

variables vary (linearly) together. So R2

has

to depend

on

the

regressors as much as

on

the response variable.

Note thaI asymptotically,

as

n increases, and the

number of regressors (Including the intercept), p, remains

constant

R'

has the same limit as the adjusted R of

SAS PROC REG:

R q= 1 - (

- R ' ) ~ .

n·p

where n

= n-l

for a corrected model with an intercept,

and n for an uncorrected model. This is meant

to

adjust

for the fact that R' always increases as regressors are

added to the model. R aq only Increases if the t-statistic

for the added variable is greater than one In absoiute

magnitude. The same remarks would apply to the ofher

finitely corrected adjusted R"s, including, for example.

Amemiya's prediction criterion (Amemiya,

1985):

R' =1-( l-R') . : E.,

pc n·p

designed to correct still further for the tendency of even

R ~ q to choose models with many variables. But again,

Ihese have the same asymptotic behavior as

R'.

V.

So

again, what does

R

measure?

Even in a correctly specified model, f l ' is primarily

an estimate of the relative size of the variation in

observation means and the population variance. Within a

single set of possible regressors. it could e a useful

index

01

fit to compare various submodels. But, models

with small error, and thus good fll,

can

have a high R2 or

a small

R2

depending on the observation means, I.e. on

the

vaJues

of the regressor variables. So if R2

is

large it

does not necessarily indicate good' fit, while R' .small

does not necessarily indicate bad fit

SAS is the registered trademark of $AS Institute, inc.,

Casy, NC, USA.

For further information contact:

Steve Thomson

University of Kentucky Computing Center

128 McVey Hall

leXington, KY 40506'()045


4/4

REFEREN ES

Amemiya, T. (1985),

Advanced Econometrics.

Cambridge. Massachusetts: Harvaro University Press.

Barrett. J.P. (1974). The Coefficient 01 Detenninalion -

Some Limitations ,

The American Statistician. 28.

19-20.

Chang. P.C. and Alift. A.C. (1987)

Goodness 01

Fit

Statistics for General Linear Regression Equations in

the Presence of Replicated Response . The

American

Statistician.

41.195-199.

Crocker, D.C. (1972). Some Interpretations of the

Multiple Correlation Coefficient ,

The American

Staustfcian 26. 31-33.

Draper. N.R. (1984). The Box-Wetz Criterion Versus R2 .

Journal

o

the Royal Statistical

SOCiety Series A. 147,

101H03.

Draper. N.R. (1985). 'Corrections: The Box-Wetz

Criterion Versus R2 .

Joumat o

/he

Royal Statistical

Society

Series

A.

148. 357.

Draper. N.R. and Smith.

H.

1981). Applied Regression

Analysis.

2nd

Ed.

New York: Wiley.

Heaty. M.J.R.

1984).

The·Use

01

R2 as a Measure

1

Goodness 01 Fit .

Joumat o

the

Royat Statislical

SOciety.

SeriesA.

147.608-609.

Kvalselh. T.O. (1985). Cautionary Note about R

The

American Statistician.

39, 279·285.

Ranney. G.B. and Thigpen. C.C. (19Bl) The Sample

Coefficient

of

Determination in Simple

line r

Regression-.

The

merican Statistician 35 152-153.

Seber,

GAF 1977). Linear Regression Analysis.

New

York: Wiley.

Theil. H. and Chung. C.-F. (198B) Inlonnation-Theoretic

Measures

01

Fit for Univariate and Multivariate line r

Regression .

The American Staastician.

42. 249-252.

Willett, J.B • and Singer. J.B. (1988). Another Cautionary

Note about R2: Its Use in Weighted Regression

Analysis .

The American Statistician

42236-238.

1249

what does r² measure.pdf

Documents