what does r² measure.pdf
TRANSCRIPT
-
8/18/2019 What Does r² measure.pdf
1/4
WHAT DOES
R2
MEASURE?
Steve Thomson, University of Kentucky Computing Center
The coefficient of determinaHon, squared multiple
correlation coefficient, that is,
R2,
is perhaps the most
extensively used measure of goodness of fit for
regression models. In my consulting, many people have
asked questions like:
My
R2 was .6. That's good, isn't
it? , or My R2 was only
.05.
That's bad, isn't it? . Thus
many people seem to use high values of R2 as an index
of fit, concluding that if R2
is
above some value, they
have good fit, but that small values indicate bad fit. In
either case above, probably the best response is
Compared to what? .
One interpretation of R2 is as the limit of the
regression sum of squares divided by the limit of the total
sum of squares. Or equivalently, it is an estimate of the
amount of variation in observation means compared to
the variation in means pius the variation in residual error.
This can be small even for models with small error,
Or
large for models with large error.
I. Review:
The general linear model can be interpreted
as
investigating how well an observed nx1 vector,
y,
can be
represented
in some lower dimensional
subspace
spanned
by
the columns of the matrix of regressors.
(Recall that the span of a set of vectors is the set of all
linear combinations of the vectors, i.e. a subspace) The
coefficients of the vectors determining the points in the
subspace are the parameters. The least squares solution
for the parameters are the parameter values
corresponding to the point
in
the subspace
i
closest to
y..
The various sums of squares in ANOVA can be
represented as (squared) Euclidian distances between
subspaces spanned by different groups of columns in the
regressor matrix.
,
SSE
1
Figure
1.
Geometry
01
ANOV
A
(Note that distances are square roots of indicated ssq's.)
The total sum of squares, SST, is the (squared) distance
from
y. to
the origin. The error sum of squares, SSE, is
the (squared) distance from y. to
i.
in the span of A and B.
The model sum of squares, SS(A, B), is the (squared)
distance from i to the origin. By the Pythagorean
theorem,
SST = SS(A,B) + SSE.
So heuristically, a good model has a large SS(A,B)
compared to SST. This seems to justify the use of their
ratio as an index of fit. namely:
R2= SS(A,B)
SST -
1246
model sum of squares
total sum of squares .
The model sum of squares, SS(A,B), could also be
partitioned into sequential sums of squares. One
possible partition would be:
SS(A,B) = SS(A) + SS(BIA).
where both terms on the right are squared distances, and
thus positive. The SS(BIA) is the sum
of
squares for B
adjusted for ,
or
orthogonalized to A. So comparing
the R2 for the submodel with only A's as regressors to the
R2 for the complete model with A's and B's:
R2 = SS(A)
,;
SS(A) + SS(BIA) R2
A SST SST
Clearly R2 has to increase as variables are added to the
model (or at least not decrease). Also, note from the
geometry that 0 ,; R2 ,; 1.
If the model Includes an intercept or constant term,
tt
is particularly easy to adjust. or orthogonalize, to this
constant, merely subtract the column mean from each
column. Most analyses are performed on such
corrected models. One last simple observation from the
geometry is that there are an uncountably infinite number
of subspaces with R2 = 1, namely, any subspace that
includes
Other results on
R2;
1. If true replicates are added to the data, so that more
than one y value occurs at each regressors value,
then R2 will often decrease (e.g. Healy, 1984, Draper,
1984).
In fact, when there are true replicates its
achievable upper bound will be less than 1.0. In
practice, the
resuHs
for
c l o s e ~ r e p c a t e s
are often
similar.
2.
For Yi = +
1 ,
x +
J
Xi' + ...
+
P-' Xi,p-' + j, with
the
; i n d e p e n d e n ~
homogeneous normal random
variables with mean 0, when
1 ,
= IJ = ... =
1
.., =0, it
is easy to show that then the expected value of R2.
using corrected sums of squares, is
P
l
(e.g.
n·
Crocker,
1972,
Seber,
1977.
pg
115).
A similar result
holds for R2 using uncorrected sums of squares (
£. .
n
In fact, under these assumptions, R2 follows
a
beta
.distribution.
It would seem that a good fit index for comparing
completely different regression equations should be
optimized by the least squares equation, should be free of
the measurement units, and should be large for models
with small residual error and small for models with a large
residual error. Finally, the regressor variables are usually
considered
as
fixed. So a fit index should
be
indapendent
of their actual values. R2 satisfies the first two criteria, but
fails on the last two.
II. An
Alternative
Definition:
An alternative definition of R2 that has been
recommended (e.g. Draper & Smith,
1981)
is
-
8/18/2019 What Does r² measure.pdf
2/4
=
squared correlation between regressand
( response and predicted value
®.
It is well known that for a linear model "corrected" for the
Intercept or constant term, estimated
by
ordinaty least
squares {so corrected sums
of
squares are used in the
first deflnHlon}, the two definitions are equivalent. For
uncorrected models, Is easily computed in SAS'"
PROC REG or GLM
by
OUTPUTing the predicted values
and computing correlations In
PAOC
CORA, and finally
squaring the result.
To IUustrete a possible problem with the second
definition, suppose below Ihal o's denole observed
values, and
p s
denote predicted values from
\I1e
least
squares equation, from a simple regression wilhout an
intercept. Then using Ihe second definition,
=
1.
o
p
o
p
o
p
p 0
p
0
p 0
Figure 2.
"
Yet Ihe predicted values, fis, are most certainly nol doing
a
good
job of predicting the observed values. Of course,
in this case, any linear function of the observed x's will
give an
=
1. It
is
slightiy more complicated
lor
multiple
regression problems,
but
generally, even if =
,
there
Is no requirement tha i the least squares hyperplane must
be "close" to
y.
Thai is, for any b, and nonzero a,
( Corr(£,
y:)
}2
= (
Corr(a
2
+
b,
y:)
l'}.
Since we could
generate a hyperplane passing through zero and any
specified value, that suggests that unless the possible
choices of coefficients is
restricted
in some way, is
not
explicHly a good measure of "fit".
III.
Limiting Behavior of R2'
The definition of
R2
used In SAS procedures seems
to make more sense,
R2
=
~ ~ ,
by default, using
corrected
sums of squares
jf the model h ~ s
an intercept,
uncorrected if the model has no intercept. But thai
particular ratio is exactly what R2 is measuring. not
necessarily the quality of fit.
Suppose a general tinear model is appropriate, i.e.
the nxl regressand vector, y> is appropriately modeled as
= X ll. + J , where the ei's are stochastically indapendent
with
mean
0 and variance B'.
n
For convenience, write the orthogonal projection
matrices p.
=
X{X Xr X and for i denoting the
oxl
vector
..
,
of l s, P, = L, a nxn matrix with all elements lin Since
n
the
j
is a column
of
X, there is an g so thai
i
=X g. Also
then PxP, = P, =p,p .
Thus,
R =
SSf\, (Pxlt-P
y : ) t ( p . ~ _ p ,y:)
It'Pxlt - yip l
SSTc
iY.-y P,y
y . Y . - Y P , ~
.
By the law of arge numbers, plim 1
=
0, so
n
~ = ft'J
-
8/18/2019 What Does r² measure.pdf
3/4
xtp X
As n gets large, with -- -->Bo, say, then,
n
.
yip ,?i . 1l'X'P Xll lltXtj j'e
k k
)
p l l m - - = p l l m ( +
~
-
n n n n n n
=
ltBoil
fttBll- ft'Boil fttB'1l
Thus,plimR
2
- t t .R 2 RtB'R+"
II
Bft ll
B""
+
a
y-
where. now, B' is the limiting, centered matrix of
regression
coefficients,
corresponding to
a moment
malrix of the· regressors. Thus
R
will be 'large if
i f
is
small (i.e. good fit). f ifllt
B 1l.
is large Le. large variation
among mean values).
Or again. with correct specification, R2
can
also be
n
'
-p: n)
f
interpreted as lim
~ ; ' - - : : - - -
~
n
basically.
the
corrected regression sum of squares.
IV. So what does
R'
measure?
For
a correctly specified model, R2 tends toward the
•
inverse of
1+- - .
where. again, B Is analogous to a
Il.tBIl
sample
covariance matrix
of
regressors
for the intercept
case, and a
moment matrix
for
'the
no Intercept case. In
either
case,
the fttB{1 is a reasonable regreSSion sums
of squares, but is not much of an indicator of fit. If one
has perfect frt.
, ;
=
0,
and
R'
will
be
1.
Of
course,
R'
will
also tend to 1 as the variation in observation means gets
large.
In practice, this means that if two experimenters
choose equal sample sizes from the same process, the
one who chooses his regressor variables to have more
variation will tend to have a higher R', For example, in
an economic context, suppose some process remains
true over the years and satisfies the standard regression
assumptions. Then an economist who uses monthly data
from that process will tend to have a lower fl ' than one
who uses yearly data, since generally the yearly data will
have more variation. Similarly, suppose two nutritionists
are modeling weight gain as a lunction of calories and
prior weight All else being equal the nuttitionist who
chooses his sample to have more variation In calories
and prior weight will tend to have a higher
R'. In
an
educational context, with a true model, a researcher who
samples across schools in a stale, could be expected to
have a larger R' than a researcher who samples schools
in
a certain county.
Actually this is not surprising. Under the usual
assumptions, the variance of the ordinary least squares
estimates of the /l's, is i f X
t
Xl". So large variation in the
X's tends to reduce the variance of the estimates of the
Ws.
Thus more variability in the X's increases efflctency.
But f l ' is apparentiy not usually Interpreted
as
a measure
01 efficiency, but rather fit. However, R' is not quite a
measure of efficiency either. For any set of regressors,
R'
would also tend to increase as the ft's get large. a
change which would have no effect on the precision of
the estimates.
1248
The fact that
R'
is not quite a measure of fit is not
suprising. It is just the squared correlation between the
response variable y and a linear combination of the
regressor
variables. Correlation measures how
much
variables vary (linearly) together. So R2
has
to depend
on
the
regressors as much as
on
the response variable.
Note thaI asymptotically,
as
n increases, and the
number of regressors (Including the intercept), p, remains
constant
R'
has the same limit as the adjusted R of
SAS PROC REG:
R q= 1 - (
- R ' ) ~ .
n·p
where n
= n-l
for a corrected model with an intercept,
and n for an uncorrected model. This is meant
to
adjust
for the fact that R' always increases as regressors are
added to the model. R aq only Increases if the t-statistic
for the added variable is greater than one In absoiute
magnitude. The same remarks would apply to the ofher
finitely corrected adjusted R"s, including, for example.
Amemiya's prediction criterion (Amemiya,
1985):
R' =1-( l-R') . : E.,
pc n·p
designed to correct still further for the tendency of even
R ~ q to choose models with many variables. But again,
Ihese have the same asymptotic behavior as
R'.
V.
So
again, what does
R
measure?
Even in a correctly specified model, f l ' is primarily
an estimate of the relative size of the variation in
observation means and the population variance. Within a
single set of possible regressors. it could e a useful
index
01
fit to compare various submodels. But, models
with small error, and thus good fll,
can
have a high R2 or
a small
R2
depending on the observation means, I.e. on
the
vaJues
of the regressor variables. So if R2
is
large it
does not necessarily indicate good' fit, while R' .small
does not necessarily indicate bad fit
SAS is the registered trademark of $AS Institute, inc.,
Casy, NC, USA.
For further information contact:
Steve Thomson
University of Kentucky Computing Center
128 McVey Hall
leXington, KY 40506'()045
-
8/18/2019 What Does r² measure.pdf
4/4
REFEREN ES
Amemiya, T. (1985),
Advanced Econometrics.
Cambridge. Massachusetts: Harvaro University Press.
Barrett. J.P. (1974). The Coefficient 01 Detenninalion -
Some Limitations ,
The American Statistician. 28.
19-20.
Chang. P.C. and Alift. A.C. (1987)
Goodness 01
Fit
Statistics for General Linear Regression Equations in
the Presence of Replicated Response . The
American
Statistician.
41.195-199.
Crocker, D.C. (1972). Some Interpretations of the
Multiple Correlation Coefficient ,
The American
Staustfcian 26. 31-33.
Draper. N.R. (1984). The Box-Wetz Criterion Versus R2 .
Journal
o
the Royal Statistical
SOCiety Series A. 147,
101H03.
Draper. N.R. (1985). 'Corrections: The Box-Wetz
Criterion Versus R2 .
Joumat o
/he
Royal Statistical
Society
Series
A.
148. 357.
Draper. N.R. and Smith.
H.
1981). Applied Regression
Analysis.
2nd
Ed.
New York: Wiley.
Heaty. M.J.R.
1984).
The·Use
01
R2 as a Measure
1
Goodness 01 Fit .
Joumat o
the
Royat Statislical
SOciety.
SeriesA.
147.608-609.
Kvalselh. T.O. (1985). Cautionary Note about R
The
American Statistician.
39, 279·285.
Ranney. G.B. and Thigpen. C.C. (19Bl) The Sample
Coefficient
of
Determination in Simple
line r
Regression-.
The
merican Statistician 35 152-153.
Seber,
GAF 1977). Linear Regression Analysis.
New
York: Wiley.
Theil. H. and Chung. C.-F. (198B) Inlonnation-Theoretic
Measures
01
Fit for Univariate and Multivariate line r
Regression .
The American Staastician.
42. 249-252.
Willett, J.B • and Singer. J.B. (1988). Another Cautionary
Note about R2: Its Use in Weighted Regression
Analysis .
The American Statistician
42236-238.
1249