chapter 8: transformations - purdue universityfmliang/stat512/lect8.pdf · chapter 8:...
TRANSCRIPT
Chapter 8: Transformations
For most practical problems, there is no theory
to tell us the correct form for the mean function,
and any parametric form we use is little more than
an approximation that we hope is adequate for the
problem at hand. Replacing either the predictors,
the response, or both by nonlinear transformations
of them is an important tool that the analyst can
use to extend the number of problems for which
linear regression methodology is appropriate. This
brings up two questions:
• How do we choose transformations? (Answered
in this chapter)
• How do we decide if an approximate model is
adequate for the data at hand? (Answered in
Chapters 9 and 10)
1 Transformations and Scatterplots
The most frequent purpose of transformations is to
achieve a mean function that is linear in the trans-
formed scale. In problems with only one predictor
and one response, the mean function can be visu-
alized in a scatterplot, and we can attempt to select
a transformation so the resulting scatterplot has
an approximate straight line mean function. With
many predictors, selection of transformations can
be harder.
Figure 1 contains a plot of body weightBodyWt
in kilograms and brain weightBrainWt in grams
for 62 species of mammals. In any case, there is
little or no evidence for a straight-line mean func-
tion here. Both variables range over several orders
of magnitude from tiny species with body weights
of just a few grams to huge animals of over 6600
kg. Transformations can help in this problem.
0 1000 2000 3000 4000 5000 6000
010
0020
0030
0040
0050
00
Body weight (kg)
Bra
in w
eigh
t (g)
Asian_elephant
Human
African_elephant
Figure 1: Plot of BrainWt versus BodyWt for 62 mammal species.
1.1 Power transformations
The power transformation family is defined for a
strictly positive variable U by
ψ(U, λ) = Uλ.
The usual values of λ that are considered are in
the range from -2 to 2, but values in the range from
-1 to 1 are ordinarily selected. The value λ = 1
corresponds to no transformation, λ = −1 corre-
sponds to the inverse, and λ = 0 is interpreted as
a log transformation.
Figure 2 shows plots of ψ(BrainWt, λ) ver-
susψ(BodyWt, λ) with the same λ for both vari-
ables, for λ = −1, 0, 1/3, 1/2. There is no nec-
essarily for the transformation to be the same for
the two variables. If we allowed each variable to
have its own transformation parameter, the visual
search for a transformation is harder because more
possibilities need to be considered.
0 50 100 150 200
01
23
45
67
(a) BodyWt−1
Bra
inW
t−1
−4 −2 0 2 4 6 8
−2
02
46
8
(b) loge(BodyWt)
log
e(B
rain
Wt)
0 5 10 15
05
10
15
(c) BodyWt0.33
Bra
inW
t0.3
3
0 20 40 60 80
02
04
06
0
(d) BodyWt0.5
Bra
inW
t0.5
Figure 2: Scatterplots for the brain weight data with four possible transformations. The
dashed line is a loess smooth.
From the four graphs in Figure 2, the clear choice
is replacing the weights by their logarithms. This
is not surprising, in light of the following empirical
rules:
• The log rule. If the values of a variable range
over more than one order of magnitude and
the variable is strictly positive, then replacing
the variable by its logarithm is likely to be help-
ful.
• The range rule. If the range of a variable is
considerably less than one order of magnitude,
then any transformation of that variable is un-
likely to be helpful.
Since linear regression seems to be appropriate
with body variables in log scale. This corresponds
to the physical model
BrainWt = β0 ×BodyWtβ1 × δ,
where δ is a multiplicative error. We would ex-
pect that δ would have mean 1 and a distribution
concentrated on values close to 1. Scientists who
study the relationships between attributes of indi-
viduals or species call the multiplicative error model
an allometric model.
1.2 Transforming Only the Predictor Variable
In many problems, transformation of only one vari-
able may be desirable. If we want to use a family
of power transformation, the scaled power trans-
formation may be convenient,
ψs(X, λ) =
Xλ−1
λ , if λ 6= 0,
log(X), if λ = 0.
This transformation differ from the basic power trans-
formation in several respects:
• ψs(X, λ) is continuous as a function of λ, be-
cause limλ→0 ψs(X, λ) = log(x).
• It preserve the direction of association, in the
sense that if (X, Y ) are positively related, then
(ψs(X, λ), Y ) are positively related for all val-
ues of λ. With basic power transformations,
the direction of association changes when λ <
0.
The value of λ can be determined by minimiz-
ing the residual sum of squares, SSRes(λ). As a
practical matter, we do not need to know λ very
precisely, and selecting λ to minimize SSRes(λ)
from
λ ∈ {−1,−1/2,−1/3, 0, 1/3, 1/2, 1}
is usually adequate.
As an example, we consider the dependence of
tree Height in decimeters on Dbh, the diameter of
the tree in mm at 137 cm above the ground, for a
sample of red cedar trees. Figure 3 is the scatter-
plot of the data, and on this plot we have super-
imposed three fitted response curves, which cor-
respond to λ = 1, 0,−1, respectively. The choice
of λ = 0 seems to match the data most closely.
As an alternative approach, the value of the trans-
formation parameter can be estimated by fitting us-
ing nonlinear least squares. Using the method de-
scribed in Chapter 11, the estimate of λ turns out
200 400 600 800 1000
100
150
200
250
300
350
400
Dbh
Hei
ght
1
0
−1
200 400 600 800 1000
100
150
200
250
300
350
400
Dbh
Hei
ght
Figure 3: Height versus Dbh for the red cedar data from Upper Flat Creek.
nfigure[htbp]
7.0 7.5 8.0 8.5 9.0 9.5 10.0
100
150
200
250
300
350
400
log2(Dbh)
Hei
ght
Figure 4: The red cedar data from Upper Flat Creek transformed.
to be 0.05 with a standard error of 0.15, so λ = 0
is a sensible transformation to use.
1.3 Transforming the Response Only
A transformation of the response only can be se-
lected using an inverse fitted value plot, in which
we put the fitted values from the regression of Y
on X on the vertical axis and the response on
the horizontal axis. In simple regression the fit-
ted values are proportional to the predictor X , so
an equivalent plot is of X on the vertical axis ver-
sus Y on the horizontal axis. The method outlined
above can then be applied to this problem.
1.4 The Box and Cox method
Box and Cox (1964) provided another general method
for selecting transformations of the response that
is applicable both in simple and multiple regres-
sion. For strictly positive Y , this method suggests
the transformation
ψM(Y, λy) = gm(Y )1−λy × ψs(Y, λY )
=
gm(Y )1−λy × (Y λ − 1)/λ, if λy 6= 0,
gm(Y )1−λy × log(Y ), if λy = 0,
where gm(Y ) = exp(∑
log(yi)/n) is the geo-
metric mean of the untransformed variable.
Given λy, we could fit the data using OLS, and
write the residual sum of squares from this regres-
sion as SSRes(λy). Multiplication of the scaled
power transformation by gm(Y )1−λ guarantees that
the units of ψM(Y, λ) are the same for all value of
λy. We estimate λy to be the value of the transfor-
mation parameter that minimizesSSRes(λy). From
a practical point of view, we can again select λy
from the set {−1,−1/2,−1/3, 0, 1/3, 1/2, 1}.
The Box-Cox method is not transforming for lin-
earity, but rather it is transforming for normality: λ
is chosen to make the residuals from the regres-
sion of ψ(Y, λy) on X as close to normally dis-
tributed as possible.
Highway accident data. The data relate the automobile ac-
cident rate in accidents per million vehicle miles to
several potential terms. Figure 5 summarizes the
choices of λy by a graph with λy on the horizon-
tal axis and log-likelihood (−n/2 log(2π)−n/2−
n/2 log (SSRes(λy)/n)) on the vertical axis. SSRes(λy)
attains its minimum at λ̂y ≈ −0.2 and the confi-
dence interval of the estimate is about -0.8 to 0.3.
This suggests that the log-transformation is appro-
priate for this dataset.
−2 −1 0 1 2
−90
−85
−80
−75
−70
λy
log−
Like
lihoo
d
95%
Figure 5: Box-Cox summary graph for the highway data.
2 Transformations of Nonpositive Variables
Several transformation families for a variableU that
includes negative values have been suggested. The
central idea is to use the methods discussed in this
chapter for selecting a transformation from a family
but to use a family that permits U to be non posi-
tive. One possibility is to consider transformations
of the form (U + γ)λ, where γ is sufficiently large
to ensure that U + γ is strictly positive. In princi-
ple, (γ, λ) could be estimated simultaneously, al-
though in practice estimates of γ are highly vari-
able and unreliable.
Alternatively, Yeo and Johnson (2000) proposed
a family of transformations that can be used with-
out restrictions on U that have many of the good
properties of the Box-Cox power family. These trans-
formations are defined by
ψY J(U, λ) =
ψM(U + 1, λ) if U ≥ 0
ψM(−U + 1, 2− λ) if U < 0
IfU is strictly positive, then the Yeo-Johnson trans-
formation is the same as the Box-Cox power trans-
formation of (U + 1). If U is strictly negative,
then the Yeo-Johnson transformation is the Box-
Cox power transformation of (−U + 1), but with
power 2− λ. With both negative and positive val-
ues, the transformation is a mixture of these two,
so different powers are used for positive and neg-
ative values. In this latter case, interpretation of
the transformation parameter is difficult, as it has a
different meaning for U ≥ 0 and for U < 0.
3 Additive Models
Additive models provide an alternative to the meth-
ods for selecting transformations for predictors. Sup-
pose we have a regression problem with regres-
sors for factors and other variables that do not need
transformation given in a vector z, and additional
predictors that may need to be transformed inx′ =
(x1, . . . , xq). We consider the mean function
E(Y |z,x) = β′z +
∑gj(xj),
where gj(xj) is some unknown function that is es-
sentially a transformation of xj. Additive models
proceed by estimating the functions gj ’s. Method-
ology that uses splines to estimate the gj ’s is dis-
cussed in a fine book by Wood (2006), with ac-
companying software available in R in the mgcv
package.