chapter 8: transformations - purdue universityfmliang/stat512/lect8.pdf · chapter 8:...

Chapter 8: Transformations

For most practical problems, there is no theory

to tell us the correct form for the mean function,

and any parametric form we use is little more than

an approximation that we hope is adequate for the

problem at hand. Replacing either the predictors,

the response, or both by nonlinear transformations

of them is an important tool that the analyst can

use to extend the number of problems for which

linear regression methodology is appropriate. This

brings up two questions:

• How do we choose transformations? (Answered

in this chapter)

• How do we decide if an approximate model is

adequate for the data at hand? (Answered in

Chapters 9 and 10)

1 Transformations and Scatterplots

The most frequent purpose of transformations is to

achieve a mean function that is linear in the trans-

formed scale. In problems with only one predictor

and one response, the mean function can be visu-

alized in a scatterplot, and we can attempt to select

a transformation so the resulting scatterplot has

an approximate straight line mean function. With

many predictors, selection of transformations can

be harder.

Figure 1 contains a plot of body weightBodyWt

in kilograms and brain weightBrainWt in grams

for 62 species of mammals. In any case, there is

little or no evidence for a straight-line mean func-

tion here. Both variables range over several orders

of magnitude from tiny species with body weights

of just a few grams to huge animals of over 6600

kg. Transformations can help in this problem.

0 1000 2000 3000 4000 5000 6000

010

0020

0030

0040

0050

00

Body weight (kg)

Bra

in w

eigh

t (g)

Asian_elephant

Human

African_elephant

Figure 1: Plot of BrainWt versus BodyWt for 62 mammal species.

1.1 Power transformations

The power transformation family is defined for a

strictly positive variable U by

ψ(U, λ) = Uλ.

The usual values of λ that are considered are in

the range from -2 to 2, but values in the range from

-1 to 1 are ordinarily selected. The value λ = 1

corresponds to no transformation, λ = −1 corre-

sponds to the inverse, and λ = 0 is interpreted as

a log transformation.

Figure 2 shows plots of ψ(BrainWt, λ) ver-

susψ(BodyWt, λ) with the same λ for both vari-

ables, for λ = −1, 0, 1/3, 1/2. There is no nec-

essarily for the transformation to be the same for

the two variables. If we allowed each variable to

have its own transformation parameter, the visual

search for a transformation is harder because more

possibilities need to be considered.

0 50 100 150 200

01

23

45

67

(a) BodyWt−1

Bra

inW

t−1

−4 −2 0 2 4 6 8

−2

02

46

8

(b) loge(BodyWt)

log

e(B

rain

Wt)

0 5 10 15

05

10

15

(c) BodyWt0.33

Bra

inW

t0.3

3

0 20 40 60 80

02

04

06

0

(d) BodyWt0.5

Bra

inW

t0.5

Figure 2: Scatterplots for the brain weight data with four possible transformations. The

dashed line is a loess smooth.

From the four graphs in Figure 2, the clear choice

is replacing the weights by their logarithms. This

is not surprising, in light of the following empirical

rules:

• The log rule. If the values of a variable range

over more than one order of magnitude and

the variable is strictly positive, then replacing

the variable by its logarithm is likely to be help-

ful.

• The range rule. If the range of a variable is

considerably less than one order of magnitude,

then any transformation of that variable is un-

likely to be helpful.

Since linear regression seems to be appropriate

with body variables in log scale. This corresponds

to the physical model

BrainWt = β0 ×BodyWtβ1 × δ,

where δ is a multiplicative error. We would ex-

pect that δ would have mean 1 and a distribution

concentrated on values close to 1. Scientists who

study the relationships between attributes of indi-

viduals or species call the multiplicative error model

an allometric model.

1.2 Transforming Only the Predictor Variable

In many problems, transformation of only one vari-

able may be desirable. If we want to use a family

of power transformation, the scaled power trans-

formation may be convenient,

ψs(X, λ) =

Xλ−1

λ , if λ 6= 0,

log(X), if λ = 0.

This transformation differ from the basic power trans-

formation in several respects:

• ψs(X, λ) is continuous as a function of λ, be-

cause limλ→0 ψs(X, λ) = log(x).

• It preserve the direction of association, in the

sense that if (X, Y ) are positively related, then

(ψs(X, λ), Y ) are positively related for all val-

ues of λ. With basic power transformations,

the direction of association changes when λ <

0.

The value of λ can be determined by minimiz-

ing the residual sum of squares, SSRes(λ). As a

practical matter, we do not need to know λ very

precisely, and selecting λ to minimize SSRes(λ)

from

λ ∈ {−1,−1/2,−1/3, 0, 1/3, 1/2, 1}

is usually adequate.

As an example, we consider the dependence of

tree Height in decimeters on Dbh, the diameter of

the tree in mm at 137 cm above the ground, for a

sample of red cedar trees. Figure 3 is the scatter-

plot of the data, and on this plot we have super-

imposed three fitted response curves, which cor-

respond to λ = 1, 0,−1, respectively. The choice

of λ = 0 seems to match the data most closely.

As an alternative approach, the value of the trans-

formation parameter can be estimated by fitting us-

ing nonlinear least squares. Using the method de-

scribed in Chapter 11, the estimate of λ turns out

200 400 600 800 1000

100

150

200

250

300

350

400

Dbh

Hei

ght

1

0

−1

200 400 600 800 1000

100

150

200

250

300

350

400

Dbh

Hei

ght

Figure 3: Height versus Dbh for the red cedar data from Upper Flat Creek.

nfigure[htbp]

7.0 7.5 8.0 8.5 9.0 9.5 10.0

100

150

200

250

300

350

400

log2(Dbh)

Hei

ght

Figure 4: The red cedar data from Upper Flat Creek transformed.

to be 0.05 with a standard error of 0.15, so λ = 0

is a sensible transformation to use.

1.3 Transforming the Response Only

A transformation of the response only can be se-

lected using an inverse fitted value plot, in which

we put the fitted values from the regression of Y

on X on the vertical axis and the response on

the horizontal axis. In simple regression the fit-

ted values are proportional to the predictor X , so

an equivalent plot is of X on the vertical axis ver-

sus Y on the horizontal axis. The method outlined

above can then be applied to this problem.

1.4 The Box and Cox method

Box and Cox (1964) provided another general method

for selecting transformations of the response that

is applicable both in simple and multiple regres-

sion. For strictly positive Y , this method suggests

the transformation

ψM(Y, λy) = gm(Y )1−λy × ψs(Y, λY )

=

gm(Y )1−λy × (Y λ − 1)/λ, if λy 6= 0,

gm(Y )1−λy × log(Y ), if λy = 0,

where gm(Y ) = exp(∑

log(yi)/n) is the geo-

metric mean of the untransformed variable.

Given λy, we could fit the data using OLS, and

write the residual sum of squares from this regres-

sion as SSRes(λy). Multiplication of the scaled

power transformation by gm(Y )1−λ guarantees that

the units of ψM(Y, λ) are the same for all value of

λy. We estimate λy to be the value of the transfor-

mation parameter that minimizesSSRes(λy). From

a practical point of view, we can again select λy

from the set {−1,−1/2,−1/3, 0, 1/3, 1/2, 1}.

The Box-Cox method is not transforming for lin-

earity, but rather it is transforming for normality: λ

is chosen to make the residuals from the regres-

sion of ψ(Y, λy) on X as close to normally dis-

tributed as possible.

Highway accident data. The data relate the automobile ac-

cident rate in accidents per million vehicle miles to

several potential terms. Figure 5 summarizes the

choices of λy by a graph with λy on the horizon-

tal axis and log-likelihood (−n/2 log(2π)−n/2−

n/2 log (SSRes(λy)/n)) on the vertical axis. SSRes(λy)

attains its minimum at λ̂y ≈ −0.2 and the confi-

dence interval of the estimate is about -0.8 to 0.3.

This suggests that the log-transformation is appro-

priate for this dataset.

−2 −1 0 1 2

−90

−85

−80

−75

−70

λy

log−

Like

lihoo

d

95%

Figure 5: Box-Cox summary graph for the highway data.

2 Transformations of Nonpositive Variables

Several transformation families for a variableU that

includes negative values have been suggested. The

central idea is to use the methods discussed in this

chapter for selecting a transformation from a family

but to use a family that permits U to be non posi-

tive. One possibility is to consider transformations

of the form (U + γ)λ, where γ is sufficiently large

to ensure that U + γ is strictly positive. In princi-

ple, (γ, λ) could be estimated simultaneously, al-

though in practice estimates of γ are highly vari-

able and unreliable.

Alternatively, Yeo and Johnson (2000) proposed

a family of transformations that can be used with-

out restrictions on U that have many of the good

properties of the Box-Cox power family. These trans-

formations are defined by

ψY J(U, λ) =

ψM(U + 1, λ) if U ≥ 0

ψM(−U + 1, 2− λ) if U < 0

IfU is strictly positive, then the Yeo-Johnson trans-

formation is the same as the Box-Cox power trans-

formation of (U + 1). If U is strictly negative,

then the Yeo-Johnson transformation is the Box-

Cox power transformation of (−U + 1), but with

power 2− λ. With both negative and positive val-

ues, the transformation is a mixture of these two,

so different powers are used for positive and neg-

ative values. In this latter case, interpretation of

the transformation parameter is difficult, as it has a

different meaning for U ≥ 0 and for U < 0.

3 Additive Models

Additive models provide an alternative to the meth-

ods for selecting transformations for predictors. Sup-

pose we have a regression problem with regres-

sors for factors and other variables that do not need

transformation given in a vector z, and additional

predictors that may need to be transformed inx′ =

(x1, . . . , xq). We consider the mean function

E(Y |z,x) = β′z +

∑gj(xj),

where gj(xj) is some unknown function that is es-

sentially a transformation of xj. Additive models

proceed by estimating the functions gj ’s. Method-

ology that uses splines to estimate the gj ’s is dis-

cussed in a fine book by Wood (2006), with ac-

companying software available in R in the mgcv

package.

chapter 8: transformations - purdue universityfmliang/stat512/lect8.pdf · chapter 8:...

Documents