multivariate

Chapter 4Multivariate Random Variables,

Correlation, and Error Propagation

4. Introduction

So far we have dealt with a single random variable X , which we can use tomodel collections of scalar, and similar data, such as the distance between points orthe times between magnetic reversals. What we mean by ‘‘similar’’ is that all thedata are (we assume in the model) of the same kind of thing, or measurement, sothat a single, scalar r.v. X can serve as the probabilistic model.

In this chapter, we generalize to this to pairs, triples, ... m-tuples of randomvariables. We may use such multiple variables either to represent vector-valued, butstill similar, quantities (such as velocity, the magnetic field, or angular motionsbetween tectonic plates); or we may use them to model situations in which we havetwo or more different kinds of quantities that we wish to model probabilistically. Inparticular, we want to have probability models that can include cases in which thedata appear to depend on each other. A geophysical example of such a relationshipoccurs, for example, if one data type is the magnitude of an earthquake, and other isthe rupture length of the fault that caused it. A compilation of such data, shown inFigure 1,1 displays considerable scatter, and so is appropriate for probability model-ing; but we need a model that can express the observed fact that larger earthquakemagnitudes correspond to longer faults.

Figure 1

1 Wells, D. L., and K. J. Coppersmith (1994). New empirical relationships among magnitude,rupture length, rupture width, rupture area, and surface displacement, Bull. Seism. Soc. Am.,84, 974-1002.

© 2005 D. C. Agnew/C. Constable

Version 1.3 Multivariate Probability 4-2

This plot displays a common, and important, aspect of data analysis, which is totransform the data in whatever way makes the relationship linear. In this case wehave taken the logarithm of the rupture length—and of course, in using earthquakemagnitude, the logarithm of the radiated energy (approximately).

This generalization to more than one random variable is called multivariateprobability, though it might better be called multidimensional. We touched on thetwo-dimensional case in Chapter 2 as part of our demonstration of the convolutionrelation for the pdf of the sum of two random variables; in this chapter we extendand formalize our treatment. In particular we describe the idea of correlation andcovariance, and discuss how multivariate probability is applied to the problem ofpropagating errors.

4.1. Multivariate PDF’s

Suppose we have an m-dimensional random variable→

X , which has as compo-nents m scalar random variables:

→

X = X1, X2, . . . , X m. We can easily generalize ourdefinition of a univariate pdf to say that the r.v.

→

X is distributed according to a jointprobability density function (which we will, as in the univariate case, just call apdf):

φ(x1, x2, . . . , xm) =def φ( →x)

which is now the derivative of a multivariate distribution function Φ:

φ( →x) =∂mΦ( →x)

∂x1∂x2. . .∂xm

The distribution Φ is an integral of φ ; if the domain of φ is not all of m-dimensionalspace, this integral needs to be done with appropriate limits. If the domain is all ofm-dimensional space, we can write the integral as:

Φ(x1, x2, . . . , xm) =x1

−∞∫

x2

−∞∫ . . .

xm

−∞∫ φ( →x) dm →x

Given a region R in m-dimensional space (it does not have to be any particularshape), the probability of the r.v.

→

X falling inside R is just the integral of φ over R

P (X ∈R) =R∫ φ

→

(x)dm →x

from which come the properties that φ must be everywhere nonnegative and that theintegral of φ over the whole region of applicability must be 1.

Of course, it is not easy to visualize functions in m-dimensional space if m isgreater than three, or to plot them if m is greater than two. Our examples willtherefore focus on the case m = 2, for which the pdf becomes a bivariate pdf. Fig-ure 2 shows what such a pdf might look like, plotting contours of equal values of φ .(We have intentionally made this pdf a somewhat complicated one; the dots anddashed lines will be explained below). It will be evident that the probability of (say)X2 falling in a certain range is not unrelated to the probability of X1 falling in a cer-tain (perhaps different) range: for example, if X1 is around zero, X2 will tend to be; if



Figure 2

X1 is far from zero, X2 will be positive. We will see how to formalize this later. It isthis ability to express relationships that makes multivariate probability such a use-ful tool.

4.2. Reducing the Dimension: Conditionals and Marginals

We can, from a multivariate pdf, find two kinds of other, lower-dimensional,pdf ’s. We start with examples for the bivariate case, for which (obviously) the onlysmaller number of dimensions than m is 1, so our reduction in dimension gives uni-variate pdf ’s.

Figure 3

We first consider the marginal pdf. For either variable this is the result ofintegrating the bivariate pdf over the other variable. So, for example, for X1 themarginal pdf is the pdf for X1 irrespective of the value of X2. If Φ(x1, x2) is thebivariate cumulative distribution, then the marginal cumulative distribution for X1

is given by Φ(x1,∞):

Φ(x1,∞) = P (X1 ≤ x1, X2 ≤ ∞) = P (X1 ≤ x)

But what is probably easier to visualize is the marginal density function, which



comes from integrating the bivariate density function over all values of (say) x2—orto put it another way, collapsing all the density onto one axis. The integral is then

φ(x1) =∞

−∞∫ φ(x1, x2)dx2

Figure 3 shows the marginal pdf ’s for the bivariate pdf plotted in Figure 2, with φ(x1)

retaining a little of the multimodal character evident in the bivariate case, thoughφ(x2) does not.

Figure 4

The conditional pdf is something quite different. It has this name because itis, for random variables, the expression of conditional probability: it gives the proba-bility, (that is, the pdf) of (say) X2 for known x1; note that we write the conditioningvariable in lowercase to show that it is a conventional variable, not a random one.The conditional pdf is found from φ(x1, x2) by computing

φ c(x2) =φ(x1, x2)

∞

−∞∫ φ(x1, x2)dx2

(We could write the conditional as φ X2|X1 = x1(x2), but while this is complete it is prob-

ably confusing). We see that x1 is held fixed in the integral in the denominator. Theconditional pdf φ c is essentially a slice through the multivariate pdf with one vari-able held fixed, normalized by its own integral so that it will integrate to 1, as itmust to be a pdf. Of course we could equally well look at the conditional pdf of X1

for x2 held fixed—or indeed the pdf along some arbitrary slice (or even a curve)through the full bivariate pdf φ . The dashed lines in Figure 2 show two slices for x1

and x2 held fixed, and Figure 4 shows the resulting conditional probabilities. (Thedots will be explained in Section 4.5). Note that these conditional pdf ’s peak atmuch higher values than does the bivariate pdf. This illustrates the general factthat as the dimension of an r.v. increases, its pdf tends to have smaller values—as itmust in order to still integrate to 1 over the whole of the relevant space.

Another name for the the marginal pdf is the unconditional pdf, in the sensethat the marginal pdf for (say) X2 describes the behavior of X2 if we consider all pos-sible values of X1.



We can generalize both of these dimension-reduction strategies to more dimen-sions than two. If we start with a multidimensional pdf φ

→

(x), we may either hold thevariable values fixed for k dimensions (k being less than m), to get a conditional pdfof dimension m − k; or we may integrate φ over k dimensions to get a marginal pdf ofdimension m − k. So, for example, if we have a pdf in 3 dimensions we might:

• Integrate over one direction (it does not have to be along one of the axes) to geta bivariate marginal pdf.

• Integrate over two directions to get a univariate marginal pdf; for example,integrate over a plane, say over the x2—x3 plane, to get a function of x1 only.

• Sample over a plane (again, it does not have to be along the axes) to get abivariate conditional pdf.

• Sample along a line to get a univariate conditional pdf.

As we will see below, when we discuss regression, a particularly important caseoccurs when we take the conditional pdf for k = m −1, which makes the conditionalpdf univariate.

4.3. Moments of Multivariate PDF’s

We can easily generalize from moments of univariate pdf ’s to moments of multi-variate pdf ’s. The zero-order moment, being the integral over the entire domain ofthe pdf, is still 1. But there are m first moments, instead of one; these are definedby2

µ i =def E[xi] =∞

−∞∫

∞

−∞∫ . . .

∞

−∞∫ xiφ( →x) dm →x

which, as in the univariate case, expresses the location of the i-th variable—thoughnot always very well.

The second moments are more varied, and more interesting, than in the uni-variate case: for one thing, there are m2 of them. As in the univariate case, we couldconsider second moments about zero, or about the expected value (the firstmoments); in practice, nobody ever considers anything but the second kind, makingthe expression for the second moments

µ ij =def

∞

−∞∫

∞

−∞∫ . . .

∞

−∞∫ (xi − µ i)(x j − µ j)φ( →x) dm →x

We can, as with univariate pdf ’s, describe the variance as

V[X i] =def µ ii = E[(X i − µ i)(X i − µ i)]

which as before expresses something about the spread of this variable. But the moreinteresting moments are the covariances between two variables, which are definedas

C[X j, X k] =def µ jk = E[(X j − µ j)(X k − µ k)]2 Notation alert: we make a slight change in usage from that in Chapter 2, using the sub-

script to denote different moments of the same degree, rather than the degree of the moment;this degree is implicit in the number of subscripts.



= ∫ ∫ ∫. . . ∫(x j − µ j)(xk − µ k)φ(x1, x2, . . . , xm)dm →x

From this, it is clear that the variances are special cases of the covariances, with thevariance being the covariance of a random variable with itself:

V[X j] = C[(X j, X j)].

Covariance has the very useful property of showing the degree of linear associa-tion between X j and X k. To lay this out in detail:

A. If X j tends to increase linearly awa y from µ j as X k increases awa y from µ k,then the covariance C[X j, X k] will be large and positive.

B. If X j tends to decrease linearly awa y from µ j as X k increases awa y from µ k,then the covariance C[X j, X k] will be large and negative.

C. The covariance C[X j, X k] will be small if there is little linear dependencebetween X j and X k.

For bivariate distributions we can define a particular ‘‘standardized’’ form of thecovariance: the correlation coefficient, ρ, between X1 and X2

ρ =C[X1, X2]

[V[X1]V[X2]]12

Here ‘‘standardized’’ means normalized by the variances of both of the two variables.This normalization gives a quantity that varies between 1 and −1. A value of zerowould mean no linear association at all; +1 or −1 would mean that X1 and X2 wouldvary exactly linearly.

4.4. Independence and Correlation

A special, and very important, case of a multivariate pdf occurs when the ran-dom variables X1, X2, . . . , X m are independent; just as the probability of indepen-dent events is the product of the individual probabilities, so the pdf of independentr.v’s can be expressed as the product of the individual pdf ’s of each variable:

φ( →x) = φ1(x1)φ2(x2) . . . φm(xm)

In this case the pdf of any given X i can be found independently of the distribution ofall the other variables, that is to say, from φ i alone.

If two r.v.’ s X1 and X2 are independent then the covariance C[X1, X2] = 0, andwe would say that these variables are uncorrelated,

C[X1, X2] = ∫ ∫ dx1dx2φ1(x1)φ2(x2)(x1 − µ1)(x2 − µ2)

=∞

−∞∫ (x1 − µ1)φ1(x1)dx1

∞

−∞∫ (x2 − µ2)φ2(x2)dx2 = 0

because µ i =∞

−∞∫ xiφ i(xi)dxi = E[X i] and

∞

−∞∫ φ i(xi)dxi = 1.

However, the converse is not necessarily true; the covariance C[X1, X2] can bezero without implying statistical independence. Independence is the stronger



condition on the pdf; absence of correlation refers only to a second moment of thepdf. For a slightly artificial example of no correlation but complete dependence, sup-pose that X ∼ N (0,1) and Y = X2. The covariance is then

C[X , Y ] = E[[[X −E[X ]][X2 −E[X2]]] = 0.

but clearly X and Y are not independent: Y depends on X exactly. What the zerocovariance indicates (correctly) is that there is no linear relationship between X andY ; indeed there is not, it is parabolic. The conditional distribution of Y , given X = x,is a discrete distribution, consisting of unit mass of probability at the point x2, whilethe unconditional (marginal) distribution of Y is χ2 with one degree of freedom.

4.4.1. The Multivariate Uniform Distribution

Having introduced the ideas of independence and correlation, we are in a betterposition to see why the generation of random numbers by computer is so difficult.The generalization to m dimensions of the uniform distribution discussed in Chapter3 would be a multivariate distribution in which:

1. Each of the X i’s would have a uniform distribution between 0 and 1;

2. Any set of n X i’s would have a uniform distribution within an n-dimensionalunit hypercube.

3. Each of the X i’s can be regarded as independent, no matter how large mbecomes.

It is quite possible to satisfy (1), and not the others, as has been demonstratedby many unsatisfactory designs of random-number generators. For example, somemethods fail to satisfy (2), in that for some modest value of n all the combinationsfall on a limited number of hyperplanes. Requirement (3) is the difficult one, both tosatisfy and to test: we must ensure that no pair of X ’s, however widely separated,have any dependence. In practice numerical generators of random numbers havesome periodicity after which they begin to repeat; one aspect of designing them is tomake this period much longer than the likely number of calls to the program.

4.5. Regression

The idea of conditional probability for random variables allows us to introducethe concept of the regression of one variable on another, or on many others.3

In Figure 2 we showed two slices through our example bivariate pdf (thedashed lines); Figure 4 shows the conditional pdf ’s along these slices, with theexpected value of X1 (or X2) shown by the dot in each frame of the plot. Figure 2shows where these dots appear in the bivariate distribution. That the expectedvalue of X1 is so far from the peaks of the pdf is an effect of this pdf being multi-modal, and with nearly similar peaks.

Now imagine finding E[X2] for each value of x1; that is, for each x1 we get thepdf of a random variable conditional on a conventional variable; and from this, we

3 Regression is another in the large array of statistical terms whose common meaning gives noclue to its technical one. The term derives from the phrase ‘‘regression to mediocrity’’ applied byFrancis Galton to his discovery that, on average, taller parents have children shorter than them-selves, and short parents taller ones. We discuss the source of this effect below.



Figure 5

find the expected value of that random variable; this expected value is again a con-ventional variable. We therefore have a function, E[X2|x1], which gives x2 as a func-tion of x1. We call this the regression of X2 on x1. More usually it is called theregression of X2 on X1, but we prefer to make the distinction between the randomand conventional variables more specific. The solid line in Figure 5 shows this func-tion, which passes through one of the dots in Figure 2, as it should. We can see thatthis sort of represents the peak of the bivariate pdf. (The bivariate distribution isshown as before, with one additional contour at 0.005).

Figure 5 also shows, as a dashed line, the other function we could find in thiswa y, namely the regression of X1 on x2—and it is very different. This dashed line is,nevertheless, the correct answer to the question, ‘‘Given an x2, what is the expectedvalue of X1?’’ You should remember that there is no unique regression: we will havea number of regression functions (usually just called regressions), depending onwhich variables the expected value is conditional on.

4.6. The Multivariate Normal Distribution

We now turn to, and spend some time exploring, the most heavily used multi-variate pdf, namely the generalization to higher dimensions of the normal distribu-tion. Our focus on this can, again, be partly justified by the Central Limit Theorem,which shows that this multivariate distribution arises from sums of random vari-ables under general conditions: something often borne out, at least approximately,by actual data. It is also the case, as we will see, that the multivariate normal has anumber of convenient properties.

The functional form of this pdf is:

φ( →x) = φ(x1, x2, . . . , xm)

=1

(2π )m/2|C|12

exp − 1

2 [( →x − →µ) ⋅C−1( →x − →µ)



We see that the mean value has been replaced by an m-vector of values →µ . The sin-gle variance σ 2 becomes C, the covariance matrix, representing the covariancesbetween all possible pairs of variables, and the variances of these variables them-selves. C is an m × m symmetric positive definite matrix, with determinant|C| > 0;since C is symmetric, there are only 1

2 m(m +1) parameters needed to define C: mvariances and 1

2 m(m −1) covariances. As usual, visualization for dimensions abovetwo are difficult, so in Figure 6 we show some bivariate normal distributions.

For the multivariate normal distribution we have, for the first three moments:

R m∫ φ( →x)dm →x = 1

E[X i] = µ i C[X i, X j] = Cij

As is the case for the univariate normal, the first and second moments completelydefine the pdf. This pdf also has the properties:

1. All marginal distributions are normal.

2. Given specified values of all the other variables, we get a conditional distribu-tion that is normal.

3. If the variables are mutually uncorrelated, so that Cij = 0 for i ≠ j, then theyare independent: in this case independence and zero correlation are equivalent.

Figure 6

The last result is most easily seen in the bivariate case; if C[X1, X2] = 0 we get



C =

σ 21

00σ 2

2

whence |C| = σ 21σ

22

and the pdf becomes

φ(x1, x2) =1

√ 2πσ1e−

(x1 − µ1)2

2(σ1)21

√ 2πσ2e−

(x2 − µ2)2

2(σ2)2

so that X1 and X2 are independent random variables. Contours of constant φ thenform ellipses with major and minor axes that are parallel to X1 and X2 axes (the toptwo plots of Figure 6).

In Chapter 3 we gave a table showing, for a univariate normal, the amount ofprobability that we would get if we integrated the pdf between certain limits. Themost useful extension of this to more than one dimension is to find the distance Rfrom the origin such that a given amount of probability (the integral over the pdf)lies within that distance. That is, for a given pdf φ , we want to find R such that

R

0∫ φ

→

(x)dm →x = p

To solve this for the multivariate normal, we assume that the covariance matrix isthe identity (no covariances, and all variances equal to 1). Then, what we seek to dois to find R such that

R

0∫ rm−1e−r2/2dr

∞

0∫ rm−1e−r2/2dr

−1

= p

where r is the distance from the origin (in 2 and 3 dimensions, the radius). The rm−1

term arises from our doing a (hyper)spherical integral; the definite integral ispresent to provide normalization without carrying around extra constants. A changeof variables, u = r2, makes the top integral into

R2

0∫ u

m−12 e−u/2du

But remember that the χ2m distribution has a pdf proportional to x

12(m−1)e−x/2, and we

see that the integral is just proportional to the cdf of the χ2m pdf; if we call this Φ(u),

the solution to our problem becomes Φ(R2) = p. This connection should not be toosurprising when we remember that the χ2

m distribution is just that of the sum of msquares of normally-distributed rv’s. Using a standard table of the cdf Φ of χ2, wecan find the following values of R for p = 0. 95:

m R1 1.962 2.453 2.794 3.08

The first line shows the result we had before: to have 0.95 probability, the limits are



(about) 2σ from the mean. For higher dimensions, the limits become farther out,since as the pdf spreads out, a larger volume is needed to contain the same amountof probability. These larger limits need to be taken into account when we are (forexample) plotting the 2-dimensional equivalent of error bars.

4.6.1. Regression for a Multivariate Normal: Linear Regression

The regression functions for a multivariate normal take a relatively simpleform, one that is widely used as a model for many regression problems. To find theregression curve, we take the conditional probability for a bivariate normal, takingx1 as given; then X2 has a conditional pdf:

φ X2|X1=x1=

1

[2πσ 22(1 − ρ2)]

12

exp− 1

2[x2 − µ2 − ρ(σ1/σ2)(x1 − µ1)]

2

σ 22(1 − ρ2)

which is a somewhat messy version of the univariate pdf. From this, we see that theexpected value of X2 is

E[X2|X1 = x1] = µ2 + ρ(σ1/σ2)(x1 − µ1)

which defines a regression line passing through (µ1, µ2), with slope ρ(σ1/σ2) If wenow switch all the subscripts we get the regression line for X1 given X2, which is

E[X1|X2 = x2] = µ1 + ρ(σ2/σ1)(x2 − µ2)

which passes through the same point, but has a different slope, of ρ−1(σ1/σ2). Thesetwo regression lines become the same only if the correlation coefficient ρ is one. If ρis zero, the lines are at right angles. Figure 6 (lower right panel) shows the regres-sion lines for the bivariate pdf contoured there.

We see that neither line falls along the major axis of the elliptical contours.This is what gives rise to the phenomenon of regression to the mean: if x1 is abovethe mean, the expected value of X2 given x1 is less than x1. So, (to use Galton’sexample) if parental height and offspring height are imperfectly correlated (and theyare) tall parents will, on average, have children whose height is closer to the mean.But there is nothing causal about this: the same applies to tall children, who willhave shorter parents on average. The same shift will be seen, and for the same rea-son, in repeated measurements of the same thing: remeasurement of items that ini-tially measured below the mean will, on average, give values closer to the mean.(Assuming, again, that the variations in measurement are entirely random). Whenthis happens, it is notoriously easy to assume that this change towards the meanshows a systematic change, rather than just the effect of random variation.

In higher dimensions, we get regression planes (or hyperplanes), but it remainstrue that the regression function of each variable on all the others is linear:

E (X i|x1, . . . xi−1, xi+1, . . . xm) =n

j=1, j≠iΣ a j x j + µ j

where the slopes a j depend on the covariances, including of course the variances, asin the bivariate case. So the multivariate normal pdf gives rise to linear regression,in which the relationships between variables are described by linear expressions. Infact, these linear expressions describe the pdf almost completely: they give its



location and shape, with one variance needed for a complete description.

4.6.2. Linear Transformations of Multivariate Normal RV’s

Not infrequently we form a linear combination of our original set of randomvariables

→

X to produce some new set→

Y ; for example, by summing a set of randomvariables, perhaps weighted by different amounts. We can write any linear transfor-mation of the original set of m random variables

→

X to produce a new set of n randomvariables

→

Y

Y j =n

k = 1Σ l jk X k

or in matrix notation→

Y = L→

X

Note that the l jk are not random variables, nor are they moments; they are simplynumbers that are elements of the matrix L. If the

→

X are distributed as a joint nor-mal pdf, the pdf for the transformed variables

→

Y can be shown to be another jointnormal with mean value

→ν = L →µ (1)

and covariance

C′ = LCLT (2)

We demonstrate this in the section on propagation of errors, below.

4.6.2.1. Two Examples: Weighted Means, and Sum and Difference

The simplest case of such a linear combination of random variables, often used withdata, is the weighted mean; from n variables X i we compute

Y =1W

n

i=1Σwi X i where W =

n

i=1Σwi

We suppose that the X i are iid (independent and identically distributed), with first and sec-ond moments µ and σ 2. It is immediately obvious from (1) that ν = µ; applying (2) to thecovariance matrix C, which in this case is a scaled version of the identity matrix, we getthat the variance of Y is

σ 2

W2

n

i=1Σw2

i

which for all the weights the same gives the familiar result σ 2/n: the variance is reduced bya factor of n, or (for a normal pdf) the standard deviation by n

12 .

Another simple example leads to the topic of the next section. Consider taking thesum and difference of two random variables X1 and X2. We can write this in matrix formas

Y1

Y2

=

1212

12

− 12

X1

X2

(3)

The most general form for the bivariate covariance matrix C is



σ 21

ρσ1σ2

ρσ1σ2

σ 22

(4)

If we now apply (2) to this we obtain

C′ =

σ 21 + ρσ1σ2

σ 21 − σ 2

2

σ 21 − σ 2

2

−σ 22 + ρσ1σ2

which for the case σ1 = σ2 becomes

C′ = σ 21 + ρ

00

1 − ρ

so that the covariance has been reduced to zero, and the variances increased and decreasedby an amount that depends on the correlation. What has happened is easy to picture: thechange of variables (3) amounts to a 45° rotation of the axes. If σ1 = σ2 and ρ > 0 the pdfwill be elongated along the line X1 = X2, which the rotation makes into the Y1 axis. Therotation thus changes the pdf into a one appropriate to two uncorrelated variables.

There are two applications of this. One is a plotting technique called the sum-differ-ence plot: as with many handy plotting tricks, this was invented by J. W. Tukey. If youhave a plot of two highly correlated variables, all you can easily see is that they fall along aline. it may be instructive to replot against the sum and difference of the two, to removethis linear dependence so you can see other effects. The other application is that this kindof correlation is quite common in fitting functions to data: if the functions are very similarin shape, the fit will produce results that appear to have large variances. And so they do:but there is also a large correlation. Looking at appropriate combinations will reveal that,for example, what has a really large variance is the fit to the difference between functions;the fit to the sum will be well-determined.

4.6.3. Removing the Correlations from a Multivariate Normal

The example just given has shown how, for a normal pdf it is possible to find arotation of the axes that will make the covariance matrix diagonal and the new vari-ables uncorrelated. The question naturally arises, can this be done in general, toalways produce a new set of uncorrelated variables? In this section we demonstratethat this is so.

A rotation of the axes is a special case of transforming the original variables toa new set of variables using linear combinations. For as rotation the matrix of linearcombinations is square (n = m), and also orthogonal, with LT = L−1 =def K ; theinverse matrix K is also orthogonal. Then, from (1) and (2), the mean and covari-ance of the new variables must be →ν and

C′ = LCK = K T CK (5)

Given standard results results from linear algebra it is easy to show that wecan always find a new, uncorrelated, set of variables. For the transformed variablesto be uncorrelated, the transformed covariance matrix has to be diagonal; from (5),we thus need to have

K T CK = diag [σ 21 ,σ 2

2, . . . ,σ 2m]. (6)

Since K is orthogonal, (6) is a standard spectral factorization or eigenvalue problem.For a symmetric positive definite matrix C (which a covariance matrix always is), wecan write the C matrix as a product

C = UTΛU



where U is orthogonal, U = [ →u1,→u2, . . . , →um], and Λ is diagonal: Λ = diag[λ1, λ2, . . . , λm]

with all the λ ’s positive; it is the case that

C →uk = λ k→uk and →u j ⋅

→uk = δ jk

so the →uk are the eigenvectors of C.

For our covariance problem, we let K = UT , with columns given by the eigenvec-tors of the covariance matrix C; then the transformation (5) produces a diagonalmatrix

diagλ1, λ2, . . . , λm

which means that σ 2j = λ j : the variances of the transformed set of variables are the

eigenvalues of C. The directions →u j associated with these diagonalized σ j are calledthe principal axes.

While this rotation gives uncorrelated variables, there are other linear transfor-mations that do so, for the multivariate normal at least. Another decomposition wecan use is the Cholesky factorization.

C = LLT

where L is a lower triangular matrix. This yields a unit diagonal matrix if L = K −1.For multivariate Gaussians transformation to an uncorrelated system of variables isthus not unique.

4.7. The Fisher Distribution

We next look at one other bivariate pdf, one that is common in some branches ofgeophysics. This is the Fisher distribution, which is the analog to the bivariateNormal when the domain of the random variable is the surface of a sphere instead ofan infinite plane. This distribution is thus useful for modeling directions on asphere, for example directions of magnetization in paleomagnetism (for which thisdistribution was invented).

The pdf for this distribution is

φ(∆,θ) =κ

4π sinhκe−κ cos∆

To keep this expression simple, we have not included the location parameter; orrather, it is implicit in the coordinates used for the arguments of φ . These are ∆,which is the angular distance from the maximum of the pdf, and θ , which is theazimuth—but since this does not in fact enter into the pdf, the distribution is circu-larly symmetric about its maximum. The location of the maximum value, with ∆ = 0,also corresponds to the expected value of the random variable. As with the vonMises distribution (which can be thought of as the ‘‘one-dimensional’’ version of theFisher), this distribution has only a shape parameter, κ , which determines thewidth; this is usually called the concentration parameter. For κ zero, the distribu-tion is uniform over the sphere; as κ becomes large, the pdf (near its maximum)approximates a bivariate normal with no covariance and both variances equal.

The version of the distribution just given provides the probability density over aunit angular area; angular areas on the sphere being measured in steradians, with



the area of the whole (unit) sphere being 4π . Another distribution derived from thisis the probability density per unit of angular distance, averaging over all values of θ ;this is given by

φ(∆) =κ

2 sinhκe−κ cos∆ sin∆

and, so, like the Rayleigh distribution (to which it is the analog) goes to zero at theorigin. Of course, neither of these distributions incorporates possible variation inazimuth, as would be produced by unequal variances. This can be done using otherdistributions such as the Bingham distribution, though these have to deal with thecomplications of the ‘‘lobes’’ of the pdf wrapping around the sphere.4

4.8. Propagation of Errors

Often we can adequately represent the actual pdf of some set of variables to bea multivariate normal; sometimes we just choose to do this implicitly by dealing onlywith the first two moments, the only parameters needed for a normal. Suppose allwe have are first moments (means) and second moments (variances and covari-ances). Suppose further that we have a function f (

→

X) of the random variables; whatis the multivariate normal that we should use to approximate the pdf of this func-tion?

The usual answer to this question is done according to what is usually calledthe propagation of errors, in which we suppose not only that we can represent

→

Xadequately by its first two moments, but also suppose that we can represent thefunction, locally at least, by a linear approximation. All this is much simpler thanwhat we did in Chapter 3 by finding the pdf of a rv transformed into another rv; butthis simplicity is needed if we are to have manageable solutions to multivariateproblems.

We start, however, by taking the function f (→

X) to produce a single random vari-able. The linear approximation of the function is

f (→

X) = f ( →µ) +m

i=1Σ

∂ f∂xi

xi = µ i

(X i − µ i) = f ( →µ) +

→

d ⋅ (→

X − →µ)T (7)

where→

d is the m-vector of partial derivatives of f , evaluated at →µ , the mean value of→

X . The new first moment (expectation) is just the result of evaluating the functionat the expected value of its argument. We can do this most easily by applying theexpectation operator introduced in Chapter 2 (rather than cluttering the problemwith integrals), remembering its linearity properties.

E[ f (→

X)] = f ( →µ) +→

d ⋅ E[→

X − →µ]T = f ( →µ) (8)

because E (→

X) = →µ . Of course, this is only an exact result if the linear relationship (7)is a complete representation of the function: in general this is only be an approxima-tion.

4 For these other distributions, see Lisa Tauxe (1998) Paleomagnetic Principles and Practice(Dordrecht, Kluwer).



The second moment (variance) of f (→

X) will be

V ( f (→

X)) = E

→

(d ⋅→

(X − →µ)T)2

= E[(→

dT⋅ (

→

X − →µ)) ⋅ (→

d ⋅ (→

X − →µ)T)] =

E[((→

X − →µ) ⋅→

dT) ⋅ (

→

d ⋅ (→

X − →µ)T)] = E[(→

X − →µ) ⋅ A ⋅ (→

X − →µ)T]

where A =→

dT →

d is the m × m matrix of products of partial derivatives; in componentform

aij = did j =∂ f∂xi

∂ f∂x j

But we can take the expectation operator inside the matrix sum to get

V ( f (→

X)) = E iΣ

jΣ aij(xi − µ i)(x j − µ j)

=

iΣ

jΣ aijE[(xi − µ i)(x j − µ j)] =

iΣ

jΣ aijCij =

ijΣ did jCij

where Cij are the elements of the covariance matrix.

Cij = C (X i − µ i, X j − µ j)

We thus have an expression for the variance of the function that depends on itsderivatives and on the covariance matrix.

Now consider a pair of functions f a(→

X), with derivatives→

d a and f b(→

X), withderivatives

→

d b. Then we can apply the same kind of computation to get the covari-ance between them as

C[ f a(→

X), f b(→

X)] = E[(→

dTa ⋅

→

(X − →µ) ⋅→

d b ⋅→

(X − →µ)T)]

=→

dTa ⋅ E[

→

(X − →µ) ⋅→

(X − →µ)T)] ⋅→

d b =→

dTa ⋅ C ⋅

→

d b

which gives the covariance, again in terms of the derivatives of each function and

the covariances of the original variables. The variances are just→

dTa ⋅ C ⋅

→

d a and→

dTb ⋅ C ⋅

→

d b.

We can now generalize this to n functions, rather than just two. Each functionwill have an m-vector of derivatives

→

d ; we can combine these to create an n × mmatrix D, with elements

Dij =∂ f i

∂x j

which is called the Jacobian matrix of the function. Following the same reasoningas before, we get that the transformation from the original covariance matrix C tothe new one C′, dimensioned n × n, is

C′ = DCDT

which is of course of the same form as the transformation to a new set of axes given



above in Section 4.6.2, though this is approximate, as that was not. Along the diago-nal, we obtain the variances of the n functions:

C′ii =m

j=1Σ Dij

m

k=1Σ DikC jk

4.8.1. An Example: Phase and Amplitude

To show how this works, we discuss the case of random variables for describing aperiodic phenomenon. We can parameterize something that varies with period T (not arandom variable) in two ways

Y (t) = A cos 2π tT

+ θ

= X1 cos 2π tT

+ X2 sin 2π tT

where the random variables are either the amplitude A and phase θ or the cosine and sineamplitudes X1 and X2. The latter are sometimes called the in-phase and quadrature parts,or, if a complex representation is used, the real and imaginary parts.

Of these two representations, the first is more traditional, and in some ways easier tointerpret; most notably, the amplitude A does not depend on what time we take to corre-spond to t = 0. The second is preferable for discussing error (and for other purposes),because the rv’s affect the result purely linearly. We may ask how the first and secondmoments of X1 and X2 map into the same moments of A and θ . The relationship betweenthese is

A = f1(X1, X22) = (X2

1 + X22)

12 θ = f2(X1, X2

2) = arctan

X1

X2

(9)

from which we can find the components of the Jacobian matrix to be

d11 =∂ f1

∂x1=

x1

ad12 =

x2

a

d21 =∂ f2

∂x1=

x2

a2 d22 =x1

a2

where we have used a for the amplitude to indicate that this is not a random variable but aconventional one, to avoid taking a derivative with respect to a random variable.

If we assume that the variances for the X ’s are as in (4), we find that the variances for Aand θ are

σ 2A =

x1

aσ 2

1 +x2

aσ 2

2 −x1 x2

a2 ρσ1σ2 σ 2θ =

x22

a4 σ 21 +

x21

a4 σ 22 −

x1 x2

a4 ρσ1σ2

which, if σ1 = σ2 and ρ = 0, reduces to

σ A = σ σθ =σ

a

which makes sense if thought about geometrically: provided a is much larger than σ1 andσ2, the error in amplitude looks like the error in the cosine and sine parts, while the errorin phase is just the angle subtended. However, if the square root of the variance is compa-rable to the amplitude, these linear approximations fail. For one thing, the result (8) nolonger holds: the expected value of the amplitude is not (X2

1 + X22)

12 , but is systematically

larger; in fact for the X ’s both zero, A has the Rayleigh distribution. And the pdf for theangle θ , also, cannot be approximated by a normal distribution. This is a special case of amore general result: if you have a choice between variable with normal errors and anotherset which is related to them nonlinearly, avoid the nonlinear set unless you are prepared todo the transformation of the pdf properly, or have made sure that the linear approximationwill be valid.



4.9. Regression and Curve Fitting

As a transition to the next chapter, on estimation, we return to the problem offinding the regression curve. We compare this with a superficially similar procedurethat is often (confusingly) also called regression, but is in fact conceptually different,although mathematically similar (hence the use of the name).

In regression as we defined it above, we assume a pdf for at least two variables,and ask for the expected value of one of them when all the others are fixed. Whichone we do not fix is a matter of choice; to go back to Figure 1, it would depend onwhether we were trying to predict magnitude from rupture length, or the other wayaround.

The other usage of the term regression is for fitting some kind of function todata. We suppose we have only one thing that should be represented by a randomvariable; and that this thing depends on one or more conventional variables throughsome functional dependence that we wish to determine. The regression curve (ormore generally surface) is then the function we wish to find. But there is not thesymmetry we have in the regression problem; only one variable is random.

Most elementary discussions of the analysis of experimental data assume amodel that amounts to this, but one which uses an approach we described in Chap-ter 1, and chose not to use: that is to say that we should regard data y as being thesum of a function f p( →x) and a random variable

→

E . Here f p is a function thatdepends on some parameters p and takes as argument some variables →x , and

→

E isthe ‘‘error’’.

In our terminology, →x is a vector of conventional variables and the function f p

produces a conventional variable; the error is modeled as a random variable. A sim-ple example would be if we measure the position of a falling object (with zero initialvelocity) as a function of known times

→t . We would write this as

→

Y = 12 g

→t

2+

→

E (10)

where we take the t2 to represent the squares of the individual t’s. The parameter ofthe function is g, the gravitational acceleration. In our framework,

→

E is a randomvariable, and so is

→

Y . Now assume the pdf of→

E to be φ(y)→

I , where I is the identitymatrix, so the individual Y ’s are independent and identically distributed. Then (10)can be written as a distribution for Y :

Y i ∼ φ(y − 12 gt2

i ) (11)

so that g, and the known values ti, appear as parameters in a pdf—which, as con-ventional variables, is where they must appear. So the problem of finding g, giventhe values for

→t and

→

Y , becomes one of estimating parameters in a pdf, somethingwhich we will examine extensively in the next chapter.

You should by now be able to see that this is quite a different setup from theone that led us to regression. In particular, since t is not a random variable, there isno way in which we can state a bivariate pdf for t and Y . We could of course contourthe pdf for Y against t, but these contours would not be those of a bivariate pdf; inparticular, there would be no requirement that the integral over Y and t together be



equal to 1. However, it is also easy to see (we hope) that we can engage in the activ-ity of finding a curve that represents E (Y ) as a function of t and the parameter g;and that this curve, being the expected value of Y given t, is found in just the samewa y as the regression curve for one random variable conditional upon the other one.So in that sense ‘‘regression’’ is equivalent to ‘‘curve fitting when only one variable israndom [e.g., has errors)’’. But it is important to remember that the similar solu-tions of these two problems come from a quite different underlying structure.


multivariate

Education

dimensional random variable

single random variable

multivariate random

multivariate probability

scalar random variables

similar data

probability model ing

tuples of random variables