chapter 9: methods for mixed data - university of...

CHAPTER 9

Methods For Mixed Data

9.1 Introduction

Chapters 5-8 pertained to datasets in which the variables wereeither all continuous or all categorical. In practice, however,statistical analyses involving variables of both types areextremely common: analysis of variance, analysis ofcovariance, logistic regression with continuous predictors, andso on. Sample surveys often contain variables of both types.This chapter develops general tools for incompletemultivariate data matrices containing both continuous andcategorical variables. Such a dataset is shown in Figure 9.1,with missing values denoted by question marks.

The statistical literature on multivariate methods tends toemphasize models for variables that are all of the same type;relatively

Figure 9.1. Mixed dataset with missing values.

©1997 CRC Press LLC

little attention has been paid to models for mixed data. Onenotable exception is the model that underlies classicaldiscriminant analysis, which contains a single categoricalresponse and one or more continuous predictors. We beginwith a version of this model called the general location model(Section 9.2) and discuss methods for keeping the number ofparameters manageable (Section 9.3). Algorithms forincomplete mixed data are presented in Section 9.4, andSection 9.5 concludes with several data examples.

9.2 The general location model

9.2.1 Definition

As in Figure 9.1, let W1, W2,..., Wp denote a set of categoricalvariables and Z1, Z2,..., Zq a set of continuous ones. If thesevariables are recorded for a sample of n units, the result is an n× (p + q) data matrix Y = (W, Z) where W and Z represent the

categorical and continuous parts, respectively.The categorical data W may be summarized by a

contingency table. Let us suppose that Wj takes possible values1, 2,..., dj, so that each unit can be classified into a cell of a p-dimensional table with total number of cells equal to

D djp

j= =∏ 1 . A generic response pattern for the categorical

variables will be denoted by w = (w1, w2,..., wp), and thefrequencies in the complete-data contingency table will be

x x w Ww= ∈{ }: , (9.1)

where xw is the number of units for which (W1, W2,..., Wp) = w,and W is the set of all possible w. We may also arrange thecells of the contingency table in a linear order indexed by d =1, 2,..., D, for example, the anti-lexicographical storage orderin which w1 varies the fastest, w2 varies the next fastest, and soon (Appendix B). Then we can replace the vector subscript inxw by a single subscript d,

x x d Dd= ={ }: , , ...,1 2 . (9.2)


Depending on the context, we will regard x either as amultidimensional array (9.1) or a vector (9.2).

Finally, it will be helpful to introduce one additionalcharacterization of W . Let U be an n × D matrix with rows

uiT , i = 1, 2,..., n, where ui is a D-vector containing a 1 in

position d if unit i falls into cell d, and 0s in all other positions.Hence each row of U contains a single 1, and UTU is

U U diag x

xx

x

T

D

= ( ) =

12

0 00 0

0 0

LL

M M O MK

. (9.3)

Because the sample units are assumed to be independent andidentically distributed, all relevant statistical information in Wis contained in x , U or UTU . The continuous data arecharacterized simply by Z.

The general location model, so named by Olkin and Tate(1961), is most easily defined in terms of the marginaldistribution of W and the conditional distribution of Z givenW. The former is described by a multinomial distribution onthe cell counts x,

x M n| ~ ,π π( ) , (9.4)

where π={πw:w ∈ W} = {πd:d=1,2,...,D} is an array of cell

probabilities corresponding to x . Given W the rowsz z zT T

nT

1 2, , ..., of Z are then modeled as conditionallymultivariate normal. Let Ed denote a D-vector containing a 1in position d and 0s elsewhere. We assume

z u E Ni i d d d| , , ~ ,= ( )µ µΣ Σ (9.5)

independently for i = 1, 2,..., n where µd is a q-vector of means

corresponding to cell d, and Σ is a q × q covariance matrix.

The means of Z1, Z2,..., Zq are allowed to vary freely from cellto cell, but a common covariance structure Σ is assumed for all

cells. When D = 2 this reduces to the model that underliesclassical discriminant analysis (e.g. Anderson, 1984).

The parameters of the general location model will bewritten

θ π µ= ( ), , Σ ,


where µ = (µ1,µ2,...,µD)T is a D × q matrix of means. For the

moment, we will impose no prior restrictions on θ other than

the necessary positive definiteness for Σ and Σw∈ wπw=1. The

number of free parameters in the unrestricted model is thus(D-1)+Dq+q(q+1)/2.

Notice that the model for Z given W may also be regardedas a classical multivariate regression,

Z=Uµ+e, (9.6)

where ∈ is an n × q matrix of errors whose rows are

independently distributed as N(0,Σ) The columns of U contain

dummy variables for each of the cells d = 1, 2,..., D. BecauseU has the same rank as UTU = diag(x), this will be a full-rankregression provided that there are no random zeroes in x.Structural zeroes may be handled simply by omitting themfrom the columns of U. A model of the form (9.6) issometimes called a standard multivariate regression; in thismodel the same matrix of regressors U is used to predict eachcolumn of the response Z.

9.2.2 Complete-data likelihood

Combining (9.4) with (9.5), we can write the complete-datalikelihood as the product of multinomial and normallikelihoods,

L Y L W L W Zθ π µ| | , | ,( ) ∝ ( ) ( )Σ . (9.7)

The likelihood factors are L Y dD

dxdθ π|( ) ∝ =Π 1 and

L W Z

z z z

n

di Bd

D

i dT

i d i d

µ

µ µ µ

, | , | | exp

ˆ

Σ Σ

Σ

( ) ∝ −

−( ) −( ) −( )}

−

∈=−

∑∑2 12

1

1

where Bd = {i: ui = Ed} is the set of all units belonging to celld. After some algebraic manipulation, the second factor maybe written as

L W Z tr Z Z

tr U Z tr U U

n T

T T T T

µ

µ µ µ

, | , | | expΣ Σ Σ

Σ Σ

( ) ∝ −{+ − }

− −

− −

2 12

1

1 12

1(9.8)


revealing that the complete-data loglikelihood is linear in theelements of the sufficient statistics

T Z ZT1 = , T U ZT

2 = , and T U U xT3 = = ( )diag . (9.9)

Maximum-likelihood estimates

Because the parameters associated with the two factors in(9.7) are distinct, complete-data ML estimates may be foundby maximizing each factor separately. The result for π is the

usual ML estimate for an unrestricted multinomial model,ˆ , , , ..., .πd

xdn d D= =1 2

The estimate for µ follows from the least-squares regression of

Z on U

µ̂ = ( ) =− −U U U Z T TT T1

31

2 , (9.10)

and the estimate for Σ is

ˆ ˆ ˆΣ = ∈ ∈= −( )−1 11 2 3

12n n

T T T TT T , (9.11)

where ˆ ˆ∈= −Z Uµ is the matrix of estimated residuals. Noticethat (9.11) differs from the classical unbiased estimate in thatit uses a denominator of n rather than n − D.

These estimates can be further understood by noting that

U U

xx

x

T

D

( ) =

−

−

−

−

11

1

21

1

0 0

0 0

0 0

L

LM M O M

L

and that UTZ is a D × q matrix with Σ i B iT

dz∈ in the dth row.

The rows of µ̂ are thus

ˆ , , , ... ,µdT

d iT

i B

x z d D

d

= =−

∈∑1 1 2

the within-cell averages of the rows of Z. The rows of theresidual matrix ∈̂ are the deviations of the rows of Z fromtheir cell-specific means, so the estimated covariance matrixcan be written as


ˆ ˆ .Σ = =( ) =( )∈=∑∑1

1n i d i d

T

i Bd

D

z z

d

µ µ

Random zeroes and sparse data

If any cell in x is randomly zero, the matrix of regressors Uhas deficient rank and the least-squares estimate (9.10) is nolonger defined. When this happens, the mean vector µd

corresponding to the empty cell drops out of the likelihoodfunction and becomes inestimable; the likelihood takes thesame value regardless of µd, and the ML estimate is no longer

unique.Clearly, the unrestricted general location model will tend to

be useful only when n is large relative to D, when enoughobservations are present in each cell to estimate all thecomponents of µ. When the data are sparse, restricted versions

of the model that contain fewer free parameters (to bediscussed below) will be more appropriate.

Table 9.1. Classification of subjects by foreign language studied and sex

9.2.3 Example

In Section 6.3 we examined data pertaining to the validity ofthe Foreign Language Attitude Scale (FLAS), a test instrumentfor predicting achievement in foreign language study; the rawdata are reproduced in Appendix A. We will now apply theunrestricted general location model to a portion of this dataset.As shown in Table 6.5, the variables LAN and FLAS had nomissing values, and SEX and HGPA were missing for onlyone subject each. For the moment, let us discard those two


subjects to obtain an apparently complete dataset with fourvariables and 277 observations, The variables FLAS andHGPA are continuous, whereas LAN and SEX are categoricalwith four and two levels, respectively. The frequencies for theLAN by SEX classification are shown in Table 9.1. Adoptinga columnwise storage order, the cell counts are

UTU = diag (35, 45, 62, 9, 31, 32, 52, 11),and dividing these counts by n = 277 yields the ML estimateπ̂=(0.126, 0.162, 0.224, 0.032, 0.112, 0.116, 0.188, 0.040).The sufficient statistics pertaining to HGPA and FLAS are

U ZT =

94 45 2841121 08 3397170 78 4987

26 35 69482 63 275983 12 2719

153 41 451729 86 907

.

.

.

.

.

.

.

.

, Z ZT =

2199 69 62894 1862894 18 1934421

. .

.

Dividing the rows of UTZ by the cell counts yields theestimated matrix of means,

ˆ

. .

. .

. .

. .

. .

. .

. .

. .

µ =

2 70 81 22 69 75 52 75 80 42 93 77 12 67 89 02 60 85 02 95 86 92 71 82 4

,

and the ML estimate of the covariance matrix is

ˆ

. .

. .

Σ = − ( )

=

− −n Z Z Z U U U U ZT T T T1 1

0 367 0 4110 411 176 9

9.2.4 Complete-data Bayesian inference

The factorization (9.7) which simplified the problem of MLestimation is also convenient from a Bayesian point of view: ifwe apply independent prior distributions to π and (µ,Σ), these

parameter sets will be independent in the posterior distribution


as well. For simplicity, we will apply a Dirichlet prior to thecell probabilities,

π α~ D( ) ,

where α α α= ∈{ } = ={ }w dw W d D: : , , ...,1 2 is an array of

user-specified hyperparameters; the complete-data posteriordistribution of π is then

π α~ D ′( )where ′ = +α α x . For discussion on choosing values for thehyperparameters, see Section 7.2.5.

Inferences for µ and Σ under a noninformative prior

With regard to µ and Σ, let us first consider what happens

when we apply an improper uniform prior to the elements of µand the standard noninformative prior to the covariance matrixΣ,

Pq

µ, Σ Σ( ) ∝ −( )+12 . (9.12)

With a little algebra, the likelihood factor (9.8) for µ and Σ can

be written in terms of the least-squares estimates,

L W Z

U U

n T

T T

µ

µ µ µ µ

, | , exp

ˆ ˆ

ˆ ˆΣ Σ Σ

Σ

( ) ∝ −{− −( ) −( )}

− −∈ ∈2 1

21

12

tr

tr -1(9.13)

The diagonal form of UTU then allows us to rewrite (9.13) asL Z W

x

n T

d d dT

d

D

d d

µ

µ µ µ µ

, | , exp

ˆ ˆ

ˆ ˆΣ Σ Σ

Σ

( ) ∝ −{− −( ) −( )

− −∈ ∈

−

=∑

2 12

1

12

1

1

tr

which is equivalent toL Z W

x x

n D T

dd

D

d dT

d d d

µ

µ µ µ µ

, | , exp

exp ˆ ˆ

ˆ ˆΣ Σ Σ

Σ Σ

( ) ∝ −{ }× − −( ) ( ) −( )

−( ) −∈ ∈

− −

=

− −

−

∏

2

12

12

1

1

1

12

1 1

tr (9.14)

Combining (9.14) with the prior (9.12) leads toP Z W

x x

n D qT

dd

D

d dT

d d d

µ

µ µ µ µ

, | , exp

exp ˆ ˆ ,

ˆ ˆΣ Σ Σ

Σ Σ

( ) ∝ −{ }× − −( ) ( ) −( )

−( ) −∈ ∈

− −

=

− −

− + +

∏

12

12

12

1

1

1

12

1 1

tr


which, by inspection, is the product of independentmultivariate normal densities for µ1,...µ2,...µD given Σ and an

inverted-Wishart density for Σ

µ µd d dY N x| , ~ ˆ ,Σ Σ−( )1 , (9.15)

Σ | ~ , ˆ ˆY W n D T−∈ ∈

−− ( )

1 1, (9.16)

For this posterior to be proper, we need n D q≥ + and xd > 0

for all d, structural zeroes excluded; also, the matrix ˆ ˆ∈ ∈T ofresidual sums of squares and cross-products should have fullrank.

Informative priors

The preceding arguments can easily be extended toincorporate prior knowledge about µ and Σ . The most

convenient way to specify prior information for µ would be in

the form of independent multivariate normal distributions forµ1,...µ2,...µD with covariance matrices proportional to Σ; prior

information for Σ could then be expressed through an

inverted-Wishart distribution. The resulting complete-dataposterior would again be the product of independent normaldistributions for µ1,...µ2,...µD given Σ and an inverted-Wishart

distribution for Σ and the updated hyperparameters would be

obtained by calculations similar to those given in Section5.2.2.

For typical applications of the general location model,strong prior information about µ or Σ will not be available; in

all our examples, we will use the noninformative prior (9.12).The use of an improper prior can lead to difficulties, especiallyin sparse-data situations. For many datasets, particularly if thenumber of cells D in the contingency table is large, we mayfind that portions of µ or Σ are poorly estimated or

inestimable, and the posterior may be improper. When thishappens, we will not attempt to stabilize the inference throughinformative priors for µ or Σ; rather, we will specify a more


parsimonious regression model for Z given W, reducing thenumber of free parameters and enforcing simpler relationshipsbetween Z1, Z2,..., Zq and W1, W2,..., Wp.

9.3 Restricted models

9.3.1 Reducing the number of parameters

The unrestricted general location model tends to work wellwhen the sample size n is appreciably larger than the totalnumber of cells D. When this is not the case, the data maycontain little or no information about certain aspects of π, µ or

Σ and it would be wise to reduce the number of free

parameters. As shown by Krzanowski (1980, 1982) and Littleand Schluchter (1985), the general location model is amenableto certain types of restrictions on the parameter space. Becausewe defined the complete-data distribution and likelihood asthe product of two distinct factors, the marginal distribution ofW and the conditional distribution of Z given W , we willimpose restrictions on the parameter sets π and (µ , Σ)

separately to keep them distinct.

Loglinear models for the cell probabilities

For the cell probabilities π, we may require them to satisfy a

loglinear modellogπ λ= M (9.17)

where M is a user-specified matrix. Because the contingencytable is a cross-classification by W1, W2,..., Wp, M willtypically reflect this structure, containing ’main effects’ for W1,W2,..., Wp and ’interactions’ among them. If the first column ofM is constant, the first element of λ (the intercept) is not a free

parameter but a normalizing constant that scales π to sum to

one. The total number of free parameters in this loglinearmodel is rank (M ) − 1. Our fitting procedures will operate

directly on the elements of π; there will be no need to


explicitly create M or estimate λ unless the loglinear

coefficients are of intrinsic interest.

Linear models for the within-cell means

In the unrestricted general location model, the conditionaldistribution of Z given W is specified by the multivariateregression

Z U= + ∈µ , (9.18)

where U is an n × D matrix of dummy indicators recording the

cell location 1, 2,..., D of each sample unit. The means of Z1,Z2,..., Zq are allowed to vary freely among cells. As a result,(9.18) is equivalent to a multivariate analysis of variance(MANOVA) model for (Z1, Z 2,..., Zq) with main effects forW1, W2,..., Wp and all interactions among them. In practice,many of these interactions may be poorly estimated, and it isadvantageous to eliminate them from the model.

To simplify the model, we could directly replace U byanother matrix with fewer columns. For notational purposes,however, it is helpful to retain the present definition of Ubecause of its role in the complete-data sufficient statistics.Instead, let us restrict µ to be of the form

µ β= A (9.19)

for some β, where A is a constant matrix of dimension D × r.

Each of the q columns of µ, corresponding to the variables Z1,

Z2,..., Zq, is thus required to lie in the linear subspace of RD

spanned by the columns of A. The regression model becomesZ UA= + ∈β ,

with a reduced set of regression coefficients in β. By taking A

= I (the identity matrix) we obtain the unrestricted model(9.18) as a special case.

If A has full rank, then each of the r × q elements of βrepresents a free parameter. More generally, the number offree parameters in β is q × rank (A). If the contingency table

contains no random zeroes, then all of the regressioncoefficients will be estimable. If the table does contain zeroes,the coefficients may still all be estimable, because estimability


now depends on the rank of UA rather than U itself. To keepmatters simple, let us proceed under the assumption that thereare no deficiencies in the rank of A or UA,

rank(A) = rank(UA) = r.In practice we can ensure that this is satisfied by defining A tohave full rank, and then checking the rank of UA by seeingwhether ATATUA is invertible.

Choosing the design matrix

The design matrix A defines the regression that relates thecells of the contingency table to the means of the continuousvariables. This matrix is created in the same way that onecreates a design matrix for a factorial ANOVA. Thinking ofthe categorical variables W1, W2,..., Wp as ’factors’ of theexperiment, we first list all the possible combinations of levelsof these factors, using the linear storage order that we adoptedfor our contingency table; these identify the rows of A. Thenwe create columns for the main effects of W1, W2,..., Wp, andperhaps interactions among them, using any coding schemethat is convenient. In most applications, the first column of Awill contain ls for an intercept and the remaining columns willcontain dummy codes or contrasts for the desired effects ofW1, W2,..., Wp and their interactions.

For example, consider a model with p = 2 categoricalvariables, W1 and W2 taking d 1 = 2 and d2 = 3 levels,respectively, so that the contingency table has D = 6 cells. Letus adopt the anti-lexicographical storage order

(W1, W2) = (1, 1), (2, 1), (1, 2), (2, 2), (1, 3), (2, 3).One possible design matrix is

A =−

−− −

− − −

1 1 1 01 1 1 01 1 0 11 1 0 11 1 1 11 1 1 1

,

whose columns correspond to the intercept, a main-effectcontrast for W1 and two main-effect contrasts for W2. We mayalso add contrasts for the W1W2 interaction by including theproducts of the second column with the third and fourth. If the


interaction were included, the resulting model would have thesame number of parameters and give the same fit as theunrestricted version (9.18).

9.3.2 Likelihood inference for restricted models

The two sets of restrictions that we have imposed, theloglinear restrictions on π and the linear restrictions on µ do

not interfere with each other; the joint parameter space forθ π µ= ( ), , Σ is still the product of the individual spaces for πand (µ, Σ). Therefore, the problem of maximizing the joint

likelihood for θ still separates into two unrelated

maximizations. The ML estimate for π may be found by

conventional IPF (Section 8.3). For µ and Σ the estimates

come from the least-squares fit of the reduced regressionmodel Z UA= + ∈β , which gives

ˆ

,

β = ( )= ( )

−

−A U UA A U Z

A T A A T

T T T T

T T

1

31

2

(9.20)

n Z UA Z UA

T T A A T A A T

T

T T T

ˆ ˆ ˆ

.

Σ = −( ) −( )= − ( )−

β β

1 2 31

2

(9.21)

The corresponding ML estimate of µ is ˆ ˆµ β= A . For the

covariance matrix, most statisticians would tend to use the

unbiased estimate n n r−( )−1Σ̂ rather than Σ̂ . Notice that AT3A

is not diagonal, so in general the estimation of µ and Σ now

requires the inversion of an r × r matrix.

Example: Foreign Language Attitude Scale

Returning to the example of Section 9.2.3, let us fit a reducedmodel to this four-variable dataset in which (a) SEX and LANare marginally independent, and (b) the linear model forHGPA and FLAS has only main effects for SEX and LAN.Let xij denote a count in the LAN × SEX contingency table


(Table 9.1) and πij the corresponding cell probability. The ML

estimates of the cell probabilities for the independence modelare available in closed form as ˆ /πij i jx x n= + +

2 which givesˆ . , . , . , . , . , . , . , . .π = ( )0 130 0 152 0 224 0 039 0 108 0 126 0 187 0 033

Using the dummy-coded design matrix

A =

1 1 0 0 11 0 1 0 11 0 0 1 11 0 0 0 11 1 0 0 01 0 1 0 01 0 0 1 01 0 0 0 0

,

the least-squares regression of Z on UA yields

ˆ

. .

. .

. .

. .

. .

β =−−

− −

2 825 83 4350 125 5 4030 154 0 3900 036 4 0240 032 7 522

, ˆ . .. .Σ =

0 372 0 3850 385 177 3 .

The corresponding ML estimate of the cell-means matrix is

ˆ ˆ

. .

. .

. .

. .

. .

. .

. .

. .

µ β= =

A

2 67 81 32 64 76 32 83 79 92 79 75 92 70 88 82 67 83 82 86 87 52 82 83 4

We can check the plausibility of this restricted modelagainst the unrestricted alternative by means of a likelihood-ratio test. Plugging ˆ, ˆπ µ and Σ̂ into the formula for thecomplete-data log likelihood,

l Y x T

T T

d dn

d

D

T T

π µ π

µ µ µ

, , | log logΣ Σ Σ

Σ Σ

( ) = − −

+ −=

−∑ 212

1

11

212 3

tr

tr tr-1 -1

, (9.22)

yields a value of −1394.83. The parameter estimates from the

unrestricted model (Section 9.2.3) give a slightly higherloglikelihood of −1391.86. The two models are separated by


(4 − 1) × (2 − 1) = 3 parameters for the marginal association

between SEX and LAN, plus 3 × 2 = 6 coefficients for the

LAN × SEX interaction in the linear model for HGPA and

FLAS. The deviance statistic is 2 × (-1391-86 + 1394.83) =

5.94, and the corresponding p-value is P X92 5 94 0 75≥( ) =. . .

The reduced model thus appears to fit the data quiteadequately. Because the complete-data likelihood factors intodistinct pieces for π and (µ, Σ), we can also separate this

goodness-of-fit test into two tests, one for the marginal modelfor LAN and SEX (3 degrees of freedom), another for theconditional model for HGPA and FLAS (6 degrees offreedom), and the two deviance statistics will add up to theoverall deviance.

9.3.3 Bayesian inference

Bayesian inference for the restricted model proceeds mosteasily if we apply independent prior distributions to theparameter sets π and (µ, Σ), so that they remain independent in

the complete-data posterior distribution. In keeping with themethods developed in the last chapter, let us apply aconstrained Dirichlet prior to the elements of π with prior

density

P dd

Ddπ πα( ) ∝ −

=∏ 1

1

for values of π that satisfy the loglinear constraints and P(π)=0

elsewhere. The complete-data posterior density will then beconstrained Dirichlet with updated hyperparameters

′ = +α αd d dx . Posterior modes can be calculated using,conventional IPF (Section 8.3), and simulated posterior drawsof π can be obtained with Bayesian IPF (Section 8.4).

Bayesian inference for β and Σ under a noninformative prior

Bayesian inference for the standard multivariate regressionmodel is covered in many texts on multivariate analysis; a


good source is Press (1982). The likelihood function for Σ and

the free coefficients β is

L Y Z UA Z UAn Tβ β β, | expΣ Σ Σ( ) ∝ − −( ) −( ){ }− 2 1

2 tr -1 .

Following some algebraic manipulation, this likelihoodfunction can be rewritten in terms of the least-squaresestimates as

Σ Σ Σ−∈ ∈

−− − −( ) ⊗[ ] −( )

n TT

V2 12

12

1exp ˆ ˆˆ ˆtr -1 β β β β , (9.23)

where β̂ is the matrix of estimated coefficients, ˆˆ

∈ = −Z UAβis the matrix of estimated residuals and V=(ATUTUA)-1. Thesymbol ⊗ denotes the Kronecker product,

Σ ⊗ =

V

V V VV V V

V V V

q

q

q q qq

σ σ σσ σ σ

σ σ σ

11 12 1

12 22 2

1 2

LL

M M O ML

In (9.23), the columns of β and β̂ have been implicitly

stacked to form vectors of length r q , so that

β β β β−( ) ⊗[ ] −( )−ˆ ˆTVΣ 1 is meaningful. For some

elementary properties of Kronecker products, see Mardia,Kent and Bibby (1979).Let us first consider what happens when we apply an improperuniform prior to β and the standard Jeffreys prior to Σ,

Pq

β, Σ Σ( ) ∝ −( )+12 . (9.24)

When A = I we have β = µ and this reduces to the

noninformative prior (9.12) that we used in the unrestrictedmodel. Combining (9.24) with the likelihood function (9.23),and using the fact that

Σ Σ⊗ =V Vr q,we obtain the posterior density

P Y

V V

n r qT

T

β

β β β β

( ) ∝ −{ }× ⊗ − −( ) ⊗[ ] −( )

−( ) ∈ ∈

− −

− + +

Σ Σ Σ

Σ Σ

| exp

exp ˆ ˆ

ˆ ˆ1

2

12

12

12

1

tr -1

. (9.25)


By inspection, this is the product of a multivariate normaldensity for β given Σ and an inverted-Wishart density for Σ

β β| , ~ ˆ,Σ ΣY N V⊗( ), (9.26)

Σ | ~ , ˆ ˆY W n r T−∈ ∈

−− ( )

1 1. (9.27)

Given Σ the posterior distribution of each column of β is

multivariate normal, centered at the corresponding column of

β̂ and with covariance matrix proportional to V. Marginally,

the columns of β have multivariate t-distributions with n − r

degrees of freedom. Notice that for (9.25) to be a properposterior density, we need n q r≥ + , and ˆ ˆ∈ ∈T must have fullrank.

Informative priors for β and Σ

One may extend the above arguments to incorporate moresubstantial prior information about β and Σ . To obtain a

convenient posterior distribution for (β, Σ) within the normal

inverted-Wishart family, however, the prior distribution musthave a particular form: Σ must be inverted-Wishart, and βgiven Σ must be multivariate normal with a patterned

covariance matrix similar to that of (9.26). The limitations ofthis family of priors are discussed by Press (1982). In mostpractical applications of the general location model, it will bedifficult to quantify prior knowledge about β and Σ; all our

examples will use the noninformative prior (9.24). If theposterior distribution under this prior is not proper, then wemay interpret it as a sign that the model is too complex to besupported by the data, and the model should be simplified bychoosing a design matrix A with fewer columns.


9.4 Algorithms for incomplete mixed data

Thus far we have reviewed the basic methods of likelihoodand Bayesian inference for the parameters of the unrestricted(Section 9.2) and restricted (Section 9.3) general locationmodels. Now we extend these methods to handle mixeddatasets with axbitrary patterns of missing values. Thesealgorithms are built from portions of the code for normal andcategorical data given in Chapters 5-8, The reader who is lessinterested in computational details than in applications maywish to lightly skim this section to see what algorithms areavailable, and then proceed directly to the data examples inSection 9.5.

9.4.1 Predictive distributions

A row of the data matrix may have missing values for any orall of the variables W1,..., Wp, Z1,..., Zq. Before we can deriveestimation and simulation algorithms for the general locationmodel, we must be able to characterize the joint distribution ofany subset of these variables given the rest, so that we canobtain the predictive distribution of the missing data in anyrow of the data matrix given the observed data.

Categorical variables completely missing

Let us first consider the conditional distribution of thecategorical variables given the continuous ones, which isneeded when Z1,..., Zq are observed but W1,..., Wp are missing.We can represent the complete data for row i by (ui, zi), whereZi

T is the realized value of (Z1,..., Zq), and ui is a vector oflength D containing a single 1 in the cell positioncorresponding to the realized values of W1,..., Wp and 0selsewhere. Let Ed be the D-vector with 1 in position d and 0selsewhere. By definition, the joint density of (ui, zi) under thegeneral location model is

P u E z z z ui d i d i dT

i d=( ) ∝ − −( ) −( ){ }− −, | expθ π µΣ Σ12 1

21 .


The conditional distribution of ui given zi is thus

P u E zz z

z zi d i

d i dT

i d

d i dT

i dd

D=( ) =− −( ) −( ){ }

− −( ) −( ){ }−

′ ′−

′′ =∑

| ,exp

exp

θπ µ µ

π µ µ

12

1

12

1

1

Σ

Σ

The portions of the numerator and denominator involving thequadratic term z zi

TiΣ−1 cancel out, leading to a well-known

result from classical multivariate analysis: the conditionalprobability that unit i belongs to cell d is

P u E zi d i d i=( ) ∝ ( )| , exp ,θ δwhere δd i, denotes the value of the linear discriminant

function of zi with respect to µd,

δ µ µ µ πd i dT

i dT

d dz, log= − +− −Σ Σ1 12

1 . (9.28)

When Z1,..., Zq are observed but W1,..., Wp are missing, thepredictive distribution of W1,..., Wp is obtained by calculatingthe terms πd exp (δd i, ) for cells d = 1, 2,..., D and normalizing

them to sum to one.Continuous variables partially missingNow consider what happens if W1,..., Wp and an arbitrarysubset of Z1,..., Zq are missing. Denote the observedcomponents of zi by zi(obs) and the missing components byzi(mis). The conditional distribution of ui given zi(obs) and θ is

obtained by integrating both the numerator and denominatorof

P u E zP u E z

P zi d ii d i

i=( ) =

=( )( )| ,

, |

|θ

θθ

over all possible values of zi(mis). The result is

P u E zi d i obs d i=( ) ∝ ( )( )| , exp ,*θ δ , (9.29)

where δd i,* is a linear discriminant based on the reduced

information in zi(obs) rather than zi. This new discriminant isδ µ µ µ πd i d i

Ti i obs d i

Td i dz,

*,

* *,

*,

* log= − +−( )Σ 1 1

2 , (9.30)

where δd i,* and Σ i

* denote the subvector and square submatrix

of µ d and Σ respectively, corresponding to the observed


elements of zi. (When all continuous variables are missing,define δ πd i d,

* log= so that (9.29) reduces to πd). Moreover,

becausez u E Ni i d d| , ~ ,= ( )θ µ Σ ,

the conditional distribution of the missing elements of zi givenui = E d and the observed elements of zi is also multivariatenormal; the parameters of this distribution can be obtained byapplying the sweep operator to µd and Σ (Section 5.2). This

conditional normal distribution, along with the probabilities(9.29), characterize the joint predictive distribution of W1,...,Wp and the missing elements of Z1,..., Zq.Continuous and categorical variables partially missingFinally, let us now consider the general case in which arbitrarysubsets of W1,..., Wp and Z1,..., Zq are missing. This differsfrom the case we have just examined in that the predictivedistribution must now take into account any additionalinformation in the observed members of W1,..., Wp. Whensome of these categorical variables are observed, the unit isknown to lie within a particular subset of the cells of thecontingency table; the cell probabilities are still of the form(9.29), but must be normalized to sum to one over this reducedset.More specifically, let wi(obs) and wi(mis) denote the observedand missing parts, respectively, of the categorical data for uniti. Rather than indexing the cells of the contingency table bytheir linear positions d = 1, 2,..., D, let us now identify themby their corresponding response patterns w = (w1, w2,..., wp),wj = 1, 2,..., dj. Let Oi(w) and Mi(w) denote the subvectors of wcorresponding to the observed and missing parts, respectively,of the categorical data for unit i. The predictive probability offalling into cell w given the observed data is now

P u E w zi w i obs i obsw i

w iM wi

=( ) =( )

( )( ) ( )

( )∑

| , ,exp

exp

,*

,*

θδ

δ(9.31)

over the cells w for which Oi(w) agrees with wi(obs), and zerofor all other cells. Once again, the conditional predictivedistribution of zi(mis) given ui = Ew is a multivariate normal


whose parameters can be obtained by sweeping µw and Σ onthe positions corresponding to zi(obs).

Predictive distrbutions and sweep

As shown by Little and Schluchter (1985), the discriminantsδw i,

* and the parameters of the conditional normal distribution

of zi(mis) can be neatly obtained by a single application of thesweep operator. Suppose we arrange the parameters of thegeneral location model into a matrix,

θ µµ=

Σ T

P, (9.32)

where P is a D × D matrix with elementspw w= 2 logπ

on the diagonal and zeroes elsewhere. If we sweep this θ-

matrix on the positions in Σ corresponding to zi(obs) we obtain a

transformed version of the parameter,

θ µµ=

Σ * ** *

T

P(9.33)

The diagonal element of P* corresponding to cell w ispw w i

Ti w i w

*,

* *,

* log= − +−µ µ πΣ 1 2 ,

which is twice the sum of the final two terms in the lineardiscriminant function (9-30). The coefficients of zi(obs) in thisdiscriminant, µw i

Ti,

* *Σ −1, are found in row w of µ*, in the

columns corresponding to the variables in zi(obs). Theremaining elements of µ* and Σ* contain the parameters of the

multivariate regression of zi(mis) on zi(obs) for all cells w . Theintercepts, which vary from cell to cell, are found in µ*; the

slopes and residual covariances, which are assumed to beequal for all cells, are found in Σ*.

Although we have depicted θ as a (q + D) × (q + D) matrix,

in practice we do not actually need (q + D)2 memory locationsto store it. The off-diagonal elements of P* are not really ofinterest, nor are they needed to reverse-sweep θ* back to its

original form. Thus we can minimize computation andmemory requirements by retaining only µ , the diagonal


elements of P and the upper-triangular portion of Σ in packed

storage.

9.4.2 EM for the unrestricted model

We are now ready to describe an EM algorithm for obtainingML estimates for the unrestricted general location model(Little and Schluchter, 1985). In Section 9.2.2, we saw that thecomplete-data loglikelihood is a linear function of thesufficient statistics

T Z ZT1 = , T U ZT

2 = , and T U U xT3 = = ( )diag .

The ML estimates for the unrestricted model were shown to beπ̂ = −n x1 , (9.34)µ̂ = −T T3

12 , (9.35)

Σ̂ = −( )− −n T T T TT11 2 3

12 . (9.36)

The M-step is a simple matter of calculating (9.34)-(9.36)using the expected versions of T1, T2 and T3, rather than thesufficient statistics themselves. The complicated part is the E-step, where we must find the conditional expectations of T1, T2

and T3 given the observed parts of the data matrix and anassumed value of θ.

The E-step

First, consider the expectation of the diagonal elements of T3.Notice that the complete-data contingency table can be written

as x uin

i= =∑ 1 . The elements of ui are Bernoulli indicators of

ui = Ew, for all cells w , so their expectations are just thepredictive probabilities given by (9.31). Thus, the expectationof ui can be found by the following steps. (a) Sweep the θ-

matrix on positions corresponding to zi(obs) to obtain θ∗ . (b)

From zi(obs) and θ*, calculate the discriminants for all cells w

for which Oi(w) agrees with wi(obs). The discriminant for cell wis

δ µw i w w jj O

i jp z,* *

,*

,= +∈∑1

2 ,


where µw j,* is the (w, j)th element of µ * , and Oi is the subset

of {1, 2,..., q} corresponding to the variables in zi(obs) (We havealready been using Oi and Mi as operators that extract theobserved and missing components of w = (w1,..., wp), and forconvenience we will continue to do so; the dual usage shouldnot create any confusion.) (c) Normalize the terms exp(δw i,

* )

for these cells to obtain the predictive probabilities

πδ

δw i

w i

w iM wi

,* ,

*

,*

exp

exp=

( )( )

( )∑

(9.37)

These predictive probabilities also play an important role in

the expectation of T2. Row w of T2 is in

w i iTu z=∑ 1 , , where uw,i

= 1 if unit i falls into cell w and uw, i = 0 otherwise. If theobserved data in wi(obs) indicate that unit i cannot possiblybelong to cell w, then

E u z Yw i i obs, | ,θ( ) = 0.

On the other hand, if wi(obs) agrees with Oi(w), thenE u z Y zw i i obs w i w i, ,

*,

*| ,θ π( ) = , (9.38)

where zw i,* is the predicted mean of zi given the observed

values in zi(obs), and given that unit i falls into cell w. The partsof zw i,

* corresponding to zi(obs) are identical to zi(obs), whereas

the parts corresponding to zi(mis) are the predicted values fromthe multivariate regression of zi(mis) on zi(obs) within cell w,

zz j O

j Mw ij

ij i

w jk O

ii

jk zik,

*,

*,

,*=

∈+ ∈

∈

if

ifµ σΣ

where σ j k,* is the (j, k)th element of Σ*.

Finally, consider the expectation of the sums of squares andcross-products matrix,

T Z Z z zTi i

T

i

n

11

= ==∑ .

The (j, k)th element of this matrix is in

ij ikz z=∑ 1 . But notice

that a single element of this sum can be written as


z z u z zij ik w i ij ikw

= ∑ , ,

so the expectation of this element is

E z z Y E z z Y uij ik obs w iM w

ij ik obs w i

i

| , | , ,,*

,θ π θ( ) = =( )( )

∑ 1 , (9.39)

where the sum is taken over all cells w for which Oi(w) agrees

with wi(obs). The form of E z z Y uij ik obs w i| , , ,θ =( )1 depends on

whether zij and zik are observed. If both are observed, thisexpectation is simply zij zik. If zij is observed but zik is missing,the expectation is z zij w ik,

* . Finally, if both are missing, the

expectation becomes z zw ij w ik jk,*

,* *+σ .

Organizing the computations

To carry out the E-step, we must cycle through the units i = 1,2,..., n in the dataset, sweeping θ on the positions

corresponding to zi(obs) and summing the contributions (9.37),(9.38) and (9.39) of unit i to the expectations of the sufficientstatistics. The number of forward and reverse-sweeps can bereduced by grouping together rows of the data matrix havingthe same pattern of missingness for Z1,..., Zq because the sameversion of θ* can then be used for all units in the pattern. The

expected sufficient statistics can be accumulated into aworkspace of the same size and shape as θ,

T T TT T

T=

1 22 3

Once the E-step is complete, the M-step proceeds by applying(9.34)-(9.36) to T which gives the updated estimate of θ.

Evaluating the observed-data loglikelihood

One can show that the contribution of observation i to theobserved-data loglikelihood is

− + −( )

( )

−( )∑1

212

1log log exp*,

* *Σ Σi w i i obsT

i i obsw

z zδ ,


where the sum is taken over all cells w for which Oi(w) agreeswith wi(obs). The procedure for evaluating the observed-dataloglikelihood at any particular value of θ is very similar to the

E-step. In addition to the linear discriminant δw i,* , we need to

evaluate the quadratic termz zi obs

Ti i obs( )−

( )Σ* 1

and the determinant of Σ i*−1. The latter can be obtained along

with θ* as an immediate by-product of sweep (Section 5.2.4).

To calculate the former, note that − −Σ i* 1 is contained in the

rows and columns of Σ* corresponding to the variables in

zi(obs).

9.4.3 Data augmentation

With fairly minor modifications, the EM algorithm describedabove can be converted to data augmentation, enabling us tosimulate posterior draws of θ or multiple imputations of Ymis.

For the I-step, we must create a random draw of (T1, T2, T3)from its predictive distribution given the observed data and anassumed value for θ. Just as in the E-step, we cycle through

the units i = 1, 2,..., n, sweeping θ to obtain the parameters of

the predictive distribution of the missing variables given theobserved variables; we then draw the missing data for unit ifrom their predictive distribution, and accumulate the resultingcomplete-data sufficient statistics into T1, T2 and T3. Once theI-step is complete, the P-step proceeds by drawing a new valueof θ from its posterior given T1, T2 and T3. Details of these

steps are given below.

The I-step

It is convenient to draw the missing data for unit i in twostages: first by drawing ui, which indicates the cell to whichunit i belongs, and then by drawing zi(mis) given ui. Thepredictive distribution of ui is that of a single multinomial trialover the cells w for which Oi(w) agrees with wi(obs); the cellprobabilities are given by (9.37). A simple way to simulate


this multinomial trial is by table sampling: cycle through thecells, summing up their probabilities, and assign the unit to thefirst cell for which the cumulative probability exceeds thevalue of a U(0, 1) random variate. Pseudocode for a similartable-sampling algorithm appears in Figure 7.4. When a unit isassigned to cell w, its contribution to T3 is reflected by adding1 to the wth diagonal element.After assigning unit i to cell w, we may then draw the missingcontinuous variables in zi(mis) according to their multivariateregression on zi(obs). The regression prediction for an elementof zi(obs) is

z zw ij w j jkk O

ik,*

,* *= +

∈∑µ σ

To these predictions, we must add simulated residuals drawnfrom a multivariate normal distribution. The residualcovariances are found in Σ*, in the rows and columns

corresponding to zi(mis). To draw the residuals, we will need toextract the appropriate submatrix from Σ* and calculate its

Cholesky factor (Section 5.4.1). Adding the simulatedresiduals to the zw ij,

* produces a simulated draw of zi(mis). The

contribution of the completed version of zi to the sufficientstatistics is then reflected by adding zi into the wth row of T2,and adding z zi i

T into the matrix T1.

The P-step

In Section 9.2.4, we showed that under the improper priordistribution

P ww

wq

π µ πα, , Σ Σ( ) ∝

− −( )∏

+1 12

the complete-data posterior isπ α| ~Y D x+( ) , (9.40)

Σ | , ~ , ˆ ˆπ Y W n D T−∈ ∈

−− ( )

1 1, (9.41)

µ π µw w wY N x| , , ~ ˆ ,Σ Σ−( )1 , (9.42)


where α = { }aw is an array of user-specified

hyperparameters. The P-step is simply a matter of drawingfrom these distributions in turn, given the simulated values ofT1, T2 and T3 from the I-step. This can be done as follows.

1. For each cell w, draw the probability πω from a standard

gamma distribution with shape parameter xw + αw where xw

is the wth diagonal element of T3, and normalize the πw to

sum to one.

2. Draw an upper-triangular matrix B whose elements areindependently distributed as

b x j qjj n D j~ , ,..., ,− − + =12 1

b N j kjk ~ , , ,0 1( ) <

and take Σ = M MT where M B CT=( )−1 and C is the upper

triangular Cholesky factor of

ˆ ˆ∈ ∈−= −T

iTT T T T2 3

12.

3. Calculate µ̂ = −T T31

2 and take µ µ= + −ˆ /T HM31 2 , where H

is a D × q matrix of independent N(0, 1) random variates,

and T31 2− / is the matrix with elements xw

−1 2/ on the

diagonal and zeroes elsewhere.

9.4.4 Algorithms for restricted models

An ECM algorithm

Little and Schluchter (1985) discussed an EM algorithm forML estimation under restricted versions of the generallocation model. The E-step is identical to that described abovefor the unrestricted model, because the expectations of T1, T2

and T3 have the same form regardless of where θ π µ= ( ), , Σ


lies in the parameter space. The only difference is found in theM-step, which is now a constrained maximization subject tologlinear restrictions on π and linear restrictions on µ. As

discussed in Section 9.3, the constrained maxima for π and µmay be found by conventional IPF and least squares,respectively.

In the same article, Little and Schluchter also conjecturedthat the full maximization of the likelihood for π in each M-

step, which may require many IPF cycles, could be replacedby a single IPF cycle, thus avoiding undesirable nestediterations. The resulting algorithm would no longer be EM, butit would have the same essential property that the observed-data loglikelihood would be non-decreasing. Their conjectureturned out to be correct. This algorithm is a special case ofECM, exhibiting the same reliable convergence properties asEM; see Sections 3.2.5 and 8.5.1 for further details andreferences. A single cycle of ECM for the restricted generallocation model proceeds as follows.

1. E-step: Given the current estimate θ π µt t t t( ) ( ) ( ) ( )= ( ), , Σ ,

calculate the expectations of T1, T2 and T3 as described inSection 9.4.2.

2. CM-step: Using the expected value of x (the diagonalelements of T3), perform a single cycle of conventional IPFfrom the starting value π(t) to obtain π(t+1). Then calculate

β(t+1) and Σ(t+1) as in (9.20)-(9.21) using the expected values

of T1, T2 and T3, and take µ(t+1) and Aβ(t+1).

Data augmentation-Bayesian IPF

In a similar fashion, the data augmentation algorithm for theunrestricted model can be adapted to restricted models. The I-step remains the same; only the P-step must be changed toaccommodate the restrictions on the parameter space.

Under the family of prior distributions discussed in Section9.3.3, the complete-data posterior distribution for π is


constrained Dirichlet, and the complete-data posterior for (β,

Σ) is

Σ | ~ , ˆ ˆY W n r T−∈ ∈

−− ( )

1 1, (9.43)

β β| , ~ ˆ,Σ ΣY N V⊗( ), (9.44)

where V = (ATUTUA)-1. Random draws from the constrainedDirichlet can be simulated by Bayesian IPF (Section 8.4), anddrawing from (9.43) is straightforward. In many applicationsthe dimension of β can be quite large, but simulating draws

from (9.44) is not difficult if we exploit the patternedcovariance structure. Let G and H denote the upper-triangularCholesky factors of Σ and V, respectively, so that Σ and V =

HTH. Using elementary properties of Kronecker products,

Σ ⊗ = ( ) ⊗ ( )= ⊗( ) ⊗( )= ⊗( ) ⊗( )

V G G H H

G H G H

G H G H

T T

T T

T ,and thus G H⊗ is an upper-triangular square root for Σ ⊗ VTherefore, to simulate a multivariate normal random vectorwith covariance matrix Σ ⊗ V we may simply pre-multiply a

vector of standard normal variates by G H T⊗( ) .A data augmentation-Bayesian IPF (DABIPF) algorithm for

the restricted general location model proceeds as follows.

1. I-step: Given the current values of the parameters

π µ βt t tA( ) ( ) ( )=, and Σ(t), draw the missing data from their

predictive distribution as described in Section 9.4.3, andaccumulate the simulated values of the sufficient statisticsT1, T2 and T3.

2. Bayesian IPF: Using the simulated value of x (the diagonalelements of T3), perform a single cycle of Bayesian IPFfrom the starting value π(t) to obtain π(t+1).

3. P-step for Σ: Draw an upper-triangular matrix B whose

elements are independently distributed as


b x j q

b N j k

jj n r j

jk

~ , ,..., ,

~ , , ,

− − + =

( ) <

12 1

0 1

and take Σ(t+1)=MTM, where M = (BT)-1C and C is the upper-

triangular Cholesky factor of

ˆ ˆ .∈ ∈= − ( )−T T T TT T A A T A A T1 2 31

2

4. P-step for β : Draw β(t+1) from a multivariate normal

distribution with mean β̂ = ( )−A T A A TT T

31

2 and

covariance matrix Σ t V+( ) ⊗1 , where V A T AT= ( )−3

1. This

can be done in the following manner. Let βj and β̂ j denote

the jth columns of β(t+1) and β̂ j respectively. Calculate

G=Chol(Σ(t+1)) and H = Chol(V), and take

β β κ

β β κ κ

β β κ κ κ

1 1 11 1

2 2 21 1 22 2

1 1 2 2

= +

= + +

= + + + +

ˆ ,

ˆ ,

ˆ ,

g H

g H g H

g H g H g H

T

T T

q q qT

qT

qqT

q

M

L

where gij is the (i, j)th element of G and where κ κ κ1 2, , ..., q

are vectors of independent N (0, 1) random variates oflength r.

This DABIPF algorithm is not true data augmentation, but ahybrid that substitutes a single cycle of Bayesian IPF for thefull simulation of π in the P-step.


9.5 Data examples

9.5.1 St. Louis Risk Research Project

Little and Schluchter (1985) presented data from the St. LouisRisk Research Project, an observational study to assess theeffects of parental psychological disorders on various aspectsof child development. In a preliminary cross-sectional study,data were collected on 69 families having two children each.The families were classified into three risk groups for parentalpsychological disorders. The children were classified into twogroups according to the number of adverse psychiatricsymptoms they exhibited. Standardized reading and verbalcomprehension scores were also collected for the children.Each family is thus described by three continuous and fourcategorical variables:

Variable Levels CodeParental risk group 1=low,2=moderate,3=high GSymptoms, child 1 1=low, 2=high D1

Symptoms, child 2 1=low, 2=high D2

Reading score, child 1 continuous R1

Verbal score, child 1 continuous V1

Reading score, child 2 continuous R2

Verbal score, child 2 continuous V2

Data from this preliminary study are displayed in Table 9.2.Missing values occur on all variables except G. Only twelvefamilies have values recorded for all seven variables.

The unrestricted model

The unrestricted general location model for this dataset has 69free parameters: 11 for the 3 × 2 × 2 contingency table that

cross-classifies families by G, D1 and D2, 48 for the within-cell means of R1, V1, R2 and V2, and 10 for the within-cellcovariance matrix. As pointed out by Little and Schluchter(1985), all of the parameters of this model are technicallyestimable. There are no zero counts in the table for the 29



Table 9.2. DData from the SSt. Louis Risk Research Project

families that can be fully classified on G, D1 and D2. The onlyfamily known to belong to the G = 2, D1 = 2. D2 = 1 cell hasmissing values for V1, R2 and V2; there are three other partiallyclassified families that can possibly belong to this cell,however, and two of these families have all their continuousvariables recorded. Similarly, the only family known to haveG = 1, D1 = 2, D2 = 2 has a missing value for R1, but there aretwo other partially classified families for whom R1 is knownwho may belong to this cell. These partially classified familiescontribute ’fractional observations’ of the continuous variablesto certain cells. With respect to the means of these cells, theobserved-data likelihood is not flat, but some of the meansmay be estimated with precision equivalent to sample sizes ofless than one.

Using the EM algorithm described in Section 9.4.2, Littleand Schluchter (1985) discovered that the observed-datalikelihood for this example is multimodal. They found thatEM converges to different parameter estimates from differentstarting values, and the loglikelihood values at these estimatesare not identical. The data augmentation algorithm of Section9.4.3, when used in conjunction with EM, provides anadditional tool to help us explore the observed-data likelihood.Starting at a mode, we ran several hundred iterations of dataaugmentation, and used the final simulated value of theparameter as a new starting value for EM. By repeating thisprocess, we were quickly able to identify ten distinct modes,and would have undoubtedly found more had we continuedfurther. The unusual shape of the observed-data loglikelihoodsuggests that some of the parameters of the unrestricted modelare very poorly estimated. This is not surprising, given that weare trying to estimate 69 parameters from only 69 incompleteobservations. Time-series plots of some parameters across theiterations of


Figure 9.2. Time-series plots of the conditional means of R1, V1, R2 and V2

given (G=2, D1=2, D2=1) for 1000 iterations of data augmentation under theunrestricted general location model.

data augmentation show erratic behavior. Plots of thesimulated means of the four continuous variables within the G= 2. D1 = 2, D2 = 1 cell are shown in Figure 9.2. The meansfor V1, R2 and V2 are highly unstable, wandering well outsidethe plausible range of reading and verbal scores. The use ofthe unrestricted model is not recommended for this dataset, asit is clearly over-parameterized.

Restricted models

Because the ultimate purpose of the St. Louis Risk study wasto examine the relationship of parental psychological disorderson child development, we now examine two restricted modelsthat focus attention on the effects of greatest interest, namely,the associations between parental risk G and the childdevelopment variables D1, R1, V1, D2, R2 and V2.

The first model, which will be called the ’null model’,allows the six development variables to be interrelated, butassumes that they are collectively independent of G . Theloglinear model for the categorical variables is (G, D1, D2).The design matrix specifying the regression of the fourcontinuous variables on the categorical ones is shown in Table


9.3 (a); it includes an intercept, main effects for D1 and D2 andthe D1 D2 interaction. This model fits 5 free

Table 9.3. Design matrix for the null model, and the linear contrast for Gincluded in the alternative model

parameters to the contingency table, 16 regression coefficientsand 10 covariances for a total of 31 free parameters.

The second model, which we call the ’alternative model’,adds simple associations between G and each of the sixdevelopment variables. The loglinear model is now (GD1,GD2, D1D2), and the association between G and the continuousvariables is specified by adding columns to the design matrixfor G. To conserve parameters, we add only a single columnfor a linear contrast, as shown in Table 9.3 (b). The alternativemodel has 9 parameters for the contingency table, 20regression coefficients and 10 covariances for a total of 39parameters.

ML estimates under these two models were computed usingthe ECM algorithm of Section 9.4.4. As with the unrestrictedmodel, the observed-data loglikelihood functions are notunimodal; we found two modes under the null model and twomodes under the alternative. The likelihood-ratio test statisticbased on the two major modes is 21.9 with 8 degrees offreedom. It appears that the alternative model may fit the datasubstantially better than the null model, but we cannot assignan accurate p-value to this difference due to the unusual shapeof the likelihood function.


Adopting a Bayesian approach, however, we candemonstrate rather conclusively that G is indeed related toeach of the six development variables. Using the DABIPFalgorithm, we simulated

Figure 9.3. Time-series plots of the conditional means of R1, V1, R2 and V2

given (G=2, D1 =2 , D2=1 ) for 1000 iterations of DABIPF under thealternative model.

5000 correlated draws from the observed-data posterior underthe alternative model and stored the values of parameters ofinterest. Time-series plots of the parameters, shown in Figure9.3, did not exhibit the same instability found in plots for theunrestricted model, so the algorithm appears to be convergingreliably. By examining the simulated values of the parameterspertaining to the associations between G and the othervariables, we may proceed to make Bayesian inferences aboutthese parameters directly without appealing to large-sampleapproximations.

Risk and adverse psychological symptoms

Let πijk denote the marginal probability of the event G = i, D1

= j , D2 = k. The association between G and D1 can bedescribed by two odds ratios, say


ω π ππ π

ω π ππ π1

11 22

21 122

21 32

31 22= =k k

k k

k k

k k, .

These express the increase in odds of adverse symptoms in thefirst child as we move from low to moderate risk, and frommoderate to high risk, respectively. Notice that these oddsratios do not depend on k; they are identical for k = 1 and k = 2because the loglinear model omits the three-way associationGD1 D2. Similarly,

Figure 9.4. Boxplots of simulated log-odds ratios from 5000 iterations ofDABIPF under the alternative model.

Table 9.4. Simulated posterior percentiles and p-values for odds ratios

the association between G and D2 can be described by

ωπ ππ π

ωπ ππ π3

1 1 2 2

2 1 1 24

2 1 3 2

3 1 2 2= =j j

j j

j j

j j, ,

which express the increase in odds of adverse symptoms in thesecond child as we move from low to moderate and frommoderate to high risk.

Boxplots of the logarithms of the four odds ratios from5000 cycles of DABIPF are shown in Figure 9.4. The logs ofω1 and ω3 are nearly all positive, providing strong evidence


that children in moderate-risk families (G = 2) have higherrates of adverse symptoms that children in low-risk families(G = 1). The logs of ω2 and ω4 however, lie on both sides of

zero; there is no evidence that the adverse-symptom ratesdiffer for children in moderate- (G = 2) and high-risk (G = 3)families. Simulated percentiles of

Figure 9.5. Boxplots of simulated regression coefficients from 5000 iterationsof DABIPF under the alternative model.

Table 9.5. Simulated posterior percentiles and p-values for regressioncoefficients

the posterior distributions of the ω1 are shown in Table 9.4,

along with Bayesian p-values for testing each null hypothesisω1=1 against the two-sided alternative ω1≠1. Based on the

posterior medians, we estimate that children in moderate-riskfamilies are about 4.5 times as likely (on the odds scale) todisplay adverse symptoms than children in low-risk families.

Risk and comprehension scores

The association between risk and comprehension issummarized by the coefficients of the linear term for G in theregression model for R1, V1, R2 and V2. Boxplots of the


simulated regression coefficients from DABIPF are displayedin Figure 9.5. For each coefficient, the majority of thesimulated values lie well below zero, providing evidence thatincreasing risk is associated with decreasing reading andverbal comprehension. Simulated posterior percentiles for thefour coefficients are given in Table 9.5, along with a two-tailed Bayesian p-value for testing the null hypothesis thateach coefficient is zero. All four effects are ’statisticallysignificant.’ From the medians, we estimate that increasingrisk by one category (low to moderate or moderate to high) isassociated with a drop of 6-7 points in reading comprehensionand 10-13 points in verbal comprehension for each child.

9.5.2 Foreign Language Attitude Scale

In Section 6.3, we examined data pertaining to the ForeignLanguage Attitude Scale (FLAS), an instrument designed topredict achievement in the study of foreign languages. Of thetwelve variables in the dataset, five are categorical and sevenare continuous. The analyses in Chapter 6 relied on multipleimputations created under a multivariate normal model. Priorto imputation, we recoded some of the categorical variables tomake the normal model appear more reasonable. In theprocess of recoding, however, some potentially useful detailwas lost. For example, the final grade variable GRD wascollapsed from five categories to only two. Now, using thegeneral location model, we will re-impute the missing datawithout altering any of the categorical variables.

The imputation model

For imputation purposes, we fitted a restricted version of thegeneral location model to the twelve variables listed in Table6.5. The categorical variables LAN, AGE, PRI, SEX and GRDdefine a five-dimensional contingency table with 4 × 5 × 5 × 2

× 5 = 1000 cells. This table was described by a loglinear

model with all main effects and two-variable associations. Theseven continuous variables were then described by aregression with main effects for each categorical variable. Thedesign matrix, which had 1000 rows and eight columns,


included a constant term for the intercept, three dummyindicators for LAN, a dummy indicator for SEX and linearcontrasts for AGE, PRI and GRD. The coding scheme for thedesign matrix is shown in Table 9.6.

Like the multivariate normal distribution, this model allowssimple associations between any two variables. Imputationsgenerated under the model will preserve simple marginal andconditional associations, but higher-order effects such asinteractions will not be

Table 9.6. Columns of design matrix in imputation model, foreign languageachievement study data

reflected in the imputed values. If the post-imputation analysesinvolve only simple associations (e.g. regressions with maineffects but no interactions) then this imputation model may beexpected to perform well. More elaborate analyses involvinginteractions, however, would require a more elaborateimputation model.

Prior distributions

Recall from Section 6.3 that certain parameters of the normalmodel were inestimable, because values of GRD were missingfor all students enrolled in Russian (LAN=4). In the newimputation model, some aspects of the association betweenGRD and LAN are again inestimable for the same reason.Furthermore, the sparseness of the contingency table (recallthat there are 1000 cells but only n = 279 observations) resultsin ML estimates on the boundary of the parameter space.


These difficulties can be addressed by specifying a properprior distribution for the cell probabilities.

In previous examples involving sparse tables, we appliedflattening priors, Dirichlet or constrained Dirichletdistributions with hyperparameters set to a small positiveconstant. Flattening priors smooth the estimated cellprobabilities toward a uniform table. This type of smoothingmay be undesirable in this application, because some of thecategorical variables (AGE and GRD, in particular) havecategories that are quite rare; flattening priors could distort themarginal distributions for these variables, leading to an over-representation of rare categories in the imputed values.Another possibility is a data-dependent prior that smooths theestimates toward a table of mutual independence among thevariables, but leaves the marginal distribution of each variableunchanged (Section 7.2.5). To generate multiple imputations,we ran DABIPF under two different priors: (a) a data-dependent prior of this type, with hyperparameters scaled toadd to 50; and (b) the Jeffreys prior with all hyperparametersequal to 1/2. The latter may arguably result in over-smoothing;we include it primarily to assess the sensitivity of our resultsto the choice of prior.

Generating the imputations

Under each prior, we generated m = 10 imputations by runninga single chain of DABIPF, allowing 250 cycles betweenimputations. To obtain a starting value of θ, we first ran the

ECM algorithm, setting hyperparameters to 1.05 to ensure amode in the interior of the parameter space. The continuousvariables were modeled and imputed on their original scaleswithout transformation. The imputed values for these variableshardly ever strayed outside their natural ranges. For example,only two of the 10 × 34 = 340 values of CGPA imputed in the

first DABIPF run fell above the maximum of 4.0. Becausethese ’impossible’ imputations occurred so rarely, we simplyallowed them to remain in the imputed data rather than editingor re-drawing them.


A proportional-odds model

In keeping with the purpose of this study, a model was fittedto predict final grade GRD from the other eleven variables.Because GRD is an ordinal scale (0=F, 1=D, 2=C, 3=B, 4=A),we used a logistic model for ordinal responses known as theproportional-odds model (McCullagh, 1980; Agresti, 1990).For any subject i, let xi denote the probability of the eventGRD ≥ j, and let xi be a vector of covariates. The proportional-

odds model is

log , , , , .π

πα βij

ijj i

Tx j1

1 2 3 4−

= + =

In other words, the log-odds of falling above each of the fourGRD cut-points are simultaneously modeled as parallel linearfunctions with common slopes β and intercepts

α α α α1 2 3 4≥ ≥ ≥ . Routines for maximum-likelihoodestimation in the proportional-odds


Table 9.7. Estimates, standard errors, p-values and percent missinginformation for coefficients in the proportional-odds model, from m = 10multiple imputations under (a) data-dependent and (b) Jeffreys priors

model are available in several popular statistical softwarepackages, including SAS (SAS Institute Inc., 1990) andBMDP (BMDP Statistical Software, Inc., 1992).

The covariates in our proportional-odds model included allseven of the continuous variables in the dataset. In addition,we included three dummy indicators for LAN, a dummyindicator for SEX and linear contrasts for AGE and PRI,coded as shown in Table 9.6. For each imputed dataset, wecalculated ML estimates using software developed by Harrell(1990) for the statistical system S (Becker, Chambers andWilks, 1988). The estimates, along with standard errors basedon score statistics, were then combined using Rubin’s rules forscalar estimands (Section 4.3.2). Estimated coefficients andstandard errors are displayed in Table 9.7, along with percentmissing information and two-tailed p-values for testing thenull hypothesis that each coefficient is zero.

Results using a data-dependent prior, shown in Table 9.7(a), are fairly consistent with our findings in Section 6.3 wherewe fitted a simple logit model to the dichotomized version ofGRD. The only substantial difference is that under the


proportional-odds model, PRI has a significant effect on GRDbut SEX does not; under the dichotomous model, SEX had asignificant effect but PRJ did not. Results under the Jeffreysprior, shown in Table 9.7 (b), are similar to those from thedata-dependent prior, with the following two exceptions: first,the linear effect of AGE is no longer significant; second, thecoefficient of the dummy indicator LAN4 is now highlysignificant. The latter is rather curious, because we know thatthe data provide essentially no information about the effect ofLAN4 on GRD given the other variables. This ’statisticallysignificant’ relationship appears to be a figment of the Jeffreysprior, which smooths the data quite heavily. The high fractionof missing information for this coefficient, along with itssensitivity to the choice of prior, should alert us to use extremecaution when trying to make any inferences regarding gradesfor the LAN = 4 group.

Partial correlation coefficients

Apart from determining which predictors are significantlyrelated to GRD, it is also useful to consider the practicalimportance of the estimated effects. In many areas of socialscience, associations are expressed and compared in terms ofsimple or partial correlation coefficients. In linear regression,a partial correlation measures the expected change in theresponse variable (expressed in standard units) associated witha one-unit increase in a predictor (also in standard units) whenall other predictors are held constant. A squared partialcorrelation measures the proportion of variance in theresponse variable ’explained by’ the predictor, after accountingfor the measureable effects of all other predictors. Even if theclassical regression model does not hold, e.g. when theresponse is ordinal, the partial correlation still serves as aheuristically useful benchmark for gauging the practicalimportance of an association.

A partial correlation can be calculated from the usual t-statistic used for testing the significance of a regressioncoefficient. Let T denote a t-statistic (the estimated coefficientdivided by its standard error) and ν its degrees of freedom.

The estimated partial correlation is


rT

T v= ±

+

2

2 ,

where the sign is chosen to be consistent with that of T. Underan assumption of multivariate normality, r is approximatelynormally

Table 9.8. Estimated partial correlation coefficients, 95% intervals andpercent missing information from m = 10 multiple imputations under (a) data-dependent and (b) Jeffreys priors

distributed about the population coefficient ρ. An even better

approximation is provided by Fisher’s (1921) transformationtan-1(r), which in large samples is essentially normallydistributed about —1(ρ) with variance 1/(ν-1) (Anderson, 1984).

For each imputed dataset, we regressed GRD on the sameset of predictors used in the proportional-odds model. UsingRubin’s rules, we calculated estimates and 95% intervals for-1 ( ρ), and then transformed the results back to the correlation

scale. The resulting point and interval estimates are shown inTable 9.8. These figures should be interpreted somewhatloosely, because the assumptions underlying the classicalregression model and the normal approximation to tanh-1(r)clearly do not hold. Yet it is apparent that FLAS, the predictorof primary interest, has substantial validity for predictingachievement in the study of foreign languages. Except for


HGPA, FLAS has the highest partial correlation with GRD,higher even than the well established instrument MLAT.

9.5.3 National Health and Nutrition Examination Survey

The largest and most notable application of these methods todate has been to the Third National Health and NutritionExamination Survey (NHANES III). This survey, conductedby the National Center for Health Statistics, provides basicinformation on health and nutritional status for the civiliannon-institutionalized U.S. population. NHANES III is acomplex, multistage area sample with oversampling of youngchildren, the elderly, Mexican Americans and AfricanAmericans. Details of the design are given by Ezzati et al.(1992). Data were collected over six years (1988-94) with atotal sample size of 39695. The data collection occurred in twostages: (a) personal interviews with subjects at home, and (b)detailed physical examinations of subjects in MobileExamination Centers (MECs). Because of the inconvenienceassociated with going to a MEC and completing the exam,nonresponse rates at the examination phase wereunderstandably high; many key survey variables hadmissingness rates of 30% or more.

In 1992, NCHS initiated a research project to investigatealternative missing-data procedures for NHANES III,including multiple imputation. This project will culminate inthe public release of a multiply-imputed research dataset,currently scheduled for 1997. The dataset will contain fiveimputations of more than 60 variables. Here we brieflysummarize the imputation model and the results of anextensive simulation study to assess the performance of themethod. Complete details are given by Schafer et al. (1996)and their references.

The imputation model

The imputation model was designed to produce imputationsappropriate for a wide variety of analyses. Data fromNHANES are used to estimate important health-relatedquantities at the national level, e.g. rates of obesity by age and


sex. These estimates, produced and reported by NCHS, arebased on classical methods of survey inference (Cochran,1977) and are designed to be approximately unbiased overrepetitions of the sampling procedure. Standard errors arecalculated using special variance-estimation techniquesappropriate for data from complex samples (Wolter, 1985). Tobe compatible with these procedures, an imputation modelmust be sensitive to major features of the sample design.Outside NCHS, the data are also subjected to secondaryanalysis by researchers in many health-related fields. Forexample, researchers might fit linear or logistic regressionmodels to NHANES data to investigate relationships amonghealth outcomes and potential risk factors. For this reason, theimputation model needed to preserve important marginal andconditional associations among variables.

We created multiple imputations under a general locationmodel that included over 30 variables. Because individuals’probabilities of selection varied by age group, gender andrace/ethnicity, the distributions of other survey variables hadto be allowed to vary across the levels of these three;otherwise, biases could be introduced into many importantestimators, both nationally and within demographicsubclasses. The imputation model was also designed to reflectpotential variation in characteristics across primary samplingunits (PSUs), the clusters that enter into the NCHS proceduresfor variance estimation; without these effects, the quality ofthe standard errors calculated from the resulting imputeddatasets could be impaired.

The categorical part of the general location model used afour-way classification by age, gender, race/ethnicity andPSU. The remaining variables were modeled by a multivariatelinear regression with full three-way interactions for age,gender and race/ethnicity, plus main effects for PSUs. Most ofthe response variables in this regression were continuous, buta few were binary or ordinal. Multiple imputations weregenerated using the DABIPF algorithm of Section 9.4.4, andthe imputed values for the binary and ordinal variables wererounded off to the nearest category. Preliminary analyses ofthe imputed data suggested that for most purposes, m = 5


imputations would be sufficient to obtain accurate andefficient inferences.

A simulation study

Recognizing that this imputation procedure was based upon aprobability model that was, at best, only approximately true,we carried out an extensive simulation experiment. The goalof this simulation was to evaluate the performance of theimputation procedure from a purely frequentist perspective,without reference to any particular probability model. Forexample, we wanted to learn whether 95% interval estimatesin typical applications would really cover the quantity ofinterest 95% of the time over repetitions of the sampling andimputation procedure. To this end, we constructed an artificialpopulation of 31 847 persons by pooling data from four NCHSexamination surveys conducted since 1971. This artificialpopulation was weighted to resemble the projected U.S.population in the year 2000 in terms of race/ethnicity andgeography. From this population, we drew stratified randomsamples of 6000 persons

Figure 9.6. Simulated coverage of 95% multiple-imputation (MI) intervals byaverage percent missing information for 448 means.

using a sampling plan resembling that of NHANES III.Missing values were imposed on each sample using a random,ignorable mechanism to mimic the rates and patterns ofnonresponse observed in NHANES III. The missing data were


then imputed five times under 4 general location model, andmultiple-imputation point and interval estimates werecalculated for a variety of estimands (means and proportions,subdomain means, quantiles, and conditional log-odds ratios)using methods appropriate for stratified random samples. Theentire sampling, imputation and estimation procedure wasrepeated 1000 times.

Here we briefly summarize our results for means. Weexamined means for ten exam variables for the entirepopulation and within demographic categories defined by age,race/ethnicity and gender. Among these 448 means, theaverage simulated coverage of the 95% intervals over 1000repetitions was 949.3, not significantly different from 950.Individually, however, 81 of the 448 means (18%) hadcoverage significantly different from 950 at the 0.05 level. Thecoverages of the multiple-imputation (MI) intervals are shownin Figure 9.6, plotted against the average estimated percentmissing information for the respective estimands. In this plot,the least squares fit (dashed line) is nearly indistinguishablefrom a horizontal line through 950 (solid); there is no overalltendency for the actual coverage to increase or decrease withthe fraction of missing information. There is, however, sometendency for the coverage to vary more as the rate of missinginformation goes up.

Figure 9.7. Simulated coverage of 95% multiple-imputation (MI) intervalsversus complete data (CD) intervals, with points (507,824), (608, 799) and(479, 876) not shown.


Further analysis revealed that, among the intervals whosecoverage departed substantially from 95%, the departurescould be largely traced to failure in the normal approximationfor the inference without missing data. In Figure 9.7, thesimulated coverage of each MI interval is plotted against thecoverage of the corresponding normal-based interval (thepoint estimate plus or minus 1.96 standard errors) that onewould have used if no data were missing. The two coveragesare strongly correlated. Somewhat surprisingly, for theestimands for which the complete-data (CD) interval exhibitedgross undercoverage (and especially the three pathologicalcases that fell outside the plotting region) the MI intervalsperformed substantially better than their CD counterparts. Onthe other hand, there were no estimands for which CD did wellbut MI did poorly. Results for other types of estimandsrevealed similar trends: the MI intervals tended to performvery well, except where difficulties were observed in thecorresponding CD intervals. Further discussion of thissimulation study, including its limitations, are given bySchafer et al. (1996).

Further remarks

In this application, it was feasible to add PSU to the generallocation model because there were relatively few PSUs and alarge number of subjects within each PSU; we were able toinclude dummy indicators for PSU in the design matrix,without experiencing problems of inestimability. In othersurveys, the number of clusters may be too large to adopt suchan approach. In those settings, it may be possible to producemultiple imputations under hierarchical or random-effectsmodels that impose probability distributions on the cluster-specific parameters. Estimation and imputation algorithms forrandom-effects models can be developed by extending thetechniques of this chapter, but they are beyond the scope ofthis book. For an example of imputation under a random-effects model for multivariate categorical data, see Schafer(1995).


chapter 9: methods for mixed data - university of...

Documents