liu - 2002 - block principal component analysis with application to gene microarray data...

STATISTICS IN MEDICINEStatist. Med. 2002; 21:34653474 (DOI: 10.1002/sim.1263)

Block principal component analysis with application to genemicroarray data classication

Aiyi Liu1;;, Ying Zhang1, Edmund Gehan1 and Robert Clarke2

1Biostatistics Unit; Lombardi Cancer Center; Georgetown University Medical Center; 3800 Reservoir Road;NW; Washington; DC 20007; U.S.A.

2Department of Oncology; Lombardi Cancer Center; Georgetown University Medical Center; Washington;DC 20007; U.S.A.

SUMMARY

We propose a block principal component analysis method for extracting information from a databasewith a large number of variables and a relatively small number of subjects, such as a microarray geneexpression database. This new procedure has the advantage of computational simplicity, and theory andnumerical results demonstrate it to be as ecient as the ordinary principal component analysis whenused for dimension reduction, variable selection and data visualization and classication. The method isillustrated with the well-known National Cancer Institute database of 60 human cancer cell lines data(NCI60) of gene microarray expressions, in the context of classication of cancer cell lines. Copyright? 2002 John Wiley & Sons, Ltd.

KEY WORDS: principal component analysis; grouping of variables; similarity; gene expression;microarray data analysis

1. INTRODUCTION

Principal component analysis is one of the most common techniques of exploratory mul-tivariate data analysis. It is a method of transforming a set of p correlated variables x=(x1; x2; : : : ; xp) to a set of p uncorrelated variables y=(y1; y2; : : : ; yp) that are linear func-tions of the xs, referred to as p principal components of x, such that the variances of theys are in descending order with respect to the variation among the xs. Usually the rstseveral components explain most of the variation among the xs. In addition to many otherapplications, principal component analysis has been shown to be a useful tool in reducingdata dimension and extracting information, in seeking important regressors in regression anal-ysis, and in eectively visualizing and clustering subjects, when measurements on a largenumber of variables are collected from each subject. The book by Jollie [1] provides excel-lent reading on this topic, although other textbooks on multivariate data analysis do also (for

Correspondence to: Aiyi Liu, Biostatistics Unit, Lombardi Cancer Center S112, Georgetown UniversityMedical Center, 3800 Reservoir Road, Washington, DC 20007, U.S.A.

E-mail: [email protected] August 2001

Copyright ? 2002 John Wiley & Sons, Ltd. Accepted January 2002

3466 A. LIU ET AL.

example, references [2] and [3]). Recently, principal component analysis has found applicationin the analysis of microarray gene expressions [4], a growing technology in human genomestudies [5, 6].When dealing with an extremely large number of variables (for example, 500 or more),

deriving principal components can be computationally intensive, since it involves nding theeigenvectors (and eigenvalues) of a matrix with large dimensions. Moreover, a linear com-bination of such a large number of variables becomes less meaningful to the investigatorssince the high dimensionality makes it hard to extract useful information and to interpretthe combination. In one microarray technology, cDNA clone inserts are printed onto a glassslide and then hybridized to two dierentially uorescently labelled probes. The nal geneexpression prole contains uorescent intensities and ratio information of many hundreds orthousands of genes. If one intends to apply principal component analysis directly to extractgene expression information for these genes from a certain group of subjects, then one hasto deal with a matrix with huge dimensions.In dealing with such high dimensional data, we propose to perform the principal component

analysis in a stratied way. We rst group the original variables into several blocks ofvariables, in the sense that each block contains variables (genes in the microarray experiments)that are similar; variables from the same block are more correlated than variables from dierentblocks. We then perform principal component analysis within each block and obtain a smallnumber of variance-dominating principal components. Combining these principal componentsobtained from each block forms a new database from which we can then extract informationby performing a new principal component analysis. We term this procedure as block principalcomponent analysis. Dominating principal components obtained from the nal stage can thenbe used in various data exploratory analyses such as clustering and visualization.The proposed block principal component analysis method also enables us to reduce the

number of variables eectively. Within each block, when principal component analysis isconducted and dominating linear combinations of variables are examined, only those variablesthat have relatively large coecients are retained. We will examine this variable selectionprocedure in detail using the gene microarray example.After a brief review of the mathematical derivation of principal components and their

applications in Section 2, we introduce in Section 3 the method of block principal componentanalysis. In Section 4, we investigate the eciency of block principal components in thereduction of data dimension with respect to the amount of variance explained. It is shown thatthe proposed procedure can be as ecient as the ordinary principal component analysis. Wethen discuss the selection of informative variables using block principal component analysis.In Section 5 we apply the method to the problem of classication of microarray data fromthe well-known National Cancer Institute database of 60 human cancer cell lines (NCI60),each of which has gene microarray expression of more than 1000 genes [7]. Some discussionis given in Section 6.

2. PRINCIPAL COMPONENTS

We start with a brief mathematical derivation of principal components. More details can befound in references [1] or [2] and [3]. Throughout, vectors are viewed as column vectors, andA is the transpose of a matrix A.

Copyright ? 2002 John Wiley & Sons, Ltd. Statist. Med. 2002; 21:34653474

BLOCK PRINCIPAL COMPONENT ANALYSIS 3467

Consider a p-variate random vector X with mean vector and positive denite covariancematrix . Let 12 p(0) be the eigenvalues of and let W=(w1; : : : ;wp) be a pporthogonal matrix such that

WW==diag(1; : : : ; p) (1)

so that wi is an eigenvector of corresponding to the eigenvalue i. Now put U=WX=(U1; : : : ; Up); then cov(U)=, so that U1; : : : ; Up are all uncorrelated, and var(Ui)= i,i=1; : : : ; p. The linear components U1; : : : ; Up are called principal components of X. The rstprincipal component is U1 =w1X and its variance is 1; the second principal component isU2 =w2X with variance 2; and so on. These p principal components have the following keyproperty. The rst principal component U1 is the normalized (unit length) linear combinationof the components of X with the largest variance, and its maximum variance is 1; then outof all normalized linear combinations of the components of X which are uncorrelated with U1,the second principal component U2 has maximum variance 2. In general, the kth principalcomponent Uk has maximum variance k , among all normalized linear combinations of thecomponents of X which are uncorrelated with U1; : : : ; Uk1.Very often these principal components are referred to as population principal components.

In practice is not known and has to be estimated from the sample, yielding the sampleprincipal components. We do not distinguish these two denitions here.Once the p principal components are derived, then we can conduct various statistical anal-

yses using only the rst q(p) principal components which account for most of the varianceof X. For example, we can plot the rst two (three) principal components in a two- (three-)dimensional space to seek interesting patterns among the data, or perform clustering analysison subjects in order to search for clusters among the data. We can also use these leadingprincipal components as regressors in a regression analysis to nd prognostic factors for clin-ical outcomes (for example, drug response or resistance). See reference [1] for various otherapplications of principal component analysis.Derivation of principal components involves computation of eigenvalues and eigenvectors

of the pp matrix (or its sample estimate). When p is very large, the computation willbecome extremely extensive. Moreover, it is always the interest of the investigators to examinethe rst several leading principal components in order to nd useful information. With a linearcombination of a large number of variables, this becomes extremely dicult and results arehard to interpret. To deal with these problems, we develop the block principal componentanalysis method to be discussed in the following sections.

3. BLOCK PRINCIPAL COMPONENT ANALYSIS

Ordinary principal component analysis needs to nd an orthogonal matrix W such that WWis diagonal. In a very extreme case when all of the components of X are independent, the pprincipal components are the p components of X, and W is merely some permutation of theidentity matrix, rearranging the components of X according to their variances. If the randomvector X can be partitioned into k uncorrelated random subvectors, so that has diagonalblocks, then performing principal component analysis with X is equivalent to performing prin-cipal component analysis with each subvector and then combining all the principal components


3468 A. LIU ET AL.

from all subvectors. This simple fact leads to the consideration of block principal componentanalysis even when X does not have uncorrelated partitions.Let X be partitioned as X=(X1; : : : ;X

k) with Xi being pi-dimensional, where p1 + +

pk =p, and be partitioned accordingly as

=

11 12 1k k1 k2 kk

(2)

Let Wi=(wi1; : : : ;wipi), i=1; : : : ; k, be an pipi orthogonal matrix such thatWi iiWi=i=diag(i1; : : : ; ipi); i1 ipi (3)

so that wij , j=1; : : : ; pi, is an eigenvector of ii corresponding to the eigenvalue ij . PutUi=Wi Xi=(Ui1; : : : ; Uipi)

, then the pi components Uij , j=1; : : : ; pi, of Ui dene the piprincipal components, referred to as block principal components, with respect to the randomvector Xi, the ith block of variables of X.Now dene

Q=diag(W1; : : : ;Wk) (4)

also an orthogonal matrix, and

Y=QX=(U1; : : : ;Uk) (5)

a random vector combining all block principal components, then

cov(Y )==QQ=

1 W112W2 W11kWk

Wkk1W1 Wkk2W2 k

(6)

Note that and have the same eigenvalues, and in particular, tr()= tr(), where tr standsfor the trace (sum of all diagonal elements) of a matrix. Hence X and Y have equal totalvariance. Let W be dened as in (1), and

R=QW (7)

then R is also an orthogonal matrix and

R cov(Y )R=WW=diag(1; : : : ; p) (8)

that is, the p principal components of Y are identical to those of X.Hence, we can obtain the principal components of a random vector X by two steps. In

the rst step, we group the variables in X into several blocks, and then derive principalcomponents for each block of variables. In the second step, we dene a new random vectorY by combining all the block principal components and then obtain the principal componentsof Y, which are identical to the principal components of X.



The geometrical interpretation of block principal component analysis is quite clear. Thep-dimensional random vector X represents the p axes in a p-dimensional space. The pprincipal components rotate the X-space to one whose axes are dened by the p principalcomponents. In order to rotate the original space to its desired direction, we can rst groupthe axis and rotate the axis within each group and then do an overall rotation to achieve thedesired direction.From the mathematical derivation above, we notice that this procedure always yields the

principal components of X, regardless of how the blocks are dened. The choice of blocks,however, does have eects on several aspects. First, if the X can be divided into uncorrelatedblocks, then the components in Y are the principal components of X, and there is no needto orthogonalize Y. Second, even when X cannot be partitioned into uncorrelated blocks, ifthe o-diagonal terms Wi ijWj are relatively small, as measured, say, by a matrix norm (forexample, squared sum of squares of all elements), then without losing much information,we can still use the components of Y as approximation to the principal components of X.Third, when reduction in dimension and in the number of variables is conducted within eachblock, which will be discussed in the next section, we would expect that variables within eachblock are much more correlated than variables from two dierent blocks, so that selectionof dimensions and of variables from one block will not be much aected by selection ofvariables from another block. For these reasons, we recommend grouping the variables intoblocks according to their correlation. This can be achieved by clustering the variables usinga proper function of Pearsons correlation coecient as the measure of similarity betweenvariables; one such measure is given in Section 5 of the paper.

4. DIMENSION REDUCTION AND VARIABLE SELECTION

4.1. Dimension reduction

A major application of principal component analysis is to reduce data dimension so thatthe data structures can be explored or even visualized in a low-dimensional space. Whendata dimension is extremely high, block principal component analysis allows us to reducedata dimension more eectively without losing much information. We propose the followingprocedure to achieve low dimension. Suppose k blocks, Xi, with dimension pi and covariancematrix ii , i=1; : : : ; k, of the original variables X, are determined according to the correlationbetween variables. For each block Xi we derive the pi principal components, and retainonly the rst qi (pi) principal components, say, Uij , j=1; : : : ; qi, so that the total varianceexplained by these qi principal components is i tr(ii), where 0i1. Now dene

Y=(U11; : : : ; U1q1 ; : : : ; Uk1; : : : ; Ukqk ) (9)

a variable combining all principal components selected from each block. We then obtain theprincipal components of Y, and choose the rst f principal components, say Z1; : : : ; Zf, whichexplain a high percentage of 100 per cent (for example, =95 per cent) of the total varianceof Y. Data visualization and classication with the original variable X is then conducted basedon these f principal components.These block principal components preserve many optimal properties of the ordinary prin-

cipal components: (i) Z1; : : : ; Zf are uncorrelated; and (ii) var(Z1) var(Zf). However,these variances are no longer the eigenvalues of , the covariance matrix of the original


3470 A. LIU ET AL.

variables X. Instead, they are the eigenvalues of the covariance matrix of Y; (iii) the totalvariance of Z1; : : : ; Zf is

tr[cov(Y)]= ki=1

[qij=1

var(Uij)

]=

ki=1i tr(ii)

which accounts for 100 per cent of the total variance of X, where

=k

i=1 i tr(ii)tr()

=k

i=1 i tr(ii)ki=1 tr(ii)

We hence have

min{i} (10)When using principal components to explore (for example, cluster, visualize) the data, we

expect that the leading components explain most of the variance so that they will reveal thetrue nature of the data structure; (10) asserts that block principal components Z1; : : : ; Zf willretain most of the variance if, within each block and for the nal principal component analysis,the selected principal components explain most of the variance. For example, if i95 percent, i=1; : : : ; k and 95 per cent, then 90 per cent.

4.2. Variable selection

When the number p of variables is very large, many variables can be highly correlated witheach other and some may become redundant when the rest are being used to explore datestructure. For example, in a gene microarray experiment where gene expression of a largenumber of genes is obtained for a number of tissues, tissue classication based on all genesmay be quite similar to that based on a small group of genes. If this is the case, thenwith respect to tissue classication, only these genes are informative and the rest becomeredundant, assuming that using all the genes indeed captures the real structure of the data. Itis therefore important to select variables that contain almost all information, with respect tocertain statistical properties, that all variables would contain.Block principal component analysis can be used to select these variables. We propose the

following two steps:

Step 1. Divide the original variable X into k blocks, Xi, i=1; : : : ; k, according to correlationbetween variables.

Step 2. For each block Xi, conduct principal component analysis and select the rst qi leadingprincipal components such that the total variance of Xi is explained by a satisfactoryamount, say, at least 95 per cent. Examine the coecients (or loadings in manyprincipal component analysis literatures) of the variables in Xi in these qi leadingcomponents and retain only those variables with large coecients. Combine all thevariables selected from each block and then use only these variables for furtheranalysis.

A third step may also be useful if the number of variables selected is still too large:

Step 3. Conduct principal component analysis again, but based only on the variables selectedin step 2. Select the rst several leading principal components to explain most of the



variance. Then examine the variables again and retain those with large coecients inthese leading combinations.

There is no universal criterion for how many and which variables should be selected fromthe leading principal components. Jollie [1] recommended choosing a variable from eachleading principal component with the largest absolute coecient, if the variable has not beenselected from previous leading components. In practice some modications of Jollies proce-dure may also be eective. For example, one can choose several variables from each leadingprincipal component with the largest absolute coecients. For more discussion, see refer-ence [1].In the next section, we demonstrate the block principal component analysis method us-

ing the well-known NCI60 human cancer cell-line data [7] to select a group of genes tovisualize=cluster the cell lines. The result shows such selection to be quite eective.

5. APPLICATION TO GENE MICROARRAY ANALYSIS: AN EXAMPLE

The NCI60 database contains expression of more than 9000 genes of 60 human cancer celllines from nine types of cancer including colorectal, renal, ovarian, breast, prostate, lung andcentral nervous system, as well as leukaemia and melanomas. Gene expression levels areexpressed as log(ratio), where ratio= the red=green uorescence ratio after computationalbalancing of the two channels. Readers are referred to reference [7] for more details. Thedata have been made public for analysis on the authors web site http:==discover.nci.nih.gov.To get familiar with the DNA microarray technology, readers are referred to references [5]and [6] for more information.One of the objectives of this study is to explore the relationship between gene proles

and cancer phenotypes. Scherf et al. [7] used a clustering analysis method to study therelationship. They provide the clustering tree of the 60 cell lines, based on 1376 genes, andshowed that most of the cell lines cluster together according to their phenotypes (see Figure 2aof reference [7].) One important question is whether a smaller group of genes can preservethe same relationship structure.We use a selection method based on block principal component analysis, as described in

Section 4, to tackle this issue. For simplicity, we study only cell lines from three typesof cancer, colorectal (7 cell lines), leukaemia (6 cell lines) and renal (8 cell lines); eachcell line has microarray expression of the same 1416 genes. The data set of interest, 21cell lines (being the subjects) and 1416 genes (being the variables), hence form a 211416matrix, representing 21 data points (21 rows of the matrix) in a 1416-dimensional data space.The complete-linkage clustering tree based on these 1416 genes is shown in Figure 1(a).The dendrogram is consistent with that in reference [7], and shows clearly that the 21 celllines cluster according to their cancer phenotypes. The readers are reminded that phenotypeinformation is not used in the clustering, but only to validate the clustering results. Onerenal cell line marked as RE8, which is farther from the rest of renal cell lines, has beenrecognized to have some special feature (see reference [7] for detail).We now seek to determine the blocks for the 1416 genes. Figure 2 shows a plot of semi-

partial R2 versus the number of clusters using complete-linkage algorithm and dij =arcos(|ij |)as a measure of dissimilarity between gene i and gene j, where ij is the Pearson correlation


3472 A. LIU ET AL.

RE1

RE2

RE3

RE4

RE5 R

E6RE7

RE8

CO1

CO2

CO3

CO4

CO5

CO6

CO7

LE1

LE2

LE3

LE4

LE5

LE630

40

50

60

70

(a)

RE1

RE2

RE3

RE4

RE5

RE6

RE7

RE8

CO1

CO2

CO3

CO4

CO5

CO6

CO7

LE1

LE2

LE3

LE4

LE5

LE6

15

20

25

30

(b)

Figure 1. Dendrogram of complete linkage hierarchical clustering of 21 cell lines: (a) tree based on1416 genes; (b) tree based on 200 genes. CO is colorectal, LE is leukaemia and RE is renal.

coecient. The semi-partial R2 measures the loss of homogeneity when two clusters aremerged. Dene SST as the corrected total sum of squares of all subjects and summed overall variables. For a certain cluster C, let SSC be the corrected total sum of squares of allsubjects in cluster C summed over all variables. Then the semi-partial R2 for combining twoclusters C1 and C2 into one cluster C is (SSC SSC1 SSC2)=SST . A large semi-partial R2indicates signicant decrease in homogeneity. Since subjects within the same cluster shouldbe similar, two clusters should not be combined as one cluster if the semi-partial R2 is large.In practice we determine the number of clusters by minimizing the semi-partial R2; a plotof the semi-partial R2 versus the number of clusters is extremely helpful. More discussionand computation of semi-partial R2 can be found in reference [8]. Other statistics can also beused to determine the number of clusters in the data. Milligan and Cooper [9] examined 30procedures for determining the number of clusters, including several variations based on sumof squares.For the cancer cell-line microarray data, the semi-partial R2 becomes nearly at after 14

clusters. This indicates that the 1416 genes can be approximately divided into 14 clusters;further dividing the data gains little in reducing heterogeneity. These clusters of genes deter-mine the blocks within each of which principal component analysis will be conducted. Thenumber of genes in the blocks ranges from 43 to 158 (Table I). Principal component analysis



0.0040.0060.0080.0100.0120.0140.0160.0180.0200.0220.0240.0260.0280.0300.0320.034

0 10 20 30

Figure 2. Determining number of blocks: plot of semi-partial R2.

Table I. Summary of 14 gene blocks.

Block

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of genes 107 154 88 158 68 152 84 136 44 84 143 82 73 43Number of PCs 14 13 15 15 14 15 14 15 14 16 16 14 14 11Per cent variance 95.6 95.3 95.2 95.3 95.2 96 95.1 95.2 96 96 95.1 96 95.6 95.6

is conducted within each block, and the rst several leading principal components are thenselected, resulting in a total of 200 principal components. For each block, selected principalcomponents explain 95 per cent of total variance in that block. For each block, Table Ilists the number of genes, the number of selected principal components and the percentage oftotal variance explained by these leading components.For each block, genes with largest coecients in the selected leading principal components

are retained, using Jollifes one variable per leading component method. This yields a totalof 200 genes for further analysis.The rst three leading principal components, computed based on the 1416 genes, explain

only 49 per cent of the total variance. Two- or three-dimensional visualization of the databased on these principal components can be very misleading. We validate these selected 200genes by deriving a hierarchical clustering tree for the 21 cell lines based on gene expressions.The dendrogram is shown in Figure 1(b). It is remarkably similar to the one based on all1416 genes (Figure 1(a)). Both illustrate that cell lines with the same phenotype are moresimilar than those from dierent phenotypes. This shows that a much smaller number of genes


3474 A. LIU ET AL.

can provide the same insight for the data as the whole set of genes, and block principal com-ponent analysis provides an eective way to achieve this. Note that a hierarchical clusteringdendrogram, obtained based on a set of variables, is essentially the same as the hierarchicalclustering dendrogram obtained based on leading principal components, provided that theseleading components explain most of the variation among the variables. The remarkable re-semblance between Figure 1(a) and Figure 1(b) further demonstrates the eectiveness of theblock principal component analysis method, as compared to the ordinary principal componentanalysis.

6. DISCUSSION

In this paper we show that a much smaller number of genes can provide the same insightfor the cancer phenotypes as the whole set of genes. We demonstrate that block principalcomponent analysis is an eective way to select these genes. This kind of analysis is unsu-pervised, a term popular in neural network=pattern recognition [10]; cancer phenotypes areused only to validate the algorithm and analysis.Selection of informative genes in the microarray setting, and other settings as well, is by

no means an easy task, especially when the analysis is unsupervised. Very likely the choicesof genes are not unique; there might exist several groups of genes that provide the sameclassication. Biostatisticians should provide every potential group of genes to the medicalinvestigators and hopefully a meaningful group of genes can be determined by combining thestatistical guidance and biological knowledge. Indeed, some preliminary selection of genesbased on biological knowledge is extremely valuable, even before any statistical analysis isconducted. It should be noted, however, that genes that are biologically similar=dissimilar maynot be statistically similar (correlated)=dissimilar (uncorrelated).

ACKNOWLEDGEMENTS

The authors would like to thank the editor and three anonymous referees for their valuable commentsand suggestions, which have improved the manuscript.

REFERENCES

1. Jollie IT. Principal Component Analysis. Springer-Verlag: New York, 1986.2. Anderson TW. An Introduction to Multivariate Statistical Analysis. 2nd edn. Wiley: New York, 1984.3. Rencher AC. Methods of Multivariate Analysis. Wiley: New York, 1995.4. Hilsenbeck SG, Friedrichs WE, Schi R, OConnell P, Hansen RK, Osborne CK, Fuqua SA. Statistical analysisof array expression data as applied to the problem of tamoxifen resistance. Journal of the National CancerInstitute 1999; 91:453459.

5. Cheung VG, Morley M, Aguilar F, Massimi A, Kucherlapati R, Childs G. Making and reading microarrays.Nature Genetics Supplement 1999; 21:1519.

6. Brown PO, Botstein D. Exploring the new world of the genome with DNA microarrays. Nature GeneticsSupplement 1999; 21:3337.

7. Scherf W, Ross DT, Waltham M, Smith LH, Lee JK, Tanabe L, Kohn KW, Reinhold WC, Myers TG,Andrews DT, Scudiero DA, Eisen MB, Sausville EA, Pommier Y, Botstein D, Brown PO, Weinstein JN.A gene expression database for the molecular pharmacology of cancer. Nature Genetics 2000; 24:236244.

8. Khattree R, Naik DN. Multivariate Data Reduction and Discrimination with SAS Software. SAS InstituteInc.: Cary, NC, 2000.

9. Milligan G, Cooper MC. An examination of procedures for determining the number of clusters in a data set.Psychometrika 1985; 50:159179.

10. Ripley RD. Pattern Recognition and Neural Networks. Cambridge University Press: Cambridge (UK), 1996.


liu - 2002 - block principal component analysis with application to gene microarray data...

Documents