bmc bioinformatics biomed central - connecting repositories · types of pattern recognition and...

17
BioMed Central Page 1 of 17 (page number not for citation purposes) BMC Bioinformatics Open Access Research article Sample phenotype clusters in high-density oligonucleotide microarray data sets are revealed using Isomap, a nonlinear algorithm Kevin Dawson*, Raymond L Rodriguez and Wasyl Malyj Address: Laboratory for High Performance Computing and Informatics and Section of Molecular and Cellular Biology University of California Davis MCB, One Shields Avenue Davis, CA 95616. USA Email: Kevin Dawson* - [email protected]; Raymond L Rodriguez - [email protected]; Wasyl Malyj - [email protected] * Corresponding author Abstract Background: Life processes are determined by the organism's genetic profile and multiple environmental variables. However the interaction between these factors is inherently non-linear [1]. Microarray data is one representation of the nonlinear interactions among genes and genes and environmental factors. Still most microarray studies use linear methods for the interpretation of nonlinear data. In this study, we apply Isomap, a nonlinear method of dimensionality reduction, to analyze three independent large Affymetrix high-density oligonucleotide microarray data sets. Results: Isomap discovered low-dimensional structures embedded in the Affymetrix microarray data sets. These structures correspond to and help to interpret biological phenomena present in the data. This analysis provides examples of temporal, spatial, and functional processes revealed by the Isomap algorithm. In a spinal cord injury data set, Isomap discovers the three main modalities of the experiment – location and severity of the injury and the time elapsed after the injury. In a multiple tissue data set, Isomap discovers a low-dimensional structure that corresponds to anatomical locations of the source tissues. This model is capable of describing low- and high- resolution differences in the same model, such as kidney-vs.-brain and differences between the nuclei of the amygdala, respectively. In a high-throughput drug screening data set, Isomap discovers the monocytic and granulocytic differentiation of myeloid cells and maps several chemical compounds on the two-dimensional model. Conclusion: Visualization of Isomap models provides useful tools for exploratory analysis of microarray data sets. In most instances, Isomap models explain more of the variance present in the microarray data than PCA or MDS. Finally, Isomap is a promising new algorithm for class discovery and class prediction in high-density oligonucleotide data sets. Background The gene expression microarray is an assay that measures expression levels of tens of thousands of genes in parallel on a single chip. Microarrays can be performed from a very small amount of a biological sample, thus allowing for an experimental design involving many sample groups, repeats, dense time series, and samples collected at high-granularity from various anatomic locations. Published: 02 August 2005 BMC Bioinformatics 2005, 6:195 doi:10.1186/1471-2105-6-195 Received: 17 December 2004 Accepted: 02 August 2005 This article is available from: http://www.biomedcentral.com/1471-2105/6/195 © 2005 Dawson et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Upload: others

Post on 12-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BMC Bioinformatics BioMed Central - COnnecting REpositories · types of pattern recognition and clustering techniques have been developed and applied to microarray data. A common

BioMed CentralBMC Bioinformatics

ss

Open AcceResearch articleSample phenotype clusters in high-density oligonucleotide microarray data sets are revealed using Isomap, a nonlinear algorithmKevin Dawson*, Raymond L Rodriguez and Wasyl Malyj

Address: Laboratory for High Performance Computing and Informatics and Section of Molecular and Cellular Biology University of California Davis MCB, One Shields Avenue Davis, CA 95616. USA

Email: Kevin Dawson* - [email protected]; Raymond L Rodriguez - [email protected]; Wasyl Malyj - [email protected]

* Corresponding author

AbstractBackground: Life processes are determined by the organism's genetic profile and multipleenvironmental variables. However the interaction between these factors is inherently non-linear[1]. Microarray data is one representation of the nonlinear interactions among genes and genes andenvironmental factors. Still most microarray studies use linear methods for the interpretation ofnonlinear data. In this study, we apply Isomap, a nonlinear method of dimensionality reduction, toanalyze three independent large Affymetrix high-density oligonucleotide microarray data sets.

Results: Isomap discovered low-dimensional structures embedded in the Affymetrix microarraydata sets. These structures correspond to and help to interpret biological phenomena present inthe data. This analysis provides examples of temporal, spatial, and functional processes revealed bythe Isomap algorithm. In a spinal cord injury data set, Isomap discovers the three main modalitiesof the experiment – location and severity of the injury and the time elapsed after the injury. In amultiple tissue data set, Isomap discovers a low-dimensional structure that corresponds toanatomical locations of the source tissues. This model is capable of describing low- and high-resolution differences in the same model, such as kidney-vs.-brain and differences between thenuclei of the amygdala, respectively. In a high-throughput drug screening data set, Isomap discoversthe monocytic and granulocytic differentiation of myeloid cells and maps several chemicalcompounds on the two-dimensional model.

Conclusion: Visualization of Isomap models provides useful tools for exploratory analysis ofmicroarray data sets. In most instances, Isomap models explain more of the variance present in themicroarray data than PCA or MDS. Finally, Isomap is a promising new algorithm for class discoveryand class prediction in high-density oligonucleotide data sets.

BackgroundThe gene expression microarray is an assay that measuresexpression levels of tens of thousands of genes in parallelon a single chip. Microarrays can be performed from a

very small amount of a biological sample, thus allowingfor an experimental design involving many samplegroups, repeats, dense time series, and samples collectedat high-granularity from various anatomic locations.

Published: 02 August 2005

BMC Bioinformatics 2005, 6:195 doi:10.1186/1471-2105-6-195

Received: 17 December 2004Accepted: 02 August 2005

This article is available from: http://www.biomedcentral.com/1471-2105/6/195

© 2005 Dawson et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 1 of 17(page number not for citation purposes)

Page 2: BMC Bioinformatics BioMed Central - COnnecting REpositories · types of pattern recognition and clustering techniques have been developed and applied to microarray data. A common

BMC Bioinformatics 2005, 6:195 http://www.biomedcentral.com/1471-2105/6/195

Today, the cost of microarrays is the principal factor lim-iting the number of samples that can be examined in aparticular experiment. In spite of the high cost of microar-rays, two thirds of those surveyed by GenomeWeb saidthey performed more than 200 microarrays and 57%spent more than $100,000 on microarrays in 2003 [2].Sixty eight percent of these chips were oligonucleotidearrays, mostly Affymetrix chips. With the widespread useof microarrays in basic research and their increasing use inmedical diagnostics, biomedical researchers can antici-pate lower costs for chips that will lead to more studiesutilizing hundreds, if not thousands, of samples. Thisexpansion in sample size will provide researchers withhigher resolution insights into biological processes as theyare reflected in temporal, spatial, and functional patternsin microarray data sets. To reveal these patterns, severaltypes of pattern recognition and clustering techniqueshave been developed and applied to microarray data.

A common task in the analysis of large microarray datasets is sample classification based on gene expression pat-terns. This process can be divided into two steps: class pre-diction and class discovery. During class predictionsamples are assigned to predefined sample classes;whereas class discovery is the process of establishing newsample classes. For example, when gene expression arraysare used for cancer classification, class prediction assignstumor samples into pre-existing groups of malignancies,while class discovery reveals previously unknown cancersubtypes [3]. The newly discovered tumor subtypes mayhave different clinical patterns, respond differently to cer-tain drugs, and require more or less aggressive surgical andradiological treatment. Class discovery may also revealpreviously unknown processes in cancer biology anddefine more specific indications for certain drugs. Specificdrugs may be used to target newly discovered tumor sub-types, thus facilitating pharmacogenomic drug design anddevelopment. These goals will soon become achievablewith the results from microarray studies using large sam-ples. Class prediction and class discovery using large datasets will require the evaluation, adaptation, and develop-ment of robust mathematical, statistical, and computa-tional tools.

Several mathematical algorithms and computationalmethods have been applied to class prediction and classdiscovery in large gene expression data sets. The methodsmost frequently used are based on clustering techniquessuch as hierarchical clustering (HC) [4]. HC was used fortemporal classification in conjunction with Fourier analy-sis to detect genes that correlate with periodic changes insynchronized S. cerevisiae cells [5]. HC was also applied tocancer classification, for example breast cancer classifica-tion [6]. Other clustering techniques applied to microar-ray data are the unstructured k-means clustering [7],

cluster affinity search (CAST) [8], fuzzy c-means clustering[9], and two-way clustering [10] that was used for theanalysis of drug-tumor interactions [11]. Self-organizingmaps (SOM) is another technique that is particularly wellsuited for exploratory data analysis. Unlike HC, SOM doesnot impose a rigid structure to the data [12]. The utility ofSOM was demonstrated in leukemia classification using aweighted voting procedure [3]. Weighted voting classifica-tion was also used for predicting chemosensitivity of theNCI-60 tumor collection [13] and human breast cancers[6]. Supervised methods, such as Fisher's linear discrimi-nant analysis, artificial neural networks (ANN), supportvector machines (SVM), and boosting, have the advantagethat the sample classes are usually defined by another"gold standard" method (e.g., histology, clinical outcome,length of survival, etc). In supervised classification, thechoice of classifiers is frequently based on other consider-ations, e.g., genes that play a role in the pathomechanismof a certain disease or are expressed in a particular tissue.These "enrichment" methods can improve the predictionstrength of a classification and decrease the samplenumber necessary for developing a prediction model.However they also introduce bias that may lead to over-training or lack of discovery of unexpected sample classes.One of the supervised methods, support vector machines(SVM), has the advantage that it does not make assump-tions about the distribution of the data [14]. SVM wastested on ovarian cancer, leukemia, and colon tumor datasets [15] and was also demonstrated to be useful formulti-class cancer classification [16-18]. Artificial neuralnetworks (ANN), another machine learning method, wasshown to classify small, round blue-cell tumors (SRBCT)[19]. Tree harvesting, a new method of supervised learn-ing was recently applied for gene expression data [20]. Incontrast to these complex procedures, much simpler clas-sifiers may also perform equally well on some data sets.For example, nearest shrunken centroids were applied toSRBCT cells and leukemias [21].

Dimensionality reduction methods are useful in predict-ing the underlying true dimensionality of a microarraydata set and reduce the number of variables applied asinputs to any of the classification procedures. Multi-dimensional scaling (MDS), a linear method, was used forclassifying alveolar rhabdomyosarcomas [22], cutaneousmalignant melanomas [23], and breast cancers [24]. Prin-cipal component analysis (PCA), another dimensionalityreduction method, was used for visualizing gene expres-sion maps of central nervous system (CNS) development[25] and classifying embryonal CNS tumors [26]. Proba-bilistic PCA, a method incorporating biological assump-tions in linear factor models, was recently applied to twoyeast microarray datasets [27]. Another method, singularvalue decomposition (SVD), was applied to soft tissuetumors [28], S. cerevisiae cell cycle, and serum-induced

Page 2 of 17(page number not for citation purposes)

Page 3: BMC Bioinformatics BioMed Central - COnnecting REpositories · types of pattern recognition and clustering techniques have been developed and applied to microarray data. A common

BMC Bioinformatics 2005, 6:195 http://www.biomedcentral.com/1471-2105/6/195

fibroblast data sets [29,30]. Generalized SVD (GSVD) wasdeveloped to extract conserved gene expression patternscomparable between two different organisms [31]. Allthese linear methods are inherently sensitive to outliers,missing values, and non-normal distribution. A variant ofSVD, robust SVD (rSVD), was recently developed to min-imize the effect of these corruptions in the data set [32].Sammon mapping, a nonlinear mapping algorithm [33]was incorporated in the R multiv package as well as geneexpression data processing and exploratory analysis soft-ware, such as ENGENE [34]. To address similar issues, Iso-map [35] a nonlinear technique of dimensionalityreduction originally designed for solving classical prob-lems of pattern recognition, such as visual perception andhandwriting recognition, was applied recently to discov-ery biologically relevant structures in cDNA microarrays[36,37].

We have applied Isomap to the analysis of both breastcancer microarray data sets and prostate cancer proteom-ics spectra [36] and showed that it consistently outper-formed PCA in revealing biologically relevant low-dimensional structures in high-dimensional data sets.Nilsson et al. independently demonstrated the utility ofIsomap using a lymphoma and a lung adenocarcinomacDNA microarray data set [37]. We report here the appli-cation of Isomap to three independent Affymetrix Gene-Chip® data sets. We show that Isomap is capable ofdiscovering temporally, spatially, and functionally rele-vant structures in gene expression data. To avoid any kindof bias that may be introduced with gene selection or featureenrichment methods, we did not apply any data scrubbing, fil-tering, or feature enrichment techniques. We also show thatIsomap can successfully detect biologically relevant struc-tures even in the background noise of tens of thousands ofgenes present.

ResultsSpinal cord injury data setThe Isomap algorithm was first evaluated on a large dataset consisting of 170 rat U34A high density oligonucle-otide arrays with 8,799 genes on each array [38]. Thesedata were originally collected for a study on spinal cordinjury that illustrated the role of cell cycle in trauma-induced neuronal death [38]. Compared to earlier micro-array studies on experimental spinal cord damage, thisstudy applied a lower level spinal cord injury, used indi-vidual rat samples rather than pooled spinal cord tissues,evaluated several time points, employed larger Affymetrixarrays, and used both sham-injured and naïve controls.

Unlike this original study, we do not make the assump-tion that only those genes "consistently expressed abovebackground" are important for further consideration. TheDiGiovanni et al. study used a stringent inclusion thresh-

old that included only those genes present in at least 40%of all the samples. In addition to this first filter, a secondfilter was also applied that eliminated genes that did nothave a change in expression level of at least two-fold com-pared to that of the sham controls, which was determinedwith Welch ANOVA t-test. Unlike the DiGiovanni et al.study, our analysis considers all 8,799 genes, thus elimi-nating an important source of potential selection bias. Wedo not use stringent selection criteria at the level of datascrubbing, because these filters may introduce confound-ing into the data set, which can lead to separation betweensample classes based on subjective filtering criteria ratherthan existing biological phenomena.

One hundred seventy samples were collected from spinalcords of naïve rats, after sham operations, as well as mild,moderate, and severe injuries. In addition to these fiveseverity classes, samples are also classified into one ofthree locations, such as below, above, and at the positionof the spinal cord injury. The third classification categoryis the time interval from the injury to the sample collec-tion. These time points are: 0 min, 30 min, 4 h, 24 h, 2days, 3 days, 7 days, 14 days, and 28 days. All the 170samples were subjected to Isomap analysis in a com-pletely unsupervised fashion without the a priori knowl-edge of which classes the sample belongs to. Isomap fits anonlinear manifold on the 170 samples. This manifold isused to express sample-to-sample distances as path dis-tances on the surface of the manifold rather than thedirect Euclidean distance without considering the exist-ence of the manifold. Distances are computed for all pairsof the 170 samples and subjected to multi-dimensionalscaling. The result of this procedure is presented in Figure2 in a three dimensional coordinate system where eachsphere represents one sample.

Figure 2 shows the 170 samples in a three-dimensionalmodel that was generated by Isomap. Samples are classi-fied and colored according to one of three major attributes– time, location, and severity of the injury. The spherediameters express the compactness of the sample classes;the more compact the class, the smaller the diameter ofthe sphere. If members of a class spread out in a largerspace then larger size spheres are displayed. The spherediameter is determined as the 95 percentile of the distancedistribution from the nearest neighbor within the class.The Isomap model successfully clusters similar samplesinto groups that are at well-defined locations of the three-dimensional model. The least affected sham-operatedsamples are shown at the very central core location of themodel. By contrast, the most affected samples after mod-erate to severe injury at 24 h (that is, the most active phaseof spinal cord injury) are shown at the peripheries of themodel.

Page 3 of 17(page number not for citation purposes)

Page 4: BMC Bioinformatics BioMed Central - COnnecting REpositories · types of pattern recognition and clustering techniques have been developed and applied to microarray data. A common

BMC Bioinformatics 2005, 6:195 http://www.biomedcentral.com/1471-2105/6/195

The location panel (Fig. 2A) shows that samples fromregions below and above the injury are at the central coreof the model. At the same time, samples from the injuryitself are displayed in the peripheral lobes of the model.There is no visible separation between the unaffected sam-ples and those originating from locations other than theinjury itself.

The time panel (Fig. 2B) shows a clear-cut time-dependentseparation of the samples along the x-axis. The right-sidelobe of the model shows samples at 30 min, 4 h, and 24h after injury. The earlier samples are at the core of the

model. Conversely, samples from later time points arefound at more distal locations in the right-side lobe of themodel. This time-pattern is in agreement with Di Gio-vanni et al. [38] who found with temporal clustering thatthe immediate early genes are over-expressed at 30 minafter spinal cord injury. This is followed by genes associ-ated with inflammatory and oxidative stress plateauing at4 h after injury, as well as cell cycle and neuronal apopto-sis-regulating genes at 4–24 h. Isomap effectively sepa-rates this early active phase of post-injury damage fromthe later phase of regeneration that takes place at 48 h –28 days. Samples from this later regenerative phase, at

Isomap analysis of the spinal cord injury data setFigure 2Isomap analysis of the spinal cord injury data set. Three-dimensional Isomap models were generated from 170 rat high density oligonucleotide arrays with 8,799 genes on each array as described in Systems and Methods. Samples were classified based on the A: time, B:location, and C:severity of the spinal cord injury. Sphere diameters express the 95 percentile distance from the nearest neighbor within the group. Neighborhood size: k = 3. D: Residual variance after the application of Isomap, MDS, and PCA models. See text for more detail. Animated images are presented in Additional Files 1, 2, 3.

Page 4 of 17(page number not for citation purposes)

Page 5: BMC Bioinformatics BioMed Central - COnnecting REpositories · types of pattern recognition and clustering techniques have been developed and applied to microarray data. A common

BMC Bioinformatics 2005, 6:195 http://www.biomedcentral.com/1471-2105/6/195

time points of 2, 3, 7, 14, and 28 days, are displayed in theleft-side lobe of the Isomap model. During regeneration,the earlier samples are shown in the more peripheral loca-tions and the later samples are closer to the central core ofthe model. It is noteworthy that some samples are at theperipheral locations of the model even 28 days after theinjury, which is explained by incomplete regeneration andpermanent damage caused by a more severe spinal cordinjury as will be demonstrated later. In the time panel, Iso-map shows the most striking separation between the earlyand later phases of post-injury events. The right-side lobeof the model contains samples from the early 30 min – 24h phase of the post-injury damage. In these samples, thedominant events are the spinal cord injury, the secondarybiochemical changes, and the endogenous autodestruc-tive events. On the other hand, the left-side lobe of theIsomap model contains samples from the later 48 h-28day phase when the neuroprotective and recovery pro-moting phenomena overcome the earlier autodestructiveevents.

The severity panel (Fig. 2C) shows that samples taken afterincremental levels of spinal cord injury appear at increas-ingly distal locations in the Isomap model. The sham-operated samples are located in the central core of themodel surrounded by a shell of samples from rats withmild spinal cord injury. The most peripheral samples inboth lobes of the model are those representing moderateto severe injury. In the right lobe of the model, which rep-resents the 30 min – 24 h time points, there is no clearseparation between moderate injury samples and thosewith severe injury. Unexpectedly, in the left lobe that rep-resents the later time points, the samples with moderateinjury are more distal than those with severe injury.Although this separation between the moderate andsevere injury was not anticipated, the separation betweenmild and moderate to severe injury is very clear. This sep-aration also answers a question we raised on the timepanel. In the left lobe of the model, six samples are visibleat 28 days after injury. Three of these samples are in a clus-ter that is closer to the periphery and the other three sam-ples are closer to the core of the model. The severity panelclearly shows that the three distal samples are those thatunderwent moderate injury and the more central threesamples are those with only mild injury. This finding isconsistent with our interpretation that the left lobe of themodel represents the regeneration process. While threerats were able to partially recover from a mild injury after28 days, the other three animals with moderate spinalcord injury did not recover by this time.

For comparison, in addition to Isomap analysis we alsoused hierarchical clustering [4] to cluster the 170 samplesbased on their gene expression patterns (Fig. 1). After clus-tering, the orientation of the branches of the clustering

tree is undefined. Therefore, we used SOM to fold theleaves in an order that places the more similar samplescloser to each other [12]. Although hierarchical clusteringand SOM clusters similar samples into well-definedgroups, the result is inferior to the Isomap model becausethe tree structure can display only one-dimensional struc-tures, which is inadequate for displaying complex multi-dimensional phenomena such as spinal cord injury. Inthis experiment, several modalities are present, such aslocation and severity of the injury and the elapsed timeafter injury. Unlike clustering, Isomap analysis discoversthese three major modalities and presents the 170 sam-ples in a fashion that the underlying biological processescan be interpreted based on these three modalities.

Rat multiple tissue gene expression data setIsomap was applied to another data set containing 122samples from 11 peripheral tissues and 15 brain regionsfrom three common outbred rat strains, such as Wistar,Wistar Kyoto, and Sprague Dawley. The original study wasperformed to characterize physiological expression levelsin these anatomical locations, to find genes that areimportant in certain brain regions, and to explain the phe-notypic variations between rat strains [39]. This data setconsists of 122 Affymetrix rat U34A arrays with 8,799genes on each array.

Using Affymetrix MAS5, Walker et al. applied several fil-ters on the genes, such as t-test p-value less than 0.05, atleast two-fold change of expression levels, and minimumexpression thresholds. Additionally, they used RosettaResolver with two filters such as Resolver ANOVA p-valueless than 0.05 and at least two-fold change of expressionlevels. Our Isomap analysis of this data set used all genesto compute sample-to-sample differences without the useof input filters. Although our approach may increase thenoise in the data set, it also eliminates any potentialsource of selection bias. We also avoided using any tissue-enrichment techniques based on a priori knowledge. Iso-map analysis was carried out in a completely unsuper-vised fashion.

We were surprised by the ability of Isomap to sort severalrat tissue samples in a fashion that reflects the topologicalanatomy of the source tissues (Fig. 2A). It is apparent inthis figure that duplicate samples from the same tissue aredisplayed at almost identical locations on the Isomapmap. For example, on the top left side of the figure, dupli-cates from small intestine, large intestine, and endothelialsamples are presented as three dots. These three tissues aredisplayed close to each other, which is expected since theprimary components of all three are epithelial cells.Nearby is displayed the kidney, a large part of which isalso derived from epithelial cells. Below these samples onthe mid left part of the figure close to each other are the

Page 5 of 17(page number not for citation purposes)

Page 6: BMC Bioinformatics BioMed Central - COnnecting REpositories · types of pattern recognition and clustering techniques have been developed and applied to microarray data. A common

BMC Bioinformatics 2005, 6:195 http://www.biomedcentral.com/1471-2105/6/195

Hierarchical clustering of samples in the spinal cord injury data setFigure 1Hierarchical clustering of samples in the spinal cord injury data set. One hundred seventy rat high density oligonucle-otide arrays were clustered with hierarchical clustering as described in Systems and Methods. The samples are annotated on time, location, and severity of the injury.

Page 6 of 17(page number not for citation purposes)

Page 7: BMC Bioinformatics BioMed Central - COnnecting REpositories · types of pattern recognition and clustering techniques have been developed and applied to microarray data. A common

BMC Bioinformatics 2005, 6:195 http://www.biomedcentral.com/1471-2105/6/195

spleen and thymus samples, two organs of the immunesystem. On their right is the bone marrow which is theprimary site of blood cell generation and therefore func-tionally related to the spleen and thymus. On the bottomleft part of the figure, close to each other are the two majormuscle tissues, such as the skeletal muscle and the heartmuscle. At the bottom of the figure is the cornea that is anectodermal tissue close to the brain samples that are alsoectodermal in origin and farther away from the endoder-mal and mesodermal tissues mentioned above. The twoendocrine glands of the brain, such as the pineal and pitu-itary glands, are somewhat separate from the other partsof the brain.

Figure 3D is a three-dimensional Isomap model of anenlarged region of panel A. The bottom left corner of thisfigure is the cornea below the primary cortical neurons(PCN). Right of the PCN is the dorsal striatum (DS) andthe nucleus accumbens. Walker et al. were specificallyinterested in studying drug-seeking behavior; thereforeseparate samples were collected from the core and shellregions of the nucleus accumbens. In our Isomap model,these samples are displayed very close to each other.Above the nucleus accumbens is displayed the ventral teg-mental area (VTA) and the amygdala (A). Near to the topof the figure are the dorsal raphe (DR) and the hypothala-mus (H). Left of the hypothalamus is the pituitary, anendocrine gland under the control of the hypothalamus.Even farther left is the pineal gland, another endocrineorgan related to the pituitary gland. On the right middlepart of panel D, are two samples from the locus ceruleuspresented at almost identical locations in the Isomapmodel. In the top right portion of panel D are corticallocations that are displayed in panel E in a two-dimen-sional Isomap model.

Figure 3E is a two-dimensional Isomap model of anotherenlarged region of panel A. This panel shows the neigh-borhood graphs used for building the Isomap model. Onthe left side of this panel is represented the prefrontal cor-tex (PFC) that is connected to the temporal/parietal/occipital cortex (cortex minus frontal cortex, CMFC). Atthe bottom of this panel are the samples from the frontalcortex (FC) below two hypothalamus (H) samples. Abovethese are samples from the amygdala (A), the amygdalacentral nucleus (ACN), which is a part of the amygdala,the ventral striatum (VS), and the dorsal striatum (DS).Not circled are two samples from the Fisher dorsal rootganglia (DRGF). Figure 3E also shows that most of theamygdala samples are close to the ACN. Unexpectedly,two other amygdala samples are in a very different loca-tion showed on panel D close to the VTA. Similar to theamygdala, the hypothalamus (H) is also shown at two dif-ferent locations; two hypothalamus samples are shown in

panel E below the amygdala, and two other samples are inpanel D close to the dorsal raphe (DR).

Isomap analysis projects the hypothalamic samples ontotwo different well-defined locations in the Isomap model.This separation of the hypothalamus samples may reflectan unequal contribution of the different hypothalamicregions to the four hypothalamic samples and provides anexample of class discovery. The hypothalamus has severalanatomic regions, one of which is the periventricularhypothalamus that is functionally related to the pituitaryas well as to the autonomic areas in the brain stem and thespinal cord. Another hypothalamic region, the medialhypothalamus has several connections with the medialdivision of the amygdala. The third hypothalamic region,the lateral preoptic hypothalamus, has a very complexanatomy with many fibers passing through this region.These subclasses may be explained by spatial or temporaldifferences. The hypothalamus samples on panel D closeto the pituitary samples may originate from samples thatare rich in periventricular hypothalamus. On the otherhand, samples on panel E close to the amygdala samplesmay be richer in medial hypothalamus. The separation ofthe hypothalamus samples into two well defined classesmay be explained by not only anatomical but temporaldifferences. Many of the hypothalamic nuclei synthesizeseveral neurotransmitters and hormones, such as cortico-tropin-releasing hormone (CRH) and other related releas-ing hormones. The levels of these molecules and theactivity of particular hypothalamic neurons may vary bytime depending on the circadian rhythm, stress, foodintake, emotional state, estrus cycle, and other factors. Thetwo hypothalamic subclasses discovered by Isomap maybe different because of these temporal variations.

Most of the amygdala (A) samples are projected on theIsomap map at a location between the cortical regions andthe ventral striatum (Fig. 3E). Two samples, on the otherhand, are projected at the ventral tegmental area (VTA)and the nucleus accumbens (panel E). These twoamygdala samples are clearly different from the rest of theamygdala samples. This separation is consistent with ana-tomical dissimilarities between different parts of the amy-gdala. The largest portion of the amygdala, the basolateralnuclear complex, primarily consists of pyramidal and stel-late neurons similar to those in the cerebral cortex. In fact,Isomap found most of the amygdala samples in the prox-imity of the cerebral cortex. Another part of the amygdala,the centromedial group, including the central nucleus(ACN), is connected to the bed nucleus through fiberscalled the stria terminalis. Although the bed nucleus isanatomically closer to the hypothalamus, it is histologi-cally very similar to the amygdala. This separation demon-strates a potential class discovery by Isomap, where thediscovered classes are in agreement with anatomically,

Page 7 of 17(page number not for citation purposes)

Page 8: BMC Bioinformatics BioMed Central - COnnecting REpositories · types of pattern recognition and clustering techniques have been developed and applied to microarray data. A common

BMC Bioinformatics 2005, 6:195 http://www.biomedcentral.com/1471-2105/6/195

Isomap analysis of the rat multiple tissue gene expression data setFigure 3Isomap analysis of the rat multiple tissue gene expression data set. Three- and two-dimensional Isomap models were generated from a data set containing 122 samples from 11 peripheral tissues, 15 brain regions of three common out-bred rat strains, such as Wistar, Wistar Kyoto, and Sprague Dawley as described in Systems and Methods. Panel A shows an overview of a two-dimensional model using k = 3 nearest neighbors. Panel B shows the residual variance after the application of Isomap, MDS, and PCA models. Panel C shows a 2-dimensional comparison map generated by Principal Component Analysis. This panel demonstrates that PCA, unlike Isomap, is less accurate in mapping the different tissues correctly. Arrows show that (1) the skeletal muscle and kidney samples are overlapping the forebrain samples, (2) the thymus and spleen samples overlap the nucleus accumbens samples, and (3) the small intestine samples are mapped at a distance from the (4) large intestine samples. Panel D shows a portion of panel A with a three-dimensional model with k = 4. Panel E presents another region of the over-view map representing the central brain structures. The neighborhood graphs demonstrate the k = 3 nearest neighbors. Panel F presents the anatomic locations of the samples on brain slice images from the Neuroscience Division, Regional Primate Research Center, University of Washington: BrainInfo (2000) http://braininfo.rprc.washington.edu. The abbreviations are: A, amygdala; ACN, amygdala central nucleus; CMFC, cortex minus frontal cortex; DR, dorsal raphe; DRGF, Fisher dorsal root gan-glia; DS, dorsal striatum; endoth, endothel; FC, frontal cortex; H, hypothalamus; LIntestine, large intestine; LocCer, locus ceruleus; NAccumb, nucleus accumbens; PCN, primary cortical neurons; PFC, prefrontal cortex; Pineal, pineal gland; Pituit, pituitary gland; SIntestine, small intestine; SkelMusc, skeletal muscle; Thym, thymus; VS, ventral striatum; and VTA, ventral tegmental area. See text for more detail.

Page 8 of 17(page number not for citation purposes)

Page 9: BMC Bioinformatics BioMed Central - COnnecting REpositories · types of pattern recognition and clustering techniques have been developed and applied to microarray data. A common

BMC Bioinformatics 2005, 6:195 http://www.biomedcentral.com/1471-2105/6/195

histologically, and functionally well defined structureswithin the hypothalamus and the amygdala. Some of thehypothalamus and amygdala samples potentially containmore or less tissue from different parts of these anatomi-cally heterogeneous brain structures, which leads to differ-ent classifications based on the major constituent.

As a comparison, we analyzed the multiple tissue datasetalso with Principal Component Analysis (PCA) (Fig. 3C).Unlike Isomap, PCA was unable to separate the skeletalmuscle, kidney, and bone marrow samples from eachother (arrow 1). With PCA, the thymus and spleen sam-ples overlap the nucleus accumbens samples (arrow 2);and the small intestine (arrow 3) and large intestine(arrow4) samples are mapped at a distance from eachother. This comparison demonstrates that Isomap outper-forms PCA in mapping the different tissue samples on a 2-dimensional space. This difference in performance of thetwo methods is also supported by the higher residual var-iance of the PCA model compared to that of the Isomapmodel (Fig. 3B).

High throughput drug screening data setThe Isomap algorithm was also applied to a high through-put drug screening data set [40]. The goal of this study wasto develop a general approach for identifying gene expres-sion signatures as surrogates for cellular states in highthroughput drug screening experiments. Particularly, Steg-maier et al. screened 1,739 chemical compounds to iden-tify those capable of inducing terminal differentiation ofacute myeloid leukemia (AML) cells. They exposed undif-ferentiated HL-60 samples (undiff) to several chemicalcompounds. The names and abbreviations of the chemi-cal compounds selected for microarray analysis are listedin Table 1. In addition to the HL-60 samples, Stegmaier etal. also included primary acute promyelocytic leukemia(APL), primary patient AML, normal human neutrophil(poly), and normal human monocyte (mono) cells (Table2). This gene expression data set consists of 86 humangenomic U133A Affymetrix arrays with 18,400 transcriptsand variants on each array and 30 human 6800 arrayswith 7,129 genes on each array.

Figure 4 shows a two dimensional Isomap model builtfrom 86 human U133A microarrays. Panel A presents alow resolution overview of this model. The 86 samples aredistributed in a Y-shape with the untreated patient AMLsamples and normal neutrophils clustered at one end ofthe Y and the normal monocytes at the other. The twoperpendicular arrows on this panel represent the granulo-cytic and monocytic differentiation. The former arrowpoints from the HL-60 cells (undiff) to the normal neu-trophils (poly) and the latter one points to the normalmonocytes (mono). As expected, HL-60 cells treated withPMA, a well characterized inducer of monocytic differen-

tiation, are clustered close to the normal monocytes. It isnoteworthy that the gene expression patterns of the pri-mary patient AML cells are very different from that of theHL-60 cell line.

Figure 4B, an enlarged portion of panel A, focuses on themonocytic end of the Y distribution showing HL-60 cellstreated with several chemical compounds. These com-pounds, Apo, Keto, Cyc7p5, Sulma, Erythro, DMSO, Methyl,Phen, EGFR, and PMA (Table 1) change the gene expres-sion pattern of HL-60 cells to become more or less similarto normal monocytes; therefore these compounds arepotential inducers of monocytic differentiation. ATRA isdisplayed on the left side of this panel and represents aneffect that is more similar to granulocytic than monocyticdifferentiation.

Figure 4C, another enlarged portion of panel A, shows anarea of the Y-distribution around the normal granulocytes(poly). In addition to the granulocytes, this area also con-tains duplicates of APL samples before and after severaldrug treatments. The effect of drugs on APL, as measuredby the change of gene expression levels, is less comparedto the effect of the same drugs on the HL-60 cells. The

Table 1: Chemical compounds and their abbreviations used in the high throughput screening dataset

Apo (R)-(-)-apomorphine HClATRA all trans retinoic acidCaff 8-(3-chlorostyryl) caffeineCyc7p5 cyclazosin HClDMSO dimethyl sulfoxideEGFR 4,5-dianilinophthalimideErythro erythro-9-(2-hydroxy-3-nonyl)adenine HClKeto 16-ketoestradiolMethyl α-methyl-L-p-tyrosinePerg pergolide methanesulfonatePhen 1,10-phenanthrolinePMA phorbol-12-myristate-13-acetateScop (-) scopolamine methyl bromideSulma sulmazoleVitD calcitriol5FU 5-fluorouracil5FUD 5-fluorouridine

Table 2: Cells and their abbreviations included in the high throughput screening dataset

AML primary patient acute myeloid leukemia cellsAPL primary acute promyelocytic leukemia cellsundiff HL-60 cell linemono normal human monocytespoly normal human neutrophils

Page 9 of 17(page number not for citation purposes)

Page 10: BMC Bioinformatics BioMed Central - COnnecting REpositories · types of pattern recognition and clustering techniques have been developed and applied to microarray data. A common

BMC Bioinformatics 2005, 6:195 http://www.biomedcentral.com/1471-2105/6/195

Isomap analysis of the high throughput screening data set U133A arraysFigure 4Isomap analysis of the high throughput screening data set U133A arrays.Two-dimensional Isomap models were gen-erated from 86 human U133A high density oligonucleotide arrays with 22,283 genes on each array as described in Systems and Methods. Panel A shows an overview map of the Isomap model. Arrows point to the directions of the monocytic and granulo-cytic differentiation. Panels B, C,and D are zoomed in regions of panel A. Panel E shows Isomap analysis of the monocytic and neutrophilic differentiation markers as described in Systems and Methods. Neighborhood size k: = 4. Panel Fshows the residual variance after the application of Isomap, MDS, and PCA models. Abbreviations of compounds and cell types are listed in Table 1 and 2, respectively. See text for more detail.

Page 10 of 17(page number not for citation purposes)

Page 11: BMC Bioinformatics BioMed Central - COnnecting REpositories · types of pattern recognition and clustering techniques have been developed and applied to microarray data. A common

BMC Bioinformatics 2005, 6:195 http://www.biomedcentral.com/1471-2105/6/195

result of Keto- and Phen-treatment on APL cells is differentfrom that of ATRA; while EGFR and Erythro causes only aminor change in the expression pattern of APL cells.

Figure 4D zooms in the area of the Y-distribution that issurrounding the undifferentiated HL-60 cells (undiff) atthe third end of the Y. Two compounds, 5-FU, and 5-FUD,cause a change in the gene expression pattern that is dif-ferent from monocytic and granulocytic differentiation.The Scop- and Caff-treated samples are very similar to eachother and represent only a small change from theuntreated HL-60 cells. The effect of two other compounds,Apo and Perg on HL-60 cells is unique and cannot be clas-sified as only monocytic or granulocytic. At the same time,VitD is displayed close to the ATRA-treated samples in theleft upper corner of panel D, which suggests that calci-triol's effect on HL-60 cells is similar to that of ATRA. Thisfunctional similarity is supported by the known similari-ties of action between these two compounds. Both chem-icals bind to nuclear receptors, VitD binds to vitamin Dreceptors (VDR) and ATRA binds to retinoic acid receptors(RAR). Both of these receptors heterodimerize with retin-oid X receptor (RXR) and act as hormone-dependent tran-scription factors.

Stegmaier et al. computed "monocyte and neutrophilscores" based on four parameters: interleukin 1 receptorantagonist (IL1RN) and secreted phosphoprotein 1 (SPP1) forthe monocyte signature genes and autosomal chronic gran-ulomatous disease protein (NCF1) and orosomucoid 1(ORM1) for the neutrophil signature genes. In a supple-mentary table, these four values were presented for undif-ferentiated HL-60 cells and cells treated with one of 16compounds. In Figure 4E, we present a two dimensionalIsomap model built only from these four parameters.Arrows point to the directions of the granulocytic (left)and monocytic (right) differentiation. It is noteworthythat our unbiased model, using the expression levels of allthe 18,400 transcripts and variants as input, offers a verysimilar result to the model using only four signaturegenes. In both models, the monocytic and granulocyticdifferentiations are revealed and the various chemicaltreatments cause similar changes relative to the two maindimensions. In both maps, Caff and Scop are proximal andrepresent only a small change compared to the undiffer-entiated HL-60 cells (undiff). Perg, 5-FU, and 5-FUD causemore change and Erythro, Apo, EGFR, Phen, ATRA, andVitD are all mapped in similar relative locations in panelA and E. The only major difference between the two mod-els is in the location of Sulma and Methyl. However, inboth cases these two compounds are mapped close toeach other. All these similarities underscore Isomap's abil-ity to recognize treatment-related changes in gene expres-sion patterns even when no a priori information isavailable about which genes are good predictors of a bio-

logical process or difference. Isomap can detect changes ingene expression signals in the background noise of tens ofthousands of other genes.

Figure 5 presents the two- and three-dimensional Isomapmodels built from 30 Affymetrix human 6800 arrays.Panel B is shown to demonstrate the individual samplesand the separation between sample classes. Although thesample number is relatively low, Isomap detects good sep-aration between the patient AML, normal monocyte(mono), neutrophil (poly), PMA-treated, and untreatedundifferentiated HL-60 sample classes. The ATRA-treatedHL-60 class partially overlaps with the untreated HL-60class. This is in agreement with the result of the U133Aarrays that showed that ATRA has less effect than PMA onthe gene expression levels in HL-60 cells. Any of thesechanges is less than the difference between the cell types,such as HL-60, AML, neutrophils, and monocytes. In the6800 arrays, similar to the U133A arrays, the PMA-treatedcells are on the monocytic differentiation axis and theAML samples are mapped on the granulocytic differentia-tion axis. Within the PMA-treated group, samples after 4h, 12 h, and 24 h treatment are grouped together. Asexpected, the 4 h group is located at the right side of thePMA cluster closer to the untreated HL-60 cells and the 24h group is located at the left side of the PMA cluster closerto the normal monocytes.

DiscussionIn this study, we use three Affymetrix high density oligo-nucleotide microarray data sets to demonstrate Isomap'sability to discover relevant structures in an unbiased fash-ion. In the rat multiple tissue gene expression data set, thestructures revealed correspond to the anatomical topologyof the source organs and tissues [39]. Similarly, in thehigh-throughput drug screening data set, the structuresmatch the monocytic and granulocytic differentiation pat-terns of myeloid leukemia cells treated with various chem-ical compounds [40]. We also demonstrate with acomplex spinal cord injury data set that Isomap revealsthe three main modalities (e.g., location, time, and sever-ity of the injury) of the experimental design [38]. This Iso-map model (Fig. 2) is superior to hierarchical clustering(Fig. 1) because Isomap can model structures of higherdimensionality while hierarchical clustering is limited tothe discovery of only one-dimensional structures. Cluster-ing is limited in two significant ways; it requires well-sep-arated data and linear correlation. The only relationshipclustering can detect is a one-on-one relationship whenpair-wise linear comparisons are made [41]. Isomap is anonlinear algorithm capable of analyzing microarray datasets that are nonlinear in nature. The sources of these non-linearities may originate from outliers, missing values,and non-normal distribution; all common events formeasurements in biological systems.

Page 11 of 17(page number not for citation purposes)

Page 12: BMC Bioinformatics BioMed Central - COnnecting REpositories · types of pattern recognition and clustering techniques have been developed and applied to microarray data. A common

BMC Bioinformatics 2005, 6:195 http://www.biomedcentral.com/1471-2105/6/195

Isomap analysis of the high throughput screening data set human 6800 arraysFigure 5Isomap analysis of the high throughput screening data set human 6800 arrays. Isomap models were generated from 30 human 6800 high density oligonucleotide arrays with 7,129 genes on each array as described in Systems and Methods. Panel A shows a three-dimensional Isomap model. Panel B shows a two-dimensional Isomap model. Arrows point to the directions of the monocytic and granulocytic differentiation. Panel C shows the residual variance after the application of Isomap, MDS, and PCA models. Neighborhood size: k = 5. See text for more detail. Animated image is presented in Additional file 4.

Page 12 of 17(page number not for citation purposes)

Page 13: BMC Bioinformatics BioMed Central - COnnecting REpositories · types of pattern recognition and clustering techniques have been developed and applied to microarray data. A common

BMC Bioinformatics 2005, 6:195 http://www.biomedcentral.com/1471-2105/6/195

In all of our examples, Isomap is applied to the expressionlevel values of all of the genes in the microarray data set.To avoid any kind of selection bias, no filtering or datascrubbing procedures were applied. Even under these con-ditions, we demonstrate that Isomap is capable of findingthe biologically relevant structures within noisy data sets.The experimental noise may come from the measurementerror and the variations of expression levels of tens thou-sands of genes. When Isomap is used in production mode,the experimental noise may be decreased by eliminatingthe genes that are not expressed in the studied tissues andthose which have very stable and unchanging expressionlevels in all of the samples. These procedures will furtherimprove the resolution of the Isomap models and facili-tate class discovery by Isomap. The classification power ofIsomap may be improved when it is used in combinationwith enrichment procedures that weight the gene expres-sion levels of different genes dependent on other surro-gate information coming from the knowledge of chemicalpathways in which the gene plays a role, the tissues inwhich the gene is expressed [42], and the gene expressionpatterns that are evolutionarily conserved between differ-ent organisms [39,43].

In addition to Isomap, all the three datasets were also ana-lyzed with PCA and MDS. Residual variance curves arepresented for all the three datasets. In all instances, PCAand Isomap outperformed MDS and in most instances,Isomap worked the best out of the three methods. In thespinal cord injury dataset, PCA and Isomap give compara-ble results (Fig. 2D). However with 3–4-dimensionalmodels, Isomap performs better than PCA; and at dimen-sions over 9, PCA worked better. In the multiple tissuedataset, Isomap outperforms PCA in the range of 2–20dimensions (Fig. 3B). Similarly, in the two drug screeningdatasets, Isomap outperforms PCA at dimension 3 andover (Fig. 4F) and at every dimension (Fig. 5C). Typically,Isomap has a better performance in explaining the vari-ance of microarray data at lower dimensions (Fig. 6). Thisadvantage of Isomap is eliminated at higher (over 10)dimensions. The presented examples show that Isomapworks well for datasets with several samples that presentgradual changes from one another. However, when a fewvery distinct classes of samples exist then the more classi-cal methods, e.g., PCA work better. Isomap needs samplesto be present along the geodesic surface of the manifold.In our examples, many severity levels of spinal cord injurywere considered, as well as multiple tissues and multiplechemical compounds, which was the source of Isomap'ssuccess. Another prerequisite for a successful applicationof Isomap is a large enough number of samples. Isomapwill become really useful when datasets with hundreds orthousands of microarrays need to be analyzed.

The structures discovered by Isomap help to understandbiological phenomena underlying these structures.Visualization tools of Isomap models may be applied asdata mining tools for microarray data sets. Similar toolswere used for modeling co-regulated genes in C. eleganswith VxInsight [44] and generating high-resolution tem-poral models of CNS development [25]. Moreover, Iso-map is a promising tool of unbiased class discoverybecause it is performed in a completely unsupervisedmanner. This is a key advantage of the Isomap algorithmconsidering the limited number of class discovery tools[23,45,46] compared to the abundance of class predictionmethods. The Isomap algorithm is capable of revealingstructures in microarray data sets, which may provideinsight into underlying biological networks[4,5,7,22,25,29,30,42,43,47].

Another application of the Isomap algorithm is predictingclass membership. We previously demonstrated the use ofIsomap for class prediction in a breast cancer outcomemicroarray data set [36]. The selection of good classifiersthat are robust, insensitive to outliers, medicallyinterpretable, having high generalization power, andbased on the smallest possible subset with the maximaldiscriminatory features [48] is important and should beevaluated for Isomap in the future. Aclass predictionmodel using the Isomap algorithm does not need to useall the genes in the microarray. Somorjai et al. argue thatno matter what feature selection approach is used to themicroarray data, generally 50 or more features are neededfor classification [48]. In this paper, we show that onlyfour genes are used as input for an Isomap model (Fig.4E). Isomap is a type of dimensionality reduction methodand may be evaluated as input to supervised machinelearning techniques, such as ANN and SVM. Besides geneexpression microarrays, Isomap can be used on othertypes of biological data, such as genomic, proteomic, andmetabolomic data sets to reveal low dimensional struc-tures related to diet-genome interactions, genotype-dis-ease associations, and drug-gene-disease relationships.

ConclusionIsomap, a nonlinear dimensionality reduction algorithm,discovers low-dimensional structures embedded in high-dimensional Affymetrix high-density oligonucleotidemicroarray data sets. These structures correspond to andhelp to interpret underlying biological phenomenapresent in these data. Our work provides examples ofthree experiments with temporal, spatial, and functionalprocesses revealed by the Isomap algorithm. Visualizationof Isomap models helps to understand these processesand provides means of data mining from gene expressiondata sets. The low-dimensional models generated by Iso-map help to reveal new sample classes and are potentiallyuseful for class prediction. In summary, Isomap is a prom-

Page 13 of 17(page number not for citation purposes)

Page 14: BMC Bioinformatics BioMed Central - COnnecting REpositories · types of pattern recognition and clustering techniques have been developed and applied to microarray data. A common

BMC Bioinformatics 2005, 6:195 http://www.biomedcentral.com/1471-2105/6/195

ising new algorithm for the unbiased analysis of high-density oligonucleotide microarray data sets.

MethodsDatasetsSpinal cord injury data set: A large data set consisting of170 Affymetrix rat U34A high density oligonucleotidearrays with 8,799 genes on each array was accessed at theGene Expression Omnibus (GEO) GSE 464. Rat multipletissue data set: Another data set consisting of 122 Affyme-trix rat U34A arrays with 8,799 genes on each array wasaccessed at the GEO GSE 952. High throughput screen-ing data set: A gene expression data set based on high

throughput screening (GE-HST) was accessed at the GEOGSE 995. This series is a combination of three other series:GSE 976, 982, and 985. The accessed data set contains 86human genomic U133A Affymetrix arrays with 18,400transcripts and variants on each array and 30 human 6800arrays with 7,129 genes on each array.

Probe-level microarray data analysisAll microarray data sets were downloaded to a local com-puter in Affymetrix CEL file format. Some of these fileswere text files, others were binary CEL files. Probe-leveldata analysis was carried out with Bioconductor 1.3.28libraries [49] in the R 1.8.0 environment on a dual proc-

Residual variance differences between PCA and Isomap modelsFigure 6Residual variance differences between PCA and Isomap models.Typically, Isomap has a better performance explaining the variance in microarray datasets at lower dimensions. By increasing the number of dimensions being considered, this advan-tage will be eliminated.

Page 14 of 17(page number not for citation purposes)

Page 15: BMC Bioinformatics BioMed Central - COnnecting REpositories · types of pattern recognition and clustering techniques have been developed and applied to microarray data. A common

BMC Bioinformatics 2005, 6:195 http://www.biomedcentral.com/1471-2105/6/195

essor PC with RedHat 8.0 operating system installed. Rawdata were first normalized with quantile normalization;background correction and gene expression levels werethen computed with Robust Multichip Average (RMA)[50].

Hierarchical clusteringGene expression levels were expressed as RMA values andimported into Gene Cluster 3.0 [4]. Genes were centeredto the mean and filtered with the criteria of genes withexpression levels ≥ 1.0 in at least 10% of the samples.Since the RMA values are base-two logarithm-convertedvalues, this statement is equivalent to a requirement thatthe expression levels of the selected genes be ≥ 2-fold or ≤1/2 of the mean expression level of that gene in at least10% of the samples. In case of the spinal cord injury dataset, 397 of 8,799 genes passed these filtering criteria. Theselected genes and all the samples were clustered usingcentroid linkage hierarchical clustering based on uncen-tered correlation similarity metric. Branches of the cluster-ing trees were then folded using SOM and visualized usingJava TreeView 1.0.1. The result of this analysis with thespinal cord injury data set is presented in Figure 1.

Isomap analysis of the microarray data setsGene expression levels of all genes were expressed as RMAvalues and the result table was imported into the Matlabenvironment. Sample-to-sample differences between allpairs of genes were expressed as Euclidean distances ofRMA values. Isomap was carried out in Matlab with thealgorithm provided by Joshua Tenenbaum et al. [35]. Weused the nearest neighbor method with k = 2, 3, or 4dependent on which value of k generated the minimalresidual variance. Other data processing and visualizationsteps with the Isomap models were carried out with cus-tom written Matlab functions. Source code is presented inAdditional Files 5, 6, 7, 8.

Isomap analysis of the monocytic and neutrophilic differentiation markersStegmaier et al. computed "monocyte and neutrophilscores" based on four parameters: interleukin 1 receptorantagonist (IL1RN) and secreted phosphoprotein 1 (SPP1) forthe monocyte signature genes and autosomal chronic gran-ulomatous disease protein (NCF1) and orosomucoid 1(ORM1) for the neutrophil signature genes [40]. In a sup-plementary table of their paper, these four variables werepresented for undifferentiated HL-60 cells as well as forcells treated with one of 16 chemical compounds. In thisstudy, we expressed sample-to-sample differences asEuclidean distances of the base-two logarithm-convertedmean values of the four marker genes. An Isomap modelusing the nearest neighbor method with k = 2 was per-formed with the algorithm provided by Joshua Tenen-baum [35]. All other data processing and visualization

steps were carried out with custom written Matlab func-tions. The result of this analysis is presented in Figure 4E.

Authors' contributionsKD conceived of and designed the study, carried out thedata analysis and visualization, developed the Matlabcomputer code, and drafted the manuscript. RLR and WMcontributed to the study design and editing of the manu-script. All authors read and approved the finalmanuscript.

Additional material

Additional File 1Animated Isomap model of Fig. 2A.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-6-195-S1.gif]

Additional File 2Animated Isomap model of Fig. 2B.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-6-195-S2.gif]

Additional File 3Animated Isomap model of Fig. 2C.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-6-195-S3.gif]

Additional File 4Animated Isomap model of Fig. 5A.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-6-195-S4.gif]

Additional File 5Matlab code. Gene expression values are stored in a matrix (M) with var-iables (genes) in the rows and observations (samples) in the columns. A vector (samples) contains the sample names. Another vector (pheno) con-tains the classification of each sample. The gene names are stored in another vector (genes). Euclidean distances may be computed as D = L2_distance(M, M); If we want to perform Isomap at dimensions 1 through 20 then we store this range in: options.dims = 1:20; The Isomap algorithm can be executed with a neighborhood size of 5 as follows: [Y, R] = Isomap(D,'k',5, options); 1 through 20 dimensional coordinates are stored in Y. coords and the residual variances are stored in R. The results can be wrapped around to create a 3D 'model' structure, using make-model: model = makemodel(M, Y, options, samples, pheno,'i',3,5);Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-6-195-S5.m]

Page 15 of 17(page number not for citation purposes)

Page 16: BMC Bioinformatics BioMed Central - COnnecting REpositories · types of pattern recognition and clustering techniques have been developed and applied to microarray data. A common

BMC Bioinformatics 2005, 6:195 http://www.biomedcentral.com/1471-2105/6/195

AcknowledgementsThis work was funded by grant (P60MD00222) from the NIH to the NCMHD Center of Excellence in Nutritional Genomics.

References1. Nicholson JK, Holmes E, Lindon JC, Wilson ID: The challenges of

modeling mammalian biocomplexity. Nat Biotechnol 2004,22:1268-1274.

2. Arrays Come of Age. Genome Technology 2004, 42:38-39.3. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov

JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD,Lander ES: Molecular classification of cancer: class discoveryand class prediction by gene expression monitoring. Science1999, 286:531-537.

4. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysisand display of genome-wide expression patterns. Proc NatlAcad Sci U S A 1998, 95:14863-14868.

5. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB,Brown PO, Botstein D, Futcher B: Comprehensive identificationof cell cycle-regulated genes of the yeast Saccharomyces cer-evisiae by microarray hybridization. Mol Biol Cell 1998,9:3273-3297.

6. Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pol-lack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A,Williams C, Zhu SX, Lonning PE, Borresen-Dale AL, Brown PO, Bot-stein D: Molecular portraits of human breast tumours. Nature2000, 406:747-752.

7. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: System-atic determination of genetic network architecture. NatGenet 1999, 22:281-285.

8. Ben-Dor A, Shamir R, Yakhini Z: Clustering gene expressionpatterns. J Comput Biol 1999, 6:281-297.

9. Wang J, Bo TH, Jonassen I, Myklebost O, Hovig E: Tumor classifi-cation and marker gene prediction by feature selection andfuzzy c-means clustering using microarray data. BMCBioinformatics 2003, 4:60.

10. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, LevineAJ: Broad patterns of gene expression revealed by clusteringanalysis of tumor and normal colon tissues probed by oligo-nucleotide arrays. Proc Natl Acad Sci U S A 1999, 96:6745-6750.

11. Scherf U, Ross DT, Waltham M, Smith LH, Lee JK, Tanabe L, KohnKW, Reinhold WC, Myers TG, Andrews DT, Scudiero DA, Eisen MB,Sausville EA, Pommier Y, Botstein D, Brown PO, Weinstein JN: A

gene expression database for the molecular pharmacologyof cancer. Nat Genet 2000, 24:236-244.

12. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E,Lander ES, Golub TR: Interpreting patterns of gene expressionwith self-organizing maps: methods and application tohematopoietic differentiation. Proc Natl Acad Sci U S A 1999,96:2907-2912.

13. Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J,Scherf U, Lee JK, Reinhold WO, Weinstein JN, Mesirov JP, Lander ES,Golub TR: Chemosensitivity prediction by transcriptionalprofiling. Proc Natl Acad Sci U S A 2001, 98:10787-10792.

14. Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS,Ares MJ, Haussler D: Knowledge-based analysis of microarraygene expression data by using support vector machines. ProcNatl Acad Sci U S A 2000, 97:262-267.

15. Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haus-sler D: Support vector machine classification and validationof cancer tissue samples using microarray expression data.Bioinformatics 2000, 16:906-914.

16. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, AngeloM, Ladd C, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W,Loda M, Lander ES, Golub TR: Multiclass cancer diagnosis usingtumor gene expression signatures. Proc Natl Acad Sci U S A 2001,98:15149-15154.

17. Su AI, Welsh JB, Sapinoso LM, Kern SG, Dimitrov P, Lapp H, SchultzPG, Powell SM, Moskaluk CA, Frierson HFJ, Hampton GM: Molecu-lar classification of human carcinomas by use of gene expres-sion signatures. Cancer Res 2001, 61:7388-7393.

18. Yeang CH, Ramaswamy S, Tamayo P, Mukherjee S, Rifkin RM, AngeloM, Reich M, Lander E, Mesirov J, Golub T: Molecular classificationof multiple tumor types. Bioinformatics 2001, 17 Suppl1:S316-22.

19. Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F,Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS: Clas-sification and diagnostic prediction of cancers using geneexpression profiling and artificial neural networks. Nat Med2001, 7:673-679.

20. Hastie T, Tibshirani R, Botstein D, Brown P: Supervised harvestingof expression trees. Genome Biol 2001, 2:RESEARCH00031-12.

21. Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiplecancer types by shrunken centroids of gene expression. ProcNatl Acad Sci U S A 2002, 99:6567-6572.

22. Khan J, Simon R, Bittner M, Chen Y, Leighton SB, Pohida T, Smith PD,Jiang Y, Gooden GC, Trent JM, Meltzer PS: Gene expression pro-filing of alveolar rhabdomyosarcoma with cDNAmicroarrays. Cancer Res 1998, 58:5009-5013.

23. Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Rad-macher M, Simon R, Yakhini Z, Ben-Dor A, Sampas N, Dougherty E,Wang E, Marincola F, Gooden C, Lueders J, Glatfelter A, Pollock P,Carpten J, Gillanders E, Leja D, Dietrich K, Beaudry C, Berens M,Alberts D, Sondak V: Molecular classification of cutaneousmalignant melanoma by gene expression profiling. Nature2000, 406:536-540.

24. Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R,Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Wilfond B, BorgA, Trent J: Gene-expression profiles in hereditary breastcancer. N Engl J Med 2001, 344:539-548.

25. Wen X, Fuhrman S, Michaels GS, Carr DB, Smith S, Barker JL, Som-ogyi R: Large-scale temporal gene expression mapping ofcentral nervous system development. Proc Natl Acad Sci U S A1998, 95:334-339.

26. Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M,McLaughlin ME, Kim JY, Goumnerova LC, Black PM, Lau C, Allen JC,Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T,Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, MesirovJP, Lander ES, Golub TR: Prediction of central nervous systemembryonal tumour outcome based on gene expression.Nature 2002, 415:436-442.

27. Girolami M, Breitling R: Biologically valid linear factor models ofgene expression. Bioinformatics 2004, 20:3021-3033.

28. Nielsen TO, West RB, Linn SC, Alter O, Knowling MA, O'Connell JX,Zhu S, Fero M, Sherlock G, Pollack JR, Brown PO, Botstein D, van deRijn M: Molecular characterisation of soft tissue tumours: agene expression study. Lancet 2002, 359:1301-1307.

29. Holter NS, Mitra M, Maritan A, Cieplak M, Banavar JR, Fedoroff NV:Fundamental patterns underlying gene expression profiles:

Additional File 6Matlab code. The function showmode can be used l to render an image from this model: showmodel(model); Enter showmodel without input parameters for more options.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-6-195-S6.m]

Additional File 7Matlab code. Adding the sample names to the figure: annotate(Y, samples,3);Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-6-195-S7.m]

Additional File 8Matlab code. Similarly, a 3-D animation (mov) can be created using: mov = makemovie(model); Enter makemovie without input parameters for more options.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-6-195-S8.m]

Page 16 of 17(page number not for citation purposes)

Page 17: BMC Bioinformatics BioMed Central - COnnecting REpositories · types of pattern recognition and clustering techniques have been developed and applied to microarray data. A common

BMC Bioinformatics 2005, 6:195 http://www.biomedcentral.com/1471-2105/6/195

Publish with BioMed Central and every scientist can read your work free of charge

"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

simplicity from complexity. Proc Natl Acad Sci U S A 2000,97:8409-8414.

30. Holter NS, Maritan A, Cieplak M, Fedoroff NV, Banavar JR: Dynamicmodeling of gene expression data. Proc Natl Acad Sci U S A 2001,98:1693-1698.

31. Alter O, Brown PO, Botstein D: Generalized singular valuedecomposition for comparative analysis of genome-scaleexpression data sets of two different organisms. Proc Natl AcadSci U S A 2003, 100:3351-3356.

32. Liu L, Hawkins DM, Ghosh S, Young SS: Robust singular valuedecomposition analysis of microarray data. Proc Natl Acad SciU S A 2003, 100:13167-13172.

33. Sammon JWJ: A nonlinear mapping for data structure analysis.IEEE Trans Comp 1969, C-18:401-409.

34. Garcia de la Nava J, Santaella DF, Cuenca Alba J, Maria Carazo J,Trelles O, Pascual-Montano A: Engene: the processing andexploratory analysis of gene expression data. Bioinformatics2003, 19:657-658.

35. Tenenbaum JB, de Silva V, Langford JC: A global geometric frame-work for nonlinear dimensionality reduction. Science 2000,290:2319-2323.

36. Dawson K, Rodriguez RL, Malyj WV: Gene Expression and Pro-teomics Data Analysis in Cancer Diagnosis and PrognosisUsing Isomap, a Non-Linear Algorithm. Breast Cancer Research2003, 5:.

37. Nilsson J, Fioretos T, Hoglund M, Fontes M: Approximate geo-desic distances reveal biologically relevant structures inmicroarray data. Bioinformatics 2004, 20:874-880.

38. Di Giovanni S, Knoblach SM, Brandoli C, Aden SA, Hoffman EP, FadenAI: Gene profiling in spinal cord injury shows role of cell cyclein neuronal death. Ann Neurol 2003, 53:454-468.

39. Walker JR, Su AI, Self DW, Hogenesch JB, Lapp H, Maier R, Hoyer D,Bilbe G: Applications of a rat multiple tissue gene expressiondata set. Genome Res 2004, 14:742-749.

40. Stegmaier K, Ross KN, Colavito SA, O'Malley S, Stockwell BR, GolubTR: Gene expression-based high-throughput screening(GE-HTS) and application to leukemia differentiation. Nat Genet2004, 36:257-263.

41. Bittner M, Meltzer P, Trent J: Data analysis and integration: ofsteps and arrows. Nat Genet 1999, 22:213-215.

42. Beer MA, Tavazoie S: Predicting gene expression fromsequence. Cell 2004, 117:185-198.

43. Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression net-work for global discovery of conserved genetic modules. Sci-ence 2003, 302:249-255.

44. Kim SK, Lund J, Kiraly M, Duke K, Jiang M, Stuart JM, Eizinger A, WylieBN, Davidson GS: A gene expression map for Caenorhabditiselegans. Science 2001, 293:2087-2092.

45. von Heydebreck A, Huber W, Poustka A, Vingron M: Identifyingsplits with clear separation: a new class discovery method forgene expression data. Bioinformatics 2001, 17 Suppl 1:S107-14.

46. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A,Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, MooreT, Hudson JJ, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC,Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R,Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM:Distinct types of diffuse large B-cell lymphoma identified bygene expression profiling. Nature 2000, 403:503-511.

47. Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS: Discoveringfunctional relationships between RNA expression and chem-otherapeutic susceptibility using relevance networks. ProcNatl Acad Sci U S A 2000, 97:12182-12186.

48. Somorjai RL, Dolenko B, Baumgartner R: Class prediction and dis-covery using gene microarray and proteomics mass spec-troscopy data: curses, caveats, cautions. Bioinformatics 2003,19:1484-1491.

49. Dudoit S, Gentleman RC, Quackenbush J: Open source softwarefor the analysis of microarray data. Biotechniques 2003,Suppl:45-51.

50. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ,Scherf U, Speed TP: Exploration, normalization, and summa-ries of high density oligonucleotide array probe level data.Biostatistics 2003, 4:249-264.

Page 17 of 17(page number not for citation purposes)