independent component analysis: idenfying data‐driven...

1
Independent component analysis: iden1fying data‐driven human gene modules Jesse Engreitz 1 , Bernie Daigle 2 , Yoni Marshall 1 , and Russ B. Altman 1,2 1 Department of Bioengineering 2 Department of Gene@cs Stanford University mo1va1on Cellular physiology, including disease states and drug responses, results from the combined influences of many genes. Experimentalists have now sampled many condi@ons and cell types, contribu@ng vast amounts of gene expression data that represent many biological processes. The expansion of public microarray databases such as the Gene Expression Omnibus (GEO) allows the use of intelligent data mining approaches to extract informa@on about these biological processes in a data‐ driven manner. Using 9,395 human microarrays measuring over 20,000 genes, we use independent component analysis to iden1fy func1onal gene modules, or sets of genes, that describe a wide range of biological condi1ons. methods 9,395 human microarrays Affymetrix HG U133 Plus 2.0 Diverse experimental condi@ons Preprocessing We used hierarchical clustering to reduce the contribu@ons from highly replicated experimental systems. We consolidated clusters of over‐ represented condi@ons to create a meta‐ compendium of 423 meta‐arrays. Independent component analysis Independent component analysis (ICA) models gene expression data as a linear combina@on of transcrip@onal paZerns, termed independent components. 1 Given a set of microarrays, ICA iden@fies components so that sta@s@cal independence is maximized. Since ICA is a stochas@c method, we run the algorithm 20 @mes and cluster the component es@mates. Gene modules Each independent component has a weight for each gene that quan@fies rela@ve expression level. For each component, we use a weight cut‐off to generate gene modules of over‐expressed and under‐expressed genes. 2 Gene module analysis We can calculate the expression of each independent component in a new microarray experiment. We predict that gene modules associated with highly‐expressed components play an important role in the experiment. Components ranked by variance explained results applica1on: parthenolide We iden@fied 423 independent components and defined 846 gene modules. Annota@on using the Gene Ontology (GO) suggests that while some of the gene modules represent known biological processes, some may describe transcrip@onal programs not well characterized by GO. ICA gives reproducible component es@mates when applied to a large compendium of gene expression data, and performs beZer than PCA in describing the data. Rela1onship between variance explained and number of enriched GO terms Clustered component es1mates with top GO annota1ons GSE7538: Treatment of primary AML specimens with parthenolide 12 treated‐untreated pairs of microarrays Differen1ally expressed components in parthenolide response CREM We inves@gated the mechanism for parthenolide, a preclinical drug with an@‐prolifera@ve effects. First we iden@fied differen@ally expressed components between treated and untreated samples. These components represent known parthenolide‐modulated pathways including NF‐κB signaling and oxida@ve stress response. To iden@fy key genes in these pathways, we used the up‐regulated genes in component 373 to generate a gene network. 3 We ranked these genes based on their connec@vity and iden@fied CREM as a poten@al regulator in parthenolide response. Network of up‐regulated genes in component 373 created using Ingenuity soTware. 3 Green indicates a ‘known’ parthenolide gene. Blue indicates a novel predic1on. conclusion GSE1060: The recurrent SET‐NUP214 fusion as a new HOXA ac@va@on mechanism in pediatric T‐ALL GSE7440 Early Response and Outcome in High‐Risk Childhood Acute Lymphoblas@c Leukemia GSE10358 Discovery and valida@on of expression data for the Genomics of Acute Myeloid Leukemia… GSE10792 Genome wide genotyping and gene expression data of childhood B‐cell precursor ALL… GSE7757 Robustness of gene expression signatures in leukemia: comparison of three dis@nct… GSE11190 Interferon signaling and treatment outcome in chronic hepa@@s C Top experiments associated with component 373 We extracted gene modules from a large corpus of expression data using data‐driven means, providing a new method for predic@ng func@onal rela@onships between genes. These modules are useful for differen@al expression analysis and may be applied in a number of other sehngs, including Gene Set Enrichment Analysis, phenotype classifica@on, drug discovery, and content‐based microarray search. Support Bio‐X Undergraduate Research Award Bio‐X2 Supercompu@ng Cluster References 1. Liebermeister W. Bioinforma)cs 2002, 18(1): 51‐60. 2. Lee S, Batzoglou S. Genome Biology 2003, 24(11): R76. 3. Ingenuity Systems, www.ingenuity.com

Upload: nguyenmien

Post on 04-Jun-2018

233 views

Category:

Documents


0 download

TRANSCRIPT

Independentcomponentanalysis:iden1fyingdata‐drivenhumangenemodulesJesseEngreitz1,BernieDaigle2,YoniMarshall1,andRussB.Altman1,2

1DepartmentofBioengineering2DepartmentofGene@csStanfordUniversity

mo1va1on

Cellularphysiology,includingdiseasestatesanddrugresponses,results from the combined influences of many genes.Experimentalists have now sampled many condi@ons and celltypes, contribu@ng vast amounts of geneexpressiondata thatrepresentmanybiologicalprocesses. Theexpansionofpublicmicroarray databases such as the Gene Expression Omnibus(GEO) allows the use of intelligent datamining approaches toextract informa@onabout thesebiologicalprocesses inadata‐driven manner. Using 9,395 human microarrays measuringover20,000genes,weuseindependentcomponentanalysistoiden1fy func1onal gene modules, or sets of genes, thatdescribeawiderangeofbiologicalcondi1ons.

methods

9,395humanmicroarraysAffymetrixHGU133Plus2.0Diverseexperimentalcondi@ons

PreprocessingWe used hierarchical clustering to reduce thecontribu@ons from highly replicated experimentalsystems. We consolidated clusters of over‐represented condi@ons to create a meta‐compendiumof423meta‐arrays.

IndependentcomponentanalysisIndependent component analysis (ICA) modelsgene expression data as a linear combina@on oftranscrip@onal paZerns, termed independentcomponents.1 Given a set of microarrays, ICAiden@fies components so that sta@s@calindependence is maximized. Since ICA is astochas@cmethod,werunthealgorithm20@mesandclusterthecomponentes@mates.

GenemodulesEach independent component has a weight foreachgenethatquan@[email protected] each component, we use a weight cut‐off togenerate gene modules of over‐expressed andunder‐expressedgenes.2

GenemoduleanalysisWe can calculate the expression of eachindependent component in a new microarrayexperiment. We predict that gene modulesassociatedwithhighly‐expressedcomponentsplayanimportantroleintheexperiment.

Componentsrankedbyvarianceexplained

results

applica1on:parthenolide

Weiden@fied423 independentcomponentsanddefined846genemodules. Annota@onusingthe Gene Ontology (GO) suggests that while some of the gene modules represent knownbiologicalprocesses,somemaydescribetranscrip@onalprogramsnotwellcharacterizedbyGO.ICA gives reproducible component es@mates when applied to a large compendium of geneexpressiondata,andperformsbeZerthanPCAindescribingthedata.

Rela1onshipbetweenvarianceexplainedandnumberofenrichedGOterms Clusteredcomponentes1mateswithtopGOannota1ons

GSE7538:TreatmentofprimaryAMLspecimenswithparthenolide

12treated‐untreatedpairsofmicroarrays

Differen1allyexpressedcomponentsinparthenolideresponse

CREM

We inves@gated the mechanism for parthenolide, apreclinicaldrugwithan@‐prolifera@veeffects. Firstweiden@fieddifferen@allyexpressedcomponentsbetweentreated and untreated samples. These componentsrepresent known parthenolide‐modulated pathwaysincludingNF‐κ[email protected] iden@fy key genes in these pathways, we used theup‐regulated genes in component 373 to generate agenenetwork.3 Werankedthesegenesbasedontheirconnec@vity and iden@fied CREM as a poten@alregulatorinparthenolideresponse.

Networkofup‐regulatedgenesincomponent373createdusingIngenuitysoTware.3Greenindicatesa‘known’parthenolidegene.Blueindicatesanovelpredic1on.

conclusionGSE1060: TherecurrentSET‐NUP214fusionasanewHOXAac@va@onmechanisminpediatricT‐ALLGSE7440 EarlyResponseandOutcomeinHigh‐RiskChildhoodAcuteLymphoblas@cLeukemiaGSE10358 Discoveryandvalida@onofexpressiondatafortheGenomicsofAcuteMyeloidLeukemia…GSE10792 GenomewidegenotypingandgeneexpressiondataofchildhoodB‐cellprecursorALL…GSE7757 Robustnessofgeneexpressionsignaturesinleukemia:comparisonofthreedis@nct…GSE11190 Interferonsignalingandtreatmentoutcomeinchronichepa@@sC

Topexperimentsassociatedwithcomponent373

We extracted gene modules from alargecorpusofexpressiondatausingdata‐driven means, providing a newmethod for predic@ng func@onalrela@onships between genes. Thesemodules are useful for differen@alexpression analysis and may beappliedinanumberofothersehngs,including Gene Set EnrichmentAnalysis, phenotype classifica@on,drug discovery, and content‐basedmicroarraysearch.

SupportBio‐XUndergraduateResearchAwardBio‐X2Supercompu@ngCluster

References1.LiebermeisterW.Bioinforma)cs2002,18(1):51‐60.2.LeeS,BatzoglouS.GenomeBiology2003,24(11):R76.3.IngenuitySystems,www.ingenuity.com