knowledge discovery - ist department at ritrpv/local/syllabi/discovery/knowledgediscovery1.pdfthe...
TRANSCRIPT
KnowledgeDiscovery
Ourgoal
......to understanding (wisdom) ......to knowledge ......to information
data
WhydoweneedKnowledgeDiscovery?
• DataExplosion:webusage,automateddatacollec?ontools,maturedatabasetechnology
• ToomuchdataandtooliAleknowledge
• HumansnotabletosiDthroughthedataeffec?vely
• Computa?onalapproachestodataanalysisarerequiredforthecon?nuallyincreasing,accumulateddata
Poten?alApplica?ons
• Marketanalysis,customerrela?onshipmanagement
• Riskanalysisandmanagement• Frauddetec?on• Textminingnewsgroups,email,documents• Webminingoflogs,datastreamsforcustomiza?on,adver?sing,marke?ng
• BiologyandMedicine‐manytypesofhigh‐throughputdatafordiagnos?cs,predic?veandpersonalizedmedicine
Linktoimagereference
Linktoimagereference
EvenBeAerConsulttheDomainExpert(s)
TheProcess
• GuidedDiscovery– PBL– KnowledgeDiscovery– Learnthroughexamplesandprac?ce
• Samegeneralapproachmaybeappliedtomanydifferentproblemdomains
• Selectappropriatemethodstocustomizeapproach
• Noonerightanswer!
RunningExampleofKD
• GeneExpressionData• Whyagoodexample?
– Biotechnologyadvancescreatedhugeinfluxofdata
– Biologistsnotequippedtoanalyzethedata– Computa?onalscien?stsdidn’tunderstandthebiology
– KDDprocesssorelyneeded– Hassignificantlyadvancedoverthelast10years
Papers
• Datapreprocessingandtransforma?on– Quackenbush
• Needforstandards– MAGE‐ML– www.mged.org
• MininglargedatasetsforpaAerns– MolecularClassifica?onofCancer– Golubetal.
ATypicalScenario
• Biologistdesignsandrunsanexperimentanddeliverssamples(alongwith$$)totheFunc?onalGenomicslabforhigh‐throughputgeneexpressionanalysis.AcoupleweekslaterbiologistpicksupaCDwithmul?plefilescontainingtherawdataandsomepreprocesseddata…notknowinghowtoanalyzethedatabiologistcallsinyourhelp…
• Wheredowestart?– Understandthedomainandtheproblems
13
HighThroughputSystemsforStudyingGlobalGeneExpressionare
Complex
• Needtolearnaboutandconsider:– thebiologybehindtheexperiments&theinterpreta?onoftheexperiments
– Howthedataisacquired(biotechnology)– thedataissues
14
BiologyBasics:TheFlowofInforma?on
Ageneisexpressedin2steps: DNAistranscribedintoRNA(mRNA)
RNAistranslatedintoprotein
15
GenotypetoPhenotype
• Individualcellsinanorganismhavethesamegenes(DNA)– thegenotype
but….notallgenesareac?ve(expressed)ineachcell
• Itistheexpressionofthousandsofgenesandtheirproducts(RNA,proteins),func?oninginacomplicatedandorchestratedway,thatmakeaspecificcellwhatitis.– thephenotype
16
GeneExpressionDependsonContext
• Thesubsetsofgenesthatareexpressed(RNA/protein)willdifferamongcells,?ssues,organs,condi?ons…– thesubsetexpressedconfersuniqueproper?estothecell
musclemuscle
neuron liver
17
Differen?alGeneExpression
• Thelevelofexpressionofgenesalsodifferswiththecellularcontext
• i.e.theamountofagivenRNAwillvary
• Wecanthinkofgeneexpression(inhigherorganisms)ashavingbothan“on/off”switchand“volume”control
18
WhatBiologistsWanttoKnow:SpecificPaAernsofGeneExpression• Tissue/Celltype‐specific ‐e.g.skincellvs.braincell ‐e.g.kera?nocytevs.melanocyte
• Developmentalstage ‐e.g.embryonicskincellvs.adultskincell
• Diseasestate
‐e.g.normalskincellvs.skintumorcell• Environment‐specific(drugs,toxins)
‐e.g.skincelluntreatedvs.treated
19
Butalso,themoredifficultproblem:GeneNetworks
• Genesandtheirproductsarerelatedthroughtheirrolesin:– metabolicpathways– cellsignallingnetworks
20
MetabolicPathway
FromKEGGDatabase
21
CellSignallingNetworks
www.mpi‐dortmund.mpg.de/departments/dep1/signaltransduk?on/image3.gif
22
WhatcanwelearnbystudyingglobalpaAernsofgeneexpression?
• Individualgeneexpressionpa1erns• Classifica5ons:fordiagnosis,predic?on…
– GroupsofGenes– Moleculartaxonomyofdisease
• GeneNetworks/Pathways:– Reconstruc?onofmetabolic®ulatorypathways
Nowthatwehavesomeunderstandingofthedomainandgoals…
• Whataboutthedata?– Howarethedatagenerated?– Datatype?– Dataquality?– Needfordatacleaningandpreprocessing?
KnowledgeDiscoveryProcessConsulttheDomainExpert(s)
25
GeneChip®Oligonucleo?deArray
High‐throughputgeneexpressionanalysis
26
RecallthatDNAandRNAarecomposedofstringsofnucleo?des
• Ageneofinterestwillhaveaspecificnucleo?desequence
• DNAandRNAsequencescanformbondswithcomplementarybasesonanotherstring‐calledbase‐pairing.
• Whenwedothisexperimentallywecallithybridiza?onandwecandetectitbylabelingoneofthestrings(akastrands)
GeneChip®ExpressionAnalysis
Hybridiza?onandStaining
Array
cRNATarget
HybridizedArray
Streptavidin‐phycoerythrinconjugate
CourtesyofM.Hessner,CAAGEDWorkshop
HowdoAffymetrixmicroarrayswork?
• 12‐20probesarepickedto“interrogate”agene,theideaistogetmul?plemeasurements.Eachprobeisa25meroligonucleo?dethatbindstoagene
• Thecollec?onofprobesthataredesignedtohybridizetothesamegeneiscalleda“probeset”….maybetensofthousandsoftheseprobesetsonagivenchip
• Probesetnameshaveiden?fica?onnamescalled“AffymetrixIds”,andlooklike“10329_g_at”,etc.OnanyGenechip,someprobesetsarededicatedfor“QualityControl”,thesebeginwith“AFFX_”
• Take‐homemessage:havetolearnalotofterminology
29
AffymetrixChips
300,000“Probes”PerfectMatchandMismatchAverageDifferenceValuesCourtesyofJ.GlasnerCAAGEDWorkshop
AffymetrixAnalysis
• Highresolu?onimageofthescannedmicroarraygeneratesaDATfile
• Sincetheprobesarelaidoutinagridfashion,andeachprobeposi?ondeterminedintermsofitsX‐Yco‐ordinates,onecancomputethePMandMMprobeintensi?esfromthepixelatedimage
• TheCDF(chipdefini?onfile)libraryfilecontainstheXYlayoutofeveryprobe
AffymetrixDataFlow
ScanChip
HybridizedGeneChip
DATfileProcessImage(GCOS)
CELfile
CDFfile
MAS5(GCOS)
CHPfile
TXTfile
RPTfileEXPfile
GeneChipOpera?ngSoDware(GCOS)‐AffymetrixhAp://www.affymetrix.com/products/soDware/specific/gcos.affx
AffymetrixFileTypes• DATfile:
– Raw(TIFF)op?calimageofthehybridizedchip• CDFFile(ChipDescrip?onFile):
– ProvidedbyAffy,describeslayoutofchip• CELFile:
– ProcessedDATfile(intensity/posi?onvalues)– hAp://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/
AffxFileFormats/cel.html• CHPFile:
– The“CHP”filecontainssummarizedgeneexpressionscoresaDerprobecellsareanalyzed;
– formatis:Gene Avg.D PresenceAFFX_CreX_at 48 AAFFX_BioB_at 149 P
• TXTFile:– Probesetexpressionvalueswithannota?on(CHPfileintextformat)
• RPTFile– GeneratedbyAffysoDware,reportofQCinfo
KnowledgeDiscoveryProcessConsulttheDomainExpert(s)
DataQuality
• Mostdataminingtechniquescantoleratesomelevelofimperfec?oninthedata,butimprovingdataqualitycanimprovequalityofanalyses
• Mainissues– Noise– Outliers– Missingvalues
– Duplicatedata– Inconsistentdata
35
ThereareManyProblemsFacingExpressionAnalysisontheBiotechside
• Standardiza?on&qualitycontrolintheexperiments(affectsdataqualityatmanylevels)
• Cost
36
Probleminreproducibilityofexperimentaldata
• Lotsofvaria?oninarrays– morethan100experimentalsteps
• Sourcesofvaria?on– biologicalvariabilityineachRNAextract– eachlabelingreac?onisdifferent– eachslideisaseparatehybridiza?on– spotsontheslidearevariableacrossslides(andwithinslideswhen
doublespoAed)
– each“color”isscannedseparately• NeedReplicatesandSta?s?cs!
37
Outcome
• “Noisy”data• Datapreprocessingisnecessary
– normaliza?on
– scaling• Heavyrelianceonsta?s?cstoday
Whatdothespots(intensitymeasurements)represent?
• Fluorescenceintensityisameasureoftherela?veabundanceofindividualmRNAs(expressedgenes)ingivensamples– e.g.experimentalrela?vetocontrol
• But,geneexpressionexperimentsarerunon“mul?plesamples”Why?
• Wearetryingtounderstandadynamicprocess‐eachsampleonlyrepresentsa“snapshot”– Compareamongsamples(differentarrays)
– Compareacrossa?me‐courseofrelatedsamples
Howcanweusethedata?
• Wecanonlyreallydependonbetween‐samplefoldchangeforMicroarraysnotabsolutevaluesorwithinsamplecomparisons(>1.3‐2.0foldchange,ingeneral)
• Take‐homemessage:Havetobecarefulwhencomparingbetweenarrays;fromexperimenttoexperiment….
40
Pre‐processing
• Genefiltering– controlgenes– uninforma?vegenes
• Normaliza?onandscaling– allowscomparisonsacrossarrays
– scalingtocontroldynamicrange
• Transforma?on• logarithmictransforma?onforimprovedsta?s?calproper?es
Normaliza?on
Cy3signal(log2)
Cy5signal(log
2)
Take‐homeMessage
• Importanttorememberthatoncepreprocessing,normaliza?on,transforma?onofthedatahaveoccurred,alldownstreamminingwillbeaffected.
DataRepresenta?on
• Flatfile• Vectordata• Sparsematrix(text)data
• Sequencedata(e.g.weborgenomic)
• Timeseries
• Imagedata
• Spa?o‐temporal
Threelevelsofmicroarraygeneexpressiondataprocessing
Brazma et al., Nature Genetics, 29:365-371, 2001
OutcomesofMicroarrayAnalysis
Large,complexdatasetsofhighdimensionality– exampleofarou?nestudy:
50,000“genes”from20samples‐approx.1‐2X106piecesofdata
challengesforBioinforma?cs• annota?on,storage,retrieval,sharingofdata• informa?onfromthedata
KnowledgeDiscoveryProcessConsulttheDomainExpert(s)
StateofMicroarrayData
• Wideavailabilityoftechnologyhasgivenrisetoalargenumberofdistributeddatabases
• datascaAeredamongmanyindependentsites(accessibleviaInternet)ornotpubliclyavailableatall
• Needforstandardiza?on!
MGEDGroupandStandardiza?onIssues
• MicroarrayGeneExpressionDatabase(MGED)Group
www.mged.org
• MGEDistakingonthechallengeofstandardiza?on
• Fourmajorprojects
• MIAME‐Theformula?onoftheminimuminforma?onaboutamicroarrayexperimentrequiredtointerpretandverifytheresults.
• MAGE‐Theestablishmentofadataexchangeformat(MAGE‐ML)andobjectmodel(MAGE‐OM)formicroarrayexperiments.
MGEDProjects
MGEDProjects
• Ontologies‐Thedevelopmentofontologiesformicroarrayexperimentdescrip?onandbiologicalmaterial(biomaterial)annota?oninpar?cular.
• Normaliza?on‐Thedevelopmentofrecommenda?onsregardingexperimentalcontrolsanddatanormaliza?onmethods.
MAGE‐ML
• theXMLrepresenta?onoftheMAGE‐OM• theDTD(documenttypedefini?on)iswhatisspecifiedinMAGE_ML– rulesordeclara?ons– whattagscanbeused– whattagscontain
• MAGE‐OM• hAp://www.mged.org/Workgroups/MAGE/mage‐om.html
• mappingofmicroarrayexperimentalworkflowtotheOM
• DTD• hAp://www.omg.org/docs/dtc/03‐05‐03.dtd
• MAGE‐STKsoDwaretoolkit– definesanAPItoMAGE‐OM– inJava,Perl,C++
• Usedto– exportdatatoMAGE_ML– tostoredatainrela?onaldatabase– inputdatatoanalysistools
• Reader:MAGE‐MLdocsintoobjects• Writer:objectsintoMAGE‐ML
KnowledgeDiscoveryProcessConsulttheDomainExpert(s)
DataMiningTechniques
• Exploratorydataanalysis• Descrip?vemodeling
• Predic?vemodeling
• PaAerndiscovery• others
ExploratoryDataAnalysis
• Interac?veandvisual• Insightandfeelforthedatainabroadsense
– Providesummaries• e.g.max/min,mean/median,varianceetc
– Visualiza?on• Histograms,scaAerplots
• Usefulfordatavalida?onorverifica?on• Simpleexploratorydataanalysisisinvaluable
– Alwaysgetacursoryviewofthedatabeforeapplyingdataminingalgorithms
PaAernDiscovery
• Discoverinteres?nglocalpaAernsindataratherthantocharacterizedataglobally
• Marketbasketdata– Discoverthatifcustomersbuywineandbread,theybuycheesewitha0.9probability
– Knownasassocia?onrules
Descrip?veModeling
• Buildmodelforunderlyingprocess– Simulatethedataifneeded
• Clusteranalysistofindnaturalgroupsinthedata
• Bayesiannetworktofinddependencymodelsamongvariables
Predic?veModeling
• PredictavariableY,givenap‐dimensionalvectorX– Classifica?on:Yiscategorical– Regression:Yisreal‐valued
• Muchlikefunc?onapproxima?on– Learningtherela?onshipbetweenYandX
• Sta?s?csandmachinelearninghavemanyalgorithmsforpredic?vemodeling– EmphasisisoDenonpredic?veaccuracyratherthanunderstandingthemodelitself.
MiningofExpressionDataRecallthat:• AgeneexpressionpaAernderivedfromasinglemicroarrayissimplyasnapshot(oneexperimentalsamplevsreference)
• Usuallywanttounderstandaprocessorchangesinexpressionoveracollec?onofsamples
geneexpressionprofile
62
WorkingwithGeneExpressionData
• Hypothesis‐drivenapproaches– Typicallymodel‐oriented– Descrip?vesta?s?csrelyingonpriorknowledgeandgooddesign
• Discovery‐based– Few,ifany,apriorihypotheses– Data‐drivenandalgorithm‐oriented– Sta?s?calalgorithms– Machinelearningusingheuris?ctechniques
63
Tes?ngHypotheses
• Basedonpriorbiologicalknowledge• Simplest
– lookforindividualdifferen?allyexpressedgenes– foldchanges
• ScaAerplot• Sta?s?calmeasures
64
ScaAerplot
65
Somesimplesta?s?cs
• Ifwearelookingatsamplesthatseemtobelongtotwogroupsorcondi?ons
• t‐testcomparesthemeansoftwogroupswhileaccoun?ngforthestandarderrorofthedifferenceofthemeans
• ANOVAifwanttoextendtheanalysistomorethantwogroups
66
But,genechipsallowustomeasurethousandsofgenes....
• Acrossmul?plesamples
GoalofAnalysisofExpressionMatrix
• Somesta?s?calmethodsappliedto:1. “Group”similargenestogether=>groupsof
func?onallysimilargenes.
2. ”Group”similarcellsamplestogether.
3. “Extract”representa?vegenesineachgroup.
Typicalapproach
• LookforpaAerns– comparerowstofindevidenceforco‐regula?onofgenes– comparecolumnstofindevidenceforrelatednessamongsamples
1)Chooseameasureofsimilarity(distance)amongtheobjectsbeingcompared‐eachroworcolumnisconsideredavectorinspace
2)Then,grouptogetherobjects(genesorsamples)withsimilarproper?es‐isamul?dimensionalanalysis
69
Anexperiment
• 12Genes• Expressionvaluesat0,2,4,6,8and10hours
70
Table4.2ofCampbell/Heyer• Name 0 hrs 2 hrs 4 hrs 6 hrs 8 hrs 10 hrs
C 1 8 12 16 12 8 D 1 3 4 4 3 2 E 1 4 8 8 8 8 F 1 1 1 .25 .25 .1 G 1 2 3 4 3 2 H 1 .5 .33 .25 .33 .5 I 1 4 8 4 1 .5 J 1 2 1 2 1 2 K 1 1 1 1 3 3 L 1 2 3 4 3 2 M 1 .33 .25 .25 .33 .5 N 1 .125 .0833 .0625 .0833 .125
71
Takelogs•
C 0 3.0 3.58 4.0 3.58 3.0 D 0 1.58 2.0 2.0 1.58 1.0 E 0 2.0 3.0 3.0 3.0 3.0 F 0 0 0 -2.0 -2.0 -3.32 G 0 1.0 1.58 2.0 1.58 1.0 H 0 -1.0 -1.6 -2.0 -1.6 -1.0 I 0 2.0 3.0 2.0 0 -1.0 J 0 1.0 0 1.0 0 1.0 K 0 0 0 0 1.58 1.58 L 0 1.0 1.58 2.0 1.58 1.0 M 0 -1.6 -2.0 -2.0 -1.6 -1.0 N 0 -3.0 -3.59 -4.0 -3.59 -3.0
• Compare
72
HowSimilararetwoRows?
• Howsimilararetheexpressionsoftwogenes?
• Firstwe’llnormalizeeachrow
• Calculatethemeanandstandarddevia?onforeachgene
• Normalizeeachvaluebysubtrac?ngthemeananddividingbythestandarddevia?on.
73
HowSimilararetwoRows?
• CalculatethePearsonCorrela?onbetweenpairsofrows
• Correla?onquan?fiestheextenttowhichtheexpressionpaAernsoftwogenesgoupordowntogether,regardlessoftheirmagnitudes.
• Calculatedbytakingthedotproductofthetwovectors
> (pc '( 1 2 3 4 3 2 ) ; row G '( 1 2 3 4 3 2 )) ; row L 1.0 > (pc '( 1 2 3 4 3 2 ) ; row G '( 1 3 4 4 3 2 )) ; row D 0.8971499589146109
74
Someotherpairs• Name 0 hrs 2 hrs 4 hrs 6 hrs 8 hrs 10 hrs
C 1 8 12 16 12 8 D 1 3 4 4 3 2 E 1 4 8 8 8 8 F 1 1 1 .25 .25 .1 G 1 2 3 4 3 2 H 1 .5 .33 .25 .33 .5 I 1 4 8 4 1 .5 J 1 2 1 2 1 2 K 1 1 1 1 3 3 L 1 2 3 4 3 2 M 1 .33 .25 .25 .33 .5 N 1 .125 .0833 .0625 .0833 .125
> (pc '( 1 3 4 4 3 2) ; row D '( 1 .33 .25 .25 .33 .5)) ; row M -0.9260278787295065 > (pc '( 1 2 3 4 3 2) ; row G '( 1 .5 .33 .25 .33 .5)) ; row H -0.9090853650855358
75
PearsonCorrela?on
• pc(G,L)=1‐‐iden?callyexpressedgenes• pc(G,D)=.897‐‐similarlyexpressedgenes• pc(D,M)=‐.926‐‐reciprocallyexpressed• pc(G,H)=‐.909‐‐alsoreciprocallyexpressed
Descrip?veandPredic?veModeling
• Clustering• Featureextrac?on/selec?on• Classifica?on‐discrimina?onanalysis
Analy?cApproaches
• Clustering:Identification of associations between data points; organization of data into groups
• UnsupervisedClustering:genesclusteredbysimilarity/correla?on,orothercriteriabasedonX‐values‐nousefulexternalinforma?onabouttheY–variables(theresponse),isused→doesn’trevealgroupsofgeneswithspecialinterestfor?ssuediscrimina?on
• SupervisedMethods:‐groupingofvariables(genes),controlledbyinforma?onabouttheXandYvariables→supervisedalgorithmstrytofindgeneclusters,whoseaverageexpressionprofilehasgreatpoten?alforexplainingtheresponseY,i.e.for?ssuediscrimina?on
• UnsupervisedClusteringAlgorithms– Hierarchical– K‐means– Self‐organizingmaps– Others
Eisen et al.
http://www.pnas.org/cgi/content/full/95/25/14863
samples
g
e
n
e
s
Gene Expression Matrix
& Hierarchical Clustering
Theory
• HierarchicalClusteringworksbysequen?allyjoiningthetwonearestclustersandthenhierarchicallyjoiningthenexttwoclosestclustersandsooninthisfashion,joiningthenearestclustersfirstandfarthestclusterslast.
• Ini?allyeachindividualdataptissetequaltoonecluster
HierarchicalClusteringAlgorithm
• GivenasetofNitemstobeclustered,andanN*Ndistance(orsimilarity)matrix.
1. Startbyassigningeachitemtoacluster,sothatifyouhaveNitems,youwillnowhaveNclusters,eachcontainingjustoneitem.Letthedistances(similari?es)betweentheclustersbedefinedasthesameasthedistances(similari?es)betweentheitemstheycontain.
2. Findtheclosest(mostsimilar)pairofclustersandmergethemintoasinglecluster.Younowhaveoneclusterless.
3. Computedistances(similari?es)betweenthenewclusterandeachoftheoldclusters.
4. Repeatsteps2and3un?lallitemsareclusteredintoasingleclusterofsizeN.
Hierarchicalinac?on
Varia?onsofHierarchicalAlgorithm
• Step3(compu?ngdistancesbetweenthenewclusterandeachoftheoldclusters)canbedoneinseveraldifferentways.SingleLinkage,averagelinkageandcompletelinkage.
• Insinglelinkagethedistancebetweenclustersisequaltotheshortestdistancefromanyonememberofoneclustertoanyonememberoftheothercluster.
• InAveragelinkagethedistancebetweentwoclustersisdefinedastheaveragedistancebetweenanymemberofoneclustertoanymemberoftheothercluster.
• Completelinkageisdefinedasthethemaximumdistancefromanyonememberofthefirstclustertoanyonememberofthesecondcluster.
Varia?onsofHierarchicalAlgorithm
• SelfOrganizingTreeAlgorithm– Unsupervisedneuralnetworkwithabinarytreetopology
– Combina?onofSOMandhierarchicalclustering
– Run?meisapproximatelylinear• Fasterthannormalhierarchicalmethod
– Usesdivisivemethod• IncomparisontoboAomupmethodofhierarchical
Advantages
• Hierarchicalclusteringresultsinavisualrepresenta?onthatisconvenientforhumanstoanalyze
• Unlikek‐meansandSOM,doesnothaveanaprioriclusternumber
Whyclusteranalysismaynotbe“the”answer
• Clusteringmethodstypicallyrequireuserinputs:
Example:distancemeasure• Clusteringmethodsdifferinthewaythatthenumberofclustersarespecified.
• ClusteringmethodsareoDensensi?vetotheini?aliza?oncondi?on(star?ngguess)
• Localvs.globalsamplingofclusteringspace
ClusterAnalysisChallenges
• “Noise”inthedataitself• Largedatasets
– mostofthetechniquescurrentlyusedwerenotdevelopedformul?dimensionaldata
• Whataboutnetworks?– limita?onofclusteranalysis:similarityinexpressionpaAernsuggestsco‐regula?onbutdoesn’trevealcause‐effectrela?onships
FeatureSelec?on&Classifica?on
• First,iden?fyfeatures(genes)thatdiscriminatebetweenclasses
• Thenusefeaturesforclassifica?on– machinelearningapproach– supervisedanalysis– assignmentofanewsampletoapreviouslyspecifiedclass,basedonsamplefeaturesandatrainedclassifier
“Classic”Example:Classifica?onofAMLvs.ALL
• Biological/ClinicalProblems:• previously,nosinglereliabletesttodis?nguishthem• differgreatlyinclinicalcourse&responsetotreatments
Golub et al., Science Oct 15 1999: 531-537
• Comparing 2 acute leukemias • acute myeloid leukemia (AML) • acute lymphoid leukemia (ALL)
Golub et al., Science Oct 15 1999: 531-537
Study Design
The prediction of a new sample is based on 'weighted votes' of a set of informative genes
Resultsofthestudy
1)Clusteringofmicroarraydatausingtumorsofknowntype
found1100of6817genescorrelatedwithclassdis?nc?on
2)Forma?onofaclasspredictor=50mostinforma?vegenesusedasatrainingset
classifica?onofunknowntumors
Golub et al., Science Oct 15 1999: 531-537
Results
Howtotestthevalidityofclasspredictors?
• Cross‐valida?ontests:The50‐genepredictorassigned36ofthe38samplesaseitherAMLorALLandtheremainingtwoasuncertain(PS<0.3).All36predic?onsagreedwiththepa?ents'clinicaldiagnosis;
• Independenttest:The50‐genepredictorwasappliedtoanindependentcollec?onof34leukemiasamples.Thepredictorassigned29ofthe34samples,andtheaccuracywas100%;
• Predic?onstrength:medianPS=0.77incross‐valida?onand0.73inindependenttest(Fig.3A).
Results
Classdiscovery
• IftheAML‐ALLdis?nc?onwerenotalreadyknown,couldithavebeendiscoveredsimplyonthebasisofgeneexpression?
Results
Twoclusteranalysis
(1).Clustertumorsbygeneexpression:
• Atwo‐clusterSOMwasappliedtoautoma?callygroupthe38ini?alleukemiasamplesintotwoclassesonthebasisoftheexpressionpaAernofall6817genes.
Results
Determinewhetherputa?veclassesproducedaremeaningful.
• TheclusterswerefirstevaluatedbycomparingthemtotheknownAML‐ALLclasses(Fig.4A).ClassA1containedmostlyALL(24of25samples)andclassA2containedmostlyAML(10of13samples).TheSOMwasthusquiteeffec?veatautoma?callydiscoveringthetwotypesofleukemia.
Results
• Howcouldoneevaluatesuchputa?veclustersifthe"right"answerwerenotalreadyknown?
Classdiscoverycouldbetestedbyclasspredic?on;Ifputa?veclassesreflecttruestructure,thenaclasspredictorbasedontheseclassesshouldperformwell.