all the data: challenges - stanford...
TRANSCRIPT
![Page 1: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/1.jpg)
All the Data: challenges
SusanHolmeshttp://www-stat.stanford.edu/˜susan/
Bio-X andStatistics, StanfordUniversity
STAMPS,2015thanks, NIH-TR01andNIH-R01GM086884
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 2: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/2.jpg)
What'shappeninginBiologicalDataAnalysis?
1985 1998 2012
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 3: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/3.jpg)
What'shappeninginBiologicalDataAnalysis?1985 1998 2012
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 4: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/4.jpg)
messy
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 5: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/5.jpg)
(Big)Challengesforthoseofusworkingfromthegroundup
Heterogeneity. Heteroscedasticity. InformationLeaks. MultiplicityofChoices. Reproducibility.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 6: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/6.jpg)
Part I
`Homogeneous data are all alike;
all heterogeneous data are heterogeneous
in their own way.'
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 7: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/7.jpg)
HeterogeneityofData Status: response/explanatory. Hidden(latent)/measured. Types:
Continuous Binary, categorical Graphs/Trees Images Maps/SpatialInformation Rankings
Amountsofdependency: independent/timeseries/spatial. Differenttechnologiesused(454, phyloseq, Illumina,
MassSpec, RNA-seq).
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 8: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/8.jpg)
HeterogeneousDataObjectsInputanddatamanipulationwith phyloseq
(McMurdieandHolmes, 2013, PlosONE)AsalwaysinR:objectorienteddata.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 9: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/9.jpg)
apepackage
OTU Abundanceotu_table
Sample Variablessample_data
Taxonomy TabletaxonomyTable
Phylogenetic Treephylo
otu_table sample_data tax_table phy_tree
otu_table sample_data tax_table
read.treeread.nexusread_tree
as as as
import
phyloseqconstructor:
Biostringspackage
Reference Seq.XStringSet
DNAStringSet RNAStringSet
AAStringSet
phyloseq
Experiment Data
otu_table,sam_data,tax_table,phy_treerefseq
Accessors:get_taxaget_samplesget_variablensamplesntaxarank_namessample_namessample_sumssample_variablestaxa_namestaxa_sums
Processors:filter_taxamerge_phyloseqmerge_samplesmerge_taxaprune_samplesprune_taxasubset_taxasubset_samplestip_glomtax_glom
matrix matrixdata.frame
optional
refseq
data
data structure & APIphyloseq
http://joey711.github.io/phyloseq/
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 10: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/10.jpg)
phyloseq
Preprocessing
Import
Direct Plots
plot_network plot_heatmap plot_ordination
distance ordinate
Summary / ExploratoryGraphics
filter_taxafilterfun_samplegenefilter_sampleprune_taxaprune_samplessubset_taxasubset_samplestransform_sample_counts
import_biomimport_mothurimport_pyrotaggerimport_qiimeimport_RDP
plot_tree
plot_richness
plot_bar
bootstrappermutation testsregressiondiscriminant analysismultiple testinggap statisticclusteringprocrustes
Inference, Testing
sample data
OTU cluster output
Input
raw
phyloseqprocessed
work flowphyloseq
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 11: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/11.jpg)
graphics
−0.3
−0.2
−0.1
0.0
0.1
0.2
0.3
−0.4 −0.2 0.0 0.2 0.4NMDS1
NMDS
2
SampleType
FecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTongue
plot_ordination, NMDS, wUF
FreshwaterFreshwater (creek)FreshwaterFreshwater (creek)Freshwater (creek)SoilSoilSoilSkinSkinSkinM
ockM
ockM
ockFecesFecesFecesFecesSedim
ent (estuary)TongueTongueO
ceanO
ceanO
ceanSedim
ent (estuary)Sedim
ent (estuary)
SampleType
OTU
1
100
10000
Abundance
plot_heatmap; bray−curtis, NMDS
SeqTech
IlluminaPyro454Sanger
Enterotype 1
23
plot_network; Enterotype data, bray−curtis, max.dist=0.25
CytophagaEmticicia
Sphingobacterium
Segetibacter
Haliscomenobacter
Pedobacter
Bacteroides
Alistipes
Bacteroides
Cytophaga
Porphyromonas
Prevotella
Parabacteroides
Algoriphagus
Odoribacter
CandidatusAquirestis
Capnocytophaga
Porphyromonas
Spirosoma
Prevotella
Balneola
Prevotella
Hymenobacter
Prevotella
76
73
75
75
79
67
81
84
84
82
75
Abundance
12562515625
SampleType
FecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTongue
Order Bacteroidales
FlavobacterialesSphingobacteriales
plot_tree; Bacteroidetes−only. Merged samples, tip_glom=0.1
0e+00
2e+05
4e+05
6e+05
Feces
Freshwater
Freshwater (creek)
Mock
Ocean
Sediment (estuary)
Skin
Soil
Tongue
SampleType
Abun
danc
e
FamilyBacteroidaceaeBalneolaceaeCryomorphaceaeCyclobacteriaceaeFlavobacteriaceaeFlexibacteraceaePorphyromonadaceaePrevotellaceaeRikenellaceaeSaprospiraceaeSphingobacteriaceae
plot_bar; Bacteroidetes−only
S.obs S.chao1 S.ACE
2000
4000
6000
8000
FALSE TRUE FALSE TRUE FALSE TRUEHuman Associated Samples
Num
ber o
f OTU
sSampleType
FecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTongue
plot_ordination()
plot_network()
plot_bar()
plot_heatmap()
plot_tree()
plot_richness()
phyloseq
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 12: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/12.jpg)
graphics
plot_ordination()
−1.0
−0.5
0.0
0.5
1.0
1.5
−1 0 1CA1
CA2
SampleTypeFecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTongue
Samples Only; type="samples"
−1.0
−0.5
0.0
0.5
1.0
1.5
−1 0 1CA1
CA2
Class
FlavobacteriaSphingobacteriaBacteroidiasamples
SampleType
FecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTonguetaxa
type
samplestaxa
Biplot; type="biplot"
Bacteroidia Flavobacteria Sphingobacteria
−1.0
−0.5
0.0
0.5
1.0
1.5
−1 0 1 −1 0 1 −1 0 1CA1
CA2
Class
BacteroidiaFlavobacteriaSphingobacteria
Taxa Only; type="taxa"
samples taxa
−1.0
−0.5
0.0
0.5
1.0
1.5
−1 0 1 −1 0 1CA1
CA2
ClassFlavobacteriaSphingobacteriaBacteroidiasamples
SampleTypeFecesFreshwaterFreshwater (creek)MockOceanSediment (estuary)SkinSoilTonguetaxa
Split Plot; type="split"
type="scree"
split
taxa-only
biplot
samples-only
phyloseq
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 13: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/13.jpg)
markdown !(code + console) +
figures
phyloseq + !ggplot2 + !etc.
# Main title!This is an [R Markdown](my.link.com) document of my recent analysis.!## Subsection: some codeHere is some import code, etc.```rlibrary("phyloseq")library("ggplot2")physeq = import_biom(“datafile.biom”)plot_richness(physeq)```
source.Rmd
Complete HTML5
knitr::knit2html()
microbiome data
Our Goal with Collaborators:!Reproducible analysis workflow with R-markdown
Better Reproducibility
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 14: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/14.jpg)
Part II
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 15: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/15.jpg)
ExampleofStudy
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 16: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/16.jpg)
Summaryofthestudy Choosethedatatransformation(hereproportions
replacedtheoriginalcounts).... log, rlog, subsample, prop, orig.
Takeasubsetofthedata, somesamplesdeclaredasoutliers. ... leaveout0, 1, 2,..,9, +criteria(10)......
Filteroutcertaintaxa(unknownlabels, rare, etc...)... removeraretaxa(thresholdat0.01%, 1%, 2%,...)
Chooseadistance.... 40choicesinvegan/phyloseq.
Chooseanordinationmethodandnumberofcoordinates.... MDS,NMDS,k=2,3,4,5..
Chooseaclusteringmethod, chooseanumberofclusters.... PAM,KNN,densitybased, hclust...
Chooseanunderlyingcontinuousvariable(gradientorgroupofvariables: manifold).
Chooseagraphicalrepresentation.. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 17: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/17.jpg)
Therearethusmorethan200millionpossiblewaysofanalyzingthisdata:
5× 100× 10× 40× 8× 16× 2× 4 = 204800000
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 18: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/18.jpg)
Part III
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 19: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/19.jpg)
Some real data (Caporoso et al, 2011)
> GlobalPatterns
phyloseq-class experiment-level object
otu_table() OTU Table: [ 19216 taxa and 26 samples ]
sample_data()Sample Data: [ 26 samples by 7 sample variables ]
tax_table()Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree:[ 19216 tips and 19215 internal nodes ]
> sample_sums(GlobalPatterns)
CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr
864077 1135457 697509 1543451 2076476 718943 433894 186297
.....
NP3 NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1
1478965 1652754 58688 493126 279704 937466 1211071 1216137
> summary(sample_sums(GlobalPatterns))
Min. 1st Qu. Median Mean 3rd Qu. Max.
58690 567100 1107000 1085000 1527000 2357000
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 20: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/20.jpg)
Howtodealwithdifferentnumbersofreads?
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 21: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/21.jpg)
CurrentMethod: RarefyingAdhoc librarysizenormalizationbyrandomsubsamplingwithoutreplacement.1. Selectaminimumlibrarysize, NL,min. Thishasalsobeen
calledthe rarefactionlevel thoughwewillnotusethetermhere.
2. Discardlibraries(microbiomesamples)thathavefewerreadsthan NL,min.
3. Subsampletheremaininglibrarieswithoutreplacementsuchthattheyallhavesize NL,min.
Often NL,min ischosentobeequaltothesizeofthesmallestlibrarythatisnotconsidered defective, andtheprocessofidentifyingdefectivesamplescomeswithariskofsubjectivityandbias. Inmanycasesresearchershavealsofailedtorepeattherandomsubsamplingstep(3)orrecordthepseudorandomnumbergenerationseed/process---bothofwhichareessentialforreproducibility.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 22: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/22.jpg)
ReductionofDatatoProportions
Manysoftwareprogramsautomaticallyreducethedatatorelativeproportions, losingtheinformationaboutlibrarysizesorreadcounts.Thismakescomparisonsverydifficult.StatisticalFormulation: Whenmakinga(testing)decision,reducingresultsfromaBinomialdistributionintoaproportiondoesnotgivean admissible procedure.Definition:Anadmissibleruleisanoptimalruleformakingadecisioninthesensethatthereisnootherrulethatisalwaysbetterthanit.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 23: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/23.jpg)
Aimofthestudies: DifferentialAbundance
Likedifferentiallyexpressedgenes, aspecies/OTU isconsidereddifferentiallyabundantifitsmeanproportionissignificantlydifferentbetweentwoormoresampleclassesintheexperimentaldesign.OptimalityCriteria:SensititivityorPower TruePositiveRate.Specificity TrueNegativeRate.
Wehavetocontrolformanysourcesoferror(blocking,modeling, etc..)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 24: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/24.jpg)
NegativeBinomialorGamma-Poisson
McMurdieandHolmes(2014)``WasteNot, WantNot: Whyrarefyingmicrobiomedataisinadmissible'',NegativeBinomialasahierarchicalmixtureforreadcountsIftechnicalreplicateshavesamenumberofreads: sj,Poissonvariationwithmean µ = sjui.Taxa i incidenceproportion ui.Numberofreadsforthesample j andtaxa i wouldbe
Kij ∼ Poisson (sjui)
SameasDESeqforRNA-seq
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 25: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/25.jpg)
ImprovementinPowerandFDR
Performanceofdifferentialabundancedetectionwithandwithoutrarefyingsummarizedby“AreaUndertheCurve”(AUC) metricofaReceiverOperatorCurve(ROC) (verticalaxis).Briefly, theAUC valuevariesfrom0.5(random)to1.0(perfect).Thehorizontalaxisindicatestheeffectsize, shownasthefactorappliedtoOTU countstosimulateadifferentialabundance.Eachcurvetracestherespectivenormalizationmethod’smeanperformanceofthatpanel, withaverticalbarindicatingastandarddeviationinperformanceacrossallreplicatesandmicrobiometemplates.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 26: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/26.jpg)
Theright-handsideofthepanelrowsindicatesthemedianlibrarysize, N,whilethedarknessoflineshadingindicatesthenumberofsamplespersimulatedexperiment.Colorshadeandshapeindicatethenormalizationmethod.DetectionamongmultipletestswasdefinedusingaFalseDiscoveryRate(Benjamini-Hochberg)significancethresholdof0.05.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 27: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/27.jpg)
Part IV
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 28: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/28.jpg)
UnifracDistance(LozuponeandKnight, 2005)
isadistancebetweengroupsoforganismsthatarerelatedtoeachotherbyatree.SupposewehavetheOTUspresentinsample1(blue)andinsample2(red).Question: Dothetwosamplesdifferphylogenetically?ItisdefinedastheratioofthesumofthelengthsofthebranchesleadingtomembersofgroupA ormembersofgroupB butnotbothtothetotalbranchlengthofthetree.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 29: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/29.jpg)
WeightedUnifracdistance A modificationofUniFrac,weightedUniFracisdefinedin(Lozuponeetal., 2007)as
n∑i=1
bi × | AiAT
− BiBT
|
n = numberofbranchesinthetree bi =lengthoftheithbranch Ai =numberofdescendantsof
ithbranchingroupA AT =totalnumberofsequences
ingroupA
[?]orcanbeseenasbyprobabilistsastheWassersteindistance(earthmovers)[?]. . .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 30: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/30.jpg)
Costelloetal. 2010
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 31: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/31.jpg)
Rao'sDistance
Westartwithadistancebetweenindividuals.Theheterogeneityofapopulation(Hi )istheaveragedistancebetweenmembersofthatpopulation.Theheterogeneitybetweentwopopulations(Hij)istheaveragedistancebetweenamemberofpopulation i andamemberofpopulation j.Thedistancebetweentwopopulationsis
Dij = Hij −1
2(Hi + Hj)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 32: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/32.jpg)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 33: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/33.jpg)
DecompositionofDiversity
Ifwehavepopulations 1, . . . , k withfrequencies π1, . . . , πk,thenthediversityofallthepopulationstogetheris
H0 =
k∑i=1
πiHi +∑i
∑j
πiπjDij = H(w) + D(b)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 34: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/34.jpg)
DoublePrincipalCoordinateAnalysisPurdom, 2005hadtheideaofusingamethoddevelopedbyPavoineetal. calledDPCoA [?]implementedin ade4 [?].Supposewehavenspeciesinplocationsanda(euclidean)matrix ∆ givingthesquaresofthepairwisedistancesbetweenthespecies. Thenwecan
Usethedistancesbetweenspeciestofindanembeddingin n− 1 -dimensionalspacesuchthattheeuclideandistancesbetweenthespeciesisthesameasthedistancesbetweenthespeciesdefinedin ∆.
Placeeachoftheplocationsatthebarycenterofitsspeciesprofile. TheeuclideandistancesbetweenthelocationswillbethesameasthesquarerootoftheRaodissimilaritybetweenthem.
UsePCA tofindalower-dimensionalrepresentationofthelocations.
Givethespeciesandcommunitiescoordinatessuchthattheinertiadecomposesthesamewaythediversitydoes.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 35: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/35.jpg)
FukuyamaandHolmes, 2012.Method Originaldescription Newformula PropertiesDPCoA square root of Rao's distance
basedonthesquarerootofthepatristicdistances
[∑
i bi(Ai/AT − Bi/BT)2]1/2 Mostsensitivetooutliers, leastsensitive to noise, upweightsdeep differences, gives OTUlocations
wUniFrac∑
i bi |Ai/AT − Bi/BT|∑
i bi |Ai/AT − Bi/BT| Less sensitive to outliers/moresensitivetonoisethanDPCoA
UniFrac fractionofbranchesleadingtoexactlyonegroup
∑i bi1
Ai/AT−Bi/BTAi/AT+Bi/BT
≥ 1 Sensitive to noise, upweightsshallowdifferencesonthetree
Summaryofthemethodsunderconsideration. ``Outliers"referstohighlyabundantOTUs, andnoisereferstonoiseindetectinglow-abundanceOTUs(seethetextformoredetail).
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 36: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/36.jpg)
AntibioticTimeCourseData
Measurementsofabout2500differentbacterialOTUsfromstoolsamplesofthreepatients(D,E,F)Eachpatientsampled ∼ 50timesduringthecourseoftreatmentwithciprofloxacin(anantibiotic).TimescategorizedasPreCp, 1stCp, 1stWPC (weekpostcipro), Interim, 2ndCp, 2ndWPC,andPostCp.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 37: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/37.jpg)
UniFrac
Axis 1: 14.7%
Axi
s 2:
10.
3%
−0.2
−0.1
0.0
0.1
0.2
−0.2 −0.1 0.0 0.1 0.2 0.3 0.4
weighted UniFrac
Axis 1: 47.6%A
xis
2: 1
2.3%
−0.2
−0.1
0.0
0.1
0.2
−0.4−0.3−0.2−0.1 0.0 0.1 0.2 0.3
weighted UF on presence/absence
Axis 1: 32.7%
Axi
s 2:
15.
1%
−0.05
0.00
0.05
0.10
−0.10−0.050.000.050.100.150.20
subject
D
E
F
ComparingtheUniFracvariants. Fromlefttoright:PCoA/MDS withunweightedUniFrac, withweightedUniFrac,andwithweightedUniFracperformedonpresence/absencedataextractedfromtheabundancedatausedintheothertwoplots
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 38: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/38.jpg)
(a) MDS of OTUs
Axis 1: 6.2%
Axi
s 2:
3.7
%
−1.5
−1.0
−0.5
0.0
0.5
−1.0 −0.5 0.0 0.5
(c) DPCoA OTU plot
CS1
CS
2
−1.0
−0.5
0.0
0.5
1.0
−1.5 −1.0 −0.5 0.0 0.5 1.0
phylum
4C0d−2
Actinobacteria
Bacteroidetes
Candidate division TM7
Cyanobacteria
Firmicutes
Fusobacteria
Lentisphaerae
Proteobacteria
Synergistetes
Verrucomicrobia
(b) DPCoA community plot
Axis 1: 40.9%
Axi
s 2:
13.
3%−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
−1.0 −0.5 0.0 0.5
subject
D
E
F
(a)PCoA/MDS oftheOTUsbasedonthepatristicdistance, (b)communityand(c)speciespointsforDPCoA afterremovingtwooutlyingspecies.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 39: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/39.jpg)
AntibioticStress
Wenextwanttovisualizetheeffectoftheantibiotic.OrdinationsofthecommunitiesduetoDPCoA andUniFracwithinformationaboutthewhetherthecommunitywasstressedornotstressed(precipro, interim, andpostciprowereconsidered``notstressed'', whilefirstcipro, firstweekpostcipro, secondcipro, andsecondweekpostciprowereconsidered``stressed'').WeseethatforUniFrac, thefirstaxisseemstoseparatethestressedcommunitiesfromthenotstressedcommunities.DPCoA alsoseemstoseparatetheoutthestressedcommunitiesalongthefirstaxis(inthedirectionassociatedwithBacteroidetes), althoughonlyforsubjectsD andE.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 40: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/40.jpg)
Axis1
Axi
s2
−0.2
−0.1
0.0
0.1
0.2
D 1
D 2
E 1 E 2
F 1F 2
−0.2 −0.1 0.0 0.1 0.2 0.3 0.4
Antibiotic stress
1: not stressed
2: stressed
Subject
D
E
F
PCoA/MDS withunweightedUniFrac. Thelabelsrepresentsubjectplusantibioticcondition.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 41: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/41.jpg)
Axis1
Axi
s2
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
D 1D 2
E 1
E 2
F 1F 2
−1.0 −0.5 0.0 0.5
CommunitypointsasrepresentedbyDPCoA.Thelabelsrepresentsubjectplusantibioticcondition.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 42: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/42.jpg)
ConclusionsforAntibioticStress
SinceUniFracemphasizesshallowdifferencesonthetreeandsincePCoA/MDS withUniFracseemstoseparatethesubjectsfromeachotherbetterthantheothertwomethods, wecanconcludethatthedifferencesbetweensubjectsaremainlyshallowones. However, DPCoA alsoseparatesthesubjectsandthestressedversusnon-stressedcommunities, andexaminingthecommunityandOTU ordinationscantellusaboutthedifferencesinthecompositionsofthesecommunities.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 43: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/43.jpg)
Part V
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 44: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/44.jpg)
Non-reproducibility: EnterotypesStudyNature WebLink
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 45: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/45.jpg)
MultiplicityofChoices/Tests/Tuning
Data Filter Normalize Smooth Project WeightWeight
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 46: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/46.jpg)
SuboptimalWorkflow
Anoptimizationwhereyouonlyallowedyourselftwostepsoneparalleltoeachcoordinate....
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 47: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/47.jpg)
Iteration: DocumentationandRecordkeeping
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 48: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/48.jpg)
DocumentedChoices
Dataminingdiary: arecordofeverysampledropped, everynormalization, everychoiceofweight.ExamplefortheEnterotypedata Cloud Enterotype
example
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 49: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/49.jpg)
RationalScientificEconomics
Manysignificantfindingstodayarenotreproducible(seeJPA Ioannidis-2005). Why?
Datadredging. Researchisrecreatedfromscratcheverytime(new
clusteringmethod, newtensormethod, newsmoothingmethod), insteadofbuildingonpriorresearch.Why?Originalityrequirementpushespeopletoobscureconnectionstoexistingwork.
Someideas: consistentvocabulary, carefulreferencetoexistingwork...Collaborativeframeworks, crowdchecking, user-base.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 50: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/50.jpg)
RationalScientificEconomics
Manysignificantfindingstodayarenotreproducible(seeJPA Ioannidis-2005). Why?Datadredging.
Researchisrecreatedfromscratcheverytime(newclusteringmethod, newtensormethod, newsmoothingmethod), insteadofbuildingonpriorresearch.Why?
Originalityrequirementpushespeopletoobscureconnectionstoexistingwork.
Someideas: consistentvocabulary, carefulreferencetoexistingwork...Collaborativeframeworks, crowdchecking, user-base.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 51: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/51.jpg)
RationalScientificEconomics
Manysignificantfindingstodayarenotreproducible(seeJPA Ioannidis-2005). Why?Datadredging.
Researchisrecreatedfromscratcheverytime(newclusteringmethod, newtensormethod, newsmoothingmethod), insteadofbuildingonpriorresearch.Why?Originalityrequirementpushespeopletoobscureconnectionstoexistingwork.
Someideas: consistentvocabulary, carefulreferencetoexistingwork...Collaborativeframeworks, crowdchecking, user-base.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 52: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/52.jpg)
Complexitiesofproofbysimulation
Itisatheoremthat...Proofverifiedbypeerreview.Oursimulationshows....
Howwerechoicesmade, robustness. Doesthemodelfitthedata? Goodnessoffitishard
whendoing200,000datapoint. Howwastheprogramchecked(theoryavailable, test
case)? Crowdchecking....
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 53: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/53.jpg)
ReproducibleSimulationsandSubsampling
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 54: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/54.jpg)
SupplementaryMaterial
ClusterSimulationCodeDispersionsurvey
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 55: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/55.jpg)
KeepalltheData
......includingtheanalyses
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 57: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/57.jpg)
Goalsalreadyattained:
Modelingmixtures: VarianceStabilizingtransformations(nodatadiscarded).
Waysofcombiningheterogeneousdata: distancesandmultivariaterepresentation(nodatadiscarded).
Throughscripting: multiplemethodsshowninparallelanditeratively(noanalyseshidden).
Dataintegration phyloseq (trees, tables, networks)throughanobjectorientedapproach.
Threshold, sensitivitytestsandmodelingsimulations(tuningmadevisible).
Reproducibility: opensourcestandards, publicationofsourcecodeanddatausingR/Bioconductor. (phyloseq).
Shiny-phyloseq.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 58: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/58.jpg)
BenefittingfromthetoolsandschoolsofStatisticians.......
Thankstothe R community:HadleyWickhamfor ggplot2, Chessel, Jombart, Dray,Thioulouse ade4 , GordonSmythandhisteamfor edgeR,WolfgangHuber, MichaelLovefor DESeq2 andEmmanuelParadisfor ape.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 59: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/59.jpg)
Collaborators:
DavidRelman AlfredSpormann ElisabethPurdomJustinSonnenburgandPersiDiaconis.
PostdoctoralFellows Paul(Joey)McMurdie, AlexAlekseyenko(NYU),BenCallahan, AngelaMarcobal.Students: MilingShen, AldenTimme, KatieShelef, YanaHoy,JohnChakerian, JuliaFukuyama, KrisSankaran.Fundingfrom CIMI-Toulouse, NIH-TR01, NIH/NIGMS R01,NSF-VIGRE andNSF-DMS.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 60: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/60.jpg)
phyloseq
JoeyMcMurdie(joey711ongithub).AvailableinBioconductor.HowcanI (mystudents, mypostdocs...) learnmore?Askme.http://www-stat.stanford.edu/˜susan/
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 61: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/61.jpg)
Multi-tablemethods
Inertia, Co-InertiaWegeneralizeitinseveraldirectionsthroughtheideaofinertia.Asinphysics, wedefineinertiaasaweightedsumofdistancesofweightedpoints.Thisenablesustouseabundancedatainacontingencytableandcomputeitsinertiawhichinthiscasewillbetheweightedsumofthesquaresofdistancesbetweenobservedandexpectedfrequencies, suchasisusedincomputingthechisquarestatistic.Anothergeneralizationofvariance-inertiaistheusefulPhylogeneticdiversityindex. (computingthesumofdistancesbetweenasubsetoftaxathroughthetree).Wealsohavesuchgeneralizationsthatcovervariabilityofpointsonagraphtakenfromstandardspatialstatistics.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 62: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/62.jpg)
Co-Inertia
Whenstudyingtwovariablesmeasuredatthesamelocations,forinstancePH andhumiditythestandardquantificationofcovariationisthecovariance.
sum(x1 ∗ y1 + x2 ∗ y2 + x3 ∗ y3)
ifxandyco-vary-inthesamedirectionthiswillbebig.A simplegeneralizationtothiswhenthevariabilityismorecomplicatedtomeasureasaboveisdonethroughCo-Inertiaanalysis(CIA).Co-inertiaanalysis(CIA) isamultivariatemethodthatidentifiestrendsorco-relationshipsinmultipledatasetswhichcontainthesamesamplesorthesametimepoints. Thatistherowsorcolumnsofthematrixhavetobeweightedsimilarlyandthusmustbematchable.
. .. .. .. .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 63: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/63.jpg)
RV coefficient
TheglobalmeasureofsimilarityoftwodatatablesasopposedtotwovectorscanbedonebyageneralizationofcovarianceprovidedbyaninnerproductbetweentablesthatgivestheRV coefficient, anumberbetween0and1, likeacorrelationcoefficient, butfortables.
RV(A,B) = Tr(A′B)√Tr(A′A)
√Tr(B′B)
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 64: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/64.jpg)
ExampleCurrentworkwithJoeyMcMurdie, LesDethfelsen, DavidRelman, AngelaMarcobal, JustinSonnenburg. Data:
Taxa Readcounts(3patientstakingcipro: twotimecourses): .
Mass-Spec PositiveandNegativeionMassSpecfeaturesandtheirintensities: .
RNA-seq Metagenomicdataongenes:.
HereistheRV tableofthethreearraytypes:
> fourtable$RV
Taxa Kegg MassSpec+ MassSpec-
Taxa 1 0.565 0.561 0.670
Kegg 0.565 1 0.686 0.644
MassSpec+ 0.561 0.686 1 0.568
MassSpec- 0.670 0.644 0.568 1
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .
![Page 65: All the Data: challenges - Stanford Universityweb.stanford.edu/class/bios221/AlltheData_STAMPS.pdf · 2015. 8. 8. · (Big) Challenges for those of us working from the ground up Heterogeneity](https://reader035.vdocument.in/reader035/viewer/2022062219/60f7d72f888d4902a77a5626/html5/thumbnails/65.jpg)
Multipletablemethods
InPCA wecomputethevariance-covariancematrix, inmultipletablemethodswecantakeacubeoftablesandcomputetheRV coefficientoftheircharacterizingoperators.Wethendiagonalizethisandfindthebestweighted`ensemble'.Thisiscalledthe`compromise'andalltheindividualtablescanbeprojectedontoit.
. .. .. . . .. .. .. . . .. .. .. . . .. .. .. . . .. . . .. .. .