geometrical methods for heterogeneous datastatweb.stanford.edu/~susan/ima_geometry.pdfgraph or tree...

134
Geometrical Methods for Heterogeneous Data Susan Holmes http://www-stat.stanford.edu/˜susan/ Bio-X and Statistics, Stanford University IMA, 2013 and funded by NIH-R01GM086884 ABabcdfghiejkl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Upload: others

Post on 13-Nov-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Geometrical Methods forHeterogeneous Data

Susan Holmeshttp://www-stat.stanford.edu/˜susan/

Bio-X and Statistics, Stanford University

IMA, 2013 and funded by NIH-R01GM086884

ABabcdfghiejkl..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 2: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

What's happening in Biological Data Analysis?

1985 1998 2012

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 3: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

What's happening in Biological Data Analysis?1985 1998 2012

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 4: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

What's happening in Biological Data Analysis?1985 1998 2012

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 5: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

What's happening in Biological Data Analysis?1985 1998 2012

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 6: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Challenges for those of us working from the groundup

▶ Heterogeneity.▶ Structured high-dimensionality.▶ Graph or Tree integration.▶ Identifiability.▶ High quality graphics.▶ Reproducibility.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 7: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Part I

Heterogeneity

`Homogeneous data are all alike;

all heterogeneous data are heterogeneous

in their own way.'

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 8: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Heterogeneity of Data▶ Status : response/ explanatory.▶ Hidden (latent)/measured.▶ Types :

▶ Continuous▶ Binary, categorical▶ Graphs/ Trees▶ Images▶ Maps/ Spatial Information▶ Rankings

▶ Amounts of dependency: independent/time series/spatial.▶ Different technologies used (454, phyloseq, Illumina,

MassSpec, RNA-seq).

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 9: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Microbiome and other useful `ome'wordsJoshua Lederberg:`the ecological community of commensal,symbiotic, and pathogenic microorganisms that literally shareour body space and have been all but ignored asdeterminants of health and disease'Microbiome Complete collection of genes contained in the

genomes of microbes living in a givenenvironment.

Numbers Humans shelter 100 trillion microbes (1014), (weare made of 10 ×1012 cells)It is estimated that there are more than 1000`species' of bacteria living in the human gut.

Metagenome Composition of all genes present in anenvironment (soil, gut, seawater), regardless ofspecies.

Transciptome These are the mRNA transcripts in the cell, itreflects the genes that are being activelyexpressed at any given time.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 10: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Bacteria etc... and Us

The human microbiome or human microbiota is theassemblage of microorganisms that reside on the surface andin deep layers of skin, in the saliva and oral mucosa, in theconjunctiva, and in the gastrointestinal tracts.

▶ They include bacteria, fungi, and archaea.▶ Some of these organisms perform tasks that are useful

for the human host. (live in symbiosis)▶ Majority have no known beneficial or harmful effect.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 11: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

YK Lee and SK Mazmanian Science, 2010.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 12: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Human Microbiome: What are the data?

DNA The Genomic material present (16sRNA-geneespecially).

RNA What genes are being turned on (geneexpression).

Mass Spec Specific signatures of chemical compoundspresent.

Clinical Multivariate information about patients' clinicalstatus, medication, weight.

Environmental Location, nutrition, time.Domain Knowledge Metabolic networks, phylogenetic trees,

gene ontologies.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 13: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Heterogeneous Data ObjectsObject Oriented Data AnalysisInput and data manipulation with phyloseq(McMurdie and Holmes, 2013, Plos ONE)As always in R: object oriented data:

Taxonomy Table taxonomyTableslots: .Data

OTU Abundanceclass: otuTableslots: .Data, speciesAreRows

Sample VariablessampleDataslots: .Data,names,row.names,.S3Class

Phylogenetic Treeclass: phyloslots: see ape package

matrix matrixdata.frame

phyloseqslots:otuTablesampleDatataxTabtre

Experiment-level data object:

Component data objects:

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 14: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Questions from many Disciplines

Biomedical sciences Clinical variables, immune history.Genetics Host and bacterial genomes and transcriptomes.Bioinformatics Databases and formats.Biochemistry Compounds, Metabolic pathways involved.Ecology Ordination, diversity, environmental influences.Statistics Decompositions of variability, spatio-temperal

analyses.Systematic Biology Phylogenetic Trees, dynamics of evolution.Network science Microbial communities, gene interactions,

metabolic pathways,...Visualization Dimension reduction, rich representations.Mathematics Differential Geometry, multilinear algebra,

probability, topology.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 15: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Useful first order representation: Many Matrices

▶ Time series of abundance matrices.▶ Different types of data on same samples (taxa counts,

clinical variates, spatial location).▶ Networks and trees over time.▶ Explanatory (environmental) variables, Response variables.

Holmes (2005), Duality Diagrams.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 16: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Part II

Dimension Reduction: theEuclidean embedding workhorse:

MDS

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 17: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Principal Component Analysis: Dimension Reduction

PCA seeks to replace the original (centered) matrix X by amatrix of lower rank, this can be solved using the singularvalue decomposition of X:

X = USV′, with U′DU = In and V′QV = Ip and Sdiagonal

XX′ = US2U′, with U′DU = In and S2 = Λ

PCA is a linear nonparametric multivariate method fordimension reduction. D and Q are the relevant metrics on thedual row and column spaces of n samples and p variables.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 18: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Examples from Ecology

The Boomlake plant data:

Biplot representing both species and locationsBlue circles with letters are species scores

Sampling locations are green circles with numbers.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 19: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Sample 1 is actually in the lake, and sample 12 is far away.Species are located closely to the samples they occur in. Ifyou looked carefully into the data matrix, you would find thatspecies R and Q are strictly aquatic, while species F is adryland plant (cribs). There is an arch effect.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 20: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Psychological Data

Color confusion data (Ekman, 1954):

w434 w445 w465 w472 w490 w504 w537 w555 w584 w600 w610 w628 w651 w6741 0.00 0.86 0.42 0.42 0.18 0.06 0.07 0.04 0.02 0.07 0.09 0.12 0.13 0.162 0.86 0.00 0.50 0.44 0.22 0.09 0.07 0.07 0.02 0.04 0.07 0.11 0.13 0.143 0.42 0.50 0.00 0.81 0.47 0.17 0.10 0.08 0.02 0.01 0.02 0.01 0.05 0.034 0.42 0.44 0.81 0.00 0.54 0.25 0.10 0.09 0.02 0.01 0.00 0.01 0.02 0.045 0.18 0.22 0.47 0.54 0.00 0.61 0.31 0.26 0.07 0.02 0.02 0.01 0.02 0.006 0.06 0.09 0.17 0.25 0.61 0.00 0.62 0.45 0.14 0.08 0.02 0.02 0.02 0.017 0.07 0.07 0.10 0.10 0.31 0.62 0.00 0.73 0.22 0.14 0.05 0.02 0.02 0.008 0.04 0.07 0.08 0.09 0.26 0.45 0.73 0.00 0.33 0.19 0.04 0.03 0.02 0.029 0.02 0.02 0.02 0.02 0.07 0.14 0.22 0.33 0.00 0.58 0.37 0.27 0.20 0.2310 0.07 0.04 0.01 0.01 0.02 0.08 0.14 0.19 0.58 0.00 0.74 0.50 0.41 0.2811 0.09 0.07 0.02 0.00 0.02 0.02 0.05 0.04 0.37 0.74 0.00 0.76 0.62 0.5512 0.12 0.11 0.01 0.01 0.01 0.02 0.02 0.03 0.27 0.50 0.76 0.00 0.85 0.6813 0.13 0.13 0.05 0.02 0.02 0.02 0.02 0.02 0.20 0.41 0.62 0.85 0.00 0.7614 0.16 0.14 0.03 0.04 0.00 0.01 0.00 0.02 0.23 0.28 0.55 0.68 0.76 0.00

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 21: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Results: Color Confusion

Screeplot

●●

● ● ● ●

2 4 6 8 10

0.0

0.5

1.0

1.5

2.0

Index

clas

s.co

l$ei

g

Planar configuration

−0.4 −0.2 0.0 0.2 0.4

−0.

4−

0.2

0.0

0.2

0.4

cmdscale(colorc)[,1]cm

dsca

le(c

olor

c)[,2

]

1 2

34

5

6

78

9

10

11

1213

14

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 22: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Metric Multidimensional ScalingSchoenberg (1935)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 23: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

From Coordinates to Distances and Back

If we started with original data in Rp that are not centered:Y, apply the centering matrix

X = HY, with H = (I− 1

n11′), and 1′ = (1, 1, 1 . . . , 1)

Call B = XX′, if D(2) is the matrix of squared distancesbetween rows of X in the euclidean coordinates, we can showthat

−1

2HD(2)H = B

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 24: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Schoenberg's result: exact Euclidean distance

If B is positive semi-definite then D can be seen as a distancebetween points in a Euclidean space.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 25: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Reverse engineering an Euclidean embedding

We can go backwards from a matrix D to X by taking theeigendecomposition of B = −1

2HD(2)H in much the same waythat PCA provides the best rank r approximation for data bytaking the singular value decomposition of X, or theeigendecomposition of XX′.

X(r) = US(r)V′ with S(r) =

s1 0 0 0 ...0 s2 0 0 ...0 0 ... ... ...0 0 ... sr ...... ... ... 0 0

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 26: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Multidimensional Scaling (MDS) also called PCoA

Simple classical multidimensional scaling.▶ Square D elementwise D(2) = D2.▶ Compute −1

2 HD2H = B.▶ Diagonalize B to find the principal coordinates SV′.▶ Choose a number of dimensions by inspecting the

eigenvalue's screeplot.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 27: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Malaria Data as seen using ape

Pre1

Pme2

Plo6

Pga11

Pma3

Pbe5

Pfr7

Pkn8

Pcy9

Pvi10

Pfa4

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 28: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Bootstrap of Malaria Data

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 29: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

0 20000 40000 60000 80000 120000

Eigenvalues of MDS for bootstrapped trees

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 30: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

−40 −20 0 20 40

−60

−40

−20

020

40

Bootstrapped trees

o1

2

3

4

5

6

7

8

9

1011

12

1314

15

1617

18

19

20

21

22

23

24

25 26

2728

29

30

31

32

33

34

35

36

373839

40

41

42

43

4445

46

47 4849

50

51

52

53

54

55

5657

5859

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

7576

77

78

7980

81

8283

84

85

86

8788

89

90

91

92

93

94

95

9697

98

99

100o

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 31: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Data Analysis: Geometrical Approach

i. The data are p variables measured on n observations.ii. X with n rows (the observations) and p columns (the

variables).iii. Dn is an n× n matrix of weights on the ``observations'',

which is most often diagonal.iv Symmetric definite positive matrix Q,often

Q =

1σ21

0 0 0 ...

0 1σ22

0 0 ...

0 0 1σ23

0 ...

... ... ... 0 1σ2

p

.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 32: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Euclidean Spaces

These three matrices form the essential ``triplet" (X,Q,D)defining a multivariate data analysis.Q and D define geometries or inner products in Rp and Rn,respectively, through

xtQy =< x, y >Q x, y ∈ Rp

xtDy =< x, y >D x, y ∈ Rn.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 33: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

A Commutative Diagram Approach

Statisticians search for approximations with certainproperties, for the case of PCA for instance, we rephrase theproblem as follows:

▶ Q can be seen as a linear function from Rp toRp∗ = L(Rp), the space of scalar linear functions on Rp.

▶ D can be seen as a linear function from Rn toRn∗ = L(Rn).

▶Rp∗ −−−−→

XRn

Qx yV D

y xW

Rp ←−−−−Xt

Rn∗

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 34: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Properties of the Diagram

Rank of the diagram: X,Xt,VQ and WD all have the samerank.For Q and D symmetric matrices, VQ and WD arediagonalisable and have the same eigenvalues.

λ1 ≥ λ2 ≥ λ3 ≥ . . . ≥ λr ≥ 0 ≥ · · · ≥ 0.

Eigendecomposition of the diagram: VQ is Q symmetric, thuswe can find Z such that

VQZ = ZΛ,ZtQZ = Ip, where Λ = diag(λ1, λ2, . . . , λp). (1)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 35: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Comparing Two Diagrams: the RV coefficientMany problems can be rephrased in terms of comparison oftwo ``duality diagrams" or put more simply, two characterizingoperators, built from two ``triplets", usually with one of thetriplets being a response or having constraints imposed on it.Most often what is done is to compare two such diagrams,and try to get one to match the other in some optimal way.To compare two symmetric operators, there is either a vectorcovariance as inner productcovV(O1,O2) = Tr(Ot

1O2) =< O1,O2 > or a vector correlation(Escoufier, 1977)

RV(O1,O2) =Tr(Ot

1O2)√Tr(Ot

1O1)tr(Ot2O2)

.

If we were to compare the two triplets(Xn×1, 1,

1nIn)

and(Yn×1, 1,

1nIn)

we would have RV = ρ2.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 36: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

PCA: Approximating one diagram by another

PCA can be seen as finding the matrix Y which maximizes theRV coefficient between characterizing operators, that is,between

(Xn×p,Q,D

)and

(Yn×q, I,D

), under the constraint

that Y be of rank q < p .

RV(XQXtD,YYtD

)=

Tr(XQXtDYYtD

)√Tr (XQXtD)2 Tr (YYtD)2

.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 37: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

This maximum is attained where Y is chosen as the first qeigenvectors of XQXtD normed so that YtDY = Λq. Themaximum RV is

RVmax =

∑qi=1 λ

2i∑p

i=1 λ2i.

Of course, classical PCA has D = 1nI, Q = I, but the extra

flexibility is often useful. We define the distance betweentriplets (X,Q,D) and (Z,Q,M) where Z is also n× p, as thedistance deduced from the RV inner product betweenoperators XQXtD and ZMZtD.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 38: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Part III

Combine and Compare Trees,Graphs and Contingent Data

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 39: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Find sub graphs which are well explained by variables

A subset of the skeleton gene interaction network. −→most perturbed subgraph

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 40: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Find sub graphs which are well explained by variables.

A subgraph related to a chemokine pathway. (GXNA,Serban et al, 2005, Bioinformatics). ..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 41: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Discriminant Analysis as a duality diagram

Let A be the g× p matrix of group means in each of the pvariables. This satisfies

YtDX = ∆YA where ∆Y = YtDY = diag(w1,w2, . . . ,wg),

and wk =∑

i:yik=1 di, the wk's are the group weights, as theyare the sums of the weights as defined by D for all theelements in that group.Call T the matrix T = XtDX, in the standard case with alldiagonal elements of D equal to 1

n this is just the standardvariance-covariance, otherwise it is a generalization thereof.The generalized between group variance-covariance isB = At∆YA and call the between group variance covariancethe matrix W = (X− YA)tD(X− YA).

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 42: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Proposition 1

(a generalized Huyghens' formula):

T = B + W

Proof: Expanding W gives

W = XtDX− XtDYA− AtYtDX + AtYtDYA= T− A′∆YA− A′∆YA + A′∆YA = T− B

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 43: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Duality Diagram for LDA

The duality diagram for linear discriminant analysis is

Rp∗ −−→A

Rg

T−1

x yB ∆Y

y xAT−1At

Rp ←−−At

Rg∗

.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 44: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

This corresponds to the triple (A,T−1,∆Y), because

(XtDY)∆−1Y (YtDX) = At∆YA

and gives equivalent results to the triple (YtDX,T−1,∆−1Y ).

The discriminating variables are the eigenvectors of theoperator

At∆YAT−1.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 45: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Explaining one diagram by another

Principal Component Analysis with respect to InstrumentalVariables was a technique developed by C.R. Rao,(1964) tofind the best set of coefficients in the multivariate regressionsetting where the response is multivariate, given by a matrixY. In terms of diagrams and RV coefficients, this problem canbe rephrased as that of finding M to associate to X so that(X,M,D) is as close as possible to (Y,Q,D) in the RV sense.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 46: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

The answer is provided by defining M such that

YQYtD = λXMXtD.

If this is possible then the two eigendecompositions of thetriple give the same answers. We simplify notation by thefollowing abbreviations:

XtDX = Sxx YtDY = Syy XtDY = Sxy and R = S−1xx SxyQSyxS−1

xx .

Then

∥ YQYtD−XMXtD ∥2=∥ YQYtD−XRXtD ∥2 + ∥ XRXtD−XMXtD ∥2 .

The first term on the right hand side does not depend on M,and the second term will be zero for the choice M = R.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 47: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

If we add the extra constraint that we only allow ourselvesa rank q approximation, with q < min (rank (X), rank (Y)), theoptimal choice of a positive definite matrix M is to takeM = RBBtR where the columns of B are the eigenvectors ofXtDXR with:

B =

(1√λ1

β1, . . .1√λq

βq

)such that

XtDXRβk = λkβk

βtkRβk = λk, k = 1 . . . q

λ1 > λ2 > . . . > λq

The PCA with regards to instrumental variables of rank q isequivalent to the PCA of rank q of the triple (X,R,D) where

R = S−1xx SxyQSyxS−1

xx .

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 48: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Canonical Correlation Analysis

Canonical correlation analysis was introduced by Hotelling[?]to find the common structure in two sets of variables X1 andX2 measured on the same observations. MaximizingCovariance

cov(x, y) = 1

n∑

i(xi − x)(yi − y)

(v(x) cov(x, y)

cov(y, x) v(y)

)

cor(x, y) = cov(x, y)√v(x)

√v(y)

=< x, y >

√< x, x >

√< y, y >

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 49: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

One Diagram to replace Two Diagrams

Canonical correlation analysis was introduced by Hotelling tofind the common structure in two sets of variables X1 and X2

measured on the same observations. This is equivalent tomerging the two matrices columnwise to form a large matrixwith n rows and p1 + p2 columns and taking as the weightingof the variables the matrix defined by the two diagonalblocks (Xt

1DX1)−1 and (Xt

2DX2)−1

Q =

(Xt

1DX1)−1 0

0 (Xt2DX2)

−1

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 50: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

.

Rp1∗ −−−−→X1

Rn

Ip1

x yV1 Dy xW1

Rp1 ←−−−−Xt1

Rn∗

Rp2∗ −−−−→X2

Rn

Ip2

x yV2 Dy xW2

Rp2 ←−−−−Xt2

Rn∗

Rp1+p2∗ −−−−→[X1;X2]

Rn

Qx yV D

y xW

Rp1+p2 ←−−−−−[X1;X2]t

Rn∗

This analysis gives the same eigenvectors as the analysis ofthe triple (Xt

2DX1, (Xt1DX1)

−1, (Xt2DX2)

−1), also known as thecanonical correlation analysis of X1 and X2...... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 51: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Incorporate taxa abundances and phylogenetic tree

Epulopiscium

Clostridium

Adlercreutzia

Lachnospira

Alistipes

Roseburia

Coprococcus

Clostridium

Blautia

Coprococcus

Dehalobacterium

Clostridium

Clostridium

Clostridium

Coprobacillus

Coprococcus

Clostridium

Clostridium

Moryella

Abundance

1

25

625

Class

Actinobacteria (class)

Bacilli

Bacteroidia

Clostridia

Erysipelotrichi

Gammaproteobacteria

Mollicutes

Verrucomicrobiae

YS2

Duality diagram methods that can use any dependencystructure. ..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 52: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Rao's Distance

We start with a distance between samples.The heterogeneity of a population (Hi ) is the averagedistance between members of that population.The heterogeneity between two populations (Hij) is theaverage distance between a member of population i and amember of population j.The distance between two populations is

Dij = Hij −1

2(Hi + Hj)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 53: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 54: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Decomposition of Diversity

If we have populations 1, . . . , k with frequencies π1, . . . , πk,then the diversity of all the populations together is

H0 =

k∑i=1

πiHi +∑

i

∑j

πiπjDij = H(w) + D(b)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 55: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Double Principal Component AnalysisPavoine et al. 2004 developed a method known as DPCoAimplemented in ade4.Suppose we have n species in p samples/locations and amatrix ∆ giving the squares of the pairwise distancesbetween the species. (Think distance along the tree of thespecies) Then we can

▶ Use the distances between species to find an embeddingin n− 1 -dimensional space such that the euclideandistances between the species is the same as thedistances between the species defined in ∆.

▶ Place each of the p samples/locations at the barycenterof its species profile. The euclidean distances betweenthe samples/locations will be the same as the squareroot of the Rao dissimilarity between them.

▶ Use PCA to find a lower-dimensional representation ofthe locations.

Give the species and communities coordinates such that theinertia decomposes the same way the diversity does...... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 56: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Double Principal Coordinate Analysis

Takes into account both the tree distances and theabundances.Pavoine, Dufour and Chessel (2004), Purdom (2010) andFukuyama et al. (2011).Takes the tree into account in computation of principalcoordinates.baselineDPCoA=ordinate(phy_pifn_glom, "DPCoA")p=plot_ordination(phy_pifn_glom, baselineDPCoA,type="taxa",color="Class")p+scale_colour_brewer(type = "qual", palette = "Set1")+geom_point(size=7,alpha=0.2)p=plot_ordination(phy_pifn_glom, baselineDPCoA, type="samples", color = "mousenames")+ geom_point(size = 1)+geom_text(aes(label=matn[-which(matn[,2]=="154"),2]),size=8)

p+scale_colour_hue(guide="none")baselineDPCoA$eig[1]/sum(baselineDPCoA$eig)0.88baselineDPCoA$eig[2]/sum(baselineDPCoA$eig)0.067sum(baselineDPCoA$eig[1:2])/sum(baselineDPCoA$eig)0.95

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 57: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

161

161

165165

174

174

175175175

33

33

33

591591 71

71 7171

73

7373

73

91

91

92

92

94

94

152

152

153

153

162

162

163

163

164

164

172

172173

173

34

34

3435

35

35

41

41

43

43

592592

593593

7474

74

74

82

82

8383

85

85

95

95

151

171

1713131

3232

32

42

4445

45

727272 7275

757575

8181

8484

93

93

−0.1

0.0

0.1

−0.6 −0.3 0.0 0.3Axis1

Axi

s2

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 58: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

−0.2

−0.1

0.0

0.1

0.2

−0.5 0.0 0.5CS1

CS

2

Class

Actinobacteria (class)

Bacilli

Bacteroidia

Clostridia

Erysipelotrichi

Gammaproteobacteria

Mollicutes

Verrucomicrobiae

YS2

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 59: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Comparing Graphs and Covariates

Example: Co-occurrence graph network.With the enterotype data we show the results of themakenetwork() function igraph package to create agraphical display of the ``connectedness'' of samples accordingto the profile of taxa that they share.

> data(enterotype)> entc <- subset_samples(enterotype, !is.na(ClinicalStatus))> ig <- newnet(entc, TRUE, 0.02, 0.4)> p <- gggraph(ig)# Make a new plotting data.frame from the ggplot-plot> DF <- sampleData(entc)[ig[[9]][[3]][["name"]], ]> DF <- data.frame( p$data, DF[p$data$value, ] )> p <- p + geom_point(aes(x, y, color=ClinicalStatus,

shape=Enterotype), data=DF, size=5, alpha=1)> print(p)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 60: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 61: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Network analysis plot produced by phyloseq'smakenetwork() wrapper on the enterotype dataset

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 62: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Bacteria `sharing' between mice

Using the Jaccard index that measures the co-incidence orco-occurrence of species between mice.

Jaccard Similarity =f11

f01 + f10 + f11

Jaccard Disimilarity =f01 + f10

f01 + f10 + f11mouse10 0 0 1 0 1 0 1 0 0 0 0 0 0 1mouse41 0 0 0 0 0 0 0 0 0 0 0 0 0 1vegdist(rbind(mouse1,mouse4),method="jaccard")0.8

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 63: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Bacteria `sharing' between mice as a network

netbaseline=make_network(phy_pifn_glom)p=plot_network(netbaseline,phy_pifn_glom,color="mousenames",label="mousenames",point_size=7)+geom_text(aes(label=mousenames),size=7)p+scale_colour_hue(guide="none")

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 64: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

161

161

165165

174

174

175

175

175

33

33591

71

71

71

73

73

73

73

91

91

92

152

152153

153

163

163

164

172

172

173

34

34

34

35

41

592592

593

74

74

74

74

82

83

83

85

171

31

32

32

32

45

72

72

72

72

7575

75

75

81

81

84

84

93

93

161

161

165 165

174174

175

175175

3333591

71

7171

73

73

73

73

91 91

92

152152 153153

163

163

164

172

172

173

34

34

34

35

41

592592

59374

74

74

74

82

8383

85

171

31

3232

32

45

72

72

72727575

75

75

81

81

8484

9393

163

164

165

171

172

173

174

175

31

32

33

34

35

41

45

591

592

593

71

72

73

74

75

81

82

83

84

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 65: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Does the network relate to `communities'?

Friedman and Rafsky (1979) devised a nonparametric test formultivariate data using the minimum spanning tree with anymetric.Then compute the number of `pure' edging connecting labelsfrom the same groups compared to the mixed edgesconnecting labels from different groups, call Fo the observedstatistic.In our example: Fo = 82Keeping the graph fixed, permute the labels and recomputethe number of pure edges.All 1000 simulated values had Fs < 82 so p < 0.001.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 66: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Co-occurrence networks for taxa of the baselinemice

p=plot_network(netbasetaxa,phy_pifn_glom,color="Family",type="taxa",label=NULL)p+geom_text(aes(label=Class),size=3)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 67: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Mollicutes

Bacteroidia

Bacteroidia

BacteroidiaBacteroidia

Bacteroidia

Bacteroidia

Bacteroidia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Family

Anaeroplasmataceae

Catabacteriaceae

Clostridiales Family XIII. Incertae Sedis

Lachnospiraceae

Ruminococcaceae

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 68: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Changes of the network over time?

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 69: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Differences between two graphs?

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 70: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Bootstrap for dependency neighborhoods

Given a graph, we want to create a new one by bootstrappingand being able to compare the bootstrapped graphs with ouractual ones.We want a ``Block-Bootstrap'' for graphs.Following Holmes and Reinert, 2004 we use the notion ofneighborhood of dependency.This is what inspires our sampling approach.To create the neighborhood of dependencies,we take a smallrandom walk starting at a node (say 5 steps) and all nodesreached by the walk are sampled.This is an ego-centric sampling scheme. Take the subset ofnodes that have been walked upon and combine them into anew graph.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 71: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Part IV

Combining Many Data Tables

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 72: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Comparative Systems

functionalSamples tim

e

functionalSamples tim

e

geneticSamples tim

e

geneticSamples tim

e

taxonomicSamples tim

e

Samples tim

e

Uncontaminated (eg Sample A) Contaminated (eg Sample B)

taxonomic

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 73: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Multi-table methods

Inertia, Co-InertiaWe generalize it in several directions through the idea ofinertia.As in physics, we define inertia as a weighted sum ofdistances of weighted points.This enables us to use abundance data in a contingency tableand compute its inertia which in this case will be theweighted sum of the squares of distances between observedand expected frequencies, such as is used in computing thechisquare statistic.Another generalization of variance-inertia is the usefulPhylogenetic diversity index. (computing the sum of distancesbetween a subset of taxa through the tree).We also have such generalizations that cover variability ofpoints on a graph taken from standard spatial statistics.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 74: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Co-Inertia

When studying two variables measured at the same locations,for instance PH and humidity the standard quantification ofcovariation is the covariance.

sum(x1 ∗ y1 + x2 ∗ y2 + x3 ∗ y3)

if x and y co-vary -in the same direction this will be big.A simple generalization to this when the variability is morecomplicated to measure as above is done through Co-Inertiaanalysis (CIA).Co-inertia analysis (CIA) is a multivariate method thatidentifies trends or co-relationships in multiple datasets whichcontain the same samples or the same time points. That isthe rows or columns of the matrix have to be weightedsimilarly and thus must be matchable.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 75: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

RV coefficient

The global measure of similarity of two data tables asopposed to two vectors can be done by a generalization ofcovariance provided by an inner product between tables thatgives the RV coefficient, a number between 0 and 1, like acorrelation coefficient, but for tables.

RV(A,B) = Tr(A′B)√Tr(A′A)

√Tr(B′B)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 76: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

ExampleCurrent work with Joey McMurdie, Les Dethfelsen, DavidRelman, Angela Marcobal, Justin Sonnenburg. Data:

Taxa Read counts (3 patients taking cipro: two timecourses) : .

Mass-Spec Positive and Negative ion Mass Spec featuresand their intensities: .

RNA-seq Metagenomic data on genes :.

Here is the RV table of the three array types:

> fourtable$RVTaxa Kegg MassSpec+ MassSpec-

Taxa 1 0.565 0.561 0.670Kegg 0.565 1 0.686 0.644MassSpec+ 0.561 0.686 1 0.568MassSpec- 0.670 0.644 0.568 1

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 77: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Multiple table methods

In PCA we compute the variance-covariance matrix, inmultiple table methods we can take a cube of tables andcompute the RV coefficient of their characterizing operators.We then diagonalize this and find the best weighted`ensemble'.This is called the `compromise' and all the individual tablescan be projected onto it. file:///Users/susan/MultiTable/ReprodResearch/statis-demo.html

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 78: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Distances and Inertia

Variance is at the heart of ANOVA: decomposition ofvariability into parts that can be explained by a factor and aresidual part.

Test statistic:average ss explained by factoraverage of residual ss

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 79: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Features

1. Inertia : Trace(VQ) = Trace(WD)(inertia in the sense of Huyghens inertia formula forinstance). Huygens, C. (1657),

n∑i=1

pid2(xi, a)

Inertia with regards to a point a of a cloud of pi-weightedpoints.PCA with Q = Ip, D = 1

nIn, and the variables are centered,the inertia is the sum of the variances of all the variables.If the variables are standardized (Q is the diagonal matrix ofinverse variances), then the inertia is the number ofvariables p.For correspondence analysis the inertia is the Chi-squaredstatistic.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 80: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Goals already attained:

▶ Mathematical Generalizations using diagram andmultilinear approaches (cubes with statis).

▶ Compute distances between general trees distory anduse this in MDS.

▶ Data integration phyloseq, networks, trees, abundancetables.

▶ Threshold, sensitivity tests and modeling simulations.▶ Reproducibility: open source standards, publication of

source code and data. (R).Questions:More methods based on approximate diagrams (expectationsas in Stein's method).We know the distance between hierarchical clustering trees,what is the correct distance between barcodes?

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 81: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Benefitting from the tools and schools ofStatisticians.......

Thanks to the R community: Chessel, Jombart, Dray,Thioulouse ade4 and Emmanuel Paradis ape.

Collaborators: David Relman, Alfred Spormann, LesDethfelsen, Justin Sonnenburg, Persi Diaconis, ElisabethPurdom.

Postdoctoral Fellows Paul (Joey) McMurdie.Angela Marcobal, Serban Nacu.Students: Alden Timme, Katie Shelef, Yana Hoy, JohnChakerian, Julia Fukuyama, Kris Sankaran.Funding from NIH/ NIGMS R01, NSF-VIGRE and NSF-DMS.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 82: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

phyloseq

Joey McMurdie (joey711 on github).Available in Bioconductor.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 83: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

L. Billera, S. Holmes, and K. Vogtmann.The geometry of tree space.Adv. Appl. Maths, 771--801, 2001.

J. Chakerian and S. Holmes.distory:Distances between trees, 2010.

Daniel Chessel, Anne Dufour, and Jean Thioulouse.The ade4 package - i: One-table methods.R News, 4(1):5--10, 2004.

P. Diaconis, S. Goel, and S. Holmes.Horseshoes in multidimensional scaling and kernelmethods.Annals of Applied Statistics, 2007.

Y. Escoufier.Operators related to a data matrix.In J.R. et al. Barra, editor, Recent developments inStatistics., pages 125--131. North Holland,, 1977.

Jerome H Friedman and Lawrence C Rafsky...... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 84: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Multivariate generalizations of the Wald-Wolfowitz andSmirnov two-sample tests.The Annals of Statistics, pages 697--717, 1979.

Susan Holmes.Multivariate analysis: The French way.In D. Nolan and T. P. Speed, editors, Probability andStatistics: Essays in Honor of David A. Freedman,volume 56 of IMS Lecture Notes--Monograph Series.IMS, Beachwood, OH, 2006.

Purna C Kashyap, Angela Marcobal, Luke K Ursell,Samuel A Smits, Erica D Sonnenburg, Elizabeth K Costello,Steven K Higginbottom, Steven E Domino, Susan P Holmes,David A Relman, et al.Genetically dictated change in host mucus carbohydratelandscape exerts a diet-dependent effect on the gutmicrobiota.Proceedings of the National Academy of Sciences, page201306070, 2013.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 85: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

S.W. Kembel, P.D. Cowan, M.R. Helmus, W.K. Cornwell,H. Morlon, D.D. Ackerly, S.P. Blomberg, and C.O. Webb.Picante: R tools for integrating phylogenies and ecology.Bioinformatics, 26(11):1463--1464, 2010.

K. Mardia, J. Kent, and J. Bibby.Multiariate Analysis.Academic Press, NY., 1979.

P. J. McMurdie and S. Holmes.Phyloseq: A bioconductor package for handling andanalysis of high-throughput phylogenetic sequence data.

P. J. McMurdie and S. Holmes.Phyloseq: Reproduible research platform for bacterialcensus data.PlosONE, 2013.April 22,.

Serban Nacu, Rebecca Critchley-Thorne, Peter Lee, andSusan Holmes.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 86: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Gene expression network analysis and applications toimmunology.Bioinformatics, 23(7):850--8, Apr 2007.

Jari Oksanen, F. Guillaume Blanchet, Roeland Kindt, PierreLegendre, R. G. O'Hara, Gavin L. Simpson, Peter Solymos,M. Henry H. Stevens, and Helene Wagner.vegan:Community Ecology Package, 2010.

E. Paradis, J. Claude, and K. Strimmer.Ape: analyses of phylogenetics and evolution in Rlanguage.Bioinformatics, 20:289--290, 2004.

Sandrine Pavoine, Anne-Béatrice Dufour, and DanielChessel.From dissimilarities among species to dissimilarities amongcommunities: a double principal coordinate analysis.Journal of Theoretical Biology, 228(4):523--537, 2004.

Katherine S. Pollard, Houston N. Gilbert, Yongchao Ge,Sandra Taylor, and Sandrine Dudoit...... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 87: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

multtest: Resampling-based multiple hypothesis testing,2010.R package version 2.4.0.

C. R. Rao.The use and interpretation of principal componentanalysis in applied research.Sankhya A, 26:329--359., 1964.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 88: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Distances between Trees

▶ Robinson and Foulds, (bipartitions).▶ Nearest Neighbor Interchange (NNI).

Rotation Moves

4

0

2 31

0

4321

0

41 32

▶ Subtree Prune Rebranch. (SPR)▶ Fill-in of NNI moves: Billera, Holmes, Vogtmann (BHV).

The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an `unresolved' tree.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 89: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Distances between Trees

▶ Robinson and Foulds, (bipartitions).▶ Nearest Neighbor Interchange (NNI). Rotation Moves

4

0

2 31

0

4321

0

41 32

▶ Subtree Prune Rebranch. (SPR)▶ Fill-in of NNI moves: Billera, Holmes, Vogtmann (BHV).

The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an `unresolved' tree.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 90: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Distances between Trees

▶ Robinson and Foulds, (bipartitions).▶ Nearest Neighbor Interchange (NNI). Rotation Moves

4

0

2 31

0

4321

0

41 32

▶ Subtree Prune Rebranch. (SPR)▶ Fill-in of NNI moves: Billera, Holmes, Vogtmann (BHV).

The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an `unresolved' tree.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 91: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Distances between Trees

▶ Robinson and Foulds, (bipartitions).▶ Nearest Neighbor Interchange (NNI). Rotation Moves

4

0

2 31

0

4321

0

41 32

▶ Subtree Prune Rebranch. (SPR)▶ Fill-in of NNI moves: Billera, Holmes, Vogtmann (BHV).

The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an `unresolved' tree.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 92: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Distances between Trees

▶ Robinson and Foulds, (bipartitions).▶ Nearest Neighbor Interchange (NNI). Rotation Moves

4

0

2 31

0

4321

0

41 32

▶ Subtree Prune Rebranch. (SPR)▶ Fill-in of NNI moves: Billera, Holmes, Vogtmann (BHV).

The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an `unresolved' tree.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 93: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Boundary for trees with 3 leaves

1 2 3

1 2 3

3 2 1

3 1 2

0

0

0

0

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 94: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

The quadrant for one tree

(0,0) (1,0)

(0,1) (1,1)

0

1 2 3 4

11

{{

1 2 3 4

0

21 3 4

0

1 2 3 4

{1

0

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 95: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

The quadrant for one tree

(0,0) (1,0)

(0,1) (1,1)

0

1 2 3 4

11

{{

1 2 3 4

0

21 3 4

0

1 2 3 4

{1

0

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 96: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

The quadrant for one tree

(0,0) (1,0)

(0,1) (1,1)

0

1 2 3 4

11

{{

1 2 3 4

0

21 3 4

0

1 2 3 4

{1

0

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 97: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

The quadrant for one tree

(0,0) (1,0)

(0,1) (1,1)

0

1 2 3 4

11

{{

1 2 3 4

0

21 3 4

0

1 2 3 4

{1

0

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 98: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Link of the originAll 15 quadrants for n = 4 share the same origin. If we takethe diagonal line segment x + y = 1 in each quadrant, weobtain a graph with an edge for each quadrant and atrivalent vertex for each boundary ray; this graph is calledthe link of the origin.

1 42 3

0

1 42 3

0

(1,0)

(0,1)

x+y=1

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 99: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Cube complex of Euclidean Orthants

A path between two treesconsists of line segments through a sequence of orthants.This sequence of orthants is the path.A path is a geodesic when it has the smallest length of allpaths between two points.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 100: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Cube complex of Euclidean Orthants

A path between two treesconsists of line segments through a sequence of orthants.This sequence of orthants is the path.A path is a geodesic when it has the smallest length of allpaths between two points.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 101: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

A Cone Path

A path between two trees T and T′ always exists. Since allorthants connect at the origin, any two trees T and T′ canbe connected by a two-segment path, this is called thecone-path.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 102: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Three orthants sharing a common boundary for n = 4 leaves.

1 42 3

0

1 42 3

0

1 42 3

0

1 43 2

0

2 43 1

0

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 103: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

c

a

c

b ba

Tree space with BHV metric is a CAT(0) space, that is, it hasnon-positive curvature. This implies there are geodesicbetween any two trees (Gromov).It is also not an Euclidean space.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 104: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Euclidean space (where through every point not on a line) isflat:(sum of angles of a triangle is 180 o),

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 105: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Hyperbolic space is ` negatively' curved:

Euclid's parallel postulate is replaced.In hyperbolic geometry there are at least two distinct linesthrough P which do not intersect l, so the parallel postulateis false.A characteristic property of hyperbolic geometry is that theangles of a triangle add to less than 180 o.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 106: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

c

a

c

b ba

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 107: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Geodesic metric space:If we have a distance defined between any two points of aspace, we call it a metric space.(The distance doesn't have to be defined through ordinarycoordinates)A geodesic metric space is a metric space where geodesicsare defined to be the shortest path between points in thespace.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 108: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

δ-hyperbolic space is a geodesic metric space in which everygeodesic triangle is δ-thin.δ-thin: pick three points and draw geodesic lines betweenthem to make a geodesic triangle. Then any point on any ofthe edges of the triangle is within a distance of δ from oneof the other two sides.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 109: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

For example, trees are 0-hyperbolic: a geodesic triangle in atree is just a subtree, so any point on a geodesic triangle isactually on two edges.

Normal Euclidean space is ∞-hyperbolic; i.e. not hyperbolic.Generally, the higher δ has to be, the less curved the spaceis.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 110: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Is it better to represent the distances by a tree or aEuclidean projection?

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 111: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Tree of Trees

A tree is a complete CAT(0) space.

c

a

c

b baa

c

b

Since BHV,2001 [1] have shown that the space of trees isnegatively curved (a CAT(0) space), the most naturalrepresentation of a collection of trees may be a tree.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 112: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Some Advantages of R

Cutting edge solutions Availability of > 4000 packages, allstatistical/machine learning methods available.Robust methods, sparse versions of all of theprojection methods.

Data Input and cleaning Multiple formats, (Newick, RNA-seq,Qiime, HTS, Micro-array, Mass-Spec, Genome,...).

Visualization and Presentation High quality graphics:ggplot2.

Reproducibilty Sweave, history, ...Open Source Community of users, documentation.Simulation Random number generators.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 113: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Some Difficulties in R

Main Drawback By default works with data in RAM.Another Problem Heterogeneity: Data, Packages,

Documentation.Open Source Community of users, documentation.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 114: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Best Platform for Designing a Statistical Analysis

▶ Comparing sophisticated methods.Almost 5,000 packages available.

▶ Compare 20 methods on a new type of data in a week.▶ High quality visualization.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 115: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Seamless Interaction with standardized Databases

▶ Through Bioconductor package AnnotationDbi wehave access to GenBank, the Gene OntologyConsortium, Entrez genes, UniGene, the UCSCHuman Genome Project KEGG

▶ metlin ChemmineR , rcdklibs, rpubchem

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 116: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Packages Tailored to specific formats

.

▶ Maps.▶ Trees, Networks and Graphs▶ Microarrays, Mass Spectroscopy,▶ Texts.▶ Images.▶ DNA, AA, Sequencing Data, HTS, RNA-seq.▶ Qiime / mothur formats.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 117: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Plotting

.

High powered graphics packages with layered functionalities:▶ ade4 .▶ Philosophy: Leland Wilkinson's Grammar of Graphics.▶ Implementation :ggplot2

(Hadley Wickham youtube video)▶ Animated gifs, interactive graphics.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 118: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

One Diagram to replace Two Diagrams

Canonical correlation analysis was introduced by Hotelling tofind the common structure in two sets of variables X1 and X2

measured on the same observations. This is equivalent tomerging the two matrices columnwise to form a large matrixwith n rows and p1 + p2 columns and taking as the weightingof the variables the matrix defined by the two diagonalblocks (Xt

1DX1)−1 and (Xt

2DX2)−1

Q =

(Xt

1DX1)−1 0

0 (Xt2DX2)

−1

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 119: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

.

Rp1∗ −−−−→X1

Rn

Ip1

x yV1 Dy xW1

Rp1 ←−−−−Xt1

Rn∗

Rp2∗ −−−−→X2

Rn

Ip2

x yV2 Dy xW2

Rp2 ←−−−−Xt2

Rn∗

Rp1+p2∗ −−−−→[X1;X2]

Rn

Qx yV D

y xW

Rp1+p2 ←−−−−−[X1;X2]t

Rn∗

This analysis gives the same eigenvectors as the analysis ofthe triple (Xt

2DX1, (Xt1DX1)

−1, (Xt2DX2)

−1), also known as the..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 120: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

canonical correlation analysis of X1 and X2.

Part VI

Confirmatory Analysis:Nonparametric tests

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 121: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Nonparametric Tests

▶ Canonical Correspondence Analysis tests on a factor.▶ Mantel's Test between distance matrices▶ Multiple testing correction.▶ Bootstrap tests.

Everything is done by shuffling labels

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 122: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Example:Is there a shedding effect in Relman/Hoy Mice data?> shed = scan("shed.txt")> shedf = as.factor(shed)[-(70:71)]> resca.shed = cca(t(pib.nz) ~ shedf)> anova(resca.shed)Permutation test for cca under reduced model

Model: cca(formula = t(pib.nz) ~ shedf)Df Chisq F N.Perm Pr(>F)

Model 6 0.3825 2.0627 199 0.005 **Residual 87 2.6886---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 123: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Example: Is there a subject effect in Katie's data?subjcc = vegan::cca(tnorepnz ~ subject)anova(subjcc)

Permutation test for cca under reduced model

Model: cca(formula = tnorepnz ~ subject)Df Chisq F N.Perm Pr(>F)

Model 7 1.2177 10.7999 199 0.005 **Residual 472 7.6027---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 124: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

> cca.cage = vegan::cca(t(tcmall) ~ cagef)> plot(cca.cage)> text(cca.cage, choices = c(1, 2), label = cagef, display = "lc")> anova(cca.cage)Permutation test for cca under reduced model

Model: cca(formula = t(tcmall) ~ cagef)Df Chisq F N.Perm Pr(>F)

Model 2 0.3722 7.6501 199 0.005 **Residual 178 4.3305---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 125: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

−4 −2 0 2 4

−2

02

4

CCA1

CC

A2

++

++ +

++

++

++++ +

++

+

+++ +

++

+

+

+ +

+

++

+

+

++

+ +

++

+

++

++

++

+

+

+

+

+

+

+

++

+

+

+

++

++++

+

+

+

+++

+

+

+++

+

++

++

++++

+

+

+

++

++

+

+

++

++

++

+

++

+++

+

+

++

+++

+

+

+

+

+

+

+++

++ ++

+

+ ++

+

++

+

++ ++

++

+

++ +

++

++

+

++

+ +++

++

+ +

++

+

+

++

+ ++

+

+

+

+++ +

++

++

+

+

++

++ +

+

+

+

+

+

++

++

+

+

++

+

+

++

+++

+++

++

++

++ +

++

+ + ++

+

+

+

+

+

+

+++ +

+

++

++++ +

+

+

++

+

+

+

+++

++

+

+

++

++

+

+

+

+++

+++

++++ ++

+

+

++ +

+

++

+++

++

+

++

++

+ +

++

++

+

+ ++

++

+

+

++

+

+

● ●

● ●

●●●

●●

●●

●●

●●●

● ●

●●●

●●

●●●

●●

●●●●

●●

●●

●●

● ●●●

●●

●●

●●

●●●

●●●

●●●

● ●

● ●

●●

●●

●●

●●

●●●

●●

●● ●●

●●●

●●

●●

●●

x

x

x

1515151515151515151515

777777777777777777777777

4444444444444444444444444444

1515151515151515151515151515

4444444444444444444444444444

7777777777777777777777777777777777777777777777777777777777777777777777777777

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 126: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

> plot(cca.cage, scaling = 1)> text(cca.cage, scaling = 1, choices = c(1, 2), display = "sp",+ cex = 0.6)> title("Species, Cage effect, 1,2")

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 127: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

−4 −2 0 2 4

−2

02

4

CCA1

CC

A2

+

+

++ +

+

+

+

+

+

+

++

+

+

+

+

++

++

+

+

+

+

++

+

++

+

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+++

+

+

+

+

+

++

+

+

+

++

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

++

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

+

+

+

+

+++

+

+

+

+

+

++

++

+

+

+

+

+

+ ++

+

+

+

+ +

+

+

+

+

+

+

++

+

+

+

+

++

+ +

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

++ +

+

+

+ + +

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

++

+

+

+

++

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

++

+

++

+

++

++ +

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+ +

+

+

+

+

+

+

+

+

+

●● ●

●●

●●

● ●●

●● ●●●●

●●●

●● ●●●

●●●●● ●

●●●●

●●

●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●

●●

● ●● ●

●●●●●

●●

●●●

●● ●●

●●●

●●

●●

●●

●● ●● ●

●● ●

●●

●●●

●●

●●●●

●●●

●●

●●●

●●●●●

●●

●●●

●●●●

●●

●●●●●

●●●

x

xx

f__

Lachno

LachnoLachno Lachno

Lachno

Ruminoc

Lachno

Lachno

Lachno

Lachno

LachnoRuminoc

f__

Lachno

Lachno

f__

LachnoLachno

LachnoRuminoc

f__

Lachno

f__

Lachno

Lachnof__

Lachno

LachnoLachno

Lachno

Lachno

Ruminoc

Lachno

RuminocLachno

f__

Ruminoc

Lachno

f__

Lachno

f__

Coriobact

Lachnof__

f__

f__

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

f__

Lachno

Lachno

Ruminoc

Lachno

Lachno

LachnoLachnoLachno

Lachno

Lachno

Ruminoc

Lachno

Ruminoc

LachnoLachno

Lachno

Lachno

Ruminoc

LachnoLachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

f__

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

Ruminoc

f__

Lachno

Lachno

Lachno

Lachno

f__

Ruminoc

Lachno

Ruminoc

f__

Catabact

f__

AnaeroplCatabact

f__

Lachno

Ruminoc

Lachno

Catabact

f__Ruminoc

Lachno

Lachno

Ruminoc

Dehalobact

f__

Ruminoc

Lachno

Lachno

Lachno

Lachno

Lachnof__

Ruminoc

Ruminoc

Lactobaci

RuminocRuminoc

ClostridI

Lachno

Ruminoc

Lachno

f__

LachnoRuminocRuminoc

Lachno

Lachno

f__

Lachno

Lachno

LachnoRuminoc

LachnoLachno

Lachno

Lachno

Ruminoc

Ruminoc

Ruminoc

Lachno RuminocLachno

Ruminoc

Lachno

Lachno

RuminocLachno

Lachno

Lactobaci

Ruminoc

f__

f__

Erysip

LachnoLachno

Lachno

Lachno

Lachno

Ruminoc

LachnoLachno

RuminocLachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

RuminocRuminoc

f__

Lachno

Lachno

Ruminoc

Lachno

Lachno

RuminocLachno

Lachno

Lachno

Lachno

Lachno

Lachno

f__

Lachno

Ruminoc

Lachno

Lachno

Lachno

Lachno

Lachno

LachnoLachno

Lachno

Lachno

Lachno

Lachno

Lachno

LachnoEntero Lachno

Lachno

Lachno

Ruminoc Lachno Lachno

Ruminoc

Lachno

Lachno

Lachno

Lachno

Ruminoc

Lachno

Lachno

Lachnof__

Ruminoc

Lachno

Lachno

Lachno

Lachno

Lachno

LachnoLachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

LachnoLachno

Lachno

ClostridI

Leuconost

Lachno

Lachno

Ruminoc

f__

Lachno

Lachno

Lachno

Lachno

Lachno

LachnoLachno

Lachno

LachnoLachno

Lachno

RuminocLachno

Lachno

f__f__

Lachno

Lachno

Lachno

Lachno

Lachno

Ruminoc

Erysip

Lachno

Lachno

Catabact

Lachno

Lachno

Lachno

Ruminoc

Lachno

Lachno

Catabact

ClostridIII

Lachno

Lachno

f__Lachno

Lachno

Ruminoc

f__

Lachno f__

Lachno

Lachno

Erysip

Lachno

Verruc

Lachno

Lachno

Lachno

Lachno

Species, Cage effect, 1,2

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 128: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Over-representation of certain phyla

Set Over-represented UniverseMicrobiome Families/Phyla Species Present

Gene Expression Ontological groups Filtered GenesTest in both cases: hypergeometric / Fisher's exact test.We define the set of prefiltered species (species universe) asthose that passed the threshold test of being present (>6000) in at least 31 of the arrays.This method is especially relevant here as the tree does notshow equal representation of different families and phyla.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 129: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Hypergeometric Tests for over representation ofcertain phyla)

▶ IBS higher group had significantly more Bacteriodetes▶ overrepresentation of Firmicutes in the healthy controls.▶ At the family level, the results showed that the families

of Oxalobacteraceae, Prevotellaceae, Burkholderiaceae,Sphingobacteriaceae were significantly overrepresentedin IBS.

▶ Conversely, the most significantly enriched family incontrol rats were Lachnospiraceae, includingRuminococcus sp., followed by Erysipelotrichaeceae andClostridiaceae.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 130: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Part VII

RobustnessHigh Breakdown (median,)

Sparse (L1 minimizing)

Nonparameteric (ranks, nmMDS, ..)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 131: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 132: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Tumor Cells

0 5000 10000 15000 20000

05000

10000

15000

20000

05e-0

40.0

010 1 1 0 -1 1 0 0 0 -1

0 1 1 0 0 0 0 0 0 1

0 1 -1 0 -1 0 0 0 0 -1

0 1 1 0 0 -1 1 0 1 1

0 1 1 0 0 0 0 1 0 1( (

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 133: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Solutions: Ways of Using Trees

Main tool: ape.Example:

> treeall=read.tree("/Users/susan/Dropbox/Yana/gg.tre")> todrop=which(!(treeall$tip.label %in% otu.id))>> todrop=which(!(treeall$tip.label %in% otu.id))> dropthese=treeall$tip.label[todrop]> yana1772tree=drop.tip(treeall,dropthese)>layout(matrix(c(1,1,1,2,2,2,3,4,5,5,5,5),ncol=12,nrow=1))>subtree = drop.tip(tree, which(!tree$tip.label %in% selected))>plot(tree, use.edge.length=F, no.margin=T, edge.color=colo, show.tip.label=F)###Relevant subtree>plot(subtree, edge.color='red', show.tip.label=T, no.margin=T, use.edge.length=F)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Page 134: Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree integration. Identifiability. High quality graphics. Reproducibility..... clinical

Mapping Variables onto Phylogenies

Example using phyloseq package> data(esophagus)> es1 <- esophagus> sn <- sample.names(esophagus)> sampleData(es1) <- sampleData(data.frame(sample=sn, row.names=sn))

We create the tree graphic, grouping/coloring by our dummysample-name variable, and also labelling the number ofindividuals observed in each sample (if at all). The symbolsare slightly enlarged as the number of individuals increases.> plot_tree_phyloseq(es1, color_factor="sample",

type_abundance_value=TRUE, treeTitle="esophagus example data")

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .