phylogenetics analysis in r

Phylogenetic Analyses in R

Klaus Schliep

Universidad de Vigo

Porto, 15–16 July 2013

Outline

Getting started

Data Structures

Distance based methods

Maximum Parsimony

Maximum likelihood

Section 1

Getting started

About

This slides should give a short introduction into phylogeneticreconstruction in R. It focuses mostly on the packages ape andphangorn. I have to thank Emmanuel Paradis for his work on ape.The slides are produced with literate programming using Latex,Beamer, Sweave and R. So all the code and graphics are ”real”!

Help

To install an R package it is good to have administrator rights.Download R from www.cran.r-project.org. You can easily installpackges from within R:

> install.packages("phangorn")

> install.packages("phytools")

> install.packages("pegas")

> install.packages("seqLogo")

> q()

Then you can load the packages you need:

> library("phangorn")

> library("seqLogo")

Help

The R homepage provides lots of general documentation, faqs, etc.There are help pages for all the functions and most of themcontain examples.

> library(help="phangorn")

> help.start()

> ?pml

> help(pml)

> example(pml)

> vignette("Ancestral")

Copy and paste the parts of the code in the examples is a goodstart. If you prefer reading a book (even they are fast outdated):Paradis, E. (2012) Analysis of Phylogenetics and Evolution with R(Second Edition) New York: SpringerThere is a mailing list stat.ethz.ch/mailman/listinfo/r-sig-phylowhere you can ask questions, after browsing through the archive.

Section 2

Data Structures

Data Structures

Reminder:

1. Data in R are made of vector + attribute(s) (andcombinations of these). Vector: a series of elements all of thesame kind (a list is a vector of pointers).

2. The class is the attribute determining the action of genericfunctions (plot, summary, etc.)

We will make heavily use of the following 3 data structures:1. phyDat: sequences (DNA, AA, codons, user defined) inphangorn2. DNAbin: DNA sequences (ape format)3. phylo: phylogenetic trees

Class phylo

This class represents phylogenetic trees. The tip labels may bereplicated, the node labels (which may be absent). Input:1. read.tree: Newick files2. read.nexus: NEXUS filesIf the file contains several trees, these two functions return anobject of class multiPhylo which is a list of trees of class phylo.And you can write objects of class phylo using write.tree orwrite.nexus.

Plotting trees

ape has great plotting capabilities.

> help(plot.phylo)

Some simple example

> tree <- rtree(10)

> par(mfrow=c(2,2), mar=rep(0,4))

> plot(tree)

> plot(tree, type="fan")

> plot(tree, type="unrooted")

> plot(tree, type="cladogram")

Plotting trees

t9

t10

t4

t8

t5

t3

t6

t1

t2

t7

t9

t10

t4

t8

t5

t3

t6

t1

t2

t7

t9t10

t4

t8

t5

t3

t6

t1

t2

t7

t9

t10

t4

t8

t5

t3

t6

t1

t2

t7

Transforming trees

There are many functions in ape and phangorn to transform trees(i.e. objects of class phylo)

> root(tree, outgroup)

> drop.tip(tree, "t1")

> extract.clade(phy, 1)

> bind.tree(tree1, tree2)

> unroot(tree)

> multi2di(tree)

> di2multi(tree)

> nni(tree)

> rSPR(tree)

Class phyDat

The starting point for phylogenetic reconstruction are sequencealignments. ape can call clustal,tcoffee and muscle andphyloch can call mafft, prank and gblocks.More frequently you will just read in an alignment

> align1 <- read.phyDat("myfile")

phangorn (phyDat) and ape (DNAbin) use different formats torepresent alignments, but it is easy to convert formats.

> align2 <- read.dna("myfile") # ape format

> align3 <- as.phyDat(align1) # phangorn format

Section 3


Distance based methodsDistance methods take a distance or dissimilarity matrix as input.

Ultrametric Additive

upgmaa fastme.olswpgmaa fastme.bal

njUNJa

bionj

a in phangorn the rest in ape.

I Fast methods O(n2) or O(n3) → big data sets can beanalysed.

I Distances can be calculated for different kinds of data.

I In phylogenetics often used to compute starting trees for ML,MP or inside species tree methods.


> set.seed(1)

> bs <- bootstrap.phyDat(Laurasiatherian, FUN = function(x)nj(dist.ml(x)), bs=100)

> class(bs) <- 'multiPhylo'

> cnet = consensusNet(bs, .3)

> plot(cnet, show.tip.label=FALSE, show.nodes=TRUE)

Consensusnetwork

Section 4

Maximum Parsimony

Maximum parsimony

In contrast to the distance methods (maximum) parsimony usessequence alignments as input. The target is to minimize anoptimality criterion, i.e. a score to a tree, given the data. For theparsimony method the score is the minimal number of substitutionsneeded to account for the data on a phylogeny.

> data(Laurasiatherian)

> tree = nj(dist.ml(Laurasiatherian))

> parsimony(tree, Laurasiatherian)

[1] 9776

> tree2 = optim.parsimony(tree, Laurasiatherian,

trace=FALSE, rearrangement="SPR")

> parsimony(tree2, Laurasiatherian)

[1] 9713

> tree3 = pratchet(Laurasiatherian, rearrangement="SPR", trace=0)

Branch and boundNormally it is not possible to evaluate an optimality criterion for alltrees, as there are just too many trees.

> sapply(3:10, howmanytrees, FALSE)

[1] 1 3 15 105 945 10395

[7] 135135 2027025

> howmanytrees(20, FALSE)

[1] 2.216431e+20

For small datasets it is possible to find all most parsimonious treesusing a branch and bound algorithm. For datasets with more than10 taxa this can take a long time and depends strongly on howtree like the data are.

> besttree <- bab(subset(Laurasiatherian,1:10), trace=0)

> parsimony(besttree, Laurasiatherian)

[1] 2695

Ancestral reconstructionTo reconstruct ancestral sequences we first load some data andreconstruct a tree:

> primates = read.phyDat("primates.dna")

> tree = pratchet(primates, trace=0)

> tree = acctran(tree, primates)

> parsimony(tree, primates)

[1] 746

In parsimony analysis the edge length represent the observednumber of changes. Reconstructiong ancestral states thereforedefines also the edge lengths of a tree. However there can existseveral equally parsimonious reconstructions or states can beambiguous and therefore edge length can differ (e.g. ”MPR” or”ACCTRAN” ).

> anc.acctran = ancestral.pars(tree, primates, "ACCTRAN")

> anc.mpr = ancestral.pars(tree, primates, "MPR")

Ancestral reconstruction

> seqLogo( t(subset(anc.mpr, getRoot(tree), 1:20)[[1]]), ic.scale=FALSE)

1 2 3 4 5 6 7 8 910 12 14 16 18 20

Position

0

0.2

0.4

0.6

0.8

1P

roba

bilit

y

Ancestral reconstruction MPR

> plotAnc(tree, anc.mpr, 17)

> title("MPR")

Mouse

Bovine

Lemur

Tarsier

Squir Monk

Jpn Macaq

Rhesus Mac

Crab−E.Mac

BarbMacaq

Gibbon

Orang

Gorilla

Chimp

Human

acgt

MPR

Ancestral reconstruction ACCTRAN

> plotAnc(tree, anc.acctran, 17)

> title("ACCTRAN")

Mouse

Bovine

Lemur

Tarsier

Squir Monk

Jpn Macaq

Rhesus Mac

Crab−E.Mac

BarbMacaq

Gibbon

Orang

Gorilla

Chimp

Human

acgt

ACCTRAN

Section 5

Maximum likelihood

Maximum Likelihood

”[In 1961] I had visions of evolutionary tree estimation being muchthe same [than linkage estimation] but with the addition of theneed to estimate the form of the tree itself, surely a fatalcomplexity: my intuition was that there would be insufficient datafor the task.”

—A.W.F. Edwards (2009)

Phylogenetic likelihood is the probability f (x |θ, τ) of observing analignment X given a model of (nucleotide) substitution withparameters θ and phylogenetic tree τ .

L(θ, τ, x) =N∏i=1

f (xi |θ, τ)

where N is the number of sites in the alignment. It is common tomaximise the log-likelihood function`(θ, τ, x) =

∑Ni=1 log (f (xi |θ, τ)) which also maximises L(θ, τ, x).

Applications in phylogenetics

Felsenstein (1981) introduced the pruning algorithm which madethe computation of the likelihood feasible. Let nodes j and k havea direct ancestor h. We can estimate the conditional likelihood

Lh(xh) =

∑xj

Lj(xj)pxj ,xh(tj)

×(∑xk

Lk(xk)pxk ,xh(tk)

)

The likelihood of the tree is evaluated by traversing the tree inpostorder fashion from the tips towards the root. For unrootedtrees, a root can be chosen arbitrarily as our models aretime-reversible. We get the likelihood of the tree if we multiply theconditional likelihood of the root node r with the base compositionπ, as

fh(x |θ, τ) =∑xr

πxrLr (xr ),

These formulas can be adapted to estimate ancestral sequences.

ML in phylogenetics

5

6

7

human chimp gorilla orangutan

ML in phylogenetics

a a g t

ML in phylogenetics

1|0|0|0 1|0|0|0 0|0|1|0 0|0|0|1

ML in phylogenetics

1|0|0|0 1|0|0|0 0|0|1|0 0|0|0|1

0.000988|0.000031|0.000595|0.000744

0.027161|0.000559|0.016240|0.000559

0.923613|0.000168|0.000168|0.000169

Finding the best topology

A binary unrooted tree has 5 edges and 3 distinct topologies. Hereare the general formulas for binary unrooted trees:

I 2n − 3 edges

I (2n − 5)!! = 1× 3× 5× · · · × (2n − 3) topologies

Rooted binary trees have 2n − 2 edges and (2n − 5)!! topologies.A function exists for this:

> howmanytrees(4, rooted=FALSE)

[1] 3


[1] 2027025


[1] 2.216431e+20

Finding the best trees

The strategy of evaluating the likelihood criterion for all trees inorder to find the best tree topoology is in most cases highlyimpracticable. Instead, local tree rearrangements are used tosearch locally within the tree space. The idea behind such aheuristic is to use a starting tree and search locally for improvedscores (parsimony, maximum likelihood, Least-Squares), until nofurther rearrangements can lead to a tree with a better score.

Nearest neighbor interchangeFor any internal edge of a binary tree there exist three differentways to connect its four subtrees, one of which is the current tree.

A

B

C

D

A

C

B

D

A

D

B

C

Modelling rate variation

We assume that the substitution rate varies between different sites(intron vs. exon, codon positions, etc). Two approaches arecommonly used:

I define different partitions

I model rate variation with different rate categories, with a(discrete) Γ distribution and/or proportion of variables sites

Comparing trees and modelsThe phylogenetic likelihood allows us to compare many differentmodels or trees. There is often a bias vs. variance trade-off.Simple models are easy to interpret but can often be biased.

MSEVariance

Bias2

number of parameters

Comparing trees and models

The phylogenetic likelihood allows us to compare many differentmodels or trees.

I If two models are nested - that is, one model can be describedas a special case of the other – then we can directly comparetheir likelihoods under their ML parameter estimates for afixed tree using a likelihood ratio test (LRT)

I For non nested models we can use the Akaike InformationCriteria (AIC) or the Bayesian Information Criteria (BIC):AIC = −`(θ, τ, x) + 2 ∗ dfBIC = −`(θ, τ, x) + ln(n) ∗ dfwhere df is the number of parameters of the model and n thenumber of sites.

I Or use the Shimodaira-Hasegawa test or similar bootstrapapproaches.

Detection of molecular adaptation

We look at each triplet of nucletides and assume that only onenucleotide can be replaced at a time. Then we can distinguishbetween nucleotide substitutions that result in the same aminoacid (synonymous substitutions) or a different amino acid(non-synonymous substitutions). The ratio dN/dS ofnon-synonymous to synonymous substitutions can be an indicationof the kind of selective pressure acting on the codon site. Undernegative selection, we expect that non-synonymous substitutionswill accumulate more slowly than synonymous ones. And underpositive or diversifying selection, we expect more amino acidchanging replacements.

Applications with phangorn

The two main functions are pml to set up the model andoptim.pml for optimising parameters and the tree with ML.Example session for Jukes Cantor, GTR and GTR+Γ+I model:

> data(Laurasiatherian)

> tr <- nj(dist.ml(Laurasiatherian))

> m0 <- pml(tr, Laurasiatherian)

> m.jc69 <- optim.pml(m0, optNni=TRUE)

> m.gtr <- optim.pml(m0, optNni=TRUE, model="GTR")

> m.gtr.G.I <- optim.pml(update(m.gtr, k=4), model=

"GTR", optNni=TRUE, optGamma=TRUE, optInv=TRUE)

By default, only the edge lengths are optimized. Currentlyphangorn only supports NNI tree rearrangements (equivalent toPhyML vers. 2)

There exist several useful generic functions like update, anova orAIC for objects of class pml.

> methods(class="pml")

[1] anova.pml logLik.pml plot.pml print.pml

[5] update.pml vcov.pml

For example we can compare the different models as they arenested with likelihood ratio test:

> anova(m.jc69, m.gtr, m.gtr.G.I)

Likelihood Ratio Test Table

Log lik. Df Df change Diff log lik. Pr(>|Chi|)

1 -54113 91

2 -50603 99 8 7020 < 2.2e-16 ***

3 -44527 101 2 12151 < 2.2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Partition models

pmlPart(global ∼ local, object, model)

global local

bf bfQ Q

inv invshape shapeedge edge

ratenni

Each component can be only used once in the formula.

Partition models

There are two different ways to set up partition models.1. Setting up partition models for different genes.> fit1 <- pml(tree, g1)

> fit2 <- pml(tree, g2)



> genePart <- pmlPart(Q + bf ∼ edge,

list(fit1, fit2, fit3, fit4), optRooted=TRUE)

> trees <- lapply(genePart$fits, function(x)x$tree)

> class(trees) <- "multiPhylo"

> densiTree(trees, type="phylogram", col="red")

where g1, g2, g3 and g4 are objects of class phyDat.

ML in phylogenetics

Scer

Spar

Smik

Skud

Sbay

Scas

Sklu

Calb

Partition models

2. Partitioning via a weight matrix.> woody <- phyDat(woodmouse)

> tree <- nj(dist.ml(woody))

> fit <- pml(tree, woody)

> w <- attr(woody, "index")

> weight <- table(w, rep(c(1,2,3), length=length(w)))

> codonPart <- pmlPart(edge ∼ rate, fit,

model=c("JC", "JC", "GTR"), weight=weight)

Model / tree comparison

Alternatively we can use the Shimodaira-Hasegawa test to checkfor differences between models:

> SH.test(m.jc69, m.gtr, m.gtr.G.I)

Trees ln L Diff ln L p-value

[1,] 1 -54112.74 9585.685 0.0000

[2,] 2 -50602.74 6075.683 0.0000

[3,] 3 -44527.06 0.000 0.5911

Model selection

Two possibilities

I ape: phymltest

> write.phyDat(woody, "woody.phy")

> out <- phymltest("woody.phy", execname =

"~/phyml")

I phangorn: modelTest

> mt <- modelTest(Laurasiatherian, model=c("JC",

"F81", "HKY", "GTR"))

modelTest works also for amino acid models similar to ProtTest.

> mt <- modelTest(myAAData, model=c("WAG", "JTT",

"LG","Dayhoff"))

Model Selection

Model df logLik AIC BIC

1 JC 91.00 -54303.67 108789.35 109341.202 JC+I 92.00 -50673.32 101530.63 102088.553 JC+G 92.00 -48684.10 97552.19 98110.114 JC+G+I 93.00 -48605.03 97396.06 97960.055 F81 94.00 -54212.64 108613.27 109183.326 F81+I 95.00 -50549.53 101289.06 101865.177 F81+G 95.00 -48500.49 97190.99 97767.108 F81+G+I 96.00 -48416.26 97024.51 97606.699 HKY 95.00 -51275.86 102741.72 103317.83

10 HKY+I 96.00 -47451.73 95095.45 95677.6311 HKY+G 96.00 -44893.11 89978.23 90560.4012 HKY+G+I 97.00 -44770.18 89734.36 90322.6013 GTR 99.00 -50759.89 101717.79 102318.1614 GTR+I 100.00 -47081.77 94363.55 94969.9815 GTR+G 100.00 -44759.49 89718.99 90325.4216 GTR+G+I 101.00 -44624.02 89450.04 90062.54

Bootstrap

> bs <- bootstrap.pml(m.gtr, bs=100, optNni=TRUE)

> plotBS(m.gtr$tree, bs, type="phylo", bs.adj=c(.5,0))

PlatypusWallarooPossum

Bandicoot

Opposum

ArmadilloElephant

AardvarkTenrec

HedghogGymnure

MoleShrew

RbatFlyingFoxRyFlyFox

FruitBatLongTBat

HorseDonkeyWhiteRhino

IndianRhin

Pig

AlpacaCowSheep

HippoFinWhaleBlueWhaleSpermWhale

RabbitPika

SquirrelDormouseGuineaPig

MouseVoleCaneRat

BaboonHuman

LorisCebus

Cat

DogHarbSeal

FurSealGraySeal

10058100

100100

58

93

100100

100100

6458

10086

100100

98

96

10010087

100

44

79

10088

97

64

86

73

75

100

5489100

70

47

91

55

68

67

100

100

Codon Models

qij =

0 if i and j differ in more than one positionπj for synonymous transversionπjκ for synonymous transitionπjω for non-synonymous transversionπjωκ for non-synonymous transition

or if we make abstraction of pij (frequency of base j):

qij =

0 if i and j differ in more than one position1 for synonymous transversionκ for synonymous transitionω for non-synonymous transversionωκ for non-synonymous transition

where ω is the dN/dS ratio, κ the transition transversion ratio andπj is the the equilibrium frequencies of codon j .

Codon Models

> (dat <- phyDat(as.character(yeast), "CODON"))

> tree <- nj(dist.ml(yeast))

> fit <- pml(tree, dat)

> ctr <- pml.control(trace=0)

> fit0 <- optim.pml(fit, control = ctr)

> fit1 <- optim.pml(fit0, model="codon1", control=ctr)



Model κ ω

codon0 1 1codon1 free freecodon2 1 freecodon3 free 1

Additionally, the equilibrium frequencies of the codons πj can beestimated setting the parameter optBf=TRUE.

Codon Models

> anova(fit0, fit2, fit1)



1 -1054762 13

2 -648282 14 1 812961 < 2.2e-16 ***

3 -642807 15 1 10949 < 2.2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> anova(fit0, fit3, fit1)



1 -1054762 13

2 -708674 14 1 692176 < 2.2e-16 ***

3 -642807 15 1 131735 < 2.2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

phylogenetics analysis in r

Documents