phylogenetics - advances in bioinformatics and genomics...

83
Phylogenetics Advances in Bioinformatics and Genomics GEN 240B Jason Stajich April 26 & 28, 2010 Phylogenetics Slide 1/82

Upload: others

Post on 19-Jul-2020

5 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

PhylogeneticsAdvances in Bioinformatics and Genomics

GEN 240B

Jason Stajich

April 26 & 28, 2010

Phylogenetics Slide 1/82

Page 2: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

IntroductionPhylogenetics

Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances

Tree Building MethodsClustering MethodsParsimony Methods

Quality Assessment by ResamplingBootstrapJackknife

Software

Software

References

Phylogenetics Slide 2/82

Page 3: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

OutlineIntroduction

Phylogenetics

Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances

Tree Building MethodsClustering MethodsParsimony Methods

Quality Assessment by ResamplingBootstrapJackknife

Software

Software

References

Phylogenetics Introduction Slide 3/82

Page 4: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

OutlineIntroduction

Phylogenetics

Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances

Tree Building MethodsClustering MethodsParsimony Methods

Quality Assessment by ResamplingBootstrapJackknife

Software

Software

References

Phylogenetics Introduction Phylogenetics Slide 4/82

Page 5: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Phylogeny: Evolutionary Relationships Among Organisms

Systematics is the study of the diversity of organisms whileTaxonomy is the classification of species. It uses and appliesPhylogenetics and Phylogenetic methods.

Phylogenetics is the study of the evolutionary relationships amongorganisms through time.

Phylogenies (evolutionary models or trees) provide a framework forstudying evolutionary patterns and mechanisms.

Phylogenetics Introduction Phylogenetics Slide 5/82

Page 6: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

The Tree of Life

The similarity of molecular mechanisms suggests a common ancestor forall organisms.

Thus, any set of species is related. This relationship is called phylogeny.

Phylogenetic trees are often used to represent this relationship.

Phylogenetics Introduction Phylogenetics Slide 6/82

Page 7: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Morphological characters used to construct and interepretphylogenies

Morphological characteristics from living and fossilised organismshave been used for inferring and interpreting phylogenies.

Phylogenetics Introduction Phylogenetics Slide 7/82

Page 8: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Sequence-Based Phylogenies

Molecular relationships can provide more accurate distancemeasures, such as:

Enzyme and Immunological dataGenetic Mapping dataDNA and protein sequence data

Because certain sequences change with an almost constant rate overtime they can serve as molecular clocks.

These clock-like sequences allow to infer often much more reliablephylogenies than traditional approaches.

Phylogenetics Introduction Phylogenetics Slide 8/82

Page 9: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Common Workflow of Phylogenetic Sequence Analyses

1. Select the Appropriate Sequences for a Phylogenetic QuestionImportant: sequences should show significant similarity.

2. Create a Multiple Alignment for Chosen SequencesImportant: unalignable sequence areas should be removed.

S1 FMPFSAGKRICAGEGLARMELFLFLT 450S2 FMPFSAGKRICVGEALAGMELFLFLT 450S3 .LAFGCGARVCLGEPLARLELFVVLT 443S4 SLPFGFGKRSCMGRRLAELELQMALA 470S5 YTPFGSGPRNCIGMRFALMNMKLALI 457consensus ..PFg.GkR.C.Ge.LA.mELfl.Lt

3. Compute a Distance Matrix for Multiple Alignment

S1 S2 S3 S4 S5S1 0.0 0.43 0.71 0.71 0.48S2 0.0 0.57 0.57 0.39S3 0.0 0.29 0.21S4 0.0 0.13S5 0.0

4. Calculate Phylogenetic TreeImportant: choose a tree building method.

5. Tree Post ProcessingImportant: tree rooting and bootstrapping.

Unresolved Partiallyresolved

Fullyresolved

BifurcationPolytomy or

multifurcation

Species B

Species A

Orthologues

ParaloguesSpeciation

Duplications

= internal node

= external node

= internal branch

= external branch

1 2 3 4 1 2 3 4

=

Taxon A

Taxon B

Taxon C

Taxon D

Taxon A

Taxon B

Taxon C

Taxon D5

4

1

11

1

Taxon A

Taxon B

Taxon C

Taxon D

0510152025

Cladogram Phylogram Ultrametric Tree

Branch lengths have no meaning.

Branch lengths are proportional to (genetic) change.

Branch lengths are proportional to time.

[Million Years]

A B C DA

B

C

-

((A:0.5, B:0.5):0.1,(C:0.5,D:0.5):0.1);

Slanted style

Rectangularstyle

Newick format:

A

B

D

C

D

Unrooted Tree Rooted Trees

A

B

C

D

A B C D

e

a

b

c

d

Root:

e

e

A B C D

a

B A C D

b

Root: a

Root: b

Outgroup

Ingroup

A

B

C

D

A

B

C

D

Cladogram

A

B

C

D5

4

1

11

1

Phylogram

Midpoint Rooting

10

7

6

A

B

C

A B

6 6

4

A

B

C

1

1

1

1

A

B

C

1

6

1

1

A

B

C

1

1

1

1

A

B

C

1

6

1

1

Tree A Tree B

A

B

C

D

0.1

0.20.1

0.3

0.4

1 2 54

67

1 2

4

5

3

1/2 d45

1 2

6

1 2

4

5

3

1/2 d12

1 2 5 34

67

8

91 2

4

5

31/2 d68

3-4. Iteration

2. Iteration

1. Iteration

A

C

B

D A

C

B

D

AAG AAA GGA AGA

AAA AGA

AAA

1 1

1

AAG AGA AAA GGA

AAA AAA

AAA

1 21

������������

Alignment Two Possible Parsimony Trees

S2

S4

S1

S35

3

1

11

1

3 S5

100

78

99

88

Phylogenetics Introduction Phylogenetics Slide 9/82

Page 10: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

The Tree of Life

Phylogenies from sequences assume that they have evolved from thesame ancestral gene in a common ancestral species.

Phylogenetics Introduction Phylogenetics Slide 10/82

Page 11: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Incomplete Phylogenies of the Tree of Life

Unresolved Partiallyresolved

Fullyresolved

BifurcationPolytomy or

multifurcation

Some parts of the tree of life are fully resolved, others are only partiallyresolved or completely unresolved.

Phylogenetics Introduction Phylogenetics Slide 11/82

Page 12: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Evolution of Genes

Complication: the evolution of genes is driven by

Gene duplications

Speciation

Special case in bacteria: horizontal gene transfer

Phylogenetics Introduction Phylogenetics Slide 12/82

Page 13: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Orthologous and Paralogous Genes

Species B

Species A

Orthologues

ParaloguesSpeciation

Duplications

Homologous genes: evolved from a common ancestor.

Orthologous gene: homologous genes that evolved by speciation.

Paralogous genes: homologous genes that evolved by gene duplications.

To infer phylogenies of species, orthologous genes need to be used.

To infer phylogenies of gene duplications, paralogous genes need to beused.

Phylogenetics Introduction Phylogenetics Slide 13/82

Page 14: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Interpreting trees: Character evolution

In comparison with its ancestor, an organism has both shared and derivedcharacteristics.

A Shared ancestral character is one that originates in an ancestor beforethe clade formed.

Shared derived character is an evolutionary novelty unique to a particularclade and originated at the time of cladogenesis (everyone in the cladehas it)

Phylogenetics Introduction Phylogenetics Slide 14/82

Page 15: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Interpreting trees: Characters

TAXA

Lanc

elet

(out

grou

p)

Lam

prey

Sala

man

der

Leop

ard

Turt

le

Tuna

Vertebral column(backbone)

Hinged jaws

Four walking legs

Amniotic (shelled) egg

CH

AR

AC

TER

S

Hair

(a) Character table

0

0 0

0

0

0

0 0

0

0

0 0

0 0 0 1

11

111

1

11

1

1

11

11

Phylogenetics Introduction Phylogenetics Slide 15/82

Page 16: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Interpreting trees: Characters on Trees

Hair

Hinged jaws

Vertebralcolumn

Four walking legs

Amniotic egg

(b) Phylogenetic tree

Salamander

Leopard

Turtle

Lamprey

Tuna

Lancelet(outgroup)

Phylogenetics Introduction Phylogenetics Slide 16/82

Page 17: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

OutlineIntroduction

Phylogenetics

Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances

Tree Building MethodsClustering MethodsParsimony Methods

Quality Assessment by ResamplingBootstrapJackknife

Software

Software

References

Phylogenetics Tree Basics Slide 17/82

Page 18: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

OutlineIntroduction

Phylogenetics

Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances

Tree Building MethodsClustering MethodsParsimony Methods

Quality Assessment by ResamplingBootstrapJackknife

Software

Software

References

Phylogenetics Tree Basics Tree Topology Slide 18/82

Page 19: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Basics on Trees

Usually, the leaves (end nodes) of a tree are labelled.

In some cases the labels can be swapped (see Fig 7) withoutchanging the tree.

A tree with a given leaf labelling is called a labelled branchingpattern or the tree topology T .

The lengths of the edges are denoted by ti .

Phylogenetics Tree Basics Tree Topology Slide 19/82

Page 20: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Basics on Trees

Typically, binary trees are used in phylogenetics, where three branchesmeet at each branch node.

Tree components

internal and external branches (edges)internal and external nodes

Unresolved Partiallyresolved

Fullyresolved

BifurcationPolytomy or

multifurcation

Species B

Species A

Orthologues

ParaloguesSpeciation

Duplications

= internal node

= external node

= internal branch

= external branch

A true phylogeny has a root, which represents the ultimate ancestor forall the other items in a tree.

Certain algorithms like parsimony or probabilistic models provide noinformation about the position of the root in a tree.

Phylogenetics Tree Basics Tree Topology Slide 20/82

Page 21: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Tree Styles

Phylogenetics Tree Basics Tree Topology Slide 21/82

Page 22: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Tree Styles

Circle Tree

Phylogenetics Tree Basics Tree Topology Slide 22/82

Page 23: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Significance of Branch Lengths in Cladograms, Phylogramsand Ultrametric Trees

Unresolved Partiallyresolved

Fullyresolved

BifurcationPolytomy or

multifurcation

Species B

Species A

Orthologues

ParaloguesSpeciation

Duplications

= internal node

= external node

= internal branch

= external branch

1 2 3 4 1 2 3 4

=

Taxon A

Taxon B

Taxon C

Taxon D

Taxon A

Taxon B

Taxon C

Taxon D5

4

1

11

1

Taxon A

Taxon B

Taxon C

Taxon D

0510152025

Cladogram Phylogram Ultrametric Tree

Branch lengths have no meaning.

Branch lengths are proportional to (genetic) change.

Branch lengths are proportional to time.

[Million Years]

Phylogenetics Tree Basics Tree Topology Slide 23/82

Page 24: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Rotating Branches in Phylogenetic Trees

Unresolved Partiallyresolved

Fullyresolved

BifurcationPolytomy or

multifurcation

Species B

Species A

Orthologues

ParaloguesSpeciation

Duplications

= internal node

= external node

= internal branch

= external branch

1 2 3 4 1 2 3 4

=Figure 7: Branch Rotations

Rotations of internal nodes yield exactly the same tree.

Phylogenetics Tree Basics Tree Topology Slide 24/82

Page 25: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Rooting Trees

There are five possibilities to root the unrooted tree on the top left.

Trees rooted on branches c and d are not shown.

Phylogenetics Tree Basics Tree Topology Slide 25/82

Page 26: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

OutlineIntroduction

Phylogenetics

Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances

Tree Building MethodsClustering MethodsParsimony Methods

Quality Assessment by ResamplingBootstrapJackknife

Software

Software

References

Phylogenetics Tree Basics Counting Trees Slide 26/82

Page 27: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Number of Nodes and Branches in Trees

N nodes in rooted tree = (2n − 1)

N branches in rooted tree = (2n − 2)

N nodes in unrooted tree = (2n − 2)

N branches in unrooted tree = (2n − 3)

n = number of leaves (taxa)

Phylogenetics Tree Basics Counting Trees Slide 27/82

Page 28: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Number of Unrooted and Rooted Trees

The number of possible trees grows more than exponentially as thenumber of taxa n increases:

N unrooted trees =(2n − 5)!

2n−3(n − 3)!= (2n − 5)!! (1)

N rooted trees =(2n − 3)!

2n−2(n − 2)!= (2n − 3)!! (2)

n = number of leaves (taxa)N = number of possible trees

Example

n N Unrooted Trees N Rooted Trees

3 1 34 3 155 15 1057 954 10,395

10 2,027,025 34,459,425

Phylogenetics Tree Basics Counting Trees Slide 28/82

Page 29: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

OutlineIntroduction

Phylogenetics

Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances

Tree Building MethodsClustering MethodsParsimony Methods

Quality Assessment by ResamplingBootstrapJackknife

Software

Software

References

Phylogenetics Tree Basics Tree Rooting Methods Slide 29/82

Page 30: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Methods for Rooting Phylogenetic Trees

Outgroup Method

Rooting by including one or more outgroup taxa/sequences.

Gene Duplication

Paralogous gene duplication predating the common ancestorof a clade are used.

Midpoint Rooting

Tree is rooted by midpoint between the two most distantbranches.

Phylogenetics Tree Basics Tree Rooting Methods Slide 30/82

Page 31: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Rooting by Outgroup

Rooting is accomplished by including one or more outgroup(taxa/sequences) that differ from all ingroup members morethan all the ingroup members among each other.

The main assumption of this method is that outgroup taxafall outside of the ingroup.

Phylogenetics Tree Basics Tree Rooting Methods Slide 31/82

Page 32: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Rooting with Duplicated Genes

A gene duplication in an ancestral organism gives rise toparalogous genes.

Speciation processes give rise to orthologous genes.

Phylogenetics Tree Basics Tree Rooting Methods Slide 32/82

Page 33: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Rooting with Duplicated Genes

The root is placed between paralogous gene populations.

Gene Copies A

Gene Copies B

Phylogenetics Tree Basics Tree Rooting Methods Slide 33/82

Page 34: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Rooting the Tree of Life

Universal trees based on single gene orthologs cannot be rooted by theoutgroup method, because of the lack of an ancestral sequence.

Solution: use ancient gene duplication that predates the last commonancestor (cenancestor) of all living organisms [Iwabe 1989].

Cenancestor

Phylogenetics Tree Basics Tree Rooting Methods Slide 34/82

Page 35: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Midpoint Rooting

Choose the midpoint between the two most distant branches.

Midpoint rooting assumes that the rate of evolution is thesame on the longest branches of the tree.

Phylogenetics Tree Basics Tree Rooting Methods Slide 35/82

Page 36: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

OutlineIntroduction

Phylogenetics

Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances

Tree Building MethodsClustering MethodsParsimony Methods

Quality Assessment by ResamplingBootstrapJackknife

Software

Software

References

Phylogenetics Tree Basics Inferring Trees from Distances Slide 36/82

Page 37: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Transforming Characters to Distances

DNA Alignment

G1 TTATTAAG2 AATTTAAG3 AAAAATAG4 AAAAAAT

Distance Matrix

G1 G2 G3 G4

G1 0.43 0.71 0.71G2 3 0.57 0.57G3 5 4 0.29G4 5 4 2

Absolute distances in bottom triangle and uncorrected relative distances in top triangle.

Phylogenetics Tree Basics Inferring Trees from Distances Slide 37/82

Page 38: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Similarity vs. Phylogenetic RelationshipSimilarity and phylogenetic relationships are not the same.

Similarity refers to likeness or resemblance.

Phylogenetic relationship refers to historical connections through commonancestry.

Similarity: evolutionary relationship when distances are ultrametric (e.g.

sequences are evolving in a perfectly clock-like manner).

Example Tree A: B is most similar to A and is also most closelyrelated to A.

When distances are not ultrametric, two taxa can be most similar without

being closely related.

Example Tree B: B is more similar to C, but it is most closelyrelated to A.

Phylogenetics Tree Basics Inferring Trees from Distances Slide 38/82

Page 39: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Properties of Distances

Metric Distances: A matrix of metric distances must satisfy the following four

conditions for all taxa.

1 Identityd(A,A) = 0

2 Symmetryd(A,B) = d(B,A)

3 Non-negativityd(A,B) ≥ 0 if A 6= B

4 Triangle inequalityd(A,C) ≤ d(A,B) + d(B,C)

Phylogenetics Tree Basics Inferring Trees from Distances Slide 39/82

Page 40: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Properties of Distances

Ultrametric Distances: Satisfy conditions 1-4 plus condition 5.

5. Ultrametric conditiond(A,B) ≤ max [d(A,C), d(B,C)]d(A,C) ≤ max [d(A,B), d(B,C)]d(B,C) ≤ max [d(A,B), d(A,C)]

Condition 5 can only be true if the two largest distances are equal anddefine the longest sides of an isosceles triangle.

Phylogenetics Tree Basics Inferring Trees from Distances Slide 40/82

Page 41: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Properties of Distances

Additive Distances: They must be metric (conditions 1-4) or ultrametric

(conditions 1-5) and also satisfy conditions 6.

6. A matrix is additive if and only if the following conditions apply forevery combination of four taxa (A, B, C, D):

d(A,B) + d(C ,D) ≤ max [d(A,C) + d(B,D), d(A,D) + d(B,C)]

d(A,C) + d(B,D) ≤ max [d(A,B) + d(C ,D), d(A,D) + d(B,C)]

d(A,D) + d(B,C) ≤ max [d(A,B) + d(C ,D), d(A,C) + d(B,D)]

Condition 6 is also know as Buneman’s four-point metric. For distancesto fit perfectly into an evolutionary tree, they must satisfy this rule.

Tree additivity occurs when the evolutionary distances between each pairof taxa is equal to the sum of branch lengths between the members ofeach pair.

Phylogenetics Tree Basics Inferring Trees from Distances Slide 41/82

Page 42: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Additivity of Distances in Trees

Additive property: the distances in the distance matrix match the relativebranch lengths in the tree.

In a perfectly additive tree the branch lengths match the distances in thedistance matrix perfectly. In such a case there will be only a single andunique additive tree that fits the distance matrix (perfect fit theorem).

Example for perfectly additive tree:

A B C D

A 0.3 0.6 0.5B 0.7 0.6C 0.7

Phylogenetics Tree Basics Inferring Trees from Distances Slide 42/82

Page 43: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Application of the Buneman’s Test

The Buneman’s four-point metric simply means that of the three sums ofdistances, one sum must be smaller than the other two, and these othertwo must be equal.

For example:

d(A,B) + d(C ,D) < d(A,C) + d(B,D) = d(A,D) + d(B,C)

(0.3 + 0.7) < (0.6 + 0.6) = (0.5 + 0.7)

1.0 < 1.2 = 1.2

Phylogenetics Tree Basics Inferring Trees from Distances Slide 43/82

Page 44: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

The Neighbor-Relations Methods

The neighbor-relations method takes advantage of the Buneman’sfour-point metric to choose the correct tree when the distances areadditive.

Neighbors are taxa that are joined through a single internal node.Non-neighbors are joined through more than one internal node.

If the tree above is taken as the true tree, thend(A,B) + d(C ,D) < d(A,C) + d(B,D) = d(A,D) + d(B,C), becaused(A,B) and d(C ,D) are distances between neighbors and do not containthe internal branch.

In the case of four taxa with unknown phylogenetic relationships, identifythe two sets of neighbors. Once the neighbors are identified, so is thetopology.

Phylogenetics Tree Basics Inferring Trees from Distances Slide 44/82

Page 45: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

The Neighbor-Relations Methods

With four taxa three unrooted trees are possible.

Each tree contains different neighbors.

To determine which taxa are neighbors, compute the distances for allpossible pairs of taxa.

A B C D

A 0.3 0.6 0.5B 0.7 0.6C 0.7

The given tree is the correct one, because the sum d(A,B) + d(C ,D) isthe smallest of the three possible sums.

Phylogenetics Tree Basics Inferring Trees from Distances Slide 45/82

Page 46: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Superimposed Substitutions Cause Distances to BeNon-Additive

Complete inventory of all genetic events (unique and superimposedsubstitutions) would constitute a set of perfectly additive distances.

However, superimposed substitutions cause observed distances to benon-additive and underestimate true evolutionary distances.

Phylogenetics Tree Basics Inferring Trees from Distances Slide 46/82

Page 47: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Corrections for Non-Additive Distances

Models of sequence evolution are employed to correct observed distancesfor superimposed substitutions.

Ideally, this will result in additive distances.

In reality, corrected distances are unlikely to exhibit a perfect fit to

pathlength distances on a tree because of:

1 Inadequate models of sequence evolution.2 Stochastic (random) error associated with sequences of finite length.

Phylogenetics Tree Basics Inferring Trees from Distances Slide 47/82

Page 48: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Additivity and Distance Matrices

Phylogenetics Tree Basics Inferring Trees from Distances Slide 48/82

Page 49: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

OutlineIntroduction

Phylogenetics

Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances

Tree Building MethodsClustering MethodsParsimony Methods

Quality Assessment by ResamplingBootstrapJackknife

Software

Software

References

Phylogenetics Tree Building Methods Slide 49/82

Page 50: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Desirable Features of Tree Building Methods

Consistency: will the method converge on the correct solutiongiven enough data?

Efficiency: how fast is the method?

Robustness: will minor violations of the assumptions result inpoor estimates of phylogeny?

Phylogenetics Tree Building Methods Slide 50/82

Page 51: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

OutlineIntroduction

Phylogenetics

Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances

Tree Building MethodsClustering MethodsParsimony Methods

Quality Assessment by ResamplingBootstrapJackknife

Software

Software

References

Phylogenetics Tree Building Methods Clustering Methods Slide 51/82

Page 52: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Clustering Methods

Algorithmic methods in which the algorithm itself defines thetree selection criterion.

No optimality criteria applied.

Advantage: tend to be very fast (efficient) computations thatproduce singular trees.

Disadvantages:

Do not allow evaluation of competing hypotheses.No objective function (e.g. likelihood, number of steps) is usedto compare different trees to each other, even if numerousother trees could explain the data equally well.

Examples that construct trees from distances:

UPGMANeighbor joining (NJ)

Phylogenetics Tree Building Methods Clustering Methods Slide 52/82

Page 53: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

UPGMA

UPGMA stands for unweighted pair group method usingarithmetic averages [Sokal & Michener 1958].

Clusters taxa (sequences) agglomeratively and creates at thesame time a hierarchical tree.

The branch (edge) lengths and node positions are determinedby the average distance between clusters.

There are variants of UPGMA that define the distancebetween clusters (linkage method) as the minimum ormaximum of the distances between clusters, rather than theaverage.

The average linkage seems to have the best performancerecords.

Phylogenetics Tree Building Methods Clustering Methods Slide 53/82

Page 54: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Basic Definitions for UPGMA

Initially, each sequence is assigned to its own cluster (i , j , ...)of size 1.

The distance dij between two clusters Ci and Cj is the averagedistance between pairs of sequences from each cluster:

dij =1

|Ci ||Cj |∑

p in Ci ,q in Cj

dpq (3)

|Ci | and |Cj | are the number of items in the clusters i and j .

Ck is defined as the union of two clusters |Ci | and |Cj |.If Ckl = Ci ∪ Cj and Cl is any other cluster, then

dkl =dil |Ci |+ djl |Cj ||Ci |+ |Cj |

(4)

Phylogenetics Tree Building Methods Clustering Methods Slide 54/82

Page 55: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

UPGMA Algorithm

Initialization

Assign each sequence to its own cluster Ci .

Define one leaf of T (tree) for each sequence, and place it atheight zero.

Iteration

Determine the two clusters i , j for which dij is minimal. Ifthere are several equidistant choices than pick one randomly.

Define a new cluster k by Ck = Ci ∪ Cj

Define a node k with daughter nodes i and j , and place it atheight dij/2.

Add k to the current clusters and remove clusters i and j .

Update distances by computing dkl for all other clusters l .

Termination

When only two clusters i and j remain place the root atheight dij/2.

Phylogenetics Tree Building Methods Clustering Methods Slide 55/82

Page 56: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Illustration of UPGMA Algorithm

Unresolved Partiallyresolved

Fullyresolved

BifurcationPolytomy or

multifurcation

Species B

Species A

Orthologues

ParaloguesSpeciation

Duplications

= internal node

= external node

= internal branch

= external branch

1 2 3 4 1 2 3 4

=

Taxon A

Taxon B

Taxon C

Taxon D

Taxon A

Taxon B

Taxon C

Taxon D5

4

1

11

1

Taxon A

Taxon B

Taxon C

Taxon D

0510152025

Cladogram Phylogram Ultrametric Tree

Branch lengths have no meaning.

Branch lengths are proportional to (genetic) change.

Branch lengths are proportional to time.

[Million Years]

A B C DA

B

C

-

((A:0.5, B:0.5):0.1,(C:0.5,D:0.5):0.1);

Slanted style

Rectangularstyle

Newick format:

A

B

D

C

D

Unrooted Tree Rooted Trees

A

B

C

D

A B C D

e

a

b

c

d

Root:

e

e

A B C D

a

B A C D

b

Root: a

Root: b

Outgroup

Ingroup

A

B

C

D

A

B

C

D

Cladogram

A

B

C

D5

4

1

11

1

Phylogram

Midpoint Rooting

10

7

6

A

B

C

A B

6 6

4

A

B

C

1

1

1

1

A

B

C

1

6

1

1

A

B

C

1

1

1

1

A

B

C

1

6

1

1

Tree A Tree B

A

B

C

D

0.1

0.20.1

0.3

0.4

1 2 54

67

1 2

4

5

3

1/2 d45

1 2

6

1 2

4

5

3

1/2 d12

1 2 5 34

67

8

91 2

4

5

31/2 d68

3-4. Iteration

2. Iteration

1. Iteration

Phylogenetics Tree Building Methods Clustering Methods Slide 56/82

Page 57: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Limitations of UPGMA Algorithm

The edge lengths of an UPGMA tree correspond roughly to the timesmeasured by a molecular clock with constant rate.

The method assumes that the divergence of sequences occurs at allpoints in the tree with a constant rate and the distances are additive.

If the molecular clock assumption applies to a given distance matrix, thenUPGMA constructs the tree correctly.

However, if this assumption does not apply to the underlying distancematrix, then UPGMA may construct the tree incorrectly.

Example Fig 7: correct tree on the left and incorrect UPGMA tree on theright:

Unresolved Partiallyresolved

Fullyresolved

BifurcationPolytomy or

multifurcation

Species B

Species A

Orthologues

ParaloguesSpeciation

Duplications

= internal node

= external node

= internal branch

= external branch

1 2 3 4 1 2 3 4

=

Taxon A

Taxon B

Taxon C

Taxon D

Taxon A

Taxon B

Taxon C

Taxon D5

4

1

11

1

Taxon A

Taxon B

Taxon C

Taxon D

0510152025

Cladogram Phylogram Ultrametric Tree

Branch lengths have no meaning.

Branch lengths are proportional to (genetic) change.

Branch lengths are proportional to time.

[Million Years]

A B C DA

B

C

-

((A:0.5, B:0.5):0.1,(C:0.5,D:0.5):0.1);

Slanted style

Rectangularstyle

Newick format:

A

B

D

C

D

Unrooted Tree Rooted Trees

A

B

C

D

A B C D

e

a

b

c

d

Root:

e

e

A B C D

a

B A C D

b

Root: a

Root: b

Outgroup

Ingroup

A

B

C

D

A

B

C

D

Cladogram

A

B

C

D5

4

1

11

1

Phylogram

Midpoint Rooting

10

7

6

A

B

C

A B

6 6

4

A

B

C

1

1

1

1

A

B

C

1

6

1

1

A

B

C

1

1

1

1

A

B

C

1

6

1

1

Tree A Tree B

A

B

C

D

0.1

0.20.1

0.3

0.4

1 2 54

67

1 2

4

5

3

1/2 d45

1 2

6

1 2

4

5

3

1/2 d12

1 2 5 34

67

8

91 2

4

5

31/2 d68

3-4. Iteration

2. Iteration

1. Iteration

A

C

B

D A

C

B

D

Solution: test if the distance matrix is ultrametric, where in any triplet ofdistances one pair must be equal and the remaining one is the smallest.

Phylogenetics Tree Building Methods Clustering Methods Slide 57/82

Page 58: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Neighbor-Joining Method

If the molecular clock property fails for a given data set, but additivityholds, then the neighbor-joining method can construct a correct tree.

To overcome the problem that neighboring leaves can be more distant toeach other than to non-neighboring leaves (see Fig 7), one can calculatethe rate corrected distances Dij by subtracting from dij the averageddistances to all other leaves:

Dij = dij − (ri + rj ), where ri =1

|L| − 2

∑k∈L

dik (5)

|L| is the size of the leaf set L.

Consequence: i and j are neighboring leaves if their Dij is minimal!

Phylogenetics Tree Building Methods Clustering Methods Slide 58/82

Page 59: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Example: Rate Corrected Distance Matrix

Distance Matrix

A B C D

A 0.3 0.7 0.4B -1.2 0.8 0.5C -1.0 -1.0 0.5D -1.0 -1.0 -1.2

Upper right triangle: original distance values.Lower left triangle: rate corrected distances calculated by eq 5 as:

rA = (0.3 + 0.7 + 0.4)/2 = 0.7rB = (0.3 + 0.8 + 0.5)/2 = 0.8rC = (0.7 + 0.8 + 0.5)/2 = 1.0

...DAB = 0.3− 0.7− 0.8 = −1.2DAC = 0.7− 0.7− 1.0 = −1.0

...

Phylogenetics Tree Building Methods Clustering Methods Slide 59/82

Page 60: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Neighbor-Joining Algorithm

Initialization

T is the set of leaf nodes corresponding to the number ofitems in distance matrix L.

Iteration

Pick a pair i , j in L for which Dij is minimal as defined by eq 5.

Define a new node k and set for all other leaves m in L todkm = 1/2(dim + djm − dij ).

Add k to T with edges of lengths dik = 1/2(dij + ri − rj ) anddjk = dij − dik that connect i and j to node k .

Remove i and j from L and k .

Termination

When L consists of two leaves i and j add the remainingedges between i and j with length dij .

Phylogenetics Tree Building Methods Clustering Methods Slide 60/82

Page 61: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Main Differences Between UPGMA and Neighbor-Joining

UPGMA

1 Assumes additivity and ultrametricity.

2 Does not use rate corrected distances values for tree construction.

3 Results in a rooted tree with branch pairs reflecting the distanceinformation.

4 Tree type: cladogram.

5 May fail to generate the correct tree from distance values that violate theultrametricity rule.

Neighbor-Joining

1 Assumes additivity, but not ultrametricity.

2 Uses rate corrected distances values for tree construction.

3 Results in an unrooted tree with the branch lengths reflecting thedistance information.

4 Tree type (after rooting): phylogram or cladogram.

5 Generates correct tree from distance values that may violate theultrametricity rule.

Phylogenetics Tree Building Methods Clustering Methods Slide 61/82

Page 62: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

OutlineIntroduction

Phylogenetics

Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances

Tree Building MethodsClustering MethodsParsimony Methods

Quality Assessment by ResamplingBootstrapJackknife

Software

Software

References

Phylogenetics Tree Building Methods Parsimony Methods Slide 62/82

Page 63: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Parsimony Methods

Simple hypotheses are preferred over more complicatedhypotheses (∼ evolution is parsimonious).

No explicit correction for superimposed substitutions.

Maximum parsimony tree is the tree that requires theminimum number of steps

Known to be inconsistent in some situations (e.g. parsimonyis susceptible to long branch attraction).

Phylogenetics Tree Building Methods Parsimony Methods Slide 63/82

Page 64: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Parsimony

Basic principle: find the tree that explains the observedsequences with the minimal number of substitutions.

Instead of building a tree, like with distance methods,parsimony assigns a cost to a given tree.

This requires searching through all possible trees or a subsetof trees that contains the best or close to best tree topology.

Parsimony algorithms consist of two major steps:1 Computation of the cost for a given tree T .2 Search for the tree with minimum cost.

Phylogenetics Tree Building Methods Parsimony Methods Slide 64/82

Page 65: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Example: Computing the Cost of a Given Tree

Given a multiple alignment and a tree topology, one can count thenumber of substitution needed for each tree.

The following figure shows two possible trees for the alignment on theleft. Trees differ in the order the sequences are assigned to the leaves.

Hypothetical sequences have been assigned to the ancestral nodes thatminimize the number of substitution need in the entire tree.

Unresolved Partiallyresolved

Fullyresolved

BifurcationPolytomy or

multifurcation

Species B

Species A

Orthologues

ParaloguesSpeciation

Duplications

= internal node

= external node

= internal branch

= external branch

1 2 3 4 1 2 3 4

=

Taxon A

Taxon B

Taxon C

Taxon D

Taxon A

Taxon B

Taxon C

Taxon D5

4

1

11

1

Taxon A

Taxon B

Taxon C

Taxon D

0510152025

Cladogram Phylogram Ultrametric Tree

Branch lengths have no meaning.

Branch lengths are proportional to (genetic) change.

Branch lengths are proportional to time.

[Million Years]

A B C DA

B

C

-

((A:0.5, B:0.5):0.1,(C:0.5,D:0.5):0.1);

Slanted style

Rectangularstyle

Newick format:

A

B

D

C

D

Unrooted Tree Rooted Trees

A

B

C

D

A B C D

e

a

b

c

d

Root:

e

e

A B C D

a

B A C D

b

Root: a

Root: b

Outgroup

Ingroup

A

B

C

D

A

B

C

D

Cladogram

A

B

C

D5

4

1

11

1

Phylogram

Midpoint Rooting

10

7

6

A

B

C

A B

6 6

4

A

B

C

1

1

1

1

A

B

C

1

6

1

1

A

B

C

1

1

1

1

A

B

C

1

6

1

1

Tree A Tree B

A

B

C

D

0.1

0.20.1

0.3

0.4

1 2 54

67

1 2

4

5

3

1/2 d45

1 2

6

1 2

4

5

3

1/2 d12

1 2 5 34

67

8

91 2

4

5

31/2 d68

3-4. Iteration

2. Iteration

1. Iteration

A

C

B

D A

C

B

D

AAG AAA GGA AGA

AAA AGA

AAA

1 1

1

AAG AGA AAA GGA

AAA AAA

AAA

1 21

������������

Alignment Two Possible Parsimony Trees

The tree on the left is more parsimonious than the one on the rightbecause it requires only 3 instead of 4 changes.

As shown here, parsimony treats each site independently and then sumsup the substitutions needed for all sites.

Phylogenetics Tree Building Methods Parsimony Methods Slide 65/82

Page 66: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Traditional Parsimony [Fitch 1971]

Summary

Requires simply counting of the number of substitutions for a tree. Toobtain the cost of a tree, generate a list of minimal cost residues Rk ateach node k, along with the current cost C . To compute the minimalcost at site u, one can proceed as follows:

Initialization

Set C = 0 and k = 2n − 1.

Recursion - to obtain the set Rk :

If k is leaf node:Set Rk = xk

u .

If k is not a leaf node:Compute Ri ,Rj for the daughter nodes i , j of k, and setRk = Ri ∩ Rj if intersection is not empty, or elseset Rk = Ri ∪ Rj and increment C .

Termination

Minimal cost of tree = C .

Phylogenetics Tree Building Methods Parsimony Methods Slide 66/82

Page 67: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Algorithm Weighted Parsimony

Definitions

S(a, b) is the cost for each substitution. To compute the minimal cost atsite u, let Sk (a) denote the minimal cost for the assignment of a to nodek.

Initialization

Set k = 2n − 1, the number of nodes..

Recursion - compute Sk (a) for all a as follows:

If k is leaf node:Set Sk (a) = 0 for a = xk

u , Sk (a) =∞, otherwise.

If k is not a leaf node:Compute Si (a), Sj (a) for all daughter nodes i , j , and defineSk (a) = minb(Si (b) + S(a, b)) + minb(Sj (b) + S(a, b)).

Termination

Minimal cost of tree = minaS2n−1(a).

Note: If S(a, a) = 0 and S(a, b) = 1 then the algorithm is identical to thetraditional parsimony.

Phylogenetics Tree Building Methods Parsimony Methods Slide 67/82

Page 68: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Tree Search Methods

Several tree search methods can be considered:1 Exhaustive Searches: All trees are evaluated only possible for

trees with less than 20 taxa.2 Heuristic Searches: Not guaranteed to find the best tree (e.g.

random branch changes and re-scoring of tree).3 Branch and bound algorithm which does not evaluate all trees,

but guarantees to find the best tree.4 Many other approaches.

Branch and bound algorithm

It exploits the idea that the cost (number of substitutions) of asubtree can only increase by adding an extra edge.It systematically builds trees with increasing numbers of leavesand abandons avenues of tree building whenever an incompletetree exceeds the smallest cost of a complete tree.

Phylogenetics Tree Building Methods Parsimony Methods Slide 68/82

Page 69: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Branch and Bound Method in Phylogeny

Phylogenetics Tree Building Methods Parsimony Methods Slide 69/82

Page 70: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

OutlineIntroduction

Phylogenetics

Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances

Tree Building MethodsClustering MethodsParsimony Methods

Quality Assessment by ResamplingBootstrapJackknife

Software

Software

References

Phylogenetics Quality Assessment by Resampling Slide 70/82

Page 71: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Assessing the Reliability of Tree Branches

Nonparametric resampling methods are often used to estimatethe variance associated with a statistic when the underlyingsampling distribution for a statistic is either unknown ordifficult to derive analytically.

Resampling methods include the bootstrap and the jackknife,both of which operate by repeatedly resampling data from theoriginal data set to estimate the variance of the samplingdistribution.

Although both methods have been used for evaluating thereliability of branches, the bootstrap method is morecommonly applied.

Phylogenetics Quality Assessment by Resampling Slide 71/82

Page 72: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

OutlineIntroduction

Phylogenetics

Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances

Tree Building MethodsClustering MethodsParsimony Methods

Quality Assessment by ResamplingBootstrapJackknife

Software

Software

References

Phylogenetics Quality Assessment by Resampling Bootstrap Slide 72/82

Page 73: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Bootstrap Resampling

Data points are randomly resampled from the original dataset, with replacement, until new data sets with the originalnumber of observations are obtained.

Statistic of interest (e.g. a tree) is computed for eachreplicated data set.

Agreement among the resulting trees is summarized with amajority-rule consensus tree (agreement > 50%).

A bootstrap proportion (BP) is the frequency of occurrence ofa clade (for all replicated data sets) and is a measure ofsupport for a group.

Phylogenetics Quality Assessment by Resampling Bootstrap Slide 73/82

Page 74: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Example: Bootstrapping

Original matrix (alignment) Resampled matrix (alignment)

Majority rule tree with bootstrap proportions (BP) at branch nodes.

Phylogenetics Quality Assessment by Resampling Bootstrap Slide 74/82

Page 75: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Interpreting Bootstrap Values

Bootstrap proportions are sometimes interpreted as confidenceintervals for phylogenies.

This interpretation assumes that characters (e.g. nucleotide oramino acid sites) are independent and identically distributed,and that the method is consistent.

Even in a best case scenario bootstrapping sometimes givesunderestimates of accuracy at high bootstrap values andoverestimates of accuracy at low bootstrap values.

Zharkikh and Li (1995) suggested the complete and partialbootstrap technique to reduce the bias of bootstrapproportions.

Phylogenetics Quality Assessment by Resampling Bootstrap Slide 75/82

Page 76: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

OutlineIntroduction

Phylogenetics

Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances

Tree Building MethodsClustering MethodsParsimony Methods

Quality Assessment by ResamplingBootstrapJackknife

Software

Software

References

Phylogenetics Quality Assessment by Resampling Jackknife Slide 76/82

Page 77: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Jackknife Resampling

The original data set, which contains n data points, isresampled by dropping k data points at a time. This results ina resampled data set with n − k data points.

The statistic of interest is computed for each replicated dataset.

Agreement among the resulting trees is summarized with amajority-rule consensus tree. A jackknife proportion (JP) isthe frequency of occurrence of a clade (for all replicated datasets) and is a measure of support for a group.

Like the bootstrap, the jackknife assumes that characters (e.g.nucleotides) are independent and identically distributed.

Phylogenetics Quality Assessment by Resampling Jackknife Slide 77/82

Page 78: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

OutlineIntroduction

Phylogenetics

Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances

Tree Building MethodsClustering MethodsParsimony Methods

Quality Assessment by ResamplingBootstrapJackknife

Software

Software

References

Phylogenetics Software Slide 78/82

Page 79: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Phylogenetics Software (Selection!)

PAUP: complex phylogenetic tool collection (partiallycommercial)

PHYLIP: complex phylogenetic tool collection (free)

MrBayes: popular package for Bayesian inference ofphylogenies.

BEAST: Another tool for Bayesian inference of phylogenies.

PAML: Molecular evolution software for determining sequenceevolutionary rates.

HyPhy: Molecular evolution software.

Many more: see for instance this Link Collection.

Phylogenetics Software Slide 79/82

Page 80: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

OutlineIntroduction

Phylogenetics

Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances

Tree Building MethodsClustering MethodsParsimony Methods

Quality Assessment by ResamplingBootstrapJackknife

Software

Software

References

Phylogenetics Software Slide 80/82

Page 81: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

Alignment processing

trimAl: automated alignment trimming - with built-in andflexible models

Gblocks: automated alignment trimming

Phylogenetics Software Slide 81/82

Page 82: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

OutlineIntroduction

Phylogenetics

Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances

Tree Building MethodsClustering MethodsParsimony Methods

Quality Assessment by ResamplingBootstrapJackknife

Software

Software

References

Phylogenetics References Slide 82/82

Page 83: Phylogenetics - Advances in Bioinformatics and Genomics ...lab.stajich.org/presentations/teaching/Gen240B.2011S/Phylogenetics… · Phylogenetics Introduction Phylogenetics Slide

References

Durbin R, Eddy S, Krogh A and Mitchison G (1999) Biological SequenceAnalysis - Probabilistic Models of Proteins and Nucleic Acids. CambridgeUniversity Press. Chapters 7-8.

Felsenstein J (2004) Inferring Phylogenies. Sinauer Associates, Inc, MA. Pages1-664.

Fitch (1971) Toward defining the course of evolution: minimum change for aspecified tree topology. Systematic Zoology 20: 406-416.

Iwabe N, Kuma K, Hasegawa M, Osawa S, Miyata T (1989) Evolutionaryrelationship of archaebacteria, eubacteria, and eukaryotes inferred fromphylogenetic trees of duplicated genes. Proc Natl Acad Sci U S A 86: 9355-9359.URL http://www.hubmed.org/display.cgi?uids=2531898

Sokal RR and Michener CD (1958) A statistical method for evaluatingsystematic relationships. University of Kansas Scientific Bulletin 28: 1409-1438.

Phylogenetics References Slide 82/82