1 phylogenetic tree reconstruction modified version of dr. chun-chieh shih’ institute of...

11

Phylogenetic TreePhylogenetic TreeReconstructionReconstruction

Modified version of Dr. Chun-Chieh Shih’Modified version of Dr. Chun-Chieh Shih’

Institute of Information SciencesInstitute of Information Sciences

Academia SinicaAcademia Sinica

22

OUTLINEOUTLINE

Tree reconstruction methods

Flowchart of phylogenetic analysis

Concept 0f evolutionary trees

Evaluation of reconstructed trees

33

Why RECONSTRUCT phylogenetic trees

Understand evolutionary history

Map pathogen strain diversity for vaccines

Assist in epidemiology of infectious diseases

Aid in prediction of function of novel genes

Biodiversity studies

Understanding microbial ecologies

For Example

44


Rooted treeOne sequence (root) defined to becommon ancestor of all other sequences

If molecular clock hypothesis holds,it is possible to predict a root

Unrooted treeIndicates evolutionary relationshipwithout revealing location of oldest ancestry

55 Image: Image: http://www.ncbi.nlm.nih.gov/About/primer/phylo.htmlhttp://www.ncbi.nlm.nih.gov/About/primer/phylo.html

http://www.ncbi.nlm.nih.gov/About/primer/phylo.html

66

Unrooted treesUnrooted trees Rooted treesRooted trees

# sequences# sequences# pairwise d# pairwise d

istancesistances # trees# trees# branches # branches

/tree/tree # trees# trees

# branches# branches

/tree/tree

33 33 11 33 33 44

44 66 33 55 1515 66

55 1010 1515 77 105105 88

66 1515 105105 99 945945 1010

1010 4545 2,027,0252,027,025 1717 34,459,42534,459,425 1818

3030 435435 8.69 8.69 10 103636 5757 4.95 4.95 10 103838 5858

NN NN ( (NN - 1) - 1)

22

(2(2NN - 5)! - 5)!

22NN - 3 - 3 ((N N - 3)!- 3)!

22N - N - 33 (2(2NN - 3)! - 3)!

22NN - 2 - 2 ((N N - 2)!- 2)!

22N - N - 22

Taken From: http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt


Number of Trees

77

Types of data used in phylogenetic inference

Use the aligned characters, such as DNA or protein sequences, directly during tree inference.

Character-based methods:

Transform the sequence data into pairwise distances, and usethe matrix during tree building.

Distance-based methods:

88

Data set collection

Multiple sequence alignment

Tree construction

Character-based Distance-based

Optimal criteriaParsimony Maximum likelihood

UPGMA NJ

Fitch-MargoliashKITCH Distance

Test reliability of the tree by analytical and/or resampling procedure

99

Distance Methods

Calculate changes between each pair in a groupof sequences (The first step in producing a multiple sequenceAlignment)

Identify tree that correctly positions neighbors

and that also has branch lengths that reproduce

the original data as closely as possible

Finding closest neighbors among a group of

Sequences

1010

Distance Methods - Example

distancesbetweensequences

distance table

1111

FITCHFITCH: estimates phylogenetic tree assuming additivi: estimates phylogenetic tree assuming additivity of branch lengths using the Fitch-Margoliash methty of branch lengths using the Fitch-Margoliash methodod

KITSHKITSH: same as FITCH, but under the assumption of : same as FITCH, but under the assumption of a molecular clocka molecular clock

NEIGHBORNEIGHBOR: estimates phylogenies using either:: estimates phylogenies using either: Neighbor-joiningNeighbor-joining (no molecular clock assumed) (no molecular clock assumed) Unweighted Pair Group Method with ArithmeticUnweighted Pair Group Method with Arithmetic MeanMean

((UPGMAUPGMA) (molecular clock assumed)) (molecular clock assumed)

Distance Methods - Example

Distance Programs in Phylip

1212

Distance Methods - UPGMA

Construct a distance treeA -GCTTGTCCGTTACGATB –ACTTGTCTGTTACGATC –ACTTGTCCGAAACGATD -ACTTGACCGTTTCCTTE –AGATGACCGTTTCGATF -ACTACACCCTTATGAG

AA BB CC DD EE

BB 22

CC 44 44

DD 66 66 66

EE 66 66 66 44

FF 88 88 88 88 88

Clustering

All leaves are assigned to a cluster, which then are iteratively merged according to their distance

1313


The distance between two clusters i and jis defined as:

where |Ci| and |Cj| denote the number of sequencesin cluster i and j, respectively.

Replacing

Ck = Ci Cj

The new distances between the new node k and all otherclusters l are computed according to:

||||

||||

ji

jjliilkl CC

CdCdd

1414


Step I: Initialization. Assign each sequence i to its own cluster Ci.

. Define one leaf of T for each sequence, and place at height zero.

Step II: Iteration. Determine the two clusters i, j for which di,j is minimal

. Define a new cluster k by Ck= CiU Cj, and define dkl for all l

. Define a node k with daughter nodes i and j, and place it at

height di,j/2.

. Add k to the current clusters and remove i.

Step III: Termination

. When only two clusters i, j remain, place the root at height di,j/2.

1515


First round

AA BB CC DD EE

BB 22

CC 44 44

DD 66 66 66

EE 66 66 66 44

FF 88 88 88 88 88

A

B

1

1

dist((A,B),C) = (distAC+distBC)/2 =4

dist((A,B),D) = (distAD+distBD)/2 = 6

dist((A,B),E) = (distAE+distBE)/2 = 6

dist((A,B),F) = (distAF+distBF)/2 = 8

A,BA,B CC DD EE

CC 44

DD 66 66

EE 66 66 44

FF 88 88 88 88

Choose the most similar pair, cluster them togetherand calculate the new distance matrix.

1616


Second roundA

B

1

1A,BA,B CC DD EE

CC 44

DD 66 66

EE 66 66 44

FF 88 88 88 88

D

E

2

2

Third roundA

B

1

1A,BA,B CC D,ED,E

CC 44

D,ED,E 66 66

FF 88 88 88D

E

2

2

C

1

2

1717


Fourth round

AB,CAB,C D,ED,E

D,ED,E 66

FF 88 88

A

B

1

1

D

E

2

2

C

1

2

1

1

Fifth round

ABC, DEABC, DE

FF 88

A

B

1

1

D

E

2

2

C

1

21

1

F4

1

ROOT

1818


The UPGMA clustering method is very sensitiveto unequal evolutionary rates Assumes that the evolutionary rate is the same for all branches

Clustering works only if the data are ultrametric

Ultrametric treeUltrametric tree

Special kind of additive treein which the tips of the trees areall equidistant from the root

A cladogram with branch lengths,also called phylograms and metrictrees

1 1 1 1

1

1

23

13 1

Additive treeAdditive tree7

2 3

1 35

1

4

33

1919


UPGMA fails when rates of evolution are not constant

A

B

1

4

D

E

3

2

C

1

21

1

F4

1

AA BB CC DD EE

BB 55

CC 44 77

DD 77 1100

77

EE 66 99 66 55

FF 88 1111

88 99 88

Wrong topology

A

C

1

1

D

E

2.5

2.5

A

C

B

2

23

1

D

E

2.5

2.51.5

0.5

A

C

B

2

23

1

D

E

2.5

2.51.5

0.5

F4.5

0.5

A

C

2

23

1

2020

Distance Methods – Neighbor Joining

The Four Point Condition

dAC+ dBD= dAD+ dBC= a + b + c + d + 2x = dAB+ dCD+ 2x

The 4-point condition

dAB+ dCD< dAC+ dBD

dAB+ dCD< dAD+ dBC

neighbors non-neighbors

• Neighbors are closer than non-neighbors

2121


Sequences chosen to give best least-squares

estimate of branch length

Begin with star topology – no neighbors have

been joined

B

A

C

D

E

Tree modified by joining pairs of sequences

2222


Pair is chosen by calculating sum of branch

lengths for the corresponding tree

22)2(2

N

dd

N

ddS ijmninimmn

If A and B are joined:

B

A

C

D

E

2323


Neighbor-Joining approximates the least squares

tree, assuming additivity, but without resorting

to the assumption of a molecular clock.

Idea: join clusters that are not only close to one

another, but are also far from the rest.

In each iteration: find direct ancestor of two

species in the tree neighboring leaves.

2424


Example: neighboring leaves i, j with ancestor k. Join i and

j remove them from list of leave nodes add k to list with

distances to other leave(s) m defined as

)(2

1ijjmimkm dddd

Problem: it is not sufficient to pick simply the two

closest leaves

2525


Solution: For node i, define average distance ui to all othe

r leaves: and correct distances:

Minimum-evolution criterion: minimize the sum of all branc

h lengths. Nodes i and j that are clustered next are those fo

r which Dij is smallest.

2626


Initialization

Iteration:

1. Initialize n clusters with the given species, one species per cluster

2. Set the size of each cluster to 1: ni 1

3. In the output tree T, assign a leaf for each species

1. For each species, compute

2. Choose the i and j for which dij − ui − uj is smallest.

3. Join clusters i and j to new cluster, with corresponding node k and set

Calculate the branch lengths from i and j to the new node as:

,

4. Delete clusters i and j from T and add k

5. If more than two nodes remain, go back to 1. Otherwise, --- end

2727

Maximum Parsimony

Predicts evolutionary tree by minimizing numberof steps required to generate observed variation

For each position, a phylogenetic tree requiressmallest number of evolutionary changes toproduce observed sequence changes are identified

Trees producing smallest number of changes forall sequence positions are identified

Time consuming algorithm

Only works well if the sequences have a strongsequence similarity

2828

Maximum Parsimony

Step I

Step II

Input: multiple sequence alignment

For each aligned position, identify phylogenetic trees that

require the smallest number of evolutionary changes to

produce the observed sequence changes

Step III

Continue analysis for every position in the sequence alignment

Step IV

Sequence variations at each site in the alignment are placed at the

tips of the trees

2929

Maximum Parsimony - Example

Sequences

positions

Informative sites: must favor one tree over another

site 5 is informative, but sites 1, 6, 8 are not

To be informative, a site must also have the same sequence character in

at least two genomes

only sites 5, 7, and 9 are informative according to this rule

E.g. trees for position 5:

Combining sites 5, 7, and 9, the left tree is the best tree for these 4 sequences

3030


What is the parsimony score of

3131


Species 1 - A G G G T A A C T GSpecies 2 - A C G A T T A T T ASpecies 3 - A T A A T T G T C TSpecies 4 - A A T G T T G T C G

1 2 3 4 5 6 7 8 9 10

How many possible unrooted trees?

3232


How many substitutions?

tree 1 change 5 changes

3333



1 2 3 4 5 6 7 8 9 10

0

0

0

3434



1 2 3 4 5 6 7 8 9 10

0 3

0 3

0 3

3535



1 2 3 4 5 6 7 8 9 10

0 3 2 2

0 3 2 2

0 3 2 1

3636


1 2 3 4 5 6 7 8 9 10Species 1 - A G G G T A A C T G

Species 2 - A C G A T T A T T ASpecies 3 - A T A A T T G T C TSpecies 4 - A A T G T T G T C G

0 3 2 2 0 1 1 1 1 3

0 3 2 2 0 1 2 1 2 3

0 3 2 1 0 1 2 1 2 3

14

16

16

Minimum substitutions

3737

Maximum Parsimony – Searching for Trees

#Taxa#Taxa 33 44 55 1010 5050 100100

#Trees#Trees 11 33 1515 22101066 2210107474 221010182182

Imagine how large of 10182 ...

3838

Maximum Parsimony

Parsimony can give misleading information when rates of sequence

change vary in the different branches of a tree that are represented by

the sequence data

Where maximum parsimony fails

Real tree: 2 long branches in which

G has turned to A independently,

possibly with some intermediate

steps.

In parsimony analysis rates of change along all branches of the tree are assumed equal. Therefore the tree predicted from parsimony will not be correct.

3939

Standard problem: Maximum Parsimony Standard problem: Maximum Parsimony (Hamming distance Steiner Tree)(Hamming distance Steiner Tree)

InputInput: Set : Set SS of of nn aligned sequences of aligned sequences of length klength k

OutputOutput: A phylogenetic tree : A phylogenetic tree TT– leaf-labeled by sequences in leaf-labeled by sequences in SS– additional sequences of length additional sequences of length kk labeling the labeling the

internal nodes of internal nodes of TT

such that is minimized. such that is minimized. )(),(

),(TEji

jiH


4040

Maximum parsimony (example)Maximum parsimony (example)

InputInput: Four sequences: Four sequences– ACTACT– ACAACA– GTTGTT– GTAGTA

QuestionQuestion: which of the three trees has the : which of the three trees has the best MP scores?best MP scores?


4141

All possible unrooted treesAll possible unrooted trees

ACT

GTT ACA

GTA ACA ACT

GTAGTT

ACT

ACA

GTT

GTA


4242

Possible substitutionsPossible substitutions

ACT

GTT

GTT GTA

ACA

GTA

12

2

MP score = 5

ACA ACT

GTAGTT

ACA ACT

3 1 3

MP score = 7

ACT

ACA

GTT

GTAACA GTA

1 2 1

MP score = 4

Optimal MP tree


4343

Maximum Parsimony: Maximum Parsimony: computational complexitycomputational complexity

ACT

ACA

GTT

GTAACA GTA

1 2 1

MP score = 4

Finding the optimal MP tree is NP-hard

Optimal labeling can becomputed in linear time O(nk)


4444

Maximum likelihood approach

Method uses probability calculations to find atree that best accounts for the variation in aset of sequences

Similar to maximum parsimony method in thatanalysis is performed on each column of amultiple sequence alignment

Start with an evolutionary model of sequencechange that provides estimates of rates ofsubstitution of one base for another(transitions and transversions).

4545


Statistical method - powerful and flexible,also computationally complex

Given a particular tree and a model of theevolutionary change, calculate the likelihoodof the tree based on data, i.e. the givenmultiple sequence alignment

Likelihood (tree | data) proportional toProbability( data | tree)

4646


Tree with branches, vk branch lengths

Probability of character change PAC(t) for

A C in time t

Don’t know character states inside tree (inthe past) so calculate for all possibilities,e.g. A, C, G, T

4747


L = p(A) PAA(v1) PAA(v2) PAG(v4) PAA(v5) PAA(v6) PAA(v3) PAA(v7) PAA(v8)

L = p(A) PAA(v1) PAG(v2) PGG(v4) PGA(v5) PAA(v6) PAA(v3) PAA(v7) PAA(v8)

4848


L = p(s0) Ps0s1(v1) Ps1s2(v2) Ps2s4(v4) Ps2s5(v5) Ps1s6(v6) Ps0s3(v3) Ps3s7(v7) Ps3s8(v8)

Maximum likelihood does best in simulationbut is also slowest method

Variety of new heuristics to find ML tree faster

4949

Maximum Likelihood (ML)Maximum Likelihood (ML) Given: stochastic model of sequence evolution Given: stochastic model of sequence evolution

(e.g. Jukes-Cantor) and a set S of sequences (e.g. Jukes-Cantor) and a set S of sequences Objective: Find tree T and probabilities p(e) of suObjective: Find tree T and probabilities p(e) of su

bstitution on each edge, to maximize the probabibstitution on each edge, to maximize the probability of the data.lity of the data.

Preferred by some systematists, but even harder tPreferred by some systematists, but even harder than MP in practice.han MP in practice.


5050

Quality of the tree

Phylogenetic trees can vary dramatically with

slight changes in data

We want to know which branches are reliable, and

which branches do not have strong support from the

data

Bootstrapping is the most common method used

A general statistical technique for determining how

much error is in a set of results

5151

Confidence assessment

Bootstrapping

Original data set with n characters

Draw n characters randomly with re-placement.Repeat m times.

m pseudo-replicates, each with n characters.

Original analysis,e.g. MP, ML, NJ.

Repeat original analysison each of thepseudo-replicate data sets.

Evaluate the resultsfrom the m analyses.

5252


Bootstrap sampling of phylogenies

5353


What do the bootstrap values mean?

Bootstrap values for phylogenetic trees do not

follow proper statistical behavior

Bootstrap value 95% actually close to 100%

confidence in that branch

Bootstrap value 75% often close to 95%

confidence

Bootstrap value 60% is much lower confidence

Less than 50% bootstrap: no confidence in that

branch over an alternative

5454

Computer Software for PhylogeneticsComputer Software for Phylogenetics Due to the lack of consensus among evolutionary biologists Due to the lack of consensus among evolutionary biologists about basic principles for phylogenetic analysis, it is not about basic principles for phylogenetic analysis, it is not surprising that there is a wide array of computer software surprising that there is a wide array of computer software available for this purpose.available for this purpose.– PHYLIPPHYLIP is a free package that includes 30 programs that is a free package that includes 30 programs that

compute various phylogenetic algorithms on different kinds of compute various phylogenetic algorithms on different kinds of data.data.

– The The GCGGCG package (available at most research institutions) package (available at most research institutions) contains a full set of programs for phylogenetic analysis contains a full set of programs for phylogenetic analysis including simple distance-based clustering and the complex including simple distance-based clustering and the complex cladisticcladistic analysis program analysis program PAUPPAUP ( (PPhylogenetic hylogenetic AAnalysis nalysis UUsing sing PParsimony)arsimony)

– CLUSTALXCLUSTALX is a multiple alignment program that includes the is a multiple alignment program that includes the ability to create tress based on ability to create tress based on Neighbor Joining.Neighbor Joining.

– MacCladeMacClade is a well designed cladistics program that allows is a well designed cladistics program that allows the user to explore possible trees for a data set.the user to explore possible trees for a data set.

5555

Phylogenetics on the WebPhylogenetics on the Web There are several phylogenetics servers available There are several phylogenetics servers available

on the Web on the Web – some of these will change or disappear in the near futuresome of these will change or disappear in the near future

– these programs can be very slow so keep your sample sets smallthese programs can be very slow so keep your sample sets small The Institut Pasteur, Paris has a The Institut Pasteur, Paris has a PHYLIPPHYLIP server at: server at:

http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.htmlhttp://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html Louxin Zhang at the Natl. University of Singapore has a Louxin Zhang at the Natl. University of Singapore has a WebPhylipWebPhylip server: server:

http://sdmc.krdl.org.sg:8080/~lxzhang/phylip/http://sdmc.krdl.org.sg:8080/~lxzhang/phylip/ The Belozersky Institute at Moscow State University has their own "The Belozersky Institute at Moscow State University has their own "GeneBeeGeneBee" "

phylogenetics server:phylogenetics server:

http://www.genebee.msu.su/services/phtree_reduced.htmlhttp://www.genebee.msu.su/services/phtree_reduced.html The The PhylodendronPhylodendron website is a tree drawing program with a nice website is a tree drawing program with a nice

user interface and a lot of options, however, the output is limited to user interface and a lot of options, however, the output is limited to gifs at 72 dpi - not publication qualitygifs at 72 dpi - not publication quality..

http://iubio.bio.indiana.edu/treeapp/treeprint-form.htmlhttp://iubio.bio.indiana.edu/treeapp/treeprint-form.html

5656

Other Web ResourcesOther Web Resources

Joseph Felsenstein (author of PHYLIP) maintains a Joseph Felsenstein (author of PHYLIP) maintains a

comprehensive list of comprehensive list of Phylogeny programsPhylogeny programs at: at: http://evolution.genetics.washington.edu/phylip/software.htmlhttp://evolution.genetics.washington.edu/phylip/software.html

Introduction to Phylogenetic Systematics,Introduction to Phylogenetic Systematics,Peter H. Weston & Michael D. Crisp, Society of Australian Systematic Peter H. Weston & Michael D. Crisp, Society of Australian Systematic BiologistsBiologists

http://www.science.uts.edu.au/sasb/WestonCrisp.htmlhttp://www.science.uts.edu.au/sasb/WestonCrisp.html

University of California, Berkeley Museum of Paleontology University of California, Berkeley Museum of Paleontology (UCMP)(UCMP)http://www.ucmp.berkeley.edu/clad/clad4.htmlhttp://www.ucmp.berkeley.edu/clad/clad4.html

5757

Software HazardsSoftware Hazards There are a variety of programs for Macs and There are a variety of programs for Macs and

PCs, but you can easily tie up your machine for PCs, but you can easily tie up your machine for many hours with even moderately sized data many hours with even moderately sized data sets (i.e. fifty 300 bp sequences)sets (i.e. fifty 300 bp sequences)

Moving sequences into different programs can Moving sequences into different programs can be a major hassle due to incompatible file be a major hassle due to incompatible file formats.formats.

Just because a program can perform a given Just because a program can perform a given computation on a set of data does not mean that computation on a set of data does not mean that that is the appropriate algorithm for that type of that is the appropriate algorithm for that type of data.data.

5858

Which Method to Choose?

Depends upon the sequences that are being compared

Strong sequence similarity:

Maximum parsimony

Clearly recognizable sequence similarity

Distance methods

All others:

Maximum likelihood

Best to choose at least two approaches

Compare the results – if they are similar,you can have more confidence

5959

Which Method to Choose?

6060

Neighbor-joiningNeighbor-joining Maximum parsimonyMaximum parsimony Maximum likelihoodMaximum likelihood

Uses only pairwise Uses only pairwise distancesdistances

Uses only shared Uses only shared derived charactersderived characters

Uses all dataUses all data

Minimizes distance Minimizes distance between nearest between nearest neighborsneighbors

Minimizes total Minimizes total distancedistance

Maximizes tree likelihood Maximizes tree likelihood given specific parameter given specific parameter valuesvalues

Very fastVery fast SlowSlow VeryVery slow slow

Easily trapped in local Easily trapped in local optimaoptima

Assumptions fail when Assumptions fail when evolution is rapidevolution is rapid

Highly dependent on Highly dependent on assumed evolution modelassumed evolution model

Good for generating Good for generating tentative tree, or choosing tentative tree, or choosing among multiple treesamong multiple trees

Best option when Best option when tractable (<30 taxa, tractable (<30 taxa, homoplasy rare)homoplasy rare)

Good for very small data Good for very small data sets and for testing trees sets and for testing trees built using other methodsbuilt using other methods

Tony Weisstein, http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt

Comparison of Methods

6161

More Topics

Related to

Phylogenetics

6262

More topics related to Phylogenetics

Phylogeny epidemiology

Supertree / Tree of life

Phylogeography

6363

Idea of the ‘Tree of Life’

The idea that the evolution of life can be represented as a tree, with leaves corresponding to extant species and nodes to extinct ancestors, came from Charles Darwin

The earliest trees formed by Ernst Haeckel and others were based on a general idea of a hierarchy of relationships between species and higher taxa

Gradually, quantitative criteria have been developed to measure the degree of morphological difference that was thought to reflect evolutionary distance

6464

Winds of Change

In the early days of molecular phylogenetics, a gene tree was usually equated with the species tree. This view was typified using ribosomal RNA (rRNA) sequences as the principal molecular phylogenetic marker

This resulted in the discovery of a previously unrecognized domain of life, the Archaea, and in a tree topology that has been aptly called the ‘standard model’ of evolution

This model involves the early descent of the bacterial clade from the last universal common ancestor and a subsequent separation of archaea and eukaryotes.

All this was to change once comparative genomics yielded more information and multiple complete genome sequences became available for comparison

6565

The three domains of Life

Identified by phylogenetic analysis of the highly

conserved 16S ribosomal RNA

6666

Three strategies for constructing phylogenies

Homologous single-gene data set

Sequence concatenation

Supertree construction

Rely on many taxa for a single gene

Combine or concatenate multiple

sequences for the same set of species

Need for close concordance of species

sampling among genes, which is difficult

because of the hit-or-miss sampling in

the databases.

Less genes and less samples

Large number sequence alignment

Sample multiple genes only for minimally overlapping sets of species

Tree constructed by a set of subtrees

6767

With current computational tools, phylogenetic analyses for 1,000 species

is possible with adequate computer resources

It is currently impossible to reach a reasonable solution for 500,000 species,

even with months of computation .

Tree of Life( 30,000 species )

Assembling the Tree of Life (ATOL )

What difficulty in computing

David Hillis, Science, 2003

PARALLEL ALGORITHMS FOR GENETICS

6969

Assembling large data matrices by concatenation

Advantages

Improve the accuracy of a specific portion of a tree

The addition of species can be useful in cases of so-called

‘long-branch attraction’, in which high substitution rates or long

intervals of time can mislead phylogenetic inference methods

Two potential problems

Multiple genes can mix phylogenetic signals arising from different

evolutionary histories

Some sequences are usually unavailable for some species,

‘missing data’, with possible deleterious effects on accuracy

Domination by biological problems

7070

Reconstruction of trees from large data matrices

Two issues in constructing phylogenetic trees Computation time

Reliability

Two time-consuming computational problems

Multiple sequence alignment

Phylogenetic inference

Domination by computational problems

Optimal methods ( parsimony and maximum likelihood ) are time-consuming

Even heuristic approach

Months of processor time were devoted to a heuristic parsimon

y analysis of the Chase et al. dataset of ~ 500 sequences, and i

t never ran to completion ( Sanderson and Driskell, 2004)

7171

Synthesis of large trees: supertree

Tree constructed by a set of trees

Advantages Independent studies can be combined into a single tree

Initial trees can be based on different kinds of data

Initial trees can be obtained by different methodologies

Initial trees often have been selected from competing trees by professional judgment

There are most likely no common data for all species

Methods such as maximum likelihood would not be computationally tractable on such a large dataset

+

7272

Synthesis of large trees: supertree

Classification ( Wilkinson et al, 2001, Bininda-Emonds et al, 2002 )

Present Past

Supertree technique past and present ( Bininda-Emonds, 2004 )

7373

Reconstructing the “Tree” of Life

Handling large datasets: millions of species

The “Tree of Life” is not

really a tree: reticulate evolution

7474

PhylogeneticEpidemiology

7575

Infectious diseases are caused by pathogens

pathogen: microbe that causes disease

microbe: microscopic organism

The major classes of disease-causing microbes are

viruses, bacteria, and eukaryotes (protists, fungi, and worms)

RNA Viruses

The RNA viruses are more often associated with epidemic and

emerging diseases in humans than DNA viruses.

The gene sequences of many RNA viruses change so rapidly

that it is possible to watch spatial and temporal patterns unfold on

a ‘real time’ scale that is not usually visible in other organisms.

Diseases caused by RNA viruses: avian influenza, HIV, dengue...

7676

The rapidity of RNA virus evolution is caused by acombination of (Holmes, 2004)

Extremely high mutation rates

Short generation times

Immense population sizes.

These factors produce rates of nucleotide substitution that are, on

average, some six orders of magnitude higher than those in eukaryotes

and DNA viruses (Jenkins et al. 2002).

The high rates of substitution found in viruses and bacteria allow

phylogenies to be reconstructed for sequences that have diverged

only recently

Molecular phylogenies have come to play an increasingly important

role in epidemiological studies of microbial pathogens, as they

provide information about the location, timing, and mechanisms by

which virulent strains arise.

7777

Guan et al. (2002) Emergence of multiple genotypes of H5N1 avian influenza virusesin Hong Kong SAR. Proc Natl Acad Sci U S A, 99, 8950-8955.

7878

Moya, A., Holmes, E.C., and Gonzalez-Candelas, F. (2004) The population geneticsand evolutionary epidemiology of RNA viruses. Nat Rev Microbiol, 2, 279-288.

7979

Maximum likelihood estimate of phylogeny of eight strains of influenza A isolated from humans, swine, and birds based on an analysis of the HA gene. The divergence years prior to 1870, estimated using a partially constrained molecular clock, are shown at the left of the branch. The branch lengths (after 1870) are calibrated in units of years (scale at bottom).

Rannala, B. 2002. Molecular phylogenies and virulence evolution.In Adaptive Dynamics of Infectious Diseases: In Pursuit of Virulence Management

8080

Difficulties With Phylogenetic Analysis

Horizontal or lateral transfer of genetic material

(for instance through viruses) makes it difficult to

determine phylogenetic origin of some evolutionary

events

Garbage in, garbage out ! Alignment crucial

Genes selective pressure can be rapidly evolving,

masking earlier changes that had occurred

phylogenetically

8181

Difficulties With Phylogenetic Analysis

Two sites within comparative sequences may be

evolving at different rates

Rearrangements of genetic material can lead to

false conclusions

duplicated genes can evolve along separate pathways,

leading to different functions

8282

Gene trees vs species trees Gene duplication can complicate phylogenetic analysis

Paralogues (duplicated genes) do not fit in evolutionary tree

Phylogenetics - Issues

Choice of target sequence type

Use for very long-term evolutionary studies, spanning species boundaries & biological kingdoms

Ribosomal RNA (slowest change / mutation rate)

(a) Use for short-term studies of closely-related species

DNA / RNA (fastest change / mutation rate)

(b) Contains more evolutionary information than protein

(a) Use for wide species comparisons

Protein (medium change / mutation rate)

(b) More reliable alignment than DNA

8383

NO HOMEWORK! Happy??A problem will be appeared in the Final Exam:

Give an example and design a flowchart to

show how to construct a tree

Give an example and design a flowchart to

show how to construct a tree

Your answer should include, at least:

(a) Where you find the example? ( Google, books, or papers )

(b) Why you choose this example? ( curiosity, simple, or no reason? )

(c) Where you plan to get the sequences? ( database in the public domain )

(d) What kind of the methods you plan to use to construct your tree?

(e) Why you plan not use other methods

Just go to Google and find YOUR OWN Answer !

1 phylogenetic tree reconstruction modified version of dr. chun-chieh shih’ institute of...

Documents