phylogenetic analysis in the context of multigene sequences sudhindra r. gadagkar university of...

62
Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Upload: abbigail-wootten

Post on 15-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Phylogenetic analysis in the context of multigene sequences

Sudhindra R. Gadagkar

University of Dayton

Page 2: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

DNA Evolution

The unifying force of all life on earth is DNA

Adenine, Cytosine, Guanine, Thymine

ATGGCATACGTGCAGTTCATCGGCTAGTGTGACATGA

Page 3: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

DNA sequence evolution

t0

t1ATGGCATACGTGCA

ATGGTATAGGTGCA

ATGGCATACGTGAA

Page 4: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

A phylogenetic treeA pattern of branching events, with each branching point showing a speciation (or divergence) event

Taxon ATaxon A

Taxon BTaxon B

3.53.5

3.53.5

7.57.5

44

Taxon CTaxon C

•Nodes (extinct ancestors) •Tips (living species)

•Branches (amount of evolution) •Taxon (pl. Taxa)

Page 5: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

• Reconstruction of the evolutionary relationships among “taxa”

•Representation in a graphical form.

What is phylogenetic inference?

M Fin Whale

M Blue Whale

M Cow

M Rat

M Mouse

M Opossum

B Chicken

A Xenopus

F Rainbow Trout

F Loach

F Carp

L Lamprey

S Sea urchin

0.05

M F

in W

hal

e

M B

lue

Wha

le

M C

ow

M Rat

M Mouse

M O

possum B C

hicken

A X

eno

pu

sF Rai

nbow

Tro

ut

F Loach

F Carp

L Lamprey

S S

ea urchin

0.05

M F

in W

hal

e

M Blue W

hale

M Cow

M Rat

M M

ouse

M O

po

ssum

B C

hic

ken

A X

enop

us

F Rainbow Tro

ut

F Loach

F Carp

L Lamprey

S S

ea urch

in

0.05

Page 6: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Parts of a tree• Tree size: no. of taxa in the

phylogeny.

• Interior branch: partitions an unrooted tree into 2 subtrees, each containing 2 taxa.

• Cluster size: minimum of two subtree sizes partitioned by an interior branch.

• Depth of a branch: defined in terms of the no. of taxa clustered by it.

Root

Internal BranchF. Whale

B. Whale

Cow

Rat

Mouse

Opossum

External branch

Node

Outgroup

Page 7: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Example of a 6-sequence treeF. Whale

B. Whale

Cow

Rat

Mouse

Opossum

F. Whale

B. Whale

Cow

Rat Mouse

Opossum

Rooted Tree

Unrooted Tree

Page 8: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Phylogenetic analysis using DNA sequences

t0

t1ATGGCATACGTGCA

ATGGTATAGGTGCA

ATGGCATACGTGAA

Page 9: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Gene Sequences

Homologous (orthologous) gene sequences• D. melanogaster ATGTCGTTGACCAACAAGAACGTGATTTTCGTGGCCGGTCT...• D. pseudoobscura ATGTCTCTCACCAACAAGAACGTCGTTTTCGTGGCCGGTCT...• D. crassifemur ATGTTCATCGCTGGCAAGAACATCATCTTTGTCGCTGGTCT...• D. mulleri ATGGCCATCGCTAACAAGAACATCATCTTCGTCGCTGGACT...

[ D.me D.ps D.cr D.mu][D.me] [D.ps] 0.14 [D.cr] 0.24 0.24 [D.mu] 0.21 0.20 0.21

Distance Matrix

D. melanogaster

D. pseudoobscura

D. mulleri

D. crassifemur

Page 10: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Expected or Species tree

F. Whale

B. Whale

Cow

Rat

Mouse

Opossum

Realized tree for gene X

F. Whale

B. Whale

Cow

Rat

Mouse

Opossum

Page 11: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Two-fold Challenge

• Today’s challenge is the flood of data, in two ways:

1. The increasing number of taxa (say, species) for which molecular data is available.

2. The increasing amount of molecular data that is available for each taxon.

Page 12: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

The number of possible trees increases enormously as the number of taxa increases

Why is reconstructing the evolutionary history of a large number of taxa a challenge?

Page 13: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Number of rooted trees

• The number of bifurcating rooted trees is given by the following formula, where m is the number of taxa.

2

1 3 5 2 3

2 3 !

2 2 !m

m

m

m

Source: Nei and Kumar, 2000. Molecular Evolution and Phylogenetics

Page 14: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

3 taxa

Source: Brian Golding, Reconstructing Phylogenieshttp://helix.biology.mcmaster.ca/721/phylo/phylo.html

Page 15: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

4 taxa

Source: Brian Golding, Reconstructing Phylogenieshttp://helix.biology.mcmaster.ca/721/phylo/phylo.html

Page 16: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

More taxa

Source: Brian Golding, Reconstructing Phylogenieshttp://helix.biology.mcmaster.ca/721/phylo/phylo.html

Page 17: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

So many trees!

0

400

600

800

1000

1200

0 100 200 300 400

Millions

Billions

10200

10

10

10

10

10N

o. o

f P

oss i

ble

Tre

es

No. of Sequences

1079 atoms in the universe

1037 atoms in the bodies of all humans by year 2035

5 1030 prokaryotes living today

5 1011 stars in the milky way

How many trees represent the true relationship?

Page 18: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Only ONE out of all possible trees is the true tree!

Page 19: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Which is the true tree?

Choose a criterion (optimality criterion).

Score the fit of the data to a given tree for that criterion

Tree with the optimal score is chosen as the best tree.

Optimal tree found in this way is expected to be closest to the true tree.

Page 20: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Optimality Criteria

Branch lengths computed for each tree using pair-wise distances obtained from sequences. Sum of branch lengths (S) is used as the optimality score.

Minimum Evolution (ME)

Branch lengths Computer

Data

Topology

Sum of branch lengths

Substitution Model

Distance Computer

Tree with the smallest S-value is chosen.

Page 21: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

The Neighbor-Joining method (Saitou and Nei, Mol. Biol.Evol. 4: 406 - 425, 1987)

Page 22: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

• Computationally efficient

• Desirable statistical properties

• Accuracy

• Performance with large phylogenies?

Properties of the NJ method

Page 23: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Research ProblemPerformance of NJ optimality criteria in inferring large trees

Performance worse with more sequences?

More difficult to infer deep branches as compared to the shallow ones?

Reconstruct branches at similar depths in large and small trees with same efficiency?

Page 24: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

• 4 basic 6-taxa trees (topologies)

• Equal interior branch lengths

• Trees stacked to make larger trees (e.g., Dx = x trees of type D stacked)

Model trees and their features

E F G

D

D

D

D

D

D

D

D

B D

8

9

9

11

1 1

1

1

C

8

11

1

1

9

9

11 1 1

6

6

7

8

6

6

A

1 11

4

46

57

8

1

Kumar and Gadagkar, 2000, J. Mol. Evol.,51:544-553 (Fig. 1)

Page 25: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Additional model topology - the rbcL tree

(From Hillis, Nature, 383:130-131, 1996)

Page 26: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Tree parameters

• Rate: Up to 10 fold differences in rate.

• Sequence Length: Up to 10 multiples of 100 sites.

• Tree size: Ax, Bx, Cx, Dx, where x varied from 1 to 10, 16, and 32

Page 27: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Simulating Evolutionary Change

• Starting point or “root” chosen.

• Random ancestral sequence generated

for the root.

• Branch length randomly obtained

from a Poisson distribution with mean

= expected no. of substitutions

(evolutionary rate sequence length

multiplier).

4

4

5

6

7

8

1

1

1

1

Page 28: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

• Equal probability of transition from one state to another.

• Process carried out for all branches

• Resulting data are sequences for the taxa for that “gene”.

• These sequences used to infer back the evolutionary

relationships using NJ.

• 1000 replications (A to D trees; 60 taxa), 100 reps (>

60 taxa, rbcL tree).

Simulating Evolutionary Change (contd.)

Page 29: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Accurate Inference of Complete Trees

0

20

40

60

80

100

0 50 100 150 200

Number of sequences

% R

eco

very

of

co

mp

lete

tre

es

200 sites

500 sites

1000 sites

Kumar and Gadagkar, 2000, J. Mol. Evol.,51:544-553 (Table 1)

Page 30: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Effect of 0-length branches on NJ performance

Sequence length (s)

0 200 400 600 800 10000

20

40

60

80

P0

PModel

PRealized100

% b

ranc

hes

corr

ect

Kumar and Gadagkar, 2000, J. Mol. Evol.,51:544-553 (Fig. 3)

Page 31: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Reconstruction efficiency of 6 taxa monophyletic clusters

70

80

90

100

0 50 100 150 200

Number of sequences

200 sites

500 sites

1000 sites

% c

orre

ct r

epli

cate

s

Page 32: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

% branches inferred correctly

04

05

60

70

80

90

1 00

Tree size

618 30 42 54 96

192

0.00625

0.03125

0.0625

1000 sites

500 sites

200 sites

100 sites

Rat

e (r

)

Per

cen t

Eff

i ci e

ncy

(PB

R)

Kumar and Gadagkar, 2000, J. Mol. Evol.,51:544-553 (Fig. 4)

Page 33: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Branch depth and NJ efficiency

70

80

90

100

Branch depth2 3 4 5 6 1 2 18 24 3 0 48 96

6

24

42

60

Tree size

192

pB

Kumar and Gadagkar, 2000, J. Mol. Evol.,51:544-553 (Fig. 5B)

Page 34: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Shallow versus deep branchesResults from rbcL tree

70

80

90

100

2 5 8 11 14 18 23 26 29 32 47 51 54 67 74

Kumar and Gadagkar, 2000, J. Mol. Evol.,51:544-553 (Fig. 8B)

Branch depth

Rec

onst

ruct

ion

effi

cien

cy

Page 35: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Branch depth and efficiency for different

inference methods (JC simulations)

Rosenberg and Kumar, 2001, Mol. Biol. Evol.,18:1823-1827 (Fig. 1)

Page 36: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Branch depth and efficiency for different

inference methods (HKY simulations)

Rosenberg and Kumar, 2001, Mol. Biol. Evol.,18:1823-1827 (Fig. 2)

Page 37: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

The Challenge of Multi-Gene Sequences

• Multi-Gene/Whole Genome sequences increasingly available for many taxa.

• How best to obtain phylogenetic information from these multiple sequences?

Page 38: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Concatenation vs Consensus

Concatenation approach

ATGCTGACTG ATGTCGTCAGTC

ATGCTGACTGATGTCGTCAGTC

A B C D E

A B C D E A B C D E

ATGCTGACTG ATGTCGTCAGTC

Consensus approach

A B C D E

Page 39: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

The worst-case scenario approach

• The worst-case scenario is when all the available genes yield highly incorrect phylogenetic reconstructions.

• When faced with such sequences, which strategy to employ: consensus or concatenation?

Page 40: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Simulation with estimated parameters

• Model tree based on the phylogenetic relationships among 66 mammals from Murphy et al., (Nature 409:614-618, 2001).

Page 41: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Source: Fig. 1 from Gadagkar, Rosenberg and Kumar, Molecular and Developmental Evolution (Accepted)

Page 42: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Simulation with estimated parameters

• Sequences for 448 genes downloaded from HOVERGEN (Duret et al., Nucleic Acids Res. 22: 2360-2365, 1994).

• Sequence parameters (length, L, substitution rate, r, transition-transversion rate ratio, , and G+C content, ) were estimated from the data.

Page 43: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Simulation with estimated parameters (contd.)

• For each of the 448 genes, 100 replicate sequences generated by computer simulation, using the estimated parameters and the HKY model of evolution.

Page 44: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Computer Simulation

Rep1 Rep2 Rep3 . . . Rep100

Gene1

Gene2

Gene3

.

.

.

Gene448

Page 45: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Simulation with estimated parameters (contd.)

• Phylogenetic inference was done on each of the 44,800 simulation replicates using NJ-JC and NJ-TN methods.

• The accuracy of each tree was recorded in terms of the number of incorrect branches when compared to the model tree.

Page 46: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Spread of sequence attributes

0

20

40

60

80

Log Sequence Length

0

20

40

60

Substitution rate per site (x 10 - 9 )

0

20

40

60

80

log Kappa

0

20

40

60

G+C content

Page 47: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Simulation lets us play God!• In computer simulation, evolution is simulated

based on a model tree, and replicate sequences are obtained.

• These replicate sequences are then used to infer back the true tree.

• Therefore, for the 100 simulation replicates for each of the 448 genes, we know the worst performing replicate.

Page 48: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Simulation lets us play God!

D. melanogaster

D. pseudoobscura

D. mulleri

D. crassifemur

Start

Page 49: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

The Two-Gene Case

• Data: For each of NJ-JC and NJ-TN, we picked

* 10,000 pairs of worst replicates

* 10,000 pairs of randomly chosen replicates

Page 50: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Two-gene concatenation

Source: Table 1 from Gadagkar, Rosenberg and Kumar, Molecular and Developmental Evolution (Accepted)

Page 51: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Comparison of the number of incorrect

inferred branches (NJ-JC)

0

10

20

30

40

50

0 10 20 30 40 50

Gene 1 tree

Worst replicate pairs

0

10

20

30

40

50

0 10 20 30 40 50

Gene 1 tree

Random replicate pairs

Page 52: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Effect of Gene Attributes

Page 53: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Effect of gene attributes (contd.)

2.00

2.50

3.00

3.50

4.00

2.00 2.50 3.00 3.50 4.00

Log Length (Gene 1)

Log

Len

gth

(Gen

e 2)

Page 54: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Quality of second gene

Source: Fig. 2 from Gadagkar, Rosenberg and Kumar, Molecular and Developmental Evolution (Accepted)

Worst case

Random case

Page 55: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Progressive addition of genes

Source: Fig. 3 from Gadagkar, Rosenberg and Kumar, Molecular and Developmental Evolution (Accepted)

Page 56: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Whe

n al

l 448

gen

es w

ere

used

Sour

ce: F

ig. 4

fro

m G

adag

kar,

Ros

enbe

rg a

nd K

umar

, Mol

ecul

ar a

nd

Dev

elop

men

tal E

volu

tion

(A

ccep

ted)

Page 57: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Effect of neighboring branches

Source: Fig. 5 from Gadagkar, Rosenberg and Kumar, Molecular and Developmental Evolution (Accepted)

Page 58: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Summary & Conclusions• Heck of a lot of data available

• Two dimensions – number of species, and number of sequences per species

• Many methods available to infer phylogenies from a large number of species

• Neighbor-joining (NJ), a fast, distance based algorithm works well and infers trees correctly as long as there are no polytomies (multifurcations) in the true tree

• NJ also infers shallow and deep branches with good and equal efficiency

Page 59: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Summary and Conculsions – contd.

• Multigene data available for many species

• How best to obtain phylogenetic info from these sequences (consensus or concatenation)?

• Our simulation results, with biologically realistic parameters and the worst-case approach, show that concatenation is better

• However, concatenation approach appears excessively prone to certain systematic errors.

Page 60: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

Acknowledgements• Co-authors:

– Sudhir Kumar– Michael Rosenberg

• Help:– Roman Johnson– Tushar Gadagkar– Sankar Subramanian– Balaji Ramanujam

Arizona State University

University of Dayton

Page 61: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton

please visit our Biology Department at:

http://biology.udayton.edu

To find out more about our graduate programs,

Apply online for free at:

http://gradadmission.udayton.edu

Page 62: Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton