cg7-trees

45
Based on lectures by C-B Stewart, and by Tal Pupko Phylogenetic Analysis based on two talks, by Caro-Beth Stewart, Ph.D. Department of Biological Sciences University at Albany, SUNY [email protected] and Tal Pupko, Ph.D. Faculty of Life Science Tel-Aviv University [email protected]

Upload: mukul-suryawanshi

Post on 01-Jan-2016

15 views

Category:

Documents


0 download

DESCRIPTION

g

TRANSCRIPT

Page 1: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

Phylogenetic Analysisbased on two talks, by

Caro-Beth Stewart, Ph.D.

Department of Biological Sciences

University at Albany, SUNY

[email protected]

and Tal Pupko, Ph.D.

Faculty of Life Science

Tel-Aviv University

[email protected]

Page 2: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

What is phylogenetic analysis and why should we perform it?

Phylogenetic analysis has two major components:

1. Phylogeny inference or “tree building” — the inference of the branching orders, and ultimately the evolutionary relationships, between “taxa” (entities such as genes, populations, species, etc.)

2. Character and rate analysis —using phylogenies as analytical frameworks for rigorous understanding of the evolution of various traits or conditions of interest

Page 3: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

Ancestral Node or ROOT of

the TreeInternal Nodes orDivergence Points

(represent hypothetical ancestors of the taxa)

Branches or Lineages

Terminal Nodes

A

B

C

D

E

Represent theTAXA (genes,populations,species, etc.)used to inferthe phylogeny

Common Phylogenetic Tree Terminology

Page 4: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

Phylogenetic trees diagram the evolutionary relationships between the taxa

((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses

Taxon A

Taxon B

Taxon C

Taxon E

Taxon D

No meaning to thespacing between thetaxa, or to the order inwhich they appear fromtop to bottom.

This dimension either can have no scale (for ‘cladograms’),can be proportional to genetic distance or amount of change(for ‘phylograms’ or ‘additive trees’), or can be proportionalto time (for ‘ultrametric trees’ or true evolutionary trees).

These say that B and C are more closely related to each other than either is to A,and that A, B, and C form a clade that is a sister group to the clade composed ofD and E. If the tree has a time scale, then D and E are the most closely related.

Page 5: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

A few examples of what can be inferred from phylogenetic trees built from DNA

or protein sequence data:

• Which species are the closest living relatives of modern humans?

• Did the infamous Florida Dentist infect his patients with HIV?

• What were the origins of specific transposable elements?

• Plus countless others…..

Page 6: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

Which species are the closest living relatives of modern humans?

Mitochondrial DNA, most nuclear DNA-encoded genes, and DNA/DNA hybridization all show that bonobos and chimpanzees are related more closely to humans than either are to gorillas.

The pre-molecular view was that the great apes (chimpanzees, gorillas and orangutans) formed a clade separate from humans, and that humans diverged from the apes at least 15-30 MYA.

MYA

Chimpanzees

Orangutans Humans

Bonobos

GorillasHumans

Bonobos

Gorillas Orangutans

Chimpanzees

MYA015-30014

Page 7: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

Did the Florida Dentist infect his patients with HIV?

DENTIST

DENTIST

Patient D

Patient F

Patient C

Patient A

Patient G

Patient BPatient E

Patient A

Local control 2

Local control 3

Local control 9

Local control 35

Local control 3

Yes:The HIV sequences fromthese patients fall withinthe clade of HIV sequences found in the dentist.

No

No

From Ou et al. (1992) and Page & Holmes (1998)

Phylogenetic treeof HIV sequencesfrom the DENTIST,his Patients, & LocalHIV-infected People:

Page 8: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

A few examples of what can be learned from character analysis using

phylogenies as analytical frameworks:

• When did specific episodes of positive Darwinian selection occur during evolutionary history?

• Which genetic changes are unique to the human lineage?

• What was the most likely geographical location of the common ancestor of the African apes and humans?

• Plus countless others…..

Page 9: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

The number of unrooted trees increases in a greater than exponential manner with number of taxa

(2N - 5)!! = # unrooted trees for N taxa

CA

B D

A B

C

A D

B E

C

A D

B E

C

F

Page 10: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

Inferring evolutionary relationships between the taxa requires rooting the tree:

To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: A

BC

Root D

A B C D

RootNote that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D.

Rooted tree

Unrooted tree

Page 11: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

Now, try it again with the root at another position:

A

BC

Root

D

Unrooted tree

Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D.

C D

Root

Rooted tree

A

B

Page 12: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees

The unrooted tree 1:

A C

B D

Rooted tree 1d

C

D

A

B

4

Rooted tree 1c

A

B

C

D

3

Rooted tree 1e

D

C

A

B

5

Rooted tree 1b

A

B

C

D

2

Rooted tree 1a

B

A

C

D

1

These trees show five different evolutionary relationships among the taxa!

Page 13: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

By outgroup: Uses taxa (the “outgroup”) that are known to fall outside of the group of interest (the “ingroup”). Requires some prior knowledge about the relationships among the taxa. The outgroup can either be species (e.g., birds to root a mammalian tree) or previous gene duplicates (e.g., -globins to root -globins).

There are two major ways to root trees:

A

B

C

D

10

2

3

5

2

By midpoint or distance:Roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes that the taxa are evolving in a clock-like manner. This assumption is built into some of the distance-based tree building methods.

outgroup

d (A,D) = 10 + 3 + 5 = 18Midpoint = 18 / 2 = 9

Page 14: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

x =

CA

B D

A D

B E

C

A D

B E

C

F (2N - 3)!! = # unrooted trees for N taxa

Each unrooted tree theoretically can be rooted anywhere along any of its branches

Page 15: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

Molecular phylogenetic tree building methods:

Are mathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available today, each having strengths and weaknesses. Most can be classified as follows:

COMPUTATIONAL METHOD

Clustering algorithmOptimality criterion

DA

TA

TY

PE

Ch

arac

ters

Dis

tan

ces

PARSIMONY

MAXIMUM LIKELIHOOD

UPGMA

NEIGHBOR-JOINING

MINIMUM EVOLUTION

LEAST SQUARES

Page 16: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

Types of data used in phylogenetic inference:Character-based methods: Use the aligned characters, such as DNA

or protein sequences, directly during tree inference. Taxa Characters

Species A ATGGCTATTCTTATAGTACGSpecies B ATCGCTAGTCTTATATTACASpecies C TTCACTAGACCTGTGGTCCASpecies D TTGACCAGACCTGTGGTCCGSpecies E TTGACCAGTTCTCTAGTTCG

Distance-based methods: Transform the sequence data into pairwise distances (dissimilarities), and then use the matrix during tree building.

A B C D E Species A ---- 0.20 0.50 0.45 0.40 Species B 0.23 ---- 0.40 0.55 0.50 Species C 0.87 0.59 ---- 0.15 0.40 Species D 0.73 1.12 0.17 ---- 0.25 Species E 0.59 0.89 0.61 0.31 ----

Example 1: Uncorrected“p” distance(=observed percentsequence difference)

Example 2: Kimura 2-parameter distance(estimate of the true number of substitutions between taxa)

Page 17: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

Exact algorithms: "Guarantee" to find the optimal or "best" tree for the method of choice. Two types used in tree building:

Exhaustive search: Evaluates all possible unrooted trees, choosing the one with the best score for the method.

Branch-and-bound search: Eliminates the parts of thesearch tree that only contain suboptimal solutions.

Heuristic algorithms: Approximate or “quick-and-dirty” methods that attempt to find the optimal tree for the method of choice, but cannot guarantee to do so. Heuristic searchesoften operate by “hill-climbing” methods.

Computational methods for finding optimal trees:

Page 18: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

Exact searches become increasingly difficult, andeventually impossible, as the number of taxa increases:

(2N - 5)!! = # unrooted trees for N taxa

A D

B E

C

CA

B D

A B

C

A D

B E

C

F

Page 19: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

Heuristic search algorithms are input order dependent and can get stuck in local minima or maxima

Rerunning heuristic searches using different input orders of taxa can help

find global minima or maxima

Searchfor global minimum GLOBAL

MAXIMUM

GLOBALMINIMUM

localminimum

localmaximum

Searchfor globalmaximum

GLOBALMAXIMUM

GLOBALMINIMUM

Page 20: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

COMPUTATIONAL METHOD

Clustering algorithmOptimality criterion

DA

TA

TY

PE

Ch

arac

ters

Dis

tan

ces

PARSIMONY

MAXIMUM LIKELIHOOD

UPGMA

NEIGHBOR-JOINING

MINIMUM EVOLUTION

LEAST SQUARES

Classification of phylogenetic inference methods

Page 21: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

Parsimony methods:

Optimality criterion: The ‘most-parsimonious’ tree is the one thatrequires the fewest number of evolutionary events (e.g., nucleotidesubstitutions, amino acid replacements) to explain the sequences.

Advantages:• Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’). • Can be used on molecular and non-molecular (e.g., morphological) data.• Can tease apart types of similarity (shared-derived, shared-ancestral, homoplasy)• Can be used for character (can infer the exact substitutions) and rate analysis.• Can be used to infer the sequences of the extinct (hypothetical) ancestors.

Disadvantages:• Are simple, intuitive, and logical (derived from “Medieval logic”, not statistics!)• Can be fooled by high levels of homoplasy (‘same’ events).• Can become positively misleading in the “Felsenstein Zone”:

[See Stewart (1993) for a simple explanation of parsimony analysis, and Swoffordet al. (1996) for a detailed explanation of various parsimony methods.]

Page 22: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

Branch and Bound

Tal Pupko, Tel-Aviv University

Page 23: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

There are many trees..,

We cannot go over all the trees. We will try to find a way to find the best tree.There are approximate solutions… But what if we want to make sure we find the global maximum.

There is a way more efficient than just go over all possible tree. It is called BRANCH AND BOUND and is a general technique in computer science, that can be applied to phylogeny.

Page 24: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

BRANCH AND BOUND

To exemplify the BRANCH AND BOUND (BNB) method, we will use an example not connected to evolution. Later, when the general BNB method is understood, we will see how to apply this method to finding the MP tree. We will present the traveling salesperson path problem (TSP).

Page 25: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

THE TSP PROBLEM

(especially adapted to israel).

A guard has to visit n check-points whose location on a map is known. The problem is to find the shortest path that goes through all points exactly once (no need to come back to starting point).

Naïve approach: (say for 5 points). You have 5 starting points. For each such starting point you have 4 “next steps”. For each such combination of starting point and first step, you have 3 possible second steps, etc. All together we have 5*4*3*2*1Possible solutions = 5! .

Page 26: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

THE TSP TREE

1 2 3 4 5

2 3 4 5 1 3 4 5 1 2 4 5 1 2 3 5 1 2 3 4

2 4 5 1 4 5 1 2 5 1 2 4

5 4 5 2 4 2

4 5 2 5 2 4

Page 27: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

THE SHP NAÏVE APPROACH

Each solution can be represented as a permutation:

(1,2,3,4,5)(1,2,3,5,4)(1,2,4,3,5)(1,2,4,5,3)(1,2,5,3,4)…We can go over the list and find the one giving the highest score.

Page 28: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

THE SHP NAÏVE APPROACH

However, for 15 points, for example, there are 1,307,674,368,000

The rate of increase of the number of solutions is too fast for this to be practical.

Page 29: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

A TSP GREEDY HEURISTIC

Start from a random point. Go to the closest point.Go to its closest point, etc.etc.This approach doesn’t work so well…

(but a reasonably close heuristic, based on simulated annealing, will be presented in a couple of lectures.)

Page 30: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

BNB SOLUTION TO SHP

1 2 3 4 5

2 3 4 5 1 3 4 5 1 2 4 5 1 2 3 5 1 2 3 4

2 4 5 1 4 5 1 2 5 1 2 4

5 4 5 2 4 2

4 5 2 5 2 4

Shortest path found so far = 15

Score here already 16: no point in expanding the rest of the subtree

Page 31: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

Back to finding the MP tree

Finding the MP tree is NP-Hard (will see shortly)…

BNB helps, though it is still exponential…

Page 32: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

The MP search tree1

2

34 is added to branch 1.

1

2

34

1

2

34

1

2

3

4

5 is added to branch 2.There are 5 branches

Page 33: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

The MP search tree

4 is added to branch 1.

30

43 39

52 54 52 53 58 61 56 59 61 69 53 51 42 47 47

55

Page 34: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

MP-BNB

4 is added to branch 1.

30

43 39

52 54 52 53 58 61 56 59 61 69 53 51 42 47 47

55

Best (minimum) value = 52

Page 35: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

MP-BNB

4 is added to branch 1.

30

43 39

52 54 52 53 58 61 56 59 61 69 53 51 42 47 47

55

Best record = 52

Page 36: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

MP-BNB

4 is added to branch 1.

30

43 39

52 54 52 53 58 61 56 59 61 69 53 51 42 47 47

55

Best record = 52

Page 37: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

MP-BNB

30

43 39

52 54 52 53 58 53 51 42 47 47

55

Best record = 52

Page 38: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

MP-BNB

30

43 39

52 54 52 53 58 53 51 42 47 47

55

Best record = 52

Page 39: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

MP-BNB

30

43 39

52 54 52 53 58 53 51 42 47 47

55

Best record = 52 51

53 58

Page 40: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

MP-BNB

30

43 39

52 54 52 53 58 53 51 42 47 47

55

Best record = 52 51 42

Page 41: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

MP-BNB

30

43 39

52 54 52 53 58 53 51 42 47 47

55

Best record = 52 51 42

Page 42: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

MP-BNB

30

43 39

52 54 52 53 58 53 51 42 47 47

55

Best record = 52 51 42

Page 43: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

MP-BNB

30

43 39

52 54 52 53 58 53 51 42 47 47

55

Best TREE.MP score = 42

Total # trees visited: 14

Page 44: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

Order of Evaluation Matters

30

43 39

53 51 42 47 47

55

Evaluate all 3 first

Total tree visited: 9

The bound after searching this subtree will be 42.

Page 45: CG7-trees

Based on lectures by C-B Stewart, and by Tal Pupko

And Now

Maximum Parsimony is Computationally Intractable

Felsenstein’s Dynamic Programming Algorithm for tiny maximum likelihood

and more, time permitting