phylogenetic analysis. motivation –the problem of explaining the evolutionary history of...

37
Phylogenetic Analysis

Upload: bertram-hopkins

Post on 17-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Phylogenetic Analysis

Page 2: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

• Motivation – The problem of explaining the evolutionary history

of today's species– How do species relate to one another in terms of

common ancestors– Nucleic acids and Proteins also evolve

• Approaches – Fossil Records , Phylogenetic Trees

Page 3: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

General comments on phylogenetics• Phylogenetics is the branch of biology that deals with

evolutionary relatedness• Uses some measure of evolutionary relatedness: e.g.,

morphological features• Phylogenetics on sequence data is an attempt to reconstruct the

evolutionary history of those sequences• Relationships between individual sequences are not necessarily

the same as those between the organisms they are found in• The ultimate goal is to be able to use sequence data from many

sequences to give information about phylogenetic history of organisms

• Phylogenetic relationships usually depicted as trees, with branches representing ancestors of “children”; the bottom of the tree (individual organisms) are leaves. Individual branch points are nodes.

Page 4: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

What is phylogenetic analysis and why should we perform it?

Phylogenetic analysis has two major components:

1. Phylogenetic inference or “tree building” — the inference of the branching orders, and ultimately the evolutionary relationships, between “taxa” (entities such as genes, populations, species, etc.)

2. Character and rate analysis —using phylogenies as analytical frameworks for rigorous understanding of the evolution of various traits or conditions of interest

Page 5: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

• Examine the process of evolution– What drives evolution?– Understanding mutation, gene flow and natural selection

• Examine the history of evolution– What has evolution done in the past?– Understanding how living organisms are related and how

they have changed over time• Aim

– The ultimate goal is to be able to use sequence data from many sequences to give information about phylogenetic history of organisms

– To construct a visual representation (a tree) to describe the assumed evolution occurring between and among different groups (individuals, populations, species, etc.) and to study the reliability of the consensus tree.

– Phylogenetic relationships usually depicted as trees, with branches representing ancestors of “children”; the bottom of the tree (individual organisms) are leaves. Individual branch points are nodes.

Page 6: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Ancestral Node or ROOT of

the TreeInternal Nodes orDivergence Points

(represent hypothetical ancestors of the taxa)

Branches or Lineages

Terminal Nodes

A

B

C

D

E

Represent theTAXA (genes,populations,species, etc.)used to inferthe phylogeny

Common Phylogenetic Tree Terminology

Page 7: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Parts of a Phylogenetic TreeNode

Root

Outgroup

Ingroup

Branch

Page 8: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Phylogenetic trees diagram the evolutionary relationships between the taxa

((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses

Taxon A

Taxon B

Taxon C

Taxon E

Taxon D

No meaning to thespacing between thetaxa, or to the order inwhich they appear fromtop to bottom.

This dimension either can have no scale (for ‘cladograms’),can be proportional to genetic distance or amount of change(for ‘phylograms’ or ‘additive trees’), or can be proportionalto time (for ‘ultrametric trees’ or true evolutionary trees).

These say that B and C are more closely related to each other than either is to A,and that A, B, and C form a clade that is a sister group to the clade composed ofD and E. If the tree has a time scale, then D and E are the most closely related.

Page 9: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

• In Phylogenetic trees – Leaves represent present

day species – Interior nodes represent

hypothesized ancestors– We will only consider binary

trees: edges split only into two branches (daughter edges)

– Rooted trees have an explicit ancestor; the direction of time is explicit in these trees

– Unrooted trees do not have an explicit ancestor; the direction of time is undetermined in such trees

Page 10: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

A few examples of what can be inferred from phylogenetic trees built from DNA

or protein sequence data:

• Which species are the closest living relatives of modern humans?

• What were the origins of specific transposable elements?

• Plus countless others…..

Page 11: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Input data for Phylogenetic Reconstruction

• Distance Matrix

• Character State Matrix

Page 12: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Types of phylogenetic analysis methods

• Phenetic: trees are constructed based on observed characteristics, not on evolutionary history

• Cladistic: trees are constructed based on fitting observed characteristics to some model of evolutionary history

Distancemethods

ParsimonyandMaximumLikelihoodmethods

Page 13: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Distance methods

• Another way to say this is that there are a set of distances dij between each pair of sequences i,j in the dataset. dij can be the fraction f of sites u where residues xi and xj differ; or dij can be such a fraction but weighted in some way (e.g. Jukes-Cantor distance)

Page 14: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Parsimony methods

• Parsimony methods are based on the idea that the most probable evolutionary pathway is the one that requires the smallest number of changes from some ancestral state

• For sequences, this implies treating each position separately and finding the minimal number of substitutions at each position

• Parsimony methods assign a cost to each tree available to the dataset, then screen trees available to the dataset and select the most parsimonious

• Screening all the trees available to even a smallish dataset would take too much time; branch and bound method builds trees with increasing numbers of leaves but abandons the topology whenever the current tree has a bigger cost than any complete tree

Page 15: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Example of parsimonious tree building

• Tree on left requires only one change, tree on right requires two: left tree is most parsimonious

Page 16: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Character State Matrix

• A character has a finite number of states

• Taxonomical units for which we want to create phylogeny are called Objects– e.g. species, population

• Every object has a state vector & inherit the same characters but not the same states!

Page 17: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Character State Matrix M

• M has n rows (Objects)

• M has m columns (characters)

• Mij denotes the state object i has for character j

Page 18: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Which species are the closest living relatives of modern humans?

Mitochondrial DNA, most nuclear DNA-encoded genes, and DNA/DNA hybridization all show that bonobos and chimpanzees are related more closely to humans than either are to gorillas.

The pre-molecular view was that the great apes (chimpanzees, gorillas and orangutans) formed a clade separate from humans, and that humans diverged from the apes at least 15-30 MYA.

MYA

Chimpanzees

Orangutans Humans

Bonobos

GorillasHumans

Bonobos

Gorillas Orangutans

Chimpanzees

MYA015-30014

Page 19: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

A few examples of what can be learned from character analysis using

phylogenies as analytical frameworks:

• When did specific episodes of positive Darwinian selection occur during evolutionary history?

• Which genetic changes are unique to the human lineage?

• What was the most likely geographical location of the common ancestor of the African apes and humans?

• Plus countless others…..

Page 20: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

The number of unrooted trees increases in a greater than exponential manner with number of taxa

(2N - 5)!! = # unrooted trees for N taxa

CA

B D

A B

C

A D

B E

C

A D

B E

C

F

Page 21: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Inferring evolutionary relationships between the taxa requires rooting the tree:

To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: A

BC

Root D

A B C D

RootNote that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D.

Rooted tree

Unrooted tree

Page 22: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Now, try it again with the root at another position:

A

BC

Root

D

Unrooted tree

Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D.

C D

Root

Rooted tree

A

B

Page 23: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees

The unrooted tree 1:

A C

B D

Rooted tree 1d

C

D

A

B

4

Rooted tree 1c

A

B

C

D

3

Rooted tree 1e

D

C

A

B

5

Rooted tree 1b

A

B

C

D

2

Rooted tree 1a

B

A

C

D

1

These trees show five different evolutionary relationships among the taxa!

Page 24: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

By outgroup: Uses taxa (the “outgroup”) that are known to fall outside of the group of interest (the “ingroup”). Requires some prior knowledge about the relationships among the taxa. The outgroup can either be species (e.g., birds to root a mammalian tree) or previous gene duplicates (e.g., -globins to root -globins).

There are two major ways to root trees:

A

B

C

D

10

2

3

5

2

By midpoint or distance:Roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes that the taxa are evolving in a clock-like manner. This assumption is built into some of the distance-based tree building methods.

outgroup

d (A,D) = 10 + 3 + 5 = 18Midpoint = 18 / 2 = 9

Page 25: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

x =

CA

B D

A D

B E

C

A D

B E

C

F (2N - 3)!! = # unrooted trees for N taxa

Each unrooted tree theoretically can be rooted anywhere along any of its branches

Page 26: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Molecular phylogenetic tree building methods:

Are mathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available today, each having strengths and weaknesses. Most can be classified as follows:

COMPUTATIONAL METHOD

Clustering algorithmOptimality criterion

DA

TA

TY

PE

Ch

arac

ters

Dis

tan

ces

PARSIMONY

MAXIMUM LIKELIHOOD

UPGMA

NEIGHBOR-JOINING

MINIMUM EVOLUTION

LEAST SQUARES

Page 27: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Types of data used in phylogenetic inference:Character-based methods: Use the aligned characters, such as DNA

or protein sequences, directly during tree inference. Taxa Characters

Species A ATGGCTATTCTTATAGTACGSpecies B ATCGCTAGTCTTATATTACASpecies C TTCACTAGACCTGTGGTCCASpecies D TTGACCAGACCTGTGGTCCGSpecies E TTGACCAGTTCTCTAGTTCG

Distance-based methods: Transform the sequence data into pairwise distances (dissimilarities), and then use the matrix during tree building.

A B C D E Species A ---- 0.20 0.50 0.45 0.40 Species B 0.23 ---- 0.40 0.55 0.50 Species C 0.87 0.59 ---- 0.15 0.40 Species D 0.73 1.12 0.17 ---- 0.25 Species E 0.59 0.89 0.61 0.31 ----

Example 1: Uncorrected“p” distance(=observed percentsequence difference)

Example 2: Kimura 2-parameter distance(estimate of the true number of substitutions between taxa)

Page 28: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Exact algorithms: "Guarantee" to find the optimal or "best" tree for the method of choice. Two types used in tree building:

Exhaustive search: Evaluates all possible unrooted trees, choosing the one with the best score for the method.

Branch-and-bound search: Eliminates the parts of thesearch tree that only contain suboptimal solutions.

Heuristic algorithms: Approximate or “quick-and-dirty” methods that attempt to find the optimal tree for the method of choice, but cannot guarantee to do so. Heuristic searchesoften operate by “hill-climbing” methods.

Computational methods for finding optimal trees:

Page 29: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Exact searches become increasingly difficult, andeventually impossible, as the number of taxa increases:

(2N - 5)!! = # unrooted trees for N taxa

A D

B E

C

CA

B D

A B

C

A D

B E

C

F

Page 30: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

COMPUTATIONAL METHOD

Clustering algorithmOptimality criterion

DA

TA

TY

PE Ch

arac

ters

Dis

tan

ces

PARSIMONY

MAXIMUM LIKELIHOOD

UPGMA

NEIGHBOR-JOINING

MINIMUM EVOLUTION

LEAST SQUARES

Classification of phylogenetic inference methods

Page 31: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Parsimony methods:

Optimality criterion: The ‘most-parsimonious’ tree is the one thatrequires the fewest number of evolutionary events (e.g., nucleotidesubstitutions, amino acid replacements) to explain the sequences.

Advantages:• Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’). • Can be used on molecular and non-molecular (e.g., morphological) data.• Can tease apart types of similarity (shared-derived, shared-ancestral, homoplasy)• Can be used for character (can infer the exact substitutions) and rate analysis.• Can be used to infer the sequences of the extinct (hypothetical) ancestors.

Disadvantages:• Are simple, intuitive, and logical (derived from “Medieval logic”, not statistics!)• Can be fooled by high levels of homoplasy (‘same’ events).• Can become positively misleading in the “Felsenstein Zone”:

[See Stewart (1993) for a simple explanation of parsimony analysis, and Swoffordet al. (1996) for a detailed explanation of various parsimony methods.]

Page 32: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Bootstrapping• Evaluation of the tree reliability• n number of trees are built (n=100/1000/5000)• How many times a certain branch is reproduced• Values between 1-100 (%)• if the assumptions the method is based on hold, you should always get

the same tree from the bootstrapped alignments as you did originally• The frequency of some feature of your phylogeny in the bootstrapped

set gives some measure of the confidence you can have for this feature

Page 33: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Parsimony methods• Parsimony methods are based on the idea

that the most probable evolutionary pathway is the one that requires the smallest number of changes from some ancestral state

• For sequences, this implies treating each position separately and finding the minimal number of substitutions at each position

Page 34: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Example of parsimonious tree building

• Tree on left requires only one change, tree on left requires two: left tree is most parsimonious

Page 35: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

• Parsimony methods assign a cost to each tree available to the dataset, then screen trees available to the dataset and select the most parsimonious

• Screening all the trees available to even a smallish dataset would take too much time; branch and bound method builds trees with increasing numbers of leaves but abandons the topology whenever the current tree has a bigger cost than any complete tree

Page 36: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Phylogeny in medical forensics: HIV

• A dentist who was infected with HIV was suspected of infecting some of his patients in the course of treatment

• HIV evolves very quickly (10-3 substitutions/year) • Possible to trace the history of infections among individuals by

conducting a phylogenetic analysis of HIV sequences• Samples were taken from dentist, patients, and other infected

individuals in the community• Study found 5 patients had been infected by the dentist

Source: Ou et. al. 1992. Molecular epidemiology of HIV transmission in a dental practice. Science, 256: 1165-1171.

Page 37: Phylogenetic Analysis. Motivation –The problem of explaining the evolutionary history of today's species –How do species relate to one another in terms

Did the Florida Dentist infect his patients with HIV?

DENTIST

DENTIST

Patient D

Patient F

Patient C

Patient A

Patient G

Patient BPatient E

Patient A

Local control 2

Local control 3

Local control 9

Local control 35

Local control 3

Yes:The HIV sequences fromthese patients fall withinthe clade of HIV sequences found in the dentist.

No

No

From Ou et al. (1992) and Page & Holmes (1998)

Phylogenetic treeof HIV sequencesfrom the DENTIST,his Patients, & LocalHIV-infected People: