what is phylogenetic analysis and why uses of

7
1 What is phylogenetic analysis and why should we perform it? Phylogenetic analysis has two major components: (1) Phylogeny inference or “tree building” the inference of the branching orders, and ultimately the evolutionary relationships, between “taxa” (entities such as genes, populations, species, etc.) (2) Analyzing change in traits (phenotypes, genes) using phylogenies as analytical frameworks for rigorous understanding of the evolution of various traits or conditions of interest Germline and somatic evolution included! Uses of Phylogenetics in the Study of Health & Disease (1) Evolutionary history of humans, between and within species (2) Analysis of evolution of phenotypic and genetic traits in humans, especially human-specific traits - evolved when, where, why, how (3) Evolution of parasites and pathogens, in relation to their hosts (us) (4) Evolution of cancer cell lineages, and somatic evolution more generally. (5) Study of adaptation in humans and other taxa What you will learn in this lecture (1) About phylogenies, terminology, what they are, how they work, ‘tree thinking’ (2) How to infer phylogenies (3) How we can use phylogenies to answer questions related to human adaptation, health and disease Ancestral Node or ROOT of the Tree Internal Nodes or Divergence Points (represent hypothetical ancestors of the taxa) Branches or Lineages Terminal Nodes A B C D E Represent the TAXA (genes, populations, species, etc.) used to infer the phylogeny Common Phylogenetic Tree Terminology Phylogenetic trees diagram the evolutionary relationships between the taxa ((A,(B,C)),(D, E)) = The above phylogeny as nested parentheses Taxon A Taxon B Taxon C Taxon E Taxon D No meaning to the spacing between the taxa, or to the order in which they appear from top to bottom. This dimension either can have no scale (for ‘cladograms’), can be proportional to genetic distance or amount of change (for ‘phylograms’ or ‘additive trees’), or can be proportional to time (for ‘ultrametric trees’ or true evolutionary trees). These say that B and C are more closely related to each other than either is to A, and that A, B, and C form a clade that is a sister group to the clade composed of D and E. If the tree has a time scale, then D and E are the most closely related. Taxon A Taxon B Taxon C Taxon D 1 1 1 6 3 5 genetic change Taxon A Taxon B Taxon C Taxon D time Taxon A Taxon B Taxon C Taxon D no meaning Three types of trees Cladogram Phylogram Ultrametric tree All show the same evolutionary relationships, or branching orders, between the taxa.

Upload: others

Post on 10-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

1

What is phylogenetic analysis and whyshould we perform it?

Phylogenetic analysis has two major components:

(1) Phylogeny inference or “tree building”the inference of the branching orders, and ultimately the evolutionary relationships, between “taxa” (entities such as genes, populations, species, etc.)

(2) Analyzing change in traits (phenotypes, genes)using phylogenies as analytical frameworksfor rigorous understanding of the evolution ofvarious traits or conditions of interest

Germline and somatic evolution included!

Uses of Phylogenetics in the Study ofHealth & Disease

(1) Evolutionary history of humans, between and withinspecies

(2) Analysis of evolution of phenotypic and genetic traits inhumans, especially human-specific traits - evolved when,where, why, how

(3) Evolution of parasites and pathogens, in relation to theirhosts (us)

(4) Evolution of cancer cell lineages, and somatic evolutionmore generally.

(5) Study of adaptation in humans and other taxa

What you will learn in this lecture

(1) About phylogenies, terminology, what they are,how they work, ‘tree thinking’

(2) How to infer phylogenies

(3) How we can use phylogenies to answer questionsrelated to human adaptation, health and disease

Ancestral Nodeor ROOT of

the TreeInternal Nodes orDivergence Points

(represent hypotheticalancestors of the taxa)

Branches or Lineages

Terminal Nodes

A

B

C

D

E

Represent theTAXA (genes,populations,species, etc.)used to inferthe phylogeny

Common Phylogenetic Tree Terminology

Phylogenetic trees diagram the evolutionary relationships between the taxa

((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses

Taxon A

Taxon B

Taxon C

Taxon E

Taxon D

No meaning to thespacing between thetaxa, or to the order inwhich they appear fromtop to bottom.

This dimension either can have no scale (for ‘cladograms’),can be proportional to genetic distance or amount of change(for ‘phylograms’ or ‘additive trees’), or can be proportionalto time (for ‘ultrametric trees’ or true evolutionary trees).

These say that B and C are more closely related to each other than either is to A,and that A, B, and C form a clade that is a sister group to the clade composed ofD and E. If the tree has a time scale, then D and E are the most closely related.

Taxon A

Taxon B

Taxon C

Taxon D

11

1

6

3

5

geneticchange

Taxon A

Taxon B

Taxon C

Taxon D

time

Taxon A

Taxon B

Taxon C

Taxon D

nomeaning

Three types of trees

Cladogram Phylogram Ultrametric tree

All show the same evolutionary relationships, or branching orders, between the taxa.

2

Completely unresolvedor "star" phylogeny

Partially resolvedphylogeny

Fully resolved,bifurcating phylogeny

A A A

B

B B

C

C

C

E

E

E

D

D D

Polytomy or multifurcation A bifurcation

A major goal of phylogeny inference is to resolve the branching orders of lineages in evolutionary trees:

RESOLUTION AND SUPPORT for nodes

There are three possible unrooted treesfor four taxa (A, B, C, D)

A C

B D

Tree 1A B

C D

Tree 2A B

D C

Tree 3

Phylogenetic tree building (or inference) methods are aimed atdiscovering which of the possible unrooted trees is "correct".We would like this to be the “true” biological tree — that is, onethat accurately represents the evolutionary history of the taxa.However, we must settle for discovering the computationallycorrect or optimal tree for the phylogenetic method of choice.

The number of unrooted trees increases in a greaterthan exponential manner with number of taxa

# Taxa (N)

3 4 5 6 7 8 910 . . . .30

# Unrooted trees

1 3 15 105 945 10,935 135,135 2,027,025 . . . . !3.58 x 1036

(2N - 5)!! = # unrooted trees for N taxa

CA

B D

A B

C

A D

B E

C

A D

B E

C

F

Inferring evolutionary relationships betweenthe taxa requires rooting the tree:

To root a tree mentally,imagine that the tree ismade of string. Grab thestring at the root andtug on it until the ends ofthe string (the taxa) fallopposite the root: A

BC

Root D

A B C D

RootNote that in this rooted tree, taxon A isno more closely related to taxon B thanit is to C or D.

Rooted tree

Unrooted tree

TIME

Now, try it again with the root at another position:

A

BC

Root

D

Unrooted tree

Note that in this rooted tree, taxon A is mostclosely related to taxon B, and together theyare equally distantly related to taxa C and D.

C D

Root

Rooted tree

A

B

TIME

An unrooted, four-taxon tree theoretically can be rooted in fivedifferent places to produce five different rooted trees

The unrooted tree 1:

A C

B D

Rooted tree 1d

C

D

A

B

4

Rooted tree 1c

A

B

C

D

3

Rooted tree 1e

D

C

A

B

5

Rooted tree 1b

A

B

C

D

2

Rooted tree 1a

B

A

C

D

1

These trees show five different evolutionary relationships among the taxa!

3

All of these rearrangements show the same evolutionaryrelationships between the taxa

B

A

C

D

A

B

D

C

B

C

A

D

B

D

A

C

B

ACD

Rooted tree 1aB

A

C

D

A

B

C

D

By outgroup:Uses taxa (the “outgroup”) that areknown to fall outside of the group ofinterest (the “ingroup”). Requiressome prior knowledge about therelationships among the taxa.

Main way to root trees:

outgroup

Molecular phylogenetic tree building methods:Are mathematical and/or statistical methods for inferring the divergenceorder of taxa, as well as the lengths of the branches that connect them.There are many phylogenetic methods available today, each havingstrengths and weaknesses. Most can be classified as follows:

COMPUTATIONAL METHODClustering algorithmOptimality criterion

DA

TA T

YPE Cha

ract

ers

Dis

tanc

es

PARSIMONY

MAXIMUM LIKELIHOOD

UPGMA

NEIGHBOR-JOINING

MINIMUM EVOLUTION

LEAST SQUARES

Types of data used in phylogenetic inference:Character-based methods: Use the aligned characters, such as DNAor protein sequences, directly during tree inference.

Taxa CharactersSpecies A ATCGCTAGTCCTATAGTGCASpecies B ATCGCTAGTCCTATATTGCASpecies C TTCGCTAGACCTGTGGTCCASpecies D TTGACCAGACCTGTGGTCCGSpecies E TTGACCAGTTCTGTGGTCCG ETCETC

Similarity vs. Evolutionary Relationship:

Similarity and relationship are not the same thing, even thoughevolutionary relationship is inferred from certain types of similarity.

Similar: having likeness or resemblance (an observation)

Related: genetically connected (an historical fact)

Two taxa can be most similar without being most closely-related:

Taxon A

Taxon B (eg HUMANS!)

Taxon C

Taxon D

11

1

6

3

5

C is more similar in sequence to A (d = 3) than to B (d = 7),but C and B are most closelyrelated (that is, C and B shareda common ancestor more recentlythan either did with A).

Main computational approach:

Optimality approaches: Use either character or distance data.First define an optimality criterion (minimum branch lengths, fewestnumber of events, highest likelihood), and then use a specific algorithmfor finding trees with the best value for the objective function. Canidentify many equally optimal trees, if such exist.

Warning: Finding an optimal tree is not necessarily the same as findingthe "true” tree. Random data will give you an ‘optimal’ (best ) tree!

4

Parsimony methods:

Optimality criterion: The ‘most-parsimonious’ tree is the one thatrequires the fewest number of evolutionary events (e.g., nucleotidesubstitutions, amino acid replacements) to explain the sequences.

Advantages:• Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’).• Can be used on molecular and non-molecular (e.g., morphological) data.• Can be used for character (can infer the exact substitutions) and rate analysis.• Can be used to infer the sequences of the extinct (hypothetical) ancestors.

Disadvantages:• Not explicitly statistical• Can be fooled by high levels of parallel evolution

Use parsimony to infer the optimal (best) treeCharacter-based methods: Use the aligned characters, such as DNA or protein sequences, directly during tree inference.

Taxa CharactersSpecies A ATCG CTAGACCTATAGTGCASpecies B ATCG CTAGACCTATATTGCASpecies C TTCG CTAGACCTGTGGTCCASpecies D TTGA CCAGACCTGTGGTCCGSpecies E TTGA CCAGTTGTGTGGTCCG

OUTGROUP TTAC CCATTTGTGTCCTCCG

Infer maximum parsimony tree using first four characters

Quality of trees (how likely it is that they reflect the one True Tree) can be evaluated in various ways (random data will give you alow-quality ‘best’ tree)

We can Statistically Compare alternative trees, corresponding to specific biological hypothesesof the history of some set of lineages

Time scales on trees:molecular clocks

% g

enet

ic d

iver

genc

e

Time since divergence (Myr)

100%

50%

75%

25%

1500300 600 900 1200

Fibrinopeptides

Hemoglobin

Cytochrome c

Histone IV

Why such differentprofiles? Variation inmutation rate?

Variation in selection.Genes coding for somemolecules under verystrong stabilizing selection.

Dates for calibrating molecular clocks can come from geology, fossils, or historical data

From known ages

of islands, for two genes

Calibrating using fossil data

chimps

humans

whales

hippos56 mya

60 substitutions

6 substitutions

5

Calibrating from known dates of the ages of samples: for very fast-evolvingtaxa such as HIV

Uses of Phylogenetics in the Study ofHealth & Disease

(1) Evolutionary history of humans, between andwithin species

(2) Analysis of evolution of phenotypic and genetictraits in humans, especially human-specifictraits - evolved when, where, why, how

(3) Taxonomy and evolution of parasites andpathogens, and evolution in relation to theirhosts

(4) Evolution of cancer cell lineages, and somaticevolution more generally.

(5) Study of adaptation in humans and other taxa,via analysis of divergence and convergence

VIRUS - what IS it?Sequence it’s DNA and relatesequence to known viruses

Evolution of SIV and HIV viruses:multiple transfers to humans, fromchimps and from green monkeys

EMERGING VIRUSES - THE GREATEST KNOWN HEALTH THREAT TO HUMANITY SARS (severe acute respiratory syndrome) what causes it and where did it come from?

HIV phylogenywithin humans in different regions: Haiti as stepping stone to North America

6

HIV evolves very rapidly WITHIN hosts, as a result of interactions with the immune system

Can do phylogenetics:-Pathogens within individuals, -Pathogens between Individuals (eg in different or same regions)

How originate? From other species?How spread? How does resistance toAntibiotics evolve in pathogens, & resistance to chemotherapeuticagents evolve in cancer?

Cancer evolves genetically in the body during carcinogenesis, allowing the inference of ‘oncogenetic trees’

Cytogenetic data:Gains and losses of Chromosomal regionsDuring evolution of cancers;Lose tumor suppressorgene copies, gain Oncogene copies

Involves losses of heterozygosityand losses of imprinting

7

CancerEvolutionaryPhylogenomics

Compare primary cancerwith metastatictumors

What you learned in this lecture

(1) About phylogenies, terminology, what they are, how they work, ‘tree thinking’

(2) How to infer and evaluate phylogenies

(3) How to use phylogenies to answer questions related to human adaptation, healthand disease (viruses, cancer, etc)

(4) How to THINK in terms of evolutionary trees(historical patterns of evolution), within and between species