mar 2004 irit orr phylogenetic analysis introduction to · phylogenetic analysis irit orr mar 2004...

Introduction to Phylogenetic Analysis

Irit OrrMar 2004

Subjects of this lecture

1 Introducing some of the terminology ofphylogenetics.

2 Introducing some of the most commonlyused methods for phylogenetic analysis.

3 Explain how to construct phylogenetictrees.

Taxonomy - is the

science of classification

of organisms.

Phylogeny –

is the evolution

of a genetically

related group

of organisms.

Or: A study of relationships between collection of "things" (genes, proteins, organs..) that are derived from a common ancestor

Phylogenetics - WHY?

Find evolutionary ties between organisms.(Analyze changes that occured in different

organisms during evolution).

Find (understand) relationships between anancestral sequence and it descendants.

(Evolution of family of sequences)

Estimate time of divergence between a group oforganisms that share a common ancestor.

Basic tree Of Life

Eukaryote tree

From a common ancestor sequence, two

DNA sequences are diverged.

Each of these two sequences start toaccumulate nucleotide substitutions.

The number of these mutations are used inmolecular evolution analysis.

Similar sequences, common ancestor...

... common ancestor, similar function

One of the most striking features of life is allliving organisms share “highly conservedregions” in proteins, particularly in proteins thatare involved in information processing(transcription and translation).All of course share the same genetic codes.This information lead us to accept the theory thatall organisms known to us have evolved from acommon ancestor.(Patrick Forterre coined this ancestor as LUCA(Last Universal Common Ancestor)

Relationships of PhylogeneticAnalysis and Sequences Analysis

When 2 sequences found in 2 organisms are

very similar, we assume that they have derived

from one ancestor.

AAGAATC AAGAGTT

AAGA(A/G)T(C/ T)

The sequences alignment reveal which positions are conserved from the ancestor sequence.

How we calculate theDegree of Divergence

If two sequences of length N differ fromeach other at n sites, then their degree ofdivergence is:

n/N or n/N*100%.


Most phylogenetic methods assume that eachposition in a sequence can changeindependently from the other positions.

Gaps in alignments represent mutations insequences such as: insertion, deletion, geneticrearrangments.

Gaps are treated in various ways by thephylogenetic methods. Most of them ignoregaps.


Another approach to treat gaps is by usingsequences similarity scores as the base for thephylogenetic analysis, instead of using thealignment itself, and instead of trying to decidewhat happened at each position of the sequence.

Alignment similarity scores are based on scoringmatrices (with gaps scores), and are used by theDISTANCES methods.

What is a phylogenetic tree?

An illustration of the evolutionaryrelationships among a group of organisms.

Dendrogram is another name for aphylogenetic tree.

A tree is composed of nodes and branches.One branch connects any two adjacentnodes. Nodes represent the taxonomicunits. (sequences)

What is a phylogenetic tree?

E.G: 2 very similar sequences will beneighbors on the outer branches and will beconnected by a common internal branch.

Trees

Types of Trees

Networks

Only one path between any pair of nodes

More than one path between any pair of nodes

seqA

seqB

seqC

seqD

Tips (Leaves) = Represent the taxa (sequences)1

2

3Nodes = 1 2 3Represent a speciation event.Nodes (e.g 1) represent the ancestor seq from which seqA and seqB divrged.

*

*

*

Branches *Connect Tips and Nodes on the tree. The length of thebranch can represent the # of changes that occurred inthe seqs prior to the next level of separation.

Phylogenetic Tree

In a phylogenetic tree...

Each NODE represents a speciation event inevolution. Beyond this point any sequencechanges that occurred in each new specie, arespecific for it (specie).The BRANCH connects NODES or TIPS on thetree. The length of each BRANCH between oneNODE to the next, represents the # of changesthat occurred until the next speciation(separation) event.After a branching event, one taxon (sequence)can undergo more mutations then the othertaxon.

In a phylogenetic tree...

NOTE: The amount of evolutionary time thatpassed from a diversion of 2 sequences fromtheir common ancestor is not known.

The phylogenetic analysis can only estimate thetime, based on the # of changes that occurredfrom the time of separation.

Topology of a tree is the branching pattern of atree.

Tree structure

♠Terminal nodes - represent the data (e.gsequences) under comparison(A,B,C,D,E), also known as OTUs,(Operational Taxonomic Units).

♠ Internal nodes - represent inferredancestral units (usually without empiricaldata), also known as HTUs, (HypotheticalTaxonomic Units).

Slide taken from Dr. Itai Yanai

Different kinds of trees can be used to depict different aspects of evolutionary history

1. Cladogram: simply shows relative recency of common ancestry

2. Additive trees: a cladogram with branch lengths, also called phylograms and metric trees

3. Ultrametric trees: (dendograms) special kind of additive tree in which the tips of the trees are all equidistant from the root

5

4

3

1

3

7

32

1

1 1 1 12

31

1

13

The Molecular Clock Hypothesis

All the mutations occur in the same rate inall the taxa of a tree.

The rate of the mutations is the same forall positions along the sequence.

The Molecular Clock Hypothesis is mostsuitable for closely related species.

Rooted Tree = Cladogram

A phylogenetic tree thatall the "objects" on itshare a known commonancestor (the root).There exists a particularroot node.A root is a taxon (seq)that branched earlier ofall the other taxa on thetree, but is related tothem.

A

B

C

Root

The paths from the root to the nodes correspond to evolutionary time

A phylogenetic tree where allthe "objects" on it are relateddescendants - but there isnot enough information tospecify the commonancestor (root).

The path between nodes of thetree do not specify anevolutionary time.

B

Unrooted Tree = PhenogramRooted versus Unrooted

The number of tree topologies of rootedtree is much higher than that of theunrooted tree for the same number ofOTUs (sequences).

Therefore, the error of an unrooted treetopology is smaller than that of a rootedtree.

Possible Number of

Rooted trees Unrootedtrees

2 1 1

3 3 1

4 1 5 3

5 1 0 5 1 5

6 9 4 5 1 0 5

7 10395 9 4 5

8 135135 10395

9 2027025 135135

1 0 34459425 2027025

The number of rooted and unrooted trees:

Number of OTU’s

OTU – Operational Taxonomical Unit

Slide taken from Dr. Itai Yanai

Orthologs - genes related by speciationevents. Meaning same genes in differentspecies.

Paralogs - genes related by duplicationevents. Meaning duplicated genes in thesame species.

Selecting sequences for phylogeneticanalysis

What type of sequences are used for thephylogenetic analysis? Protein or DNA?

For DNA sequences, the rate of mutation isassumed to be the same in both codingand non-coding regions.

However, there is a difference in thesubstitution rate between coding and non-coding regions.Non-coding DNA regions have more substitutionthan coding regions.


For Proteins, the rate of mutation is very low inthe conserved region ,or “functional regions” aswe relate to them.

Most evolution algorithm can better analyzeregions that mutate slowly, small number ofchanges in the multiple alignments. Regions that have “high number of changes”need special algorithm to deal with themsuccessfully.

Known Problems of Multiple Alignments

The base for a phylogenetic analysis is MultipleAlignments of the sequences.

SO…Good analysis is based on good alignments.

Check the alignment to see that Important Sitesare not misaligned by the software used for thesequence alignment.

Misalignment can effect the significance of thesite - and the tree.

For example: ATG as start codon, or specific aminoacids in functional domains.


Gaps in multiple alignment are usually notscored, (or plainly ignored) by mostprograms.

Gaps are not scored since there is nosuitable model of evolution mechanism thatproduces them.

T Y R R S R ACA TAC AGG CGA T Y R RT Y R R S R ACA TAC AGG CGA T Y R RT Y R - S R ACA TAC AGG --- T Y R -T Y R - S R ACA TAC --- CGA T Y - RT Y R R S R ACA TAC AGG CGA T Y R R


Alignment of Proteins that contains gaps, should be compared with the alignment of their DNA coding regions..The reason is to to be sure about the placement of gaps,since the degeneracy of the cod.


♥Low complexity regions - effect the multiplealignment because they create random biasfor various regions of the alignment.

♥Low complexity regions should be removedfrom the alignment before building the tree.

If you delete these regions you need toconsider the affect of the deletions on thebranch lengths of the whole tree.


‡ Sequences that are being compared belongtogether (orthologs).

‡ If no ancestral sequence is available you mayuse an "outgroup" as a reference to measuredistances. In such a case, for an outgroup youneed to choose a close relative to the groupbeing compared.‡ For example: if the group is of mammalian

sequences then the outgroup should be a sequencefrom birds and not plants.

How to choose a phylogenetic method?

Choose set of related seqs(DNA or Proteins

Obtain MultipleAlignment

Is there a strong similarity?

Strong similarityMaximum Parsimony

Distant (weak) similarityDistance methods

Very weak similarityMaximum Likelihood

Check validity of the results

Taken from Dr.Itai Yanai

A - GCTTGTCCGTTACGATB – ACTTGTCTGTTACGATC – ACTTGTCCGAAACGATD - ACTTGACCGTTTCCTTE – AGATGACCGTTTCGATF - ACTACACCCTTATGAG

Given a multiple alignment, how do we construct the tree?

?

Building Phylogenetic Trees

Main methods:

Distances matrix methods Neighbour Joining, UPGMA

Character based methods: Parsimony methods

Maximum Likelihood method

Validation method:Bootstrapping

Jack Knife

Character Based Methods

All Character Based Methods assume thateach character substitution is independentof its neighbors.

In most of the programs based onCharacter Methods, the program will yield(build) one tree with the fewest changesrequired to explain (tree) the differencesobserved in the data.


For example the Maximum Parsimony(minimum evolution) program will find thetree where any changes can occur in anyof the sequences (taxa) on the tree, thatcan turn them into any of the othersequences on the tree.


Q: How do you find the minimum # of changesneeded to explain the data in a given tree?

A: The answer will be to construct a set of possibleways to get from one set to the other, and choosethe "best". (for example: Maximum Parsimony)

CCGCCACGA P P RCGGCCACGA R P R

Character Based Methods - MaximumParsimony

∞Not all the sites are informative for theparsimony method.

∞ Informative site, is a site that has at least 2characters, each appearing at least in 2 ofthe sequences of the dataset.

∞With the parsimony method, onlyinformative sites need to be analyzed.

123456789012345678901 Mouse CTTCGTTGGATCAGTTTGATA Rat CCTCGTTGGATCATTTTGATADog CTGCTTTGGATCAGTTTGAACHuman CCGCCTTGGATCAGTTTGAAC------------------------------------Invariant * * ******** *****Variant ** * * **------------------------------------Informative ** ** Non-inform. * *

Start by classifying the sites:

Maximum Parsimony

Taken fromDr. Itai Yanai

123456789012345678901 Mouse CTTCGTTGGATCAGTTTGATA Rat CCTCGTTGGATCATTTTGATADog CTGCTTTGGATCAGTTTGAACHuman CCGCCTTGGATCAGTTTGAAC

** *

Mouse

Rat

Dog

Human

Mouse

Rat

Dog

Human

Mouse Rat

Dog Human

Mouse

Rat

Dog

Human

Mouse

Rat

Dog

Human

Mouse Rat

Dog HumanMouse

Rat

Dog

Human

Mouse

Rat

Dog

Human

Mouse Rat

Dog Human

Site 5:G

G

T

C

T

T

T

C

T

C

G

G

T

G

T

C

G

T

G

C

T

C

T

G

G

T

T

T

T

G

G

C

C

C

T

G

GG

CC

GG

GG

CT

GG

TG

CC

GT

Site 2:

Site 3:


123456789012345678901 Mouse CTTCGTTGGATCAGTTTGATA Rat CCTCGTTGGATCATTTTGATADog CTGCTTTGGATCAGTTTGAACHuman CCGCCTTGGATCAGTTTGAACInformative ** **

Mouse

Rat

Dog

Human

Mouse

Rat

Dog

Human

Mouse Rat

Dog Human

3 0 1

Maximum Parsimony



The possible optimal tree is built byadding the number of changes at eachinformative site for each tree.

The tree that requires the least number

of changes is chosen.


∞ The Maximum Parsimony method is good forsimilar sequences, a sequences group withsmall amount of variations

Maximum Parsimony methods do not give thebranch lengths only the branch order.

For larger set it is recommended to usethe “branch and bound” method insteadof Maximum Parsimony.

Maximum Parsimony Methods areAvailable…

For DNA in Programs:

paup, molphy,phylo_win

In the Phylip package:

DNAPars, DNAPenny, etc..

For Protein in Programs:

paup, molphy,phylo_win


PROTPars

Character Based Methods -Maximum Likelihood

Basic idea of Maximum Likelihood method isbuilding a tree based on mathematical model.This method find a tree based on probabilitycalculations that best accounts for the largeamount of variations of the data (sequences)set.

Maximum Likelihood method (like the MaximumParsimony method) performs its analysis oneach position of the multiple alignment.

This is why this method is very heavy on CPU.

Character Based Methods -Maximum Likelihood

Maximum Likelihood method –Starts with a set of sequences, and finds

estimates for the variability in each position.

It checks the rates of transitions and transversions(using both models Kimura and Jukes & Cantor).At the end, after all positions in the sequences

alignment were checked, the likelihood of thewhole tree is reported.

Maximum Likelihood method

The Maximum Likelihood methods arevery slow and cpu consuming.

Maximum Likelihood methods can befound in phylip, paup or puzzle.

In phylip package in programs:

DNAML and DNAMLK

Distances Methods

Distance - the number of substitutions per site pertime period.Evolutionary distance are calculated based on oneof DNA evolutionary models.

Neighbors – pair of sequences, in a sequence set,that have the smallest number of changes(substitutions) between them.On a phylogenetic tree, neighbors are joined by abranch to the same node (common ancestor).

Distances Matrix Methods

Some of the Distances methods assume amolecular clock, such as the Neighbor-JoiningUPGMA. Most of the other Distances programs

do not.This assumption is not true for several reasons:

Different environmental conditions affectmutation rates.This assumption ignores selection issueswhich are different with different timeperiods.


Distance methods try to place the correctpositions of all the neighbors, and find thecorrect branches lengths. However, theyvary in the way they construct the tree.

Distance based clustering methods:Neighbor-Joining (unrooted tree) UPGMA (rooted tree)

Common steps to build a TREE usedby Distances method

1 Multiple alignments - based on all againstall pairwise comparisons.

2 Building distance matrix of all the comparedsequences (all pairs of OTUs).

3 Disregard of the actual sequences.

4 Constructing a guide tree by clustering thedistances. Iteratively build the relations(branches and internal nodes) between allOTUs.

Distance method steps

Construction of a distance tree using clustering with the UnweightedPair Group Method with Arithmatic Mean (UPGMA)

A B C D E

B 2

C 4 4

D 6 6 6

E 6 6 6 4

F 8 8 8 8 8

From http://www.icp.ucl.ac.be/~opperd/private/upgma.html

A - GCTTGTCCGTTACGATB – ACTTGTCTGTTACGATC – ACTTGTCCGAAACGATD - ACTTGACCGTTTCCTTE – AGATGACCGTTTCGATF - ACTACACCCTTATGAG

First, construct a distance matrix:


Distances matrix methods can be found in thefollowing Programs:

Clustalw, Phylo_win, Paup

In the GCG software package:

Paupsearch, distances


DNADist, PROTDist, Fitch, Kitch, Neighbor

Statistical Methods

Bootstrapping Analysis –

Is a method for testing how good a dataset fits aevolutionary model.

This method can check the branch arrangement(topology) of a phylogenetic tree.

In Bootstrapping, the program re-samplescolumns in a multiple aligned group ofsequences, and creates many new alignments,(with replacement the original dataset).

These new sets represent the population.

Statistical Methods

The process is done at least 100 times.

Phylogenetic trees are generated from all thesets.Part of the results will show the # of times aparticular branch point occurred out of all thetrees that were built.

The higher the # - the more valid thebranching point.

Taken from Dr. Itai Yanai

Chimpanzee

Gorilla

Human

Orang-utan

Gibbon

Given the following tree, estimate the confidence of the two internal branches

Taken from Dr. Itai Yanai

Chimpanzee

Chimpanzee

Chimpanzee

Gorilla

Gorilla

Gorilla

HumanHuman Human

Orang-utanOrang-utan Orang-utan

GibbonGibbon

Gibbon

41/100 28/100 31/100

Chimpanzee

Gorilla

Human

Orang-utan

Gibbon

100

41

Estimating Confidence from the Resamplings

1. Of the 100 trees:

In 100 of the 100 trees, gibbon and orang-utan are split from the rest.

In 41 of the 100 trees, chimp and gorilla are split from the rest.

2. Upon the original tree we superimpose bootstrap values:

Statistical Methods

Bootstrap values between 90-100 areconsidered statistically significant

Mutations in DNA as a source forevolutionary analysis

Only mutations that were fixed in thepopulation are called substitutions.

We assume that each observed change insimilar sequences, represent a “singlemutation event”.

The greater the number of changes, themore possible types of mutations.

Original Sequence

ACTGAACGTAA

CTGA > C > TAC > GGT > A

AC > ATGA AC > AGT > A

Single Substitution

Parallel Substitution

MultipleSubstitution coincidental Substitution

DNA Substitution Mutations

Transition - a change between purines(A,G) or between pyrimidines (T,C).

Transversion - a change between purines(A,G) to pyrimidines (T,C).

Substitution mutations usually arise frommispairing of bases during replication.

Correction for likelihood of mutationsin DNA sequences

There are several evolutionary modelsused for correction for the likelihood ofmultiple mutations and reversions in DNAsequences.

These evolutionary models use anormalized distance measurement that isthe average degree of change per lengthof aligned sequences.

Jukes & Cantor one-parameter model

This model assumes that substitutionsbetween the 4 bases occur with equal frequency.Meaning no bias in the direction of the change.

G

TC

A

α

α

Is the rate of substitutionsIn each of the 3 directionsFor one base.

α ( Is the one parameter).

Kimura two-parameter model

This model assumes that transitions(A - G or T - C) occur more often than

transversions (purine -pyrimidine).

G

TC

A

α

α

Is the rate of transitional Substitutions.

α

ββ β Is the rate of transversionalsubstitutions.

These evolutionary models improve thedistance calculations between thesequences.These evolutionary models have lesseffect in phylogenetic predictions ofclosely related sequences.These evolutionary models have bettereffect with distant related sequences.

DNA Evolution Models

A basic process in DNA sequenceevolution is the substitution of onenucleotide with another.This process is slow and can not beobserved directly.The study of DNA changes is used toestimate the rate of evolution, and theevolution history of organisms.

How to draw Trees? (Building trees software)

* Unrooted trees should be plotted usingthe DRAWGRAM program (phylip), orsimilar.

* Rooted trees should be plotted using theDRAWTREE program (phylip), or similar.

* On a PC/Mac use the TreeView program

A Tip...

For DNA sequences use the Kimura'smodel in the building trees programs.

For PROTEINS the differences lie withthe scoring (substitution) matrices used.For more distant sequences you shoulduse BLOSUM with lower # (i.e., fordistant proteins use blosum45 and forsimilar proteins use blosum60).

Known problems of Phylogenetic Analysis

Order of the input data (sequences) - The order of the input sequences effects the tree

construction. You can "correct" this effect in someof the programs (like phylip), using the Jumbleoption. (J in phylip set to 10).

The number of possible trees is huge for largedatasets. Often it is not possible to construct alltrees, but can guarantee only "a good" tree notthe "best tree".


The definition of "best tree" is ambiguous.It might mean the most likely tree, or a tree withthe fewest changes, or a tree best fit to a knownmodel, etc..

The trees that result from various methods differfrom each other. Never the less, in order tocompare trees, one need to assume someevolutionary model so that the trees may betested.


∆ Pay attention to the data used for the treeconstruction, use “informative” data, withoutlarge gaps.

∆ Population effects are often to be considered,especially if we have a lot of variety (large #of alleles for one protein).

How many trees to build?

! For each dataset it is recommended to buildmore than one tree. Build a tree using adistance method and if possible also use acharacter-based method, like maximumparsimony.

! The core of the tree should be similar inboth methods, otherwise you may suspectthat your tree is incorrect.

mar 2004 irit orr phylogenetic analysis introduction to · phylogenetic analysis irit orr mar 2004...

Documents