lecture 24

Post on 02-Feb-2016

24 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Bioinformatics. Lecture 24. Inferring molecular phylogeny Distance methods Discrete methods Comparisons of different tree building methods Estimating sampling error: the bootstrap. Inferring molecular phylogeny. - PowerPoint PPT Presentation

TRANSCRIPT

Bioinformatics

• Inferring molecular phylogeny

• Distance methods

• Discrete methods

• Comparisons of different tree building methods

• Estimating sampling error: the bootstrap

Lecture 24

Inferring molecular phylogeny

• The objective of molecular phylogenetics is to convert sequences information (DNA, RNA, proteins) into an evolutionary tree for this sequences.

• Ever growing number of tree building methods can very roughly be split into two approaches.

• Distance methods versus discrete characters methods.

• Clustering methods versus search methods.

• These methods will be considered during the lecture.

Distance methods

• The simplest distance method based on assumption of constant substitution rates and approximately equal length of neighboring branches called UPGMA (Unweighted Pair Group Method with Arithmetic Mean).

• A distance matrix, representing distances between all possible pairs of sequences used for the phylogenetic reconstruction must be built as a first step.

• The UPGMA starts from calculating branch length

Distance methods: an idealised case

A. Sequences

Sequence A ACGCGTTGGGCGATGGCAACSequence B ACGCGTTGGGCGACGGTAATSequence C ACGCATTGAATGATGATAATSequence B ACACATTGAGTGATAATAAT

B. Distances between sequences

nAB 3nAC 7nAD 8nBC 6nBD 7nCD 3

OTU A B C D

A - 3 7 8

B - - 6 7

C - - - 3

D - - - -

C. Distance table

D. The assumed unrooted tree

A C

DB

1

1

2

24

Diagram illustrating the stepwise construction of a phylogenetic tree for four OTUs according to unweighted pair group method with arithmetic

mean (UPGMA). The resulting tree is ultrametric. Methods used: distance and clustering.

8--C

1311-B

71114A

DCB

11-B

9.513.5AD

CB

A

D

dAD

2 d(AB)C

2d(ADC)B)

2

3.5

(AD)B = (AB + DB)/2

Values for these tables are calculated from the data presented in the initial table

(ADC)B = (AB + DB + CB)/3

A

D

C

3.5

4.75

6.33

A

D

C

3.5

4.75

B

12.67ADC

B

(AD)C = (AC + DC)/2

Neighbours-joining tree construction. Methods: distance and clustering.

OTU H C G O

C 1.45* - - -

G 1.51 1.57 - -

O 2.98 2.94 3.04 -

R 7.51 7.55 7.39 7.10

H – Human

C – Chimpanzee

G – Gorilla

O – Orangutan

R – Rhesus monkey

* Number of nucleotide substitutions per 100 sites between OTUs.

Neighbours-relation scores obtained from the distance matrix (see previous slide)

Calculation of the total scores:

(dHG + dCO) – min score

each pair (HG) and (CO) is assigned score of 1; other pairs score 0.

As a result the scores are obtained, which are shown in the table.

(OR) has the highest total score.

Building Neighbours-Joining (NJ) tree

5.225.255.25(OR)

1.571.51G

1.45C

GCHOTU

Treating (OR), which has the highest total score, as a separate single OUT, the following table can be calculated.

As only 4 OTUs are left, it is easy to see that dHC + dG(OR) = 6.67 <

< dHG + dC(OR) = 6.76 <

< dH(OR) + DCG = 6.82

Therefore, H and C are chosen as one pair of neighbours G and (OR) as the other.

Maximum parsimonyMethods: discrete characters and search/optimisation

Informative sites (*) in four compared sequences, used for phylogenetic reconstruction.

  Site

Sequence1 2 3 4 5 6 7 8 9

1 A A G A G T G C A

2 A G C C G T G C G

3 A G A T A T C C A

4 A G A G A T C C G

 Inf. sites         *   *   *

Three possible unrooted trees (I, II and III) for four DNA sequences (1, 2, 3, 4) that have been used to

choose the most parsimonious tree.

Comparison of different tree-building methods

• Efficiency (how fast is the method?),

• Power (how much data does the method need to produce reasonable result?)

• Consistency (will it converge on the right answer given enough data?)

• Robustness (will minor violations of the method’s assumptions result in poor estimates of phylogeny?)

• Falsibility (will the method tell when its assumption violated, in order to avoid using this method)

Performance of UPGMA and parsimony methods

UPGMA PARSIMONY

The success rate is the percentage of times that the correct tree was recovered in that region of the parameter space. White area in the left top of the both diagram, where non of the methods performs well

MEGA 3

MEGA3: Sequence Data Explorer

Variable sites

Parsimonious sites

Sequences continue

MEGA 3: phylogenetic trees

Neighbor- joining (NJ) Minimum evolution (ME)

Maximum Parsimony (MP) UPGMA

Bootstrapping

NJ ME

MP UPGMA

top related