2-step single-step phylogenetics (observation)sites.fas.harvard.edu/~bio181/lectures/lecture...

24
DNA (observation) PHYLOGENETIC TREE (hypothesis of relationships) Alignment step Tree inference step Cost matrix/model Cost matrix/model 2-STEP PHYLOGENETICS MULTIPLE SEQUENCE ALIGNMNET (primary homology) SINGLE-STEP PHYLOGENETICS Cost matrix /model

Upload: others

Post on 30-Aug-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

DNA(observation)

PHYLOGENETIC TREE(hypothesis of relationships)

Alignment step

Tree inference step

Cost matrix/model

Cost matrix/model

2-STEPPHYLOGENETICS

MULTIPLE SEQUENCE ALIGNMNET(primary homology)

SINGLE-STEPPHYLOGENETICS

Cost matrix/model

Page 2: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

Morrison, D. A., Ellis, J. T., 1997. Effects of nucleotide sequence alignment on phylogeny estimation: acase study of 18S rDNAs of Apicomplexa. Molecular Biology and Evolution 14: 428-441.

Page 3: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

Recognizing Homology

• Morphological homology criteria– Similarity (phenetic)– Positional– Ontogenetic

• Molecules– Similarity– Primers– Codons– Secondary structure– Introns/exons– Other: genic clusters, expression patterns, functionality, etc.

Page 4: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

“Recognizing” homology

• Similarity (BLAST)

• Primer specificity

• Secondary structure

• Codon

• Topologic

cgtcgtAAATCGCGGATTTcgtcgt

cgtcgtAACTCCCGGAGTTcgtcgt

cgtcgtAACTCGC-GAGTTcgtcgt

C

G G

C G

U:A

A:U

A:U

A:U

G C

C G

U:A

C:G

A:U

A:U

C

C G

C G

U:A

C:G

A:U

A:U

Enoplus

1 GGCAGCAAAT TTTGTTGTTT GTTGAA

Ephemera

1 GGCGCCGAGA TCCTTGTGCT CTCGGCGCTT ACT

Euperipatoides

1 GCCCGGAGCG GTTTTTGATC TTTGCCGTCC GTCCTTTGTT TTTTCCCTTT

1 CGGCCGACCG GTTCGGCGGG TACGAGGGAA GGCGGGGGCG GAGGGGTCAT

1 TCCGCGGTCT GCGGGGAAAC CTTTTCGTCA CTTTACCTTC TCGCCGTCTT

1 CGAGCGGTCG AGACGGAAGG GGAGCTTAGG CGGAAGCATA ACCGAAGCCG

1 CAGGAAG

Glossiphonia

1 GTACAGCCGC TGTCAGCCGC AACGTCTTCG GGCGTTCGGA CGTCTGGAAA

1 GGTGCAGA

Gordius

1 GAACGTCGAT CATTTTCGTC GGCGAAG

Haplognathia

1 GCACTTTGAT GGCTCTGCCG TCATATGGCG

Heterodon

1 GTTATGCGAC CCCCNAGCGG TCGGNNTCCA A

Hirudo

1 GTACAGCCGC TGTCAGCCGC AACGTCTTCG GGCGTTCGGA CGTCTGGAAA

1 GGTGCAGA

Lepidopleurus

1 CTTCTTAGAG GGACAAGTGG CGTTTAGCCA CACGAAATTG A

Acanthochitona

1 CTTCTTAGAG GGACAAGTGG CGTTTAGCCA CACGAAATTG A

Loligo

1 CTTCTTAGAG GGACGGGTGG CAGAAAGCCA CACGAGGCGG A

Sepia

1 CTTCTTAGAG GGACGGGTGG CAGAAAGCCA CACGAGGCGG A

Haliotis

1 CTTCTTAGAG GGACAGGTGG CGTATAGCCA CACGAAGTAG A

Sinezona

1 CTTCTTAGAG GGACAGGTGG CGTTTAGCCA CACGAAGTAG A

Diodora

1 CTTCTTAGAG GGACAGGTGG CGTTTAGTCA CACGAAGTAG A

Rhabdus

1 CTTCTTAGAG GGACGAGTGG CGTATAGCCA CGCGAGATTG A

Solemya

1 CTTCTTAGAG GGACAAGTGG CTTYTAGCCA CACGAGATTG A

Nucula

1 CTTCTTAGAG GGACAAGTGG CGTTTAGCCA CACGAGATTG A

Page 5: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

• Protein-coding genes– Introns/exons

– Preserving the reading frame

– Third position and saturation problems

• Ribosomal genes– Secondary structure

• Coding gene adjacencies: mitochondrial genome analysis:– Inversions

– Overlapping reading frames

– Translocations

• Complete genome analyses– Inversions

– Overlapping reading frames

– Translocations

– Transpositions

70S ribosome structure@ 3.5 Â resolution

Complete mitochondrial genomes of two myriapods

Page 6: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

Homology in molecular characters

Base to base Fragment to fragment

Static

Dynamic

Multiple alignment

Optimization alignment

Fixed state optimization

Genome analysis

Wheeler, W. C. (2001). Homology and DNA sequence data. In The character concept in evolutionary biology (G. P.Wagner, Ed.), pp. 303-317. Academic Press, San Diego.

Wheeler, W. C. (2001). Homology and the optimization of DNA sequence data. Cladistics 17, S3-S11.

Page 7: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

DNA sequence alignment

It is the process of transforming unequal length sequences intocharacters of equal length via the insertion of gaps

(insertion/deletion events; indels)

ATTCGCA

ATTGCA

ATTCGCA

ATT-GCA

Gaps

Are symbols that indicate that an insertion or deletion event hasoccurred in a position of the DNA chain after the sequences

compared diverged from a common ancestor, resulting in the lackof homologous nucleotides at a given position

Page 8: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

ATTCGCGTTA

ATCCGTTTA

ATTCGCGTT-A

AT-C-CGTTTA

ATTCGCGTTA

ATCCGT-TTA

--------ATTCGCGTTA

ATCCGTTTA---------

3 insertions, 0 transformations

1 insertion, 2 transformations

17 insertions, 0 transformations

Page 9: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

Types of alignments:

• Local alignments

– Search for homologous domains between sequences (searches fororthologs, paralogs, structural domains, etc.): BLAST (Altschul et al. 1997)

• Global alignments

– Two homologous fragments are compared (comparison of orthologs)PHYLOGENETIC ESTIMATION

• Two sequences (pairwise alignments)

Needleman & Wunsch (1970): minimum edit distance between twosequences minimum number of transformations required to go fromsequence A to sequence B

• N sequences (N > 2): multiple sequence alignments

– N-dimensional case (Sankoff & Cedergren 1983)

– Heuristic multiple alignment following a guide-tree (Feng & Doolittle 1987)

Page 10: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

Algorithmic alignments(two or more sequences )

• Require specification of at least two variables (parameters):

– Gaps (indels)• Individual gaps

• Contiguous gaps (with specific functions)– Gap initiation

– Gap extension (linear or concave function)

– Substitutions• All equal

• Transversions/transitions

• Each substitution receives its own cost

***** *

ACTTGCACA-A

ACT--CCGACA

ACTT--ACACA

A

C

G

T

Page 11: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

TTTACTTT

TTTG-TTT

TTTACTTT

TTT-GTTT

TTT-ACTTT

TTTG--TTT

Gap = 1Tv = 2Ts = 2

Cost = 3

Cost = 3

Cost = 4

Gap = 2Tv = 1Ts = 2

Cost = 6

Gap = 2Tv = 2Ts = 1

Cost = 3

Cost = 4

Cost = 6

Pairwise alignments :

Page 12: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

The number of multiple alignments

n\m 2 3 4 5

1 3 13 75 5412 13 409 23917 22443613 63 16081 10681263 146387567214 321 699121 5552351121 1176299594851215 1683 32193253 3147728203035 1.05 x 1018

10 8097453 9850349744182729 3.32 x 1026 1.35 x 1038

f(n, m) para 1 n 10; 1 m 5

It is not possible to obtain exact solutions

From Slowinski, J. B. (1998). The number of multiple alignments. Mol. Phylogenet. Evol. 10, 264-266.

Page 13: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

Heuristic solutions

• All pairwise comparisons are made

• A binary guide tree is generated

• Sequences are incorporated following the binaryguide tree

Implementation

• CLUSTAL (Higgins & Sharp 1988; Jeanmougin et al. 1998)

• TREEALIGN (Hein 1989, 1990)

• MALIGN (Wheeler & Gladstein 1994, 1995)

• POY (Wheeler 1996) implied alignment

• COFFEE, etc. profile alignments, iteration alignments,etc.

Page 14: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

Sequence A TAAATTGCASequence B AATTTGGGCCA

The Needleman-Wunsch algorithm: wavefront updating

Phillips, A., Janies, D. & Wheeler, W. C. Multiple sequence alignment in phylogenetic analysis.Molecular Phylogenetics and Evolution 16, 317-330 (2000).

Page 15: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

The Needleman-Wunsch algorithm: traceback procedure and edit graph

Page 16: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

Pairwise alignments:distance-based guide trees

A - ATTCGB - AGCGC - ACTCG

gap = 2change = 1

A - ATTCG

B - AG-CG

A - ATTCG

C - ACTCG

B - AG-CG

C - ACTCG

Distance matrix:

d(A,B) = 3d(A,C) = 1d(B,C) = 3

B A C

Page 17: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

Pairwise alignmentsA - ATTCGB - AGCGC - AGTCGD - AATGG

Distance matrix:

d(A,B) = 3d(A,C) = 1d(A,D) = 2d(B,C) = 3d(B,D) = 4d(C,D) = 2

? A C

d(AC,B) = 2d(AC,D) = 2d(B,D) = 4

A C

AYTCG

Page 18: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

Multiple alignments based on parsimony

Alignment

I GGGG

II -GGG

III GAAG

IV -GAA

I GGGG

II GGG-

III GAAG

IV GAA-

Topology

((I II) III) IV) (((I III) II) IV) (((I IV) II) III)

7 6 8

6 6 8

Gap = 2; Substitution = 1

Page 19: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

GGGGGGGGAAGGAA

GGGG

-GGG

GAAG

-GAA

7

GGGG

GGG-

GAAG

GAA-

6

GGGGGAAGGGGGAA

GGGG

GAAG

-GGG

-GAA

6

GGGG

GAAG

GGG-

GAA-

6

Gap = 2; Substitution = 1

Page 20: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

• Eliminate gappy regions• Eliminate ambiguous regions: Cull (Gatesy et al. 1993)• Differential weighting of ambiguous regions: Elision (Wheeler et

al. 1995)• Exploring multiple analytical parameters: Sensitivity analysis

(wheeler 1995):– Multiple parameters are evaluated (gaps, substitutions, relative

weights of partitions, etc.)– Allows to test for node stability– May require a second optimality criterion to choose a parameter set

among the parameter space (e.g. character congruence;topological congruence, etc.)

ACTGACT

A-TGACT

AC-GACT

ACT-ACT

ACTG-CT

ACTGACT

A-AGACT

ACTGACT

AA-GACT

AGACT

AGACT

ACTGACT

A-AGACT

ACTGACT

AA-GACT

1 0.5 1

A CT CT GACT

A -A A- GACT

Page 21: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

• Alignments are difficult Require heuristicimplementations

• Alignments are parameter-dependent (Fitch &Smith 1983; Waterman et al. 1992)

Different parameters

Different alignments

Different phylogenetic hypotheses

Page 22: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

GGGG

GGG(G)

GGG

GAAG

GRRG GAA

GAA(G)

2 indels + 2 substitutions = 6

GGGG

GRR(G)

GAA

GGG

GGG GAAG

GR(A)G

2 indels + 3 substitutions = 7

Gap cost = 2

Base substitution = 1

Direct optimization:

I GGGG

II GGG

III GAAG

IV GAA

(((I II) III) IV)

(((I IV) II) III)

(((I III) II) IV) …

Page 23: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

Probabilistic “alignments”

• Require the implementation of indels in evolutionary models

• The TKF model (Thorne, Kishino & Felsenstein 1991)

• Tree HMMs (Mitchinson 1999); Evolutionary HMMSs (Holmes &Bruno 2001)

• POY-ML (Wheeler et al. 2002) (see also Fleissner et al. 2005)

• Simultaneous Bayesian inference of trees and alignments(Redelings et al. 2005)

Felsenstein, J. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland, Massachusetts.Wheeler, W. C., D. Gladstein and J. De Laet. 2002. POY version 3.0. Ver. Program and documentation availableat ftp.amnh.org/pub/molecular. American Museum of Natural History.Redelings, B. D. and M. A. Suchard. 2005. Joint Bayesian estimation of alignment and phylogeny. SystematicBiology 54: 401-418.Fleissner, R., D. Metzler and A. von Haeseler. 2005. Simultaneous statistical multiple alignment and phylogenyreconstruction. Systematic Biology 54: 548-561.

Page 24: 2-STEP SINGLE-STEP PHYLOGENETICS (observation)sites.fas.harvard.edu/~bio181/lectures/Lecture 05.ppt.pdfDNA sequence alignment It is the process of transforming unequal length sequences

DNA(observation)

PHYLOGENETIC TREE(hypothesis of relationships)

Alignment step

Tree inference step

Cost matrix/model

Cost matrix/model

2-STEPPHYLOGENETICS

MULTIPLE SEQUENCE ALIGNMNET(primary homology)

SINGLE-STEPPHYLOGENETICS

Cost matrix/model