2-step single-step phylogenetics (observation)sites.fas.harvard.edu/~bio181/lectures/lecture...
TRANSCRIPT
DNA(observation)
PHYLOGENETIC TREE(hypothesis of relationships)
Alignment step
Tree inference step
Cost matrix/model
Cost matrix/model
2-STEPPHYLOGENETICS
MULTIPLE SEQUENCE ALIGNMNET(primary homology)
SINGLE-STEPPHYLOGENETICS
Cost matrix/model
Morrison, D. A., Ellis, J. T., 1997. Effects of nucleotide sequence alignment on phylogeny estimation: acase study of 18S rDNAs of Apicomplexa. Molecular Biology and Evolution 14: 428-441.
Recognizing Homology
• Morphological homology criteria– Similarity (phenetic)– Positional– Ontogenetic
• Molecules– Similarity– Primers– Codons– Secondary structure– Introns/exons– Other: genic clusters, expression patterns, functionality, etc.
“Recognizing” homology
• Similarity (BLAST)
• Primer specificity
• Secondary structure
• Codon
• Topologic
cgtcgtAAATCGCGGATTTcgtcgt
cgtcgtAACTCCCGGAGTTcgtcgt
cgtcgtAACTCGC-GAGTTcgtcgt
C
G G
C G
U:A
A:U
A:U
A:U
G C
C G
U:A
C:G
A:U
A:U
C
C G
C G
U:A
C:G
A:U
A:U
Enoplus
1 GGCAGCAAAT TTTGTTGTTT GTTGAA
Ephemera
1 GGCGCCGAGA TCCTTGTGCT CTCGGCGCTT ACT
Euperipatoides
1 GCCCGGAGCG GTTTTTGATC TTTGCCGTCC GTCCTTTGTT TTTTCCCTTT
1 CGGCCGACCG GTTCGGCGGG TACGAGGGAA GGCGGGGGCG GAGGGGTCAT
1 TCCGCGGTCT GCGGGGAAAC CTTTTCGTCA CTTTACCTTC TCGCCGTCTT
1 CGAGCGGTCG AGACGGAAGG GGAGCTTAGG CGGAAGCATA ACCGAAGCCG
1 CAGGAAG
Glossiphonia
1 GTACAGCCGC TGTCAGCCGC AACGTCTTCG GGCGTTCGGA CGTCTGGAAA
1 GGTGCAGA
Gordius
1 GAACGTCGAT CATTTTCGTC GGCGAAG
Haplognathia
1 GCACTTTGAT GGCTCTGCCG TCATATGGCG
Heterodon
1 GTTATGCGAC CCCCNAGCGG TCGGNNTCCA A
Hirudo
1 GTACAGCCGC TGTCAGCCGC AACGTCTTCG GGCGTTCGGA CGTCTGGAAA
1 GGTGCAGA
Lepidopleurus
1 CTTCTTAGAG GGACAAGTGG CGTTTAGCCA CACGAAATTG A
Acanthochitona
1 CTTCTTAGAG GGACAAGTGG CGTTTAGCCA CACGAAATTG A
Loligo
1 CTTCTTAGAG GGACGGGTGG CAGAAAGCCA CACGAGGCGG A
Sepia
1 CTTCTTAGAG GGACGGGTGG CAGAAAGCCA CACGAGGCGG A
Haliotis
1 CTTCTTAGAG GGACAGGTGG CGTATAGCCA CACGAAGTAG A
Sinezona
1 CTTCTTAGAG GGACAGGTGG CGTTTAGCCA CACGAAGTAG A
Diodora
1 CTTCTTAGAG GGACAGGTGG CGTTTAGTCA CACGAAGTAG A
Rhabdus
1 CTTCTTAGAG GGACGAGTGG CGTATAGCCA CGCGAGATTG A
Solemya
1 CTTCTTAGAG GGACAAGTGG CTTYTAGCCA CACGAGATTG A
Nucula
1 CTTCTTAGAG GGACAAGTGG CGTTTAGCCA CACGAGATTG A
• Protein-coding genes– Introns/exons
– Preserving the reading frame
– Third position and saturation problems
• Ribosomal genes– Secondary structure
• Coding gene adjacencies: mitochondrial genome analysis:– Inversions
– Overlapping reading frames
– Translocations
• Complete genome analyses– Inversions
– Overlapping reading frames
– Translocations
– Transpositions
70S ribosome structure@ 3.5 Â resolution
Complete mitochondrial genomes of two myriapods
Homology in molecular characters
Base to base Fragment to fragment
Static
Dynamic
Multiple alignment
Optimization alignment
Fixed state optimization
Genome analysis
Wheeler, W. C. (2001). Homology and DNA sequence data. In The character concept in evolutionary biology (G. P.Wagner, Ed.), pp. 303-317. Academic Press, San Diego.
Wheeler, W. C. (2001). Homology and the optimization of DNA sequence data. Cladistics 17, S3-S11.
DNA sequence alignment
It is the process of transforming unequal length sequences intocharacters of equal length via the insertion of gaps
(insertion/deletion events; indels)
ATTCGCA
ATTGCA
ATTCGCA
ATT-GCA
Gaps
Are symbols that indicate that an insertion or deletion event hasoccurred in a position of the DNA chain after the sequences
compared diverged from a common ancestor, resulting in the lackof homologous nucleotides at a given position
ATTCGCGTTA
ATCCGTTTA
ATTCGCGTT-A
AT-C-CGTTTA
ATTCGCGTTA
ATCCGT-TTA
--------ATTCGCGTTA
ATCCGTTTA---------
3 insertions, 0 transformations
1 insertion, 2 transformations
17 insertions, 0 transformations
Types of alignments:
• Local alignments
– Search for homologous domains between sequences (searches fororthologs, paralogs, structural domains, etc.): BLAST (Altschul et al. 1997)
• Global alignments
– Two homologous fragments are compared (comparison of orthologs)PHYLOGENETIC ESTIMATION
• Two sequences (pairwise alignments)
Needleman & Wunsch (1970): minimum edit distance between twosequences minimum number of transformations required to go fromsequence A to sequence B
• N sequences (N > 2): multiple sequence alignments
– N-dimensional case (Sankoff & Cedergren 1983)
– Heuristic multiple alignment following a guide-tree (Feng & Doolittle 1987)
Algorithmic alignments(two or more sequences )
• Require specification of at least two variables (parameters):
– Gaps (indels)• Individual gaps
• Contiguous gaps (with specific functions)– Gap initiation
– Gap extension (linear or concave function)
– Substitutions• All equal
• Transversions/transitions
• Each substitution receives its own cost
***** *
ACTTGCACA-A
ACT--CCGACA
ACTT--ACACA
A
C
G
T
TTTACTTT
TTTG-TTT
TTTACTTT
TTT-GTTT
TTT-ACTTT
TTTG--TTT
Gap = 1Tv = 2Ts = 2
Cost = 3
Cost = 3
Cost = 4
Gap = 2Tv = 1Ts = 2
Cost = 6
Gap = 2Tv = 2Ts = 1
Cost = 3
Cost = 4
Cost = 6
Pairwise alignments :
The number of multiple alignments
n\m 2 3 4 5
1 3 13 75 5412 13 409 23917 22443613 63 16081 10681263 146387567214 321 699121 5552351121 1176299594851215 1683 32193253 3147728203035 1.05 x 1018
10 8097453 9850349744182729 3.32 x 1026 1.35 x 1038
f(n, m) para 1 n 10; 1 m 5
It is not possible to obtain exact solutions
From Slowinski, J. B. (1998). The number of multiple alignments. Mol. Phylogenet. Evol. 10, 264-266.
Heuristic solutions
• All pairwise comparisons are made
• A binary guide tree is generated
• Sequences are incorporated following the binaryguide tree
Implementation
• CLUSTAL (Higgins & Sharp 1988; Jeanmougin et al. 1998)
• TREEALIGN (Hein 1989, 1990)
• MALIGN (Wheeler & Gladstein 1994, 1995)
• POY (Wheeler 1996) implied alignment
• COFFEE, etc. profile alignments, iteration alignments,etc.
Sequence A TAAATTGCASequence B AATTTGGGCCA
The Needleman-Wunsch algorithm: wavefront updating
Phillips, A., Janies, D. & Wheeler, W. C. Multiple sequence alignment in phylogenetic analysis.Molecular Phylogenetics and Evolution 16, 317-330 (2000).
The Needleman-Wunsch algorithm: traceback procedure and edit graph
Pairwise alignments:distance-based guide trees
A - ATTCGB - AGCGC - ACTCG
gap = 2change = 1
A - ATTCG
B - AG-CG
A - ATTCG
C - ACTCG
B - AG-CG
C - ACTCG
Distance matrix:
d(A,B) = 3d(A,C) = 1d(B,C) = 3
B A C
Pairwise alignmentsA - ATTCGB - AGCGC - AGTCGD - AATGG
Distance matrix:
d(A,B) = 3d(A,C) = 1d(A,D) = 2d(B,C) = 3d(B,D) = 4d(C,D) = 2
? A C
d(AC,B) = 2d(AC,D) = 2d(B,D) = 4
A C
AYTCG
Multiple alignments based on parsimony
Alignment
I GGGG
II -GGG
III GAAG
IV -GAA
I GGGG
II GGG-
III GAAG
IV GAA-
Topology
((I II) III) IV) (((I III) II) IV) (((I IV) II) III)
7 6 8
6 6 8
Gap = 2; Substitution = 1
GGGGGGGGAAGGAA
GGGG
-GGG
GAAG
-GAA
7
GGGG
GGG-
GAAG
GAA-
6
GGGGGAAGGGGGAA
GGGG
GAAG
-GGG
-GAA
6
GGGG
GAAG
GGG-
GAA-
6
Gap = 2; Substitution = 1
• Eliminate gappy regions• Eliminate ambiguous regions: Cull (Gatesy et al. 1993)• Differential weighting of ambiguous regions: Elision (Wheeler et
al. 1995)• Exploring multiple analytical parameters: Sensitivity analysis
(wheeler 1995):– Multiple parameters are evaluated (gaps, substitutions, relative
weights of partitions, etc.)– Allows to test for node stability– May require a second optimality criterion to choose a parameter set
among the parameter space (e.g. character congruence;topological congruence, etc.)
ACTGACT
A-TGACT
AC-GACT
ACT-ACT
ACTG-CT
ACTGACT
A-AGACT
ACTGACT
AA-GACT
AGACT
AGACT
ACTGACT
A-AGACT
ACTGACT
AA-GACT
1 0.5 1
A CT CT GACT
A -A A- GACT
• Alignments are difficult Require heuristicimplementations
• Alignments are parameter-dependent (Fitch &Smith 1983; Waterman et al. 1992)
Different parameters
Different alignments
Different phylogenetic hypotheses
GGGG
GGG(G)
GGG
GAAG
GRRG GAA
GAA(G)
2 indels + 2 substitutions = 6
GGGG
GRR(G)
GAA
GGG
GGG GAAG
GR(A)G
2 indels + 3 substitutions = 7
Gap cost = 2
Base substitution = 1
Direct optimization:
I GGGG
II GGG
III GAAG
IV GAA
(((I II) III) IV)
(((I IV) II) III)
(((I III) II) IV) …
Probabilistic “alignments”
• Require the implementation of indels in evolutionary models
• The TKF model (Thorne, Kishino & Felsenstein 1991)
• Tree HMMs (Mitchinson 1999); Evolutionary HMMSs (Holmes &Bruno 2001)
• POY-ML (Wheeler et al. 2002) (see also Fleissner et al. 2005)
• Simultaneous Bayesian inference of trees and alignments(Redelings et al. 2005)
Felsenstein, J. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland, Massachusetts.Wheeler, W. C., D. Gladstein and J. De Laet. 2002. POY version 3.0. Ver. Program and documentation availableat ftp.amnh.org/pub/molecular. American Museum of Natural History.Redelings, B. D. and M. A. Suchard. 2005. Joint Bayesian estimation of alignment and phylogeny. SystematicBiology 54: 401-418.Fleissner, R., D. Metzler and A. von Haeseler. 2005. Simultaneous statistical multiple alignment and phylogenyreconstruction. Systematic Biology 54: 548-561.
DNA(observation)
PHYLOGENETIC TREE(hypothesis of relationships)
Alignment step
Tree inference step
Cost matrix/model
Cost matrix/model
2-STEPPHYLOGENETICS
MULTIPLE SEQUENCE ALIGNMNET(primary homology)
SINGLE-STEPPHYLOGENETICS
Cost matrix/model