bioinformatics dr. víctor treviño [email protected]
DESCRIPTION
Multiple Sequence Alignments and Phylogeny. Bioinformatics Dr. Víctor Treviño [email protected]. Within a protein sequence, some regions will be more conserved than others. As more conserved, more important . for function for 3D structure for localization for modification - PowerPoint PPT PresentationTRANSCRIPT
![Page 2: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/2.jpg)
SEQUENCE SIMILARITY Within a protein sequence, some regions
will be more conserved than others. As more conserved, more important. for function for 3D structure for localization for modification for interaction for regulation/control for transcriptional regulation
(in DNA)
REASONS TOPERFORM
SEQUENCESIMILARITYANALYSIS
ANDSEARCHES
![Page 3: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/3.jpg)
SEQUENCE ALIGNMENT Procedure for comparing two (pair-wise
alignment) or more (multiple sequence alignment) sequences by searching for similar patterns that are in the same order in the sequences Identical residues (nt or aa) are placed in the same
column Non-identical residues can be placed in the same
column or indicated as gaps
Wikipedia, http://www-personal.umich.edu/~lpt/fgf/fgfrcomp.htmBioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
Ove
rall
sim
ilitu
de
![Page 4: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/4.jpg)
MULTIPLE SEQUENCE ANALYSIS – ADDITIONAL USES
Interesting regions Promoter regions Consensus sequence for probe
design
![Page 5: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/5.jpg)
Multiple Sequence Alignment - MSA
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 6: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/6.jpg)
MULTIPLE SEQUENCE ALIGNMENT - MSADynamical programming is designed for two
sequences It would take quite a long time for three or
more (see MSA program)
Sequence A
Seq
uenc
e B
Sequence C
![Page 8: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/8.jpg)
MULTIPLE SEQUENCE ALIGNMENT – METHODS
Extenstions of sequence pair alignment MSA
Progressive Methods CLUSTALW
Iterative Methods Hidden Markov Models (HMM)
![Page 9: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/9.jpg)
MULTIPLE SEQUENCE ALIGNMENT - MSAAlgorithm
1. Calculate all pair-wise alignment scores (alignment costs).
2. Use the scores (costs) to predict a tree.3. Calculate pair weights based on the tree.4. Produce a heuristic msa based on the tree.5. Calculate the maximum for each sequence
pair.6. Determine the spatial positions that must be
calculated to obtain the optimal alignment.7. Perform the optimal alignment.8. Report the epsilon found compared to the
maximum epsilon.epsilon for a given sequence pair is the difference between the score of the alignment of that pair in the msa and the score of the optimal pair-wise alignment. The bigger the value of , the more divergent the msa from the pair-wise alignment and the smaller the contribution of tht alignment to the msa. For example, if an extra copy of one of the sequences is added to the alignment project, then for sequence pairs that do not include that sequence will increase, indicating a lesser role because the contributions of that pair have been out-voted by the alike sequences.
![Page 10: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/10.jpg)
PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENTDynamical programming is designed for two
sequences It would take quite a long time for three or
more (see MSA program)Therefore… 1. Pair-wise all sequences2. Determine "distances between each one"3. Align the two most similar then get the alignment4. Get the next more similar and perform the same
steps until all sequences has been included5. E.G.
1. (S3+S4)=c1,2. (S1+S2)=c23. (c1+c2)=c34. (c3+S5)=final
S1S2S3S4S5
![Page 11: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/11.jpg)
PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT - CLUSTALW
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
(then normalized tolargest = 1)
Alignment Scorefor column
CLUSTALWMETHOD
![Page 12: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/12.jpg)
PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT - CLUSTALW
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
1
2
3
![Page 13: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/13.jpg)
PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT - PROBLEMS
Dependency on the most similar sequences Nested problems when most similar
sequences are actually different So, for closely related sequence, CLUSTALW is
the best Choice of suitable scoring matrices
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 14: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/14.jpg)
ITERATIVE MULTIPLE SEQUENCE ALIGNMENT
Try to correct for the dependency on the most similar sequences in progressive methods
Repeatedly realigning subgroups, then aligning these on the global alignment Based in tree ordering, separation of
sequences, or random grupo selection
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 15: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/15.jpg)
ITERATIVE MULTIPLE SEQUENCE ALIGNMENT
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 16: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/16.jpg)
[email protected] MULTIPLE SEQUENCE ALIGNMENT
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
D1
![Page 17: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/17.jpg)
[email protected] SEQUENCE ALIGNMENT - PROGRAMS
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 19: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/19.jpg)
PHYLOGENY ANALYSIS AND PREDICTION FROM DNA/PROTEIN SEQUENCES
Determination of how the family might have been derived during evolution
Sequences is depicted as branches on a tree
Very similar sequences are located as neighbours in a branch
The goal is to discover all the branching relationships and the branch lengths
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 20: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/20.jpg)
PHYLOGENY ANALYSIS AND PREDICTION FROM DNA/PROTEIN SEQUENCES
Phylogenetic relationships among the genes can help to predict which ones might have an equivalent function.
Phylogenetic analysis may also be used to follow the changes occurring in a rapidly changing species, such as a virus
Important for discovering function, 3D structure, localization, modification,
interaction, regulation/control, transcriptional regulation
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 21: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/21.jpg)
PHYLOGENY ANALYSIS AND PREDICTION FROM DNA/PROTEIN SEQUENCES
Related to SEQUENCE ALIGNMENT
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 22: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/22.jpg)
SEQUENCE SIMILARITY – EVOLUTIONARY RELATIONSHIP
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 23: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/23.jpg)
GENOME COMPLEXITY
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 24: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/24.jpg)
GENOME COMPLEXITY
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 25: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/25.jpg)
EVOLUTIONARY TREE An evolutionary tree is a two-dimensional
graph showing evolutionary relationships among organisms
The separate sequences are referred to as taxa (singular taxon), defined as phylogenetically distinct units on the tree
The tree is composed of outer branches (or leaves) representing the taxa and nodes and branches representing relationships among the taxa
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 26: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/26.jpg)
EVOLUTIONARY TREE A and B are derived
from a common ancestor
each node in the tree represents a splitting of the evolutionary path of the gene into two different species that are isolated reproductively
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 27: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/27.jpg)
EVOLUTIONARY TREE Beyond spliting, any
further evolutionary changes in each new branch are independent of those in the other new branch
The length of each branch to the next node represents the number of sequence changes that occurred prior to the next level of separation
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 28: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/28.jpg)
EVOLUTIONARY TREE Uniform mutation
rate Molecular Clock Hypothesis, suitable for closely related species
Special cases could use non-uniform rates
The root is defined by including a taxon that we are reasonably sure branched off earlier than the other
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 29: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/29.jpg)
EVOLUTIONARY TREE The sum of all the branch
lengths in a tree is referred to as the tree length.
The tree is also a bifurcating or binary tree, in that only two branches emanate from each node.
Trees can have more than one branch emanating from a node if the events separating taxa are so close that they cannot be resolved, or to simplify the tree.
The unrooted tree also shows the evolutionary relationships among sequences A–D, but it does not reveal the location of the oldest ancestry.
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 30: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/30.jpg)
EVOLUTIONARY TREE The number of possible rooted trees
increases very rapidly with the number of sequences or taxa
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 31: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/31.jpg)
METHODS TO BUILD EVOLUTIONARY TREES
To find the evolutionary tree or trees that best account for the observed variation in a group of sequences
Maximum Parsimony Distance Maximum Likelihood
![Page 32: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/32.jpg)
METHOD SELECTION
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 33: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/33.jpg)
CONSIDERATIONS Not Large number of gaps
Phylogenetic methods analyze conserved regions that are represented in all the sequences (Local Alignments)
![Page 34: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/34.jpg)
MAXIMUM PARSIMONY (OR MINIMUM EVOLUTION)
Predicts the evolutionary tree by minimizing the number of steps required to generate the observed sequence changes
Requires a multiple sequence alignment Method revise each informative position
and each possible tree same residue in at least two sequences but not
all Used for sequences that are quite similar
and for small number of sequences
![Page 35: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/35.jpg)
MAXIMUM PARSIMONY (OR MINIMUM EVOLUTION)
Noninformative
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 36: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/36.jpg)
DISTANCE METHODS Employs the number of changes between
each pair Sequence pairs that have the smallest
number of sequence changes are "neighbours" sharing a node in the tree
Very related to Multiple sequence alignment method (CLUSTALW) which produced DISTANCE MATRICES then analysed by distance methods
Remember Distance vs Similarity (and gaps)
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
![Page 37: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/37.jpg)
DISTANCE METHODS
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
"Idealized"
![Page 38: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/38.jpg)
DISTANCE ALGORITHMS Fitch and Margoliash Method Neighbor-joining Method Unweighted Pair Group Method
with Arithmetic Mean (UPGMA)
![Page 39: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/39.jpg)
DISTANCE ALGORITHM Choosing a outgroup (Grupo Fuera)
improves prediction because methods are informed about the "order" of the outgroup
![Page 40: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/40.jpg)
MAXIMUM LIKELIHOOD Uses probability of the number of
sequence changes Analysis is performed for each
informative residue (like in maximum parsimony)
All possible trees are considered (so, for small number of sequences)
Consider variations in mutation rates, so it can be used for most distant sequences
Main disadvantage: Computation Time
![Page 41: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/41.jpg)
MAXIMUM LIKELIHOOD Needs a model that provides estimates
of substitution rates for each residue pair
![Page 42: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/42.jpg)
RELIABILITY OF PHYLOGENETIC PREDICTIONS
Bootstrap method randomly resampling residues within columns (robustness test) Good evidence if more than 70%
predictions are conserved then Collapse branches and confirm tree
length Compare distinct methods and
parameters
![Page 43: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/43.jpg)
"CLASSIC" PROGRAMS PHYLIPhttp://evolution.genetics.washington.edu/phylip.html
PAUPhttp://paup.csit.fsu.edu/downl.html
Phylemonhttp://phylemon.bioinfo.cipf.es/cgi-bin/tools.cgi
![Page 45: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/45.jpg)
[email protected] – WEB SERVICES
http://bioinformatics.ca/links_directory/index.php?search=phylogeny&submit=Search+Directory
![Page 46: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/46.jpg)
[email protected] – WEB SERVICES
http://bioinformatics.ca/links_directory/index.php?search=phylogeny&submit=Search+Directory
![Page 48: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/48.jpg)
EXERCISE/HOMEWORK Select a gene Get the sequence in at least 7 species Select a site (Phylemon) Perform the multiple sequence alignment
(ClustalW) Perform Phylogeny to obtain a tree
At least 2 tree methods At least 3 parameter(s) changes Take DNA/Protein
Report results and discussion
12 MSA+Trees
![Page 49: Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx](https://reader036.vdocument.in/reader036/viewer/2022062305/5681645f550346895dd635d9/html5/thumbnails/49.jpg)
PAPERS TO REVISE Phylogeny-aware gap placement
prevents errors in sequence alignment and evolutionary analysis – Loytynoja, Goldman, Science 2008
Insertions and deletions treated as different events