sequence comparison with dot matrices
TRANSCRIPT
-
8/2/2019 Sequence Comparison With Dot Matrices
1/30
Computational Biology, Part 2Sequence Comparison with Dot
Matrices
Robert F. Murphy
Copyright 1996, 1999-2006.All rights reserved.
-
8/2/2019 Sequence Comparison With Dot Matrices
2/30
Sequence Alignment
Definition: Procedure for comparing two or
more sequences by searching for a series of
individual characters or character patternsthat are in the same orderin the sequences
Pair-wise alignment: compare two sequences
Multiple sequence alignment: compare morethan two sequences
-
8/2/2019 Sequence Comparison With Dot Matrices
3/30
Example sequence alignment
Task: align abcdef with abdgf
Write second sequence below the first
abcdefabdgf
Move sequences to give maximum match betweenthem
Show characters that match using vertical bar
-
8/2/2019 Sequence Comparison With Dot Matrices
4/30
Example sequence alignment
abcdef
||abdgf
Insert gap between b and d on lower
sequence to allow d and f to align
-
8/2/2019 Sequence Comparison With Dot Matrices
5/30
Example sequence alignment
abcdef
|| | |ab-dgf
-
8/2/2019 Sequence Comparison With Dot Matrices
6/30
Example sequence alignment
abcdef
|| | |ab-dgf
Note e and gdont match
-
8/2/2019 Sequence Comparison With Dot Matrices
7/30
Matching Similarity vs. Identity
Alignments can be based on finding only
identical characters, or (more commonly)
can be based on findingsimilarcharacters
More on how to definesimilarity later
-
8/2/2019 Sequence Comparison With Dot Matrices
8/30
Global vs. Local Alignment
We distinguish
Global alignment algorithms which optimize
overallalignment between two sequences
Local alignment algorithms which seek only
relatively conservedpieces of sequence
Alignment stops at the ends of regions of strong
similarity
Favors finding conserved patterns in otherwise
different pairs of sequences
-
8/2/2019 Sequence Comparison With Dot Matrices
9/30
Global vs. Local Alignment
Global
LGPSSKQTGKGS-SRIWDN
| | ||| | |LN-ITKSAGKGAIMRLGDA
Local
--------GKG--------
|||--------GKG--------
-
8/2/2019 Sequence Comparison With Dot Matrices
10/30
-
8/2/2019 Sequence Comparison With Dot Matrices
11/30
Why do sequence alignments?
To find whether two (or more) genes or
proteins are evolutionarily related to each
other
To find structurally or functionally similar
regions within proteins
-
8/2/2019 Sequence Comparison With Dot Matrices
12/30
Origin of similar genes
Similar genes arise bygene duplication
Copy of a gene insertednext to the original
Two copies mutateindependently
Each can take on separatefunctions
All or part can betransferred from one part
of genome to another
-
8/2/2019 Sequence Comparison With Dot Matrices
13/30
Methods for Pairwise Alignment
Dot matrix analysis
Dynamic Programming
Word ork-tuple methods (FASTA and
BLAST)
-
8/2/2019 Sequence Comparison With Dot Matrices
14/30
Sequence comparison with dot
matrices Goal: Graphically display regions of
similarity between two sequences (e.g.,
domains in common between two proteinsof suspected similar function)
-
8/2/2019 Sequence Comparison With Dot Matrices
15/30
Sequence comparison with dot
matrices Basic Method: For two sequences of
lengths M and N, lay out an M by N grid
(matrix) with one sequence across the topand one sequence down the left side. For
each position in the grid, compare the
sequence elements at the top (column) andto the left (row). If and only if they are the
same, place a dot at that position.
-
8/2/2019 Sequence Comparison With Dot Matrices
16/30
Examples for protein sequences
(Demonstration A6, Sequence 1 vs. 2)
(Demonstration A6, Sequence 2 vs. 3)
-
8/2/2019 Sequence Comparison With Dot Matrices
17/30
Interpretation of dot matrices
Regions of similarity appear as diagonal
runs of dots
Reverse diagonals (perpendicular todiagonal) indicate inversions
Reverse diagonals crossing diagonals (Xs)
indicate palindromes(Demonstration A6, Sequence 4 vs. 4)
-
8/2/2019 Sequence Comparison With Dot Matrices
18/30
Interpretation of dot matrices
Can link or "join" separate diagonals to
form alignment with "gaps"
Each a.a. or base can only be used onceCan't trace vertically or horizontally
Can't double back
A gap is introduced by each vertical orhorizontal skip
-
8/2/2019 Sequence Comparison With Dot Matrices
19/30
Uses for dot matrices
Can use dot matrices to align two proteins
or two nucleic acid sequences
Can use to find amino acid repeats within aprotein by comparing a protein sequence to
itself
Repeats appear as a set of diagonal runs stackedvertically and/or horizontally
(Demonstration A6, Sequence 5 vs. 6)
-
8/2/2019 Sequence Comparison With Dot Matrices
20/30
Uses for dot matrices
Can use to find self base-pairing of an RNA
(e.g., tRNA) by comparing a sequence to
itself complemented and reversed Excellent approach for finding sequence
transpositions
-
8/2/2019 Sequence Comparison With Dot Matrices
21/30
Filtering to remove noise
A problem with dot matrices for long
sequences is that they can be very noisy due
to lots of insignificant matches (i.e., one A) Solution use a window and a threshold
compare character by character within a
window (have to choose window size)require certain fraction of matches within
window in order to display it with a dot
-
8/2/2019 Sequence Comparison With Dot Matrices
22/30
Example spreadsheet with
window (Demonstration A7)
-
8/2/2019 Sequence Comparison With Dot Matrices
23/30
How do we choose a window
size? Window size changes with goal of analysis
size of average exon
size of average protein structural element
size of gene promoter
size of enzyme active site
-
8/2/2019 Sequence Comparison With Dot Matrices
24/30
How do we choose a threshold
value? Threshold based on statistics
using shuffled actual sequence
find average (m) and s.d. () of match scores ofshuffled sequence
convert original (unshuffled) scores (x) toZscores
Z = (x - m)/
use threshold Z of of 3 to 6
using analysis of other sets of sequences
provides objective standard of significance
-
8/2/2019 Sequence Comparison With Dot Matrices
25/30
Dot matrix analysis with DNA
Strider (Mount, Fig 3.4) Get phage l cI and phage P22 c2 repressor
sequences from Genbank (X00166 and
V01153 respectively) Use DNA Strider 1.4 (contact TA to get a
copy)
Use window size of 11 and stringency of 7
-
8/2/2019 Sequence Comparison With Dot Matrices
26/30
Dot matrix (Mount Fig 3.4)
Note set ofdiagonals
in lowerright thatdo not lineup due to
insertionnear 475on cI
100
100
200
200
300
300
400
400
500
500
600
600
100 100
200 200
300 300
400 400
500 500
600 600
700 700
-
8/2/2019 Sequence Comparison With Dot Matrices
27/30
Dot matrix analysis with DNA
Strider (Mount, Fig 3.6) Get human LDL receptor protein sequence
from Genbank (P01130)
Use weighting Identity
Use window size of 1 and stringency of 1
Use window size of 23 and stringency of 7
-
8/2/2019 Sequence Comparison With Dot Matrices
28/30
Dot matrix (Mount Fig 3.6)
W=1 S=1
Note set of
stackeddiagonalsin upperleft
100
100
200
200
300
300
400
400
500
500
600
600
700
700
800
800
100 100
200 200
300 300
400 400
500 500
600 600
700 700
800 800
-
8/2/2019 Sequence Comparison With Dot Matrices
29/30
Dot matrix (Mount Fig 3.6)
W=23 S=7
Note set of
stackeddiagonalsin upperleft
100
100
200
200
300
300
400
400
500
500
600
600
700
700
800
800
100 100
200 200
300 300
400 400
500 500
600 600
700 700
800 800
-
8/2/2019 Sequence Comparison With Dot Matrices
30/30
Reading for next class
Mount, Chapter 3 through page 93
Look over paper by Needleman and
Wunsch on web site
(03-510/710) Durbin et al, pp 17-32