Pairwise alignment II
Agenda
- Previous Lesson: Minhala + Introduction
- Review Dynamic Programming
- Pariwise Alignment Biological Motivation
Today:
- Quick Review: Sequence Alignment (Global, Local, Variants).
- Heuristic Search
3
Ilan Smoly and Dan Gusfield, WABI 2015 in Atlanta, Georgia
4/64
Sequence Comparison (cont)
• We seek the following similarities between sequences :
• Find similar proteins – Allows to predict function & structure
• Locate similar subsequences in DNA– Allows to identify (e.g) regulatory elements
• Locate DNA sequences that might overlap– Helps in sequence assembly
g1
g2
Comparison methods
• Global alignment – Finds the best alignment across the whole two sequences.
• Local alignment – Finds regions of similarity in parts of the sequences.
Global Local
_____ _______ __ ____
__ ____ ____ __ ____
Global Alignment
• Algorithm of Needleman and Wunsch (1970)
• Finds the alignment of two complete sequences:
ADLGAVFALCDRYFQ
|||| |||| |
ADLGRTQN-CDRYYQ
• Some global alignment programs “trim ends”
Local Alignment
• Algorithm of Smith and Waterman (1981).
• Makes an optimal alignment of the best segment of
similarity between two sequences.
ADLG CDRYFQ
|||| |||| |
ADLG CDRYYQ
• Can return a number of highly aligned segments.
47
Global Alignment: Algorithm
1..j1..iT and S of alignment optimum of Cost),( jiC
T of jlength of Prefix
S of i length of Prefix
..1
..1
j
i
T
S
ba
babaw
if
if),(
48
)1j,i(C
)j,1i(C
)T,S(w)1j,1i(C
max)j,i(Cji
j)j,0(Ci)0,i(C
Initial conditions:
Recurrence relation: For 1 i n, 1 j m:
Theorem. C(i,j) satisfies the following relationships:
51
Computation Procedure
C(n,m)
C(0,0)
C(i,j)
)1j,i(C,)j,1i(C),T,S(w)1j,1i(Cmax)j,i(C ji
C(i-1,j)C(i-1,j-1)
C(i,j-1)
52
λ C T C G C A G C
A
C
T
T
C
A
C
+10 for match, -2 for mismatch, -5 for space
0 -5 -10 -15 -20 -25 -30 -35 -40
-5
-10
-15
-20
-25
-30
-35
10 5
λ
53
0 -5 -10 -15 -20 -25 -30 -35 -40
-5 10 5 0 -5 -10 -15 -20 -25
-10 5 8 3 -2 -7 0 -5 -10
-15 0 15 10 5 0 -5 -2 -7
-20 -5 10 13 8 3 -2 -7 -4
-25 -10 5 20 15 18 13 8 3
-30 -15 0 15 18 13 28 23 18
-35 -20 -5 10 13 28 23 26 33
λ C T C G C A G C
A
C
T
T
C
A
C
λ
Traceback can yield both optimum alignments
*
*
54
Local AlignmentSmith-Waterman
• Best score for aligning part of sequences– Often beats global alignment score
ATTGCAGTG-TCGAGCGTCAGGCT
ATTGCGTCGATCGCAC-GCACGCT
Global Alignment
Local Alignment
CATATTGCAGTGGTCCCGCGTCAGGCT
TAAATTGCGT-GGTCGCACTGCACGCT
55
Local Alignment: Motivation
• Ignoring stretches of non-coding DNA:– Non-coding regions are more likely to be subjected to
mutations than coding regions.
– Local alignment between two sequences is likely to be between two exons.
• Locating protein domains:– Proteins of different kind and of different species often
exhibit local similarities
– Local similarities may indicate ”functional subunits”.
56
>Human DNA
CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA
>Mouse DNA
CATGCGTCTGACgctttttgctagcgatatcggactATCGATATA
Global vs. Local alignment
Alignment of two Genomic sequences
57
Human:CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA
Mouse:CATGCGTCTGACgct---ttttgctagcgatatcggactATCGAT-ATA
****** ***** * *** * ****** ***
Global Alignment
Human:CATGCGACTGAC
Mouse:CATGCGTCTGAC
Human:ATCGATCATA
Mouse:ATCGAT-ATA
Local Alignment
Global vs. Local alignment
Alignment of two Genomic sequences
58
>Human DNA
CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA
>Human mRNA
CATGCGACTGACATCGATCATA
Global vs. Local alignment
Alignment of DNA and mRNA
59
DNA: CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA
mRNA:CATGCGACTGAC---------------------------ATCGATCATA
************ **********
Global Alignment
DNA: CATGCGACTGAC
mRNA:CATGCGACTGAC
DNA: ATCGATCATA
mRNA:ATCGATCATA
Local Alignment
Global vs. Local alignment
Alignment of DNA and mRNA
60
Global vs. Local alignment
DOROTHY
DOROTHY
HODGKIN
HODGKIN
Global alignment:DOROTHY--------HODGKIN
DOROTHYCROWFOOTHODGKIN
Local alignment:
DorothyHodkin
DorothyCrowfootHodkin
61/64
Global vs. Local Alignment
Source: Jones and Pevzner
62
Local Alignment: Algorithm
Initialize top row and leftmost column to zero.
0
1,
,1
,]1,1[
max ,
jiC
jiC
jtisscorejiC
jiC
C [i, j] = Score of optimally aligning a suffix of S1…i with a suffix of T1…j.
63
0 0 0 0 0 0 0 0 0
0 1 0 1 0 1 0 0 1
0 0 0 0 0 0 2 0 0
0 0 1 0 0 0 0 1 0
0 0 1 0 0 0 0 0 0
0 1 0 2 0 1 0 0 1
0 0 0 0 1 0 2 0 0
0 1 0 1 0 2 0 1 1
λ C T C G C A G C
A
C
T
T
C
A
C
λ
+1 for a match, -1 for a mismatch, -5 for a space
64
Reducing space requirements
• O(mn) tables are often the limiting factor in computing large alignments
• There is a linear space technique that only doubles the time required [Hirschberg, 1977]
65
0 10 5 10 5 10 5 0 10
λ C T C G C A G C
A
C
T
T
C
A
G
IDEA: We only need the previous row to calculate the next
0 0 0 0 0 0 0 0 0λ
0 5 8 5 8 5 20 15 10
66
Linear-space Alignments
mn + ½ mn + ¼ mn + 1/8 mn + 1/16 mn + … = 2 mn
67
Some ResultsMost pairwise sequence alignment problems can be solved in O(mn)
time.
Space requirement can be reduced to O(m+n), while keeping run-time
fixed Hirshberg, 1988].
Highly similar sequences can be aligned in O(dn) time, where d
measures the distance between the sequences [Landau, 1986]
Time complexity of the fastest known sequence alignment algorithms ?
O(n2/logn) [Crochemore, Landau, Ziv-Ukelson, 2003]
For Discrete Scoring Schemes: [Masek and Paterson, 1980]
Sub Quadratic Sequence Alignment
How many points of interest? O(n2/t)
n/ t rows with n vertices each
n/ t columns with n vertices each
LZ-78 Compression Table Lookup
[Crochemore, Landua
and Ziv-Ukelson, 2003]
[Masek and Paterson,
1981]
69/64
Variants of Sequence Alignment
We have seen two basic variants of sequence alignment:
• Global alignment (Needelman-Wunsch)
• Local alignment (Smith-Waterman)
We will pose and solve two problems :
• Finding the best overlap alignment
• Using an affine cost for gaps
70/64
Overlap AlignmentConsider the following question:
Can we find the most significant overlap between two sequences s,t ?
Possible overlap relations: a.
b.
The difference between this problem and local alignment is that here we require alignment between the endpoints of the two sequences.
71
End-gap free alignment
• Gaps at the start or end of alignment are not penalized
Best global Best end-gap free
Match: +2 Mismatch and space: -1
Score = 1 Score = 9
72
Motivation: Shotgun assembly
73
Motivation: Shotgun assembly
• Shotgun assembly produces a large set of partially overlapping subsequences from many copies of one unknown DNA sequence.
• Problem: Use the overlapping sections to ”paste”
the subsequences together.
• Overlapping pairs will have low global alignmentscore, but high end-space free score because of overlap.
• HOW CAN THIS BE SOLVED?
75
Algorithm
• Same as global alignment, except:
1. Initialize with zeros (free gaps at start)
– Locate max in the last row/column (free gaps atend)
76
10 5 10 5 10 5 0 10
λ C T C G C A G C
A
C
T
T
C
A
G
+10 for match, -2 for mismatch, -5 for gap
0 0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
λ
5 8 5 8 5 20 15 10
0 15 10 5 6 15 18 13
-2 10 13 8 3 10 13 16
10 5 20 15 18 13 8 23
5 8 15 18 13 28 23 18
0 3 10 25 20 23 38 33
77/64
Overlap Alignment
• Initialization: V[i,0]=0, V[0,j]=0
• Recurrence: as in global alignment
• a match starts at the top or left border of the matrix and finishes on the right or bottom border.
• Score: maximum value at the bottom line and rightmost
line in the matrix
])[,(],[
)],[(],[
])[],[(],[
max],[
1jtj1iV
1is1jiV
1jt1isjiV
1j1iV
78/64
Overlap Alignment Example
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0
A 0
W 0
H 0
E 0
A 0
E 0
s = PAWHEAE
t = HEAGAWGHEE
Scoring system:
Match: +4
Mismatch: -1
Indel: -5
79/64
Overlap Alignment Example
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
A 0 -1
W 0 -1
H 0 4
E 0 1
A 0 -1
E 0 -1
s = PAWHEAE
t = HEAGAWGHEE
Scoring system:
Match: +4
Mismatch: -1
Indel: -5
80/64
Overlap Alignment Example
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
A 0 -1 -2 3 -2 3 -2 -2 -2 -2 -2
W 0 -1 -2 -2 2 -2 7 2 -3 -3 -1
H 0 4 -1 -3 -3 1 2 6 6 1 -2
E 0 -1 8 3 -2 -3 0 1 5 10 5
A 0 -1 3 12 7 2 -2 -1 0 5 9
E 0 -1 3 7 11 6 1 -3 -2 4 9
s = PAWHEAE
t = HEAGAWGHEE
Scoring system:
Match: +4
Mismatch: -1
Indel: -5
81/64
Overlap Alignment ExampleThe best overlap is:
PAWHEAE------
---HEAGAWGHEE
Remark:
A different scoring system could lead us to a different result, such as:
---PAW-HEAE
HEAGAWGHEE-