Pairwise alignments
Introduction Why do alignments? Definitions
Scoring alignments Alignment methods Significance of alignments
Definitions
An alignment is a mutual arrangement of sequences, which exhibits where the sequences are similar, and where they differ.
An optimal alignment is one that exhibits the most correspondences and the least differences. It is the alignment with the highest score. May or may not be biologically meaningful.
Why do alignments?
Sequence Alignment is useful for discovering structural, functional and evolutional information in biological sequences.
How to measure the similarity
Three kinds of changes can occur at any given position within a sequence:
Mutation Insertion Deletion
Insertion and deletion have been found to occur in nature at a significantly lower frequency than mutations.
indel
v : A T G T T A Tw : A T C G T A C
m = 7
n = 7
A T -- G T A T --
A T C G -- A -- C
letters of v
letters of w
T
T
5 matches 2 insertions 2 deletions
Given 2 DNA sequences v and w:
Alignment: 2 row representation
???
4 matches3 mismatchs
Aligning DNA Sequences
V = ATCTGATG
W = TGCATAC
n = 8
m = 7
A T C T G A T G
T G C A T A CV
W
match
insertiondeletion
mismatch
indels
4123
matchesmismatchesinsertions deletions
Scoring Matrices for Aligning DNA Sequences
Transition --- substitutions in which a purine (A/G) is replaced by another purine (A/G) or a pyrimadine (C/T) is replaced by another pyrimadine (C/T).
Transversions --- (A/G) (C/T)
5-4
-4
-4
G
-4
5-4
-4
C
-4
-4
5-4
T
-4
-4
-4
5A
GCTA
1000G
0100C
0010T
0001A
GCTA
1-5
-5
-1
G
-5
1-1
-5
C
-5
-1
1-5
T
-1
-5
-5
1A
GCTA
Identity matrix BLAST matrix Transition-Transversion matrix
Scoring a Sequence Alignment
Match score: +1 Mismatch score: +0 Gap penalty: –1
ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT
Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1)
Score = +11?
Aligning protein sequences
FFDGGLQMQMLKDKFPMEGGQKDPKQRI
Amino Acid Substitution Matrices
PAM - point accepted mutation based on global alignment [evolutionary model]
BLOSUM - block substitutions based on local alignments [similarity among conserved sequences]
Part of PAM 250 MatrixC S T P A G
C 12S 0 2T -2 1 3P -3 1 0 6A -2 1 1 1 2G -3 1 0 -1 1 5
Log-odds = log ( )
chance to see pair in homologous proteins chance to see pair in unrelated proteins by chance
PAM 250 MatrixC S T P A G N D E Q H R K M I L V F Y W
C 12
S 0 2
T -2 1 3
P -3 1 0 6
A -2 1 1 1 2
G -3 1 0 -1 1 5
N -4 1 0 -1 0 0 2
D -5 0 0 -1 0 1 2 4
E -5 0 0 -1 0 0 1 3 4
Q -5 -1 -1 0 0 -1 1 2 2 4
H -3 -1 -1 0 -1 -2 2 1 1 3 6
R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6
K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5
M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1
-2 0 0 6
I -2 -1 0 -2 -1 -3 -2 -2 -2 -2
-2 -2 -2 2 5
L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2
-2 -3 -3 4 2 6
V -2 -1 0 -1 0 -1 -2 -2 -2 -2
-2 -2 -2 2 4 2 4
F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5
-2 -4 -5 0 1 2 -1 9
Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4
0 -4 -4 -2 -1 -1 -2 7 10
W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5
-3 2 -3 -4 -5 -2 -6 0 0 17
Scoring Matrix: Example
A R N K
A 5 -2 -1 -1
R - 7 -1 3
N - - 7 0
K - - - 6
• Notice that although R and K are different amino acids, they have a positive score.
• Why?
They are both positively charged amino acids will not greatly change function of protein.
AKRANRKAAANK
-1+(-1)+(-2) +5+7+3=11
Sequence Alignment Problem
T C A T GC A T T G
Elements of Dynamic Programming
Dynamic Programming method is used to solve optimization problems to which optimal solutions depend on the optimal solutions to the subproblems. It involves
Characterize the structure of the optimal solutions
Recursively define the score of an optimal solution in terms of the scores of optimal solutions of sub-problems
Compute the solution in a bottom-up fashion Trace back the optimal solution
Dynamic Programming
Consider two sequences:AAAT AGC
To find the optimal solution, if T is aligned with C, we have to find the best alignment between AAA and AG.
Best solution depends on the best solutions of the subproblems.
Dynamic Programming
Consider two sequences:AAAT AGC
To find the optimal solution, we have to find the best alignment between AAA and AG, AAA and AGC or AAAT and AG.
Best solution depends on the best solutions of the subproblems.
Dynamic Programming
Optimal solutions for the subproblems have to be solved recursively.
Let n be the size of sequence s = AAAT, m be the size of sequence t = AGC.
Consider subproblems: matching the prefixes of s and t.t has ? possible prefixes, including empty strings has ? possible prefixes, including empty string
n+1
m+1
Dynamic Programming
We would like to match s[1…i] and t[1…j]: Align s[1…i] with t[1…j-1] and match a space
with t[j] Align s[1…i-1] with t[1…j-1] and match s[i]
with t[j] Align s[1…i-1] with t[1…j] and match a space
with s[i]
Similarity between s and t:Score(s[1…i],t[1…j])=maxScore(s[1…i],t[1…j-1])+gap penalty
Score(s[1…i-1],t[1…j-1])+score(s[i],s[j])Score(s[1…i-1],t[1…j])+gap penalty
Definitions
Global alignment - Needleman-Wunsch (1970) maximizes the number of matches between the sequences along the entire length of the sequences.
Local alignment - Smith-Waterman (1981) is a modification of the dynamic programming algorithm gives the highest scoring local match between two sequences.
Example Let gap = -2
match = 1 mismatch = -1.
C A A Aempty
C
G
A
empty
1
-1
-3
-5 -
1 0 -
3
-4 -
1 -1
-2
-8
-6
-4
-2
-2
-6
-4
-2
0
AAACA-GC Complexity :
O(mn)?
Gap Penalty Scoring Indels: Naive
Approach A fixed penalty σ is given to every
indel: -σ for 1 indel, -2σ for 2 consecutive indels -3σ for 3 consecutive indels, etc.
Can be too severe penalty for a series of 100 consecutive indels
Affine Gap Penalties
In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events:
Normal scoring would give the same score for both alignments
This is more likely.
This is less likely.
Gap Penalty
Gap opening penalty defines the cost for opening a gap in one of the sequences. The penalty must be tuned based on the default matrix.
Gap extension penalty is an extra penalty proportional to the length of the gap. The gap extension penalty is always lower than gap opening penalty.
Optimal penalties vary from sequence to sequence, and finding the most adequate value is a matter of empirical trial and error.
Accounting for Gaps Gaps- contiguous sequence of spaces in
one of the rows
Score for a gap of length x is: -(ρ + σx) where x length of the gap, ρ >0 is the
penalty for introducing a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty because you do not want to add too much
of a penalty for extending the gap.
Affine Gap Penalties
Gap penalties: -ρ-σ when there is 1 indel -ρ-2σ when there are 2 indels -ρ-3σ when there are 3 indels,
etc. -ρ- x·σ (-gap opening - x gap
extensions) Somehow reduced penalties (as
compared to naïve scoring) are given to runs of horizontal and vertical edges
Affine Gap Penalties and Edit Graph
To reflect affine gap penalties we have to add “long” horizontal and vertical edges to the edit graph. Each such edge of length x should have weight
- - x *
There are many such edges!
Adding them to the graph increases the running time of the alignment algorithm by a factor of n (where n is the length of the sequence)
So the complexity increases from O(n2) to O(n3)
Adding “Affine Penalty” Edges to the Edit Graph
Optimal alignment
Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically meaningful.
Global alignment - Needleman-Wunsch (1970) maximizes the number of matches between the sequences along the entire length of the sequences.
Local alignment - Smith-Waterman (1981) gives the highest scoring local match between two sequences.
Local Alignment
Problem first formulated: Smith and Waterman (1981)
Problem: Find an optimal alignment between
a substring of s and a substring of t Algorithm:
is a variant of the basic algorithm for global alignment
Motivation Searching for unknown domains or
motifs within proteins from different families Proteins encoded from Homeobox genes
(only conserved in 1 region called Homeo domain – 60 amino acids long)
Identifying active sites of enzymes Comparing long stretches of anonymous
DNA Querying databases where query word
much smaller than sequences in database
Analyzing repeated elements within a single sequence
Local Alignment
Let n be the size of sequence s = GATCACCT m be the size of sequence t =
GATACCC.Consider subproblems:
matching the suffixes of s and t.t has ? possible suffixes, including empty strings has ? possible suffixes, including empty string
n+1
m+1
DP for Local Alignment
Match the suffixes of s[1…i] and t[1…j]: Align suffixes of s[1…i] with t[1…j-1] & match a
space with t[j] Align suffixes of s[1…i-1] with t[1…j-1] & match
s[i] with t[j] Align suffixes of s[1…i-1] with t[1…j] & match a
space with s[i]
Score(s[1…i],t[1…j])=max
Score(s[1…i],t[1…j-1])+gap penaltyScore(s[1…i-1],t[1…j-1])+score(s[i],s[j])Score(s[1…i-1],t[1…j])+gap penalty
0
Sij – highest score for alignment between 2 prefixes ending at i and j
Local Alignment Let gap = -2
match = 1 mismatch = -1.
GATCACCTGATACCC
C
C
C
A
T
A
G
empty
TCCACTAGempty
0 0 0 0 0 0 0 0 0
00
00
00
0
100
00
00
0 00 01
02
11
0
00
00
32
2
00
00
14
3
00
10
02
3
20
10
0
0
03
10
0
0
01
22
1
1
GATCACCTGAT_ACCC
Local Alignment Let gap = -2
match = 1 mismatch = -1.
GATCACCTGATACCC
C
C
C
A
T
A
G
empty
TCCACTAGempty
0 0 0 0 0 0 0 0 0
00
00
00
0
100
00
00
0 00 01
02
11
0
00
00
32
2
00
00
14
3
00
10
02
3
20
10
0
0
03
10
0
0
01
22
1
1
GATCACCTGAT_ACCC
Local Alignment Let gap = -2
match = 1 mismatch = -1.
ACACACTA AGCACAC
- A C A C A C T A- 0 0 0 0 0 0 0 0 0A 0 1 0 1 0 1 0 0 1G 0 0 0 0 0 0 0 0 0C 0 0 1 0 1 0 1 0 0A 0 1 0 2 0 2 0 0 1C 0 0 2 0 3 0 3 1 0A 0 1 0 3 1 4 2 0 2C 0 0 1 0 4 2 5 3 1A 0 1 0 1 2 5 3 4 4
Smith & Waterman Place each sequence along one axis Place score 0 at the up-left corner Fill in 1st row & column with 0s Fill in the matrix with max value of 4
possible values: 0 Vertical move: Score + gap penalty Horizontal move: Score + gap penalty Diagonal move: Score + match/mismatch score
The optimal alignment score is the max in the matrix
To reconstruct the optimal alignment, trace back where the MAX at each step came from, stop when a zero is hit
Semi-global AlignmentExample:
CAGCA-CTTGGATTCTCGG–––CAGCGTGG––––––––
CAGCACTTGGATTCTCGGCAGC––––G––T––––GG
We like the first alignment much better. In semiglobal comparison, we score the alignments ignoring some of the end spaces.
Global AlignmentExample:
AAACCCA - - CCC
Prefer to see: AAACCC - - ACCC
Do not want to penalize the end spaces
empty
A A A C C C
empty 0 -2 -4 -6 -8 -
10-
12A -2 1 -1 -3 -5 -7 -9C -4 -1 0 -2 -2 -4 -6C -6 -3 -2 -1 -1 -1 -3C -8 -5 -4 -3 0 0 0
SemiGlobal AlignmentExample:
s = AAACCC
t = - - ACCC
empty
A A A C C C
empty 0 0 0 0 0 0 0
A -2 1 1 1 -1 -1 -1C -4 -1 0 0 2 0 0C -6 -3 -2 -1 1 3 1C -8 -5 -4 -3 0 2 4
SemiGlobal AlignmentExample:
s = AAACCCG
t = - - ACCC -
empty
A A A C C C
empty 0 0 0 0 0 0 0
A -2 1 1 1 -1 -1 -1C -4 -1 0 0 2 0 0C -6 -3 -2 -1 1 3 1C -8 -5 -4 -3 0 2 4 2
-2
-1
0
G
-1
SemiGlobal Alignment
Summary of end space charging procedures:
Place where spaces are not penalized
forAction
Beginning of 2nd sequence
End of 1st sequence
Beginning of 1st sequence
End of 2nd sequence
Initialize 1st row with zeros
Look for max in last row
Initialize 1st column with zeros
Look for max in last column
Global vs Local
Demo
R R: http://www.r-project.org/ R IDE: http://rstudio.org/ R manual:
http://cran.r-project.org/doc/manuals/R-intro.pdf R & IDE downloads:
http://cran.case.edu/ http://rstudio.org/download/
Quick R: http://www.statmethods.net/
R Demo Bioconductor
http://www.bioconductor.org/install/ source("http://bioconductor.org/biocLite.R") biocLite()
library(Biostrings) #load librarypairwiseAlignment(pattern = c("succeed", "precede"),
subject = "supersede")?pairwiseAlignment