s. maarschalkerweerd & a. tjhang1 probability theory and basic alignment of string sequences...
Post on 15-Jan-2016
216 views
TRANSCRIPT
S. Maarschalkerweerd & A. Tjhang
1
Probability Theory and Basic Probability Theory and Basic Alignment of String SequencesAlignment of String Sequences
Chapter 1.1-2.3
S. Maarschalkerweerd & A. Tjhang
2
OverviewOverview
Probability Theory-Maximum Likelihood
-Bayes Theorem
Pairwise Alignment-The Scoring Model
-Alignment Algorithms
S. Maarschalkerweerd & A. Tjhang
3
Probability TheoryProbability Theory
S. Maarschalkerweerd & A. Tjhang
4
Probability TheoryProbability Theory
What is a probabilistic model? Simple example:
What is probability of base sequence x1x2…xn?
p(xi), p(x1), p(x2)…p(xn) independent of each other
If pC = 0.3; pT = 0.2 and sequence is CTC:P(CTC)=0.3*0.2*0.3=0.018
S. Maarschalkerweerd & A. Tjhang
5
Maximum Likelihood EstimationMaximum Likelihood Estimation
Estimate parameters of the model from large sets of examples (training set)– For example: P(T) and P(C) are estimated from their
frequency in a database of residues
Avoid overfitting– Database too small, model also fits to noise in the
training set
S. Maarschalkerweerd & A. Tjhang
6
Probability TheoryProbability Theory
Conditional Probability
-P(X,Y) = P(X|Y) P(Y) (joint probability)
-P(X) = Y P(X,Y) = Y P(X|Y) P(Y)
(marginal probability)
S. Maarschalkerweerd & A. Tjhang
7
Bayes’ TheoremBayes’ Theorem
P(X|Y) =
- Posterior probability Example:
P(X)=Probability tumor visible on x-ray P(C)=Probability breast-cancer = 0.01
P(X|C) = 0.9; P(X|¬C) = 0.05
- On the x-ray a tumor is seen. What is the probability that the woman has breast-cancer?
P(Y|X) P(X)
P(Y)
S. Maarschalkerweerd & A. Tjhang
8
Pairwise AlignmentPairwise Alignment
S. Maarschalkerweerd & A. Tjhang
9
Pairwise AlignmentPairwise Alignment
Goal: determine whether 2 sequences are related (homologous).
Issues regarding pairwise alignment:1. What sorts of alignment should be considered?
2. The scoring system used to rank alignments.
3. The algorithm used to find optimal (or good) scoring alignments.
4. The statistical methods to evaluate significance of an alignment score.
S. Maarschalkerweerd & A. Tjhang
10
ExampleExample
You need a ‘smart’ scoring model to distinguish b from c.
S. Maarschalkerweerd & A. Tjhang
11
The Scoring ModelThe Scoring Model
S. Maarschalkerweerd & A. Tjhang
12
The Scoring ModelThe Scoring Model
When sequences are related, then both sequences have to be from a common ancestor.– Due to mutation sequences can change.
Substitutions Gaps (insertions or deletions)
– Natural selection ensures that some mutations are seen more often than others. (Survival of the fittest)
S. Maarschalkerweerd & A. Tjhang
13
The Scoring ModelThe Scoring Model
Total score of an alignment:– Sum of terms for each aligned pair of residues– Terms for each gap
Take the sum of those terms
S. Maarschalkerweerd & A. Tjhang
14
Substitution MatricesSubstitution Matrices
We need a matrix with the scores for every possible pair of residues (e.g. bases or amino acids)
We can compute these score by:s(a,b) = log( )
pab = probability that residues a and b have been derived independently from some unknown original residue c.
qa = frequency of a
pab
qaqb
S. Maarschalkerweerd & A. Tjhang
15
BLOSUM50BLOSUM50
S. Maarschalkerweerd & A. Tjhang
16
Gap PenaltiesGap Penalties
(g) = -gd (linear score)(g) = -d-(g-1)e (affine score)
– d = gap-open penalty– e = gap-extension penalty– g = gap length
P(gap) = f(g) qxii in gap
S. Maarschalkerweerd & A. Tjhang
17
Alignment AlgorithmsAlignment Algorithms
S. Maarschalkerweerd & A. Tjhang
18
Alignment AlgorithmsAlignment Algorithms
Needleman-Wunsch (global alignment)Smith-Waterman (local alignment)Repeated matchesOverlap matchesHybrid match conditions
S. Maarschalkerweerd & A. Tjhang
19
Dynamic ProgrammingDynamic Programming
Enormous amount of possible alignmentsAlgorithm for finding optimal alignment:
Use Dynamic ProgrammingSave sub-results for later reuse, avoiding
calculation of same problem
S. Maarschalkerweerd & A. Tjhang
20
Needleman-Wunsch AlgorithmNeedleman-Wunsch Algorithm
Global alignment For sequences of size n and m, make (n+1)x(m+1)
matrix Fill in from top left to bottom right
F(i-1, j-1) + s(xi,yj) F(i,j) = max F(i-1, j) – d
F(i, j-1) – d Keep pointer to cell that is used to derive F(i,j) Takes O(nm) time and memory
{
S. Maarschalkerweerd & A. Tjhang
21
MatrixMatrix
-2
-8
-80 -8
-8 -2
S. Maarschalkerweerd & A. Tjhang
22
MatrixMatrix
Traceback
S. Maarschalkerweerd & A. Tjhang
23
Smith-Waterman AlgorithmSmith-Waterman Algorithm
Local alignment Two differences with Needleman-Wunsch:
0
F(i-1, j-1) + s(xi,yj)F(i-1, j) – dF(i, j-1) – d
2. Local alignment can end anywhere, so choose highest value in matrix from where traceback starts (not necessarily bottom right cell)
1. F(i,j) = max{
S. Maarschalkerweerd & A. Tjhang
24
MatrixMatrix
S. Maarschalkerweerd & A. Tjhang
25
Smith-Waterman AlgorithmSmith-Waterman Algorithm
Expected score for a random match s(a,b) must be negative
There must be some s(a,b) greater than 0 or no alignment is found
S. Maarschalkerweerd & A. Tjhang
26
Repeated MatchesRepeated Matches
Many local alignments possible if one or both sequences are long. Smith-Waterman only finds one of them
Find parts of sequence in the other sequenceNot every alignment is useful threshold
S. Maarschalkerweerd & A. Tjhang
27
Repeated MatchesRepeated Matches
F(i, 0)
F(i-1, j-1) + s(xi,yj)
F(i-1, j) – d
F(i, j-1) – d
F(i-1, 0)
F(i-1, j) – T, j = 1,…m;
F(i,j) = max {{F(i,0) = max
S. Maarschalkerweerd & A. Tjhang
28
MatrixMatrix
Threshold T = 20
S. Maarschalkerweerd & A. Tjhang
29
Overlap MatchesOverlap Matches
Find match between start of a sequence and end of a sequence (can be the same)
Alignment begins on left-hand or top border of the matrix and ends on right-hand or bottom border
S. Maarschalkerweerd & A. Tjhang
30
Overlap MatchesOverlap Matches
F(0,j) = 0, for j = 1,…,mF(i,0) = 0, for i = 1,…,n
F(i-1, j-1) + s(xi,yj)
F(i,j) = max F(i-1, j) – d
F(i, j-1) – d{
S. Maarschalkerweerd & A. Tjhang
31
MatrixMatrix
S. Maarschalkerweerd & A. Tjhang
32
Hybrid Match ConditionsHybrid Match Conditions
Different types of alignment can be created by – adjusting rhs of this formula:
F(i,j) = max {….– adjusting the traceback
Example:– We want to align two sequences from the beginning of
both the sequences until local alignment has been found.
S. Maarschalkerweerd & A. Tjhang
33
SummarySummary Probability theory is important for sequence
analysis Goal: determine whether 2 sequences are
related For that, we need to find an optimal alignment
between those sequences using algorithms Scoring model is required to rank different
alignments Different algorithms for different types of
alignments – use dynamic programming