s. maarschalkerweerd & a. tjhang1 probability theory and basic alignment of string sequences...

S. Maarschalkerweerd & A. Tjhang

1

Probability Theory and Basic Probability Theory and Basic Alignment of String SequencesAlignment of String Sequences

Chapter 1.1-2.3


2

OverviewOverview

Probability Theory-Maximum Likelihood

-Bayes Theorem

Pairwise Alignment-The Scoring Model

-Alignment Algorithms


3

Probability TheoryProbability Theory


4


What is a probabilistic model? Simple example:

What is probability of base sequence x1x2…xn?

p(xi), p(x1), p(x2)…p(xn) independent of each other

If pC = 0.3; pT = 0.2 and sequence is CTC:P(CTC)=0.3*0.2*0.3=0.018


5

Maximum Likelihood EstimationMaximum Likelihood Estimation

Estimate parameters of the model from large sets of examples (training set)– For example: P(T) and P(C) are estimated from their

frequency in a database of residues

Avoid overfitting– Database too small, model also fits to noise in the

training set


6


Conditional Probability

-P(X,Y) = P(X|Y) P(Y) (joint probability)

-P(X) = Y P(X,Y) = Y P(X|Y) P(Y)

(marginal probability)


7

Bayes’ TheoremBayes’ Theorem

P(X|Y) =

- Posterior probability Example:

P(X)=Probability tumor visible on x-ray P(C)=Probability breast-cancer = 0.01

P(X|C) = 0.9; P(X|¬C) = 0.05

- On the x-ray a tumor is seen. What is the probability that the woman has breast-cancer?

P(Y|X) P(X)

P(Y)


8

Pairwise AlignmentPairwise Alignment


9

Pairwise AlignmentPairwise Alignment

Goal: determine whether 2 sequences are related (homologous).

Issues regarding pairwise alignment:1. What sorts of alignment should be considered?

2. The scoring system used to rank alignments.

3. The algorithm used to find optimal (or good) scoring alignments.

4. The statistical methods to evaluate significance of an alignment score.


10

ExampleExample

You need a ‘smart’ scoring model to distinguish b from c.


11

The Scoring ModelThe Scoring Model


12


When sequences are related, then both sequences have to be from a common ancestor.– Due to mutation sequences can change.

Substitutions Gaps (insertions or deletions)

– Natural selection ensures that some mutations are seen more often than others. (Survival of the fittest)


13


Total score of an alignment:– Sum of terms for each aligned pair of residues– Terms for each gap

Take the sum of those terms


14

Substitution MatricesSubstitution Matrices

We need a matrix with the scores for every possible pair of residues (e.g. bases or amino acids)

We can compute these score by:s(a,b) = log( )

pab = probability that residues a and b have been derived independently from some unknown original residue c.

qa = frequency of a

pab

qaqb


15

BLOSUM50BLOSUM50


16

Gap PenaltiesGap Penalties

(g) = -gd (linear score)(g) = -d-(g-1)e (affine score)

– d = gap-open penalty– e = gap-extension penalty– g = gap length

P(gap) = f(g) qxii in gap


17

Alignment AlgorithmsAlignment Algorithms


18

Alignment AlgorithmsAlignment Algorithms

Needleman-Wunsch (global alignment)Smith-Waterman (local alignment)Repeated matchesOverlap matchesHybrid match conditions


19

Dynamic ProgrammingDynamic Programming

Enormous amount of possible alignmentsAlgorithm for finding optimal alignment:

Use Dynamic ProgrammingSave sub-results for later reuse, avoiding

calculation of same problem


20

Needleman-Wunsch AlgorithmNeedleman-Wunsch Algorithm

Global alignment For sequences of size n and m, make (n+1)x(m+1)

matrix Fill in from top left to bottom right

F(i-1, j-1) + s(xi,yj) F(i,j) = max F(i-1, j) – d

F(i, j-1) – d Keep pointer to cell that is used to derive F(i,j) Takes O(nm) time and memory

{


21

MatrixMatrix

-2

-8

-80 -8

-8 -2


22

MatrixMatrix

Traceback


23

Smith-Waterman AlgorithmSmith-Waterman Algorithm

Local alignment Two differences with Needleman-Wunsch:

0

F(i-1, j-1) + s(xi,yj)F(i-1, j) – dF(i, j-1) – d

2. Local alignment can end anywhere, so choose highest value in matrix from where traceback starts (not necessarily bottom right cell)

1. F(i,j) = max{


24

MatrixMatrix


25

Smith-Waterman AlgorithmSmith-Waterman Algorithm

Expected score for a random match s(a,b) must be negative

There must be some s(a,b) greater than 0 or no alignment is found


26

Repeated MatchesRepeated Matches

Many local alignments possible if one or both sequences are long. Smith-Waterman only finds one of them

Find parts of sequence in the other sequenceNot every alignment is useful threshold


27

Repeated MatchesRepeated Matches

F(i, 0)

F(i-1, j-1) + s(xi,yj)

F(i-1, j) – d

F(i, j-1) – d

F(i-1, 0)

F(i-1, j) – T, j = 1,…m;

F(i,j) = max {{F(i,0) = max


28

MatrixMatrix

Threshold T = 20


29

Overlap MatchesOverlap Matches

Find match between start of a sequence and end of a sequence (can be the same)

Alignment begins on left-hand or top border of the matrix and ends on right-hand or bottom border


30

Overlap MatchesOverlap Matches

F(0,j) = 0, for j = 1,…,mF(i,0) = 0, for i = 1,…,n

F(i-1, j-1) + s(xi,yj)

F(i,j) = max F(i-1, j) – d

F(i, j-1) – d{


31

MatrixMatrix


32

Hybrid Match ConditionsHybrid Match Conditions

Different types of alignment can be created by – adjusting rhs of this formula:

F(i,j) = max {….– adjusting the traceback

Example:– We want to align two sequences from the beginning of

both the sequences until local alignment has been found.


33

SummarySummary Probability theory is important for sequence

analysis Goal: determine whether 2 sequences are

related For that, we need to find an optimal alignment

between those sequences using algorithms Scoring model is required to rank different

alignments Different algorithms for different types of

alignments – use dynamic programming

s. maarschalkerweerd & a. tjhang1 probability theory and basic alignment of string sequences...

Documents

alignment score

optimal alignment

sorts of alignment

smart scoring model

probability breastcancer

scoring system

probability tumor visible

good scoring alignments