s. maarschalkerweerd & a. tjhang1 probability theory and basic alignment of string sequences...

33
S. Maarschalkerweerd & A. Tjhang 1 Probability Theory and Probability Theory and Basic Alignment of String Basic Alignment of String Sequences Sequences Chapter 1.1-2.3

Post on 15-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

1

Probability Theory and Basic Probability Theory and Basic Alignment of String SequencesAlignment of String Sequences

Chapter 1.1-2.3

Page 2: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

2

OverviewOverview

Probability Theory-Maximum Likelihood

-Bayes Theorem

Pairwise Alignment-The Scoring Model

-Alignment Algorithms

Page 3: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

3

Probability TheoryProbability Theory

Page 4: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

4

Probability TheoryProbability Theory

What is a probabilistic model? Simple example:

What is probability of base sequence x1x2…xn?

p(xi), p(x1), p(x2)…p(xn) independent of each other

If pC = 0.3; pT = 0.2 and sequence is CTC:P(CTC)=0.3*0.2*0.3=0.018

Page 5: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

5

Maximum Likelihood EstimationMaximum Likelihood Estimation

Estimate parameters of the model from large sets of examples (training set)– For example: P(T) and P(C) are estimated from their

frequency in a database of residues

Avoid overfitting– Database too small, model also fits to noise in the

training set

Page 6: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

6

Probability TheoryProbability Theory

Conditional Probability

-P(X,Y) = P(X|Y) P(Y) (joint probability)

-P(X) = Y P(X,Y) = Y P(X|Y) P(Y)

(marginal probability)

Page 7: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

7

Bayes’ TheoremBayes’ Theorem

P(X|Y) =

- Posterior probability Example:

P(X)=Probability tumor visible on x-ray P(C)=Probability breast-cancer = 0.01

P(X|C) = 0.9; P(X|¬C) = 0.05

- On the x-ray a tumor is seen. What is the probability that the woman has breast-cancer?

P(Y|X) P(X)

P(Y)

Page 8: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

8

Pairwise AlignmentPairwise Alignment

Page 9: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

9

Pairwise AlignmentPairwise Alignment

Goal: determine whether 2 sequences are related (homologous).

Issues regarding pairwise alignment:1. What sorts of alignment should be considered?

2. The scoring system used to rank alignments.

3. The algorithm used to find optimal (or good) scoring alignments.

4. The statistical methods to evaluate significance of an alignment score.

Page 10: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

10

ExampleExample

You need a ‘smart’ scoring model to distinguish b from c.

Page 11: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

11

The Scoring ModelThe Scoring Model

Page 12: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

12

The Scoring ModelThe Scoring Model

When sequences are related, then both sequences have to be from a common ancestor.– Due to mutation sequences can change.

Substitutions Gaps (insertions or deletions)

– Natural selection ensures that some mutations are seen more often than others. (Survival of the fittest)

Page 13: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

13

The Scoring ModelThe Scoring Model

Total score of an alignment:– Sum of terms for each aligned pair of residues– Terms for each gap

Take the sum of those terms

Page 14: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

14

Substitution MatricesSubstitution Matrices

We need a matrix with the scores for every possible pair of residues (e.g. bases or amino acids)

We can compute these score by:s(a,b) = log( )

pab = probability that residues a and b have been derived independently from some unknown original residue c.

qa = frequency of a

pab

qaqb

Page 15: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

15

BLOSUM50BLOSUM50

Page 16: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

16

Gap PenaltiesGap Penalties

(g) = -gd (linear score)(g) = -d-(g-1)e (affine score)

– d = gap-open penalty– e = gap-extension penalty– g = gap length

P(gap) = f(g) qxii in gap

Page 17: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

17

Alignment AlgorithmsAlignment Algorithms

Page 18: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

18

Alignment AlgorithmsAlignment Algorithms

Needleman-Wunsch (global alignment)Smith-Waterman (local alignment)Repeated matchesOverlap matchesHybrid match conditions

Page 19: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

19

Dynamic ProgrammingDynamic Programming

Enormous amount of possible alignmentsAlgorithm for finding optimal alignment:

Use Dynamic ProgrammingSave sub-results for later reuse, avoiding

calculation of same problem

Page 20: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

20

Needleman-Wunsch AlgorithmNeedleman-Wunsch Algorithm

Global alignment For sequences of size n and m, make (n+1)x(m+1)

matrix Fill in from top left to bottom right

F(i-1, j-1) + s(xi,yj) F(i,j) = max F(i-1, j) – d

F(i, j-1) – d Keep pointer to cell that is used to derive F(i,j) Takes O(nm) time and memory

{

Page 21: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

21

MatrixMatrix

-2

-8

-80 -8

-8 -2

Page 22: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

22

MatrixMatrix

Traceback

Page 23: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

23

Smith-Waterman AlgorithmSmith-Waterman Algorithm

Local alignment Two differences with Needleman-Wunsch:

0

F(i-1, j-1) + s(xi,yj)F(i-1, j) – dF(i, j-1) – d

2. Local alignment can end anywhere, so choose highest value in matrix from where traceback starts (not necessarily bottom right cell)

1. F(i,j) = max{

Page 24: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

24

MatrixMatrix

Page 25: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

25

Smith-Waterman AlgorithmSmith-Waterman Algorithm

Expected score for a random match s(a,b) must be negative

There must be some s(a,b) greater than 0 or no alignment is found

Page 26: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

26

Repeated MatchesRepeated Matches

Many local alignments possible if one or both sequences are long. Smith-Waterman only finds one of them

Find parts of sequence in the other sequenceNot every alignment is useful threshold

Page 27: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

27

Repeated MatchesRepeated Matches

F(i, 0)

F(i-1, j-1) + s(xi,yj)

F(i-1, j) – d

F(i, j-1) – d

F(i-1, 0)

F(i-1, j) – T, j = 1,…m;

F(i,j) = max {{F(i,0) = max

Page 28: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

28

MatrixMatrix

Threshold T = 20

Page 29: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

29

Overlap MatchesOverlap Matches

Find match between start of a sequence and end of a sequence (can be the same)

Alignment begins on left-hand or top border of the matrix and ends on right-hand or bottom border

Page 30: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

30

Overlap MatchesOverlap Matches

F(0,j) = 0, for j = 1,…,mF(i,0) = 0, for i = 1,…,n

F(i-1, j-1) + s(xi,yj)

F(i,j) = max F(i-1, j) – d

F(i, j-1) – d{

Page 31: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

31

MatrixMatrix

Page 32: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

32

Hybrid Match ConditionsHybrid Match Conditions

Different types of alignment can be created by – adjusting rhs of this formula:

F(i,j) = max {….– adjusting the traceback

Example:– We want to align two sequences from the beginning of

both the sequences until local alignment has been found.

Page 33: S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

S. Maarschalkerweerd & A. Tjhang

33

SummarySummary Probability theory is important for sequence

analysis Goal: determine whether 2 sequences are

related For that, we need to find an optimal alignment

between those sequences using algorithms Scoring model is required to rank different

alignments Different algorithms for different types of

alignments – use dynamic programming