sequence analysis of nucleic acids and proteins: part 1

Sequence analysis of nucleic acids and proteins: part 1

Based on Chapter 3 of Post-genome Bioinformatics

by Minoru Kanehisa, Oxford University Press, 2000

Similarity search

Search and learning problems in sequence analysisProblems in Biological Science Math/Stat/CompSci method

Similarity search Pairwise sequence alignmentDatabase search for similarsequencesMultiple sequence alignmentPhylogenetic treereconstructionProtein 3D structurealignment

Optimization algorithms• Dynamic programming

(DP)• Simulated annealing (SA)• Genetic algorithms (GA)• Markov Chain Monte

Carlo (MCMC:Metropolis and Gibbssamplers)

• Hopfield neural networkStructure/functionprediction

ab initio prediction RNA secondary structurepredictionRNA 3D structure predictionProtein 3D structure prediction

Knowledge basedprediction

Motif extractionFunctional site predictionCellular localization predictionCoding region predictionTransmembrane domainpredictionProtein secondary structurepredictionProtein 3D structure prediction

Pattern recognition andlearning algorithms• Discriminant analysis• Neural networks• Support vector machines• Hidden Markov models

(HMM)• Formal grammar• CART

Molecular classifi cation Superfamily classificationOrtholog/paralog grouping ofgenes3D fold classification

Clustering algorithms• Hierarchical, k-means, etc• PCA, MDS, etc• Self -organizing maps, etc

A comparison of the homology search and the motif search for functional interpretation of sequence information.

Homology Search Motif Search

New sequence

Retrieval

Similarsequence

Expertknowledge

Sequence interpretation

Sequence database(Primary data)

Knowledgeacquisition

Motif library(Empirical rules)

Expertknowledge

New sequence

Inference

Sequence interpretation

Pairwise sequence alignment by the dynamic programming algorithm. The algorithm involves finding the optimal path in the

path matrix. (a), which is equivalent to searching the optimal solution in the search tree (b).

(a) Path Matrix (b) Search Tree

A I M S

A

M

O

S

Alignment AIM-S A-MOS Pruning by an optimization function

X X

. . . . .

. . . . . . . . .

Methods for computing the optimal score in the dynamic programming algorithm (a ) the gap penalty is a constant.

(b) the gap penalty is a linear function of the gap length.

(a) (b)Di, j-l

d

Di-1, j

Di-1, j-1 Di-1, j-1

Di-1, j

Di, j-l

d

ws(i), t(j)

Di,j

Di, j(2)b

ws(i), t(j)

Di,j(1)

Di,j(3)

b

Concepts of global and local optimality in the pairwise sequence alignment. The distinction is made as to how the

initial values are assigned to the path matrix.

(a) Global vs. Global (b) Local vs. Global

0 0 0 . . . . . . 0

.

.

.

.

0

0 0 . . . . . . 0

.

.

.

.

0

0 0 . . . . . . 0

X

(c) Local vs. Local

The order of computing matrix elements in the path matrix, which is suitable for (a) sequential processing and (b) parallel processing.

(I, j -1)

(i, j)

(i +1, j-1)

(i +1, j )

(i -1, j -1)

(i -1, j )

(a)

(i, j -2)

(i, j -1)

(i, j)

(i+1, j -2)

(i +1, j -1)(i -1, j -1)

(i -1, j )

(b)

The dynamic programming algorithm can be applied to limited areas, rather than to the entire matrix, after rapidly searching the

diagonals that contain candidate markers.

n1

mm

n +m -1

j

11

i

l

l

The hashing technique for rapid sequence comparison. In this case the horizontal sequence is converted to a hash table, which

contains the locations of the four nucleotides.

A T C A C A C G G CT

A

T

C

G

C

A

G

T

C

A

A

T

T

C

.

.

*

* * *

*

* * * *

* *

* * * *

* * *

* *

*

* * * *

* * *

* * *

*

*

* * * *

Key Address

A 1 4 6

C 3 5 7 10

G 8 9

T 2

Hash TableQuery Sequence

Used in FASTA

An example of the finite state automaton for pattern matching

Q0

Q3

Q4

Q2

Q1

B

A

A

C

B

A

B

BA

CA

BC

C

C

Bold arrows lead to ouputs indicating patterns have been found

Used in BLAST

The tree-based progressive method for multiple sequence alignment, which utilizes: (a) a dendrogram obtained by cluster analysis and (b) group alignment for pairwise comparison of groups of sequences.

(a)

DEHUG3

DEPGG3

DEBYG3

DEZYG3

DEBSGF

(b) L W R D G R G A L Q

L W R G G R G A A Q

D W R - G R T A S G

L R R - A R T A S A

L - R G A R A A A E

Possible tree topologies in the phylogenetic analysis of: (a) three sequences or (b) four sequences. Filled circles represent extant sequences, while open circles represent common ancestors.

(a)

A

C

B

A

C

B A

C

B

D D

A

C

B

D

Simulated annealing and Metropolis Monte Carlo methods are based on the concept of thermal fluctuations in the energy functions.

E = E (x’n) - E (x n)

p =

1

exp(-El Tn )

When E

When E

E

x

Dynamic programming to find edit distances

- Edit operation: M, R, I, D

- Edit transcript: A string over the alphabet M, R, I, D that describes a transformation of one string into another. Example:

R D I M D MR D I M D M

M A - T H S

A - R T - S

- Edit (Levens(h)tein) distance: The minimum number of edit operations necessary to transform one string into another. (Note: matches are not counted.) Example:

R D I M D MR D I M D M

1+ 1+ 1+ 0+ 1+ 0 = 4

The recurrence- Stage: position in the edit transcript;

- State: I, D, M, or R;

- Optimal value function: D(i, j)

where D(i, j) = edit distance of Seq1[1...i] and Seq2[1...j]

- Recurrence relation: 1 +D(i-1, j)

D(i, j) = min 1 +D(i, j-1)

t(i, j) +D(i-1, j-1) , where

t(i, j) = {1, Seq1(i) ≠Seq2( )j, Seq1( )i =Seq2( )j

The tabulation , D(i, j) Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0

M 1

A 2

T 3

H 4

S 5


Seq1(i) 0 1 2 3 4

0 0

M 1

A 2

T 3

H 4

S 5


Seq1(i) 0 1 2 3 4

0 0 1

M 1

A 2

T 3

H 4

S 5


Seq1(i) 0 1 2 3 4

0 0 1 2

M 1

A 2

T 3

H 4

S 5


Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1

A 2 2

T 3 3

H 4 4

S 5 5


Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1

A 2 2

T 3 3

H 4 4

S 5 5


Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2

A 2 2

T 3 3

H 4 4

S 5 5


Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3

H 4 4

S 5 5


Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4

S 5 5


Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4 3 3 3 3

S 5 5 4 4 4 3

The traceback Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4 3 3 3 3

S 5 5 4 4 4 3

The solutions - #11 0 1 1 0 = 3

DD MM RR RR MM

M A T H S

- A R T S


Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4 3 3 3 3

S 5 5 4 4 4 3

The solutions - #21 0 1 0 1 0 = 3

DD MM II MM DD MM

M A - T H S

- A R T - S


Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4 3 3 3 3

S 5 5 4 4 4 3

The solutions - #31 1 0 1 0 = 3

RR RR MM DD MM

M A T H S

A R T - S

“Life must be lived forwards and understood backwards.”

- Søren Kierkegaard

BLOSUM62 SCORING MATRIX

C 9

S -1 4

T -1 1 5

P -3 -1 -1 7

A 0 1 0 -1 4

G -3 0 -2 -2 0 6

N -3 1 0 -2 -2 0 6

D -3 0 -1 -1 -2 -1 1 6

E -4 0 -1 -1 -1 -2 0 2 5

Q -3 0 -1 -1 -1 -2 0 0 2 5

H -3 -1 -2 -2 -2 -2 1 -1 0 0 8

R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5

K -3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5

M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5

I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4

L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4

V -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4

F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6

Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7

W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

C S T P A G N D E Q H R K M I L V F Y W

134 LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETLLI | ||| | | |||||| | || || 137 LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETFVM

D:D = +6

D:R = -2

From Henikoff 1996

Scoring Matrices

• Physical/Chemical similarities

- comparing two sequences according to the properties of their residues may highlight regions of structural similarity

• Identity matrices

- by stressing only identities in the alignment, stretches of sequence that may have diverged will not penalise any remaining common features

Scoring Matrices (ctd)

• As the direct source of residue by residue comparison scores the scoring matrix you choose will have a major impact on the alignment calculated

• The most commonly used will be one of the mutation matrices

PAM, BLOSUM

• The matrix that performs best will be the matrix that reflects the evolutionary separation of the sequences being aligned

Probability and LikelihoodSome probabilities of observations depend on unknown parameters. E.g. if

O = SFFSFFF

then under independence

pr(O) = p2(1-p)5.

We can calculate this for any observation O, so in a sense we have a 2-variable function

pr(O,p) or pr(O|p)

depending on O and p (0< p <1).

Likelihood: holds O fixed, varies p.

Maximum Likelihood estimate: the p which maximizes pr(O,p), O fixed, denoted .

E.g. above, = 2/7.

Statistical motivation for alignment scores

pr(data|H) = pr( |H) = pr( |H) x ...

= (1-p)apd d = # disagreements, a = # agreements, p = (1-e-8t)

pr(data|R) = pr( |R) = pr( |R) x ...

= ( )a( )d

= a x log + d x log . Since p < , log <0, log >0

score = a x + d x (-) >0 match score, -<0 mismatch penalty

Note that if t 0, p 6t, 1-p 1 and so log4, while - log8t is large and negative: a big difference in the two scores.

Conversely, if t is large, p = (1-), = 1-, and log(1-) -, while 1-p = (1+3), = 1+3, and so log(1+3) 3. Thus the scores are about 3:1.

AGCTGATCA...AACCGGTTA...Alignment: H = homologous (indep. sites, Jukes-

Cantor)R = random (indep. sites, equal freq.)

Hypotheses:

34

34

14

log {pr(data|H)pr(data|R) } 1-p

1/4 p3/4

34

p3/4

1-p1/4

≈ ≈ ≈ ≈ ≈

34

p3/4 ≈

14

1-p1/4

≈

We can do the same with any other Markov substitution matrix for molecular evolution. E.g. with a PAM or BLOSUM matrix of probabilities,

a1 ..... am

b1 ..... bmdata = a gap free alignment of two a.a. sequence

fragments

pr(data|H) = aipaibi(2t) pr(data|R) = aibi

log{ } = log{ }

The elements of a log-odds score matrix are typically > 0 on the diagonal and < 0 off the diagonal, but not always.

Also the relative sizes of match and mismatch penalties increase as #PAMs (t) decreases. Thus PAM(120) is more stringent than PAM(250), while PAM(360) is less stringent than it.

PAM(0) = the identity matrix is the toughest.

There are plenty of score matrices based on other principles.

m

1

i

pr(data|H)pr(data|R)

ipaibi(2t)/ bi

Local alignment

aligns only the most similar regions of two sequences

Why? Often distantly related proteins have only isolated regions (e.g. active sites) of similarity.

The modular nature of proteins

How? The dynamic programming algorithm we have seen needs only a minor modification to yield the best local alignment between two sequences. It is called the Smith-Waterman algorithm, and is named bestfit in GCG.

The usual caveats:

The question arises every time an alignment is done without prior knowledge of homology.

• the scientific goal is not necessarily the same as the mathematical/statistical goal

•significance may not mean homology

•non-significance may not mean non-homology

Similar Amino Acid Sequences: Chance or Common Ancestry?

Title of paper by Russell F. Doolittle, Science 214 (1981)1

Early use of statistics•Generate random permutations of the sequence(s)

•Obtain the average (av) and standard deviation (SD) of the random similarity scores

•Compute z=(observed score - av)/SD

•Think normal (e.g. 4 is a very large z)

This approach is still used for global alignments, but is no longer seen as appropriate for local alignments, since the score is optimized, and random optimal scores do not follow the normal law.

More recent statistical developments:

Theory developed by Karlin and collaborators in 1990-4 and, independently, by Waterman and collaborators in 1988-94. Incorporates the fact that the score has been optimized.

Immediately implemented in BLAST. Later appears in a similar form in FASTA and elsewhere.

The theory applies to the ensemble of random

•pairs of sequences, with fixed

•possibly different lengths,

•possibly different residue distributions

•and ungapped alignments

(extensions to ungapped alignments coming now)

The theoretical distribution of random similarity scores

•is universal in form (see diagram)

•with scale parameter depending on the two residue distributions, and the substitution scores used

•and location parameter depending on the above, plus the lengths of the two sequences

For m, n large, the optimal random score S has the extreme-value distribution with cdf

exp{-exp{-(s-u)}}

where is the unique positive solution (in t) of

ijpiqjexp(sijt)=1,

and

u = log(Kmn)

and K is given by a series depending on the

compositions (pi) and (qj) and the scoring

matrix (sij).

1

Databases searches: why do them?

To find exact matches to sequences

To find homologous sequences

To infer structure and/or function of new protein sequences

To locate genes in ESTs or genomic sequences

To discover gene structure in DNA sequence

And much more...

Database searching

Compares a query sequence to each sequence in a database (also called a library). Because of the large size of sequence databases, comparisons are generally carried out using faster heuristic approximations to, rather than the exact Smith-Waterman local alignment algorithm. The two most common of these are FASTA and BLAST, where each of these names corresponds to a family of algorithms used in different contexts.

Program Query Database Comparison Common use

blastn DNA DNA DNA level Seek identical DNAsequences andsplicing patterns

blastp Protein Protein Protein level Find homologousproteins

blastx DNA Protein Protein level Analyze new DNAto find genes andseek homologousproteins

tblastn Protein DNA Protein level Search for genes inunannotated DNA

tblastx DNA DNA Protein level Discover genestructure

BLAST variants for different searchesa

(after S. Brenner, Trends Guide to Bioinformatics, 1998)

aSimilar variant programs are available for FASTA. Protein-level searches of DNA sequences are performed by comparing translations of all six reading frames.

cDNA, ORFs and ESTs

• Complementary DNA (cDNA)– Single stranded DNA complementary to an RNA, from which

synthesized by reverse transcription.

• Open reading frames (ORFs)– Contains a series of triplets coding for amino acids without any

termination codons (potentially translatable into proteins)

– Many derived from sequencing of cDNAs

• Expressed sequence tags (ESTs)– Short (300-500 bp) single reads from mRNA (cDNA) sequencing

survey projects.

– A snapshot of what is expressed in a given tissue at a given developmental stage.

sequence analysis of nucleic acids and proteins: part 1

Documents

similarity search search

sequence analysisproblems

horizontal sequence

d structure predictionprotein

search tree b

rapid sequence comparison

dynamic programming

optimal path