sequence analysis of nucleic acids and proteins: part 1

48
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000 Similarity search

Upload: gage

Post on 06-Jan-2016

35 views

Category:

Documents


2 download

DESCRIPTION

Sequence analysis of nucleic acids and proteins: part 1. Similarity search. Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000. Search and learning problems in sequence analysis. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sequence analysis of nucleic acids and proteins: part 1

Sequence analysis of nucleic acids and proteins: part 1

Based on Chapter 3 of Post-genome Bioinformatics

by Minoru Kanehisa, Oxford University Press, 2000

Similarity search

Page 2: Sequence analysis of nucleic acids and proteins: part 1

Search and learning problems in sequence analysisProblems in Biological Science Math/Stat/CompSci method

Similarity search Pairwise sequence alignmentDatabase search for similarsequencesMultiple sequence alignmentPhylogenetic treereconstructionProtein 3D structurealignment

Optimization algorithms• Dynamic programming

(DP)• Simulated annealing (SA)• Genetic algorithms (GA)• Markov Chain Monte

Carlo (MCMC:Metropolis and Gibbssamplers)

• Hopfield neural networkStructure/functionprediction

ab initio prediction RNA secondary structurepredictionRNA 3D structure predictionProtein 3D structure prediction

Knowledge basedprediction

Motif extractionFunctional site predictionCellular localization predictionCoding region predictionTransmembrane domainpredictionProtein secondary structurepredictionProtein 3D structure prediction

Pattern recognition andlearning algorithms• Discriminant analysis• Neural networks• Support vector machines• Hidden Markov models

(HMM)• Formal grammar• CART

Molecular classifi cation Superfamily classificationOrtholog/paralog grouping ofgenes3D fold classification

Clustering algorithms• Hierarchical, k-means, etc• PCA, MDS, etc• Self -organizing maps, etc

Page 3: Sequence analysis of nucleic acids and proteins: part 1

A comparison of the homology search and the motif search for functional interpretation of sequence information.

Homology Search Motif Search

New sequence

Retrieval

Similarsequence

Expertknowledge

Sequence interpretation

Sequence database(Primary data)

Knowledgeacquisition

Motif library(Empirical rules)

Expertknowledge

New sequence

Inference

Sequence interpretation

Page 4: Sequence analysis of nucleic acids and proteins: part 1

Pairwise sequence alignment by the dynamic programming algorithm. The algorithm involves finding the optimal path in the

path matrix. (a), which is equivalent to searching the optimal solution in the search tree (b).

(a) Path Matrix (b) Search Tree

A I M S

A

M

O

S

Alignment AIM-S A-MOS Pruning by an optimization function

X X

. . . . .

. . . . . . . . .

Page 5: Sequence analysis of nucleic acids and proteins: part 1

Methods for computing the optimal score in the dynamic programming algorithm (a ) the gap penalty is a constant.

(b) the gap penalty is a linear function of the gap length.

(a) (b)Di, j-l

d

Di-1, j

Di-1, j-1 Di-1, j-1

Di-1, j

Di, j-l

d

ws(i), t(j)

Di,j

Di, j(2)b

ws(i), t(j)

Di,j(1)

Di,j(3)

b

Page 6: Sequence analysis of nucleic acids and proteins: part 1

Concepts of global and local optimality in the pairwise sequence alignment. The distinction is made as to how the

initial values are assigned to the path matrix.

(a) Global vs. Global (b) Local vs. Global

0 0 0 . . . . . . 0

.

.

.

.

0

0 0 . . . . . . 0

.

.

.

.

0

0 0 . . . . . . 0

X

(c) Local vs. Local

Page 7: Sequence analysis of nucleic acids and proteins: part 1

The order of computing matrix elements in the path matrix, which is suitable for (a) sequential processing and (b) parallel processing.

(I, j -1)

(i, j)

(i +1, j-1)

(i +1, j )

(i -1, j -1)

(i -1, j )

(a)

(i, j -2)

(i, j -1)

(i, j)

(i+1, j -2)

(i +1, j -1)(i -1, j -1)

(i -1, j )

(b)

Page 8: Sequence analysis of nucleic acids and proteins: part 1

The dynamic programming algorithm can be applied to limited areas, rather than to the entire matrix, after rapidly searching the

diagonals that contain candidate markers.

n1

mm

n +m -1

j

11

i

l

l

Page 9: Sequence analysis of nucleic acids and proteins: part 1

The hashing technique for rapid sequence comparison. In this case the horizontal sequence is converted to a hash table, which

contains the locations of the four nucleotides.

A T C A C A C G G CT

A

T

C

G

C

A

G

T

C

A

A

T

T

C

.

.

*

* * *

*

* * * *

* *

* * * *

* * *

* *

*

* * * *

* * *

* * *

*

*

* * * *

Key Address

A 1 4 6

C 3 5 7 10

G 8 9

T 2

Hash TableQuery Sequence

Used in FASTA

Page 10: Sequence analysis of nucleic acids and proteins: part 1

An example of the finite state automaton for pattern matching

Q0

Q3

Q4

Q2

Q1

B

A

A

C

B

A

B

BA

CA

BC

C

C

Bold arrows lead to ouputs indicating patterns have been found

Used in BLAST

Page 11: Sequence analysis of nucleic acids and proteins: part 1

The tree-based progressive method for multiple sequence alignment, which utilizes: (a) a dendrogram obtained by cluster analysis and (b) group alignment for pairwise comparison of groups of sequences.

(a)

DEHUG3

DEPGG3

DEBYG3

DEZYG3

DEBSGF

(b) L W R D G R G A L Q

L W R G G R G A A Q

D W R - G R T A S G

L R R - A R T A S A

L - R G A R A A A E

Page 12: Sequence analysis of nucleic acids and proteins: part 1

Possible tree topologies in the phylogenetic analysis of: (a) three sequences or (b) four sequences. Filled circles represent extant sequences, while open circles represent common ancestors.

(a)

A

C

B

A

C

B A

C

B

D D

A

C

B

D

Page 13: Sequence analysis of nucleic acids and proteins: part 1

Simulated annealing and Metropolis Monte Carlo methods are based on the concept of thermal fluctuations in the energy functions.

E = E (x’n) - E (x n)

p =

1

exp(-El Tn )

When E

When E

E

x

Page 14: Sequence analysis of nucleic acids and proteins: part 1

Dynamic programming to find edit distances

- Edit operation: M, R, I, D

- Edit transcript: A string over the alphabet M, R, I, D that describes a transformation of one string into another. Example:

R D I M D MR D I M D M

M A - T H S

A - R T - S

- Edit (Levens(h)tein) distance: The minimum number of edit operations necessary to transform one string into another. (Note: matches are not counted.) Example:

R D I M D MR D I M D M

1+ 1+ 1+ 0+ 1+ 0 = 4

Page 15: Sequence analysis of nucleic acids and proteins: part 1

The recurrence- Stage: position in the edit transcript;

- State: I, D, M, or R;

- Optimal value function: D(i, j)

where D(i, j) = edit distance of Seq1[1...i] and Seq2[1...j]

- Recurrence relation: 1 +D(i-1, j)

D(i, j) = min 1 +D(i, j-1)

t(i, j) +D(i-1, j-1) , where

t(i, j) = {1, Seq1(i) ≠Seq2( )j, Seq1( )i =Seq2( )j

Page 16: Sequence analysis of nucleic acids and proteins: part 1

The tabulation , D(i, j) Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0

M 1

A 2

T 3

H 4

S 5

Page 17: Sequence analysis of nucleic acids and proteins: part 1

The tabulation , D(i, j) Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0

M 1

A 2

T 3

H 4

S 5

Page 18: Sequence analysis of nucleic acids and proteins: part 1

The tabulation , D(i, j) Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1

M 1

A 2

T 3

H 4

S 5

Page 19: Sequence analysis of nucleic acids and proteins: part 1

The tabulation , D(i, j) Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2

M 1

A 2

T 3

H 4

S 5

Page 20: Sequence analysis of nucleic acids and proteins: part 1

The tabulation , D(i, j) Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1

A 2 2

T 3 3

H 4 4

S 5 5

Page 21: Sequence analysis of nucleic acids and proteins: part 1

The tabulation , D(i, j) Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1

A 2 2

T 3 3

H 4 4

S 5 5

Page 22: Sequence analysis of nucleic acids and proteins: part 1

The tabulation , D(i, j) Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2

A 2 2

T 3 3

H 4 4

S 5 5

Page 23: Sequence analysis of nucleic acids and proteins: part 1

The tabulation , D(i, j) Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3

H 4 4

S 5 5

Page 24: Sequence analysis of nucleic acids and proteins: part 1

The tabulation , D(i, j) Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4

S 5 5

Page 25: Sequence analysis of nucleic acids and proteins: part 1

The tabulation , D(i, j) Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4 3 3 3 3

S 5 5 4 4 4 3

Page 26: Sequence analysis of nucleic acids and proteins: part 1

The traceback Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4 3 3 3 3

S 5 5 4 4 4 3

Page 27: Sequence analysis of nucleic acids and proteins: part 1

The solutions - #11 0 1 1 0 = 3

DD MM RR RR MM

M A T H S

- A R T S

Page 28: Sequence analysis of nucleic acids and proteins: part 1

The traceback Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4 3 3 3 3

S 5 5 4 4 4 3

Page 29: Sequence analysis of nucleic acids and proteins: part 1

The solutions - #21 0 1 0 1 0 = 3

DD MM II MM DD MM

M A - T H S

- A R T - S

Page 30: Sequence analysis of nucleic acids and proteins: part 1

The traceback Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4 3 3 3 3

S 5 5 4 4 4 3

Page 31: Sequence analysis of nucleic acids and proteins: part 1

The solutions - #31 1 0 1 0 = 3

RR RR MM DD MM

M A T H S

A R T - S

“Life must be lived forwards and understood backwards.”

- Søren Kierkegaard

Page 32: Sequence analysis of nucleic acids and proteins: part 1

BLOSUM62 SCORING MATRIX

C 9

S -1 4

T -1 1 5

P -3 -1 -1 7

A 0 1 0 -1 4

G -3 0 -2 -2 0 6

N -3 1 0 -2 -2 0 6

D -3 0 -1 -1 -2 -1 1 6

E -4 0 -1 -1 -1 -2 0 2 5

Q -3 0 -1 -1 -1 -2 0 0 2 5

H -3 -1 -2 -2 -2 -2 1 -1 0 0 8

R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5

K -3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5

M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5

I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4

L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4

V -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4

F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6

Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7

W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

C S T P A G N D E Q H R K M I L V F Y W

134 LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETLLI | ||| | | |||||| | || || 137 LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETFVM

D:D = +6

D:R = -2

From Henikoff 1996

Page 33: Sequence analysis of nucleic acids and proteins: part 1

Scoring Matrices

• Physical/Chemical similarities

- comparing two sequences according to the properties of their residues may highlight regions of structural similarity

• Identity matrices

- by stressing only identities in the alignment, stretches of sequence that may have diverged will not penalise any remaining common features

Page 34: Sequence analysis of nucleic acids and proteins: part 1

Scoring Matrices (ctd)

• As the direct source of residue by residue comparison scores the scoring matrix you choose will have a major impact on the alignment calculated

• The most commonly used will be one of the mutation matrices

PAM, BLOSUM

• The matrix that performs best will be the matrix that reflects the evolutionary separation of the sequences being aligned

Page 35: Sequence analysis of nucleic acids and proteins: part 1

Probability and LikelihoodSome probabilities of observations depend on unknown parameters. E.g. if

O = SFFSFFF

then under independence

pr(O) = p2(1-p)5.

We can calculate this for any observation O, so in a sense we have a 2-variable function

pr(O,p) or pr(O|p)

depending on O and p (0< p <1).

Likelihood: holds O fixed, varies p.

Maximum Likelihood estimate: the p which maximizes pr(O,p), O fixed, denoted .

E.g. above, = 2/7.

Page 36: Sequence analysis of nucleic acids and proteins: part 1

Statistical motivation for alignment scores

pr(data|H) = pr( |H) = pr( |H) x ...

= (1-p)apd d = # disagreements, a = # agreements, p = (1-e-8t)

pr(data|R) = pr( |R) = pr( |R) x ...

= ( )a( )d

= a x log + d x log . Since p < , log <0, log >0

score = a x + d x (-) >0 match score, -<0 mismatch penalty

Note that if t 0, p 6t, 1-p 1 and so log4, while - log8t is large and negative: a big difference in the two scores.

Conversely, if t is large, p = (1-), = 1-, and log(1-) -, while 1-p = (1+3), = 1+3, and so log(1+3) 3. Thus the scores are about 3:1.

AGCTGATCA...AACCGGTTA...Alignment: H = homologous (indep. sites, Jukes-

Cantor)R = random (indep. sites, equal freq.)

Hypotheses:

34

34

14

log {pr(data|H)pr(data|R) } 1-p

1/4 p3/4

34

p3/4

1-p1/4

≈ ≈ ≈ ≈ ≈

34

p3/4 ≈

14

1-p1/4

Page 37: Sequence analysis of nucleic acids and proteins: part 1

We can do the same with any other Markov substitution matrix for molecular evolution. E.g. with a PAM or BLOSUM matrix of probabilities,

a1 ..... am

b1 ..... bmdata = a gap free alignment of two a.a. sequence

fragments

pr(data|H) = aipaibi(2t) pr(data|R) = aibi

log{ } = log{ }

The elements of a log-odds score matrix are typically > 0 on the diagonal and < 0 off the diagonal, but not always.

Also the relative sizes of match and mismatch penalties increase as #PAMs (t) decreases. Thus PAM(120) is more stringent than PAM(250), while PAM(360) is less stringent than it.

PAM(0) = the identity matrix is the toughest.

There are plenty of score matrices based on other principles.

m

1

i

pr(data|H)pr(data|R)

ipaibi(2t)/ bi

Page 38: Sequence analysis of nucleic acids and proteins: part 1

Local alignment

aligns only the most similar regions of two sequences

Why? Often distantly related proteins have only isolated regions (e.g. active sites) of similarity.

The modular nature of proteins

How? The dynamic programming algorithm we have seen needs only a minor modification to yield the best local alignment between two sequences. It is called the Smith-Waterman algorithm, and is named bestfit in GCG.

Page 39: Sequence analysis of nucleic acids and proteins: part 1

The usual caveats:

The question arises every time an alignment is done without prior knowledge of homology.

• the scientific goal is not necessarily the same as the mathematical/statistical goal

•significance may not mean homology

•non-significance may not mean non-homology

Similar Amino Acid Sequences: Chance or Common Ancestry?

Title of paper by Russell F. Doolittle, Science 214 (1981)1

Page 40: Sequence analysis of nucleic acids and proteins: part 1

Early use of statistics•Generate random permutations of the sequence(s)

•Obtain the average (av) and standard deviation (SD) of the random similarity scores

•Compute z=(observed score - av)/SD

•Think normal (e.g. 4 is a very large z)

This approach is still used for global alignments, but is no longer seen as appropriate for local alignments, since the score is optimized, and random optimal scores do not follow the normal law.

Page 41: Sequence analysis of nucleic acids and proteins: part 1

More recent statistical developments:

Theory developed by Karlin and collaborators in 1990-4 and, independently, by Waterman and collaborators in 1988-94. Incorporates the fact that the score has been optimized.

Immediately implemented in BLAST. Later appears in a similar form in FASTA and elsewhere.

Page 42: Sequence analysis of nucleic acids and proteins: part 1

The theory applies to the ensemble of random

•pairs of sequences, with fixed

•possibly different lengths,

•possibly different residue distributions

•and ungapped alignments

(extensions to ungapped alignments coming now)

Page 43: Sequence analysis of nucleic acids and proteins: part 1

The theoretical distribution of random similarity scores

•is universal in form (see diagram)

•with scale parameter depending on the two residue distributions, and the substitution scores used

•and location parameter depending on the above, plus the lengths of the two sequences

Page 44: Sequence analysis of nucleic acids and proteins: part 1

For m, n large, the optimal random score S has the extreme-value distribution with cdf

exp{-exp{-(s-u)}}

where is the unique positive solution (in t) of

ijpiqjexp(sijt)=1,

and

u = log(Kmn)

and K is given by a series depending on the

compositions (pi) and (qj) and the scoring

matrix (sij).

1

Page 45: Sequence analysis of nucleic acids and proteins: part 1

Databases searches: why do them?

To find exact matches to sequences

To find homologous sequences

To infer structure and/or function of new protein sequences

To locate genes in ESTs or genomic sequences

To discover gene structure in DNA sequence

And much more...

Page 46: Sequence analysis of nucleic acids and proteins: part 1

Database searching

Compares a query sequence to each sequence in a database (also called a library). Because of the large size of sequence databases, comparisons are generally carried out using faster heuristic approximations to, rather than the exact Smith-Waterman local alignment algorithm. The two most common of these are FASTA and BLAST, where each of these names corresponds to a family of algorithms used in different contexts.

Page 47: Sequence analysis of nucleic acids and proteins: part 1

Program Query Database Comparison Common use

blastn DNA DNA DNA level Seek identical DNAsequences andsplicing patterns

blastp Protein Protein Protein level Find homologousproteins

blastx DNA Protein Protein level Analyze new DNAto find genes andseek homologousproteins

tblastn Protein DNA Protein level Search for genes inunannotated DNA

tblastx DNA DNA Protein level Discover genestructure

BLAST variants for different searchesa

(after S. Brenner, Trends Guide to Bioinformatics, 1998)

aSimilar variant programs are available for FASTA. Protein-level searches of DNA sequences are performed by comparing translations of all six reading frames.

Page 48: Sequence analysis of nucleic acids and proteins: part 1

cDNA, ORFs and ESTs

• Complementary DNA (cDNA)– Single stranded DNA complementary to an RNA, from which

synthesized by reverse transcription.

• Open reading frames (ORFs)– Contains a series of triplets coding for amino acids without any

termination codons (potentially translatable into proteins)

– Many derived from sequencing of cDNAs

• Expressed sequence tags (ESTs)– Short (300-500 bp) single reads from mRNA (cDNA) sequencing

survey projects.

– A snapshot of what is expressed in a given tissue at a given developmental stage.