sequence analysis of nucleic acids and proteins: part 1
DESCRIPTION
Sequence analysis of nucleic acids and proteins: part 1. Similarity search. Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000. Search and learning problems in sequence analysis. - PowerPoint PPT PresentationTRANSCRIPT
Sequence analysis of nucleic acids and proteins: part 1
Based on Chapter 3 of Post-genome Bioinformatics
by Minoru Kanehisa, Oxford University Press, 2000
Similarity search
Search and learning problems in sequence analysisProblems in Biological Science Math/Stat/CompSci method
Similarity search Pairwise sequence alignmentDatabase search for similarsequencesMultiple sequence alignmentPhylogenetic treereconstructionProtein 3D structurealignment
Optimization algorithms• Dynamic programming
(DP)• Simulated annealing (SA)• Genetic algorithms (GA)• Markov Chain Monte
Carlo (MCMC:Metropolis and Gibbssamplers)
• Hopfield neural networkStructure/functionprediction
ab initio prediction RNA secondary structurepredictionRNA 3D structure predictionProtein 3D structure prediction
Knowledge basedprediction
Motif extractionFunctional site predictionCellular localization predictionCoding region predictionTransmembrane domainpredictionProtein secondary structurepredictionProtein 3D structure prediction
Pattern recognition andlearning algorithms• Discriminant analysis• Neural networks• Support vector machines• Hidden Markov models
(HMM)• Formal grammar• CART
Molecular classifi cation Superfamily classificationOrtholog/paralog grouping ofgenes3D fold classification
Clustering algorithms• Hierarchical, k-means, etc• PCA, MDS, etc• Self -organizing maps, etc
A comparison of the homology search and the motif search for functional interpretation of sequence information.
Homology Search Motif Search
New sequence
Retrieval
Similarsequence
Expertknowledge
Sequence interpretation
Sequence database(Primary data)
Knowledgeacquisition
Motif library(Empirical rules)
Expertknowledge
New sequence
Inference
Sequence interpretation
Pairwise sequence alignment by the dynamic programming algorithm. The algorithm involves finding the optimal path in the
path matrix. (a), which is equivalent to searching the optimal solution in the search tree (b).
(a) Path Matrix (b) Search Tree
A I M S
A
M
O
S
Alignment AIM-S A-MOS Pruning by an optimization function
X X
. . . . .
. . . . . . . . .
Methods for computing the optimal score in the dynamic programming algorithm (a ) the gap penalty is a constant.
(b) the gap penalty is a linear function of the gap length.
(a) (b)Di, j-l
d
Di-1, j
Di-1, j-1 Di-1, j-1
Di-1, j
Di, j-l
d
ws(i), t(j)
Di,j
Di, j(2)b
ws(i), t(j)
Di,j(1)
Di,j(3)
b
Concepts of global and local optimality in the pairwise sequence alignment. The distinction is made as to how the
initial values are assigned to the path matrix.
(a) Global vs. Global (b) Local vs. Global
0 0 0 . . . . . . 0
.
.
.
.
0
0 0 . . . . . . 0
.
.
.
.
0
0 0 . . . . . . 0
X
(c) Local vs. Local
The order of computing matrix elements in the path matrix, which is suitable for (a) sequential processing and (b) parallel processing.
(I, j -1)
(i, j)
(i +1, j-1)
(i +1, j )
(i -1, j -1)
(i -1, j )
(a)
(i, j -2)
(i, j -1)
(i, j)
(i+1, j -2)
(i +1, j -1)(i -1, j -1)
(i -1, j )
(b)
The dynamic programming algorithm can be applied to limited areas, rather than to the entire matrix, after rapidly searching the
diagonals that contain candidate markers.
n1
mm
n +m -1
j
11
i
l
l
The hashing technique for rapid sequence comparison. In this case the horizontal sequence is converted to a hash table, which
contains the locations of the four nucleotides.
A T C A C A C G G CT
A
T
C
G
C
A
G
T
C
A
A
T
T
C
.
.
*
* * *
*
* * * *
* *
* * * *
* * *
* *
*
* * * *
* * *
* * *
*
*
* * * *
Key Address
A 1 4 6
C 3 5 7 10
G 8 9
T 2
Hash TableQuery Sequence
Used in FASTA
An example of the finite state automaton for pattern matching
Q0
Q3
Q4
Q2
Q1
B
A
A
C
B
A
B
BA
CA
BC
C
C
Bold arrows lead to ouputs indicating patterns have been found
Used in BLAST
The tree-based progressive method for multiple sequence alignment, which utilizes: (a) a dendrogram obtained by cluster analysis and (b) group alignment for pairwise comparison of groups of sequences.
(a)
DEHUG3
DEPGG3
DEBYG3
DEZYG3
DEBSGF
(b) L W R D G R G A L Q
L W R G G R G A A Q
D W R - G R T A S G
L R R - A R T A S A
L - R G A R A A A E
Possible tree topologies in the phylogenetic analysis of: (a) three sequences or (b) four sequences. Filled circles represent extant sequences, while open circles represent common ancestors.
(a)
A
C
B
A
C
B A
C
B
D D
A
C
B
D
Simulated annealing and Metropolis Monte Carlo methods are based on the concept of thermal fluctuations in the energy functions.
E = E (x’n) - E (x n)
p =
1
exp(-El Tn )
When E
When E
E
x
Dynamic programming to find edit distances
- Edit operation: M, R, I, D
- Edit transcript: A string over the alphabet M, R, I, D that describes a transformation of one string into another. Example:
R D I M D MR D I M D M
M A - T H S
A - R T - S
- Edit (Levens(h)tein) distance: The minimum number of edit operations necessary to transform one string into another. (Note: matches are not counted.) Example:
R D I M D MR D I M D M
1+ 1+ 1+ 0+ 1+ 0 = 4
The recurrence- Stage: position in the edit transcript;
- State: I, D, M, or R;
- Optimal value function: D(i, j)
where D(i, j) = edit distance of Seq1[1...i] and Seq2[1...j]
- Recurrence relation: 1 +D(i-1, j)
D(i, j) = min 1 +D(i, j-1)
t(i, j) +D(i-1, j-1) , where
t(i, j) = {1, Seq1(i) ≠Seq2( )j, Seq1( )i =Seq2( )j
The tabulation , D(i, j) Seq2(j) A R T S
Seq1(i) 0 1 2 3 4
0
M 1
A 2
T 3
H 4
S 5
The tabulation , D(i, j) Seq2(j) A R T S
Seq1(i) 0 1 2 3 4
0 0
M 1
A 2
T 3
H 4
S 5
The tabulation , D(i, j) Seq2(j) A R T S
Seq1(i) 0 1 2 3 4
0 0 1
M 1
A 2
T 3
H 4
S 5
The tabulation , D(i, j) Seq2(j) A R T S
Seq1(i) 0 1 2 3 4
0 0 1 2
M 1
A 2
T 3
H 4
S 5
The tabulation , D(i, j) Seq2(j) A R T S
Seq1(i) 0 1 2 3 4
0 0 1 2 3 4
M 1 1
A 2 2
T 3 3
H 4 4
S 5 5
The tabulation , D(i, j) Seq2(j) A R T S
Seq1(i) 0 1 2 3 4
0 0 1 2 3 4
M 1 1 1
A 2 2
T 3 3
H 4 4
S 5 5
The tabulation , D(i, j) Seq2(j) A R T S
Seq1(i) 0 1 2 3 4
0 0 1 2 3 4
M 1 1 1 2
A 2 2
T 3 3
H 4 4
S 5 5
The tabulation , D(i, j) Seq2(j) A R T S
Seq1(i) 0 1 2 3 4
0 0 1 2 3 4
M 1 1 1 2 3 4
A 2 2 1 2 3 4
T 3 3
H 4 4
S 5 5
The tabulation , D(i, j) Seq2(j) A R T S
Seq1(i) 0 1 2 3 4
0 0 1 2 3 4
M 1 1 1 2 3 4
A 2 2 1 2 3 4
T 3 3 2 2 2 3
H 4 4
S 5 5
The tabulation , D(i, j) Seq2(j) A R T S
Seq1(i) 0 1 2 3 4
0 0 1 2 3 4
M 1 1 1 2 3 4
A 2 2 1 2 3 4
T 3 3 2 2 2 3
H 4 4 3 3 3 3
S 5 5 4 4 4 3
The traceback Seq2(j) A R T S
Seq1(i) 0 1 2 3 4
0 0 1 2 3 4
M 1 1 1 2 3 4
A 2 2 1 2 3 4
T 3 3 2 2 2 3
H 4 4 3 3 3 3
S 5 5 4 4 4 3
The solutions - #11 0 1 1 0 = 3
DD MM RR RR MM
M A T H S
- A R T S
The traceback Seq2(j) A R T S
Seq1(i) 0 1 2 3 4
0 0 1 2 3 4
M 1 1 1 2 3 4
A 2 2 1 2 3 4
T 3 3 2 2 2 3
H 4 4 3 3 3 3
S 5 5 4 4 4 3
The solutions - #21 0 1 0 1 0 = 3
DD MM II MM DD MM
M A - T H S
- A R T - S
The traceback Seq2(j) A R T S
Seq1(i) 0 1 2 3 4
0 0 1 2 3 4
M 1 1 1 2 3 4
A 2 2 1 2 3 4
T 3 3 2 2 2 3
H 4 4 3 3 3 3
S 5 5 4 4 4 3
The solutions - #31 1 0 1 0 = 3
RR RR MM DD MM
M A T H S
A R T - S
“Life must be lived forwards and understood backwards.”
- Søren Kierkegaard
BLOSUM62 SCORING MATRIX
C 9
S -1 4
T -1 1 5
P -3 -1 -1 7
A 0 1 0 -1 4
G -3 0 -2 -2 0 6
N -3 1 0 -2 -2 0 6
D -3 0 -1 -1 -2 -1 1 6
E -4 0 -1 -1 -1 -2 0 2 5
Q -3 0 -1 -1 -1 -2 0 0 2 5
H -3 -1 -2 -2 -2 -2 1 -1 0 0 8
R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5
K -3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5
M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5
I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4
L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4
V -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4
F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6
Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7
W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11
C S T P A G N D E Q H R K M I L V F Y W
134 LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETLLI | ||| | | |||||| | || || 137 LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETFVM
D:D = +6
D:R = -2
From Henikoff 1996
Scoring Matrices
• Physical/Chemical similarities
- comparing two sequences according to the properties of their residues may highlight regions of structural similarity
• Identity matrices
- by stressing only identities in the alignment, stretches of sequence that may have diverged will not penalise any remaining common features
Scoring Matrices (ctd)
• As the direct source of residue by residue comparison scores the scoring matrix you choose will have a major impact on the alignment calculated
• The most commonly used will be one of the mutation matrices
PAM, BLOSUM
• The matrix that performs best will be the matrix that reflects the evolutionary separation of the sequences being aligned
Probability and LikelihoodSome probabilities of observations depend on unknown parameters. E.g. if
O = SFFSFFF
then under independence
pr(O) = p2(1-p)5.
We can calculate this for any observation O, so in a sense we have a 2-variable function
pr(O,p) or pr(O|p)
depending on O and p (0< p <1).
Likelihood: holds O fixed, varies p.
Maximum Likelihood estimate: the p which maximizes pr(O,p), O fixed, denoted .
E.g. above, = 2/7.
Statistical motivation for alignment scores
pr(data|H) = pr( |H) = pr( |H) x ...
= (1-p)apd d = # disagreements, a = # agreements, p = (1-e-8t)
pr(data|R) = pr( |R) = pr( |R) x ...
= ( )a( )d
= a x log + d x log . Since p < , log <0, log >0
score = a x + d x (-) >0 match score, -<0 mismatch penalty
Note that if t 0, p 6t, 1-p 1 and so log4, while - log8t is large and negative: a big difference in the two scores.
Conversely, if t is large, p = (1-), = 1-, and log(1-) -, while 1-p = (1+3), = 1+3, and so log(1+3) 3. Thus the scores are about 3:1.
AGCTGATCA...AACCGGTTA...Alignment: H = homologous (indep. sites, Jukes-
Cantor)R = random (indep. sites, equal freq.)
Hypotheses:
34
34
14
log {pr(data|H)pr(data|R) } 1-p
1/4 p3/4
34
p3/4
1-p1/4
≈ ≈ ≈ ≈ ≈
34
p3/4 ≈
14
1-p1/4
≈
We can do the same with any other Markov substitution matrix for molecular evolution. E.g. with a PAM or BLOSUM matrix of probabilities,
a1 ..... am
b1 ..... bmdata = a gap free alignment of two a.a. sequence
fragments
pr(data|H) = aipaibi(2t) pr(data|R) = aibi
log{ } = log{ }
The elements of a log-odds score matrix are typically > 0 on the diagonal and < 0 off the diagonal, but not always.
Also the relative sizes of match and mismatch penalties increase as #PAMs (t) decreases. Thus PAM(120) is more stringent than PAM(250), while PAM(360) is less stringent than it.
PAM(0) = the identity matrix is the toughest.
There are plenty of score matrices based on other principles.
m
1
i
pr(data|H)pr(data|R)
ipaibi(2t)/ bi
Local alignment
aligns only the most similar regions of two sequences
Why? Often distantly related proteins have only isolated regions (e.g. active sites) of similarity.
The modular nature of proteins
How? The dynamic programming algorithm we have seen needs only a minor modification to yield the best local alignment between two sequences. It is called the Smith-Waterman algorithm, and is named bestfit in GCG.
The usual caveats:
The question arises every time an alignment is done without prior knowledge of homology.
• the scientific goal is not necessarily the same as the mathematical/statistical goal
•significance may not mean homology
•non-significance may not mean non-homology
Similar Amino Acid Sequences: Chance or Common Ancestry?
Title of paper by Russell F. Doolittle, Science 214 (1981)1
Early use of statistics•Generate random permutations of the sequence(s)
•Obtain the average (av) and standard deviation (SD) of the random similarity scores
•Compute z=(observed score - av)/SD
•Think normal (e.g. 4 is a very large z)
This approach is still used for global alignments, but is no longer seen as appropriate for local alignments, since the score is optimized, and random optimal scores do not follow the normal law.
More recent statistical developments:
Theory developed by Karlin and collaborators in 1990-4 and, independently, by Waterman and collaborators in 1988-94. Incorporates the fact that the score has been optimized.
Immediately implemented in BLAST. Later appears in a similar form in FASTA and elsewhere.
The theory applies to the ensemble of random
•pairs of sequences, with fixed
•possibly different lengths,
•possibly different residue distributions
•and ungapped alignments
(extensions to ungapped alignments coming now)
The theoretical distribution of random similarity scores
•is universal in form (see diagram)
•with scale parameter depending on the two residue distributions, and the substitution scores used
•and location parameter depending on the above, plus the lengths of the two sequences
For m, n large, the optimal random score S has the extreme-value distribution with cdf
exp{-exp{-(s-u)}}
where is the unique positive solution (in t) of
ijpiqjexp(sijt)=1,
and
u = log(Kmn)
and K is given by a series depending on the
compositions (pi) and (qj) and the scoring
matrix (sij).
1
Databases searches: why do them?
To find exact matches to sequences
To find homologous sequences
To infer structure and/or function of new protein sequences
To locate genes in ESTs or genomic sequences
To discover gene structure in DNA sequence
And much more...
Database searching
Compares a query sequence to each sequence in a database (also called a library). Because of the large size of sequence databases, comparisons are generally carried out using faster heuristic approximations to, rather than the exact Smith-Waterman local alignment algorithm. The two most common of these are FASTA and BLAST, where each of these names corresponds to a family of algorithms used in different contexts.
Program Query Database Comparison Common use
blastn DNA DNA DNA level Seek identical DNAsequences andsplicing patterns
blastp Protein Protein Protein level Find homologousproteins
blastx DNA Protein Protein level Analyze new DNAto find genes andseek homologousproteins
tblastn Protein DNA Protein level Search for genes inunannotated DNA
tblastx DNA DNA Protein level Discover genestructure
BLAST variants for different searchesa
(after S. Brenner, Trends Guide to Bioinformatics, 1998)
aSimilar variant programs are available for FASTA. Protein-level searches of DNA sequences are performed by comparing translations of all six reading frames.
cDNA, ORFs and ESTs
• Complementary DNA (cDNA)– Single stranded DNA complementary to an RNA, from which
synthesized by reverse transcription.
• Open reading frames (ORFs)– Contains a series of triplets coding for amino acids without any
termination codons (potentially translatable into proteins)
– Many derived from sequencing of cDNAs
• Expressed sequence tags (ESTs)– Short (300-500 bp) single reads from mRNA (cDNA) sequencing
survey projects.
– A snapshot of what is expressed in a given tissue at a given developmental stage.