similarity searches in sequence databases sequence search

9
Similarity searches in sequence databases Introduction to Bioinformatics B!GRe Bioinformatique des Génomes et Réseaux Jacques van Helden [email protected] Université d’Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM Unit U1090) http://tagc.univ-mrs.fr/ FORMER ADDRESS (1999-2011) Université Libre de Bruxelles, Belgique Bioinformatique des Génomes et des Réseaux (BiGRe lab) http://www.bigre.ulb.ac.be/ Sequence search algorithms Bioinformatics Matching a sequence against a database ! Example of utilization " We have a gene coding sequence and we would like to search UniProt for all similar proteins, in order to get a cue about the possible function ! Approach: we will align our query sequence to each entry in UniProt. ! Problem of size : in Dec 2005, there were 2.500.000 entries in UniProt (Swiss- Prot + TREMBL) ! It is possible to apply dynamical programming, but it takes a lot of time or requires high computation power. Fast algorithms for database matching ! FastA ! BLAST (Basic Local Alignment Search Tool) ! In short " These algorithms are ~50 times faster than Smith-Waterman " They cannot guarantee the optimal solution " However, a comparison with results obtained by dynamical programming has shown that FastA answer is close to the optimum FastA strategy ! FastA builds an index with the positions of all the small words (k-tuples) found in the query sequence. ! The program then detects diagonals of k-tuples between the query and the database sequences. ! When a significant diagonal is detected, the two sequences are aligned with Smith-Waterman algorithm. ! The size of words (k) influences the behaviour of the program: when k increases, the search is faster but one might miss some matches. 1 2 3 pos 12345678901234567890123456789012 seq MVDFYYLPGSSMVDVFDFYAKAVGVELNKKLL MVD VDF DFY YYL FYY ... 3-tuples position index k-tuple positions MVD 1, 12 VDF 2 DFY 3,17 ... ... Source: Mount (2000) FastA strategy ! Comparison of k-tuple positions between query and database sequences. ! Highest-density regions ("init regions") are identified (the best one is highlighted with an asterisk). Regions with a score below a given threshold appear in dotted lines. ! Low-scoring regions are filtered out, and the remaining regions are joined.

Upload: others

Post on 05-Apr-2022

6 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Similarity searches in sequence databases Sequence search

Similarity searches in sequence databases

Introduction to Bioinformatics

B!GRe Bioinformatique des

Génomes et Réseaux

Jacques van Helden

[email protected] Université d’Aix-Marseille, France

Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM Unit U1090)

http://tagc.univ-mrs.fr/

FORMER ADDRESS (1999-2011) Université Libre de Bruxelles, Belgique

Bioinformatique des Génomes et des Réseaux (BiGRe lab) http://www.bigre.ulb.ac.be/

Sequence search algorithms

Bioinformatics

Matching a sequence against a database

!  Example of utilization "  We have a gene coding sequence and we would like to search UniProt for all similar

proteins, in order to get a cue about the possible function !  Approach: we will align our query sequence to each entry in UniProt. !  Problem of size : in Dec 2005, there were 2.500.000 entries in UniProt (Swiss-

Prot + TREMBL) !  It is possible to apply dynamical programming, but it takes a lot of time or

requires high computation power.

Fast algorithms for database matching

!  FastA !  BLAST (Basic Local Alignment Search Tool) !  In short

"  These algorithms are ~50 times faster than Smith-Waterman "  They cannot guarantee the optimal solution "  However, a comparison with results obtained by dynamical programming has shown

that FastA answer is close to the optimum

FastA strategy

!  FastA builds an index with the positions of all the small words (k-tuples) found in the query sequence.

!  The program then detects diagonals of k-tuples between the query and the database sequences.

!  When a significant diagonal is detected, the two sequences are aligned with Smith-Waterman algorithm.

!  The size of words (k) influences the behaviour of the program: when k increases, the search is faster but one might miss some matches.

! 1 2 3!pos !12345678901234567890123456789012!seq !MVDFYYLPGSSMVDVFDFYAKAVGVELNKKLL!

MVD VDF

DFY

YYL FYY

...

3-tuples position index

k-tuple positions MVD 1, 12 VDF 2 DFY 3,17 ... ... Source: Mount (2000)

FastA strategy

!  Comparison of k-tuple positions between query and database sequences. !  Highest-density regions ("init regions") are identified (the best one is highlighted

with an asterisk). Regions with a score below a given threshold appear in dotted lines.

!  Low-scoring regions are filtered out, and the remaining regions are joined.

Page 2: Similarity searches in sequence databases Sequence search

BLAST - Basic Local Alignement Search Tool

!  Recherche des mots de dictionnaire entre les deux séquences (W=3 pour les aa, W=11 pour les nucléotides)

!  Source: Emese Meglecz

AAATGTCCTGTCTGTACTGTACATGGTCAAACTGGTGAATGGTAGC ||||||||||| ||||||||||| AAATGTCCTGTGTGAATGGTAGATGGTCAAACTGGTCAATTGTATC !

BLAST strategy (gapless version, Altschul et al., 1990)

!  Version 1 (1990) "  Builds a dictionary of k-tuples (small words) found in the query sequence. "  Uses a substitution matrix (e.g. BLOSUM) to calculate a score between these words

and each possible word of the same length. "  Only retains the words with a high score. "  Each time a pair of words from the dictionary are found (hits) in the database

sequence, extends the hit in both direction (without gaps), to obtain a High-scoring Segment Pair (HSP).

"  The program returns sequences with significant high-scoring segment pairs.

The quick brown fox jumps over the lazy dog The quiet brown cat purrs when she sees him

Identité = 1 Substitution = -1

BLAST Elongation de l’alignement

!The quick brown fox jump The quiet brown cat purr 123 45654 56789 876 5654 <= SCORE

Identité = 1 Substitution = -1

BLAST Elongation de l’alignement

!The quick brown fox jump The quiet brown cat purr 123 45654 56789 876 5654 <= SCORE 000 00012 10000 123 4345 <=(SCORE(max)-SCORE) Elongation s’arrête si (SCORE(max)-SCORE) > limite prédéfinie

BLAST Elongation de l’alignement

!The quick brown fox jump The quiet brown cat purr 123 45654 56789 876 5654 <= SCORE 000 00012 10000 123 4345 <=(SCORE(max)-SCORE) Ecourter l’alignement jusqu’au dernier Score(max)

HSP (High Scoring Pair): The quick brown ||| ||| ||||| The quiet brown

BLAST Elongation de l’alignement

Page 3: Similarity searches in sequence databases Sequence search

!  Elongation de l’alignement de deux côtés à partir des mots du dictionnaire "  L’élongation s’arrête si le score diminue au-delà

d’une limite prédéfinie par rapport au dernier maximum

"  L’alignement est écourté jusqu’au dernier score maximal (S) => HSP (High Scoring Pair)

BLAST – élongation

TAAATGGTCATGTGATGGTCCTGACTGATGCTGCCTGA GAAATGGTCATGTGATGGTCGTAACGATGCAATTGGGC

BLAST – exemple

Faites un alignement local entre ces deux séquences en suivant l’algorithme de BLAST version1 Identité: 1 Substitution: -1 Différence maximal entre le score actuel et le score maximale : 5

TAAATGGTCATGTGATGGTCCTGACTGATGCTGCCTGA GAAATGGTCATGTGATGGTCGTAACGATGCAATTGGGC

BLAST – exemple

Faites un alignement local entre ces deux séquences en suivant l’algorithme de BLAST version1 Identité: 1 Substitution: -1 Différence maximal entre le score actuel et le score maximale : 5

Noyau

TAAATGGTCATGTGATGGTCCTGACTGATGCTGCCTGA Seq1 GAAATGGTCATGTGATGGTCGTAACGATGCAATTGGGC Seq2 12345678911111111111111211111 Score 01234567898989098765

BLAST – exemple

Identité: 1 Substitution: -1 Différence maximal entre le score actuel et le score maximale : 5

TAAATGGTCATGTGATGGTCCTGACTGATGCTGCCTGA Seq1 GAAATGGTCATGTGATGGTCGTAACGATGCAATTGGGC Seq2 12345678911111111111111211111 Score 01234567898989098765 00000000000000000001010012345 Score(max)-Score

BLAST – exemple

Identité: 1 Substitution: -1 Différence maximal entre le score actuel et le score maximale : 5

TAAATGGTCATGTGATGGTCCTGACTGATGCTGCCTGA Seq1 ||||||||||||||||||| | || GAAATGGTCATGTGATGGTCGTAACGATGCAATTGGGC Seq2 12345678911111111111111211111 Score 01234567898989098765 00000000000000000001010012345 Score(max)-Score

BLAST – exemple

Identité: 1 Substitution: -1 Différence maximal entre le score actuel et le score maximale : 5

Page 4: Similarity searches in sequence databases Sequence search

BLAST strategy (gapped version, Altschul et al., 1997)

!  Version 2 (1997) "  Use smaller words, but only extend when there are two hits on the same diagonal. "  Extension includes gaps (dynamical programming). "  The extension costs more time, but the number of times it is done is reduced, because

the extension requires a pair of hits.

BLAST strategies (Altschul et al., 1990; 1997)

!  Version 1 (1990): gapless BLAST "  Builds a dictionary of k-tuples (small words) found in the query sequence. "  Uses a substitution matrix (e.g. BLOSUM) to calculate a score between these words

and each possible word of the same length. "  Only retains the words with a high score. "  Each time a pair of words from the dictionary are found (hits) in the database

sequence, extends the hit in both direction (without gaps), to obtain a High-scoring Segment Pair (HSP).

"  The program returns sequences with significant high-scoring segment pairs. !  Version 2 (1997): gapped BLAST

"  Use smaller words, but only extend when there are two hits on the same diagonal. "  Extension includes gaps (dynamical programming).

!  PSI-BLAST (in the 1997 article as well) "  A second step after the proper BLAST process. "  Once the gapped BLAST has returned a set of sequences, these sequences are

aligned and used to build a profile motif. "  The database is then scanned with this profile motif to collect additional similarities. "  The process can be iterated several times

•  collect sequences > build a profile -> collect sequences -> build a profile ...

Some traps for BLAST searches

!  Spurious domains "  Some domains are found in many proteins. This does not mean that these proteins

have the same function. The width of the alignment should thus be analyzed to assess whether the alignment covers most of the sequence length, or only a small segment.

!  Low complexity regions (repetitive sequences). "  return multiple matches with no apparent functional relationship.

!  Cloning vectors "  Some entries in the database contain a fragment of the cloning vector. This can return

many apparent matches, where the matching region is restricted to the cloning vector.

Statistics of sequence similarities

Bioinformatics

R L A S V E T D M P L T L R Q H! ! T L T S L Q T T L K A H L G T H!

Ala A 4Arg R -1 5Asn N -2 0 6Asp D -2 -2 1 6Cys C 0 -3 -3 -3 9Gln Q -1 1 0 0 -3 5Glu E -1 0 0 2 -4 2 5Gly G 0 -2 0 -1 -3 -2 -2 6His H -2 0 1 -1 -3 0 0 -2 8Ile I -1 -3 -3 -3 -1 -3 -3 -4 -3 4Leu L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4Lys K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5Met M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5Phe F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6Pro P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7Ser S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4Thr T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5Trp W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Tyr Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7Val V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

Ala

Arg

Asn

Asp

Cys

Gln

Glu

Gly

His

Ile

Leu

Lys

Met

Phe

Pro

Ser

Thr

Trp

Tyr

Val

A R N D C Q E G H I L K M F P S T W Y V

!

S = sr1,i r2,ii=1

L

"

Matching statistics - raw score

!  The raw score of a matching segment pair (MSP) is obtained by summing the scores (obtained from the substitution matrix) for each pair of residues (r1,i and r2,i) over the length of the alignment (L).

R L A S V E T D M P L T L R Q H! . | . | : : | . : . . . | . . |! T L T S L Q T T L K A H L G T H!-1 +4 +0 +4 +1 +2 +5 -1 +2 -1 -1 -2 +4 -2 -1 +8 = 21!

Ala A 4Arg R -1 5Asn N -2 0 6Asp D -2 -2 1 6Cys C 0 -3 -3 -3 9Gln Q -1 1 0 0 -3 5Glu E -1 0 0 2 -4 2 5Gly G 0 -2 0 -1 -3 -2 -2 6His H -2 0 1 -1 -3 0 0 -2 8Ile I -1 -3 -3 -3 -1 -3 -3 -4 -3 4Leu L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4Lys K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5Met M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5Phe F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6Pro P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7Ser S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4Thr T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5Trp W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Tyr Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7Val V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

Ala

Arg

Asn

Asp

Cys

Gln

Glu

Gly

His

Ile

Leu

Lys

Met

Phe

Pro

Ser

Thr

Trp

Tyr

Val

A R N D C Q E G H I L K M F P S T W Y V

!

S = sr1,i r2,ii=1

L

"

Matching statistics - raw score

!  The raw score of a matching segment pair (MSP) is obtained by summing the scores (obtained from the substitution matrix) for each pair of residues (r1,i and r2,i) over the length of the alignment (L).

Page 5: Similarity searches in sequence databases Sequence search

MSP-wise P-value and bit score

!  The P-value of a matching segment pair (MSP) with score S is the probability to observe a score of at least S by chance.

"  Karlin and Altschul (1990) defined a way to calculate the P-value of an MSP.

"  The P-value follows an exponential distribution, with two parameters : lambda and K. These two parameters depend on the substitution matrix chosen. They have thus to be estimated for each substitution matrix separately.

"  The analytic way to determine the parameters lambda and L is only valid for gapless alignments.

"  For alignment with gaps, Altschul et al (1997) propose to estimate these parameters on the basis of empirical observations

!  Bit score of an MSP "  Karlin and Altschul (1990) also propose to convert

the raw score S into a bit score S’. "  This facilitates the interpretation of the score,

because the P-value can be directly calculated from the bit score, irrespective of the substitution matrix used for the alignment. .

PvalSMSP = P(X ! S)

= Ke"!S

!

S '= "S # ln(K )ln(2)

!

PvalSMSP = Ke"#S

= Ke" ln(2)S '+ ln(K )

= 2"S'

Matching statistics - the E-value

!  Let us imagine that we align a random word with another random word. The score is likely to be generally low.

!  However, if this is repeated billions of times, some high scores will occasionally occur by chance.

!  In a database scan, each word of the query sequence is compared to each word of the database.

!  For a query sequence of size m and a database of size n, the search space (number of word pair comparisons) is thus N=nm.

!  FastA and BLAST estimate, for each score, the number of matches that would be expected by chance, given the size of the database. This is called the E-value.

!  The E-value is the product of the nominal P-value (i.e. the risk of false positives for a single comparison) by the size of the search space.

!  For a given score S, the expected number of random matches thus increases with the size of the database.

N = n !m

E =m !n !Pval=m !n !K !e"!S

= N !K !e"!S

= N / 2S '

Threshold on E-value

!  The lower is the E-value, the more significant is the match. !  High E-value ( > 1) indicate that the match should not be trusted too much. !  One essential parameter of FastA and BLAST is the threshold on E-value. !  Beware

"  On the BLAST Web server at NCBI, the default threshold value is 10 "  This means that each query would return ~10 matches by chance alone. "  If this default value is used, we already know that the answer is likely to contain

~10 false positive.

!

PvalDB = P X "1( )=1# P X = 0( )

=1# e#EE 0

0!=1# e#E

!

P X " x( ) =e#EE i

i!i= x

N

$

=1# e#EE i

i!i= 0

x#1

$

Matching statistics - database-wise P-value (=Family-Wise Error Rate)

!  From the E-value (E), one can estimate the probability to observe at least X matches by chance in random sequences.

!  This is a simple application of the Poisson distribution : calculate the probability to observe X occurrences of an event whilst expecting E.

!  A particular case is the probability to observe at least one match by chance (X>=1).

!  This probability is called database-wise P-value.

!  This P-value represents the probability to find at least one spurious match in the whole database search, with a score greater or equal to S.

Distribution of probability for matching scores

!  When one performs a database similarity search, the distribution of scores follows an extreme value distribution. This distribution is asymmetric, and should thus not be modelled with a normal (Gaussian) distribution.

Source: W.P.Pearson (2000). Protein sequence comparison and Protein evolution. ISMB Tutorial.

Interpreting similarity search results

Bioinformatics

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Page 6: Similarity searches in sequence databases Sequence search

Score distribution

!  The histogram shows the number of database matches for each score.

!  For scores higher than 92, the number of matches is very small.

!  A higher resolution histogram is shown besides the main histogram.

!  Asterisks (*) represent the random expectation (E-value) for each score

opt E()< 20 17 0:= one = represents 22 library sequences22 0 0:24 0 0:26 2 0:=28 7 3:*30 7 18:*32 45 68:===*34 166 184:========*36 337 379:================ *38 581 626:=========================== *40 869 873:=======================================*42 1009 1067:============================================== *44 1276 1177:=====================================================*====46 1253 1198:======================================================*==48 1199 1147:====================================================*==50 1032 1047:===============================================*52 949 920:=========================================*==54 838 786:===================================*===56 578 657:=========================== *58 467 539:====================== *60 393 437:================== *62 339 350:===============*64 276 278:============*66 214 220:=========*68 188 173:=======*=70 140 136:======*72 131 106:====*=74 88 83:===*76 71 64:==*=78 48 50:==*80 43 39:=*82 38 30:=*84 27 24:=*86 21 18:*88 15 14:*90 17 11:*92 7 8:* :=======* = represents 1 library sequence94 22 7:* :======*===============96 3 5:* :=== *98 8 4:* :===*====100 6 3:* :==*===102 5 2:* :=*===104 9 2:* :=*=======106 4 1:* :*===108 5 1:* :*====110 4 1:* :*===112 4 1:* :*===114 4 1:* :*===116 6 0:= *======118 1 0:= *=>120 32 0:== *================================

FastA output from Pearson (2000)

zoom

BLASTP 2.2.6 [Apr-09-2003]!!!Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, !Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), !"Gapped BLAST and PSI-BLAST: a new generation of protein database search!programs", Nucleic Acids Res. 25:3389-3402.!!Query= metL!gi|16131778|ref|NP_418375.1| aspartokinase II and!homoserine dehydrogenase II; bifunctional: aspartokinase II!(N-terminal); homoserine dehydrogenase II (C-terminal) [Escherichia!coli K12]! (810 letters)!!Database: /Users/jvanheld/rsa-!tools/data/genomes/Escherichia_coli_K12/genome/NC_000913.faa ! 4242 sequences; 1,351,322 total letters!!Searching.........done!! Score E!Sequences producing significant alignments: (bits) Value!!gi|16131778|ref|NP_418375.1| aspartokinase II and homoserine deh... 1596 0.0 !gi|16127996|ref|NP_414543.1| bifunctional: aspartokinase I (N-te... 344 2e-95!gi|16131850|ref|NP_418448.1| aspartokinase III, lysine sensitive... 122 7e-29!gi|16128228|ref|NP_414777.1| gamma-glutamate kinase [Escherichia... 31 0.28 !!>gi|16131778|ref|NP_418375.1| aspartokinase II and homoserine! dehydrogenase II; bifunctional: aspartokinase II! (N-terminal); homoserine dehydrogenase II (C-terminal)! [Escherichia coli K12]! Length = 810!! Score = 1596 bits (4132), Expect = 0.0! Identities = 810/810 (100%), Positives = 810/810 (100%)!!Query: 1 MSVIAQAGAKGRQLHKFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAAGSTTNQLINW 60! MSVIAQAGAKGRQLHKFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAAGSTTNQLINW!Sbjct: 1 MSVIAQAGAKGRQLHKFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAAGSTTNQLINW 60!!Query: 61 LKLSQTDRLSAHQVQQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDA 120! LKLSQTDRLSAHQVQQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDA!Sbjct: 61 LKLSQTDRLSAHQVQQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDA 120!!Query: 121 VYAEVVGHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAERAAQPQVDEGLSYPLLQQLL 180! VYAEVVGHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAERAAQPQVDEGLSYPLLQQLL!Sbjct: 121 VYAEVVGHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAERAAQPQVDEGLSYPLLQQLL 180!!Query: 181 VQHPGKRLVVTGFISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADP 240! VQHPGKRLVVTGFISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADP!Sbjct: 181 VQHPGKRLVVTGFISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADP 240!!Query: 241 RKVKDACLLPLLRLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRIER 300! RKVKDACLLPLLRLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRIER!Sbjct: 241 RKVKDACLLPLLRLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRIER 300!!Query: 301 VLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLL 360! VLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLL!Sbjct: 301 VLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLL 360!!Query: 361 QFCYTSEVADSALKILDEAGLPGELRLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPV 420! QFCYTSEVADSALKILDEAGLPGELRLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPV!Sbjct: 361 QFCYTSEVADSALKILDEAGLPGELRLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPV 420!!Query: 421 EFTWQSDDGISLVAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGKGNIGSRWLELFARE 480! EFTWQSDDGISLVAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGKGNIGSRWLELFARE!Sbjct: 421 EFTWQSDDGISLVAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGKGNIGSRWLELFARE 480!!Query: 481 QSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRALAFFNDEAVEQDEESLFLWMRAHPY 540! QSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRALAFFNDEAVEQDEESLFLWMRAHPY!Sbjct: 481 QSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRALAFFNDEAVEQDEESLFLWMRAHPY 540!!Query: 541 DDLVVLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQIHDAFEKTGRHWLYN 600! DDLVVLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQIHDAFEKTGRHWLYN!Sbjct: 541 DDLVVLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQIHDAFEKTGRHWLYN 600!!Query: 601 ATVGAGLPINHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSVPFTELVDQAWQQGLT 660! ATVGAGLPINHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSVPFTELVDQAWQQGLT!Sbjct: 601 ATVGAGLPINHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSVPFTELVDQAWQQGLT 660!!Query: 661 EPDPRDDLSGKDVMRKLVILAREAGYNIEPDQVRVESLVPAHCEGGSIDHFFENGDELNE 720! EPDPRDDLSGKDVMRKLVILAREAGYNIEPDQVRVESLVPAHCEGGSIDHFFENGDELNE!Sbjct: 661 EPDPRDDLSGKDVMRKLVILAREAGYNIEPDQVRVESLVPAHCEGGSIDHFFENGDELNE 720!!Query: 721 QMVQRLEAAREMGLVLRYVARFDANGKARVGVEAVREDHPLASLLPCDNVFAIESRWYRD 780! QMVQRLEAAREMGLVLRYVARFDANGKARVGVEAVREDHPLASLLPCDNVFAIESRWYRD!Sbjct: 721 QMVQRLEAAREMGLVLRYVARFDANGKARVGVEAVREDHPLASLLPCDNVFAIESRWYRD 780!!Query: 781 NPLVIRGPGAGRDVTAGAIQSDINRLAQLL 810! NPLVIRGPGAGRDVTAGAIQSDINRLAQLL!Sbjct: 781 NPLVIRGPGAGRDVTAGAIQSDINRLAQLL 810!!!>gi|16127996|ref|NP_414543.1| bifunctional: aspartokinase I! (N-terminal); homoserine dehydrogenase I (C-terminal)! [Escherichia coli K12]! Length = 820!! Score = 344 bits (882), Expect = 2e-95! Identities = 247/821 (30%), Positives = 410/821 (49%), Gaps = 44/821 (5%)!!Query: 16 KFGGSSLADVKCYLRVAGIMAEYSQPDDMM-VVSAAGSTTNQLINWLKLSQTDRLSAHQV 74! KFGG+S+A+ + +LRVA I+ ++ + V+SA TN L+ ++ + + + + +!Sbjct: 5 KFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNI 64!!Query: 75 QQTLRRYQCDLISGLLPAEEADSL--ISAFVSDLERLAALLDSGIN------DAVYAEVV 126! R + +L++GL A+ L + FV + GI+ D++ A ++!Sbjct: 65 SDAERIF-AELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINAALI 123!!Query: 127 GHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAER---AAQPQVDEGLSYPLLQQLLVQH 183! GE S +M+ VL +G +D E L A + + E ++ H!Sbjct: 124 CRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADH 183!!Query: 184 PGKRLVVTGFISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV 243! +++ GF + N GE V+LGRNGSDYSA + A IW+DV GVY+ DPR+V!Sbjct: 184 ---MVLMAGFTAGNEKGELVVLGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQV 240!!Query: 244 KDACLLPLLRLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQ-----GSTRI 298! DA LL + EA EL+ A VLH RT+ P++ +I ++ + P G++R !Sbjct: 241 PDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPCLIKNTGNPQAPGTLIGASRD 300!!Query: 299 ERVLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQ 358! E L + +++ +++ + P + + + RA++ + + + !Sbjct: 301 EDELP----VKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLITQSSSEY 356!!Query: 359 LLQFCYTSEVADSALKILDEA-------GLPGELRLRQGLALVAMVGAGVTRNPLHCHRF 411! + FC A + + E GL L + + LA++++VG G+ +F!Sbjct: 357 SISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKF 416!!Query: 412 WQQLKGQPVEFTW--QSDDGISLVAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGKGNI 469! + L + Q S+ V+ + ++ HQ +F ++ I + + G G +!Sbjct: 417 FAALARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGV 476!!Query: 470 GSRWLELFAREQSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRALAFFNDEAVEQDEE 529! G LE R+QS L + + + GV +S+ L + GL+ L + +E + E !Sbjct: 477 GGALLEQLKRQQSWLKNKH-IDLRVCGVANSKALLTNVHGLN----LENWQEELAQAKEP 531!!Query: 530 ----SLFLWMRAHPYDDLVVLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQ 585! L ++ + + V++D T+SQ +ADQY DF GFHV++ NK A S + Y Q!Sbjct: 532 FNLGRLIRLVKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKANTSSMDYYHQ 591!!Query: 586 IHDAFEKTGRHWLYNATVGAGLPINHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSV 645! + A EK+ R +LY+ VGAGLP+ +++L+++GD ++ SGI SG+LS++F + D +!Sbjct: 592 LRYAAEKSRRKFLYDTNVGAGLPVIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGM 651!!Query: 646 PFTELVDQAWQQGLTEPDPRDDLSGKDVMRKLVILAREAGYNIEPDQVRVESLVPAHCEG 705! F+E A + G TEPDPRDDLSG DV RKL+ILARE G +E + +E ++PA !Sbjct: 652 SFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIEIEPVLPAEFNA 711!!Query: 706 -GSIDHFFENGDELNEQMVQRLEAAREMGLVLRYVARFDANGKARVGVEAVREDHPLASL 764! G + F N +L++ R+ AR+ G VLRYV D +G RV + V + PL +!Sbjct: 712 EGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFKV 771!!Query: 765 LPCDNVFAIESRWYRDNPLVIRGPGAGRDVTAGAIQSDINR 805! +N A S +Y+ PLV+RG GAG DVTA + +D+ R!Sbjct: 772 KNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLR 812!!!>gi|16131850|ref|NP_418448.1| aspartokinase III, lysine sensitive;! aspartokinase III, lysine-sensitive [Escherichia coli! K12]! Length = 449!! Score = 122 bits (307), Expect = 7e-29! Identities = 121/452 (26%), Positives = 194/452 (42%), Gaps = 25/452 (5%)!!Query: 16 KFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAAGSTTNQLINWLK-LSQTDRLSAHQV 74! KFGG+S+AD R A I+ + ++V+SA+ TN L+ + L +R + !Sbjct: 8 KFGGTSVADFDAMNRSADIVLSDANVR-LVVLSASAGITNLLVALAEGLEPGERF---EK 63!!Query: 75 QQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDAVYAEVVGHGEVWSA 134! +R Q ++ L I + ++ LA + A+ E+V HGE+ S !Sbjct: 64 LDAIRNIQFAILERLRYPNVIREEIERLLENITVLAEAAALATSPALTDELVSHGELMST 123!!Query: 135 RLMSAVLNQQGLPAAWLDAREFLRA-ERAAQPQVDEGLSYPLLQQLLVQHPGKRLVVT-G 192! L +L ++ + A W D R+ +R +R + + D L L+ + LV+T G!Sbjct: 124 LLFVEILRERDVQAQWFDVRKVMRTNDRFGRAEPDIAALAELAALQLLPRLNEGLVITQG 183!!Query: 193 FISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKVKDACLLPLL 252! FI N G T LGR GSDY+A + SRV IW+DV G+Y+ DPR V A + +!Sbjct: 184 FIGSENKGRTTTLGRGGSDYTAALLAEALHASRVDIWTDVPGIYTTDPRVVSAAKRIDEI 243!!Query: 253 RLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRI---------ERVLA 303! EA+E+A A VLH TL P S+I + + S P G T + R LA!Sbjct: 244 AFAEAAEMATFGAKVLHPATLLPAVRSDIPVFVGSSKDPRAGGTLVCNKTENPPLFRALA 303!!Query: 304 SGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLLQFC 363! ++T H L A LA I L +A+ L !Sbjct: 304 LRRNQTLLTLHSLNMLHSRGFLAEVFGILARHNISVDLITTSEVSVAL-------TLDTT 356!!Query: 364 YTSEVADSAL--KILDEAGLPGELRLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPVE 421! ++ D+ L +L E + + +GLALVA++G +++ + L+ + !Sbjct: 357 GSTSTGDTLLTQSLLMELSALCRVEVEEGLALVALIGNDLSKACGVGKEVFGVLEPFNIR 416!!Query: 422 FTWQSDDGISLVAVLRTGPTESLIQGLHQSVF 453! +L ++ E ++Q LH ++F!Sbjct: 417 MICYGASSHNLCFLVPGEDAEQVVQKLHSNLF 448!!!>gi|16128228|ref|NP_414777.1| gamma-glutamate kinase [Escherichia! coli K12]! Length = 367!! Score = 31.2 bits (69), Expect = 0.28! Identities = 17/56 (30%), Positives = 29/56 (51%)!!Query: 194 ISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKVKDACLL 249! I+ N+A T + +D + LAG ++ + +D G+Y+ADPR A L+!Sbjct: 133 INENDAVATAEIKVGDNDNLSALAAILAGADKLLLLTDQKGLYTADPRSNPQAELI 188!!! Database: /Users/jvanheld/rsa-! tools/data/genomes/Escherichia_coli_K12/genome/NC_000913.faa! Posted date: Sep 8, 2004 12:13 PM! Number of letters in database: 1,351,322! Number of sequences in database: 4242! !Lambda K H! 0.320 0.136 0.397 !!Gapped!Lambda K H! 0.267 0.0410 0.140 !!!Matrix: BLOSUM62!Gap Penalties: Existence: 11, Extension: 1!Number of Hits to DB: 2,199,628!Number of Sequences: 4242!Number of extensions: 96525!Number of successful extensions: 290!Number of sequences better than 1.0: 4!Number of HSP's better than 1.0 without gapping: 4!Number of HSP's successfully gapped in prelim test: 0!Number of HSP's that attempted gapping in prelim test: 279!Number of HSP's gapped (non-prelim): 5!length of query: 810!length of database: 1,351,322!effective HSP length: 92!effective length of query: 718!effective length of database: 961,058!effective search space: 690039644!effective search space used: 690039644!T: 11!A: 40!X1: 16 ( 7.4 bits)!X2: 38 (14.6 bits)!X3: 64 (24.7 bits)!S1: 41 (21.8 bits)!S2: 65 (29.6 bits)!

BLAST result

!  The text shows the result of a BLAST search,

!  Query: the E.coli protein MetL, a bifunctional enzyme combining aspartokinase and homoserine dehydrogenase activities.

!  Database: all proteins from Escherichia coli K12.

!  The BLAST result file starts with a summary of

"  the parameters used for the search

"  The matching sequences and the score of each match.

BLAST result - first match

!  The first match is the query sequence itself (metL). This is not surprising since we scanned the set of all E.coli proteins with a protein from E.coli.

!  The E-value (0) means that, with this level of similarity; one would expect 0 false positive by chance.

>gi|16131778|ref|NP_418375.1| aspartokinase II and homoserine! dehydrogenase II; bifunctional: aspartokinase II! (N-terminal); homoserine dehydrogenase II (C-terminal)! [Escherichia coli K12]! Length = 810!! Score = 1596 bits (4132), Expect = 0.0! Identities = 810/810 (100%), Positives = 810/810 (100%)!!Query: 1 MSVIAQAGAKGRQLHKFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAAGSTTNQLINW 60! MSVIAQAGAKGRQLHKFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAAGSTTNQLINW!Sbjct: 1 MSVIAQAGAKGRQLHKFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAAGSTTNQLINW 60!!Query: 61 LKLSQTDRLSAHQVQQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDA 120! LKLSQTDRLSAHQVQQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDA!Sbjct: 61 LKLSQTDRLSAHQVQQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDA 120!!Query: 121 VYAEVVGHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAERAAQPQVDEGLSYPLLQQLL 180! VYAEVVGHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAERAAQPQVDEGLSYPLLQQLL!Sbjct: 121 VYAEVVGHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAERAAQPQVDEGLSYPLLQQLL 180!!Query: 181 VQHPGKRLVVTGFISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADP 240! VQHPGKRLVVTGFISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADP!Sbjct: 181 VQHPGKRLVVTGFISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADP 240!!Query: 241 RKVKDACLLPLLRLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRIER 300! RKVKDACLLPLLRLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRIER!Sbjct: 241 RKVKDACLLPLLRLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRIER 300!!Query: 301 VLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLL 360! VLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLL!Sbjct: 301 VLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLL 360!!Query: 361 QFCYTSEVADSALKILDEAGLPGELRLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPV 420! QFCYTSEVADSALKILDEAGLPGELRLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPV!Sbjct: 361 QFCYTSEVADSALKILDEAGLPGELRLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPV 420!!Query: 421 EFTWQSDDGISLVAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGKGNIGSRWLELFARE 480! EFTWQSDDGISLVAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGKGNIGSRWLELFARE!Sbjct: 421 EFTWQSDDGISLVAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGKGNIGSRWLELFARE 480!!Query: 481 QSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRALAFFNDEAVEQDEESLFLWMRAHPY 540! QSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRALAFFNDEAVEQDEESLFLWMRAHPY!Sbjct: 481 QSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRALAFFNDEAVEQDEESLFLWMRAHPY 540!!Query: 541 DDLVVLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQIHDAFEKTGRHWLYN 600! DDLVVLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQIHDAFEKTGRHWLYN!Sbjct: 541 DDLVVLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQIHDAFEKTGRHWLYN 600!!Query: 601 ATVGAGLPINHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSVPFTELVDQAWQQGLT 660! ATVGAGLPINHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSVPFTELVDQAWQQGLT!Sbjct: 601 ATVGAGLPINHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSVPFTELVDQAWQQGLT 660!!Query: 661 EPDPRDDLSGKDVMRKLVILAREAGYNIEPDQVRVESLVPAHCEGGSIDHFFENGDELNE 720! EPDPRDDLSGKDVMRKLVILAREAGYNIEPDQVRVESLVPAHCEGGSIDHFFENGDELNE!Sbjct: 661 EPDPRDDLSGKDVMRKLVILAREAGYNIEPDQVRVESLVPAHCEGGSIDHFFENGDELNE 720!!Query: 721 QMVQRLEAAREMGLVLRYVARFDANGKARVGVEAVREDHPLASLLPCDNVFAIESRWYRD 780! QMVQRLEAAREMGLVLRYVARFDANGKARVGVEAVREDHPLASLLPCDNVFAIESRWYRD!Sbjct: 721 QMVQRLEAAREMGLVLRYVARFDANGKARVGVEAVREDHPLASLLPCDNVFAIESRWYRD 780!!Query: 781 NPLVIRGPGAGRDVTAGAIQSDINRLAQLL 810! NPLVIRGPGAGRDVTAGAIQSDINRLAQLL!Sbjct: 781 NPLVIRGPGAGRDVTAGAIQSDINRLAQLL 810!!!>gi|16127996|ref|NP_414543.1| bifunctional: aspartokinase I! (N-terminal); homoserine dehydrogenase I (C-terminal)! [Escherichia coli K12]! Length = 820!! Score = 344 bits (882), Expect = 2e-95! Identities = 247/821 (30%), Positives = 410/821 (49%), Gaps = 44/821 (5%)!!Query: 16 KFGGSSLADVKCYLRVAGIMAEYSQPDDMM-VVSAAGSTTNQLINWLKLSQTDRLSAHQV 74! KFGG+S+A+ + +LRVA I+ ++ + V+SA TN L+ ++ + + + + +!Sbjct: 5 KFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNI 64!!Query: 75 QQTLRRYQCDLISGLLPAEEADSL--ISAFVSDLERLAALLDSGIN------DAVYAEVV 126! R + +L++GL A+ L + FV + GI+ D++ A ++!Sbjct: 65 SDAERIF-AELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINAALI 123!!Query: 127 GHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAER---AAQPQVDEGLSYPLLQQLLVQH 183! GE S +M+ VL +G +D E L A + + E ++ H!Sbjct: 124 CRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADH 183!!Query: 184 PGKRLVVTGFISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV 243! +++ GF + N GE V+LGRNGSDYSA + A IW+DV GVY+ DPR+V!Sbjct: 184 ---MVLMAGFTAGNEKGELVVLGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQV 240!!Query: 244 KDACLLPLLRLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQ-----GSTRI 298! DA LL + EA EL+ A VLH RT+ P++ +I ++ + P G++R !Sbjct: 241 PDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPCLIKNTGNPQAPGTLIGASRD 300!!Query: 299 ERVLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQ 358! E L + +++ +++ + P + + + RA++ + + + !Sbjct: 301 EDELP----VKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLITQSSSEY 356!!Query: 359 LLQFCYTSEVADSALKILDEA-------GLPGELRLRQGLALVAMVGAGVTRNPLHCHRF 411! + FC A + + E GL L + + LA++++VG G+ +F!Sbjct: 357 SISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKF 416!!Query: 412 WQQLKGQPVEFTW--QSDDGISLVAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGKGNI 469! + L + Q S+ V+ + ++ HQ +F ++ I + + G G +!Sbjct: 417 FAALARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGV 476!!Query: 470 GSRWLELFAREQSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRALAFFNDEAVEQDEE 529! G LE R+QS L + + + GV +S+ L + GL+ L + +E + E !Sbjct: 477 GGALLEQLKRQQSWLKNKH-IDLRVCGVANSKALLTNVHGLN----LENWQEELAQAKEP 531!!Query: 530 ----SLFLWMRAHPYDDLVVLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQ 585! L ++ + + V++D T+SQ +ADQY DF GFHV++ NK A S + Y Q!Sbjct: 532 FNLGRLIRLVKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKANTSSMDYYHQ 591!!Query: 586 IHDAFEKTGRHWLYNATVGAGLPINHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSV 645! + A EK+ R +LY+ VGAGLP+ +++L+++GD ++ SGI SG+LS++F + D +!Sbjct: 592 LRYAAEKSRRKFLYDTNVGAGLPVIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGM 651!!Query: 646 PFTELVDQAWQQGLTEPDPRDDLSGKDVMRKLVILAREAGYNIEPDQVRVESLVPAHCEG 705! F+E A + G TEPDPRDDLSG DV RKL+ILARE G +E + +E ++PA !Sbjct: 652 SFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIEIEPVLPAEFNA 711!!Query: 706 -GSIDHFFENGDELNEQMVQRLEAAREMGLVLRYVARFDANGKARVGVEAVREDHPLASL 764! G + F N +L++ R+ AR+ G VLRYV D +G RV + V + PL +!Sbjct: 712 EGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFKV 771!!Query: 765 LPCDNVFAIESRWYRDNPLVIRGPGAGRDVTAGAIQSDINR 805! +N A S +Y+ PLV+RG GAG DVTA + +D+ R!Sbjct: 772 KNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLR 812!!!>gi|16131850|ref|NP_418448.1| aspartokinase III, lysine sensitive;! aspartokinase III, lysine-sensitive [Escherichia coli! K12]! Length = 449!! Score = 122 bits (307), Expect = 7e-29! Identities = 121/452 (26%), Positives = 194/452 (42%), Gaps = 25/452 (5%)!!Query: 16 KFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAAGSTTNQLINWLK-LSQTDRLSAHQV 74! KFGG+S+AD R A I+ + ++V+SA+ TN L+ + L +R + !Sbjct: 8 KFGGTSVADFDAMNRSADIVLSDANVR-LVVLSASAGITNLLVALAEGLEPGERF---EK 63!!Query: 75 QQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDAVYAEVVGHGEVWSA 134! +R Q ++ L I + ++ LA + A+ E+V HGE+ S !Sbjct: 64 LDAIRNIQFAILERLRYPNVIREEIERLLENITVLAEAAALATSPALTDELVSHGELMST 123!!Query: 135 RLMSAVLNQQGLPAAWLDAREFLRA-ERAAQPQVDEGLSYPLLQQLLVQHPGKRLVVT-G 192! L +L ++ + A W D R+ +R +R + + D L L+ + LV+T G!Sbjct: 124 LLFVEILRERDVQAQWFDVRKVMRTNDRFGRAEPDIAALAELAALQLLPRLNEGLVITQG 183!!Query: 193 FISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKVKDACLLPLL 252! FI N G T LGR GSDY+A + SRV IW+DV G+Y+ DPR V A + +!Sbjct: 184 FIGSENKGRTTTLGRGGSDYTAALLAEALHASRVDIWTDVPGIYTTDPRVVSAAKRIDEI 243!!Query: 253 RLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRI---------ERVLA 303! EA+E+A A VLH TL P S+I + + S P G T + R LA!Sbjct: 244 AFAEAAEMATFGAKVLHPATLLPAVRSDIPVFVGSSKDPRAGGTLVCNKTENPPLFRALA 303!!Query: 304 SGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLLQFC 363! ++T H L A LA I L +A+ L !Sbjct: 304 LRRNQTLLTLHSLNMLHSRGFLAEVFGILARHNISVDLITTSEVSVAL-------TLDTT 356!!Query: 364 YTSEVADSAL--KILDEAGLPGELRLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPVE 421! ++ D+ L +L E + + +GLALVA++G +++ + L+ + !Sbjct: 357 GSTSTGDTLLTQSLLMELSALCRVEVEEGLALVALIGNDLSKACGVGKEVFGVLEPFNIR 416!!Query: 422 FTWQSDDGISLVAVLRTGPTESLIQGLHQSVF 453! +L ++ E ++Q LH ++F!Sbjct: 417 MICYGASSHNLCFLVPGEDAEQVVQKLHSNLF 448!!!>gi|16128228|ref|NP_414777.1| gamma-glutamate kinase [Escherichia! coli K12]! Length = 367!! Score = 31.2 bits (69), Expect = 0.28! Identities = 17/56 (30%), Positives = 29/56 (51%)!!Query: 194 ISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKVKDACLL 249! I+ N+A T + +D + LAG ++ + +D G+Y+ADPR A L+!Sbjct: 133 INENDAVATAEIKVGDNDNLSALAAILAGADKLLLLTDQKGLYTADPRSNPQAELI 188!!! Database: /Users/jvanheld/rsa-! tools/data/genomes/Escherichia_coli_K12/genome/NC_000913.faa! Posted date: Sep 8, 2004 12:13 PM! Number of letters in database: 1,351,322! Number of sequences in database: 4242! !Lambda K H! 0.320 0.136 0.397 !!Gapped!Lambda K H! 0.267 0.0410 0.140 !!!Matrix: BLOSUM62!Gap Penalties: Existence: 11, Extension: 1!Number of Hits to DB: 2,199,628!Number of Sequences: 4242!Number of extensions: 96525!Number of successful extensions: 290!Number of sequences better than 1.0: 4!Number of HSP's better than 1.0 without gapping: 4!Number of HSP's successfully gapped in prelim test: 0!Number of HSP's that attempted gapping in prelim test: 279!Number of HSP's gapped (non-prelim): 5!length of query: 810!length of database: 1,351,322!effective HSP length: 92!effective length of query: 718!effective length of database: 961,058!effective search space: 690039644!effective search space used: 690039644!T: 11!A: 40!X1: 16 ( 7.4 bits)!X2: 38 (14.6 bits)!X3: 64 (24.7 bits)!S1: 41 (21.8 bits)!S2: 65 (29.6 bits)!

BLAST result - second match

!  The second match is another bifunctional protein, product of the gene thrA.

!  This protein contains the same two domains as metA (aspartokinase and homoserine dehydrogenase).

!  The alignment covers almost the complete sequences (820 aa), with 30% identities and 49% similarity.

!  The E-value is very low (2e-95), indicating that thrA and metL are likely to be true homologs.

>gi|16127996|ref|NP_414543.1| bifunctional: aspartokinase I! (N-terminal); homoserine dehydrogenase I (C-terminal)! [Escherichia coli K12]! Length = 820!! Score = 344 bits (882), Expect = 2e-95! Identities = 247/821 (30%), Positives = 410/821 (49%), Gaps = 44/821 (5%)!!Query: 16 KFGGSSLADVKCYLRVAGIMAEYSQPDDMM-VVSAAGSTTNQLINWLKLSQTDRLSAHQV 74! KFGG+S+A+ + +LRVA I+ ++ + V+SA TN L+ ++ + + + + +!Sbjct: 5 KFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNI 64!!Query: 75 QQTLRRYQCDLISGLLPAEEADSL--ISAFVSDLERLAALLDSGIN------DAVYAEVV 126! R + +L++GL A+ L + FV + GI+ D++ A ++!Sbjct: 65 SDAERIF-AELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINAALI 123!!Query: 127 GHGEVWSARLMSAVLNQQGLPAAWLDAREFLRAER---AAQPQVDEGLSYPLLQQLLVQH 183! GE S +M+ VL +G +D E L A + + E ++ H!Sbjct: 124 CRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADH 183!!Query: 184 PGKRLVVTGFISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKV 243! +++ GF + N GE V+LGRNGSDYSA + A IW+DV GVY+ DPR+V!Sbjct: 184 ---MVLMAGFTAGNEKGELVVLGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQV 240!!Query: 244 KDACLLPLLRLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQ-----GSTRI 298! DA LL + EA EL+ A VLH RT+ P++ +I ++ + P G++R !Sbjct: 241 PDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPCLIKNTGNPQAPGTLIGASRD 300!!Query: 299 ERVLASGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQ 358! E L + +++ +++ + P + + + RA++ + + + !Sbjct: 301 EDELP----VKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLITQSSSEY 356!!Query: 359 LLQFCYTSEVADSALKILDEA-------GLPGELRLRQGLALVAMVGAGVTRNPLHCHRF 411! + FC A + + E GL L + + LA++++VG G+ +F!Sbjct: 357 SISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKF 416!!Query: 412 WQQLKGQPVEFTW--QSDDGISLVAVLRTGPTESLIQGLHQSVFRAEKRIGLVLFGKGNI 469! + L + Q S+ V+ + ++ HQ +F ++ I + + G G +!Sbjct: 417 FAALARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGV 476!!Query: 470 GSRWLELFAREQSTLSARTGFEFVLAGVVDSRRSLLSYDGLDASRALAFFNDEAVEQDEE 529! G LE R+QS L + + + GV +S+ L + GL+ L + +E + E !Sbjct: 477 GGALLEQLKRQQSWLKNKH-IDLRVCGVANSKALLTNVHGLN----LENWQEELAQAKEP 531!!Query: 530 ----SLFLWMRAHPYDDLVVLDVTASQQLADQYLDFASHGFHVISANKLAGASDSNKYRQ 585! L ++ + + V++D T+SQ +ADQY DF GFHV++ NK A S + Y Q!Sbjct: 532 FNLGRLIRLVKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKANTSSMDYYHQ 591!!Query: 586 IHDAFEKTGRHWLYNATVGAGLPINHTVRDLIDSGDTILSISGIFSGTLSWLFLQFDGSV 645! + A EK+ R +LY+ VGAGLP+ +++L+++GD ++ SGI SG+LS++F + D +!Sbjct: 592 LRYAAEKSRRKFLYDTNVGAGLPVIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGM 651!!Query: 646 PFTELVDQAWQQGLTEPDPRDDLSGKDVMRKLVILAREAGYNIEPDQVRVESLVPAHCEG 705! F+E A + G TEPDPRDDLSG DV RKL+ILARE G +E + +E ++PA !Sbjct: 652 SFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIEIEPVLPAEFNA 711!!Query: 706 -GSIDHFFENGDELNEQMVQRLEAAREMGLVLRYVARFDANGKARVGVEAVREDHPLASL 764! G + F N +L++ R+ AR+ G VLRYV D +G RV + V + PL +!Sbjct: 712 EGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFKV 771!!Query: 765 LPCDNVFAIESRWYRDNPLVIRGPGAGRDVTAGAIQSDINR 805! +N A S +Y+ PLV+RG GAG DVTA + +D+ R!Sbjct: 772 KNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLR 812!!!>gi|16131850|ref|NP_418448.1| aspartokinase III, lysine sensitive;! aspartokinase III, lysine-sensitive [Escherichia coli! K12]! Length = 449!! Score = 122 bits (307), Expect = 7e-29! Identities = 121/452 (26%), Positives = 194/452 (42%), Gaps = 25/452 (5%)!!Query: 16 KFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAAGSTTNQLINWLK-LSQTDRLSAHQV 74! KFGG+S+AD R A I+ + ++V+SA+ TN L+ + L +R + !Sbjct: 8 KFGGTSVADFDAMNRSADIVLSDANVR-LVVLSASAGITNLLVALAEGLEPGERF---EK 63!!Query: 75 QQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDAVYAEVVGHGEVWSA 134! +R Q ++ L I + ++ LA + A+ E+V HGE+ S !Sbjct: 64 LDAIRNIQFAILERLRYPNVIREEIERLLENITVLAEAAALATSPALTDELVSHGELMST 123!!Query: 135 RLMSAVLNQQGLPAAWLDAREFLRA-ERAAQPQVDEGLSYPLLQQLLVQHPGKRLVVT-G 192! L +L ++ + A W D R+ +R +R + + D L L+ + LV+T G!Sbjct: 124 LLFVEILRERDVQAQWFDVRKVMRTNDRFGRAEPDIAALAELAALQLLPRLNEGLVITQG 183!!Query: 193 FISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKVKDACLLPLL 252! FI N G T LGR GSDY+A + SRV IW+DV G+Y+ DPR V A + +!Sbjct: 184 FIGSENKGRTTTLGRGGSDYTAALLAEALHASRVDIWTDVPGIYTTDPRVVSAAKRIDEI 243!!Query: 253 RLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRI---------ERVLA 303! EA+E+A A VLH TL P S+I + + S P G T + R LA!Sbjct: 244 AFAEAAEMATFGAKVLHPATLLPAVRSDIPVFVGSSKDPRAGGTLVCNKTENPPLFRALA 303!!Query: 304 SGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLLQFC 363! ++T H L A LA I L +A+ L !Sbjct: 304 LRRNQTLLTLHSLNMLHSRGFLAEVFGILARHNISVDLITTSEVSVAL-------TLDTT 356!!Query: 364 YTSEVADSAL--KILDEAGLPGELRLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPVE 421! ++ D+ L +L E + + +GLALVA++G +++ + L+ + !Sbjct: 357 GSTSTGDTLLTQSLLMELSALCRVEVEEGLALVALIGNDLSKACGVGKEVFGVLEPFNIR 416!!Query: 422 FTWQSDDGISLVAVLRTGPTESLIQGLHQSVF 453! +L ++ E ++Q LH ++F!Sbjct: 417 MICYGASSHNLCFLVPGEDAEQVVQKLHSNLF 448!!!>gi|16128228|ref|NP_414777.1| gamma-glutamate kinase [Escherichia! coli K12]! Length = 367!! Score = 31.2 bits (69), Expect = 0.28! Identities = 17/56 (30%), Positives = 29/56 (51%)!!Query: 194 ISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKVKDACLL 249! I+ N+A T + +D + LAG ++ + +D G+Y+ADPR A L+!Sbjct: 133 INENDAVATAEIKVGDNDNLSALAAILAGADKLLLLTDQKGLYTADPRSNPQAELI 188!!! Database: /Users/jvanheld/rsa-! tools/data/genomes/Escherichia_coli_K12/genome/NC_000913.faa! Posted date: Sep 8, 2004 12:13 PM! Number of letters in database: 1,351,322! Number of sequences in database: 4242! !Lambda K H! 0.320 0.136 0.397 !!Gapped!Lambda K H! 0.267 0.0410 0.140 !!!Matrix: BLOSUM62!Gap Penalties: Existence: 11, Extension: 1!Number of Hits to DB: 2,199,628!Number of Sequences: 4242!Number of extensions: 96525!Number of successful extensions: 290!Number of sequences better than 1.0: 4!Number of HSP's better than 1.0 without gapping: 4!Number of HSP's successfully gapped in prelim test: 0!Number of HSP's that attempted gapping in prelim test: 279!Number of HSP's gapped (non-prelim): 5!length of query: 810!length of database: 1,351,322!effective HSP length: 92!effective length of query: 718!effective length of database: 961,058!effective search space: 690039644!effective search space used: 690039644!T: 11!A: 40!X1: 16 ( 7.4 bits)!X2: 38 (14.6 bits)!X3: 64 (24.7 bits)!S1: 41 (21.8 bits)!S2: 65 (29.6 bits)!

BLAST result - third match

!  The third match is the product of the gene lysC: aspartokinase III.

!  This protein contains the aspartokinase domain, but not the homoserine dehydrogenase.

!  Consequently, the alignment only extends over the first half of the query protein (453aa).

!  On this segment, there is a good level of identity (26%) and similarity (42%).

!  The E-value is very low (7e–29), indicating that the two domains are likely to be true homologs.

>gi|16131850|ref|NP_418448.1| aspartokinase III, lysine sensitive;! aspartokinase III, lysine-sensitive [Escherichia coli! K12]! Length = 449!! Score = 122 bits (307), Expect = 7e-29! Identities = 121/452 (26%), Positives = 194/452 (42%), Gaps = 25/452 (5%)!!Query: 16 KFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAAGSTTNQLINWLK-LSQTDRLSAHQV 74! KFGG+S+AD R A I+ + ++V+SA+ TN L+ + L +R + !Sbjct: 8 KFGGTSVADFDAMNRSADIVLSDANVR-LVVLSASAGITNLLVALAEGLEPGERF---EK 63!!Query: 75 QQTLRRYQCDLISGLLPAEEADSLISAFVSDLERLAALLDSGINDAVYAEVVGHGEVWSA 134! +R Q ++ L I + ++ LA + A+ E+V HGE+ S !Sbjct: 64 LDAIRNIQFAILERLRYPNVIREEIERLLENITVLAEAAALATSPALTDELVSHGELMST 123!!Query: 135 RLMSAVLNQQGLPAAWLDAREFLRA-ERAAQPQVDEGLSYPLLQQLLVQHPGKRLVVT-G 192! L +L ++ + A W D R+ +R +R + + D L L+ + LV+T G!Sbjct: 124 LLFVEILRERDVQAQWFDVRKVMRTNDRFGRAEPDIAALAELAALQLLPRLNEGLVITQG 183!!Query: 193 FISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKVKDACLLPLL 252! FI N G T LGR GSDY+A + SRV IW+DV G+Y+ DPR V A + +!Sbjct: 184 FIGSENKGRTTTLGRGGSDYTAALLAEALHASRVDIWTDVPGIYTTDPRVVSAAKRIDEI 243!!Query: 253 RLDEASELARLAAPVLHARTLQPVSGSEIDLQLRCSYTPDQGSTRI---------ERVLA 303! EA+E+A A VLH TL P S+I + + S P G T + R LA!Sbjct: 244 AFAEAAEMATFGAKVLHPATLLPAVRSDIPVFVGSSKDPRAGGTLVCNKTENPPLFRALA 303!!Query: 304 SGTGARIVTSHDDVCLIEFQVPASQDFKLAHKEIDQILKRAQVRPLAVGVHNDRQLLQFC 363! ++T H L A LA I L +A+ L !Sbjct: 304 LRRNQTLLTLHSLNMLHSRGFLAEVFGILARHNISVDLITTSEVSVAL-------TLDTT 356!!Query: 364 YTSEVADSAL--KILDEAGLPGELRLRQGLALVAMVGAGVTRNPLHCHRFWQQLKGQPVE 421! ++ D+ L +L E + + +GLALVA++G +++ + L+ + !Sbjct: 357 GSTSTGDTLLTQSLLMELSALCRVEVEEGLALVALIGNDLSKACGVGKEVFGVLEPFNIR 416!!Query: 422 FTWQSDDGISLVAVLRTGPTESLIQGLHQSVF 453! +L ++ E ++Q LH ++F!Sbjct: 417 MICYGASSHNLCFLVPGEDAEQVVQKLHSNLF 448!!!>gi|16128228|ref|NP_414777.1| gamma-glutamate kinase [Escherichia! coli K12]! Length = 367!! Score = 31.2 bits (69), Expect = 0.28! Identities = 17/56 (30%), Positives = 29/56 (51%)!!Query: 194 ISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKVKDACLL 249! I+ N+A T + +D + LAG ++ + +D G+Y+ADPR A L+!Sbjct: 133 INENDAVATAEIKVGDNDNLSALAAILAGADKLLLLTDQKGLYTADPRSNPQAELI 188!!! Database: /Users/jvanheld/rsa-! tools/data/genomes/Escherichia_coli_K12/genome/NC_000913.faa! Posted date: Sep 8, 2004 12:13 PM! Number of letters in database: 1,351,322! Number of sequences in database: 4242! !Lambda K H! 0.320 0.136 0.397 !!Gapped!Lambda K H! 0.267 0.0410 0.140 !!!Matrix: BLOSUM62!Gap Penalties: Existence: 11, Extension: 1!Number of Hits to DB: 2,199,628!Number of Sequences: 4242!Number of extensions: 96525!Number of successful extensions: 290!Number of sequences better than 1.0: 4!Number of HSP's better than 1.0 without gapping: 4!Number of HSP's successfully gapped in prelim test: 0!Number of HSP's that attempted gapping in prelim test: 279!Number of HSP's gapped (non-prelim): 5!length of query: 810!length of database: 1,351,322!effective HSP length: 92!effective length of query: 718!effective length of database: 961,058!effective search space: 690039644!effective search space used: 690039644!T: 11!A: 40!X1: 16 ( 7.4 bits)!X2: 38 (14.6 bits)!X3: 64 (24.7 bits)!S1: 41 (21.8 bits)!S2: 65 (29.6 bits)!

BLAST result - fourth match

!  The fourth match is a gamma-glutamate kinase, product of proB.

!  The match has the same level of identity (30%) and similarity (51%) as the second match (thrA).

!  However, this match only extends over 56aa, whereas the alignment between thrA and metL extends over 821aa.

!  The significance of the match is thus much lower: the E-value is quite high (0.28) suggesting that the similarity could be an artefact.

!  This clearly illustrates the fact that the important parameter to evaluate the significance of an alignment is the E-value, not the percentage of similarity !

>gi|16128228|ref|NP_414777.1| gamma-glutamate kinase [Escherichia! coli K12]! Length = 367!! Score = 31.2 bits (69), Expect = 0.28! Identities = 17/56 (30%), Positives = 29/56 (51%)!!Query: 194 ISRNNAGETVLLGRNGSDYSATQIGALAGVSRVTIWSDVAGVYSADPRKVKDACLL 249! I+ N+A T + +D + LAG ++ + +D G+Y+ADPR A L+!Sbjct: 133 INENDAVATAEIKVGDNDNLSALAAILAGADKLLLLTDQKGLYTADPRSNPQAELI 188!!!

Page 7: Similarity searches in sequence databases Sequence search

BLAST result - summary

!  The last part of the BLAST result gives some statistics about the search:

"  Number of hits "  Number of sequences in the DB "  !

Database: /Users/jvanheld/rsa-! tools/data/genomes/Escherichia_coli_K12/genome/NC_000913.faa! Posted date: Sep 8, 2004 12:13 PM! Number of letters in database: 1,351,322! Number of sequences in database: 4242! !Lambda K H! 0.320 0.136 0.397 !!Gapped!Lambda K H! 0.267 0.0410 0.140 !!Matrix: BLOSUM62!Gap Penalties: Existence: 11, Extension: 1!Number of Hits to DB: 2,199,628!Number of Sequences: 4242!Number of extensions: 96525!Number of successful extensions: 290!Number of sequences better than 1.0: 4!Number of HSP's better than 1.0 without gapping: 4!Number of HSP's successfully gapped in prelim test: 0!Number of HSP's that attempted gapping in prelim test: 279!Number of HSP's gapped (non-prelim): 5!length of query: 810!length of database: 1,351,322!effective HSP length: 92!effective length of query: 718!effective length of database: 961,058!effective search space: 690039644!effective search space used: 690039644!T: 11!A: 40!X1: 16 ( 7.4 bits)!X2: 38 (14.6 bits)!X3: 64 (24.7 bits)!S1: 41 (21.8 bits)!S2: 65 (29.6 bits)!

BLAST – examples of results

BLAST (NCBI) BLAST (NCBI)

BLASTn paramètres BLASTp paramètres

Page 8: Similarity searches in sequence databases Sequence search

Checking expected values with random sequences

(“negative control”)

Searching a sequence database with a random sequence as query !  Empirical test of the expected value:

"  We generated a random sequence of 1588 aa using the tool random-sequence (http://rsat.ulb.ac.be/rsat/ ). "  This random sequence mimics the amino acid and dipeptide composition of yeast proteins (generated with First

order Markov model). !  A blast search against the non-redundant database returns several hits. !  These hits however have low scores. Not surprisingly, the corresponding expected values are higher than 1.

blastp with random sequences

!  pblast of 10 random sequences against the non-redundant database.

"  Between 1 and 15 matches per trial. "  Was this to be expected ?

!  This corresponds pretty well to the expectation

"  On the NCBI BLAST web server, the default threshold on expect has been set to 10.

"  We thus expect, for each request, an average of 10 matches by chance.

"  We indeed observe this order of magnitude when submitting random sequences.

The modalities of BLAST

DNA versus protein searches

!  When the query is a coding DNA sequence, it is recommended to apply searches with the translated rather than raw DNA sequences

"  This allows to introduce a substitution matrix (PAM, BLOSUM, ...), which better reflects the evolutionary changes.

"  It has been shown that some distant relationships can be detected with translated searches, but escape detection with the DNA search.

"  It is easier to filter out low complexity regions from proteins than from DNA sequences.

Page 9: Similarity searches in sequence databases Sequence search

Séquence Banque

Protéique Protéique Nucléotidique Nucléotidique

BLASTn

BLASTp

tBLASTn BLASTx

tBLASTx

Variants de BLAST BLAST - a family of purpose-specific programs

!  Different program names exist, depending on the type (protein or nucleic acid) of query and database sequences.

!  For comparison between nucleic acids and proteins, the nucleic acid is translated in the 6 frames (3 frames per strand)

ATTGTGAGTCCTGATGATGGT!TAACTCTCAGGACTACTACCA!

6-frames translation

Query Database Program Application examples Study cases

protein protein blastpStarting from a protein of known function detect putative homologs in the whole Uniprot database.

Collect sequences similar to the blue-sensitive opsin in all human proteins.

nucleic acid

nucleic acid blastn Match RNAi against a genome.

Match mRNA (or EST) against a genome.

nucleic acid (translated) protein blastx

After having sequenced a piece of DNA, search potentially coding fragments + their putative homologs without any prior knowledge of gene positions in the query sequence.

protein nucleic acid (translated) tblastn

- Identify a genomic region likely to code for an homolog of a protein of interest.- Identify pseudo-genes (defective genes, with many stop codons) for a protein of interest in a genome.

Do cat see colors ? Get Human blue-sensitive opsin protein, connect UCSC genome browser, use BLAT to find similarities in Cat genome

nucleic acid (translated)

nucleic acid (translated) tblastx

Scanning 6-frames translated genomes with a protein sequence

!  The mouse urate oxidase enzyme (P25688) contains an uricase domain (EC 1.7.3.3), catalyzing the urate degradation:

"  Urate + O(2) + H(2)O <=> 5-hydroxyisourate + H(2)O(2).

!  Top: blastp result (protein against protein): "  Homo sapiens reference proteins

(Refseq) scanned with urate oxidase peptidic sequence.

"  The scan only returns weakly scoring matches (lowest E-value = 0.004).

!  Bottom: tblastn (search translated nucleotide database using protein query)

"  The scan returns 4 very high-scoring genome locations.

!  Question: why did we identify very good matches with tblastn, and not with blastp ?

Protein against protein (blastp)

Protein against translated DNA (tblastn)

tblastn result : mouse urate oxidase agains Human genome

!  Some aligned fragments.

>ref|NW_001838589.2| Download subject sequence NW_001838589 spanning the HSP Homo sapiens chromosome 1 genomic contig, alternate assembly HuRef SCAF_1103279188310, whole genome shotgun sequence Length=20217283 Sort alignments for this subject sequence by: E value Score Percent identity Query start position Subject start position Features flanking this part of subject sequence: 27555 bp at 5' side: deoxyribonuclease-2-beta isoform 2 34809 bp at 3' side: sterile alpha motif domain-containing protein 13 isoform 2 Score = 137 bits (346), Expect = 6e-33, Method: Compositional matrix adjust. Identities = 67/75 (89%), Positives = 71/75 (95%), Gaps = 0/75 (0%) Frame = +1 Query 10 KNDEVEFVRTGYGKDMVKVLHIQRDGKYHSIKEVATSVQLTLRSKKDYLHGDNSDIIPTD 69 +NDEVEFVRTGYGK+MVKVLHIQ DGKYHSIKEVATSVQLTL SKKDYLHGDNSDIIPTD Sbjct 19269451 QNDEVEFVRTGYGKEMVKVLHIQ*DGKYHSIKEVATSVQLTLSSKKDYLHGDNSDIIPTD 19269630 Query 70 TIKNTVHVLAKLRGI 84 TIKNTVHVLAK + + Sbjct 19269631 TIKNTVHVLAKFKEV 19269675

Protein against translated DNA (tblastn)

tblastn example – do cat see colors ? !  Approach: match the peptidic sequence of the Human Short-wave-sensitive opsin (blue) against the

complete cat genome (6-frames translated). !  Tool: BLAT tool at UCSC genome browser (http://genome.ucsc.edu/) !  Result: 3 matches in cat genome

"  Short wave (blue) sensitive opsin "  rhodopsin "  Long wave (red) sensitive opsin. Partial match with 1 exon, but sufficient to “fish out” the cat OPN1LW gene.