Michael Schroeder
Biotechnology CenterTU Dresden
BLAST
Contents Why to compare and align sequences?
How to judge an alignment? Z-score, E-value, P-value, structure and function
How to compare and align sequences? Levensthein distance, scoring schemes, longest common subsequence, global
and local alignment, substitution matrix, How to compute an alignment?
Dynamic programming How to compute an alignment fast?
BLAST How to align many sequences
Multiple sequence alignment, phylogenetic trees Alignments and structure
How to predict protein structure from protein sequence
2
Motivation
Two sequences of length n Dynamic programming:
matrix = time proportional to n2
Database of m sequences: time proportional to m n2
Manual search: How long does it take? 1 cell = 10 sec 1.000.000 sequences Sequence = 100 amino acids
3
Motivation
How can we reduce the database size?
How can we reduce the matrix size?
4
lev(petra, peter) = ?
5
Levenshtein Distance with Dynamic Programming:
Are all cells in the matrix equally important?
6
i \ j p e t e r
0 1 2 3 4 5
p 1 0 1 2 3 4
e 2 1 0 1 2 3
t 3 2 1 0 1 2
r 4 3 2 1 1 1
a 5 4 3 2 2 2
Levenshtein Distance with Dynamic Programming:
Are all cells in the matrix equally important?
7
i \ j p e t e r
0 1 2 3 4 5
p 1 0 1 2 3 4
e 2 1 0 1 2 3
t 3 2 1 0 1 2
r 4 3 2 1 1 1
a 5 4 3 2 2 2
For the two alignments, we only used 8 out of 36 cells.Can we discard the other cells beforehand? How?
not used
maybe used
used
Levenshtein Distance with Dynamic Programming:
Are all cells in the matrix equally important?
8
i \ j
0 1 2 3 4 5
1 1
2 2
3 3
4 4
5 5
What is the worst possible distance for two strings of size 5?5 mismatches. This means all paths of length >5 can be excluded
Levenshtein Distance with Dynamic Programming:
Are all cells in the matrix equally important?
9
i \ j
0 1 2 3 4 5
1 1
2 2
3 3
4 4
5 5
Paths through red cells have all length >5Only 24 out of 36 can contribute to results.
not used
maybe used
used
Are we solving the right problem?
10
Are all alignments useful?
Only results with reasonable edit distance.
For size 5 strings, let‘s say that‘s 3.
Levenshtein Distance with Dynamic Programming:
Are all cells in the matrix equally important?
11
i \ j
0 1 2 3 4 5
1 1
2 2
3 3
4 4
5 5
Paths through red cells have all length >3Only 16 out of 36 can contribute to results.
not used
maybe used
used
12
Ukkonen E. (1983) On approximate string matching. In: Karpinski M. (eds) Foundations of Computation Theory. FCT 1983. Lecture Notes in Computer Science, vol 158. Springer
Are we Solving the Right Problem?
13
Are we Solving the Right Problem?
Aligning completely unrelated sequences not needed
A reasonable alignment has some perfect matches
BLAST Idea:1. Identify small perfect matches (termed words)
2. Extend perfect matches on the same diagonal
3. Merge extended perfect matches (with dynamic programming)
14
BLAST Idea
Do we need the alignments for all sequences in the database?
No, only for “reasonable” hits introduce a ➞threshold
A “reasonable” alignment will contain short stretches of perfect matches
Find these first, then extend them to connect them as best possible
15
Perfect Matches (Words) Formally
16
BLAST: Filtration Algorithm
detects all matching words of length p (of a query string a in a target string b, both of length n), combining it to an alignment of a and b with no more than k mismatches
Potential match detection: Find all matches of p-tuples between a and b (can be done in linear time by inserting them into a hash table)
Potential match verification: Verify each potential match by extending it to the left and right until either the first k+1 mismatches are found or the beginning or end of the sequences are found
17
BLAST Mechanism Summary
18
word length p(here: p = 4)
no mismatchesor gaps allowed
only within thegrey areas
Zipf's Law
19
Example for BLAST
20
SWISS_PROT:C79A_HUMAN P11912
Search SWISSPROT for Ig-alpha:
Example for BLAST
21
Example for BLAST
22
Example for BLAST
23
Distribution of Hits:
Example for BLAST: Alignment>gi|126779|sp|P11911|C79A_MOUSE B-cell antigen receptor complex associated protein alpha-chainprecursor (IG-alpha) (MB-1 membrane glycoprotein)(Surface-IGM-associated protein) (Membrane-boundimmunoglobulin associated protein) (CD79A)Length = 220
Score = 278 bits (711), Expect = 5e-75Identities = 150/226 (66%), Positives = 165/226 (73%), Gaps = 6/226 (2%)
Query: 1 MPGGPGVLQALPATIFLLFLLSAVYLGPGCQALWMHKVPASLMVSLGEDAHFQCPHNSSN 60 MPGG + LL LS LGPGCQAL + P SL V+LGE+A C N+ Sbjct: 1 MPGG----LEALRALPLLLFLSYACLGPGCQALRVEGGPPSLTVNLGEEARLTC-ENNGR 55
Query: 61 NANVTWWRVLHGNYTWPPEFLGPGEDPNGTLIIQNVNKSHGGIYVCRVQEGNESYQQSCG 120 N N+TWW L N TWPP LGPG+ G L VNK+ G C+V E N ++SCGSbjct: 56 NPNITWWFSLQSNITWPPVPLGPGQGTTGQLFFPEVNKNTGACTGCQVIE-NNILKRSCG 114
Query: 121 TYLRVRQPPPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKLGLDAGD 180 TYLRVR P PRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEK G+D DSbjct: 115 TYLRVRNPVPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKFGVDMPD 174
Query: 181 EYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGSLNIGDVQLEKP 226 +YEDENLYEGLNLDDCSMYEDISRGLQGTYQDVG+L+IGD QLEKPSbjct: 175 DYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGNLHIGDAQLEKP 220
24
Example for BLAST: TaxonomyLineage Report
root. cellular organisms. . Eukaryota [eukaryotes]. . . Fungi/Metazoa group [eukaryotes]. . . . Bilateria [animals]. . . . . Coelomata [animals]. . . . . . Gnathostomata [vertebrates]. . . . . . . Tetrapoda [vertebrates]. . . . . . . . Amniota [vertebrates]. . . . . . . . . Eutheria [mammals]. . . . . . . . . . Homo sapiens (man) ---------------------- 473 33 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch. . . . . . . . . . Bos taurus (bovine) ..................... 312 2 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch. . . . . . . . . . Mus musculus (mouse) .................... 278 31 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch. . . . . . . . . . Canis familiaris (dogs) ................. 37 1 hit [mammals] IG KAPPA CHAIN V REGION GOM. . . . . . . . . . Rattus norvegicus (brown rat) ........... 35 7 hits [mammals] Vascular endothelial growth factor receptor 1 precursor (VE. . . . . . . . . . Oryctolagus cuniculus (domestic rabbit) . 29 1 hit [mammals] IG KAPPA CHAIN V REGION K29-213. . . . . . . . . Coturnix japonica ------------------------- 33 2 hits [birds] Vascular endothelial growth factor receptor 2 precursor (VE. . . . . . . . . Gallus gallus (chickens) .................. 31 4 hits [birds] CILIARY NEUROTROPHIC FACTOR RECEPTOR ALPHA PRECURSOR (CNTFR. . . . . . . . Xenopus laevis (clawed frog) ---------------- 30 2 hits [amphibians] Neural cell adhesion molecule 1, 180 kDa isoform precursor . . . . . . . Heterodontus francisci ------------------------ 28 1 hit [sharks and rays] Myelin P0 protein precursor (Myelin protein zero) (Myelin p. . . . . . Drosophila melanogaster ------------------------- 30 2 hits [flies] Neuroglian precursor. . . . . Caenorhabditis elegans ---------------------------- 29 1 hit [nematodes] Hypothetical protein F59B2.12 in chromosome III. . . . Saccharomyces cerevisiae (brewer's yeast) ----------- 33 1 hit [ascomycetes] Putative 101.7 kDa transcriptional regulatory protein in PR. . . Marchantia polymorpha --------------------------------- 29 1 hit [liverworts] Succinate dehydrogenase cytochrome b560 subunit (Succinate . . Agrobacterium tumefaciens str. C58 ---------------------- 28 1 hit [a-proteobacteria] Formamidopyrimidine-DNA glycosylase (Fapy-DNA glycosylase). Human adenovirus type 3 ----------------------------------- 30 1 hit [viruses] EARLY E3 20.5 KD GLYCOPROTEIN. Human adenovirus type 7 ................................... 30 1 hit [viruses] EARLY E3 20.5 KD GLYCOPROTEIN
25
PSI-Blast
Globin family (oxygen transport) of proteins occurs in many species
Proteins have same function and structure and But there are pairs of members of the family sharing
less than 10% identical residues
A B C
PSI-BLAST idea: score via intermediaries may be better than score from direct comparison
26
50%
Only 10%
50%
PSI-BLAST
PSI-BLAST 1. BLAST 2. Collect top hits 3. Build multiple sequence alignment from significant
local matches 4. Build profile 5. Re-probe database with profile 6. Go back to 2.
27
PSI-BLAST
But beware of PSI-BLAST: False positives propagate and spread through
iterations If protein A consists of domains D and E, and protein B
of domains E and F and protein C of domain F, then PSI-BLAST will relate A and C although they do not share any domain
28