blast - biotec.tu-dresden.de · contents why to compare and align sequences? how to judge an...

Post on 31-Aug-2019

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Michael Schroeder

Biotechnology CenterTU Dresden

BLAST

Contents Why to compare and align sequences?

How to judge an alignment? Z-score, E-value, P-value, structure and function

How to compare and align sequences? Levensthein distance, scoring schemes, longest common subsequence, global

and local alignment, substitution matrix, How to compute an alignment?

Dynamic programming How to compute an alignment fast?

BLAST How to align many sequences

Multiple sequence alignment, phylogenetic trees Alignments and structure

How to predict protein structure from protein sequence

2

Motivation

Two sequences of length n Dynamic programming:

matrix = time proportional to n2

Database of m sequences: time proportional to m n2

Manual search: How long does it take? 1 cell = 10 sec 1.000.000 sequences Sequence = 100 amino acids

3

Motivation

How can we reduce the database size?

How can we reduce the matrix size?

4

lev(petra, peter) = ?

5

Levenshtein Distance with Dynamic Programming:

Are all cells in the matrix equally important?

6

i \ j p e t e r

0 1 2 3 4 5

p 1 0 1 2 3 4

e 2 1 0 1 2 3

t 3 2 1 0 1 2

r 4 3 2 1 1 1

a 5 4 3 2 2 2

Levenshtein Distance with Dynamic Programming:

Are all cells in the matrix equally important?

7

i \ j p e t e r

0 1 2 3 4 5

p 1 0 1 2 3 4

e 2 1 0 1 2 3

t 3 2 1 0 1 2

r 4 3 2 1 1 1

a 5 4 3 2 2 2

For the two alignments, we only used 8 out of 36 cells.Can we discard the other cells beforehand? How?

not used

maybe used

used

Levenshtein Distance with Dynamic Programming:

Are all cells in the matrix equally important?

8

i \ j

0 1 2 3 4 5

1 1

2 2

3 3

4 4

5 5

What is the worst possible distance for two strings of size 5?5 mismatches. This means all paths of length >5 can be excluded

Levenshtein Distance with Dynamic Programming:

Are all cells in the matrix equally important?

9

i \ j

0 1 2 3 4 5

1 1

2 2

3 3

4 4

5 5

Paths through red cells have all length >5Only 24 out of 36 can contribute to results.

not used

maybe used

used

Are we solving the right problem?

10

Are all alignments useful?

Only results with reasonable edit distance.

For size 5 strings, let‘s say that‘s 3.

Levenshtein Distance with Dynamic Programming:

Are all cells in the matrix equally important?

11

i \ j

0 1 2 3 4 5

1 1

2 2

3 3

4 4

5 5

Paths through red cells have all length >3Only 16 out of 36 can contribute to results.

not used

maybe used

used

12

Ukkonen E. (1983) On approximate string matching. In: Karpinski M. (eds) Foundations of Computation Theory. FCT 1983. Lecture Notes in Computer Science, vol 158. Springer

Are we Solving the Right Problem?

13

Are we Solving the Right Problem?

Aligning completely unrelated sequences not needed

A reasonable alignment has some perfect matches

BLAST Idea:1. Identify small perfect matches (termed words)

2. Extend perfect matches on the same diagonal

3. Merge extended perfect matches (with dynamic programming)

14

BLAST Idea

Do we need the alignments for all sequences in the database?

No, only for “reasonable” hits introduce a ➞threshold

A “reasonable” alignment will contain short stretches of perfect matches

Find these first, then extend them to connect them as best possible

15

Perfect Matches (Words) Formally

16

BLAST: Filtration Algorithm

detects all matching words of length p (of a query string a in a target string b, both of length n), combining it to an alignment of a and b with no more than k mismatches

Potential match detection: Find all matches of p-tuples between a and b (can be done in linear time by inserting them into a hash table)

Potential match verification: Verify each potential match by extending it to the left and right until either the first k+1 mismatches are found or the beginning or end of the sequences are found

17

BLAST Mechanism Summary

18

word length p(here: p = 4)

no mismatchesor gaps allowed

only within thegrey areas

Zipf's Law

19

Example for BLAST

20

SWISS_PROT:C79A_HUMAN P11912

Search SWISSPROT for Ig-alpha:

Example for BLAST

21

Example for BLAST

22

Example for BLAST

23

Distribution of Hits:

Example for BLAST: Alignment>gi|126779|sp|P11911|C79A_MOUSE B-cell antigen receptor complex associated protein alpha-chainprecursor (IG-alpha) (MB-1 membrane glycoprotein)(Surface-IGM-associated protein) (Membrane-boundimmunoglobulin associated protein) (CD79A)Length = 220

Score = 278 bits (711), Expect = 5e-75Identities = 150/226 (66%), Positives = 165/226 (73%), Gaps = 6/226 (2%)

Query: 1 MPGGPGVLQALPATIFLLFLLSAVYLGPGCQALWMHKVPASLMVSLGEDAHFQCPHNSSN 60 MPGG + LL LS LGPGCQAL + P SL V+LGE+A C N+ Sbjct: 1 MPGG----LEALRALPLLLFLSYACLGPGCQALRVEGGPPSLTVNLGEEARLTC-ENNGR 55

Query: 61 NANVTWWRVLHGNYTWPPEFLGPGEDPNGTLIIQNVNKSHGGIYVCRVQEGNESYQQSCG 120 N N+TWW L N TWPP LGPG+ G L VNK+ G C+V E N ++SCGSbjct: 56 NPNITWWFSLQSNITWPPVPLGPGQGTTGQLFFPEVNKNTGACTGCQVIE-NNILKRSCG 114

Query: 121 TYLRVRQPPPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKLGLDAGD 180 TYLRVR P PRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEK G+D DSbjct: 115 TYLRVRNPVPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKFGVDMPD 174

Query: 181 EYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGSLNIGDVQLEKP 226 +YEDENLYEGLNLDDCSMYEDISRGLQGTYQDVG+L+IGD QLEKPSbjct: 175 DYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGNLHIGDAQLEKP 220

24

Example for BLAST: TaxonomyLineage Report

root. cellular organisms. . Eukaryota [eukaryotes]. . . Fungi/Metazoa group [eukaryotes]. . . . Bilateria [animals]. . . . . Coelomata [animals]. . . . . . Gnathostomata [vertebrates]. . . . . . . Tetrapoda [vertebrates]. . . . . . . . Amniota [vertebrates]. . . . . . . . . Eutheria [mammals]. . . . . . . . . . Homo sapiens (man) ---------------------- 473 33 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch. . . . . . . . . . Bos taurus (bovine) ..................... 312 2 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch. . . . . . . . . . Mus musculus (mouse) .................... 278 31 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch. . . . . . . . . . Canis familiaris (dogs) ................. 37 1 hit [mammals] IG KAPPA CHAIN V REGION GOM. . . . . . . . . . Rattus norvegicus (brown rat) ........... 35 7 hits [mammals] Vascular endothelial growth factor receptor 1 precursor (VE. . . . . . . . . . Oryctolagus cuniculus (domestic rabbit) . 29 1 hit [mammals] IG KAPPA CHAIN V REGION K29-213. . . . . . . . . Coturnix japonica ------------------------- 33 2 hits [birds] Vascular endothelial growth factor receptor 2 precursor (VE. . . . . . . . . Gallus gallus (chickens) .................. 31 4 hits [birds] CILIARY NEUROTROPHIC FACTOR RECEPTOR ALPHA PRECURSOR (CNTFR. . . . . . . . Xenopus laevis (clawed frog) ---------------- 30 2 hits [amphibians] Neural cell adhesion molecule 1, 180 kDa isoform precursor . . . . . . . Heterodontus francisci ------------------------ 28 1 hit [sharks and rays] Myelin P0 protein precursor (Myelin protein zero) (Myelin p. . . . . . Drosophila melanogaster ------------------------- 30 2 hits [flies] Neuroglian precursor. . . . . Caenorhabditis elegans ---------------------------- 29 1 hit [nematodes] Hypothetical protein F59B2.12 in chromosome III. . . . Saccharomyces cerevisiae (brewer's yeast) ----------- 33 1 hit [ascomycetes] Putative 101.7 kDa transcriptional regulatory protein in PR. . . Marchantia polymorpha --------------------------------- 29 1 hit [liverworts] Succinate dehydrogenase cytochrome b560 subunit (Succinate . . Agrobacterium tumefaciens str. C58 ---------------------- 28 1 hit [a-proteobacteria] Formamidopyrimidine-DNA glycosylase (Fapy-DNA glycosylase). Human adenovirus type 3 ----------------------------------- 30 1 hit [viruses] EARLY E3 20.5 KD GLYCOPROTEIN. Human adenovirus type 7 ................................... 30 1 hit [viruses] EARLY E3 20.5 KD GLYCOPROTEIN

25

PSI-Blast

Globin family (oxygen transport) of proteins occurs in many species

Proteins have same function and structure and But there are pairs of members of the family sharing

less than 10% identical residues

A B C

PSI-BLAST idea: score via intermediaries may be better than score from direct comparison

26

50%

Only 10%

50%

PSI-BLAST

PSI-BLAST 1. BLAST 2. Collect top hits 3. Build multiple sequence alignment from significant

local matches 4. Build profile 5. Re-probe database with profile 6. Go back to 2.

27

PSI-BLAST

But beware of PSI-BLAST: False positives propagate and spread through

iterations If protein A consists of domains D and E, and protein B

of domains E and F and protein C of domain F, then PSI-BLAST will relate A and C although they do not share any domain

28

top related