local and global alignment, database searching with blast

Sequence Database Searching

Local and Global Alignment, Database Searching With Blast

Original Presentations:

Hugues Sicotte

National Center for Biotechnology Information

sicotte@ncbi.nlm.nih.gov

Adaptation:

Alan Durham

University of São Paulo

alan@ime.usp.br

Alignment definition and Type:

G-ATES

GRATED

Local Alignments:

Global Alignment:

Alignment:

All bases aligned with another base or with a gap (symbol of “-” or sometimes “.”).

Each Base is used at most once.

Do not need to align all the bases in all sequences.

Align BILLGATESLIKESCHEESE and GRATEDCHEESE

G-ATESLIKESCHEESE or G-ATES & CHEESE

GRATED-----CHEESE GRATED & CHEESE

C O M P A R A T I V E A N A L Y S I S

GATTATACCAGATTA---CA

Insertions and deletions (‘indels’) are represented by gaps in alignments

gap of length 3

Score and Statistics

G-ATESLIKESCHEESE AND/OR G-ATES & CHEESE

GRATED-----CHEESE GRATED & CHEESE

Percent Identity. Can be misleading.

Score: A simple quality measure is the “score”. The score assigns points for each aligned base (or gap) of the alignment.

identical bases : “match” score

mismatching bases: “mismatch” score

gaps(optional): “gap opening” penalty for starting a gap

“gap extension” penalty for each gap symbol.

Score = 10*(+1)+1*(-1)+(-5-1)+(-5+5*(-1))

Example: match = +1 , mismatch =-1,

gap opening = -5, gap extension=-1

S C O R I N G S Y S T E M S

Which alignment is “better”?

GCTACTAGTT------CGCTTAGCGCTACTAGCTCTAGCGCGTATAGC

GCTACTAG-T-T--CGC-T-TAGCGCTACTAGCTCTAGCGCGTATAGC

0 mismatches, 5 gaps

3 mismatches, 1 gap

S C O R I N G S Y S T E M S

High penalty for “opening” a gap

(e.g. G = 5)

GCTACTAGTT------CGCTTAGCGCTACTAGCTCTAGCGCGTATAGC

GCTACTAG-T-T--CGC-T-TAGCGCTACTAGCTCTAGCGCGTATAGC

Penalty = 5G + 6L = 31

Penalty = 1G + 6L = 11

Lower penalty for “entending” a gap

(e.g. L = 1)

L O C A L S I M I L A R I T Y

Figure 7.3

F12 F2 E F1 E K Catalytic

PLAT F1 E K CatalyticK

Mix-and-match protein modules confound alignment algorithms

Protein modules in coagulation factor XII (F12) and tissue plasminogen activator (PLAT)

F1,F2 Fibronectin repeatsE EGF similarity domainK Kringle domainCatalytic Serine protease activitiy

Figure 7.3

modules inreverse order

Figure 7.3

repeatedmodules

D O T P L O T S

Figure 7.4

Dot-plot Fitch : Biochem. Genet. (1969)3,99-108

C G T A C C G T

0 0 0 1 0 0 0 0

0 0 0 1 1 0 0

1 0 0 0 0 1 0

0 1 0 0 0 0 1

Horizontal axis is coordinates for one sequence

Vertical axis is coordinates for the other

D O T P L O T S

Figure 7.4b

Dot-plot Fitch : Biochem. Genet. (1969)3,99-108

Can also score not 1 position at a time, but in sliding window. For example a window of 3 nucleotides where we score 1 for identical triplets and 0 for all other combinations yields.

C G T A C C G T

0 0 0 0 0 0

1 0 0 0 0 1

D O T P L O T S

Coagulation Factor XII (F12)

Figure 7.4

D O T P L O T S

Figure 7.4

Catalytic

EF1EF2

Plot dots for high similarity within a short window

Adjacent dots merge to form diagonal segments

D O T P L O T S

Figure 7.4

Catalytic

EF1EF2

Repeated domains show a characteristic pattern

P A T H G R A P H S

Figure 7.5

90 137

PLAU 90 EPKKVKDHCSKHSPCQKGGTCVNMP--SGPH-CLCPQHLTGNHCQKEK---CFE 137PLAT 23 ELHQVPSNCD----CLNGGTCVSNKYFSNIHWCNCPKKFGGQHCEIDKSKTCYE 72

EGF similarity domains of urokinse plasminogen activator (PLAU) and tissue plasminogen activator (PLAT)

Dot plots suggest paths through the alignment space

Path graphs are more explicit representations

Each path is a unique alignment

G A T A C T AG A T T A C C A

Construct an optimal of these two sequences:

Using these scoring rules: Match:

Mismatch:Gap:

+1-1-1

D Y N A M I C P R O G R A M M I N G

Dynamic Programming Example

G A T A C T AGATTACCA

Arrange the sequence residues along a two-dimensional lattice

Vertices of the lattice fall between letters

The goal is to find the optimal path

from here

to here

Each path corresponds to a unique alignment

Which one is optimal?

The score for a path is the sum of its incremental edges scores

A aligned with AMatch = +1

The score for a path is the sum of its incremental edges scores A aligned with T

Mismatch = -1

The score for a path is the sum of its incremental edges scores

T aligned with NULL

Gap = -1

NULL aligned with T

Incrementally extend the path

Remember the best sub-path leading to each point on the lattice

-1 -3-2

+2-2 +2 +1

-7-6-5

Trace-back to get optimal path and alignment

+2-2 +2 +1

-7-6-5

Print out the alignment

AA-TTTAACCTCAA

Two different types of Alignment

Needleman & Wunch (J. Mol. Biol. (1970) 48,443-453 : Problem of finding the best path. Revelation: Any partial sub-path that ends at a point along the true optimal path must itself be the optimal path leading to that point. This provides a method to create a matrix of path “score”, the score of a path leading to that point. Trace the optimal path from one end to the other of the two sequences.

Global Alignment methods:

Smith & Waterman.(J. Mol. Biol. (1981), 147,195-197: Use Needleman &Wunch, but report all non-overlapping paths, starting at the highest scoring points in the path graph.

FASTP(Lipman &Pearson(1985),Science 227,1435-1441

BLAST (Altschul et al (1990),J. Mol. Bio. 215,408-410): don’t report all overlapping paths, but only attempt to find paths if there are words that are high-scoring. Speeds up considerably the alignments.

Local Alignment methods:

G L O B A L & L O C A L S I M I L A R I T Y

Implementations of dynamic programming for global and local similarities

Optimal global alignment

Needleman & Wunsch (1970)

Sequences align essentially from end to end

Optimal local alignment

Smith & Waterman (1981)

Sequences align only in small, isolated regions

Local and Global Alignments: computing the matrix

global alignment: as shown in previous slides

local allignment: change computation so that never put negative values

– when value of cell will be negative, set to zero (means staring another path)

– best local alignment comes from entry in matrix with maximum value

semi-global alignment:

– good in assembling

– ignore gaps at the end and at the beginning of sequences

– to ignore gaps at the beginning of alignment: zeroes in first column and first row

– to ignore gaps at the end of alignment: pick maximum value of last row and last column

D Y N A M I C P R O G R A M M I N G: semi-global alignment

G A T A C TGATTACCA

(we eliminated the last symbol from one of the sequences)

•choose the best score from scores in last row and collumn

•fill first row and first column with zeroes

•in this problem, 2 solutions

+2-1 +2 +1

0 +2 +4

Match/mismatch scores and Statistics

•Some amino acids mutations do not affect structure/function very much. Amino acids with similar physico-chemical and steric properties can often replace each other.

•Scoring system that doesn’t penalize very much mutations to similar amino acid.

•PAM Matrices: Point Accepted Mutations. Defined in terms of a divergence of 1 percent PAM. For distant sequences use PAM250, while for closer sequences (like DNA) use PAM100. Some sites accumulate mutations some others don’t, thus use of the PAM100 matrice doesn’t mean that the sequences compared were 100% mutated.

•BLOSUM: BLOCK substitution matrices. Started with the BLOCKS database of multiple alignment only involving distant sequences. BLOSUM62 means that the proteins compated were never closer than 62% Identity. BLOSUM50 matrices involved alignment of more distant sequences. Recommend use BLOSUM matrices (BLOSUM62) for most protein alignments.

Alignment methods

Query sequence

Sequence Alignment representation using a dot plot.

For a query of N letters against a subject sequence of M letters, it requires MxN comparisons.

H A S H I N G M E T H O D S

Hashing is a common method for accelerating database searches

MLILII

MLIIKRDELVISWASHEREquery sequence

IIKIKRKRDRDEDELELVLVIVISISWSWAWASASHSHEHERERE

all overlappingwords of size 3

Compile “dictionary” of words from the query sequence. Put each word in a look-up table that points to the original position in the sequence. Thus given one word, you can know if it is in the query in a single operation.

Index lookup

Each word is assigned a unique integer.

E.g. for a word of 3 letters made up of an alphabet of 20 letters.

1. Assign a code to each letter Code(l) (0 to 19)

2. For a word of 3 letters L1 L2 L3 the code is

index = Code(L1)*202 + Code(L2)*201 + Code(L3)

3. Have an array with a list of the positions that have that word.

0 1 2 3

Position in query sequence of word

Building the dictionary for the query sequence requires (N-2) operations.

MLILII

MLIIKRDELVISWASHEREquery sequence

IIKIKRKRDRDEDELELVLVIVISISWSWAWASASHSHEHERERE

all overlappingwords of size 3

The database contains (M-2) words, and it takes only one operation to see if the word was in the query.

Query sequence

Scan the subject, looking up words in the dictionary

Use word hits to determine were to search for alignments

fills the dynamic programming matrix

in (N-2)+(M-2) operations instead

of MxN.

Blast: extending good hits

blast pre-processes the target sequence set

lists of hits for each possible word (

– 3-tuple for proteins - 203 = 8000 different words

– for each word, find with ones have “good match”

• 13 in old version 11 in new version

for the “good ones” get list of sequences in database that have it

Blast (old) : extend match both ways while score is increasing

Blast2 (new):

– when two words found in same “diagonal” withing “short” distance, extend an un-gapped alignment.

– continue extension like old blast

get local alignments with score greater than cutoff score

perform SW on best candidates

Query sequence

Scan the database, looking up words in the dictionary

BLAST extends from word hits

Query sequence

BLAST2 extends pairs in same diagonal first

Database Search Space

Query sequence

Simplest Database searching could is a large dynamic programming example.

With all the database sequences concatenated one after another.

Query sequence

Which alignment is more significant?

Query sequence

Score can be used to judge alignments. But a score absolute value is a function of the score parameters.

Match=+1,Mismatch=-1,

Gap_open=5,

gap_extend=1

Yields same alignments as

Match=+10,Mismatch=-10,

Gap_open=50,

gap_extend=10

Scores useful for relative ranking.

Query sequence

To Judge relevancy of an alignment, need to judge if match is significant.

E-value = Expect(S) is a function of the score, database size and composition, and query size.

Number of Aligments with scores >= S expected if the query was a random given the database size and composition.

Expect of 0.0 means a very good match unlikely to be random.

Alligning sequences in databases: evaluating significance

we can allign a sequence with any other one

we want good allignents that are statistically significant

when searching databases, statistical relevance needs to be computed too

E value: number of hits a random sequence of the same size would get in a database of the same size

E-value

“Hits” can be sorted according to their E-value or their score.

The E-value is better known as the EXPECT value and is a function of score, database size and query sequence length.

E-value: Number of alignments with a score >=S that you expect to find if the database was a collection of random letters.

e.g. For a score of 1, one only requires 1 match, and there should be an enormous amount of alignments. One expects to find less alignments with a score of 5, and so on.. Eventually when the score is big enough, one expects to find an insignificant number of of alignments that could be due to chance.

E-value of less than 1e-6 (1* 10-6 in scientific notation) are usually very good and for proteins, E<1e-2 is usually considered significant. It is still possible for a Hit with E>1 to be biologically meaningful, but more analysis is required to comfirm that.

Even for VERY good hits, it is possible that the hit is due to a biological artifact (sequencing/cloning vector, repeats, low-complexity sequence…)

D A T A B A S E S E A R C H I N G

The “hit list” gives titles and scores for matched sequences

> fasta myquery swissprot -ktup 2The best scores are: initn init1 opt z-sc E(77110)gi|1706794|sp|P49789|FHIT_HUMAN BIS(5'-ADENOSYL)- 996 996 996 1262.1 0gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL) 412 382 395 507.6 1.4e-21gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEI 238 133 316 407.4 5.4e-16gi|3915958|sp|Q58276|Y866_METJA HYPOTHETICAL HIT- 153 98 190 253.1 2.1e-07gi|3916020|sp|Q11066|YHIT_MYCTU HYPOTHETICAL 15.7 163 163 184 244.8 6.1e-07gi|3023940|sp|O07513|HIT_BACSU HIT PROTEIN 164 164 170 227.2 5.8e-06gi|2506515|sp|Q04344|HNT1_YEAST HIT FAMILY PROTEI 130 91 157 210.3 5.1e-05gi|2495235|sp|P75504|YHIT_MYCPN HYPOTHETICAL 16.1 125 125 148 199.7 0.0002gi|418447|sp|P32084|YHIT_SYNP7 HYPOTHETICAL 12.4 42 42 140 191.3 0.00058gi|3025190|sp|P94252|YHIT_BORBU HYPOTHETICAL 15.9 128 73 139 188.7 0.00082gi|1351828|sp|P47378|YHIT_MYCGE HYPOTHETICAL HIT- 76 76 133 181.0 0.0022gi|418446|sp|P32083|YHIT_MYCHR HYPOTHETICAL 13.1 27 27 119 165.2 0.017gi|1708543|sp|P49773|IPK1_HUMAN HINT PROTEIN (PRO 66 66 118 163.0 0.022gi|2495231|sp|P70349|IPK1_MOUSE HINT PROTEIN (PRO 65 65 116 160.5 0.03gi|1724020|sp|P49774|YHIT_MYCLE HYPOTHETICAL HIT- 52 52 117 160.3 0.031gi|1170581|sp|P16436|IPK1_BOVIN HINT PROTEIN (PRO 66 66 115 159.3 0.035gi|2495232|sp|P80912|IPK1_RABIT HINT PROTEIN (PRO 66 66 112 155.5 0.057gi|1177047|sp|P42856|ZB14_MAIZE 14 KD ZINC-BINDIN 73 73 112 155.4 0.058gi|1177046|sp|P42855|ZB14_BRAJU 14 KD ZINC-BINDIN 76 76 110 153.8 0.072gi|1169825|sp|P31764|GAL7_HAEIN GALACTOSE-1-PHOSP 58 58 104 138.5 0.51gi|113999|sp|P16550|APA1_YEAST 5',5'''-P-1,P-4-TE 47 47 103 137.8 0.56gi|1351948|sp|P49348|APA2_KLULA 5',5'''-P-1,P-4-T 63 63 98 131.3 1.3gi|123331|sp|P23228|HMCS_CHICK HYDROXYMETHYLGLUTA 58 58 99 129.4 1.6gi|1170899|sp|P06994|MDH_ECOLI MALATE DEHYDROGENA 70 48 91 122.9 3.7gi|3915666|sp|Q10798|DXR_MYCTU 1-DEOXY-D-XYLULOSE 75 50 92 121.9 4.3gi|124341|sp|P05113|IL5_HUMAN INTERLEUKIN-5 PRECU 36 36 85 121.3 4.7gi|1170538|sp|P46685|IL5_CERTO INTERLEUKIN-5 PREC 36 36 84 120.0 5.5gi|121369|sp|P15124|GLNA_METCA GLUTAMINE SYNTHETA 45 45 90 118.9 6.3gi|2506868|sp|P33937|NAPA_ECOLI PERIPLASMIC NITRA 48 48 92 117.4 7.6gi|119377|sp|P10403|ENV1_DROME RETROVIRUS-RELATED 59 59 89 117.0 8gi|1351041|sp|P48415|SC16_YEAST MULTIDOMAIN VESIC 48 48 97 117.0 8gi|4033418|sp|O67501|IPYR_AQUAE INORGANIC PYROPHO 38 38 83 116.8 8.3

D A T A B A S E S E A R C H I N G

Detailed alignments are shown farther down in the output

> fasta myquery swissprot -ktup 2

>>gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL)-TETR (182 aa)initn: 412 init1: 382 opt: 395 z-score: 507.6 E(): 1.4e-21Smith-Waterman score: 395; 52.3% identity in 109 aa overlap

10 20 30 40 50gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRPVERFHDLRPDEVADLF : X: .:.:: :.:: ::..:::::: : : : :..:: :.:..:::gi|170 MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLVIPQRAVPRLKDLTPSELTDLF 10 20 30 40 50 60

60 70 80 90 100 110gi|170 QTTQRVGTVVEKHFHGTSLTFSMQDGPEAGQTVKHVHVHVLPRKAGDFHRNDSIYEELQK ....: :.:: : ... ....::: .::::: :::::..::: .:: .:: .: :X.:gi|170 TSVRKVQQVIEKVFSASASNIGIQDGVDAGQTVPHVHVHIIPRKKADFSENDLVYSELEK 70 80 90 100 110 120

120 130 140gi|170 HDKEDFPASWRSEEEMAAEAAALRVYFQ ..gi|170 NEGNLASLYLTGNERYAGDERPPTSMRQAIPKDEDRKPRTLEEMEKEAQWLKGYFSEEQE 130 140 150 160 170 180

>>gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEIN 2 (217 aa)initn: 238 init1: 133 opt: 316 z-score: 407.4 E(): 5.4e-16Smith-Waterman score: 316; 37.4% identity in 131 aa overlap

10 20 30 40gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRP-VER :.. :. .v^: :.. ..:::: ::.::::::. ::X :

Query sequence

Some matches are non-meaningful because they occur VERY often in

database.

e.g. nucleotide AAA (from polyA)

Biological repeated elements(retroposons ALU)

Low-complexity repeated patterns.

(CAGCAG, QQQ,KKK,…)

These elements should be

FILTERED or MASKED

to avoid generating false ‘hits’.. It is ‘OK’ to align through them if they are near meaningful diagonal ‘hits’

Query sequence

FASTA searches in a band

local and global alignment, database searching with blast

Documents

l4: blast: alignment scores...

blast: basic local alignment search tool · blast: basic...

blast and fasta heuristics in pairwise sequence...

heuris'c)search:)fasta)and)blast - duke...

advanced blast searching · 2014. 5. 26. · advanced blast...

blast and multiple sequence alignment learning...

1 part 8 sequence alignment and sequencing searching...

point specific alignment methods psi – blast & phi –...

blast (basic local alignment search tool)

blast and sequence alignment

alignment methods and database searching

basic local alignment search tool blast why use blast?

basic local alignment search tool (blast)

the basic local alignment search tool (blast)

protein sequence alignment and database searching

introduction to bioinformatics - tutorial no. 2 global...

point specific alignment methods psi – blast & phi –...

basic local alignment sequence tool (blast)€¦ · blast...

alignment and database searching

blast: basic local alignment search tool