sequence based searching

Sequence based searching

Lesson 7Based on a presentation by Irit Gat-ViksBased on presentation by Amir Mitchel,Introduction to bioinformatics course,

Bioinformatics unit, Tel Aviv University.

Reminder – Importance of Homology

Use a sequence as a search query in order to find homologous sequences in a database.

Homology – similarity between sequences that results from a common ancestor.

Basic Assumption: Sequence homology → similar structure/function

Why ?• Characterizing an ORF.• Finding duplicate genes in the same organism

(known function, variants)• Finding homologues genes in other organisms (phylogeny, known

function)

Study a sequence through homologs

Identity Similarity HomologyIdentity Similarity HomologyQuery= uniprot|Q9UP52|TFR2_HUMAN Transferrin receptor protein 2 (TfR2).

>gi|20140567|sp|Q07891|TFR1_CRIGR Transferrin receptor protein 1 (TfR1) (TR) (TfR) (Trfr)

Length = 757 Score = 540 bits (1392), Expect = e-152 Identities = 305/727 (41%), Positives = 412/727 (56%), Gaps = 52/727 (7%)Query: 87 LTALLIFTGAFLLGYVAF--RGSCQAC--------GDSVLVVSEDVNYEPDLDFHQGRLY 136 + ++ F F++GY+ + R + C G+S ++ E++ RLYSbjct: 71 IAVVIFFLIGFMIGYLGYCKRTEQKDCVRLAETETGNSEIIQEENIP-------QSSRLY 123

Query: 137 WSDLQAMFLQFLGEGRLEDTIRQTSLRERVAGSAGMAALTQDIRAALSRQKLDHVWTDTH 196 W+DL+ + + L DTI+Q S R AGS L I KL VW D HSbjct: 124 WADLKKLLSEKLDAIEFTDTIKQLSQTSREAGSQKDENLAYYIENQFRDFKLSKVWRDEH 183

Query: 197 YVGLQFPDPAHPNTLHWVDEAGKVGEQLPLEDPDVYCPYSAIGNVTGELVYAHYGRPEDL 256 YV +Q A N + ++ G + +E+P Y YS V+G+L++A++G +D Sbjct: 184 YVKIQVKGSAAQNAVTIINVNG---DSDLVENPGGYVAYSKATTVSGKLIHANFGTKKDF 240

Query: 257 QDLRAXXXXXXXXXXXXXXXXISFAQKVTNAQDFGAQGVLIYPEPADFSQDPPKPSLSSQ 316 +DL+ I+FA+KV NAQ F A GVLIY + F P + ++Sbjct: 241 EDLK---YPVNGSLVIVRAGKITFAEKVANAQSFNAIGVLIYMDQTKF------PVVEAE 291

Query: 317 QAVYGHVHLGTGDPYTPGFPSFNQTQFPPVASSGLPSIPAQPISADIASRLLRKLKGPVA 376 +++GH HLGTGDPYTPGFPSFN TQFPP SSGLPSIP Q IS A +L + ++ Sbjct: 292 LSLFGHAHLGTGDPYTPGFPSFNHTQFPPSQSSGLPSIPVQTISRKAAEKLFQNMETNCP 351

• For Proteins, finding distant relatives is a difficult task.For Proteins, finding distant relatives is a difficult task.• Distant protein family members, may share <20% amino acid identity(!).Distant protein family members, may share <20% amino acid identity(!).

>gi|3582021|emb|CAA70575.1| cytochrome P450 [Nepeta racemosa] Length = 509

Score = 405 bits (1043), Expect = e-111 Identities = 94/479 (19%), Positives = 192/479 (40%), Gaps = 35/479 (7%)

Query: 61 NLYHFWRETGTHKVHLHHVQNFQKYGPIYREKLGNVESVYVIDPEDVALLFKSEGPNPER 120 NL+ G + H + ++YGP+ + G+V + PE + K++ Sbjct: 45 NLHQL----GLY-PHRYLQSLSRRYGPLMQLHFGSVPVLVASSPEAAREIMKNQDIVFSN 99 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Query: 297 -----DYRGMLYRLLGDSK----MSFEDIKANVTEMLAGGVDTTSMTLQWHLYEMARNLK 347 D+ +L + ++K + + +KA + +M G DTT+ L+W + E+ +N +Sbjct: 271 GDGALDFVDILLQFQRENKNRSPVEDDTVKALILDMFVAGTDTTATALEWAVAELIKNPR 330

Query: 348 VQDMLRAEVLAARHQAQGDMATMLQLVPLLKASIKETLRLH-PISVTLQRYLVNDLVLRD 406 L+ EV L+ +P LKASIKE+LRLH P+ + + R D + Sbjct: 331 AMKRLQNEVREVAGSKAEIEEEDLEKMPYLKASIKESLRLHVPVVLLVPRESTRDTNVLG 390

Query: 407 YMIPAKTLVQVAIYALGREPTFFFDPENFDPTRWLSK--DKNITYFRNLGFGWGVRQCLG 464 Y I + T V + +A+ R+P+ + +PE F P R+L D +F L FG G R C GSbjct: 391 YDIASGTRVLINAWAIARDPSVWENPEEFLPERFLDSSIDYKGLHFELLPFGAGRRGCPG 450 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Query typesDNA vs. Protein (e.g., UCAUAC or Serine –Tyrosine)• The sequence query can be a nucleotide sequence or an amino acid

sequence.• The search is preformed against a nucleotide or amino acid database

Which search is preferable?1. Which sequence is more conserved during evolution? Answer: The

genetic code is redundant. Some amino acids are coded by more than one codon. Therefore, the DNA sequence can change while the amino acid sequence will remain the same.

2. Nucleotides: a four letter alphabet. Amino acids: a twenty letter alphabet. Two random DNA sequences will share on average 25% of identity. Two random protein sequences will share on average 5% of identity.

3. Protein comparison matrices are much more sensitive than those for DNA, i.e., similarity relationships are defined between two amino acids (PAM/Blosum).

4. DNA databases are much larger, meaning more random hits.

Using the amino acid sequence is

preferable for homology search.

1. Protein sequence comparisons typically double the evolutionary look-back time over DNA sequence comparisons.

2. Evolutionary distant proteins will exhibit a high similarity rather than a high identity.

3. Hits can exhibit a long alignment (homology) or a short alignment (conserved domains).

Why use a nucleotide sequence after all?

Query type• The sequence query can be a nucleotide sequence or an amino acid sequence.

But … we can translate the query sequence!• The search is performed against a nucleotide or amino acid database.

But … we can use translated databases! (e.g., trEMBL)

All types of searches are possible.

• Nucleotide query can be translated and searched against protein databases:1. Translate all reading frames (3 + 3)2. Find long ORF.

• Amino acid query can be back-translated to and searched against nucleotide databases?

1. During translation we lose information. 2. A single amino acid sequence can be back-translated to many possible nucleotide

sequences .

Query: DNA Protein

Database: DNA Protein

Query types

1. amino acid query against protein database (blastp)– identifying a protein sequence – finding similar sequences in protein databases.

2. nucleotide query against nucleotide database (blastn)– In non-coding regions (no ORF found)- Identify the query sequence or find similar

sequences.– Find primer binding sites or map short contiguous motifs

3. compares translated nucleotide query against protein database. (blastx) – Useful when the query include a coding region, and we try to find homologous

proteins. – Used extensively in analyzing EST sequences. This search is more sensitive than

nucleotide blast since the comparison is performed at the protein level.4. protein query against translated nucleotide database (tblastn)

– useful for finding protein homologs in unnannotated nucleotide data of coding regions (e.g., ESTs, draft genome records (HTG)).

5. translated nucleotide query against translated nucleotide database. (tblastz)– Useful for identifying novel genes in error prone query sequences. – Used for identifying potential proteins encoded by single pass read ESTs.

* six-frame in all translations!!!

Searching databases• Naïve solution:

Use exact algorithm to compare each sequence in the database to the sequence query.

• Problems:1. Databases are huge, millions of sequences.2. Running the computations in parallel is expensive.

• Solutions:1. Use a heuristic algorithm to discard most irrelevant sequences.2. Perform the alignment on the small group of remaining sequences.

• Key concept of BLAST (Basic Local Alignment and Search Tool):Homologous sequences are expected to contain ungapped short segments (with substitutions, but without gaps)

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.

BLAST

• Preprocess – Low complexity regions are removed– A dictionary for K-tuple words is prepared for the query

sequence and the database. Protein 3 letter words, DNA 4-6 or even 11 letter words.

• Searches for K-tuple words and find database records with common words. Words can be similar, not only identical. – Identity - CAT : CAT– Similarity – CAT : CAT, CAR, HAT …– But even CAT : ZTX can be similar– For each three letter word there are at most 203 similar

words.– Similar words are only the ones that have a minimum

cut-off score (T).

BLAST Stage I

• Find matching word pairs

• Extend word pairs as much as possible,i.e., as long as the total weight increases

• Result: High-scoring Segment Pairs (HSPs)

THEFIRSTLINIHAVEADREAMESIRPATRICKREAD

INVIEIAMDEADMEATTNAMHEWASNINETEEN

BLAST Stage II

• Try to connect HSPs by aligning the sequences in between them:

THEFIRSTLINIHAVEADREA____M_ESIRPATRICKREAD

INVIEIAMDEADMEATTNAMHEW___ASNINETEEN

Versions of the programVersions of the program[t]BLAST[x/n/p]

t : Translate a DNA database in all 6 reading frames for comparison with a Protein query.

x : Translate a nucleotide query in all 6 reading frames for comparison with a Protein database.

p : Comparison is against a Protein database.

n : Comparison is against a Nucleotide database.

BLAST

Masking low complexity

• There is one frequent case where the random models and therefore the statistics discussed here break down: regions with highly biased amino acid composition ("low complexity" regions). • Alignments of two regions with similarly biased composition may achieve very high scores that owe virtually nothing to residue order but are due to segment composition. • Usually generated by slippage and thus not interesting. The BLAST programs employ the SEG (protein) DUST(DNA) algorithm to filter low complexity regions from proteins before executing a database search.•Masking is practiced on the query sequence only, not on the database sequences!

BLAST

http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/

BLAST http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/

BLAST I can limit my search to a selected organism

I can limit my search to a selected organism

…or even construct my own searchable database by an

Entrez query

…or even construct my own searchable database by an

Entrez query

Filter Low Complexity regions by SEG or DUST

Mask for lookup table hit search stage, but NOT for the

hit extension stage.

Mask According to the case within the query sequence.

BLAST

Lineage Report

root. Bilateria [animals]. . Coelomata [animals]. . . Euteleostomi [vertebrates]. . . . Tetrapoda [vertebrates]. . . . . Eutheria [mammals]. . . . . . Homo sapiens (man) ------------ 571 18 hits [mammals] retinoic acid induced 3; retinoic acid responsive gene [Hom. . . . . . Mus musculus (mouse) .......... 432 15 hits [mammals] retinoic acid inducible protein 3 [Mus musculus]. . . . . . Rattus norvegicus (brown rat) . 411 5 hits [mammals] similar to retinoic acid inducible protein 3 [Rattus norveg. . . . . Xenopus laevis (clawed frog) ---- 216 1 hit [amphibians] MGC68729 protein [Xenopus laevis]. . . . Takifugu rubripes (torafugu) ------ 40 4 hits [bony fishes] pheromone receptor [Takifugu rubripes]. . . Drosophila melanogaster ------------- 48 4 hits [flies] CG8285-PA [Drosophila melanogaster] >gi|2827758|sp|P22815|B. . . Drosophila virilis .................. 39 1 hit [flies] Bride of sevenless protein precursor >gi|1079166|pir||A4755. . . Anopheles gambiae str. PEST ......... 38 1 hit [flies] ENSANGP00000013404 [Anopheles gambiae] >gi|21296536|gb|EAA0. . Caenorhabditis elegans ---------------- 41 2 hits [nematodes] calcium-sensing receptor, similar to human metabotropic glu. environmental sequence ------------------ 40 2 hits [unclassified] unknown [environmental sequence]

Lineage Report

root. Bilateria [animals]. . Coelomata [animals]. . . Euteleostomi [vertebrates]. . . . Tetrapoda [vertebrates]. . . . . Eutheria [mammals]. . . . . . Homo sapiens (man) ------------ 571 18 hits [mammals] retinoic acid induced 3; retinoic acid responsive gene [Hom. . . . . . Mus musculus (mouse) .......... 432 15 hits [mammals] retinoic acid inducible protein 3 [Mus musculus]. . . . . . Rattus norvegicus (brown rat) . 411 5 hits [mammals] similar to retinoic acid inducible protein 3 [Rattus norveg. . . . . Xenopus laevis (clawed frog) ---- 216 1 hit [amphibians] MGC68729 protein [Xenopus laevis]. . . . Takifugu rubripes (torafugu) ------ 40 4 hits [bony fishes] pheromone receptor [Takifugu rubripes]. . . Drosophila melanogaster ------------- 48 4 hits [flies] CG8285-PA [Drosophila melanogaster] >gi|2827758|sp|P22815|B. . . Drosophila virilis .................. 39 1 hit [flies] Bride of sevenless protein precursor >gi|1079166|pir||A4755. . . Anopheles gambiae str. PEST ......... 38 1 hit [flies] ENSANGP00000013404 [Anopheles gambiae] >gi|21296536|gb|EAA0. . Caenorhabditis elegans ---------------- 41 2 hits [nematodes] calcium-sensing receptor, similar to human metabotropic glu. environmental sequence ------------------ 40 2 hits [unclassified] unknown [environmental sequence]

The statistics of sequence similarity scores

• Bits score – A score for the alignments according to the number of similarities, identities, etc.

• Expected-score (E-value) (of an alignment having a score S):

The number of times one expects to find alignments with a score >= S of a random sequence Vs. a random database. (having the same lengths and compositions). The closer the e-value approaches zero, the greater the confidence that the match is real (from zero to one).

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

BLAST

What about:•Short sequences?•large sequences and queries?

Short sequences:

Parameter settings for standard blastp and "Search for short and nearly exact matches"

ProgramWord Size

SEG Filter

Expect Value

Score Matrix

Standard Blastp 3 On 10 BLOSUM62

Search for short and nearly exact matches

2 Off 20000 PAM30

Parameter settings for standard blastn and"Search for short and nearly exact matches"

ProgramWord Size

DUST Filter

Expect Value

Standard blastn 11 On 10

Search for short nearly exact matches

7 Off 1000

PAM vs. BLUSOM- reminder– Different BLOSUM matrices are derived from blocks with different identity

percentage. (e.g., blosum62 is derived from an alignment of sequences that share at least 62% identity.) Larger n smaller evolutionary distance.

– Single PAM was constructed from at least 85% identity dataset. Different PAM matrices were computationally derived from it. Larger n larger evolutionary distance

Observed % Difference

Evolutionary distance (PAM)

BLOSUM

1 1 9910 11 9020 23 8030 38 7040 56 6050 80 5060 120 4070 159 3080 250 20

62

120

250

How to generate results from large sequences and queries?

1. Some sequences contain large regions of ALU repeats. In this case you can select the "Human Repeat" filtering option on the main BLAST search page. This will mask repeat regions which generate a large number of biologically uninteresting hits to the databases.

2. Increase the Word Size to 20 - 25. With a default Word Size of 7, limiting the number small initial fragments to be extended to HSPs.

3. Decrease the Expect value to 1.0 or lower eliminates many hits and concentrate on results which are more likely to contain large coding regions and genomic fragments.

4. Processing multiple query sequences in one run can be much faster than processing them with separate runs because the database is scanned only 1 time for the entire set of queries.

• The most sensitive BLAST program, making it useful for finding very distantly related proteins.

• Use PSI-BLAST when your standard protein-protein BLAST search failed to find significant hits.

Algorithm:• The first round of PSI-BLAST is a standard protein-protein

BLAST search. The program builds a position-specific scoring matrix (PSSM or profile) from an alignment of the sequences returned with Expect values better (lower) than the inclusion threshold (default=0.005).

• The PSSM will be used to evaluate the alignment in the next iteration of search. Any new database hits below the inclusion finding very distantly related proteins.

PSI-BLAST (Position-Specific Iterated (PSI)-BLAST )Sensitive protein-protein similarity searches.

sequence based searching

Documents