sequence search abhishek niroula department of experimental medical science lund university...

1

Sequence Search

Abhishek NiroulaDepartment of Experimental Medical Science

Lund University

2015-12-10

2

Sequence search

• Sequences– Nucleotide and amino acid sequences– Known sequences are stored in different databases

• NCBI, ensembl, and others– Number of organisms being sequenced is increasing

• Ensembl, Ensembl plant, Ensembl fungi, Ensembl bacteria, etc• Genome 10K (genomic zoo) project

– Use of sequences is expanding rapidly in biomedical research– 1000 Genomes Project, 100K Genomes Project

• Sequence search– Search for an appropriate sequence– Search for similar sequences in a database

2015-12-10

3

Sequence search

• Identify a new sequence• Functional and structural annotation of sequences• Finding homolog sequences for

– Genomic, phylogenetic, structural studies, etc

• Haemophilus influezae– The genome of Haemophilus influezae was reported in 1995 (Fleischmann et al.)– 1,743 assumed coding regions were translated into amino acid sequences, and searched for similarity in

the Swiss-Prot database– 1,007 of them matched

• the biochemical function could be deduced for each of them

• Multiple Sclerosis (source: Martin Tompa)– Autoimmune disease in which the T-cells recognise nerves’ myelin sheaths as foreign body and attack

them– Hypothesis: Nerves’ myelin sheath proteins were similar to bacterial sheath proteins from earlier infection– Methodology

• Myelin sheath proteins were sequenced• Search in a database for similar bacteria and virus sequences• Lab tests to check if the T-cells attacked the identified bacterial and viral proteins

2015-12-10

4

Similarity vs homology

• Similarity– Similarity is the degree of likeness of two sequences– It is a quantitative measure

• Homology– Homology is an evolutionary relationship between two sequences– It can not be measured.– Distant and close homology refers to the distance between the

sequences and their common ancestors

• Two sequences are 80% similar.• Two sequences are 80% homologous.

2015-12-10

5

Orthologs and paralogs

Source: http://www.ensembl.org/info/genome/compara/tree_example1.png2015-12-10

6

Sequence search: Problem

• Given– Query sequence– Database

• Goal– To find statistically significant

similarity that can be used to infer homology

Query

Result

Database

Search

Sensitivity: Are all related sequences identified?

Specificity: Are all unrelated sequences rejected?

TP1, 2, 4

FP7

FN3, 5

TN6

12345671

247

2015-12-10

7

Heuristic database searching

• Sequence search: problem– Exact similarity computation between a query sequence and a database

using dynamic programming is computationally intense– With available technology, aligning a query sequence against an entire

database is not feasible

• Solution– Heuristic methods: Fast scanning of similar sequences– Sequences similar to a query sequence are searched from the database

using heuristic methods before computing exact alignment scores– Tools

• BLAST• FASTA

2015-12-10

8

BLAST

• BASIC Local Alignment Search Tool• Developed by Altschul et al., 1990• Determines the local alignment between a query and a database• BLAST consist of two steps:

– Searching matches– Computing statistical significance of the matches

2015-12-10

9

BLAST

• Given a query sequence, split the query sequence into words with k residues– k = 3, for amino acid sequence– K = 11-12, for nucleotide sequence

• Generate all other combination of words with k residues

• Score each of the words using substitution matrix– Words with scores higher than threshold

are considered in the next step

M D L S A L T R Q

MDL DLS LSA SAL ALT ---MDV DLRMDM QLSMRL DVSMQL DKS

--- --- ------ --- ------ --- ------ --- ------ --- ------ ---

Query sequence

k-mers

2015-12-10

10

BLAST

• Match each of the high scoring words in the database sequences

• The matches are extended on both directions to form ungapped local alighment to find high scoring pair (HSP)

• The HSP with a cutoff score greater than the threshold are kept

• Significance of the ungapped HSP is calculated

High scoring words Database

HSP

Ungapped extension

2015-12-10

11

Gapped BLAST

• Altschul et al., 1997• Extension of matches requires two

non-overlapping matches in the same diagonal within a distance ”A”

• Less number of extensions makes the search faster

• Perform gapped alignments around the hits that have higher scores than a pre-defined score

2015-12-10

12

FASTA

• FAST All (extension of FASTP and FASTN)• Developed by Lipman and Pearson, 1985• FASTA also builds a local alignment between query and database• FASTA has four steps:

– Hashing– scoring– scoring– Alignment

2015-12-10

13

FASTA

• Hashing– Query sequence is split into words of size k– Exact word matches are identified in the database– Regions populated with matches are identified and 10 best regions are selected

• scoring– Within the selected regions, optimal local alignment is computed using substitution matrix

• scoring– Alignments are combined to obtain a single larger alignment– Gaps are allowed in the alignment

• Alignment– Alignment is iptimized using Smith-Waterman dynamic programming

• Statistical significance for each alignment is computed

2015-12-10

14

Variants of BLAST and FASTA

Query Database Program CommentProtein Protein blastp

fastaNucleotide Nucleotide blastn

fastaNucleotide Protein blastx

fastx, fastyTranslate query to a protein

Protein Nucleotide tblastntfastx, tfasty

Translate database

Nucleotide Nucleotide tblastx Translate both query and database

2015-12-10

15

Using BLAST and FASTA

• Web application– BLAST

• http://blast.ncbi.nlm.nih.gov/Blast.cgi– FASTA

• http://www.ebi.ac.uk/Tools/sss/fasta/

• Standalone– Local installation– Database should also be downloaded

2015-12-10

http://blast.ncbi.nlm.nih.gov/Blast.cgi

http://blast.ncbi.nlm.nih.gov/Blast.cgi

http://www.ebi.ac.uk/Tools/sss/fasta/

http://www.ebi.ac.uk/Tools/sss/fasta/

16

BLASTP

2015-12-10

17

FASTA

2015-12-10

18

Input formats

• FASTA format files– Widely used in bioinformatics

• Other file formats– GCG, EMBL, GenBank, PIR, UniProtKB/Swiss-Prot, PHYLIP

• Identifiers– Supported in BLAST– Accession– Gene identifier

2015-12-10

19

Database

• Generic databases– UniProt or RefSeq databases– UniRef and Non-redundant database: Database of unique sequence

entries– Genome, Chromosome

• Structure databases– Database of sequences for which 3D structures are available in PDB– Used specially for finding template sequence for homology modelling

• Specialized database– Local database can be created including the sequences that are

relevant for your purpose

2015-12-10

20

Other parameters

• Expect– Statistical significance parameter– Default = 10, i.e. 10 matches are expected by chance

• Filter– Mask regions of low-complexity and short repeats

• Alignment options– Substitution table and gap function

2015-12-10

21

Output

• There are three major sections in BLAST output– Header

• Information about the query sequence and the database searched• Graphical overview of matches (only in web version)

– Description• Description of the sequences (hits)• Scores: Generated from alignment, Higher is better• E-value: Number of hits expected by chance, Lower is better• Sequence identifier in NCBI databases

– Alignment• Pairwise alignment• Details of the alignment (Score, E-value, similarity, etc.)

2015-12-10

22

PSI BLAST

• Position Specific Iterated BLAST• More sensitive to distantly related sequences• Algorithm

– In the first iteration, standard BLAST is run• A PSSM (position specific scoring matrix) is generated based on the

significant alignments– In the next iteration, the new PSSM is used to score the alignments

• A new PSSM is generated based on the significant alignments– The above step is repeated until a stop criterion is met. Stop criteria

may be:• No new sequences are identified in two consecutive iterations• Number of desired iteration reached

2015-12-10

23

Sequence search: Challenges

• Self hits are uninteresting

• Size of target database– Use no big database than required

• Paralogs have similar sequences but often have different function

• Low-complexity regions reduce the quality of alignments

• Short repeats give false hits

• Results for very short queries may be less reliable– Matches that are 50% identical with length 20-40 amino acids occur frequently by chance

• Distant homologs may have very low similarity

2015-12-10

24

Sequence search

• Exercise– BLAST– FASTA

2015-12-10

sequence search abhishek niroula department of experimental medical science lund university...

Documents