sequence search abhishek niroula department of experimental medical science lund university...
DESCRIPTION
Sequence search Identify a new sequence Functional and structural annotation of sequences Finding homolog sequences for –Genomic, phylogenetic, structural studies, etc Haemophilus influezae –The genome of Haemophilus influezae was reported in 1995 (Fleischmann et al.) –1,743 assumed coding regions were translated into amino acid sequences, and searched for similarity in the Swiss-Prot database –1,007 of them matched the biochemical function could be deduced for each of them Multiple Sclerosis ( source: Martin Tompa ) –Autoimmune disease in which the T-cells recognise nerves’ myelin sheaths as foreign body and attack them –Hypothesis: Nerves’ myelin sheath proteins were similar to bacterial sheath proteins from earlier infection –Methodology Myelin sheath proteins were sequenced Search in a database for similar bacteria and virus sequences Lab tests to check if the T-cells attacked the identified bacterial and viral proteinsTRANSCRIPT
1
Sequence Search
Abhishek NiroulaDepartment of Experimental Medical Science
Lund University
2015-12-10
2
Sequence search
• Sequences– Nucleotide and amino acid sequences– Known sequences are stored in different databases
• NCBI, ensembl, and others– Number of organisms being sequenced is increasing
• Ensembl, Ensembl plant, Ensembl fungi, Ensembl bacteria, etc• Genome 10K (genomic zoo) project
– Use of sequences is expanding rapidly in biomedical research– 1000 Genomes Project, 100K Genomes Project
• Sequence search– Search for an appropriate sequence– Search for similar sequences in a database
2015-12-10
3
Sequence search
• Identify a new sequence• Functional and structural annotation of sequences• Finding homolog sequences for
– Genomic, phylogenetic, structural studies, etc
• Haemophilus influezae– The genome of Haemophilus influezae was reported in 1995 (Fleischmann et al.)– 1,743 assumed coding regions were translated into amino acid sequences, and searched for similarity in
the Swiss-Prot database– 1,007 of them matched
• the biochemical function could be deduced for each of them
• Multiple Sclerosis (source: Martin Tompa)– Autoimmune disease in which the T-cells recognise nerves’ myelin sheaths as foreign body and attack
them– Hypothesis: Nerves’ myelin sheath proteins were similar to bacterial sheath proteins from earlier infection– Methodology
• Myelin sheath proteins were sequenced• Search in a database for similar bacteria and virus sequences• Lab tests to check if the T-cells attacked the identified bacterial and viral proteins
2015-12-10
4
Similarity vs homology
• Similarity– Similarity is the degree of likeness of two sequences– It is a quantitative measure
• Homology– Homology is an evolutionary relationship between two sequences– It can not be measured.– Distant and close homology refers to the distance between the
sequences and their common ancestors
• Two sequences are 80% similar.• Two sequences are 80% homologous.
2015-12-10
5
Orthologs and paralogs
Source: http://www.ensembl.org/info/genome/compara/tree_example1.png2015-12-10
6
Sequence search: Problem
• Given– Query sequence– Database
• Goal– To find statistically significant
similarity that can be used to infer homology
Query
Result
Database
Search
Sensitivity: Are all related sequences identified?
Specificity: Are all unrelated sequences rejected?
TP1, 2, 4
FP7
FN3, 5
TN6
12345671
247
2015-12-10
7
Heuristic database searching
• Sequence search: problem– Exact similarity computation between a query sequence and a database
using dynamic programming is computationally intense– With available technology, aligning a query sequence against an entire
database is not feasible
• Solution– Heuristic methods: Fast scanning of similar sequences– Sequences similar to a query sequence are searched from the database
using heuristic methods before computing exact alignment scores– Tools
• BLAST• FASTA
2015-12-10
8
BLAST
• BASIC Local Alignment Search Tool• Developed by Altschul et al., 1990• Determines the local alignment between a query and a database• BLAST consist of two steps:
– Searching matches– Computing statistical significance of the matches
2015-12-10
9
BLAST
• Given a query sequence, split the query sequence into words with k residues– k = 3, for amino acid sequence– K = 11-12, for nucleotide sequence
• Generate all other combination of words with k residues
• Score each of the words using substitution matrix– Words with scores higher than threshold
are considered in the next step
M D L S A L T R Q
MDL DLS LSA SAL ALT ---MDV DLRMDM QLSMRL DVSMQL DKS
--- --- ------ --- ------ --- ------ --- ------ --- ------ ---
Query sequence
k-mers
2015-12-10
10
BLAST
• Match each of the high scoring words in the database sequences
• The matches are extended on both directions to form ungapped local alighment to find high scoring pair (HSP)
• The HSP with a cutoff score greater than the threshold are kept
• Significance of the ungapped HSP is calculated
High scoring words Database
HSP
Ungapped extension
2015-12-10
11
Gapped BLAST
• Altschul et al., 1997• Extension of matches requires two
non-overlapping matches in the same diagonal within a distance ”A”
• Less number of extensions makes the search faster
• Perform gapped alignments around the hits that have higher scores than a pre-defined score
2015-12-10
12
FASTA
• FAST All (extension of FASTP and FASTN)• Developed by Lipman and Pearson, 1985• FASTA also builds a local alignment between query and database• FASTA has four steps:
– Hashing– scoring– scoring– Alignment
2015-12-10
13
FASTA
• Hashing– Query sequence is split into words of size k– Exact word matches are identified in the database– Regions populated with matches are identified and 10 best regions are selected
• scoring– Within the selected regions, optimal local alignment is computed using substitution matrix
• scoring– Alignments are combined to obtain a single larger alignment– Gaps are allowed in the alignment
• Alignment– Alignment is iptimized using Smith-Waterman dynamic programming
• Statistical significance for each alignment is computed
2015-12-10
14
Variants of BLAST and FASTA
Query Database Program CommentProtein Protein blastp
fastaNucleotide Nucleotide blastn
fastaNucleotide Protein blastx
fastx, fastyTranslate query to a protein
Protein Nucleotide tblastntfastx, tfasty
Translate database
Nucleotide Nucleotide tblastx Translate both query and database
2015-12-10
15
Using BLAST and FASTA
• Web application– BLAST
• http://blast.ncbi.nlm.nih.gov/Blast.cgi– FASTA
• http://www.ebi.ac.uk/Tools/sss/fasta/
• Standalone– Local installation– Database should also be downloaded
2015-12-10
16
BLASTP
2015-12-10
17
FASTA
2015-12-10
18
Input formats
• FASTA format files– Widely used in bioinformatics
• Other file formats– GCG, EMBL, GenBank, PIR, UniProtKB/Swiss-Prot, PHYLIP
• Identifiers– Supported in BLAST– Accession– Gene identifier
2015-12-10
19
Database
• Generic databases– UniProt or RefSeq databases– UniRef and Non-redundant database: Database of unique sequence
entries– Genome, Chromosome
• Structure databases– Database of sequences for which 3D structures are available in PDB– Used specially for finding template sequence for homology modelling
• Specialized database– Local database can be created including the sequences that are
relevant for your purpose
2015-12-10
20
Other parameters
• Expect– Statistical significance parameter– Default = 10, i.e. 10 matches are expected by chance
• Filter– Mask regions of low-complexity and short repeats
• Alignment options– Substitution table and gap function
2015-12-10
21
Output
• There are three major sections in BLAST output– Header
• Information about the query sequence and the database searched• Graphical overview of matches (only in web version)
– Description• Description of the sequences (hits)• Scores: Generated from alignment, Higher is better• E-value: Number of hits expected by chance, Lower is better• Sequence identifier in NCBI databases
– Alignment• Pairwise alignment• Details of the alignment (Score, E-value, similarity, etc.)
2015-12-10
22
PSI BLAST
• Position Specific Iterated BLAST• More sensitive to distantly related sequences• Algorithm
– In the first iteration, standard BLAST is run• A PSSM (position specific scoring matrix) is generated based on the
significant alignments– In the next iteration, the new PSSM is used to score the alignments
• A new PSSM is generated based on the significant alignments– The above step is repeated until a stop criterion is met. Stop criteria
may be:• No new sequences are identified in two consecutive iterations• Number of desired iteration reached
2015-12-10
23
Sequence search: Challenges
• Self hits are uninteresting
• Size of target database– Use no big database than required
• Paralogs have similar sequences but often have different function
• Low-complexity regions reduce the quality of alignments
• Short repeats give false hits
• Results for very short queries may be less reliable– Matches that are 50% identical with length 20-40 amino acids occur frequently by chance
• Distant homologs may have very low similarity
2015-12-10
24
Sequence search
• Exercise– BLAST– FASTA
2015-12-10