ncbi review concepts 20040715 chuong huynh. ncbi pairwise sequence alignments purpose:...
TRANSCRIPT
NC
BI
Review Concepts 20040715
Chuong Huynh
NC
BI
Pairwise Sequence Alignments
• Purpose:• identification of sequences with significant similarity to
(a) sequence(s) in a sequence-repository• identification of all homologous sequences the repository• identification of domains with sequence similarity
• Terminology • Global alignment• Local alignment
NC
BI
Terminology: Global Alignment
• Finds the optimal alignment over the entire length of the two compared sequences
• Unlikely to detect genes that have evolved by recombination (e.g. domain shuffling) or insertion/deletion of DNA
• Suitable for sequences of homologous molecules
NC
BI
Terminology: Local Alignment
• short regions of similarity between a pair of sequences.
• compared sequences can receive high local similarity scores, without the need to have high levels of similarity over their entire length
• useful when looking for domains within proteins or looking for regions of genomic DNA that contain coding exons
NC
BI
An alignment that BLAST can’t find
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| |
1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT
| || || || ||| || | |||||| || | |||||| ||||| | |
61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC
|||| || ||||| || || | | |||| || |||
121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
NC
BI
BLAST Selection Matrix
NC
BI
Choosing The Right BLAST Flavor for Proteins
What you Want to Do? The Right BLAST Flavor
Find out something about the function of the protein
Use blastp to compare your protein with other proteins contained in the databases.
Discover new genes encoding similar proteins
Use tblastn to compare your protein with DNA sequences translated into their 6 possible reading framesClaverie & Notredame 2003
NC
BI
Choosing the Right BLAST
Flavor for DNAQuestions Answer
Am I interested in non coding DNA?
Yes, Use blastn. Rem: blastn is only for closely related DNA sequences (more than 70% identical)
Do I want to discover new proteins?
Yes, Use tblastx
Do I want to discover proteins encoded in my query DNA sequences?
Yes, Use blastx
Am I unsure of the quality of my DNA?
Yes, Use blastx. Especially if you suspsect your DNA sequence codes for a protein, but may contain sequencing errors.
Claverie & Notredame 2003
NC
BI
Choosing The Right BLAST Flavor
for DNA SequencesUsage Query Database Progra
m
Find very similar DNA sequence
DNA DNA blastn
Protein discovery and ESTs
Translated DNA
Translated DNA
tblastx
Analysis of query DNA sequence
Translated DNA
Protein blastx
Claverie & Notredame 2003
NC
BI
BLAST Tips
• It is faster and more accurate to BLAST proteins (blastp) rather than nucleotides.
• If in doubt use blastp.• When possible restrict to the subset of
the database you are interested in.• Look around for the database you
need or create your own custom BLAST database. BUT HOW???
• When is the best time to use the BLAST server?
NC
BI
Asking Biological Problems with BLASTWhat You
Want to DOGeneral (but More Complicated) Computational Method
Using BLAST
Finding genes in a genome
Run gene prediction software or an ORF Finder (for bacteria)
Cut your genome sequence in little (2-5kb) overlapping sequences. Use blastx to BLAST each piece of genome against NR (nonredundant protein db). Works better for sequences with no introns (bacteria).
Predicting protein function
Domain analysis or wet-lab experimentation
Use blastp to BLAST your protein sequence against SWISS-Prot (future = UniProt). If you get a good hit (more than 25% identify) over the complete length of the protein, then your protein has the same function as the SWISS-PROT protein
Predicting protein 3-D structure
Homology modeling, X-ray, NMR analysis of protein of interest
Use blastp to BLAST your protein against PDB (Protein structure DB), if you get hit >25% identity, then your protein and the good hit(s) have a similar 3-D structure
Finding protein family members
Clone new family members using PCR techniques
Use blastp (or better use PSI-BLAST) and run against NR (nonredundant protein family). After you have all members of family, you can make multiple sequence alignment phylogenetic tree
Claverie & Notredame 2003
NC
BI
BLAST and PSI-BLAST Servers on the Internet
Country
Program
URL
USA BLAST/ PSI-BLAST
http://www.ncbi.nlm.nih.gov/BLAST
USA BLAST http://genome.wustl.edu/gsc/BLAST
EUROPE BLAST http://www.ch.embnet.org/software/bBLAST.html
Europe BLAST http://www.ebi.ac.uk/blast2/
Japan BLAST/ PSI-BLAST
http://www.ddbj.nig.ac.jp/E-mail/ homology.html
NC
BI
Common Mistake
• Seq1 has domain A & B; Seq2 has domain A and Seq3 has domain B
• Use Seq 1 as query sequence• What happens? E-value of both of these hits may
be very high if domain A and B are long and well conserved.
• Seq1 is homologous to Seq2&3, but remember Seq1 is not homlogous over the entire length to Seq2&3
• Just don’t depend on the E-value• “BLAST hits are not transitive, unless the
alignments are overlapping”• Most proteins have more than one domain, so
becareful when looking a BLAST results, not all reported hits belong to the same big family.
Sequence 1: AAAAAABBBBBBSequence 2: AAAAAASequence 3: BBBBBB
NC
BI
Alternative Method for
Homology Searches• Smith-Waterman (ssearch): slower but
more accurate• FASTA: slower than BLAST, but more
accurate when making DNA comparison
• BLAT: for locating cDNA in a genome or finding close proteins in a genome
NC
BI
Common Questions
• When I do a blast job using WU-BLAST vs NCBI BLAST with the same query sequence, I get a different result? Both are based on the same algorithm, but a different implementation. So why the difference?
Usually this is due to the slight variation in the database version, but differences in BLAST program version also play a minor role in the difference. Usually the result, do not change in a dramatic manner, but they do change a bit.
NC
BI
Basic Gene Prediction Flow Chart
Obtain new genomic DNA sequence
1. Translate in all six reading frames and compare to protein sequence databases2. Perform database similarity search of expressed sequence tagSites (EST) database of same organism, or cDNA sequences if available
Use gene prediction program to locate genes
Analyze regulatory sequences in the gene
NC
BI
The Annotation Process
DNA SEQUENCE
AN
NA
LY
SIS
SO
FT
WA
RE
UsefulInformation
Annotator
NC
BI
DNA sequence
RepeatMasker Blastn HalfwiseBlastxGene finders tRNA scan
Repeats Promoters Pseudo-GenesrRNAGenes
tRNA
Fasta BlastP Pfam Prosite Psort SignalP TMHMM
Annotation Process
NC
BI
How do I do large scale genome analysis?
• Read Koonin’s book on NCBI Bookshelf
NC
BI
TaxPlot is a tool for three-way comparisons of genomes on the basis of the protein sequences they encode.
Demo TaxPlot
http://www.ncbi.nlm.nih.gov/sutils/taxik2.cgi
NC
BI
Demo - VecScreen
http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html