analysis of single sequences. toolboxes emboss –many portals. (e.g)e.g biology workbench expasy...
TRANSCRIPT
Analysis of single sequences
Toolboxes
• EMBOSS– Many portals. (E.g)
• Biology Workbench
• ExPasy proteomics tools
• Biotools @ U. Mass. Med. School.
• Many, many more…
Before we start - VecScreen
• When you get a DNA sequence from the sequencer, make sure it is really the sequence you think it is.
• If you don’t you may spend a lot of time analysing the wrong sequence!!!
• Possible problems: contamination! Work clean.
• Always: Vector contamination.
Vector contamination
• Failure to recognize foreign segments in a sequence can:– Lead to erroneous conclusions about the
biological significance of the sequence– Waste time and effort in analysis of
contaminated sequence– Delay the release of the sequence in a public
database– Pollute public databases with contaminated
sequence
Reminder: Cloning procedure
• The DNA of interest is cloned into a vector.
• The resultant DNA may (probably does) contain sections from the vector.
VecScreen
• VecScreen is a system for quickly identifying segments of a nucleic acid sequence that may be of vector origin.
• NCBI developed VecScreen to combat the problem of vector contamination in public sequence databases.
VecScreen
EMBOSS
• European Molecular Biology Open Software Suite.
• Built for use by commandline.• Many EMBOSS portals, servers and
mirrors are available.• Each program has its help file.• One server:
http://emboss.bioinformatics.nl/• Examples of a few EMBOSS programs:
Briefly – What is PCR
• The polymerase chain reaction (PCR) is a technique to amplify a single copy of a piece of DNA.
Briefly – What is PCR
• The number of copies of the target DNA increases exponentially.
• After 35 cycles: 236 = 68 billion copies.
Primer design
• Primer Length: the optimal length is 18-22 bp.
• Primer Melting Temperature: Temperature at which one half of the DNA duplex will dissociate. Tm of 52-58 oC produce best results.
• GC Content• Primer Secondary Structures• Repeats
Primer design
• Avoid Template secondary structure.
• Avoid Cross homology: – Commonly, primers are BLASTed to test the
specificity.
primer3
• Is a program from the Whitehead Institute, written by Steve Rozen and Helen J. Skaletsky, for finding primers and oligonucleotide probes.
• One interface to 'primer3' is eprimer3, an EMBOSS program.
• Primer3Plus is a nicer interface to primer3, from Biotools (U. Mass. Med. School.). We will use it.
Primer3Plus
Gene prediction
• Identifying stretches of sequence, usually genomic DNA, that are biologically functional.
• This especially includes protein-coding genes
• Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.
• Annotation
Gene prediction
• Prokaryotes:– The sequence coding for a protein occurs as
one contiguous open reading frame (ORF), typically many hundreds or thousands of bp.
• Eukaryotes:– CpG islands and binding sites for a poly(A)
tail.– Difficult to use ORF detection because of
splicing.
Gene prediction
• Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects.
• Genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools are rarely suitable for efficacious gene hunting in DNA sequences of a new genome.
• Methods based on cDNA and expressed sequence tag (EST) mapping to genomic DNA or those using alignments to closely related genomes rely either on existence of abundant cDNA and EST data and/or availability on reference genomes.
(Alexandre Lomsadze et. al., 2005)
SixPack
• “Display a DNA sequence with 6-frame translation and ORFs”
• Set “Minimum size of ORFs” to 300, to obtain only meaningful ORFs (Proteins are usually longer than 100 aa).
• Set “ORF start with an M?” to “Yes” to obtain only ORFs that begin with a Methionine.
SixPack
• The 1st section in the results page lists all the ORFs discovered.
>NM_118742.2_1_ORF1 Translation of NM_118742.2 in frame 1, ORF 1, threshold 500, 42aa GIKRLLEGQFCYRAFTWPVEITSMQTTVRDFEEDSYLSLLVS >NM_118742.2_1_ORF2 Translation of NM_118742.2 in frame 1, ORF 2, threshold 500, 909aa MDFISSLIVGCAQVLCESMNMAERRGHKTDLRQAITDLETAIGDLKAIRDDLTLRIQQDG LEGRSCSNRAREWLSAVQVTETKTALLLVRFRRREQRTRMRRRYLSCFGCADYKLCKKVS AILKSIGELRERSEAIKTDGGSIQVTCREIPIKSVVGNTTMMEQVLEFLSEEEERGIIGV YGPGGVGKTTLMQSINNELITKGHQYDVLIWVQMSREFGECTIQQAVGARLGLSWDEKET GENRALKIYRALRQKRFLLLLDDVWEEIDLEKTGVPRPDRENKCKVMFTTRSIALCNNMG
SixPack
• The 2nd section shows a map of where the ORFs are in the actual sequence.
EMBOSS / plotorf
• Plots the ORFs found by sixpack:
ORF finder
ORF finder
ORF finder
Problems with ORF finding
• ORF finding can detect only 85% of genes.
• Short proteins
• More than 1 long ORF.
• Alternative start codon (not always the one furthest from the stop codon).
Possible solutions
• Searching the databases for similar proteins. Existence of such a protein will indicate this is a true gene.
• Gene prediction tools:– GeneMark: http://opal.biology.gatech.edu/
GeneMark/ – Many more (e.g. see CBCB website)
GeneMark
• The suggested method of parallelization of gene prediction with the model parameters estimation follows the path of the iterative Viterbi training.
• Tests on well-studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods…
• Thus, a self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification.
(Alexandre Lomsadze et. al., 2005)
GeneMark
• Sample output: Exon prediction for PIP1B (remember the Gene entry?)
Gene Exon Strand Exon Exon Range Exon Start/End # # Type Length Frame
1 1 + Initial 156 483 328 1 1 - - 1 2 + Internal 709 1004 296 2 3 - - 1 3 + Internal 1085 1225 141 1 3 - - 1 4 + Terminal 1314 1409 96 1 3 - -
# protein sequence of predicted genes
>gene_1|GeneMark.hmm|286_aaMEGKEEDVRVGANKFPERQPIGTSAQSDKDYKEPPPAPLFEPGELASWSFWRAGIAEFIATFLFLYITVLTVMGVKRSPNMCASVGIQGIAWAFGGMIFALVYCTAGISGGHINPAVTFGLFLARKLSLTRAVYYIVMQCLGAICGAGVVKGFQPKQYQALGGGANTIAHGYTKGSGLGAEIIGTFVLVYTVFSATDAKRNARDSHVPILAPLPIGFAVFLVHLATIPITGTGINPARSLGAAIIFNKDNAWDDHWVFWVGPFIGAALAALYHVIVIRAIPFKSRS
extractseq
• Usually one would use a sequence editing software like BioEdit.
• Extractseq is one editing tool available from EMBOSS.
• Many more options in command line option (see manual)
BioEdit
Seqret generates a multiple sequence file
emma aligns the files
Prettyplot generates a graphical alignment
Multiple sequence alignment using EMBOSS
Usually, one uses better tools for this. We’ll see them later on in the course.
Restriction maps
• Represent the locations in a DNA sequence cut by restriction enzymes.
• Are used, for example, in identifying whether DNA in a test-tube is the same as its putative sequence.
• Can be used in cloning to design inserts for plasmids.
ReMap
• Display sequence with restriction sites, translation etc.
• Useful in identification of small nucleotide polymorphisms (SNPs).
• If a SNP changes a restriction site, it will cause that RE to cut the DNA differently compared with the wild type.
• Other RE programs: redata, restrict …
PepStatsPEPSTATS of AAU04762.1 from 1 to 1020
Molecular weight = 116504.98 Residues = 1020 Average Residue Weight = 114.221 Charge = 7.5 Isoelectric Point = 6.9295A280 Molar Extinction Coefficient = 102130 A280 Extinction Coefficient 1mg/ml = 0.88 Improbability of expression in inclusion bodies = 0.709
Residue Number Mole% DayhoffStatA = Ala 29 2.843 0.331 B = Asx 0 0.000 0.000 … Y = Tyr 22 2.157 0.634 Z = Glx 0 0.000 0.000
Property Residues Number Mole%Tiny (A+C+G+S+T) 237 23.235Small (A+B+C+D+G+N+P+S+T+V)444 43.529Aliphatic (A+I+L+V) 300 29.412Aromatic (F+H+W+Y) 107 10.490Non-polar (A+C+F+G+I+L+M+P+V+W+Y) 508 49.804Polar (D+E+H+K+N+Q+R+S+T+Z) 512 50.196Charged (B+D+E+H+K+R+Z) 289 28.333Basic (H+K+R) 156 15.294Acidic (B+D+E+Z) 133 13.039
PepInfoTiny
Small
Aliphatic
Aromatic
Non-polar
Polar
Charged
Basic
Acidic
PepInfo
Protein with transmembrane sections
PepInfo
Protein without transmembrane sections
TMHMM
• Very good tool for identifying transmembrane segments.
• http://www.cbs.dtu.dk/services/TMHMM/
Conclusion
• A tip of the iceberg of what can be done with a sequence.
• If you start working with sequences, you will have to decide which tools suit you best.
• It has a lot to do with personal preference and something to do with algorithm accuracy.