analysis of single sequences. toolboxes emboss –many portals. (e.g)e.g biology workbench expasy...

Analysis of single sequences

Toolboxes

• EMBOSS– Many portals. (E.g)

• Biology Workbench

• ExPasy proteomics tools

• Biotools @ U. Mass. Med. School.

• Many, many more…

http://workbench.sdsc.edu/

Before we start - VecScreen

• When you get a DNA sequence from the sequencer, make sure it is really the sequence you think it is.

• If you don’t you may spend a lot of time analysing the wrong sequence!!!

• Possible problems: contamination! Work clean.

• Always: Vector contamination.

Vector contamination

• Failure to recognize foreign segments in a sequence can:– Lead to erroneous conclusions about the

biological significance of the sequence– Waste time and effort in analysis of

contaminated sequence– Delay the release of the sequence in a public

database– Pollute public databases with contaminated

sequence

Reminder: Cloning procedure

• The DNA of interest is cloned into a vector.

• The resultant DNA may (probably does) contain sections from the vector.

VecScreen

• VecScreen is a system for quickly identifying segments of a nucleic acid sequence that may be of vector origin.

• NCBI developed VecScreen to combat the problem of vector contamination in public sequence databases.

VecScreen

EMBOSS

• European Molecular Biology Open Software Suite.

• Built for use by commandline.• Many EMBOSS portals, servers and

mirrors are available.• Each program has its help file.• One server:

http://emboss.bioinformatics.nl/• Examples of a few EMBOSS programs:

http://emboss.bioinformatics.nl/

Briefly – What is PCR

• The polymerase chain reaction (PCR) is a technique to amplify a single copy of a piece of DNA.

Briefly – What is PCR

• The number of copies of the target DNA increases exponentially.

• After 35 cycles: 236 = 68 billion copies.

Primer design

• Primer Length: the optimal length is 18-22 bp.

• Primer Melting Temperature: Temperature at which one half of the DNA duplex will dissociate. Tm of 52-58 oC produce best results.

• GC Content• Primer Secondary Structures• Repeats

Primer design

• Avoid Template secondary structure.

• Avoid Cross homology: – Commonly, primers are BLASTed to test the

specificity.

primer3

• Is a program from the Whitehead Institute, written by Steve Rozen and Helen J. Skaletsky, for finding primers and oligonucleotide probes.

• One interface to 'primer3' is eprimer3, an EMBOSS program.

• Primer3Plus is a nicer interface to primer3, from Biotools (U. Mass. Med. School.). We will use it.

Primer3Plus

Gene prediction

• Identifying stretches of sequence, usually genomic DNA, that are biologically functional.

• This especially includes protein-coding genes

• Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

• Annotation

Gene prediction

• Prokaryotes:– The sequence coding for a protein occurs as

one contiguous open reading frame (ORF), typically many hundreds or thousands of bp.

• Eukaryotes:– CpG islands and binding sites for a poly(A)

tail.– Difficult to use ORF detection because of

splicing.

Gene prediction

• Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects.

• Genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools are rarely suitable for efficacious gene hunting in DNA sequences of a new genome.

• Methods based on cDNA and expressed sequence tag (EST) mapping to genomic DNA or those using alignments to closely related genomes rely either on existence of abundant cDNA and EST data and/or availability on reference genomes.

(Alexandre Lomsadze et. al., 2005)

SixPack

• “Display a DNA sequence with 6-frame translation and ORFs”

• Set “Minimum size of ORFs” to 300, to obtain only meaningful ORFs (Proteins are usually longer than 100 aa).

• Set “ORF start with an M?” to “Yes” to obtain only ORFs that begin with a Methionine.

SixPack

• The 1st section in the results page lists all the ORFs discovered.

>NM_118742.2_1_ORF1 Translation of NM_118742.2 in frame 1, ORF 1, threshold 500, 42aa GIKRLLEGQFCYRAFTWPVEITSMQTTVRDFEEDSYLSLLVS >NM_118742.2_1_ORF2 Translation of NM_118742.2 in frame 1, ORF 2, threshold 500, 909aa MDFISSLIVGCAQVLCESMNMAERRGHKTDLRQAITDLETAIGDLKAIRDDLTLRIQQDG LEGRSCSNRAREWLSAVQVTETKTALLLVRFRRREQRTRMRRRYLSCFGCADYKLCKKVS AILKSIGELRERSEAIKTDGGSIQVTCREIPIKSVVGNTTMMEQVLEFLSEEEERGIIGV YGPGGVGKTTLMQSINNELITKGHQYDVLIWVQMSREFGECTIQQAVGARLGLSWDEKET GENRALKIYRALRQKRFLLLLDDVWEEIDLEKTGVPRPDRENKCKVMFTTRSIALCNNMG

SixPack

• The 2nd section shows a map of where the ORFs are in the actual sequence.

EMBOSS / plotorf

• Plots the ORFs found by sixpack:

ORF finder

Problems with ORF finding

• ORF finding can detect only 85% of genes.

• Short proteins

• More than 1 long ORF.

• Alternative start codon (not always the one furthest from the stop codon).

Possible solutions

• Searching the databases for similar proteins. Existence of such a protein will indicate this is a true gene.

• Gene prediction tools:– GeneMark: http://opal.biology.gatech.edu/

GeneMark/ – Many more (e.g. see CBCB website)

http://opal.biology.gatech.edu/GeneMark/












GeneMark

• The suggested method of parallelization of gene prediction with the model parameters estimation follows the path of the iterative Viterbi training.

• Tests on well-studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods…

• Thus, a self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification.

(Alexandre Lomsadze et. al., 2005)

GeneMark

• Sample output: Exon prediction for PIP1B (remember the Gene entry?)

Gene Exon Strand Exon Exon Range Exon Start/End # # Type Length Frame

1 1 + Initial 156 483 328 1 1 - - 1 2 + Internal 709 1004 296 2 3 - - 1 3 + Internal 1085 1225 141 1 3 - - 1 4 + Terminal 1314 1409 96 1 3 - -

# protein sequence of predicted genes

>gene_1|GeneMark.hmm|286_aaMEGKEEDVRVGANKFPERQPIGTSAQSDKDYKEPPPAPLFEPGELASWSFWRAGIAEFIATFLFLYITVLTVMGVKRSPNMCASVGIQGIAWAFGGMIFALVYCTAGISGGHINPAVTFGLFLARKLSLTRAVYYIVMQCLGAICGAGVVKGFQPKQYQALGGGANTIAHGYTKGSGLGAEIIGTFVLVYTVFSATDAKRNARDSHVPILAPLPIGFAVFLVHLATIPITGTGINPARSLGAAIIFNKDNAWDDHWVFWVGPFIGAALAALYHVIVIRAIPFKSRS

extractseq

• Usually one would use a sequence editing software like BioEdit.

• Extractseq is one editing tool available from EMBOSS.

• Many more options in command line option (see manual)

BioEdit

Seqret generates a multiple sequence file

emma aligns the files

Prettyplot generates a graphical alignment

Multiple sequence alignment using EMBOSS

Usually, one uses better tools for this. We’ll see them later on in the course.

Restriction maps

• Represent the locations in a DNA sequence cut by restriction enzymes.

• Are used, for example, in identifying whether DNA in a test-tube is the same as its putative sequence.

• Can be used in cloning to design inserts for plasmids.

ReMap

• Display sequence with restriction sites, translation etc.

• Useful in identification of small nucleotide polymorphisms (SNPs).

• If a SNP changes a restriction site, it will cause that RE to cut the DNA differently compared with the wild type.

• Other RE programs: redata, restrict …

PepStatsPEPSTATS of AAU04762.1 from 1 to 1020

Molecular weight = 116504.98 Residues = 1020 Average Residue Weight = 114.221 Charge = 7.5 Isoelectric Point = 6.9295A280 Molar Extinction Coefficient = 102130 A280 Extinction Coefficient 1mg/ml = 0.88 Improbability of expression in inclusion bodies = 0.709

Residue Number Mole% DayhoffStatA = Ala 29 2.843 0.331 B = Asx 0 0.000 0.000 … Y = Tyr 22 2.157 0.634 Z = Glx 0 0.000 0.000

Property Residues Number Mole%Tiny (A+C+G+S+T) 237 23.235Small (A+B+C+D+G+N+P+S+T+V)444 43.529Aliphatic (A+I+L+V) 300 29.412Aromatic (F+H+W+Y) 107 10.490Non-polar (A+C+F+G+I+L+M+P+V+W+Y) 508 49.804Polar (D+E+H+K+N+Q+R+S+T+Z) 512 50.196Charged (B+D+E+H+K+R+Z) 289 28.333Basic (H+K+R) 156 15.294Acidic (B+D+E+Z) 133 13.039

PepInfoTiny

Small

Aliphatic

Aromatic

Non-polar

Polar

Charged

Basic

Acidic

PepInfo

Protein with transmembrane sections

PepInfo

Protein without transmembrane sections

TMHMM

• Very good tool for identifying transmembrane segments.

• http://www.cbs.dtu.dk/services/TMHMM/

http://www.cbs.dtu.dk/services/TMHMM/

Conclusion

• A tip of the iceberg of what can be done with a sequence.

• If you start working with sequences, you will have to decide which tools suit you best.

• It has a lot to do with personal preference and something to do with algorithm accuracy.

analysis of single sequences. toolboxes emboss –many portals. (e.g)e.g biology workbench expasy...

Documents

vecscreen slide

contaminated sequence

dna sequence

primer3plus slide

annotation slide

sequence coding

stretches of sequence

wrong sequence