analysis of single sequences. toolboxes emboss –many portals. (e.g)e.g biology workbench expasy...

39
Analysis of single sequences

Upload: rudolf-chandler

Post on 26-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

Analysis of single sequences

Page 2: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

Toolboxes

• EMBOSS– Many portals. (E.g)

• Biology Workbench

• ExPasy proteomics tools

• Biotools @ U. Mass. Med. School.

• Many, many more…

Page 3: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

Before we start - VecScreen

• When you get a DNA sequence from the sequencer, make sure it is really the sequence you think it is.

• If you don’t you may spend a lot of time analysing the wrong sequence!!!

• Possible problems: contamination! Work clean.

• Always: Vector contamination.

Page 4: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

Vector contamination

• Failure to recognize foreign segments in a sequence can:– Lead to erroneous conclusions about the

biological significance of the sequence– Waste time and effort in analysis of

contaminated sequence– Delay the release of the sequence in a public

database– Pollute public databases with contaminated

sequence

Page 5: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

Reminder: Cloning procedure

• The DNA of interest is cloned into a vector.

• The resultant DNA may (probably does) contain sections from the vector.

Page 6: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

VecScreen

• VecScreen is a system for quickly identifying segments of a nucleic acid sequence that may be of vector origin.

• NCBI developed VecScreen to combat the problem of vector contamination in public sequence databases.

Page 7: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

VecScreen

Page 8: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

EMBOSS

• European Molecular Biology Open Software Suite.

• Built for use by commandline.• Many EMBOSS portals, servers and

mirrors are available.• Each program has its help file.• One server:

http://emboss.bioinformatics.nl/• Examples of a few EMBOSS programs:

Page 9: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

Briefly – What is PCR

• The polymerase chain reaction (PCR) is a technique to amplify a single copy of a piece of DNA.

Page 10: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

Briefly – What is PCR

• The number of copies of the target DNA increases exponentially.

• After 35 cycles: 236 = 68 billion copies.

Page 11: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

Primer design

• Primer Length: the optimal length is 18-22 bp.

• Primer Melting Temperature: Temperature at which one half of the DNA duplex will dissociate. Tm of 52-58 oC produce best results.

• GC Content• Primer Secondary Structures• Repeats

Page 12: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

Primer design

• Avoid Template secondary structure.

• Avoid Cross homology: – Commonly, primers are BLASTed to test the

specificity.

Page 13: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

primer3

• Is a program from the Whitehead Institute, written by Steve Rozen and Helen J. Skaletsky, for finding primers and oligonucleotide probes.

• One interface to 'primer3' is eprimer3, an EMBOSS program.

• Primer3Plus is a nicer interface to primer3, from Biotools (U. Mass. Med. School.). We will use it.

Page 14: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

Primer3Plus

Page 15: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

Gene prediction

• Identifying stretches of sequence, usually genomic DNA, that are biologically functional.

• This especially includes protein-coding genes

• Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

• Annotation

Page 16: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

Gene prediction

• Prokaryotes:– The sequence coding for a protein occurs as

one contiguous open reading frame (ORF), typically many hundreds or thousands of bp.

• Eukaryotes:– CpG islands and binding sites for a poly(A)

tail.– Difficult to use ORF detection because of

splicing.

Page 17: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

Gene prediction

• Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects.

• Genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools are rarely suitable for efficacious gene hunting in DNA sequences of a new genome.

• Methods based on cDNA and expressed sequence tag (EST) mapping to genomic DNA or those using alignments to closely related genomes rely either on existence of abundant cDNA and EST data and/or availability on reference genomes.

(Alexandre Lomsadze et. al., 2005)

Page 18: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

SixPack

• “Display a DNA sequence with 6-frame translation and ORFs”

• Set “Minimum size of ORFs” to 300, to obtain only meaningful ORFs (Proteins are usually longer than 100 aa).

• Set “ORF start with an M?” to “Yes” to obtain only ORFs that begin with a Methionine.

Page 19: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

SixPack

• The 1st section in the results page lists all the ORFs discovered.

>NM_118742.2_1_ORF1 Translation of NM_118742.2 in frame 1, ORF 1, threshold 500, 42aa GIKRLLEGQFCYRAFTWPVEITSMQTTVRDFEEDSYLSLLVS >NM_118742.2_1_ORF2 Translation of NM_118742.2 in frame 1, ORF 2, threshold 500, 909aa MDFISSLIVGCAQVLCESMNMAERRGHKTDLRQAITDLETAIGDLKAIRDDLTLRIQQDG LEGRSCSNRAREWLSAVQVTETKTALLLVRFRRREQRTRMRRRYLSCFGCADYKLCKKVS AILKSIGELRERSEAIKTDGGSIQVTCREIPIKSVVGNTTMMEQVLEFLSEEEERGIIGV YGPGGVGKTTLMQSINNELITKGHQYDVLIWVQMSREFGECTIQQAVGARLGLSWDEKET GENRALKIYRALRQKRFLLLLDDVWEEIDLEKTGVPRPDRENKCKVMFTTRSIALCNNMG

Page 20: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

SixPack

• The 2nd section shows a map of where the ORFs are in the actual sequence.

Page 21: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

EMBOSS / plotorf

• Plots the ORFs found by sixpack:

Page 22: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

ORF finder

Page 23: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

ORF finder

Page 24: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

ORF finder

Page 25: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

Problems with ORF finding

• ORF finding can detect only 85% of genes.

• Short proteins

• More than 1 long ORF.

• Alternative start codon (not always the one furthest from the stop codon).

Page 27: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

GeneMark

• The suggested method of parallelization of gene prediction with the model parameters estimation follows the path of the iterative Viterbi training.

• Tests on well-studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods…

• Thus, a self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification.

(Alexandre Lomsadze et. al., 2005)

Page 28: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

GeneMark

• Sample output: Exon prediction for PIP1B (remember the Gene entry?)

Gene Exon Strand Exon Exon Range Exon Start/End # # Type Length Frame

1 1 + Initial 156 483 328 1 1 - - 1 2 + Internal 709 1004 296 2 3 - - 1 3 + Internal 1085 1225 141 1 3 - - 1 4 + Terminal 1314 1409 96 1 3 - -

# protein sequence of predicted genes

>gene_1|GeneMark.hmm|286_aaMEGKEEDVRVGANKFPERQPIGTSAQSDKDYKEPPPAPLFEPGELASWSFWRAGIAEFIATFLFLYITVLTVMGVKRSPNMCASVGIQGIAWAFGGMIFALVYCTAGISGGHINPAVTFGLFLARKLSLTRAVYYIVMQCLGAICGAGVVKGFQPKQYQALGGGANTIAHGYTKGSGLGAEIIGTFVLVYTVFSATDAKRNARDSHVPILAPLPIGFAVFLVHLATIPITGTGINPARSLGAAIIFNKDNAWDDHWVFWVGPFIGAALAALYHVIVIRAIPFKSRS

Page 29: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

extractseq

• Usually one would use a sequence editing software like BioEdit.

• Extractseq is one editing tool available from EMBOSS.

• Many more options in command line option (see manual)

Page 30: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

BioEdit

Page 31: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

Seqret generates a multiple sequence file

emma aligns the files

Prettyplot generates a graphical alignment

Multiple sequence alignment using EMBOSS

Usually, one uses better tools for this. We’ll see them later on in the course.

Page 32: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

Restriction maps

• Represent the locations in a DNA sequence cut by restriction enzymes.

• Are used, for example, in identifying whether DNA in a test-tube is the same as its putative sequence.

• Can be used in cloning to design inserts for plasmids.

Page 33: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

ReMap

• Display sequence with restriction sites, translation etc.

• Useful in identification of small nucleotide polymorphisms (SNPs).

• If a SNP changes a restriction site, it will cause that RE to cut the DNA differently compared with the wild type.

• Other RE programs: redata, restrict …

Page 34: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

PepStatsPEPSTATS of AAU04762.1 from 1 to 1020

Molecular weight = 116504.98 Residues = 1020 Average Residue Weight = 114.221 Charge = 7.5 Isoelectric Point = 6.9295A280 Molar Extinction Coefficient = 102130 A280 Extinction Coefficient 1mg/ml = 0.88 Improbability of expression in inclusion bodies = 0.709

Residue Number Mole% DayhoffStatA = Ala 29 2.843 0.331 B = Asx 0 0.000 0.000 … Y = Tyr 22 2.157 0.634 Z = Glx 0 0.000 0.000

Property Residues Number Mole%Tiny (A+C+G+S+T) 237 23.235Small (A+B+C+D+G+N+P+S+T+V)444 43.529Aliphatic (A+I+L+V) 300 29.412Aromatic (F+H+W+Y) 107 10.490Non-polar (A+C+F+G+I+L+M+P+V+W+Y) 508 49.804Polar (D+E+H+K+N+Q+R+S+T+Z) 512 50.196Charged (B+D+E+H+K+R+Z) 289 28.333Basic (H+K+R) 156 15.294Acidic (B+D+E+Z) 133 13.039

Page 35: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

PepInfoTiny

Small

Aliphatic

Aromatic

Non-polar

Polar

Charged

Basic

Acidic

Page 36: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

PepInfo

Protein with transmembrane sections

Page 37: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

PepInfo

Protein without transmembrane sections

Page 38: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

TMHMM

• Very good tool for identifying transmembrane segments.

• http://www.cbs.dtu.dk/services/TMHMM/

Page 39: Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools Biotools @ U. Mass. Med. School.Biotools

Conclusion

• A tip of the iceberg of what can be done with a sequence.

• If you start working with sequences, you will have to decide which tools suit you best.

• It has a lot to do with personal preference and something to do with algorithm accuracy.