a bioinformatics survey . . . just a taste, with an emphasis on the gcg suite

24
A BioInformatics A BioInformatics Survey Survey . . . . . . just a taste, just a taste, with an emphasis on the with an emphasis on the GCG suite. GCG suite. Steven M. Thompson Steven M. Thompson Florida State University Florida State University School of Computational School of Computational Science and Information Science and Information Technology ( Technology ( CSIT CSIT ) )

Upload: sissy

Post on 25-Feb-2016

46 views

Category:

Documents


0 download

DESCRIPTION

A BioInformatics Survey . . . just a taste, with an emphasis on the GCG suite. Steven M. Thompson Florida State University School of Computational Science and Information Technology ( CSIT ). Summary. What is bioinformatics, genomics, sequence analysis, computational molecular biology . . . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

A BioInformatics A BioInformatics SurveySurvey

. . . . . . just a taste, with an just a taste, with an emphasis on the GCG emphasis on the GCG

suite.suite.Steven M. ThompsonSteven M. ThompsonFlorida State University School of Florida State University School of

Computational Science and Computational Science and Information Technology (Information Technology (CSITCSIT))

Page 2: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

SummarySummaryWhat is bioinformatics, genomics, What is bioinformatics, genomics,

sequence analysis, computational sequence analysis, computational molecular biology . . .molecular biology . . .Reverse Biochemistry & Evolution.Reverse Biochemistry & Evolution.Database growthDatabase growth & cpu power. & cpu power.

Very brief ‘show-and-tell,’ ‘how-to,’ Very brief ‘show-and-tell,’ ‘how-to,’ e.g:e.g:NCBI ResourcesNCBI Resources, phylogenetics, , phylogenetics,

GCG’sGCG’s SeqLab. SeqLab.High quality training is essential!High quality training is essential!

Graduates need to be competitive Graduates need to be competitive on a world biotechnology market.on a world biotechnology market.

Page 3: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

My My definitionsdefinitionsBiocomputing and computational biology are Biocomputing and computational biology are

synonymous and describe the use of computers synonymous and describe the use of computers and computational techniques to analyze any and computational techniques to analyze any biological system, from molecules, through biological system, from molecules, through cells, tissues, and organisms, all the way to cells, tissues, and organisms, all the way to populations.populations.

Bioinformatics describes using computational Bioinformatics describes using computational techniques to access, analyze, and interpret the techniques to access, analyze, and interpret the biological information in any of the available biological information in any of the available biological databases.biological databases.

Sequence analysis is the study of molecular Sequence analysis is the study of molecular sequence data for the purpose of inferring the sequence data for the purpose of inferring the function, mechanism, interactions, evolution, function, mechanism, interactions, evolution, and perhaps structure of biological molecules.and perhaps structure of biological molecules.

Genomics analyzes the context of genes or Genomics analyzes the context of genes or complete genomes (the total DNA content of an complete genomes (the total DNA content of an organism) within and across genomes.organism) within and across genomes.

Proteomics is the subdivision of genomics Proteomics is the subdivision of genomics concerned with analyzing the complete protein concerned with analyzing the complete protein complement, i.e. the proteome, of organisms, complement, i.e. the proteome, of organisms, both within and between different organisms.both within and between different organisms.

Page 4: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

from a ‘virtual’ DNA sequence to actual from a ‘virtual’ DNA sequence to actual molecular physical characterization, not molecular physical characterization, not the other way ‘round.the other way ‘round.

Using bioinformatics tools, you can Using bioinformatics tools, you can infer all sorts of functional, infer all sorts of functional, evolutionary, and, structural evolutionary, and, structural insights into a gene product, insights into a gene product, without the need to isolate and without the need to isolate and purify massive amounts of protein! purify massive amounts of protein! Eventually you can go on to clone Eventually you can go on to clone and express the gene based on that and express the gene based on that analysis using PCR techniques.analysis using PCR techniques.

The computer and molecular The computer and molecular databases are an essential part of databases are an essential part of this process.this process.

The reverse biochemistry analogyThe reverse biochemistry analogy

Page 5: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

The exponential growth of The exponential growth of molecular sequence molecular sequence databasesdatabasesYearYear BasePairs Sequences BasePairs Sequences19821982 680338 680338 606 60619831983 2274029 2274029 2427 242719841984 3368765 3368765 4175 417519851985 5204420 5204420 5700 570019861986 9615371 9615371 9978 997819871987 15514776 15514776 145841458419881988 23800000 23800000 205792057919891989 34762585 34762585 287912879119901990 49179285 49179285 395333953319911991 71947426 71947426 556275562719921992 101008486 101008486 786087860819931993 157152442 143492 157152442 14349219941994 217102462 215273 217102462 21527319951995 384939485 555694 384939485 55569419961996 651972984 1021211 651972984 102121119971997 1160300687 1765847 1160300687 176584719981998 2008761784 2837897 2008761784 283789719991999 3841163011 4864570 3841163011 486457020002000 11101066288 1010602311101066288 1010602320012001 15849921438 1497631015849921438 1497631020022002 28507990166 2231888328507990166 22318883

http://www.ncbi.nlm.nih.gov/Genbank/genbankshttp://www.ncbi.nlm.nih.gov/Genbank/genbankstats.htmltats.html

Doubling Doubling time ~ 1 time ~ 1 year!year!

& cpu power& cpu power

Page 6: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

Database growth Database growth (cont.)(cont.)The Human Genome Project and numerous other genome The Human Genome Project and numerous other genome projects have kept the data coming at alarming rates. As projects have kept the data coming at alarming rates. As of April 2003, (50 years after the Watson-Crick double-of April 2003, (50 years after the Watson-Crick double-helix!)16 Archaea, 128 Bacteria, and 10 Eukaryote helix!)16 Archaea, 128 Bacteria, and 10 Eukaryote complete, finished genomes; and 4 Vertebrate and 5 complete, finished genomes; and 4 Vertebrate and 5 Plant essentially complete genome maps are publicly Plant essentially complete genome maps are publicly available for analysis; not counting all the virus and available for analysis; not counting all the virus and viroid genomes available.viroid genomes available.

The International Human Genome Sequencing Consortium The International Human Genome Sequencing Consortium announced the completion of a "Working Draft" of the announced the completion of a "Working Draft" of the human genome in June 2000; independently that same human genome in June 2000; independently that same month, the private company month, the private company Celera GenomicsCelera Genomics announced announced that it had completed the first assembly of the human that it had completed the first assembly of the human genome. Both articles were published mid-February genome. Both articles were published mid-February 2001 in the journals 2001 in the journals ScienceScience and and NatureNature..

Page 7: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

Some neat stuff from those Some neat stuff from those paperspapersWe, We, Homo sapiensHomo sapiens, aren’t nearly as special , aren’t nearly as special as we had once hoped we were. Of the 3.2 as we had once hoped we were. Of the 3.2 billion base pairs in our DNA —billion base pairs in our DNA —

TraditionalTraditional, text-book estimates of the number , text-book estimates of the number of genes were often in the 100,000 range; of genes were often in the 100,000 range; turns out we’ve only got about twice as many turns out we’ve only got about twice as many as a fruit fly, between 25,000 and 35,000!as a fruit fly, between 25,000 and 35,000!

The protein coding region of our genome is only The protein coding region of our genome is only about 1% or so, much of the remainder ‘junk’ about 1% or so, much of the remainder ‘junk’ is ‘jumping,’ ‘selfish DNA’ of which much may is ‘jumping,’ ‘selfish DNA’ of which much may be involved in regulation and control. be involved in regulation and control. Understanding this network is a huge Understanding this network is a huge challenge.challenge.

100-200 genes were transferred from an 100-200 genes were transferred from an ancestral bacterial genome to an ancestral ancestral bacterial genome to an ancestral vertebrate genome! vertebrate genome! (Later shown to be not true (Later shown to be not true by by more extensive analysesmore extensive analyses, and to be due to gene , and to be due to gene loss rather than transfer.)loss rather than transfer.)

Page 8: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

(Central Dogma: DNA —> RNA —> protein)(Central Dogma: DNA —> RNA —> protein)Primary refers to one dimension — all of the Primary refers to one dimension — all of the

‘symbol’ information written in sequential ‘symbol’ information written in sequential order necessary to specify a particular order necessary to specify a particular biological molecular entity, be it polypeptide biological molecular entity, be it polypeptide or nucleotide.or nucleotide.

The symbols are the one letter alphabetic codes The symbols are the one letter alphabetic codes for all of the biological nitrogenous bases and for all of the biological nitrogenous bases and amino acid residues and their ambiguity amino acid residues and their ambiguity codes. Biological carbohydrates, lipids, and codes. Biological carbohydrates, lipids, and structural information are not included within structural information are not included within this sequence, however, much of this type of this sequence, however, much of this type of information is available in the reference information is available in the reference documentation sections associated with documentation sections associated with primary sequences in the databases.primary sequences in the databases.

What are primary What are primary sequences?sequences?

Page 9: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

What are sequence What are sequence databases?databases?These databases are an organized way to store These databases are an organized way to store

the tremendous amount of sequence the tremendous amount of sequence information that accumulates from laboratories information that accumulates from laboratories worldwide. Each database has its own specific worldwide. Each database has its own specific format. Three major database organizations format. Three major database organizations around the world are responsible for around the world are responsible for maintaining most of this data; they largely maintaining most of this data; they largely ‘mirror’ one another.‘mirror’ one another.

North America: National Center for Biotechnology North America: National Center for Biotechnology Information (Information (NCBINCBI): ): GenBankGenBank & GenPept. & GenPept.Also Georgetown University’s NBRF Protein Also Georgetown University’s NBRF Protein

Identification Resource: Identification Resource: PIRPIR & & NRL_3DNRL_3D..Europe: Europe: European Molecular Biology LaboratoryEuropean Molecular Biology Laboratory

(also (also EBIEBI & & ExPasyExPasy): ): EMBLEMBL & & Swiss-ProtSwiss-Prot..Asia: The DNA Data Bank of Japan (Asia: The DNA Data Bank of Japan (DDBJDDBJ).).

Page 10: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

Content & Content & organizationorganizationMost sequence database installations are examples of Most sequence database installations are examples of

complex ASCII/Binary databases, but they usually are not complex ASCII/Binary databases, but they usually are not Oracle or SQL or Object Oriented (proprietary ones often Oracle or SQL or Object Oriented (proprietary ones often are). They often contain several very long text files are). They often contain several very long text files containing different types of information all related to containing different types of information all related to particular sequences, such as all of the sequences particular sequences, such as all of the sequences themselves, versus all of the title lines, or all of the themselves, versus all of the title lines, or all of the reference sections. Binary files often help ‘glue together’ reference sections. Binary files often help ‘glue together’ all of these other files by providing index functions.all of these other files by providing index functions.

Software is usually required to successfully interact with Software is usually required to successfully interact with these databases and access is most easily handled these databases and access is most easily handled through various software packages and interfaces, either through various software packages and interfaces, either on the World Wide Web or otherwise. Nucleic acid on the World Wide Web or otherwise. Nucleic acid databases are split into subdivisions based on taxonomy databases are split into subdivisions based on taxonomy (historical). Protein databases are often organized into (historical). Protein databases are often organized into sections by level of annotation.sections by level of annotation.

Page 11: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

What are other biological What are other biological databases?databases? Three dimensional structure databases:Three dimensional structure databases:

the the Protein Data BankProtein Data Bank and Rutgers Nucleic Acid and Rutgers Nucleic Acid Database.Database.

Still more; these can be considered ‘non-molecular’:Still more; these can be considered ‘non-molecular’:Reference Databases: e.g. Reference Databases: e.g.

OMIMOMIM — Online Mendelian Inheritance in Man — Online Mendelian Inheritance in ManPubMedPubMed//MedLineMedLine — over 11 million citations from — over 11 million citations from

more than 4 thousand bio/medical scientific more than 4 thousand bio/medical scientific journals. journals.

Phylogenetic Tree Databases: e.g. the Phylogenetic Tree Databases: e.g. the Tree of LifeTree of Life..Metabolic Pathway Databases: e.g. Metabolic Pathway Databases: e.g. WITWIT (What Is There) (What Is There)

and Japan’s GenomeNet and Japan’s GenomeNet KEGGKEGG (the Kyoto (the Kyoto Encyclopedia of Genes and Genomes).Encyclopedia of Genes and Genomes).

Population studies data — which strains, where, etc.Population studies data — which strains, where, etc.And then databases that most biocomputing folk don’t even And then databases that most biocomputing folk don’t even

usually consider:usually consider:e.g. GIS/GPS/remote sensing data, medical records, e.g. GIS/GPS/remote sensing data, medical records,

census counts, mortality and birth rates . . . .census counts, mortality and birth rates . . . .

Page 12: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

What are the primary algorithms What are the primary algorithms used?used?Dot matrix approaches;Dot matrix approaches;

The dynamic programming algorithm;The dynamic programming algorithm;Heuristics based, hashing methods, for Heuristics based, hashing methods, for similarity searching;similarity searching;Multiple sequence alignment;;Consensus and weight matrix descriptors, including HMM’s;;Phylogenetic inference methodology;;Structure estimation and homology modeling.Structure estimation and homology modeling.

Common Thread: Inference through Common Thread: Inference through homology is a fundamental principle of homology is a fundamental principle of biology!biology!

What is homologyWhat is homology — in this context it is — in this context it is similarity great enough such that common similarity great enough such that common ancestry is implied. Walter Fitch, a famous ancestry is implied. Walter Fitch, a famous molecular evolutionist, likes to relate the molecular evolutionist, likes to relate the analogy — homology is like pregnancy, you analogy — homology is like pregnancy, you either are or you’re not; there’s no such either are or you’re not; there’s no such thing as 65% pregnant!thing as 65% pregnant!

Page 13: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

So how do you do bioinformatics?So how do you do bioinformatics?Often on the InterNet over the World Wide Web —Often on the InterNet over the World Wide Web —SiteSite URL (Uniform Resource Locator)URL (Uniform Resource Locator) ContentContentNat’l Center Biotech' Info'Nat’l Center Biotech' Info' http://http://www.ncbi.nlm.nih.gov/www.ncbi.nlm.nih.gov/ databases/analysis/softwaredatabases/analysis/softwarePIR/NBRFPIR/NBRF http://www-nbrf.georgetown.edu/http://www-nbrf.georgetown.edu/ protein sequence protein sequence databasedatabaseIUBIO Biology ArchiveIUBIO Biology Archive http://iubio.bio.indiana.edu/http://iubio.bio.indiana.edu/ database/software database/software archivearchiveUnivUniv. of Montreal. of Montreal http://megasun.bch.umontreal.ca/http://megasun.bch.umontreal.ca/ database/software database/software archivearchiveJapan'sJapan's GenomeNet GenomeNet http://www.genome.ad.jp/http://www.genome.ad.jp/ databases/analysis/databases/analysis/softwaresoftwareEuropean Mol' Bio' Lab'European Mol' Bio' Lab' http://www.embl-http://www.embl-heidelberg.de/heidelberg.de/ databases/analysis/softwaredatabases/analysis/softwareEuropean BioinformaticsEuropean Bioinformatics http://www.ebi.ac.uk/http://www.ebi.ac.uk/

databases/analysis/softwaredatabases/analysis/softwareThe Sanger InstituteThe Sanger Institute http://www.sanger.ac.uk/http://www.sanger.ac.uk/ databases/analysis/databases/analysis/softwaresoftwareUnivUniv. of Geneva. of Geneva BioWeb BioWeb http://www.expasy.ch/http://www.expasy.ch/

databases/analysis/softwaredatabases/analysis/softwareProteinDataBankProteinDataBank http://www.rcsb.org/pdb/http://www.rcsb.org/pdb/ 3D mol' structure 3D mol' structure databasedatabaseMolecules R UsMolecules R Us http://molbio.info.nih.gov/cgi-bin/pdb/http://molbio.info.nih.gov/cgi-bin/pdb/ 3D 3D protein/nuc' visualizationprotein/nuc' visualizationThe GenomeThe Genome DataBase DataBase http://www.gdb.org/http://www.gdb.org/

The Human Genome ProjectThe Human Genome ProjectStanford GenomicsStanford Genomics http://genome-www.stanford.edu/http://genome-www.stanford.edu/ various genome various genome projectsprojectsInst. for GenomicInst. for Genomic Res’rch Res’rch http://www.tigr.org/http://www.tigr.org/

esp. microbial genome projectsesp. microbial genome projectsHIV Sequence DatabaseHIV Sequence Database http://hiv-web.lanl.gov/http://hiv-web.lanl.gov/

HIV epidemeology seq' DBHIV epidemeology seq' DBThe Tree of LifeThe Tree of Life http://tolweb.org/tree/phylogeny.htmlhttp://tolweb.org/tree/phylogeny.htmloverview of all phylogenyoverview of all phylogenyRibosomal DatabaseRibosomal Database Proj’ Proj’ http://http://rdp.cme.msu.edu/html/rdp.cme.msu.edu/html/ databases/analysis/databases/analysis/softwaresoftwareWIT MetabolismWIT Metabolism http://wit.mcs.anl.gov/WIT2/http://wit.mcs.anl.gov/WIT2/ metabolic metabolic reconstructionreconstructionHarvard Bio' LaboratoriesHarvard Bio' Laboratories http://golgi.harvard.edu/http://golgi.harvard.edu/

nice bioinformatics links listnice bioinformatics links list

Page 14: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

But large datasets become But large datasets become intractable. What other intractable. What other resources are available?resources are available?Desktop software solutions — public Desktop software solutions — public

domain programs are available, domain programs are available, but . . . complicated to install, but . . . complicated to install, configure, and maintain. User must configure, and maintain. User must be pretty computer savvy. So, be pretty computer savvy. So,

commercial software packages are commercial software packages are available, e.g. MacVector, DS Gene, available, e.g. MacVector, DS Gene, DNAsis, DNAStar, etc.,DNAsis, DNAStar, etc.,

but . . . license hassles, big expense but . . . license hassles, big expense per machine, and Internet and/or CD per machine, and Internet and/or CD database access all complicate database access all complicate matters!matters!

Page 15: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

Therefore, UNIX server-based Therefore, UNIX server-based solutionssolutionsPublic domain solutions also exist, but now a very Public domain solutions also exist, but now a very

cooperative systems manager needs to maintain cooperative systems manager needs to maintain everything for users, so,everything for users, so,

commercial products, e.g. the commercial products, e.g. the Accelrys Accelrys GCG Wisconsin PackageGCG Wisconsin Package [a [a Pharmacopeia Co.]Pharmacopeia Co.] and the and the SeqLab Graphical User Interface, simplify SeqLab Graphical User Interface, simplify matters for administrators and users.matters for administrators and users.

One license fee for an entire institution and very One license fee for an entire institution and very fast, convenient database access on local server fast, convenient database access on local server disks. Connections from any networked terminal disks. Connections from any networked terminal or workstation anywhere!or workstation anywhere!

Operating system:Operating system: UNIX command line UNIX command line operation hassles; communications software — operation hassles; communications software — telnet, ssh, and terminal emulation; X graphics; telnet, ssh, and terminal emulation; X graphics; file transfer — ftp, and scp/sftp; and editors — file transfer — ftp, and scp/sftp; and editors — vi, emacs, pico (or desktop word processing vi, emacs, pico (or desktop word processing followed by file transfer [save as "text only!"]).followed by file transfer [save as "text only!"]).

Page 16: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

The Genetics Computer The Genetics Computer Group — Group — The Accelrys Wisconsin Package for Sequence AnalysisThe Accelrys Wisconsin Package for Sequence Analysis Begun in 1982 in Oliver Smithies’ lab at the Genetics Begun in 1982 in Oliver Smithies’ lab at the Genetics

Dept. at the University of Wisconsin, Madison, then a Dept. at the University of Wisconsin, Madison, then a private company for over 10 years, then acquired by private company for over 10 years, then acquired by the Oxford Molecular Group U.K., and now owned by the Oxford Molecular Group U.K., and now owned by Pharmacopeia Inc. U.S.A. under the new name Pharmacopeia Inc. U.S.A. under the new name Accelrys.Accelrys.

The suite contains almost 150 programs designed to The suite contains almost 150 programs designed to work in a “toolbox” fashion. Several simple programs work in a “toolbox” fashion. Several simple programs used in succession can lead to sophisticated results.used in succession can lead to sophisticated results.

Also ‘internal compatibility,’ i.e. once you learn to use Also ‘internal compatibility,’ i.e. once you learn to use one program, all programs can be run similarly, and, one program, all programs can be run similarly, and, the output from many programs can be used as input the output from many programs can be used as input for other programs.for other programs.

Used all over the world by more than 30,000 scientists Used all over the world by more than 30,000 scientists at over 530 institutions in 35 countries, so learning it at over 530 institutions in 35 countries, so learning it here will most likely be useful anywhere else you may here will most likely be useful anywhere else you may end up.end up.

Page 17: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

To answer the always perplexing GCG question To answer the always perplexing GCG question — “What sequence(s)? . . . .”— “What sequence(s)? . . . .”

The sequence is in a local GCG format single sequence file in your The sequence is in a local GCG format single sequence file in your UNIX account. (GCG Reformat and all From & To programs)UNIX account. (GCG Reformat and all From & To programs)

The sequence is in a local GCG database in which case you ‘point’ The sequence is in a local GCG database in which case you ‘point’ to it by using any of the GCG database logical names. A colon, to it by using any of the GCG database logical names. A colon, “:,” always sets the logical name apart from either an accession “:,” always sets the logical name apart from either an accession number or a proper identifier name or a wildcard expression and number or a proper identifier name or a wildcard expression and they are case insensitive.they are case insensitive.

The sequence is in a GCG format multiple sequence file, either an The sequence is in a GCG format multiple sequence file, either an MSF (multiple sequence format) file or an RSF (rich sequence MSF (multiple sequence format) file or an RSF (rich sequence format) file. To specify sequences contained in a GCG multiple format) file. To specify sequences contained in a GCG multiple sequence file, supply the file name followed by a pair of braces, sequence file, supply the file name followed by a pair of braces, “{},” containing the sequence specification, e.g. a wildcard — “{},” containing the sequence specification, e.g. a wildcard — {{**}.}.

Finally, the most powerful method of specifying sequences is in a Finally, the most powerful method of specifying sequences is in a GCG “list” file. It is merely a list of other sequence specifications GCG “list” file. It is merely a list of other sequence specifications and can even contain other list files within it. The convention to and can even contain other list files within it. The convention to use a GCG list file in a program is to precede it with an at sign, use a GCG list file in a program is to precede it with an at sign, “@.” Furthermore, one can supply attribute information within “@.” Furthermore, one can supply attribute information within list files to specify something special about the sequence.list files to specify something special about the sequence.

Specifying sequences, GCG style;Specifying sequences, GCG style;in order of increasing power and complexity:in order of increasing power and complexity:

Page 18: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

This is a small example of GCG single sequence format.This is a small example of GCG single sequence format.Always put some documentation on top, so in the futureAlways put some documentation on top, so in the futureyou can figure out what it is you're dealing with! Theyou can figure out what it is you're dealing with! Theline with the two periods is converted to the checksum line.line with the two periods is converted to the checksum line.

example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..

1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA51 GATTTAATAG CATGCGATCC CATGGGA51 GATTTAATAG CATGCGATCC CATGGGA

‘‘Clean’ GCG format single sequence file after Clean’ GCG format single sequence file after ‘reformat’ (or any of the From… programs)‘reformat’ (or any of the From… programs)

SeqLab’s Editor mode can also SeqLab’s Editor mode can also “Import” native GenBank format and “Import” native GenBank format and ABI or LI-COR trace files!ABI or LI-COR trace files!

Page 19: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

Logical terms for the Wisconsin PackageLogical terms for the Wisconsin PackageSequence databases, nucleic acids:Sequence databases, nucleic acids: Sequence databases, amino acids:Sequence databases, amino acids:GENBANKPLUSGENBANKPLUS all of GenBank plus EST and GSS subdivisionsall of GenBank plus EST and GSS subdivisions GENPEPTGENPEPT GenBank CDS translationsGenBank CDS translationsGBPGBP all of GenBank plus EST and GSS subdivisionsall of GenBank plus EST and GSS subdivisions GPGP GenBank CDS translationsGenBank CDS translationsGENBANKGENBANK all of GenBank except EST and GSS subdivisionsall of GenBank except EST and GSS subdivisions SWISSPROTPLUSSWISSPROTPLUS all of Swiss-Prot and all of all of Swiss-Prot and all of SPTrEMBLSPTrEMBLGBGB all of GenBank except EST and GSS subdivisionsall of GenBank except EST and GSS subdivisions SWPSWP all of Swiss-Prot and all of all of Swiss-Prot and all of SPTrEMBLSPTrEMBLBABA GenBank bacterial subdivisionGenBank bacterial subdivision SWISSPROTSWISSPROT all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)BACTERIALBACTERIAL GenBank bacterial subdivisionGenBank bacterial subdivision SWSW all of Swiss-Prot (fully annotated) all of Swiss-Prot (fully annotated) ESTEST GenBank EST (Expressed Sequence Tags) subdivisionGenBank EST (Expressed Sequence Tags) subdivision SPTREMBLSPTREMBL Swiss-Swiss-Prot preliminary EMBL translationsProt preliminary EMBL translationsGSSGSS GenBank GSS (Genome Survey Sequences) subdivisionGenBank GSS (Genome Survey Sequences) subdivision SPTSPT Swiss-Swiss-Prot preliminary EMBL translationsProt preliminary EMBL translationsHTCHTC GenBank High Throughput cDNAGenBank High Throughput cDNA PP all of PIR Proteinall of PIR ProteinHTGHTG GenBank High Throughput GenomicGenBank High Throughput Genomic PIRPIR all of PIR Proteinall of PIR ProteinININ GenBank invertebrate subdivisionGenBank invertebrate subdivision PROTEINPROTEIN PIR fully annotated subdivisionPIR fully annotated subdivisionINVERTEBRATEINVERTEBRATE GenBank invertebrate subdivisionGenBank invertebrate subdivision PIR1PIR1 PIR fully annotated subdivisionPIR fully annotated subdivisionOMOM GenBank other mammalian subdivisionGenBank other mammalian subdivision PIR2PIR2 PIR preliminary subdivisionPIR preliminary subdivisionOTHERMAMMOTHERMAMM GenBank other mammalian subdivisionGenBank other mammalian subdivision PIR3PIR3 PIR unverified subdivisionPIR unverified subdivisionOVOV GenBank other vertebrate subdivision GenBank other vertebrate subdivision PIR4PIR4 PIR unencoded subdivisionPIR unencoded subdivisionOTHERVERTOTHERVERT GenBank other vertebrate subdivision GenBank other vertebrate subdivision NRL_3DNRL_3D PDB 3D protein sequencesPDB 3D protein sequencesPATPAT GenBank patent subdivision GenBank patent subdivision NRLNRL PDB 3D protein sequencesPDB 3D protein sequencesPATENTPATENT GenBank patent subdivision GenBank patent subdivision PHPH GenBank phage subdivision GenBank phage subdivision PHAGEPHAGE GenBank phage subdivisionGenBank phage subdivision General data files: General data files: PLPL GenBank plant subdivision GenBank plant subdivision PLANTPLANT GenBank plant subdivision GenBank plant subdivision GENMOREDATAGENMOREDATA path to GCG optional data filespath to GCG optional data filesPRPR GenBank primate subdivision GenBank primate subdivision GENRUNDATAGENRUNDATA path to GCG default data filespath to GCG default data filesPRIMATEPRIMATE GenBank primate subdivisionGenBank primate subdivisionRORO GenBank rodent subdivisionGenBank rodent subdivisionRODENTRODENT GenBank rodent subdivisionGenBank rodent subdivisionSTSSTS GenBank (sequence tagged sites) subdivisionGenBank (sequence tagged sites) subdivisionSYSY GenBank synthetic subdivisionGenBank synthetic subdivisionSYNTHETICSYNTHETIC GenBank synthetic subdivisionGenBank synthetic subdivisionTAGSTAGS GenBank EST and GSS subdivisionsGenBank EST and GSS subdivisionsUNUN GenBank unannotated subdivisionGenBank unannotated subdivisionUNANNOTATEDUNANNOTATED GenBank unannotated subdivisionGenBank unannotated subdivisionVIVI GenBank viral subdivisionGenBank viral subdivisionVIRALVIRAL GenBank viral subdivisionGenBank viral subdivision

These are easy — These are easy — they make sense and they make sense and you’ll have a vested you’ll have a vested interest.interest.

Page 20: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

GCG MSF & RSF formatGCG MSF & RSF format

The trick is to not forget the Braces and ‘wild card,’ e.g. The trick is to not forget the Braces and ‘wild card,’ e.g.

filename{filename{**}, when specifying!}, when specifying!

!!RICH_SEQUENCE 1.0!!RICH_SEQUENCE 1.0....{{name ef1a_gialaname ef1a_gialadescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listdescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listtype PROTEINtype PROTEINlongname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}sequence-ID Q08046sequence-ID Q08046checksum 7342checksum 7342offset 23offset 23creation-date 07/11/2001 16:51:19creation-date 07/11/2001 16:51:19strand 1strand 1comments ////////////////////////////////////////////////////////////comments ////////////////////////////////////////////////////////////

!!AA_MULTIPLE_ALIGNMENT 1.0!!AA_MULTIPLE_ALIGNMENT 1.0

small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..

Name: a49171 Len: 425 Check: 537 Weight: 1.00Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00Name: a46241 Len: 274 Check: 3514 Weight: 1.00

// //////////////////////////////////////////////////// //////////////////////////////////////////////////

This is SeqLab’s native formatThis is SeqLab’s native format

Page 21: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

The List File FormatThe List File Format

An example GCG list file of many elongation An example GCG list file of many elongation 1a and Tu factors follows. As with all GCG 1a and Tu factors follows. As with all GCG data files, two periods separate data files, two periods separate documentation from data. ..documentation from data. ..

my-special.pepmy-special.pep begin:24begin:24 end:134end:134

SwissProt:EfTu_EcoliSwissProt:EfTu_Ecoli

Ef1a-Tu.msf{*}Ef1a-Tu.msf{*}

/usr/accounts/test/another.rsf{ef1a_*}/usr/accounts/test/another.rsf{ef1a_*}

@[email protected]

The ‘way’ SeqLab works!The ‘way’ SeqLab works!

remember the @ sign!remember the @ sign!

Page 22: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

SeqLab — GCG’s X-based SeqLab — GCG’s X-based GUI!GUI!Seqlab is the merger of Steve Seqlab is the merger of Steve Smith’s Genetic Data Environment Smith’s Genetic Data Environment and GCG’s Wisconsin Package and GCG’s Wisconsin Package Interface:Interface:

GDE + WPI = SeqLabGDE + WPI = SeqLabRequires an X-Windowing Requires an X-Windowing environment — either native on environment — either native on UNIX computers (including LINUX, UNIX computers (including LINUX, but not included by Apple in Mac but not included by Apple in Mac OS X [v.10+] but see Apple’s X11 OS X [v.10+] but see Apple’s X11 package and XDarwin), or package and XDarwin), or emulated with X-Server Software emulated with X-Server Software on personal computers.on personal computers.

Page 23: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

Gunnar von Heijne in his old but quite readable Gunnar von Heijne in his old but quite readable treatise, treatise, Sequence Analysis in Molecular Biology; Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit Treasure Trove or Trivial Pursuit (1987), provides a (1987), provides a very appropriate conclusion:very appropriate conclusion:

““Think about what you’re doing; use your knowledge of Think about what you’re doing; use your knowledge of the molecular system involved to guide both your the molecular system involved to guide both your interpretation of results and your direction of inquiry; use interpretation of results and your direction of inquiry; use as much information as possible; and do not blindly as much information as possible; and do not blindly accept everything the computer offers you.”accept everything the computer offers you.”

He continues:He continues:

““. . . if any lesson is to be drawn . . . it surely is that to be . . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution one must first and able to make a useful contribution one must first and foremost be a biologist, and only second a theoretician . . foremost be a biologist, and only second a theoretician . . . . We have to develop better algorithms, we have to find . . We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s above all we have to become better biologists. But that’s all it takes.”all it takes.”

ConclusionsConclusions

Page 24: A BioInformatics Survey  . . .  just a taste, with an emphasis on the GCG suite

Many fine texts are also Many fine texts are also starting to become available starting to become available in the field.in the field.

To ‘honk-my-own-horn’ a bit, To ‘honk-my-own-horn’ a bit, check out the new —check out the new —

Current Protocols in Current Protocols in BioinformaticsBioinformatics from John from John Wiley & Sons, Inc:Wiley & Sons, Inc:

http://www.does.org/cp/bioinfohttp://www.does.org/cp/bioinfo.html.html

..They asked me to contribute a They asked me to contribute a

chapter on multiple chapter on multiple sequence analysis using sequence analysis using GCG software.GCG software.

Humana Press, Inc. Humana Press, Inc. also asked me to also asked me to contribute. I’ve got contribute. I’ve got two chapters in their — two chapters in their — Introduction to Introduction to Bioinformatics:Bioinformatics:

A Theoretical And A Theoretical And Practical ApproachPractical Approach

http://www.humanaprehttp://www.humanapress.com/Product.pasp?txss.com/Product.pasp?txtCatalog=HumanaBooktCatalog=HumanaBooks&txtCategory=&txtPros&txtCategory=&txtProductID=1-58829-241-XductID=1-58829-241-X&isVariant=&isVariant=00..Both volumes are now Both volumes are now available.available.

Visit my Web page:Visit my Web page:http://bio.fsu.edu/~stevet/http://bio.fsu.edu/~stevet/cv.html.cv.html.

Contact me (Contact me ([email protected]) ) for specific bioinformatics for specific bioinformatics assistance and/or long assistance and/or long distance collaboration.distance collaboration.

FOR MORE INFO...FOR MORE INFO...