curators guide for dbs

Upload: aditya-teja

Post on 10-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Curators Guide for Dbs

    1/36

    Curators Guide for Pathway/Genome Databases

    Pathway Tools Version 11.0

    Ron Caspi, Carol Fulcher, John Ingraham, Ingrid Keseler, Markus Krummenacker, Suzanne Paley,

    SRI International

    333 Ravenswood Ave.

    Menlo Park, CA 94025

    [email protected]

    July 3, 2007

    Previous Contributors: Martha Arnaud, Cynthia Krieger

    1

  • 8/8/2019 Curators Guide for Dbs

    2/36

    Contents

    1 Introduction 4

    1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2 Literature Search 5

    2.1 Recommended Databases for Literature Search . . . . . . . . . . . . . . . . . . . . . . 5

    2.1.1 PubMed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.1.2 MEDLINE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.1.3 BIOSIS via LANL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.1.4 SciSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    3 PGDB Curation 5

    3.1 Naming Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    3.2 Overview of PGDB Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    3.2.1 Pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.2.2 Chemical Compounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    3.2.3 Compound Classes for Broad Substrate Specificity and Polymerization . . . . . 10

    3.2.4 Reactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.2.5 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.2.6 Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    3.3 Summaries and History Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    3.3.1 Writing Style Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    3.3.2 Formatting Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    3.3.3 Say It in Your Own Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    3.3.4 Citation Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    3.4 Saving Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    3.5 Evidence Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    4 Pathway Curation 23

    4.1 Pathway Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    4.2 Defining Pathway Start and End Points . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    4.3 Pathway Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    4.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2

  • 8/8/2019 Curators Guide for Dbs

    3/36

    4.5 Database Searching Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    4.6 Pathway Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    5 EcoCyc-Specific Information 27

    5.1 E. coli Gene Frame Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 Interrupted Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    6 MetaCyc-Specific Information 28

    6.1 Database Searching Strategies for MetaCyc . . . . . . . . . . . . . . . . . . . . . . . . 28

    6.2 Species Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    6.2.1 Taxonomic Range Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    6.3 E. coli Pathways in MetaCyc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    6.4 Pathways from Other Pathway Databases . . . . . . . . . . . . . . . . . . . . . . . . . 29

    6.5 Proteins as Substrates in MetaCyc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    6.6 Curation with Classification Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    6.6.1 Gene Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    7 Organism Summary 31

    8 Update Propagation among KBs 32

    8.1 Invoking KB Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328.2 Overview of the Updating Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    9 Database Release Process 34

    9.1 Correctify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    9.2 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    9.2.1 Database Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    9.2.2 Updates to the Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    9.3 Updates to the General KB Information . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    10 Programming Hints 36

    3

  • 8/8/2019 Curators Guide for Dbs

    4/36

    1 Introduction

    This guide contains information for curators of Pathway/Genome Databases (PGDBs) such as EcoCyc[4] and MetaCyc [3]. PGDBs are created, updated, and queried using the Pathway Tools softwaresystem [2].

    This curators guide addresses issues regarding PGDB conventions, literature search and review, PGDBentry and editing, and a variety of programs that may be run periodically on PGDBs. Since the rolesof curators may vary, only parts of this guide may be relevant to any particular reader. For example,some curators may not conduct literature research and others may never run any of the automatedprograms. Furthermore, sections of this guide are specific to the curation of a particular PGDB,such as the organism-specific PGDB EcoCyc, and thereby may not be applicable to other PGDBs.For example, there is an organism-specific section for EcoCyc and it addresses such issues as theconventions for naming E. coli genes. Organism-specific sections are labeled as such.

    Another important source of information for curators is the Pathway Tools Users Guide document.

    This guide is a work in progress. Please mail suggestions for improvements to

    [email protected] .

    1.1 Definitions

    The following terms are used in this manual.

    Pathway/Genome Database: A PGDB describes the genome of an organism (its chromosome(s),genes, and genome sequence), the product of each gene, the biochemical reaction(s) catalyzed by eachgene product, the substrates of each reaction, and the organization of reactions into pathways. APGDB can also describe the genetic network of an organism: its promoters, operons, transcriptionfactors, and transcription-factor binding sites. A PGDB is a type of MOD (Model Organism Database).

    EcoCyc Database: A PGDB for the organism E. coli. The majority of the information in EcoCycis derived from the biomedical literature.

    MetaCyc Database: A PGDB containing metabolic data for many different organisms. The goalof MetaCyc is to contain broad coverage of experimentally elucidated metabolic pathways from manydifferent organisms, rather than to attempt to model the complete pathway complement of any par-ticular organism. MetaCyc contains a broad base of well-established pathways that are used by thePathoLogic program to predict the pathway complement of a particular organism, which is modeledwithin a separate PGDB for that organism. The majority of the information in MetaCyc is derivedfrom the biomedical literature.

    BioCyc Knowledge Library: The collection of PGDBs at URL BioCyc.org is called the BioCycKnowledge Library. EcoCyc, MetaCyc, and BsubCyc are all component databases within the BioCycLibrary.

    4

  • 8/8/2019 Curators Guide for Dbs

    5/36

    2 Literature Search

    2.1 Recommended Databases for Literature Search

    2.1.1 PubMed

    This database is a quick and easy starting point for researching metabolic pathways. Independent ofwhat database you use for your literature search, youll want to get the PubMed or MEDLINE citationnumber for the articles youll reference to add a web link from your PGDB to PubMed. For those arti-cles that dont have PubMed or MEDLINE citations youll have to create a frame for the reference (SeePathway Tools Users Guide). For a description of PubMed searching strategies see the following URL:http://www.ncbi.nlm.nih.gov:80/entrez/query/static/help/pmhelp.html#PubMedSearching. Theword coli is useful for restricting searches for EcoCyc references that return a huge number ofirrelevant results. Use of the Limits option to limit searching to titles and abstracts also helpsrestrict searches in cases where the gene name happens also to be a name of a person or institutionor part of an institutions address.

    2.1.2 MEDLINE

    MEDLINE, not through PubMed, has more database searching functions such as combining differentsearches, subtracting others, etc.

    2.1.3 BIOSIS via LANL

    This database is accessible free of charge to Stanford University so you can access it from any terminalconnected to Stanford. If you have access to Lane library there are computers there that can be used

    for database searching.

    2.1.4 SciSearch

    This database allows you to search for articles that cite an article of interest, which enables you tosearch for related articles that have been published after the article of interest.

    3 PGDB Curation

    This section discusses curation of specific PGDB datatypes.

    Figure 1 provides an overview of the relationships among some of the main PGDB datatypes. Apathway is connected to its component reactions, which in turn are connected to the enzymes thatcatalyze them. Enzymes are connected to the genes that encode them, which are connected to thereplicons (chromosomes or plasmids) on which they reside. Enzymes can be active as monomers or ascomplexes; complexes are connected to objects that represent their polypeptide subunits. Reactionsare also connected to objects representing their substrates (such as Metabolite-1 in this example).Genes are also connected to objects representing the transcription units (operons) containing them.Transcription units are also connected to their promoters.

    5

  • 8/8/2019 Curators Guide for Dbs

    6/36

    Figure 1: Relationships among PGDB datatypes. Each name in this drawing denotes a PGDB objectof a given datatype, for example, Reaction-1 is an object that represents a reaction. A line between twoobjects indicates that a DB relationship exists between the objects, for example, a slot (attribute) ofReaction-1 called In-Pathway has Pathway-1 as a value, thus linking the reaction to the pathway that

    contains it. All of these links are bi-directional, for example, a slot of Pathway-1 called Reaction-Listhas Reaction-1 as a value.

    3.1 Naming Issues

    A variety of naming issues arise when developing PGDBs. We strive for consistency in naming. Onereason consistency is important is so that users of a database can rely on a naming scheme being usedin a systematic way. For example, if a user wishes to find all degradation pathways, they can do asubstring search for the term degradation, and we can ensure that they will not miss some pathwaysthat use the term catabolism instead.

    Although we advocate using a single consistency criterion for the common names of different objecttypes, such as chemicals, enzymes, and pathways, it is also reasonable and preferable to consistentlyapply several naming schemes when creating synonyms, if you believe that different communities ofusers are likely to consistently try to use different sorts of names when querying different entities. Theaddition of synonyms for different object types increases the flexibility and robustness of the database.Synonyms enable users 1) to search the database for information using alternative names and readilyfind the information and 2) it prevents the addition of redundant information in the database underalternative names. Furthermore, if a reaction does not have a specific Enzyme Commission (EC)number, the software program PathoLogic [2] relies on enzyme names to correlate an annotated genewith an enzyme, emphasizing the importance of enzyme names.

    Try not to put commas inside the common names for objects, because the Pathway Tools software uses

    commas as separators between the names of objects when it displays a list of objects. One alternativeis to use hyphens instead of commas.

    More detailed naming conventions are presented in sections that follow on each datatype.

    3.2 Overview of PGDB Content

    The following sections discuss curation of different PGDB datatypes, such as the naming conventionsfor each datatype. Frequently, it is beneficial to know the type of information collected in PGDBs

    6

  • 8/8/2019 Curators Guide for Dbs

    7/36

    prior to curating a given pathway or other object type, so these sections summarize the informationto be gathered for each datatype.

    3.2.1 Pathways

    Please note that Section 4 discusses curation of pathways in more detail.

    Summary of information collected for pathways:

    Common name

    Synonym(s) of pathway name

    Superclass(es) of pathway

    Names of species in which the pathway has been experimentally demonstrated (only relevant toMetaCyc since other PGDBs are specific to a given organism).

    Summary

    General description of the pathway and its significance.

    Statement regarding the initial and end substrates. If the pathway is defined as the degra-dation of substrate A to substrate E, but E is further degraded, comment on how E isfurther degraded and to what end products in the species noted.

    Statement regarding whether the pathway is shared among different types of species.

    Relationship to similar pathways in the same or different species.

    Relationship to linked pathways (preceding and subsequent pathways), and sub- and super-pathways (see definitions), if applicable.

    Experimental evidence for the pathway.

    Highlight interesting or novel reactions/enzymes in the pathway.

    If the pathway contains proposed intermediates, or hypothetical reactions, discuss the rel-evant circumstantial or preliminary evidence.

    Links to other pathways within the same PGDB.

    Links to other DBs, such as other PGDBs, the WIT DB, and The University of MinnesotaBiocatalysis/Biodegradation DB (UM-BBD).

    Label hypothetical reactions as such.

    Net reaction equation, specifies the net chemical transformation of a pathway, including stoi-chiometry, as defined in the PGDB. This can be added via the frame editor.

    Citations

    7

  • 8/8/2019 Curators Guide for Dbs

    8/36

    Pathway Naming. When adding a new pathway name, try to use the format and style of other,similar pathway names.

    For consistency, please use the terms degradation and biosynthesis for pathway common nameswhen appropriate. Do not use the terms catabolism, anabolism, nor utilization. If degradationor biosynthesis is not appropriate, metabolism may be used.

    Pathway variants should be enumerated with roman numerals, not cardinal numbers. For example,TCA cycle variation I, TCA variation II, etc.

    Begin the name of any superpathway with Superpathway of...

    Explicitly note pathways that are anaerobic by inclusion of (anaerobic) at the end of the pathwayname. The one exception is the anaerobic energy pathways (anaerobic respiration, etc.), in whichthe word anaerobic is found at the beginning of the pathway name.

    Write out the names of amino acids within all pathway names, rather than using full names withinsome pathway names and abbreviations within others.

    3.2.2 Chemical Compounds

    Summary of information collected for chemical compounds:

    Chemical name and synonym(s)

    Superclass(es) of chemical compound

    Chemical structure

    Chemical formula and molecular weight are calculated from the chemical structure.

    Gibbs free energy of formation This can be added via the frame editor. Add it if you come acrossthis information, but do not spend time searching for it.

    Links to other databases, such as other PGDBs, Chemical Abstract Service (CAS), etc.

    Summary

    Citation(s)

    Chemical names, synonyms and structures may be found in a variety of sources, such as the following:

    On-line databases

    Kyoto Encyclopedia of Genes and Genomes (KEGG):(http://www.genome.ad.jp/kegg/ligand.html). Go to Search Compound.

    The University of Minnesota Biocatalysis/Biodegradation Database (UMBBD):(http://umbbd.ahc.umn.edu/index.html). Use compound search.

    Google and Google Image

    Online literature databases

    Sigma-Aldrich product catalog: (http://www.sigmaaldrich.com/).

    8

  • 8/8/2019 Curators Guide for Dbs

    9/36

    Klotho Biochemical Compounds Declarative Database:(http://www.biocheminfo.org/klotho/).

    IUBMB Enzyme Nomenclature Database: (http://www.chem.qmul.ac.uk/iubmb/enzyme/).This DB can be helpful if the chemical of interest is a substrate or product of an ECreaction, and there is a link from the reaction to a reaction diagram, displaying the chemical

    structure. WIT Database: (http://wit.mcs.anl.gov/WIT2/).

    Enzymes and Metabolic Pathways (EMP) Database: (http://emp.mcs.anl.gov/).

    Agency for Toxic Substances and Disease Registry (ATSDR):(http://www.atsdr.cdc.gov/toxfaq.html).

    Books and chemical catalogs

    Merck Index

    General chemistry and biochemistry textbooks if the chemical is common or is a central

    metabolite Sigma catalog

    Aldrich catalog

    Chemical structures and synonyms may also be found by searching the scientific literature for researcharticles. PubMed can be searched readily on-line. If you are lucky you will find a citation for apaper that is available on line and contains a figure with the chemical structure. Articles cited in theReferences section of MetaCyc enzyme and pathway pages are good candidates to try searching.

    If a compound cannot be found by name-based searching, it can sometimes be found by searching fora reaction in which it is known to be a substrate, or by searching for an enzyme that is known tocatalyze a reaction in which it is a substrate.

    It is desirable but not essential to find each compound in at least two sources, both to ensure correctnessof its structure, and to aid in finding additional synonyms.

    It is often faster to enter a structure by copying and modifying the structure of an existing similarcompound rather than entering the structure from scratch.

    If a structure is not found for a compound, enter a value in the Comment-Internal slot using the frameeditor to record this fact of this form: No structure found in search on DATE by PERSON.

    Chemical Naming. Common names for chemical compounds should not be capitalized exceptwhere uppercase characters are strictly required, e.g., use L-tryptophan, not L-Tryptophan. Inaddition, for consistency, when naming organic acids, please use the name of the conjugate base as thecommon name and the conjugate acid as a synonym. For example, the common name for acetic acid,and benzoic acid should be acetate and benzoate, respectively, and their respective synonyms shouldbe acetic acid and benzoic acid.

    Many chemicals have a common name as well as a trivial name, and an IUPAC (International Unionof Pure and Applied Chemistry) name. For example, dimethyl ketone (common name) has the trivialname of acetone, and the IUPAC name of 2-propanone. It is also known as beta-ketopropane, and 2-oxopropane. Different communities of users should be able to query the chemical using any one of these

    9

  • 8/8/2019 Curators Guide for Dbs

    10/36

    names and find the chemical entry. To make this possible, each chemical entry in the PGDB shouldinclude the chemical common name(s), trivial name(s), and IUPAC name(s) as either the commonname or synonym. Furthermore, some functional groups have multiple prefixes and/or suffixes. Forexample, the functional group of ketones, >C=O, has two different prefixes, oxo- (IUPAC name) andketo- (common name). The bifunctional compound 4-oxohexanoate is equivalent to 4-oxohexanoicacid, 4-ketohexanoate and 4-ketohexanoic acid, and thus the chemical entry for this compound shouldinclude all of these names as either the common name or synonym. Furthermore, if you think thatmany people are likely to use names such as both glucose-6-P and glucose-6-phosphate, you maywish to consistently use one style for the common name, and one style as a synonym. Note thatthe Pathway Tools matching software for chemical names eliminates ambiguity due to capitalization,hyphenation, and spacing. Therefore the names glucose-6-P and glucose 6-p will match oneanother during a user query, so there is no need to enter both of these names as synonyms.

    3.2.3 Compound Classes for Broad Substrate Specificity and Polymerization

    Some enzymes (and their reactions) have broad and/or ill-characterized substrate specificities, and

    the rationale and philosophy for how to faithfully represent these complexities will be explained here.

    Most reactions in the small molecule metabolism convert distinct and well-defined compounds, therepresentation of which is achieved by compound instance frames that reside under the Compoundsclass hierarchy. However, a substantial number of reaction equations, including many that wereimported from the EC enzyme classification system, refer to broad compound classes, and the questionarises of how to represent these classes, such that much of the chemical logic of metabolic pathwayscan be captured and made computationally accessible.

    One of the strengths of the PGDB schema is that the concept of broad substrate specificity can befaithfully captured, by using a compound class frame instead of a compound instance as a substratein the reaction equation. Such a compound class can have multiple specific instances, i.e. example

    compounds for which the reaction equation is known or presumed to hold true.

    Organization of the Compound Class Hierarchy Compounds are organized in a class hierarchyfor several reasons, including specification of groups of compounds that are interchangable as enzymesubstrates, and to allow user navigation and retrieval of sets of compounds that share functional groupsor that share metabolic purposes.

    Underneath the class Compounds, compound classes describing functional groups include All-Amines,All-Carboxy-Acids, All-Amino-Acids, All-Carbohydrates. Classes describing metabolic purposes in-clude Coenzymes, Hormones, Vitamins, Secondary-Metabolites. These classification hierarchies arenot yet as fully developed as they could and should be, and may be subject to additional refinement

    and rearrangement.

    The special class Unclassified-Compounds serves as the catch-all bin for compounds that have notyet been carefully placed in an appropriate category. Newly created compound instances (when thecurator uses the Compound Editor [2]) are created in this class by default, unless another class isexplicitly specified.

    For now, please create new compound classes in the KB SCHEMABASE, from which they should bepropagated to organism-specific PGDBs. We plan to improve the mechanism for creating new classes;for now please create them using the Taxonomy Editor.

    10

  • 8/8/2019 Curators Guide for Dbs

    11/36

    Broad Substrate Specificity Compound classes allow a PGDB to represent broad substrate speci-ficities in reaction equations. By using a compound class as a substrate in a reaction equation, youare effectively stating that all instances of that class are interchangable within that reaction. As anexample in EcoCyc, the reaction LYSOPHOSPHOLIPASE-RXN (EC 3.1.1.5) contains two classes:

    a 2-acylglycerophosphocholine + H2O = a fatty acid + L-1-glycero-3-phosphocholine

    The class 2-Acylglycero-Phosphocholines on the left side stands for a group of Lysolecithin compoundsthat have a variable-length fatty acid tail. On the right side, the fatty acids class stands for thecorresponding hydrolysis products. In contrast, H2O and L-1-glycero-3-phosphocholine are specificcompounds, represented as instances.

    What usually determines the exact extent of the specificity is the enzyme that catalyzes the reac-tion, and this even may differ depending on the isozyme or organism. Because the classes used forrepresenting substrate specificity do not necessarily fully overlap with the classification based on thechemical viewpoint, often the curator has to make an informed decision about how much can be safelyassumed about the likely extent of specificity in a given reaction equation.

    As a complicated example, in EcoCyc, the class Nucleosides is a substrate in the transport reac-tion TRANS-RXN-108. However, the narrower subclass Ribonucleosides is a substrate in the 3-NUCLEOTID-RXN phosphatase reaction, which appears to apply only to RNA-derived substratesand not to the other major subclass Deoxy-Ribonucleosides. The overlap of these classes is quite goodbetween their viewpoints of functional groups versus roles as broad reaction substrates. However, onesubtle difference is that the class Ribonucleosides includes the compound INOSINE, because this classdefinition clearly makes sense from the functional group viewpoint. However, in the 3-NUCLEOTID-RXN, it is unlikely that INOSINE will ever be a substrate that occurs biologically in the RNA materialthat is processed. Such classification decisions often have to be made, and more precise guidelines willlikely evolve for some time to come.

    In summary, to represent broad substrate specificity in reaction equations, classes can be used that

    will generally contain specific compound instances, and could in principle acquire additional futureinstances as more new compounds are added to the database and as the enzymes for the reactions areinvestigated in greater biochemical detail. Maintaining detailed information about compound classesand their instances allows inferences to be made about compound conversions mediated by enzymeswith broad specificity, which would not become apparent in the metabolic network otherwise.

    Ill-characterized Substrates Some EC reactions describe very broad conversions of functionalgroups, such as the reduction of aldehydes to alcohols (EC 1.1.1.2):

    NADP + an alcohol = NADPH + an aldehyde

    This is so broad that it would be dangerous to make assumptions about which compound instances aretruly affected. Often, the description of such reaction equations is so vague, that it is very difficult tosearch and obtain information on the scope of the reaction. In other cases, the biochemical experimentssimply have not yet been performed that could delineate the extent of the specificity more precisely.For example in HumanCyc, a substantial number of function assignments to enzymes were made bysequence similarity to a family of enzymes that is known to convert a general class of compounds, butthe true substrate in humans has not been determined yet for every particular putative enzyme.

    In these cases, it seems best for now to create compound classes that contain no compound instances.This is in contrast to the classes that represent broad substrate specificity, which generally will contain

    11

  • 8/8/2019 Curators Guide for Dbs

    12/36

    compound instances. The classes representing ill-characterized substrates serve as place holders, untilmore information becomes available that allows a better reclassification.

    In the special class Pseudo-Compounds, ultra-ill-defined compound anomalies can be recorded, untilthey can be resolved more satisfactorily.

    Polymerization A number of EC reactions try to describe what in truth are polymerization reac-tions. Various polymerases, nucleases, and also fatty acid synthase and the corresponding degradationcan all be viewed as causing polymerizations or de-polymerizations. Polymerization under a highdegree of enzymatic control is what sets cells apart from just a simple bag of small molecules andtheir direct interconversions. Because of the key role that polymerization plays in biology, and to closethe loop computationally for the full metabolic flow of materials, a representation had to be foundthat captures the essence of polymerization reactions. As far as we know, this feature is unique tothe PGDB schema, and other metabolic pathway databases do not yet have high-level concise waysof representing polymerizations, which are a feature of key importance in cellular biochemistry.

    This can be done in the PGDB schema by looking at a growing polymer as if it were a compound class,

    and as if its hypothetical instances (which do not necessarily have to be explicitly enumerated) repre-sent the intermediate products as the polymer is being grown. Internally, PGDB reaction equationsthat add a monomer to a polymer will generally list the compound class representing the polymer onboth sides of the reaction equation. But when the name of that polymer is displayed to the user byPathway Tools in a reaction or pathway drawing, it generates different names for different lengths ofthe polymer, depending on the side of the equation the class appears on.

    For example, the reaction FOLYLPOLYGLUTAMATESYNTH-RXN (EC 6.3.2.17) refers to the same,identical compound class THF-GLU-N on both sides of the equation, but the display of this reactionshows the appropriate (n) and (n+1) versions of the polymer name:

    L-glutamate + H4PteGlu(n) + ATP = H4PteGlu(n+1) + phosphate + ADP

    These additional names are stored in 3 special slots in the compound class frame, namely N-NAME,N+1-NAME, and N-1-NAME. These slots can be edited to contain a single string each, which can beselected as an alternate COMMON-NAME in reaction and pathway displays.

    In a reaction frame that refers to a polymer, the selection of the appropriate name is achieved byattaching the NAME-SLOT annotation to the polymer compound class frame that is one of the valuesof the LEFT and RIGHT slots in the reaction frame. The value for the NAME-SLOT annotation isthe symbol that stands for one of the three slots mentioned above. [PK: Tell the reader how they dothis]

    Using the Show Frame menu command to print the frame contents in the example above shows

    these annotations:

    --- Instance FOLYLPOLYGLUTAMATESYNTH-RXN ---

    Types: EC-6.3.2

    COMMENT: "This reaction is involved in the conversion of folates to

    polyglutamate derivatives."

    COMMON-NAME: "FOLYLPOLYGLUTAMATE SYNTHASE"

    12

  • 8/8/2019 Curators Guide for Dbs

    13/36

    CREATION-DATE: 3000588944

    CREATOR: |mriley|

    EC-NUMBER: "6.3.2.17"

    ENZYMATIC-REACTION: FOLYLPOLYGLUTAMATESYNTH-ENZRXN

    IN-PATHWAY: FOLSYN-PWY

    LEFT:

    GLT

    THF-GLU-N

    ---NAME-SLOT: N-NAME

    ATP

    NAMES: "FOLYLPOLYGLUTAMATE SYNTHASE",

    "FOLYLPOLY-GAMMA-GLUTAMATE SYNTHETASE"

    OFFICIAL-EC?: T

    RIGHT:

    THF-GLU-N

    ---NAME-SLOT: N+1-NAME

    |Pi|

    ADP

    SCHEMA?: T

    SUBSTRATES: ADP, |Pi|, GLT, THF-GLU-N, ATP

    SYNONYMS: "FOLYLPOLY-GAMMA-GLUTAMATE SYNTHETASE"

    TEMPLATE-FILE: "/home/bolinas2/ecocyc/templates/new/folsyn/folC.template"

    ____________________________________________

    In a pathway that contains polymerization reactions, the selection of the appropiate names is achievedby the POLYMERIZATION-LINKS slot of the pathway frame. Each value of this slot is a list offrame ID symbols of the form (cpd-class product-rxn reactant-rxn). When both reactions are non-nil,an identity link is created between the polymer compound class cpd-class, serving as a product ofproduct-rxn, and the same compound class serving as a reactant of reactant-rxn. The identity linkis displayed to the user by a white dashed line in the pathway drawing. The PRODUCT-NAME-SLOT and REACTANT-NAME-SLOT annotations on this list specify which slot should be used fordisplaying the compound label in product-rxn and reactant-rxn above respectively if one or bothare omitted, COMMON-NAME is assumed. Either reaction above may be nil; in this case, no identitylink is created this form is used solely in conjunction with one of the name-slot annotations tospecify a name-slot other than COMMON-NAME for a polymer compound class in a reaction of thepathway.

    13

  • 8/8/2019 Curators Guide for Dbs

    14/36

    As an example, this is what the POLYMERIZATION-LINKS slot looks like in the pathway FOLSYN-PWY, as printed out by Show Frame:

    POLYMERIZATION-LINKS:

    (THF-GLU-N FOLYLPOLYGLUTAMATESYNTH-RXN FOLYLPOLYGLUTAMATESYNTH-RXN)

    ---PRODUCT-NAME-SLOT: N+1-NAME ---REACTANT-NAME-SLOT: N-NAME

    The Pathway Editor [2] supports the selection of these alternate name labels. Right-clicking on acompound in the pathway editor will bring up a menu with commands, if applicable, that allowcreating and deleting polymerization links, and selecting from the name labels that the polymercompound class makes available in the name slots described above.

    The Reaction Editor [2] still lacks support for name label selection as of June 2003. For reactionframes, the Frame Editor can be used to manually add the appropriate annotations.

    Macromolecules as Substrates Numerous reaction equations refer to proteins or nucleic acidsthat are enzymatically modified. One prominent example is protein phosphorylation in a regulatory

    context. In a sense, proteins and nucleic acids are polymers, but they are generally treated differentlyin the PGDB schema from the polymerization compound class concept discussed above.

    Historically, the PGDB schema has made the high-level distinction between small molecule compoundsand macromolecules. Proteins and nucleic acids are in the latter category. Usually, when a macro-molecule is listed as a subtrate in a reaction, a corresponding modified form of the macromoleculemust be specified on the other side of the equation. The unmodified macromolecule generally will bea gene product, such as a protein monomer or tRNA. For the modified variant, a new frame generallyneeds to be created, which represents the modified macromolecule, and which points to the unmodi-fied version in its UNMODIFIED-FORM slot. The modified and unmodified variants should be madeinstances in a class of their own. Also see the description in [2].

    Redox-active proteins are also treated in this manner, with the corresponding oxidized and reducedvariants paired up.

    Many reaction equations accept a broad range of macromolecules, and in these cases, macromoleculeclasses can be used as the substrates, in analogy to how compound classes are utilized to representbroad substrate specificity. A typical example includes tRNA-charging reactions, such as LEUCINETRNA-LIGASE-RXN (EC 6.1.1.4):

    tRNAleu + L-leucine + ATP = L-leucyl-tRNAleu + pyrophosphate + AMP

    On the left side, the class LEU-tRNAs contains 8 distinct gene products in EcoCyc.

    There are also numerous reaction equations coming from the EC system that are very generic modi-fications of macromolecule structure, such as DNA methylation at one type of base, wherever it mayoccur in a long DNA polymer. For now, the best way to represent such modifications at particularmacromolecule residues, which are likely to apply to many different gene products that are left other-wise unspecified, is by using macromolecule classes, which in analogy to ill-specified compound classeswill contain no instances. At a future time, these placeholder classes might be replaced by a bettermechanism for representing residue- and site-specific modifications on macromolecules.

    Naming Conventions of Compound Classes Like all class names, the frame IDs for compoundclasses should be capitalized plurals, with words separated by dashes. Example: All-Amines. The

    14

  • 8/8/2019 Curators Guide for Dbs

    15/36

    All- prefix means this is a high-level class from a chemists viewpoint that will contain more detaileddistinctions. [PK: I wouldnt rule out what the next sentence states.] This type of organizational classis generally not intended to serve as a substrate for a reaction.

    The COMMON-NAME slot should be left empty, so that the Taxonomy Editor will simply displaythe frame IDs.

    However, the SYNONYMS slot should list as its first value a name string that makes good sense whenwritten in a reaction equation, which generally will be a lower cased singular name, usually beginningwith a and essentially the same name as the frame ID, but with spaces used instead of dashes toseparate words. Example: the class Aliphatic-Amines has as its first synonym an aliphatic amine.(This name will be generated automatically by the software that displays compound names in reactionequations.)

    Chemical Structures of Compound Classes Compound classes can also have molecular struc-tures, just like instances. The main difference is that a class needs to show in its structure thatvariations exist, which if explicitly enumerated, would correspond to the compound instances populat-

    ing this class. In a class structure, this variation can be represented by an R group. The CompoundStructure Editor allows R to be entered as a special atomtype. There are class structures that needto represent 2 or more different R groups that have to be distinguished between. Proper softwaresupport for this does not yet exist.

    3.2.4 Reactions

    Summary of information collected for reactions:

    EC number. There is a slot to label whether this is an official EC number or not. The default is

    that it is the official number. The EC number for the reaction may not be official if the reactionis not in the exact form as defined by the Enzyme Commission.

    EC name

    Spontaneous reaction?

    Change in Gibbs Free Energy for the reaction in the direction of the reaction as written. DeltaG0can be added via the frame editor. Add it if you come across this information, but do not spendtime searching for it.

    Summary

    Brief description of the reaction. Highlight any interesting or unusual aspects of the reac-tion.

    If the reaction is novel, describe why.

    If the reaction is hypothetical, explain the supporting evidence.

    If the reaction is spontaneous, describe under what conditions. In vivo? In vitro?

    Citation(s)

    15

  • 8/8/2019 Curators Guide for Dbs

    16/36

    Reaction Directionality. By way of background, it is important to know that the Enzyme Com-mission dictates a preferred direction for every enzymatic reaction that they classify. All PGDBsshould store reactions in this preferred direction, therefore, you should not change the direction inwhich a reaction is stored in the DB for reactions that have EC numbers. Reactions that do not haveEC numbers should be stored in the same direction as comparable EC-classified reactions, if known.

    The direction in which the Enzyme Commission writes the reaction often differs from the direction inwhich the reaction occurs physiologically. If the reaction is irreversible, or a strong preference for itsdirectionality exists in the cell, a Reaction Direction can be specified using the Enzyme Editor (seedocumentation for the Reaction-Direction slot).

    Note that the Navigator software chooses which direction to display a reaction in reaction windows andenzyme windows based on several criteria. Please see the Navigator Users Guide for more information.

    Hypothetical Reactions. Reactions within pathways may be indicated as hypothetical if thereis insufficient experimental evidence. The last few reactions in anaerobic toluene degradation aremarked as hypothetical because the intermediates, 2-carboxymethyl-3-hydroxyphenylpropionyl-CoA

    and benzoylsuccinyl-CoA have yet to be observed in the lab. These intermediates and their respectivereactions were proposed based on analogous biochemical reactions. There is also genetic evidence forthe hypothesized enzymes. If a reaction is marked as hypothetical it should be further explained inthe summary of the pathway and possibly in the reaction or enzyme windows.

    Reaction Balancing. Ideally, all biochemical reactions within a PGDB should be exactly balancedwith respect to mass and atoms, meaning that the sums of the masses and atoms for the reactants ofthe reaction should equal that for the reaction products. In practice, some reactions are unbalanced.Here we discuss how to find and correct unbalanced reactions.

    The Correctify-KB process will print a list of reactions that it finds to be unbalanced with respect

    to mass or atoms. Note that it ignores reactions that are unbalanced with respect to hydrogensbecause the chemical structures within BioCyc PGDBs are stored with inconsistent ionization statesthat cannot be expected to balance across every reaction.

    Reactions are typically unbalanced for several possible reasons. (1) The literature itself contains anunbalanced reaction, such as because a full reaction equation has not been determined reliably, andthe author does not know the full equation. These reactions should all be given a summary to indicatethat the reaction is expected to be unbalanced. (2) The reaction involves polymerization, and thereforeinvolves complexity that the reaction balancer cannot handle properly. (3) The reaction equation isin error, such as due to lack of a water molecule, and should be corrected. (4) One or more chemicalstructures for substrates of the reaction are in error and should be corrected. When errors in reactions

    or compound structures are corrected, those corrections should be logged in a history note for thereaction or compound, created using the command Right-Click Notes Add to History. If anerror is found in a reaction that has an EC number, please alert the Enzyme Commission (IUBMB).

    The information printed for each reaction will aid the curator in identifying the reason the reactionis unbalanced. For example, if a large fraction of all reactions containing a given compound areunbalanced, there is a good chance that this is because the structure of that compound is in error.

    16

  • 8/8/2019 Curators Guide for Dbs

    17/36

    3.2.5 Proteins

    Summary of information collected for proteins:

    General Protein Information

    Species name

    Common name and synonym(s) of protein

    Cellular location (eg. membrane, cytoplasm, chloroplast, etc.)

    Native molecular weight in kilodaltons

    Neidhardt Spot Number (reflects the proteins electrophoretic behavior in 2-dimensionalelectrophoresis). (Can be added via the frame editor, but not typical. Add it if you comeacross this information, but do not search for it).

    Summary

    General description of the protein, highlighting any interesting aspects. The com-

    ment on each protein page should include a summary of the most important infor-mation known about the protein; for example, its function and role within the cell,any phenotypes resulting from its mutation or absence, protein domains, its partici-pation as a component of a larger structure, its similarity to other proteins (includ-ing any functional complementation studies), membership in protein families, and in-formation about its structure. If the enzyme is novel, explain why. If the enzymehas different isoforms, describe the substrate specificity of the isoforms and the cell-type/tissue/developmental specificity of the isoforms. Any reviews may be noted specif-ically.

    If a search fails to turn up any information about a particular object, it is useful toadd a summary to this effect. It should include the date of the search, for example,No information about this protein was found by a literature search conducted on 18June 2003.

    Citation(s). In EcoCyc, we are assembling comprehensive reference lists for each protein.

    Last-Curated. A PGDB can track a last-curated date for a gene product. Checking thelast-curated box in the protein editor will cause this field to be updated. The last-curateddate is the date on which a systematic literature search was last performed by curators forthis gene product. This date can be used by both curators and database users to determinehow up to date the entry is. Do not check this box if only partial curation is performed bythe gene product, as this would interfere with the purpose of this field, which is the recordthe last date on which full curation was performed.

    Enzyme Activity

    Enzyme activity name and synonym(s) (name based on the activity of the reaction)

    Summary

    General description of the enzymatic activity, highlighting any interesting aspects. Ki-netic data, such as Michaelis constants (Km(s)) for the substrates, and Ki(s) for in-hibitors should be noted, if available.

    Inhibitors (physiologically relevant or not?)

    17

  • 8/8/2019 Curators Guide for Dbs

    18/36

    Competitive inhibitors: inhibit enzyme activity by binding reversibly to the enzymeand thereby preventing the substrate from binding.

    Noncompetitive inhibitors: inhibit enzyme activity by binding reversibly to either thefree enzyme or the enzyme-substrate complex. The substrate is not prevented frombinding, but the enzyme with the inhibitor bound is not catalytically active. This

    category was added to the Pathway Tools schema for the February 2004 release. Uncompetitive inhibitors: inhibit enzyme activity by binding reversibly to the enzyme-

    substrate complex. This category was added to the Pathway Tools schema for theFebruary 2004 release.

    Allosteric inhibitors: inhibit enzyme activity by binding reversibly to the enzyme andinducing a conformational change that decreases the affinity of the enzyme to its sub-strates. Allosteric inhibitors can be competitive or noncompetitive, therefore thoseinhibition categories may be used in conjunction with this one.

    Irreversible inhibitors: irreversibly inhibit the enzyme activity by binding to the enzymeand dissociating so slowly as to be considered irreversible. This category was added tothe Pathway Tools schema for the February 2004 release.

    Other inhibitors: inhibit the enzyme activity by a mechanism that has been charac-terized but does not fall cleanly into one of the above categories. This category wasadded to the Pathway Tools schema for the February 2004 release. It replaces a sim-ilar category in previous versions for all inhibitors that were neither competitive norallosteric.

    Inhibitors of unknown mechanism: inhibit enzyme activity, but the mechanism of ac-tion is unknown either because it has not yet been elucidated, or because it has notbeen curated. This category was added to the Pathway Tools schema for the Febru-ary 2004 release. It combines and replaces two old categories from previous versionsthat distinguished between mechanisms that were unknown because they had not beenextensively curated versus ones that remained unknown after a substantial literaturesearch.

    Activators (physiologically relevant or not?)

    Categories of enzyme activation were similarly reexamined at the time of the February2004 release, and documentation was updated. Although no new activation categories wereadded, the two categories for activators whose mechanism was unknown after a minimalversus a substantial literature search were combined into a single category for activators ofunknown mechanism.

    Prosthetic Groups/Cofactors (eg. metals, FAD, NAD(H), etc.) (Note: Prosthetic groupsare defined as being covalently or tightly bound to an enzyme, whereas cofactors are not.)

    Alternative substrates

    For a description of some of these terms see Appendix A of Pathway Tools Users Guide.

    Citation(s)

    Enzyme Subunit Composition

    Subunit composition. Specifies the number of copies of each monomer subunit of a mul-timeric protein. In cases where sub-complexes of a large multimer have been observed,those sub-complexes can be created as PGDB objects that are sub-complexes of a largersuper-complex.

    18

  • 8/8/2019 Curators Guide for Dbs

    19/36

    Subunit name(s) and synonym(s)

    Subunit molecular weight (experimental and computed from sequence)

    PI (isoelectric point). Can be added via the the frame editor. Add it if you come acrossthe information but do not search for it.

    SWISS-PROT primary accession number of subunit, if available

    (http://us.expasy.org/sprot/). Links protein subunit to SWISS-PROT

    Summary

    Brief description of the enzyme subunit, such as its function, if known or proposed.For example, a subunit may be known to house the catalytic active site, it may havean FAD binding motif and may be proposed to be involved in electron transport, or itmay be proposed to be the membrane anchor subunit of a membrane-bound enzyme.

    Citation(s)

    Protein Naming. The word monomer may be used to refer to a polypeptide that has intrinsic

    activity or to a polypeptide that acts as a member of a homo-multimeric complex. The words subunitor component may be used to refer to a member of a hetero-multimeric complex.

    When naming a protein complex, use the name that is commonly recognized by the scientific commu-nity. Any other names for the complex should be added as synonyms.

    Avoid extra wordiness (complex, for example) unless it is necessary to avoid confusion. For example,the long name of the pyruvate dehydrogenase multienzyme complex distinguishes it from pyruvatedehydrogenase, which is one of its components.

    Any protein product of a named locus without an associated gene should be named using the guidelinesfor protein complex names: acetate kinase B should be named acetate kinase B with AckB asa synonym, rather than AckB, because the ackB mutation has not been localized to a particulargene.

    Enzyme Naming. Be sure not to assign non-specific enzyme names to enzymes. For example, do notuse names such as oxidoreductase or hydrogenase because these names refer to nonspecific classesof enzyme activity, not to specific enzyme activities. Use of these names will be very problematic for thePathoLogic program, because genome annotations often use these nonspecific enzyme names for geneswhose specific functions cannot be inferred. Thus, if a MetaCyc enzyme were called oxidoreductase,PathoLogic would assign a a gene annotated with the name oxidoreductase to the correspondingreaction in MetaCyc, which is not correct.

    The following procedures should be followed regarding entry of enzyme names ending in roman nu-merals and arabic numbers, such as pyruvate kinase II and tagatose-1,6-bisphosphate aldolase2.

    Enzyme names ending in roman numerals fall into two categories: those in which the roman numeralsare an integral part of the enzyme activity, and those in which the roman numerals differentiateisozymes that have the same enzyme activity.

    PGDBs allow two sets of names and synonyms to be defined for enzymes: names for the enzymeactivity, and names for the protein. Consider that E. coli has two proteins with pyruvate kinaseactivity, designated pyruvate kinase I and pyruvate kinase II. The name of the enzyme activity for

    19

  • 8/8/2019 Curators Guide for Dbs

    20/36

    both of these enzymes is pyruvate kinase, however, the names of the proteins are pyruvate kinase Iand pyruvate kinase II.

    To encode this situation in a PGDB, enter the protein name, pyruvate kinase I, in the top sectionof the protein editor, by adding a common name to the protein. Enter the enzyme activity name,pyruvate kinase, in the second section of the protein editor, under the first horizontal line, in the box

    labeled Enzyme activity name. Note that every enzyme should have at least one enzyme activityname, but protein names should be assigned less frequently, such as when different isozymes exist. Inaddition, protein names should be defined when names exist for an enzyme that are specific to thatprotein. Arabic numbers and the ends of enzyme names are also used to differentiate isozymes, andshould be treated in the same manner.

    In contrast, consider the enzymes exodeoxyribonuclease I and exodeoxyribonuclease III. These enzymescatalyze different reactions, and the roman numerals designate exactly which enzyme activity anenzyme has. Therefore, these names should be entered in the Enzyme activity name field of aPGDB, because the names are specific to the enzyme activity, not to the protein.

    3.2.6 Genes

    Summary of information collected for genes:

    Common name and synonym(s)

    Superclass(es) of gene

    Gene product type (eg. enzyme, regulator, leader, etc.). Evidence for product type (experimentalor predicted based on sequence analysis)

    Transcription direction (unspecified, forward or reverse)

    Left and right end position of gene on chromosome or plasmid

    Link to other DBs

    Summary

    Citation(s)

    Gene Naming. Regarding gene naming, for bacterial databases, automated programs will period-ically ensure that the capitalized gene name is a synonym for the name of the gene product (e.g.,TrpA will become a synonym for the product of trpA). For most organisms, the frame namefor each gene frame will be the unique identifier assigned to the gene by the genome project (e.g.,HP0001 for an H. pylori gene). Therefore, there is no need to add this same identifier as a synonymfor the gene name.

    3.3 Summaries and History Notes

    Text stored in the Comment slot of a PGDB will be seen by the public under the heading Summary.Should you want to record information that will not be seen by the public but that will be visible inthe Navigator to curators, put that information in the Comment-Internal slot using GKB-Editor.

    20

  • 8/8/2019 Curators Guide for Dbs

    21/36

    Curators are encouraged to record historical information about PGDB objects to provide explanationand justification of edits to a PGDB. Such commentary should be stored in a history note for theobject, created using the command Right-Click Notes Add to History. Example commentarycould describe the reasons for changing a gene function, a chemical structure, or a pathway definition.History note contents are also public, and will be displayed with the date they were created and theusername of the creating curator.

    3.3.1 Writing Style Guidelines

    Summaries should be written in full sentences rather than in sentence fragments. Use multiple para-graphs within summaries where the extra whitespace adds clarity and separates ideas. Embed citationswithin summaries. Other than general commentary, most of the text of a summary should consist ofan assertion followed by a citation, then another assertion followed by a citation. Do not lump all thecitations at the end of the summary.

    Guidelines for Gene-Product Summaries We typically store summaries in the protein (or RNA)product of a gene, rather than in the gene itself. The first sentence of the summary should summarizethe function of the gene product.

    Except in rare cases, in which the gene product is particularly complex or physiologically significant, aneffort should be made to keep the length of Summaries (including references)to less than 500 words.In longer summaries, the first paragraph should in a few sentences summarize what is known aboutthe gene product. Experimental support for these conclusions and information regarding proteinstructure and regulation of expression should come later. In other words, Summaries should beorganized more like a news story than a scientific paper: conclusions should precede rather thanfollow detailed information and evidence; the user should be able to gain the essential facts withoutnecessarily having to read the entire summary.

    In those cases in which background information is considered to be an important aid to some users,such information can be added but it should not be included in the first paragraph.

    If relevant reviews are available, they should be included, and grouped together at the end of thesummary, such as: Reviews: [Smith95,Jones98].

    Literature citations are an important and valuable part of summaries, but some effort should be madeto restrict their numbers to the most significant ones try not to exceed 1020 references withinmost summaries (in some cases of extremely well-studied genes, more references may be appropriate).Sentences, particularly those in the first paragraph, should not be interrupted by long listings of tensof references.

    3.3.2 Formatting Summaries

    Pathway Tools supports a subset of HTML tags for encoding special characters and formatting (suchas italics) within summaries and names. Following is a list of tags that are accepted.

    For example, to encode the name -D-glucose, enter the characters: -D-glucose.

    Greek letters

    21

  • 8/8/2019 Curators Guide for Dbs

    22/36

    alpha symbol:

    beta symbol:

    delta symbol:

    gamma symbol:

    omega: mu (micro):

    Text

    italicized text: italicized text

    bold text: bold text

    underlined text: underlined text

    supertext: supertext

    subtext: subtext

    A few, simple, HTML tags, such as

    for paragraph, and as
    for hard line break, are detectedby Pathway Tools, but are removed from the displayed text and not observed. Thus, to display a newparagraph, hit the return key twice instead of using

    .

    3.3.3 Say It in Your Own Words

    Avoid word-for-word duplication from papers, and enclose any small duplication (one to three sen-tences) within quotes and cite the source.

    3.3.4 Citation Guidelines

    Citations should be used within summaries to cite the source of the information just conveyed. Cita-tions can also be entered independently of summaries, in which case they are stored in the Citationsslot of the object (such as a protein), and they provide general references for the object.

    3.4 Saving Changes

    If changes have been made to a database, an asterisk will appear to the left of the database name at

    the top of the navigator window, for example, *MetaCyc. Changes are not saved automatically. It isimportant to remember to use the Save KB command to save changes.

    3.5 Evidence Codes

    PGDBs include an evidence ontology that is designed to encode information about why we believecertain assertions in a PGDB, the sources of those assertions, and the degree of confidence scientistshold in those assertions.

    22

  • 8/8/2019 Curators Guide for Dbs

    23/36

    A detailed description of the evidence ontology can be found in [1]. The evidence ontology canbe browsed within a PGDB by running the GKB Taxonomy viewer on the PGDB class hierar-chy rooted at class Evidence. An HTML version of the evidence ontology is available at URL

    http://bioinformatics.ai.sri.com/ptools/evidence-ontology.html.

    An assertion could be the existence of a biological object described in a PGDB. For example, we would

    like to be able to encode the evidence supporting the existence of a gene, an operon, or a pathwaythat is described within a PGDB.

    Curators can assign evidence codes by clicking on the Evidence Code button within most PathwayTools Editors. For example, within the protein editor, there is an Evidence Code button below theEnzyme Activity box. This button allows the curator to assign an evidence code; a citation to thesource of the evidence should also be added when available (the citation is optional since some evidencecodes such as Ev-AS-NAS are used when no citation is available). Whenever an evidence code has beenassigned, a new Evidence Code button will be drawn to allow assignment of an additional evidencecode. It is proper to assign multiple evidence codes if multiple types of evidence support a givenconclusion.

    We offer several guidelines as to how to apply the evidence ontology to different classes of PGDBobjects.

    Pathways. Assign code EV-Exp or one of its sub-codes to a pathway if some experimental evidencesupports the existence of the pathway. Code Ev-Comp or its sub-codes should be used for pathwayswhose presence is inferred computationally, such as by the PathoLogic program. Therefore, we expectthat all pathways in the MetaCyc PGDB will be assigned an EV-Exp code because MetaCyc containsexperimentally elucidated pathways.

    Proteins. Evidence codes should be assigned to a protein to define the evidence supporting thefunction of the protein. For example, was the function of the protein elucidated using sequenceanalysis, or using experimental methods; if the latter, what class of method was used? Enzymes.

    Several evidence codes specific to enzymes capture evidence from experimental methods that arespecific to elucidating enzyme activities, such as EV-Exp-IMP-Reaction-Enhanced . These evidencecodes are actually stored within instances of the Enzymatic-Reactions class.

    4 Pathway Curation

    This section discusses curation of pathways in more depth.

    4.1 Pathway Definitions

    A metabolic pathway is a set of one or more enzymatic transformations (such as biosynthesis,degradation, conversion, or utilization), as it occurs in a particular organism. Identical pathways thatexist in other organisms are not repeated in the MetaCyc database; instead a single pathway is labeledby the multiple organisms in which it occurs. Metabolism of a substrate exogenously supplied to cells,such as a vitamin, or drug, can also constitute a pathway.

    Pathways can be classified into base pathways and superpathways. A base pathway is considereda lowest-level pathway in the sense that it is not subdivided into smaller component pathways. Basepathways can be linear, circular, or branched.

    23

  • 8/8/2019 Curators Guide for Dbs

    24/36

    A superpathway is an aggregation of two or more pathways that are related in some way. A pathwaycomponent of a superpathway is referred to as a subpathway. A subpathway is part of a superpathway,and a superpathway is composed of subpathways. The subpathways of a superpathway can be basepathways, or can themselves be superpathways. Some superpathways will contain additional reactionsand enzymes not found within the base pathways, such as reactions that connect two base pathwaystogether. PGDBs always contain links between associated base pathways and superpathways, andthose links are displayed by the Pathway/Genome Navigator toward the bottom of a pathway displaypage.

    There are two main types of superpathways: those whose subpathways are related by a commonsubstrate, and those whose subpathways are related by being analogous base pathways from differentorganisms. For the first type of superpathway, its subpathways could be derived all from the sameorganism, or they could be derived from multiple organisms. Multispecies superpathways that areconnected via a common substrate are potentially useful in metabolic engineering. The steps used increating superpathways from subpathways are described in the Pathway Tools Users Guide, VolumeII, section 2.3.5.3.

    More specifically, superpathways can be created based on the following types of relationships amongtheir subpathways: (1) subpathways that are physically connected through a common substrate (thatis, one subpathway produces the substrate, and another subpathway consumes it); (2) subpathwaysthat are unconnected, but that metabolize the same substrates (e.g. MetaCyc Superpathway of as-partate and asparagine biosynthesis: interconversion of aspartate and asparagine); and (3) analogoussubpathways consisting of an analogous series of reactions catalyzed by the same enzymes, or byanalogous enzymes (e.g. MetaCyc Superpathway of isoleucine and valine biosynthesis).

    Additionally, pathway variants exist in which the same substrate is synthesized or degraded usingdifferent enzymes and/or cofactors, in the same or different organisms. Pathway variants share iden-tical pathway names followed by a roman numeral. Many examples of pathway variants can be foundin MetaCyc by browsing the pathway ontology. These pathway variants can potentially be combined

    into superpathways.

    4.2 Defining Pathway Start and End Points

    Several considerations guide the questions of how to define the start and end points of a pathway, andof whether a given published pathway should be encoded in a PGDB as a single base pathway, or asa set of base pathways within a common superpathway. The following rules should be used to guidethe creation and editing of base pathways and superpathways, when possible.

    1. The substrate biosynthesized or degraded by a pathway should be a stable substrate, as op-

    posed to a transient intermediate. However, a pathway could show the biosynthesis of a stableintermediate that is a precursor for the biosynthesis of other substrates.

    2. Biosynthetic pathways should begin with an intermediate of central metabolism. These inter-mediates are the 13 precursor metabolites: glucose-6-phosphate, fructose-6-phosphate, ribose-5-phosphate, erythrose-4-phosphate, triose phosphate, 3-phosphoglycerate, phosphoenolpyruvate,pyruvate, acetyl CoA, alpha-oxoglutarate, succinyl CoA, oxaloacetate, and sedoheptulose-7-phosphate. A pathway link (see Section 4.3) should be created to indicate the pathway thatproduces the precursor metabolite at the start of the pathway.

    24

  • 8/8/2019 Curators Guide for Dbs

    25/36

    3. Degradative pathways that produce an intermediate of central metabolism should stop at thatpoint. Some degradative pathways may not produce intermediates of central metabolism, butinstead produce compounds that are excreted from the cell. If appropriate, a pathway link (seeSection 4.3) should be created to indicate the pathway that processes the resulting metaboliteat the end of the pathway.

    4. Another class of pathways is applicable in cases where compounds are metabolized in a dis-similatory manner for the production of energy. Examples for these pathways include sulfatereduction and ammonia oxidation. In such cases, the metabolites are unlikely to consume orproduce intermediates of central metabolism. Such pathways should start with the natural formof the compound being used as an electron donor or acceptor, and end with the compound gen-erated at the end of the electron transport process, which would generally be secreted by theorganism.

    5. Very large or complex pathways should usually be defined as superpathways that combine severalsmaller base pathways, where those base pathways are divided at breakpoints. Dividing a largeor complex pathway in this fashion is particularly useful to optimize the accuracy of PathoLogic

    predictions, especially in cases where it is the base pathways, rather than the entire pathways,that tend to be present as units across different organisms. If the pathway was defined as onelarge base pathway, rather than as a set of base pathways connected through a superpathway,PathoLogic would be unable to predict the presence of the smaller base pathways independentlyin different organisms. Breakpoints for large base pathways can be chosen based on variouscriteria such as: branch point substrates; substrates involved in regulation; a major metabolitethat is further metabolized; the cellular compartment in which the reactions occur (organelleor cytosol); a transport segment or a utilization segment; or the type of reaction (oxidative ornon-oxidative).

    6. If a published pathway contains several pathways that are already defined in a PGDB as basepathways, it should be represented as a superpathway.

    7. If the pathway contains too many reactions to conveniently represent in one base pathway, itshould be broken into two or more base pathways, which should be linked together by pathwaylinks (see below).

    4.3 Pathway Links

    Pathway links are a mechanism for indicating substrate connections among pathways. Pathway linksare displayed as arrows connecting an input or output substrate in a pathway to the name of a secondpathway in which that substrate is metabolized. Pathway links can illustrate the source pathway for an

    input substrate, or the destination pathway for an output substrate (assuming that it is not completelymetabolized). Clicking on the second pathway name takes the user to that pathways display page.Links can be created to another base pathway, a superpathway, or to a class of substrates that derivefrom the pathway (e.g. MetaCyc glycolysis I). The steps used in creating pathway links are describedin the Pathway Tools Users Guide, Volume II, section 2.3.5.3.

    Note that although it is helpful to explain the origin or fate of substrates in the pathway summariesfield, this unstructured text is not computationally useful, and thus cannot replace the use of pathwaylinks.

    25

  • 8/8/2019 Curators Guide for Dbs

    26/36

    4.4 Limitations

    Metabolic pathways involving macromolecules and cellular structures may be difficult to representin PGDBs. This is a factor in pathway selection. Pathways that involve reactions that synthesize,degrade, or modify small molecule components of macromolecules and cellular structures can be repre-sented. However, some processes may be beyond the scope of PGDBs, which focus on small moleculemetabolism.

    4.5 Database Searching Strategies

    For those curating MetaCyc, please see Section 6.1 for additional information regarding MetaCycspecifically.

    To search available databases for information regarding a particular pathway, it is recommendedto begin by using general keywords related to the pathway name (eg, creatinine degradation toformate methionine biosynthesis, etc). You may need to search using several alternative names,

    for example: toluene degradation, toluene oxidation, toluene catabolism etc. Adding additionalsearch terms such as anaerobic will help avoid getting irrelevant hits. As in all searches, ifyou get too many hits you should narrow your search by adding keywords, such as bacteria tolimit the search to only bacterial pathways, or a species name to limit the search to only path-ways in a particular species. Some databases allow the use of wildcards, which are truncatednames followed by a special character such as * to designate different variations for the endingof the word. For example, bacter* would include bacteria, bacterial, bacterium, etc. Differ-ent databases may use different wild cards, so it is always useful to consult the databases searchdescription/overview. For a description of PubMed searching strategies see the following URL:http://www.ncbi.nlm.nih.gov:80/entrez/query/static/help/pmhelp.html#PubMedSearching.

    If you know a little bit about the pathway, such as the names of intermediates, or enzymes involved in

    the pathway, you should also search for articles using these keywords. Furthermore, if you know thenames of the researchers who studied the pathway, you can search for articles using a combination oftheir names and the substrate or enzyme they are working on. Once you get some articles of interest,you would usually find other related articles by 1) looking at their references, and 2) using SciSearchto find articles that cite them. Often an articles full text is available online in html format. Thesehtml formatted articles often have web links to articles that they reference as well as to articles thatcite them, which is a very convenient way of finding additional articles. Once you have identified andgathered the relevant articles for a pathway, try to find their PubMed ID (PMID) numbers and labelthem as these numbers are the easiest way to enter references information into PGDBs.

    4.6 Pathway Entry

    Once you have enough papers to put a pathway together, draw out the pathway, making note of thechemical reactions, chemical names and structures of the metabolites, and enzyme names if known.Try to identify any EC numbers that may have been assigned to the reactions in the pathway. It isvery important to find out whether the chemicals and/or reactions already exist in the database. Thismay not be straightforward, as chemicals may have many different names. MetaCyc already includesall of the reactions which have been assigned EC numbers, so you may want to search the IUBMB website carefully to find out whether existing EC reactions fit any of the reactions in the new pathway.Often authors are not aware of such EC numbers and do not include them in their publications. Make

    26

  • 8/8/2019 Curators Guide for Dbs

    27/36

    sure you do not create duplicate chemicals and/or reaction in the database, as this will lead to certainproblems in the future. After you have identified all existing reactions, you may need to create newreactions and chemicals. Write down the frame IDs of both existing and new reactions and add themto the drawing of the pathway which you have prepared. This will greatly facilitate the creation ofthe new pathway.

    Once you finished these steps, you are ready to define the new pathway. Do not forget to assign appro-priate class and evidence codes. An ideal reference for the pathway evidence code is a recent reviewarticle which cites all the relevant experimental literature. In such case use the code EV-EXP-TAS.Make sure you assign the appropriate organisms to the pathway, and mark any hypothetical reactionas such. See section xxx for guideline for writing summaries for new pathways. Once the pathwayis defined in the PGDB, you need to enter enzymes and genes for the various reactions. Review thepapers in greater detail while taking notes of the relevant information (see Section 3.2 below). Asmentioned earlier, it is best if you are already aware of the type of information youll input into thePGDBs. This way, you can skim/read the papers for the relevant information, and take notes in asimilar fashion to how the PGDB is organized, which expedites inputting the information into thedatabase. In addition, youll need to cite the information you input into PGDBs. Hopefully, your

    papers have PMID reference numbers; in which case, the full reference information will be importedautomatically

    5 EcoCyc-Specific Information

    5.1 E. coli Gene Frame Names

    All frame IDs for E. coli genes are derived using numeric sequences. Some genes frame names are thesame as the identifier used in Rudds EcoGene database; they use the prefix EG. Genes with prefixG were either (a) created by the MBL group when they encountered new genes in the literature, or(b) created by SRI virtually all of these genes were created in the course of including data fromthe Blattner GenBank entry for the full E. coli sequence into EcoCyc. Examples: EG10115, G1001.

    Identifiers used by other databases are often listed in the synonyms slot for the gene, e.g., the Blattnerb identifiers and later assigned EG identifiers.

    Dr. K. Rudd developed a naming scheme for E. coli ORFs, in which ORF genes are named beginningwith y. The rest of the name encodes the position of the gene on the E. coli chromosome. Thesenames are retained as synonyms for genes whose functions are later determined.

    5.2 Interrupted Genes

    Interrupted genes in E. coli are defined as follow. For each piece of the coding region, a separate geneframe is created. The same EG number is stored as a synonym for each interrupted gene, but thetwo interrupted genes have different b-numbers. In addition, a name of the form ilvG is stored as asynonym, but a name of the form ilvG 1 is stored as the common-name. The interrupted? slotis set to T, and the start-base and end-base delimit the segment of the coding region defined by thatgene frame.

    27

  • 8/8/2019 Curators Guide for Dbs

    28/36

    6 MetaCyc-Specific Information

    MetaCyc describes the union of pathways across a range of different organisms.

    6.1 Database Searching Strategies for MetaCyc

    The database searching strategies discussed below are unique to curating MetaCyc since MetaCycis currently the only PGDB covering metabolic pathways from multiple species. Section 4.5 aboveshould be reviewed in addition to this section prior to embarking on database searching.

    While researching a pathway you may find that the pathway was studied in many different species,just a couple, or maybe only one. If the pathway has been studied in more than one species, choosethe model species in which the pathway has been studied the most and thereby is best defined. Thisway you can build the most complete picture of a pathway. Pathways in MetaCyc may be associatedwith one or more species. In order for a species to be associated with a pathway, each of the reactionscomposing the pathway must have been either experimentally demonstrated or hypothesized to occur

    in the given species. Enzymes in MetaCyc, on the other hand, are associated with only one speciesbecause it is assumed that the same enzyme in different species will have slightly different properties.For an enzyme to be associated with a pathway (i.e. displayed in the pathway navigator window overthe reaction it catalyzes) the enzyme and the pathway both must be associated with the same species.This further emphasizes the reason for studying all of the reactions and enzymes for a pathway inone particular species. It is a greater priority to add as many different pathways to MetaCyc than itis to find all of the species in which a given pathway is found. However, if during your research youdiscover that many species have the same pathway, list all of the species for that given pathway. Toreduce redundancy in MetaCyc, the same pathway in multiple species is only entered once. If you findthat other species use a slightly different pathway, a new pathway can be created for these species.For example, you may note that the glutamate fermentation-the hydroxyglutarate pathway includes

    a list of several species in which the pathway is found, but it only lists one example for each enzyme.The priority was to find enzyme information for one of the species listed in the pathway for each ofthe pathway reactions, and not to include enzyme information for each of the species.

    6.2 Species Information

    Unlike the EcoCyc KB, which describes pathways for E. coli only, MetaCyc describes pathways formany different organisms. When entering a pathway into MetaCyc, you should record the one or morespecies in which the pathway is known to occur in the Species slot of the pathway, which can be doneusing the Species field of the Pathway Info Editor. The list of species recorded for a pathway should be

    considered a representative list, not necessarily an exhaustive list of all species in which the pathwayoccurs.

    When creating the enzymes that catalyze steps in the pathway, specify the species from which theenzyme was obtained. Each enzyme in MetaCyc should define a protein from a single species. Youmay create enzymes that catalyze a reaction even if those enzymes are from other species besides thosein which the pathway occurs, but those enzymes will not be displayed within the pathway diagram.

    28

  • 8/8/2019 Curators Guide for Dbs

    29/36

    6.2.1 Taxonomic Range Information

    Should the curator be reasonably certain that a pathway should be expected to occur in a limitedset of taxa (e.g., plants only, or animals only) a high-level taxonomic classification for its expectedtaxonomic range should be entered in the Taxonomic-Range slot of the pathway frame by using theExpected Taxonomic Range field in the Pathway Info Editor. Here you can limit the taxonomicrange of organisms in which the pathway occurs by selecting the appropriate taxonomic domain(s) orsubdomain(s). You can determine the most specific higher level domain for each genus and speciesby using the NCBI Taxonomy Browser. The higher level domain will usually be a phylum, or class.If the species is not present in the NCBI taxonomy, select only the highest level, a superkingdom(Archaea, Bacteria, or Eukaryota). Only enter this information if you are reasonably confident thatthe distribution of the pathway will be limited.

    The Taxonomic-Range slot is primarily intended for use by the PathoLogic pathway prediction programduring generation of new PGDBs. By assigning an expected taxonomic range to the organisms in aMetaCyc pathway, the domains, or subdomains that are represented in the expected taxonomic rangecan be compared with that assigned to the organism in the new PGDB. PathoLogic could then assign a

    lower probability score to the pathway if the organism for the PGDB is not in the expected taxonomicrange of the MetaCyc pathway. This form of reasoning can help to exclude inappropriate pathwaysfrom the new PGDB. Expected Taxonomic Range also gives MetaCyc users a taxonomic perspectiveof the pathway that they might not otherwise have if they are not familiar with some of the organismsin the species slot.

    6.3 E. coli Pathways in MetaCyc

    MetaCyc contains many E. coli enzymes and pathways. Pathways and enzymes of E. coli K12 areautomatically imported from EcoCyc into MetaCyc periodically. To ensure that updates are not lost

    as a result of that automatic import process, please observe these rules when updating E. coli K12enzymes and pathways in MetaCyc:

    Please consider EcoCyc to be the authoritative source for K12 enzymes. Do not update K12enzymes in MetaCyc. Update them in EcoCyc.

    For K12 pathways, the only slots you should update in MetaCyc are:

    Species, Summary, Common-Name, Synonyms, Citations, History

    You should not change the pathway topology (set of reactions within a pathway or their relativearrangements). The most likely changes to a K12 pathway in MetaCyc is extending that

    pathway to another species.

    6.4 Pathways from Other Pathway Databases

    When entering pathways into MetaCyc from other pathway DBs, please:

    Enter a citation in the Citations slot of the pathway, for the appropriate DB

    Enter a web link from the MetaCyc pathway to the other pathway DB

    29

  • 8/8/2019 Curators Guide for Dbs

    30/36

    Describe in the summary field of the pathway any changes that you made to the pathway, andwhy

    6.5 Proteins as Substrates in MetaCyc

    Some reactions in MetaCyc pathways involve protein substrates such as acyl carrier protein (ACP)or thioredoxin. These protein substrates should be entered as protein class frames within MetaCyc.The protein class frames are generic, species-independent descriptions of these proteins. The reasonfor this approach is that a number of different MetaCyc reactions may reference a particular proteinas a substrate, but those reactions may be from different pathways in different species, leading toconfusion as to exactly which version of the protein is intended. In addition, when MetaCyc pathwaysare predicted in organism-specific PGDBs, these reactions are now used in the context of this specificorganism, and must not refer to protein instances from other organisms (e.g. a human pathwayutilizing an E. coli protein as a substrate).

    If organism-specific instances of such proteins exist in MetaCyc (e.g. specific E. coli thioredoxins),

    these instances should reside within the appropriate protein class. This will enable the user to see alist of all such instances by clicking on a protein substrate in a pathway diagram.

    6.6 Curation with Classification Systems

    PGDBs contain several classification systems to which individual objects can be assigned. For ex-ample, the classification system for pathways divides all metabolic pathways into over a 100 differentcategories and sub-categories, such as biosynthesis, biosynthesis of amino acids, and biosynthesis ofcarbohydrates. Individual pathways can be assigned to those classes to facilitate retrieval and drill-down by users into categories and sub-categories.

    Classification systems exist for pathways, chemical compounds, reactions, and genes. In some casesit is appropriate to assign an object to more than one class, such as in the case of a multifunctionalprotein.

    When updating an object, please consider whether its assignment within a classification system shouldbe revised.

    6.6.1 Gene Classes

    The gene class hierarchy is the MultiFun system developed by Monica Riley and her colleagues [5].

    The gene class hierarchy has several classes specific for classification of uncharacterized genes. Thegeneral gene class ORF has three child classes as of the most recent updates to the hierarchy (inJuly 2003):

    1. conserved ORF 2. conserved hypothetical ORF 3. nonconserved ORF

    As we describe them to the users: Conserved ORFs have homologs, usually in other organisms, whereone or more of those homologs have known functions, but the sequence similarity of a conserved ORFto its homologs is not strong enough to permit inference of function of the conserved ORF. In contrastconserved hypothetical ORFs have homologs, but none of those homologs have known functions. The

    30

  • 8/8/2019 Curators Guide for Dbs

    31/36

    remaining category, nonconserved ORF, is used to refer to genes that do not have assigned gene classfunctions and which do not have significant similarity to other genes.

    As guidelines for naming of the products of these uncharacterized genes, we suggest that a product ofa conserved ORF will be called conserved protein, a product of a conserved hypothetical ORFwill be called conserved hypothetical protein, and a product of an nonconserved ORF will be

    called hypothetical protein. In cases where some additional information has been recorded, but thisinformation is insufficient for prediction of protein function, additional information may be appendedto the standard name; for example, conserved protein with an unusual xyz domain.

    The explanations of selected gene class definitions are provided by Dr. Gretta Serres:

    The cell structure category contains both genes encoding components of the structure andgenes encoding products involved in the biosynthesis of the structure.

    The terms trigger (3.4) and modulator (3.5) refer to specific compounds/proteins that eithertrigger a response or modulate a response. I, for example, in the case of lacI, lactose acts as thetrigger and CRP acts as a modulator of the regulation.

    Any transcriptional activator or repressor should be assigned a class of 3.1.2.2 or 3.1.2.3, or bothfor dual regulators.

    The adaptation to stress class is intended to cover the classical inducers of stress such asosmotic pressure, temperature, and starvation.

    The protectio