ewan birney,1 michele clamp, and tim hubbard 2nakhleh/comp571/fall05/databasesearch/...31 jul 2002...

31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18)P1: IBD10.1146/annurev.genom.3.030502.101529

Annu. Rev. Genomics Hum. Genet. 2002. 3:293–310doi: 10.1146/annurev.genom.3.030502.101529

Copyright c© 2002 by Annual Reviews. All rights reserved

DATABASES AND TOOLS FOR BROWSING GENOMES

Ewan Birney,1 Michele Clamp,2 and Tim Hubbard21European Bioinformatics Institute (EMBL-EBI),2Wellcome Trust Sanger Institute,Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom;e-mail: [email protected], [email protected], [email protected]

Key Words genome sequence, gene prediction, relational database, open sourcesoftware, distributed annotation

■ Abstract To maximize the value of genome sequences they need to be integratedwith other types of biological data and with each other. The entire collection of datathen needs to be made available in a way that is easy to view and mine for complexrelationships. The recently determined vertebrate genome sequences of human andmouse are so large that building the infrastructure to manage these datasets is a majorchallenge. This article reviews the database systems and tools for analysis that have sofar been developed to address this.

INTRODUCTION

The human genome sequence represents the first bounded biological dataset con-cerning our species. Having access to it is a landmark because of the limits it setson the problem of understanding biology as a whole. It has allowed us to assemblesomething equivalent to an “edge” of a multidimensional jigsaw puzzle. This isonly a first step in completeness, but at least it gives us a feeling for size andboundaries and directs us toward the “middle” of the puzzle that now needs tobe filled in. It is the first step toward other complete datasets: the complete setof genes, the complete set of proteins, the complete set of molecular interactionsin the cell, etc. Completeness changes the way we ask questions: “Is this geneinvolved in this function?” becomes “which gene carries out this function?” Thesedatasets will be determined by a combination of experimental work and compu-tational analysis, but in the context of the genome sequence. Genome sequencesprovide a framework around which all this biological knowledge can potentiallybe organized, so each layer of data will lead to a greater understanding of layersof organization of biological systems above it.

The availability of several closely related genome sequences (e.g., mouse, rat)brings the possibility of building lists of molecular features common to all ver-tebrate species and those that are unique to our own. Evolutionary similaritiesbetween individual proteins can be identified across the whole of life. Amongvertebrates, conservation between genome sequences goes beyond similarities

1527-8204/02/0728-0293$14.00 293

31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18)P1: IBD

294 BIRNEY ¥ CLAMP ¥ HUBBARD

between protein-coding genes and extends to gene order, resulting in large syn-tenic blocks of many megabases. Other more distant nonvertebrate genomes, suchas fly or worm, cannot be usefully compared to human at the level of chromoso-mal organization (except in rare cases such as the HOX gene cluster), but orthologgenes and proteins can be identified. Analysis of conserved networks of equivalentproteins in distant species will lead to understanding how common cellular sys-tems work and what makes species different from each other. It will also increaseour understanding of how biology accommodates change through evolution andvariation within populations of individuals. Refining and extending this set of or-thologies will have a substantial impact on the development of medical treatments,as these relate studies of molecular systems in model organisms to human.

Because all biological data is in some way information about how biologyas a whole is organized, it is most valuable when systematically organized andintegrated. Having these large collections of raw data, which include protein andRNA as well as genome sequences and structures, protein and RNA expressionpatterns, and cellular localization images, has created a huge need for databasesto store information, provide access, and add value.

For a current snapshot of the huge range of biological databases a good sourceis the annual special database issue ofNucleic Acids Research, published eachJanuary. It lists 339 databases in its opening review article in 2002 (5). Long beforethe first complete genome sequences of free living cells were determined, groupsfrom around the world had been tackling the issues of (a) building repositoriesfor raw data, (b) adding annotation to this raw data, and (c) providing higher levelstructure and organization. Examples of repositories are the public DNA sequencedatabases of EMBL (37), GenBank (6), and DDBJ (38) as well as the public proteinstructure database PDB (40). Examples of annotation databases are Flybase, whichmaintained annotation around the genetics ofDrosphilalong before a genome se-quence was available (10), and SwissProt, which adds functional annotation toprotein sequences largely originating from mRNA and genomic sequencing (3).Examples of organizational databases are Pfam, which groups protein sequencedomains into families, thereby showing evolutionary relationships between paral-ogous proteins within an organism and orthologous proteins between organisms(4); SCOP, which groups more distantly related proteins together by structuralsimilarity (26); and KEGG, which organizes proteins and ligands into networks ofenzymic processes and regulatory networks (20). These examples are only illus-trative, as the roles are not even clear cut in these cases, e.g., the DNA sequencerepositories and organizational databases both contain some annotation.

What differentiates the above databases from genome-sequence-based databasesis that the former are all founded around primary sets of data that are currently un-bounded, whereas in the case of the latter the primary dataset is essentially bounded.We do not know how many protein folds there are. We do know the complete DNAsequence for many organisms. One of the results of the integration of these existingdatabases with genome-sequence databases is that we can propagate this complete-ness to identify where the gaps in our knowledge lie. For example, we can identify


GENOME DATABASES 295

the places in the human genome where there is evidence for a gene but where wedo not yet have the full-length mRNA transcript or the protein sequence for whichit codes. We can identify which protein sequences in a complete genome can be as-sociated to a known three-dimensional structure and which need to be targeted forX-ray or nuclear magnetic resonance (NMR) structure determination, such as thosedetermined by structural genomics projects. We can identify which proteins arenot part of any known cellular pathway and thus need to be targeted for functionalanalysis.

A major current challenge for all these database projects is to increase their in-tegration by means that may include propagating information upward from thecomplete genomes. One of the problems that urgently needs to be addressed in thisintegration is the maintenance of evidence trails linking derived annotation withthe source of its evidence. For example, a protein of unknown function is labeled asa kinase because of a weak sequence homology to another protein that is known tobe a kinase. Later it is discovered that the weak homology between the sequenceswas false and was due to a frameshift error in one of the protein sequences. Becausemost databases do not track the relationship between annotation and the evidencethat supported it, the ‘kinase’ label is likely to persist even when the justificationfor it has vanished. Ideally all objects in all databases should have stable, versionedidentifiers, and the relationships between them should be recorded so that whena sequence version changes it can be automatically determined that any evidencethat relies on it needs to be reevaluated.

When considering errors introduced by propagated annotation, it is worth alsoremembering the range of quality and completeness of data that is presented indatabases as a whole. Ideally we would have a complete set of exact experimentalmeasurements while being able to compute the behavior of biology exactly, inorder to predict perturbations to naturally occurring systems. Currently we can doneither. For example, we have experimental methods for extracting and sequenc-ing mRNA from cells, however there are many transcripts that are present in suchsmall quantities or for such a transient period that they have never been isolated.Similarly, our ability to “compute” biology is currently very limited. As is dis-cussed below, automatic prediction of gene structures in higher organisms givesgood levels of accuracy, but it is still prone to a variety of errors. Between these twoextremes of prediction and experimental determination lies a third type of anno-tation, that of manual curation. The combination of human skills and experience,coupled frequently with knowledge of the scientific literature, means that carefulmanual annotation is less error prone and more complete than automatic annota-tion, although it is much slower and harder to maintain. Prediction and curationcomplement each other well in fact. The volume of data from newly sequencedgenomes makes automatic annotation essential, where speed is essentially onlylimited by the available central processing unit (CPU) and frequent updates arerequired to take account of the ever-increasing amount of biological data avail-able to support annotations. Therefore when making use of annotation, or ‘data’(such as protein sequences) based on annotation (gene structures from genome



sequence), users must be aware of whether it is based on prediction, curation, orexperimentation; what the expected accuracy is; and when it was last updated.

While noting the limitations of current systems, we can look forward to whatwill be possible as existing systems converge. A perfect integration would allowbiology to be viewed in vertical slices (from genome sequence to organism) andhorizontal slices (from viruses to humans). A vertical slice might be the biologyaround a gene and could stretch from the genome sequence surrounding the gene,the gene’s protein products, the product’s role in a network of molecules that on theinput side regulate its production and on the output side control its interaction withother molecules, the localization of products in cells and across the organism, thetemporal expression in the organism from life to death, or the variation in the geneand the impact of this variation within a population of organisms. A horizontalslice might be the biology around a metabolite and could stretch from the existenceof the metabolite outside life to the evolutionary tree of its roles and metabolismstretching from bacteria to humans, thus showing where it became adopted fornew roles and where old roles were replaced by other mechanisms or modifiedversions of the metabolite. Each step toward a goal of complete integration shouldimprove our ability to predict our interactions with biology, as we consider theconsequences of perturbations such as the availability of a new drug, of new virusgenotypes, and of the release of new pollutants.

From every standpoint, we are a very long way from attaining such a goal: interms of availability of experimental data, in terms of being able to predict the be-havior of biological systems, and in terms of organizing existing data. However thefirst steps are very clear; they are the organization of existing biological data aroundand between the newly determined vertebrate genome sequences and the predictionof the next level of information from the genome sequence, namely, the genes andproteins coded by it. This review concentrates on progress on these two issues, sur-veying the systems that are being constructed to automatically annotate the humangenome for genes and the computer systems technology that are being assembledto store, manipulate, and provide access to these very large blocks of data.

AUTOMATIC GENE ANNOTATION

Analysis Pipelines

The human (11, 39) and mouse (25) genomes are currently much larger than anyother genome sequenced, and the sheer size of the data has posed many problems tothose trying to annotate it. These problems are exacerbated in the human genome bythe constant change in the sequence as it is being finalized, which has led to a seriesof genome assemblies that need to be analyzed. This problem has been partiallysolved by increasing the number of computers available for the analysis, but thisis only part of the solution. A large portion of the analysis has to be automatic toget acceptable annotation turnaround, and this involves engineering an analysis-pipeline system to process the vast amount of sequences. A number of large-scale



analysis pipelines exist [Ensembl (18), National Center for Bioinformatics (NCBI),Oak Ridge Genome Channel (28), Gadfly, Celera’s Otto (22)]; and although theydiffer in their structure and underlying storage systems, they share many commonfeatures. The early stages of the different pipelines are very similar; before anygene annotation can proceed many different programs need to be run to provideinput data for the gene-building system. The genomic sequence needs to be maskedfor repeats [e.g., RepeatMasker (34)] before doing most other analyses. After therepeat masking the next stage is to run both ab initio gene-prediction programs[e.g., Genscan (8), Fgenesh (35), Genie (32)] and database searches against thepublic databases. The point at which this “raw analysis” is done is where thedifferent genome pipelines start to diverge. Each site uses different programs andsources of data to produce their gene annotations, and this part of the analysis isthe most dynamic and subject to change.

Gene-Prediction Methods

Gene-prediction methods have traditionally been described as falling into twodistinct approaches. The first approach has been to use the statistical informationcontained within the genomic sequence to predict gene structures. The informationused commonly includes compositional analysis of exons, introns, and splice sites,which may vary according to the background composition of the full genomic se-quence. There are many programs that perform this ab initio prediction, includingGenscan, Fgenesh, HMMGene (24), Grail (41), and Genie. The algorithms behindthese programs are Hidden Markov Models (HMMs), neural networks, and Lineardiscriminant functions, although the majority now use HMMs. Genscan is com-monly regarded to be the most accurate for predicting genes in the human genome,but all ab initio methods have a common weakness for large genomes. This weak-ness is partly due to the nature of genes in large vertebrate genomes. The amountof coding sequence compared to noncoding sequence is very small (approximately2% of the human genome is estimated to be coding), which leads to large intronsand large intergenic regions with relatively small exons. For instance, the mean sizeof an exon in the human genome is 120 base pairs with the median intron size being∼2000 base pairs. The large noncoding regions lead the ab initio programs astrayand lead to extra genes being predicted in intergenic regions. The large intronswithin genes can also lead to extra exons being predicted within a true gene.

The second approach to gene prediction has used the wealth of sequence datafrom genes that is already stored in the public sequence databases [SWISS-PROT(3), EMBL (37), RefSeq (31)]. These are commonly referred to as similarity-basedapproaches and are founded around taking a protein or a cDNA and aligning itto the genomic sequence. There are many programs that will do a local or globalalignment of two sequences [BLAST (1)], but for gene annotation these programsare not good enough as they do not have any knowledge of gene structure: how tomodel splice sites, for instance. For accurate alignment of proteins or cDNAs togenomic sequences, programs should be able to model splice sites, exons, introns,



and in the case of protein alignment programs, open reading frames. For aligningcDNAs to genomic sequences, programs such as SIM4 (16), ESTGENOME (27),BLAT (J. Kent, manuscript in preparation), and Exonerate (G. Slater & E. Birney,manuscript in preparation) are commonly used. Genewise (7) and Procrustes (17)are used for aligning proteins.

Sources of Genome Annotation

There are currently three main sites providing free annotation of the humangenome. These are the Ensembl site (http://www.ensembl.org), NCBI (http://www.ncbi.nlm.nih.gov), and the genome browser at University of California, Santa Cruz(UCSC) (http://genome.cse.ucsc.edu). Both approaches to gene prediction (abinitio and similarity based) are used and combined to provide the end user with aset of gene annotations.

ENSEMBL The Ensembl gene-prediction system uses both ab initio and similarity-based methods but increases the specificity of the ab initio predictions by onlyusing the ones confirmed by similarity to protein, cDNA, or expressed sequencetag (EST) sequences. Ensembl predicts genes in three steps: by placing knownhuman genes onto the genome, then placing highly similar genes (maybe frommouse or rat) onto the genome, and finally predicting novel genes from ab initiopredictions supported by sequence similarity.

The first two steps are based around aligning proteins to the genome. The finalalignment of each protein is done using the Genewise system, which aligns proteinsto genomic DNA. Genewise uses an HMM to model exons, introns, and their splicesites. It can cope with sequencing error, as it can model frameshifts, and becauseit is protein based, stop codons can be penalized to produce a translatable genestructure. As Genewise is a very slow program and the genome is very large, eachprotein to be aligned is placed on the genome using two methods. For the humanproteins a very fast exact-matching algorithm, pmatch (R. Durbin, unpublished),is used to place each protein at its approximate place in the genome. To refine thelocation of the protein, BLASTX is then used. This should locate the positions ofthe exons to within a few bases and also picks up small exons for which pmatchis not sensitive enough. Only then is Genewise used to align the protein. For thesimilar proteins the pmatch step cannot be used, as pmatch finds exact matchesonly. BLASTP is used instead to search the SWISS-PROT and TrEMBL databases.

Genewise is a protein alignment program, thus the resulting gene predictionswill lack untranslated regions. To remedy this, all the human cDNAs are aligned tothe genome using another program that is sensitive to splice sites, ESTGENOME.Again, as for Genewise, to reduce the search time cDNAs are placed onto thegenome using a two-step method before ESTGENOME is run. Instead of pmatch,the program Exonerate is used. Exonerate is a fast DNA-DNA alignment programthat allows for insertions and deletions. This is used to place the cDNA within1 Mb or so of its appropriate region. BLASTN is then used to locate the exonpositions to within a few bases; then ESTGENOME is run on only those regions



of the genome. The Genewise alignments and ESTGENOME alignments are thencombined into complete gene structures.

The Genewise/Exonerate method for predicting genes accounts for 70% of theEnsembl genes. Genewise can only provide good alignments down to approxi-mately 70% identity, so Ensembl uses a final method to predict genes by usingweaker sequence similarities. As stated earlier the ab initio program Genscan hasgood sensitivity but not very good specificity. Ensembl runs Genscan over thegenome but only includes exons that have similarity to protein, EST, or cDNAsequences. The similarity matches are then used to join the exons together intogene structures.

NCBI The gene predictions on the NCBI website are based around the Genome-scan program (42). Genomescan combines ab initio gene predictions with simi-larities to protein sequences in order to predict gene structures that have at leastone exon with supporting evidence from an existing protein sequence. Genome-scan is an extension of the Genscan program, which uses a probabilistic model ofexon-intron structure and compositional features of human genes. Genscan uses ageneralized HMM that consists of a number of states modeling the various partsof a gene. These states include 5′ splice site, 3′ splice site, internal coding exon,start exon, and terminal exon. The final gene structure predicted by Genscan isthe maximum probability path through the HMM. Genomescan extends Genscanby again predicting the gene structure based on an HMM, but this time the prob-ability of the path is changed depending on the presence or absence of regions ofsequence that are similar to known proteins. These regions of similarity can begenerated using any method the user prefers. The BLASTX program is a popularmethod and is capable of generating the similarity regions down to the necessarysequence-similarity distance.

Because BLASTX will generate output that suggests that some regions of ge-nomic sequence have protein homologies that are spurious, Genomescan willnot automatically include every similarity region. Each region is evaluated andassigned a probability, and Genomescan combines these probabilities with theHMM-state probabilities, thus leading to predicted gene structures that are morelikely than not to include the similarity regions.

Not all gene predictions on the NCBI site are predicted using Genomescan.For the set of human cDNAs from RefSeq and EMBL/Genbank these are aligneddirectly to the genome. After the exact cDNAs are aligned the genome is splitinto putative gene boundaries using mRNA alignments. Genomescan is then runon these regions with a range of similarities from different sources. Three sourcesof protein hits are used for Genomescan: (a) BLASTX of the genome againsta nonredundant set of vertebrate proteins, (b) rpsBLAST of the genome againstmotifs, and (c) BLASTP of all protein sequence hits to Genscan predictions. TheGenomescan predictions have probabilities associated with them, and the NCBIsite puts them into two classes: those with probabilities greater than 1e-4 and thosewith probabilities less than 1e-4. A final classification of Genomescan genes are



those that are predicted with no similarity confirmation. These can be consideredequivalent to the raw Genscan predictions.

It is worthwhile comparing Genomescan to other programs (Genewise, Pro-crustes) that also use protein similarities to produce gene structures. Genewiseand Procrustes are more tightly bound to using all available sequence similar-ity and thus have very high specificity, but their sensitivity declines as the se-quence similarities used become more and more distant. In practice this means thatGenewise will fail to give any prediction with a distant protein because it cannotfind an alignment. For instance, Genewise is 99% accurate with 99% coveragewith a 90% identical protein, but when the protein used is only 70% identical thecoverage falls to 30%. If you want to produce gene predictions that have a highprobability of being correct, algorithms like Genewise are the best choice if youdo not mind missing some things. If on the other hand you want more coveragebut can sacrifice some overprediction, approaches such as Genomescan are veryuseful as their sensitivity remains high with more-distant proteins.

UCSC GENOME BROWSER The UCSC human-genome site provides multiple setsof gene predictions, some of which are produced at UCSC and some of which areprovided by external groups. The group at UCSC produces a set of genes originat-ing from the human section RefSeq, which is a curated set of gene structures. Thealignment of RefSeq cDNAs is done with BLAT. BLAT is designed to very quicklyfind high-similarity (>95%) matches of lengths of 40 bases or more. Like othersimilar fast algorithms [SSAHA (29)], BLAT holds the whole genome in memoryin the form of nonoverlapping 11mers that are not found in repeats. Using thissequence index BLAT finds the approximate location of a sequence to be aligned.Once the approximate location is found a full alignment that allows for introns andsplice-site modeling is then done.

Aligning RefSeq genes to a genome will only provide a subset of all humangenes present in the genome. Another set of gene predictions on the UCSC browserthat uses extra information is that produced using Acembly. The Acembly programpredicts gene structures from cDNAs and ESTs and is built on top of the Acedbdatabase (14). If there are ESTs or cDNAs that produce different models because ofalternative splicing, all models will be generated. This is in contrast to the HMM-based method Genomescan that will only produce the most probable model. It oftenhappens that a cDNA will align to more than one region of the genome. This couldbe due to assembly errors or the presence of a very similar paralog or pseudogene.In these cases Acembly will only keep multiple alignments if they are of similarquality. If one alignment is worse than the other, only the best alignment will bekept. Acembly is especially focused on aligning ESTs to genomic sequence, forexample, it allows for mismatches and frameshifts, and it can use and display theunderlying trace data.

Another predicted gene track on the UCSC browser comes from Softberry(http://www.softberry.com) and uses a program Fgenesh+, which is based onHMMs and protein similarity but with less emphasis on cDNA/EST data. Fgenesh+is based on an idea similar to Genomescan and also came out of a group that had



previously written ab initio prediction programs. Like Genomescan it uses sim-ilarity hits (commonly BLASTX) to known proteins to affect the probabilisticgene model and biases the final gene structure predicted to more often overlap thesimilarity hits.

Gene Prediction Using Other Genomes

With the arrival of the mouse genome sequence in the public domain there isa source of data for gene prediction that is the focus of much gene-predictionresearch. In theory if you take the genome sequence of two species, the functionalparts will be conserved and over time the nonfunctional regions of sequence willslowly diverge. So if you align two syntenic regions of the genome, the conservedregions will highlight things like the exons and, possibly, the regulatory bindingsites. This has the advantage of not needing a protein or cDNA already in thedatabase to predict genes. In practice the mouse and human sequence have asmany matches in introns as in exons, so it is not a trivial problem to sort the codingfrom the noncoding regions. Additionally not all exons are conserved from mouseto human, so the problem turns from just trivially stringing the matches together toform genes, to a more complicated issue. Several groups have attempted to solvethe problem of predicting gene structures by using genomic sequence from morethan two organisms. These include Twinscan (23) and Doublescan (I. Meyer &R. Durbin, unpublished).

Twinscan is based on the generalized HMM of Genscan and integrates cross-species similarity into the probabilistic model of Genscan. It has similarities withGenomescan and Fgenesh++ as it too is attempting to combine similarity infor-mation with a probabilistic model. In some ways Twinscan has a more difficultproblem to solve than the other methods, as the similar regions between twogenomes may fall either in exons or introns or even regulatory binding sites.

Annotation Limitations

At the time of writing, just under 50% of the human genome sequence is availableonly in draft form. This means that, as well as there being sequencing errors in thedraft part of the genome sequence, the fragmented nature of the sequence leadsto problems in assembling the genome and subsequently problems in annotatingit. The sequence errors can cause the annotation procedures to fail to predict agene, owing to frameshifts. The problems in the assembly can lead to genes beingfragmented. If, for instance, there is a local misassembly with two regions ofsequence being in the wrong order or in the wrong orientation, it becomes almostimpossible for the gene annotation methods to find all the exons for a gene becauseall the methods rely on exons being on the same strand and in the right order. Alsoit must not be forgotten that there are still small regions of missing sequence(currently 5% of RefSeq genes cannot be placed on the genome), which can leadto genes being completely absent or only partially annotated. Missing sequencescan also lead to a different problem: If a gene cannot be found in the genome, theremay exist a close paralog or pseudogene elsewhere for which genomic sequence



already exists. This may get annotated as the real gene, as it is the closest thing thatcan currently be found. The draft nature of much of the sequence makes it verydifficult for the automatic methods to differentiate between a pseudogene, a closeparalog, and the real gene because they cannot tell whether the difference betweenerrors in the gene sequence arise from sequencing errors or from real divergence inthe genomic sequence. Finally, as all the methods described use, in part, data fromthe public databases, errors in the database sequences can be propagated onto thegenome. This is especially dangerous when building genes using EST sequencesthat are prone to genomic contamination.

Annotation Summary

The best gene-prediction methods in vertebrate genomes rely to a greater or lesserextent on similarities to known proteins or cDNAs/ESTs. This is in contrast to bac-terial genomes where, owing to the smaller size of the genomes and reduced com-plexity in the gene structures, the available ab initio programs perform very well.

There are various ways of incorporating similarities into gene annotation. Thereare the strict methods such as Genewise and the looser methods such as Genome-scan. One has high specificity but can only use high-similarity sequences. Theother has lower specificity but is able to produce gene structures using less-similarsequences. The three main sites all use a combination of these techniques to coverthe whole range of gene prediction. Ensembl uses Genewise for exact proteins,down to confirming Genscan exons and building gene structures at the lower end.UCSC aligns exact cDNAs using BLAT, but it also has external predictions usingESTs from Acembly and HMM and protein-based predictions from Fgenesh++as well as the Ensembl predictions.

The sheer volume of sequence data that needs to be searched has led to a newgeneration of search algorithms like BLAT, SSAHA, and Exonerate. These algo-rithms are designed both to fit the shape of data being searched and to takeadvantage of the computers available today, whether they be single large memorymachines (>4 Gb RAM) or multinode (500–1000 Gb) computer farms with modest(∼1 Gb) memory. A common theme through these algorithms is that time is tooshort for using the most-sensitive algorithms from the outset. BLAT, SSAHA,and Exonerate, as well as the Ensembl search framework for Genewise andEST GENOME, all employ the technique of finding an initial rough location for asequence, a step that is usually very fast, and of only focusing on the details oncethis initial location is found. This is very good for exact or near-exact matches. Forsearches that return less-similar matches these shortcuts will not work, and onecurrently has to return to more traditional techniques such as BLAST.

TECHNOLOGY OF GENOME ANNOTATION

The process of creating, managing, and displaying genome information requiresnontrivial software. This is because of the scale of storage and computationalresources required and the complexity of the problem, both in automatic and



TABLE 1 A list of genome management systems and their associated features

Data Web Graphical Developer DAS DASRelationala accessb Portablec Languaged accesse viewerf coreg serverh vieweri

Ensembl Yes Yes Yes Perl/Java Yes Yes ∼40 Yes Yes

UCSC Yes Yes Yes C Yes No ∼10 Yes No

Wormbase No Yes No Perl/C Yes Yes ∼5 Yes Yes

Flybase Partial Yes No Perl/Java Yes Yes ∼5 Yes Yes

NCBI Partial Yes No C Yes No ∼20 No No

Genquire Yes Yes Yes Perl No Yes ∼3 No Yes

SGDj Yes Yes No Perl Yes No ∼5 No No

aWhether an RDMS is used.bWhether access to the data is freely available.cWhether the entire system can be ported to a new site or not.dPrinciple programming language employed in the database.eWhether the database is accessible via the web.fWhether the data can be viewed through a graphical application.gEstimate of the number of people on the core development for the system.hWhether the system can act as a DAS server.iWhether the system can act as a DAS client.jSaccharomycesGenome Database (9).

manual annotation processes. The genome systems deployed use a mosaic ofsoftware components but generally with more similarities than differences. Thevarious systems and some of their features are outlined in Table 1.

The management of genomic information requires the long-term storage of itsdata and easy programmatic ways to access and update this information. Nearlyall genome systems now use a relational database management system (RDMS) astheir core data store. The advantages of using RDMS are as follows: (a) Consider-able work in the theory and implementation has been undertaken in the computer-science industry over the past two decades, providing well-understood, robustsystems, in particular with respect to backups and data integrity. (b) Relationaldatabases use a standardized query language (SQL), and all major programminglanguages have bindings to SQL interfaces, thus facilitating programmatic access.(c) There is a large pool of RDMS- and SQL-savy programmers who can easily bebrought into an RDMS-based project. The main implementations are produced byOracle or Sybase (which are commercial products) or Postgres or MySQL (whichare open-source projects). MySQL is not strictly an RDMS, but for practical pur-poses it can be considered one and is widely used for genome data storage.

For most systems direct programmatic access to the database is not encour-aged, but instead a number of software abstractions are provided on top of thedatabase. These abstractions, called an application programming interface (API)or a middleware layer, allow the database schema to be insulated somewhat fromthe main set of programmatic clients to the system. These middleware layers differin complexity and layout, but a number of systems, in particular Ensembl, Gadfly,Wormbase (36), and Genquire, reuse the open-source projects BioPerl and BioJava



(30) to provide a core set of interfaces. This reuse of interfaces sometimes providesthe ability to mix and match components directly (e.g., it is possible to use partsof the Genquire viewer with Ensembl directly), but more practically it eases thepain in porting components between projects.

The genome management system must allow viewing and, for manual systems,editing of the genomic annotation by users. For most of the previous decade themain system to use was the FMAP display of Acedb (15, 21), which is still usedin Wormbase and at the Sanger Institute for genome annotation. However the late1990s saw the rise of the Web as the principle mode of communicating to users,and the paradigm of “web views” of a database has been enthusiastically takenup by genome databases. This is further expanded in the next section. The coreof a genome web display is generally a graphical view of a slice of the genome.As this slice can be arbitrarily positioned, the view has to be built on the fly,pulling information from an appropriate datastore and constructing an image. Theconstraints of trying to return these images within the couple of seconds tolerancefor users is a considerable technical challenge. For example, the UCSC browsergoes to some length to optimize its data access pattern, and Ensembl has a seconddata-transformation step optimized for web displays.

Editing genomic annotation, in contrast to viewing, is an activity that is practicedby a smaller number of people, generally hired by the database, called curators.A curator will spend a high proportion of its time editing annotation, meaningproductivity gains provided by the software are crucial. Editing systems are alwaysstand-alone applications, and the main ones are the FMAP display built into Acedb(Wormbase, Sanger Institute), the Apollo editor (joint project between Ensembland GadFly), Genquire (Genquire system), and the Artemis (33). These editorsalso provide a helpful “power user” application tool for viewing.

A new development in genome viewing is the Distributed Annotation System(DAS) (13). DAS is a protocol that allows a client computer to contact multipleDAS-aware servers and retrieve and integrate genomic annotations. Genomic an-notations from the perspective of DAS is a range on a piece of genomic sequenceand, importantly, contains information for a user to view a web page associated withthis feature. This protocol runs over the standard http web protocol and can be con-sidered a type of “web service.” An important feature of DAS is that one does notneed to present the entire genome to provide a DAS server—indeed, one can serveup annotations on a small region of the genome. This allows much smaller labo-ratories, down to modest wet labs with some computer expertise, to run their ownDAS servers. The DAS specification has been active for approximately 1.5 yearsto date, and there are already two stand-alone systems for serving DAS annotations,these being Dazzle from the BioJava project and LDAS, which uses the BioPerlframework. On the client side, the most accessible way to use DAS is using webbrowsers associated with the main genome projects: Wormbase, GadFly, and En-sembl all offer this capability. Already there has been a reassuring uptake of DASusage—one common-use case is to present internal work only to researchers inthat lab, without publicizing the DAS server beyond a closed group of researchers.

13 Aug 2002 12:42 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18)P1: IBD


A genome database needs to be populated with useful annotation. We havealready described in detail the process of automatic annotation, but whether inautomatic or manual annotation a considerable amount of computational workneeds to occur. The system must provide an environment to execute these com-putes, which, for large eukaryotic genomes such as human, require considerableresources. All genome centers currently are working on the “farm” configuration,i.e., having a set of dedicated computer machines with a perfectly replicated set ofresources on each farm node. The design of such a system requires coordinationof CPU, floor space, air conditioning, and network resources, but such farms arecommon in the computer-science industry. As many of the computational tasksin the genome are “embarrassingly parallel,” in the sense that they require nocommunication between the different execution points, the structure of the work isgenerally to split the problem into a large number of jobs and then schedule the jobsto run on the farm. The job scheduler used is usually LSF, PBS, or Grid Engine.

Web Views

Many users of a genome do not have the time or inclination to learn a largecomplex software system. The genome databases cater to these consumers whoare content with web views to the genomic databases. Like the software systemsemployed, these web views are mainly independently developed but share moresimilarities than differences. In this review we concentrate on displays providedby the same three main sources of annotation of the human genome previouslydescribed (Ensembl, NCBI, and UCSC) because they are the displays we are mostused to and because they are the resources most likely to be long lived in theircurrent forms.

In the Ensembl website the front page presents the user with a clickable kary-otype, a sequence search interface via either BLAST or SSAHA, and a text searchbox. The search systems, whether sequence or text, end up on one of two mainpages, geneview or contigview, whereas the karyotype ends up only on a con-tigview page. An example of the contigview page is given in Figure 1. Contigviewshows a slice of the genome, with three levels of contextual information—the po-sition on an idealized karyotype, the general locale of the region of 1 megabaseeach side, and then a detailed view that is zoomable from 500 base pairs to 1megabase, which the user controls. The user also generally controls the bottomdetailed view, which shows a number of genomic features; of particular impor-tance are the confirmed genes from Ensembl. Ensembl has built in its system 50or so different “tracks” of various types of features that can be displayed on con-tigview (using DAS, this number is unlimited). Strandness is shown by the tracklocated above or below a central sequence line. Only a subset of the tracks, ap-proximately 10, are switched on by default. There are a set of drop-down menusthat control whether a track is on or off as well as other display properties, suchas height and color. The second important page in Ensembl is geneview, whichshows more detailed information about a gene. It can be reached either directly



by one of the search systems or from contigview. Geneview provides detailed in-formation on the intron/exon structure and shows all known identifiers associatedwith that structure, in particular, Human Genome Nomenculture (HUGO) names,Swissprot, and RefSeq accession numbers. In addition the protein translations ofthe gene are compared against InterPro (2), allowing genes with related proteindomains to be clustered together.

The UCSC website is focused around a single page that displays a slice of thegenome. Either a text search box or the BLAT search page takes the user to this slice,where again tracks of genomic features are shown. At UCSC, the strandedness ofthe tracks is provided by small chevrons on the features. The overall paradigmat UCSC is to provide the user with all the information about features on thegenomes without imposing too much interpretation about their results. UCSC doesnot therefore have specific combined gene prediction, but rather shows both theRefSeq sequences aligned to the genome and the gene predictions from a numberof groups (which usually include the Ensembl gene predictions) underneath those.Clicking on the feature of interest provides a small page with somewhat moredetail—for UCSC-generated data this generally includes the alignment of thesequence, whereas for contributed tracks a link back to the source website of thatfeature is provided. Additional tracks are also available and can be expanded orcontracted either by a mouse click or by a control panel at the bottom of thepage. One of the strengths of the UCSC system has been its open and aggressiveacquisition of contributed tracks. This means that the various versions of the samegenome may have a large number of different tracks depending on what analysiswas done and contributed to UCSC.

The NCBI website has a two-stage view of the genomic sequence that is heavilycoordinated with its LocusLink resource. The mapview page provides a view ofthe genome that is presented principally as a set of mapping resources on a verticalchromosome. The user who wishes to zoom into a region of interest to get a finelydetailed view needs to move over to the sequenceview section of the region, whichshows a scrolled base-pair level view of the genomic sequence with a number offeatures labeled on it. At either the mapview page or the sequenceview page, forgene features the user can click out to the LocusLink page, which provides a singlepoint of contact for gene information, with many jumping off points into other areasof NCBI. Mainly by virtue of the LocusLink page, the genome resources are wellintegrated into other NCBI resources, including the ability to jump from BLASTresults into the genome.

As with the underlying software there is growing coordination among web-sites. Currently the Ensembl and UCSC genome sites are cross-linked at theirrespective “genome slice” views; both the Ensembl site and the UCSC site pointto LocusLink pages, and LocusLink points back out to the other web sites. Al-though users are presented with somewhat different interfaces and sometimes quiteradically varying data or interpretations of the data, the duplication of websitesfor the human genome has been a generally beneficial approach—each site hasstrengths in different areas, for example, Ensembl’s strength lies in its consistency



of data and interpretation across the site and strong exporting tools, the UCSC sitehas the variety of analysis and the resistance to imposing a single interpretation,and NCBI’s integration with other molecular biology resources is helpful. Thesevarying strengths have attracted different types of users to the sites, and the cross-linking between sites allows users to pick and choose the best between the varioussites. We are hopeful that this trend of coordination will continue over time, inparticular, in the work of other genomes.

Outside of vertebrate genomes, the web views of Gadfly and Wormbase areunsurprisingly similar as they are based on the Bio::Graphics web package. Thisgives a similar horizontal slice of the genome with interesting features highlighted.Again, like Ensembl, the view is DAS aware, allowing other groups to add theirannotations to this display.

Data Resources

As well as bench biologists wanting to access specific types of genes or regions ofinterest through the web, there are a variety of other power users and bioinformati-cians who are less interested in web usability and more interested in appropriateslices of the data. Different databases take varying approaches to providing data ac-cess. Ensembl probably goes the furthest by providing a fully portable system thatcan be mirrored, i.e., a standard flat-file style distribution of the data through web-accessible bulk downloads, including Excel spreadsheets and arbitrary slices ofthe genome. Ensembl also provides an internet-accessible MySQL host providingfull SQL access to the datastore to anyone with a MySQL client. UCSC providesa very effective slice of the genome sequence, with server and bulk download oftab-delimited tables that represent the underlying database. At NCBI there is onlya flat-file bulk distribution, but it is one that aims to integrate well into its otherflat-file resources. One interesting development in this area that we are aware of isthe development of advanced query interfaces to the database at Ensembl. This al-lows a user to supply in a web form a query such as “all the validated SNPs within5 KB of genes with protein kinase domains,” with the result being a formattedExcel spreadsheet. Providing power users with a mixture of web-accessible andbulk-download forms seems likely to be a growing development.

FUTURE OUTLOOK

With the expected number of new eukaryotic genomes on the horizon numberingover 20, it is clear that the process of handling genomes will become even better un-derstood, simply by necessity. We expect that the informal sharing of concepts andsmall components between groups will grow, resulting within five years time in oneor two main sets of components that can be gathered together for genomic annota-tion. The open-source nature of some of the projects should accelerate this process.

The DAS system democratization of genome annotation (19) shows greatpromise, with many small laboratories already setting up DAS servers or using



the web-accessible DAS upload servers. The presence of DAS prevents genomeannotation to be solely in the domain of the large centers and allows both biologistsand research bioinformaticians to concentrate on their area of expertise.

All sites are providing the ability to integrate “vertically” from the genomeup toward proteome and functional aspects of the genome. In addition, many ofthe systems are being cloned in a “horizontal” manner to provide the analysis andpresentation of other genomes. The integration of the entire space, i.e., being able toeasily consider queries that, say, efficiently and correctly correlate the quantitativeloci traits on rat with the haplotype maps on human are fascinating to consider. Itis likely that the next couple of years, with the advent of many new genomes, willforce the development of new concepts and potentially new technologies to meetthis challenge.

The Annual Review of Genomics and Human Geneticsis online athttp://genom.annualreviews.org

LITERATURE CITED

1. Altschul SF, Gish W, Miller W, Myers EW,Lipman DJ. 1990. Basic local alignmentsearch tool.J. Med. Biol.215:403–10

2. Apweiler R, Attwood TK, Bairoch A,Bateman A, Birney E, et al. 2000. Inter-Pro—an integrated documentation re-source for protein families, domains andfunctional sites.Bioinformatics16:1145–50

3. Bairoch A, Apweiler R. 2000. TheSWISS-PROT protein sequence databaseand its supplement TrEMBL in 2000.Nu-cleic Acids Res.28:45–48

4. Bateman A, Birney E, Cerruti L, DurbinR, Etwiller L, et al. 2002. The Pfam pro-tein families database.Nucleic Acids Res.30:276–80

5. Baxevanis AD. 2002. The Molecular Bi-ology Database Collection: 2002 update.Nucleic Acids Res.30:1–12

6. Benson DA, Karsch-Mizrachi I, LipmanDJ, Ostell J, Rapp BA, Wheeler DL. 2002.GenBank.Nucleic Acids Res.30:17–20

7. Birney E, Durbin R. 1997. Dynamite: aflexible code generating language for dy-namic programming methods used in se-quence comparison.ISMB5:56–64

8. Burge C, Karlin S. 1997. Prediction of

complete gene structures in human ge-nomic DNA.J. Mol. Biol.268:78–94

9. Cherry JM, Adler C, Ball C, Chervitz SA,Dwight SS, et al. 1998. SGD: Saccha-romyces Genome Database.Nucleic AcidsRes.26:73–79

10. Consortium F. 2002. The FlyBase data-base of the Drosophila genome projectsand community literature.Nucleic AcidsRes.30:106–8

11. Consortium IHGS. 2001. Initial sequenc-ing and analysis of the human genome.Nature409:860–921

12. Deloukas P, Matthews LH, Ashurst J,Burton J, Gilbert JG, et al. 2001. TheDNA sequence and comparative analy-sis of human chromosome 20.Nature414:865–71

13. Dowell RD, Jokerst RM, Day A, Eddy SR,Stein L. 2001. The distributed annotationsystem.BMC Bioinform.2:7

14. Durbin D, Thierry-Mieg J. 1991. AC.elegansdatabase. http://www.sanger.ac.uk/Software/Acedb/

15. Eeckman FH, Durbin R. 1995. ACeDBand macace.Methods Cell Biol.48:583–605

16. Florea L, Hartzell G, Zhang Z, Rubin GM,



Miller W. 1998. A computer program foraligning a cDNA sequence with a genomicDNA sequence.Genome Res.8:967–74

17. Gelfand MS, Mironov AA, Pevzner PA.1996. Gene recognition via spliced se-quence alignment.Proc. Natl. Acad. Sci.USA93:9061–66

18. Hubbard T, Barker D, Birney E, CameronG, Chen Y, et al. 2002. The EnsemblGenome Database Project.Nucleic AcidsRes.30:38–41

19. Hubbard T, Birney E. 2000. Open an-notation offers a democratic solution togenome sequencing.Nature403:825

20. Kanehisa M, Goto S, Kawashima S,Nakaya A. 2002. The KEGG databases atGenomeNet.Nucleic Acids Res.30:42–46

21. Kelley S. 2000. Getting started withAcedb.Brief Bioinform.1:131–37

22. Kerlavage A, Bonazzi V, di Tommaso M,Lawrence C, Li P, et al. 2002. The Cel-era Discovery System.Nucleic Acids Res.30:129–36

23. Korf I, Flicek P, Duan D, Brent MR.2001. Integrating genomic homology intogene structure prediction.Bioinformatics17:S140–48

24. Krogh A. 2000. Using database matcheswith HMMGene for automated genedetection in Drosophila.Genome Res.10:523–28

25. Lindblad-Toh K, Lander ES, McPhersonJD, Waterston RH, Rodgers J, Birney E.2001. Progress in sequencing the mousegenome.Genesis31:137–41

26. Lo Conte L, Brenner SE, Hubbard TJ,Chothia C, Murzin AG. 2002. SCOPdatabase in 2002: refinements accommo-date structural genomics.Nucleic AcidsRes.30:264–67

27. Mott R. 1997. ESTGENOME: a pro-gram to align spliced DNA sequences tounspliced genomic DNA.Comput. Appl.Biosci.13:477–78

28. Mural RJ, Parang M, Shah M, Snoddy J,Uberbacher EC. 1999. The Genome Chan-nel: a browser to a uniform first-pass an-

notation of genomic DNA.Trends Genet.15:38–39

29. Ning Z, Cox AJ, Mullikin JC. 2001.SSAHA: a fast search method for largeDNA databases.Genome Res.11:1725–29

30. Pocock MR, Down T, Hubbard TJP. 2000.BioJava: open source components forbioinformatics.SIGBIO Newsl.20:10–12

31. Pruitt KD, Maglott DR. 2001. RefSeqand LocusLink: NCBI gene-centered re-sources.Nucleic Acids Res.29:137–40

32. Reese MG, Kulp D, Tammana H, Haus-sler D. 2000. Genie—gene finding inDrosophila melanogaster. Genome Res.10:529–38

33. Rutherford K, Parkhill J, Crook J,Horsnell T, Rice P, et al. 2000. Artemis:sequence visualization and annotation.Bioinformatics16:944–45

34. Smit AF. 1999. Interspersed repeats andother mementos of transposable elementsin mammalian genomes.Curr. Opin.Genet. Dev.9:657–63

35. Solovyev VV, Salamov AA, LawrenceCB. 1995. Identification of human genestructure using linear discriminant func-tions and dynamic programming.Proc.Int. Conf. Intell. Syst. Mol. Biol.3:367–75

36. Stein L, Sternberg P, Durbin R, Thierry-Mieg J, Spieth J. 2001. WormBase: net-work access to the genome and biologyof Caenorhabditis elegans. Nucleic AcidsRes.29:82–86

37. Stoesser G, Baker W, van den BroekA, Camon E, Garcia-Pastor M, et al.2002. The EMBL Nucleotide SequenceDatabase.Nucleic Acids Res.30:21–26

38. Tateno Y, Imanishi T, Miyazaki S,Fukami-Kobayashi K, Saitou N, et al.2002. DNA Data Bank of Japan (DDBJ)for genome scale research in life science.Nucleic Acids Res.30:27–30

39. Venter JC, Adams MD, Myers EW, LiPW, Mural RJ, et al. 2001. The sequenceof the human genome.Science291:1304–51



40. Westbrook J, Feng Z, Jain S, Bhat TN,Thanki N, et al. 2002. The Protein DataBank: unifying the archive.Nucleic AcidsRes.30:245–48

41. Xu Y, Uberbacher EC. 1997. Automatedgene identification in large-scale ge-

nomic sequences.J. Comput. Biol.4:325–38

42. Yeh RF, Lim LP, Burge CB. 2001. Com-putational inference of homologous genestructures in the human genome.GenomeRes.11:803–16

13 Aug 2002 14:38 AR AR167-GG03-12-Birney.tex AR167-GG03-12-Birney.SGM LaTeX2e(2002/01/18)P1: GDL

See legend on next page

13 Aug 2002 14:38 AR AR167-GG03-12-Birney.tex AR167-GG03-12-Birney.SGM LaTeX2e(2002/01/18)P1: GDL

Figure 1 (See figure on previous page) Screenshot of Ensembl contigview (humanversion 3.26.1), showing region of human chromosome 20 around genome sequenceaccession AL035454. Region is shown at three resolutions, and navigation (recenteron click) is possible by clicking in any of the three panels. Panels can be turned off byclicking on the box containing a “-.” (top) In the Chromosome panel a red box shows theregion being viewed in p11.23, with respect to the cytogenetic-banding pattern of theentire chromosome. (middle) In the Overview panel a second red box similarly showsthe region being viewed in detail below. The middle panel shows the location of markersand genes, by default over 1 Mbase. Genes predicted automatically by the Ensemblsystem are coloredbrownand labeled with either HUGO identifiers or SPTREMBLidentifiers if they are known. Novel Ensembl genes (see text for definition) are labeledas such and shown inblack. Curated gene structures from the Sanger Institute humanannotation group are shown in various shades ofblue, depending on their type, asdescribed in the recent publication of the Chromosome 20 (12). Other annotated genesfrom EMBL/Genbank sequence files, where present, are shown ingreen. (bottom) TheDetailed View panel shows genomic sequence features in detail, by default over 100 Kb.Gene color scheme is as for Overview panel, with the addition of sequence contig basedGenscan ab initio predictions shown incyan. Matches to the mouse genome sequenceare shown inpinkand grouped into synteny blocks, in this case showing synteny overthe entire region to mouse chromosome 2. Region being viewed can be zoomed andrecentered with the computer mouse or specified precisely in chromosomal coordinates.The pull-down menu shown is one of several and allows the user to select the featuresbeing displayed. The second pull down allows the addition of annotation from third-party DAS sources. Floating menus (not shown) appear as the mouse is moved overany feature, allowing access to pages with additional information. In this view somesequence homology features, which support the Ensembl gene predictions, are shownin full (e.g., human mRNAs: one line for each distinct mRNA), and some are shown ina compact format (e.g., ESTs: all hits shown as overlapping blocks on a single line).

ewan birney,1 michele clamp, and tim hubbard 2nakhleh/comp571/fall05/databasesearch/...31 jul 2002...

Documents