page 2 material - faculty | biology department | unc...

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.

Page 2 Material

The Generic Model Organism Database (GMOD) software tools comprises a set of biological data management and visualization components suited for creating integrated online databases of genomic data. This proposal seeks to significantly enhance the functionality of one of the most popular of the GMOD tools, the GBrowse genome annotation browser, by giving it smooth scrolling and zooming abilities, a facility for editing and commenting on existing annotations that encourages democratic genome annotation. In recognition of the shift of biology away from the analysis of a single individual and towards the study of whole populations, we also propose to add features that make GBrowse suitable for visualizing genetic association and diversity data. These enhancements will be implemented without destroying the aspects of GBrowse that made it popular in the first place: the ability to install and run it on personal computers and other modest hardware.This proposal further seeks to encourage the take up by the community of the integrated GMOD suite. While individual components of GMOD have been taken up enthusiastically by the community and are widely used, the full suite of tools, which includes a database schema, a workflow management system, a data mining system and a template-based web site, have not achieved the same penetration due to the lack of comprehensive documentation, packaging and examples. We wish to rectify this situation by establishing a full-time support center responsible for GMOD packaging, documentation, end-user education and outreach.

Key Personnel:Lincoln Stein (PI), Cold Spring Harbor LaboratoryIan Holmes (coPI), University of California, BerkeleyTodd Vision (coPI), National Evolutionary Synthesis Center, Durham, NC

2

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.A. Specific Aims

The overall purpose of this project is to improve the software and services provided to the biological research community provided by the Generic Model Organism Database Project (GMOD). There are two main thrusts to the project. First we will establish a user support and training center at the National Evolutionary Synthesis Center (NESCent) staffed by a full-time GMOD representative. Second, we will enhance the GBrowse genome annotation browser to provide a better user interface, as well as broadening it to support information on population diversity. Specific Aim #1) Enhance the GBrowse user interface to support smooth scrolling, zooming and community annotation.

We will replace the GBrowse genome annotation browser's current click, wait & reload user interface with a dynamic one similar to Google Maps. At the same time, we will interface GBrowse with the DAS/2 writeback protocol in order to allow users to easily add comments and other annotations to genome databases.

Specific Aim #2) Enhance the GBrowse interface and underlying data model to support the visualization of genetic diversity data.

We will enhance the GBrowse user interface and its underlying Chado data model using visual displays suited for managing genetic association and diversity data.

Specific Aim #3) Encourage the adoption and utilization of the GMOD software suite by the biological research community.

To encourage the adoption of the GMOD software suite we will create and staff a help desk located at the National Evolutionary Synthesis Center. The help desk staff will develop online and live tutorials explaining the installation and use of GMOD tools, sponsor workshops at scientific meetings to demonstrate the tools and to receive feedback from the user community, manage bug reports and feature requests, and assist individual users with installation and configuration of the software.

24

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.B. Background and Significance

Biology as an information science. The past 15 years have seen a remarkable transition of biology from a science whose progress was recorded in handwritten laboratory notebooks to one that is highly dependent upon online databases and other electronic resources. Two factors have contributed to this. One is the advent of information-rich genome sequencing, microarrays and other high-throughput technologies, while the other is the development of the world wide web, which made it possible to collect, integrate and distribute this information directly to biologists’ desks.The paradigm shift has been remarkable to watch. Researchers who two decades ago would have devoted years to the study of a handful of genes contributing to a single regulatory pathway now routinely perform genome-wide analyses across multiple species in order to develop insights into the functions of whole gene families and regulatory networks. The ability to integrate diverse types of biological information allow researchers to develop and test predictive systems that were not imaginable even ten years ago. A recent example of this new approach is the work of Zhong and Sternberg (2006) in which protein-protein interaction data, hand-curated Gene Ontology terms (Ashburner et al. 2000), and text mining data culled from the online databases of C. elegans, yeast and fly were combined into an online system that could accurately predict novel genetic interactions in the nematode. The genetic interactions predicted by the system now become the basis for new experimental inquiries into the developmental biology of the organism.A signpost of this change has been the emergence of many dozens of online integrative databases of biological data called MODs (model organism databases) and CODs (clade-oriented databases) (Stein et al. 2006). These resources are manned by a combination of computer scientists and biologist curators for the explicit purpose of integrating and cross-referencing multiple data types, rationalizing them by the use of ontologies and other standards, and making them available for use by the research community through a series of visualization, data mining, and data download tools, and include such examples as the S. cerevisiae database SGD (www.yeastgenome.org), the C. elegans database WormBase (www.wormbase.org), and the Arabidopsis thaliana database TAIR (www.arabidopsis.org). The democratization of genomics. Because of the dramatic increase in the efficiency and accessibility of technologies for genome sequencing and the acquisition of other forms of genomic data, genome projects are no longer restricted to a small number of model organisms. In recent years, many

25

http://www.arabidopsis.org/

http://www.wormbase.org/

http://www.yeastgenome.org/

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.genome projects have been undertaken with an explicitly evolutionary motivation. Some have included sets of closely related species, as in the genera Drosophila and Saccharomyces. Other genome projects are now underway in organisms of evolutionary and ecological interest such as the three-spine stickleback Gasterosteus aculeatus and the water-flea Daphnia pulex, or in organisms that occupy critical phylogenetic positions for comparative genomic studies such as the tunicate Ciona intestinalis. In addition, a wide variety of microbial and fungal species of medical, agricultural, industrial and scientific interest have been or are being sequenced, and our ability to interpret these data in a comparative and evolutionary context (e.g. accounting for phylogenetic relatedness among organisms and genes, and for intraspecific genetic and phenotypic variability) will become increasingly important as the tree of life is filled in. There were 2000 completed or ongoing viral, microbial and eukaryotic and metacommunity genome projects listed in the Genomes Online Database as of 1 April 2006 (www.genomesonline.org). The pace at which genome data accumulates is certain to accelerate as new sequencing and other genomic technologies continue to be improved. Compared to the initial wave of model organism genomes, much more specialized research communities are now serving as custodians of the second-wave genome projects. For these communities, the Generic Model Organism Database Project (GMOD; www.gmod.org) provides a diverse array of highly-configurable, powerful, off-the-shelf solutions for data management and dissemination. We are aware of many small groups that have used GMOD tools to advance their research interests without a major investment of personnel. One striking example is a database of 48 annotated fungal genomes created and maintained by a single Duke University graduate student, Jason Stajich, for his studies of intron evolution (fungal.genome.duke.edu). Another example is an Arabidopsis genome database created by CSHL postdoctoral fellow Matt Vaughn for the purpose of managing a series of experiments to study the effects of epigenetic tags on gene expression (see letter from Matt Vaughn). A small group of researchers at MBL in Woods Hole, has built GMOD tools into a center-wide system of 40 databases to study pathogen genomes (see letter from Andrew McArthur).A portion of this proposal will go towards providing end-user support for the GMOD project. We feel that supporting GMOD is a far more cost-effective strategy for funding agencies than supporting many independent software and database development projects with nearly identical aims. There are additional advantages to this approach. GMOD has a large and experienced distributed development team that can address feature requests and bugs soon after they arise, reducing the maintenance and upkeep costs on individual user groups. Additional developers can contribute to the project

26

http://www.genomesonline.org/

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.due its open-source nature. The widespread use of GMOD has facilitated the adoption of community-wide data standards, particularly for the ontologies that figure prominently in the Chado schema, and for data file formats such as GFF3. The use of common database modules also facilitates interoperability and database federation using tools such as BioMart (www.biomart.org).The role of NESCent. The National Evolutionary Synthesis Center (NESCent) sponsors scientific activities that promote data integration, interdisciplinary synthesis, systems-level analysis and modeling across evolutionary biology. It is a collaboration of Duke University, UNC Chapel Hill and NC State University and is located in Durham, North Carolina. NESCent opened its doors in 2005 with funding primarily from the National Science Foundation. To further its mission, the center supports Working Groups (10-12 investigators who collaborate intensively over a two year period) and Catalysis Meetings (one-time meetings intended to identify avenues ripe for scientific synthesis, involving ~30 scientists from diverse disciplines), hosts a revolving set of postdocs (~15) and sabbatarians (~6), has a very active education and outreach program run in collaboration with the American Institute of Biological Science, and has a dedicated informatics staff charged with enabling the cyberinfrastucture needed to facilitate data sharing and data integration in evolutionary biology. A major thrust of this effort is assisting working groups in the prototyping of new databases and training the evolutionary biology community at large in the use of community-standard database tools. Each year, the center supports approximately four new Working Groups and four Catalysis Meetings. Through these and other meetings and activities, NESCent hosts hundreds of scientific visitors each year from a wide variety of biological disciplines, and is thus an excellent outlet for dissemination of new informatics tools to the user community and a natural locus for collecting user feedback. Providing a help desk to the GMOD user community is a natural extension ofto NESCent’s mission toof provideing cyberinfrastucture for evolutionary synthesis.Genome Annotation Browsers. Another portion of this proposal will go to support the ongoing development of the GBrowse genome annotation browser. Genome browsers are an essential tool for visualizing the positional relationships between structural elements on the genome, such as genes, alignments, repetitive elements and cis-regulatory motifs. The predominant genome browser user interface can be traced back to the AceDB “fmap” browser (Kelley 2000), which represents genome coordinates (e.g. the contig or chromosome) as a ruler and displays different types of

27

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.structural elements as discrete tracks in parallel with the ruler. User interface elements allow the user to turn tracks on and off and to change the relative size and position of the genomic region displayed. In contrast to fmap, in which the coordinate ruler is vertical, most modern genome browsers display the chromosome horizontally.Examples of genome browsers include the UCSC Genome Browser (Kent et al. 2002) and the Ensembl ContigView browser (Hubbard et al. 2002), as well as GBrowse. In addition to the features introduced by the fmap, essential features of these browsers include the ability for users to upload their own annotation data sets and view them in the context of the canonical set of annotations, the ability to aggregate annotations from multiple third-party sites and compare and contrast them, and to view quantitative data. Quantitative data includes such information as levels of transcription in genomic tiling arrays, relative levels of precipitated oligonucleotides in chromatin immunoprecipitation (ChIP/chip) assays, and frequencies of alleles in population genetics studies.Although there are many genome browser implementations, almost all of them are tied to a particular underlying database and schema, and most of them require specialized knowledge to configure. To our knowledge, only two browsers were designed from the ground up to be truly portable from database to database and to be highly customizable by the end user. These are GBrowse, a web-based browser described in Preliminary Results, and Apollo (Lewis et al. 2002), a standalone Java application that combines a genome browser with an editor. Both are central components of the GMOD project.

28

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.C. Preliminary Results

We first discuss the GMOD project and then describe the GBrowse genome annotation browser.The Generic Model Organism Database Project

History. The Generic Model Organism Database (GMOD) project began in the fall of the year 2000 when the developers of FlyBase, WormBase, MGD and SGD agreed to pool their resources to create a system of open source, reusable software components for creating and managing model organism databases. The project was soon joined by other model organism databases, including RGD (rgd.mcw.edu; see letter from Simon Twigger), TAIR (www.arabidopsis.org), TIGR (www.tigr.org), Gramene (www.gramene.org), EcoCyc (ecocyc.org) and the Fugu Genome Project (www.fugu-sg.org; now defunct). The tools created by the GMOD project have been adopted by groups ranging in size from single-PI operations to large multi-institutional collaborations, and include both academic and commercial users. In keeping with the original goals of the project, essentially all model organism database projects established over the past several years make use of GMOD projects. These include the Duke University comparative fungal genomics web site (fungal.genome.duke.edu/), BeetleBase (www.bioinformatics.ksu.edu/BeetleBase/), wFleaBase (cricket.bio.indiana.edu:7182/), BeeBase (racerx00.tamu.edu/bee_resources.html), and ParameciumDB (paramecium.cgm.cnrs-gif.fr/), DictyBase (dictydb.org; see letter from Rex Chisholm).GMOD tools are also used widely outside the field of animal model systems. At The Institute for Genome Research (TIGR), GMOD tools are used for managing genome annotation information on rice and numerous microorganisms (see letters from Robin Buell and Owen White). Members of the bioinformatics division of Bristol-Meyers Squibb are using GMOD tools to marshall information about human gene targets in the drug development pipeline (see letter from Nathan Siemers). At the The Virginia Bioinformatics Institute, GMOD tools are used to share data collected in a multi-institution study of high-priority human pathogens identified by NIAID (see letter from Joao Setubal). GMOD components are used by the Institute for Systems Biology's T1DBase, a locus-specific database that organizes genetic association data on human type 1 diabetes (see letter from Nathan Goodman), by the Kyoto Encyclopedia of Genes and Genomes (KEGG; see letter of support from Toshiaki Katayama) and by the International HapMap Project web site (www.hapmap.org). GMOD tools are also increasingly finding their way into the hands of bioinformatics software developers, who use them as components in their own systems. An example of this is the

29

http://www.hapmap.org/

http://www.bioinformatics.ksu.edu/BeetleBase/

http://www.fugu-sg.org/

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.Pachter lab, which uses GBrowse to evaluate and visualize multiple whole-genome alignment algorithms (see letter of support from Lior Pachter).GMOD Design. The system revolves around a relational database schema known as Chado (www.gmod.org/?q=node/6 ), a set of shared ontologies for describing genes and genomes (Ashburner et al. 2000; Eilbeck et al. 2005), and a template-based web site known as Turnkey (turnkey.sourceforge.net). The Chado data model is distinguished by relying heavily on ontologies to describe its data objects and the relationships among them. The benefit of this design is to keep the number of tables to a minimum (there are currently fewer than 50) and to simplify the process of extending the data model. For example, to add support for a new “gene knockout strain” data type, one would define a new table that contains information specific to knockout strains, and then extend the relationship ontology to establish the relationships between this table and the generic “strain” object on the one hand, and the “deletion” genome sequence annotation on the other. In contrast, groups that use non-ontology based strategies for representing biological data often find themselves working with schemas comprising an unwieldy number of tables. For example, GUS, The Grand Unified biological Schema (www.gusdb.org) contains over 500 tables, and even minor modifications to GUS can only be made after they have been extensive discussed and approved by a centralized schema working group (Aaron Mackey, personal communication).On top of the Chado schema and Turnkey web site are roughly two dozen applications for creating and maintaining model organism databases, including applications for manipulating genome annotation data, bioinformatics workflow management, literature mining and annotation, data mining, and comparative genomics. The component that is the focus of this application, GBrowse, will be discussed at greater length later in this section.GMOD software is written in either Java or Perl, with the choice of the language depending on its suitability for the particular application. In general, graphical desktop applications, such as the Apollo genome annotation editor, are written in Java, while web server-based components are written in Perl. GMOD applications make heavy use of open source middleware components such as BioPerl (Stajich et al. 2002) and BioJava (Pocock et al. 2000).Some GMOD components exist as standalone products that do not use the Chado database. The most widely-used of these is Pathway Tools (Karp et al. 2002), a self-contained biochemical pathways database, editor and browser. Despite this , GMOD end-users have not had difficulty making Pathway Tools or other standalone modules part of their web sites due to the ease with which diverse software can be lashed together using the web

30

http://turnkey.sourceforge.net/

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.protocols. Project Coordination. The project is coordinated via a website located at www.gmod.org. This site hosts documentation for the project, file downloads, a bulletin board system for announcements, mailing lists, and facilities for feature requests and bug reports. The site is organized around a Wiki which allows project developers and motivated users to modify and enhance the site freely.The project's software development management system is more structured than the GMOD web site. We use the Sourceforge (www.sourceforge.net) infrastructure to provide CVS, backup, change management and mirroring services. In contrast to the web site, only registered developers can make changes to the software, although anyone can download the current development snapshot using anonymous CVS. There are currently 86 registered GMOD developers, although only about a dozen developers are active at any given time. Several of the more complex GMOD components, for example Apollo and BioMart, have independent web sites that are reciprocally linked to the main GMOD web site.In addition to the GMOD web site and Sourceforge, GMOD projects are coordinated through a series of developers' mailing lists, one per major module. There are regular conference calls among project developers at a frequency determined by individual project leads.Strategic planning for the GMOD project is provided by semiannual meetings for developers, MOD curators and motivated users. The meetings are attended by roughly 50 individuals and run for two days; the agenda consists of presentations by groups that are using the software, “show and tell” sessions, workshops and smaller sessions devoted to strategy and design. The next scheduled meeting will be held from June 29-30 at NESCent..Scott Cain, an employee of Cold Spring Harbor Laboratory in the Stein research group, is coordinator for GMOD. His responsibilities include ensuring that all GMOD components work together and are documented adequately, organizing the semiannual meetings, maintaining the GMOD web and Sourceforge sites, and coordinating releases. Other responsibilities include writing packaging and installation tools for various GMOD components, and end-user support.The primary support for GMOD development is provided by grants to individual developers. Funding agencies include the NIH, the USDA and the NSF. Funds for the coordinator's salary and for the semiannual GMOD meetings is provided by an NHGRI supplement to the main WormBase grant.Community Usage and Outreach. Because of the open source nature of the

31

http://www.sourceforge.net/

http://www.gmod.org/

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.project, users can download GMOD software from a variety of sources, including the GMOD web site, the Sourceforge web site, various FTP mirrors, or via anonymous CVS. Users can also share and redistribute the software freely. For these reasons, it is impossible to obtain an accurate count of the number of times that GMOD software has been downloaded. A lower limit is provided by the Sourceforge web site, which lists 20,526 downloads of GMOD software from its download page since May 2001 (for detailed download statistics see sourceforge.net/project/stats/detail.php?type=prdownload&group_id=27707&ugn=gmod). An examination of the GMOD mailing lists suggests that there are at least 600 active users of system components, which we conservatively define as a user who has posted a request for help or information at least twice during the past two years.In addition to the GMOD web site, the primary means of community outreach is via workshops, presentations and tutorials. GMOD-related projects have been presented at the CSHL Biology of Genomes and Genome Informatics meetings in 2003, 2004 and 2005, and at the BioCurators meeting in Asilomar in 2005. We have held interactive tutorials on GMOD components at the 2004 and 2005 ISMB meetings in Glasgow and Detroit, at the 2005 Nordic Bioinformatics Meeting in Tartu, Estonia, and at the Plant and Animal Genome (PAG) meetings in San Diego in 2003-2005. The next tutorials, on Pathway Tools and GBrowse, will be held at the IEEE-sponsored Computational Systems Biology meeting at Stanford University in August 2006.In order to increase the quality of end-user support, Todd Vision at NESCent and Lincoln Stein at CSHL have recently agreed to create a GMOD help desk based at NESCent, beginning in mid-June 2006. The help desk will initially have a single staff member, who will be responsible for tracking bug reports, responding to end-user's inquiries, developing and running tutorials. Startup funding for 10 months of a help desk staffer’s effort will be provided in equal parts by NESCent and CSHL; the CSHL portion of the funding comes from USDA Agricultural Research Service (see letter of support from Doreen Ware).The GBrowse Genome Annotation Browser

32

http://sourceforge.net/project/stats/detail.php?type=prdownload&group_id=27707&ugn=gmod

http://sourceforge.net/project/stats/detail.php?type=prdownload&group_id=27707&ugn=gmod


Figure 1: The GBrowse Genome Annnotation Browser (from www.hapmap.org)

The Generic Genome Browser, “GBrowse,” is one of the original GMOD applications (Stein et al. 2002) and, along with Apollo, is one of the most widely used.1 GBrowse (Figure 1) provides users with a web-site based genome annotation browser similar in look and feel to the UCSC Genome Browser and the Ensembl ContigView. The software is installed as a server-side script under the control of a web server. Using a web browser, the end user selects a region of the genome to browse, usually by typing in a search term such as the name of a chromosome, a chromosomal band, a gene name, an STS, a gene description, or any of a large number of other identifiers that can be selected by the web site administrator. The genome annotations, also known as “features,” that fall within the region of interest are then displayed graphically at several levels of magnification. The region of interest is displayed in the “detail” panel, which has the ability to display features to base pair resolution. The “overview,” shows the chromosome or contig context of the selected region – a red rectangle shows the position and size of the region of interest on the chromosome. There is also an optional panel called the “region” which displays a user-defined window of genomic context surrounding the region of interest.

1 It is difficult to determine the number of active installations of GBrowse, but counts of user support requests and a Google search for public GBrowse installations suggests that 200 is a realistic lower limit. The number of personal or firewalled installations may be higher.

33

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.Like the UCSC and Ensembl browsers, GBrowse organizes different types of genome annotations into a series of horizontal tracks. Tracks can display qualitative information, including gene models, nucleotide alignments such as aligned ESTs, STSs, and SNPs, as well as quantitative information such as evolutionary conservation scores. Quantitative information can be displayed as XY plots (“wiggle plots” in UCSC parlance), as heat maps, or using more specialized representations such as pie charts. The end user can turn tracks on and off, adjust the order in which they are displayed, or change track appearance in order to adjust the level of detail shown. Settings are persistent between sessions; when the user returns to the browser, his previous preferences are restored. Like the UCSC and Ensembl browsers, GBrowse allows users to upload private data to view in the context of other genomic annotations, or to view genome annotations gathered from multiple third-party providers using the Distributed Annotation System (DAS; Dowell et al. 2001). Unlike the other browsers, GBrowse is also a DAS server; once the GBrowse software is installed on a web site, it can easily be set to export the annotations in its database via DAS.Users can export the current region of interest in a variety of ways: as sequence files in a variety of formats, as colorized FASTA files in which coding regions, SNPs, oligonucleotide and other features are shown as underlined or highlighted text, as Sequence Ontology-compatible GFF3 files (Eilbeck et al. 2005) for exchange with other SO-aware software, and as high-resolution SVG (Scaleable Vector Graphics; www.w3.org/Graphics/SVG/) image files for incorporation into publications.The main feature that distinguishes GBrowse from the UCSC and Ensembl browsers is its portability and configurability. GBrowse runs as a CGI script and requires only a standard web server, the Perl interpreter, and a number of readily-available Perl library modules. The software runs on Unix, Linux, Mac OSX and Windows platforms. When high-performance is required, GBrowse can be configured to run as an embedded mod_perl (perl.apache.org) module, which eliminates the latency of launching the Perl interpreter each time GBrowse is run.GBrowse uses a flexible adaptor system to connect the user interface to the underlying genome annotation database. In addition to the recommended Chado adaptor, various standard adaptors allow GBrowse to run from flat files in a directory, from data stored in BerkeleyDB B-Tree databases (www.sleepycat.com), from various relational databases (including PostgreSQL, MySQL, Oracle and Sybase), and from DAS sources. This last feature allows a research group to run a GBrowse interface entirely off a DAS source such as UCSC or Ensembl. The advantage of the adaptor system is that a prospective GBrowse administrator can experiment with

34

http://www.sleepycat.com/

http://perl.apache.org/

http://www.w3.org/Graphics/SVG/

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.the software by running it on top of a text file database. Once he is satisfied with it, he can graduate to Chado or another relational database. Advanced users can write custom adaptors to adapt an existing genome annotation database to GBrowse.A conceptually similar system allows GBrowse functionality to be extended by a “plugin” layer. Plugins are small modules of code (written in Perl) that are placed in a designated directory on the GBrowse server machine. Plugins can generate tracks, provide specialized database searches, or add new report facilities to GBrowse. For example, there is an FGenesh (Salamov and Solovyev 2000) plugin that runs the FGenesh gene prediction software on the selected region each time it is loaded. The gene prediction track looks like any other track to the end user, but gives him the option of changing the FGenesh parameters. Another recently-released plugin runs a fast Fourier transform spectrum analysis on the selected region, thereby allowing coding regions, repeats, and other interesting regions to be visualized.A well-documented configuration file gives GBrowse administrators extensive control over the look and feel of the software. Administrators can insert custom headers and footers, change the help text and labels, define tracks, and select into which panel (detail, region or overview) each track should be displayed. There are a large number of glyphs (more than 60) for displaying different types of genomic features. In addition to a number of glyphs for drawing various representations of gene models, there are glyphs specialized for drawing SNPs, oligonucleotides, primer pairs, deletions, insertions, splice sites and regulatory regions. There are glyphs specialized for showing statistical data, such as the whisker plot glyph, glyphs for superimposing photographic images on the genome view (e.g. for showing the phenotypes of Drosophila knockout strains on top of the endpoints of the deleted region), and glyphs for registering sequence chromatogram traces on the genome. Due to its use for the HapMap project, the list of glyphs also includes ones specialized for showing population-specific allele frequencies and pairwise linkage disequilibrium values.A common problem with genome browsers is that a track that looks good at one scale looks poor at another. For example, conservation “wiggle tracks” on the UCSC browser look fine at scales of a 100 kb or so, but turn into uninformative black bands at larger scales. GBrowse addresses this problem by allowing the administrator to create semantic zooming tracks. As the user changes the magnification, the track adjusts its glyph and other settings. For example, a gene model track can be set up so that at high magnification the amino acid sequence of the coding region is shown; as the user zooms out, the amino acid sequence disappears, leaving the intron/exon structure. As he zooms out further, first the gene description disappears, then the gene names are suppressed, and then the splice site

35

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.structure disappears leaving only an arrow showing the primary transcript. At the lowest magnifications, the gene model track is replaced by a histogram that shows gene density across the chromosome.The appearance of GBrowse is controlled by a cascading stylesheet (CSS; www.w3.org/Style/CSS/ ) which allows the administrator to change the page color, typeface, link appearance and other visual attributes. The help text, button labels, and menu items, are contained in a series of internationalization-aware configuration files. The current GBrowse distribution has translations for English, Spanish, German, Portuguese, French, Classical Mandarin, Simplified Mandarin, Korean, Japanese and Icelandic, which means that users who have their browsers set to prefer one language over another will be presented with a version of GBrowse that is internationalized to their taste. The next version of GBrowse, scheduled for release in the summer of 2006, will be even more configurable. It uses a templating system that allows administrators to reconfigure the position and appearance of panels, buttons, text and other user interface elements simply by modifying a template file with a text editor.

36

http://www.w3.org/Style/CSS/

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.D. Experimental Plan

Specific Aim #1) Enhance the GBrowse user interface to support smooth scrolling, zooming and community annotation.Overview. GBrowse is a hugely successful web application with at least 200 active installations at the time of writing. It uses the classic CGI (Common Gateway Interface) web architecture, in which each page is generated by the web server as an HTML document and rendered in the user’s web browser. The user interacts with the application by manipulating form elements (buttons, checkboxes) and submits the new settings to the server as an encoded request string. In response, the server generates a new HTML document and returns it. This architecture has considerable portability advantages. Because the entire user interface is represented via HTML pages, a GBrowse instance can be accessed from any networked computer, any time, without the need for a particular operating system or custom software installation. GBrowse can even run (albeit somewhat awkwardly due to screen size) on cell phones and Blackberry pagers.The drawback of this CGI architecture is that it is fundamentally “page based.” Most of the actions taken by the user, such as scrolling or zooming along a genome sequence, result in a request being sent to the remote server, and the server responding by generating a new HTML page. One effect of this architecture is that the user interface lags behind desktop applications; standard desktop application features such as smooth scrolling, drop down menus, drag-and-drop objects, smooth animation and translucency are either missing from page-based applications or are severely limited. For example, changing the order of browser tracks is accomplished using a series of popup menus, rather than the more natural “drag and drop” interface familiar to users of desktop applications. Another consequence of the page-based architecture is that the client-server architecture introduces network and computational latencies: moving between pages can incur delays, sometimes painfully long ones, as new HTML pages are requested from, generated by and (finally) transmitted by the server. Inevitably, this causes exasperation and subliminally trains the user not to explore, but rather to take the shortest possible path through the application. This is particularly counterproductive in an application such as a genome browser, which should encourage the user to investigate, compare, explore, and assume different views.Fortunately, the recent coalescence of the client-side technologies Javascript, XML and asynchronous data loading known as “AJAX” (Garrett 2005) allows web-based applications to live in the best of both worlds: they can have the cross-platform compatibility, ease of installation, and client-server architecture of page-based web applications without sacrificing the

37

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.interactivity of desktop applications. The application of this technology is best seen in Google’s innovative GoogleMaps and GMail applications, which many feel equal or surpass desktop applications in their ease of use. We propose to extend GBrowse, by means of the AJAX framework, so as to increase user responsiveness by orders of magnitude, adding features such as fast scrolling, zooming, autocompletion of text entries, and rich contextual popup windows.Specific Aim 1.1: Rearchitect GBrowse to support smooth scrolling and zooming: The new software architecture that we propose for GBrowse uses pre-rendered genome views, an XML-based query engine, and a fast Javascript client. This architecture is illustrated in Error: Reference source not found.Figure 2.

Figure 12: Architecture of the AJAX (client-server) GBrowse. The images, features and Javascript/XML files could all be published by the same webserver, but are shown as three

different servers to illustrate a possible division of labor.

In conceiving our new approach we were inspired by the mass-market geographical mapping application “Google Maps” (maps.google.com). Progress in geo-mapping applications is a reasonable analogy for the development of genome browsers. Initial applications, such as MapQuest and Yahoo Maps, were page-based, and scrolling the map required time-

38

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.consuming page reloading. Geographical features, such as roads or businesses could be overlaid on the map, but this also required reloading the current page. Maps and geographic features are entirely analogous to the genome coordinates and annotation features in a genome browser.Google Maps revised this model with several innovations. The core difference is the intensive use of client-side Javascript rather than static HTML. The map images are pre-rendered and broken into tiles, minimizing the work done by the server. The Javascript client loads the off-screen tiles asynchronously while the user is scrolling. Often the user is unaware that new data has been loaded by the client. The Javascript client intercepts click and drag events, enabling seamlessly smooth scrolling of the map. The result is a dramatic increase in useabilityusability: users are much better able to orient themselves, find new locations and nudge the view to to simultaneously display multiple locations of interest.This specific aim involves a complete "Web 2.0" revamp of GBrowse. The full remake will include the following features:

Fast scrolling and zooming, so that the user can drag the view in real-time (already implemented in the prototype)

Interactive switching and sorting of annotation tracks using a “drag and drop” UI. For a preliminary idea of how this might look, see the following demo from the "Script.aculo.us" Javascript library: http://wiki.script.aculo.us/scriptaculous/show/GhostlySortableDemo

Autocompletion of search fields Richer contextual pop-up windows (e.g. gene names, synonyms,

available off-site links) Integration of off-site data (gene pages, expression data, etc.) Interactive annotation (see Community Annotation below). The ability to select a set of features and pass them on to other

applications, such as a primer design program.We will implement these features using the following design elements:i. As much genome annotation data as possible will be pre-rendered,

allowing fast retrieval from an optimized image server;ii. The database query engine will be separated from the genome annotation

renderer, and retooled towards a lightweight XML transaction core;iii. We will develop a Javascript client that implements the UI and

retrieves asynchronous contextual information from the image server and query engine via HTTP and XML.

Prototype Client. As a proof of principle, we have created a prototype of our 39

http://wiki.script.aculo.us/scriptaculous/show/GhostlySortableDemo

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.new system is visible at http://genome.biowiki.org/. It currently implements fast scrolling and zooming, demonstrating the feasibility of parts (i) and (iii).In order to develop this prototype we have already surmounted some technical challenges. One of these challenges has been the issue of scaling the rendering/layout engine from screen-sized views to entire chromosomes. Due to the logic of the layout engine, and in particular the vertical feature-stacking heuristics known as “bumping,” dependencies between feature positions can propagate for long distances along the annotated sequence. The consequence of this is that the entire chromosome must be laid out in one run, rendered and then broken into tiles. This presents some scaling issues: a chromosome-sized image is too large to store in memory. To work around this, we developed a proxy object, called "TiledImage", for the GD::Image module (a Perl API developed by Lincoln Stein for generating image data). TiledImage intercepts calls to GD::Image primitives and memoizes these calls in an SQL database, together with their 2D bounding boxes. The calls are later “replayed” to create individual tile images. In our current prototype this is somewhat time-consuming (roughly 2 hours per track for an entire Drosophila chromosome). With straightforward database and code optimizations we anticipate shrinking this time by several orders of magnitude. The most important code optimization will be to decouple the layout and rendering steps. The layout step is relatively quick but must be performed on the entire chromosome at once. The rendering step is slower, but can be divided so that the rendering of tiles is spread out among several CPUs. Our online prototype includes a basic Javascript client. The client displays a ruler and a set of annotation tracks that can be dragged around by the user. It also asynchronously fetches offscreen tiles in anticipation of further drag events. Although clicking on features in the prototype has no effect (other than dragging the map), the planned version of the Javascript browser will respond to feature-clicks by raising a small popup window, tied to the feature, containing brief contextual information and links. Clicking on empty areas of a track, or on the ruler, will be interpreted as a drag request (as it is currently). These click-semantics will feel familiar to users of Google Maps.Currently, the prototype implements searches by launching the CGI version of GBrowse: in the full version, this will be implemented by client-server communication using Javascript's XMLHttpRequest function. We envisage that the first year of the project will be spent bringing the functionality of the Javascript client-server model to a state comparable with the current CGI-based GBrowse. Once it is up-to-speed, the Javascript prototype will form the basis for the more sophisticated interface we outline above.Portability across web browsers is a perennial issue for Javascript-based

40

http://genome.biowiki.org/

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.web applications and will, we anticipate, be a substantial component of our time commitment. One step we will take towards portability will be to make as much use as possible of emerging open-source Javascript libraries, such as Script.aculo.us, Dojo and Google's AJAXSLT.We will implement the system so that for browsers that do not have Javascript, or for users that choose to disable this feature, GBrowse will fall back to its current page-based architecture. The fallback mode will also be the default for those wishing to install GBrowse on personal computers and laptops, which may not have the compute or memory resources to support tiled tracks. Specific Aim 1.2: Add a community annotation framework to GBrowse.Another innovation of Google Maps is its strong support for community annotation. Google has made available an API for publishing point annotations of geographical features. This has been widely utilized, leading to numerous applications of scientific interest, such as maps of reported bird-to-human transmissions of H5N1 avian flu (Nature 439, 6-7; 5 January 2006). Community annotation, a feature of demonstrated importance to genome browsers, is also a central feature of our proposed GBrowse development.A facility for users to post annotations to a central site and have them rendered automatically, rather than requiring each group to set up its own server or submit annotations manually to a curator, is fundamental to collaborative genome annotation projects (see letter of support from Lior Pachter).Our planned community annotation framework for GBrowse includes the following features:

Bulk-uploading of annotation tracks in GFF/DAS, BED, PSL, WIG, Stockholm and other common annotation formats;

Interactive annotation, commenting and peer review direct from the Javascript client

We consider these features in turn. In this section we distinguish between “default” annotation tracks (those installed server-side), “bulk uploaded” tracks and “interactively added” tracks.Bulk uploaded annotation tracks. Bulk uploaded annotation tracks are prepared offline and then uploaded en masse, allowing the user to create entire tracks of data at a time. Our goal is to maximize compatibility by recognizing any standard format for feature uploads: GFF: the most widely used annotation format. Each record in a GFF file

describes a single feature including co-ordinates, strandedness, reading frame and feature type. The recently developed version 3 of GFF includes

41

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.more structure, including a controlled vocabulary for the "type" field (the Sequence Ontology) and ways of representing alignments.2

BED: the native feature format of the UCSC Genome Browser. The UCSC Browser's "Custom Tracks" feature is currently the canonical example of bulk feature uploads. See http://genome.ucsc.edu/goldenPath/help/customTrack.html

WIG: a format for annotating continuously-varying data to genome tracks, such as posterior probabilities from feature predictors.

PSL: the format used by the BLAT program. An alternative to GFF3 for representing alignments.

Stockholm: an alignment format that also includes annotation fields. Chado/Chaos-XML (http://www.fruitfly.org/chaos-xml). This format was

desiged as an XML serialization of the Chado relational database schema, the standard schema underlying GMOD

While the current GBrowse supports GFF and GFF3 upload, it does not support the other formats listed above. In any case, the transition from the CGI architecture to the AJAX architecture will require the upload facility to be rewritten.Interactively added tracks. A completely new feature will be the ability of users to add new annotations interactively, as well as to comment on existing annotations.The user will be able to do the following from the browser:

Create a user account, log in and authenticate (with persistence) Create new (empty) annotation tracks Interactively add features to newly-created tracks Edit or delete features that they have previously added. Attach comments to any feature or genomic region, and reply to others'

comments Rate any comment, feature or track (including bulk-uploaded and

default tracks)Users will be able to invoke these operations in intuitive ways. For example, to add an annotation to an existing track, they will click on the track they wish to modify, select a menu item for adding a new feature, and then

2 Two of the PIs, Stein and Holmes, have been closely involved with the development of GFF: Stein wrote the GFF version 3 spec and co-developed the Distributed Annotation System (DAS), an XML-isation of the GFF format, while Holmes developed one of the first Perl toolkits for performing operations such as intersections, sorts and dynamic programming on GFF files.

42

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.interactively draw the feature into place. The motivation for allowing users to attach comments is slightly different from bulk uploading of annotations. Whereas bulk uploading is vital for visualizing whole genome analyses performed by software systems, a commenting system is geared towards providing a collaborative environment in which researchers review, discuss and enhance contributed annotations. The system should encourage interaction and cross-pollination between comments; thus, for example, replies to comments should be threaded, searchable, RSS-trackable and possibly peer-rated. It should be quick and easy to attach a comment from the browser. It should also be possible to attach comments to specific regions, so that one can mark up individual residues or spans. We feel that this type of interaction will be welcomed by the small research communities that are the primary targets for the GMOD project.If time permits, we will extend this facility by providing for peer review of community annotations, following the general pattern of social news websites such as Slashdot.com, Digg.com and Del.icio.us. User ratings will be weighted by a user-specific variable, with new users having a low weight and trusted users ("editors") having higher weights. Other possible extensions might include the ability to see how many users are concurrently online, or to "instant message" other scientists viewing the same chromosome.Implementation. User account details, interactive annotations, comments and ratings will be stored in an SQL database on the server and passed between client and server as XML. The client will display comments and annotations by transforming these XML trees into SVG images via XSLT stylesheets.For communication between client and server we will use the Distributed Annotation System (DAS), thereby leveraging a body of open source clients, servers and middleware code. The DAS/2 protocol supports writeback to a database (see http://www.biodas.org) using a specially formatted XML file. This file will be formatted by the Javascript client, uploaded via the HTTP POST mechanism, then parsed by code on the server and written to a custom relational database. It will then be served back to the client via rendered browser tracks.Third party annotations, whether created by bulk uploading or interactively, present interesting implementation issues for the Web 2.0 version of GBrowse. Because uploaded annotations must be rendered into tiled tracks on the server side, it may not be possible to pre-render an entire chromosome’s worth of uploaded annotations in acceptable time. For this reason, we will experiment with just-in-time server-side rendering in which tile images for uploaded tracks are rendered only when requested and then

43

http://www.biodas.org/

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.cached for rapid retrieval. The separation of the layout and rendering steps described earlier will go far towards make this approach feasible.Benefits of this work extend beyond GBrowse.An important feature of our client is its generability. Our current Javascript prototype is completely generic, being essentially a viewer for scrolling across very large (chromosome-length) images. As such, it's entirely agnostic about the method used to generate the images, and contains no hardcoded dependencies on GBrowse whatsoever. It would be straightforward to render the track images using other CGI browser engines, such as those of UC Santa Cruz (genome.ucsc.edu), the Ensembl project (ensembl.org) or Microbes Online (microbesonline.org).As the project progresses the client-server communications will necessarily become more structured, so as to implement searches, offsite links, etc. However, we are committed to ensuring that our framework stays generic and is not strongly tied to particular implementations of the client or server. That is, we are open to (and indeed anticipate) the possibility of third-party extensions to our work (e.g. a faster server, or a smarter client). A key research component of the proposed work is to identify and develop open protocols for client-server communication in future genome browsers and related web applications.Milestones for the part of the project and a description of our software engineering methodology can be found in I. Management Plan.

44

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.Specific Aim #2) Add glyphs and detail pages to GBrowse to support the visualization of diversity data.An increasingly frequent request from the GBrowse user community is to view population diversity data in the context of a genomic reference sequence. This type of the data underlies the discovery of associations between specific genetic variants and complex phenotypic traits (Risch 2000) and is important for model organism communities, where it is used to interpret experimental QTL (quantitative link traittrait loci) and to find signatures of selection in naturally occurring populations. It is also of great interest to medical geneticists, who rely on integrated visualization to understand genetic association studies, and for genetic epidemiologists, who use diversity data to probe the relationship between genetics and disease prevalence in human populations. Since GBrowse serves as the genome browser for the International HapMap project website (www.hapmap.org), its functionality has already been extended to support some views of population diversity data (Thorisson et al. 2005). These include an allele tower and pie chart glyph, which, when the genome is being viewed at sufficiently fine scale, are used display the identity and population frequency of nucleotides at a biallelic SNP. A haplotype block graph shows the stack of inferred haplotypes across a set of linked SNPs (Figure 3). A pairwise plot is used to show the level of linkage disequilibrium between pairs of SNPs. The statistics for these glyphs are typically precomputed to avoid latency, although tagged SNPs can be selected on the fly at the HapMap website using a number of different algorithms. GBrowse can also be used display numerical values relative to discrete features on the genome, or across continuously varying windows of the genome, using whiskerplot and XY plot glyphs, respectively. These flexible glyphs can be adopted to display of a variety of simple population diversity statistics, such as estimates of recombination frequency, skewness of the allele frequency spectrum, and mutation rate.

45


Figure 3: Pie chart and haplotype block displays already available in GBrowse

The HapMap project provides a relatively simple case of diversity data in which each SNP is genotyped in the same 270 individuals from a small handful of predefined populations. However, the existing GBrowse facilities are not sufficient to allow users to explore more complex situations in which allele frequencies covary with geographic context and to consider the relationship between allelic and phenotypic variation.Thus, we wish to extend GBrowse for simultaneous display of allelic/haplotypic and phenotypic variation among individuals and populations, thereby visually allowing the user to examine associations between genotype and phenotype that may have been identified external to the browser or are sufficiently noteworthy to merit follow-up study. Since such a marginal association might simply be due to population structure it is intended that this display would be integrated into a web interface that would permit data filtering and download such that a user could statistically analyze the association offline taking appropriate safeguards against spurious associations (Yu et al. 2006). We also wish to create a simple display that will allow genetic and phenotypic variation to be viewed relative to geographic, as well as genomic, coordinates. These enhancements will occur at the Bio::Graphics layer of GBrowse (the rendering engine, see Figure 2), and so will be compatible both with the current, legacy version of GBrowse, as well as the enhanced AJAX-driven version described in Specific Aim #1. The GMOD schema, Chado, does not currently have a rich diversity schema. Our first step will be to port into Chado a well tested schema, the Genomic and Diversity and Phenotypic Data Manager (GDPDM; www.maizegenetics.net/gdpdm/). GDPDM is a mature schema designed to hold quantitative phenotypic data and molecular genotypic data stemming from QTL mapping, association, and other genotype-by-environment studies. This will serve as the backend to the displays proposed below.

46

http://www.maizegenetics.net/gdpdm/

http://www.maizegenetics.net/gdpdm/

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.We will then develop a facility for the user to view the sequence alignment of alternative alleles against the reference sequence. For this purpose, we will create GBrowse displays based on the user interface of the open source Look-Align application (Caneran et al. 2006), which can display pre-computed multiple sequence alignments among alleles when stored in a Genomic Diversity and Phenotype Connection (GDPC) compatible database (a demonstration is available at the Panzea website www.panzea.org/software/alignment_viewer.html). The display that we envision will be similar to the one shown at the Panzea website, except that we will color-code putative SNPs to indicate the possible functional consequences of the polymorphisms, such as non-synonymous changes in amino acid sequence or splice-site alterations. The back end for this logic is contained in the “Bio::LiveSeq” module of the BioPerl software library (www.bioperl.org). Hence the implementation of this feature is simply a matter of invoking Bio::LiveSeq on polymorphism data contained in a diversity-enhanced Chado database.We will also implement glyphs for simultaneous displaying genetic and phenotypic variation. GBrowse currently has the capacity to visually document a particular phenotype by embedding an image or other graphic aligned to the genome within one of the annotation tracks. This glyph was designed to handle graphics in PNG, JPEG, GIF or GD format and is best-suited for morphological data, such a genetic alteration that changes the histology of an organ or in situ hybridization data. We will augment this facility by combining it with new glyphs adapted to show quantiative traits; visualizations will include side-by-side whiskerplots showing the frequency or distribution of a phenotype within each of two or more genotypic classes (from a biallelic SNP to a larger limited number of haplotypes) or vice versa, and an XY scatterplot glyph to show the relationship between allele frequency and phenotype among populations. A considerable amount of information about the patterns of selection acting on a trait of interest can be deduced from changes in allele frequencies correlated with geographic position. For example, changes in allele frequencies of the cation exchanger SLC24A5 are highly correlated with latitude among human populations, supporting the hypothesis that this gene contributes to skin pigmentation and was selected for during the human migration from low-latitude areas rich in sunlight to northern climes where sun is more scarce (Lamason et al. 2005). To support this type of research will add the capability of correlating allelic information with geographic and genomic information within GBrowse. We will do this in two ways. First, we will design a new glyph that displays a small “cartoon” map at the position of the gene of interest. This map will display the geographic region of interest color-coded in such a way to display how geographic positions of population samples correlates with

47

http://www.bioperl.org/

http://www.panzea.org/software/alignment_viewer.html

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.single-locus and haplotype allele frequencies. Because space on the GBrowse detail panel is limited, we envision using a four-pixel box chart to indicate the position and allele frequencies of each collected sample. This will only allow allele frequency changes to be shown in 25% increments, but this should still be sufficient to make significant changes in allele frequencies apparent. This display will allow the allelic distributions of all loci within a genomic region to be interrogated simultaneously, but the geographic resolution will be limited by the small size of the maps.The second display will be a larger standalone map display that will be displayed when the researcher clicks on a characterized locus to see more detail. This display will be a larger, more detailed map on which the population frequencies for two or more alleles are projected as a series of geographically-positioned pie charts. Because it is larger, this display will provide more geographic resolution, but in contrast to the embedded display, it will only allow one locus to be interrogated at a time.The maps themselves will consist of pre-supplied bitmap images of the world at a variety of spatial scales and will take advantage of OpenMap GMT-3 technology utilities (http://gmt.soest.hawaii.edu/ openmap.bbn.com/ ). We will implement the displays by projecting a longitude/latitude grid onto bitmapped images of the region of interest (e.g. a continent). The grid locations of population samples and their allelic frequencies will be retrieved from the enhanced Chado database and used to create the four-pixel box charts and pie charts described earlier. The charts will then be drawn into the proper positions on the bitmaps.Milestones for this part of the project and a description of our software engineering methodology can be found in I. Management Plan.

48

http://openmap.bbn.com/

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.Specific Aim #3) Encourage the adoption and utilization of the GMOD software suite by the biological research community. The GMOD project currently has several largely unmet, necessary support tasks. These tasks fall into a few categories: writing documentation, outreach and software support. Currently, the bulk of these tasks are performed by Scott Cain which takes away from his coordination and software writing responsibilities. To facilitate adoption of GMOD software by more organizations, clear and complete documentation is a necessity. Additionally, written tutorials and examples are needed both as basic documentation and as outreach materials. One illustration of the importance of end-user material is to compare the widespread adoption of GBrowse and Apollo, both of which have excellent user-level tutorials, to the slow pace of adoption of Chado, Turnkey, and other GMOD tools that are not documented to the same level.Currently, support for GMOD software is highly decentralized, with requests for help or new features coming in via a variety of paths, including project-specific mailing lists, bug and feature request trackers at SourceForge, the GMOD website and email sent directly to lead developers. The lack of centralization leads to diffusion of responsibility, and some user inquiries are never adequately resolved.To remedy this situation, we will establish a full-time help desk for GMOD building on the seed funding provided by NESCent and USDA ARS. A user support specialist located at NESCent will manage the GMOD help desk in coordination with Scott Cain. The specialist will monitor mailing lists and forum sites and respond directly on basic issues that novice users have with installation and configuration. When appropriate, he or she will turn correspondence into formal bug reports, feature requests, and FAQs, or direct the correspondence to the appropriate developer. He or she will also maintain active installations of the various GMOD tools at NESCent as they are released.NESCent will maintain installations of the various GMOD tools as they are released in order to stay current with the technology from the user’s perspective and serve as a beta tester for the developers. The user support specialist will collaborate with NESCent visitors, several of whom at any given time are working with or designing GMOD databases, to evaluate what works well, what does not, and to stimulate future development ideas. Another important task for user support specialist will be to maintain the GMOD website. His or her responsibilities will include ensuring that introductory documentation exists for the major modules on the GMOD Wiki, posting news items, and managing the GMOD mailing lists. Unless the user support specialist can engage the GMOD software

49

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.developers, he or she will be ineffective. For this reason, the user support specialist will participate in developer conference calls for the most active GMOD projects (GBrowse, Apollo, CMap, Pathway Tools, Turnkey, Chado), and other projects that arise during the course of the project period. The user support specialist’s role in these conference calls will be to act as an advocate for users, bringing their complaints, questions and concerns to the attention of the developers. On occasion, the support specialist will travel to remote sites to assist with installation and/or configuration of GMOD tools. Travel will be undertaken when remote support has proven ineffective and when the host institution is able to underwrite the travel, housing and other expenses of the specialist. We have found that this type of one-on-one interaction often reveals deficiencies in the software and/or documentation that were not apparent during telephone and email conversations. It also often suggests new uses for the software that stimulate innovations on the part of the developers.Tutorials. The user support specialist will work with the developers of the most active GMOD packages to write tutorial materials that can be delivered live and also made available online. Tutorials will be updated to remain current with new software releases. Initial feedback on the tutorials by naïve users will be obtained by presentations to the postdoc and sabbatarian community at NESCent. Live tutorial presentations will be targeted at meetings frequented by potential new GMOD users, such as those held by the Society for Molecular Biology and Evolution, Plant and Animal Genome, Microbial Genomics, the Cold Spring Harbor Biology of Genomes and the annual conference on Intelligent Systems in Molecular Biology. Live In-person tutorials will be presented to at least one domestic and one international scientific conference each year. In addition, NESCent is also equipped to host virtual software demonstrations using Polycom unified video-voice-data conferencing technology. Tutorials available to users online will be of two flavors. First, we will provide static documents that guide the user through worked examples step-by-step. Second, we will use a multimedia tool such as Macromedia Captivate to produce software animations that capture the onscreen actions of the narrator going through the steps of the procedure and overlay that with markup on the screen as well as the voice and video image of the narrator. TFinally, the user support specialist will organize and run an intensive three-day hands-on training session for 10-15 users at NESCent on at least an annual basis. This will be aimed at new users and will guide them through installation and configuration of the main GMOD packages. Travel scholarships will be awarded to ensure wide participation and other participants will be charged at cost. The user support specialist is being

50

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.hired at NESCent in mid-2006, well prior to the project start date, and so we anticipate being ready to start providing live tutorials and workshops in year one.Online User Surveys. In order to remain engaged with the needs of the community and to assess the effectiveness of the GMOD project in addressing those needs, the user support specialist will run an online user survey once every year. The survey will collect information on what GMOD tools, if any, respondents use, respondents’ research interests and host institutions, and will ask respondents to identify missing or unsatisfactory features in the GMOD tool set. The surveys will be publicized by posting announcements to mailing lists for model organism databases, ontology interest groups and bioinformatics projects (e.g. the BioPerl mailing list).

51

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.E. Human Subjects Researchn/a

F. Vertebrate Animalsn/a

G. Select Agent Researchn/a

52

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.H. Literature Cited

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 25(1):25-9.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. (2000) Gene ontology: tool for the unification of biology. Nat Genet. 25(1):25-9.Beck K. 2002. Extreme Programming Explained: Embrace Change. Addison Wesley.Canaran P, Stein L, Ware D. 2006. Look-Align: an interactive web-based multiple sequence alignment viewer with polymorphism analysis support. Bioinformatics 22: 885-6.Dowell R, Jokerst R, Day A, Eddy S, and Stein L (2001). The distributed annotation system. BMC Bioinformatics 2:7.Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M. (2005). The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol. 6(5):R44. Epub 2005 Apr 29. Garrett J.J. 2005. Ajax: A new approach to web applications. http://adaptivepath.com/publications/essays/archives/000385.phpHubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Huminiecki L, Kasprzyk A, Lehvaslaiho H, Lijnzaad P, Melsopp C, Mongin E, Pettett R, Pocock M, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, Vastrik I, Clamp M. 2002. The Ensembl genome database project. Nucleic Acids Res. 30: 38-41.Karp P, Paley S, and Romero P. (2002). The Pathway Tools software. Bioinformatics 18:S225-32.Kelley S. 2000 Getting started with Acedb. Brief Bioinform. 1(2):131-7. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, and Haussler D. 2002. The Human Genome Browser at UCSC. Genome Res. 12: 996-1006. Lewis SE, Searle SM, Harris N, Gibson M, Lyer V, Richter J, Wiel C, Bayraktaroglir L, Birney E, Crosby MA, Kaminker JS, Matthews BB, Prochnik SE, Smithy CD, Tupy JL, Rubin GM, Misra S, Mungall CJ, Clamp

53

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.ME. (2002) Apollo: a sequence annotation editor. Genome Biol. 3(12):RESEARCH0082. Epub 2002.Lamason RL, Mohideen MA, Mest JR, Wong AC, Norton HL, Aros MC, Jurynec MJ, Mao X, Humphreville VR, Humbert JE, Sinha S, Moore JL, Jagadeeswaran P, Zhao W, Ning G, Makalowska I, McKeigue PM, O'Donnell D, Kittles R, Parra EJ, Mangini NJ, Grunwald DJ, Shriver MD, Canfield VA, Cheng KC. (2005). SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. Science. 310(5755):1782-6.Pocock M, Down T, Hubbard T (2000). BioJava: Open source components for bioinformatics. ACM SIGBIO Newsletter 20(2):10-2.Raymond E.S. 1999. The Cathedral and the Bazaar. O’Reilly and Associates. Sebastopol, CA..Risch NJ (2000) Searching for genetic determinants in the new millennium. Nature 405: 847-56.Salamov A, and Solovyev V (2000). Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10:516-22.Stajich, J. E. Block, D. Boulez, K. Brenner, S. E. Chervitz, S. A.; Dagdigian, C. Fuellen, G. Gilbert, J. G. R. Korf, I. Lapp, H. (2002). The Bioperl Toolkit: Perl Modules for the Life Sciences. Genome Res. 10(12):1611-8.Stein L, Beavis W, Gessler D, Huala E, Lawrence C, Main D, Mueller L, Rhee SY, Rokshar D (2006). Save our Data! The Scientist 20:24-5. Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S. 2002. The generic genome browser: a building block for a model organism system database. Genome Res. 12(10):1599-610.Thorisson GA, Smith AV, Krishnan L, Stein LD. 2005. The International HapMap Project Web site. Genome Res. 15:1592-3.Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, Kresovich S, Buckler ES. 2006. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 38: 203-8.Zhong W, Sternberg PW. (2006) Genome-wide prediction of C. elegans genetic interactions. Science 311(5766):1481-4.

54

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.I. Management Plan (Multiple PI Leader Plan)

Lincoln Stein will be responsible for oversight of the project as a whole, including accomplishment of the milestones described below, and the release of software and other resources by the project team in a complete and timely basis. Ian Holmes and Todd Vision will have primary responsibility for specific aims #1 and 3, respectively, and Lincoln Stein will have primary responsibility for the accomplishment of the goals set out in specific aim #2.The three coPIs will meet by telephone once per month to discuss project progress, to identify unexpected technical and strategic issues, and, if needed, to accommodate these issues by modifying the project plan. The three coPIs, along with the project software engineers, user support specialist, and GMOD coordinator will meet once per year in person at the spring GMOD meeting.Should one of the groups fall behind on its milestones, the lead PI will discuss the issue with the responsible coPI, and may exercise the right to reapportion funds to ensure that the project goals are met.Our software development methodology is heavily influenced by the literature of Extreme Programming (Beck 2000) and of the open source movement (Raymond 1999). We create short use cases during the design of the software as an aid to discussion and understanding of the design goals within our group. We then implement working prototypes of the software at the earliest possible stage, place these prototypes online, and solicit feedback from as wide a section of the community as possible. We make use of the open source model by encouraging software developers to participate in the testing and enhancement of the software.In addition to user-driven testing, we develop automated unit tests in parallel with the implementation of the production code. Our testsfor the GBrowse enhancements described in specific aims #1 and 2 will take the following form

1) For the server-side rendering engine for the AJAX GBrowse client, annotation files in common formats will be rendered to images and broken into tiles (as per normal operation of the server). These images will then be compared at the binary level to expected output. This test is similar to the existing automated test templates for CGI-based GBrowse.

2) For the server-side query processor and image tile server, we will maintain a Chado database containing an invariant series of genome annotations. Unit tests will then exercise the server by sending it a series of URL requests and comparing the response (at the HTML

55

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.level) against the expected response.

3) For client code, we will implement unit tests within Javascript. These unit tests will load the client application Javascript libraries and exercise each major subroutine with a variety of valid and invalid parameters.

4) Since performance optimization of the system as a whole is one of our design goals, we will also collect, report and review performance statistics on an automatic basis: excessively poor performance will be reported as a test failure, the failure threshold to be set progressively lower as the code is optimized.

5) We will test the community annotation framework using an automated agent that will (i) upload a sample annotation dataset over HTTP to a test server, (ii) test that the new annotation appears in the list of available tracks available from the server, (iii) test that the annotation track was correctly rendered by downloading tile images and performing a binary-level comparison and (iv) test that the uploaded annotation can be revised and deleted, and that queries to the server reflects these changes.

6) Cross-browser portability is one of our goals. On a quarterly basis, and prior to each major release, we will review and test the entire feature set for the client on the major browsers on the major operating systems. The exact set of browsers to be tested will be determined by examining the server logs of WormBase (www.wormbase.org) and HapMap (www.hapmap.org), both of which are accessible to L. Stein.

7) The diversity displays will be tested by comparing the binary image generated by the renderer against the expected output, as described in (1).

To measure the effectiveness of the end-user support and outreach efforts described in Specific Aim #3, we will monitor the following metrics:

1) The average time between the opening of a bug report ticket on the GMOD bug tracking system and its resolution.

2) The number of web site impressions and file downloads from the GMOD web site.

3) The results from the online user surveys, as well as informal feedback provided during tutorials and workshops.

We use the Sourceforge CVS (Concurrent Versioning System) for source code management. This resource is regularly backed up and is generally reliable, but there have been several well-publicized service outages recently. The GMOD developers are monitoring the situation and are

56

http://www.hapmap.org/

http://www.wormbase.org/

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.prepared to switch to another source code repository if outages remain common.A timeline for specific aims 1-2 which lists major milestones and development activities follows (Figure 4):

Figure 4: Project timeline and milestones.

57

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.J. Consortium/Contractual Arrangements

58

Enhancement of the GBrowse Genome Annotation Browser Stein, Lincoln D.K. Resource Sharing

All software, schemas, documentation, tutorials, and workshop materials will be made available to the research community under an open source license agreement that allows for unrestricted use, modification and redistribution, provided that attribution to the original developers and their institutions is maintained. For Perl-based software, we will use the Perl Artistic License (www.perl.com/pub/a/language/misc/Artistic.html). We will distribute Javascript client code under the Academic Free License (opensource.org/licenses/afl-2.1.php), which is a generalized version of the Berkeley BSD License. As described in the management plan, resources will be released to the community on an “early and often” basis. In addition to formal releases, the community will have access to “live” development code through anonymous CVS at the GMOD Sourceforge site. Nightly snapshot “tarballs”of the development code will also be available for those without access to CVS. Formal software releases will be available for download via the web and FTP using Sourceforge facilities. Non-software resources, such as tutorials, will be available for download from the GMOD web site.

59

http://opensource.org/licenses/afl-2.1.php

http://www.perl.com/pub/a/language/misc/Artistic.html

page 2 material - faculty | biology department | unc...

Documents