basys: a web server for automated bacterial genome annotation gary van domselaar †, paul stothard,...

1
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar , Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli Dong, Paul Lu, Duane Szafron, Russ Greiner, and David S. Wishart Departments of Computing Science and Biological Sciences University of Alberta Edmonton AB T6E 2E9 [email protected] [email protected] Abstract BASys (Bacterial Annotation System) is a web server that supports automated, in-depth annotation of bacterial genomic (chromosomal, plasmid, and contig) sequences. It accepts raw DNA sequence data and an optional list of gene identification information and provides extensive textual and hyperlinked image output. BASys uses more than 30 programs to determine nearly 60 annotation subfields for each gene, including gene/protein name, GO function, COG function, possible paralogues and orthologues, molecular weight, isoelectric point, operon structure, subcellular localization, signal peptides, transmembrane regions, secondary structure, 3-D structure, reactions, and pathways. The depth and detail of a BASys annotation matches or exceeds that found in a standard SwissProt entry. BASys also generates colourful, clickable and fully zoomable maps of each query chromosome to permit rapid navigation and detailed visual analysis of all resulting gene annotations. The textual annotations and images that are provided by BASys can be generated in approximately 24 hours for an average bacterial chromosome (5 Megabases). BASys annotations may be viewed and downloaded anonymously or through a password protected access system. The BASys server and databases can also be downloaded and run locally. BASys is accessible at: http://wishart.biology.ualberta.ca/basys Genomic Sequence Data (Optional) Gene Identificatio n Data Head Node SWISSPROT CCDB Reference DB Similarity Search Data Submission BASys supplies a web form for uploading chromosome, plasmid, or contig sequence data. Optional gene identification data can be provided, or BASys can predict protein coding regions from the genomic data using Glimmer [1]. E. coli D. melanogaster H. sapiens C. elegans S. cerevisiae Model Organism Similarity Search Compute Node Compute Node KEGG Metabolite Analysis Sequence Analysis Pfam PROSITE PredictSPTM etc. Data Scheduling BASys is implemented as a distributed system. The head node monitors and manages the job scheduling. Annotation and report generation are carried out by the compute nodes. Annotation Reports BASys uses CGView [3] to generate clickable genome maps for navigating the genome data. An HTML-formatted tabular summary is also provided. The genome maps are prerendered as a series of hyperlinked PNG image files. Each gene label is hyperlinked to its corresponding HTML-formatted “gene card”. The card is hyperlinked where applicable to external references. Text-only gene cards are also provided. BASys also supplies an 'evidence card' describing how each annotation was generated. The gene cards, evidence cards, and graphical genome maps are downloadable for offline viewing. References 1. Delcher AL et al. (1999) Nucleic Acid Res. 27:4636-41. 2. Ilioupoulos I et al. (2003) Bioinformatics 19:717-26. 3. Stothard P. and Wishart DS (2005) Bioinformatics 21:537-39. Report Generation CGview Annotation Reports Search Capability BASys supports online keyword searches and sequence similarity searches Search results contain hyperlinks to their gene cards and graphical genome maps. BASys Annotation Pipeline The BASys annotation engine combines database comparison and computational sequence analysis in its annotation pipeline. Translated coding sequences are initially compared using BLAST to the expertly annotated reference databases UniProt and the CyberCell comprehensive molecular database on Escherichia coli . The similarity score between the query and database sequence is compared to the threshold value for each annotation type and qualifying annotations are transitively applied to the query sequence. BASys attempts to fill the remaining annotations with additional similarity searches and sequence analyses. BLAST searches are conducted against the protein sequences of C. elegans, human, yeast, and Drosophila; a non-redundant database of bacterial protein sequences, the PDB , KEGG, and COG databases. Various sequence analyses are also performed including Pfam, PROSITE, signal peptide and transmembrane domain predictions, and predicted secondary structure with PSIPRED. If the sequence has sufficient similarity to a sequence represented in the PDB database, then BASys may use HOMODELLER to Validation BASys annotations were compared to a set of expertly annotated proteins from C. trachomatis [2]. BASys annotations agreed with the expert annotations 762 times out of 894. The sensitivity is 94% ; the specificity is 73% . Structure Analysis Homodeller VADAR PDB

Upload: asher-paul

Post on 29-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli

BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar†, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli Dong, Paul Lu, Duane

Szafron, Russ Greiner, and David S. Wishart‡

Departments of Computing Science and Biological SciencesUniversity of Alberta

Edmonton AB T6E 2E9

[email protected][email protected]

Abstract

BASys (Bacterial Annotation System) is a web server that

supports automated, in-depth annotation of bacterial genomic

(chromosomal, plasmid, and contig) sequences. It accepts

raw DNA sequence data and an optional list of gene

identification information and provides extensive textual and

hyperlinked image output. BASys uses more than 30

programs to determine nearly 60 annotation subfields for

each gene, including gene/protein name, GO function, COG

function, possible paralogues and orthologues, molecular

weight, isoelectric point, operon structure, subcellular

localization, signal peptides, transmembrane regions,

secondary structure, 3-D structure, reactions, and pathways.

The depth and detail of a BASys annotation matches or

exceeds that found in a standard SwissProt entry. BASys

also generates colourful, clickable and fully zoomable maps

of each query chromosome to permit rapid navigation and

detailed visual analysis of all resulting gene annotations. The

textual annotations and images that are provided by BASys

can be generated in approximately 24 hours for an average

bacterial chromosome (5 Megabases). BASys annotations

may be viewed and downloaded anonymously or through a

password protected access system. The BASys server and

databases can also be downloaded and run locally. BASys is

accessible at:

http://wishart.biology.ualberta.ca/basys

Abstract

BASys (Bacterial Annotation System) is a web server that

supports automated, in-depth annotation of bacterial genomic

(chromosomal, plasmid, and contig) sequences. It accepts

raw DNA sequence data and an optional list of gene

identification information and provides extensive textual and

hyperlinked image output. BASys uses more than 30

programs to determine nearly 60 annotation subfields for

each gene, including gene/protein name, GO function, COG

function, possible paralogues and orthologues, molecular

weight, isoelectric point, operon structure, subcellular

localization, signal peptides, transmembrane regions,

secondary structure, 3-D structure, reactions, and pathways.

The depth and detail of a BASys annotation matches or

exceeds that found in a standard SwissProt entry. BASys

also generates colourful, clickable and fully zoomable maps

of each query chromosome to permit rapid navigation and

detailed visual analysis of all resulting gene annotations. The

textual annotations and images that are provided by BASys

can be generated in approximately 24 hours for an average

bacterial chromosome (5 Megabases). BASys annotations

may be viewed and downloaded anonymously or through a

password protected access system. The BASys server and

databases can also be downloaded and run locally. BASys is

accessible at:

http://wishart.biology.ualberta.ca/basys

Genomic Sequence Data

Genomic Sequence Data

(Optional) Gene

IdentificationData

(Optional) Gene

IdentificationData

Head NodeHead Node

SWISSPRO

T

CCDB

Reference DB

Similarity Search

Data Submission

BASys supplies a web form for uploading chromosome, plasmid, or contig sequence data. Optional gene identification data can be provided, or BASys can predict protein coding regions from the genomic data using Glimmer [1].

Data Submission

BASys supplies a web form for uploading chromosome, plasmid, or contig sequence data. Optional gene identification data can be provided, or BASys can predict protein coding regions from the genomic data using Glimmer [1].

E. coli

D. melanogaster

H. sapiens

C. elegans

S. cerevisiae

Model Organism

Similarity Search

Compute Node

Compute Node

Compute Node

Compute Node

KEGG

Metabolite Analysis Sequence Analysis

PfamPfam

PROSITEPROSITE

PredictSPTMPredictSPTM

etc.etc.

Data Scheduling

BASys is implemented as a distributed system. The head node monitors and manages the job scheduling. Annotation and report generation are carried out by the compute nodes.

Data Scheduling

BASys is implemented as a distributed system. The head node monitors and manages the job scheduling. Annotation and report generation are carried out by the compute nodes.

Annotation Reports

BASys uses CGView [3] to generate clickable genome maps for navigating the genome data. An HTML-formatted tabular summary is also provided. The genome maps are prerendered as a series of hyperlinked PNG image files. Each gene label is hyperlinked to its corresponding HTML-formatted “gene card”. The card is hyperlinked where applicable to external references. Text-only gene cards are also provided. BASys also supplies an 'evidence card' describing how each annotation was generated. The gene cards, evidence cards, and graphical genome maps are downloadable for offline viewing.

Annotation Reports

BASys uses CGView [3] to generate clickable genome maps for navigating the genome data. An HTML-formatted tabular summary is also provided. The genome maps are prerendered as a series of hyperlinked PNG image files. Each gene label is hyperlinked to its corresponding HTML-formatted “gene card”. The card is hyperlinked where applicable to external references. Text-only gene cards are also provided. BASys also supplies an 'evidence card' describing how each annotation was generated. The gene cards, evidence cards, and graphical genome maps are downloadable for offline viewing.

References

1. Delcher AL et al. (1999) Nucleic Acid Res. 27:4636-41.2. Ilioupoulos I et al. (2003) Bioinformatics 19:717-26.3. Stothard P. and Wishart DS (2005) Bioinformatics 21:537-39.

References

1. Delcher AL et al. (1999) Nucleic Acid Res. 27:4636-41.2. Ilioupoulos I et al. (2003) Bioinformatics 19:717-26.3. Stothard P. and Wishart DS (2005) Bioinformatics 21:537-39.

Report Generation

CGview

Annotation Reports

Annotation Reports

Search Capability

BASys supports online keyword searches and sequence similarity searches Search results contain hyperlinks to their gene cards and graphical genome maps.

Search Capability

BASys supports online keyword searches and sequence similarity searches Search results contain hyperlinks to their gene cards and graphical genome maps.

BASys Annotation Pipeline

The BASys annotation engine combines database comparison and

computational sequence analysis in its annotation pipeline. Translated

coding sequences are initially compared using BLAST to the expertly

annotated reference databases UniProt and the CyberCell

comprehensive molecular database on Escherichia coli. The similarity

score between the query and database sequence is compared to the

threshold value for each annotation type and qualifying annotations are

transitively applied to the query sequence. BASys attempts to fill the

remaining annotations with additional similarity searches and sequence

analyses. BLAST searches are conducted against the protein

sequences of C. elegans, human, yeast, and Drosophila; a non-

redundant database of bacterial protein sequences, the PDB , KEGG,

and COG databases. Various sequence analyses are also performed

including Pfam, PROSITE, signal peptide and transmembrane domain

predictions, and predicted secondary structure with PSIPRED. If the

sequence has sufficient similarity to a sequence represented in the PDB

database, then BASys may use HOMODELLER to generate a homology

model and subsequently perform a structural analysis using VADAR.

Several additional annotations, such as protein molecular weight,

isoelectric point, and operon structure are calculated directly from the

chromosomal, protein-coding nucleotide, and translated protein

sequence data. In all collection of nearly 60 distinct annotations is

generated for each gene.

BASys Annotation Pipeline

The BASys annotation engine combines database comparison and

computational sequence analysis in its annotation pipeline. Translated

coding sequences are initially compared using BLAST to the expertly

annotated reference databases UniProt and the CyberCell

comprehensive molecular database on Escherichia coli. The similarity

score between the query and database sequence is compared to the

threshold value for each annotation type and qualifying annotations are

transitively applied to the query sequence. BASys attempts to fill the

remaining annotations with additional similarity searches and sequence

analyses. BLAST searches are conducted against the protein

sequences of C. elegans, human, yeast, and Drosophila; a non-

redundant database of bacterial protein sequences, the PDB , KEGG,

and COG databases. Various sequence analyses are also performed

including Pfam, PROSITE, signal peptide and transmembrane domain

predictions, and predicted secondary structure with PSIPRED. If the

sequence has sufficient similarity to a sequence represented in the PDB

database, then BASys may use HOMODELLER to generate a homology

model and subsequently perform a structural analysis using VADAR.

Several additional annotations, such as protein molecular weight,

isoelectric point, and operon structure are calculated directly from the

chromosomal, protein-coding nucleotide, and translated protein

sequence data. In all collection of nearly 60 distinct annotations is

generated for each gene.

Validation

BASys annotations were compared to a set of expertly annotated proteins from C. trachomatis [2]. BASys annotations agreed with the expert annotations 762 times out of 894. The sensitivity is 94% ; the specificity is 73% .

Validation

BASys annotations were compared to a set of expertly annotated proteins from C. trachomatis [2]. BASys annotations agreed with the expert annotations 762 times out of 894. The sensitivity is 94% ; the specificity is 73% .

Structure Analysis

HomodellerHomodeller

VADARVADAR

PDB