EBI is an Outstation of the European Molecular Biology Laboratory.
Every genome deserves a home
Dan LawsonEMBL-EBI
Disclosure - my background
VectorBase http://www.vectorbase.org
• NIAID-funded Bioinformatic Resource Center focused on arthropod vectors of human pathogens
• Collaborates with sequencers and community on 1o annotation
• Community resource, ‘One stop shop’
Ensembl Genomes http://www.ensemblgenomes.org
• Extending Ensembl across taxonomic space
• 5 taxonomic portals to present genome assemblies and annotation
• Integrated resource for cross-species interrogation
Find a home for every genome
Every genome deserves a home
• Sequencing the genome of your favourite species is a beginning
• You will want to make your genome:
• Useful to your group/community
• Useful to other communities
• You will (hopefully) want to update/improve:
• Assembly (new sequencing technologies, mapping strategies)
• Gene predictions (new models, correct existing models, delete unsupported models)
• Gene annotation (add gene names/symbols, descriptions)
• Data richness (new high-throughput datasets, xrefs to relevant resources)
Finding a home for every genome
• All genomes deserve a home
• Houses
• Apartments/Flats
• Dormitories/Barracks
Genomic information infrastructure after the delugeJulian Parkhill, Ewan Birney and Paul KerseyGenome Biology 2010, 11:402 http://genomebiology.com/2010/11/7/402
Anatomy of a home
• Genome browser
• Similarity searches
• BLAST/BLAT
• Query tools
• Simple keyword
• Complex queries
• DownloadsSimilarity searches
Query tool
Downloads
Browser
Browser
Compara
Finding a home
• Factors to take into account when choosing a home for your genome
• Required functionality
• Data access (Bulk download, tailored download, computational)
• Visualization (Genome browser)
• Search (Sequence based, simple keyword queries, complex queries)
• Extendability for new data types (e.g. NGS transcriptomics, variation)
• Resources required for maintenance
• Compute/servers
• Staff (with appropriate skills)
Tier 2 databases: VectorBase
• One of 4 NIAID Bioinformatics Resource Centers
• Integrated genomic resource for arthropod vectors of human pathogens
• Collaboration of 3 European and 3 US Institutes
• VectorBase is:
• Both service provider and content generator
• A collator of genomic information
• A genome annotation group (gene structure prediction)
• A provider of tools for browsing and data mining vector genomes
• A helpdesk for community queries
• Responsible for data submissions to the public archival databases
• Committed to regular release cycles (5-6 releases per year)
VectorBase highlights 2012
• Website orientated around data rather than species
• Consolidation of legacy sections
• Faceted universal search
• Scalable handling of:
• organism
• strain
• assembly
• gene set
• Ensembl genome browser
• Extensive user data upload facilities
• More species
• Community Annotation Portal overhaul
Tier 3 databases: Ensembl Genomes
Ensembl Genomes release 18 (http://metazoa.ensembl.org)
• 43 species
• Stakeholders:
• VectorBase
• FlyBase
• WormBase
• BeetleBase
• Hymenoptera Genome Database
• Other highlights
• Lepidoptera (3 spp. one to come)
• Sole location of a number of arthropod genomes
Ensembl Genomes - home analogy
• Integration into the Ensembl relational database schema
• Genome browser
• Data centric views
• Downloads
• Similarity searches (Blast/Blat)
• Comparative analysis with other species
• Programmatic access (Perl API)
• BioMart query tool
• Data consistency across species
Benefits of inclusion in Ensembl Genomes
• Integration with a wide range of other species
• Ability to include other data types
• Variation
• Functional genomics
• Alignments
• Community data sets (configuration of site)
• BAMs (RNA-seq, re-sequencing)
• VCFs (SNPs, CNVs)
• Wiggle plots for regulatory elements/ChiP-Seq etc.
• User addition of data sets (temporary visualization)
• Downstream usage by 3rd party tools/analyses
Choosing a solution
• Look at existing solutions
“Off the shelf”
• Generic Model Organism Database project (http://www.gmod.org/wiki/Main_Page)
• Ensembl (http://www.ensembl.org)
“Roll your own”
• Content Management Systems (Drupal)
• Wikis (many flavours)
Publicise your resource
• Meetings
• Mailing lists
• Publication
• NAR Database issue
• a little bit of SEO
• Google/Bing etc.
• Social media
Make your data available in common formats
Just as we use a lingua franca to communicate between nationalities we use the same in sharing data
Sequences
• Fasta format
• http://www.ebi.ac.uk/help/formats.html
Assembly
• AGP (Golden Path)
• GenBank http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtml
Annotation
• GFF3 (Gene Feature Format v3)
• Sequence Ontology http://www.sequenceontology.org/gff3.shtml
Bulk downloads are not an afterthought...
• The provision of data as bulk downloads should not be an afterthought for your project
• Make data available in common formats
• Be responsive to community needs (in terms of alternative formats, other data types)
• Run quality assurance over the download files
• Completeness
• Within files
• Across files
• ‘Round trip’ data where possible - “I have a dream”
but by far the most important thing is
Submission to the public archival databases
Why submit to the public archival databases?
• Visability
• Integration with the widest possible community
• xrefs back to your resource
• Longevity
• Funding for INSDC is always going to be more secure than your database
• Accreditation
• Publication
• Many funders and journals require submission prior to publication
• NCBI/EBI/UCSC Browser agreement
• Only assemblies submitted to INSDC can be visualised through these resources
Personally - I don’t consider a genome to be in the public domain until it has been submitted to INSDC
Submission makes you do a number of things
Requirement to conform to standards
• Some are mandatory, some advisory
• Opportunity to capture metadata
• Minimum information about a genome sequence (MIGS)
Encourages good practice
Explicit nomenclature and versioning
• Caveat that you need to make updates!
GenBank nomenclature
• BioProject accessions
• WGS accessions
• Assembly accessions
i5k BioProject at INSDC
• We encourage communities to submit data to the appropriate public archival database (GenBank/ENA/DDBJ), Short Read Archive (SRA) etc.
• We encourage you to join us and add your project when submitting data to INSDC
• http://www.ncbi.nlm.nih.gov/bioproject/163993
Encourage collaboration
• “Many cooks spoil the broth” v “Many hands make light work”
• Send your genome to school to learn
• Encourage collaboration within your community
• Encourage the next generation of researchers
• Don’t be afraid to ask “experts” for specific help
• Fort Lauderdale agreement
• Outcome from a 2003 meeting
• Sequencing group reserves right to publish
• Strike a balance between fair use (i.e. no pre-emptive publication) and early disclosure.
http://arthropodgenomes.org
arthropodgenomes.org
• > 600 registered users from 178 institutes worldwide
• 30 community resources/databases
• ≅800 species nominated by individuals, consortia, museums or societies
Built around Person & Organism pages
Stakeholders - Databases
• Outreach opportunity
• Includes species (living in this home)
• Contact details for the project
• Contact details for the developers
• References
Stakeholders - Resources
• Outreach opportunity
• Includes species (living in this home)
• Contact details for the project
• Contact details for the developers
• References
Encourage collaboration
Finding “experts” from outside your community
Genome papers, supplemental data
Future challenges
• Scaling bioinformatics infrastructure to deal with 1000s of genomes
• Centralised or federated models
• Democratisation of genome analysis
• “Best practices” for genome assembly & annotation
• Metrics for assessing genome assemblies and annotations
• e.g. Assemblathon (http://assemblathon.org)
• Facilitating and improving community involvement in genome projects
• e.g. VectorBase Community Annotation Portal (CAP), WebApollo.