Download - EBI is an Outstation of the European Molecular Biology Laboratory. Every genome deserves a home Dan Lawson EMBL-EBI

EBI is an Outstation of the European Molecular Biology Laboratory.

Every genome deserves a home

Dan LawsonEMBL-EBI

Disclosure - my background

VectorBase http://www.vectorbase.org

• NIAID-funded Bioinformatic Resource Center focused on arthropod vectors of human pathogens

• Collaborates with sequencers and community on 1o annotation

• Community resource, ‘One stop shop’

Ensembl Genomes http://www.ensemblgenomes.org

• Extending Ensembl across taxonomic space

• 5 taxonomic portals to present genome assemblies and annotation

• Integrated resource for cross-species interrogation

http://www.vectorbase.org/index.php

http://www.ensemblgenomes.org/

Find a home for every genome

Every genome deserves a home

• Sequencing the genome of your favourite species is a beginning

• You will want to make your genome:

• Useful to your group/community

• Useful to other communities

• You will (hopefully) want to update/improve:

• Assembly (new sequencing technologies, mapping strategies)

• Gene predictions (new models, correct existing models, delete unsupported models)

• Gene annotation (add gene names/symbols, descriptions)

• Data richness (new high-throughput datasets, xrefs to relevant resources)

Finding a home for every genome

• All genomes deserve a home

• Houses

• Apartments/Flats

• Dormitories/Barracks

Genomic information infrastructure after the delugeJulian Parkhill, Ewan Birney and Paul KerseyGenome Biology 2010, 11:402 http://genomebiology.com/2010/11/7/402

Anatomy of a home

• Genome browser

• Similarity searches

• BLAST/BLAT

• Query tools

• Simple keyword

• Complex queries

• DownloadsSimilarity searches

Query tool

Downloads

Browser

Browser

Compara

Finding a home

• Factors to take into account when choosing a home for your genome

• Required functionality

• Data access (Bulk download, tailored download, computational)

• Visualization (Genome browser)

• Search (Sequence based, simple keyword queries, complex queries)

• Extendability for new data types (e.g. NGS transcriptomics, variation)

• Resources required for maintenance

• Compute/servers

• Staff (with appropriate skills)

Tier 2 databases: VectorBase

• One of 4 NIAID Bioinformatics Resource Centers

• Integrated genomic resource for arthropod vectors of human pathogens

• Collaboration of 3 European and 3 US Institutes

• VectorBase is:

• Both service provider and content generator

• A collator of genomic information

• A genome annotation group (gene structure prediction)

• A provider of tools for browsing and data mining vector genomes

• A helpdesk for community queries

• Responsible for data submissions to the public archival databases

• Committed to regular release cycles (5-6 releases per year)

VectorBase highlights 2012

• Website orientated around data rather than species

• Consolidation of legacy sections

• Faceted universal search

• Scalable handling of:

• organism

• strain

• assembly

• gene set

• Ensembl genome browser

• Extensive user data upload facilities

• More species

• Community Annotation Portal overhaul

Tier 3 databases: Ensembl Genomes

Ensembl Genomes release 18 (http://metazoa.ensembl.org)

• 43 species

• Stakeholders:

• VectorBase

• FlyBase

• WormBase

• BeetleBase

• Hymenoptera Genome Database

• Other highlights

• Lepidoptera (3 spp. one to come)

• Sole location of a number of arthropod genomes

http://metazoa.ensembl.org/

Ensembl Genomes - home analogy

• Integration into the Ensembl relational database schema

• Genome browser

• Data centric views

• Downloads

• Similarity searches (Blast/Blat)

• Comparative analysis with other species

• Programmatic access (Perl API)

• BioMart query tool

• Data consistency across species

Benefits of inclusion in Ensembl Genomes

• Integration with a wide range of other species

• Ability to include other data types

• Variation

• Functional genomics

• Alignments

• Community data sets (configuration of site)

• BAMs (RNA-seq, re-sequencing)

• VCFs (SNPs, CNVs)

• Wiggle plots for regulatory elements/ChiP-Seq etc.

• User addition of data sets (temporary visualization)

• Downstream usage by 3rd party tools/analyses

Choosing a solution

• Look at existing solutions

“Off the shelf”

• Generic Model Organism Database project (http://www.gmod.org/wiki/Main_Page)

• Ensembl (http://www.ensembl.org)

“Roll your own”

• Content Management Systems (Drupal)

• Wikis (many flavours)

http://www.gmod.org/wiki/Main_Page

http://www.ensembl.org/

Publicise your resource

• Meetings

• Mailing lists

• Publication

• NAR Database issue

• a little bit of SEO

• Google/Bing etc.

• Social media

Make your data available in common formats

Just as we use a lingua franca to communicate between nationalities we use the same in sharing data

Sequences

• Fasta format

• http://www.ebi.ac.uk/help/formats.html

Assembly

• AGP (Golden Path)

• GenBank http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtml

Annotation

• GFF3 (Gene Feature Format v3)

• Sequence Ontology http://www.sequenceontology.org/gff3.shtml

http://www.ebi.ac.uk/help/formats.html

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtml

http://www.sequenceontology.org/gff3.shtml

Bulk downloads are not an afterthought...

• The provision of data as bulk downloads should not be an afterthought for your project

• Make data available in common formats

• Be responsive to community needs (in terms of alternative formats, other data types)

• Run quality assurance over the download files

• Completeness

• Within files

• Across files

• ‘Round trip’ data where possible - “I have a dream”

but by far the most important thing is

Submission to the public archival databases

Why submit to the public archival databases?

• Visability

• Integration with the widest possible community

• xrefs back to your resource

• Longevity

• Funding for INSDC is always going to be more secure than your database

• Accreditation

• Publication

• Many funders and journals require submission prior to publication

• NCBI/EBI/UCSC Browser agreement

• Only assemblies submitted to INSDC can be visualised through these resources

Personally - I don’t consider a genome to be in the public domain until it has been submitted to INSDC

Submission makes you do a number of things

Requirement to conform to standards

• Some are mandatory, some advisory

• Opportunity to capture metadata

• Minimum information about a genome sequence (MIGS)

Encourages good practice

Explicit nomenclature and versioning

• Caveat that you need to make updates!

GenBank nomenclature

• BioProject accessions

• WGS accessions

• Assembly accessions

i5k BioProject at INSDC

• We encourage communities to submit data to the appropriate public archival database (GenBank/ENA/DDBJ), Short Read Archive (SRA) etc.

• We encourage you to join us and add your project when submitting data to INSDC

• http://www.ncbi.nlm.nih.gov/bioproject/163993

http://www.ncbi.nlm.nih.gov/bioproject/163993

Encourage collaboration

• “Many cooks spoil the broth” v “Many hands make light work”

• Send your genome to school to learn

• Encourage collaboration within your community

• Encourage the next generation of researchers

• Don’t be afraid to ask “experts” for specific help

• Fort Lauderdale agreement

• Outcome from a 2003 meeting

• Sequencing group reserves right to publish

• Strike a balance between fair use (i.e. no pre-emptive publication) and early disclosure.

http://arthropodgenomes.org

arthropodgenomes.org

• > 600 registered users from 178 institutes worldwide

• 30 community resources/databases

• ≅800 species nominated by individuals, consortia, museums or societies

Built around Person & Organism pages

Stakeholders - Databases

• Outreach opportunity

• Includes species (living in this home)

• Contact details for the project

• Contact details for the developers

• References

Stakeholders - Resources

• Outreach opportunity

• Includes species (living in this home)

• Contact details for the project

• Contact details for the developers

• References

Encourage collaboration

Finding “experts” from outside your community

Genome papers, supplemental data

Future challenges

• Scaling bioinformatics infrastructure to deal with 1000s of genomes

• Centralised or federated models

• Democratisation of genome analysis

• “Best practices” for genome assembly & annotation

• Metrics for assessing genome assemblies and annotations

• e.g. Assemblathon (http://assemblathon.org)

• Facilitating and improving community involvement in genome projects

• e.g. VectorBase Community Annotation Portal (CAP), WebApollo.

http://assemblathon.org/

Contact [email protected] or [email protected]

mailto:[email protected]

Download - EBI is an Outstation of the European Molecular Biology Laboratory. Every genome deserves a home Dan Lawson EMBL-EBI

Top Related