describing bioinformatic metadata at ebi james malone

15
Describing Bioinformatic Metadata at EBI James Malone [email protected]

Upload: elfreda-blankenship

Post on 18-Jan-2018

218 views

Category:

Documents


0 download

DESCRIPTION

Master headline The Sorts of Data we Serve We manage databases of biological data such as nucleic acid, protein sequences and macromolecular structures ENA: nucleotide sequencing information UniProt: protein sequence and functional information ArrayExpress: functional genomics data repository Ensembl: genome info for vertebrates and other eukaryotes InterPro: database of predictive protein "signatures" PDBe: data resource on biological macromolecular structures

TRANSCRIPT

Page 1: Describing Bioinformatic Metadata at EBI James Malone

Describing Bioinformatic Metadata at EBIJames [email protected]

Page 2: Describing Bioinformatic Metadata at EBI James Malone

Master headline2

Cross-Domain Data available from EBIGenomes

DNA & RNA sequence

Gene expression

Protein sequence

Protein families, motifs and domains

Protein structure

Protein interactions

Chemical entities

Pathways

Systems

Literature and ontologies

Page 3: Describing Bioinformatic Metadata at EBI James Malone

Master headline

The Sorts of Data we Serve

• We manage databases of biological data such as nucleic acid, protein sequences and macromolecular structures

• ENA: nucleotide sequencing information • UniProt: protein sequence and functional information • ArrayExpress: functional genomics data repository• Ensembl: genome info for vertebrates and other

eukaryotes• InterPro: database of predictive protein "signatures" • PDBe: data resource on biological macromolecular

structures

Page 4: Describing Bioinformatic Metadata at EBI James Malone

Master headline

Sorts of Metadata we need

• Low complexity – high volume (genome sequencing)• High complexity – low volume (mouse phenotyping)• 1000 genomes in order of magnitude physics data• Provenance models• Experimental variables• Publication details• Synonym and domain specific language• Cross-domain mappings• Metadata has existed and been captured for a while, e.g.

InterPro IDs

Page 5: Describing Bioinformatic Metadata at EBI James Malone

Master headline

Page 6: Describing Bioinformatic Metadata at EBI James Malone

Master headline

Metadata: Minimum Information Standards

• Minimum Information Standards specify minimum amount of meta data (and data) required to meet a specific aim (usually reporting data or submitting to public repository)

• MIAMI: Minimum Information About a Microarray Experiment • MIARE: Minimum Information About an RNAi Experiment • MIAPE: Minimum Information About a Proteomic Experiment • MIFlowCyt: Minimum Information about a Flow Cytometry

Experiment• ISA: cross domain experiment reporting• Some public repositories require some conformation, e.g.

ArrayExpress – MIAME scoring

Page 7: Describing Bioinformatic Metadata at EBI James Malone

Master headline

Ontologies

• As a method of representing knowledge in which concepts are described both by their meaning and their relationship to each other.

• Increasingly important component to formalise metadata• Thriving bio-ontology community• e.g. Gene Ontology ‘project to standarise the

representation of gene and gene product attributes• e.g. ChEBI ‘ontology of molecular entities focused on

small chemical compounds’• e.g. Ontology of Biomedical Investigations ‘ontology to

describe experimental protocols from inception to analysis’

Page 8: Describing Bioinformatic Metadata at EBI James Malone

Metadata that is Interoperable• Goal: community is interoperable set reference ontologies• Consumed by application ontologies for specific needs• E.g. Experimental Factor Ontology @ www.ebi.ac.uk/efo

Anatomy Reference Ontology

Cell Type Ontology

Chemical Entities of Biological Interest

(ChEBI)

Various Species Anatomy

Ontologies

Relation Ontology

Disease Ontology

Page 9: Describing Bioinformatic Metadata at EBI James Malone

Master headline

Applying Ontologies in Data Curation @ www.ebi.ac.uk/gxa

Query for Cell adhesion genes in all ‘organism parts’

‘View on EFO’

Ontologically Modeling Sample Variables in Gene Expression Data [email protected]

Page 10: Describing Bioinformatic Metadata at EBI James Malone

Master headline

Strategies for Integrating Multi-Domain Data• Consuming reference ontologies and mapping to multiple

ontologies where overlap exists offers us maximum interoperability

Rdf tripleRdf tripleRdf tripleRdf tripleRdf tripleRdf triple

QUERY

Atlas

SwissProt

Amino Acid Ontology

Page 11: Describing Bioinformatic Metadata at EBI James Malone

Master headline

ELIXIR Report

• Data Integration & Interoperability Recommendations – Jul 2009

• ELIXIR should build a distributed data infrastructure based on a Service Oriented Architecture using WS technology

• Ontologies needed in areas of disease, anatomy and taxon• Annotation systems for associating data to metadata• Pan‑domain coordination and funding for reporting

standards

Page 12: Describing Bioinformatic Metadata at EBI James Malone

Master headline

Current Challenges• Literature – data gap• Curation relatively slow, more advanced tooling required• Ontologies not interoperable yet and more needed• Bio-ontology funding• New high-throughput methods

Ass

ays

Exp

erim

ents

0

50000

100000

150000

200000

250000

300000

350000

Oct. 03

Jan.

04Apr

. 04

Jul. 0

4Oct.

04Ja

n. 05

Apr. 0

5Ju

l. 05

Oct. 05

Jan.

06Apr

. 06

Jul. 0

6Oct.

06Ja

n. 07

Apr. 0

7Ju

l. 07

Oct. 07

Jan.

08Apr

. 08

Jul. 0

8Oct.

08Ja

n. 09

Apr. 0

9Ju

l. 09

Oct. 09

Jan.

10

0

2000

4000

6000

8000

10000

12000

Assays Experiments

Page 13: Describing Bioinformatic Metadata at EBI James Malone

Master headline

Challenges: ScalingWorld-wide sequencing data production is now just an order of magnitude behind CERN

• Large Hadron Collider produces 15 petabytes per year from single point source

• LHC grid is 140 computer centres - 33 countries centered at CERN (Tier 0)

• Sequencing is producing data in hundreds of centers in dozens of countries with Tier 0 sites (EBI & NCBI)

• More than 150 Terabytes of 1000genomes data in the Short Read Archive and this represents more than half of all the data in the archive

Slide: Laura Clarke, EBI

Page 14: Describing Bioinformatic Metadata at EBI James Malone

Master headline

Summary

• EBI uses combination of metadata strategies• Minimal Information useful for reporting standards• Ontologies provide powerful method describing domain

knowledge• Ontologies also allow community consensus to be built as

well as strategies for data integration• ELIXIR suggests :

• Infrastructures should be WS compatible• Annotation tools required• Pan-domain coordination is essential

Page 15: Describing Bioinformatic Metadata at EBI James Malone

Developing an Ontology from the Application [email protected]

Acknowledgements• Ontology creation:

• James Malone, Tomasz Adamusiak, Ele Holloway, Helen Parkinson, Jie Zheng (U Penn)

• Atlas GUI Development• Misha Kapushesky, Pasha Kurnosov, Anna Zhukova. Nikolay Kolesinkov

• External Review and anatomy:• Jonathan Bard, Jie Zheng

• ArrayExpress Production Staff• EBI Rebholz Group (Whatizit text mining tool)• Many source ontologies for terms and definitions esp. Disease Ontology, Cell Type

Ontology, FMA, NCIT, OBI• Funders: EC (Gen2Phen,FELICS, MUGEN, EMERALD, ENGAGE, SLING), EMBL,

NIH• Eric Neumann, Joanne Luciano and Alan Ruttenberg

HCLS Group - Eric Prud'hommeaux and Scott Marshall