metadata challenges research and re-usable data - biosharing, isa and stato

30
Metadata challenges of reproducible research and re-usable data BioSharing, ISA and STATO examples Alejandra González-Beltrán, PhD Oxford e-Research Centre, University of Oxford [email protected] @alegonbel OpenData & Reproducibility workshop: the Good Scientist in the Open Science era 21st April 2015 British Ecological Society, UK

Upload: alejandra-gonzalez-beltran

Post on 19-Jul-2015

195 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Metadata challenges of reproducible research and re-usable data

BioSharing, ISA and STATO examples

Alejandra González-Beltrán, PhD Oxford e-Research Centre, University of [email protected] @alegonbel

OpenData & Reproducibility workshop: the Good Scientist in the Open Science era

21st April 2015 British Ecological Society, UK

Reproducible  &  Reusable    Bioscience  Research

Well-­‐annotated  &  Structured  Data

Reproducible  &  Reusable    Bioscience  Research

Well-­‐annotated  &  Structured  Data

reasoning

analysis

exchange

integration

visualization

browsingretrieval

Community  Standards Software  Tools

Reproducible  &  Reusable    Bioscience  Research

Well-­‐annotated  &  Structured  Data

reasoning

analysis

exchange

integration

visualization

browsingretrieval

Community  Standards Software  Tools

A community mobilization to develop standards, e.g.:

!  Structural and operational differences •  organization types (open, close to members, society, WG etc.) •  standards development (how to formulate, conduct and maintain) •  adoption, uptake, outreach (link to journals, funders and commercial sector) •  funds (sponsors, memberships, grants, volunteering)

de jure de facto

grass-roots groups

standard organizations

Nanotechnology Working Group

Types of reporting standards

Nanotechnology Working Group

Including minimum information reporting requirements, or checklists to report the same core, essential information

Including controlled vocabularies, taxonomies, thesauri, ontologies etc. to use the same word and refer to the same ‘thing’

Including conceptual model, conceptual schema from which an exchange format is derived to allow data to flow from one system to another

A web-based, curated and searchable registry ensuring that standards and databases are registered, informative and discoverable; also

monitoring the development and evolution of standards, their use in databases and the adoption of both in data policies.

Launched Jan 2011

Researchers, developers and curators lack support and guidance on how to best navigate and select content standards, understand their maturity, or find databases that implement them;

Funders, journals and librarians do not have enough information to make informed decisions on which content standards or database to recommended in policies, or funded or implemented

Goal: assist stakeholders to make informed decisions

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

Core functionalities: • search and filtering, e.g. by

funder • submissions forms to add

new records • “claim” functionality of

existing records • person’s profile (as

maintainer of records) associated to the ORCID profile (for credit, as incentive)

• visualization and views of content

Search, filter, submit, claim, view and more

Curated crowdsourcing approach

Formats & Database Fragmentation

14

) infrastructureThe Investigation/Study/Assay (

generic format for experimental description and data exchange

open source software toolscommunity engagement

investigation

assay(s) assay(s)

data data

external files in native or other for-

mats

pointers to data file names/location

investigationhigh level concept to link related studies

studythe central unit, containing information on the subject under study, its characteristics and any treatments applied.a study has associated assays

assaytest performed either on material taken from the sub-ject or on the whole initial subject, which produce quali-tative or quantitative meas-urements (data)

• environmental health• environmental genomics• metabolomics• metagenomics• nanotechnology• proteomics

• stem cell discovery• system biology• transcriptomics• toxicogenomics• communities

working to build a library of cellular signatures

investigation

assay(s) assay(s)

data data

external files in native or other for-

mats

pointers to data file names/location

investigationhigh level concept to link related studies

studythe central unit, containing information on the subject under study, its characteristics and any treatments applied.a study has associated assays

assaytest performed either on material taken from the sub-ject or on the whole initial subject, which produce quali-tative or quantitative meas-urements (data)

• environmental health• environmental genomics• metabolomics• metagenomics• nanotechnology• proteomics

• stem cell discovery• system biology• transcriptomics• toxicogenomics• communities

working to build a library of cellular signatures

The experimental plan

experimental design!sample characteristic(s)!

experimental variable(s)!

2-week systemic rat study using male Wistar rats (N=15 per dose group)

14 proprietary drug candidates from participating companies and 2 reference toxic compounds

InnoMed PredTox Project

The experimental plan

experimental design!sample characteristic(s)!

experimental variable(s)!

technology(s)!measurement(s)!protocols(s)!data file(s)!…!

http://dx.doi.org/10.5524/100063

investigation

study

http://www.nature.com/search?journal=sdata&q=ecology

http://www.nature.com/articles/sdata201513

http://www.nature.com/articles/sdata20158

23

24

http://isa-tools.github.io/stato/

• General-purpose statistics ontology (formal logic-based representation)

• Coverage for processes (e.g. statistical tests and their condition of application) and information needed or resulting from statistical methods (e.g. probability distributions, variable, spread and variation metrics)

• STATO also benefits from: (i) extensive documentation with the provision of textual and formal definitions; (ii) an associated R code snippets using the dedicated R-command metadata tag, aiming at facilitating teaching and learning while relying of the popular R language; (iii) query examples documentation, highlighting how the ontology can be harnessed for reviewers/tutors/student alike.

Developed in collaboration with Dr Burke, Senior Statistician, Nuffield Department of Population Health, University of Oxford

Reproducible  &  Reusable    Bioscience  Research

Well-­‐annotated  &  Structured  Data

reasoning

analysis

exchange

integration

visualization

browsingretrieval

Community  Standards Software  Tools

29

funders

Questions?You can email us...

[email protected]

View our bloghttp://isatools.wordpress.com

Follow us on Twitter@isatools

View our websites

View our Git repo & contributehttp://github.com/ISA-tools

Thanks for your attention!