bioshare: opal and mica: a software suite for data harmonization and federation - vincent ferretti -...

Post on 16-Apr-2017

535 Views

Category:

Health & Medicine

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A SOFTWARE SUITE FOR DATA HARMONIZATION AND FEDERATION

Vincent FerrettiOntario Institute for Cancer Research

The Maelstrom Research Software Suite

Software development started in 2007$3,800,000 CAD of investment so far

Onyx

Opal

Mica DataSHIELD

Collection

StorageManagement Harmonization

Publication Analysis

Some User’s StoriesName Type Activities Tools

The Canadian Longitudinal Study on Aging (CLSA)

Single study50,000 participants

Collection, management, portal

The Canadian Partnership for tomorrow project (CPTP)

Study consortium5 studies, 300,000 participants

Collection, harmonization, portal

BBMRI-LPC Network >30 studies Cataloguing

Maelstrom Research Research project Cataloguing, harmonization

Interconnect NetworkCataloguing, (harmonization, federated data analysis)

BioSHaRE NetworkCataloguing, harmonization, federated data analysis DataSHIELD

Onyx

OpalMica

Onyx

OpalMica

Mica

OpalMica

Mica

OpalMica

1 - Data Harmonization with OpalThe Canadian Partnership for Tomorrow Project (CPTP)

5 cohorts with baseline data on ~ 300,000 participants• 5 Different legislations, questionnaires, data access

policies, languages, etc. Project’s objectives

• To create harmonized datasets across the 5 cohorts• To create a data portal to browse harmonized datasets

and request access to themPhase 1

The baseline Health and Risk Factorquestionnaire (CoreQx)• 716 harmonized variables

Opal SoftwareA database application for integrating and storing data from multiple and heterogeneous sources

•Used by studies to create central data repositories

Metadata in Opal Projects -> tables -> variables Tables are defined by a customizable dictionaries in Excel

format Variables are annotated with an arbitrary number of attributes

Controlled vocabularies - Taxonomies - (e.g. ICD-10) Maelstrom Research variable classification

More than 130 terms in 17 classes (e.g. Reproduction, Physical Measures)

Variable Name Attribute Name Attribute Value

Cancer_type Diseases NeoplasmAsthma_ever Diseases Respiratory system (J00-J99)Ever_smoke Question label [EN] Have you ever smoked?

[FR] Avez-vous déjà fumé?Ever_smoke Health

behaviorsTobacco

Data DerivationOpal derive new variables by executing custom JavaScript code

Useful for data validation, curation and harmonisation

User-friendly interfaces for recoding variables

JavaScript API for more advanced derivation

JavaScript code executed by Opal when needed

Derived data is not persisted – Views or Virtual tables

Deriving the CoreQx datasets with Opal

Deriving the CoreQx datasets with Opal

Deriving the CoreQx datasets with Opal

How to query and access these harmonized datasets?

The Mica Software Software to create web data portals for individual studies or for study consortiaStudy catalogue• MR Standard description of

longitudinal studies• Publication workflow

Datasets• Data dictionaries, data

harmonization, • database federation

Data Access• Online forms, requests

management workflow with roles

Data Persistence

MongoDB

Opal Server

Mica Server

Mica2New client-server

architecture

The CPTP Data Portal

Study Catalogue

Querying Opal Servers for Metadata and Aggregated Data

Dictionary Faceted Search

Variable Page

Real time summary statistics

Harmonization Result

Data Access Requests

Researcher account registration

Customized application form Application review workflow Email notifications Multi-languages

2 - Advanced Cataloguing with MicaMaelstrom-research.org

Maelstrom Research web site is powered by Mica Includes a catalogue of international networks and studies with annotated dictionaries

Current version • 6 Networks• 129 Studies• 222 datasets• 182,622 Variables

Search Harmonisation Potential

Multi-dimensional Search Tool

3- Data AnalysisThe BioSHaRE Healthy Obese Project

10 studies from 7 European countries

200,000 subjects The HOP dataset - 103

harmonized variables

How to analyze these datasets

» without pooling data » without accessing

individual-level data?

A Federated Approach

Real Time Cross Tabulation on Harmonized Data

New Improved Version

Real Time Advanced Queries on Harmonized Data

More Advanced Analyses with R

R Studio Web Consolerstudio.bioshare.eu

More Information

www.maelstrom-research.org www.obiba.org Code available at github.com/obiba

Let us know and acknowledge Maelstrom Research if you are using our software, it’s important for our funding and our ability to provide support

Acknowledgement

Yannick Marcon and his software developer teamThe Maelstrom Research scientific team

The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n°261433 (Biobank Standardisation and Harmonisation for Research Excellence in the European Union - BioSHaRE-EU)

top related