bioshare: opal and mica: a software suite for data harmonization and federation - vincent ferretti -...
Post on 16-Apr-2017
535 Views
Preview:
TRANSCRIPT
A SOFTWARE SUITE FOR DATA HARMONIZATION AND FEDERATION
Vincent FerrettiOntario Institute for Cancer Research
The Maelstrom Research Software Suite
Software development started in 2007$3,800,000 CAD of investment so far
Onyx
Opal
Mica DataSHIELD
Collection
StorageManagement Harmonization
Publication Analysis
Some User’s StoriesName Type Activities Tools
The Canadian Longitudinal Study on Aging (CLSA)
Single study50,000 participants
Collection, management, portal
The Canadian Partnership for tomorrow project (CPTP)
Study consortium5 studies, 300,000 participants
Collection, harmonization, portal
BBMRI-LPC Network >30 studies Cataloguing
Maelstrom Research Research project Cataloguing, harmonization
Interconnect NetworkCataloguing, (harmonization, federated data analysis)
BioSHaRE NetworkCataloguing, harmonization, federated data analysis DataSHIELD
Onyx
OpalMica
Onyx
OpalMica
Mica
OpalMica
Mica
OpalMica
1 - Data Harmonization with OpalThe Canadian Partnership for Tomorrow Project (CPTP)
5 cohorts with baseline data on ~ 300,000 participants• 5 Different legislations, questionnaires, data access
policies, languages, etc. Project’s objectives
• To create harmonized datasets across the 5 cohorts• To create a data portal to browse harmonized datasets
and request access to themPhase 1
The baseline Health and Risk Factorquestionnaire (CoreQx)• 716 harmonized variables
Opal SoftwareA database application for integrating and storing data from multiple and heterogeneous sources
•Used by studies to create central data repositories
Metadata in Opal Projects -> tables -> variables Tables are defined by a customizable dictionaries in Excel
format Variables are annotated with an arbitrary number of attributes
Controlled vocabularies - Taxonomies - (e.g. ICD-10) Maelstrom Research variable classification
More than 130 terms in 17 classes (e.g. Reproduction, Physical Measures)
Variable Name Attribute Name Attribute Value
Cancer_type Diseases NeoplasmAsthma_ever Diseases Respiratory system (J00-J99)Ever_smoke Question label [EN] Have you ever smoked?
[FR] Avez-vous déjà fumé?Ever_smoke Health
behaviorsTobacco
Data DerivationOpal derive new variables by executing custom JavaScript code
Useful for data validation, curation and harmonisation
User-friendly interfaces for recoding variables
JavaScript API for more advanced derivation
JavaScript code executed by Opal when needed
Derived data is not persisted – Views or Virtual tables
Deriving the CoreQx datasets with Opal
Deriving the CoreQx datasets with Opal
Deriving the CoreQx datasets with Opal
How to query and access these harmonized datasets?
The Mica Software Software to create web data portals for individual studies or for study consortiaStudy catalogue• MR Standard description of
longitudinal studies• Publication workflow
Datasets• Data dictionaries, data
harmonization, • database federation
Data Access• Online forms, requests
management workflow with roles
Data Persistence
MongoDB
Opal Server
Mica Server
Mica2New client-server
architecture
The CPTP Data Portal
Study Catalogue
Querying Opal Servers for Metadata and Aggregated Data
Dictionary Faceted Search
Variable Page
Real time summary statistics
Harmonization Result
Data Access Requests
Researcher account registration
Customized application form Application review workflow Email notifications Multi-languages
2 - Advanced Cataloguing with MicaMaelstrom-research.org
Maelstrom Research web site is powered by Mica Includes a catalogue of international networks and studies with annotated dictionaries
Current version • 6 Networks• 129 Studies• 222 datasets• 182,622 Variables
Search Harmonisation Potential
Multi-dimensional Search Tool
3- Data AnalysisThe BioSHaRE Healthy Obese Project
10 studies from 7 European countries
200,000 subjects The HOP dataset - 103
harmonized variables
How to analyze these datasets
» without pooling data » without accessing
individual-level data?
A Federated Approach
Real Time Cross Tabulation on Harmonized Data
New Improved Version
Real Time Advanced Queries on Harmonized Data
More Advanced Analyses with R
R Studio Web Consolerstudio.bioshare.eu
More Information
www.maelstrom-research.org www.obiba.org Code available at github.com/obiba
Let us know and acknowledge Maelstrom Research if you are using our software, it’s important for our funding and our ability to provide support
Acknowledgement
Yannick Marcon and his software developer teamThe Maelstrom Research scientific team
The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n°261433 (Biobank Standardisation and Harmonisation for Research Excellence in the European Union - BioSHaRE-EU)
top related