the links between research data, scientific analysis ... › fileadmin › user_upload ›...
TRANSCRIPT
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association
Institute for Data Processing and Electronics (IPE)
www.kit.edu
NORDR
The links between Research Data, Scientific Analysis
Workflows, Provenance and Metadata: A Researchers
Perspective on RDA
Ajinkya Prabhune
Institute for Data Processing and Electronics (IPE) 2
Introduction: Nanoscopy
Nanoscopy Research
• Investigation on “aggressive B-cell lymphomas”
• Microscopy technique – Spectral Precision Distance Microscopy (SPDM)
• Novel imaging method producing datasets in the range of 150-200 TB
Community Requirements
• Community specific data-processing algorithms
• Manage the continuously evolving scientific workflows
• Allow experiment reproducibility
• Automated provenance information management
Ajinkya Prabhune – The links between Research Data, Scientific Analysis
Workflows, Provenance and Metadata: A Researchers perspective on RDA
02.12.2015
Institute for Data Processing and Electronics (IPE) 3
Data Repository/Workflow/ Provenance/
Metadata
• Data repository for long term storage and access to scientific datasets
• Integrate scientific workflow + associated provenance information in the
repository system
• Capture workflow description and execution details for
- enabling research reproducibility
- tracking workflow evolution
- assessing data quality
Ajinkya Prabhune – The links between Research Data, Scientific Analysis
Workflows, Provenance and Metadata: A Researchers perspective on RDA
02.12.2015
Raw Intermediate Results Scientific
Interpretations
Data
Institute for Data Processing and Electronics (IPE) 4
Generic Provenance Metadata Model (GPDM)
• Enable modelling of prospective &
retrospective provenance information
• Flexible: can be modelled as per the needs
of the community
• Interoperable: Automated conversion into
OPM/PROV model
• Extensible: Easy to integrate vocabularies
Ajinkya Prabhune – The links between Research Data, Scientific Analysis
Workflows, Provenance and Metadata: A Researchers perspective on RDA
02.12.2015
Institute for Data Processing and Electronics (IPE) 5
• PROVENANCEGEN: automatic
generation of provenance graphs
• Metadata modelling services
integrated with metadata model
registry
• Building links between data,
provenance and metadata
• Digital Object (DO) available in a data
repository
Enable Scientific Data Reproducibility
Automated Metadata Management in Scientific
Repository Systems
Ajinkya Prabhune – The links between Research Data, Scientific Analysis
Workflows, Provenance and Metadata: A Researchers perspective on RDA
02.12.2015
Workflow
Definition
Metadata
Data
Raw Intermediate Results Scientific
Interpretations
DO1
R1
DO2
R2
DO3 DO4
R3
PROVENANCEGEN
Provenance
Institute for Data Processing and Electronics (IPE) 6
Metadata WGs & IGs
Metadata Standards
• Metadata directory for storing and accessing various metadata standards
• YAML based template for submitting metadata standard
• Well documented list of tools for handling the metadata standards
• Provision an API for adding, searching, retrieving metadata standards
• Generic metadata template and metadata principles document available
Ajinkya Prabhune – The links between Research Data, Scientific Analysis
Workflows, Provenance and Metadata: A Researchers perspective on RDA
02.12.2015
Institute for Data Processing and Electronics (IPE) 7
Research Data Provenance
Research Data Provenance IG
• Focus on comparison and evaluation of data provenance models
(OPM/PROV)
• Provide recommendation on provenance model
• Liaison with Data Citation, Data Foundation & Terminology and
Metadata Standards
Ajinkya Prabhune – The links between Research Data, Scientific Analysis
Workflows, Provenance and Metadata: A Researchers perspective on RDA
02.12.2015
Institute for Data Processing and Electronics (IPE) 8
Repositories WGs and IGs
Repositories Platforms of Research Data IG
• Collect and analyse research data use cases in context of repository platforms
• Matrix relating use cases with functional requirements as a deliverable
• Propose a specification for generic API in future New BOF group spawned
Domain Repositories IG
• Aim to bring together active data repositories serving scientific communities
• Provide a forum for sharing practical experience and developing joint projects
Ajinkya Prabhune – The links between Research Data, Scientific Analysis
Workflows, Provenance and Metadata: A Researchers perspective on RDA
02.12.2015
Institute for Data Processing and Electronics (IPE) 9
Conclusion
Nanoscopy Data Repository System available for scientific community
• Capable of managing the extremely large datasets
• Metadata management integrated in the repository
• Automated provenance information management enabled via
PROVENANCEGEN algorithm and GPDM
• Involvement with various WGs and IGs is an additional benefit for my
research
Ajinkya Prabhune – The links between Research Data, Scientific Analysis
Workflows, Provenance and Metadata: A Researchers perspective on RDA
02.12.2015
Institute for Data Processing and Electronics (IPE) 10
Extra slides
Ajinkya Prabhune - The links between research data, scientific analysis workflows,
provenance and metadata: a researchers perspective on RDA
02.12.2015
Institute for Data Processing and Electronics (IPE) 11
Aligning Nanoscopy Repository System with RDA
Ajinkya Prabhune - The links between research data, scientific analysis workflows,
provenance and metadata: a researchers perspective on RDA
02.12.2015
Me
tad
ata Metadata Extraction
Metadata Modelling
Metadata Processing
Metadata Preservation
Scientific Workflow
Intelligent Search
Annotation Service
Data Publication
ServiceSe
rvic
es
Da
ta
Data Preservation
Data Curation
Data Analysis
Data Processing
Interactive Web Portal
Knowledge Representation
Nanoscopy Open Reference Data RepositoryD
ata
Va
lida
tio
nD
ata
Co
llectio
n
An
on
ym
iza
tio
n
Da
ta I
ng
est
Use
r-
Inte
rfa
ce
Data Archive Data Processing
Data Fabric IG
• Data Management
• Data Preservation
• Data Analysis
• Data Curation/Processing
• Hardware-Infrastructure
• Reference Data Collection
Metadata SD/C WG and Metadata IG
• Metadata Model
• Metadata Management
• Metadata Store
Research Data Provenance IG
• Data Provenance Model
Repositories WGs and IGs
• Comprehensive coverage of
functional requirements
• Generic API for interoperability
Institute for Data Processing and Electronics (IPE) 12
Automated Metadata Management in Scientific
Repository Systems
Ajinkya Prabhune - The links between research data, scientific analysis workflows,
provenance and metadata: a researchers perspective on RDA
02.12.2015
Automated Metadata Management
• PROVENANCEGEN algorithm for
automatic generation of provenance
graphs
• Metadata modelling services
integrated with metadata model
registry
• Integrated PID system
Institute for Data Processing and Electronics (IPE) 13
Data Fabric IG
Data Fabric IG aims to design a flexible and dynamic ecosystem consisting of
components, services, tools, infrastructure for enabling efficient, cost-effective and
reproducible research.
• Data Fabric IG is the umbrella group, works together with other WGs and IGs
• Use cases submitted by various communities
Prof. Max Mustermann - Title 02.12.2015
Research Area Relation with WGs IGs
PIDs assignment PID Information Types
Scientific data repositories Repositories Platform for Research Data, Domain Repositories, Research
Data Repositories Interoperability
Metadata management Metadata Standards Directory, Metadata Standards Catalog and Metadata IG
Provenance data management Research Data Provenance
Data management policies Practical Policies
Institute for Data Processing and Electronics (IPE) 14
Introduction: RDA
WGs and IGs to support the complete
research data lifecycle
• Metadata WGs and IGs
• Repository IGs
• Research Data Provenance IG
• Data Fabric IG
• And more…
Ajinkya Prabhune – Research on extreme large datasets in field of nanoscopy
Institute for Data Processing and Electronics (IPE) 15
Scientific Workflows/ Provenance/ Metadata
Typical scientific workflow execution
Ajinkya Prabhune – The links between Research Data, Scientific Analysis
Workflows, Provenance and Metadata: A Researchers perspective on RDA
02.12.2015
Metadata
Data
Workflow
Definition Researcher
Raw Intermediate Results Scientific
Interpretations
Required:
Raw data repositorz
Workflow description
Provenance informai
Institute for Data Processing and Electronics (IPE) 16
Consolidated Motivation
Enabling efficient management of scientific research (meta)data lifecycle
from the perspective of the scientific community
• Comprehensive scientific data repository system
• Extensible architecture for integrating dynamic requirements
• Seamless integration of complex scientific workflows + associated provenance
data management
Active involvement with RDA
• Firsthand feedback from domain experts
• Dedicated groups focusing on specific topics (useable/adaptable outcomes)
• Regular discussion and updates via teleconferences
Ajinkya Prabhune – The links between Research Data, Scientific Analysis
Workflows, Provenance and Metadata: A Researchers perspective on RDA
Institute for Data Processing and Electronics (IPE) 17
Scientific Workflows/ Provenance/ Metadata
Typical scientific workflow execution
• Raw dataset ingested and available in data repository
Ajinkya Prabhune – The links between Research Data, Scientific Analysis
Workflows, Provenance and Metadata: A Researchers perspective on RDA
02.12.2015
Data
Raw Intermediate Results Scientific
Interpretations