dataverse opportunities

30
dans.knaw.nl DANS is een instituut van KNAW en NWO Opportunities with open source data repository Vyacheslav Tykhonov Senior Information Scientist (DANS) Dataverse application manager [email protected] Utrecht, 22.11.2016

Upload: vty

Post on 09-Jan-2017

300 views

Category:

Science


0 download

TRANSCRIPT

dans.knaw.nl DANS is een instituut van KNAW en NWO

Opportunities with open source data repository

Vyacheslav Tykhonov Senior Information Scientist (DANS)

Dataverse application manager [email protected]

Utrecht, 22.11.2016

Why Dataverse?

• Open source project developed by IQSS of Harvard University and published on github

• Great product with very long history (from 2006) • Very dynamic and experienced development team working in the

Agile environment (community call scheduled once in two weeks) • Clear vision and understanding of research communities

requirements, public roadmap • Strong community behind of Dataverse is helping to improve the

basic functionality and develop it further • Well developed architecture with rich APIs allows to build

application layers around Dataverse

DataverseNL services

• Federated login for Netherlands institutions • Persistent Identifier Services (DOI and handle) • Integration with archival systems • DataTags for data containing sensitive information • Modern and historical world maps visualisations • Data API and Geo API services for projects with data

• Panel datasets contructor

• Time series plot

• Treemaps

• Pie and chart visualizations

• Descriptive statistics tools • Data widgets • Big Data repository

Added value for research communities

The benefits of data sharing can be classified in terms of Metadata and Data access and sharing (Collection tools) and Statistical Analysis and Data Mining (Research tools): ● access to a specific case study, citing and finding data ● access to the universe of data from DataVerse network that can organize and display them for browsing and searching ● data filtering: researchers with proper authorization can obtain the subset of data provided by data collector ● data analysis to run descriptive statistics and graphics, visualization, plotting on historical maps ● Data APIs to export data for further analysis by popular statistical packages (STATA, SPSS, R, iPython Notebook) and advanced data mining tools that will be developed in the future (always up-to-date solution)

User Management and Collaboration

• Datasets restrictions on upload and download • Groups (collaboration in research groups)

Linking Dataverses and collecting statistics

Linked Dataverses + Linked Datasets (Hierarchy)

Currently, the ability to link a dataverse to another dataverse or a dataset to a dataverse is a super user only feature.

Dataset Guestbooks

Guestbooks allow you to collect data about who is downloading the files from your datasets. You can decide to collect account information (username, given name & last name, affiliation, etc.) as well as create custom questions (e.g., What do you plan to use this data for?).

Integrations with other archival systems

• Dataverse is local repository, preservation infrastructure is custom solution (EASY, Archivematica, Fedora)

• Dataverse serves as front-end for users • Archival system ingesting datasets and metadata by SWORD

protocol and preserves in the storage after research is finished

Bulk upload of datasets from data providers

Sharing Privacy Sensitive Data

DataTags: Harvard University Privacy Tools Project

Source: Merce Crosas, The DataTags System: Sharing Sensitive Data with Confidence

DataTags levels

Source: Merce Crosas, The DataTags System: Sharing Sensitive Data with Confidence

Data Citations

• Generated by Dataverse automatically • Persistant identifiers (DOI, handle) resolve to datasets

landing pages • Persistant identifier applies to dataset not to individual files

(and to all versions of files) • Provenance data will be included in Dataverse 4.7

Citations in Dataverse

Source: Merce Crosas, Data Citation Implementation at Dataverse

Harvesting in Dataverse

Two ways to get metadata out of Dataverse • OAI-PHM protocol • SWORD (JSON)

SWORD stands for “Simple Web-service Offering Repository Deposit” and is a “profile” of AtomPub (RFC 5023) which is a RESTful API that allows non-Dataverse software to deposit files and metadata into a Dataverse installation. Client libraries are available in Python, Java, R, Ruby, and PHP.

Dataverse dashboard for admin (Miniverse)

Files statistics

Interactive overview of datasets

Dataverse applications (DANS store)

• Data Processing engine (Data API) • Historical and modern maps • Charts, graphs, pies • Treemaps • Panel data • Datasets statistics • Data quality check

Map visualisation (historical and modern)

•Integrated with Dataverse repository

•Modern boundaries and historical countries switch

•Show only years with available data

•Time slider to go back in time

•Export maps to JPEG, PDF, PS in high quality

•Ready for publication in research papers

Data services: historical maps

Data services: time series plot

Bar charts

Data services: panel data

Treemaps

Descriptive statistics

Data quality check

DataverseNL integration (example)

DataverseNL as Big Data repository

This approach is suitable for product development companies (industry) and organizations and institutions (education and science) looking for sustainable (Big) data archiving services.

Big Data object in DataverseNL consists of: • metadata with authorship and citation information • data usage licence • persistent DOI or handle • information how to obtain key (API token) to start use API endpoint(s) • link to API endpoint delivering data • representation of API (interactive documentation, Swagger) • data provenance • controlled vocabularies to meet domain specific community standards (optional)

Data hubs as DataverseNL object (part 1)

Source: http://api.clariah-sdh.eculture.labs.vu.nl/api-docs#/

CLARIAH is structured data hub (Linked Open Data)

Data hubs as Dataverse objects (part 2, FAIR)

DataverseNL can create landing page with metadata and handle for any API endpoint and will allow CLARIAH and other data hubs to make datasets: - (F) Findable - (A) Accessible - (I) Interoperable - (R) Re-usable FAIR principles described here: https://www.force11.org/group/fairgroup/fairprinciples

dans.knaw.nl DANS is een instituut van KNAW en NWO

Questions?