dataverse opportunities
TRANSCRIPT
dans.knaw.nl DANS is een instituut van KNAW en NWO
Opportunities with open source data repository
Vyacheslav Tykhonov Senior Information Scientist (DANS)
Dataverse application manager [email protected]
Utrecht, 22.11.2016
Why Dataverse?
• Open source project developed by IQSS of Harvard University and published on github
• Great product with very long history (from 2006) • Very dynamic and experienced development team working in the
Agile environment (community call scheduled once in two weeks) • Clear vision and understanding of research communities
requirements, public roadmap • Strong community behind of Dataverse is helping to improve the
basic functionality and develop it further • Well developed architecture with rich APIs allows to build
application layers around Dataverse
DataverseNL services
• Federated login for Netherlands institutions • Persistent Identifier Services (DOI and handle) • Integration with archival systems • DataTags for data containing sensitive information • Modern and historical world maps visualisations • Data API and Geo API services for projects with data
• Panel datasets contructor
• Time series plot
• Treemaps
• Pie and chart visualizations
• Descriptive statistics tools • Data widgets • Big Data repository
Added value for research communities
The benefits of data sharing can be classified in terms of Metadata and Data access and sharing (Collection tools) and Statistical Analysis and Data Mining (Research tools): ● access to a specific case study, citing and finding data ● access to the universe of data from DataVerse network that can organize and display them for browsing and searching ● data filtering: researchers with proper authorization can obtain the subset of data provided by data collector ● data analysis to run descriptive statistics and graphics, visualization, plotting on historical maps ● Data APIs to export data for further analysis by popular statistical packages (STATA, SPSS, R, iPython Notebook) and advanced data mining tools that will be developed in the future (always up-to-date solution)
User Management and Collaboration
• Datasets restrictions on upload and download • Groups (collaboration in research groups)
Linking Dataverses and collecting statistics
Linked Dataverses + Linked Datasets (Hierarchy)
Currently, the ability to link a dataverse to another dataverse or a dataset to a dataverse is a super user only feature.
Dataset Guestbooks
Guestbooks allow you to collect data about who is downloading the files from your datasets. You can decide to collect account information (username, given name & last name, affiliation, etc.) as well as create custom questions (e.g., What do you plan to use this data for?).
Integrations with other archival systems
• Dataverse is local repository, preservation infrastructure is custom solution (EASY, Archivematica, Fedora)
• Dataverse serves as front-end for users • Archival system ingesting datasets and metadata by SWORD
protocol and preserves in the storage after research is finished
Sharing Privacy Sensitive Data
DataTags: Harvard University Privacy Tools Project
Source: Merce Crosas, The DataTags System: Sharing Sensitive Data with Confidence
Data Citations
• Generated by Dataverse automatically • Persistant identifiers (DOI, handle) resolve to datasets
landing pages • Persistant identifier applies to dataset not to individual files
(and to all versions of files) • Provenance data will be included in Dataverse 4.7
Harvesting in Dataverse
Two ways to get metadata out of Dataverse • OAI-PHM protocol • SWORD (JSON)
SWORD stands for “Simple Web-service Offering Repository Deposit” and is a “profile” of AtomPub (RFC 5023) which is a RESTful API that allows non-Dataverse software to deposit files and metadata into a Dataverse installation. Client libraries are available in Python, Java, R, Ruby, and PHP.
Dataverse applications (DANS store)
• Data Processing engine (Data API) • Historical and modern maps • Charts, graphs, pies • Treemaps • Panel data • Datasets statistics • Data quality check
Map visualisation (historical and modern)
•Integrated with Dataverse repository
•Modern boundaries and historical countries switch
•Show only years with available data
•Time slider to go back in time
•Export maps to JPEG, PDF, PS in high quality
•Ready for publication in research papers
DataverseNL as Big Data repository
This approach is suitable for product development companies (industry) and organizations and institutions (education and science) looking for sustainable (Big) data archiving services.
Big Data object in DataverseNL consists of: • metadata with authorship and citation information • data usage licence • persistent DOI or handle • information how to obtain key (API token) to start use API endpoint(s) • link to API endpoint delivering data • representation of API (interactive documentation, Swagger) • data provenance • controlled vocabularies to meet domain specific community standards (optional)
Data hubs as DataverseNL object (part 1)
Source: http://api.clariah-sdh.eculture.labs.vu.nl/api-docs#/
CLARIAH is structured data hub (Linked Open Data)
Data hubs as Dataverse objects (part 2, FAIR)
DataverseNL can create landing page with metadata and handle for any API endpoint and will allow CLARIAH and other data hubs to make datasets: - (F) Findable - (A) Accessible - (I) Interoperable - (R) Re-usable FAIR principles described here: https://www.force11.org/group/fairgroup/fairprinciples