collaboratively creating a network of ideas, data and software
TRANSCRIPT
Anita de WaardVP Research Data Collaborations
Elsevier, Jericho, VT
Some Thoughts on Collectively Creating Networks of Ideas, Data and Software
How do we unify the needs of the collective and the individual?
“Let us endeavor to build systems that allow a kid in Mali who wants to learn about proteomics to not be overwhelmed by the irrelevant and the untrue.”
- John Perry Barlow, iAnnotate 2014
Collectively create nimble and robust systems of knowledge management that interconnect ideas, data and software.
Automated caption/body text splitting & linking
Precision Recall F-score56.3 76.0 64.7
Statement type
Connecting Ideas: Big Mechanism
Connecting Ideas: Towards an Elsevier Knowledge Graph
14M articles from Science Direct
3.3M triples
475M triples
49M triples p x r matrix p x k, k x r latent factor matrices
~102 triples
920K concepts from EMMeT
• Ongoing proof-of-concept work by Paul Groth, Sujit Pal and Ron Daniel of Elsevier Labs
• Unsupervised, scalable and built with off-the-shelf technologies• Based on recent work at University College London and University of
Massachusetts Amherst
Riedel, Sebastian, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. "Relation extraction with matrix factorization and universal schemas." (2013).
Connecting Research Data:
https://www.elsevier.com/connect/10-aspects-of-highly-effective-research-data
Linking Papers to Data, Phase 1
• Supplementary data at PANGAEA• Bidirectional links between PANGAEA &
ScienceDirect• Data visualized next to the article
http://www.elsevier.com/databaselinking
Linking Papers to Data, Phase 2
• ICSU/WDS/RDA Publishing Data Service Working group
• Currently creating linked-data model for exposing DOI to DOI links outside publisher’s firewall
• Merged with National Data Service pilot with the same goal
• Collaboration between CrossRef, DataCite, Europe PubMed Central, ANDS, Thompson Reuters, Elsevier
• About to deliver: http://dliservice.research-infrastructures.eu/#/api
Objective: move from
a plethora of (mostly) bilateral arrangements between the different players…
.. a one-for-all cross-referencing service for articles and data
.. to ..
Researchers
Funding AgencyInstitution
Data Repository
Dataset
JournalPaper
Current Systems for Linking Data
1. Researcher creates datasets2. Researcher writes paper & publishes in journal3. (Sometimes,) dataset gets posted to repository4. Researcher reports (post-hoc) to Institution and Funder
22
1
3
4
4
Researchers
Funding AgencyInstitution
Data Repository
Dataset
JournalPaper
Issues with the Current Situation:
22
1
3
4
4iii. No link between data
and paper
iv. Funders/Institutions informed as an afterthought
i. Too much work for researchers
ii. Data posting not mandatory
Researchers
Funding Agency
Institution
Data Repository
Dataset
Journal
Paper
A Proposal To Address These Issues:
1. Researcher creates datasets and posts to repository(under embargo)
2. Funder is automatically notified of dataset publication3. Researcher writes paper & publishes in journal; embargo is lifted and data linked- NB this also allows release of non-used data for negative result and reproducibility4. Funder and institution get report on publication and embargo lifting
2
11
3
3
3
4
4
i. Less Work!
iv. Better Tracking!
iii. Better Linking!
ii. More Data
Stored!
One piece of the puzzle: Mendeley Data:https://data.mendeley.com/datasets/xz6gv65m6d/6
Linked to published papers – or not
Linked to Github – or not
Versioning and provenance
Another Piece of the Puzzle: DataSearch:http://datasearchdemo.elsevier.com/indexed#/search/mercury
Federated Poor APIRich API
FTP & Index
Federated Poor APIRich API
FTP & Index
Federated Poor APIRich API
FTP & Index
Data
Enrichment Manual
Automated(User) Intent
Ranking Filtering (how
to mix federated &
indexed rich & poor)
SearchRendering
Search all dataFaceted query/Results
refinementStore & Use results
General UI
Domain UI
Filtering
Feeding user signals
back into Search rankingEvaluation
How Do We Evaluate Discoverability?
Birds of a Feather on Data Search: https://rd-alliance.org/bof-data-search.html
How do we pay for all this?RDA Cost Recovery WG
• Cochair with Ingrid Dillo (DANS), Simon Hodson (CODATA)• Goal: write a report regarding new potential funding models for
data repositories, allow them to start sharing this knowledge• Interviewed 24 repositories on their funding (current and future)• Now summarising stories and trends – will present at RDAP7
Terms of funding for main income stream (in %)
Software As A First-Class Knowledge Object:
Working with Networks of PartnersForce11:
– Multi-stakeholder, member-driven organisation– Unites scholars, tool developers, librarians, publishers, funding agencies etc. etc.– E.g. Software citation group, akin to Data Citation Group– Will present at Force16 in Portland, OR April 17-19, 2016
National Data Service:– Multi-stakeholder group, based around supercomputing centres– Aims to be a ‘connective tissue’ between data creation, curation, storage etc projects. – Inviting Pilots: two or more partners who have not worked together, interested in collaborating
on a data-centric project to solve a real-world needs: can include software sharing– E.g. Datasearch, Data Linking systems
RDA: – CoLead Data publishing, linking group– Colead Cost Recovery group– Active in Chemistry, Earth Science groups– Starting BoF Data Search
The NationalDATA SERVICE
Anita de WaardVP Research Data Collaborations, [email protected]@anitadewaard
In summary:Let’s collectively enable ‘an account of the present undertakings, studies and labours of the ingenious in many considerable parts of the world’,
by connecting ideas, data, and software through interconnected partnerships!