wikidata: verifiable, linked open knowledge that anyone can edit

69
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit Dario Taraborelli @readermeter National Institutes of Health • September 23, 2016

Upload: dario-taraborelli

Post on 15-Apr-2017

1.038 views

Category:

Technology


0 download

TRANSCRIPT

Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit

Dario Taraborelli@readermeter

National Institutes of Health • September 23, 2016

Wikimedia Researchhttps://www.mediawiki.org/wiki/Wikimedia_Research

The altmetrics manifestohttp://altmetrics.org/manifesto/

A short history of Wikipedia

A website that anyone can edit

The largest reference work on the internet

A multi-language online encyclopedia

A short history of Wikipedia

A website that anyone can edit

The largest reference work on the internet

A multi-language online encyclopedia

A short history of Wikipedia

A website that anyone can edit

The largest reference work on the internet

A multi-language online encyclopedia

Wikipedia: unintended outcomes

accelerate the dissemination of scholarship

provide an infrastructure open scientific research

enable distributed fact-checking and curation of scientific knowledge

Outline1. Wikipedia as the front matter to all research

2. A new kind of open knowledge

3. Wikidata: Collaboratively curated linked open data

4. WikiCite: Building the sum of all human citations

5. Applications and opportunities for open science

6. Concluding remarks

1. Wikipedia as the front matter to all research

“Wikipedia is not the bottom layer of authority, nor the top, but in fact the highest layer without formal vetting. In this unique role, it serves as an ideal bridge between the validated and unvalidated Web.”

Casper GrathwohlChronicle of Higher Education

http://chronicle.com/article/article-content/125899/

Top sources of DOI lookups

http://crosstech.crossref.org/2014/02/many-metrics-such-data-wow.html http://blog.crossref.org/2016/05/https-and-wikipedia.html

wikipedia.org

World’s most accessed online medical resources

Heilman and West (2015) doi.org/10.2196/jmir.4069

Most visited resource on Ebola in West Africa

Heilman (2016) http://tinyurl.com/jfuyduv

Most used internet site in Liberia, Sierra Leone and Guinea for Ebola during 2014 outbreak

Greater than CNN, CDC and WHO

2. A new kind of open knowledge

Schmachtenberg et al (2014)http://lod-cloud.net [CC BY SA]

Challenges

Biases / errors

Coverage

Diversity and inclusiveness

Verifiability

Machine-readable linked open dataEditable by anyone

Supporting human + algorithmic curationComprehensive

Transparently verifiable

Machine-readable linked open dataEditable by anyone

Supporting human + algorithmic curationComprehensive

Transparently verifiable

Machine-readable linked open dataEditable by anyone

Supporting human + algorithmic curationComprehensive

Transparently verifiable

3. Wikidata: Collaboratively curated linked open data

Wikidata

Free knowledge base that anyone can edit

Launched in 2012

Integrated with Wikipedia and other sister projects

Statistics (September 2016)Over 20M itemsOver 100M statements

Wikidata:Growth

http://reportcard.wmflabs.org/graphs/active_editors

English Wikipedia

Wikidata

Wikidata:Growth

http://reportcard.wmflabs.org/graphs/very_active_editors

English Wikipedia

Wikidata

Wikidata’s anatomy

https://www.wikidata.org/wiki/Wikidata:Introduction

Wikidata’s anatomy

Linked data, San Francisco, Jeblad https://commons.wikimedia.org/wiki/File:Linked_Data_-_San_Francisco.svg [CC BY SA]

SPARQL:https://t.co/cDR4Lt7V6P

Birth place of people employed by MIT

Wikidata: queries

SPARQL:http://tinyurl.com/h2lqv9y

Authors with a known location and ORCID

Wikidata: queries

Expert curation of scientific open data

Benjamin Good (2016) Opportunities and challenges presented by Wikidata in the context of biocurationhttp://tinyurl.com/hk9qrmz

Sample of current biomedical content in Wikidata

● All human, mouse genes and proteins (swissprot) ● All Gene Ontology terms● All Human Disease Ontology terms● All FDA approved drugs ● 109 reference microbial genomes

Mitraka et al (2015) Semantic Web Applications for the Life SciencesBurgstaller-Muelbacher et al (2016) DatabasePutman et al (2016) Database

Expert curation of scientific open data

Expert curation of scientific open data

Gene Wiki: WIkidata SPARQL exampleshttps://bitbucket.org/sulab/wikidatasparqlexamples/overview

Get all known drug-drug interactions for Methadone via its CHEMBL idGet a list of all diseases known to be treated by MetforminGet a list of all diseases that might be treated by Metformin

4. WikiCite: Building the sum of all human citations

Randall Munroe, Wikipedian protester http://tinyurl.com/p3rodlb [CC BY]

Benjamin Good (2016) Opportunities and challenges presented by Wikidata in the context of biocurationhttp://tinyurl.com/hk9qrmz

the disappearance of provenance

http://bit.ly/SumOfAllCitations

the disappearance of provenance

a provenance-preserving answer engine

The sum of all human knowledge

The sum of all data and sources backing human knowledge

+

The molecular origins of insulin go at least as far back as the simplest unicellular [[eukaryotes]].<ref name='LeRoith'>{{cite journal | vauthors = LeRoith D, Shiloach J, Heffron R, Rubinovitz C, Tanenbaum R, Roth J | title = Insulin-related material in microbes: similarities and differences from mammalian insulins | journal = Can. J. Biochem. Cell Biol. | volume = 63 | issue = 8 | pages = 839–49 | year = 1985 | pmid = 3933801 | doi = 10.1139/o85-106 }}</ref> Apart from animals, insulin-like proteins are also known to exist in Fungi and Protista kingdoms.

References in Wikipedia

WikiCite: goals

Build a repository of all Wikimedia citations and bibliographic metadata

Design data models and technology to improve the coverage, quality, standards-compliance and machine-readability of

citations and bibliographic metadata in Wikimedia projects

@wikicite • meta.wikimedia.org/wiki/WikiCite

VisionTechnologyCommunityScaleLicensingIndependence

https://tools.wmflabs.org/sqid/#/view?id=P2860

All biomedical OA review articles of the last 5 years

The Zika corpus

Open citation graph layer

Bibliographic metadata layer

Expert annotation layer

Encyclopedic layer

The Zika corpus

Encyclopedic layer

The Zika corpus

Expert annotation layer

Encyclopedic layer Pathogen transmission process

The Zika corpus

Bibliographic metadata layer

Expert annotation layer

Encyclopedic layer

The Zika corpus

Open citation graph layer

Bibliographic metadata layer

Expert annotation layer

Encyclopedic layer

5. Applications

Co-author graphs for individual researchers SPARQL: http://tinyurl.com/zml3jox

Most cited authors in the Zika research corpus (+ filtering by journal, OA status, type of statement) SPARQL: http://tinyurl.com/jb8da68

Semi-automated recommendation of entities, missing statements, references for unsourced statements

https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool

Semi-automated recommendation of entities, missing statements, references for unsourced statements

https://meta.wikimedia.org/wiki/Grants:Project/WikiFactMine https://twitter.com/larswillighagen/status/774614483394236416

Tools for crowdsourcing entity matching / disambiguation

http://www.generalist.org.uk/blog/2014/wikidata-identifiers-and-the-odnb-where-next/ http://www.generalist.org.uk/blog/2014/wikidata-and-identifiers-part-2-the-matching-process/

read/write interfaces for biocuration

all statements citing a New York Times article

the most popular scholarly journals used as citations for statements in any item that is a subclass of economics

all statements citing the works of Joseph Stiglitz

all statements citing journal articles by physicists from Oxford University

all statements citing a journal article that was retracted

all statements citing a source that cites a journal article that was retracted

New opportunities for linked open knowledge curation and discovery

https://meta.wikimedia.org/wiki/WikiCite_2016/Report/Group_5

More reliable data for altmetrics services

https://www.altmetric.com/blog/new-source-alert-wikipedia/

6. Concluding remarks

Dominant biocuration paradigm

● Cost of ad-hoc parsing of API responses or flatfile data● Ambiguous or non-existent xrefs● Persistence of funding ● Too much information to curate

B. Good (2016) Opportunities and challenges presented by Wikidata in the context of biocuration http://tinyurl.com/hk9qrmz

A new paradigm for biocuration

● Reduce API/parser proliferation● Force up-front integration● Facilitate coordination ● Ensure that if funding is lost, data is not● Leverage community input

B. Good (2016) Opportunities and challenges presented by Wikidata in the context of biocuration http://tinyurl.com/hk9qrmz

T. Putman (2016) Centralizing content and distributing labor: a community model for curating the very long tail of microbial genomes https://doi.org/10.6084/m9.figshare.3201796.v1

Accelerate the discoverability, reusability, and societal impact of open access

Support new forms of open curation and distributed fact-checking

Provide long-term, sustainable infrastructure to support open science

Benefit from large-scale distribution of data in the linked data ecosystem

Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit

Thank youAcknowledgmentsDaniel Mietchen, Jonathan Dugan, Lydia Pintscher, Cameron Neylon, James Hare, James Heilman, Magnus Manske, Egon Willighagen, the Gene Wiki team (especially Andra Waagmeester, Tim Putman, Benjamin Good), the ContentMine team, the University of Chicago Knowledge Lab, all WikiCite 2016 participants and Wikidata Source Metadata project contributors.

Additional image credits

Library, National Park Service Collection thenounproject.com/term/library/191/ [CC0]Robot, Creative Stall thenounproject.com/term/robot/132360/ [CC BY]Open Access logo commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_transparent.svg [CC0]

[email protected] • @readermeter • @Wikidata • @WikiCite • @WikiResearch

A short history of NIH and Wikimedia● 2002: article National Institutes of Health started on English Wikipedia● 2003: MEDLINE● 2004: PubMed● 2005: PubMed Central

○ along with Template:PMC● 2007: WikiProject National Institutes of Health

○ along with Template:National Institutes of Health● 2009: first Wikipedia Academy in the US took place at NIH

○ Susannah Fox: “Shared Kismet: Wikipedia and the NIH”○ triggers Guidelines for Participating in Wikipedia from NIH

● 2012: bot imports multimedia from PMC into Wikimedia Commons○ triggers formation of JATS for Reuse working group

● 2015: Template:NIH properties on Wikidata● 2016: First papers using Wikidata queries appear in PMC