wikidata: verifiable, linked open knowledge that anyone can edit
TRANSCRIPT
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
Dario Taraborelli@readermeter
National Institutes of Health • September 23, 2016
Wikimedia Researchhttps://www.mediawiki.org/wiki/Wikimedia_Research
The altmetrics manifestohttp://altmetrics.org/manifesto/
A short history of Wikipedia
A website that anyone can edit
The largest reference work on the internet
A multi-language online encyclopedia
A short history of Wikipedia
A website that anyone can edit
The largest reference work on the internet
A multi-language online encyclopedia
A short history of Wikipedia
A website that anyone can edit
The largest reference work on the internet
A multi-language online encyclopedia
Wikipedia: unintended outcomes
accelerate the dissemination of scholarship
provide an infrastructure open scientific research
enable distributed fact-checking and curation of scientific knowledge
Outline1. Wikipedia as the front matter to all research
2. A new kind of open knowledge
3. Wikidata: Collaboratively curated linked open data
4. WikiCite: Building the sum of all human citations
5. Applications and opportunities for open science
6. Concluding remarks
“Wikipedia is not the bottom layer of authority, nor the top, but in fact the highest layer without formal vetting. In this unique role, it serves as an ideal bridge between the validated and unvalidated Web.”
Casper GrathwohlChronicle of Higher Education
http://chronicle.com/article/article-content/125899/
Top sources of DOI lookups
http://crosstech.crossref.org/2014/02/many-metrics-such-data-wow.html http://blog.crossref.org/2016/05/https-and-wikipedia.html
wikipedia.org
World’s most accessed online medical resources
Heilman and West (2015) doi.org/10.2196/jmir.4069
Most visited resource on Ebola in West Africa
Heilman (2016) http://tinyurl.com/jfuyduv
Most used internet site in Liberia, Sierra Leone and Guinea for Ebola during 2014 outbreak
Greater than CNN, CDC and WHO
Schmachtenberg et al (2014)http://lod-cloud.net [CC BY SA]
Machine-readable linked open dataEditable by anyone
Supporting human + algorithmic curationComprehensive
Transparently verifiable
Machine-readable linked open dataEditable by anyone
Supporting human + algorithmic curationComprehensive
Transparently verifiable
Machine-readable linked open dataEditable by anyone
Supporting human + algorithmic curationComprehensive
Transparently verifiable
Wikidata
Free knowledge base that anyone can edit
Launched in 2012
Integrated with Wikipedia and other sister projects
Statistics (September 2016)Over 20M itemsOver 100M statements
Wikidata:Growth
http://reportcard.wmflabs.org/graphs/active_editors
English Wikipedia
Wikidata
Wikidata:Growth
http://reportcard.wmflabs.org/graphs/very_active_editors
English Wikipedia
Wikidata
Wikidata’s anatomy
https://www.wikidata.org/wiki/Wikidata:Introduction
Wikidata’s anatomy
Linked data, San Francisco, Jeblad https://commons.wikimedia.org/wiki/File:Linked_Data_-_San_Francisco.svg [CC BY SA]
SPARQL:https://t.co/cDR4Lt7V6P
Birth place of people employed by MIT
Wikidata: queries
SPARQL:http://tinyurl.com/h2lqv9y
Authors with a known location and ORCID
Wikidata: queries
Expert curation of scientific open data
Benjamin Good (2016) Opportunities and challenges presented by Wikidata in the context of biocurationhttp://tinyurl.com/hk9qrmz
Sample of current biomedical content in Wikidata
● All human, mouse genes and proteins (swissprot) ● All Gene Ontology terms● All Human Disease Ontology terms● All FDA approved drugs ● 109 reference microbial genomes
Mitraka et al (2015) Semantic Web Applications for the Life SciencesBurgstaller-Muelbacher et al (2016) DatabasePutman et al (2016) Database
Expert curation of scientific open data
Gene Wiki: WIkidata SPARQL exampleshttps://bitbucket.org/sulab/wikidatasparqlexamples/overview
Get all known drug-drug interactions for Methadone via its CHEMBL idGet a list of all diseases known to be treated by MetforminGet a list of all diseases that might be treated by Metformin
4. WikiCite: Building the sum of all human citations
Randall Munroe, Wikipedian protester http://tinyurl.com/p3rodlb [CC BY]
https://twitter.com/egonwillighagen/status/718474906858582016
Benjamin Good (2016) Opportunities and challenges presented by Wikidata in the context of biocurationhttp://tinyurl.com/hk9qrmz
the disappearance of provenance
http://bit.ly/SumOfAllCitations
a provenance-preserving answer engine
The sum of all human knowledge
The sum of all data and sources backing human knowledge
+
https://tools.wmflabs.org/wikidata-todo/stats.php https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Source_MetaData#Sources_used_as_references_on_Wikidata
77%
2013 2014 2015 2016
References in Wikidata
The molecular origins of insulin go at least as far back as the simplest unicellular [[eukaryotes]].<ref name='LeRoith'>{{cite journal | vauthors = LeRoith D, Shiloach J, Heffron R, Rubinovitz C, Tanenbaum R, Roth J | title = Insulin-related material in microbes: similarities and differences from mammalian insulins | journal = Can. J. Biochem. Cell Biol. | volume = 63 | issue = 8 | pages = 839–49 | year = 1985 | pmid = 3933801 | doi = 10.1139/o85-106 }}</ref> Apart from animals, insulin-like proteins are also known to exist in Fungi and Protista kingdoms.
References in Wikipedia
WikiCite: goals
Build a repository of all Wikimedia citations and bibliographic metadata
Design data models and technology to improve the coverage, quality, standards-compliance and machine-readability of
citations and bibliographic metadata in Wikimedia projects
@wikicite • meta.wikimedia.org/wiki/WikiCite
https://tools.wmflabs.org/sqid/#/view?id=P2860
All biomedical OA review articles of the last 5 years
The Zika corpus
Open citation graph layer
Bibliographic metadata layer
Expert annotation layer
Encyclopedic layer
The Zika corpus
Open citation graph layer
Bibliographic metadata layer
Expert annotation layer
Encyclopedic layer
Co-author graphs for individual researchers SPARQL: http://tinyurl.com/zml3jox
Most cited authors in the Zika research corpus (+ filtering by journal, OA status, type of statement) SPARQL: http://tinyurl.com/jb8da68
Semi-automated recommendation of entities, missing statements, references for unsourced statements
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool
Semi-automated recommendation of entities, missing statements, references for unsourced statements
https://meta.wikimedia.org/wiki/Grants:Project/WikiFactMine https://twitter.com/larswillighagen/status/774614483394236416
Tools for crowdsourcing entity matching / disambiguation
http://www.generalist.org.uk/blog/2014/wikidata-identifiers-and-the-odnb-where-next/ http://www.generalist.org.uk/blog/2014/wikidata-and-identifiers-part-2-the-matching-process/
all statements citing a New York Times article
the most popular scholarly journals used as citations for statements in any item that is a subclass of economics
all statements citing the works of Joseph Stiglitz
all statements citing journal articles by physicists from Oxford University
all statements citing a journal article that was retracted
all statements citing a source that cites a journal article that was retracted
New opportunities for linked open knowledge curation and discovery
https://meta.wikimedia.org/wiki/WikiCite_2016/Report/Group_5
More reliable data for altmetrics services
https://www.altmetric.com/blog/new-source-alert-wikipedia/
Dominant biocuration paradigm
● Cost of ad-hoc parsing of API responses or flatfile data● Ambiguous or non-existent xrefs● Persistence of funding ● Too much information to curate
B. Good (2016) Opportunities and challenges presented by Wikidata in the context of biocuration http://tinyurl.com/hk9qrmz
A new paradigm for biocuration
● Reduce API/parser proliferation● Force up-front integration● Facilitate coordination ● Ensure that if funding is lost, data is not● Leverage community input
B. Good (2016) Opportunities and challenges presented by Wikidata in the context of biocuration http://tinyurl.com/hk9qrmz
T. Putman (2016) Centralizing content and distributing labor: a community model for curating the very long tail of microbial genomes https://doi.org/10.6084/m9.figshare.3201796.v1
Support new forms of open curation and distributed fact-checking
Provide long-term, sustainable infrastructure to support open science
Benefit from large-scale distribution of data in the linked data ecosystem
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
meta.wikimedia.org/wiki/WikiCite • @wikicite
Thank youAcknowledgmentsDaniel Mietchen, Jonathan Dugan, Lydia Pintscher, Cameron Neylon, James Hare, James Heilman, Magnus Manske, Egon Willighagen, the Gene Wiki team (especially Andra Waagmeester, Tim Putman, Benjamin Good), the ContentMine team, the University of Chicago Knowledge Lab, all WikiCite 2016 participants and Wikidata Source Metadata project contributors.
Additional image credits
Library, National Park Service Collection thenounproject.com/term/library/191/ [CC0]Robot, Creative Stall thenounproject.com/term/robot/132360/ [CC BY]Open Access logo commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_transparent.svg [CC0]
[email protected] • @readermeter • @Wikidata • @WikiCite • @WikiResearch
A short history of NIH and Wikimedia● 2002: article National Institutes of Health started on English Wikipedia● 2003: MEDLINE● 2004: PubMed● 2005: PubMed Central
○ along with Template:PMC● 2007: WikiProject National Institutes of Health
○ along with Template:National Institutes of Health● 2009: first Wikipedia Academy in the US took place at NIH
○ Susannah Fox: “Shared Kismet: Wikipedia and the NIH”○ triggers Guidelines for Participating in Wikipedia from NIH
● 2012: bot imports multimedia from PMC into Wikimedia Commons○ triggers formation of JATS for Reuse working group
● 2015: Template:NIH properties on Wikidata● 2016: First papers using Wikidata queries appear in PMC