Ontologies, Knowledge Bases, WikidataMPRI 2.26.2: Web Data Management
Antoine AmarilliFriday, January 11th
1/31
Reminder
• Ontology: vocabulary (classes and relations) to describe things• Knowledge base: set of facts in one or several ontologies→ Focus on Wikidata: a general-purpose knowledge base and
ontology
2/31
Ontologies
Ontologies
• Various domain-specific vocabularies used across knowledgebases
• One general-purpose ontology used by Google, Microsoft, Yahoo,Yandex: schema.org
• Other ontologies that come together with a knowledge base
3/31
Friend of a friend (FOAF)
Describe people, relationship, profiles, activities (social network)
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .
<#JW>a foaf:Person ;foaf:name "Jimmy Wales" ;foaf:mbox <mailto:[email protected]> ;foaf:homepage <http://www.jimmywales.com> ;foaf:nick "Jimbo" ;foaf:depiction <http://www.jimmywales.com/aus_img_small.jpg> ;foaf:interest <http://www.wikimedia.org> ;foaf:knows [
a foaf:Person ;foaf:name "Angela Beesley"
] .4/31
Creative Commons
Describe the license and rights on documents
<div about="http://lessig.org/blog/"xmlns:cc="http://creativecommons.org/ns#">This page, by <a property="cc:attributionName"
rel="cc:attributionURL"href="http://lessig.org/">Lawrence Lessig</a>,
is licensed under a <a rel="license"href="http://creativecommons.org/licenses/by/3.0/">Creative Commons Attribution License</a>.
</div>
• Many content providers add this kind of markup (e.g., Flickr)• Search engines can use it (e.g., Google)
5/31
Other domain-specific ontologies
• Dublin Core (DC): Describe digital resources (videos, images, etc.)and physical resources (books, CDs, etc.)
• Simple knowledge organization system (SKOS): describe thesauri,taxonomies, etc.
• Open Graph Protocol: metadata for Web pages to be integratedin Facebook’s social graph; also Twitter Cards for Twitter
• DOAP (Description of a Project): describe software projects• VoID (Vocabulary of Interlinked Datasets): describe a linkeddataset
• Countless others6/31
Schema.org: a general-purpose ontology
• General-purpose ontology: 598 types and 862 properties inversion 3.5
• Intended to be used on Web pages to annotate the semantics ofelements
• Used by search engines for rich search results• Used in over 10 million sites1
1Source: https://schema.org/7/31
Format: Microdata
<div class="event-wrapper" itemscope itemtype="http://schema.org/Event"><div class="event-date" itemprop="startDate"
content="2013-09-14T21:30">Sat Sep 14</div><div class="event-title" itemprop="name">
Typhoon with Radiation City</div><div class="event-venue" itemprop="location"
itemscope itemtype="http://schema.org/Place"><span itemprop="name">The Hi-Dive</span><div class="address" itemprop="address" itemscope
itemtype="http://schema.org/PostalAddress"><span itemprop="streetAddress">7 S. Broadway</span><br><span itemprop="addressLocality">Denver</span>,<span itemprop="addressRegion">CO</span><span itemprop="postalCode">80209</span>
</div></div><div class="event-time">9:30 PM</div>
</div>
• itemscope creates an item and itemtype gives its type• itemprop gives values for properties of the item 8/31
Format: RDFa
Competing format to Microdata, seems less common2
<div vocab="http://schema.org/" class="event-wrapper" typeof="Event"><div class="event-date" property="startDate"
content="2013-09-14T21:30">Sat Sep 14</div><div class="event-title" property="name">
Typhoon with Radiation City</div><div class="event-venue" property="location" typeof="Place"><span property="name">The Hi-Dive</span><div class="address" property="address" typeof="PostalAddress">
<span property="streetAddress">7 S. Broadway</span><br><span property="addressLocality">Denver</span>,<span property="addressRegion">CO</span><span property="postalCode">80209</span>
</div></div><div class="event-time">9:30 PM</div>
</div>
2http://webdatacommons.org/structureddata/index.html#toc29/31
Format: JSON-LD
Alternative approach: give the structured data separately in JSON<script type="application/ld+json">{"@context": "http://schema.org","@type": "Event","location": {"@type": "Place","address": {
"@type": "PostalAddress","addressLocality": "Denver","addressRegion": "CO","postalCode": "80209","streetAddress": "7 S. Broadway"
},"name": "The Hi-Dive"
},"name": "Typhoon with Radiation City","startDate": "2013-09-14T21:30"
}</script>
• The @contextattribute gives thenamespace for the@type.
• No longer gives anylink to the pagecontents
• Also @id to give an URIto a node
• Many other features(editor’s draft of thespec is 167 pages)
10/31
Web Data Commons Structured Data
• Extraction of semantic content from the Common Crawl
• Also useful to measure usage of structured data:• In November 2017, the Common Crawl contained 66 TB(compressed), 260 TB (uncompressed), 3.2G pages
• 39% of pages (and 28% of domains) contained semantic data• 9G entities and 38G triples• http://webdatacommons.org/structureddata/
11/31
Knowledge bases
Common Knowledge bases
• Generalistic: DBpedia, YAGO, Freebase (defunct), Wikidata• Proprietary: Google Knowledge Graph, Bing Knowledge Graph(aka Satori)
• Domain-specific• We will focus afterwards on Wikidata
12/31
DBpedia
• Started in 2007• License: CC-BY-SA• Code license: GPLv2• Actors: Leipzig University, University of Mannheim, Open LinkSoftware
• Latest release: 2016-10• Extracted from Wikimedia projects
• 6M entities and 10G triples in 2016-043,3https://blog.dbpedia.org/2016/10/19/yeah-we-did-it-again-new-2016-04-dbpedia-release/
13/31
YAGO
• Started in 2008• License: CC-BY• Code license: GPLv3• Actors: Max Planck Institute for Informatics, Télécom ParisTech
• Latest release: YAGO 3.1 (2017)• Extracted from Wikipedias and other sources; manual evaluation
• 10M entities and 120M triples4,4http://yago-knowledge.org/
14/31
Freebase
• Started in 2007, discontinued in 2016• License: CC-BY• Code license: Apache2 (provided after-the-fact by Google)• Actors: Metaweb, acquired by Google in 2010• Initially imported from various sources
• Could be edited by anyone• Partially imported into Wikidata (but not completely)• Last release: 2016• Last dump has 1.9G triples
15/31
Wikidata
• Started in 2012• License: public domain• Code license: GPLv2• Actors: Wikimedia Deutschland, Wikimedia• Last release: weekly• Around 650M statements and 54M items
• Can be edited by anyone! Around 20k active users.16/31
Domain-specific
• MusicBrainz, for CDs and music in general (20 million recordings)• British National Bibliography: bibliographic details about bookspublished in the UK since 1950
• data.bnf.fr, data from the French national library
• OpenStreetMaps, and Geonames• Medicine and chemistry with SNOMED CT, and other databases:DrugBank, KEGG, UniProt, ChEMBL, etc.
• Linguistic resources, e.g., Babelnet• Bibliography, e.g., DBLP, Crossref
17/31
Linked Open DataLegend
Cross Domain
Geography
Government
Life Sciences
Linguistics
Media
Publications
Social Networking
User Generated
status...
GeoNam...
Person...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...Amino ...
Compar...
Chemic...
CRISP ...
Logica...
Cell l...
MESH T...
Medica...
NCI Th...
Nation...
Nation...
NIFSTD
NanoPa...
Read C...
RxNORM
SNOMED...
SNP-On...
Sequen...
Sugges...
VANDF
DBpedi...
DBpedia
datahub
openli...
W3C
Arthro...
DBLP R...
Freebase
New Yo...
status...
status...
status...
status...
status...status...
status...
status...
TaxonC...
BBC Wi...
Europe...
Fishes...
GeoSpe...
OpenCyc
UMBEL ...
UniProt
status...
status...
DBTune...
MusicB...
Poképé...
Pokede...
Univer...
OLiA
Japane...
Web ND...
DBpedi...
HEALTH...
Cancer...
Cancer...
COSTART
Human ...
Experi...
Health...
ICPC-2...
MedDRA
Medlin...
Natura...NIF Dy...
Online...
PMA 2010
RadLex
WHO Ad...
ChEMBL...
Bio2RD...
EPA-CDR
EPA-FRS
EPA-SRS
DWS-Group
Semant...
semant...
Bio2RD...
Bio2RD...
Bio2RD...
Bio2RD...
Bio2RD...
Inspec...
Czech ...
Geospa...
YAGO
Wikidata
Nation...
Associ...
CiteSe...
Commun...
ReSIST...
DBLP C...
ePrint...
Univer...
Univer...
Resear...
School...
ReSIST...
Uberbl...
TIP
Linked...
Influe...
Advers...
BioAss...
Bone D...
Basic ...
BIRNLex
Gene R...
BioTop
CAO
Cell C...
Chemic...
Cell L...
Cognit...
Ontolo...
Electr...
Human ...
Cardia...
eagle-...
eVOC (...
Fly ta...
Genera...
Gene O...
Gene R...
Host P...
Inform...
Intern...
Infect...
Brucel...
Malari...
Intera...
SysMO-...
Mental...
Emotio...
Protei...
Mosqui...
Neural...
Neomar...
NIF Cell
Neural...
NMR-in...
Ontolo...
Ontolo...
OBOE SBC
Ontolo...
Ontolo...
Ontolo...
Ontolo...
Ontolo...
Ontolo...
Ontolo...
Ontolo...
Phenot...
Pediat...
PRotei...
RNA on...
Subcel...
Sleep ...
Semant...
Softwa...
Time E...
Transl...
VIVO
Vaccin...
MGED O...
Mass s...
Solana...
Units ...
Units ...
Rechts...
Parole...
lexinfo
Rat St...
Africa...
Minima...
Physic...
PHARE
Pathwa...
El Via...
GeoLin...
DBpedi...
2000 U...
DBTune...
flickr...
DailyMed
DBLP B...
Diseasome
DrugBank
Eurost...
Projec...
SIDER:...
Linked...
RDF Bo...
Revyu....
TCMGen...
WordNe...
World ...
Gemeen...
zhishi...
BabelNet
DBpedi...
Zhishi.me
status...
status...
status...
status...
status...
status...
status...
AI/RHEUM
Bleedi...Curren...
Common...
Plant ...
FlyBas...
HCPCS
Human ...
ICD10
ICD10CM
Intern...
Intern...
Molecu...
Breast...
Cell l...
Master...
Mammal...
Mouse ...
Metath...
NCBI o...
Ontolo...
Orphan...
Studen...
Reuter...
Amphib...
Anatom...
Basic ...
Bilate...
BRENDA...
Cerebr...
Human ...
Human ...
Drosop...
Hymeno...
Mouse ...
Medaka...
Teleos...
Uber a...
Verteb...
verteb...
Xenopu...
Zebraf...
CLLD-WOLD
CLLD-G...
Lexvo
Persée...
data.b...
IdRef:...
VIAF: ...
EnAKTi...
Ordnan...
Prince...
WordNe...
openda...
statis...
Agenda...
Instit...
Ascomy...
System...
Cognit...
Fungal...
Fissio...
Gene O...
Cereal...
Event ...
IxnO
MeGO
Plant ...
Plant ...
Physic...
System...
SoyOnt...
Plant ...
Verteb...
Yeast ...
status...
Linked...
U.S. S...
ichoose
eagle-...
Biomed...
Basisr...
Open D...
eagle-...
EventKG
Deaths...
Regist...
data.g...
status...
status...
Univer...
EPA-TRI
Family...
Intern...
eagle-...
Intera...
Didact...
Focus ...
status...status...
status...
status...
status...
MLSA -...
wiktio...
Dendri...
Protei...
openda...
Linked...
EUR-Le...
ABA Ad...
Cell type
Enviro...
Spider...
Mosqui...
C. ele...
Tender...
State ...
R&D Pr...
Temple...
Semant...
Syndro...
Atheli...
LemonW...
Tradit...
Multip...
EARTh
GEnera...
ThISTUMTHES
Deusto...
MORElab
CLLD-E...
DBkWik
Europe...
Bundes...
Food a...
Intern...
Transp...
World ...
ICD-10...
Ontolo...
Bio2RD...
Bio2RD...
Bio2RD...
Bio2RD...
Breast...
Dictyo...
Tick g...
BBC Music
openda...
refere...
RISM A...
Gemein...
Fundaç...Budape...
Instit...
France...
Divers...
Korean...
Univer...
Prince...
Librar...
Brown ...
ICANE
Lista ...
cablegate
Situat...
Sample...
Facete...
Thai W...
Reacto...
UniProtKB
Bio2RD...
Bio2RD...
Bio2RD...
Bio2RD...
Bio2RD...
Bio2RD...
Bio2RD...
IMGT-O...
Parasi...
Proyec...
openda...
Biolog...
FDA Me...
Lipid ...
PKO_Re
Experi...
dbnary
ALPINO...
School...
Resili...
DEPLOY...
dotAC ...
epsrc
IBM Re...
IEEE P...
UK JIS...
LAAS-C...
Open A...
Univer...
RISKS ...
Univer...
ECS So...
C. ele...
Amphib...
Taxono...
Teleos...
TOK_On...
TWC: L...
GovTra...
vivo2doi
CrossR...
VIVO S...
VIVO U...
VIVO W...
VIVO W...
tags2c...
WordNe...
Europe...
EEA Re...
EIONET...
Telegr...
Linked...
DBTune...
Multil...
Neomar...
DATATU...
NASA S...
BBC Pr...
Integr...
Clinic...
DBpedi...
openda...
eagle-...
EUMIDA...
Linked...
NUTS (...
Sudoc ...
CE4R K...
eagle-...
OpenMo...
Linked...
lobid-...
B3Kat ...
Dewey ...
Projec...
lobid-...
Open L...
Automa...
fun
Linked...
Bio2RD...
Aperti...
Animal...
Spatia...
ExO
Logger...
MIxS C...
Sentim...
openda...
Google...
LinkedCT
Univer...
Aperti...
xLiD-L...
dbpedi...
Projet...
DBpedi...
Bio2RD...
Manual...
Debian...
Bricklink
Bio2RD...
sloWNe...
openda...
Job ap...
status...
status...
bio2rd...
CLLD-afbo
Aperti...
ReSIST...
southa...BPR ? ...
Univer...
Aperti...
Open M...
ISOcat
wordpress
Univer...
lemonUby
Univer...
Univer...
The Li...
Univer...
MARC C...
lingvo...
Englis...
Genera...
TDS
SmartL...
iServe...
Verrij...
Cornet...
DBpedi...
Art & ... ERA - ...
openda...
Medici...
ATC gr...
YSA - ...
YSO - ...
SALDO-RDF
Data a...
Compre...
Alpine...
BibBase
busine...
Chroni...
Discog...
Mosele...
Data I...
data.o...
DBTropes
DBTune...
data.dcs
educat...
EnAKTi...
EnAKTi...
EnAKTi...
enviro...
ESD St...
Eurost...
EventM...
TheSoz...
Hungar...
John G...
Linked...
Linked...
Linked...
The Lo...
Lotico
myExpe...
Nation...
OpenCa...
Openly...
patent...
Englis...
Last.F...
resear...
Techni...
Deep B...
UN/LOC...
WordNe...
Semant...
STW Th...
Surge ...
Thesau...
Open L...
The Vi...
transp...
UK Leg...
UK Pos...
Univer...
URIBurner
VIVO C...
VIVO I...
20th C...
GeoEcu...
Nation...
Linked...
Diagno...
Non Ra...
Random...
datos....
Thesau...
openda...
Diavgeia
Hellen...
Hellen...
status...
status...
status...
status...
status...
status...
status...
status...
Bio2RD...
Linked...
Schema...
openda...
associ...
Edublogs
EnAKTi...
Accomm...
Inever...
Inever...
CLLD-P...
CLLD-WALS
status...
status...
Genera...
Code l...
Cadast...
status...
Aperti...
Public...
openda...
PreLex
Linked...
Drosop...
eagle-...
DBpedi...
Amster...
Commun...
Italia...
Albane...
SIMPLE
Weathe...
MetaSh...
TEKORD
eagle-...
ciard-...
Univer...
EU Age...
Linked...
OpenEI...
KORE 5...
MultiW...
Federa...
IATI a...
The Eu...
UNESCO...
openda...
openda...
GeoWor...
FrameB...
LODAC ...
Persia...
status...
Univer...
theses.fr
Polyma...
Regist...
EU Par...
EU Who...
Educat...
CTIC P...
Public...
Bio2RD...
DIKB-E...
Epilepsy
ICPS N...
MaHCO ...
Measur...
Proteo...
Role O...
Traffi...
CLLD-S...
eagle-...
Univer...
Datos ...
openda...
proven...
DBLP i...
Reprod...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
DataGo...
BulTre...
Univer...
IPTC N...
apache
Archiv...
berlios
Deutsc...
Eniped...
FAO ge...
greek-...
Linked...
Linked...
LOD2 P...
myopen...
NHS Ja...
oreillyPlanet...
RDFohloh
status...
status...
status...
Chines...
DBpedi...
The Eu...
Norweg...
Tradit...
Univer...
EU: fi...
Linked...
MExiCo
Instit...
Organi...
Univer...
Smokin...
FiESTA
Bio2RD...
Bio2RD...
Airpor...
unipro...
Open D...
Comput...
Physic...
C. ele...
Linked...
Univer...
OpenWN...
Univer...
Nomenc...
MediCare
Social...
openda...
Active...
Romani...
Audite...
Data a...
Edinbu...
eagle-...
Linked...
World ...
Slovak...
SORS
openda...
Nation...
Linked...
status...
Rådata...
Produc...
Produc...
photos
status...
eagle-...
Univer...
eagle-...
eagle-...
Deutsc...
Instan...
openda...
status...
Italia...
Result...
R&D Pr...
Face Link
Yahoo ...
FinnWo...
Univer...
RAMEAU...
World ...
ISIL->...
Bio2RD...
DisGeNET
Global...
Univer...
Univer...
oceand...
Aperti...
Kallik...
Bio2RD...
Nobel ...
ZBW Labs
Univer...
CLLD-A...
HUGO
IATE RDF
Ocean ...
Ocean ...
Linked...
Univer...
openda...
vulner...
Salzbu...
Univer...
Betwee...
openda...
Summar...
CIPFA
Aperti...
DBTune...
OBOE
openda...
Bio2RD...
thesaurus
status...
Univer...
Norsk ...
Univer...
Entrez...
status...
Univer...
Founda...
Wordne...
BioPAX
Klapps...
Chem2B...
bio2rd...
Univer...
JITA C...
GeoSpe...
openda...
PanLex
Vytaut...
Shoah ...
Reposi...
Open D...
OLAC M...
Images...
OpenCo...
openda...
openda...
Requir...
Austra...
Bank f...
Spring...
Schola...
status...
Mis Mu...
Univer...
Organi...VIVO
status...
Averag...
Ruben ...
NPM
Ruben ...
Bio2RD...
Semant...
EURAXE...
QBOAir...
Aperti...
Wheat ...
Nation...
Aperti...
Open D...
Multex...
WarSampo
Aperti...
Red Un...
Univer...
yso-fi...
yso-fi...
Copyri...
eagle-...
Univer...
EMN
Accomm...
Taxons
The Co...
openda...
Lexico...
Bio2RD...
semanlink
Europe...
prefix.cc
ProductDB
typepad
Univer...
openda...
openda...
webconf
Addgene
SwetoDblp
AGROVOC
Norweg...
Scotti...
Climb ...
notube
Unempl...
Univer...
ItalWo...
status...
Univer...
Aperti...
NERC V...
WordLi...
mEduca...
FOODpe...
German...
Job ap...
eagle-...
openda...
ISOcat...
openda...
Basque...
taxonc...
Open D...
Period...
Englis...
Pleiades
Europe...
openda...
Univer...
Univer...
AragoD...
Aragon...
Instit...
Univer...
tharaw...
Ocean ...
EPA-RCRA
Prospe...
Univer...
Swedis...
Univer...
geodom...
SLI Ga...
data-h...
ECCO-T...
Linkin...
openda...
Merite...
Plant ...
LinkLi...
ePrint...
School...
Biblio...
Galici...
AEMET ...
Yovist...
Courts...
Univer...
Green ...
Europe...
status...
status...
CORE -...
RDFLic...
Univer...
Univer...
Enviro...
Metoff...
Aperti...
Ordnan...
IEEE V...
The Or...
LCSubj...
MASC-B...
DanNet...
Univer...
openda...
twc-op...
Regist...
IWN
DBTune...
Italia...
Univer...
RSS-50...
Interc...
status...
Japane...
openda...
STITCH...
PreMOn
Lingui...
Garnic...
Univer...
Select...
SALDOM...
EnAKTi...
Lexvo.org
openda...
List o...
IceWor...
Renewa...
Salzbu...
webnma...
Aperti...
Chemic...
Aperti...
Farmac...
Whisky...
openda...
openda...
openda...
openda...
Influe...
Eventseer
Social...
Univer...
openda...
eagle-...
Mi Guí...
ASN:US
Univer...
Europe...
Swedis...
status...
openda...
Number...
openda...
OLiA D...
Hedatuz
Termin...
BioMod...
Univer...
eagle-...
Aperti...
Univer...
Finnis...
openda...
Framester
Biblio...
status...
plWord...
CareLex
openda...
sears.com
Open E...
Univer...
BioSam...
Gene E...
Phonet...
HeBIS ...
ESD-To...
Calames
Standa...
Mathem...
Univer...
Brazil...
Univer...
Serend...
eagle-...
My Fam...
LIBRIS
eagle-...
eagle-...
Univer...
Britis...
openda...
Learni...
aliada...
Aperti...
Englis...
eagle-...
Univer...
openda...
de-gaa...
Chines...
Univer...
Muninn...
USPTO ...
Thesau...
Regist...
Museos...
taxonc...
openda...
Aperti...
Univer...
Aperti...
openda...
Europe...
Aperti...
Datos....
Catala...
openda...
GNOSS....
Evalua...
GovWIL...
EEA Vo...
eagle-...
Univer...
List o...
DBTune...
eagle-...
Allie ...
Ontos ...
WordLi...
Sancti...
Univer...
Kidney...
Salzbu...
Freeyork
DBTune...
The Ge...
2011 U...
Aperti...
Open B...
RDFizi...
DM2E
Judaic...
N-Lex ...
"Raini...
Bans o...
JRC-Na...
Taiwan...
Univer...
data-s...
Polyth...
News-1...
Hebrew...
TAXREF...
Orthol...
Geolog...
ISTAT ...
Univer...
status...
Organi...
gemet-...
Publis...
Lichfi...
Web Sc...
xxxxx
UNODC ...
BibSon...
gdlc
crowds...
Confis...
Street...
Linked...
Croati...
Inspec...
Struct...
Wikili...
Greek ...
AgriNe...
Univer...
Univer...
eagle-...
interv...
Univer...
Glottolog
Entorn...
Aperti...
ietflang
Univer...
ChEMBL...
Biblio...
Univer...
Twarql
Aperti...
status...
OntoBe...
TCGA R...
Drug D...
World ...
OSM Se...
WOLF W...
openda...
Aperti...
EuroSe...
SweFN-RDF
sandra...
SPARQL...
datos-...
ISPRA ...
Open W...
Deusto...
Social...
Transc...
PDEV-L...
Geogra...
bio2rd...
NTNU s...
Arabic...
Open D...
dev8d
openda...Greek ...
medline
Source...
linked...
openda...
AEGP, ...
openda...
openda...
Next W...
Linked...
Univer...
Near
eagle-...
WebIsALOD
zarago...
Biogra...
Chat G...
Univer...
AGRIS
Linked...
Atlant...
Bio2RD...
semant...
The Linked Open Data Cloud from lod-cloud.net
18/31
Gathering Semantic Web Data
• Browsing online versions of KBs• Using ad-hoc APIs to retrieve relevant triples• Using a SPARQL endpoint• Downloading a dump• Crawling other knowledge bases, e.g., dereferencing Cool URIs
19/31
Systems
• RDF stores (triplestores) with relational or native backend,open-source or commercial, related to graph databases
• Apache Jena• Virtuoso• Blazegraph, essentially acquired by Amazon• Amazon Neptune
• SPARQL engines, usually on top of a triplestore.http://en.wikipedia.org/wiki/SPARQL
• Tool to view semantic data in Web pages: http://www.google.com/webmasters/tools/richsnippets
20/31
Semantic Web challenges
• Complexity:• Writing structured content is harder than writing text!• Using structured content (with heterogeneous schema) iscomplicated!
• Discoverability problem for knowledge bases, vocabularies
• Performance:• Data is large• Running queries on graphs is tricky• Reasoning makes it even worse• Federation makes things worse again
21/31
Semantic Web challenges, cont’d
• Data quality:• Vagueness and modeling issues• Trust (anyone can add a triple)• Canonicity and alignment• Temporality, sources often complicated to represent• Open-world semantics: missing values vs no values
• Incentives: many data providers do not want to be eaten byothers
22/31
Wikidata
Why Wikidata matters
• Backed by the Wikimedia foundation: credible andnoncommercial
• Not run by academics, but some academics are involved• Genuine uses on Wikipedia (to some extent)• Centralized model, which is a good idea for now• Good tradeoffs in terms of expressiveness, scope...• Uses the successful wiki model
23/31
Wikidata basics
• Entities: Q1, Q2, Q3, ..., Q60527475 and beyond• Properties: P1, P2, P3, ..., P6343 and beyond
• Entities and properties have a label and short description ineach language, along with aliases (search engine)
• Entities can also have sitelinks to Wikimedia projects (e.g., thecorresponding Wikimedia pages)
• For each entity and property, we can have facts (or claims) withdifferent objects
• Everyone can create and edit entities and facts• Discussion is needed before creating a property• Software: Wikibase, a set of extensions to Mediawiki
24/31
Wikidata basics
• Entities: Q1, Q2, Q3, ..., Q60527475 and beyond• Properties: P1, P2, P3, ..., P6343 and beyond• Entities and properties have a label and short description ineach language, along with aliases (search engine)
• Entities can also have sitelinks to Wikimedia projects (e.g., thecorresponding Wikimedia pages)
• For each entity and property, we can have facts (or claims) withdifferent objects
• Everyone can create and edit entities and facts• Discussion is needed before creating a property• Software: Wikibase, a set of extensions to Mediawiki
24/31
Wikidata basics
• Entities: Q1, Q2, Q3, ..., Q60527475 and beyond• Properties: P1, P2, P3, ..., P6343 and beyond• Entities and properties have a label and short description ineach language, along with aliases (search engine)
• Entities can also have sitelinks to Wikimedia projects (e.g., thecorresponding Wikimedia pages)
• For each entity and property, we can have facts (or claims) withdifferent objects
• Everyone can create and edit entities and facts• Discussion is needed before creating a property• Software: Wikibase, a set of extensions to Mediawiki
24/31
Wikidata basics
• Entities: Q1, Q2, Q3, ..., Q60527475 and beyond• Properties: P1, P2, P3, ..., P6343 and beyond• Entities and properties have a label and short description ineach language, along with aliases (search engine)
• Entities can also have sitelinks to Wikimedia projects (e.g., thecorresponding Wikimedia pages)
• For each entity and property, we can have facts (or claims) withdifferent objects
• Everyone can create and edit entities and facts• Discussion is needed before creating a property
• Software: Wikibase, a set of extensions to Mediawiki
24/31
Wikidata basics
• Entities: Q1, Q2, Q3, ..., Q60527475 and beyond• Properties: P1, P2, P3, ..., P6343 and beyond• Entities and properties have a label and short description ineach language, along with aliases (search engine)
• Entities can also have sitelinks to Wikimedia projects (e.g., thecorresponding Wikimedia pages)
• For each entity and property, we can have facts (or claims) withdifferent objects
• Everyone can create and edit entities and facts• Discussion is needed before creating a property• Software: Wikibase, a set of extensions to Mediawiki
24/31
Qualifiers, references, ranks, data types
• Each fact can have qualifiers to indicate things like start/endtime, details (e.g., major/degree for P69 “educated at”)
• Each fact can also have sources to indicate where it comes from(a source is a set of key–value pairs)
• Each fact can have a rank among “normal”, “preferred” (e.g., forthe current value), or “deprecated”.
• Literal values can have data typeshttps://www.wikidata.org/wiki/Special:ListDatatypes
• Also two special values• “unknown value” (a value exists but is unknown)• “no value” (it is known that there is no value)
25/31
Constraints
• Wikidata has constraints which are only advisory (= you cancreate violations) and are quite simple. Main ones:
• “single (best) value constraint”• “inverse constraint” (mother vs child), “symmetric constraint”• “type constraint”, or requiring/disallowing certain facts• “range constraint” “contemporary constraint”, “format constraint”• “one-of/none-of constraint” (list of allowed/forbidden values)• Requiring/allowing qualifiers or units• Allowing use as a qualifier/unit
• There is a mechanism for exceptions
• Many constraint violations in practice
26/31
Usage on Wikipedia
• Used for interwiki links, i.e., the links between Wikipedia pagesacross languages
• Used in some infoboxes on Wikipedia, e.g., to automaticallypopulate some fields
• Can be used for other things, e.g., filling tables, or external linksto other sources
• Policy depends on each Wikipedia: some communities are morewelcoming than others...
27/31
Ongoing Wikidata discussions
• Project scope: what belongs in Wikidata?• The public domain license is a strong requirement• Concerns, e.g., about the high number of bibliographic entities(almost half of the entities)
• Some external datasets are imported, but Wikipedia (historically)gave much importance to human validation of imports
• Some support for federation in queries; and many external links
• Notability: essentially no policy currently• Managing vandalism?• Importance of references?
28/31
Accessing Wikidata data
• Simply by browsing• Can retrieve in multiple formats, e.g.,https://www.wikidata.org/wiki/Special:EntityData/Q42.json
• For simple queries (triple patterns), Linked data fragmentshttps://query.wikidata.org/bigdata/ldf
• Wikimedia API, e.g., API for recent changes• SPARQL queries, https://query.wikidata.org/ (and API)• Weekly dumps in JSON, RDF, XML (around 50 GB compressed)
29/31
Other cool Wikidata stuff
• Distributed Wikidata Game: crowdsourcing edits on Wikidatahttps://tools.wmflabs.org/wikidata-game/distributed/
• Reasonator: automatically generate a Wikipedia-like page from aWikidata entity https://tools.wmflabs.org/reasonator/
• Lexemes: ongoing effort to add linguistic data to Wikidata• OWL ontology: http://wikiba.se/ontology• askplatyp.us: natural language question answering tool• File captions on Wikimedia Commons to have a structured way togive labels to images (deployed on January 10)
• OpenRefine to reconcile datasets with Wikidata and add Wikidatafacts https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing/Tutorials/Video
30/31
Slide acknowledgements
• Many thanks to Thomas Pellissier-Tanon for his helpful feedback
• Slide 4: https://en.wikipedia.org/wiki/FOAF_(ontology)
• Slide 5: https://www.w3.org/Submission/ccREL/
• Slide 8–10: https://schema.org/Event
• Slide 13:https://commons.wikimedia.org/wiki/File:DBpediaLogo.svg
• Slide 14: https://en.wikipedia.org/wiki/File:YAGO.svg
• Slide 15: https://commons.wikimedia.org/wiki/File:Freebase_Logo_optimised.svg
• Slide 16, 23:https://en.wikipedia.org/wiki/File:Wikidata-logo-en.svg
31/31