dbpedia - a large-scale, multilingual knowledge base...

30
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/259828897 DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia Article in Semantic Web · January 2014 DOI: 10.3233/SW-140134 CITATIONS 671 READS 2,461 11 authors, including: Some of the authors of this publication are also working on these related projects: SlideWiki EU Project (www.slidewiki.eu) View project Social Semantic Collaboration for EKM, E-Learning & E-Tourism View project Jens Lehmann University of Bonn 208 PUBLICATIONS 8,420 CITATIONS SEE PROFILE Anja Jentzsch Hasso Plattner Institute 27 PUBLICATIONS 1,330 CITATIONS SEE PROFILE Dimitris Kontokostas University of Leipzig 48 PUBLICATIONS 1,181 CITATIONS SEE PROFILE Pablo N. Mendes IBM 37 PUBLICATIONS 2,063 CITATIONS SEE PROFILE All content following this page was uploaded by Christian Bizer on 28 January 2014. The user has requested enhancement of the downloaded file.

Upload: others

Post on 16-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

See discussions stats and author profiles for this publication at httpswwwresearchgatenetpublication259828897

DBpedia - A Large-scale Multilingual Knowledge Base Extracted from

Wikipedia

Article in Semantic Web middot January 2014

DOI 103233SW-140134

CITATIONS

671READS

2461

11 authors including

Some of the authors of this publication are also working on these related projects

SlideWiki EU Project (wwwslidewikieu) View project

Social Semantic Collaboration for EKM E-Learning amp E-Tourism View project

Jens Lehmann

University of Bonn

208 PUBLICATIONS 8420 CITATIONS

SEE PROFILE

Anja Jentzsch

Hasso Plattner Institute

27 PUBLICATIONS 1330 CITATIONS

SEE PROFILE

Dimitris Kontokostas

University of Leipzig

48 PUBLICATIONS 1181 CITATIONS

SEE PROFILE

Pablo N Mendes

IBM

37 PUBLICATIONS 2063 CITATIONS

SEE PROFILE

All content following this page was uploaded by Christian Bizer on 28 January 2014

The user has requested enhancement of the downloaded file

Semantic Web 1 (2012) 1ndash5 1IOS Press

DBpedia ndash A Large-scale MultilingualKnowledge Base Extracted from WikipediaEditor(s) Name Surname University CountrySolicited review(s) Name Surname University CountryOpen review(s) Name Surname University Country

Jens Lehmann alowast Robert Isele g Max Jakob e Anja Jentzsch d Dimitris Kontokostas aPablo N Mendes f Sebastian Hellmann a Mohamed Morsey a Patrick van Kleef c Soren Auer aChristian Bizer ba University of Leipzig Institute of Computer Science AKSW Group Augustusplatz 10 D-04009 Leipzig GermanyE-mail lastnameinformatikuni-leipzigdeb University of Mannheim Research Group Data and Web Science B6-26 D-68159 MannheimE-mail chrisinformatikuni-mannheimdec OpenLink Software 10 Burlington Mall Road Suite 265 Burlington MA 01803 USAE-mail pkleefopenlinkswcomd Hasso-Plattner-Institute for IT-Systems Engineering Prof-Dr- Helmert-Str 2-3 D-14482 Potsdam GermanyE-mail mailanjajentzschdee Neofonie GmbH Robert-Koch-Platz 4 D-10115 Berlin GermanyE-mail maxjakobneofoniedef Knoesis - Ohio Center of Excellence in Knowledge-enabled Computing Wright State University Dayton USAE-Mail pabloknoesisorgg Brox IT-Solutions GmbH An der Breiten Wiese 9 D-30625 Hannover GermanyE-Mail mailrobertiselecom

Abstract The DBpedia community project extracts structured multilingual knowledge from Wikipedia and makes it freelyavailable on the Web using Semantic Web and Linked Data technologies The project extracts knowledge from 111 differentlanguage editions of Wikipedia The largest DBpedia knowledge base which is extracted from the English edition of Wikipediaconsists of over 400 million facts that describe 37 million things The DBpedia knowledge bases that are extracted from the other110 Wikipedia editions together consist of 146 billion facts and describe 10 million additional things The DBpedia project mapsWikipedia infoboxes from 27 different language editions to a single shared ontology consisting of 320 classes and 1650 propertiesThe mappings are created via a world-wide crowd-sourcing effort and enable knowledge from the different Wikipedia editions tobe combined The project publishes regular releases of all DBpedia knowledge bases for download and provides SPARQL queryaccess to 14 out of the 111 language editions via a global network of local DBpedia chapters In addition to the regular releasesthe project maintains a live knowledge base which is updated whenever a page in Wikipedia changes DBpedia sets 27 millionRDF links pointing into over 30 external data sources and thus enables data from these sources to be used together with DBpediadata Several hundred data sets on the Web publish RDF links pointing to DBpedia themselves and thus make DBpedia one ofthe central interlinking hubs in the Linked Open Data (LOD) cloud In this system report we give an overview of the DBpediacommunity project including its architecture technical implementation maintenance internationalisation usage statistics andapplications

Keywords Knowledge Extraction Wikipedia Multilingual Knowledge Bases Linked Data RDF

1570-084412$2750 ccopy 2012 ndash IOS Press and the authors All rights reserved

2 Lehmann et al DBpedia

1 Introduction

Wikipedia is the 6th most popular website1 the mostwidely used encyclopedia and one of the finest exam-ples of truly collaboratively created content There areofficial Wikipedia editions in 287 different languageswhich range in size from a couple of hundred articlesup to 38 million articles (English edition)2 Besides offree text Wikipedia articles consist of different types ofstructured data such as infoboxes tables lists and cat-egorization data Wikipedia currently offers only free-text search capabilities to its users Using Wikipediasearch it is thus very difficult to find all rivers that flowinto the Rhine and are longer than 100 miles or allItalian composers that were born in the 18th century

The DBpedia project builds a large-scale multilin-gual knowledge base by extracting structured data fromWikipedia editions in 111 languages This knowledgebase can be used to answer expressive queries such asthe ones outlined above Being multilingual and cover-ing an wide range of topics the DBpedia knowledgebase is also useful within further application domainssuch as data integration named entity recognition topicdetection and document ranking

The DBpedia knowledge base is widely used as atestbed in the research community and numerous appli-cations algorithms and tools have been built around orapplied to DBpedia DBpedia is served as Linked Dataon the Web Since it covers a wide variety of topicsand sets RDF links pointing into various external datasources many Linked Data publishers have decidedto set RDF links pointing to DBpedia from their datasets Thus DBpedia has developed into a central inter-linking hub in the Web of Linked Data and has beena key factor for the success of the Linked Open Datainitiative

The structure of the DBpedia knowledge base ismaintained by the DBpedia user community Mostimportantly the community creates mappings fromWikipedia information representation structures to theDBpedia ontology This ontology ndash which will be ex-plained in detail in Section 3 ndash unifies different tem-plate structures both within single Wikipedia languageeditions and across currently 27 different languages

Corresponding author E-mail lehmannninformatikuni-leipzigde

1httpwwwalexacomtopsites Retrieved in Octo-ber 2013

2httpmetawikimediaorgwikiList_of_Wikipedias

The maintenance of different language editions of DB-pedia is spread across a number of organisations Eachorganisation is responsible for the support of a certainlanguage The local DBpedia chapters are coordinatedby the DBpedia Internationalisation Committee

The aim of this system report is to provide a descrip-tion of the DBpedia community project including thearchitecture of the DBpedia extraction framework itstechnical implementation maintenance internationali-sation usage statistics as well as presenting some pop-ular DBpedia applications This system report is a com-prehensive update and extension of previous project de-scriptions in [1] and [5] The main advances comparedto these articles are

ndash The concept and implementation of the extractionbased on a community-curated DBpedia ontology

ndash The wide internationalisation of DBpediandash A live synchronisation module which processes

updates in Wikipedia as well as the DBpedia on-tology and allows third parties to keep their copiesof DBpedia up-to-date

ndash A description of the maintenance of public DBpe-dia services and statistics about their usage

ndash An increased number of interlinked data setswhich can be used to further enrich the content ofDBpedia

ndash The discussion and summary of novel third partyapplications of DBpedia

Overall DBpedia has undergone 7 years of continu-ous evolution Table 17 provides an overview of theprojectrsquos timeline

The system report is structured as follows In the nextsection we describe the DBpedia extraction frameworkwhich forms the technical core of DBpedia This isfollowed by an explanation of the community-curatedDBpedia ontology with a focus on multilingual sup-port In Section 4 we explicate how DBpedia is syn-chronised with Wikipedia with just very short delaysand how updates are propagated to DBpedia mirrorsemploying the DBpedia Live system Subsequently wegive an overview of the external data sets that are in-terlinked from DBpedia or that set RDF links pointingto DBpedia themselves (Section 5) In Section 6 weprovide statistics on the usage of DBpedia and describethe maintenance of a large scale public data set WithinSection 7 we briefly describe several use cases andapplications of DBpedia in a variety of different areasFinally we report on related work in Section 8 and givean outlook on the further development of DBpedia inSection 9

Lehmann et al DBpedia 3

Fig 1 Overview of DBpedia extraction framework

2 Extraction Framework

Wikipedia articles consist mostly of free text butalso comprise of various types of structured informationin the form of wiki markup Such information includesinfobox templates categorisation information imagesgeo-coordinates links to external web pages disam-biguation pages redirects between pages and linksacross different language editions of Wikipedia TheDBpedia extraction framework extracts this structuredinformation from Wikipedia and turns it into a richknowledge base In this section we give an overviewof the DBpedia knowledge extraction framework

21 General Architecture

Figure 1 shows an overview of the technical frame-work The DBpedia extraction is structured into fourphases

Input Wikipedia pages are read from an externalsource Pages can either be read from a Wikipediadump or directly fetched from a MediaWiki instal-lation using the MediaWiki API

Parsing Each Wikipedia page is parsed by the wikiparser The wiki parser transforms the source codeof a Wikipedia page into an Abstract Syntax Tree

Extraction The Abstract Syntax Tree of each Wikipediapage is forwarded to the extractors DBpedia of-fers extractors for many different purposes for in-stance to extract labels abstracts or geographicalcoordinates Each extractor consumes an AbstractSyntax Tree and yields a set of RDF statements

Output The collected RDF statements are written toa sink Different formats such as N-Triples aresupported

22 Extractors

The DBpedia extraction framework employs variousextractors for translating different parts of Wikipediapages to RDF statements A list of all available extrac-tors is shown in Table 1 DBpedia extractors can bedivided into four categories

Mapping-Based Infobox Extraction The mapping-based infobox extraction uses manually writtenmappings that relate infoboxes in Wikipedia toterms in the DBpedia ontology The mappings alsospecify a datatype for each infobox property andthus help the extraction framework to produce highquality data The mapping-based extraction willbe described in detail in Section 24

Raw Infobox Extraction The raw infobox extractionprovides a direct mapping from infoboxes inWikipedia to RDF As the raw infobox extractiondoes not rely on explicit extraction knowledge inthe form of mappings the quality of the extracteddata is lower The raw infobox data is useful if aspecific infobox has not been mapped yet and thusis not available in the mapping-based extraction

Feature Extraction The feature extraction uses anumber of extractors that are specialized in ex-tracting a single feature from an article such as alabel or geographic coordinates

4 Lehmann et al DBpedia

Statistical Extraction Some NLP related extractorsaggregate data from all Wikipedia pages in order toprovide data that is based on statistical measuresof page links or word counts as further describedin Section 26

23 Raw Infobox Extraction

The type of Wikipedia content that is most valuablefor the DBpedia extraction are infoboxes Infoboxes arefrequently used to list an articlersquos most relevant factsas a table of attribute-value pairs on the top right-handside of the Wikipedia page (for right-to-left languageson the top left-hand side) Infoboxes that appear in aWikipedia article are based on a template that specifiesa list of attributes that can form the infobox A widerange of infobox templates are used in Wikipedia Com-mon examples are templates for infoboxes that describepersons organisations or automobiles As Wikipediarsquosinfobox template system has evolved over time dif-ferent communities of Wikipedia editors use differ-ent templates to describe the same type of things (egInfobox city japan Infobox swiss townand Infobox town de) In addition different tem-plates use different names for the same attribute(eg birthplace and placeofbirth) As manyWikipedia editors do not strictly follow the recommen-dations given on the page that describes a templateattribute values are expressed using a wide range ofdifferent formats and units of measurement An excerptof an infobox that is based on a template for describingautomobiles is shown below

Infobox automobile| name = Ford GT40| manufacturer = [[Ford Advanced Vehicles]]| production = 1964-1969| engine = 4181cc()

In this infobox the first line specifies the infoboxtype and the subsequent lines specify various attributesof the described entity

An excerpt of the extracted data is as follows

dbrFord_GT40dbpname Ford GT40endbpmanufacturer dbrFord_Advanced_Vehiclesdbpengine 4181dbpproduction 1964()

This extraction output has weaknesses The resourceis not associated to a class in the ontology and parsed

values are cleaned up and assigned a datatyper basedon heuristics In particular the raw infobox extractorsearches for values in the following order dates coor-dinates numbers links and strings as default Thus thedatatype assignment for the same property in differentresources is non deterministic The engine for exam-ple is extracted as a number but if another instance ofthe template used ldquocc4181rdquo it would be extracted asstring This behaviour makes querying for properties inthe dbp namespace inconsistent Those problems canbe overcome by the mapping-based infobox extractionpresented in the next subsection

24 Mapping-Based Infobox Extraction

In order to homogenize the description of informa-tion in the knowledge base in 2010 a community ef-fort was initiated to develop an ontology schema andmappings from Wikipedia infobox properties to thisontology The alignment between Wikipedia infoboxesand the ontology is performed via community-providedmappings that help to normalize name variations inproperties and classes Heterogeneity in the Wikipediainfobox system like using different infoboxes for thesame type of entity or using different property names forthe same property (cf Section 23) can be alleviated inthis way This significantly increases the quality of theraw Wikipedia infobox data by typing resources merg-ing name variations and assigning specific datatypes tothe values

This effort is realized using the DBpedia MappingsWiki3 a MediaWiki installation set up to enable usersto collaboratively create and edit mappings These map-pings are specified using the DBpedia Mapping Lan-guage The mapping language makes use of MediaWikitemplates that define DBpedia ontology classes andproperties as well as templatetable to ontology map-pings A mapping assigns a type from the DBpedia on-tology to the entities that are described by the corre-sponding infobox In addition attributes in the infoboxare mapped to properties in the DBpedia ontology Inthe following we show a mapping that maps infoboxesthat use the Infobox automobile template to theDBpedia ontology

TemplateMapping|mapToClass = Automobile|mappings =PropertyMapping| templateProperty = name

3httpmappingsdbpediaorg

Lehmann et al DBpedia 5

Table 1

Overview of the DBpedia extractors (cf Table 16 for a complete listof prefixes)

Name Description Example

abstract Extracts the first lines of theWikipedia article

dbrBerlin dboabstract Berlin is the capital city of

() article categories Extracts the categorization of the

articledbrOliver Twist dcsubject dbrCategoryEnglish novels

category label Extracts labels for categories dbrCategoryEnglish novels rdfslabel English novels category hierarchy Extracts information about which

concept is a category and how cat-egories are related using the SKOSVocabulary

dbrCategoryWorld War II skosbroader

dbrCategoryModern history

disambiguation Extracts disambiguation links dbrAlien dbowikiPageDisambiguates dbrAlien (film) external links Extracts links to external web

pages related to the conceptdbrAnimal Farm dbowikiPageExternalLink

lthttpbooksgooglecomid=RBGmrDnBs8UCgt geo coordinates Extracts geo-coordinates dbrBerlin georsspoint 525006 133989 grammatical gender Extracts grammatical genders for

personsdbrAbraham Lincoln foafgender male

homepage Extracts links to the official home-page of an instance

dbrAlabama foafhomepage lthttpalabamagovgt

image Extracts the first image of aWikipedia page

dbrBerlin foafdepiction lthttpOverview Berlinjpggt

infobox Extracts all properties from all in-foboxes

dbrAnimal Farm dbodate March 2010

interlanguage Extracts interwiki links dbrAlbedo dbowikiPageInterLanguageLink dbr-deAlbedo label Extracts the article title as label dbrBerlin rdfslabel Berlin lexicalizations Extracts information about surface

forms and their association withconcepts (only N-Quad format)

dbrPine sptllexicalization lxpine tree lsPine pine tree lxpine tree rdfslabel pine tree lsPine pine tree sptlpUriGivenSf 0941

mappings Extraction based on mappings ofWikipedia infoboxes to the DBpe-dia ontology

dbrBerlin dbocountry dbrGermany

page ID Extracts page ids of articles dbrAutism dbowikiPageID 25

page links Extracts all links betweenWikipedia articles

dbrAutism dbowikiPageWikiLink dbrHuman brain

persondata Extracts information about per-sons represented using the Person-Data template

dbrAndre Agassi foafbirthDate 1970-04-29

PND Extracts PND (Personennamen-datei) data about a person

dbrWilliam Shakespeare dboindividualisedPnd 118613723

redirects Extracts redirect links between ar-ticles in Wikipedia

dbrArtificialLanguages dbowikiPageRedirects

dbrConstructed language revision ID Extracts the revision ID of the

Wikipedia articledbrAutism lthttpwwww3orgnsprovwasDerivedFromgt

lthttpenwikipediaorgwikiAutismoldid=495234324gt thematic concept Extracts lsquothematicrsquo concepts the

centres of discussion for cate-gories

dbrCategoryMusic skossubject dbrMusic

topic signatures Extracts topic signatures dbrAlkane sptltopicSignature carbon alkanes atoms wiki page Extracts links to corresponding ar-

ticles in WikipediadbrAnAmericanInParis foafisPrimaryTopicOf

lthttpenwikipediaorgwikiAnAmericanInParisgt

6 Lehmann et al DBpedia

| ontologyProperty = foafname PropertyMapping| templateProperty = manufacturer| ontologyProperty = manufacturer DateIntervalMapping| templateProperty = production| startDateOntologyProperty =

productionStartDate| endDateOntologyProperty =

productionEndDate IntermediateNodeMapping| nodeClass = AutomobileEngine| correspondingProperty = engine| mappings =

PropertyMapping| templateProperty = engine| ontologyProperty = displacement| unit = Volume PropertyMapping| templateProperty = engine| ontologyProperty = powerOutput| unit = Power

()

The RDF statements that are extracted from the pre-vious infobox example are shown below As we can seethe production period is correctly split into a start yearand an end year and the engine is represented by a dis-tinct RDF node It is worth mentioning that all valuesare canonicalized to basic units For example in theengine mapping we state that engine is a Volumeand thus the extractor converts ldquo4181ccrdquo (cubic cen-timeters) to cubic meters (ldquo0004181rdquo) Additionallythere can exist multiple mappings on the same propertythat search for different datatypes or different units Forexample a number with ldquoPSrdquo as a suffix for engine

dbrFord_GT40rdftype dboAutomobilerdfslabel Ford GT40endbomanufacturer

dbrFord_Advanced_VehiclesdboproductionStartYear

1964ˆˆxsdgYeardboproductionEndYear 1969ˆˆxsdgYeardboengine [

rdftype AutomobileEnginedbodisplacement 0004181

]()

The DBpedia Mapping Wiki is not only used to mapdifferent templates within a single language editionof Wikipedia to the DBpedia ontology but is used tomap templates from all Wikipedia language editionsto the shared DBpedia ontology Figure 2 shows how

the infobox properties author and συγγαϕεας ndash au-thor in Greek ndash are both being mapped to the globalidentifier dboauthor That means in turn that in-formation from all language versions of DBpedia canbe merged and DBpedias for smaller languages can beaugmented with knowledge from larger DBpedias suchas the English edition Conversely the larger DBpe-dia editions can benefit from more specialized knowl-edge from localized editions such as data about smallertowns which is often only present in the correspondinglanguage edition [43]

Besides hosting of the mappings and DBpedia on-tology definition the DBpedia Mappings Wiki offersvarious tools which support users in their work

ndash Mapping Syntax Validator The mapping syntaxvalidator checks for syntactic correctness and high-lights inconsistencies such as missing propertydefinitions

ndash Extraction Tester The extraction tester linked oneach mapping page tests a mapping against a setof example Wikipedia pages This gives directfeedback about whether a mapping works and howthe resulting data is structured

ndash Mapping Tool The DBpedia Mapping Tool is agraphical user interface that supports users to cre-ate and edit mappings

25 URI Schemes

For every Wikipedia article the framework intro-duces a number of URIs to represent the concepts de-scribed on a particular page Up to 2011 DBpedia pub-lished URIs only under the httpdbpediaorgdomain The main namespaces were

ndash httpdbpediaorgresource (prefixdbr) for representing article data There is a one-to-one mapping between a Wikipedia page and aDBpedia resource based on the article title Forexample for the Wikipedia article on Berlin4 DB-pedia will produce the URI dbrBerlin Exceptionsin this rule appear when intermediate nodes areextracted from the mapping-based infobox extrac-tor as unique URIs (eg the engine mappingexample in Section 24)

ndash httpdbpediaorgproperty (prefixdbp) for representing properties extracted fromthe raw infobox extraction (cf Section 23) egdbppopulation

4httpenwikipediaorgwikiBerlin

Lehmann et al DBpedia 7

Fig 2 Depiction of the mapping from the Greek (left) and English Wikipedia templates (right) about books to the same DBpedia Ontology class(middle) [24]

ndash httpdbpediaorgontology (prefixdbo) for representing the DBpedia ontology (cfSection 24) eg dbopopulationTotal

Although data from other Wikipedia language edi-tions were extracted they used the same namespacesThis was achieved by exploiting the Wikipedia inter-language links5 For every page in a language otherthan English the page was extracted only if the pagecontained an inter-language link to an English page Inthat case using the English link the data was extractedunder the English resource name (ie dbrBerlin)

Recent DBpedia internationalisation developmentsshowed that this approach omitted valuable data [24]Thus starting from the DBpedia 37 release6 two typesof data sets were generated The localized data sets con-tain all things that are described in a specific languageWithin the datasets things are identified with languagespecific URIs such as httpltlanggtdbpediaorgresource for article data and httpltlanggtdbpediaorgproperty for prop-erty data In addition we produce a canonicalized dataset for each language The canonicalized data sets onlycontain things for which a corresponding page in the

5httpenwikipediaorgwikiHelpInterlanguage_links

6A list of all DBpedia releases is provided in Table 17

English edition of Wikipedia exists Within all canoni-calized data sets the same thing is identified with thesame URI from the generic language-agnostic names-pace httpdbpediaorgresource

26 NLP Extraction

DBpedia provides a number of data sets which havebeen created to support Natural Language Processing(NLP) tasks [33] Currently four datasets are extractedtopic signatures grammatical gender lexicalizationsand thematic concept While the topic signatures andthe grammatical gender extractors primarily extract datafrom the article text the lexicalizations and thematicconcept extractors make use of the wiki markup

DBpedia entities can be referred to using many dif-ferent names and abbreviations The Lexicalization dataset provides access to alternative names for entitiesand concepts associated with several scores estimatingthe association strength between name and URI Thesescores distinguish more common names for specificentities from rarely used ones and also show how am-biguous a name is with respect to all possible conceptsthat it can mean

The topic signatures data set enables the descriptionof DBpedia resources based on unstructured informa-tion as compared to the structured factual data providedby the mapping-based and raw extractors We build a

8 Lehmann et al DBpedia

Vector Space Model (VSM) where each DBpedia re-source is a point in a multidimensional space of wordsEach DBpedia resource is represented by a vector andeach word occurring in Wikipedia is a dimension ofthis vector Word scores are computed using the tf-idfweight with the intention to measure how strong is theassociation between a word and a DBpedia resourceNote that word stems are used in this context in order togeneralize over inflected words We use the computedweights to select the strongest related word stems foreach entity and build topic signatures [30]

There are two more Feature Extractors related to Nat-ural Language Processing The thematic concepts dataset relies on Wikipediarsquos category system to capture theidea of a lsquothemersquo a subject that is discussed in its arti-cles Many of the categories in Wikipedia are linked toan article that describes the main topic of that categoryWe rely on this information to mark DBpedia entitiesand concepts that are lsquothematicrsquo that is they are thecentre of discussion for a category

The grammatical gender data set uses a simple heuris-tic to decide on a grammatical gender for instances ofthe class Person in DBpedia While parsing an articlein the English Wikipedia if there is a mapping froman infobox in this article to the class dboPersonwe record the frequency of gender-specific pronouns intheir declined forms (Subject Object Possessive Ad-jective Possessive Pronoun and Reflexive) ndash ie hehim his himself (masculine) and she her hers herself(feminine) Grammatical genders for DBpedia entitiesare assigned based on the dominating gender in thesepronouns

27 Summary of Other Recent Developments

In this section we summarize the improvements ofthe DBpedia extraction framework since the publicationof the previous DBpedia overview article [5] in 2009One of the major changes on the implementation levelis that the extraction framework has been rewritten inScala in 20107 resulting in an improvement of the effi-ciency of the extractors by an order of magnitude com-pared to the previous PHP based framework This iscrucial for the DBpedia Live project which will be ex-plained in Section 4 The new more modular frameworkalso allows to extract data from tables in Wikipediapages and supports extraction from multiple MediaWikitemplates per page Another significant change was the

7Table 17 provides an overview of the projectrsquos evolution throughtime

creation and utilization of the DBpedia Mappings Wikias described earlier Further significant changes includethe mentioned NLP extractors and the introduction ofURI schemes

In addition there were several smaller improvementsand general maintenance Overall over the past fouryears the parsing of the MediaWiki markup improvedquite a lot which led to better overall coverage forexample concerning references and parser functionsIn addition the collection of MediaWiki namespaceidentifiers for many languages is now performed semi-automatically leading to a high accuracy of detectionThis concerns common title prefixes such as User FileTemplate Help Portal etc in English that indicatepages that do not contain encyclopedic content andwould produce noise in the data They are important forspecific extractors as well for instance the category hi-erarchy data set is produced from pages of the Categorynamespace Furthermore the output of the extractionsystem now supports more formats and several compli-ance issues regarding URIs IRIs N-Triples and Turtlewere fixed

The individual data extractors have been improved aswell in both number and quality in many areas The ab-stract extraction was enhanced producing more accurateplain text representations of the beginning of Wikipediaarticle texts More diverse and more specific datatypesdo exist (eg many currencies and XSD datatypes suchas xsdgYearMonth xsdpositiveIntegeretc) and for a number of classes and properties specificdatatypes were added (eg inhabitantskm2 for the pop-ulation density of populated places and m3s for thedischarge of rivers) Many issues related to data parserswere resolved and the quality of the owlsameAsdata set for multiple language versions was increasedby an implementation that takes bijective relations intoaccount

There are also further extractors eg for Wikipediapage IDs and revisions Moreover redirect and disam-biguation extractors were introduced and improved Forthe redirect data the transitive closure is computedwhile taking care of catching cycles in the links Theredirects also help regarding infobox coverage in themapping-based extraction by resolving alternative tem-plate names Moreover in the PHP framework if aninfobox value pointed to a redirect this redirection wasnot properly resolved and thus resulted in RDF linksthat led to URIs which did not contain any further in-formation Resolving redirects affected approximately15 of all links and hence increased the overall inter-connectivity of resources in the DBpedia ontology

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an infoboxcontains a string value that is not linked to anotherWikipedia article the extraction framework searchesfor hyperlinks in the same Wikipedia article that havethe same anchor text as the infobox value string If sucha link exists the target of that link is used to replacethe string value in the infobox This method furtherincreases the number of object property assertions inthe DBpedia ontology

Orthogonal to the previously mentioned improve-ments there have been various efforts to assess the qual-ity of the DBpedia datasets [26] developed a frame-work for estimating the quality of DBpedia and a sam-ple of 75 resources were analysed A more compre-hensive effort was performed in [48] by providing adistributed web-based interface [25] for quality assess-ment In this study 17 data quality problem types wereanalysed by 58 users covering 521 resources in DBpe-dia

3 DBpedia Ontology

The DBpedia ontology consists of 320 classes whichform a subsumption hierarchy and are described by1650 different properties With a maximal depth of5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depictsa part of the DBpedia ontology indicating the rela-tions among the top ten classes of the DBpedia on-

tology ie the classes with the highest number ofinstances The complete DBpedia ontology can bebrowsed online at httpmappingsdbpediaorgserverontologyclasses

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings Wiki Fig-ure 4 depicts the growth of the DBpedia ontology overtime While the number of classes is not growing toomuch due to the already good coverage of the initialversion of the ontology the number of properties in-creases over time due to the collaboration on the DBpe-dia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communitiesfor 27 languages 23 of which are active Figure 5 shows

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

statistics for the coverage of these mappings in DBpe-dia Figures (a) and (c) refer to the absolute numberof template and property mappings that are defined forevery DBpedia language edition Figures (b) and (d) de-pict the percentage of the defined template and propertymappings compared to the total number of availabletemplates and properties for every Wikipedia languageedition Figures (e) and (g) show the occurrences (in-stances) that the defined template and property map-pings have in Wikipedia Finally figures (f) and (h)give the percentage of the mapped templates and prop-erties occurences compared to the total templates andproperty occurences in a Wikipedia language edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50of total template occurrences In addition almost alllanguages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most activemapping language communities is depicted in Figure 6It is interesting to notice that the high mapping ac-tivity peaks coincide with the DBpedia release dates

Lehmann et al DBpedia 11

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

For instance the DBpedia 37 version was released onSeptember 2011 and the 2nd and 3rd quarter of thatyear have a very high activity compared to the 4th quar-ter In the last two years (2012 and 2013) most of theDBpedia mapping language communities have definedtheir own chapters and have their own release datesThus recent mapping activity shows less fluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (x axis)that have exactly y occurrences (y axis) The occurrencefrequency follows a long tail distribution Thus a lownumber of property mappings have a high number ofoccurrences and a high number of property mappingshave a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the num-ber of facts (ie statements) that have been extractedfrom infoboxes describing these things Afterwards wereport on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statements andtype statements for the 20 languages The column head-ings have the following meaning LD = Localized datasets (see Section 25) CD = Canonicalized data sets

(see Section 25) all = Overall number of instances inthe data set including instances without infobox datawith MD = Number of instances for which mapping-based infobox data exists Raw Properties = Numberof different properties that are generated by the rawinfobox extractor Mapping Properties = Number ofdifferent properties that are generated by the mapping-based infobox extractor Raw Statements = Number ofstatements (facts) that are generated by the raw infoboxextractor Mapping Statements = Number of statements(facts) that are generated by the mapping-based infoboxextractor

It is interesting to see that the English version of DB-pedia describes about three times more instances thanthe second and third largest language editions (FrenchGerman) Comparing the first column of the table withthe second and third reveals which portion of the in-stances of a specific language correspond to instancesin the English version of DBpedia and which portionof the instances is described by clean mapping-basedinfobox data The difference between the number ofproperties in the raw infobox data set and the cleanermapping-based infobox data set (columns 4 and 5) re-sults on the one hand from multiple Wikipedia infoboxproperties being mapped to a single ontology propertyOn the other hand it reflects the number of mappingsthat have been so far created in the Mapping Wiki for aspecific language

Table 3 reports the number of instances for a set ofpopular classes from the third and forth hierarchy levelof the ontology within the canonicalized DBpedia datasets for each language The indented classes are sub-classes of the superclasses set in bold The zero val-ues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English version ofDBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-baseddata set how many instances are described in multi-ple languages The Instances column contains the totalnumber of instances per class across all 20 languagesthe second column contains the number of instancesthat are described only in a single language version thenext column contains the number of instances that arecontained in two languages but not in three or more lan-guages etc For example 12936 persons are describedin five languages but not in six or more languages Thenumber 871630 for the class Person means that all20 language versions together describe 871630 differ-ent persons The number is higher than the number ofpersons described in the canonicalized English infobox

12 Lehmann et al DBpedia

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Table 2Basic statistics about Localized DBpedia Editions

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

data set (763643) listed in Table 3 since there areinfoboxes in non-English articles describing a personwithout a corresponding infobox in the English articledescribing the same person Summing up columns 2 to10+ for the Person class we see that 195263 personsare described in two or more languages The large dif-ference of this number compared to the total number of871630 persons is due to the much smaller size of thelocalized DBpedia versions compared to the Englishone (cf Table 2)

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approaches

in [20] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently mappingsfor 27 languages8 The DBpedia 37 release9 in Septem-ber 2011 was the first DBpedia release to use the lo-calized I18n (Internationalisation) DBpedia extractionframework [24]

8Arabic (ar) Bulgarian (bg) Bengali (bn) Catalan (ca) Czech(cs) German (de) Greek (el) English (en) Spanish (es) Estonian(et) Basque (eu) French (fr) Irish (ga) Hindi (hi) Croatian (hr)Hungarian (hu) Indonesian (id) Italian (it) Japanese (ja) Korean(ko) Dutch (nl) Polish (pl) Portuguese (pt) Russian (ru) Slovene(sl) Turkish (tr) Urdu (ur)

9httpblogdbpediaorg20110911

Lehmann et al DBpedia 13

Table 3Number of instances per class within 10 localized DBpedia versions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 4Cross-language overlap Number of instances that are described in multiple languages

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

At the time of writing DBpedia chapters for 14 lan-guages have been founded Basque Czech Dutch En-glish French German Greek Italian Japanese Ko-rean Polish Portuguese Russian and Spanish10 Be-sides providing mappings from infoboxes in the corre-sponding Wikipedia editions DBpedia chapters organ-ise a local community and provide hosting for data setsand associated services

While at the moment chapters are defined by owner-ship of the IP and server of the sub domain A record(eg httpkodbpediaorg) given by DBpe-dia maintainers the DBpedia internationalisation com-mittee11 is manifesting its structure and each languageedition has a representative with a vote in elections Insome cases (eg Greek12 and Dutch13) the existence of

10Accessed on 25092013 httpwikidbpediaorgInternationalizationChapters

11httpwikidbpediaorgInternationalization

12httpeldbpediaorg13httpnldbpediaorg

a local DBpedia chapter has had a positive effect on thecreation of localized LOD clouds [24]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where commu-nities from each language work together to improvemappings increase coverage and detect bugs in the ex-traction process The progress of the mapping effortis tracked through statistics on the number of mappedtemplates and properties as well as the number oftimes these templates and properties occur in WikipediaThese statistics provide an estimate of the coverage ofeach Wikipedia edition in terms of how many entitieswill be typed and how many properties from those en-tities will be extracted Therefore they can be usedby each language edition to prioritize properties andtemplates with higher impact on the coverage

The mapping statistics have also been used as a wayto promote a healthy competition between languageeditions A sprint page was created with bar charts thatshow how close each language is from achieving to-tal coverage (as shown in Figure 5) and line chartsshowing the progress over time highlighting when one

14 Lehmann et al DBpedia

language is overtaking another in their race for highercoverage The mapping sprints have served as a greatmotivator for the crowd-sourcing efforts as it can benoted from the increase in the number of mapping con-tributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute14 This high changefrequency leads to DBpedia data quickly being out-dated which in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [2036] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported15 Meanwhile DBpedia Livefor Dutch16 was developed

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [27]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia articles

The overall architecture of DBpedia Live is indicatedin Figure 8 The major components of the system areas follows

14httpstatswikimediaorgENSummaryENhtm

15httplivedbpediaorg16httplivenldbpediaorg

Fig 8 Overview of DBpedia Live extraction framework

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a localWikipedia mirror allows us to exceed any accesslimits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary inputsource Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pages forre-processing as input and applies all the enabledextractors After processing a page the extractedtriples are a) inserted into a backend triple store(in our case Virtuoso [10]) updating the old triplesand b) saved as changesets into a compressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extractionframework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from differentsources

Lehmann et al DBpedia 15

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as anexample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extractorsmight never be applied to rarely modified pages Toovercome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mappingaffected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triples Aset for the added triples and another set for the deletedtriples We write those two sets into N-Triples filescompress them and publish the compressed files aschangesets If another DBpedia Live mirror wants tosynchronise with the DBpedia Live endpoint it can justdownload those files decompress and integrate them

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset filessequentially decompresses them and updates the targetSPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data indifferent SPARQL graphs Data from the articleupdate feeds are contained in the graph with theURI httplivedbpediaorg static data (ielinks to the LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All data isalso accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-

tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing to DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository17 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linking com-mittee Linksets which are added to the repository areused for the subsequent official DBpedia release as wellas for DBpedia Live Table 5 lists the linksets createdby the DBpedia community as of April 2013 The firstcolumn names the data set that is the target of the linksThe second and third column contain the predicate thatis used for linking as well as the overall number of linksthat is set between DBpedia and the external data setThe last column names the tool that was used to gener-ate the links The value S refers to Silk L to LIMESC to custom script and a missing entry means that thedataset is copied from the previous releases and notregenerated

An example for the usage of links is the combina-tion of data about European Union project funding(FTS) [32] and data about countries in DBpedia Thequery below compares funding per year (from FTS) andcountry with the gross domestic product of a country(from DBpedia)18

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding) 3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

17httpsgithubcomdbpediadbpedia-links18Endpoint httpftspublicdataeusparql

Results httpbitly1c2mIwQ

16 Lehmann et al DBpedia

Table 5Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732

In addition to providing outgoing links on aninstance-level DBpedia also sets links on schema-level pointing from the DBpedia ontology to equiva-lent terms in other schemas Links to other schematacan be set by the community within the DBpedia Map-pings Wiki by using owlequivalentClass inclass templates and owlequivalentPropertyin datatype or object property templates respectively

In particular in 2011 Google Microsoft and Yahooannounced their collaboration on Schemaorg a col-lection of vocabularies for marking up content on webpages The DBpedia 38 ontology contains 45 equiva-lent class and 31 equivalent property links pointing tohttpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of incoming links ie linkspointing to DBpedia from other data sets is 39007478according to the Data Hub19 However those countsare entered by users and may not always be valid andup-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [39]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from otherwiseit is not considered Sindice computes a graph sum-mary [8] over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing to DBpedia Table 6lists the 10 data sets which set most links to DBpedia

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub20 However it crawls forRDFa snippets converts microformats etc which arenot captured by the DataHub Despite the inaccuracythe relative comparison of different datasets can stillgive us insights Therefore we analysed the link struc-ture of all Sindice datasets using the Sindice clusterTable 8 shows the datasets with most incoming linksThose are authorities in the network structure of the

19See httpwikidbpediaorgInterlinking fordetails

20httpdatahubio

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 2: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

Semantic Web 1 (2012) 1ndash5 1IOS Press

DBpedia ndash A Large-scale MultilingualKnowledge Base Extracted from WikipediaEditor(s) Name Surname University CountrySolicited review(s) Name Surname University CountryOpen review(s) Name Surname University Country

Jens Lehmann alowast Robert Isele g Max Jakob e Anja Jentzsch d Dimitris Kontokostas aPablo N Mendes f Sebastian Hellmann a Mohamed Morsey a Patrick van Kleef c Soren Auer aChristian Bizer ba University of Leipzig Institute of Computer Science AKSW Group Augustusplatz 10 D-04009 Leipzig GermanyE-mail lastnameinformatikuni-leipzigdeb University of Mannheim Research Group Data and Web Science B6-26 D-68159 MannheimE-mail chrisinformatikuni-mannheimdec OpenLink Software 10 Burlington Mall Road Suite 265 Burlington MA 01803 USAE-mail pkleefopenlinkswcomd Hasso-Plattner-Institute for IT-Systems Engineering Prof-Dr- Helmert-Str 2-3 D-14482 Potsdam GermanyE-mail mailanjajentzschdee Neofonie GmbH Robert-Koch-Platz 4 D-10115 Berlin GermanyE-mail maxjakobneofoniedef Knoesis - Ohio Center of Excellence in Knowledge-enabled Computing Wright State University Dayton USAE-Mail pabloknoesisorgg Brox IT-Solutions GmbH An der Breiten Wiese 9 D-30625 Hannover GermanyE-Mail mailrobertiselecom

Abstract The DBpedia community project extracts structured multilingual knowledge from Wikipedia and makes it freelyavailable on the Web using Semantic Web and Linked Data technologies The project extracts knowledge from 111 differentlanguage editions of Wikipedia The largest DBpedia knowledge base which is extracted from the English edition of Wikipediaconsists of over 400 million facts that describe 37 million things The DBpedia knowledge bases that are extracted from the other110 Wikipedia editions together consist of 146 billion facts and describe 10 million additional things The DBpedia project mapsWikipedia infoboxes from 27 different language editions to a single shared ontology consisting of 320 classes and 1650 propertiesThe mappings are created via a world-wide crowd-sourcing effort and enable knowledge from the different Wikipedia editions tobe combined The project publishes regular releases of all DBpedia knowledge bases for download and provides SPARQL queryaccess to 14 out of the 111 language editions via a global network of local DBpedia chapters In addition to the regular releasesthe project maintains a live knowledge base which is updated whenever a page in Wikipedia changes DBpedia sets 27 millionRDF links pointing into over 30 external data sources and thus enables data from these sources to be used together with DBpediadata Several hundred data sets on the Web publish RDF links pointing to DBpedia themselves and thus make DBpedia one ofthe central interlinking hubs in the Linked Open Data (LOD) cloud In this system report we give an overview of the DBpediacommunity project including its architecture technical implementation maintenance internationalisation usage statistics andapplications

Keywords Knowledge Extraction Wikipedia Multilingual Knowledge Bases Linked Data RDF

1570-084412$2750 ccopy 2012 ndash IOS Press and the authors All rights reserved

2 Lehmann et al DBpedia

1 Introduction

Wikipedia is the 6th most popular website1 the mostwidely used encyclopedia and one of the finest exam-ples of truly collaboratively created content There areofficial Wikipedia editions in 287 different languageswhich range in size from a couple of hundred articlesup to 38 million articles (English edition)2 Besides offree text Wikipedia articles consist of different types ofstructured data such as infoboxes tables lists and cat-egorization data Wikipedia currently offers only free-text search capabilities to its users Using Wikipediasearch it is thus very difficult to find all rivers that flowinto the Rhine and are longer than 100 miles or allItalian composers that were born in the 18th century

The DBpedia project builds a large-scale multilin-gual knowledge base by extracting structured data fromWikipedia editions in 111 languages This knowledgebase can be used to answer expressive queries such asthe ones outlined above Being multilingual and cover-ing an wide range of topics the DBpedia knowledgebase is also useful within further application domainssuch as data integration named entity recognition topicdetection and document ranking

The DBpedia knowledge base is widely used as atestbed in the research community and numerous appli-cations algorithms and tools have been built around orapplied to DBpedia DBpedia is served as Linked Dataon the Web Since it covers a wide variety of topicsand sets RDF links pointing into various external datasources many Linked Data publishers have decidedto set RDF links pointing to DBpedia from their datasets Thus DBpedia has developed into a central inter-linking hub in the Web of Linked Data and has beena key factor for the success of the Linked Open Datainitiative

The structure of the DBpedia knowledge base ismaintained by the DBpedia user community Mostimportantly the community creates mappings fromWikipedia information representation structures to theDBpedia ontology This ontology ndash which will be ex-plained in detail in Section 3 ndash unifies different tem-plate structures both within single Wikipedia languageeditions and across currently 27 different languages

Corresponding author E-mail lehmannninformatikuni-leipzigde

1httpwwwalexacomtopsites Retrieved in Octo-ber 2013

2httpmetawikimediaorgwikiList_of_Wikipedias

The maintenance of different language editions of DB-pedia is spread across a number of organisations Eachorganisation is responsible for the support of a certainlanguage The local DBpedia chapters are coordinatedby the DBpedia Internationalisation Committee

The aim of this system report is to provide a descrip-tion of the DBpedia community project including thearchitecture of the DBpedia extraction framework itstechnical implementation maintenance internationali-sation usage statistics as well as presenting some pop-ular DBpedia applications This system report is a com-prehensive update and extension of previous project de-scriptions in [1] and [5] The main advances comparedto these articles are

ndash The concept and implementation of the extractionbased on a community-curated DBpedia ontology

ndash The wide internationalisation of DBpediandash A live synchronisation module which processes

updates in Wikipedia as well as the DBpedia on-tology and allows third parties to keep their copiesof DBpedia up-to-date

ndash A description of the maintenance of public DBpe-dia services and statistics about their usage

ndash An increased number of interlinked data setswhich can be used to further enrich the content ofDBpedia

ndash The discussion and summary of novel third partyapplications of DBpedia

Overall DBpedia has undergone 7 years of continu-ous evolution Table 17 provides an overview of theprojectrsquos timeline

The system report is structured as follows In the nextsection we describe the DBpedia extraction frameworkwhich forms the technical core of DBpedia This isfollowed by an explanation of the community-curatedDBpedia ontology with a focus on multilingual sup-port In Section 4 we explicate how DBpedia is syn-chronised with Wikipedia with just very short delaysand how updates are propagated to DBpedia mirrorsemploying the DBpedia Live system Subsequently wegive an overview of the external data sets that are in-terlinked from DBpedia or that set RDF links pointingto DBpedia themselves (Section 5) In Section 6 weprovide statistics on the usage of DBpedia and describethe maintenance of a large scale public data set WithinSection 7 we briefly describe several use cases andapplications of DBpedia in a variety of different areasFinally we report on related work in Section 8 and givean outlook on the further development of DBpedia inSection 9

Lehmann et al DBpedia 3

Fig 1 Overview of DBpedia extraction framework

2 Extraction Framework

Wikipedia articles consist mostly of free text butalso comprise of various types of structured informationin the form of wiki markup Such information includesinfobox templates categorisation information imagesgeo-coordinates links to external web pages disam-biguation pages redirects between pages and linksacross different language editions of Wikipedia TheDBpedia extraction framework extracts this structuredinformation from Wikipedia and turns it into a richknowledge base In this section we give an overviewof the DBpedia knowledge extraction framework

21 General Architecture

Figure 1 shows an overview of the technical frame-work The DBpedia extraction is structured into fourphases

Input Wikipedia pages are read from an externalsource Pages can either be read from a Wikipediadump or directly fetched from a MediaWiki instal-lation using the MediaWiki API

Parsing Each Wikipedia page is parsed by the wikiparser The wiki parser transforms the source codeof a Wikipedia page into an Abstract Syntax Tree

Extraction The Abstract Syntax Tree of each Wikipediapage is forwarded to the extractors DBpedia of-fers extractors for many different purposes for in-stance to extract labels abstracts or geographicalcoordinates Each extractor consumes an AbstractSyntax Tree and yields a set of RDF statements

Output The collected RDF statements are written toa sink Different formats such as N-Triples aresupported

22 Extractors

The DBpedia extraction framework employs variousextractors for translating different parts of Wikipediapages to RDF statements A list of all available extrac-tors is shown in Table 1 DBpedia extractors can bedivided into four categories

Mapping-Based Infobox Extraction The mapping-based infobox extraction uses manually writtenmappings that relate infoboxes in Wikipedia toterms in the DBpedia ontology The mappings alsospecify a datatype for each infobox property andthus help the extraction framework to produce highquality data The mapping-based extraction willbe described in detail in Section 24

Raw Infobox Extraction The raw infobox extractionprovides a direct mapping from infoboxes inWikipedia to RDF As the raw infobox extractiondoes not rely on explicit extraction knowledge inthe form of mappings the quality of the extracteddata is lower The raw infobox data is useful if aspecific infobox has not been mapped yet and thusis not available in the mapping-based extraction

Feature Extraction The feature extraction uses anumber of extractors that are specialized in ex-tracting a single feature from an article such as alabel or geographic coordinates

4 Lehmann et al DBpedia

Statistical Extraction Some NLP related extractorsaggregate data from all Wikipedia pages in order toprovide data that is based on statistical measuresof page links or word counts as further describedin Section 26

23 Raw Infobox Extraction

The type of Wikipedia content that is most valuablefor the DBpedia extraction are infoboxes Infoboxes arefrequently used to list an articlersquos most relevant factsas a table of attribute-value pairs on the top right-handside of the Wikipedia page (for right-to-left languageson the top left-hand side) Infoboxes that appear in aWikipedia article are based on a template that specifiesa list of attributes that can form the infobox A widerange of infobox templates are used in Wikipedia Com-mon examples are templates for infoboxes that describepersons organisations or automobiles As Wikipediarsquosinfobox template system has evolved over time dif-ferent communities of Wikipedia editors use differ-ent templates to describe the same type of things (egInfobox city japan Infobox swiss townand Infobox town de) In addition different tem-plates use different names for the same attribute(eg birthplace and placeofbirth) As manyWikipedia editors do not strictly follow the recommen-dations given on the page that describes a templateattribute values are expressed using a wide range ofdifferent formats and units of measurement An excerptof an infobox that is based on a template for describingautomobiles is shown below

Infobox automobile| name = Ford GT40| manufacturer = [[Ford Advanced Vehicles]]| production = 1964-1969| engine = 4181cc()

In this infobox the first line specifies the infoboxtype and the subsequent lines specify various attributesof the described entity

An excerpt of the extracted data is as follows

dbrFord_GT40dbpname Ford GT40endbpmanufacturer dbrFord_Advanced_Vehiclesdbpengine 4181dbpproduction 1964()

This extraction output has weaknesses The resourceis not associated to a class in the ontology and parsed

values are cleaned up and assigned a datatyper basedon heuristics In particular the raw infobox extractorsearches for values in the following order dates coor-dinates numbers links and strings as default Thus thedatatype assignment for the same property in differentresources is non deterministic The engine for exam-ple is extracted as a number but if another instance ofthe template used ldquocc4181rdquo it would be extracted asstring This behaviour makes querying for properties inthe dbp namespace inconsistent Those problems canbe overcome by the mapping-based infobox extractionpresented in the next subsection

24 Mapping-Based Infobox Extraction

In order to homogenize the description of informa-tion in the knowledge base in 2010 a community ef-fort was initiated to develop an ontology schema andmappings from Wikipedia infobox properties to thisontology The alignment between Wikipedia infoboxesand the ontology is performed via community-providedmappings that help to normalize name variations inproperties and classes Heterogeneity in the Wikipediainfobox system like using different infoboxes for thesame type of entity or using different property names forthe same property (cf Section 23) can be alleviated inthis way This significantly increases the quality of theraw Wikipedia infobox data by typing resources merg-ing name variations and assigning specific datatypes tothe values

This effort is realized using the DBpedia MappingsWiki3 a MediaWiki installation set up to enable usersto collaboratively create and edit mappings These map-pings are specified using the DBpedia Mapping Lan-guage The mapping language makes use of MediaWikitemplates that define DBpedia ontology classes andproperties as well as templatetable to ontology map-pings A mapping assigns a type from the DBpedia on-tology to the entities that are described by the corre-sponding infobox In addition attributes in the infoboxare mapped to properties in the DBpedia ontology Inthe following we show a mapping that maps infoboxesthat use the Infobox automobile template to theDBpedia ontology

TemplateMapping|mapToClass = Automobile|mappings =PropertyMapping| templateProperty = name

3httpmappingsdbpediaorg

Lehmann et al DBpedia 5

Table 1

Overview of the DBpedia extractors (cf Table 16 for a complete listof prefixes)

Name Description Example

abstract Extracts the first lines of theWikipedia article

dbrBerlin dboabstract Berlin is the capital city of

() article categories Extracts the categorization of the

articledbrOliver Twist dcsubject dbrCategoryEnglish novels

category label Extracts labels for categories dbrCategoryEnglish novels rdfslabel English novels category hierarchy Extracts information about which

concept is a category and how cat-egories are related using the SKOSVocabulary

dbrCategoryWorld War II skosbroader

dbrCategoryModern history

disambiguation Extracts disambiguation links dbrAlien dbowikiPageDisambiguates dbrAlien (film) external links Extracts links to external web

pages related to the conceptdbrAnimal Farm dbowikiPageExternalLink

lthttpbooksgooglecomid=RBGmrDnBs8UCgt geo coordinates Extracts geo-coordinates dbrBerlin georsspoint 525006 133989 grammatical gender Extracts grammatical genders for

personsdbrAbraham Lincoln foafgender male

homepage Extracts links to the official home-page of an instance

dbrAlabama foafhomepage lthttpalabamagovgt

image Extracts the first image of aWikipedia page

dbrBerlin foafdepiction lthttpOverview Berlinjpggt

infobox Extracts all properties from all in-foboxes

dbrAnimal Farm dbodate March 2010

interlanguage Extracts interwiki links dbrAlbedo dbowikiPageInterLanguageLink dbr-deAlbedo label Extracts the article title as label dbrBerlin rdfslabel Berlin lexicalizations Extracts information about surface

forms and their association withconcepts (only N-Quad format)

dbrPine sptllexicalization lxpine tree lsPine pine tree lxpine tree rdfslabel pine tree lsPine pine tree sptlpUriGivenSf 0941

mappings Extraction based on mappings ofWikipedia infoboxes to the DBpe-dia ontology

dbrBerlin dbocountry dbrGermany

page ID Extracts page ids of articles dbrAutism dbowikiPageID 25

page links Extracts all links betweenWikipedia articles

dbrAutism dbowikiPageWikiLink dbrHuman brain

persondata Extracts information about per-sons represented using the Person-Data template

dbrAndre Agassi foafbirthDate 1970-04-29

PND Extracts PND (Personennamen-datei) data about a person

dbrWilliam Shakespeare dboindividualisedPnd 118613723

redirects Extracts redirect links between ar-ticles in Wikipedia

dbrArtificialLanguages dbowikiPageRedirects

dbrConstructed language revision ID Extracts the revision ID of the

Wikipedia articledbrAutism lthttpwwww3orgnsprovwasDerivedFromgt

lthttpenwikipediaorgwikiAutismoldid=495234324gt thematic concept Extracts lsquothematicrsquo concepts the

centres of discussion for cate-gories

dbrCategoryMusic skossubject dbrMusic

topic signatures Extracts topic signatures dbrAlkane sptltopicSignature carbon alkanes atoms wiki page Extracts links to corresponding ar-

ticles in WikipediadbrAnAmericanInParis foafisPrimaryTopicOf

lthttpenwikipediaorgwikiAnAmericanInParisgt

6 Lehmann et al DBpedia

| ontologyProperty = foafname PropertyMapping| templateProperty = manufacturer| ontologyProperty = manufacturer DateIntervalMapping| templateProperty = production| startDateOntologyProperty =

productionStartDate| endDateOntologyProperty =

productionEndDate IntermediateNodeMapping| nodeClass = AutomobileEngine| correspondingProperty = engine| mappings =

PropertyMapping| templateProperty = engine| ontologyProperty = displacement| unit = Volume PropertyMapping| templateProperty = engine| ontologyProperty = powerOutput| unit = Power

()

The RDF statements that are extracted from the pre-vious infobox example are shown below As we can seethe production period is correctly split into a start yearand an end year and the engine is represented by a dis-tinct RDF node It is worth mentioning that all valuesare canonicalized to basic units For example in theengine mapping we state that engine is a Volumeand thus the extractor converts ldquo4181ccrdquo (cubic cen-timeters) to cubic meters (ldquo0004181rdquo) Additionallythere can exist multiple mappings on the same propertythat search for different datatypes or different units Forexample a number with ldquoPSrdquo as a suffix for engine

dbrFord_GT40rdftype dboAutomobilerdfslabel Ford GT40endbomanufacturer

dbrFord_Advanced_VehiclesdboproductionStartYear

1964ˆˆxsdgYeardboproductionEndYear 1969ˆˆxsdgYeardboengine [

rdftype AutomobileEnginedbodisplacement 0004181

]()

The DBpedia Mapping Wiki is not only used to mapdifferent templates within a single language editionof Wikipedia to the DBpedia ontology but is used tomap templates from all Wikipedia language editionsto the shared DBpedia ontology Figure 2 shows how

the infobox properties author and συγγαϕεας ndash au-thor in Greek ndash are both being mapped to the globalidentifier dboauthor That means in turn that in-formation from all language versions of DBpedia canbe merged and DBpedias for smaller languages can beaugmented with knowledge from larger DBpedias suchas the English edition Conversely the larger DBpe-dia editions can benefit from more specialized knowl-edge from localized editions such as data about smallertowns which is often only present in the correspondinglanguage edition [43]

Besides hosting of the mappings and DBpedia on-tology definition the DBpedia Mappings Wiki offersvarious tools which support users in their work

ndash Mapping Syntax Validator The mapping syntaxvalidator checks for syntactic correctness and high-lights inconsistencies such as missing propertydefinitions

ndash Extraction Tester The extraction tester linked oneach mapping page tests a mapping against a setof example Wikipedia pages This gives directfeedback about whether a mapping works and howthe resulting data is structured

ndash Mapping Tool The DBpedia Mapping Tool is agraphical user interface that supports users to cre-ate and edit mappings

25 URI Schemes

For every Wikipedia article the framework intro-duces a number of URIs to represent the concepts de-scribed on a particular page Up to 2011 DBpedia pub-lished URIs only under the httpdbpediaorgdomain The main namespaces were

ndash httpdbpediaorgresource (prefixdbr) for representing article data There is a one-to-one mapping between a Wikipedia page and aDBpedia resource based on the article title Forexample for the Wikipedia article on Berlin4 DB-pedia will produce the URI dbrBerlin Exceptionsin this rule appear when intermediate nodes areextracted from the mapping-based infobox extrac-tor as unique URIs (eg the engine mappingexample in Section 24)

ndash httpdbpediaorgproperty (prefixdbp) for representing properties extracted fromthe raw infobox extraction (cf Section 23) egdbppopulation

4httpenwikipediaorgwikiBerlin

Lehmann et al DBpedia 7

Fig 2 Depiction of the mapping from the Greek (left) and English Wikipedia templates (right) about books to the same DBpedia Ontology class(middle) [24]

ndash httpdbpediaorgontology (prefixdbo) for representing the DBpedia ontology (cfSection 24) eg dbopopulationTotal

Although data from other Wikipedia language edi-tions were extracted they used the same namespacesThis was achieved by exploiting the Wikipedia inter-language links5 For every page in a language otherthan English the page was extracted only if the pagecontained an inter-language link to an English page Inthat case using the English link the data was extractedunder the English resource name (ie dbrBerlin)

Recent DBpedia internationalisation developmentsshowed that this approach omitted valuable data [24]Thus starting from the DBpedia 37 release6 two typesof data sets were generated The localized data sets con-tain all things that are described in a specific languageWithin the datasets things are identified with languagespecific URIs such as httpltlanggtdbpediaorgresource for article data and httpltlanggtdbpediaorgproperty for prop-erty data In addition we produce a canonicalized dataset for each language The canonicalized data sets onlycontain things for which a corresponding page in the

5httpenwikipediaorgwikiHelpInterlanguage_links

6A list of all DBpedia releases is provided in Table 17

English edition of Wikipedia exists Within all canoni-calized data sets the same thing is identified with thesame URI from the generic language-agnostic names-pace httpdbpediaorgresource

26 NLP Extraction

DBpedia provides a number of data sets which havebeen created to support Natural Language Processing(NLP) tasks [33] Currently four datasets are extractedtopic signatures grammatical gender lexicalizationsand thematic concept While the topic signatures andthe grammatical gender extractors primarily extract datafrom the article text the lexicalizations and thematicconcept extractors make use of the wiki markup

DBpedia entities can be referred to using many dif-ferent names and abbreviations The Lexicalization dataset provides access to alternative names for entitiesand concepts associated with several scores estimatingthe association strength between name and URI Thesescores distinguish more common names for specificentities from rarely used ones and also show how am-biguous a name is with respect to all possible conceptsthat it can mean

The topic signatures data set enables the descriptionof DBpedia resources based on unstructured informa-tion as compared to the structured factual data providedby the mapping-based and raw extractors We build a

8 Lehmann et al DBpedia

Vector Space Model (VSM) where each DBpedia re-source is a point in a multidimensional space of wordsEach DBpedia resource is represented by a vector andeach word occurring in Wikipedia is a dimension ofthis vector Word scores are computed using the tf-idfweight with the intention to measure how strong is theassociation between a word and a DBpedia resourceNote that word stems are used in this context in order togeneralize over inflected words We use the computedweights to select the strongest related word stems foreach entity and build topic signatures [30]

There are two more Feature Extractors related to Nat-ural Language Processing The thematic concepts dataset relies on Wikipediarsquos category system to capture theidea of a lsquothemersquo a subject that is discussed in its arti-cles Many of the categories in Wikipedia are linked toan article that describes the main topic of that categoryWe rely on this information to mark DBpedia entitiesand concepts that are lsquothematicrsquo that is they are thecentre of discussion for a category

The grammatical gender data set uses a simple heuris-tic to decide on a grammatical gender for instances ofthe class Person in DBpedia While parsing an articlein the English Wikipedia if there is a mapping froman infobox in this article to the class dboPersonwe record the frequency of gender-specific pronouns intheir declined forms (Subject Object Possessive Ad-jective Possessive Pronoun and Reflexive) ndash ie hehim his himself (masculine) and she her hers herself(feminine) Grammatical genders for DBpedia entitiesare assigned based on the dominating gender in thesepronouns

27 Summary of Other Recent Developments

In this section we summarize the improvements ofthe DBpedia extraction framework since the publicationof the previous DBpedia overview article [5] in 2009One of the major changes on the implementation levelis that the extraction framework has been rewritten inScala in 20107 resulting in an improvement of the effi-ciency of the extractors by an order of magnitude com-pared to the previous PHP based framework This iscrucial for the DBpedia Live project which will be ex-plained in Section 4 The new more modular frameworkalso allows to extract data from tables in Wikipediapages and supports extraction from multiple MediaWikitemplates per page Another significant change was the

7Table 17 provides an overview of the projectrsquos evolution throughtime

creation and utilization of the DBpedia Mappings Wikias described earlier Further significant changes includethe mentioned NLP extractors and the introduction ofURI schemes

In addition there were several smaller improvementsand general maintenance Overall over the past fouryears the parsing of the MediaWiki markup improvedquite a lot which led to better overall coverage forexample concerning references and parser functionsIn addition the collection of MediaWiki namespaceidentifiers for many languages is now performed semi-automatically leading to a high accuracy of detectionThis concerns common title prefixes such as User FileTemplate Help Portal etc in English that indicatepages that do not contain encyclopedic content andwould produce noise in the data They are important forspecific extractors as well for instance the category hi-erarchy data set is produced from pages of the Categorynamespace Furthermore the output of the extractionsystem now supports more formats and several compli-ance issues regarding URIs IRIs N-Triples and Turtlewere fixed

The individual data extractors have been improved aswell in both number and quality in many areas The ab-stract extraction was enhanced producing more accurateplain text representations of the beginning of Wikipediaarticle texts More diverse and more specific datatypesdo exist (eg many currencies and XSD datatypes suchas xsdgYearMonth xsdpositiveIntegeretc) and for a number of classes and properties specificdatatypes were added (eg inhabitantskm2 for the pop-ulation density of populated places and m3s for thedischarge of rivers) Many issues related to data parserswere resolved and the quality of the owlsameAsdata set for multiple language versions was increasedby an implementation that takes bijective relations intoaccount

There are also further extractors eg for Wikipediapage IDs and revisions Moreover redirect and disam-biguation extractors were introduced and improved Forthe redirect data the transitive closure is computedwhile taking care of catching cycles in the links Theredirects also help regarding infobox coverage in themapping-based extraction by resolving alternative tem-plate names Moreover in the PHP framework if aninfobox value pointed to a redirect this redirection wasnot properly resolved and thus resulted in RDF linksthat led to URIs which did not contain any further in-formation Resolving redirects affected approximately15 of all links and hence increased the overall inter-connectivity of resources in the DBpedia ontology

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an infoboxcontains a string value that is not linked to anotherWikipedia article the extraction framework searchesfor hyperlinks in the same Wikipedia article that havethe same anchor text as the infobox value string If sucha link exists the target of that link is used to replacethe string value in the infobox This method furtherincreases the number of object property assertions inthe DBpedia ontology

Orthogonal to the previously mentioned improve-ments there have been various efforts to assess the qual-ity of the DBpedia datasets [26] developed a frame-work for estimating the quality of DBpedia and a sam-ple of 75 resources were analysed A more compre-hensive effort was performed in [48] by providing adistributed web-based interface [25] for quality assess-ment In this study 17 data quality problem types wereanalysed by 58 users covering 521 resources in DBpe-dia

3 DBpedia Ontology

The DBpedia ontology consists of 320 classes whichform a subsumption hierarchy and are described by1650 different properties With a maximal depth of5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depictsa part of the DBpedia ontology indicating the rela-tions among the top ten classes of the DBpedia on-

tology ie the classes with the highest number ofinstances The complete DBpedia ontology can bebrowsed online at httpmappingsdbpediaorgserverontologyclasses

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings Wiki Fig-ure 4 depicts the growth of the DBpedia ontology overtime While the number of classes is not growing toomuch due to the already good coverage of the initialversion of the ontology the number of properties in-creases over time due to the collaboration on the DBpe-dia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communitiesfor 27 languages 23 of which are active Figure 5 shows

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

statistics for the coverage of these mappings in DBpe-dia Figures (a) and (c) refer to the absolute numberof template and property mappings that are defined forevery DBpedia language edition Figures (b) and (d) de-pict the percentage of the defined template and propertymappings compared to the total number of availabletemplates and properties for every Wikipedia languageedition Figures (e) and (g) show the occurrences (in-stances) that the defined template and property map-pings have in Wikipedia Finally figures (f) and (h)give the percentage of the mapped templates and prop-erties occurences compared to the total templates andproperty occurences in a Wikipedia language edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50of total template occurrences In addition almost alllanguages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most activemapping language communities is depicted in Figure 6It is interesting to notice that the high mapping ac-tivity peaks coincide with the DBpedia release dates

Lehmann et al DBpedia 11

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

For instance the DBpedia 37 version was released onSeptember 2011 and the 2nd and 3rd quarter of thatyear have a very high activity compared to the 4th quar-ter In the last two years (2012 and 2013) most of theDBpedia mapping language communities have definedtheir own chapters and have their own release datesThus recent mapping activity shows less fluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (x axis)that have exactly y occurrences (y axis) The occurrencefrequency follows a long tail distribution Thus a lownumber of property mappings have a high number ofoccurrences and a high number of property mappingshave a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the num-ber of facts (ie statements) that have been extractedfrom infoboxes describing these things Afterwards wereport on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statements andtype statements for the 20 languages The column head-ings have the following meaning LD = Localized datasets (see Section 25) CD = Canonicalized data sets

(see Section 25) all = Overall number of instances inthe data set including instances without infobox datawith MD = Number of instances for which mapping-based infobox data exists Raw Properties = Numberof different properties that are generated by the rawinfobox extractor Mapping Properties = Number ofdifferent properties that are generated by the mapping-based infobox extractor Raw Statements = Number ofstatements (facts) that are generated by the raw infoboxextractor Mapping Statements = Number of statements(facts) that are generated by the mapping-based infoboxextractor

It is interesting to see that the English version of DB-pedia describes about three times more instances thanthe second and third largest language editions (FrenchGerman) Comparing the first column of the table withthe second and third reveals which portion of the in-stances of a specific language correspond to instancesin the English version of DBpedia and which portionof the instances is described by clean mapping-basedinfobox data The difference between the number ofproperties in the raw infobox data set and the cleanermapping-based infobox data set (columns 4 and 5) re-sults on the one hand from multiple Wikipedia infoboxproperties being mapped to a single ontology propertyOn the other hand it reflects the number of mappingsthat have been so far created in the Mapping Wiki for aspecific language

Table 3 reports the number of instances for a set ofpopular classes from the third and forth hierarchy levelof the ontology within the canonicalized DBpedia datasets for each language The indented classes are sub-classes of the superclasses set in bold The zero val-ues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English version ofDBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-baseddata set how many instances are described in multi-ple languages The Instances column contains the totalnumber of instances per class across all 20 languagesthe second column contains the number of instancesthat are described only in a single language version thenext column contains the number of instances that arecontained in two languages but not in three or more lan-guages etc For example 12936 persons are describedin five languages but not in six or more languages Thenumber 871630 for the class Person means that all20 language versions together describe 871630 differ-ent persons The number is higher than the number ofpersons described in the canonicalized English infobox

12 Lehmann et al DBpedia

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Table 2Basic statistics about Localized DBpedia Editions

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

data set (763643) listed in Table 3 since there areinfoboxes in non-English articles describing a personwithout a corresponding infobox in the English articledescribing the same person Summing up columns 2 to10+ for the Person class we see that 195263 personsare described in two or more languages The large dif-ference of this number compared to the total number of871630 persons is due to the much smaller size of thelocalized DBpedia versions compared to the Englishone (cf Table 2)

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approaches

in [20] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently mappingsfor 27 languages8 The DBpedia 37 release9 in Septem-ber 2011 was the first DBpedia release to use the lo-calized I18n (Internationalisation) DBpedia extractionframework [24]

8Arabic (ar) Bulgarian (bg) Bengali (bn) Catalan (ca) Czech(cs) German (de) Greek (el) English (en) Spanish (es) Estonian(et) Basque (eu) French (fr) Irish (ga) Hindi (hi) Croatian (hr)Hungarian (hu) Indonesian (id) Italian (it) Japanese (ja) Korean(ko) Dutch (nl) Polish (pl) Portuguese (pt) Russian (ru) Slovene(sl) Turkish (tr) Urdu (ur)

9httpblogdbpediaorg20110911

Lehmann et al DBpedia 13

Table 3Number of instances per class within 10 localized DBpedia versions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 4Cross-language overlap Number of instances that are described in multiple languages

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

At the time of writing DBpedia chapters for 14 lan-guages have been founded Basque Czech Dutch En-glish French German Greek Italian Japanese Ko-rean Polish Portuguese Russian and Spanish10 Be-sides providing mappings from infoboxes in the corre-sponding Wikipedia editions DBpedia chapters organ-ise a local community and provide hosting for data setsand associated services

While at the moment chapters are defined by owner-ship of the IP and server of the sub domain A record(eg httpkodbpediaorg) given by DBpe-dia maintainers the DBpedia internationalisation com-mittee11 is manifesting its structure and each languageedition has a representative with a vote in elections Insome cases (eg Greek12 and Dutch13) the existence of

10Accessed on 25092013 httpwikidbpediaorgInternationalizationChapters

11httpwikidbpediaorgInternationalization

12httpeldbpediaorg13httpnldbpediaorg

a local DBpedia chapter has had a positive effect on thecreation of localized LOD clouds [24]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where commu-nities from each language work together to improvemappings increase coverage and detect bugs in the ex-traction process The progress of the mapping effortis tracked through statistics on the number of mappedtemplates and properties as well as the number oftimes these templates and properties occur in WikipediaThese statistics provide an estimate of the coverage ofeach Wikipedia edition in terms of how many entitieswill be typed and how many properties from those en-tities will be extracted Therefore they can be usedby each language edition to prioritize properties andtemplates with higher impact on the coverage

The mapping statistics have also been used as a wayto promote a healthy competition between languageeditions A sprint page was created with bar charts thatshow how close each language is from achieving to-tal coverage (as shown in Figure 5) and line chartsshowing the progress over time highlighting when one

14 Lehmann et al DBpedia

language is overtaking another in their race for highercoverage The mapping sprints have served as a greatmotivator for the crowd-sourcing efforts as it can benoted from the increase in the number of mapping con-tributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute14 This high changefrequency leads to DBpedia data quickly being out-dated which in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [2036] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported15 Meanwhile DBpedia Livefor Dutch16 was developed

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [27]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia articles

The overall architecture of DBpedia Live is indicatedin Figure 8 The major components of the system areas follows

14httpstatswikimediaorgENSummaryENhtm

15httplivedbpediaorg16httplivenldbpediaorg

Fig 8 Overview of DBpedia Live extraction framework

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a localWikipedia mirror allows us to exceed any accesslimits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary inputsource Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pages forre-processing as input and applies all the enabledextractors After processing a page the extractedtriples are a) inserted into a backend triple store(in our case Virtuoso [10]) updating the old triplesand b) saved as changesets into a compressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extractionframework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from differentsources

Lehmann et al DBpedia 15

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as anexample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extractorsmight never be applied to rarely modified pages Toovercome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mappingaffected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triples Aset for the added triples and another set for the deletedtriples We write those two sets into N-Triples filescompress them and publish the compressed files aschangesets If another DBpedia Live mirror wants tosynchronise with the DBpedia Live endpoint it can justdownload those files decompress and integrate them

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset filessequentially decompresses them and updates the targetSPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data indifferent SPARQL graphs Data from the articleupdate feeds are contained in the graph with theURI httplivedbpediaorg static data (ielinks to the LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All data isalso accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-

tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing to DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository17 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linking com-mittee Linksets which are added to the repository areused for the subsequent official DBpedia release as wellas for DBpedia Live Table 5 lists the linksets createdby the DBpedia community as of April 2013 The firstcolumn names the data set that is the target of the linksThe second and third column contain the predicate thatis used for linking as well as the overall number of linksthat is set between DBpedia and the external data setThe last column names the tool that was used to gener-ate the links The value S refers to Silk L to LIMESC to custom script and a missing entry means that thedataset is copied from the previous releases and notregenerated

An example for the usage of links is the combina-tion of data about European Union project funding(FTS) [32] and data about countries in DBpedia Thequery below compares funding per year (from FTS) andcountry with the gross domestic product of a country(from DBpedia)18

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding) 3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

17httpsgithubcomdbpediadbpedia-links18Endpoint httpftspublicdataeusparql

Results httpbitly1c2mIwQ

16 Lehmann et al DBpedia

Table 5Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732

In addition to providing outgoing links on aninstance-level DBpedia also sets links on schema-level pointing from the DBpedia ontology to equiva-lent terms in other schemas Links to other schematacan be set by the community within the DBpedia Map-pings Wiki by using owlequivalentClass inclass templates and owlequivalentPropertyin datatype or object property templates respectively

In particular in 2011 Google Microsoft and Yahooannounced their collaboration on Schemaorg a col-lection of vocabularies for marking up content on webpages The DBpedia 38 ontology contains 45 equiva-lent class and 31 equivalent property links pointing tohttpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of incoming links ie linkspointing to DBpedia from other data sets is 39007478according to the Data Hub19 However those countsare entered by users and may not always be valid andup-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [39]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from otherwiseit is not considered Sindice computes a graph sum-mary [8] over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing to DBpedia Table 6lists the 10 data sets which set most links to DBpedia

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub20 However it crawls forRDFa snippets converts microformats etc which arenot captured by the DataHub Despite the inaccuracythe relative comparison of different datasets can stillgive us insights Therefore we analysed the link struc-ture of all Sindice datasets using the Sindice clusterTable 8 shows the datasets with most incoming linksThose are authorities in the network structure of the

19See httpwikidbpediaorgInterlinking fordetails

20httpdatahubio

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 3: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

2 Lehmann et al DBpedia

1 Introduction

Wikipedia is the 6th most popular website1 the mostwidely used encyclopedia and one of the finest exam-ples of truly collaboratively created content There areofficial Wikipedia editions in 287 different languageswhich range in size from a couple of hundred articlesup to 38 million articles (English edition)2 Besides offree text Wikipedia articles consist of different types ofstructured data such as infoboxes tables lists and cat-egorization data Wikipedia currently offers only free-text search capabilities to its users Using Wikipediasearch it is thus very difficult to find all rivers that flowinto the Rhine and are longer than 100 miles or allItalian composers that were born in the 18th century

The DBpedia project builds a large-scale multilin-gual knowledge base by extracting structured data fromWikipedia editions in 111 languages This knowledgebase can be used to answer expressive queries such asthe ones outlined above Being multilingual and cover-ing an wide range of topics the DBpedia knowledgebase is also useful within further application domainssuch as data integration named entity recognition topicdetection and document ranking

The DBpedia knowledge base is widely used as atestbed in the research community and numerous appli-cations algorithms and tools have been built around orapplied to DBpedia DBpedia is served as Linked Dataon the Web Since it covers a wide variety of topicsand sets RDF links pointing into various external datasources many Linked Data publishers have decidedto set RDF links pointing to DBpedia from their datasets Thus DBpedia has developed into a central inter-linking hub in the Web of Linked Data and has beena key factor for the success of the Linked Open Datainitiative

The structure of the DBpedia knowledge base ismaintained by the DBpedia user community Mostimportantly the community creates mappings fromWikipedia information representation structures to theDBpedia ontology This ontology ndash which will be ex-plained in detail in Section 3 ndash unifies different tem-plate structures both within single Wikipedia languageeditions and across currently 27 different languages

Corresponding author E-mail lehmannninformatikuni-leipzigde

1httpwwwalexacomtopsites Retrieved in Octo-ber 2013

2httpmetawikimediaorgwikiList_of_Wikipedias

The maintenance of different language editions of DB-pedia is spread across a number of organisations Eachorganisation is responsible for the support of a certainlanguage The local DBpedia chapters are coordinatedby the DBpedia Internationalisation Committee

The aim of this system report is to provide a descrip-tion of the DBpedia community project including thearchitecture of the DBpedia extraction framework itstechnical implementation maintenance internationali-sation usage statistics as well as presenting some pop-ular DBpedia applications This system report is a com-prehensive update and extension of previous project de-scriptions in [1] and [5] The main advances comparedto these articles are

ndash The concept and implementation of the extractionbased on a community-curated DBpedia ontology

ndash The wide internationalisation of DBpediandash A live synchronisation module which processes

updates in Wikipedia as well as the DBpedia on-tology and allows third parties to keep their copiesof DBpedia up-to-date

ndash A description of the maintenance of public DBpe-dia services and statistics about their usage

ndash An increased number of interlinked data setswhich can be used to further enrich the content ofDBpedia

ndash The discussion and summary of novel third partyapplications of DBpedia

Overall DBpedia has undergone 7 years of continu-ous evolution Table 17 provides an overview of theprojectrsquos timeline

The system report is structured as follows In the nextsection we describe the DBpedia extraction frameworkwhich forms the technical core of DBpedia This isfollowed by an explanation of the community-curatedDBpedia ontology with a focus on multilingual sup-port In Section 4 we explicate how DBpedia is syn-chronised with Wikipedia with just very short delaysand how updates are propagated to DBpedia mirrorsemploying the DBpedia Live system Subsequently wegive an overview of the external data sets that are in-terlinked from DBpedia or that set RDF links pointingto DBpedia themselves (Section 5) In Section 6 weprovide statistics on the usage of DBpedia and describethe maintenance of a large scale public data set WithinSection 7 we briefly describe several use cases andapplications of DBpedia in a variety of different areasFinally we report on related work in Section 8 and givean outlook on the further development of DBpedia inSection 9

Lehmann et al DBpedia 3

Fig 1 Overview of DBpedia extraction framework

2 Extraction Framework

Wikipedia articles consist mostly of free text butalso comprise of various types of structured informationin the form of wiki markup Such information includesinfobox templates categorisation information imagesgeo-coordinates links to external web pages disam-biguation pages redirects between pages and linksacross different language editions of Wikipedia TheDBpedia extraction framework extracts this structuredinformation from Wikipedia and turns it into a richknowledge base In this section we give an overviewof the DBpedia knowledge extraction framework

21 General Architecture

Figure 1 shows an overview of the technical frame-work The DBpedia extraction is structured into fourphases

Input Wikipedia pages are read from an externalsource Pages can either be read from a Wikipediadump or directly fetched from a MediaWiki instal-lation using the MediaWiki API

Parsing Each Wikipedia page is parsed by the wikiparser The wiki parser transforms the source codeof a Wikipedia page into an Abstract Syntax Tree

Extraction The Abstract Syntax Tree of each Wikipediapage is forwarded to the extractors DBpedia of-fers extractors for many different purposes for in-stance to extract labels abstracts or geographicalcoordinates Each extractor consumes an AbstractSyntax Tree and yields a set of RDF statements

Output The collected RDF statements are written toa sink Different formats such as N-Triples aresupported

22 Extractors

The DBpedia extraction framework employs variousextractors for translating different parts of Wikipediapages to RDF statements A list of all available extrac-tors is shown in Table 1 DBpedia extractors can bedivided into four categories

Mapping-Based Infobox Extraction The mapping-based infobox extraction uses manually writtenmappings that relate infoboxes in Wikipedia toterms in the DBpedia ontology The mappings alsospecify a datatype for each infobox property andthus help the extraction framework to produce highquality data The mapping-based extraction willbe described in detail in Section 24

Raw Infobox Extraction The raw infobox extractionprovides a direct mapping from infoboxes inWikipedia to RDF As the raw infobox extractiondoes not rely on explicit extraction knowledge inthe form of mappings the quality of the extracteddata is lower The raw infobox data is useful if aspecific infobox has not been mapped yet and thusis not available in the mapping-based extraction

Feature Extraction The feature extraction uses anumber of extractors that are specialized in ex-tracting a single feature from an article such as alabel or geographic coordinates

4 Lehmann et al DBpedia

Statistical Extraction Some NLP related extractorsaggregate data from all Wikipedia pages in order toprovide data that is based on statistical measuresof page links or word counts as further describedin Section 26

23 Raw Infobox Extraction

The type of Wikipedia content that is most valuablefor the DBpedia extraction are infoboxes Infoboxes arefrequently used to list an articlersquos most relevant factsas a table of attribute-value pairs on the top right-handside of the Wikipedia page (for right-to-left languageson the top left-hand side) Infoboxes that appear in aWikipedia article are based on a template that specifiesa list of attributes that can form the infobox A widerange of infobox templates are used in Wikipedia Com-mon examples are templates for infoboxes that describepersons organisations or automobiles As Wikipediarsquosinfobox template system has evolved over time dif-ferent communities of Wikipedia editors use differ-ent templates to describe the same type of things (egInfobox city japan Infobox swiss townand Infobox town de) In addition different tem-plates use different names for the same attribute(eg birthplace and placeofbirth) As manyWikipedia editors do not strictly follow the recommen-dations given on the page that describes a templateattribute values are expressed using a wide range ofdifferent formats and units of measurement An excerptof an infobox that is based on a template for describingautomobiles is shown below

Infobox automobile| name = Ford GT40| manufacturer = [[Ford Advanced Vehicles]]| production = 1964-1969| engine = 4181cc()

In this infobox the first line specifies the infoboxtype and the subsequent lines specify various attributesof the described entity

An excerpt of the extracted data is as follows

dbrFord_GT40dbpname Ford GT40endbpmanufacturer dbrFord_Advanced_Vehiclesdbpengine 4181dbpproduction 1964()

This extraction output has weaknesses The resourceis not associated to a class in the ontology and parsed

values are cleaned up and assigned a datatyper basedon heuristics In particular the raw infobox extractorsearches for values in the following order dates coor-dinates numbers links and strings as default Thus thedatatype assignment for the same property in differentresources is non deterministic The engine for exam-ple is extracted as a number but if another instance ofthe template used ldquocc4181rdquo it would be extracted asstring This behaviour makes querying for properties inthe dbp namespace inconsistent Those problems canbe overcome by the mapping-based infobox extractionpresented in the next subsection

24 Mapping-Based Infobox Extraction

In order to homogenize the description of informa-tion in the knowledge base in 2010 a community ef-fort was initiated to develop an ontology schema andmappings from Wikipedia infobox properties to thisontology The alignment between Wikipedia infoboxesand the ontology is performed via community-providedmappings that help to normalize name variations inproperties and classes Heterogeneity in the Wikipediainfobox system like using different infoboxes for thesame type of entity or using different property names forthe same property (cf Section 23) can be alleviated inthis way This significantly increases the quality of theraw Wikipedia infobox data by typing resources merg-ing name variations and assigning specific datatypes tothe values

This effort is realized using the DBpedia MappingsWiki3 a MediaWiki installation set up to enable usersto collaboratively create and edit mappings These map-pings are specified using the DBpedia Mapping Lan-guage The mapping language makes use of MediaWikitemplates that define DBpedia ontology classes andproperties as well as templatetable to ontology map-pings A mapping assigns a type from the DBpedia on-tology to the entities that are described by the corre-sponding infobox In addition attributes in the infoboxare mapped to properties in the DBpedia ontology Inthe following we show a mapping that maps infoboxesthat use the Infobox automobile template to theDBpedia ontology

TemplateMapping|mapToClass = Automobile|mappings =PropertyMapping| templateProperty = name

3httpmappingsdbpediaorg

Lehmann et al DBpedia 5

Table 1

Overview of the DBpedia extractors (cf Table 16 for a complete listof prefixes)

Name Description Example

abstract Extracts the first lines of theWikipedia article

dbrBerlin dboabstract Berlin is the capital city of

() article categories Extracts the categorization of the

articledbrOliver Twist dcsubject dbrCategoryEnglish novels

category label Extracts labels for categories dbrCategoryEnglish novels rdfslabel English novels category hierarchy Extracts information about which

concept is a category and how cat-egories are related using the SKOSVocabulary

dbrCategoryWorld War II skosbroader

dbrCategoryModern history

disambiguation Extracts disambiguation links dbrAlien dbowikiPageDisambiguates dbrAlien (film) external links Extracts links to external web

pages related to the conceptdbrAnimal Farm dbowikiPageExternalLink

lthttpbooksgooglecomid=RBGmrDnBs8UCgt geo coordinates Extracts geo-coordinates dbrBerlin georsspoint 525006 133989 grammatical gender Extracts grammatical genders for

personsdbrAbraham Lincoln foafgender male

homepage Extracts links to the official home-page of an instance

dbrAlabama foafhomepage lthttpalabamagovgt

image Extracts the first image of aWikipedia page

dbrBerlin foafdepiction lthttpOverview Berlinjpggt

infobox Extracts all properties from all in-foboxes

dbrAnimal Farm dbodate March 2010

interlanguage Extracts interwiki links dbrAlbedo dbowikiPageInterLanguageLink dbr-deAlbedo label Extracts the article title as label dbrBerlin rdfslabel Berlin lexicalizations Extracts information about surface

forms and their association withconcepts (only N-Quad format)

dbrPine sptllexicalization lxpine tree lsPine pine tree lxpine tree rdfslabel pine tree lsPine pine tree sptlpUriGivenSf 0941

mappings Extraction based on mappings ofWikipedia infoboxes to the DBpe-dia ontology

dbrBerlin dbocountry dbrGermany

page ID Extracts page ids of articles dbrAutism dbowikiPageID 25

page links Extracts all links betweenWikipedia articles

dbrAutism dbowikiPageWikiLink dbrHuman brain

persondata Extracts information about per-sons represented using the Person-Data template

dbrAndre Agassi foafbirthDate 1970-04-29

PND Extracts PND (Personennamen-datei) data about a person

dbrWilliam Shakespeare dboindividualisedPnd 118613723

redirects Extracts redirect links between ar-ticles in Wikipedia

dbrArtificialLanguages dbowikiPageRedirects

dbrConstructed language revision ID Extracts the revision ID of the

Wikipedia articledbrAutism lthttpwwww3orgnsprovwasDerivedFromgt

lthttpenwikipediaorgwikiAutismoldid=495234324gt thematic concept Extracts lsquothematicrsquo concepts the

centres of discussion for cate-gories

dbrCategoryMusic skossubject dbrMusic

topic signatures Extracts topic signatures dbrAlkane sptltopicSignature carbon alkanes atoms wiki page Extracts links to corresponding ar-

ticles in WikipediadbrAnAmericanInParis foafisPrimaryTopicOf

lthttpenwikipediaorgwikiAnAmericanInParisgt

6 Lehmann et al DBpedia

| ontologyProperty = foafname PropertyMapping| templateProperty = manufacturer| ontologyProperty = manufacturer DateIntervalMapping| templateProperty = production| startDateOntologyProperty =

productionStartDate| endDateOntologyProperty =

productionEndDate IntermediateNodeMapping| nodeClass = AutomobileEngine| correspondingProperty = engine| mappings =

PropertyMapping| templateProperty = engine| ontologyProperty = displacement| unit = Volume PropertyMapping| templateProperty = engine| ontologyProperty = powerOutput| unit = Power

()

The RDF statements that are extracted from the pre-vious infobox example are shown below As we can seethe production period is correctly split into a start yearand an end year and the engine is represented by a dis-tinct RDF node It is worth mentioning that all valuesare canonicalized to basic units For example in theengine mapping we state that engine is a Volumeand thus the extractor converts ldquo4181ccrdquo (cubic cen-timeters) to cubic meters (ldquo0004181rdquo) Additionallythere can exist multiple mappings on the same propertythat search for different datatypes or different units Forexample a number with ldquoPSrdquo as a suffix for engine

dbrFord_GT40rdftype dboAutomobilerdfslabel Ford GT40endbomanufacturer

dbrFord_Advanced_VehiclesdboproductionStartYear

1964ˆˆxsdgYeardboproductionEndYear 1969ˆˆxsdgYeardboengine [

rdftype AutomobileEnginedbodisplacement 0004181

]()

The DBpedia Mapping Wiki is not only used to mapdifferent templates within a single language editionof Wikipedia to the DBpedia ontology but is used tomap templates from all Wikipedia language editionsto the shared DBpedia ontology Figure 2 shows how

the infobox properties author and συγγαϕεας ndash au-thor in Greek ndash are both being mapped to the globalidentifier dboauthor That means in turn that in-formation from all language versions of DBpedia canbe merged and DBpedias for smaller languages can beaugmented with knowledge from larger DBpedias suchas the English edition Conversely the larger DBpe-dia editions can benefit from more specialized knowl-edge from localized editions such as data about smallertowns which is often only present in the correspondinglanguage edition [43]

Besides hosting of the mappings and DBpedia on-tology definition the DBpedia Mappings Wiki offersvarious tools which support users in their work

ndash Mapping Syntax Validator The mapping syntaxvalidator checks for syntactic correctness and high-lights inconsistencies such as missing propertydefinitions

ndash Extraction Tester The extraction tester linked oneach mapping page tests a mapping against a setof example Wikipedia pages This gives directfeedback about whether a mapping works and howthe resulting data is structured

ndash Mapping Tool The DBpedia Mapping Tool is agraphical user interface that supports users to cre-ate and edit mappings

25 URI Schemes

For every Wikipedia article the framework intro-duces a number of URIs to represent the concepts de-scribed on a particular page Up to 2011 DBpedia pub-lished URIs only under the httpdbpediaorgdomain The main namespaces were

ndash httpdbpediaorgresource (prefixdbr) for representing article data There is a one-to-one mapping between a Wikipedia page and aDBpedia resource based on the article title Forexample for the Wikipedia article on Berlin4 DB-pedia will produce the URI dbrBerlin Exceptionsin this rule appear when intermediate nodes areextracted from the mapping-based infobox extrac-tor as unique URIs (eg the engine mappingexample in Section 24)

ndash httpdbpediaorgproperty (prefixdbp) for representing properties extracted fromthe raw infobox extraction (cf Section 23) egdbppopulation

4httpenwikipediaorgwikiBerlin

Lehmann et al DBpedia 7

Fig 2 Depiction of the mapping from the Greek (left) and English Wikipedia templates (right) about books to the same DBpedia Ontology class(middle) [24]

ndash httpdbpediaorgontology (prefixdbo) for representing the DBpedia ontology (cfSection 24) eg dbopopulationTotal

Although data from other Wikipedia language edi-tions were extracted they used the same namespacesThis was achieved by exploiting the Wikipedia inter-language links5 For every page in a language otherthan English the page was extracted only if the pagecontained an inter-language link to an English page Inthat case using the English link the data was extractedunder the English resource name (ie dbrBerlin)

Recent DBpedia internationalisation developmentsshowed that this approach omitted valuable data [24]Thus starting from the DBpedia 37 release6 two typesof data sets were generated The localized data sets con-tain all things that are described in a specific languageWithin the datasets things are identified with languagespecific URIs such as httpltlanggtdbpediaorgresource for article data and httpltlanggtdbpediaorgproperty for prop-erty data In addition we produce a canonicalized dataset for each language The canonicalized data sets onlycontain things for which a corresponding page in the

5httpenwikipediaorgwikiHelpInterlanguage_links

6A list of all DBpedia releases is provided in Table 17

English edition of Wikipedia exists Within all canoni-calized data sets the same thing is identified with thesame URI from the generic language-agnostic names-pace httpdbpediaorgresource

26 NLP Extraction

DBpedia provides a number of data sets which havebeen created to support Natural Language Processing(NLP) tasks [33] Currently four datasets are extractedtopic signatures grammatical gender lexicalizationsand thematic concept While the topic signatures andthe grammatical gender extractors primarily extract datafrom the article text the lexicalizations and thematicconcept extractors make use of the wiki markup

DBpedia entities can be referred to using many dif-ferent names and abbreviations The Lexicalization dataset provides access to alternative names for entitiesand concepts associated with several scores estimatingthe association strength between name and URI Thesescores distinguish more common names for specificentities from rarely used ones and also show how am-biguous a name is with respect to all possible conceptsthat it can mean

The topic signatures data set enables the descriptionof DBpedia resources based on unstructured informa-tion as compared to the structured factual data providedby the mapping-based and raw extractors We build a

8 Lehmann et al DBpedia

Vector Space Model (VSM) where each DBpedia re-source is a point in a multidimensional space of wordsEach DBpedia resource is represented by a vector andeach word occurring in Wikipedia is a dimension ofthis vector Word scores are computed using the tf-idfweight with the intention to measure how strong is theassociation between a word and a DBpedia resourceNote that word stems are used in this context in order togeneralize over inflected words We use the computedweights to select the strongest related word stems foreach entity and build topic signatures [30]

There are two more Feature Extractors related to Nat-ural Language Processing The thematic concepts dataset relies on Wikipediarsquos category system to capture theidea of a lsquothemersquo a subject that is discussed in its arti-cles Many of the categories in Wikipedia are linked toan article that describes the main topic of that categoryWe rely on this information to mark DBpedia entitiesand concepts that are lsquothematicrsquo that is they are thecentre of discussion for a category

The grammatical gender data set uses a simple heuris-tic to decide on a grammatical gender for instances ofthe class Person in DBpedia While parsing an articlein the English Wikipedia if there is a mapping froman infobox in this article to the class dboPersonwe record the frequency of gender-specific pronouns intheir declined forms (Subject Object Possessive Ad-jective Possessive Pronoun and Reflexive) ndash ie hehim his himself (masculine) and she her hers herself(feminine) Grammatical genders for DBpedia entitiesare assigned based on the dominating gender in thesepronouns

27 Summary of Other Recent Developments

In this section we summarize the improvements ofthe DBpedia extraction framework since the publicationof the previous DBpedia overview article [5] in 2009One of the major changes on the implementation levelis that the extraction framework has been rewritten inScala in 20107 resulting in an improvement of the effi-ciency of the extractors by an order of magnitude com-pared to the previous PHP based framework This iscrucial for the DBpedia Live project which will be ex-plained in Section 4 The new more modular frameworkalso allows to extract data from tables in Wikipediapages and supports extraction from multiple MediaWikitemplates per page Another significant change was the

7Table 17 provides an overview of the projectrsquos evolution throughtime

creation and utilization of the DBpedia Mappings Wikias described earlier Further significant changes includethe mentioned NLP extractors and the introduction ofURI schemes

In addition there were several smaller improvementsand general maintenance Overall over the past fouryears the parsing of the MediaWiki markup improvedquite a lot which led to better overall coverage forexample concerning references and parser functionsIn addition the collection of MediaWiki namespaceidentifiers for many languages is now performed semi-automatically leading to a high accuracy of detectionThis concerns common title prefixes such as User FileTemplate Help Portal etc in English that indicatepages that do not contain encyclopedic content andwould produce noise in the data They are important forspecific extractors as well for instance the category hi-erarchy data set is produced from pages of the Categorynamespace Furthermore the output of the extractionsystem now supports more formats and several compli-ance issues regarding URIs IRIs N-Triples and Turtlewere fixed

The individual data extractors have been improved aswell in both number and quality in many areas The ab-stract extraction was enhanced producing more accurateplain text representations of the beginning of Wikipediaarticle texts More diverse and more specific datatypesdo exist (eg many currencies and XSD datatypes suchas xsdgYearMonth xsdpositiveIntegeretc) and for a number of classes and properties specificdatatypes were added (eg inhabitantskm2 for the pop-ulation density of populated places and m3s for thedischarge of rivers) Many issues related to data parserswere resolved and the quality of the owlsameAsdata set for multiple language versions was increasedby an implementation that takes bijective relations intoaccount

There are also further extractors eg for Wikipediapage IDs and revisions Moreover redirect and disam-biguation extractors were introduced and improved Forthe redirect data the transitive closure is computedwhile taking care of catching cycles in the links Theredirects also help regarding infobox coverage in themapping-based extraction by resolving alternative tem-plate names Moreover in the PHP framework if aninfobox value pointed to a redirect this redirection wasnot properly resolved and thus resulted in RDF linksthat led to URIs which did not contain any further in-formation Resolving redirects affected approximately15 of all links and hence increased the overall inter-connectivity of resources in the DBpedia ontology

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an infoboxcontains a string value that is not linked to anotherWikipedia article the extraction framework searchesfor hyperlinks in the same Wikipedia article that havethe same anchor text as the infobox value string If sucha link exists the target of that link is used to replacethe string value in the infobox This method furtherincreases the number of object property assertions inthe DBpedia ontology

Orthogonal to the previously mentioned improve-ments there have been various efforts to assess the qual-ity of the DBpedia datasets [26] developed a frame-work for estimating the quality of DBpedia and a sam-ple of 75 resources were analysed A more compre-hensive effort was performed in [48] by providing adistributed web-based interface [25] for quality assess-ment In this study 17 data quality problem types wereanalysed by 58 users covering 521 resources in DBpe-dia

3 DBpedia Ontology

The DBpedia ontology consists of 320 classes whichform a subsumption hierarchy and are described by1650 different properties With a maximal depth of5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depictsa part of the DBpedia ontology indicating the rela-tions among the top ten classes of the DBpedia on-

tology ie the classes with the highest number ofinstances The complete DBpedia ontology can bebrowsed online at httpmappingsdbpediaorgserverontologyclasses

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings Wiki Fig-ure 4 depicts the growth of the DBpedia ontology overtime While the number of classes is not growing toomuch due to the already good coverage of the initialversion of the ontology the number of properties in-creases over time due to the collaboration on the DBpe-dia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communitiesfor 27 languages 23 of which are active Figure 5 shows

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

statistics for the coverage of these mappings in DBpe-dia Figures (a) and (c) refer to the absolute numberof template and property mappings that are defined forevery DBpedia language edition Figures (b) and (d) de-pict the percentage of the defined template and propertymappings compared to the total number of availabletemplates and properties for every Wikipedia languageedition Figures (e) and (g) show the occurrences (in-stances) that the defined template and property map-pings have in Wikipedia Finally figures (f) and (h)give the percentage of the mapped templates and prop-erties occurences compared to the total templates andproperty occurences in a Wikipedia language edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50of total template occurrences In addition almost alllanguages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most activemapping language communities is depicted in Figure 6It is interesting to notice that the high mapping ac-tivity peaks coincide with the DBpedia release dates

Lehmann et al DBpedia 11

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

For instance the DBpedia 37 version was released onSeptember 2011 and the 2nd and 3rd quarter of thatyear have a very high activity compared to the 4th quar-ter In the last two years (2012 and 2013) most of theDBpedia mapping language communities have definedtheir own chapters and have their own release datesThus recent mapping activity shows less fluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (x axis)that have exactly y occurrences (y axis) The occurrencefrequency follows a long tail distribution Thus a lownumber of property mappings have a high number ofoccurrences and a high number of property mappingshave a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the num-ber of facts (ie statements) that have been extractedfrom infoboxes describing these things Afterwards wereport on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statements andtype statements for the 20 languages The column head-ings have the following meaning LD = Localized datasets (see Section 25) CD = Canonicalized data sets

(see Section 25) all = Overall number of instances inthe data set including instances without infobox datawith MD = Number of instances for which mapping-based infobox data exists Raw Properties = Numberof different properties that are generated by the rawinfobox extractor Mapping Properties = Number ofdifferent properties that are generated by the mapping-based infobox extractor Raw Statements = Number ofstatements (facts) that are generated by the raw infoboxextractor Mapping Statements = Number of statements(facts) that are generated by the mapping-based infoboxextractor

It is interesting to see that the English version of DB-pedia describes about three times more instances thanthe second and third largest language editions (FrenchGerman) Comparing the first column of the table withthe second and third reveals which portion of the in-stances of a specific language correspond to instancesin the English version of DBpedia and which portionof the instances is described by clean mapping-basedinfobox data The difference between the number ofproperties in the raw infobox data set and the cleanermapping-based infobox data set (columns 4 and 5) re-sults on the one hand from multiple Wikipedia infoboxproperties being mapped to a single ontology propertyOn the other hand it reflects the number of mappingsthat have been so far created in the Mapping Wiki for aspecific language

Table 3 reports the number of instances for a set ofpopular classes from the third and forth hierarchy levelof the ontology within the canonicalized DBpedia datasets for each language The indented classes are sub-classes of the superclasses set in bold The zero val-ues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English version ofDBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-baseddata set how many instances are described in multi-ple languages The Instances column contains the totalnumber of instances per class across all 20 languagesthe second column contains the number of instancesthat are described only in a single language version thenext column contains the number of instances that arecontained in two languages but not in three or more lan-guages etc For example 12936 persons are describedin five languages but not in six or more languages Thenumber 871630 for the class Person means that all20 language versions together describe 871630 differ-ent persons The number is higher than the number ofpersons described in the canonicalized English infobox

12 Lehmann et al DBpedia

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Table 2Basic statistics about Localized DBpedia Editions

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

data set (763643) listed in Table 3 since there areinfoboxes in non-English articles describing a personwithout a corresponding infobox in the English articledescribing the same person Summing up columns 2 to10+ for the Person class we see that 195263 personsare described in two or more languages The large dif-ference of this number compared to the total number of871630 persons is due to the much smaller size of thelocalized DBpedia versions compared to the Englishone (cf Table 2)

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approaches

in [20] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently mappingsfor 27 languages8 The DBpedia 37 release9 in Septem-ber 2011 was the first DBpedia release to use the lo-calized I18n (Internationalisation) DBpedia extractionframework [24]

8Arabic (ar) Bulgarian (bg) Bengali (bn) Catalan (ca) Czech(cs) German (de) Greek (el) English (en) Spanish (es) Estonian(et) Basque (eu) French (fr) Irish (ga) Hindi (hi) Croatian (hr)Hungarian (hu) Indonesian (id) Italian (it) Japanese (ja) Korean(ko) Dutch (nl) Polish (pl) Portuguese (pt) Russian (ru) Slovene(sl) Turkish (tr) Urdu (ur)

9httpblogdbpediaorg20110911

Lehmann et al DBpedia 13

Table 3Number of instances per class within 10 localized DBpedia versions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 4Cross-language overlap Number of instances that are described in multiple languages

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

At the time of writing DBpedia chapters for 14 lan-guages have been founded Basque Czech Dutch En-glish French German Greek Italian Japanese Ko-rean Polish Portuguese Russian and Spanish10 Be-sides providing mappings from infoboxes in the corre-sponding Wikipedia editions DBpedia chapters organ-ise a local community and provide hosting for data setsand associated services

While at the moment chapters are defined by owner-ship of the IP and server of the sub domain A record(eg httpkodbpediaorg) given by DBpe-dia maintainers the DBpedia internationalisation com-mittee11 is manifesting its structure and each languageedition has a representative with a vote in elections Insome cases (eg Greek12 and Dutch13) the existence of

10Accessed on 25092013 httpwikidbpediaorgInternationalizationChapters

11httpwikidbpediaorgInternationalization

12httpeldbpediaorg13httpnldbpediaorg

a local DBpedia chapter has had a positive effect on thecreation of localized LOD clouds [24]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where commu-nities from each language work together to improvemappings increase coverage and detect bugs in the ex-traction process The progress of the mapping effortis tracked through statistics on the number of mappedtemplates and properties as well as the number oftimes these templates and properties occur in WikipediaThese statistics provide an estimate of the coverage ofeach Wikipedia edition in terms of how many entitieswill be typed and how many properties from those en-tities will be extracted Therefore they can be usedby each language edition to prioritize properties andtemplates with higher impact on the coverage

The mapping statistics have also been used as a wayto promote a healthy competition between languageeditions A sprint page was created with bar charts thatshow how close each language is from achieving to-tal coverage (as shown in Figure 5) and line chartsshowing the progress over time highlighting when one

14 Lehmann et al DBpedia

language is overtaking another in their race for highercoverage The mapping sprints have served as a greatmotivator for the crowd-sourcing efforts as it can benoted from the increase in the number of mapping con-tributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute14 This high changefrequency leads to DBpedia data quickly being out-dated which in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [2036] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported15 Meanwhile DBpedia Livefor Dutch16 was developed

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [27]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia articles

The overall architecture of DBpedia Live is indicatedin Figure 8 The major components of the system areas follows

14httpstatswikimediaorgENSummaryENhtm

15httplivedbpediaorg16httplivenldbpediaorg

Fig 8 Overview of DBpedia Live extraction framework

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a localWikipedia mirror allows us to exceed any accesslimits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary inputsource Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pages forre-processing as input and applies all the enabledextractors After processing a page the extractedtriples are a) inserted into a backend triple store(in our case Virtuoso [10]) updating the old triplesand b) saved as changesets into a compressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extractionframework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from differentsources

Lehmann et al DBpedia 15

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as anexample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extractorsmight never be applied to rarely modified pages Toovercome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mappingaffected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triples Aset for the added triples and another set for the deletedtriples We write those two sets into N-Triples filescompress them and publish the compressed files aschangesets If another DBpedia Live mirror wants tosynchronise with the DBpedia Live endpoint it can justdownload those files decompress and integrate them

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset filessequentially decompresses them and updates the targetSPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data indifferent SPARQL graphs Data from the articleupdate feeds are contained in the graph with theURI httplivedbpediaorg static data (ielinks to the LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All data isalso accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-

tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing to DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository17 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linking com-mittee Linksets which are added to the repository areused for the subsequent official DBpedia release as wellas for DBpedia Live Table 5 lists the linksets createdby the DBpedia community as of April 2013 The firstcolumn names the data set that is the target of the linksThe second and third column contain the predicate thatis used for linking as well as the overall number of linksthat is set between DBpedia and the external data setThe last column names the tool that was used to gener-ate the links The value S refers to Silk L to LIMESC to custom script and a missing entry means that thedataset is copied from the previous releases and notregenerated

An example for the usage of links is the combina-tion of data about European Union project funding(FTS) [32] and data about countries in DBpedia Thequery below compares funding per year (from FTS) andcountry with the gross domestic product of a country(from DBpedia)18

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding) 3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

17httpsgithubcomdbpediadbpedia-links18Endpoint httpftspublicdataeusparql

Results httpbitly1c2mIwQ

16 Lehmann et al DBpedia

Table 5Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732

In addition to providing outgoing links on aninstance-level DBpedia also sets links on schema-level pointing from the DBpedia ontology to equiva-lent terms in other schemas Links to other schematacan be set by the community within the DBpedia Map-pings Wiki by using owlequivalentClass inclass templates and owlequivalentPropertyin datatype or object property templates respectively

In particular in 2011 Google Microsoft and Yahooannounced their collaboration on Schemaorg a col-lection of vocabularies for marking up content on webpages The DBpedia 38 ontology contains 45 equiva-lent class and 31 equivalent property links pointing tohttpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of incoming links ie linkspointing to DBpedia from other data sets is 39007478according to the Data Hub19 However those countsare entered by users and may not always be valid andup-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [39]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from otherwiseit is not considered Sindice computes a graph sum-mary [8] over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing to DBpedia Table 6lists the 10 data sets which set most links to DBpedia

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub20 However it crawls forRDFa snippets converts microformats etc which arenot captured by the DataHub Despite the inaccuracythe relative comparison of different datasets can stillgive us insights Therefore we analysed the link struc-ture of all Sindice datasets using the Sindice clusterTable 8 shows the datasets with most incoming linksThose are authorities in the network structure of the

19See httpwikidbpediaorgInterlinking fordetails

20httpdatahubio

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 4: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

Lehmann et al DBpedia 3

Fig 1 Overview of DBpedia extraction framework

2 Extraction Framework

Wikipedia articles consist mostly of free text butalso comprise of various types of structured informationin the form of wiki markup Such information includesinfobox templates categorisation information imagesgeo-coordinates links to external web pages disam-biguation pages redirects between pages and linksacross different language editions of Wikipedia TheDBpedia extraction framework extracts this structuredinformation from Wikipedia and turns it into a richknowledge base In this section we give an overviewof the DBpedia knowledge extraction framework

21 General Architecture

Figure 1 shows an overview of the technical frame-work The DBpedia extraction is structured into fourphases

Input Wikipedia pages are read from an externalsource Pages can either be read from a Wikipediadump or directly fetched from a MediaWiki instal-lation using the MediaWiki API

Parsing Each Wikipedia page is parsed by the wikiparser The wiki parser transforms the source codeof a Wikipedia page into an Abstract Syntax Tree

Extraction The Abstract Syntax Tree of each Wikipediapage is forwarded to the extractors DBpedia of-fers extractors for many different purposes for in-stance to extract labels abstracts or geographicalcoordinates Each extractor consumes an AbstractSyntax Tree and yields a set of RDF statements

Output The collected RDF statements are written toa sink Different formats such as N-Triples aresupported

22 Extractors

The DBpedia extraction framework employs variousextractors for translating different parts of Wikipediapages to RDF statements A list of all available extrac-tors is shown in Table 1 DBpedia extractors can bedivided into four categories

Mapping-Based Infobox Extraction The mapping-based infobox extraction uses manually writtenmappings that relate infoboxes in Wikipedia toterms in the DBpedia ontology The mappings alsospecify a datatype for each infobox property andthus help the extraction framework to produce highquality data The mapping-based extraction willbe described in detail in Section 24

Raw Infobox Extraction The raw infobox extractionprovides a direct mapping from infoboxes inWikipedia to RDF As the raw infobox extractiondoes not rely on explicit extraction knowledge inthe form of mappings the quality of the extracteddata is lower The raw infobox data is useful if aspecific infobox has not been mapped yet and thusis not available in the mapping-based extraction

Feature Extraction The feature extraction uses anumber of extractors that are specialized in ex-tracting a single feature from an article such as alabel or geographic coordinates

4 Lehmann et al DBpedia

Statistical Extraction Some NLP related extractorsaggregate data from all Wikipedia pages in order toprovide data that is based on statistical measuresof page links or word counts as further describedin Section 26

23 Raw Infobox Extraction

The type of Wikipedia content that is most valuablefor the DBpedia extraction are infoboxes Infoboxes arefrequently used to list an articlersquos most relevant factsas a table of attribute-value pairs on the top right-handside of the Wikipedia page (for right-to-left languageson the top left-hand side) Infoboxes that appear in aWikipedia article are based on a template that specifiesa list of attributes that can form the infobox A widerange of infobox templates are used in Wikipedia Com-mon examples are templates for infoboxes that describepersons organisations or automobiles As Wikipediarsquosinfobox template system has evolved over time dif-ferent communities of Wikipedia editors use differ-ent templates to describe the same type of things (egInfobox city japan Infobox swiss townand Infobox town de) In addition different tem-plates use different names for the same attribute(eg birthplace and placeofbirth) As manyWikipedia editors do not strictly follow the recommen-dations given on the page that describes a templateattribute values are expressed using a wide range ofdifferent formats and units of measurement An excerptof an infobox that is based on a template for describingautomobiles is shown below

Infobox automobile| name = Ford GT40| manufacturer = [[Ford Advanced Vehicles]]| production = 1964-1969| engine = 4181cc()

In this infobox the first line specifies the infoboxtype and the subsequent lines specify various attributesof the described entity

An excerpt of the extracted data is as follows

dbrFord_GT40dbpname Ford GT40endbpmanufacturer dbrFord_Advanced_Vehiclesdbpengine 4181dbpproduction 1964()

This extraction output has weaknesses The resourceis not associated to a class in the ontology and parsed

values are cleaned up and assigned a datatyper basedon heuristics In particular the raw infobox extractorsearches for values in the following order dates coor-dinates numbers links and strings as default Thus thedatatype assignment for the same property in differentresources is non deterministic The engine for exam-ple is extracted as a number but if another instance ofthe template used ldquocc4181rdquo it would be extracted asstring This behaviour makes querying for properties inthe dbp namespace inconsistent Those problems canbe overcome by the mapping-based infobox extractionpresented in the next subsection

24 Mapping-Based Infobox Extraction

In order to homogenize the description of informa-tion in the knowledge base in 2010 a community ef-fort was initiated to develop an ontology schema andmappings from Wikipedia infobox properties to thisontology The alignment between Wikipedia infoboxesand the ontology is performed via community-providedmappings that help to normalize name variations inproperties and classes Heterogeneity in the Wikipediainfobox system like using different infoboxes for thesame type of entity or using different property names forthe same property (cf Section 23) can be alleviated inthis way This significantly increases the quality of theraw Wikipedia infobox data by typing resources merg-ing name variations and assigning specific datatypes tothe values

This effort is realized using the DBpedia MappingsWiki3 a MediaWiki installation set up to enable usersto collaboratively create and edit mappings These map-pings are specified using the DBpedia Mapping Lan-guage The mapping language makes use of MediaWikitemplates that define DBpedia ontology classes andproperties as well as templatetable to ontology map-pings A mapping assigns a type from the DBpedia on-tology to the entities that are described by the corre-sponding infobox In addition attributes in the infoboxare mapped to properties in the DBpedia ontology Inthe following we show a mapping that maps infoboxesthat use the Infobox automobile template to theDBpedia ontology

TemplateMapping|mapToClass = Automobile|mappings =PropertyMapping| templateProperty = name

3httpmappingsdbpediaorg

Lehmann et al DBpedia 5

Table 1

Overview of the DBpedia extractors (cf Table 16 for a complete listof prefixes)

Name Description Example

abstract Extracts the first lines of theWikipedia article

dbrBerlin dboabstract Berlin is the capital city of

() article categories Extracts the categorization of the

articledbrOliver Twist dcsubject dbrCategoryEnglish novels

category label Extracts labels for categories dbrCategoryEnglish novels rdfslabel English novels category hierarchy Extracts information about which

concept is a category and how cat-egories are related using the SKOSVocabulary

dbrCategoryWorld War II skosbroader

dbrCategoryModern history

disambiguation Extracts disambiguation links dbrAlien dbowikiPageDisambiguates dbrAlien (film) external links Extracts links to external web

pages related to the conceptdbrAnimal Farm dbowikiPageExternalLink

lthttpbooksgooglecomid=RBGmrDnBs8UCgt geo coordinates Extracts geo-coordinates dbrBerlin georsspoint 525006 133989 grammatical gender Extracts grammatical genders for

personsdbrAbraham Lincoln foafgender male

homepage Extracts links to the official home-page of an instance

dbrAlabama foafhomepage lthttpalabamagovgt

image Extracts the first image of aWikipedia page

dbrBerlin foafdepiction lthttpOverview Berlinjpggt

infobox Extracts all properties from all in-foboxes

dbrAnimal Farm dbodate March 2010

interlanguage Extracts interwiki links dbrAlbedo dbowikiPageInterLanguageLink dbr-deAlbedo label Extracts the article title as label dbrBerlin rdfslabel Berlin lexicalizations Extracts information about surface

forms and their association withconcepts (only N-Quad format)

dbrPine sptllexicalization lxpine tree lsPine pine tree lxpine tree rdfslabel pine tree lsPine pine tree sptlpUriGivenSf 0941

mappings Extraction based on mappings ofWikipedia infoboxes to the DBpe-dia ontology

dbrBerlin dbocountry dbrGermany

page ID Extracts page ids of articles dbrAutism dbowikiPageID 25

page links Extracts all links betweenWikipedia articles

dbrAutism dbowikiPageWikiLink dbrHuman brain

persondata Extracts information about per-sons represented using the Person-Data template

dbrAndre Agassi foafbirthDate 1970-04-29

PND Extracts PND (Personennamen-datei) data about a person

dbrWilliam Shakespeare dboindividualisedPnd 118613723

redirects Extracts redirect links between ar-ticles in Wikipedia

dbrArtificialLanguages dbowikiPageRedirects

dbrConstructed language revision ID Extracts the revision ID of the

Wikipedia articledbrAutism lthttpwwww3orgnsprovwasDerivedFromgt

lthttpenwikipediaorgwikiAutismoldid=495234324gt thematic concept Extracts lsquothematicrsquo concepts the

centres of discussion for cate-gories

dbrCategoryMusic skossubject dbrMusic

topic signatures Extracts topic signatures dbrAlkane sptltopicSignature carbon alkanes atoms wiki page Extracts links to corresponding ar-

ticles in WikipediadbrAnAmericanInParis foafisPrimaryTopicOf

lthttpenwikipediaorgwikiAnAmericanInParisgt

6 Lehmann et al DBpedia

| ontologyProperty = foafname PropertyMapping| templateProperty = manufacturer| ontologyProperty = manufacturer DateIntervalMapping| templateProperty = production| startDateOntologyProperty =

productionStartDate| endDateOntologyProperty =

productionEndDate IntermediateNodeMapping| nodeClass = AutomobileEngine| correspondingProperty = engine| mappings =

PropertyMapping| templateProperty = engine| ontologyProperty = displacement| unit = Volume PropertyMapping| templateProperty = engine| ontologyProperty = powerOutput| unit = Power

()

The RDF statements that are extracted from the pre-vious infobox example are shown below As we can seethe production period is correctly split into a start yearand an end year and the engine is represented by a dis-tinct RDF node It is worth mentioning that all valuesare canonicalized to basic units For example in theengine mapping we state that engine is a Volumeand thus the extractor converts ldquo4181ccrdquo (cubic cen-timeters) to cubic meters (ldquo0004181rdquo) Additionallythere can exist multiple mappings on the same propertythat search for different datatypes or different units Forexample a number with ldquoPSrdquo as a suffix for engine

dbrFord_GT40rdftype dboAutomobilerdfslabel Ford GT40endbomanufacturer

dbrFord_Advanced_VehiclesdboproductionStartYear

1964ˆˆxsdgYeardboproductionEndYear 1969ˆˆxsdgYeardboengine [

rdftype AutomobileEnginedbodisplacement 0004181

]()

The DBpedia Mapping Wiki is not only used to mapdifferent templates within a single language editionof Wikipedia to the DBpedia ontology but is used tomap templates from all Wikipedia language editionsto the shared DBpedia ontology Figure 2 shows how

the infobox properties author and συγγαϕεας ndash au-thor in Greek ndash are both being mapped to the globalidentifier dboauthor That means in turn that in-formation from all language versions of DBpedia canbe merged and DBpedias for smaller languages can beaugmented with knowledge from larger DBpedias suchas the English edition Conversely the larger DBpe-dia editions can benefit from more specialized knowl-edge from localized editions such as data about smallertowns which is often only present in the correspondinglanguage edition [43]

Besides hosting of the mappings and DBpedia on-tology definition the DBpedia Mappings Wiki offersvarious tools which support users in their work

ndash Mapping Syntax Validator The mapping syntaxvalidator checks for syntactic correctness and high-lights inconsistencies such as missing propertydefinitions

ndash Extraction Tester The extraction tester linked oneach mapping page tests a mapping against a setof example Wikipedia pages This gives directfeedback about whether a mapping works and howthe resulting data is structured

ndash Mapping Tool The DBpedia Mapping Tool is agraphical user interface that supports users to cre-ate and edit mappings

25 URI Schemes

For every Wikipedia article the framework intro-duces a number of URIs to represent the concepts de-scribed on a particular page Up to 2011 DBpedia pub-lished URIs only under the httpdbpediaorgdomain The main namespaces were

ndash httpdbpediaorgresource (prefixdbr) for representing article data There is a one-to-one mapping between a Wikipedia page and aDBpedia resource based on the article title Forexample for the Wikipedia article on Berlin4 DB-pedia will produce the URI dbrBerlin Exceptionsin this rule appear when intermediate nodes areextracted from the mapping-based infobox extrac-tor as unique URIs (eg the engine mappingexample in Section 24)

ndash httpdbpediaorgproperty (prefixdbp) for representing properties extracted fromthe raw infobox extraction (cf Section 23) egdbppopulation

4httpenwikipediaorgwikiBerlin

Lehmann et al DBpedia 7

Fig 2 Depiction of the mapping from the Greek (left) and English Wikipedia templates (right) about books to the same DBpedia Ontology class(middle) [24]

ndash httpdbpediaorgontology (prefixdbo) for representing the DBpedia ontology (cfSection 24) eg dbopopulationTotal

Although data from other Wikipedia language edi-tions were extracted they used the same namespacesThis was achieved by exploiting the Wikipedia inter-language links5 For every page in a language otherthan English the page was extracted only if the pagecontained an inter-language link to an English page Inthat case using the English link the data was extractedunder the English resource name (ie dbrBerlin)

Recent DBpedia internationalisation developmentsshowed that this approach omitted valuable data [24]Thus starting from the DBpedia 37 release6 two typesof data sets were generated The localized data sets con-tain all things that are described in a specific languageWithin the datasets things are identified with languagespecific URIs such as httpltlanggtdbpediaorgresource for article data and httpltlanggtdbpediaorgproperty for prop-erty data In addition we produce a canonicalized dataset for each language The canonicalized data sets onlycontain things for which a corresponding page in the

5httpenwikipediaorgwikiHelpInterlanguage_links

6A list of all DBpedia releases is provided in Table 17

English edition of Wikipedia exists Within all canoni-calized data sets the same thing is identified with thesame URI from the generic language-agnostic names-pace httpdbpediaorgresource

26 NLP Extraction

DBpedia provides a number of data sets which havebeen created to support Natural Language Processing(NLP) tasks [33] Currently four datasets are extractedtopic signatures grammatical gender lexicalizationsand thematic concept While the topic signatures andthe grammatical gender extractors primarily extract datafrom the article text the lexicalizations and thematicconcept extractors make use of the wiki markup

DBpedia entities can be referred to using many dif-ferent names and abbreviations The Lexicalization dataset provides access to alternative names for entitiesand concepts associated with several scores estimatingthe association strength between name and URI Thesescores distinguish more common names for specificentities from rarely used ones and also show how am-biguous a name is with respect to all possible conceptsthat it can mean

The topic signatures data set enables the descriptionof DBpedia resources based on unstructured informa-tion as compared to the structured factual data providedby the mapping-based and raw extractors We build a

8 Lehmann et al DBpedia

Vector Space Model (VSM) where each DBpedia re-source is a point in a multidimensional space of wordsEach DBpedia resource is represented by a vector andeach word occurring in Wikipedia is a dimension ofthis vector Word scores are computed using the tf-idfweight with the intention to measure how strong is theassociation between a word and a DBpedia resourceNote that word stems are used in this context in order togeneralize over inflected words We use the computedweights to select the strongest related word stems foreach entity and build topic signatures [30]

There are two more Feature Extractors related to Nat-ural Language Processing The thematic concepts dataset relies on Wikipediarsquos category system to capture theidea of a lsquothemersquo a subject that is discussed in its arti-cles Many of the categories in Wikipedia are linked toan article that describes the main topic of that categoryWe rely on this information to mark DBpedia entitiesand concepts that are lsquothematicrsquo that is they are thecentre of discussion for a category

The grammatical gender data set uses a simple heuris-tic to decide on a grammatical gender for instances ofthe class Person in DBpedia While parsing an articlein the English Wikipedia if there is a mapping froman infobox in this article to the class dboPersonwe record the frequency of gender-specific pronouns intheir declined forms (Subject Object Possessive Ad-jective Possessive Pronoun and Reflexive) ndash ie hehim his himself (masculine) and she her hers herself(feminine) Grammatical genders for DBpedia entitiesare assigned based on the dominating gender in thesepronouns

27 Summary of Other Recent Developments

In this section we summarize the improvements ofthe DBpedia extraction framework since the publicationof the previous DBpedia overview article [5] in 2009One of the major changes on the implementation levelis that the extraction framework has been rewritten inScala in 20107 resulting in an improvement of the effi-ciency of the extractors by an order of magnitude com-pared to the previous PHP based framework This iscrucial for the DBpedia Live project which will be ex-plained in Section 4 The new more modular frameworkalso allows to extract data from tables in Wikipediapages and supports extraction from multiple MediaWikitemplates per page Another significant change was the

7Table 17 provides an overview of the projectrsquos evolution throughtime

creation and utilization of the DBpedia Mappings Wikias described earlier Further significant changes includethe mentioned NLP extractors and the introduction ofURI schemes

In addition there were several smaller improvementsand general maintenance Overall over the past fouryears the parsing of the MediaWiki markup improvedquite a lot which led to better overall coverage forexample concerning references and parser functionsIn addition the collection of MediaWiki namespaceidentifiers for many languages is now performed semi-automatically leading to a high accuracy of detectionThis concerns common title prefixes such as User FileTemplate Help Portal etc in English that indicatepages that do not contain encyclopedic content andwould produce noise in the data They are important forspecific extractors as well for instance the category hi-erarchy data set is produced from pages of the Categorynamespace Furthermore the output of the extractionsystem now supports more formats and several compli-ance issues regarding URIs IRIs N-Triples and Turtlewere fixed

The individual data extractors have been improved aswell in both number and quality in many areas The ab-stract extraction was enhanced producing more accurateplain text representations of the beginning of Wikipediaarticle texts More diverse and more specific datatypesdo exist (eg many currencies and XSD datatypes suchas xsdgYearMonth xsdpositiveIntegeretc) and for a number of classes and properties specificdatatypes were added (eg inhabitantskm2 for the pop-ulation density of populated places and m3s for thedischarge of rivers) Many issues related to data parserswere resolved and the quality of the owlsameAsdata set for multiple language versions was increasedby an implementation that takes bijective relations intoaccount

There are also further extractors eg for Wikipediapage IDs and revisions Moreover redirect and disam-biguation extractors were introduced and improved Forthe redirect data the transitive closure is computedwhile taking care of catching cycles in the links Theredirects also help regarding infobox coverage in themapping-based extraction by resolving alternative tem-plate names Moreover in the PHP framework if aninfobox value pointed to a redirect this redirection wasnot properly resolved and thus resulted in RDF linksthat led to URIs which did not contain any further in-formation Resolving redirects affected approximately15 of all links and hence increased the overall inter-connectivity of resources in the DBpedia ontology

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an infoboxcontains a string value that is not linked to anotherWikipedia article the extraction framework searchesfor hyperlinks in the same Wikipedia article that havethe same anchor text as the infobox value string If sucha link exists the target of that link is used to replacethe string value in the infobox This method furtherincreases the number of object property assertions inthe DBpedia ontology

Orthogonal to the previously mentioned improve-ments there have been various efforts to assess the qual-ity of the DBpedia datasets [26] developed a frame-work for estimating the quality of DBpedia and a sam-ple of 75 resources were analysed A more compre-hensive effort was performed in [48] by providing adistributed web-based interface [25] for quality assess-ment In this study 17 data quality problem types wereanalysed by 58 users covering 521 resources in DBpe-dia

3 DBpedia Ontology

The DBpedia ontology consists of 320 classes whichform a subsumption hierarchy and are described by1650 different properties With a maximal depth of5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depictsa part of the DBpedia ontology indicating the rela-tions among the top ten classes of the DBpedia on-

tology ie the classes with the highest number ofinstances The complete DBpedia ontology can bebrowsed online at httpmappingsdbpediaorgserverontologyclasses

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings Wiki Fig-ure 4 depicts the growth of the DBpedia ontology overtime While the number of classes is not growing toomuch due to the already good coverage of the initialversion of the ontology the number of properties in-creases over time due to the collaboration on the DBpe-dia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communitiesfor 27 languages 23 of which are active Figure 5 shows

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

statistics for the coverage of these mappings in DBpe-dia Figures (a) and (c) refer to the absolute numberof template and property mappings that are defined forevery DBpedia language edition Figures (b) and (d) de-pict the percentage of the defined template and propertymappings compared to the total number of availabletemplates and properties for every Wikipedia languageedition Figures (e) and (g) show the occurrences (in-stances) that the defined template and property map-pings have in Wikipedia Finally figures (f) and (h)give the percentage of the mapped templates and prop-erties occurences compared to the total templates andproperty occurences in a Wikipedia language edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50of total template occurrences In addition almost alllanguages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most activemapping language communities is depicted in Figure 6It is interesting to notice that the high mapping ac-tivity peaks coincide with the DBpedia release dates

Lehmann et al DBpedia 11

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

For instance the DBpedia 37 version was released onSeptember 2011 and the 2nd and 3rd quarter of thatyear have a very high activity compared to the 4th quar-ter In the last two years (2012 and 2013) most of theDBpedia mapping language communities have definedtheir own chapters and have their own release datesThus recent mapping activity shows less fluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (x axis)that have exactly y occurrences (y axis) The occurrencefrequency follows a long tail distribution Thus a lownumber of property mappings have a high number ofoccurrences and a high number of property mappingshave a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the num-ber of facts (ie statements) that have been extractedfrom infoboxes describing these things Afterwards wereport on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statements andtype statements for the 20 languages The column head-ings have the following meaning LD = Localized datasets (see Section 25) CD = Canonicalized data sets

(see Section 25) all = Overall number of instances inthe data set including instances without infobox datawith MD = Number of instances for which mapping-based infobox data exists Raw Properties = Numberof different properties that are generated by the rawinfobox extractor Mapping Properties = Number ofdifferent properties that are generated by the mapping-based infobox extractor Raw Statements = Number ofstatements (facts) that are generated by the raw infoboxextractor Mapping Statements = Number of statements(facts) that are generated by the mapping-based infoboxextractor

It is interesting to see that the English version of DB-pedia describes about three times more instances thanthe second and third largest language editions (FrenchGerman) Comparing the first column of the table withthe second and third reveals which portion of the in-stances of a specific language correspond to instancesin the English version of DBpedia and which portionof the instances is described by clean mapping-basedinfobox data The difference between the number ofproperties in the raw infobox data set and the cleanermapping-based infobox data set (columns 4 and 5) re-sults on the one hand from multiple Wikipedia infoboxproperties being mapped to a single ontology propertyOn the other hand it reflects the number of mappingsthat have been so far created in the Mapping Wiki for aspecific language

Table 3 reports the number of instances for a set ofpopular classes from the third and forth hierarchy levelof the ontology within the canonicalized DBpedia datasets for each language The indented classes are sub-classes of the superclasses set in bold The zero val-ues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English version ofDBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-baseddata set how many instances are described in multi-ple languages The Instances column contains the totalnumber of instances per class across all 20 languagesthe second column contains the number of instancesthat are described only in a single language version thenext column contains the number of instances that arecontained in two languages but not in three or more lan-guages etc For example 12936 persons are describedin five languages but not in six or more languages Thenumber 871630 for the class Person means that all20 language versions together describe 871630 differ-ent persons The number is higher than the number ofpersons described in the canonicalized English infobox

12 Lehmann et al DBpedia

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Table 2Basic statistics about Localized DBpedia Editions

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

data set (763643) listed in Table 3 since there areinfoboxes in non-English articles describing a personwithout a corresponding infobox in the English articledescribing the same person Summing up columns 2 to10+ for the Person class we see that 195263 personsare described in two or more languages The large dif-ference of this number compared to the total number of871630 persons is due to the much smaller size of thelocalized DBpedia versions compared to the Englishone (cf Table 2)

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approaches

in [20] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently mappingsfor 27 languages8 The DBpedia 37 release9 in Septem-ber 2011 was the first DBpedia release to use the lo-calized I18n (Internationalisation) DBpedia extractionframework [24]

8Arabic (ar) Bulgarian (bg) Bengali (bn) Catalan (ca) Czech(cs) German (de) Greek (el) English (en) Spanish (es) Estonian(et) Basque (eu) French (fr) Irish (ga) Hindi (hi) Croatian (hr)Hungarian (hu) Indonesian (id) Italian (it) Japanese (ja) Korean(ko) Dutch (nl) Polish (pl) Portuguese (pt) Russian (ru) Slovene(sl) Turkish (tr) Urdu (ur)

9httpblogdbpediaorg20110911

Lehmann et al DBpedia 13

Table 3Number of instances per class within 10 localized DBpedia versions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 4Cross-language overlap Number of instances that are described in multiple languages

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

At the time of writing DBpedia chapters for 14 lan-guages have been founded Basque Czech Dutch En-glish French German Greek Italian Japanese Ko-rean Polish Portuguese Russian and Spanish10 Be-sides providing mappings from infoboxes in the corre-sponding Wikipedia editions DBpedia chapters organ-ise a local community and provide hosting for data setsand associated services

While at the moment chapters are defined by owner-ship of the IP and server of the sub domain A record(eg httpkodbpediaorg) given by DBpe-dia maintainers the DBpedia internationalisation com-mittee11 is manifesting its structure and each languageedition has a representative with a vote in elections Insome cases (eg Greek12 and Dutch13) the existence of

10Accessed on 25092013 httpwikidbpediaorgInternationalizationChapters

11httpwikidbpediaorgInternationalization

12httpeldbpediaorg13httpnldbpediaorg

a local DBpedia chapter has had a positive effect on thecreation of localized LOD clouds [24]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where commu-nities from each language work together to improvemappings increase coverage and detect bugs in the ex-traction process The progress of the mapping effortis tracked through statistics on the number of mappedtemplates and properties as well as the number oftimes these templates and properties occur in WikipediaThese statistics provide an estimate of the coverage ofeach Wikipedia edition in terms of how many entitieswill be typed and how many properties from those en-tities will be extracted Therefore they can be usedby each language edition to prioritize properties andtemplates with higher impact on the coverage

The mapping statistics have also been used as a wayto promote a healthy competition between languageeditions A sprint page was created with bar charts thatshow how close each language is from achieving to-tal coverage (as shown in Figure 5) and line chartsshowing the progress over time highlighting when one

14 Lehmann et al DBpedia

language is overtaking another in their race for highercoverage The mapping sprints have served as a greatmotivator for the crowd-sourcing efforts as it can benoted from the increase in the number of mapping con-tributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute14 This high changefrequency leads to DBpedia data quickly being out-dated which in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [2036] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported15 Meanwhile DBpedia Livefor Dutch16 was developed

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [27]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia articles

The overall architecture of DBpedia Live is indicatedin Figure 8 The major components of the system areas follows

14httpstatswikimediaorgENSummaryENhtm

15httplivedbpediaorg16httplivenldbpediaorg

Fig 8 Overview of DBpedia Live extraction framework

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a localWikipedia mirror allows us to exceed any accesslimits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary inputsource Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pages forre-processing as input and applies all the enabledextractors After processing a page the extractedtriples are a) inserted into a backend triple store(in our case Virtuoso [10]) updating the old triplesand b) saved as changesets into a compressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extractionframework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from differentsources

Lehmann et al DBpedia 15

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as anexample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extractorsmight never be applied to rarely modified pages Toovercome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mappingaffected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triples Aset for the added triples and another set for the deletedtriples We write those two sets into N-Triples filescompress them and publish the compressed files aschangesets If another DBpedia Live mirror wants tosynchronise with the DBpedia Live endpoint it can justdownload those files decompress and integrate them

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset filessequentially decompresses them and updates the targetSPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data indifferent SPARQL graphs Data from the articleupdate feeds are contained in the graph with theURI httplivedbpediaorg static data (ielinks to the LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All data isalso accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-

tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing to DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository17 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linking com-mittee Linksets which are added to the repository areused for the subsequent official DBpedia release as wellas for DBpedia Live Table 5 lists the linksets createdby the DBpedia community as of April 2013 The firstcolumn names the data set that is the target of the linksThe second and third column contain the predicate thatis used for linking as well as the overall number of linksthat is set between DBpedia and the external data setThe last column names the tool that was used to gener-ate the links The value S refers to Silk L to LIMESC to custom script and a missing entry means that thedataset is copied from the previous releases and notregenerated

An example for the usage of links is the combina-tion of data about European Union project funding(FTS) [32] and data about countries in DBpedia Thequery below compares funding per year (from FTS) andcountry with the gross domestic product of a country(from DBpedia)18

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding) 3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

17httpsgithubcomdbpediadbpedia-links18Endpoint httpftspublicdataeusparql

Results httpbitly1c2mIwQ

16 Lehmann et al DBpedia

Table 5Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732

In addition to providing outgoing links on aninstance-level DBpedia also sets links on schema-level pointing from the DBpedia ontology to equiva-lent terms in other schemas Links to other schematacan be set by the community within the DBpedia Map-pings Wiki by using owlequivalentClass inclass templates and owlequivalentPropertyin datatype or object property templates respectively

In particular in 2011 Google Microsoft and Yahooannounced their collaboration on Schemaorg a col-lection of vocabularies for marking up content on webpages The DBpedia 38 ontology contains 45 equiva-lent class and 31 equivalent property links pointing tohttpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of incoming links ie linkspointing to DBpedia from other data sets is 39007478according to the Data Hub19 However those countsare entered by users and may not always be valid andup-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [39]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from otherwiseit is not considered Sindice computes a graph sum-mary [8] over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing to DBpedia Table 6lists the 10 data sets which set most links to DBpedia

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub20 However it crawls forRDFa snippets converts microformats etc which arenot captured by the DataHub Despite the inaccuracythe relative comparison of different datasets can stillgive us insights Therefore we analysed the link struc-ture of all Sindice datasets using the Sindice clusterTable 8 shows the datasets with most incoming linksThose are authorities in the network structure of the

19See httpwikidbpediaorgInterlinking fordetails

20httpdatahubio

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 5: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

4 Lehmann et al DBpedia

Statistical Extraction Some NLP related extractorsaggregate data from all Wikipedia pages in order toprovide data that is based on statistical measuresof page links or word counts as further describedin Section 26

23 Raw Infobox Extraction

The type of Wikipedia content that is most valuablefor the DBpedia extraction are infoboxes Infoboxes arefrequently used to list an articlersquos most relevant factsas a table of attribute-value pairs on the top right-handside of the Wikipedia page (for right-to-left languageson the top left-hand side) Infoboxes that appear in aWikipedia article are based on a template that specifiesa list of attributes that can form the infobox A widerange of infobox templates are used in Wikipedia Com-mon examples are templates for infoboxes that describepersons organisations or automobiles As Wikipediarsquosinfobox template system has evolved over time dif-ferent communities of Wikipedia editors use differ-ent templates to describe the same type of things (egInfobox city japan Infobox swiss townand Infobox town de) In addition different tem-plates use different names for the same attribute(eg birthplace and placeofbirth) As manyWikipedia editors do not strictly follow the recommen-dations given on the page that describes a templateattribute values are expressed using a wide range ofdifferent formats and units of measurement An excerptof an infobox that is based on a template for describingautomobiles is shown below

Infobox automobile| name = Ford GT40| manufacturer = [[Ford Advanced Vehicles]]| production = 1964-1969| engine = 4181cc()

In this infobox the first line specifies the infoboxtype and the subsequent lines specify various attributesof the described entity

An excerpt of the extracted data is as follows

dbrFord_GT40dbpname Ford GT40endbpmanufacturer dbrFord_Advanced_Vehiclesdbpengine 4181dbpproduction 1964()

This extraction output has weaknesses The resourceis not associated to a class in the ontology and parsed

values are cleaned up and assigned a datatyper basedon heuristics In particular the raw infobox extractorsearches for values in the following order dates coor-dinates numbers links and strings as default Thus thedatatype assignment for the same property in differentresources is non deterministic The engine for exam-ple is extracted as a number but if another instance ofthe template used ldquocc4181rdquo it would be extracted asstring This behaviour makes querying for properties inthe dbp namespace inconsistent Those problems canbe overcome by the mapping-based infobox extractionpresented in the next subsection

24 Mapping-Based Infobox Extraction

In order to homogenize the description of informa-tion in the knowledge base in 2010 a community ef-fort was initiated to develop an ontology schema andmappings from Wikipedia infobox properties to thisontology The alignment between Wikipedia infoboxesand the ontology is performed via community-providedmappings that help to normalize name variations inproperties and classes Heterogeneity in the Wikipediainfobox system like using different infoboxes for thesame type of entity or using different property names forthe same property (cf Section 23) can be alleviated inthis way This significantly increases the quality of theraw Wikipedia infobox data by typing resources merg-ing name variations and assigning specific datatypes tothe values

This effort is realized using the DBpedia MappingsWiki3 a MediaWiki installation set up to enable usersto collaboratively create and edit mappings These map-pings are specified using the DBpedia Mapping Lan-guage The mapping language makes use of MediaWikitemplates that define DBpedia ontology classes andproperties as well as templatetable to ontology map-pings A mapping assigns a type from the DBpedia on-tology to the entities that are described by the corre-sponding infobox In addition attributes in the infoboxare mapped to properties in the DBpedia ontology Inthe following we show a mapping that maps infoboxesthat use the Infobox automobile template to theDBpedia ontology

TemplateMapping|mapToClass = Automobile|mappings =PropertyMapping| templateProperty = name

3httpmappingsdbpediaorg

Lehmann et al DBpedia 5

Table 1

Overview of the DBpedia extractors (cf Table 16 for a complete listof prefixes)

Name Description Example

abstract Extracts the first lines of theWikipedia article

dbrBerlin dboabstract Berlin is the capital city of

() article categories Extracts the categorization of the

articledbrOliver Twist dcsubject dbrCategoryEnglish novels

category label Extracts labels for categories dbrCategoryEnglish novels rdfslabel English novels category hierarchy Extracts information about which

concept is a category and how cat-egories are related using the SKOSVocabulary

dbrCategoryWorld War II skosbroader

dbrCategoryModern history

disambiguation Extracts disambiguation links dbrAlien dbowikiPageDisambiguates dbrAlien (film) external links Extracts links to external web

pages related to the conceptdbrAnimal Farm dbowikiPageExternalLink

lthttpbooksgooglecomid=RBGmrDnBs8UCgt geo coordinates Extracts geo-coordinates dbrBerlin georsspoint 525006 133989 grammatical gender Extracts grammatical genders for

personsdbrAbraham Lincoln foafgender male

homepage Extracts links to the official home-page of an instance

dbrAlabama foafhomepage lthttpalabamagovgt

image Extracts the first image of aWikipedia page

dbrBerlin foafdepiction lthttpOverview Berlinjpggt

infobox Extracts all properties from all in-foboxes

dbrAnimal Farm dbodate March 2010

interlanguage Extracts interwiki links dbrAlbedo dbowikiPageInterLanguageLink dbr-deAlbedo label Extracts the article title as label dbrBerlin rdfslabel Berlin lexicalizations Extracts information about surface

forms and their association withconcepts (only N-Quad format)

dbrPine sptllexicalization lxpine tree lsPine pine tree lxpine tree rdfslabel pine tree lsPine pine tree sptlpUriGivenSf 0941

mappings Extraction based on mappings ofWikipedia infoboxes to the DBpe-dia ontology

dbrBerlin dbocountry dbrGermany

page ID Extracts page ids of articles dbrAutism dbowikiPageID 25

page links Extracts all links betweenWikipedia articles

dbrAutism dbowikiPageWikiLink dbrHuman brain

persondata Extracts information about per-sons represented using the Person-Data template

dbrAndre Agassi foafbirthDate 1970-04-29

PND Extracts PND (Personennamen-datei) data about a person

dbrWilliam Shakespeare dboindividualisedPnd 118613723

redirects Extracts redirect links between ar-ticles in Wikipedia

dbrArtificialLanguages dbowikiPageRedirects

dbrConstructed language revision ID Extracts the revision ID of the

Wikipedia articledbrAutism lthttpwwww3orgnsprovwasDerivedFromgt

lthttpenwikipediaorgwikiAutismoldid=495234324gt thematic concept Extracts lsquothematicrsquo concepts the

centres of discussion for cate-gories

dbrCategoryMusic skossubject dbrMusic

topic signatures Extracts topic signatures dbrAlkane sptltopicSignature carbon alkanes atoms wiki page Extracts links to corresponding ar-

ticles in WikipediadbrAnAmericanInParis foafisPrimaryTopicOf

lthttpenwikipediaorgwikiAnAmericanInParisgt

6 Lehmann et al DBpedia

| ontologyProperty = foafname PropertyMapping| templateProperty = manufacturer| ontologyProperty = manufacturer DateIntervalMapping| templateProperty = production| startDateOntologyProperty =

productionStartDate| endDateOntologyProperty =

productionEndDate IntermediateNodeMapping| nodeClass = AutomobileEngine| correspondingProperty = engine| mappings =

PropertyMapping| templateProperty = engine| ontologyProperty = displacement| unit = Volume PropertyMapping| templateProperty = engine| ontologyProperty = powerOutput| unit = Power

()

The RDF statements that are extracted from the pre-vious infobox example are shown below As we can seethe production period is correctly split into a start yearand an end year and the engine is represented by a dis-tinct RDF node It is worth mentioning that all valuesare canonicalized to basic units For example in theengine mapping we state that engine is a Volumeand thus the extractor converts ldquo4181ccrdquo (cubic cen-timeters) to cubic meters (ldquo0004181rdquo) Additionallythere can exist multiple mappings on the same propertythat search for different datatypes or different units Forexample a number with ldquoPSrdquo as a suffix for engine

dbrFord_GT40rdftype dboAutomobilerdfslabel Ford GT40endbomanufacturer

dbrFord_Advanced_VehiclesdboproductionStartYear

1964ˆˆxsdgYeardboproductionEndYear 1969ˆˆxsdgYeardboengine [

rdftype AutomobileEnginedbodisplacement 0004181

]()

The DBpedia Mapping Wiki is not only used to mapdifferent templates within a single language editionof Wikipedia to the DBpedia ontology but is used tomap templates from all Wikipedia language editionsto the shared DBpedia ontology Figure 2 shows how

the infobox properties author and συγγαϕεας ndash au-thor in Greek ndash are both being mapped to the globalidentifier dboauthor That means in turn that in-formation from all language versions of DBpedia canbe merged and DBpedias for smaller languages can beaugmented with knowledge from larger DBpedias suchas the English edition Conversely the larger DBpe-dia editions can benefit from more specialized knowl-edge from localized editions such as data about smallertowns which is often only present in the correspondinglanguage edition [43]

Besides hosting of the mappings and DBpedia on-tology definition the DBpedia Mappings Wiki offersvarious tools which support users in their work

ndash Mapping Syntax Validator The mapping syntaxvalidator checks for syntactic correctness and high-lights inconsistencies such as missing propertydefinitions

ndash Extraction Tester The extraction tester linked oneach mapping page tests a mapping against a setof example Wikipedia pages This gives directfeedback about whether a mapping works and howthe resulting data is structured

ndash Mapping Tool The DBpedia Mapping Tool is agraphical user interface that supports users to cre-ate and edit mappings

25 URI Schemes

For every Wikipedia article the framework intro-duces a number of URIs to represent the concepts de-scribed on a particular page Up to 2011 DBpedia pub-lished URIs only under the httpdbpediaorgdomain The main namespaces were

ndash httpdbpediaorgresource (prefixdbr) for representing article data There is a one-to-one mapping between a Wikipedia page and aDBpedia resource based on the article title Forexample for the Wikipedia article on Berlin4 DB-pedia will produce the URI dbrBerlin Exceptionsin this rule appear when intermediate nodes areextracted from the mapping-based infobox extrac-tor as unique URIs (eg the engine mappingexample in Section 24)

ndash httpdbpediaorgproperty (prefixdbp) for representing properties extracted fromthe raw infobox extraction (cf Section 23) egdbppopulation

4httpenwikipediaorgwikiBerlin

Lehmann et al DBpedia 7

Fig 2 Depiction of the mapping from the Greek (left) and English Wikipedia templates (right) about books to the same DBpedia Ontology class(middle) [24]

ndash httpdbpediaorgontology (prefixdbo) for representing the DBpedia ontology (cfSection 24) eg dbopopulationTotal

Although data from other Wikipedia language edi-tions were extracted they used the same namespacesThis was achieved by exploiting the Wikipedia inter-language links5 For every page in a language otherthan English the page was extracted only if the pagecontained an inter-language link to an English page Inthat case using the English link the data was extractedunder the English resource name (ie dbrBerlin)

Recent DBpedia internationalisation developmentsshowed that this approach omitted valuable data [24]Thus starting from the DBpedia 37 release6 two typesof data sets were generated The localized data sets con-tain all things that are described in a specific languageWithin the datasets things are identified with languagespecific URIs such as httpltlanggtdbpediaorgresource for article data and httpltlanggtdbpediaorgproperty for prop-erty data In addition we produce a canonicalized dataset for each language The canonicalized data sets onlycontain things for which a corresponding page in the

5httpenwikipediaorgwikiHelpInterlanguage_links

6A list of all DBpedia releases is provided in Table 17

English edition of Wikipedia exists Within all canoni-calized data sets the same thing is identified with thesame URI from the generic language-agnostic names-pace httpdbpediaorgresource

26 NLP Extraction

DBpedia provides a number of data sets which havebeen created to support Natural Language Processing(NLP) tasks [33] Currently four datasets are extractedtopic signatures grammatical gender lexicalizationsand thematic concept While the topic signatures andthe grammatical gender extractors primarily extract datafrom the article text the lexicalizations and thematicconcept extractors make use of the wiki markup

DBpedia entities can be referred to using many dif-ferent names and abbreviations The Lexicalization dataset provides access to alternative names for entitiesand concepts associated with several scores estimatingthe association strength between name and URI Thesescores distinguish more common names for specificentities from rarely used ones and also show how am-biguous a name is with respect to all possible conceptsthat it can mean

The topic signatures data set enables the descriptionof DBpedia resources based on unstructured informa-tion as compared to the structured factual data providedby the mapping-based and raw extractors We build a

8 Lehmann et al DBpedia

Vector Space Model (VSM) where each DBpedia re-source is a point in a multidimensional space of wordsEach DBpedia resource is represented by a vector andeach word occurring in Wikipedia is a dimension ofthis vector Word scores are computed using the tf-idfweight with the intention to measure how strong is theassociation between a word and a DBpedia resourceNote that word stems are used in this context in order togeneralize over inflected words We use the computedweights to select the strongest related word stems foreach entity and build topic signatures [30]

There are two more Feature Extractors related to Nat-ural Language Processing The thematic concepts dataset relies on Wikipediarsquos category system to capture theidea of a lsquothemersquo a subject that is discussed in its arti-cles Many of the categories in Wikipedia are linked toan article that describes the main topic of that categoryWe rely on this information to mark DBpedia entitiesand concepts that are lsquothematicrsquo that is they are thecentre of discussion for a category

The grammatical gender data set uses a simple heuris-tic to decide on a grammatical gender for instances ofthe class Person in DBpedia While parsing an articlein the English Wikipedia if there is a mapping froman infobox in this article to the class dboPersonwe record the frequency of gender-specific pronouns intheir declined forms (Subject Object Possessive Ad-jective Possessive Pronoun and Reflexive) ndash ie hehim his himself (masculine) and she her hers herself(feminine) Grammatical genders for DBpedia entitiesare assigned based on the dominating gender in thesepronouns

27 Summary of Other Recent Developments

In this section we summarize the improvements ofthe DBpedia extraction framework since the publicationof the previous DBpedia overview article [5] in 2009One of the major changes on the implementation levelis that the extraction framework has been rewritten inScala in 20107 resulting in an improvement of the effi-ciency of the extractors by an order of magnitude com-pared to the previous PHP based framework This iscrucial for the DBpedia Live project which will be ex-plained in Section 4 The new more modular frameworkalso allows to extract data from tables in Wikipediapages and supports extraction from multiple MediaWikitemplates per page Another significant change was the

7Table 17 provides an overview of the projectrsquos evolution throughtime

creation and utilization of the DBpedia Mappings Wikias described earlier Further significant changes includethe mentioned NLP extractors and the introduction ofURI schemes

In addition there were several smaller improvementsand general maintenance Overall over the past fouryears the parsing of the MediaWiki markup improvedquite a lot which led to better overall coverage forexample concerning references and parser functionsIn addition the collection of MediaWiki namespaceidentifiers for many languages is now performed semi-automatically leading to a high accuracy of detectionThis concerns common title prefixes such as User FileTemplate Help Portal etc in English that indicatepages that do not contain encyclopedic content andwould produce noise in the data They are important forspecific extractors as well for instance the category hi-erarchy data set is produced from pages of the Categorynamespace Furthermore the output of the extractionsystem now supports more formats and several compli-ance issues regarding URIs IRIs N-Triples and Turtlewere fixed

The individual data extractors have been improved aswell in both number and quality in many areas The ab-stract extraction was enhanced producing more accurateplain text representations of the beginning of Wikipediaarticle texts More diverse and more specific datatypesdo exist (eg many currencies and XSD datatypes suchas xsdgYearMonth xsdpositiveIntegeretc) and for a number of classes and properties specificdatatypes were added (eg inhabitantskm2 for the pop-ulation density of populated places and m3s for thedischarge of rivers) Many issues related to data parserswere resolved and the quality of the owlsameAsdata set for multiple language versions was increasedby an implementation that takes bijective relations intoaccount

There are also further extractors eg for Wikipediapage IDs and revisions Moreover redirect and disam-biguation extractors were introduced and improved Forthe redirect data the transitive closure is computedwhile taking care of catching cycles in the links Theredirects also help regarding infobox coverage in themapping-based extraction by resolving alternative tem-plate names Moreover in the PHP framework if aninfobox value pointed to a redirect this redirection wasnot properly resolved and thus resulted in RDF linksthat led to URIs which did not contain any further in-formation Resolving redirects affected approximately15 of all links and hence increased the overall inter-connectivity of resources in the DBpedia ontology

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an infoboxcontains a string value that is not linked to anotherWikipedia article the extraction framework searchesfor hyperlinks in the same Wikipedia article that havethe same anchor text as the infobox value string If sucha link exists the target of that link is used to replacethe string value in the infobox This method furtherincreases the number of object property assertions inthe DBpedia ontology

Orthogonal to the previously mentioned improve-ments there have been various efforts to assess the qual-ity of the DBpedia datasets [26] developed a frame-work for estimating the quality of DBpedia and a sam-ple of 75 resources were analysed A more compre-hensive effort was performed in [48] by providing adistributed web-based interface [25] for quality assess-ment In this study 17 data quality problem types wereanalysed by 58 users covering 521 resources in DBpe-dia

3 DBpedia Ontology

The DBpedia ontology consists of 320 classes whichform a subsumption hierarchy and are described by1650 different properties With a maximal depth of5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depictsa part of the DBpedia ontology indicating the rela-tions among the top ten classes of the DBpedia on-

tology ie the classes with the highest number ofinstances The complete DBpedia ontology can bebrowsed online at httpmappingsdbpediaorgserverontologyclasses

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings Wiki Fig-ure 4 depicts the growth of the DBpedia ontology overtime While the number of classes is not growing toomuch due to the already good coverage of the initialversion of the ontology the number of properties in-creases over time due to the collaboration on the DBpe-dia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communitiesfor 27 languages 23 of which are active Figure 5 shows

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

statistics for the coverage of these mappings in DBpe-dia Figures (a) and (c) refer to the absolute numberof template and property mappings that are defined forevery DBpedia language edition Figures (b) and (d) de-pict the percentage of the defined template and propertymappings compared to the total number of availabletemplates and properties for every Wikipedia languageedition Figures (e) and (g) show the occurrences (in-stances) that the defined template and property map-pings have in Wikipedia Finally figures (f) and (h)give the percentage of the mapped templates and prop-erties occurences compared to the total templates andproperty occurences in a Wikipedia language edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50of total template occurrences In addition almost alllanguages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most activemapping language communities is depicted in Figure 6It is interesting to notice that the high mapping ac-tivity peaks coincide with the DBpedia release dates

Lehmann et al DBpedia 11

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

For instance the DBpedia 37 version was released onSeptember 2011 and the 2nd and 3rd quarter of thatyear have a very high activity compared to the 4th quar-ter In the last two years (2012 and 2013) most of theDBpedia mapping language communities have definedtheir own chapters and have their own release datesThus recent mapping activity shows less fluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (x axis)that have exactly y occurrences (y axis) The occurrencefrequency follows a long tail distribution Thus a lownumber of property mappings have a high number ofoccurrences and a high number of property mappingshave a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the num-ber of facts (ie statements) that have been extractedfrom infoboxes describing these things Afterwards wereport on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statements andtype statements for the 20 languages The column head-ings have the following meaning LD = Localized datasets (see Section 25) CD = Canonicalized data sets

(see Section 25) all = Overall number of instances inthe data set including instances without infobox datawith MD = Number of instances for which mapping-based infobox data exists Raw Properties = Numberof different properties that are generated by the rawinfobox extractor Mapping Properties = Number ofdifferent properties that are generated by the mapping-based infobox extractor Raw Statements = Number ofstatements (facts) that are generated by the raw infoboxextractor Mapping Statements = Number of statements(facts) that are generated by the mapping-based infoboxextractor

It is interesting to see that the English version of DB-pedia describes about three times more instances thanthe second and third largest language editions (FrenchGerman) Comparing the first column of the table withthe second and third reveals which portion of the in-stances of a specific language correspond to instancesin the English version of DBpedia and which portionof the instances is described by clean mapping-basedinfobox data The difference between the number ofproperties in the raw infobox data set and the cleanermapping-based infobox data set (columns 4 and 5) re-sults on the one hand from multiple Wikipedia infoboxproperties being mapped to a single ontology propertyOn the other hand it reflects the number of mappingsthat have been so far created in the Mapping Wiki for aspecific language

Table 3 reports the number of instances for a set ofpopular classes from the third and forth hierarchy levelof the ontology within the canonicalized DBpedia datasets for each language The indented classes are sub-classes of the superclasses set in bold The zero val-ues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English version ofDBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-baseddata set how many instances are described in multi-ple languages The Instances column contains the totalnumber of instances per class across all 20 languagesthe second column contains the number of instancesthat are described only in a single language version thenext column contains the number of instances that arecontained in two languages but not in three or more lan-guages etc For example 12936 persons are describedin five languages but not in six or more languages Thenumber 871630 for the class Person means that all20 language versions together describe 871630 differ-ent persons The number is higher than the number ofpersons described in the canonicalized English infobox

12 Lehmann et al DBpedia

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Table 2Basic statistics about Localized DBpedia Editions

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

data set (763643) listed in Table 3 since there areinfoboxes in non-English articles describing a personwithout a corresponding infobox in the English articledescribing the same person Summing up columns 2 to10+ for the Person class we see that 195263 personsare described in two or more languages The large dif-ference of this number compared to the total number of871630 persons is due to the much smaller size of thelocalized DBpedia versions compared to the Englishone (cf Table 2)

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approaches

in [20] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently mappingsfor 27 languages8 The DBpedia 37 release9 in Septem-ber 2011 was the first DBpedia release to use the lo-calized I18n (Internationalisation) DBpedia extractionframework [24]

8Arabic (ar) Bulgarian (bg) Bengali (bn) Catalan (ca) Czech(cs) German (de) Greek (el) English (en) Spanish (es) Estonian(et) Basque (eu) French (fr) Irish (ga) Hindi (hi) Croatian (hr)Hungarian (hu) Indonesian (id) Italian (it) Japanese (ja) Korean(ko) Dutch (nl) Polish (pl) Portuguese (pt) Russian (ru) Slovene(sl) Turkish (tr) Urdu (ur)

9httpblogdbpediaorg20110911

Lehmann et al DBpedia 13

Table 3Number of instances per class within 10 localized DBpedia versions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 4Cross-language overlap Number of instances that are described in multiple languages

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

At the time of writing DBpedia chapters for 14 lan-guages have been founded Basque Czech Dutch En-glish French German Greek Italian Japanese Ko-rean Polish Portuguese Russian and Spanish10 Be-sides providing mappings from infoboxes in the corre-sponding Wikipedia editions DBpedia chapters organ-ise a local community and provide hosting for data setsand associated services

While at the moment chapters are defined by owner-ship of the IP and server of the sub domain A record(eg httpkodbpediaorg) given by DBpe-dia maintainers the DBpedia internationalisation com-mittee11 is manifesting its structure and each languageedition has a representative with a vote in elections Insome cases (eg Greek12 and Dutch13) the existence of

10Accessed on 25092013 httpwikidbpediaorgInternationalizationChapters

11httpwikidbpediaorgInternationalization

12httpeldbpediaorg13httpnldbpediaorg

a local DBpedia chapter has had a positive effect on thecreation of localized LOD clouds [24]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where commu-nities from each language work together to improvemappings increase coverage and detect bugs in the ex-traction process The progress of the mapping effortis tracked through statistics on the number of mappedtemplates and properties as well as the number oftimes these templates and properties occur in WikipediaThese statistics provide an estimate of the coverage ofeach Wikipedia edition in terms of how many entitieswill be typed and how many properties from those en-tities will be extracted Therefore they can be usedby each language edition to prioritize properties andtemplates with higher impact on the coverage

The mapping statistics have also been used as a wayto promote a healthy competition between languageeditions A sprint page was created with bar charts thatshow how close each language is from achieving to-tal coverage (as shown in Figure 5) and line chartsshowing the progress over time highlighting when one

14 Lehmann et al DBpedia

language is overtaking another in their race for highercoverage The mapping sprints have served as a greatmotivator for the crowd-sourcing efforts as it can benoted from the increase in the number of mapping con-tributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute14 This high changefrequency leads to DBpedia data quickly being out-dated which in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [2036] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported15 Meanwhile DBpedia Livefor Dutch16 was developed

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [27]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia articles

The overall architecture of DBpedia Live is indicatedin Figure 8 The major components of the system areas follows

14httpstatswikimediaorgENSummaryENhtm

15httplivedbpediaorg16httplivenldbpediaorg

Fig 8 Overview of DBpedia Live extraction framework

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a localWikipedia mirror allows us to exceed any accesslimits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary inputsource Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pages forre-processing as input and applies all the enabledextractors After processing a page the extractedtriples are a) inserted into a backend triple store(in our case Virtuoso [10]) updating the old triplesand b) saved as changesets into a compressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extractionframework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from differentsources

Lehmann et al DBpedia 15

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as anexample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extractorsmight never be applied to rarely modified pages Toovercome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mappingaffected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triples Aset for the added triples and another set for the deletedtriples We write those two sets into N-Triples filescompress them and publish the compressed files aschangesets If another DBpedia Live mirror wants tosynchronise with the DBpedia Live endpoint it can justdownload those files decompress and integrate them

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset filessequentially decompresses them and updates the targetSPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data indifferent SPARQL graphs Data from the articleupdate feeds are contained in the graph with theURI httplivedbpediaorg static data (ielinks to the LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All data isalso accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-

tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing to DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository17 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linking com-mittee Linksets which are added to the repository areused for the subsequent official DBpedia release as wellas for DBpedia Live Table 5 lists the linksets createdby the DBpedia community as of April 2013 The firstcolumn names the data set that is the target of the linksThe second and third column contain the predicate thatis used for linking as well as the overall number of linksthat is set between DBpedia and the external data setThe last column names the tool that was used to gener-ate the links The value S refers to Silk L to LIMESC to custom script and a missing entry means that thedataset is copied from the previous releases and notregenerated

An example for the usage of links is the combina-tion of data about European Union project funding(FTS) [32] and data about countries in DBpedia Thequery below compares funding per year (from FTS) andcountry with the gross domestic product of a country(from DBpedia)18

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding) 3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

17httpsgithubcomdbpediadbpedia-links18Endpoint httpftspublicdataeusparql

Results httpbitly1c2mIwQ

16 Lehmann et al DBpedia

Table 5Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732

In addition to providing outgoing links on aninstance-level DBpedia also sets links on schema-level pointing from the DBpedia ontology to equiva-lent terms in other schemas Links to other schematacan be set by the community within the DBpedia Map-pings Wiki by using owlequivalentClass inclass templates and owlequivalentPropertyin datatype or object property templates respectively

In particular in 2011 Google Microsoft and Yahooannounced their collaboration on Schemaorg a col-lection of vocabularies for marking up content on webpages The DBpedia 38 ontology contains 45 equiva-lent class and 31 equivalent property links pointing tohttpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of incoming links ie linkspointing to DBpedia from other data sets is 39007478according to the Data Hub19 However those countsare entered by users and may not always be valid andup-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [39]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from otherwiseit is not considered Sindice computes a graph sum-mary [8] over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing to DBpedia Table 6lists the 10 data sets which set most links to DBpedia

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub20 However it crawls forRDFa snippets converts microformats etc which arenot captured by the DataHub Despite the inaccuracythe relative comparison of different datasets can stillgive us insights Therefore we analysed the link struc-ture of all Sindice datasets using the Sindice clusterTable 8 shows the datasets with most incoming linksThose are authorities in the network structure of the

19See httpwikidbpediaorgInterlinking fordetails

20httpdatahubio

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 6: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

Lehmann et al DBpedia 5

Table 1

Overview of the DBpedia extractors (cf Table 16 for a complete listof prefixes)

Name Description Example

abstract Extracts the first lines of theWikipedia article

dbrBerlin dboabstract Berlin is the capital city of

() article categories Extracts the categorization of the

articledbrOliver Twist dcsubject dbrCategoryEnglish novels

category label Extracts labels for categories dbrCategoryEnglish novels rdfslabel English novels category hierarchy Extracts information about which

concept is a category and how cat-egories are related using the SKOSVocabulary

dbrCategoryWorld War II skosbroader

dbrCategoryModern history

disambiguation Extracts disambiguation links dbrAlien dbowikiPageDisambiguates dbrAlien (film) external links Extracts links to external web

pages related to the conceptdbrAnimal Farm dbowikiPageExternalLink

lthttpbooksgooglecomid=RBGmrDnBs8UCgt geo coordinates Extracts geo-coordinates dbrBerlin georsspoint 525006 133989 grammatical gender Extracts grammatical genders for

personsdbrAbraham Lincoln foafgender male

homepage Extracts links to the official home-page of an instance

dbrAlabama foafhomepage lthttpalabamagovgt

image Extracts the first image of aWikipedia page

dbrBerlin foafdepiction lthttpOverview Berlinjpggt

infobox Extracts all properties from all in-foboxes

dbrAnimal Farm dbodate March 2010

interlanguage Extracts interwiki links dbrAlbedo dbowikiPageInterLanguageLink dbr-deAlbedo label Extracts the article title as label dbrBerlin rdfslabel Berlin lexicalizations Extracts information about surface

forms and their association withconcepts (only N-Quad format)

dbrPine sptllexicalization lxpine tree lsPine pine tree lxpine tree rdfslabel pine tree lsPine pine tree sptlpUriGivenSf 0941

mappings Extraction based on mappings ofWikipedia infoboxes to the DBpe-dia ontology

dbrBerlin dbocountry dbrGermany

page ID Extracts page ids of articles dbrAutism dbowikiPageID 25

page links Extracts all links betweenWikipedia articles

dbrAutism dbowikiPageWikiLink dbrHuman brain

persondata Extracts information about per-sons represented using the Person-Data template

dbrAndre Agassi foafbirthDate 1970-04-29

PND Extracts PND (Personennamen-datei) data about a person

dbrWilliam Shakespeare dboindividualisedPnd 118613723

redirects Extracts redirect links between ar-ticles in Wikipedia

dbrArtificialLanguages dbowikiPageRedirects

dbrConstructed language revision ID Extracts the revision ID of the

Wikipedia articledbrAutism lthttpwwww3orgnsprovwasDerivedFromgt

lthttpenwikipediaorgwikiAutismoldid=495234324gt thematic concept Extracts lsquothematicrsquo concepts the

centres of discussion for cate-gories

dbrCategoryMusic skossubject dbrMusic

topic signatures Extracts topic signatures dbrAlkane sptltopicSignature carbon alkanes atoms wiki page Extracts links to corresponding ar-

ticles in WikipediadbrAnAmericanInParis foafisPrimaryTopicOf

lthttpenwikipediaorgwikiAnAmericanInParisgt

6 Lehmann et al DBpedia

| ontologyProperty = foafname PropertyMapping| templateProperty = manufacturer| ontologyProperty = manufacturer DateIntervalMapping| templateProperty = production| startDateOntologyProperty =

productionStartDate| endDateOntologyProperty =

productionEndDate IntermediateNodeMapping| nodeClass = AutomobileEngine| correspondingProperty = engine| mappings =

PropertyMapping| templateProperty = engine| ontologyProperty = displacement| unit = Volume PropertyMapping| templateProperty = engine| ontologyProperty = powerOutput| unit = Power

()

The RDF statements that are extracted from the pre-vious infobox example are shown below As we can seethe production period is correctly split into a start yearand an end year and the engine is represented by a dis-tinct RDF node It is worth mentioning that all valuesare canonicalized to basic units For example in theengine mapping we state that engine is a Volumeand thus the extractor converts ldquo4181ccrdquo (cubic cen-timeters) to cubic meters (ldquo0004181rdquo) Additionallythere can exist multiple mappings on the same propertythat search for different datatypes or different units Forexample a number with ldquoPSrdquo as a suffix for engine

dbrFord_GT40rdftype dboAutomobilerdfslabel Ford GT40endbomanufacturer

dbrFord_Advanced_VehiclesdboproductionStartYear

1964ˆˆxsdgYeardboproductionEndYear 1969ˆˆxsdgYeardboengine [

rdftype AutomobileEnginedbodisplacement 0004181

]()

The DBpedia Mapping Wiki is not only used to mapdifferent templates within a single language editionof Wikipedia to the DBpedia ontology but is used tomap templates from all Wikipedia language editionsto the shared DBpedia ontology Figure 2 shows how

the infobox properties author and συγγαϕεας ndash au-thor in Greek ndash are both being mapped to the globalidentifier dboauthor That means in turn that in-formation from all language versions of DBpedia canbe merged and DBpedias for smaller languages can beaugmented with knowledge from larger DBpedias suchas the English edition Conversely the larger DBpe-dia editions can benefit from more specialized knowl-edge from localized editions such as data about smallertowns which is often only present in the correspondinglanguage edition [43]

Besides hosting of the mappings and DBpedia on-tology definition the DBpedia Mappings Wiki offersvarious tools which support users in their work

ndash Mapping Syntax Validator The mapping syntaxvalidator checks for syntactic correctness and high-lights inconsistencies such as missing propertydefinitions

ndash Extraction Tester The extraction tester linked oneach mapping page tests a mapping against a setof example Wikipedia pages This gives directfeedback about whether a mapping works and howthe resulting data is structured

ndash Mapping Tool The DBpedia Mapping Tool is agraphical user interface that supports users to cre-ate and edit mappings

25 URI Schemes

For every Wikipedia article the framework intro-duces a number of URIs to represent the concepts de-scribed on a particular page Up to 2011 DBpedia pub-lished URIs only under the httpdbpediaorgdomain The main namespaces were

ndash httpdbpediaorgresource (prefixdbr) for representing article data There is a one-to-one mapping between a Wikipedia page and aDBpedia resource based on the article title Forexample for the Wikipedia article on Berlin4 DB-pedia will produce the URI dbrBerlin Exceptionsin this rule appear when intermediate nodes areextracted from the mapping-based infobox extrac-tor as unique URIs (eg the engine mappingexample in Section 24)

ndash httpdbpediaorgproperty (prefixdbp) for representing properties extracted fromthe raw infobox extraction (cf Section 23) egdbppopulation

4httpenwikipediaorgwikiBerlin

Lehmann et al DBpedia 7

Fig 2 Depiction of the mapping from the Greek (left) and English Wikipedia templates (right) about books to the same DBpedia Ontology class(middle) [24]

ndash httpdbpediaorgontology (prefixdbo) for representing the DBpedia ontology (cfSection 24) eg dbopopulationTotal

Although data from other Wikipedia language edi-tions were extracted they used the same namespacesThis was achieved by exploiting the Wikipedia inter-language links5 For every page in a language otherthan English the page was extracted only if the pagecontained an inter-language link to an English page Inthat case using the English link the data was extractedunder the English resource name (ie dbrBerlin)

Recent DBpedia internationalisation developmentsshowed that this approach omitted valuable data [24]Thus starting from the DBpedia 37 release6 two typesof data sets were generated The localized data sets con-tain all things that are described in a specific languageWithin the datasets things are identified with languagespecific URIs such as httpltlanggtdbpediaorgresource for article data and httpltlanggtdbpediaorgproperty for prop-erty data In addition we produce a canonicalized dataset for each language The canonicalized data sets onlycontain things for which a corresponding page in the

5httpenwikipediaorgwikiHelpInterlanguage_links

6A list of all DBpedia releases is provided in Table 17

English edition of Wikipedia exists Within all canoni-calized data sets the same thing is identified with thesame URI from the generic language-agnostic names-pace httpdbpediaorgresource

26 NLP Extraction

DBpedia provides a number of data sets which havebeen created to support Natural Language Processing(NLP) tasks [33] Currently four datasets are extractedtopic signatures grammatical gender lexicalizationsand thematic concept While the topic signatures andthe grammatical gender extractors primarily extract datafrom the article text the lexicalizations and thematicconcept extractors make use of the wiki markup

DBpedia entities can be referred to using many dif-ferent names and abbreviations The Lexicalization dataset provides access to alternative names for entitiesand concepts associated with several scores estimatingthe association strength between name and URI Thesescores distinguish more common names for specificentities from rarely used ones and also show how am-biguous a name is with respect to all possible conceptsthat it can mean

The topic signatures data set enables the descriptionof DBpedia resources based on unstructured informa-tion as compared to the structured factual data providedby the mapping-based and raw extractors We build a

8 Lehmann et al DBpedia

Vector Space Model (VSM) where each DBpedia re-source is a point in a multidimensional space of wordsEach DBpedia resource is represented by a vector andeach word occurring in Wikipedia is a dimension ofthis vector Word scores are computed using the tf-idfweight with the intention to measure how strong is theassociation between a word and a DBpedia resourceNote that word stems are used in this context in order togeneralize over inflected words We use the computedweights to select the strongest related word stems foreach entity and build topic signatures [30]

There are two more Feature Extractors related to Nat-ural Language Processing The thematic concepts dataset relies on Wikipediarsquos category system to capture theidea of a lsquothemersquo a subject that is discussed in its arti-cles Many of the categories in Wikipedia are linked toan article that describes the main topic of that categoryWe rely on this information to mark DBpedia entitiesand concepts that are lsquothematicrsquo that is they are thecentre of discussion for a category

The grammatical gender data set uses a simple heuris-tic to decide on a grammatical gender for instances ofthe class Person in DBpedia While parsing an articlein the English Wikipedia if there is a mapping froman infobox in this article to the class dboPersonwe record the frequency of gender-specific pronouns intheir declined forms (Subject Object Possessive Ad-jective Possessive Pronoun and Reflexive) ndash ie hehim his himself (masculine) and she her hers herself(feminine) Grammatical genders for DBpedia entitiesare assigned based on the dominating gender in thesepronouns

27 Summary of Other Recent Developments

In this section we summarize the improvements ofthe DBpedia extraction framework since the publicationof the previous DBpedia overview article [5] in 2009One of the major changes on the implementation levelis that the extraction framework has been rewritten inScala in 20107 resulting in an improvement of the effi-ciency of the extractors by an order of magnitude com-pared to the previous PHP based framework This iscrucial for the DBpedia Live project which will be ex-plained in Section 4 The new more modular frameworkalso allows to extract data from tables in Wikipediapages and supports extraction from multiple MediaWikitemplates per page Another significant change was the

7Table 17 provides an overview of the projectrsquos evolution throughtime

creation and utilization of the DBpedia Mappings Wikias described earlier Further significant changes includethe mentioned NLP extractors and the introduction ofURI schemes

In addition there were several smaller improvementsand general maintenance Overall over the past fouryears the parsing of the MediaWiki markup improvedquite a lot which led to better overall coverage forexample concerning references and parser functionsIn addition the collection of MediaWiki namespaceidentifiers for many languages is now performed semi-automatically leading to a high accuracy of detectionThis concerns common title prefixes such as User FileTemplate Help Portal etc in English that indicatepages that do not contain encyclopedic content andwould produce noise in the data They are important forspecific extractors as well for instance the category hi-erarchy data set is produced from pages of the Categorynamespace Furthermore the output of the extractionsystem now supports more formats and several compli-ance issues regarding URIs IRIs N-Triples and Turtlewere fixed

The individual data extractors have been improved aswell in both number and quality in many areas The ab-stract extraction was enhanced producing more accurateplain text representations of the beginning of Wikipediaarticle texts More diverse and more specific datatypesdo exist (eg many currencies and XSD datatypes suchas xsdgYearMonth xsdpositiveIntegeretc) and for a number of classes and properties specificdatatypes were added (eg inhabitantskm2 for the pop-ulation density of populated places and m3s for thedischarge of rivers) Many issues related to data parserswere resolved and the quality of the owlsameAsdata set for multiple language versions was increasedby an implementation that takes bijective relations intoaccount

There are also further extractors eg for Wikipediapage IDs and revisions Moreover redirect and disam-biguation extractors were introduced and improved Forthe redirect data the transitive closure is computedwhile taking care of catching cycles in the links Theredirects also help regarding infobox coverage in themapping-based extraction by resolving alternative tem-plate names Moreover in the PHP framework if aninfobox value pointed to a redirect this redirection wasnot properly resolved and thus resulted in RDF linksthat led to URIs which did not contain any further in-formation Resolving redirects affected approximately15 of all links and hence increased the overall inter-connectivity of resources in the DBpedia ontology

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an infoboxcontains a string value that is not linked to anotherWikipedia article the extraction framework searchesfor hyperlinks in the same Wikipedia article that havethe same anchor text as the infobox value string If sucha link exists the target of that link is used to replacethe string value in the infobox This method furtherincreases the number of object property assertions inthe DBpedia ontology

Orthogonal to the previously mentioned improve-ments there have been various efforts to assess the qual-ity of the DBpedia datasets [26] developed a frame-work for estimating the quality of DBpedia and a sam-ple of 75 resources were analysed A more compre-hensive effort was performed in [48] by providing adistributed web-based interface [25] for quality assess-ment In this study 17 data quality problem types wereanalysed by 58 users covering 521 resources in DBpe-dia

3 DBpedia Ontology

The DBpedia ontology consists of 320 classes whichform a subsumption hierarchy and are described by1650 different properties With a maximal depth of5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depictsa part of the DBpedia ontology indicating the rela-tions among the top ten classes of the DBpedia on-

tology ie the classes with the highest number ofinstances The complete DBpedia ontology can bebrowsed online at httpmappingsdbpediaorgserverontologyclasses

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings Wiki Fig-ure 4 depicts the growth of the DBpedia ontology overtime While the number of classes is not growing toomuch due to the already good coverage of the initialversion of the ontology the number of properties in-creases over time due to the collaboration on the DBpe-dia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communitiesfor 27 languages 23 of which are active Figure 5 shows

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

statistics for the coverage of these mappings in DBpe-dia Figures (a) and (c) refer to the absolute numberof template and property mappings that are defined forevery DBpedia language edition Figures (b) and (d) de-pict the percentage of the defined template and propertymappings compared to the total number of availabletemplates and properties for every Wikipedia languageedition Figures (e) and (g) show the occurrences (in-stances) that the defined template and property map-pings have in Wikipedia Finally figures (f) and (h)give the percentage of the mapped templates and prop-erties occurences compared to the total templates andproperty occurences in a Wikipedia language edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50of total template occurrences In addition almost alllanguages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most activemapping language communities is depicted in Figure 6It is interesting to notice that the high mapping ac-tivity peaks coincide with the DBpedia release dates

Lehmann et al DBpedia 11

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

For instance the DBpedia 37 version was released onSeptember 2011 and the 2nd and 3rd quarter of thatyear have a very high activity compared to the 4th quar-ter In the last two years (2012 and 2013) most of theDBpedia mapping language communities have definedtheir own chapters and have their own release datesThus recent mapping activity shows less fluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (x axis)that have exactly y occurrences (y axis) The occurrencefrequency follows a long tail distribution Thus a lownumber of property mappings have a high number ofoccurrences and a high number of property mappingshave a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the num-ber of facts (ie statements) that have been extractedfrom infoboxes describing these things Afterwards wereport on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statements andtype statements for the 20 languages The column head-ings have the following meaning LD = Localized datasets (see Section 25) CD = Canonicalized data sets

(see Section 25) all = Overall number of instances inthe data set including instances without infobox datawith MD = Number of instances for which mapping-based infobox data exists Raw Properties = Numberof different properties that are generated by the rawinfobox extractor Mapping Properties = Number ofdifferent properties that are generated by the mapping-based infobox extractor Raw Statements = Number ofstatements (facts) that are generated by the raw infoboxextractor Mapping Statements = Number of statements(facts) that are generated by the mapping-based infoboxextractor

It is interesting to see that the English version of DB-pedia describes about three times more instances thanthe second and third largest language editions (FrenchGerman) Comparing the first column of the table withthe second and third reveals which portion of the in-stances of a specific language correspond to instancesin the English version of DBpedia and which portionof the instances is described by clean mapping-basedinfobox data The difference between the number ofproperties in the raw infobox data set and the cleanermapping-based infobox data set (columns 4 and 5) re-sults on the one hand from multiple Wikipedia infoboxproperties being mapped to a single ontology propertyOn the other hand it reflects the number of mappingsthat have been so far created in the Mapping Wiki for aspecific language

Table 3 reports the number of instances for a set ofpopular classes from the third and forth hierarchy levelof the ontology within the canonicalized DBpedia datasets for each language The indented classes are sub-classes of the superclasses set in bold The zero val-ues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English version ofDBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-baseddata set how many instances are described in multi-ple languages The Instances column contains the totalnumber of instances per class across all 20 languagesthe second column contains the number of instancesthat are described only in a single language version thenext column contains the number of instances that arecontained in two languages but not in three or more lan-guages etc For example 12936 persons are describedin five languages but not in six or more languages Thenumber 871630 for the class Person means that all20 language versions together describe 871630 differ-ent persons The number is higher than the number ofpersons described in the canonicalized English infobox

12 Lehmann et al DBpedia

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Table 2Basic statistics about Localized DBpedia Editions

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

data set (763643) listed in Table 3 since there areinfoboxes in non-English articles describing a personwithout a corresponding infobox in the English articledescribing the same person Summing up columns 2 to10+ for the Person class we see that 195263 personsare described in two or more languages The large dif-ference of this number compared to the total number of871630 persons is due to the much smaller size of thelocalized DBpedia versions compared to the Englishone (cf Table 2)

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approaches

in [20] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently mappingsfor 27 languages8 The DBpedia 37 release9 in Septem-ber 2011 was the first DBpedia release to use the lo-calized I18n (Internationalisation) DBpedia extractionframework [24]

8Arabic (ar) Bulgarian (bg) Bengali (bn) Catalan (ca) Czech(cs) German (de) Greek (el) English (en) Spanish (es) Estonian(et) Basque (eu) French (fr) Irish (ga) Hindi (hi) Croatian (hr)Hungarian (hu) Indonesian (id) Italian (it) Japanese (ja) Korean(ko) Dutch (nl) Polish (pl) Portuguese (pt) Russian (ru) Slovene(sl) Turkish (tr) Urdu (ur)

9httpblogdbpediaorg20110911

Lehmann et al DBpedia 13

Table 3Number of instances per class within 10 localized DBpedia versions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 4Cross-language overlap Number of instances that are described in multiple languages

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

At the time of writing DBpedia chapters for 14 lan-guages have been founded Basque Czech Dutch En-glish French German Greek Italian Japanese Ko-rean Polish Portuguese Russian and Spanish10 Be-sides providing mappings from infoboxes in the corre-sponding Wikipedia editions DBpedia chapters organ-ise a local community and provide hosting for data setsand associated services

While at the moment chapters are defined by owner-ship of the IP and server of the sub domain A record(eg httpkodbpediaorg) given by DBpe-dia maintainers the DBpedia internationalisation com-mittee11 is manifesting its structure and each languageedition has a representative with a vote in elections Insome cases (eg Greek12 and Dutch13) the existence of

10Accessed on 25092013 httpwikidbpediaorgInternationalizationChapters

11httpwikidbpediaorgInternationalization

12httpeldbpediaorg13httpnldbpediaorg

a local DBpedia chapter has had a positive effect on thecreation of localized LOD clouds [24]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where commu-nities from each language work together to improvemappings increase coverage and detect bugs in the ex-traction process The progress of the mapping effortis tracked through statistics on the number of mappedtemplates and properties as well as the number oftimes these templates and properties occur in WikipediaThese statistics provide an estimate of the coverage ofeach Wikipedia edition in terms of how many entitieswill be typed and how many properties from those en-tities will be extracted Therefore they can be usedby each language edition to prioritize properties andtemplates with higher impact on the coverage

The mapping statistics have also been used as a wayto promote a healthy competition between languageeditions A sprint page was created with bar charts thatshow how close each language is from achieving to-tal coverage (as shown in Figure 5) and line chartsshowing the progress over time highlighting when one

14 Lehmann et al DBpedia

language is overtaking another in their race for highercoverage The mapping sprints have served as a greatmotivator for the crowd-sourcing efforts as it can benoted from the increase in the number of mapping con-tributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute14 This high changefrequency leads to DBpedia data quickly being out-dated which in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [2036] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported15 Meanwhile DBpedia Livefor Dutch16 was developed

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [27]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia articles

The overall architecture of DBpedia Live is indicatedin Figure 8 The major components of the system areas follows

14httpstatswikimediaorgENSummaryENhtm

15httplivedbpediaorg16httplivenldbpediaorg

Fig 8 Overview of DBpedia Live extraction framework

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a localWikipedia mirror allows us to exceed any accesslimits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary inputsource Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pages forre-processing as input and applies all the enabledextractors After processing a page the extractedtriples are a) inserted into a backend triple store(in our case Virtuoso [10]) updating the old triplesand b) saved as changesets into a compressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extractionframework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from differentsources

Lehmann et al DBpedia 15

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as anexample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extractorsmight never be applied to rarely modified pages Toovercome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mappingaffected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triples Aset for the added triples and another set for the deletedtriples We write those two sets into N-Triples filescompress them and publish the compressed files aschangesets If another DBpedia Live mirror wants tosynchronise with the DBpedia Live endpoint it can justdownload those files decompress and integrate them

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset filessequentially decompresses them and updates the targetSPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data indifferent SPARQL graphs Data from the articleupdate feeds are contained in the graph with theURI httplivedbpediaorg static data (ielinks to the LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All data isalso accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-

tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing to DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository17 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linking com-mittee Linksets which are added to the repository areused for the subsequent official DBpedia release as wellas for DBpedia Live Table 5 lists the linksets createdby the DBpedia community as of April 2013 The firstcolumn names the data set that is the target of the linksThe second and third column contain the predicate thatis used for linking as well as the overall number of linksthat is set between DBpedia and the external data setThe last column names the tool that was used to gener-ate the links The value S refers to Silk L to LIMESC to custom script and a missing entry means that thedataset is copied from the previous releases and notregenerated

An example for the usage of links is the combina-tion of data about European Union project funding(FTS) [32] and data about countries in DBpedia Thequery below compares funding per year (from FTS) andcountry with the gross domestic product of a country(from DBpedia)18

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding) 3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

17httpsgithubcomdbpediadbpedia-links18Endpoint httpftspublicdataeusparql

Results httpbitly1c2mIwQ

16 Lehmann et al DBpedia

Table 5Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732

In addition to providing outgoing links on aninstance-level DBpedia also sets links on schema-level pointing from the DBpedia ontology to equiva-lent terms in other schemas Links to other schematacan be set by the community within the DBpedia Map-pings Wiki by using owlequivalentClass inclass templates and owlequivalentPropertyin datatype or object property templates respectively

In particular in 2011 Google Microsoft and Yahooannounced their collaboration on Schemaorg a col-lection of vocabularies for marking up content on webpages The DBpedia 38 ontology contains 45 equiva-lent class and 31 equivalent property links pointing tohttpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of incoming links ie linkspointing to DBpedia from other data sets is 39007478according to the Data Hub19 However those countsare entered by users and may not always be valid andup-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [39]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from otherwiseit is not considered Sindice computes a graph sum-mary [8] over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing to DBpedia Table 6lists the 10 data sets which set most links to DBpedia

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub20 However it crawls forRDFa snippets converts microformats etc which arenot captured by the DataHub Despite the inaccuracythe relative comparison of different datasets can stillgive us insights Therefore we analysed the link struc-ture of all Sindice datasets using the Sindice clusterTable 8 shows the datasets with most incoming linksThose are authorities in the network structure of the

19See httpwikidbpediaorgInterlinking fordetails

20httpdatahubio

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 7: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

6 Lehmann et al DBpedia

| ontologyProperty = foafname PropertyMapping| templateProperty = manufacturer| ontologyProperty = manufacturer DateIntervalMapping| templateProperty = production| startDateOntologyProperty =

productionStartDate| endDateOntologyProperty =

productionEndDate IntermediateNodeMapping| nodeClass = AutomobileEngine| correspondingProperty = engine| mappings =

PropertyMapping| templateProperty = engine| ontologyProperty = displacement| unit = Volume PropertyMapping| templateProperty = engine| ontologyProperty = powerOutput| unit = Power

()

The RDF statements that are extracted from the pre-vious infobox example are shown below As we can seethe production period is correctly split into a start yearand an end year and the engine is represented by a dis-tinct RDF node It is worth mentioning that all valuesare canonicalized to basic units For example in theengine mapping we state that engine is a Volumeand thus the extractor converts ldquo4181ccrdquo (cubic cen-timeters) to cubic meters (ldquo0004181rdquo) Additionallythere can exist multiple mappings on the same propertythat search for different datatypes or different units Forexample a number with ldquoPSrdquo as a suffix for engine

dbrFord_GT40rdftype dboAutomobilerdfslabel Ford GT40endbomanufacturer

dbrFord_Advanced_VehiclesdboproductionStartYear

1964ˆˆxsdgYeardboproductionEndYear 1969ˆˆxsdgYeardboengine [

rdftype AutomobileEnginedbodisplacement 0004181

]()

The DBpedia Mapping Wiki is not only used to mapdifferent templates within a single language editionof Wikipedia to the DBpedia ontology but is used tomap templates from all Wikipedia language editionsto the shared DBpedia ontology Figure 2 shows how

the infobox properties author and συγγαϕεας ndash au-thor in Greek ndash are both being mapped to the globalidentifier dboauthor That means in turn that in-formation from all language versions of DBpedia canbe merged and DBpedias for smaller languages can beaugmented with knowledge from larger DBpedias suchas the English edition Conversely the larger DBpe-dia editions can benefit from more specialized knowl-edge from localized editions such as data about smallertowns which is often only present in the correspondinglanguage edition [43]

Besides hosting of the mappings and DBpedia on-tology definition the DBpedia Mappings Wiki offersvarious tools which support users in their work

ndash Mapping Syntax Validator The mapping syntaxvalidator checks for syntactic correctness and high-lights inconsistencies such as missing propertydefinitions

ndash Extraction Tester The extraction tester linked oneach mapping page tests a mapping against a setof example Wikipedia pages This gives directfeedback about whether a mapping works and howthe resulting data is structured

ndash Mapping Tool The DBpedia Mapping Tool is agraphical user interface that supports users to cre-ate and edit mappings

25 URI Schemes

For every Wikipedia article the framework intro-duces a number of URIs to represent the concepts de-scribed on a particular page Up to 2011 DBpedia pub-lished URIs only under the httpdbpediaorgdomain The main namespaces were

ndash httpdbpediaorgresource (prefixdbr) for representing article data There is a one-to-one mapping between a Wikipedia page and aDBpedia resource based on the article title Forexample for the Wikipedia article on Berlin4 DB-pedia will produce the URI dbrBerlin Exceptionsin this rule appear when intermediate nodes areextracted from the mapping-based infobox extrac-tor as unique URIs (eg the engine mappingexample in Section 24)

ndash httpdbpediaorgproperty (prefixdbp) for representing properties extracted fromthe raw infobox extraction (cf Section 23) egdbppopulation

4httpenwikipediaorgwikiBerlin

Lehmann et al DBpedia 7

Fig 2 Depiction of the mapping from the Greek (left) and English Wikipedia templates (right) about books to the same DBpedia Ontology class(middle) [24]

ndash httpdbpediaorgontology (prefixdbo) for representing the DBpedia ontology (cfSection 24) eg dbopopulationTotal

Although data from other Wikipedia language edi-tions were extracted they used the same namespacesThis was achieved by exploiting the Wikipedia inter-language links5 For every page in a language otherthan English the page was extracted only if the pagecontained an inter-language link to an English page Inthat case using the English link the data was extractedunder the English resource name (ie dbrBerlin)

Recent DBpedia internationalisation developmentsshowed that this approach omitted valuable data [24]Thus starting from the DBpedia 37 release6 two typesof data sets were generated The localized data sets con-tain all things that are described in a specific languageWithin the datasets things are identified with languagespecific URIs such as httpltlanggtdbpediaorgresource for article data and httpltlanggtdbpediaorgproperty for prop-erty data In addition we produce a canonicalized dataset for each language The canonicalized data sets onlycontain things for which a corresponding page in the

5httpenwikipediaorgwikiHelpInterlanguage_links

6A list of all DBpedia releases is provided in Table 17

English edition of Wikipedia exists Within all canoni-calized data sets the same thing is identified with thesame URI from the generic language-agnostic names-pace httpdbpediaorgresource

26 NLP Extraction

DBpedia provides a number of data sets which havebeen created to support Natural Language Processing(NLP) tasks [33] Currently four datasets are extractedtopic signatures grammatical gender lexicalizationsand thematic concept While the topic signatures andthe grammatical gender extractors primarily extract datafrom the article text the lexicalizations and thematicconcept extractors make use of the wiki markup

DBpedia entities can be referred to using many dif-ferent names and abbreviations The Lexicalization dataset provides access to alternative names for entitiesand concepts associated with several scores estimatingthe association strength between name and URI Thesescores distinguish more common names for specificentities from rarely used ones and also show how am-biguous a name is with respect to all possible conceptsthat it can mean

The topic signatures data set enables the descriptionof DBpedia resources based on unstructured informa-tion as compared to the structured factual data providedby the mapping-based and raw extractors We build a

8 Lehmann et al DBpedia

Vector Space Model (VSM) where each DBpedia re-source is a point in a multidimensional space of wordsEach DBpedia resource is represented by a vector andeach word occurring in Wikipedia is a dimension ofthis vector Word scores are computed using the tf-idfweight with the intention to measure how strong is theassociation between a word and a DBpedia resourceNote that word stems are used in this context in order togeneralize over inflected words We use the computedweights to select the strongest related word stems foreach entity and build topic signatures [30]

There are two more Feature Extractors related to Nat-ural Language Processing The thematic concepts dataset relies on Wikipediarsquos category system to capture theidea of a lsquothemersquo a subject that is discussed in its arti-cles Many of the categories in Wikipedia are linked toan article that describes the main topic of that categoryWe rely on this information to mark DBpedia entitiesand concepts that are lsquothematicrsquo that is they are thecentre of discussion for a category

The grammatical gender data set uses a simple heuris-tic to decide on a grammatical gender for instances ofthe class Person in DBpedia While parsing an articlein the English Wikipedia if there is a mapping froman infobox in this article to the class dboPersonwe record the frequency of gender-specific pronouns intheir declined forms (Subject Object Possessive Ad-jective Possessive Pronoun and Reflexive) ndash ie hehim his himself (masculine) and she her hers herself(feminine) Grammatical genders for DBpedia entitiesare assigned based on the dominating gender in thesepronouns

27 Summary of Other Recent Developments

In this section we summarize the improvements ofthe DBpedia extraction framework since the publicationof the previous DBpedia overview article [5] in 2009One of the major changes on the implementation levelis that the extraction framework has been rewritten inScala in 20107 resulting in an improvement of the effi-ciency of the extractors by an order of magnitude com-pared to the previous PHP based framework This iscrucial for the DBpedia Live project which will be ex-plained in Section 4 The new more modular frameworkalso allows to extract data from tables in Wikipediapages and supports extraction from multiple MediaWikitemplates per page Another significant change was the

7Table 17 provides an overview of the projectrsquos evolution throughtime

creation and utilization of the DBpedia Mappings Wikias described earlier Further significant changes includethe mentioned NLP extractors and the introduction ofURI schemes

In addition there were several smaller improvementsand general maintenance Overall over the past fouryears the parsing of the MediaWiki markup improvedquite a lot which led to better overall coverage forexample concerning references and parser functionsIn addition the collection of MediaWiki namespaceidentifiers for many languages is now performed semi-automatically leading to a high accuracy of detectionThis concerns common title prefixes such as User FileTemplate Help Portal etc in English that indicatepages that do not contain encyclopedic content andwould produce noise in the data They are important forspecific extractors as well for instance the category hi-erarchy data set is produced from pages of the Categorynamespace Furthermore the output of the extractionsystem now supports more formats and several compli-ance issues regarding URIs IRIs N-Triples and Turtlewere fixed

The individual data extractors have been improved aswell in both number and quality in many areas The ab-stract extraction was enhanced producing more accurateplain text representations of the beginning of Wikipediaarticle texts More diverse and more specific datatypesdo exist (eg many currencies and XSD datatypes suchas xsdgYearMonth xsdpositiveIntegeretc) and for a number of classes and properties specificdatatypes were added (eg inhabitantskm2 for the pop-ulation density of populated places and m3s for thedischarge of rivers) Many issues related to data parserswere resolved and the quality of the owlsameAsdata set for multiple language versions was increasedby an implementation that takes bijective relations intoaccount

There are also further extractors eg for Wikipediapage IDs and revisions Moreover redirect and disam-biguation extractors were introduced and improved Forthe redirect data the transitive closure is computedwhile taking care of catching cycles in the links Theredirects also help regarding infobox coverage in themapping-based extraction by resolving alternative tem-plate names Moreover in the PHP framework if aninfobox value pointed to a redirect this redirection wasnot properly resolved and thus resulted in RDF linksthat led to URIs which did not contain any further in-formation Resolving redirects affected approximately15 of all links and hence increased the overall inter-connectivity of resources in the DBpedia ontology

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an infoboxcontains a string value that is not linked to anotherWikipedia article the extraction framework searchesfor hyperlinks in the same Wikipedia article that havethe same anchor text as the infobox value string If sucha link exists the target of that link is used to replacethe string value in the infobox This method furtherincreases the number of object property assertions inthe DBpedia ontology

Orthogonal to the previously mentioned improve-ments there have been various efforts to assess the qual-ity of the DBpedia datasets [26] developed a frame-work for estimating the quality of DBpedia and a sam-ple of 75 resources were analysed A more compre-hensive effort was performed in [48] by providing adistributed web-based interface [25] for quality assess-ment In this study 17 data quality problem types wereanalysed by 58 users covering 521 resources in DBpe-dia

3 DBpedia Ontology

The DBpedia ontology consists of 320 classes whichform a subsumption hierarchy and are described by1650 different properties With a maximal depth of5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depictsa part of the DBpedia ontology indicating the rela-tions among the top ten classes of the DBpedia on-

tology ie the classes with the highest number ofinstances The complete DBpedia ontology can bebrowsed online at httpmappingsdbpediaorgserverontologyclasses

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings Wiki Fig-ure 4 depicts the growth of the DBpedia ontology overtime While the number of classes is not growing toomuch due to the already good coverage of the initialversion of the ontology the number of properties in-creases over time due to the collaboration on the DBpe-dia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communitiesfor 27 languages 23 of which are active Figure 5 shows

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

statistics for the coverage of these mappings in DBpe-dia Figures (a) and (c) refer to the absolute numberof template and property mappings that are defined forevery DBpedia language edition Figures (b) and (d) de-pict the percentage of the defined template and propertymappings compared to the total number of availabletemplates and properties for every Wikipedia languageedition Figures (e) and (g) show the occurrences (in-stances) that the defined template and property map-pings have in Wikipedia Finally figures (f) and (h)give the percentage of the mapped templates and prop-erties occurences compared to the total templates andproperty occurences in a Wikipedia language edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50of total template occurrences In addition almost alllanguages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most activemapping language communities is depicted in Figure 6It is interesting to notice that the high mapping ac-tivity peaks coincide with the DBpedia release dates

Lehmann et al DBpedia 11

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

For instance the DBpedia 37 version was released onSeptember 2011 and the 2nd and 3rd quarter of thatyear have a very high activity compared to the 4th quar-ter In the last two years (2012 and 2013) most of theDBpedia mapping language communities have definedtheir own chapters and have their own release datesThus recent mapping activity shows less fluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (x axis)that have exactly y occurrences (y axis) The occurrencefrequency follows a long tail distribution Thus a lownumber of property mappings have a high number ofoccurrences and a high number of property mappingshave a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the num-ber of facts (ie statements) that have been extractedfrom infoboxes describing these things Afterwards wereport on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statements andtype statements for the 20 languages The column head-ings have the following meaning LD = Localized datasets (see Section 25) CD = Canonicalized data sets

(see Section 25) all = Overall number of instances inthe data set including instances without infobox datawith MD = Number of instances for which mapping-based infobox data exists Raw Properties = Numberof different properties that are generated by the rawinfobox extractor Mapping Properties = Number ofdifferent properties that are generated by the mapping-based infobox extractor Raw Statements = Number ofstatements (facts) that are generated by the raw infoboxextractor Mapping Statements = Number of statements(facts) that are generated by the mapping-based infoboxextractor

It is interesting to see that the English version of DB-pedia describes about three times more instances thanthe second and third largest language editions (FrenchGerman) Comparing the first column of the table withthe second and third reveals which portion of the in-stances of a specific language correspond to instancesin the English version of DBpedia and which portionof the instances is described by clean mapping-basedinfobox data The difference between the number ofproperties in the raw infobox data set and the cleanermapping-based infobox data set (columns 4 and 5) re-sults on the one hand from multiple Wikipedia infoboxproperties being mapped to a single ontology propertyOn the other hand it reflects the number of mappingsthat have been so far created in the Mapping Wiki for aspecific language

Table 3 reports the number of instances for a set ofpopular classes from the third and forth hierarchy levelof the ontology within the canonicalized DBpedia datasets for each language The indented classes are sub-classes of the superclasses set in bold The zero val-ues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English version ofDBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-baseddata set how many instances are described in multi-ple languages The Instances column contains the totalnumber of instances per class across all 20 languagesthe second column contains the number of instancesthat are described only in a single language version thenext column contains the number of instances that arecontained in two languages but not in three or more lan-guages etc For example 12936 persons are describedin five languages but not in six or more languages Thenumber 871630 for the class Person means that all20 language versions together describe 871630 differ-ent persons The number is higher than the number ofpersons described in the canonicalized English infobox

12 Lehmann et al DBpedia

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Table 2Basic statistics about Localized DBpedia Editions

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

data set (763643) listed in Table 3 since there areinfoboxes in non-English articles describing a personwithout a corresponding infobox in the English articledescribing the same person Summing up columns 2 to10+ for the Person class we see that 195263 personsare described in two or more languages The large dif-ference of this number compared to the total number of871630 persons is due to the much smaller size of thelocalized DBpedia versions compared to the Englishone (cf Table 2)

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approaches

in [20] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently mappingsfor 27 languages8 The DBpedia 37 release9 in Septem-ber 2011 was the first DBpedia release to use the lo-calized I18n (Internationalisation) DBpedia extractionframework [24]

8Arabic (ar) Bulgarian (bg) Bengali (bn) Catalan (ca) Czech(cs) German (de) Greek (el) English (en) Spanish (es) Estonian(et) Basque (eu) French (fr) Irish (ga) Hindi (hi) Croatian (hr)Hungarian (hu) Indonesian (id) Italian (it) Japanese (ja) Korean(ko) Dutch (nl) Polish (pl) Portuguese (pt) Russian (ru) Slovene(sl) Turkish (tr) Urdu (ur)

9httpblogdbpediaorg20110911

Lehmann et al DBpedia 13

Table 3Number of instances per class within 10 localized DBpedia versions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 4Cross-language overlap Number of instances that are described in multiple languages

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

At the time of writing DBpedia chapters for 14 lan-guages have been founded Basque Czech Dutch En-glish French German Greek Italian Japanese Ko-rean Polish Portuguese Russian and Spanish10 Be-sides providing mappings from infoboxes in the corre-sponding Wikipedia editions DBpedia chapters organ-ise a local community and provide hosting for data setsand associated services

While at the moment chapters are defined by owner-ship of the IP and server of the sub domain A record(eg httpkodbpediaorg) given by DBpe-dia maintainers the DBpedia internationalisation com-mittee11 is manifesting its structure and each languageedition has a representative with a vote in elections Insome cases (eg Greek12 and Dutch13) the existence of

10Accessed on 25092013 httpwikidbpediaorgInternationalizationChapters

11httpwikidbpediaorgInternationalization

12httpeldbpediaorg13httpnldbpediaorg

a local DBpedia chapter has had a positive effect on thecreation of localized LOD clouds [24]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where commu-nities from each language work together to improvemappings increase coverage and detect bugs in the ex-traction process The progress of the mapping effortis tracked through statistics on the number of mappedtemplates and properties as well as the number oftimes these templates and properties occur in WikipediaThese statistics provide an estimate of the coverage ofeach Wikipedia edition in terms of how many entitieswill be typed and how many properties from those en-tities will be extracted Therefore they can be usedby each language edition to prioritize properties andtemplates with higher impact on the coverage

The mapping statistics have also been used as a wayto promote a healthy competition between languageeditions A sprint page was created with bar charts thatshow how close each language is from achieving to-tal coverage (as shown in Figure 5) and line chartsshowing the progress over time highlighting when one

14 Lehmann et al DBpedia

language is overtaking another in their race for highercoverage The mapping sprints have served as a greatmotivator for the crowd-sourcing efforts as it can benoted from the increase in the number of mapping con-tributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute14 This high changefrequency leads to DBpedia data quickly being out-dated which in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [2036] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported15 Meanwhile DBpedia Livefor Dutch16 was developed

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [27]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia articles

The overall architecture of DBpedia Live is indicatedin Figure 8 The major components of the system areas follows

14httpstatswikimediaorgENSummaryENhtm

15httplivedbpediaorg16httplivenldbpediaorg

Fig 8 Overview of DBpedia Live extraction framework

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a localWikipedia mirror allows us to exceed any accesslimits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary inputsource Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pages forre-processing as input and applies all the enabledextractors After processing a page the extractedtriples are a) inserted into a backend triple store(in our case Virtuoso [10]) updating the old triplesand b) saved as changesets into a compressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extractionframework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from differentsources

Lehmann et al DBpedia 15

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as anexample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extractorsmight never be applied to rarely modified pages Toovercome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mappingaffected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triples Aset for the added triples and another set for the deletedtriples We write those two sets into N-Triples filescompress them and publish the compressed files aschangesets If another DBpedia Live mirror wants tosynchronise with the DBpedia Live endpoint it can justdownload those files decompress and integrate them

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset filessequentially decompresses them and updates the targetSPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data indifferent SPARQL graphs Data from the articleupdate feeds are contained in the graph with theURI httplivedbpediaorg static data (ielinks to the LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All data isalso accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-

tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing to DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository17 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linking com-mittee Linksets which are added to the repository areused for the subsequent official DBpedia release as wellas for DBpedia Live Table 5 lists the linksets createdby the DBpedia community as of April 2013 The firstcolumn names the data set that is the target of the linksThe second and third column contain the predicate thatis used for linking as well as the overall number of linksthat is set between DBpedia and the external data setThe last column names the tool that was used to gener-ate the links The value S refers to Silk L to LIMESC to custom script and a missing entry means that thedataset is copied from the previous releases and notregenerated

An example for the usage of links is the combina-tion of data about European Union project funding(FTS) [32] and data about countries in DBpedia Thequery below compares funding per year (from FTS) andcountry with the gross domestic product of a country(from DBpedia)18

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding) 3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

17httpsgithubcomdbpediadbpedia-links18Endpoint httpftspublicdataeusparql

Results httpbitly1c2mIwQ

16 Lehmann et al DBpedia

Table 5Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732

In addition to providing outgoing links on aninstance-level DBpedia also sets links on schema-level pointing from the DBpedia ontology to equiva-lent terms in other schemas Links to other schematacan be set by the community within the DBpedia Map-pings Wiki by using owlequivalentClass inclass templates and owlequivalentPropertyin datatype or object property templates respectively

In particular in 2011 Google Microsoft and Yahooannounced their collaboration on Schemaorg a col-lection of vocabularies for marking up content on webpages The DBpedia 38 ontology contains 45 equiva-lent class and 31 equivalent property links pointing tohttpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of incoming links ie linkspointing to DBpedia from other data sets is 39007478according to the Data Hub19 However those countsare entered by users and may not always be valid andup-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [39]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from otherwiseit is not considered Sindice computes a graph sum-mary [8] over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing to DBpedia Table 6lists the 10 data sets which set most links to DBpedia

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub20 However it crawls forRDFa snippets converts microformats etc which arenot captured by the DataHub Despite the inaccuracythe relative comparison of different datasets can stillgive us insights Therefore we analysed the link struc-ture of all Sindice datasets using the Sindice clusterTable 8 shows the datasets with most incoming linksThose are authorities in the network structure of the

19See httpwikidbpediaorgInterlinking fordetails

20httpdatahubio

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 8: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

Lehmann et al DBpedia 7

Fig 2 Depiction of the mapping from the Greek (left) and English Wikipedia templates (right) about books to the same DBpedia Ontology class(middle) [24]

ndash httpdbpediaorgontology (prefixdbo) for representing the DBpedia ontology (cfSection 24) eg dbopopulationTotal

Although data from other Wikipedia language edi-tions were extracted they used the same namespacesThis was achieved by exploiting the Wikipedia inter-language links5 For every page in a language otherthan English the page was extracted only if the pagecontained an inter-language link to an English page Inthat case using the English link the data was extractedunder the English resource name (ie dbrBerlin)

Recent DBpedia internationalisation developmentsshowed that this approach omitted valuable data [24]Thus starting from the DBpedia 37 release6 two typesof data sets were generated The localized data sets con-tain all things that are described in a specific languageWithin the datasets things are identified with languagespecific URIs such as httpltlanggtdbpediaorgresource for article data and httpltlanggtdbpediaorgproperty for prop-erty data In addition we produce a canonicalized dataset for each language The canonicalized data sets onlycontain things for which a corresponding page in the

5httpenwikipediaorgwikiHelpInterlanguage_links

6A list of all DBpedia releases is provided in Table 17

English edition of Wikipedia exists Within all canoni-calized data sets the same thing is identified with thesame URI from the generic language-agnostic names-pace httpdbpediaorgresource

26 NLP Extraction

DBpedia provides a number of data sets which havebeen created to support Natural Language Processing(NLP) tasks [33] Currently four datasets are extractedtopic signatures grammatical gender lexicalizationsand thematic concept While the topic signatures andthe grammatical gender extractors primarily extract datafrom the article text the lexicalizations and thematicconcept extractors make use of the wiki markup

DBpedia entities can be referred to using many dif-ferent names and abbreviations The Lexicalization dataset provides access to alternative names for entitiesand concepts associated with several scores estimatingthe association strength between name and URI Thesescores distinguish more common names for specificentities from rarely used ones and also show how am-biguous a name is with respect to all possible conceptsthat it can mean

The topic signatures data set enables the descriptionof DBpedia resources based on unstructured informa-tion as compared to the structured factual data providedby the mapping-based and raw extractors We build a

8 Lehmann et al DBpedia

Vector Space Model (VSM) where each DBpedia re-source is a point in a multidimensional space of wordsEach DBpedia resource is represented by a vector andeach word occurring in Wikipedia is a dimension ofthis vector Word scores are computed using the tf-idfweight with the intention to measure how strong is theassociation between a word and a DBpedia resourceNote that word stems are used in this context in order togeneralize over inflected words We use the computedweights to select the strongest related word stems foreach entity and build topic signatures [30]

There are two more Feature Extractors related to Nat-ural Language Processing The thematic concepts dataset relies on Wikipediarsquos category system to capture theidea of a lsquothemersquo a subject that is discussed in its arti-cles Many of the categories in Wikipedia are linked toan article that describes the main topic of that categoryWe rely on this information to mark DBpedia entitiesand concepts that are lsquothematicrsquo that is they are thecentre of discussion for a category

The grammatical gender data set uses a simple heuris-tic to decide on a grammatical gender for instances ofthe class Person in DBpedia While parsing an articlein the English Wikipedia if there is a mapping froman infobox in this article to the class dboPersonwe record the frequency of gender-specific pronouns intheir declined forms (Subject Object Possessive Ad-jective Possessive Pronoun and Reflexive) ndash ie hehim his himself (masculine) and she her hers herself(feminine) Grammatical genders for DBpedia entitiesare assigned based on the dominating gender in thesepronouns

27 Summary of Other Recent Developments

In this section we summarize the improvements ofthe DBpedia extraction framework since the publicationof the previous DBpedia overview article [5] in 2009One of the major changes on the implementation levelis that the extraction framework has been rewritten inScala in 20107 resulting in an improvement of the effi-ciency of the extractors by an order of magnitude com-pared to the previous PHP based framework This iscrucial for the DBpedia Live project which will be ex-plained in Section 4 The new more modular frameworkalso allows to extract data from tables in Wikipediapages and supports extraction from multiple MediaWikitemplates per page Another significant change was the

7Table 17 provides an overview of the projectrsquos evolution throughtime

creation and utilization of the DBpedia Mappings Wikias described earlier Further significant changes includethe mentioned NLP extractors and the introduction ofURI schemes

In addition there were several smaller improvementsand general maintenance Overall over the past fouryears the parsing of the MediaWiki markup improvedquite a lot which led to better overall coverage forexample concerning references and parser functionsIn addition the collection of MediaWiki namespaceidentifiers for many languages is now performed semi-automatically leading to a high accuracy of detectionThis concerns common title prefixes such as User FileTemplate Help Portal etc in English that indicatepages that do not contain encyclopedic content andwould produce noise in the data They are important forspecific extractors as well for instance the category hi-erarchy data set is produced from pages of the Categorynamespace Furthermore the output of the extractionsystem now supports more formats and several compli-ance issues regarding URIs IRIs N-Triples and Turtlewere fixed

The individual data extractors have been improved aswell in both number and quality in many areas The ab-stract extraction was enhanced producing more accurateplain text representations of the beginning of Wikipediaarticle texts More diverse and more specific datatypesdo exist (eg many currencies and XSD datatypes suchas xsdgYearMonth xsdpositiveIntegeretc) and for a number of classes and properties specificdatatypes were added (eg inhabitantskm2 for the pop-ulation density of populated places and m3s for thedischarge of rivers) Many issues related to data parserswere resolved and the quality of the owlsameAsdata set for multiple language versions was increasedby an implementation that takes bijective relations intoaccount

There are also further extractors eg for Wikipediapage IDs and revisions Moreover redirect and disam-biguation extractors were introduced and improved Forthe redirect data the transitive closure is computedwhile taking care of catching cycles in the links Theredirects also help regarding infobox coverage in themapping-based extraction by resolving alternative tem-plate names Moreover in the PHP framework if aninfobox value pointed to a redirect this redirection wasnot properly resolved and thus resulted in RDF linksthat led to URIs which did not contain any further in-formation Resolving redirects affected approximately15 of all links and hence increased the overall inter-connectivity of resources in the DBpedia ontology

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an infoboxcontains a string value that is not linked to anotherWikipedia article the extraction framework searchesfor hyperlinks in the same Wikipedia article that havethe same anchor text as the infobox value string If sucha link exists the target of that link is used to replacethe string value in the infobox This method furtherincreases the number of object property assertions inthe DBpedia ontology

Orthogonal to the previously mentioned improve-ments there have been various efforts to assess the qual-ity of the DBpedia datasets [26] developed a frame-work for estimating the quality of DBpedia and a sam-ple of 75 resources were analysed A more compre-hensive effort was performed in [48] by providing adistributed web-based interface [25] for quality assess-ment In this study 17 data quality problem types wereanalysed by 58 users covering 521 resources in DBpe-dia

3 DBpedia Ontology

The DBpedia ontology consists of 320 classes whichform a subsumption hierarchy and are described by1650 different properties With a maximal depth of5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depictsa part of the DBpedia ontology indicating the rela-tions among the top ten classes of the DBpedia on-

tology ie the classes with the highest number ofinstances The complete DBpedia ontology can bebrowsed online at httpmappingsdbpediaorgserverontologyclasses

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings Wiki Fig-ure 4 depicts the growth of the DBpedia ontology overtime While the number of classes is not growing toomuch due to the already good coverage of the initialversion of the ontology the number of properties in-creases over time due to the collaboration on the DBpe-dia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communitiesfor 27 languages 23 of which are active Figure 5 shows

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

statistics for the coverage of these mappings in DBpe-dia Figures (a) and (c) refer to the absolute numberof template and property mappings that are defined forevery DBpedia language edition Figures (b) and (d) de-pict the percentage of the defined template and propertymappings compared to the total number of availabletemplates and properties for every Wikipedia languageedition Figures (e) and (g) show the occurrences (in-stances) that the defined template and property map-pings have in Wikipedia Finally figures (f) and (h)give the percentage of the mapped templates and prop-erties occurences compared to the total templates andproperty occurences in a Wikipedia language edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50of total template occurrences In addition almost alllanguages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most activemapping language communities is depicted in Figure 6It is interesting to notice that the high mapping ac-tivity peaks coincide with the DBpedia release dates

Lehmann et al DBpedia 11

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

For instance the DBpedia 37 version was released onSeptember 2011 and the 2nd and 3rd quarter of thatyear have a very high activity compared to the 4th quar-ter In the last two years (2012 and 2013) most of theDBpedia mapping language communities have definedtheir own chapters and have their own release datesThus recent mapping activity shows less fluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (x axis)that have exactly y occurrences (y axis) The occurrencefrequency follows a long tail distribution Thus a lownumber of property mappings have a high number ofoccurrences and a high number of property mappingshave a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the num-ber of facts (ie statements) that have been extractedfrom infoboxes describing these things Afterwards wereport on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statements andtype statements for the 20 languages The column head-ings have the following meaning LD = Localized datasets (see Section 25) CD = Canonicalized data sets

(see Section 25) all = Overall number of instances inthe data set including instances without infobox datawith MD = Number of instances for which mapping-based infobox data exists Raw Properties = Numberof different properties that are generated by the rawinfobox extractor Mapping Properties = Number ofdifferent properties that are generated by the mapping-based infobox extractor Raw Statements = Number ofstatements (facts) that are generated by the raw infoboxextractor Mapping Statements = Number of statements(facts) that are generated by the mapping-based infoboxextractor

It is interesting to see that the English version of DB-pedia describes about three times more instances thanthe second and third largest language editions (FrenchGerman) Comparing the first column of the table withthe second and third reveals which portion of the in-stances of a specific language correspond to instancesin the English version of DBpedia and which portionof the instances is described by clean mapping-basedinfobox data The difference between the number ofproperties in the raw infobox data set and the cleanermapping-based infobox data set (columns 4 and 5) re-sults on the one hand from multiple Wikipedia infoboxproperties being mapped to a single ontology propertyOn the other hand it reflects the number of mappingsthat have been so far created in the Mapping Wiki for aspecific language

Table 3 reports the number of instances for a set ofpopular classes from the third and forth hierarchy levelof the ontology within the canonicalized DBpedia datasets for each language The indented classes are sub-classes of the superclasses set in bold The zero val-ues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English version ofDBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-baseddata set how many instances are described in multi-ple languages The Instances column contains the totalnumber of instances per class across all 20 languagesthe second column contains the number of instancesthat are described only in a single language version thenext column contains the number of instances that arecontained in two languages but not in three or more lan-guages etc For example 12936 persons are describedin five languages but not in six or more languages Thenumber 871630 for the class Person means that all20 language versions together describe 871630 differ-ent persons The number is higher than the number ofpersons described in the canonicalized English infobox

12 Lehmann et al DBpedia

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Table 2Basic statistics about Localized DBpedia Editions

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

data set (763643) listed in Table 3 since there areinfoboxes in non-English articles describing a personwithout a corresponding infobox in the English articledescribing the same person Summing up columns 2 to10+ for the Person class we see that 195263 personsare described in two or more languages The large dif-ference of this number compared to the total number of871630 persons is due to the much smaller size of thelocalized DBpedia versions compared to the Englishone (cf Table 2)

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approaches

in [20] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently mappingsfor 27 languages8 The DBpedia 37 release9 in Septem-ber 2011 was the first DBpedia release to use the lo-calized I18n (Internationalisation) DBpedia extractionframework [24]

8Arabic (ar) Bulgarian (bg) Bengali (bn) Catalan (ca) Czech(cs) German (de) Greek (el) English (en) Spanish (es) Estonian(et) Basque (eu) French (fr) Irish (ga) Hindi (hi) Croatian (hr)Hungarian (hu) Indonesian (id) Italian (it) Japanese (ja) Korean(ko) Dutch (nl) Polish (pl) Portuguese (pt) Russian (ru) Slovene(sl) Turkish (tr) Urdu (ur)

9httpblogdbpediaorg20110911

Lehmann et al DBpedia 13

Table 3Number of instances per class within 10 localized DBpedia versions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 4Cross-language overlap Number of instances that are described in multiple languages

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

At the time of writing DBpedia chapters for 14 lan-guages have been founded Basque Czech Dutch En-glish French German Greek Italian Japanese Ko-rean Polish Portuguese Russian and Spanish10 Be-sides providing mappings from infoboxes in the corre-sponding Wikipedia editions DBpedia chapters organ-ise a local community and provide hosting for data setsand associated services

While at the moment chapters are defined by owner-ship of the IP and server of the sub domain A record(eg httpkodbpediaorg) given by DBpe-dia maintainers the DBpedia internationalisation com-mittee11 is manifesting its structure and each languageedition has a representative with a vote in elections Insome cases (eg Greek12 and Dutch13) the existence of

10Accessed on 25092013 httpwikidbpediaorgInternationalizationChapters

11httpwikidbpediaorgInternationalization

12httpeldbpediaorg13httpnldbpediaorg

a local DBpedia chapter has had a positive effect on thecreation of localized LOD clouds [24]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where commu-nities from each language work together to improvemappings increase coverage and detect bugs in the ex-traction process The progress of the mapping effortis tracked through statistics on the number of mappedtemplates and properties as well as the number oftimes these templates and properties occur in WikipediaThese statistics provide an estimate of the coverage ofeach Wikipedia edition in terms of how many entitieswill be typed and how many properties from those en-tities will be extracted Therefore they can be usedby each language edition to prioritize properties andtemplates with higher impact on the coverage

The mapping statistics have also been used as a wayto promote a healthy competition between languageeditions A sprint page was created with bar charts thatshow how close each language is from achieving to-tal coverage (as shown in Figure 5) and line chartsshowing the progress over time highlighting when one

14 Lehmann et al DBpedia

language is overtaking another in their race for highercoverage The mapping sprints have served as a greatmotivator for the crowd-sourcing efforts as it can benoted from the increase in the number of mapping con-tributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute14 This high changefrequency leads to DBpedia data quickly being out-dated which in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [2036] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported15 Meanwhile DBpedia Livefor Dutch16 was developed

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [27]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia articles

The overall architecture of DBpedia Live is indicatedin Figure 8 The major components of the system areas follows

14httpstatswikimediaorgENSummaryENhtm

15httplivedbpediaorg16httplivenldbpediaorg

Fig 8 Overview of DBpedia Live extraction framework

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a localWikipedia mirror allows us to exceed any accesslimits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary inputsource Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pages forre-processing as input and applies all the enabledextractors After processing a page the extractedtriples are a) inserted into a backend triple store(in our case Virtuoso [10]) updating the old triplesand b) saved as changesets into a compressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extractionframework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from differentsources

Lehmann et al DBpedia 15

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as anexample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extractorsmight never be applied to rarely modified pages Toovercome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mappingaffected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triples Aset for the added triples and another set for the deletedtriples We write those two sets into N-Triples filescompress them and publish the compressed files aschangesets If another DBpedia Live mirror wants tosynchronise with the DBpedia Live endpoint it can justdownload those files decompress and integrate them

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset filessequentially decompresses them and updates the targetSPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data indifferent SPARQL graphs Data from the articleupdate feeds are contained in the graph with theURI httplivedbpediaorg static data (ielinks to the LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All data isalso accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-

tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing to DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository17 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linking com-mittee Linksets which are added to the repository areused for the subsequent official DBpedia release as wellas for DBpedia Live Table 5 lists the linksets createdby the DBpedia community as of April 2013 The firstcolumn names the data set that is the target of the linksThe second and third column contain the predicate thatis used for linking as well as the overall number of linksthat is set between DBpedia and the external data setThe last column names the tool that was used to gener-ate the links The value S refers to Silk L to LIMESC to custom script and a missing entry means that thedataset is copied from the previous releases and notregenerated

An example for the usage of links is the combina-tion of data about European Union project funding(FTS) [32] and data about countries in DBpedia Thequery below compares funding per year (from FTS) andcountry with the gross domestic product of a country(from DBpedia)18

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding) 3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

17httpsgithubcomdbpediadbpedia-links18Endpoint httpftspublicdataeusparql

Results httpbitly1c2mIwQ

16 Lehmann et al DBpedia

Table 5Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732

In addition to providing outgoing links on aninstance-level DBpedia also sets links on schema-level pointing from the DBpedia ontology to equiva-lent terms in other schemas Links to other schematacan be set by the community within the DBpedia Map-pings Wiki by using owlequivalentClass inclass templates and owlequivalentPropertyin datatype or object property templates respectively

In particular in 2011 Google Microsoft and Yahooannounced their collaboration on Schemaorg a col-lection of vocabularies for marking up content on webpages The DBpedia 38 ontology contains 45 equiva-lent class and 31 equivalent property links pointing tohttpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of incoming links ie linkspointing to DBpedia from other data sets is 39007478according to the Data Hub19 However those countsare entered by users and may not always be valid andup-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [39]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from otherwiseit is not considered Sindice computes a graph sum-mary [8] over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing to DBpedia Table 6lists the 10 data sets which set most links to DBpedia

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub20 However it crawls forRDFa snippets converts microformats etc which arenot captured by the DataHub Despite the inaccuracythe relative comparison of different datasets can stillgive us insights Therefore we analysed the link struc-ture of all Sindice datasets using the Sindice clusterTable 8 shows the datasets with most incoming linksThose are authorities in the network structure of the

19See httpwikidbpediaorgInterlinking fordetails

20httpdatahubio

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 9: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

8 Lehmann et al DBpedia

Vector Space Model (VSM) where each DBpedia re-source is a point in a multidimensional space of wordsEach DBpedia resource is represented by a vector andeach word occurring in Wikipedia is a dimension ofthis vector Word scores are computed using the tf-idfweight with the intention to measure how strong is theassociation between a word and a DBpedia resourceNote that word stems are used in this context in order togeneralize over inflected words We use the computedweights to select the strongest related word stems foreach entity and build topic signatures [30]

There are two more Feature Extractors related to Nat-ural Language Processing The thematic concepts dataset relies on Wikipediarsquos category system to capture theidea of a lsquothemersquo a subject that is discussed in its arti-cles Many of the categories in Wikipedia are linked toan article that describes the main topic of that categoryWe rely on this information to mark DBpedia entitiesand concepts that are lsquothematicrsquo that is they are thecentre of discussion for a category

The grammatical gender data set uses a simple heuris-tic to decide on a grammatical gender for instances ofthe class Person in DBpedia While parsing an articlein the English Wikipedia if there is a mapping froman infobox in this article to the class dboPersonwe record the frequency of gender-specific pronouns intheir declined forms (Subject Object Possessive Ad-jective Possessive Pronoun and Reflexive) ndash ie hehim his himself (masculine) and she her hers herself(feminine) Grammatical genders for DBpedia entitiesare assigned based on the dominating gender in thesepronouns

27 Summary of Other Recent Developments

In this section we summarize the improvements ofthe DBpedia extraction framework since the publicationof the previous DBpedia overview article [5] in 2009One of the major changes on the implementation levelis that the extraction framework has been rewritten inScala in 20107 resulting in an improvement of the effi-ciency of the extractors by an order of magnitude com-pared to the previous PHP based framework This iscrucial for the DBpedia Live project which will be ex-plained in Section 4 The new more modular frameworkalso allows to extract data from tables in Wikipediapages and supports extraction from multiple MediaWikitemplates per page Another significant change was the

7Table 17 provides an overview of the projectrsquos evolution throughtime

creation and utilization of the DBpedia Mappings Wikias described earlier Further significant changes includethe mentioned NLP extractors and the introduction ofURI schemes

In addition there were several smaller improvementsand general maintenance Overall over the past fouryears the parsing of the MediaWiki markup improvedquite a lot which led to better overall coverage forexample concerning references and parser functionsIn addition the collection of MediaWiki namespaceidentifiers for many languages is now performed semi-automatically leading to a high accuracy of detectionThis concerns common title prefixes such as User FileTemplate Help Portal etc in English that indicatepages that do not contain encyclopedic content andwould produce noise in the data They are important forspecific extractors as well for instance the category hi-erarchy data set is produced from pages of the Categorynamespace Furthermore the output of the extractionsystem now supports more formats and several compli-ance issues regarding URIs IRIs N-Triples and Turtlewere fixed

The individual data extractors have been improved aswell in both number and quality in many areas The ab-stract extraction was enhanced producing more accurateplain text representations of the beginning of Wikipediaarticle texts More diverse and more specific datatypesdo exist (eg many currencies and XSD datatypes suchas xsdgYearMonth xsdpositiveIntegeretc) and for a number of classes and properties specificdatatypes were added (eg inhabitantskm2 for the pop-ulation density of populated places and m3s for thedischarge of rivers) Many issues related to data parserswere resolved and the quality of the owlsameAsdata set for multiple language versions was increasedby an implementation that takes bijective relations intoaccount

There are also further extractors eg for Wikipediapage IDs and revisions Moreover redirect and disam-biguation extractors were introduced and improved Forthe redirect data the transitive closure is computedwhile taking care of catching cycles in the links Theredirects also help regarding infobox coverage in themapping-based extraction by resolving alternative tem-plate names Moreover in the PHP framework if aninfobox value pointed to a redirect this redirection wasnot properly resolved and thus resulted in RDF linksthat led to URIs which did not contain any further in-formation Resolving redirects affected approximately15 of all links and hence increased the overall inter-connectivity of resources in the DBpedia ontology

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an infoboxcontains a string value that is not linked to anotherWikipedia article the extraction framework searchesfor hyperlinks in the same Wikipedia article that havethe same anchor text as the infobox value string If sucha link exists the target of that link is used to replacethe string value in the infobox This method furtherincreases the number of object property assertions inthe DBpedia ontology

Orthogonal to the previously mentioned improve-ments there have been various efforts to assess the qual-ity of the DBpedia datasets [26] developed a frame-work for estimating the quality of DBpedia and a sam-ple of 75 resources were analysed A more compre-hensive effort was performed in [48] by providing adistributed web-based interface [25] for quality assess-ment In this study 17 data quality problem types wereanalysed by 58 users covering 521 resources in DBpe-dia

3 DBpedia Ontology

The DBpedia ontology consists of 320 classes whichform a subsumption hierarchy and are described by1650 different properties With a maximal depth of5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depictsa part of the DBpedia ontology indicating the rela-tions among the top ten classes of the DBpedia on-

tology ie the classes with the highest number ofinstances The complete DBpedia ontology can bebrowsed online at httpmappingsdbpediaorgserverontologyclasses

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings Wiki Fig-ure 4 depicts the growth of the DBpedia ontology overtime While the number of classes is not growing toomuch due to the already good coverage of the initialversion of the ontology the number of properties in-creases over time due to the collaboration on the DBpe-dia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communitiesfor 27 languages 23 of which are active Figure 5 shows

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

statistics for the coverage of these mappings in DBpe-dia Figures (a) and (c) refer to the absolute numberof template and property mappings that are defined forevery DBpedia language edition Figures (b) and (d) de-pict the percentage of the defined template and propertymappings compared to the total number of availabletemplates and properties for every Wikipedia languageedition Figures (e) and (g) show the occurrences (in-stances) that the defined template and property map-pings have in Wikipedia Finally figures (f) and (h)give the percentage of the mapped templates and prop-erties occurences compared to the total templates andproperty occurences in a Wikipedia language edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50of total template occurrences In addition almost alllanguages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most activemapping language communities is depicted in Figure 6It is interesting to notice that the high mapping ac-tivity peaks coincide with the DBpedia release dates

Lehmann et al DBpedia 11

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

For instance the DBpedia 37 version was released onSeptember 2011 and the 2nd and 3rd quarter of thatyear have a very high activity compared to the 4th quar-ter In the last two years (2012 and 2013) most of theDBpedia mapping language communities have definedtheir own chapters and have their own release datesThus recent mapping activity shows less fluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (x axis)that have exactly y occurrences (y axis) The occurrencefrequency follows a long tail distribution Thus a lownumber of property mappings have a high number ofoccurrences and a high number of property mappingshave a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the num-ber of facts (ie statements) that have been extractedfrom infoboxes describing these things Afterwards wereport on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statements andtype statements for the 20 languages The column head-ings have the following meaning LD = Localized datasets (see Section 25) CD = Canonicalized data sets

(see Section 25) all = Overall number of instances inthe data set including instances without infobox datawith MD = Number of instances for which mapping-based infobox data exists Raw Properties = Numberof different properties that are generated by the rawinfobox extractor Mapping Properties = Number ofdifferent properties that are generated by the mapping-based infobox extractor Raw Statements = Number ofstatements (facts) that are generated by the raw infoboxextractor Mapping Statements = Number of statements(facts) that are generated by the mapping-based infoboxextractor

It is interesting to see that the English version of DB-pedia describes about three times more instances thanthe second and third largest language editions (FrenchGerman) Comparing the first column of the table withthe second and third reveals which portion of the in-stances of a specific language correspond to instancesin the English version of DBpedia and which portionof the instances is described by clean mapping-basedinfobox data The difference between the number ofproperties in the raw infobox data set and the cleanermapping-based infobox data set (columns 4 and 5) re-sults on the one hand from multiple Wikipedia infoboxproperties being mapped to a single ontology propertyOn the other hand it reflects the number of mappingsthat have been so far created in the Mapping Wiki for aspecific language

Table 3 reports the number of instances for a set ofpopular classes from the third and forth hierarchy levelof the ontology within the canonicalized DBpedia datasets for each language The indented classes are sub-classes of the superclasses set in bold The zero val-ues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English version ofDBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-baseddata set how many instances are described in multi-ple languages The Instances column contains the totalnumber of instances per class across all 20 languagesthe second column contains the number of instancesthat are described only in a single language version thenext column contains the number of instances that arecontained in two languages but not in three or more lan-guages etc For example 12936 persons are describedin five languages but not in six or more languages Thenumber 871630 for the class Person means that all20 language versions together describe 871630 differ-ent persons The number is higher than the number ofpersons described in the canonicalized English infobox

12 Lehmann et al DBpedia

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Table 2Basic statistics about Localized DBpedia Editions

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

data set (763643) listed in Table 3 since there areinfoboxes in non-English articles describing a personwithout a corresponding infobox in the English articledescribing the same person Summing up columns 2 to10+ for the Person class we see that 195263 personsare described in two or more languages The large dif-ference of this number compared to the total number of871630 persons is due to the much smaller size of thelocalized DBpedia versions compared to the Englishone (cf Table 2)

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approaches

in [20] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently mappingsfor 27 languages8 The DBpedia 37 release9 in Septem-ber 2011 was the first DBpedia release to use the lo-calized I18n (Internationalisation) DBpedia extractionframework [24]

8Arabic (ar) Bulgarian (bg) Bengali (bn) Catalan (ca) Czech(cs) German (de) Greek (el) English (en) Spanish (es) Estonian(et) Basque (eu) French (fr) Irish (ga) Hindi (hi) Croatian (hr)Hungarian (hu) Indonesian (id) Italian (it) Japanese (ja) Korean(ko) Dutch (nl) Polish (pl) Portuguese (pt) Russian (ru) Slovene(sl) Turkish (tr) Urdu (ur)

9httpblogdbpediaorg20110911

Lehmann et al DBpedia 13

Table 3Number of instances per class within 10 localized DBpedia versions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 4Cross-language overlap Number of instances that are described in multiple languages

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

At the time of writing DBpedia chapters for 14 lan-guages have been founded Basque Czech Dutch En-glish French German Greek Italian Japanese Ko-rean Polish Portuguese Russian and Spanish10 Be-sides providing mappings from infoboxes in the corre-sponding Wikipedia editions DBpedia chapters organ-ise a local community and provide hosting for data setsand associated services

While at the moment chapters are defined by owner-ship of the IP and server of the sub domain A record(eg httpkodbpediaorg) given by DBpe-dia maintainers the DBpedia internationalisation com-mittee11 is manifesting its structure and each languageedition has a representative with a vote in elections Insome cases (eg Greek12 and Dutch13) the existence of

10Accessed on 25092013 httpwikidbpediaorgInternationalizationChapters

11httpwikidbpediaorgInternationalization

12httpeldbpediaorg13httpnldbpediaorg

a local DBpedia chapter has had a positive effect on thecreation of localized LOD clouds [24]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where commu-nities from each language work together to improvemappings increase coverage and detect bugs in the ex-traction process The progress of the mapping effortis tracked through statistics on the number of mappedtemplates and properties as well as the number oftimes these templates and properties occur in WikipediaThese statistics provide an estimate of the coverage ofeach Wikipedia edition in terms of how many entitieswill be typed and how many properties from those en-tities will be extracted Therefore they can be usedby each language edition to prioritize properties andtemplates with higher impact on the coverage

The mapping statistics have also been used as a wayto promote a healthy competition between languageeditions A sprint page was created with bar charts thatshow how close each language is from achieving to-tal coverage (as shown in Figure 5) and line chartsshowing the progress over time highlighting when one

14 Lehmann et al DBpedia

language is overtaking another in their race for highercoverage The mapping sprints have served as a greatmotivator for the crowd-sourcing efforts as it can benoted from the increase in the number of mapping con-tributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute14 This high changefrequency leads to DBpedia data quickly being out-dated which in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [2036] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported15 Meanwhile DBpedia Livefor Dutch16 was developed

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [27]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia articles

The overall architecture of DBpedia Live is indicatedin Figure 8 The major components of the system areas follows

14httpstatswikimediaorgENSummaryENhtm

15httplivedbpediaorg16httplivenldbpediaorg

Fig 8 Overview of DBpedia Live extraction framework

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a localWikipedia mirror allows us to exceed any accesslimits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary inputsource Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pages forre-processing as input and applies all the enabledextractors After processing a page the extractedtriples are a) inserted into a backend triple store(in our case Virtuoso [10]) updating the old triplesand b) saved as changesets into a compressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extractionframework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from differentsources

Lehmann et al DBpedia 15

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as anexample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extractorsmight never be applied to rarely modified pages Toovercome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mappingaffected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triples Aset for the added triples and another set for the deletedtriples We write those two sets into N-Triples filescompress them and publish the compressed files aschangesets If another DBpedia Live mirror wants tosynchronise with the DBpedia Live endpoint it can justdownload those files decompress and integrate them

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset filessequentially decompresses them and updates the targetSPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data indifferent SPARQL graphs Data from the articleupdate feeds are contained in the graph with theURI httplivedbpediaorg static data (ielinks to the LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All data isalso accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-

tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing to DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository17 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linking com-mittee Linksets which are added to the repository areused for the subsequent official DBpedia release as wellas for DBpedia Live Table 5 lists the linksets createdby the DBpedia community as of April 2013 The firstcolumn names the data set that is the target of the linksThe second and third column contain the predicate thatis used for linking as well as the overall number of linksthat is set between DBpedia and the external data setThe last column names the tool that was used to gener-ate the links The value S refers to Silk L to LIMESC to custom script and a missing entry means that thedataset is copied from the previous releases and notregenerated

An example for the usage of links is the combina-tion of data about European Union project funding(FTS) [32] and data about countries in DBpedia Thequery below compares funding per year (from FTS) andcountry with the gross domestic product of a country(from DBpedia)18

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding) 3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

17httpsgithubcomdbpediadbpedia-links18Endpoint httpftspublicdataeusparql

Results httpbitly1c2mIwQ

16 Lehmann et al DBpedia

Table 5Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732

In addition to providing outgoing links on aninstance-level DBpedia also sets links on schema-level pointing from the DBpedia ontology to equiva-lent terms in other schemas Links to other schematacan be set by the community within the DBpedia Map-pings Wiki by using owlequivalentClass inclass templates and owlequivalentPropertyin datatype or object property templates respectively

In particular in 2011 Google Microsoft and Yahooannounced their collaboration on Schemaorg a col-lection of vocabularies for marking up content on webpages The DBpedia 38 ontology contains 45 equiva-lent class and 31 equivalent property links pointing tohttpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of incoming links ie linkspointing to DBpedia from other data sets is 39007478according to the Data Hub19 However those countsare entered by users and may not always be valid andup-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [39]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from otherwiseit is not considered Sindice computes a graph sum-mary [8] over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing to DBpedia Table 6lists the 10 data sets which set most links to DBpedia

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub20 However it crawls forRDFa snippets converts microformats etc which arenot captured by the DataHub Despite the inaccuracythe relative comparison of different datasets can stillgive us insights Therefore we analysed the link struc-ture of all Sindice datasets using the Sindice clusterTable 8 shows the datasets with most incoming linksThose are authorities in the network structure of the

19See httpwikidbpediaorgInterlinking fordetails

20httpdatahubio

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 10: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an infoboxcontains a string value that is not linked to anotherWikipedia article the extraction framework searchesfor hyperlinks in the same Wikipedia article that havethe same anchor text as the infobox value string If sucha link exists the target of that link is used to replacethe string value in the infobox This method furtherincreases the number of object property assertions inthe DBpedia ontology

Orthogonal to the previously mentioned improve-ments there have been various efforts to assess the qual-ity of the DBpedia datasets [26] developed a frame-work for estimating the quality of DBpedia and a sam-ple of 75 resources were analysed A more compre-hensive effort was performed in [48] by providing adistributed web-based interface [25] for quality assess-ment In this study 17 data quality problem types wereanalysed by 58 users covering 521 resources in DBpe-dia

3 DBpedia Ontology

The DBpedia ontology consists of 320 classes whichform a subsumption hierarchy and are described by1650 different properties With a maximal depth of5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depictsa part of the DBpedia ontology indicating the rela-tions among the top ten classes of the DBpedia on-

tology ie the classes with the highest number ofinstances The complete DBpedia ontology can bebrowsed online at httpmappingsdbpediaorgserverontologyclasses

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings Wiki Fig-ure 4 depicts the growth of the DBpedia ontology overtime While the number of classes is not growing toomuch due to the already good coverage of the initialversion of the ontology the number of properties in-creases over time due to the collaboration on the DBpe-dia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communitiesfor 27 languages 23 of which are active Figure 5 shows

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

statistics for the coverage of these mappings in DBpe-dia Figures (a) and (c) refer to the absolute numberof template and property mappings that are defined forevery DBpedia language edition Figures (b) and (d) de-pict the percentage of the defined template and propertymappings compared to the total number of availabletemplates and properties for every Wikipedia languageedition Figures (e) and (g) show the occurrences (in-stances) that the defined template and property map-pings have in Wikipedia Finally figures (f) and (h)give the percentage of the mapped templates and prop-erties occurences compared to the total templates andproperty occurences in a Wikipedia language edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50of total template occurrences In addition almost alllanguages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most activemapping language communities is depicted in Figure 6It is interesting to notice that the high mapping ac-tivity peaks coincide with the DBpedia release dates

Lehmann et al DBpedia 11

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

For instance the DBpedia 37 version was released onSeptember 2011 and the 2nd and 3rd quarter of thatyear have a very high activity compared to the 4th quar-ter In the last two years (2012 and 2013) most of theDBpedia mapping language communities have definedtheir own chapters and have their own release datesThus recent mapping activity shows less fluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (x axis)that have exactly y occurrences (y axis) The occurrencefrequency follows a long tail distribution Thus a lownumber of property mappings have a high number ofoccurrences and a high number of property mappingshave a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the num-ber of facts (ie statements) that have been extractedfrom infoboxes describing these things Afterwards wereport on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statements andtype statements for the 20 languages The column head-ings have the following meaning LD = Localized datasets (see Section 25) CD = Canonicalized data sets

(see Section 25) all = Overall number of instances inthe data set including instances without infobox datawith MD = Number of instances for which mapping-based infobox data exists Raw Properties = Numberof different properties that are generated by the rawinfobox extractor Mapping Properties = Number ofdifferent properties that are generated by the mapping-based infobox extractor Raw Statements = Number ofstatements (facts) that are generated by the raw infoboxextractor Mapping Statements = Number of statements(facts) that are generated by the mapping-based infoboxextractor

It is interesting to see that the English version of DB-pedia describes about three times more instances thanthe second and third largest language editions (FrenchGerman) Comparing the first column of the table withthe second and third reveals which portion of the in-stances of a specific language correspond to instancesin the English version of DBpedia and which portionof the instances is described by clean mapping-basedinfobox data The difference between the number ofproperties in the raw infobox data set and the cleanermapping-based infobox data set (columns 4 and 5) re-sults on the one hand from multiple Wikipedia infoboxproperties being mapped to a single ontology propertyOn the other hand it reflects the number of mappingsthat have been so far created in the Mapping Wiki for aspecific language

Table 3 reports the number of instances for a set ofpopular classes from the third and forth hierarchy levelof the ontology within the canonicalized DBpedia datasets for each language The indented classes are sub-classes of the superclasses set in bold The zero val-ues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English version ofDBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-baseddata set how many instances are described in multi-ple languages The Instances column contains the totalnumber of instances per class across all 20 languagesthe second column contains the number of instancesthat are described only in a single language version thenext column contains the number of instances that arecontained in two languages but not in three or more lan-guages etc For example 12936 persons are describedin five languages but not in six or more languages Thenumber 871630 for the class Person means that all20 language versions together describe 871630 differ-ent persons The number is higher than the number ofpersons described in the canonicalized English infobox

12 Lehmann et al DBpedia

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Table 2Basic statistics about Localized DBpedia Editions

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

data set (763643) listed in Table 3 since there areinfoboxes in non-English articles describing a personwithout a corresponding infobox in the English articledescribing the same person Summing up columns 2 to10+ for the Person class we see that 195263 personsare described in two or more languages The large dif-ference of this number compared to the total number of871630 persons is due to the much smaller size of thelocalized DBpedia versions compared to the Englishone (cf Table 2)

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approaches

in [20] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently mappingsfor 27 languages8 The DBpedia 37 release9 in Septem-ber 2011 was the first DBpedia release to use the lo-calized I18n (Internationalisation) DBpedia extractionframework [24]

8Arabic (ar) Bulgarian (bg) Bengali (bn) Catalan (ca) Czech(cs) German (de) Greek (el) English (en) Spanish (es) Estonian(et) Basque (eu) French (fr) Irish (ga) Hindi (hi) Croatian (hr)Hungarian (hu) Indonesian (id) Italian (it) Japanese (ja) Korean(ko) Dutch (nl) Polish (pl) Portuguese (pt) Russian (ru) Slovene(sl) Turkish (tr) Urdu (ur)

9httpblogdbpediaorg20110911

Lehmann et al DBpedia 13

Table 3Number of instances per class within 10 localized DBpedia versions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 4Cross-language overlap Number of instances that are described in multiple languages

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

At the time of writing DBpedia chapters for 14 lan-guages have been founded Basque Czech Dutch En-glish French German Greek Italian Japanese Ko-rean Polish Portuguese Russian and Spanish10 Be-sides providing mappings from infoboxes in the corre-sponding Wikipedia editions DBpedia chapters organ-ise a local community and provide hosting for data setsand associated services

While at the moment chapters are defined by owner-ship of the IP and server of the sub domain A record(eg httpkodbpediaorg) given by DBpe-dia maintainers the DBpedia internationalisation com-mittee11 is manifesting its structure and each languageedition has a representative with a vote in elections Insome cases (eg Greek12 and Dutch13) the existence of

10Accessed on 25092013 httpwikidbpediaorgInternationalizationChapters

11httpwikidbpediaorgInternationalization

12httpeldbpediaorg13httpnldbpediaorg

a local DBpedia chapter has had a positive effect on thecreation of localized LOD clouds [24]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where commu-nities from each language work together to improvemappings increase coverage and detect bugs in the ex-traction process The progress of the mapping effortis tracked through statistics on the number of mappedtemplates and properties as well as the number oftimes these templates and properties occur in WikipediaThese statistics provide an estimate of the coverage ofeach Wikipedia edition in terms of how many entitieswill be typed and how many properties from those en-tities will be extracted Therefore they can be usedby each language edition to prioritize properties andtemplates with higher impact on the coverage

The mapping statistics have also been used as a wayto promote a healthy competition between languageeditions A sprint page was created with bar charts thatshow how close each language is from achieving to-tal coverage (as shown in Figure 5) and line chartsshowing the progress over time highlighting when one

14 Lehmann et al DBpedia

language is overtaking another in their race for highercoverage The mapping sprints have served as a greatmotivator for the crowd-sourcing efforts as it can benoted from the increase in the number of mapping con-tributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute14 This high changefrequency leads to DBpedia data quickly being out-dated which in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [2036] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported15 Meanwhile DBpedia Livefor Dutch16 was developed

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [27]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia articles

The overall architecture of DBpedia Live is indicatedin Figure 8 The major components of the system areas follows

14httpstatswikimediaorgENSummaryENhtm

15httplivedbpediaorg16httplivenldbpediaorg

Fig 8 Overview of DBpedia Live extraction framework

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a localWikipedia mirror allows us to exceed any accesslimits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary inputsource Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pages forre-processing as input and applies all the enabledextractors After processing a page the extractedtriples are a) inserted into a backend triple store(in our case Virtuoso [10]) updating the old triplesand b) saved as changesets into a compressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extractionframework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from differentsources

Lehmann et al DBpedia 15

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as anexample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extractorsmight never be applied to rarely modified pages Toovercome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mappingaffected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triples Aset for the added triples and another set for the deletedtriples We write those two sets into N-Triples filescompress them and publish the compressed files aschangesets If another DBpedia Live mirror wants tosynchronise with the DBpedia Live endpoint it can justdownload those files decompress and integrate them

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset filessequentially decompresses them and updates the targetSPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data indifferent SPARQL graphs Data from the articleupdate feeds are contained in the graph with theURI httplivedbpediaorg static data (ielinks to the LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All data isalso accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-

tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing to DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository17 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linking com-mittee Linksets which are added to the repository areused for the subsequent official DBpedia release as wellas for DBpedia Live Table 5 lists the linksets createdby the DBpedia community as of April 2013 The firstcolumn names the data set that is the target of the linksThe second and third column contain the predicate thatis used for linking as well as the overall number of linksthat is set between DBpedia and the external data setThe last column names the tool that was used to gener-ate the links The value S refers to Silk L to LIMESC to custom script and a missing entry means that thedataset is copied from the previous releases and notregenerated

An example for the usage of links is the combina-tion of data about European Union project funding(FTS) [32] and data about countries in DBpedia Thequery below compares funding per year (from FTS) andcountry with the gross domestic product of a country(from DBpedia)18

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding) 3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

17httpsgithubcomdbpediadbpedia-links18Endpoint httpftspublicdataeusparql

Results httpbitly1c2mIwQ

16 Lehmann et al DBpedia

Table 5Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732

In addition to providing outgoing links on aninstance-level DBpedia also sets links on schema-level pointing from the DBpedia ontology to equiva-lent terms in other schemas Links to other schematacan be set by the community within the DBpedia Map-pings Wiki by using owlequivalentClass inclass templates and owlequivalentPropertyin datatype or object property templates respectively

In particular in 2011 Google Microsoft and Yahooannounced their collaboration on Schemaorg a col-lection of vocabularies for marking up content on webpages The DBpedia 38 ontology contains 45 equiva-lent class and 31 equivalent property links pointing tohttpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of incoming links ie linkspointing to DBpedia from other data sets is 39007478according to the Data Hub19 However those countsare entered by users and may not always be valid andup-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [39]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from otherwiseit is not considered Sindice computes a graph sum-mary [8] over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing to DBpedia Table 6lists the 10 data sets which set most links to DBpedia

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub20 However it crawls forRDFa snippets converts microformats etc which arenot captured by the DataHub Despite the inaccuracythe relative comparison of different datasets can stillgive us insights Therefore we analysed the link struc-ture of all Sindice datasets using the Sindice clusterTable 8 shows the datasets with most incoming linksThose are authorities in the network structure of the

19See httpwikidbpediaorgInterlinking fordetails

20httpdatahubio

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 11: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

statistics for the coverage of these mappings in DBpe-dia Figures (a) and (c) refer to the absolute numberof template and property mappings that are defined forevery DBpedia language edition Figures (b) and (d) de-pict the percentage of the defined template and propertymappings compared to the total number of availabletemplates and properties for every Wikipedia languageedition Figures (e) and (g) show the occurrences (in-stances) that the defined template and property map-pings have in Wikipedia Finally figures (f) and (h)give the percentage of the mapped templates and prop-erties occurences compared to the total templates andproperty occurences in a Wikipedia language edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50of total template occurrences In addition almost alllanguages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most activemapping language communities is depicted in Figure 6It is interesting to notice that the high mapping ac-tivity peaks coincide with the DBpedia release dates

Lehmann et al DBpedia 11

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

For instance the DBpedia 37 version was released onSeptember 2011 and the 2nd and 3rd quarter of thatyear have a very high activity compared to the 4th quar-ter In the last two years (2012 and 2013) most of theDBpedia mapping language communities have definedtheir own chapters and have their own release datesThus recent mapping activity shows less fluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (x axis)that have exactly y occurrences (y axis) The occurrencefrequency follows a long tail distribution Thus a lownumber of property mappings have a high number ofoccurrences and a high number of property mappingshave a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the num-ber of facts (ie statements) that have been extractedfrom infoboxes describing these things Afterwards wereport on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statements andtype statements for the 20 languages The column head-ings have the following meaning LD = Localized datasets (see Section 25) CD = Canonicalized data sets

(see Section 25) all = Overall number of instances inthe data set including instances without infobox datawith MD = Number of instances for which mapping-based infobox data exists Raw Properties = Numberof different properties that are generated by the rawinfobox extractor Mapping Properties = Number ofdifferent properties that are generated by the mapping-based infobox extractor Raw Statements = Number ofstatements (facts) that are generated by the raw infoboxextractor Mapping Statements = Number of statements(facts) that are generated by the mapping-based infoboxextractor

It is interesting to see that the English version of DB-pedia describes about three times more instances thanthe second and third largest language editions (FrenchGerman) Comparing the first column of the table withthe second and third reveals which portion of the in-stances of a specific language correspond to instancesin the English version of DBpedia and which portionof the instances is described by clean mapping-basedinfobox data The difference between the number ofproperties in the raw infobox data set and the cleanermapping-based infobox data set (columns 4 and 5) re-sults on the one hand from multiple Wikipedia infoboxproperties being mapped to a single ontology propertyOn the other hand it reflects the number of mappingsthat have been so far created in the Mapping Wiki for aspecific language

Table 3 reports the number of instances for a set ofpopular classes from the third and forth hierarchy levelof the ontology within the canonicalized DBpedia datasets for each language The indented classes are sub-classes of the superclasses set in bold The zero val-ues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English version ofDBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-baseddata set how many instances are described in multi-ple languages The Instances column contains the totalnumber of instances per class across all 20 languagesthe second column contains the number of instancesthat are described only in a single language version thenext column contains the number of instances that arecontained in two languages but not in three or more lan-guages etc For example 12936 persons are describedin five languages but not in six or more languages Thenumber 871630 for the class Person means that all20 language versions together describe 871630 differ-ent persons The number is higher than the number ofpersons described in the canonicalized English infobox

12 Lehmann et al DBpedia

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Table 2Basic statistics about Localized DBpedia Editions

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

data set (763643) listed in Table 3 since there areinfoboxes in non-English articles describing a personwithout a corresponding infobox in the English articledescribing the same person Summing up columns 2 to10+ for the Person class we see that 195263 personsare described in two or more languages The large dif-ference of this number compared to the total number of871630 persons is due to the much smaller size of thelocalized DBpedia versions compared to the Englishone (cf Table 2)

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approaches

in [20] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently mappingsfor 27 languages8 The DBpedia 37 release9 in Septem-ber 2011 was the first DBpedia release to use the lo-calized I18n (Internationalisation) DBpedia extractionframework [24]

8Arabic (ar) Bulgarian (bg) Bengali (bn) Catalan (ca) Czech(cs) German (de) Greek (el) English (en) Spanish (es) Estonian(et) Basque (eu) French (fr) Irish (ga) Hindi (hi) Croatian (hr)Hungarian (hu) Indonesian (id) Italian (it) Japanese (ja) Korean(ko) Dutch (nl) Polish (pl) Portuguese (pt) Russian (ru) Slovene(sl) Turkish (tr) Urdu (ur)

9httpblogdbpediaorg20110911

Lehmann et al DBpedia 13

Table 3Number of instances per class within 10 localized DBpedia versions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 4Cross-language overlap Number of instances that are described in multiple languages

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

At the time of writing DBpedia chapters for 14 lan-guages have been founded Basque Czech Dutch En-glish French German Greek Italian Japanese Ko-rean Polish Portuguese Russian and Spanish10 Be-sides providing mappings from infoboxes in the corre-sponding Wikipedia editions DBpedia chapters organ-ise a local community and provide hosting for data setsand associated services

While at the moment chapters are defined by owner-ship of the IP and server of the sub domain A record(eg httpkodbpediaorg) given by DBpe-dia maintainers the DBpedia internationalisation com-mittee11 is manifesting its structure and each languageedition has a representative with a vote in elections Insome cases (eg Greek12 and Dutch13) the existence of

10Accessed on 25092013 httpwikidbpediaorgInternationalizationChapters

11httpwikidbpediaorgInternationalization

12httpeldbpediaorg13httpnldbpediaorg

a local DBpedia chapter has had a positive effect on thecreation of localized LOD clouds [24]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where commu-nities from each language work together to improvemappings increase coverage and detect bugs in the ex-traction process The progress of the mapping effortis tracked through statistics on the number of mappedtemplates and properties as well as the number oftimes these templates and properties occur in WikipediaThese statistics provide an estimate of the coverage ofeach Wikipedia edition in terms of how many entitieswill be typed and how many properties from those en-tities will be extracted Therefore they can be usedby each language edition to prioritize properties andtemplates with higher impact on the coverage

The mapping statistics have also been used as a wayto promote a healthy competition between languageeditions A sprint page was created with bar charts thatshow how close each language is from achieving to-tal coverage (as shown in Figure 5) and line chartsshowing the progress over time highlighting when one

14 Lehmann et al DBpedia

language is overtaking another in their race for highercoverage The mapping sprints have served as a greatmotivator for the crowd-sourcing efforts as it can benoted from the increase in the number of mapping con-tributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute14 This high changefrequency leads to DBpedia data quickly being out-dated which in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [2036] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported15 Meanwhile DBpedia Livefor Dutch16 was developed

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [27]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia articles

The overall architecture of DBpedia Live is indicatedin Figure 8 The major components of the system areas follows

14httpstatswikimediaorgENSummaryENhtm

15httplivedbpediaorg16httplivenldbpediaorg

Fig 8 Overview of DBpedia Live extraction framework

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a localWikipedia mirror allows us to exceed any accesslimits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary inputsource Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pages forre-processing as input and applies all the enabledextractors After processing a page the extractedtriples are a) inserted into a backend triple store(in our case Virtuoso [10]) updating the old triplesand b) saved as changesets into a compressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extractionframework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from differentsources

Lehmann et al DBpedia 15

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as anexample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extractorsmight never be applied to rarely modified pages Toovercome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mappingaffected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triples Aset for the added triples and another set for the deletedtriples We write those two sets into N-Triples filescompress them and publish the compressed files aschangesets If another DBpedia Live mirror wants tosynchronise with the DBpedia Live endpoint it can justdownload those files decompress and integrate them

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset filessequentially decompresses them and updates the targetSPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data indifferent SPARQL graphs Data from the articleupdate feeds are contained in the graph with theURI httplivedbpediaorg static data (ielinks to the LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All data isalso accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-

tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing to DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository17 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linking com-mittee Linksets which are added to the repository areused for the subsequent official DBpedia release as wellas for DBpedia Live Table 5 lists the linksets createdby the DBpedia community as of April 2013 The firstcolumn names the data set that is the target of the linksThe second and third column contain the predicate thatis used for linking as well as the overall number of linksthat is set between DBpedia and the external data setThe last column names the tool that was used to gener-ate the links The value S refers to Silk L to LIMESC to custom script and a missing entry means that thedataset is copied from the previous releases and notregenerated

An example for the usage of links is the combina-tion of data about European Union project funding(FTS) [32] and data about countries in DBpedia Thequery below compares funding per year (from FTS) andcountry with the gross domestic product of a country(from DBpedia)18

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding) 3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

17httpsgithubcomdbpediadbpedia-links18Endpoint httpftspublicdataeusparql

Results httpbitly1c2mIwQ

16 Lehmann et al DBpedia

Table 5Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732

In addition to providing outgoing links on aninstance-level DBpedia also sets links on schema-level pointing from the DBpedia ontology to equiva-lent terms in other schemas Links to other schematacan be set by the community within the DBpedia Map-pings Wiki by using owlequivalentClass inclass templates and owlequivalentPropertyin datatype or object property templates respectively

In particular in 2011 Google Microsoft and Yahooannounced their collaboration on Schemaorg a col-lection of vocabularies for marking up content on webpages The DBpedia 38 ontology contains 45 equiva-lent class and 31 equivalent property links pointing tohttpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of incoming links ie linkspointing to DBpedia from other data sets is 39007478according to the Data Hub19 However those countsare entered by users and may not always be valid andup-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [39]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from otherwiseit is not considered Sindice computes a graph sum-mary [8] over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing to DBpedia Table 6lists the 10 data sets which set most links to DBpedia

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub20 However it crawls forRDFa snippets converts microformats etc which arenot captured by the DataHub Despite the inaccuracythe relative comparison of different datasets can stillgive us insights Therefore we analysed the link struc-ture of all Sindice datasets using the Sindice clusterTable 8 shows the datasets with most incoming linksThose are authorities in the network structure of the

19See httpwikidbpediaorgInterlinking fordetails

20httpdatahubio

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 12: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

Lehmann et al DBpedia 11

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

For instance the DBpedia 37 version was released onSeptember 2011 and the 2nd and 3rd quarter of thatyear have a very high activity compared to the 4th quar-ter In the last two years (2012 and 2013) most of theDBpedia mapping language communities have definedtheir own chapters and have their own release datesThus recent mapping activity shows less fluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (x axis)that have exactly y occurrences (y axis) The occurrencefrequency follows a long tail distribution Thus a lownumber of property mappings have a high number ofoccurrences and a high number of property mappingshave a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the num-ber of facts (ie statements) that have been extractedfrom infoboxes describing these things Afterwards wereport on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statements andtype statements for the 20 languages The column head-ings have the following meaning LD = Localized datasets (see Section 25) CD = Canonicalized data sets

(see Section 25) all = Overall number of instances inthe data set including instances without infobox datawith MD = Number of instances for which mapping-based infobox data exists Raw Properties = Numberof different properties that are generated by the rawinfobox extractor Mapping Properties = Number ofdifferent properties that are generated by the mapping-based infobox extractor Raw Statements = Number ofstatements (facts) that are generated by the raw infoboxextractor Mapping Statements = Number of statements(facts) that are generated by the mapping-based infoboxextractor

It is interesting to see that the English version of DB-pedia describes about three times more instances thanthe second and third largest language editions (FrenchGerman) Comparing the first column of the table withthe second and third reveals which portion of the in-stances of a specific language correspond to instancesin the English version of DBpedia and which portionof the instances is described by clean mapping-basedinfobox data The difference between the number ofproperties in the raw infobox data set and the cleanermapping-based infobox data set (columns 4 and 5) re-sults on the one hand from multiple Wikipedia infoboxproperties being mapped to a single ontology propertyOn the other hand it reflects the number of mappingsthat have been so far created in the Mapping Wiki for aspecific language

Table 3 reports the number of instances for a set ofpopular classes from the third and forth hierarchy levelof the ontology within the canonicalized DBpedia datasets for each language The indented classes are sub-classes of the superclasses set in bold The zero val-ues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English version ofDBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-baseddata set how many instances are described in multi-ple languages The Instances column contains the totalnumber of instances per class across all 20 languagesthe second column contains the number of instancesthat are described only in a single language version thenext column contains the number of instances that arecontained in two languages but not in three or more lan-guages etc For example 12936 persons are describedin five languages but not in six or more languages Thenumber 871630 for the class Person means that all20 language versions together describe 871630 differ-ent persons The number is higher than the number ofpersons described in the canonicalized English infobox

12 Lehmann et al DBpedia

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Table 2Basic statistics about Localized DBpedia Editions

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

data set (763643) listed in Table 3 since there areinfoboxes in non-English articles describing a personwithout a corresponding infobox in the English articledescribing the same person Summing up columns 2 to10+ for the Person class we see that 195263 personsare described in two or more languages The large dif-ference of this number compared to the total number of871630 persons is due to the much smaller size of thelocalized DBpedia versions compared to the Englishone (cf Table 2)

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approaches

in [20] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently mappingsfor 27 languages8 The DBpedia 37 release9 in Septem-ber 2011 was the first DBpedia release to use the lo-calized I18n (Internationalisation) DBpedia extractionframework [24]

8Arabic (ar) Bulgarian (bg) Bengali (bn) Catalan (ca) Czech(cs) German (de) Greek (el) English (en) Spanish (es) Estonian(et) Basque (eu) French (fr) Irish (ga) Hindi (hi) Croatian (hr)Hungarian (hu) Indonesian (id) Italian (it) Japanese (ja) Korean(ko) Dutch (nl) Polish (pl) Portuguese (pt) Russian (ru) Slovene(sl) Turkish (tr) Urdu (ur)

9httpblogdbpediaorg20110911

Lehmann et al DBpedia 13

Table 3Number of instances per class within 10 localized DBpedia versions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 4Cross-language overlap Number of instances that are described in multiple languages

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

At the time of writing DBpedia chapters for 14 lan-guages have been founded Basque Czech Dutch En-glish French German Greek Italian Japanese Ko-rean Polish Portuguese Russian and Spanish10 Be-sides providing mappings from infoboxes in the corre-sponding Wikipedia editions DBpedia chapters organ-ise a local community and provide hosting for data setsand associated services

While at the moment chapters are defined by owner-ship of the IP and server of the sub domain A record(eg httpkodbpediaorg) given by DBpe-dia maintainers the DBpedia internationalisation com-mittee11 is manifesting its structure and each languageedition has a representative with a vote in elections Insome cases (eg Greek12 and Dutch13) the existence of

10Accessed on 25092013 httpwikidbpediaorgInternationalizationChapters

11httpwikidbpediaorgInternationalization

12httpeldbpediaorg13httpnldbpediaorg

a local DBpedia chapter has had a positive effect on thecreation of localized LOD clouds [24]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where commu-nities from each language work together to improvemappings increase coverage and detect bugs in the ex-traction process The progress of the mapping effortis tracked through statistics on the number of mappedtemplates and properties as well as the number oftimes these templates and properties occur in WikipediaThese statistics provide an estimate of the coverage ofeach Wikipedia edition in terms of how many entitieswill be typed and how many properties from those en-tities will be extracted Therefore they can be usedby each language edition to prioritize properties andtemplates with higher impact on the coverage

The mapping statistics have also been used as a wayto promote a healthy competition between languageeditions A sprint page was created with bar charts thatshow how close each language is from achieving to-tal coverage (as shown in Figure 5) and line chartsshowing the progress over time highlighting when one

14 Lehmann et al DBpedia

language is overtaking another in their race for highercoverage The mapping sprints have served as a greatmotivator for the crowd-sourcing efforts as it can benoted from the increase in the number of mapping con-tributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute14 This high changefrequency leads to DBpedia data quickly being out-dated which in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [2036] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported15 Meanwhile DBpedia Livefor Dutch16 was developed

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [27]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia articles

The overall architecture of DBpedia Live is indicatedin Figure 8 The major components of the system areas follows

14httpstatswikimediaorgENSummaryENhtm

15httplivedbpediaorg16httplivenldbpediaorg

Fig 8 Overview of DBpedia Live extraction framework

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a localWikipedia mirror allows us to exceed any accesslimits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary inputsource Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pages forre-processing as input and applies all the enabledextractors After processing a page the extractedtriples are a) inserted into a backend triple store(in our case Virtuoso [10]) updating the old triplesand b) saved as changesets into a compressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extractionframework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from differentsources

Lehmann et al DBpedia 15

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as anexample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extractorsmight never be applied to rarely modified pages Toovercome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mappingaffected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triples Aset for the added triples and another set for the deletedtriples We write those two sets into N-Triples filescompress them and publish the compressed files aschangesets If another DBpedia Live mirror wants tosynchronise with the DBpedia Live endpoint it can justdownload those files decompress and integrate them

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset filessequentially decompresses them and updates the targetSPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data indifferent SPARQL graphs Data from the articleupdate feeds are contained in the graph with theURI httplivedbpediaorg static data (ielinks to the LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All data isalso accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-

tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing to DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository17 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linking com-mittee Linksets which are added to the repository areused for the subsequent official DBpedia release as wellas for DBpedia Live Table 5 lists the linksets createdby the DBpedia community as of April 2013 The firstcolumn names the data set that is the target of the linksThe second and third column contain the predicate thatis used for linking as well as the overall number of linksthat is set between DBpedia and the external data setThe last column names the tool that was used to gener-ate the links The value S refers to Silk L to LIMESC to custom script and a missing entry means that thedataset is copied from the previous releases and notregenerated

An example for the usage of links is the combina-tion of data about European Union project funding(FTS) [32] and data about countries in DBpedia Thequery below compares funding per year (from FTS) andcountry with the gross domestic product of a country(from DBpedia)18

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding) 3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

17httpsgithubcomdbpediadbpedia-links18Endpoint httpftspublicdataeusparql

Results httpbitly1c2mIwQ

16 Lehmann et al DBpedia

Table 5Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732

In addition to providing outgoing links on aninstance-level DBpedia also sets links on schema-level pointing from the DBpedia ontology to equiva-lent terms in other schemas Links to other schematacan be set by the community within the DBpedia Map-pings Wiki by using owlequivalentClass inclass templates and owlequivalentPropertyin datatype or object property templates respectively

In particular in 2011 Google Microsoft and Yahooannounced their collaboration on Schemaorg a col-lection of vocabularies for marking up content on webpages The DBpedia 38 ontology contains 45 equiva-lent class and 31 equivalent property links pointing tohttpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of incoming links ie linkspointing to DBpedia from other data sets is 39007478according to the Data Hub19 However those countsare entered by users and may not always be valid andup-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [39]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from otherwiseit is not considered Sindice computes a graph sum-mary [8] over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing to DBpedia Table 6lists the 10 data sets which set most links to DBpedia

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub20 However it crawls forRDFa snippets converts microformats etc which arenot captured by the DataHub Despite the inaccuracythe relative comparison of different datasets can stillgive us insights Therefore we analysed the link struc-ture of all Sindice datasets using the Sindice clusterTable 8 shows the datasets with most incoming linksThose are authorities in the network structure of the

19See httpwikidbpediaorgInterlinking fordetails

20httpdatahubio

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 13: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

12 Lehmann et al DBpedia

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Table 2Basic statistics about Localized DBpedia Editions

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

data set (763643) listed in Table 3 since there areinfoboxes in non-English articles describing a personwithout a corresponding infobox in the English articledescribing the same person Summing up columns 2 to10+ for the Person class we see that 195263 personsare described in two or more languages The large dif-ference of this number compared to the total number of871630 persons is due to the much smaller size of thelocalized DBpedia versions compared to the Englishone (cf Table 2)

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approaches

in [20] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently mappingsfor 27 languages8 The DBpedia 37 release9 in Septem-ber 2011 was the first DBpedia release to use the lo-calized I18n (Internationalisation) DBpedia extractionframework [24]

8Arabic (ar) Bulgarian (bg) Bengali (bn) Catalan (ca) Czech(cs) German (de) Greek (el) English (en) Spanish (es) Estonian(et) Basque (eu) French (fr) Irish (ga) Hindi (hi) Croatian (hr)Hungarian (hu) Indonesian (id) Italian (it) Japanese (ja) Korean(ko) Dutch (nl) Polish (pl) Portuguese (pt) Russian (ru) Slovene(sl) Turkish (tr) Urdu (ur)

9httpblogdbpediaorg20110911

Lehmann et al DBpedia 13

Table 3Number of instances per class within 10 localized DBpedia versions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 4Cross-language overlap Number of instances that are described in multiple languages

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

At the time of writing DBpedia chapters for 14 lan-guages have been founded Basque Czech Dutch En-glish French German Greek Italian Japanese Ko-rean Polish Portuguese Russian and Spanish10 Be-sides providing mappings from infoboxes in the corre-sponding Wikipedia editions DBpedia chapters organ-ise a local community and provide hosting for data setsand associated services

While at the moment chapters are defined by owner-ship of the IP and server of the sub domain A record(eg httpkodbpediaorg) given by DBpe-dia maintainers the DBpedia internationalisation com-mittee11 is manifesting its structure and each languageedition has a representative with a vote in elections Insome cases (eg Greek12 and Dutch13) the existence of

10Accessed on 25092013 httpwikidbpediaorgInternationalizationChapters

11httpwikidbpediaorgInternationalization

12httpeldbpediaorg13httpnldbpediaorg

a local DBpedia chapter has had a positive effect on thecreation of localized LOD clouds [24]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where commu-nities from each language work together to improvemappings increase coverage and detect bugs in the ex-traction process The progress of the mapping effortis tracked through statistics on the number of mappedtemplates and properties as well as the number oftimes these templates and properties occur in WikipediaThese statistics provide an estimate of the coverage ofeach Wikipedia edition in terms of how many entitieswill be typed and how many properties from those en-tities will be extracted Therefore they can be usedby each language edition to prioritize properties andtemplates with higher impact on the coverage

The mapping statistics have also been used as a wayto promote a healthy competition between languageeditions A sprint page was created with bar charts thatshow how close each language is from achieving to-tal coverage (as shown in Figure 5) and line chartsshowing the progress over time highlighting when one

14 Lehmann et al DBpedia

language is overtaking another in their race for highercoverage The mapping sprints have served as a greatmotivator for the crowd-sourcing efforts as it can benoted from the increase in the number of mapping con-tributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute14 This high changefrequency leads to DBpedia data quickly being out-dated which in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [2036] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported15 Meanwhile DBpedia Livefor Dutch16 was developed

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [27]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia articles

The overall architecture of DBpedia Live is indicatedin Figure 8 The major components of the system areas follows

14httpstatswikimediaorgENSummaryENhtm

15httplivedbpediaorg16httplivenldbpediaorg

Fig 8 Overview of DBpedia Live extraction framework

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a localWikipedia mirror allows us to exceed any accesslimits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary inputsource Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pages forre-processing as input and applies all the enabledextractors After processing a page the extractedtriples are a) inserted into a backend triple store(in our case Virtuoso [10]) updating the old triplesand b) saved as changesets into a compressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extractionframework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from differentsources

Lehmann et al DBpedia 15

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as anexample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extractorsmight never be applied to rarely modified pages Toovercome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mappingaffected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triples Aset for the added triples and another set for the deletedtriples We write those two sets into N-Triples filescompress them and publish the compressed files aschangesets If another DBpedia Live mirror wants tosynchronise with the DBpedia Live endpoint it can justdownload those files decompress and integrate them

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset filessequentially decompresses them and updates the targetSPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data indifferent SPARQL graphs Data from the articleupdate feeds are contained in the graph with theURI httplivedbpediaorg static data (ielinks to the LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All data isalso accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-

tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing to DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository17 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linking com-mittee Linksets which are added to the repository areused for the subsequent official DBpedia release as wellas for DBpedia Live Table 5 lists the linksets createdby the DBpedia community as of April 2013 The firstcolumn names the data set that is the target of the linksThe second and third column contain the predicate thatis used for linking as well as the overall number of linksthat is set between DBpedia and the external data setThe last column names the tool that was used to gener-ate the links The value S refers to Silk L to LIMESC to custom script and a missing entry means that thedataset is copied from the previous releases and notregenerated

An example for the usage of links is the combina-tion of data about European Union project funding(FTS) [32] and data about countries in DBpedia Thequery below compares funding per year (from FTS) andcountry with the gross domestic product of a country(from DBpedia)18

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding) 3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

17httpsgithubcomdbpediadbpedia-links18Endpoint httpftspublicdataeusparql

Results httpbitly1c2mIwQ

16 Lehmann et al DBpedia

Table 5Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732

In addition to providing outgoing links on aninstance-level DBpedia also sets links on schema-level pointing from the DBpedia ontology to equiva-lent terms in other schemas Links to other schematacan be set by the community within the DBpedia Map-pings Wiki by using owlequivalentClass inclass templates and owlequivalentPropertyin datatype or object property templates respectively

In particular in 2011 Google Microsoft and Yahooannounced their collaboration on Schemaorg a col-lection of vocabularies for marking up content on webpages The DBpedia 38 ontology contains 45 equiva-lent class and 31 equivalent property links pointing tohttpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of incoming links ie linkspointing to DBpedia from other data sets is 39007478according to the Data Hub19 However those countsare entered by users and may not always be valid andup-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [39]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from otherwiseit is not considered Sindice computes a graph sum-mary [8] over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing to DBpedia Table 6lists the 10 data sets which set most links to DBpedia

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub20 However it crawls forRDFa snippets converts microformats etc which arenot captured by the DataHub Despite the inaccuracythe relative comparison of different datasets can stillgive us insights Therefore we analysed the link struc-ture of all Sindice datasets using the Sindice clusterTable 8 shows the datasets with most incoming linksThose are authorities in the network structure of the

19See httpwikidbpediaorgInterlinking fordetails

20httpdatahubio

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 14: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

Lehmann et al DBpedia 13

Table 3Number of instances per class within 10 localized DBpedia versions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 4Cross-language overlap Number of instances that are described in multiple languages

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

At the time of writing DBpedia chapters for 14 lan-guages have been founded Basque Czech Dutch En-glish French German Greek Italian Japanese Ko-rean Polish Portuguese Russian and Spanish10 Be-sides providing mappings from infoboxes in the corre-sponding Wikipedia editions DBpedia chapters organ-ise a local community and provide hosting for data setsand associated services

While at the moment chapters are defined by owner-ship of the IP and server of the sub domain A record(eg httpkodbpediaorg) given by DBpe-dia maintainers the DBpedia internationalisation com-mittee11 is manifesting its structure and each languageedition has a representative with a vote in elections Insome cases (eg Greek12 and Dutch13) the existence of

10Accessed on 25092013 httpwikidbpediaorgInternationalizationChapters

11httpwikidbpediaorgInternationalization

12httpeldbpediaorg13httpnldbpediaorg

a local DBpedia chapter has had a positive effect on thecreation of localized LOD clouds [24]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where commu-nities from each language work together to improvemappings increase coverage and detect bugs in the ex-traction process The progress of the mapping effortis tracked through statistics on the number of mappedtemplates and properties as well as the number oftimes these templates and properties occur in WikipediaThese statistics provide an estimate of the coverage ofeach Wikipedia edition in terms of how many entitieswill be typed and how many properties from those en-tities will be extracted Therefore they can be usedby each language edition to prioritize properties andtemplates with higher impact on the coverage

The mapping statistics have also been used as a wayto promote a healthy competition between languageeditions A sprint page was created with bar charts thatshow how close each language is from achieving to-tal coverage (as shown in Figure 5) and line chartsshowing the progress over time highlighting when one

14 Lehmann et al DBpedia

language is overtaking another in their race for highercoverage The mapping sprints have served as a greatmotivator for the crowd-sourcing efforts as it can benoted from the increase in the number of mapping con-tributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute14 This high changefrequency leads to DBpedia data quickly being out-dated which in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [2036] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported15 Meanwhile DBpedia Livefor Dutch16 was developed

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [27]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia articles

The overall architecture of DBpedia Live is indicatedin Figure 8 The major components of the system areas follows

14httpstatswikimediaorgENSummaryENhtm

15httplivedbpediaorg16httplivenldbpediaorg

Fig 8 Overview of DBpedia Live extraction framework

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a localWikipedia mirror allows us to exceed any accesslimits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary inputsource Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pages forre-processing as input and applies all the enabledextractors After processing a page the extractedtriples are a) inserted into a backend triple store(in our case Virtuoso [10]) updating the old triplesand b) saved as changesets into a compressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extractionframework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from differentsources

Lehmann et al DBpedia 15

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as anexample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extractorsmight never be applied to rarely modified pages Toovercome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mappingaffected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triples Aset for the added triples and another set for the deletedtriples We write those two sets into N-Triples filescompress them and publish the compressed files aschangesets If another DBpedia Live mirror wants tosynchronise with the DBpedia Live endpoint it can justdownload those files decompress and integrate them

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset filessequentially decompresses them and updates the targetSPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data indifferent SPARQL graphs Data from the articleupdate feeds are contained in the graph with theURI httplivedbpediaorg static data (ielinks to the LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All data isalso accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-

tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing to DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository17 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linking com-mittee Linksets which are added to the repository areused for the subsequent official DBpedia release as wellas for DBpedia Live Table 5 lists the linksets createdby the DBpedia community as of April 2013 The firstcolumn names the data set that is the target of the linksThe second and third column contain the predicate thatis used for linking as well as the overall number of linksthat is set between DBpedia and the external data setThe last column names the tool that was used to gener-ate the links The value S refers to Silk L to LIMESC to custom script and a missing entry means that thedataset is copied from the previous releases and notregenerated

An example for the usage of links is the combina-tion of data about European Union project funding(FTS) [32] and data about countries in DBpedia Thequery below compares funding per year (from FTS) andcountry with the gross domestic product of a country(from DBpedia)18

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding) 3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

17httpsgithubcomdbpediadbpedia-links18Endpoint httpftspublicdataeusparql

Results httpbitly1c2mIwQ

16 Lehmann et al DBpedia

Table 5Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732

In addition to providing outgoing links on aninstance-level DBpedia also sets links on schema-level pointing from the DBpedia ontology to equiva-lent terms in other schemas Links to other schematacan be set by the community within the DBpedia Map-pings Wiki by using owlequivalentClass inclass templates and owlequivalentPropertyin datatype or object property templates respectively

In particular in 2011 Google Microsoft and Yahooannounced their collaboration on Schemaorg a col-lection of vocabularies for marking up content on webpages The DBpedia 38 ontology contains 45 equiva-lent class and 31 equivalent property links pointing tohttpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of incoming links ie linkspointing to DBpedia from other data sets is 39007478according to the Data Hub19 However those countsare entered by users and may not always be valid andup-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [39]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from otherwiseit is not considered Sindice computes a graph sum-mary [8] over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing to DBpedia Table 6lists the 10 data sets which set most links to DBpedia

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub20 However it crawls forRDFa snippets converts microformats etc which arenot captured by the DataHub Despite the inaccuracythe relative comparison of different datasets can stillgive us insights Therefore we analysed the link struc-ture of all Sindice datasets using the Sindice clusterTable 8 shows the datasets with most incoming linksThose are authorities in the network structure of the

19See httpwikidbpediaorgInterlinking fordetails

20httpdatahubio

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 15: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

14 Lehmann et al DBpedia

language is overtaking another in their race for highercoverage The mapping sprints have served as a greatmotivator for the crowd-sourcing efforts as it can benoted from the increase in the number of mapping con-tributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute14 This high changefrequency leads to DBpedia data quickly being out-dated which in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [2036] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported15 Meanwhile DBpedia Livefor Dutch16 was developed

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [27]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia articles

The overall architecture of DBpedia Live is indicatedin Figure 8 The major components of the system areas follows

14httpstatswikimediaorgENSummaryENhtm

15httplivedbpediaorg16httplivenldbpediaorg

Fig 8 Overview of DBpedia Live extraction framework

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a localWikipedia mirror allows us to exceed any accesslimits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary inputsource Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pages forre-processing as input and applies all the enabledextractors After processing a page the extractedtriples are a) inserted into a backend triple store(in our case Virtuoso [10]) updating the old triplesand b) saved as changesets into a compressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extractionframework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from differentsources

Lehmann et al DBpedia 15

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as anexample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extractorsmight never be applied to rarely modified pages Toovercome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mappingaffected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triples Aset for the added triples and another set for the deletedtriples We write those two sets into N-Triples filescompress them and publish the compressed files aschangesets If another DBpedia Live mirror wants tosynchronise with the DBpedia Live endpoint it can justdownload those files decompress and integrate them

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset filessequentially decompresses them and updates the targetSPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data indifferent SPARQL graphs Data from the articleupdate feeds are contained in the graph with theURI httplivedbpediaorg static data (ielinks to the LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All data isalso accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-

tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing to DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository17 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linking com-mittee Linksets which are added to the repository areused for the subsequent official DBpedia release as wellas for DBpedia Live Table 5 lists the linksets createdby the DBpedia community as of April 2013 The firstcolumn names the data set that is the target of the linksThe second and third column contain the predicate thatis used for linking as well as the overall number of linksthat is set between DBpedia and the external data setThe last column names the tool that was used to gener-ate the links The value S refers to Silk L to LIMESC to custom script and a missing entry means that thedataset is copied from the previous releases and notregenerated

An example for the usage of links is the combina-tion of data about European Union project funding(FTS) [32] and data about countries in DBpedia Thequery below compares funding per year (from FTS) andcountry with the gross domestic product of a country(from DBpedia)18

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding) 3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

17httpsgithubcomdbpediadbpedia-links18Endpoint httpftspublicdataeusparql

Results httpbitly1c2mIwQ

16 Lehmann et al DBpedia

Table 5Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732

In addition to providing outgoing links on aninstance-level DBpedia also sets links on schema-level pointing from the DBpedia ontology to equiva-lent terms in other schemas Links to other schematacan be set by the community within the DBpedia Map-pings Wiki by using owlequivalentClass inclass templates and owlequivalentPropertyin datatype or object property templates respectively

In particular in 2011 Google Microsoft and Yahooannounced their collaboration on Schemaorg a col-lection of vocabularies for marking up content on webpages The DBpedia 38 ontology contains 45 equiva-lent class and 31 equivalent property links pointing tohttpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of incoming links ie linkspointing to DBpedia from other data sets is 39007478according to the Data Hub19 However those countsare entered by users and may not always be valid andup-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [39]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from otherwiseit is not considered Sindice computes a graph sum-mary [8] over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing to DBpedia Table 6lists the 10 data sets which set most links to DBpedia

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub20 However it crawls forRDFa snippets converts microformats etc which arenot captured by the DataHub Despite the inaccuracythe relative comparison of different datasets can stillgive us insights Therefore we analysed the link struc-ture of all Sindice datasets using the Sindice clusterTable 8 shows the datasets with most incoming linksThose are authorities in the network structure of the

19See httpwikidbpediaorgInterlinking fordetails

20httpdatahubio

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 16: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

Lehmann et al DBpedia 15

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as anexample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extractorsmight never be applied to rarely modified pages Toovercome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mappingaffected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triples Aset for the added triples and another set for the deletedtriples We write those two sets into N-Triples filescompress them and publish the compressed files aschangesets If another DBpedia Live mirror wants tosynchronise with the DBpedia Live endpoint it can justdownload those files decompress and integrate them

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset filessequentially decompresses them and updates the targetSPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data indifferent SPARQL graphs Data from the articleupdate feeds are contained in the graph with theURI httplivedbpediaorg static data (ielinks to the LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All data isalso accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-

tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing to DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository17 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linking com-mittee Linksets which are added to the repository areused for the subsequent official DBpedia release as wellas for DBpedia Live Table 5 lists the linksets createdby the DBpedia community as of April 2013 The firstcolumn names the data set that is the target of the linksThe second and third column contain the predicate thatis used for linking as well as the overall number of linksthat is set between DBpedia and the external data setThe last column names the tool that was used to gener-ate the links The value S refers to Silk L to LIMESC to custom script and a missing entry means that thedataset is copied from the previous releases and notregenerated

An example for the usage of links is the combina-tion of data about European Union project funding(FTS) [32] and data about countries in DBpedia Thequery below compares funding per year (from FTS) andcountry with the gross domestic product of a country(from DBpedia)18

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding) 3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

17httpsgithubcomdbpediadbpedia-links18Endpoint httpftspublicdataeusparql

Results httpbitly1c2mIwQ

16 Lehmann et al DBpedia

Table 5Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732

In addition to providing outgoing links on aninstance-level DBpedia also sets links on schema-level pointing from the DBpedia ontology to equiva-lent terms in other schemas Links to other schematacan be set by the community within the DBpedia Map-pings Wiki by using owlequivalentClass inclass templates and owlequivalentPropertyin datatype or object property templates respectively

In particular in 2011 Google Microsoft and Yahooannounced their collaboration on Schemaorg a col-lection of vocabularies for marking up content on webpages The DBpedia 38 ontology contains 45 equiva-lent class and 31 equivalent property links pointing tohttpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of incoming links ie linkspointing to DBpedia from other data sets is 39007478according to the Data Hub19 However those countsare entered by users and may not always be valid andup-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [39]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from otherwiseit is not considered Sindice computes a graph sum-mary [8] over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing to DBpedia Table 6lists the 10 data sets which set most links to DBpedia

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub20 However it crawls forRDFa snippets converts microformats etc which arenot captured by the DataHub Despite the inaccuracythe relative comparison of different datasets can stillgive us insights Therefore we analysed the link struc-ture of all Sindice datasets using the Sindice clusterTable 8 shows the datasets with most incoming linksThose are authorities in the network structure of the

19See httpwikidbpediaorgInterlinking fordetails

20httpdatahubio

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 17: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

16 Lehmann et al DBpedia

Table 5Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732

In addition to providing outgoing links on aninstance-level DBpedia also sets links on schema-level pointing from the DBpedia ontology to equiva-lent terms in other schemas Links to other schematacan be set by the community within the DBpedia Map-pings Wiki by using owlequivalentClass inclass templates and owlequivalentPropertyin datatype or object property templates respectively

In particular in 2011 Google Microsoft and Yahooannounced their collaboration on Schemaorg a col-lection of vocabularies for marking up content on webpages The DBpedia 38 ontology contains 45 equiva-lent class and 31 equivalent property links pointing tohttpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of incoming links ie linkspointing to DBpedia from other data sets is 39007478according to the Data Hub19 However those countsare entered by users and may not always be valid andup-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [39]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from otherwiseit is not considered Sindice computes a graph sum-mary [8] over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing to DBpedia Table 6lists the 10 data sets which set most links to DBpedia

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub20 However it crawls forRDFa snippets converts microformats etc which arenot captured by the DataHub Despite the inaccuracythe relative comparison of different datasets can stillgive us insights Therefore we analysed the link struc-ture of all Sindice datasets using the Sindice clusterTable 8 shows the datasets with most incoming linksThose are authorities in the network structure of the

19See httpwikidbpediaorgInterlinking fordetails

20httpdatahubio

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 18: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

Lehmann et al DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 7Sindice summary statistics for incoming links to DBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 8Top 10 datasets by incoming links in Sindice

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

web of data and DBpedia is currently ranked second interms of incoming links

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one of theextractors listed in Table 1 Second DBpedia is servedvia a public SPARQL endpoint and third it providesdereferencable URIs according to the Linked Data prin-ciples In this section we explore some of the statisticsgathered during the hosting of DBpedia over the lasttwo of years

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume of the Englishlanguage edition of DBpedia

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 6 TB per month is cur-rently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popularityThe download count and the download volume of eachdata set during the year 2012 is depicted in Figure 10In those statistics we filtered out all IP addresses whichrequested a file more than 1000 times per month21

Pagelinks are the most downloaded dataset althoughthey are not semantically rich as they do not revealwhich type of links exists between two resources Sup-posedly they are used for network analysis or providingrelevant links in user interfaces and downloaded moreoften as they are not provided via the official SPARQLendpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint22 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports horizontal scale-out either by re-distributing the existing cluster nodes onto multiple ma-

21The IP address was only filtered for that specific file and monthin those cases

22httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 19: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

Table 9Hardware of the machines serving the public SPARQL endpoint

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

chines or by adding several separate clusters with around robin HTTP front-end This allows the clustersetup to grow in line with desired response times foran RDF data set collection As the size of the DBpediadata set increased and its use by the Linked Data com-munity grew the project migrated to increasingly pow-erful but still moderately priced hardware as shown inTable 9 The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limitedafter a burst period

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists(ACLs) which allow the administrator to rate limit cer-tain IP addresses or whole IP ranges A maximum num-ber of requests per second (currently 15) as well asa bandwidth limit per request (currently 10MB) areenforced If the client software can handle compres-

Table 11Number of unique sites accessing DBpedia endpoints

DBpedia AvgDay Median Stdev Maximum

33 5824 5977 1046 786534 6656 6704 947 831235 9392 9432 1643 1258436 10810 10810 1612 1479837 17077 17091 6782 3328838 14513 14493 2783 20510

sion replies are compressed to further save bandwidthException rules can be configured for multiple clientshidden behind a NAT firewall (appearing as a single IPaddress) or for temporary requests for higher rate limitsWhen a client hits an ACL limit the system reports anappropriate HTTP status code23 like 509 (rdquoBandwidthLimit Exceededrdquo) and quickly drops the connectionThe system further uses an iptables based firewall forpermanent blocking of clients identified by their IPaddresses

63 Public Static Endpoint Statistics

The statistics presented in this section were extractedfrom reports generated by Webalizer v22124 Table 10and Table 11 show various DBpedia SPARQL endpointusage statistics for the DBpedia 33 to 38 releases Notethat the usage of all endpoints mentioned in Table 12 iscounted The AvgDay column represents the averagenumber of hits (resp visitssites) per day followed by

23httpenwikipediaorgwikiList_of_HTTP_status_codes

24httpwwwwebalizerorg

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 20: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

Lehmann et al DBpedia 19

Table 10Number of endpoint hits (left) and visits (right)

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

the Median and Standard Deviation The last columnshows the maximum number of hits (resp visitssites)that was recorded on a single day for each data setversion Visits (ie sessions of subsequent queries fromthe same client) are determined by a floating 30 minutetime window All requests from behind a NAT firewallare logged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Table 10 shows the increasing popularity of DBpediaThere is a distinct dip in hits to the SPARQL endpoint inDBpedia 35 which is partially due to more strict initiallimits for bot-related traffic which were later relaxedThe sudden drop of visits between the 37 and the 38data sets can be attributed to

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL endpointbut also serves as a Linked Data Hub returning re-sources in a number of different formats For each dataset we randomly selected 14 days worth of log files andprocessed those in order to show the various servicescalled Table 12 shows the number of hits to the variousendpoints

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML or Tur-tle) equivalent of the article Clients also frequentlymint their own URLs to either page or data version ofan articles directly or download the raw data directlyThis explains why the count of page and data hitsin the table is larger than the number of hits on the

Table 12Hits per service to httpdbpediaorg in thousands

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Fig 11 Traffic Linked Data versus SPARQL endpoint

resource endpoint The ontology and property end-points return meta information about the DBpedia on-tology While all of these endpoints themselves mayuse SPARQL queries to generate various page contentthese requests are handled by the internal Virtuoso en-gine directly and do not show up as extra calls to thesparql endpoint in our analysis

Figure 11 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements per type

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 21: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

20 Lehmann et al DBpedia

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 14Trends in SPARQL select (rounded values in )

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

As the log files only record the full SPARQL query ona GET request all the POST requests are counted asunknown

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing number

of users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of liveupdate requests ie changeset downloads Figure 12indicates the number of daily SPARQL and synchroni-sation requests sent to DBpedia Live endpoint in theperiod between August 2012 and January 2013

Fig 12 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) synchronisation requests from August 2012until January 2013

7 Use Cases and Applications

Due to DBpediarsquos coverage of various domains aswell as its steady growth as a hub on the Web of Datathe data sets provided by DBpedia can serve many pur-poses Such applications include improving search andexploration of Wikipedia data proliferation for applica-tions mashups as well as text analysis and annotationtools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [33] For that purpose DBpediaincludes a number of specialized data sets25 For in-stance the lexicalizations data set can be used to esti-mate the ambiguity of phrases to help select unambigu-ous identifiers for ambiguous phrases or to providealternative names for entities just to mention a few ex-amples Topic signatures can be useful in tasks such asquery expansion or document summarization and has

25httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 22: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

Lehmann et al DBpedia 21

been successfully employed to classify ambiguouslydescribed images as good depictions of DBpedia enti-ties [13] The thematic concepts data set of resourcescan be used for creating a corpus from Wikipedia to beused as training data for topic classifiers among otherthings (see below) The grammatical gender data setcan for example be used to add a gender feature inco-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts

or other content with semantic information Namedentity recognition and disambiguation ndash also known askey phrase extraction and entity linking tasks ndash refersto the task of finding real world entities in text andlinking them to unique identifiers One of the mainchallenges in this regard is ambiguity an entity nameor surface form may be used in different contexts torefer to different concepts Many different methodshave been developed to resolve this ambiguity withfairly high accuracy [22]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority of do-mains in text annotation Consequently interlinking textdocuments with Linked Data enables the Web of Datato be used as background knowledge within document-oriented applications such as semantic search or facetedbrowsing (cf Section 73)

Many applications performing this task of annotatingtext with entities in fact use DBpedia entities as targetsFor example DBpedia Spotlight [34] is an open sourcetool26 including a free web service that detects men-tions of DBpedia resources in text It uses the lexical-izations in conjunction with the topic signatures dataset as context model in order to be able to disambiguatefound mentions The main advantage of this system isits comprehensiveness and flexibility allowing one toconfigure it based on quality measures such as promi-nence contextual ambiguity topical pertinence and dis-ambiguation confidence as well as the DBpedia on-tology The resources that should be annotated can bespecified by a list of resource types or by more complexrelationships within the knowledge base described asSPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI27 Semantic API

26httpspotlightdbpediaorg27httpwwwalchemyapicom

from Ontos28 Open Calais29 and Zemanta30 among oth-ers Furthermore the DBpedia ontology has been usedfor training named entity recognition systems (withoutdisambiguation) in the context of the Apache Stanbolproject31

A related project is ImageSnippets32 which is a sys-tem for annotating images It uses DBpedia as one ofits main datasets for unambiguously identifying entitiesdepicted within an image

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multimediacontent such as music photos or videos can also be con-nected to the Linked Data hub This has previously beenimplemented by letting the user resolve ambiguitiesFor example Faviki33 suggests a set of DBpedia entitiescoming from Zemantarsquos API and lets the user choosethe desired one Alternatively similar disambiguationtechniques as mentioned above can be utilized to chooseentities from tags automatically [14] The BBC34 [23]employs DBpedia URIs for tagging their programmesShort clips and full episodes are tagged using two dif-ferent tools while utilizing DBpedia to benefit fromglobal identifiers that can be easily integrated with otherknowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system [12] The Watsonsystem won a $1 million prize in Jeopardy and relieson several data sets including DBpedia35 DBpedia isalso the primary target for several QA systems in theQuestion Answering over Linked Data (QALD) work-shop series36 Several QA systems such as TBSL [44]PowerAqua [31] FREyA [9] and QAKiS [6] have beenapplied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such

28httpwwwontoscom29httpwwwopencalaiscom30httpwwwzemantacom31httpstanbolapacheorg32httpwwwimagesnippetscom33httpwwwfavikicom34httpbbccouk35httpwwwaaaiorgMagazineWatson

watsonphp36httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 23: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

22 Lehmann et al DBpedia

systems Due to its large schema and data size as wellas its topic diversity it provides significant scientificchallenges In particular it would be difficult to providecapable QA systems for DBpedia based only on simplepatterns or via domain specific dictionaries because ofits size and broad coverage Therefore a question an-swering system which is able to reliable answer ques-tions over DBpedia correctly could be seen as a trulyintelligent system In the latest QALD series questionanswering benchmarks also exploit national DBpediachapters for multilingual question answering

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [28]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition tothe stable URIs could form the basis for a globalclassification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project37 Recently it was announced thatVIAF added a total of 250000 reciprocal authoritylinks to Wikipedia38 These links are already harvestedby DBpedia Live and will also be included in the nextstatic DBpedia release This creates a huge opportunityfor libraries that use VIAF to get connected to DBpediaand the LOD cloud in general

37httpviaforg38Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools eitherused DBpedia as a testbed or were specifically builtfor DBpedia We give a brief overview of tools andstructure them in categories

Facet Based Browsers An award-winning39 facet-based browser used the Neofonie search engine to com-bine facts in DBpedia with full-text from Wikipedia inorder to compute hierarchical facets [15] Another facetbased browser which allows to create complex graphstructures of facets in a visually appealing interface andfilter them is gFacet [16] A generic SPARQL basedfacet explorer which also uses a graph based visualisa-tion of facets is LODLive [7] The OpenLink built-infacet based browser40 is an interface which enablesdevelopers to explore DBpedia compute aggregationsover facets and view the underlying SPARQL queries

Search and Querying The DBpedia Query Builder41

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns via in-telligent autocompletion The autocompletion function-ality ensures that only URIs which lead to solutions aresuggested to the user The RelFinder [17] tool providesan intuitive interface which allows to explore the neigh-borhood and connections between resources specifiedby the user For instance the user can view the short-est paths connecting certain persons in DBpedia Sem-Lens [18] allows to create statistical analysis queriesand correlations in RDF data and DBpedia in particular

Spatial Applications DBpedia Mobile [3] is a locationaware client which renders a map of nearby locationsfrom DBpedia provides icons for schema classes andsupports more than 30 languages from various DBpedialanguage editions It can follow RDF links to otherdata sets linked from DBpedia and supports powerfulSPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-

39httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

40httpdbpediaorgfct41httpquerybuilderdbpediaorg

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 24: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

Lehmann et al DBpedia 23

able in 171 languages (of which approximately 147 canbe considered active42) It contains information abouthundreds of spoken and even ancient languages In thecase of the English Wiktionary there are nearly 3 mil-lion detailed descriptions of words covering several do-mains43 Such descriptions provide for a lexical worda hierarchical disambiguation to its language part ofspeech sometimes etymologies synonyms hyponymshyperonyms example sentences and most prominentlysenses

Due to its fast changing nature together with thefragmentation of the project into Wiktionary languageeditions (WLE) with independent layout rules a config-urable mediatorwrapper approach is taken for its auto-mated transformation into a structured knowledge baseThe workflow of this dedicated Wiktionary extractorbeing part of the Wiktionary2RDF [19] project is asfollows For every WLE to be transformed an XMLconfiguration file is provided as input This configura-tion is used by the Wiktionary extractor invoked bythe DBpedia extraction framework to first generate aschema reflecting the configured page structure (wrap-per part) After this these language specific schemasare converted to a global schema (mediator part) andlater serialized to RDF

To enable non-programmers (the community ofadopters and domain experts) to tailor and maintainthe WLE wrappers themselves a simple XML dialectwas created to encode the page structure to be parsedand declare triple patterns that define how the resultingRDF should be built The described setup is run againstWiktionary dumps The resulting data set is open inevery aspect and hosted as Linked Data44 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata45 Wikidata is a freeknowledge base about the world that can be read and

42https23orgwikistatswiktionaries_htmlphp

43See httpenwiktionaryorgwikisemanticfor a simple example page

44httpwiktionarydbpediaorg45httpwikidataorg

edited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Things de-scribed in the Wikidata knowledge base are called itemsand can have labels descriptions and aliases in all lan-guages Wikidata does not aim at offering a single truthabout things but providing statements given in a par-ticular context Rather than stating that Berlin has apopulation of 35 million Wikidata contains the state-ment about Berlinrsquos population being 35 million as of2011 according to the German statistical office ThusWikidata can offer a variety of statements from differ-ent sources and dates As there are potentially manydifferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe them SinceMarch 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface46 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase47 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each other

46httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

47httpwwwfreebasecom

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 25: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

24 Lehmann et al DBpedia

Table 15Statistical comparison of extractions for different languages

language words triples resources predicates senses

en 2142237 28593364 11804039 28 424386fr 4657817 35032121 20462349 22 592351ru 1080156 12813437 5994560 17 149859de 701739 5618508 2966867 16 122362

and provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracted dataas well as APIs or endpoints to access the data andallow their communities to influence the schema of thedata There are however also major differences be-tween both projects DBpedia focuses on being an RDFrepresentation of Wikipedia and serving as a hub on theWeb of Data whereas Freebase uses several sources toprovide broad coverage The store behind Freebase isthe GraphD [35] graph database which allows to effi-ciently store metadata for each fact This graph store isappend-only Deleted triples are marked and the systemcan easily revert to a previous version This is neces-sary since Freebase data can be directly edited by userswhereas information in DBpedia can only indirectly beedited by modifying the content of Wikipedia or theMappings Wiki From an organisational point of viewFreebase is mainly run by Google whereas DBpedia isan open community project In particular in focus areasof Google and areas in which Freebase includes otherdata sources the Freebase database provides a highercoverage than DBpedia

813 YAGOOne of the projects that pursues similar goals

to DBpedia is YAGO48 [42] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key features is tolink this type information to WordNet WordNet synsetsare represented as classes and the extracted types ofentities may become subclasses of such a synset In theYAGO2 system [21] declarative extraction rules wereintroduced which can extract facts from different partsof Wikipedia articles eg infoboxes and categories aswell as other sources YAGO2 also supports spatial andtemporal dimensions for facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of its

48httpwwwmpi-infmpgdeyago-nagayago

content YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains many more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via mappingsin DBpedia and therefore by the DBpedia communityitself whereas this task is facilitated by expert-designeddeclarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative tothe DBpedia ontology and sameAs links are providedin both directions While the underlying systems arevery different both projects share similar aims andpositively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notable re-cent or pertinent to DBpedia MediaWikiorg maintainsan up-to-date list of software projects49 who are able toprocess wiki syntax as well as a list of data extractionextensions50 for MediaWiki

JWPL (Java Wikipedia Library [49]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is also

49httpwwwmediawikiorgwikiAlternative_parsers

50httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 26: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

Lehmann et al DBpedia 25

provided as XML dump and is incorporated in the lexi-cal resource UBY51 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [37] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service52 Addi-tional projects that exploit the mentioned features arelisted on the Special Interest Group on Wikipedia Min-ing (SIGWP) Web site53

An earlier approach to improve the quality of theinfobox schemata and contents is described in [47]The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [38] the authors describe their ef-fort building an integrated Chinese Linking Open Data(CLOD) source based on the Chinese Wikipedia andthe two widely used and large encyclopedias BaiduBaike54 and Hudong Baike55 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [45] Focusing onwiki knowledge bases the authors introduce their solu-tion based on structural properties like similar linkagestructures the assignment to similar categories and sim-ilar interests of the authors of wiki documents in theconsidered languages Since this approach is language-feature-agnostic it is not restricted to certain languages

51httpwwwukptu-darmstadtdedatalexical-resourcesuby

52httpsigwporgenindexphpWikipedia_Thesaurus

53httpsigwporgen54httpbaikebaiducom55httpwwwhudongcom

KnowItAll56 is a web scale knowledge extractioneffort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [29] is a large com-mon sense knowledge base which is now partiallyreleased as OpenCyc and also available as an OWLontology OpenCyc is linked to DBpedia which pro-vides an ontological embedding in its comprehensivestructures WikiTaxonomy [40] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [46] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [47] is a system which autonomouslyextracts structured data from Wikipedia and uses self-supervised linking

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this article in-cluded in particular (1) the extraction based on thecommunity-curated DBpedia ontology (2) the live syn-chronisation of DBpedia with Wikipedia and DBpediamirrors through update propagation and (3) the facilita-tion of the internationalisation of DBpedia As a resultwe demonstrated that in the past four years DBpediamatured and improved significantly in terms of cover-age usability and data quality

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [19] meanwhile becomingpart of DBpedia and LinkedGeoData [41] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration and

56httpwwwcswashingtoneduresearchknowitall

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 27: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

26 Lehmann et al DBpedia

fusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are able toprecisely identify equivalent overlapping and comple-mentary parts in different DBpedia language editionswe can reach significantly increased coverage On theother hand comparing the values of a specific prop-erty between different language editions will help usto spot extraction errors as well as wrong or outdatedinformation in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding defi-ciencies of the DBpedia extraction framework By con-stantly monitoring the data quality and integrating im-provements into the mappings to the DBpedia ontologyas well as fixes into the extraction framework we aim todemonstrate that the Wikipedia community is not onlycapable of creating the largest encyclopedia but alsothe most comprehensive and structured knowledge baseWith the DBpedia quality evaluation campaign [48] wewere making a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future weenvision to also extract semantic information fromtyped links Typed links is a feature of Semantic Me-diaWiki which was backported and implemented asa very lightweight extension for MediaWiki57 If thisextension is deployed at Wikipedia installations thisopens up completely new possibilities for more fine-grained and non-invasive knowledge representationsand extraction from Wikipedia

Collaboration between Wikidata and DBpediaWhile DBpedia provides a comprehensive and currentview on entity descriptions extracted from WikipediaWikidata offers a variety of factual statements fromdifferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other and there are several ongoing col-laboration activities In future versions DBpedia willinclude more raw data provided by Wikidata and add

57httpwwwmediawikiorgwikiExtensionLightweightRDFa

services such as Linked DataSPARQL endpoints RDFdumps linking and ontology mapping for Wikidata

Feedback for Wikipedia A promising prospect is thatDBpedia can help to identify misrepresentations errorsand inconsistencies in Wikipedia In the future we planto provide more feedback to the Wikipedia communityabout the quality of Wikipedia This can for instancebe achieved in the form of sanity checks which areimplemented as SPARQL queries on the DBpedia Liveendpoint which identify data quality issues and are ex-ecuted in certain intervals For example a query couldcheck that the birthday of a person must always be be-fore the death day or spot outliers that differ signifi-cantly from the range of the majority of the other val-ues In case a Wikipedia editor makes a mistake or typowhen adding such information to a page this could beautomatically identified and provided as feedback toWikipedians

Integrate DBpedia and NLP Recent advances(cf Section 7) show huge potential for employingLinked Data background knowledge in various NaturalLanguage Processing (NLP) tasks One very promisingresearch avenue in this regard is to employ DBpediaas structured background knowledge for named en-tity recognition and disambiguation Currently mostapproaches use statistical information such as co-occurrence for named entity disambiguation Howeverco-occurrence is not always easy to determine (dependson training data) and update (requires recomputation)With DBpedia and in particular DBpedia Live we havecomprehensive and evolving background knowledgecomprising information on the relationship between alarge number of real-world entities Consequently wecan employ this information for deciding to what entitya certain surface form should be mapped

Acknowledgment

We would like to thank and acknowledge the supportof the following people and organisations to DBpedia

ndash all Wikipedia contributorsndash all DBpedia Mappings Wiki contributorsndash OpenLink Software for providing and maintaining

the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Christopher Sahnwaldt for DBpedia developmentand release management

ndash Claus Stadler for DBpedia development

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 28: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

Lehmann et al DBpedia 27

ndash Paul Kreis for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

Appendix

Table 16List of namespace prefixes

Prefix Namespace

dbo httpdbpediaorgontologydbp httpdbpediaorgpropertydbr httpdbpediaorgresourcedbr-de httpdedbpediaorgresourcedc httppurlorgdcelements11foaf httpxmlnscomfoaf01geo httpwwww3org200301geowgs84 posgeorss httpwwwgeorssorggeorssls httpspotlightdbpediaorgscoreslx httpspotlightdbpediaorglexicalizationsowl httpwwww3org200207owlrdf httpwwww3org19990222-rdf-syntax-nsrdfs httpwwww3org200001rdf-schemaskos httpwwww3org200402skoscoresptl httpspotlightdbpediaorgvocabxsd httpwwww3org2001XMLSchema

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiak andZ Ives DBpedia A Nucleus for a Web of Open Data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

Table 17DBpedia timeline

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia2007 Mar DBpedia 10 release

Jun ESWC DBpedia article [2]Nov DBpedia 20 releaseNov ISWC DBpedia article [1]Dec DBpedia 30 release candidate

2008 Feb DBpedia 30 releaseAug DBpedia 31 releaseNov DBpedia 32 release

2009 Jan DBpedia 33 releaseSep JWS DBpedia article [5]

2010 Feb Information extraction framework in ScalaMappings Wiki release

Mar DBpedia 34 releaseApr DBpedia 35 releaseMay Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight releaseMar DBpedia 36 releaseJul DBpedia Live releaseSep DBpedia 37 release (with I18n datasets)

2012 Aug DBpedia 38 releaseSep Publication of DBpedia Internationalization

article[24]2013 Sep DBpedia 39 release

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notes inComputer Science pages 503ndash517 Berlin Heidelberg 2007Springer

[3] C Becker and C Bizer Exploring the geospatial Semantic Webwith DBpedia Mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity linkingand slot filling through statistical processing and inference rulesIn Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization point forthe Web of Data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based onrelational patterns In ISWC-PD International Semantic WebConference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the Web of Data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing RDF graph summary with application toassisted SPARQL formulation In Database and Expert Systems

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 29: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

28 Lehmann et al DBpedia

Applications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FREyAAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops volume7117 pages 125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the Virtuoso DBMSIn S Auer C Bizer C Muller and A V Zhdanova editorsCSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in KnowitAll In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] D Ferrucci E Brown J Chu-Carroll J Fan D Gondek A AKalyanpur A Lally J W Murdock E Nyberg J Prager et alBuilding Watson An overview of the DeepQA project AImagazine 31(3)59ndash79 2010

[13] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia Enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowledgecapture K-CAP rsquo11 pages 137ndash144 New York NY USA2011 ACM

[14] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[15] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted WikipediaSearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 ofLecture Notes in Business Information Processing pages 1ndash11Springer 2010

[16] P Heim T Ertl and J Ziegler Facet Graphs Complex seman-tic querying made easy In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[17] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann RelFinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia 4thInternational Conference on Semantic and Digital Media Tech-nologies SAMT 2009 Graz Austria December 2-4 2009 Pro-ceedings volume 5887 of Lecture Notes in Computer Sciencepages 182ndash187 Springer 2009

[18] P Heim S Lohmann D Tsendragchaa and T Ertl SemLensVisual analysis of semantic data with scatter plots and seman-tic lenses In C Ghidini A-C N Ngomo S N Lindstaedtand T Pellegrini editors Proceedings the 7th InternationalConference on Semantic Systems I-SEMANTICS 2011 GrazAustria September 7-9 2011 ACM International ConferenceProceeding Series pages 175ndash178 ACM 2011

[19] S Hellmann J Brekle and S Auer Leveraging the Crowd-sourcing of Lexical Resources for Bootstrapping a LinguisticData Cloud In JIST 2012

[20] S Hellmann C Stadler J Lehmann and S Auer DBpediaLive Extraction In Proc of 8th International Conference on On-tologies DataBases and Applications of Semantics (ODBASE)volume 5871 of Lecture Notes in Computer Science pages1209ndash1223 2009

[21] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 a spatially and temporally enhanced knowledge basefrom Wikipedia Artif Intell 19428ndash61 2013

[22] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[23] G Kobilarov T Scott Y Raimond S Oliver C SizemoreM Smethurst C Bizer and R Lee Media meets SemanticWeb ndash how the BBC uses DBpedia and linked data to makeconnections In The semantic web research and applicationspages 723ndash737 Springer 2009

[24] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of Linked Data Thecase of the Greek DBpedia Edition Web Semantics ScienceServices and Agents on the World Wide Web 15(0)51 ndash 612012

[25] D Kontokostas A Zaveri S Auer and J LehmannTripleCheckMate A Tool for Crowdsourcing the Quality As-sessment of Linked Data In Proceedings of the 4th Conferenceon Knowledge Engineering and Semantic Web 2013

[26] P Kreis Design of a quality assessment framework for theDBpedia knowledge base Masterrsquos thesis Freie UniversitatBerlin 2011

[27] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[28] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[29] D B Lenat CYC A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[30] C-Y Lin and E H Hovy The automated acquisition of topicsignatures for text summarization In Proceedings of the 18thconference on Computational linguistics pages 495ndash501 2000

[31] V Lopez M Fernandez E Motta and N Stieler PowerAquaSupporting users in querying and exploring the Semantic WebSemantic Web 3(3)249ndash265 2012

[32] M Martin C Stadler P Frischmuth and J Lehmann Increas-ing the financial transparency of European Commission projectfunding Semantic Web Journal Special Call for Linked Datasetdescriptions 2013

[33] P N Mendes M Jakob and C Bizer DBpedia for NLP - amultilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[34] P N Mendes M Jakob A Garcıa-Silva and C Bizer DBpediaSpotlight Shedding light on the web of documents In Proceed-ings of the 7th International Conference on Semantic Systems(I-Semantics) Graz Austria 2011

[35] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in GraphD In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[36] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the Live Extraction of Structured Datafrom Wikipedia Program electronic library and informationsystems 4627 2012

[37] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a corpus

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats

Page 30: DBpedia - A Large-scale, Multilingual Knowledge Base ...static.tongtianta.site/paper_pdf/38bd5a66-1521-11e9-bf90-00163e08bb… · DBpedia is served as Linked Data on the Web. Since

Lehmann et al DBpedia 29

for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[38] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[39] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom A document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[40] S P Ponzetto and M Strube WikiTaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[41] C Stadler J Lehmann K Hoffner and S Auer LinkedGeo-Data A Core for a Web of Spatial Open Data Semantic WebJournal 3(4)333ndash354 2012

[42] F M Suchanek G Kasneci and G Weikum YAGO A coreof semantic knowledge In C L Williamson M E ZurkoP F Patel-Schneider and P J Shenoy editors WWW pages697ndash706 ACM 2007

[43] E Tacchini A Schultz and C Bizer Experiments withWikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[44] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based Question Answer-ing over RDF Data In Proceedings of the 21st internationalconference on World Wide Web pages 639ndash648 2012

[45] Z Wang J Li Z Wang and J Tang Cross-lingual KnowledgeLinking Across Wiki Knowledge Bases In Proceedings ofthe 21st international conference on World Wide Web pages459ndash468 New York NY USA 2012 ACM

[46] F Wu and D Weld Automatically refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

[47] F Wu and D S Weld Autonomously semantifying WikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[48] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven QualityEvaluation of DBpedia In To appear in Proceedings of 9thInternational Conference on Semantic Systems I-SEMANTICSrsquo13 Graz Austria September 4-6 2013 ACM 2013

[49] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from Wikipedia and Wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

View publication statsView publication stats