creating and exploiting a web of semantic data

Creating and Creating and Exploiting a Web Exploiting a Web of Semantic Dataof Semantic Data

Overview

•Introduction•Semantic Web 101•Recent Semantic Web trends•Examples: DBpedia, Wikitology•Conclusion

The Age of Big Data

•Massive amounts of data is available today•Advances in many fields driven by availability of unstructured data, e.g., text, audio, images

• Increasingly, large amounts of structured and semi-structured data is also online

•Much of this available in the Semantic Web language RDF, fostering integration and interoperability

•Such structured data is especially important for the sciences

Twenty years ago…Tim Berners-Lee’s 1989 WWW proposal described a web of rela- tionships among named objects unifying many information management tasksCapsule history• Guha’s MCF (~94) • XML+MCF=>RDF (~96)• RDF+OO=>RDFS (~99)• RDFS+KR=>DAML+OIL (00)• W3C’s SW activity (01)• W3C’s OWL (03)• SPARQL, RDFa (08)• Rules (09)

http://www.w3.org/History/1989/proposal.html

Ten years ago ….

•The W3C started developing standards for the Semantic Web

•The vision, technology and use cases are still evolving

•Moving from a web of documents to a web of data

Today

4.5 billion integrated facts 4.5 billion integrated facts published on the Web as published on the Web as RDF Linked Open DataRDF Linked Open Data

Tomorrow

Large collections of Large collections of integrated facts published integrated facts published

on the Web for many on the Web for many disciplines and domainsdisciplines and domains

W3C’s Semantic Web Goal

“The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”-- Berners-Lee, Hendler and Lassila, The Semantic Web, Scientific American, 2001

Contrast with a non-Web approach

The W3C Semantic Web approach is•Distributed•Open•Non-proprietary•Standards based

How can we share data on the Web?

•POX, Plain Old XML, is one approach, but it has deficiencies

•The Semantic Web languages RDF and OWL offer a simpler and more abstract data model (a graph) that is better for integration

• Its well defined semantics supports knowledge modeling and inference

•Supported by a stable, funded standards organization, the World Wide Web Consortium

Simple RDF Example

http://umbc.edu/~finin/talks/idm02/

“Intelligent Information Systemson the Web and in the Aether”

http://umbc.edu/

dc:Title

dc:Creator

bib:Aff

“Tim Finin” “[email protected]”

bib:namebib:email

Note: “blank node”

The RDF Data Model•An RDF document is an unordered collection of statements, each with a subject, predicate and object

•Such triples can be thought of as a labelled arc in a graph

•Statements describe properties of resources•A resource is any object that can be referenced or denoted by a URI

•Properties themselves are also resources (URIs)•Dereferencing a URI produces useful additional information, e.g., a definition or additional facts

RDF is the first SW language

<rdf:RDF ……..> <….> <….></rdf:RDF>

XML EncodingGraph

stmt(docInst, rdf_type, Document)stmt(personInst, rdf_type, Person)stmt(inroomInst, rdf_type, InRoom)stmt(personInst, holding, docInst)stmt(inroomInst, person, personInst)

Triples

RDFData Model

Good for Machineprocessin

g

Good for human viewing

Good for storage and reasoning

RDF is a simple language for graph based representations

XML encoding for RDF

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:bib="http://daml.umbc.edu/ontologies/bib/"><description about="http://umbc.edu/~finin/talks/idm02/"> <dc:title>Intelligent Information … and in the Aether</dc:Title> <dc:creator> <description> <bib:Name>Tim Finin</bib:Name> <bib:Email>[email protected]</bib:Email> <bib:Aff resource="http://umbc.edu/" /> </description> </dc:Creator></description></rdf:RDF>



http://umbc.edu/

dc:Title

dc:Creator

bib:Aff


bib:namebib:email

N3 is a friendlier encoding

@prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# . @prefix dc: http://purl.org/dc/elements/1.1/ .@prefix bib: http://daml.umbc.edu/ontologies/bib/ .

<http://umbc.edu/~finin/talks/idm02/> dc:title "Intelligent ... and in the Aether" ; dc:creator [ bib:Name "Tim Finin"; bib:Email "[email protected]" bib:Aff: "http://umbc.edu/" ] .



http://umbc.edu/

dc:Title

dc:Creator

bib:Aff


bib:namebib:email

RDFS supports simple inferences• RDF Schema adds vocabulary for classes, properties & constraints• An RDF ontology plus some RDF statements may imply additional

RDF statements (not possible in XML)• Note that this is part of the data model and not of the accessing or

processing code.

@prefix rdfs: <http://www.....>.@prefix : <genesis.n3>.parent a rdf: property; rdfs:domain person;

rdfs:range person.mother rdfs:subProperty parent; rdfs:domain woman; rdfs:range person.eve mother cain.

person a class.woman subClass person.mother a property.eve a person; a woman; parent cain.cain a person.

OWL adds further richness

OWL adds richer representational vocabulary, e.g.– parentOf is the inverse of childOf– Every person has exactly one mother– Every person is a man or a woman but not both– A man is the equivalent of a person with a sex

property with value “male”OWL is based on ‘description logic’ – a logic subset with efficient reasoners that are complete– Good algorithms for reasoning about descriptions

That was then, this is now

• 1996-2000: focus on RDF and data• 2000-2007: focus on OWL,

developing ontologies, sophisticated reasoning

• 2008-…: Integrating and exploiting large RDF data collections backed by lightweight ontologies

A Linked Data story

•Wikipedia as a source of knowledge–Wikis are a great ways to collaborate

on building up knowledge resources

•Wikipedia as an ontology–Every Wikipedia page is a concept or object

•Wikipedia as RDF data–Map this ontology into RDF

•DBpedia as the lynchpin for Linked Data–Exploit its breadth of coverage to integrate things

Populating Freebase KB

Underlying Powerset’s KB

Mined by TrueKnowledge

Wikipedia as an ontology

• Using Wikipedia as an ontology–each article (~3M) is an ontology concept or instance–terms linked via category system (~200k), infobox template

use, inter-article links, infobox links–Article history contains metadata for trust, provenance, etc.

• It’s a consensus ontology with broad coverage• Created and maintained by a diverse community for

free!• Multilingual• Very current• Overall content quality is high

Wikipedia as an ontology

•Uncategorized and miscategorized articles•Many ‘administrative’ categories: articles needing revision; useless ones: 1949 births

•Multiple infobox templates for the same class

•Multiple infobox attribute names for same property

•No datatypes or domains for infobox attribute values

• etc.

Dbpedia : Wikipedia in RDF

•A community effort to extractstructured information fromWikipedia and publish as RDFon the Web

•Effort started in 2006 with EU funding•Data and software open sourced•DBpedia doesn’t extract information from Wikipedia’s text, but from the its structured information, e.g., links, categories, infoboxes

DBpedia: Linked Data lynchpin

http://lookup.dbpedia.org/

Dbpedia uses WP structured data

DBpedia extracts structured data from Wikipedia, especially from Infoboxes

Dbpedia ontology

•Dbpedia 3.2 (Nov 2008) added a manually constructed ontology with–170 classes in a subsumption hierarchy–880K instances– 940 properties with domain and range

•A partial, manual mapping was constructed from infobox attributes to these term

•Current domain and range constraints are “loose”

•Namespace: http://dbpedia.org/ontology/

Place 248,000Person 214,000Work 193,000Species 90,000Org. 76,000Building 23,000

http://dbpedia.org/ontology/

Person56 properties

Organisation50 properties

Place110 properties

http://dbpedia.org/sparql/

PREFIX dbp: <http://dbpedia.org/resource/>PREFIX dbpo: <http://dbpedia.org/ontology/>SELECT distinct ?Property ?PlaceWHERE {dbp:Barack_Obama ?Property ?Place . ?Place rdf:type dbpo:Place .}

DBpedia: Linked Data lynchpin

Consider Baltimore, MD

Looking at the RDF description

We find assertions equating DBpedia's object for Baltimore with those in other LOD datasets:dbpedia:Baltimore%2C_Maryland

owl:sameAs census:us/md/counties/baltimore/baltimore;

owl:sameAs cyc:concept/Mx4rvVin-5wpEbGdrcN5Y29ycA;

owl:sameAs freebase:guid.9202a8c04000641f800000000004921a;

owl:sameAs geonames:4347778/ .

Since owl:sameAs is defined as an equivalence relation, the mapping works both ways

Linked Data Cloud, March 2009

Four principles for linked data

• Use URIs to identify things that you expose to the Web as resources

• Use HTTP URIs so that people can locate and look up (dereference) these things.

• When someone looks up a URI, provide useful information

• Include links to other, related URIs in the exposed data as a means of improving information discovery on the Web

-- Tim Berners-Lee, 2006

4.5 billion triples for free

•The full public LOD dataset has about 4.5 billion triples as of March 2009

•Linking assertions are spotty, but probably include order 10M equivalences

•Availability:–download the data in RDF–Query it via a public SPARQL servers– load it as an Amazon EC2 public dataset –Launch it and required software as an Amazon public

AMI image

WikitologyWe’ve been exploring a different approach to derive an ontology from Wikipedia through a series of use cases:– Identifying user context in a collaboration system from

documents viewed (2006)– Improve IR accuracy by adding Wikitology tags to

documents (2007)– ACE: cross document co-reference resolution for named

entities in text (2008)– TAC KBP: Knowledge Base population from text (2009)– Improve Web search engine by tagging documents and

queries (2009)

Wikitology 2.0 (2008)

WordNetYago

Human input & editingDatabases

Freebase KB

RDF RDF

textgraphs

Wikitology tagging

•Using Serif’s output, we produced an entity document for each entity.

Included the entity’s name, nominal and pronominal mentions, APF type and subtype, and words in a window around the mentions

•We tagged entity documents using Wiki-tology producing vectors of (1) terms and (2) categories for the entity

•We used the vectors to compute features measuring entity pair similarity/dissimilarity

Wikitology Entity Document & Tags

Wikitology entity document<DOC><DOCNO>ABC19980430.1830.0091.LDC2000T44-E2 <DOCNO><TEXT>

Webb HubbellPERIndividualNAM: "Hubbell” "Hubbells” "Webb Hubbell” "Webb_Hubbell"PRO: "he” "him” "his"abc's accountant after again ago all alleges alone also and arranged attorney avoid been before being betray but came can cat charges cheating circle clearly close concluded conspiracy cooperate counsel counsel's department did disgrace do dog dollars earned eightynine enough evasion feel financial firm first four friend friends going got grand happening has he help him hi s hope house hubbell hubbells hundred hush income increase independent indict indicted indictment inner investigating jackie jackie_judd jail jordan judd jury justice kantor ken knew lady late law left lie little make many mickey mid money mr my nineteen nineties ninetyfour not nothing now office other others paying peter_jennings president's pressure pressured probe prosecutors questions reported reveal rock saddened said schemed seen seven since starr statement such tax taxes tell them they thousand time today ultimately vernon washington webb webb_hubbell were what's whether which white whitewater why wife years</TEXT></DOC>

Wikitology article tag vector

Webster_Hubbell 1.000 Hubbell_Trading_Post National Historic Site 0.379 United_States_v._Hubbell 0.377 Hubbell_Center 0.226 Whitewater_controversy 0.222

Wikitology category tag vector

Clinton_administration_controversies 0.204 American_political_scandals 0.204 Living_people 0.201 1949_births 0.167 People_from_Arkansas 0.167 Arkansas_politicians 0.167 American_tax_evaders 0.167 Arkansas_lawyers 0.167

Name

Type & subtype

Mention heads

Words surroundingmentions

Top Ten Features (by F1)

Prec. Recall F1 Feature Description

90.8% 76.6% 83.1% some NAM mention has an exact match

92.9% 71.6% 80.9% Dice score of NAM strings (based on the intersection of NAM strings, not words or n-grams of NAM strings)

95.1% 65.0% 77.2% the/a longest NAM mention is an exact match

86.9% 66.2% 75.1% Similarity based on cosine similarity of Wikitology Article Medium article tag vector

86.1% 65.4% 74.3% Similarity based on cosine similarity of Wikitology Article Long article tag vector

64.8% 82.9% 72.8% Dice score of character bigrams from the 'longest' NAM string

95.9% 56.2% 70.9% all NAM mentions have an exact match in the other pair

85.3% 52.5% 65.0% Similarity based on a match of entities' top Wikitology article tag

85.3% 52.3% 64.8% Similarity based on a match of entities' top Wikitology article tag

85.7% 32.9% 47.5% Pair has a known alias

Knowledge Base Population

•The 2009 NIST Text Analysis Conference (TAC) will include a new Knowledge Base Population track

•Goal: discover information about named entities (people, organizations, places) and incorporate it into a KB

•TAC KBP has two related tasks:–Entity linking: doc. entity mention -> KB entity –Slot filling: given a document entity mention, find

missing slot values in large corpus

KBs and IE are Symbiotic

KnowledgeBase

KnowledgeBase

Information Extraction from Text

Information Extraction from Text

KB info helps interpret text

IE helps populate KBs

InfoboxGraph

InfoboxGraph

IRcollection

IRcollection

RelationalDatabaseRelationalDatabase

TripleStoreTripleStore

RDFreasoner

RDFreasoner

Page LinkGraph

Page LinkGraph

CategoryLinks Graph

CategoryLinks Graph

ArticlesArticles

WikitologyCode

WikitologyCode

Application Specific Algorithms








LinkedSemanticWeb data &ontologies

InfoboxGraph

InfoboxGraph

Wikipedia’s social network

•Wikipedia has an implicit ‘social network’ that can help disambiguate PER mentions

•Resolving PER mentions in a short document to KB people who are linked in the KB is good

•The same can be done for the network of ORG and GPE entities

WSN Data

•We extracted 213K people from the DBpedia’s Infobox dataset, ~30K of which participate in an infobox link to another person

•We extracted 875K people from Freebase, 616K of were linked to Wikipedia pages, 431K of which are in one of 4.8M person-person article links

•Consider a document that mentions two people: George Bush and Mr. Quayle

Which Bush & which Quayle?

Six George Bushes Nine Male Quayles

A simple closeness metric

Let Si = {two hop neighbors of Si}Cij = |intersection(Si,Sj)| / |union(Si,Sj) |

Cij>0 for six of the 56 possible pairs

0.43 George_H._W._Bush -- Dan_Quayle

0.24 George_W._Bush -- Dan_Quayle

0.18 George_Bush_(biblical_scholar) -- Dan_Quayle

0.02 George_Bush_(biblical_scholar) -- James_C._Quayle

0.02 George_H._W._Bush -- Anthony_Quayle

0.01 George_H._W._Bush -- James_C._Quayle

Application to TAC KBP

•Using entity network data extracted from Dbpedia and Wikipedia provides evidence to support KBP tasks:–Mapping document mentions into infobox entities

–Mapping potential slot fillers into infobox entities

–Evaluating the coherence of entities as potential slot fillers

Conclusion

•The Semantic Web approach is a powerful approach for data interoperability and integration

•The research focus is shifting to a “Web of Data” perspective

•Many research issue remain: uncertainty, provenance, trust, parallel graph algorithms, reasoning over billions of triples, user-friendly tools, etc.

•Just as the Web enhances human intelligence, the Semantic Web will enhance machine intelligence

•The ideas and technology are still evolving

http://ebiquity.umbc.edu/http://ebiquity.umbc.edu/

creating and exploiting a web of semantic data

Documents

semantic web language

web of rela

web of datatoday4

web of documents

current web

web of linked documentsto

web of linked data contrast

prefix rdf