the knowledgestore: an integrated framework for …€¦ · the knowledgestore: an integrated...

50
The KNOWLEDGESTORE: an Integrated Framework for Ontology Population Bernardo Magnini [email protected] Fondazione Bruno Kessler Trento, Italy Joint work with: Roldano Cattoni, Francesco Corcoglioniti, Christian Girardi, Marco Rospocher, Luciano Serafini Darmstadt, October 17, 2014

Upload: truongmien

Post on 06-Apr-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

The KNOWLEDGESTORE: an Integrated Framework for Ontology

Population

Bernardo Magnini [email protected]

Fondazione Bruno Kessler Trento, Italy

Joint work with: Roldano Cattoni, Francesco Corcoglioniti,

Christian Girardi, Marco Rospocher, Luciano Serafini

Darmstadt, October 17, 2014

Page 2: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

NewsReader recording history

by processing massive streams of daily news

ICT 316404 FP7-ICT-2011-8

www.newsreader-project.eu

Page 3: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

HOW DID THE WORLD CHANGE YESTERDAY?

Page 4: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Can we handle the news?

§  Information broker LexisNexis archives:

ü  1.5 millions news articles on a single working day ü  30,000 different sources

Page 5: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

How did the Car industry change

during the financial crisis?

•  6 million English articles on the car industry in the LexisNexis archive for the last 10 years

•  2 million Google hits for “Volkswagen takeover” not sorted by publication date

Page 6: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

A short history of VW and Porsche

Page 7: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

How to measure the volume of change?

7"

1995 96 97 98 99 2000 01 02 03 04 2005 06 07 08 09 2010 11 12 13 14 2015

1995 96 97 98 99 2000 01 02 03 04 2005 06 07 08 09 2010 11 12 13 14 2015

Speculation Past New New

200k mentions per year

10k entities per year

6 MILLION ARTICLES

HOW MANY CHANGES?

volum

e of

en

tity

On 16 September 2008, Porsche increased its shares by another 4.89%, in effect taking control of the company, with more than 35% of the voting rights.

6 Jan 2009 – Porsche has been on a quest to takeover VW for more than two years.

Past

Page 8: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

DAILY NEWS TSUNAMI •  Volume is very big: 1,5 million items each working day

•  Repeated and duplicated: we cannot distinguish new from old

•  Incomplete and piecemeal: we need to read all to get a complete picture

•  Actual and speculated events: we cannot distinguish the realis from irrealis (speculations, fears and hopes)

•  Inconsistent and contradictory: we cannot tell true from false (who to believe)

•  Opinionated and selective: we do not realize the bias of our sources

Page 9: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Requirements for large-scale ontology population

•  Cope with noisy in information extraction –  E.g. event extraction is still not a consolidated technology

•  Fill the gap between structured and unstructured data –  Knowledge in the information extraction loop –  Take advantage of “background knowledge” (e.g. wikipedia)

•  Larger variety of facts (e.g. relations, events)

•  Exploit cross-document co-reference

•  Ontology Population as an incremental process

Page 10: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

NewsReader Architecture

Page 11: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

KnowledgeStore: Objectives •  Three-layer architecture:

–  resource, mention, entity •  Allows to store massive amount of data about:

–  entities extracted from the data –  links to the data sources –  background domain knowledge

•  Provides access via text-based search and semantic query languages

•  Provides reasoning services over the knowledge it contains

Page 12: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

The Knowledge Store: Content

•  Resource Layer –  Textual documents (e.g. a newspaper article) –  But also images and videos –  Linguistic Annotations of the textual resources (e.g. a whole

document processed by a NLP pipeline); NAF format in NewsReader

–  Relevant Metadata (e.g. date, source, author, language)

Resource Layer

Page 13: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

•  Mention Layer –  Relevant portions of a resource document with their linguistic

annotations (e.g. names of people and places, an event mention, the area representing a person in a image)

–  All orthographical variants are stored (e.g. “B. Magnini” and “Bernardo Magnini”)

–  Coreference among mentions (intra and cross-document) –  Links to the data sources from which mentions have been extracted –  Relevant Metadata (e.g. frequency of occurrence, confidence of

mention recognition)

Resource Layer

Mention Layer

The Knowledge Store: Content

Page 14: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

•  The Entity Layer –  Unique Instances of mentions that have been extracted from the

data, after mention coreference –  Represented as RDF triples –  Orthographical variants are not recorded –  Representation based on ontological schemas (e.g. SEM for

events) –  Links to Background domain knowledge (e.g. wikipedia) –  Relevant Metadata (e.g. factuality, provenance, confidence)

Resource Layer

Entity Layer

Mention Layer

The Knowledge Store: Content

Page 15: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

ks:Resource

uri: URIks:storedAs: ks:Representationrdfs:comment: string

ks:Mention

uri: URInif:beginIndex: intnif:endIndex: intnif:anchorOf: stringrdfs:comment: string

TimeMention

value: stringtimeType: TIMEX3TypefunctionInDocument: FunctionInDocumentquant: stringfreq: stringmod: TIMEX3ModifiertemporalFunction: bool

EventMention

eventClass: EventClasspred: stringcertainty: Certaintyfactuality: FactualityfactualityConfidence: floatpos: PartOfSpeechtense: Tenseaspect: Aspectpolarity: Polaritymodality: stringframenetRef: URIpropbankRef: URIverbnetRef: URInombankRef: URI

ks:containedIn1ks:Entity

uri: URI

ks:Axiom

uri: URIks:encodedBy: rdf:Statement[1..*]crystallized: booldc:source: URIconfidence: floatrdfs:comment: string

ks:describes1..*

0..*

NAFDocument

version: stringdct:identifier: stringlayer: NAFLayer[1..*]dct:creator: NAFProcessor[1..*]dct:language: dct:LinguisticSystem

annotationOf

1 RelationMention

EntityMention

localCorefID: string

Participation

thematicRole: stringframenetRef: URIpropbankRef: URIverbnetRef: URInombankRef: URI

CLink

TLink

relType: TLinkType

News

dct:title: stringdct:publisher: dct:Agentdct:creator: dct:Agentdct:created: datedct:spatial: dct:Locationdct:temporal: time:Intervaldct:subject: URIdct:rights: dct:RightsStatementdct:rightsHolder: dct:Agentdct:language: dct:LinguisticSystemoriginalFileName: stringoriginalFileFormat: stringoriginalPages: int

source1

target1

ks:expressedBy0..* 0..*ks:refersTo

0..*

0..1

ObjectMention

syntacticHead: stringsyntacticType: SyntacticTypeentityType: EntityTypeentityClass: EntityClass

TimeOrEventMention

SLink

ValueMention

valueType: ValueType

target 1

source 1

source 1

target 1

source1

target 1

ks:Context

uri: URIsem:hasPointOfView: sem:PointOfViewsem:hasTimeValidity: time:Interval

ks:holdsIn1

SignalMention

signal0..1

csignal0..1

ks:describedBy

ks:referredBy

ks:expresses

GLinksource 1

target 1

CSignalMention

anchorTime

beginPoint

endPoint

valueFromFunction

Page 16: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Resource Layer

Mention Layer

Entity Layer

Men$on:  M2  

extent   Valen$no  Rossi  

type   PER  firstname   Valen$no  lastname   Rossi  start   33  

En$ty:  E1  

predicate   object   Source  

type   Pilot   motogp.com  firstName   Valen:no   motogp.com,  R3  lastName   Rossi   motogp.com,  R3  gender   male   motogp.com,  R3  birthDate   1979-­‐02-­‐16   wikipedia.it  birthPlace   Urbino   wikipedia.it  height   182   motogp.com  weight   67   motogp.com  team   Duca:   R3  

Facts    $tle   Rossi torna single

(Rossi  is  again  single)  

creator   l’Adige  

format   Text  

date   2012/02/27  

content   GOSSIP  -­‐  Il  campione  della  Duca$  Valen:no  Rossi  e  la  sua  fidanzata  Marwa  Klebi  si  sono  lascia$.  …  (Duca&  champion  Valen&no  Rossi  and  his  girlfriend  Marwa  Klebi  break  up.  …)  

Resouuce:  R3  

Occurs in References

The Knowledge Store: Content

Page 17: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Resource Mention

part of

Entity Statement

crystallized: boolean

EntityMention RelationMention

arguments

refers todescribed by

Resource Mention

part of

Entity Statement

crystallized: boolean

EntityMention RelationMention

arguments

refers todescribed by

Resource Mention

part of

Entity Statement

crystallized: boolean

EntityMention RelationMention

arguments

refers todescribed by

High Level Data Model

17

Resource Layer

dbpedia:United_Nations rdf:type yago:PoliticalSystems

dbpedia:United_Nations rdfs:label "United Nations"@en

dbpedia:United_Nations foaf:homepage <http://www.un.org/>

dbpedia:United_Nations

Entity Layer Mention Layer

Indonesia Hit By Earthquake

A United Nations assessment team was dispatched to the province after two quakes, measuring 7.6 and 7.4, struck west of Manokwari Jan. 4. At least five people were killed, 250 others injured and more than 800 homes destroyed by those temblors, according to the UN.

Page 18: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Knowledge Store

KnowledgeStore: Context

Resource Layer

Mention Layer

Entity Layer

write annotations

Resource processors → tokenization & lemmatization

→ part of speech tagging, → word sense disambiguation → parsing (dep./consituency)

→ keyphrase extraction

read resources

Mention processors → named entity recognition,

→ event recognition, → semantic role labelling,

→ Tlink / Clink / Slink tagging, → entity linking (wikification…)

write mentions

read annotations & mentions

Entity processors → entity & event coreference,

→ event chaining, → event significance & rel.

→ narrative graph extraction, → crystallization read

mentions & background knowledge write entities

& statements

store background knowledge

Source 2

Source 1

other

Resource populators

. . . knowledge populators

store news

Decision Support System

(mixed) queries

18

Page 19: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Resource annotation

dc:$tle   Indonesia’s  West  Papua  Province  Hit  by  Earthquake  (Update2)  

dc:creator   bloomberg.com  

dc:language   EN  

dc:issued   2009-­‐01-­‐07T01:55:00-­‐05:00  

nfo:wordCount   287  

nfo:characterCount   1778  

file   news_00001.txt  

…   …  

news_00001.txt

News   nwr:news_00001  

19

Page 20: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Mentions: NAF example Toyota brought Lexus to Japan in 2005.

Page 21: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Mentions: NAF example Toyota brought Lexus to Japan in 2005.

Page 22: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Mentions (organizations)

ORG  Men:on  

nwr:orgmen_00001  nif:beginIndex   146  

nif:endIndex   160  

nif:anchorOf   United  Na$ons  

foaf:name   United  Na$ons  

…   …  

ORG  Men:on   nwr:orgmen_00002  nif:anchorOf   the  UN  

head   UN  

foaf:name   UN  

…   …  

Non-event entities Events Time expressions TLink signals CLink signals 22

Page 23: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Mentions: linking to resource

dc:$tle   Indonesia’s  West  Papua  Province  Hit  by  Earthquake  (Update2)  

dc:creator   bloomberg.com  

dc:language   EN  

dc:issued   2009-­‐01-­‐07T01:55:00-­‐05:00  

nfo:wordCount   287  

nfo:characterCount   1778  

file   news_00001.txt  

….   ….  

News   nwr:news_00001  

ORG  Men:on   nwr:orgmen_00001    dc:isPartOf   nwr:news_00001  

ORG  Men:on   nwr:orgmen_00002  dc:isPartOf   nwr:news_00001  

Event  Men:on   nwr:evmen_00001  dc:isPartOf   nwr:news_00001  

Event  Men:on   nwr:evmen_00002  dc:isPartOf   nwr:news_00001  

CLINK  Men:on   nwr:clmen_00001  dc:isPartOf   nwr:news_00001  

Part.  Men:on   nwr:pmen_00002  dc:isPartOf   nwr:news_00001  

“United Nations”

“the UN”

“quakes”

“temblors”

“destroyed by those temblors”

“according to the UN”

23

Page 24: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Entities: SEM format

ENTITY INSTANCE <http://dbpedia.org/resource/Toyota> a nwr:organization ; rdfs:label "Toyota" , "Toyota motor" ; gaf:denotedBy <nwr:data/cars/2013/1/1/5760-PM51-JD34-P4RM.xml#char=98,104&word=w18&term=t18> , <nwr:data/cars/2013/1/1/57K5-FKK1-DYBW-2534.xml#char=44934,44940&word=w8114&term=t8114> .

Page 25: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

EVENT INSTANCE <nwr:data/cars/2013/1/1/5758-BPN1-F0J6-D2T2.xml#sellEvent> a sem:Event , fn:Commerce_sell ; rdfs:label "sell" ; gaf:denotedBy <nwr:data/cars/2013/1/1/5758-BPN1-F0J6-D2T2.xml#char=1352,1356&word=w251&term=t251> , <nwr:data/cars/2013/1/1/5760-PM51-JD34-P4H7.xml#char=1536,1540&word=w275&term=t275>.

Entities: SEM format

Page 26: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Semantic relations as named graphs

<nwr:/data/cars/2013/1/1/5758-BPN1-F0J6-D2T2.xml#pr25,rl55> { <nwr:data/cars/2013/1/1/5722-S821-F0J6-D48N.xml#sellEvent> sem:hasActor ; fn:Commerce_sell#Seller

<http://dbpedia.org/resource/Magyar_Suzuki> . } <nwr:data/cars/2013/1/1/5760-PM51-JD34-P4H7.xml#pr46,rl114> { <nwr:data/cars/2013/1/1/5758-BPN1-F0J6-D2T2.xml#sellEvent> sem:hasPlace <http://dbpedia.org/resource/South_Africa> . } <nwr:data/cars/2013/1/1/5760-PM51-JD34-P4H7.xml#docTime_26> { <nwr:data/cars/2013/1/1/5760-PM51-JD34-P4H7.xml#sellEvent> sem:hasTime <nwr:time/2013-01-01> . }

Page 27: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Properties of relations

PROVENANCE <nwr:data/cars/2013/1/1/57R8-5451-F0J6-D2GH.xml#pr25,rl55> gaf:denotedBy

<nwr:data/cars/2013/1/1/57R8-5451-F0J6-D2GH.xml#rl55> ; prov-o:wasAttributedTo

<nwr:sourceowner/Peru_Autos_Report> .

FACTUALITY <nwr:data/cars/2013/1/1/57K5-FKK1-DYBW-2534.xml#facValue_1125> { <nwr:data/cars/2013/1/1/57K5-FKK1-DYBW-2534.xml#sellEvent> nwr:hasFactBankValue "CT+" .}

Page 28: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Data Sources

High-level KS architecture

KS Frontend

API implementation on top of lower layers; optional transactions & data validation

HBase + Hadoop

distributed & replicated for scalability and fault-tolerance Triple Store

possibly distributed

Mention Resource Entity Statement RDF Triples

Applications

direct access to KS API; (some) linguistic processors

HTTP REST API CRUD & bulk services, (mixed) queries, specialized access patterns, map/reduce hook

Populators

loading resources, annotations, background knowledge in specific formats (e.g., RDF, TAF)

Mgmt. Scripts

start / stop,

backup / restore, con-

figuration, statistics

gathering, Inference

Rules

Data Model Properties

Deployment Properties

Serv

er

Clien

ts

Configuration

Rule-based inference (partial) Replication

Page 29: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Use Case 1: Trentino Media

§  Goal: acquire and load news and background knowledge

§  Multimedia news –  ~56 GB of textual news, images and videos in the Italian language –  acquired from 4 news providers local to the Italian Trentino region –  daily updated, since 1999

Linking  &  en$ty  crea$on  

Coreference  resolu$on  

Men$on  extrac$on  

Resource  preprocessing  

Content  acquisi$on  

Provider News Images Videos l’Adige 733,738 21,525 -

VitaTrentina 33,403 14,198 -

RTTR 2,455 - 120 h

Fed. Cooperative 1,402 - -

All 770,998 35,723 120 h

Coop. Trentina

Page 30: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Topic Contexts Persons Organizations Avg.properties Facts (triples)

sport 136 8,570 191 3.81 192115 culture 20 9,785 1 2.00 33236 justice 7 354 10 2.16 1575 economy 7 49 1,203 4.47 11147 education 6 850 82 2.35 3573 politics 535 8,402 319 4.64 98780 religion 3 1,391 0 1.67 12855 total 714 28,687 1,806 3.64 352,244

Linking  &  en$ty  crea$on  

Coreference  resolu$on  

Men$on  extrac$on  

Resource  preprocessing  

Content  acquisi$on  

§  Background knowledge •  acquired from Italian Wikipedia, sport-related community sites, local and

national level public administrations and government bodies •  manual acquisition using ad-hoc Web site wrappers

Use Case 1: Trentino Media

Page 31: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Trentino Media: Resource Preprocessing

§  Goal: normalize and annotate news so to ease further processing §  Several operations:

–  Conversion of multimedia resources to common formats –  Segmentation of complex news

•  separation of individual stories in a news broadcast •  separation of figures and captions in complex XML news

articles

–  Automatic Speech Recognition –  Annotations of news with linguistic taggers

•  part of speech tagging, based on TextPro tool [Pianta08] •  temporal expression tagging, based on TextPro •  key concept extraction, based on KX tool [Pianta10]

Linking  &  en$ty  crea$on  

Coreference  resolu$on  

Men$on  extrac$on  

Resource  preprocessing  

Content  acquisi$on  

Page 32: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Trentino Media: Mention Extraction

§  Mention extraction is based on the TextPro Mention Detection module –  system based on the supervised training of a statistical model –  training for the Italian language based on the Italian Content Annotation

Bank (I-CAB) dataset [Magnini06]; measured F1 value: 82% §  Mention extraction statistics

Linking  &  en$ty  crea$on  

Coreference  resolu$on  

Men$on  extrac$on  

Resource  preprocessing  

Content  acquisi$on  

News provider

PER mentions

ORG mentions GPE/LOC mentions

Total mentions

l’Adige 5,387,994 3,100,994 3,052,011 11,540,999 VitaTrentina 144,486 100,789 136,611 381,886 RTTR 19,290 15,493 27,404 62,187 Fed. Coop. 14,404 12,731 8,513 35,648 All 5,566,174 3,230,007 3,224,539 12,020,720

Page 33: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

§  Goal: put mentions referring to the same entity in a mention cluster –  cross-document (more complex) as mentions may belong to different

news

Trentino Media: Coreference Resolution Linking  &  en$ty  

crea$on  Coreference  resolu$on  

Men$on  extrac$on  

Resource  preprocessing  

Content  acquisi$on  

Entity type PER ORG GPE/LOC Total Mention clusters 340,147 16,649 52,478 409,274

… il presidente della Provincia di Trento Lorenzo Dellai ha lodato l'iniziativa per molteplici aspetti ...

... questa mattina il presidente Dellai ha incontrato le parti sociali sulla manovra che la Giunta sta mettendo a punto … … e intanto l' Alto Adige è tra le regioni europee con il più basso tasso di disoccupazione

Lorenzo Dellai (PERSON entity)

Page 34: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Trentino Media: Linking & Entity Creation

§  Goal: link mention clusters to entities in the background knowledge and to external resources; create new entities from unlinked clusters

§  Different types of linking are performed –  linking to GeoNames toponyms of GPE/LOC clusters, using GeoCoder –  linking to background knowledge entities of PER and ORG clusters –  linking to Wikipedia pages of all the entities, using the WikiMachine tool

§  New entities are created from unlinked clusters

§  Linking and entity statistics:

Linking  &  en$ty  crea$on  

Coreference  resolu$on  

Men$on  extrac$on  

Resource  preprocessing  

Content  acquisi$on  

Entity type PER ORG GPE/LOC Total Linked clusters 5.03% 7.96% 48.64% 10.74% Linked mentions 22.36% 12.02% 65.04% 31.03% Resulting entities 321,713 17,129 52,478 421,320

Page 35: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Trentino Media: Linking & Entity Creation

§  Linking to background knowledge is context-driven [Tamilin10] –  exploits the contextual organization of knowledge in the KNOWLEDGESTORE –  84,5% accuracy (i.e., correctly linked clusters) on gold standard of 298

clusters

Linking  &  en$ty  crea$on  

Coreference  resolu$on  

Men$on  extrac$on  

Resource  preprocessing  

Content  acquisi$on  

-, World, -

Research, World, -

Research, Italy, 2009 Politics, Trento, 2009 Politics,Civezzano,2009

Politics,Trentino,2009 Politics,Trentino,2004

EIT L. Dellai (pres.)

S. Dellai FBK

L. Dellai (major)

“Dellai” (PER)

➌ Link the mention only to entities appearing in the selected formal contexts

topic: research location: Trentino time: 2009

➋ Select a ranked list of KNOWLEDGESTORE contexts that better match chosen dimensional values

➊ Map textual context of each cluster mention to appropriate values of contextual dimensions

Page 36: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Other use cases • USE CASES:

– Car Industry news (2003-2013): 63K articles, 1,7M event instances, 445K actors, 63K places, 41K DBpedia entities and 46M triples.

– TechCrunch (2005-2013): 43K articles, 1.6M event instances, 300K actors, 28K DBpedia entities and 24M triples.

– Fifa World Cup: 200K documents, 9M Events, 35K actors (30% pers, 70% org), 15K places, 6K dates and 136M triples.

– Dutch House of Representatives Bank Inquiry, 1M documents (900K XML and 100K PDF), pending

• BENCHMARK DATA: – WikiNews: 19K English, 8K Italian, 7K Spanish and 1K Dutch. 69 Apple news documents for annotation.

– ECB+: 43 topics and 482 articles from GoogleNews, extended with 502 GoogleNews articles for 43+ topics (similar but different event).

Page 37: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int
Page 38: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int
Page 39: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

TOP events per year

Page 40: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

TOP actors per year

Page 41: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

TOP places per year

Page 42: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

CARS: WHERE & WHEN

Page 43: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Use Case: Worldcup •  We processed 212,511 news articles in about 3 weeks -> BBC,

Guardian, Lexis Nexis

•  We stored the resources and the result in the KnowledgeStore:

•  45GB storage of news and processed mentions

•  22GB storage for triples expressing statements on instances in RDF, e.g. who involved, when happened, where…

•  136,075,006 triples from news (event statements)

•  104,595,567 triples from background data (DBPedia statements)

Page 44: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Events Actors Places Time

Mention 25,470,763 3,590,012 58% org., 42% pers. 2,279,939 1,828,134

Instance 9,387,356 51,283 dbp 30% org., 70% pers. 15,219 dbp 5,961

33,123 events involving Blatter

Use Case: Worldcup

Page 45: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

191 stand 196 suggest 208 comment 209 want 232 insist 254 meet 260 take 272 give 370 tell 408 add 440 make 523 have 1143 say

135 claim 138 run 139 ask 140 play 145 write 160 vote 162 challenge 162 do 169 come 172 get 180 call 180 visit 188 go

116 seek 116 speak 117 re-elect 119 describe 121 believe 122 attend 125 confirm 127 support 129 try 129 win 131 look 132 announce 133 use

What Blatter does…

Page 46: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

KS Web site: https://knowledgestore.fbk.eu

§  Download source code, selected data §  Documentation, demo, video

Page 47: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Semeval 2015 Task T4 - TimeLine: Cross-Document Event Ordering

- Task description:build an ordered list ofevents involving a specificpre-selected entity

- Trial and test data:Wikinews

- Evaluation:(i) time anchors + ordering,(ii) only ordering

- Subtasks:(i) timeline on raw texts,(ii) timeline on textsannotated with events

Register to the TimeLine task Google Group: semeval-task4-timeline

Organizers: A-L. Minard, E. Agirre, I. Aldabe, M. van Erp, B. Magnini, G. Rigau, M. Speranza, R. Urizar

http://www.newsreader-project.eu

Page 48: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

Conclusion •  Our contributions

–  integration of techniques and technologies from different fields (NLP, Semantic Web, Machine Learning, …)

–  large scale application (some use cases) –  tight interlinking of knowledge and multimedia

•  Further research directions –  Improve event extraction –  Exploit already stored knowledge for event Cross-document co-

reference –  Incremental population of the KnowledgeStore –  Mixed retrieval, both from entities (Sparql) and mentions (textual

indexing)

Page 49: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

References 2012: R. Cattoni, F. Corcoglioniti, C. Girardi, B. Magnini, L. Serafini, R. Zanoli:

Anchoring Background Knowledge to Rich Multimedia Contexts in the KnowledgeStore, In: New Trends of Research in Ontologies and Lexical Resources, edited by Federica Corradi Dell’Acqua, Qin Lu, Piek Vossen, Alessandro Oltramari and Eduard Hovy, Springer.

2012: R. Cattoni, F. Corcoglioniti, C. Girardi, B. Magnini, L. Serafini, R. Zanoli

TrentinoMedia: Exploiting NLP and Background Knowledge to Browse a Large Multimedia News Store. PAI 2012, Popularizing Artificial Intelligence, Rome June 15, 2012.

2012: R. Cattoni, F. Corcoglioniti, C. Girardi, B. Magnini, L. Serafini, R. Zanoli

The KnowledgeStore: an Entity-Based Storage System, LREC 2012. 2013: F. Corcoglioniti, M. Rospocher, R. Cattoni, B. Magnini, and L. Serafini: Interlinking Unstructured and

Structured Knowledge in an Integrated Framework, Proceedings of Seventh IEEE International Conference on Semantic Computing, ICSC 2013, September 16-18, 2013, Irvine, California, USA.

Page 50: The KNOWLEDGESTORE: an Integrated Framework for …€¦ · The KNOWLEDGESTORE: an Integrated Framework for Ontology Population ... NAF format in NewsReader ... nif:beginIndex: int

References

[Buscaldi10] Buscaldi, D., Magnini, B.: Grounding toponyms in an Italian local news corpus. In: Proc. of 6th Workshop on Geographic Information Retrieval. pp. 15:1–15:5. GIR ’10 (2010)

[Homola10] Homola, M., Tamilin, A., Serafini, L.: Modeling contextualized knowledge. In: Proc. of 2nd Int. Workshop on Context, Information And Ontologies. CIAO ’10, vol. 626 (2010)

[Magnini06] Magnini, B., Pianta, E., Girardi, C., Negri, M., Romano, L., Speranza, M., Bartalesi Lenzi, V., Sprugnoli, R.: I-CAB: the Italian Content Annotation Bank. In: Proc. of 5th Int. Conf. on Language Resources and Evaluation, LREC ’06 (2006)

[Pianta08] Pianta, E., Girardi, C., Zanoli, R.: The TextPro tool suite. In: Proc. of 6th Int. Conf. on Language Resources and Evaluation. LREC ’08 (2008)

[Pianta10] Pianta, E., Tonelli, S.: KX: A flexible system for keyphrase extraction. In: Proc. of 5th Int. Workshop on Semantic Evaluation, SemEval ’10, pp. 170–173 (2010)

[Zanoli12] Zanoli, R., Corcoglioniti, F., Girardi, C.: Exploiting background knowledge for clustering person names. In: Proc. of Evalita 2011 – Evaluation of NLP and Speech Tools for Italian (2012), to appear

[Tamilin10] Tamilin, A., Magnini, B., Serafini, L.: Leveraging entity linking by contextualized background knowledge: A case study for news domain in Italian. In: Proc. of 6th Workshop on Semantic Web Applications and Perspectives. SWAP ’10 (2010)