the knowledgestore: an integrated framework for …€¦ · the knowledgestore: an integrated...

The KNOWLEDGESTORE: an Integrated Framework for Ontology

Population

Bernardo Magnini [email protected]

Fondazione Bruno Kessler Trento, Italy

Joint work with: Roldano Cattoni, Francesco Corcoglioniti,

Christian Girardi, Marco Rospocher, Luciano Serafini

Darmstadt, October 17, 2014

NewsReader recording history

by processing massive streams of daily news

ICT 316404 FP7-ICT-2011-8

www.newsreader-project.eu

HOW DID THE WORLD CHANGE YESTERDAY?

Can we handle the news?

§  Information broker LexisNexis archives:

ü  1.5 millions news articles on a single working day ü  30,000 different sources

How did the Car industry change

during the financial crisis?

•  6 million English articles on the car industry in the LexisNexis archive for the last 10 years

•  2 million Google hits for “Volkswagen takeover” not sorted by publication date

A short history of VW and Porsche

How to measure the volume of change?

7"

1995 96 97 98 99 2000 01 02 03 04 2005 06 07 08 09 2010 11 12 13 14 2015

1995 96 97 98 99 2000 01 02 03 04 2005 06 07 08 09 2010 11 12 13 14 2015

Speculation Past New New

200k mentions per year

10k entities per year

6 MILLION ARTICLES

HOW MANY CHANGES?

volum

e of

en

tity

On 16 September 2008, Porsche increased its shares by another 4.89%, in effect taking control of the company, with more than 35% of the voting rights.

6 Jan 2009 – Porsche has been on a quest to takeover VW for more than two years.

Past

DAILY NEWS TSUNAMI •  Volume is very big: 1,5 million items each working day

•  Repeated and duplicated: we cannot distinguish new from old

•  Incomplete and piecemeal: we need to read all to get a complete picture

•  Actual and speculated events: we cannot distinguish the realis from irrealis (speculations, fears and hopes)

•  Inconsistent and contradictory: we cannot tell true from false (who to believe)

•  Opinionated and selective: we do not realize the bias of our sources

Requirements for large-scale ontology population

•  Cope with noisy in information extraction –  E.g. event extraction is still not a consolidated technology

•  Fill the gap between structured and unstructured data –  Knowledge in the information extraction loop –  Take advantage of “background knowledge” (e.g. wikipedia)

•  Larger variety of facts (e.g. relations, events)

•  Exploit cross-document co-reference

•  Ontology Population as an incremental process

NewsReader Architecture

KnowledgeStore: Objectives •  Three-layer architecture:

–  resource, mention, entity •  Allows to store massive amount of data about:

–  entities extracted from the data –  links to the data sources –  background domain knowledge

•  Provides access via text-based search and semantic query languages

•  Provides reasoning services over the knowledge it contains

The Knowledge Store: Content

•  Resource Layer –  Textual documents (e.g. a newspaper article) –  But also images and videos –  Linguistic Annotations of the textual resources (e.g. a whole

document processed by a NLP pipeline); NAF format in NewsReader

–  Relevant Metadata (e.g. date, source, author, language)

Resource Layer

•  Mention Layer –  Relevant portions of a resource document with their linguistic

annotations (e.g. names of people and places, an event mention, the area representing a person in a image)

–  All orthographical variants are stored (e.g. “B. Magnini” and “Bernardo Magnini”)

–  Coreference among mentions (intra and cross-document) –  Links to the data sources from which mentions have been extracted –  Relevant Metadata (e.g. frequency of occurrence, confidence of

mention recognition)

Resource Layer

Mention Layer


•  The Entity Layer –  Unique Instances of mentions that have been extracted from the

data, after mention coreference –  Represented as RDF triples –  Orthographical variants are not recorded –  Representation based on ontological schemas (e.g. SEM for

events) –  Links to Background domain knowledge (e.g. wikipedia) –  Relevant Metadata (e.g. factuality, provenance, confidence)

Resource Layer

Entity Layer

Mention Layer


ks:Resource

uri: URIks:storedAs: ks:Representationrdfs:comment: string

ks:Mention

uri: URInif:beginIndex: intnif:endIndex: intnif:anchorOf: stringrdfs:comment: string

TimeMention

value: stringtimeType: TIMEX3TypefunctionInDocument: FunctionInDocumentquant: stringfreq: stringmod: TIMEX3ModifiertemporalFunction: bool

EventMention

eventClass: EventClasspred: stringcertainty: Certaintyfactuality: FactualityfactualityConfidence: floatpos: PartOfSpeechtense: Tenseaspect: Aspectpolarity: Polaritymodality: stringframenetRef: URIpropbankRef: URIverbnetRef: URInombankRef: URI

ks:containedIn1ks:Entity

uri: URI

ks:Axiom

uri: URIks:encodedBy: rdf:Statement[1..*]crystallized: booldc:source: URIconfidence: floatrdfs:comment: string

ks:describes1..*

0..*

NAFDocument

version: stringdct:identifier: stringlayer: NAFLayer[1..*]dct:creator: NAFProcessor[1..*]dct:language: dct:LinguisticSystem

annotationOf

1 RelationMention

EntityMention

localCorefID: string

Participation

thematicRole: stringframenetRef: URIpropbankRef: URIverbnetRef: URInombankRef: URI

CLink

TLink

relType: TLinkType

News

dct:title: stringdct:publisher: dct:Agentdct:creator: dct:Agentdct:created: datedct:spatial: dct:Locationdct:temporal: time:Intervaldct:subject: URIdct:rights: dct:RightsStatementdct:rightsHolder: dct:Agentdct:language: dct:LinguisticSystemoriginalFileName: stringoriginalFileFormat: stringoriginalPages: int

source1

target1

ks:expressedBy0..* 0..*ks:refersTo

0..*

0..1

ObjectMention

syntacticHead: stringsyntacticType: SyntacticTypeentityType: EntityTypeentityClass: EntityClass

TimeOrEventMention

SLink

ValueMention

valueType: ValueType

target 1

source 1

source 1

target 1

source1

target 1

ks:Context

uri: URIsem:hasPointOfView: sem:PointOfViewsem:hasTimeValidity: time:Interval

ks:holdsIn1

SignalMention

signal0..1

csignal0..1

ks:describedBy

ks:referredBy

ks:expresses

GLinksource 1

target 1

CSignalMention

anchorTime

beginPoint

endPoint

valueFromFunction

Resource Layer

Mention Layer

Entity Layer

Men$on: M2

extent Valen$no Rossi

type PER firstname Valen$no lastname Rossi start 33

En$ty: E1

predicate object Source

type Pilot motogp.com firstName Valen:no motogp.com, R3 lastName Rossi motogp.com, R3 gender male motogp.com, R3 birthDate 1979-‐02-‐16 wikipedia.it birthPlace Urbino wikipedia.it height 182 motogp.com weight 67 motogp.com team Duca: R3

Facts $tle Rossi torna single

(Rossi is again single)

creator l’Adige

format Text

date 2012/02/27

content GOSSIP -‐ Il campione della Duca$ Valen:no Rossi e la sua fidanzata Marwa Klebi si sono lascia$. … (Duca& champion Valen&no Rossi and his girlfriend Marwa Klebi break up. …)

Resouuce: R3

Occurs in References


Resource Mention

part of

Entity Statement

crystallized: boolean

EntityMention RelationMention

arguments

refers todescribed by

Resource Mention

part of

Entity Statement



arguments


Resource Mention

part of

Entity Statement



arguments


High Level Data Model

17

Resource Layer

dbpedia:United_Nations rdf:type yago:PoliticalSystems

dbpedia:United_Nations rdfs:label "United Nations"@en

dbpedia:United_Nations foaf:homepage <http://www.un.org/>

dbpedia:United_Nations

Entity Layer Mention Layer

Indonesia Hit By Earthquake

A United Nations assessment team was dispatched to the province after two quakes, measuring 7.6 and 7.4, struck west of Manokwari Jan. 4. At least five people were killed, 250 others injured and more than 800 homes destroyed by those temblors, according to the UN.

Knowledge Store

KnowledgeStore: Context

Resource Layer

Mention Layer

Entity Layer

write annotations

Resource processors → tokenization & lemmatization

→ part of speech tagging, → word sense disambiguation → parsing (dep./consituency)

→ keyphrase extraction

read resources

Mention processors → named entity recognition,

→ event recognition, → semantic role labelling,

→ Tlink / Clink / Slink tagging, → entity linking (wikification…)

write mentions

read annotations & mentions

Entity processors → entity & event coreference,

→ event chaining, → event significance & rel.

→ narrative graph extraction, → crystallization read

mentions & background knowledge write entities

& statements

store background knowledge

Source 2

Source 1

other

Resource populators

. . . knowledge populators

store news

Decision Support System

(mixed) queries

18

Resource annotation

dc:$tle Indonesia’s West Papua Province Hit by Earthquake (Update2)

dc:creator bloomberg.com

dc:language EN

dc:issued 2009-‐01-‐07T01:55:00-‐05:00

nfo:wordCount 287

nfo:characterCount 1778

file news_00001.txt

… …

news_00001.txt

News nwr:news_00001

19

Mentions: NAF example Toyota brought Lexus to Japan in 2005.

Mentions (organizations)

ORG Men:on

nwr:orgmen_00001 nif:beginIndex 146

nif:endIndex 160

nif:anchorOf United Na$ons

foaf:name United Na$ons

… …

ORG Men:on nwr:orgmen_00002 nif:anchorOf the UN

head UN

foaf:name UN

… …

Non-event entities Events Time expressions TLink signals CLink signals 22

Mentions: linking to resource

dc:$tle Indonesia’s West Papua Province Hit by Earthquake (Update2)

dc:creator bloomberg.com

dc:language EN

dc:issued 2009-‐01-‐07T01:55:00-‐05:00

nfo:wordCount 287

nfo:characterCount 1778

file news_00001.txt

…. ….

News nwr:news_00001

ORG Men:on nwr:orgmen_00001 dc:isPartOf nwr:news_00001

ORG Men:on nwr:orgmen_00002 dc:isPartOf nwr:news_00001

Event Men:on nwr:evmen_00001 dc:isPartOf nwr:news_00001

Event Men:on nwr:evmen_00002 dc:isPartOf nwr:news_00001

CLINK Men:on nwr:clmen_00001 dc:isPartOf nwr:news_00001

Part. Men:on nwr:pmen_00002 dc:isPartOf nwr:news_00001

“United Nations”

“the UN”

“quakes”

“temblors”

“destroyed by those temblors”

“according to the UN”

23

Entities: SEM format

ENTITY INSTANCE <http://dbpedia.org/resource/Toyota> a nwr:organization ; rdfs:label "Toyota" , "Toyota motor" ; gaf:denotedBy <nwr:data/cars/2013/1/1/5760-PM51-JD34-P4RM.xml#char=98,104&word=w18&term=t18> , <nwr:data/cars/2013/1/1/57K5-FKK1-DYBW-2534.xml#char=44934,44940&word=w8114&term=t8114> .

EVENT INSTANCE <nwr:data/cars/2013/1/1/5758-BPN1-F0J6-D2T2.xml#sellEvent> a sem:Event , fn:Commerce_sell ; rdfs:label "sell" ; gaf:denotedBy <nwr:data/cars/2013/1/1/5758-BPN1-F0J6-D2T2.xml#char=1352,1356&word=w251&term=t251> , <nwr:data/cars/2013/1/1/5760-PM51-JD34-P4H7.xml#char=1536,1540&word=w275&term=t275>.

Entities: SEM format

Semantic relations as named graphs

<nwr:/data/cars/2013/1/1/5758-BPN1-F0J6-D2T2.xml#pr25,rl55> { <nwr:data/cars/2013/1/1/5722-S821-F0J6-D48N.xml#sellEvent> sem:hasActor ; fn:Commerce_sell#Seller

<http://dbpedia.org/resource/Magyar_Suzuki> . } <nwr:data/cars/2013/1/1/5760-PM51-JD34-P4H7.xml#pr46,rl114> { <nwr:data/cars/2013/1/1/5758-BPN1-F0J6-D2T2.xml#sellEvent> sem:hasPlace <http://dbpedia.org/resource/South_Africa> . } <nwr:data/cars/2013/1/1/5760-PM51-JD34-P4H7.xml#docTime_26> { <nwr:data/cars/2013/1/1/5760-PM51-JD34-P4H7.xml#sellEvent> sem:hasTime <nwr:time/2013-01-01> . }

Properties of relations

PROVENANCE <nwr:data/cars/2013/1/1/57R8-5451-F0J6-D2GH.xml#pr25,rl55> gaf:denotedBy

<nwr:data/cars/2013/1/1/57R8-5451-F0J6-D2GH.xml#rl55> ; prov-o:wasAttributedTo

<nwr:sourceowner/Peru_Autos_Report> .

FACTUALITY <nwr:data/cars/2013/1/1/57K5-FKK1-DYBW-2534.xml#facValue_1125> { <nwr:data/cars/2013/1/1/57K5-FKK1-DYBW-2534.xml#sellEvent> nwr:hasFactBankValue "CT+" .}

Data Sources

High-level KS architecture

KS Frontend

API implementation on top of lower layers; optional transactions & data validation

HBase + Hadoop

distributed & replicated for scalability and fault-tolerance Triple Store

possibly distributed

Mention Resource Entity Statement RDF Triples

Applications

direct access to KS API; (some) linguistic processors

HTTP REST API CRUD & bulk services, (mixed) queries, specialized access patterns, map/reduce hook

Populators

loading resources, annotations, background knowledge in specific formats (e.g., RDF, TAF)

Mgmt. Scripts

start / stop,

backup / restore, con-

figuration, statistics

gathering, Inference

Rules

Data Model Properties

Deployment Properties

Serv

er

Clien

ts

Configuration

Rule-based inference (partial) Replication

Use Case 1: Trentino Media

§  Goal: acquire and load news and background knowledge

§  Multimedia news –  ~56 GB of textual news, images and videos in the Italian language –  acquired from 4 news providers local to the Italian Trentino region –  daily updated, since 1999

Linking & en$ty crea$on

Coreference resolu$on

Men$on extrac$on

Resource preprocessing

Content acquisi$on

Provider News Images Videos l’Adige 733,738 21,525 -

VitaTrentina 33,403 14,198 -

RTTR 2,455 - 120 h

Fed. Cooperative 1,402 - -

All 770,998 35,723 120 h

Coop. Trentina

Topic Contexts Persons Organizations Avg.properties Facts (triples)

sport 136 8,570 191 3.81 192115 culture 20 9,785 1 2.00 33236 justice 7 354 10 2.16 1575 economy 7 49 1,203 4.47 11147 education 6 850 82 2.35 3573 politics 535 8,402 319 4.64 98780 religion 3 1,391 0 1.67 12855 total 714 28,687 1,806 3.64 352,244



Men$on extrac$on


Content acquisi$on

§  Background knowledge •  acquired from Italian Wikipedia, sport-related community sites, local and

national level public administrations and government bodies •  manual acquisition using ad-hoc Web site wrappers

Use Case 1: Trentino Media

Trentino Media: Resource Preprocessing

§  Goal: normalize and annotate news so to ease further processing §  Several operations:

–  Conversion of multimedia resources to common formats –  Segmentation of complex news

•  separation of individual stories in a news broadcast •  separation of figures and captions in complex XML news

articles

–  Automatic Speech Recognition –  Annotations of news with linguistic taggers

•  part of speech tagging, based on TextPro tool [Pianta08] •  temporal expression tagging, based on TextPro •  key concept extraction, based on KX tool [Pianta10]



Men$on extrac$on


Content acquisi$on

Trentino Media: Mention Extraction

§  Mention extraction is based on the TextPro Mention Detection module –  system based on the supervised training of a statistical model –  training for the Italian language based on the Italian Content Annotation

Bank (I-CAB) dataset [Magnini06]; measured F1 value: 82% §  Mention extraction statistics



Men$on extrac$on


Content acquisi$on

News provider

PER mentions

ORG mentions GPE/LOC mentions

Total mentions

l’Adige 5,387,994 3,100,994 3,052,011 11,540,999 VitaTrentina 144,486 100,789 136,611 381,886 RTTR 19,290 15,493 27,404 62,187 Fed. Coop. 14,404 12,731 8,513 35,648 All 5,566,174 3,230,007 3,224,539 12,020,720

§  Goal: put mentions referring to the same entity in a mention cluster –  cross-document (more complex) as mentions may belong to different

news

Trentino Media: Coreference Resolution Linking & en$ty

crea$on Coreference resolu$on

Men$on extrac$on


Content acquisi$on

Entity type PER ORG GPE/LOC Total Mention clusters 340,147 16,649 52,478 409,274

… il presidente della Provincia di Trento Lorenzo Dellai ha lodato l'iniziativa per molteplici aspetti ...

... questa mattina il presidente Dellai ha incontrato le parti sociali sulla manovra che la Giunta sta mettendo a punto … … e intanto l' Alto Adige è tra le regioni europee con il più basso tasso di disoccupazione

Lorenzo Dellai (PERSON entity)

Trentino Media: Linking & Entity Creation

§  Goal: link mention clusters to entities in the background knowledge and to external resources; create new entities from unlinked clusters

§  Different types of linking are performed –  linking to GeoNames toponyms of GPE/LOC clusters, using GeoCoder –  linking to background knowledge entities of PER and ORG clusters –  linking to Wikipedia pages of all the entities, using the WikiMachine tool

§  New entities are created from unlinked clusters

§  Linking and entity statistics:



Men$on extrac$on


Content acquisi$on

Entity type PER ORG GPE/LOC Total Linked clusters 5.03% 7.96% 48.64% 10.74% Linked mentions 22.36% 12.02% 65.04% 31.03% Resulting entities 321,713 17,129 52,478 421,320

Trentino Media: Linking & Entity Creation

§  Linking to background knowledge is context-driven [Tamilin10] –  exploits the contextual organization of knowledge in the KNOWLEDGESTORE –  84,5% accuracy (i.e., correctly linked clusters) on gold standard of 298

clusters



Men$on extrac$on


Content acquisi$on

-, World, -

Research, World, -

Research, Italy, 2009 Politics, Trento, 2009 Politics,Civezzano,2009

Politics,Trentino,2009 Politics,Trentino,2004

EIT L. Dellai (pres.)

S. Dellai FBK

L. Dellai (major)

“Dellai” (PER)

➌ Link the mention only to entities appearing in the selected formal contexts

topic: research location: Trentino time: 2009

➋ Select a ranked list of KNOWLEDGESTORE contexts that better match chosen dimensional values

➊ Map textual context of each cluster mention to appropriate values of contextual dimensions

Other use cases • USE CASES:

– Car Industry news (2003-2013): 63K articles, 1,7M event instances, 445K actors, 63K places, 41K DBpedia entities and 46M triples.

– TechCrunch (2005-2013): 43K articles, 1.6M event instances, 300K actors, 28K DBpedia entities and 24M triples.

– Fifa World Cup: 200K documents, 9M Events, 35K actors (30% pers, 70% org), 15K places, 6K dates and 136M triples.

– Dutch House of Representatives Bank Inquiry, 1M documents (900K XML and 100K PDF), pending

• BENCHMARK DATA: – WikiNews: 19K English, 8K Italian, 7K Spanish and 1K Dutch. 69 Apple news documents for annotation.

– ECB+: 43 topics and 482 articles from GoogleNews, extended with 502 GoogleNews articles for 43+ topics (similar but different event).

TOP events per year

TOP actors per year

TOP places per year

CARS: WHERE & WHEN

Use Case: Worldcup •  We processed 212,511 news articles in about 3 weeks -> BBC,

Guardian, Lexis Nexis

•  We stored the resources and the result in the KnowledgeStore:

•  45GB storage of news and processed mentions

•  22GB storage for triples expressing statements on instances in RDF, e.g. who involved, when happened, where…

•  136,075,006 triples from news (event statements)

•  104,595,567 triples from background data (DBPedia statements)

Events Actors Places Time

Mention 25,470,763 3,590,012 58% org., 42% pers. 2,279,939 1,828,134

Instance 9,387,356 51,283 dbp 30% org., 70% pers. 15,219 dbp 5,961

33,123 events involving Blatter

Use Case: Worldcup

191 stand 196 suggest 208 comment 209 want 232 insist 254 meet 260 take 272 give 370 tell 408 add 440 make 523 have 1143 say

135 claim 138 run 139 ask 140 play 145 write 160 vote 162 challenge 162 do 169 come 172 get 180 call 180 visit 188 go

116 seek 116 speak 117 re-elect 119 describe 121 believe 122 attend 125 confirm 127 support 129 try 129 win 131 look 132 announce 133 use

What Blatter does…

KS Web site: https://knowledgestore.fbk.eu

§  Download source code, selected data §  Documentation, demo, video

Semeval 2015 Task T4 - TimeLine: Cross-Document Event Ordering

- Task description:build an ordered list ofevents involving a specificpre-selected entity

- Trial and test data:Wikinews

- Evaluation:(i) time anchors + ordering,(ii) only ordering

- Subtasks:(i) timeline on raw texts,(ii) timeline on textsannotated with events

Register to the TimeLine task Google Group: semeval-task4-timeline

Organizers: A-L. Minard, E. Agirre, I. Aldabe, M. van Erp, B. Magnini, G. Rigau, M. Speranza, R. Urizar

http://www.newsreader-project.eu

Conclusion •  Our contributions

–  integration of techniques and technologies from different fields (NLP, Semantic Web, Machine Learning, …)

–  large scale application (some use cases) –  tight interlinking of knowledge and multimedia

•  Further research directions –  Improve event extraction –  Exploit already stored knowledge for event Cross-document co-

reference –  Incremental population of the KnowledgeStore –  Mixed retrieval, both from entities (Sparql) and mentions (textual

indexing)

References 2012: R. Cattoni, F. Corcoglioniti, C. Girardi, B. Magnini, L. Serafini, R. Zanoli:

Anchoring Background Knowledge to Rich Multimedia Contexts in the KnowledgeStore, In: New Trends of Research in Ontologies and Lexical Resources, edited by Federica Corradi Dell’Acqua, Qin Lu, Piek Vossen, Alessandro Oltramari and Eduard Hovy, Springer.

2012: R. Cattoni, F. Corcoglioniti, C. Girardi, B. Magnini, L. Serafini, R. Zanoli

TrentinoMedia: Exploiting NLP and Background Knowledge to Browse a Large Multimedia News Store. PAI 2012, Popularizing Artificial Intelligence, Rome June 15, 2012.

2012: R. Cattoni, F. Corcoglioniti, C. Girardi, B. Magnini, L. Serafini, R. Zanoli

The KnowledgeStore: an Entity-Based Storage System, LREC 2012. 2013: F. Corcoglioniti, M. Rospocher, R. Cattoni, B. Magnini, and L. Serafini: Interlinking Unstructured and

Structured Knowledge in an Integrated Framework, Proceedings of Seventh IEEE International Conference on Semantic Computing, ICSC 2013, September 16-18, 2013, Irvine, California, USA.

References

[Buscaldi10] Buscaldi, D., Magnini, B.: Grounding toponyms in an Italian local news corpus. In: Proc. of 6th Workshop on Geographic Information Retrieval. pp. 15:1–15:5. GIR ’10 (2010)

[Homola10] Homola, M., Tamilin, A., Serafini, L.: Modeling contextualized knowledge. In: Proc. of 2nd Int. Workshop on Context, Information And Ontologies. CIAO ’10, vol. 626 (2010)

[Magnini06] Magnini, B., Pianta, E., Girardi, C., Negri, M., Romano, L., Speranza, M., Bartalesi Lenzi, V., Sprugnoli, R.: I-CAB: the Italian Content Annotation Bank. In: Proc. of 5th Int. Conf. on Language Resources and Evaluation, LREC ’06 (2006)

[Pianta08] Pianta, E., Girardi, C., Zanoli, R.: The TextPro tool suite. In: Proc. of 6th Int. Conf. on Language Resources and Evaluation. LREC ’08 (2008)

[Pianta10] Pianta, E., Tonelli, S.: KX: A flexible system for keyphrase extraction. In: Proc. of 5th Int. Workshop on Semantic Evaluation, SemEval ’10, pp. 170–173 (2010)

[Zanoli12] Zanoli, R., Corcoglioniti, F., Girardi, C.: Exploiting background knowledge for clustering person names. In: Proc. of Evalita 2011 – Evaluation of NLP and Speech Tools for Italian (2012), to appear

[Tamilin10] Tamilin, A., Magnini, B., Serafini, L.: Leveraging entity linking by contextualized background knowledge: A case study for news domain in Italian. In: Proc. of 6th Workshop on Semantic Web Applications and Perspectives. SWAP ’10 (2010)

the knowledgestore: an integrated framework for …€¦ · the knowledgestore: an integrated...

Documents