contextualized semantic enrichment...a-box facts extracted from selected web sources • by web page...

59
Contextualized Semantic Enrichment The LiveMemories experience and future directions presenter Francesco Corcoglioniti work by Luciano Serafini, Andrei Tamilin, Mathew Joseph DKM internal seminar February 8th, 2011

Upload: others

Post on 04-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Contextualized Semantic Enrichment The LiveMemories experience and future directions

presenter Francesco Corcoglioniti

work by Luciano Serafini, Andrei Tamilin, Mathew

Joseph

DKM internal seminar

February 8th, 2011

Page 2: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Outline

Introduction

Proposed approach

LiveMemories use case

Preliminary evaluation in LiveMemories

Conclusions and future work

Page 3: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

From text to knowledge and back

Background

Knowledge

Semantic

Enrichment

… Knowledge

Population

Page 4: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Semantic enrichment

Background

Knowledge Base

GeoNames

Semantic Enrichment

=

(1) Entity Linking

+

(2) Knowledge Selection

NLP tools

ORG

LOC PER

1

2

URI

Page 5: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Role of context

Entity Linking – ambiguity problem

• “Boban” (Football, 2000) Zvonimir Boban, Football player

• “Boban” (Music, 2000) Boban Marković, Trumpet player

Knowledge Selection – knowledge validity problem

• Zvonimir Boban (FIFA World Cup, 1998) Croatia Zagreb player

• Zvonimir Boban (Italian League Serie A, 2000) AC Milan player

Semantic enrichment is context driven

Page 6: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Context-driven semantic enrichment

Background

Knowledge Base

Culture

Root

Football Volley

Contextualized Knowledge Repository

TV

Sport

GeoNames

NLP tools

ORG

LOC PER

Detect text

context

Page 7: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Outline

Introduction

Proposed approach

• contextualized knowledge representation (CKR)

• context-driven entity linking

• context-driven knowledge selection

LiveMemories use case

Preliminary evaluation in LiveMemories

Conclusions and future work

Page 8: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Contextualized Knowledge Repository

Framework to represent contextualized knowledge

(RDF/RDFS/OWL), supporting the enrichment procedure

Context as a box metaphor [Benerecetti et al, 2000]

• context is a box with the knowledge base inside of the box and a set

of dimension-value pairs outside the box

• example dimensions: location, topic, time

Dimensions are structured:

• values vi of Di are fixed with ontologies and structured with partial

order coverage relation Di between them

Topic =FIFA World Cup, Location=France, Time=Jun,98-Jul-98

Player_Of(Zvonimir_Boban, Croatia_Zagreb) Has_Role(Zvonimir_Boban, Midfielder) …

Page 9: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

CKR – Example

Page 10: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Context hierarchy

Context covering:

Given a pair of contexts Ca and Cb defined on the same dimensions

{Di}i {1..n}, Ca covers Cb (Ca Cb) if for each dimension i {1..n} we

have (via Di vib)

Observations:

• context covering is a partial order

• given a set of contexts, by virtue of context covering we can construct

a contexts hierarchy

• when a new context is inserted into the repository, its position in the

hierarchy is automatically determined by values of dimensions

Page 11: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Context hierarchy – Example

Page 12: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

CKR architecture and functionalities Contextualized Background Knowledge Repository

Knowledge

retrieval

module

Management

module

Adm

inis

tration

serv

ices

Applic

ation serv

ices

Contexts

organization

module RDF store

Querying

module Query shifting/

Local querying

String-based

searching

RDF store

RDF store …

SPARQL

Context C1

Context C2 Context C3

Dimensions

structures

Contexts

declarations

Knowledge

Indexing

module

Index store

Keyword

search

CRUD on Dimensions/ Contexts

Materialize/ Dematerialize/ Index

Local / Multi-context / Keyword-based queries

Context determination (based on lexicon)

Page 13: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Entity linking procedure (1)

Goal: given a document and an entity mention in it, find the

ontological individual the mention refers to

Linking procedure:

1. identify document context, by extracting values of context

dimensions

• this step must be adapted to particular applications, based on the chosen

dimensions

– e.g. extract subjects, time and location the document refers to

• can exploit document metadata

– e.g. publication date

• can exploit information retrieval and NLP techniques

– e.g. keyword extraction based on TF/IDF, followed by keyword mapping to

values of the subject dimension (currently employed)

– e.g. hierarchical document classification (not tried)

Page 14: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Entity linking procedure (2)

Linking procedure (cont‟d):

2. determine ranked context(s) of interest in the repository

• e.g. rank contexts based on subject relevance for the document

3. match mention vs individuals in contexts‟ knowledge

• string matching of mention against indexed elements in context

• stop if match found, otherwise shift query to more specific contexts

• stop if match found, otherwise shift query to more general contexts

4. (to improve coverage) if no match found, perform a non-

contextualized search in the CKR for the mention string

– if exactly an ontological individual is found, and it appears in a context whose

time dimension value contains the one of the document context, then accept it

Page 15: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Entity linking procedure (3)

Problem: procedure is sensitive to the mention string

• entities have multiple possible mentions/surface forms, e.g. “Zvone”,

“Boban”, “Boban Zvonimir”, “Zvonimir B.”…

Solution: apply NLP global (cross-document) coreference

• group/cluster together mentions referring to the same entity in the

whole corpora

• select the representative name for the cluster

• heuristic: prefer longer/frequent names

• in the news domain, this brings name and surname for people, non

abbreviated names for organizations and locations

• execute the entity linking procedure document per document, using

the computed cluster name

Page 16: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Knowledge selection procedure

Goal: given a document, its associated contexts and a linked

entity occurring in it, select the knowledge about the entity

relevant for that document/contexts

Knowledge selection procedure:

• perform SPARQL DESCRIBE queries for the entity starting from the

identified contexts

• propagate/shift the query to more specific contexts in the hierarchy

• for detailed info, e.g., championship, team Boban played, role he played

in a team

• propagate/shift the query to more general contexts

• for general info, e.g., date of birth of Boban

Page 17: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Outline

Introduction

Proposed approach

LiveMemories use case

• knowledge base construction

• entity linking

• knowledge selection and display

Preliminary evaluation in LiveMemories

Conclusions and future work

Page 18: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

LiveMemories use case

Named

Entity

Recognition

Local & global

(cross-document)

coreference

Semantic

Enrichment Lucene

Index

LiveMemories

background

knowledge

CKR

• 716,455 documents

• 181,734 entities (82% per,

18% org)

• 5,704,669 <entity,

document> occurrences

Entity URIs,

identified contexts

ORG

LOC

PER

Page 19: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Knowledge base construction

Goal: describe relevant named entities commonly cited in the

considered news corpus

Focus on

• persons and organizations only, Geonames used for locations

• local knowledge (mainly related to Trento & surroundings)

• number of entities covered (breadth), more than detail of entity

descriptions (depth)

Methodology

• define contextual dimensions

• T-Box design (based on available sources)

• A-Box acquisition from selected Web sources

Page 20: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Contextual dimensions

Time

• interval representation, by start time and end time

• hierarchy implicitly defined by interval inclusion

• e.g. [1999-01-01, 2010-12-31] covers [2001-01-01, 2001-01-31]

Location

• hierarchy explicitly defined, manually edited

• fine grained description of Trento province only

• Geonames not used, too many locations missing

Subject

• hierarchy explicitly defined, manually edited

• based on article classifications usually found on newspapers (culture,

politics, sport, …)

Page 21: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Contextual dimensions – Location

Valli Giudicarie

Val di Sole

Val di Non

Valle dell’Adige

Alto Garda e

Ledro

Primiero

Val di Fiemme

Alta Valsugana

Bassa Valsugana

e Tesino

Ladino di Fassa

Vallagarina

municipality of

Mezzolombardo

municipality of

Mezzocorona

municipality of

Trento

municipality of

Lavis

Vigolo Baselga

Sopramonte

Romagnano

Trento

Sardagna

Villazzano

Gardolo

Povo

Mattarello

Oltrefersina

Piedicastello

Bondone

Meano

Argentario

Cognola

Cadine

Baselga del Bond.

Ravina

World

America

USA

Europe

Italy

Trentino Alto

Adige

province of

Bolzano

province of

Trento

Page 22: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Contextual dimensions – Subject (1)

root

subject

sport

culture

justice

economy

education

environment

politics

religion

volley

hockey

basket

football

auto racing

motorcycle racing

horse racing

cycling

tennis

winter sports

water sports

athletics

martial arts

golf

champions league

uefa cup

coppa italia

serie a

serie b

serie c1a

serie c1b

under 21

eccelenza

Page 23: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Contextual dimensions – Subject (2)

root

subject

sport

culture

justice

economy

education

environment

politics

religion

European

parliament

United Nations

Italian politics

government Prodi I

government D’Alema I

government D’Alema II

government Amato II

Presidency of

the Republic

XVI legislature

XV legislature

XIV legislature

XIII legislature

XII legislature

XI legislature

X legislature

local politics

Page 24: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

T-Box design

Ontology upper level: definition of Person and Organization entity

classes

At each level in the subject hierarcy

• Specialization of Person and Organization classes

• e.g. Person Sportsman Football player

• e.g. Organisation Sport team Football team

• Definition of relevant entity properties

• e.g. plays for (team), has coach, plays in (competition)…

Defined only concepts and properties with data available from

Web sources

Page 25: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

T-Box design – Football example LiveMemories ontology upper level (lm - http://www.livememories.org)

Sport domain (lms - http://www.livememories.org/sport, file sport/sport.otx)

Football domain (lmsc - http://www.livememories.org/sport/calcio, files sport/calcio/calcio.otx and sport/calcio/calcio-italiano.otx)

lmsc.calciatore

lmsc.ha_nazionalita: stringlmsc.e_nato_il: stringlmsc.e_nato_a: stringlmsc.ha_altezza: stringlmsc.ha_peso: string

lmsc.squadra_calcio

lmsc.ha_sede: stringlmsc.ha_stadio: stringlmsc.ha_colori: stringlmsc.ha_sito_web: string

lmsc.gioca_nella_squadra

lmsc.allenatore_della_squadra

lmsc.ha_nazionalita: stringlmsc.e_nato_il: stringlmsc.e_nato_a: string

lmsc.ha_allenatore

lmsc.e_allenatore_di

lmsc.presidente_della_squadra lmsc.ha_presidente

lmsc.campionato

lm.organizzazione

rdfs: label

lmsc.e_organizzato_da

lmsc.lega_calcio lmsc.figc

lms.squadra

lm.persona

rdfs: label

lmsc.arbitro

lms.sportivo

lmsc.ruolo<<enumeration>>

lmsc.difensorelmsc.centrocampistalmsc.attaccantelmsc.portiere

ha_ruolo

lmsc.gioca_nel_campionato

lmsc.campionato_italiano<<enumeration>>

lmsc.serieAlmsc.serieBlmsc.serieClmsc.serieDlmsc.coppa_italialmsc.supercoppa_italiana

lmsc.campionato_europea<<enumeration>>

lmsc.champions_league

lms.tifoso

Root

subject

Sport

subject

Football

subject

Page 26: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

A-Box acquisition

A-Box facts extracted from selected Web sources

• by Web page scraping, to extract structured data

• by manually encoding facts based on data found online

Linked Data not used

• because not available for many entities, in particular for entities local

to Trentino Alto-Adige

• sometimes some Italian Wikipedia pages are available, but they have not

been included in DBPedia

• because often incomplete or imprecise, e.g. for the Italian Football

domain

• better data is available on dedicated Web sites

Page 27: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

A-Box acquisition – Football example

http://www.tuttocalciatori.net/

index.php?mod=ct

http://www.tuttocalciatori.net/index.php

?mod=cc8&parstag=<SEASON>&pars

=<SERIES>

http://www.tuttocalciatori.net/index.php?mod

=cc0&idcl=<CLUB>&stag=<SEASON>

http://www.tuttocalciatori.net/index.php?

mod=cc0&idcl=<CLUB>

http://www.tuttocalciatori.net/

<PLAYER_NAME>

sport/calcio/campionato_serie<SERIES> _stagione_<SEASON>.otx

gioca_nel_campionato(<TEAM>, <SERIES>)

<TEAM> \olabel \string{<TEAM NAME>}

ha_allenatore(<TEAM>, <COACH>)

e_allenatore_di(<COACH>, <TEAM>)

...

ha_ruolo(<PLAYER>, <ROLE>)

gioca_nella_squadra(<PLAYER>, <TEAM>)

...

sport/calcio/giocatori.otx

calciatore(<PLAYER>)

<PLAYER> \olabel \string{<NAME>}

ha_nazionalita(<PLAYER>,

\string{<NATIONALITY>})

e_nato_il(<PLAYER>, \string{<BIRTH_DATE>})

e_nato_a(<PLAYER>, \string{<BIRTH_PLACE>})

ha_altezza(<PLAYER>, \string{<HEIGHT>})

ha_peso(orlando_massimo, \string{Kg 69})

...

sport/calcio/allenatori.otx allenatore_della_squadra(<COACH>)

<COACH> \olabel \string{<NAME>}

ha_nazionalita(<COACH>,

\string{<NATIONALITY>})

e_nato_il(<COACH>, \string{<BIRTH_DATE>})

e_nato_a(<COACH>, \string{<BIRTH_PLACE>})

...

sport/calcio/squadre.otx squadra_calcio(<TEAM>)

<TEAM> \olabel \string{<NAME>}

ha_sede(<TEAM>, \string{<HEADQUARTERS>})

ha_presidente(<TEAM>, <PRESIDENT>)

presidente_della_squadra(<PRESIDENT>)

<PRESIDENT> \olabel \string{<PRES_NAME>}

ha_stadio(<TEAM>, \string{<STADIUM>})

ha_colori(<TEAM>, \string{<COLOURS>})

ha_sito_web(<TEAM>, \string{<WEB_SITE>})

...

sport,

mondo,

-

calcio,

trentino-alto-

adige,

-

calcio,

italia,

-

calcio_eccelenza,

trentino-

alto_adige,

2009-2010

calcio_under_21,

italia,

-

<SERIES>,

italia,

-

<SERIES>,

italia,

<SEASON>

Page 28: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

A-Box acquisition – Data sources

Top-level subject Sources

Culture Wikipedia (Italian pages)

Sport Wikipedia (Italian pages), Lega Italiana Hockey Ghiaccio,

tuttohockey.com, tuttocalciatori.net, Lega Pallavolo Serie A,

formula1.com, rallylink.it, racepilot.com, motogp.com

Justice Tribunale di Trento, Tribunale di Rovereto, Magistratura

Democratica

Economy Camera di Commercio di Trento, Banca d'Italia

Education tuttitalia.it, Centro Studi Orientamento

Politics Parlamento Europeo, Camera dei Deputati,

Senato della Repubblica, Ministero degli Interni, Regione

Trentino Alto-Adige, Comune di Trento

Religion Web Diocesi

Page 29: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

A-Box acquisition – URI design

Subjects are associated to namespaces

• e.g. sport http://www.livememories.org/sport

• e.g. football http://www.livememories.org/sport/calcio

Ontological individuals extracted for a subject are given a URI

under the corresponding namespace

• e.g. Christian Vieri

http://www.livememories.org/sport/calcio/christian_vieri

Instance matching problem - Same entities in different contexts

are manually aligned by assigning a unique URI (simplistic!)

• e.g. Silvio Berlusconi is given two URIs, the second being manually

discarded and replaced with the first one

http://www.livememories.org/politica/silvio_berlusconi

http://www.livememories.org/sport/calcio/silvio_berlusconi

Page 30: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Knowledge base statistics

(*) This is the avg. number of distinct predicates specified for each individual (e.g. an individual

subject of lm:name, lm:surname and several lm:hasChild statements will have 3 properties)

(**) This is computed by merging all the triples in all the contexts under a specified top level

domain, removing duplicates (e.g. T-Box axioms imported in multiple contexts).

Top level

subjects

Contexts PER

individuals

ORG

individuals

Average

properties

per entity (*)

Triples

(**)

sport 136 30110 803 3.81 192115

culture 20 9785 1 2.00 33236

justice 7 354 10 2.16 1575

economy 7 51 1847 4.47 11147

education 6 850 82 2.35 3573

politics 535 12320 1124 4.64 98780

religion 3 1391 0 1.67 12855

total 714 54861 3867 3.64 352244

Page 31: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Entity Linking (1)

Context identification

• time – publication year

• location – any (currently, location not extracted from documents)

• subject: ranked list of keywords extracted from text (TF-IDF), plus

article category when available (e.g. politics)

• keywords and category mapped to values of the subject dimension via

manually crafted lexicon

Top

Sports

Football Basketball

Science {“sport”, “sportsmen”, “coach”, …}

{“football”, “goalkeeper”, “midfielder”, …}

Page 32: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Entity Linking (2)

Entity linking is performed offline

• exploited the Kore cluster (thanks to Roldano)

• occurrences to link grouped by entity and evenly distributed in ~100

batches, each one being a job to execute on the cluster

• 8 hours to complete the linking

• there are evidences that work can be better distributed

• performances can be improved by reducing repository creation overhead

Linked URIs and identified contexts stored in the Lucene index

Page 33: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Knowledge selection and display

Contextualized entity card generation (online)

• concerns an entity occurrence in a given document

• only knowledge valid/relevant for the context identified for the

document is extracted

• e.g. only the current profession/role of a person is shown

Complete entity card generation (online)

• concerns an entity

• all the knowledge about an entity is extracted and shown

• knowledge organized based on contexts, from most current/specific

to more old/general

• for persons, this implies a sort of CV is displayed

Entity description generation (offline)

Page 34: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Contextualized entity card generation

Page 35: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Complete entity card generation

Page 36: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Entity description generation

Page 37: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Outline

Introduction

Proposed approach

LiveMemories use case

Preliminary evaluation in LiveMemories

• quantitative evaluation (coverage)

• impact of global coreference

Conclusions and future work

Page 38: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Quantitative evaluation

Entity linking performed offline for the whole corpora and the

entities from cross-document coreference

• 716,455 documents from Adige, Vita Trentina, RTTR

• 181,734 entities (82% persons, 18% organisations)

• 5,704,669 processed <entity,document> pairs (occurrences)

• 13.16% entities linked, 26.93% in terms of occurrences

Knowledge selection performed online

Evaluation

• computed coverage statistics (next slides)

• no precision/recall statistics computed so far

• requires the construction of a gold standard by manually linking a subset

of documents

Page 39: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Entity statistics (1)

Fully linked; 15773; 9%

Partially linked; 7581; 4%

Ambiguously linked; 554; 0%

Unlinked; 4359; 2%

Unknown; 153467; 85%

Linking results in terms of linked entitiesfully linked all entity mentions

linked to the same

URI

partially

linked

some mention

linked, one URI

ambiguously

linked

mentions linked to

different URIs

unlinked no mention linked,

there is some

individual with the

entity name in the

repository

unknown no mention linked,

no individual with the

entity name in the

repository

Total: 181,734 entities

Page 40: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Entity statistics (2)

Fully linked; 14537; 10%

Partially linked; 7145; 5%

Ambiguously linked; 508; 0%

Unlinked; 4088; 3%

Unknown; 122353; 82%

PER entities

Fully linked; 1236; 4%

Partially linked; 436; 1%

Ambiguously linked; 46; 0%

Unlinked; 271; 1%

Unknown; 31114; 94%

ORG entities

Page 41: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Occurrence statistics (1)

Fully linked, 1508755,

26%

Ambiguously linked,

27745, 1%

Unlinked, 367287, 6%

Unknown, 3800882,

67%

Linking results in terms of linked occurrences

properly

linked

refers to a fully or

partially linked entity

(1 URI)

ambiguously

linked

refers to ambiguously

linked entity

(> 1 URI)

unlinked

refers to unlinked entity

(no link, some

individual with entity

name in repository)

unknown

refers to unknown

entity (no link, no

individual with entity

name in repository)

Total: 5,704,669 occurrences

We call occurrence each pair <entity,

document> the linking procedure has

been applied to

Page 42: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Occurrence statistics (2)

Fully linked, 1066243,

30%

Ambiguously linked,

10556, 0%

Unlinked, 167982, 5%

Unknown, 2332518,

65%

PER entities

Fully linked, 442512,

21%

Ambiguously linked,

17189, 1%

Unlinked, 199305, 9%

Unknown, 1468364,

69%

ORG entities

Page 43: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Occurrence statistics (3)

Fully linked, 1478186,

27%

Ambiguously linked,

27148, 0%

Unlinked, 359272, 6%

Unknown, 3700380,

67%

Adige archive

Fully linked, 29513, 22%

Ambiguously linked, 579, 0%

Unlinked, 7751, 6%

Unknown, 98016, 72%

Vita Trentina archive

Fully linked, 1056, 28%

Ambiguously linked, 18,

0%

Unlinked, 264, 7%

Unknown, 2486, 65%

RTTR

Page 44: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Occurrence statistics (4)

05000

100001500020000250003000035000400004500050000

19

990

1

19

990

4

19

990

7

19

991

0

20

000

1

20

000

4

20

000

7

20

001

0

20

010

1

20

010

4

20

010

7

20

011

0

20

020

1

20

020

4

20

020

7

20

021

0

20

030

1

20

030

4

20

030

7

20

031

0

20

040

1

20

040

4

20

040

7

20

041

0

20

050

1

20

050

4

20

050

7

20

051

0

20

060

1

20

060

4

20

060

7

20

061

0

20

070

1

20

070

4

20

070

7

20

071

0

20

080

1

20

080

4

20

080

7

20

081

0

20

090

1

20

090

4

20

090

7

20

091

0

20

100

1

20

100

4

20

100

7

20

101

0

Enti

ty o

ccu

rren

ces

per

mo

nth

Months

Adige Vita Trentina RTTR

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

19

99

01

19

99

04

19

99

07

19

99

10

20

00

01

20

00

04

20

00

07

20

00

10

20

010

1

20

01

04

20

01

07

20

01

10

20

02

01

20

02

04

20

02

07

20

02

10

20

03

01

20

030

4

20

03

07

20

03

10

20

04

01

20

04

04

20

04

07

20

04

10

20

05

01

20

05

04

20

050

7

20

05

10

20

06

01

20

06

04

20

06

07

20

06

10

20

07

01

20

07

04

20

07

07

20

071

0

20

08

01

20

08

04

20

08

07

20

08

10

20

09

01

20

09

04

20

09

07

20

09

10

20

100

1

20

10

04

20

10

07

20

10

10

Pe

rce

nta

ge o

f o

ccu

rre

nce

s w

.r.t

. mo

nth

ly t

ota

l

Months

Fully linked Ambiguously linked Unlinked Unknown

Page 45: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Impact of global coreference

Remind

• when linking, we don‟t use the mention label (can be too short)

• we exploit global (cross-document) coreference and take the

representative label of the cluster the mention has been assigned to

If coreference is wrong, entity linking will likely be wrong too

• because the linking algorithm will start from a wrong label

• hence, linking precision will be bound by coreference precision

Problem: global coreference „seems‟ to be often wrong…

• can we measure the amount of errors due to global coreference?

Page 46: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Global coreference errors (1)

coreferred to

Raul Cremona

(comic artist)

Milan – Real Madrid news article

Page 47: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Global coreference errors (2)

coreferred to

Fabio

Cannavaro

(then linked)

Should be

Paolo

Cannavaro

Football article about next week Serie A matches

Page 48: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Global coreference errors (3)

coreferred to

San Mauro

Coreferred to

Mamma Lucia

Coreferred to

Gabriele

Albertini (then

linked)

News article about a car accident

Page 49: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Outline

Introduction

Proposed approach

LiveMemories use case

Preliminary evaluation in LiveMemories

Conclusions and future work

• achievements and next LiveMemories tasks

• possible research directions to improve entity linking

• publications

Page 50: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Conclusions

Main outcomes

• construction of contextualized ontologies on various subject domains

related to Italy and Trentino Alto-Adige

• fully implemented system for context-driven semantic enrichment

Future LiveMemories tasks

• precision evaluation

• combine syntactic features and semantic feature for coreference

resolution (EVALITA?)

• add (some) semantic search features in the demonstrator

Page 51: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Improving entity linking

writer’s

knowledge

reader’s

knowledge

mention

synthesis

mention

disambiguation

Entity E1

Entity E2 (=E1?)

assumptions on

reader‟s knowledge

and disambiguation

process

learning

writer

reader

1. Extract more information

about a mention

human-readable

machine readable

No global coreference!!!

2. Improve knowledge

organization

more context dimensions?

3. Improve disambiguation

algorithm

4. learn?

learn to be ignorant

Page 52: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Event dimension (1)

Car accident – First article (02/08/10)

Page 53: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Event dimension (2)

Gloria

Sommadossi

(coref, CI)

Bruno Serafin

(coref + CI)

Gigi Moncalvo

(coref + CI)

Jessica

(coref)

Car accident – Related article (02/08/10)

More easily

linkable by

recognizing

articles are

about same

event

Page 54: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Event dimension (3)

Car accident – Main article (03/08/10)

Page 55: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Event dimension (4)

Jessica

(coref)

Jessica

Pellegrino (CI)

Car accident– Related article (03/08/10)

Page 56: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Finer context granularity

Coreferred to

Simone

Inzaghi

(should be

Filippo)

By

considering

finer contexts

(e.g. Milan-

Barcellona

match instead

of just

Football) it

can be

possible to

properly link it.

News article about Milan – Barcellona

Page 57: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Should

interpret

„Cannavaro‟ in

the context of

this sentence,

which is about

Napoli-

Brescia

Football article about next week Serie A matches

Sentence / paragraph context

Page 58: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Possible research directions

Improve context modelling and detection

• finer context granularities

• paragraph / sentence level contexts

Improve disambiguation

• „semantic distance‟ within a context?

Extract mention properties and apply entity matching techniques

• slot filling?

• synergies with uncertain reasoning

Learn missing contexts and populate them with unknown entities

Page 59: Contextualized Semantic Enrichment...A-Box facts extracted from selected Web sources • by Web page scraping, to extract structured data • by manually encoding facts based on data

Publications

A.Tamilin, B.Magnini, L.Serafini, C.Girardi, M.Joseph, R.Zanoli. Context-

driven Semantic Enrichment of Italian News Archive. In Proc. of the 7th

Extended Semantic Web Conference (ESWC'10), Semantic Web in Use

Track, Heraklion, Greece.

http://dkm.fbk.eu/tamilin/publications/2010/eswc/paper.pdf

A.Tamilin, B.Magnini, L.Serafini. Leveraging Entity Linking by

Contextualized Background Knowledge: A case study for news domain

in Italian. In Proc. of the 6th Workshop on Semantic Web Applications

and Perspectives (SWAP‟10), Bressanone, Italy, 2010.

http://dkm.fbk.eu/tamilin/publications/2010/swap/paper.pdf