entity mention detection using a combination of redundancy-driven classifiers silvana marianela...

Post on 13-Jan-2016

217 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Entity Mention Detection using a Combination of Redundancy-Driven

Classifiers

Silvana Marianela Bernaola Biggio,

Manuela Speranza, Roberto Zanolibernaola, manspera, zanoli{@fbk.eu}

Fondazione Bruno Kessler – Irst

Trento, Italy

The present work is supported by the LiveMemories Project May, 2010

2

Outline

• Entity Mention Detection: An extension of NER

task.

• The system to be presented:

Mention Levels: NAM, NOM, PRO Entity types: GPE, LOC, ORG , PER Drawing from 2 systems (ACE 2008, EVALITA 2009) 2 new features to recognize mentions Applied in LiveMemories and Italian wikipedia Available as a web service, to be integrated into TextPro

4 mentions of type NAM (proper name ): 2 PER, 1 ORG, 1 GPE

Venezuelan President Hugo Chavez on Saturday called for Internet regulations. He demanded that authorities crack down on a news Web site he accused of spreading false information. "The Internet cannot be something open where anything is said and done." President said, according to reports by Reuters.

Hugo Rafael Chávez Frías (28 July 1954) is the President of Venezuela.

Mentions: Named Entities

3

Venezuelan President Hugo Chavez on Saturday called for Internet regulations. He demanded that authorities crack down on a news Web site he accused of spreading false information. "The Internet cannot be something open where anything is said and done." President said, according to reports by Reuters.

Hugo Rafael Chávez Frías (28 July 1954) is the President of Venezuela.

3 nominal mentions (NOM): 3 PER

Mentions: Nominals

4

Mentions: PronominalsVenezuelan President Hugo Chavez on Saturday

called for Internet regulations. He demanded that authorities crack down on a news Web site he accused of spreading false information. "The Internet cannot be something open where anything is said and done." President said, according to reports by Reuters.

Hugo Rafael Chávez Frías (28 July 1954) is the President of Venezuela.

2 pronoun mentions (PRO): 2 PER

5

cc

One-level mentions: Hugo ChavezVenezuelan

Two-level mention: Venezuelan President Three-level: Venezuelan President Hugo Chavez

Nested MentionsVenezuelan President Hugo Chavez on Saturday

called for Internet regulations. He demanded that authorities crack down on a news Web site he accused of spreading false information. "The Internet cannot be something open where anything is said and done." President said, according to reports by Reuters.

6

6 different mentions refer to 1 entity of type PER

Entities

7

Venezuelan President Hugo Chavez on Saturday called for Internet regulations. He demanded that authorities crack down on a news Web site he accused of spreading false information. "The Internet cannot be something open where anything is said and done." President said, according to reports by Reuters.

Hugo Rafael Chávez Frías (28 July 1954) is the President of Venezuela.

8

The idea … Exploiting a large corpus to improve the detection of mentions:

-Patterns

-Data redundancy“ … Italia … ““ … Rossi …”“ … Benetton … “

9

1. Candidates

2. TF – IDF (Term Frequency – Inverse Document Frequency) :

• Pattern Frequency: The more frequent the pattern occurs

with a mention that belongs to an specific category, the

more important is for the category.

• Inverse Category Frequency : The more categories the

pattern occurs with, the smaller its contribution in

characterizing the semantics of a category which it co-

occurs with.

[After annotating the large corpus]

wordn-5 wordn-4 wordn-3 wordn-2 wordn-1 wordn wordn+1 wordn+2 wordn+3

wordn+4 wordn+5

MENTION

Pattern Extraction

10

1. “... La giunta Coni sostiene la candidatura di Torino per

le Olimpiadi giovanili 2010. ..” A GPE or an ORG (soccer

team)?

2. Prob(“Torino”/type=“GPE”)?

• Use a classifier to recognize all mentions in a large corpus

in order to obtain the probability distribution for all

mentions across all possible types.

PER ORG GPE LOC

Mention=“Torino”

Data Redundancy

B-GPE_NAM11823B-ORG_NAM 2950B-LOC_NAM: 33B-PER_NAM: 5

System Architecture

11

Identifies the syntactic head of a mention and its mention level. For the extension of a mention, we use the Malt Parser for Italian (Lavelli et al. 2009)

Recognizes the type of a mention

System Architecture

12

1.

13

2.

System Architecture

14

3.

System Architecture

15

4.

System Architecture

16

5.

System Architecture

17

6.

System Architecture

1. EVALITA 2009 EMD Task: value = 65.7%

2. Feature Analysis:

18

Evaluation and Feature Analysis

FB1Class All features NOT redundancy NOT pattern

General 79.58% 74.09% 79.28%NAM_GPE 83.65% 78.37% 82.83%NAM_LOC 73.02% 77.52% 73.02%NAM_ORG 73.92% 66.81% 72.94%NAM_PER 91.63% 88.86% 92.03%NOM_GPE 75.86% 55.38% 75.18%NOM_LOC 62.37% 55.10% 59.18%NOM_ORG 71.46% 64.03% 70.41%NOM_PER 86.32% 78.29% 86.08%PRO_GPE 30.77% 14.29% 24.00%PRO_ORG 29.17% 27.59% 30.56%PRO_PER 69.58% 68.43% 69.97%

1. LiveMemories Project.- Identifying mentions in 2 Italian

corpora:

19

Applications …

A. Articles from the local newspaper “L’Adige”

B. Blogs posted by students living in the university residence of “San Bartolomeo”

2. Semantic Wikipedia for Italian (SWiiT)http://textpro.fbk.eu/resources/SWiiT.html , annotated at 5

levels:A. Basic NLP processingB. Entity MentionsC. Entity Subtypes (work in progress)D. Entity Co-reference (work in progress)E. Dependency parsing (work in progress)

20

Applications …

System available as …

1. A web service: http://textpro.fbk.eu/typhoon.html

• Using Axis (open source, XML based web

service framework)

• Allows the user to submit a document and

have it annotated with entity mentions using

the IOB format

2. Part of TextPro: http://textpro.fbk.eu (work in

progress)

21

Conclusions and future work

1. Difficulties in recognizing pronominal mentions,

coreference is needed.

2. Data Redundancy improves the general FB1 in

around 5%; and in around 20% for nominal

names that refer to geopolitical entities.

3. The results for patterns were not what was

expected; probably because the selection of them

for each class were not the appropriate ones. As

future work we would like to find out how to

select the right patterns for each class. 22

• Bartalesi Lenzi, V., Sprugnoli, R. (2009). EVALITA 2009: Description and Results of the Local Entity Detection and Recognition (LEDR) task. In Proceedings of Evalita 2009, workshop held at AI*IA, 12 December 2009, Reggio Emilia, Italy.

• Bernaola Biggio, S.M., Zanoli, R., Giuliano, C., Uryupina, O., Versley, Y., Poesio, M. (2009). Local Entity Detection and Recognition Task. In Proceedings of Evalita 2009, workshop to held at AI*IA, 12 December 2009, Reggio Emilia, Italy.

• Bernaola Biggio, S.M., Speranza M., Zanoli, R. Entity Mention Detection Using a Combination of Redundancy-Driven Classifiers. In Proceedings of LREC 2010, 7th Conference on Language Resources and Evaluation, Malta, Italy.

• Lavelli, A., Hall, J., Nilsson, J., Nivre, J. (2009). MaltParser at the EVALITA 2009 Dependency Parsing Task. In Proceedings of Evalita 2009, workshop held at AI*IA, 12 December 2009, Reggio Emilia, Italy.

• Magnini, B., Cappelli, A., Pianta, E., Speranza, M., Bartalesi Lenzi, V., Sprugnoli, R., Romano, L., Girardi, C., Negri, M. (2006). Annotazione di contenuti concettuali in un corpus italiano: I-CAB. In Proceedings of SILFI 2006. Florence, Italy.

• Speranza, M. (2009). The Named Entity Recognition Task at EVALITA 2009. In Proceedings of Evalita 2009, workshop held at AI*IA, 12 December 2009, Reggio Emilia, Italy.

References

top related