entity mention detection using a combination of redundancy-driven classifiers
DESCRIPTION
Entity Mention Detection using a Combination of Redundancy-Driven Classifiers Silvana Marianela Bernaola Biggio, Manuela Speranza, Roberto Zanoli bernaola, manspera , zanoli{@fbk.eu} Fondazione Bruno Kessler – Irst Trento, Italy The present work is supported by the LiveMemories Project. - PowerPoint PPT PresentationTRANSCRIPT
Entity Mention Detection using a Combination of Redundancy-Driven
Classifiers
Silvana Marianela Bernaola Biggio,
Manuela Speranza, Roberto Zanolibernaola, manspera, zanoli{@fbk.eu}
Fondazione Bruno Kessler – IrstTrento, Italy
The present work is supported by the LiveMemories Project May, 2010
2
Outline
• Entity Mention Detection: An extension of NER task.• The system to be presented:
Mention Levels: NAM, NOM, PRO Entity types: GPE, LOC, ORG , PER Drawing from 2 systems (ACE 2008, EVALITA 2009) 2 new features to recognize mentions Applied in LiveMemories and Italian wikipedia Available as a web service, to be integrated into TextPro
4 mentions of type NAM (proper name ): 2 PER, 1 ORG, 1 GPE
Venezuelan President Hugo Chavez on Saturday called for Internet regulations. He demanded that authorities crack down on a news Web site he accused of spreading false information. "The Internet cannot be something open where anything is said and done." President said, according to reports by Reuters.
Hugo Rafael Chávez Frías (28 July 1954) is the President of Venezuela.
Mentions: Named Entities
3
Venezuelan President Hugo Chavez on Saturday called for Internet regulations. He demanded that authorities crack down on a news Web site he accused of spreading false information. "The Internet cannot be something open where anything is said and done." President said, according to reports by Reuters.
Hugo Rafael Chávez Frías (28 July 1954) is the President of Venezuela.
3 nominal mentions (NOM): 3 PER
Mentions: Nominals
4
Mentions: PronominalsVenezuelan President Hugo Chavez on Saturday
called for Internet regulations. He demanded that authorities crack down on a news Web site he accused of spreading false information. "The Internet cannot be something open where anything is said and done." President said, according to reports by Reuters.
Hugo Rafael Chávez Frías (28 July 1954) is the President of Venezuela.
2 pronoun mentions (PRO): 2 PER
5
c
One-level mentions: Hugo ChavezVenezuelan
Two-level mention: Venezuelan President Three-level: Venezuelan President Hugo Chavez
Nested MentionsVenezuelan President Hugo Chavez on Saturday
called for Internet regulations. He demanded that authorities crack down on a news Web site he accused of spreading false information. "The Internet cannot be something open where anything is said and done." President said, according to reports by Reuters.
6
6 different mentions refer to 1 entity of type PER
Entities
7
Venezuelan President Hugo Chavez on Saturday called for Internet regulations. He demanded that authorities crack down on a news Web site he accused of spreading false information. "The Internet cannot be something open where anything is said and done." President said, according to reports by Reuters.
Hugo Rafael Chávez Frías (28 July 1954) is the President of Venezuela.
8
The idea … Exploiting a large corpus to improve the detection of mentions:
-Patterns
-Data redundancy“ … Italia … ““ … Rossi …”“ … Benetton … “
9
1. Candidates
2. TF – IDF (Term Frequency – Inverse Document Frequency) :• Pattern Frequency: The more frequent the pattern occurs
with a mention that belongs to an specific category, the more important is for the category.
• Inverse Category Frequency : The more categories the pattern occurs with, the smaller its contribution in characterizing the semantics of a category which it co-occurs with.
[After annotating the large corpus]
wordn-5 wordn-4 wordn-3 wordn-2 wordn-1 wordn wordn+1 wordn+2 wordn+3
wordn+4 wordn+5
MENTION
Pattern Extraction
10
1. “... La giunta Coni sostiene la candidatura di Torino per le Olimpiadi giovanili 2010. ..” A GPE or an ORG (soccer team)?
2. Prob(“Torino”/type=“GPE”)? • Use a classifier to recognize all mentions in a large corpus
in order to obtain the probability distribution for all mentions across all possible types.
PER ORG GPE LOC
Mention=“Torino”
Data Redundancy
B-GPE_NAM11823B-ORG_NAM 2950B-LOC_NAM: 33B-PER_NAM: 5
System Architecture
11
Identifies the syntactic head of a mention and its mention level. For the extension of a mention, we use the Malt Parser for Italian (Lavelli et al. 2009)
Recognizes the type of a mention
System Architecture
12
1.
13
2.
System Architecture
14
3.
System Architecture
15
4.
System Architecture
16
5.
System Architecture
17
6.
System Architecture
1. EVALITA 2009 EMD Task: value = 65.7%2. Feature Analysis:
18
Evaluation and Feature Analysis
FB1Class All features NOT redundancy NOT pattern
General 79.58% 74.09% 79.28%NAM_GPE 83.65% 78.37% 82.83%NAM_LOC 73.02% 77.52% 73.02%NAM_ORG 73.92% 66.81% 72.94%NAM_PER 91.63% 88.86% 92.03%NOM_GPE 75.86% 55.38% 75.18%NOM_LOC 62.37% 55.10% 59.18%NOM_ORG 71.46% 64.03% 70.41%NOM_PER 86.32% 78.29% 86.08%PRO_GPE 30.77% 14.29% 24.00%PRO_ORG 29.17% 27.59% 30.56%PRO_PER 69.58% 68.43% 69.97%
1. LiveMemories Project.- Identifying mentions in 2 Italian corpora:
19
Applications …
A. Articles from the local newspaper “L’Adige”
B. Blogs posted by students living in the university residence of “San Bartolomeo”
2. Semantic Wikipedia for Italian (SWiiT)http://textpro.fbk.eu/resources/SWiiT.html , annotated at 5
levels:A. Basic NLP processingB. Entity MentionsC. Entity Subtypes (work in progress)D. Entity Co-reference (work in progress)E. Dependency parsing (work in progress)
20
Applications …
System available as …
1. A web service: http://textpro.fbk.eu/typhoon.html
• Using Axis (open source, XML based web
service framework)
• Allows the user to submit a document and
have it annotated with entity mentions using
the IOB format
2. Part of TextPro: http://textpro.fbk.eu (work in
progress)21
Conclusions and future work
1. Difficulties in recognizing pronominal mentions, coreference is needed.
2. Data Redundancy improves the general FB1 in around 5%; and in around 20% for nominal names that refer to geopolitical entities.
3. The results for patterns were not what was expected; probably because the selection of them for each class were not the appropriate ones. As future work we would like to find out how to select the right patterns for each class. 22
• Bartalesi Lenzi, V., Sprugnoli, R. (2009). EVALITA 2009: Description and Results of the Local Entity Detection and Recognition (LEDR) task. In Proceedings of Evalita 2009, workshop held at AI*IA, 12 December 2009, Reggio Emilia, Italy.
• Bernaola Biggio, S.M., Zanoli, R., Giuliano, C., Uryupina, O., Versley, Y., Poesio, M. (2009). Local Entity Detection and Recognition Task. In Proceedings of Evalita 2009, workshop to held at AI*IA, 12 December 2009, Reggio Emilia, Italy.
• Bernaola Biggio, S.M., Speranza M., Zanoli, R. Entity Mention Detection Using a Combination of Redundancy-Driven Classifiers. In Proceedings of LREC 2010, 7th Conference on Language Resources and Evaluation, Malta, Italy.
• Lavelli, A., Hall, J., Nilsson, J., Nivre, J. (2009). MaltParser at the EVALITA 2009 Dependency Parsing Task. In Proceedings of Evalita 2009, workshop held at AI*IA, 12 December 2009, Reggio Emilia, Italy.
• Magnini, B., Cappelli, A., Pianta, E., Speranza, M., Bartalesi Lenzi, V., Sprugnoli, R., Romano, L., Girardi, C., Negri, M. (2006). Annotazione di contenuti concettuali in un corpus italiano: I-CAB. In Proceedings of SILFI 2006. Florence, Italy.
• Speranza, M. (2009). The Named Entity Recognition Task at EVALITA 2009. In Proceedings of Evalita 2009, workshop held at AI*IA, 12 December 2009, Reggio Emilia, Italy.
References