talp el system in tac-kbp 2013anaderi/kbp2013-poster.pdf · talp el system in tac-kbp 2013 this...
Post on 23-Sep-2020
0 Views
Preview:
TRANSCRIPT
RESEARCH POSTER PRESENTATION DESIGN © 2012
www.PosterPresentations.com
RESULTS
REFERENCESS. Cucerzan. 2007. Large-scale named entity disambiguation based on Wikipedia data. In the Joint Conference on Empirical Methods
in Natural Language Processing and Computational Natural Language Learning, Prague.
C. Fellbaum. 1998. Wordnet: An electronic lexical database. In MIT Press.
J. Hoffart, , M. A. Yosef, I. Bordino, H. Furstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. 2011. Robust
disambiguation of named entities in text. In the Conference on Empirical Methods in Natural Language Processing (EMNLP),
Edinburgh, Scotland.
J. Hoffart, F. M. Suchanek, K. Berberich, and G.Weikum. 2013. Yago2: a spatially and temporally enhanced knowledge base from
Wikipedia. In Artificial Intelligence Journal.
L. Ratinov and D. Roth. 2009. Design challenges and misconceptions in named entity recognition. In CoNLL.
ACKNOWLEDGMENTS
This work has been produced with the support of the project SKATeR (TIN2012-38584-C06-01).
Tasks:
To effectively reduce the ambiguities of the mention by expanding the query from its context.
Enriching the background document integrating information retrieved from knowledge resources.
1-1) Query Classification
1-2) Background Document Enrichment.
1-3) Alternate Name Generation.
TALP Research Center, UPC, Spain.
A. Naderi, H. Rodríguez, and J. Turmo
TALP EL System in TAC-KBP 2013
This poster presents our Entity Linking (EL) system that uses a topic modeling
approach by taking advantage of a huge Wikipedia-based knowledge resource to
enrich background documents with relevant information in order to increase the
accuracy.
ABSTRACT
{anaderi, horacio, turmo}@lsi.upc.edu
The VSM components for ranking candidates
are extracted from the background document
of each query. Thus, as most disambiguated
entities as possible are required. For doing so,
AIDA system (Hoffart et al., 2011) is applied.
AIDA is a framework for entity detection
and disambiguation. Given a natural-
language text or a Web table, it maps
mentions of ambiguous names onto
canonical entities registered in YAGO2
(Hoffart et al., 2013).
YAGO2 is a huge semantic KB derived from
WP, WordNet (Fellbaum, 1998) and
Geonames, containing more than 10 million
entities and more than 120 million facts
about these entities. Each entity in YAGO2
contains a sort of information, including
weighted keyphrases.
Keyphrase is contextual information
extracted from link anchor, in-link, title and
WP category sources of the corresponding
entity page that can be used for entity
disambiguation. We use AIDA to extract
keyphrases from the entities in the
background document.
Figure 2 shows an example for producing related
keyphrases of background document mentions
“Man U”, “Liverpool”, and “Premier league”
using AIDA for the query name “Scholes”.
Detailed architecture of the system with a
sample query “Scholes” is depicted in Figure 4.
Task:This module sorts the retrieved candidates according to the likelihood of being the correct referent.
Our ranking approach is a Vector Space Model (VSM) inspired by Cucerzan (2007).
In our case the vector space domain consist of the whole set of word within the keyphrases found in the enriched
background document and the rank consists of their Tf-Idf computed against the set of candidate documents. We
use cosine similarity. In addition, in order to reduce dimensionality we apply LSI.
A term clustering method is applied to cluster NIL queries.
Universitat Politecnica de Catalunya
(BarcelonaTech)
Fig. 1: General architecture of EL systems
Fig. 2: Enriching background document of the query
“Scholes” to generate keyphrases using AIDA system
Fig. 3: Sample background document from the TAC-KBP data set
Fig. 4: Detailed architecture of our EL system with a sample query “Scholes”
TALP Research Center
Fig. 5: A KB candidate entity page for query “Scholes”
containing a set of facts and its informative context
In this step, a set of Alternate Names (ANs) of each
query is generated from the content of its
corresponding background document . In Figure 3, the
system used Acronym expansion for extracting
“Football Association” from “FA”.
In addition, Several auxiliary gazetteers are applied
such as:
- The US states, (e.g., the pair <CA, California>).
- Country abbreviations, (e.g., the pairs <UK, United
Kingdom>).
Thus, a set of potential candidates is generated from
each AN of each query.
Task:Given a particular query, q, a set of candidates, C, is found by
retrieving those entries from the KB whose names are similar
enough, using Dice coefficient, to one of the alternate names of
q found with the query expansion.
In general, KB entity pages contain facts and an informative context
about the entity. We enrich the context information of each KB
candidate entity by searching the corresponding facts as separate
entities in the reference KB and then merging their related informative
contexts with the current one. By applying this technique, the context
of each candidate could be more discriminative and informative.
Figure 5 shows a sample KB entity page corresponding to entity name
“Paul Scholes”. The system collects the <wiki_text> information of its
related entities “Manchester United” and “England” to enrich the
<wiki_text> of “Paul Scholes”.
All Docs
All Entities PER ORG GPE
Overall 0.435 0.535 0.538 0.248
In-KB 0.285 0.333 0.320 0.242
NIL 0.584 0.736 0.607 0.248
1. QUERY EXPANSION AND ENRICHMENT
3. CANDIDATE RANKING AND NIL CLUSTERING
2. CANDIDATE GENERATION
Table. 1: The TALP official EL results (B-cubed+ F1) in TAC-KBP 2013
GENERAL ARCHITECTURE
As shown in Figure 1, our EL approach follows the typical architecture in the state of the art including following steps:
1. Query Expansion and Enrichment
2. Candidate Generation
3. Candidate Ranking and NIL Clustering
The system classified queries into 3 entity types: PER, ORG, GPE using Illinois NERC (Ratinov et al., 2009).
It classifies all entity mentions in the background document.
Considering all mentions with their type, those ones related to the query name are selected.
The system chooses the longest mention (e.g., selecting full name of the Manchester United footballer
“Paul Aaron Scholes” rather than a part of its name “P. Scholes” for the query name “Scholes”), and assign
its type as query type.
top related