linking named entities to a structured knowledge...

LINKING NAMED ENTITIES TO A

STRUCTURED KNOWLEDGE BASE

By

Kranthi Reddy. B

200502008

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THEREQUIREMENTS FOR THE DEGREE OF

Master of Science (by Research)in

Computer Science & Engineering

Search and Information Extraction Lab

Language Technologies Research Centre

International Institute of Information Technology

Hyderabad, India

June 2010

To all my dearer ones

INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY

Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “ Linking Named Entities

to a Structured Knowledge Base ” by Kranthi Reddy.B (200502008) submitted

in partial fulfillment for the award of the degree of Master of Science (by Research)

in Computer Science & Engineering, has been carried out under my supervision

and it is not submitted elsewhere for a degree.

Date Advisor :

Dr. Vasudeva VarmaAssociate Professor

IIIT, Hyderabad

Acknowledgements

I am grateful to my advisor, Dr Vasudeva Varma for his advice and for believing

in me throughout the duration of my thesis work. His regular suggestions have been

of great value. I would also like to thank Dr Prasad Pingali for his valuable insights

on research. I have had great pleasure and joy to work with him for the whole

duration of my MS by Research studies. I have been fortunate to get timely advice

and quick feedback from Dr Prasad Pingali and Dr Vasudeva Varma inspite of their

hectic schedules. I would like to thank Mr Babji who worked tirelessly to keep the

IE lab servers running 24/7.

I would also like to acknowledge the time, help and guidance provided by Pra-

neeth and Sai Krishna. Both have been monumental in giving shape to my thesis

draft, without whom it would have an herculean task. Along with Kiran they not

only helped me through the difficult times and but also helped me to coup with the

pressure. Their confidence in me gave a lot of moral support. I have the pleasure of

working and publishing work with all three of them. Thanks to them, who showed

that research can be done with interest and fun.

I thank all my colleagues in Setu Software Systems Pvt. Ltd where I have been

working as an intern during the entire period of my thesis. I have had great time

and fun working in their companionship.

A person can be defined by the social circle he is associated with. I think I had

one of the best friend circle during my stay in IIIT. I thank Abhilash and Ambati

for their inputs and discussions on my Thesis work. A special thanks to Phani

Chaitanya, Ganesh, Girish, Gopal, Vijay, Harsha and Samrat who have been my

close knit of friends. Their frequent visits to campus during my research had lifted

my spirits many a time. Special thanks to charan. He always gave philosophical

and motivating talks whenever he saw me in dull mood.

Last, but not the least, I would like to thank my parents and sister for having the

trust in my abilities. They gave freedom and space to grow more as an individual. I

thank them for being my invisible sources of moral and mental support.

vi

Abstract

The World Wide Web (WWW) is a huge, widely distributed global source of in-

formation to web users. Web documents are broadly classified into: unstructured

and structured documents. Users prefer structured documents when looking for a

piece of information. Hence, in the past decade research community focused on

mining structured information from unstructured documents and attempted to pre-

serve them in the form of attribute-value pairs, tables, flow charts etc. But, the focus

has been only on extracting information at document level or on particular domains

like disaster, finance, medicine etc. The techniques never attempted to integrate the

extracted information to common knowledge repositories like Wikipedia, DBPedia

etc.

Structured databases like Wikipedia, DBPedia etc are created through collabo-

rative contributions from volunteers and organizations. Since they rely heavily on

manual effort, the process of updating these databases is not only tedious and time

consuming but is also fraught with many drawbacks. Hence, automatic updation of

structured databases has become one of the hot topics of research in the past few

years. Automatic updation of structured databases can be broken down into two sub

problems: Entity Linking and Slot Filling. In this thesis, we address Entity Link-

ing. Entity Linking is the task of linking named entities occurring in a document

to entries in a Knowledge Base. This is a challenging task because entities can not

only occur in various forms, viz: acronyms, nick names, spelling variations etc but

can also occur in various contexts.

Once named entities from documents are linked to entries in a knowledge base,

information can be integrated across them. Current IE techniques can be used to ex-

tract information from documents. Person named disambiguation and Co-reference

Resolution are two tasks that share a lot of similarities with Entity Linking. These

tasks have attempted to link entities across documents but never attempted to inte-

grate them into a common Knowledge Base.

Our approach to Entity Linking begins with building of an Entity Repository

(ER). ER contains information about different forms of named entities and is built

using Wikipedia structural information like redirect pages, disambiguation pages

and bold text from first paragraph. Our core algorithm for Entity Linking can be

broken down into two steps : Candidate List Generation (CLG) and Ranking.

In the CLG phase, we use the ER, Web search results and a named entity rec-

ognizer to identify all possible variations of a given named entity. Using these

variations we obtain an unordered list of candidate nodes from the KB which can

be linked to the given named entity in a document. In the ranking phase, we rank

the unordered list of candidate nodes using various similarity techniques. We cal-

culate the similarity between the text of the candidate nodes and the document in

which the named entity occurrs. We experiment ranking using various similarity

functions like cosine similarity, Naı̈ve Bayes, maximum entropy, Tf-idf ranking

and re-ranking using pseudo relevance feedback. Our experiments show that cosine

similarity and Naı̈ve Bayes perform close to state of the art and the Tf-idf ranking

function performs better in some cases.

Our approach was tested on a standard Entity Linking dataset provided as part

of Text Analysis Conference (TAC) for Knowledge Base Population (KBP) shared

task. We evaluated our approach using Micro-Average Score which is the standard

evaluation metrics. We achieved very impressive MAS of 83% and 85% on TAC-

KBP, Entity Linking 2009 and 2010 data sets, which secured top spot in these shared

tasks respectively.

Publications

• Kranthi Reddy, Karun Kumar, Sai Krishna, Prasad Pingali, Vasudeva Varma ,“Linking Named Entities to a Structured Knowledge Base”, in Cicling 2010.Published in “International Journal of Computational Linguistics and Appli-cations, ISSN 0976-0962 ”.

• Vasudeva Varma, Vijay Bharath Reddy, Sudheer K, Praveen Bysani, GSKSantosh, kiran kumar, kranthi Reddy, karuna Kumar, nithin M, “IIIT Hy-derabad at TAC 2009”, In the Working Notes of Text Analysis Conference(TAC), National Institute of Standards and Technology Gaithersburg, Mary-land USA, November, 2009.

• Praveen Bysani, Kranthi Reddy, Vijay Bharath Reddy, Sudheer Kovelamudi,Prasad Pingali, Vasudeva Varma, “IIIT Hyderabad in Guided Summarizationand Knowledge Base Population”, In the Working Notes of Text AnalysisConference (TAC), National Institute of Standards and Technology Gaithers-burg, Maryland USA, November, 2010.

Contents

Table of Contents x

List of Tables xiii

List of Figures xiv

1 Introduction 11.1 Structured Information Database : Knowledge Base . . . . . . . . . . . . . 21.2 Challenges in Manual Maintenance of Knowledge Bases . . . . . . . . . . 41.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4.1 Co-reference Resolution . . . . . . . . . . . . . . . . . . . . . . . 101.4.2 Difference Between Entity Linking and Co-reference Resolution . . 12

1.5 Overview of the Proposed Methodology . . . . . . . . . . . . . . . . . . . 131.5.1 Building Entity Repository . . . . . . . . . . . . . . . . . . . . . . 131.5.2 Candidate List Generation and Ranking . . . . . . . . . . . . . . . 14

1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Related Work 162.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.1 Unsupervised Person Name Disambiguation . . . . . . . . . . . . . 172.1.2 Vector Space Model for Co-reference Resolution . . . . . . . . . . 17

2.2 Using Wikipedia Taxonomy for Entity Linking . . . . . . . . . . . . . . . 192.2.1 Support Vector Machines for Entity Linking . . . . . . . . . . . . . 192.2.2 A Heuristic Based approach for Entity Linking . . . . . . . . . . . 21

2.3 Approaches to Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . 222.3.1 Entity Linking as Cross-document Co-reference Resolution . . . . 232.3.2 Two stage methodology for Entity Linking . . . . . . . . . . . . . 262.3.3 Supervised Machine Learning for Entity Linking . . . . . . . . . . 29

2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

x

CONTENTS

3 Candidate List Generation 313.1 Building Entity Repository . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 Identifying Query Entity Variations . . . . . . . . . . . . . . . . . . . . . . 37

3.2.1 Using Query Document in Context . . . . . . . . . . . . . . . . . 373.2.2 Using Entity Repository . . . . . . . . . . . . . . . . . . . . . . . 393.2.3 Using Web Search Results . . . . . . . . . . . . . . . . . . . . . . 39

3.3 Candidate Nodes Identification . . . . . . . . . . . . . . . . . . . . . . . . 403.4 Adding Wikipedia Article to the Candidate List . . . . . . . . . . . . . . . 413.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Entity Linking as Ranking 434.1 Entity Linking as Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2 Vector Representation of Documents . . . . . . . . . . . . . . . . . . . . . 444.3 Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.4 Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.4.1 Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4.2 Maximum Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5 Tf-idf Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.5.1 Term frequency and weighting . . . . . . . . . . . . . . . . . . . . 504.5.2 Inverse document frequency . . . . . . . . . . . . . . . . . . . . . 514.5.3 Tf-idf Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.6 Pseudo Relevance Feedback for Re-ranking . . . . . . . . . . . . . . . . . 524.6.1 Pseudo Relevance Feedback . . . . . . . . . . . . . . . . . . . . . 524.6.2 Hyperspace to Analogue Language(HAL) Model . . . . . . . . . . 534.6.3 Re-ranked Candidate Nodes : . . . . . . . . . . . . . . . . . . . . 54

4.7 Mapping Node Identification . . . . . . . . . . . . . . . . . . . . . . . . . 554.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Data Set 565.1 Text Analysis Conference . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2 Data set and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 58

5.2.1 Structure of nodes in Knowledge Base . . . . . . . . . . . . . . . . 585.2.2 Structure of documents in Document Collection . . . . . . . . . . . 595.2.3 Structure of an Entity Linking Query . . . . . . . . . . . . . . . . 61

5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 Evaluation 636.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.2 TAC-KBP 2009 and 2010 Query Set Analysis . . . . . . . . . . . . . . . . 636.3 Candidate List Size Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 666.4 Candidate List Generation Phase Analysis . . . . . . . . . . . . . . . . . . 686.5 Entity Linking System Performance . . . . . . . . . . . . . . . . . . . . . 696.6 Precision Vs Top “N” results . . . . . . . . . . . . . . . . . . . . . . . . . 71

xi

CONTENTS

6.7 NIL Prediction Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.8 Comparison with Top 5 systems at TAC-KBP . . . . . . . . . . . . . . . . 746.9 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7 Conclusion 777.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817.3 Application of Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . 82

Bibliography 85

xii

List of Tables

5.1 Percentage break down of entity types in the Knowledge Base. . . . . . . . 595.2 No:Of documents from various sources in Document Collection. . . . . . . 605.3 System output for a set of query strings . . . . . . . . . . . . . . . . . . . 62

6.1 Statistics on 2009 and 2010 query sets. . . . . . . . . . . . . . . . . . . . . 646.2 Distribution of Non-Nil queries. . . . . . . . . . . . . . . . . . . . . . . . 646.3 Sample Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.4 The above table indicates the number of queries(2010 query set) having a

particular candidate list size. . . . . . . . . . . . . . . . . . . . . . . . . . 666.5 The above table indicates the number of queries(2009 query set) having a

particular candidate list size. . . . . . . . . . . . . . . . . . . . . . . . . . 676.6 The above table indicates the failure to list the correct candidate node in

the Candidate List even though the mapping node exist in the KnowledgeBase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.7 Average Micro-Average Score and Base line scores obtained by variousparticipating universities/teams for TAC-KBP Entity Linking task on 2009and 2010 query sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.8 Micro-Average Score for individual heuristics for 2010 Query set. GoogleSearch includes both Google spell suggestion and Google directive search. . 70

6.9 Micro-average score for individual heuristics for 2009 Query set. GoogleSearch includes both Google spell suggestion and Google directive search. . 71

6.10 Statistics of NIL predictions and its accuracy for 2010 Query Set. . . . . . . 736.11 Statistics of NIL predictions and its accuracy for 2009 Query Set. . . . . . . 746.12 Performance Comparison with Top 5 systems at TAC-KBP 2010 Entity

Linking sub task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.13 Performance Comparison with Top 5 systems at TAC-KBP 2009 Entity

Linking sub task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

xiii

List of Figures

3.1 A sample article/document in Wikipedia. . . . . . . . . . . . . . . . . . . . 333.2 A sample redirect document in Wikipedia. . . . . . . . . . . . . . . . . . . 353.3 A sample disambiguation document in Wikipedia. . . . . . . . . . . . . . . 363.4 Flow Chart of Candidate List Generation Phase . . . . . . . . . . . . . . . 42

4.1 Cosine Similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1 Knowledge Base Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.2 Document Collection Document . . . . . . . . . . . . . . . . . . . . . . . 605.3 Sample Query from the Query Set. . . . . . . . . . . . . . . . . . . . . . . 61

6.1 Precision Vs Top “N” results for Non-Nil Queries from 2010 TAC-KBPEntity Linking Query Set. . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.2 Precision Vs Top “N” results for Non-Nil Queries from 2009 TAC-KBPEntity Linking Query Set. . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7.1 An application of Entity Linking flow chart. . . . . . . . . . . . . . . . . . 837.2 Possible application of Entity Linking. . . . . . . . . . . . . . . . . . . . . 83

xiv

Chapter 1

Introduction

The World Wide Web (WWW) is a huge, widely distributed global source of information

to web users. Web documents are broadly classified into: unstructured and structured

documents. Users read unstructured documents thoroughly in order to mine the information

they are looking for. To ease this task, research community focused on mining structured

information from unstructured documents and attempted to preserve them in the form of

attribute-value pairs, tables, flow charts etc. In this process many Information Extraction

(IE) techniques [16, 8, 14, 46, 23] have been proposed to extract structured information

from unstructured documents. But, they have focused only on extracting information at

document level or on particular domains like disaster, finance [66], medicine [20] etc. The

techniques never attempted to integrate the extracted information to common knowledge

repositories like Wikipedia 1, DBPedia 2 etc.

Recently, there have been attempts to build and maintain global knowledge reposito-

ries (structured documents) like Wikipedia, DBPedia, Freebase 3, Uniprot 4, Medline 5 etc.

These databases are created through collaborative contributions from volunteers and or-

1http://en.wikipedia.org/2http://dbpedia.org/3http://www.freebase.com/4http://www.uniprot.org/5http://www.nlm.nih.gov/databases/databases medline.html

1

CHAPTER 1. INTRODUCTION

ganizations [47]. Since they rely heavily on manual effort, the process of updating these

databases is not only tedious and time consuming but is also fraught with many drawbacks.

The research community has identified this problem and has started working towards au-

tomating the process of maintaining these databases. Hence, automatic updation of struc-

tured databases has become one of the hot topics of research in the past few years.

In this chapter, we give a brief overview of the problem we address in this thesis. In the

next section, we give an overview of a structured database a.k.a Knowledge Base (KB).

1.1 Structured Information Database : Knowledge Base

A Knowledge Base (KB) is a special kind of database for knowledge management, provid-

ing the means for the computerized collection, organization and retrieval of knowledge. In

layman terms, a KB is a semi-structured/structured database containing information about

a named entity or an event. Since a definite structure is followed while building a KB, they

are not only human readable, but also machine readable and hence can be used for a wide

range of applications.

Knowledge Bases (KBs) like Wikipedia reduce the time and effort spent by a user in

finding a key piece of information about an event or named entity on the web, as users can

find answers to most of their questions here quickly. Since a standard structure is followed

in these KBs, it is easy to build applications that can exploit these structures. KBs have been

used in a wide range of applications in the fields of Natural Language Processing (NLP)

[19, 39] , Information Extraction (IE) [70] , establishing Entity Relationships [46, 23] ,

Search [43], Named Entity Recognition (NER) [25, 57] , Named Entity Disambiguation

[13], Text Mining etc.

Such highly useful resources like KBs can be created and maintained in two ways :

• Manual : Current KBs are created and maintained through collaborative contribu-

tions from volunteers and organizations [47]. Such practices have been followed

2


since biblical times, with scribes transcribed and at the same time often edited, up-

dated, interpreted or reinterpreted using original texts[35]. But open access large

scale public collaborative content creation projects are relatively recent phenomena

on the Web. This phenomena of knowledge creation and sharing has been fueled by

the content management technologies such as wiki6.

• Automatic : Unlike the process followed for current KBs, the process of creating

and updating KBs with up to date information can be automated. Automating this

task overcomes many problems faced by current day KBs. The process of automating

this task can be broken down into two sub problems.

– Entity Linking (EL) : Entity Linking addresses the problem of mapping named

entities occurring in a textual document to entries/nodes 7 in the KB. The prob-

lem is complicated by the fact that entities can be referred to using multiple

name variants (e.g., aliases, acronyms, misspellings) and because many entities

share the same name (e.g., Washington might refer to a person, city, state, or

football team).

– Slot Filling (SF) : Slot Filling addresses the problem of mining structured infor-

mation about entities from unstructured documents. The structured information

can be in the form of attribute value pairs, tables etc. In addition to requir-

ing that extracted information be correct, exact and supported by a document,

the information must also be previously unrecorded in the KB. Complexity in

natural language is another major problem confronted by SF.

In this thesis, we address the problem of Entity Linking. We now discuss about the

problems that arise from maintaining a KB manually.

6http://en.wikipedia.org/wiki/Wiki7A node is an entry in the KB which contains information and attribute value pairs about a named entity

or event

3


1.2 Challenges in Manual Maintenance of Knowledge Bases

Since KBs are built manually, they face quite a few complex problems. Some of the major

problems faced by KBs like Wikipedia are :

• Inconsistency in information : Since current KBs are collaborativately maintained

by volunteers, integration of knowledge from multiple sources is an important aspect.

Under these circumstances KBs are confronted with the prospect of inconsistency.

• Incomplete information : Another key problem faced by current day KBs is that

they might not have all the pertinent information about an entity or event. This leads

to the problem of incomplete information being found about an entity/event.

• Accuracy of facts : The information provided by volunteers is not only verified by

themselves, but it is also scrutinized by the KB moderators before it is updated. Even

after taking several measures like verifying the information by multiple volunteers to

ensure the correctness of the information, sometimes the information might still be

inaccurate and error prone.

• Outdated information : Since the current set of KBs are being edited and updated

manually by volunteers, there is very high chance that some pieces of information

about an entity/event might become outdated during the course of time.

• Manual effort is slow and time consuming : The process of knowledge acquisition

from different volunteers is a slow and time consuming process.

• Scalability : Manually scaling KBs to different domains and large number of entries

is very time consuming and tedious. Wikipedia has taken nearly 10 years to develop

into a rich knowledge repository.

• Adaptations to new domains : Creating a KB for a new domain manually will

require large amount of human effort and time.

4


Automatically updating KBs from news articles is a possible solution, because it can

overcome the above mentioned problems to a major extent. Upon automating the process of

knowledge acquisition to KBs, the major problem addressed will be scalability, adaptations

to new domains, reproducibility in labs etc and will certainly reduce the effort put in by

humans today in maintaining the KBs. In view of this solution, a need arises to address

the task of linking named entities found in news articles to nodes/entities in the KB. This

task is referred to as Entity Linking (EL). This thesis addresses the problem of EL, its

challenges, our methodology and results.

1.3 Problem Description

Most of the research community till date have focused on extracting structured information

from unstructured documents. But none of them have focused on integrating this extracted

information to global KBs. Until relatively recently, there has been very little focus to-

wards this direction as there were no publicly available KBs. But with the emergence of

Wikipedia, DBPedia, Freebase etc. as an important repository of information, community

efforts are focused towards integrating the information extracted from web documents like

news articles to these KBs automatically. The success and rapid growth of these KBs show

that the very useful to web users. Wikipedia alone has around 14 million registered users

8. These KBs provide rich source of information to the users in the form of text, tables and

flowcharts etc.

But, current day KBs face a lot of problems because of manual maintenance. We

showed that this process of knowledge acquisition and updating information into a KB

can be automated. We further discussed how EL is an important prerequisite for automatic

updation of KBs. In this thesis, we address various problems of EL, our methodology and

results. In this section, we explain our motivation behind attempting this problem. Then,

we state the problem in formal terms and finally conclude with a discussion on major chal-

8http://en.wikipedia.org/wiki/Special:Statistics

5


lenges in EL.

1.3.1 Motivation

The rise of Web 2.0 technology has provided a platform for user generated content on the

web through blogs, forums etc. This has led to the growth of information on web at a

staggering rate and hence the problem of information overload [27]. Information overload

refers to the difficulty a person can have in understanding an issue and making decisions

that can be caused by the presence of too much information. Some of the general causes of

information overload on the web are

• Rapidly increasing rate of novel information.

• Ease of duplication and transmission of data across the Internet.

• An increase in the available channels of incoming information.

• Large amounts of historical information to dig through.

• Contradictions and inaccuracies in available information.

Information overload is a growing problem for users in the web era. The overabundance

of information on the web has resulted in time consuming and difficult challenge for users

searching for a key piece of information in an increasingly competitive world. Information

overload is more than an inconvenience to a user and the rate at which it is growing will only

create bigger challenges and problems in the near future. Current day KBs like Wikipedia

try to overcome this problem by providing information about named entities/events under

a single roof. With a staggering rate of information growth on the web, it is imperative to

provide users with tools for efficient and effective access to knowledge repositories. An EL

system is an important component to maintain KBs automatically.

6


1.3.2 Problem Statement

Given a Knowledge Base and a textual document, the task of Entity Linking is to determine

for each named entity (NE) and the document in which it appears, which KB node is being

referred to, or if the entity is a new entity and not present in the KB. This is a challenging

task because entities can not only occur in various forms, viz: acronyms, nick names,

spelling variations etc but can also occur in various contexts.

Throughout this thesis, we refer the entity to be linked as Query Entity and the document

in which it appears as Query Document. For entities that do not have an entry in the

Knowledge Base, we return NIL and call this as NIL detection problem.

There has been a shared task; Knowledge Base Population (KBP) in Text Analysis

Conference 9 (TAC), 2009 and 2010. Entity Linking was sub task of the KBP track. Hence,

we evaluated our algorithm on this data set. The data set consisted of

• Query Entity : This refers to a named entity occurring in a document which is to be

linked to a node in the KB, if any.

• Query Document : It provides context for disambiguating the query entity.

• Knowledge Base : KB consists of a set of nodes, to which the query entity should

be linked.

We explain the complete data set and evaluation metrics in chapter 5. Though EL solves

many problems, it is not an easy task. In the next section we explain in detail the various

challenges of EL.

1.3.3 Challenges

Some of the major challenges involved in EL are :

9http://www.nist.gov/tac/

7


• Mention Ambiguity : An instance of a named entity can refer to different real world

entities based on the context in which it occurs. This ambiguity is called as mention

ambiguity and is one of the commonly faced problems on the web [13].

For example, the entity mention “Texas” refers to more than twenty different named

entities in Wikipedia. In the context “former Texas quarterback James Street”, Texas

refers to the University of Texas at Austin; in the context “in 2000, Texas released a

greatest hits album”, Texas refers to Bishop pop band; in the context “Texas borders

Oklahoma on the north”, Texas refers to the United States state; and in the context

“the characters in Texas include both real and fictional explorers”, Texas refers to the

novel written by James A.Michener.

• Named Entity Variations : An instance of the named entity can be referred using

various forms like

– Acronyms : Acronyms are abbreviations that are formed using the initial com-

ponents of a phrase or name. A named entity can always be referred using its

acronym and the same acronym can refer to different named entities based on

the context it appears.

For example, the acronym “SRT” refers to “Sachin Ramesh Tendulkar” in the

context “SRT is an Indian cricketer widely regarded as one of the greatest bats-

men in the history of cricket” whereas, it refers to “Street and Racing Tech-

nology” in the context “SRT is a high-performance automobile group within

Chrysler LLC.”

– Nick Names : Some times named entities are referred using either nick names,

alias names etc. The main difficulty here is that the nick name need not be a

named entity by itself.

For example, “Sachin Tendulkar” a batsmen of the Indian cricket team is re-

ferred using seven different nick names. They are “The God of Cricket, Little

8


Master, Tendlya, Master Blaster, The Master, The Little Champion, The Great

Man”. None of these seven names is a named entity by itself.

– Spelling Variations : Finally, a named entity can also be referred using multi-

ple spelling variations based on an pronunciation.

For example, “Angela Dorothea Merkel” the vice chancellor of Germany is

referred using different spellings like “Angie Merkel, Angelika Merkel, Angela

Merkel, Angela Markel, Angel Merkel” etc.

• NIL Detection : When trying to link named entities from a large, generic collection

of documents, there is highly likelihood that large number of entities have no map-

ping node in the KB. In such cases, the system is expected to predict NIL. We call

this NIL Detection problem.

The combination of all these issues make EL a challenging task. Sometimes a mention

of an entity can involve more than one of the above challenges. Consider the occurrence

of the entity “Dorothea”. An Italian might be reminded of “Dorotea Bucca” an Italian

physician, where as for an Irish person “Dorothea” might strike as “Dorothea Jordan”, an

Irish actress. However, the mention of “Dorothea” in the textual document might refer to

the entity “Angela Dorothea Kasner”, which in turn is a name variation of “Angela Merkel”,

chancellor of Germany. Thus, an EL system must determine if either of the two “Dorothea”

is correct, even though neither are exact matches. If the system determines neither, should

it return NIL or the variant “Angela Merkel”?.

1.4 Background

Named entities are the fundamental constituents in the texts present on the web. The ability

to identify named entities like persons, organizations and locations; extracting knowledge

about them and identifying entity relationships has many applications. The task of identi-

fying the named entities like persons, organizations and locations occurring in a piece of9


text is referred as Named Entity Recognition(NER). For example, an NER would recog-

nize the mention of Sachin Tendulkar and 24 April 1973 as Person and Date respectively.

NER is a sub task of information extraction problem and is one of the widely explored

[65, 40, 17, 11, 71] problems in this field. A relation extraction system [3, 68, 6, 8] would

establish the relation between named entities occurring in a document. This ability to dis-

cover entity relationships embedded in the documents would be very useful not only for

information retrieval but also for question answering [54, 67, 44, 30, 50, 63] and summa-

rization [33, 4, 5, 22, 32, 29] tasks. Though information extraction algorithms are capable

of extracting such valuable information automatically, they never addressed the problem

of integrating the extracted information to KBs like Wikipedia or DBpedia. This task of

inserting the extracted knowledge into a KB has many challenges that arise from natural

language ambiguity, inconsistencies in text and lack of world knowledge. The focus of this

thesis is to establish the mapping between an entity occurring in a document to an entity

in a KB, if any. The ability to disambiguate various named entities is an important prereq-

uisite for updating an entity’s record (Node) in the KB. This task has been referred to as

Entity Linking or Named Entity Disambiguation. When performed without a KB, EL is

called as Co-reference Resolution (CR).

CR shares a lot of similarities with EL. In the next section, we first explain in detail

the problem of CR and then give a brief introduction of the various tasks held in this area.

Finally, we compare how EL differs from CR.

1.4.1 Co-reference Resolution

The task of Co-reference Resolution [2] aims to determine whether two occurrences in a

document correspond to the same entity or not. Entity mentions that map to the same real

world entity are grouped into the same cluster. This task becomes more complex when we

try to determine whether the instances of two entities across different documents co-refer or

not. When CR is performed across documents it is called as Cross-document Co-reference

10


Resolution (CDCR).

Cross-document co-reference occurs when the same person, place, event or concept

is discussed in more than one text source. Computer recognition of this phenomenon is

important because it helps break the document boundary by allowing a user to examine in-

formation about a particular entity from multiple text sources at the same time. Resolving

cross-document co-reference allows a user to identify trends and dependencies across the

documents. Once the document barrier is broken, CR becomes a central tool for informa-

tion fusion and for generating summaries from multiple documents.

CDCR differs substantially from within document CR. There is certain level of consis-

tency within a document which makes CR an easier task when compared to CDCR. CDCR

is a challenging problem because the documents can come from different sources and they

might also have different conventions and styles. In addition, the problems encountered

during within document co-reference are compounded when looking for co-references

across documents because the underlying principles of linguistics and discourse context

no longer apply across documents and the underlying assumptions in CDCR are distinct.

CR also differs from NER. In the task of NER, we try to identify phrases which might

refer to a person, location or organization. While identifying the named entities, each

entity mention is treated to be unique and distinct. Whereas, in the task of CR we attempt

to determine whether entity mentions in a document are actually referring to the same real

world entity or not. Various community efforts have taken place in the form of shared tasks

viz: Message Understanding Conference 10, Tipster 11 and Web People Search 12 to address

the challenges of CR and CDCR.

10http://www-nlpir.nist.gov/related projects/muc/11http://www-nlpir.nist.gov/related projects/tipster/12http://nlp.uned.es/weps/

11


1.4.2 Difference Between Entity Linking and Co-reference Resolution

Though the task of EL and CR share similarities i.e. both these tasks aim at disambiguating

named entities, there exists a slight difference between them in the aspect of what the final

goal of each task is. In CR, we have a set of documents/document all of which mention the

same entity name. The difficulty lies in clustering these documents into sets which refer to

the same real world named entity. Whereas, in EL, the same entity name could be referred

to in different contexts and also using various forms like acronyms, nick names etc. Our

problem is to link this named entity to an entry in the KB, if present.

For example, consider the following five different contexts. We show the expected

output of EL and CR.

Context 1 : A spokeswoman for Abbott said it does not expect the guidelines to affect

approval of its Xience stent, which is expected in the second quarter.

Context 2 : Aside from items offered by the 67-year-old Fonda, the auction included

memorabilia related to Peter Frampton, Elvis Presley and Abbot and Costello.

Context 3 : Abbott, which spun off HPD in 2004, rejected the charges, insisting it has

“consistently complied with all laws and regulations.”

Context 4 : Most of his screenplays, which included several Abbott and Costello come-

dies, as well as scripts for television shows, were written between the 1930s and 1960s.

Context 5 : Abbott was appointed to a three year position as chairman of the California

Board of Forestry.

In context 1 and context 3, the mention of “Abbott” refers to “Abbott Laboratories” (A

pharmaceuticals and health care company), whereas in context 2 and context 4 the same

mention of “Abbott” refers to “Bud Abbott” (An American film actor) and in context 5 it

refers to “Abbott Kinney” (An American conservationist).

In the task of CR we would form a cluster for mention of “Abbott” in context 1 and

context 3 and another cluster for the mention of “Abbott” in context 2 and context 4 is

formed and a separate cluster for context 5 is formed. Each cluster corresponds to a unique

12


real world named entity. However successful CR is insufficient for correct EL, as the co-

reference chain must still be correctly mapped to it’s corresponding KB node. CR does not

identify to which real world entity do the mention of named entity in the cluster belong.

In the task of EL, we link the entity mention of “Abbott” in context 2 and context

4 to “Bud Abbott” node in the KB, entity mention in context 5 to “Abbot Kinney” and

entity mention in context 1 and context 3 are linked to “Abbott Laboratories” node in the

KB. Since the KB is a structured database containing information about entities, with the

linking of these new documents, new information about the corresponding named entity

can be extracted and updated into the KB node automatically. In the next section, we give

a brief over of our proposed methodology.

1.5 Overview of the Proposed Methodology

Our approach consists of building an Entity Repository (ER). ER contains information

about different forms of named entities and is built using various features from Wikipedia.

Using ER, Web search results and an NER system, a set of candidate nodes are obtained

from the KB (Candidate List Generation, CLG). These candidate nodes are ranked to iden-

tify the mapping node, if any. In the CLG phase, query entity 13 is expanded to obtain

its variations. These variations are used to generate candidate nodes from the KB. These

candidate nodes are ranked using various similarity techniques. The top ranked node is

returned as the mapping node for the input query entity.

1.5.1 Building Entity Repository

We build an Entity Repository (ER) which contains various forms in which a named entity

could be referred, viz: alias names, nick names, acronyms etc. We use Wikipedia which

is the largest semi-structured KB available to the public. Wikipedia structural information

13Query Entity refers to a named entity occurring in a document which is to be linked to a node in the

Knowledge Base, if any.

13


(Redirection, Disambiguation, Bold Text) comes handy in extracting few of the variations

of a named entity. Snap shot of XML dump of the Wikipedia is used to build ER.

1.5.2 Candidate List Generation and Ranking

• Candidate List Generation (CLG) : In this phase, we obtain possible variations of

the query entity using various heuristics. We use the ER, an NER and Web search

engine results to obtain all the possible variations of the query entity. Using the

identified variations of the query entity, we obtain an unordered list of candidate

nodes from the KB which might be linked to the query entity.

• Entity Linking as Ranking : We rank the unordered list of candidate nodes using

various similarity techniques. We calculate the similarity between the text of the can-

didate nodes and the query document. We experiment ranking using various similar-

ity functions like cosine similarity, Naı̈ve Bayes, maximum entropy, Tf-idf ranking

and re-ranking using pseudo relevance feedback. We show that cosine similarity and

Naı̈ve Bayes perform close to state of the art and the Tf-idf ranking function performs

better in some cases.

Our proposed approach was tested on a standard data set provided as part of Text

Analysis Conference (TAC), Knowledge Base Population (KBP), Entity Linking

shared task. We participated in the TAC-KBP, EL 2009 and 2010 shared tasks. We

used the standard evaluation metric i.e. Micro-Average Score (MAS) for evaluating

our algorithms performance. We achieved very impressive MAS of 83% and 85% on

TAC-KBP, EL 2009 and 2010 data sets. Also, our proposed approaches performed

very well in the TAC-KBP, EL shared tasks.

1.6 Thesis Organization

The rest of the thesis is organized as follows

14


In chapter 2, we discuss literature work and current state-of-the-art algorithms on EL.

We also discuss algorithms developed as part of TAC-KBP, EL shared task, so that we

have a platform to compare our system. We also discuss seminal work on named entity

disambiguation and CR, as they are closely related to EL.

In chapter 3, we describe the task of building Entity Repository (ER) and the Candidate

List Generation (CLG) phase of our EL algorithm. We discuss in detail the various features

used in building ER. We also discuss how an unordered list of candidate nodes are obtained

from the KB.

Chapter 4 describes EL as a ranking problem. We rank the set of unordered list of

candidate nodes obtained in chapter 3 using various similarity techniques, viz: cosine sim-

ilarity, Naı̈ve Bayes, maximum entropy, Tf-idf ranking and pseudo relevance feedback for

re-ranking. We also discuss how the mapping node is arrived upon from the initial un-

ordered list of candidate nodes.

In chapter 5, we describe the data set used and further explain the evaluation metric

used to evaluate the performance of an EL algorithm.

In chapter 6, we describe the experiments conducted on 2009 and 2010 TAC-KBP,

EL query sets to validate our methodology. We report the results using our methodology

and evaluate in detail the impact of various features we used in developing our algorithm.

We also compare the performance of our algorithm with the existing state-of-the-art ap-

proaches. We discuss the results and present our observations in detail.

Finally, we conclude the thesis by outlining our contributions, providing some insights

on how to extend this work in future in chapter 7.

15

Chapter 2

Related Work

2.1 Related Work

In this chapter, we discuss the literature related to Entity Linking (EL). First, we discuss

seminal work on person name disambiguation and co-reference resolution (CR) as they

share a lot of similarities with EL. In what follows, we focus on the first works on EL and

finally, conclude with discussions on recent literature in EL.

Until relatively recently, there has been very little focus on EL because there was no

general purpose publicly available collection of information about named entities. How-

ever, with the emergence of Wikipedia, DBpedia and Freebase as an important repository

of semi-structured database about named entities, EL has received a lot of attention from

various research communities. Accordingly these databases have been exploited for a num-

ber of tasks ranging from named entity recognition to relation extraction, but with passage

of time it has been observed that the maintenance of these databases is time consuming

and a costly affair. Hence, EL has received a lot of attention recently because it addresses

the problem of information integration and helps in automating the task of maintaining

the databases with up-to-date information. EL has been addressed using various heuristic

based approaches and machine learning techniques. All these approaches rely heavily on

the document context as features to link the entities.

16

CHAPTER 2. RELATED WORK

2.1.1 Unsupervised Person Name Disambiguation

Person name disambiguation is closely related to EL in the sense that it also tries to disam-

biguate and identify named entities. This task is also called as proper noun disambiguation.

The goal of this task is to cluster mentions of person entities in documents to unique en-

tities. Simple word senses and translation ambiguity may typically have 1-10 alternative

meanings that must be resolved based on the context they occur. Whereas, a personal name

like “Jim Clark” might potentially refer to hundreds or even thousands of distinct individ-

uals. Each unique referent typically has it’s own distinct contextual characteristics. These

characteristics can help distinguish and resolve the referent when they occur in documents.

First significant contribution was done by Mann and Yarowsky in 2003 [34].

Their approach utilizes an unsupervised clustering technique over a rich feature space

of biographic facts, which are automatically extracted via a language-independent boot-

strapping process. The induced clustering of named entities are then partitioned and linked

to their real referents via the automatically extracted biographic data. The biographic facts

extracted can be birth year, occupation and affiliation etc. They are extracted using manu-

ally written regular expressions.

2.1.2 Vector Space Model for Co-reference Resolution

Co-reference resolution (CR) is another problem which is very closely related to EL. Co-

reference [2] occurs when the same person, place, event or concept is discussed at various

points in a text. When it occurs in multiple text sources it is called as Cross-document Co-

reference Resolution (CDCR). Computer recognition of this phenomenon helps in breaking

the document barrier and helps in mining or examining information about an entity from

multiple sources simultaneously. In particular, resolving cross-document co-references al-

lows a user to analyze different trends and dependencies across multiple documents. It

can be used as a central tool in generating multi document summaries and in information

fusion.

17


The task of CR is to determine if two occurrences of an entity in a document correspond

to the same unique real world entity. This task becomes more complex when performed

across documents, as the documents could come from different sources and might also

follow different styles and conventions. In CDCR, we have a set of documents all of which

mention the same entity name. The difficulty lies in clustering these documents into sets

which mention the same entity. Additionally, most of CR data sets have never been modeled

to address named entity synonym problem. Seminal work on CDCR was done by Bagga

and Baldwin [2].

Bagga and Baldwin used a Vector Space Model (VSM) [60] to form clusters on enti-

ties. In their approach, the documents are passed through a sentence extraction module.

For each document this module extracts all the sentences relevant to a particular entity

of interest. In other words, the sentence extractor module produces a “summary” of the

article with respect to the entity of interest. Then for each article, VSM disambiguate mod-

ule uses the summary extracted by the sentence extractor and computes its similarity with

the summaries extracted from each of the other articles. If the similarity computed between

summaries is above a pre-defined threshold, then the entity of interest in the two summaries

are considered to be co-referent.

Their algorithm was tested on a highly ambiguous test set which consisted of 197 ar-

ticles from 1996 and 1997 editions of the New York Times. All the articles whose text

contained the expression John.*?Smith, i.e. contained some variation of John Smith were

included. There were a total of 35 different John Smiths in these articles out of which 24 of

them had only a single article. The remaining 11 John Smiths had 173 articles. These doc-

uments were manually grouped based on the mention of John Smith. None of the articles

had multiple occurrences and hence the annotations were done at the document level only.

The experimental results showed that the system had very high performance. The problem

with this data set is that none of the documents have synonym mention of John Smith, that

is John Smith could have been referred using “Mr. John, Mr. Smith, Jo Smith” etc. Most

CDCR data sets are collected in a way that they don’t address the problem of synonym

18


resolution. Although CR integrates the information about an entity from multiple sources,

it does not address the problem of integrating this information to a KB.

2.2 Using Wikipedia Taxonomy for Entity Linking

Seminal work on EL was done by Bunesca and Pasca [7] and Cucerzan [13]. Cucerzan uses

a heuristic based approach and exploits Wikipedia structure to derive mappings between

surface forms of entities and their Wikipedia entries. Context vectors are derived as a

prototype for each entity in Wikipedia and these vectors are compared against the context

vectors of unknown entity mentions from documents for disambiguation. In the work by

Bunescu and Pasca [7], a supervised Support Vector Machines ranking model is used for

disambiguation. Both the approaches rely heavily on Wikipedia structural information,

such as category hierarchies and disambiguation links.

2.2.1 Support Vector Machines for Entity Linking

Bunescu and Pasca [7] use a supervised Support Vector Machines (SVM) kernel ranking

model for disambiguation. The SVM kernel is trained so as to exploit the high coverage

and rich structure of information encoded in an online encyclopedia. Since there was no

manually labeled data available for evaluation, they trained and evaluated the algorithm

developed on Wikipedia’s link anchor text. A subset of inter article links were obtained

from Wikipedia for evaluation. These articles were obtained using two heuristics.

• To ensure that the article was talking about a named entity, a set of heuristics were

framed. The heuristics used were

– If the article title is multiword, all the content words were checked for capital-

ization, i.e. words other than prepositions, determiners, conjunctions, relative

pronouns or negations. If all the content words are capitalized it was considered

as a named entity.19


– If the article title is a one word title that contains at least two capital letters, then

also it is considered as a named entity. Otherwise, next step is done.

– A count of how many times the article title occurs in the text of the article, in

positions other than at the beginning of a sentence is calculated. If at least 75%

of these occurrences are capitalized, then it is considered as a named entity.

• A setC2 was obtained, which includes only child categories of People by Occupation,

that are assigned to at least 200 articles. Then, if one of the categories assigned to the

article belongs to C2, the article was considered to be talking about a named entity.

The positive examples constituted the articles that matched the above heuristics and

had link mentions. Articles that did not match the above heuristics constituted the set of

negative examples.

The mention of an entity in the text was used to generate a set of candidates. An exact

match was done on Wikipedia article titles, redirect titles and disambiguation titles. The

articles that had the exact match of the entity were considered as candidates.

For disambiguating the candidates obtained above, Bunescu and Pasca used a SVM

ranking model implemented in SVM light toolkit1. They used two classes of features to

train the model. The first feature used was the cosine similarity between the context in

which the named entity occurred and the text present in the Wikipedia candidate article.

The second feature was created using a 2 tuple for each combination of the candidate

categories and context words. They learned to predict NIL for queries by including NIL

candidates. This helped the system in learning a linking threshold. The experimental results

showed that the system had very high performance.

However, the drawback with this approach is that the system is heavily dependent on

Wikipedia structural information like redirect and disambiguation pages.

1http://www.cs.cornell.edu/people/tj/svm light/svm rank.html

20


2.2.2 A Heuristic Based approach for Entity Linking

Cucerzan [13] uses a heuristics based approach to link named entities occurring in docu-

ments to entities in Wikipedia. His work assumes that all the mentions of unknown entities

have a corresponding entry in the Wikipedia. However, the assumption fails for a significant

percentage of entities present in news articles as they do not have an entry in Wikipedia.

Context vectors are derived as a prototype for each entity in Wikipedia and these vectors

are compared against the context vectors of the unknown entity mentions in a document for

disambiguation.

In the first phase, entity mentions are identified that need to be linked to Wikipedia

articles. For this, the system splits a document into sentences and true cases the beginning

of each sentence, hypothesizing whether the first word is part of an entity or is it capitalized

because of orthographic conventions. It also identifies all the titles and hypothesizes correct

case for all the words in the titles. This is done based on statistics obtained from a one-

billion-word corpus, with back-off to web statistics. In the second stage, a hybrid NER

based on capitalization rules, web statistics, and statistics extracted from the CoNLL 2003

shared task data [65] are used to identify the boundaries of the entity mentions in the text.

It also assigns to each set of entity mentions sharing the same surface form a probability

distribution over four labels: Person, Location, Organization, and Miscellaneous 2. Then,

in document co-reference was performed to obtain longer surface forms for entities. It is

fairly common for one of the mentions of an entity in a document to be a long, typical

surface form of that entity (e.g., George W. Bush), while the other mentions are shorter

surface forms (e.g., Bush). Therefore, before attempting to solve the semantic ambiguity,

the system hypothesizes in document co-references and maps short surface forms to longer

surface forms with the same dominant label (for example, Brown/PERSON can be mapped

to Michael Brown/PERSON). Similar approach is also employed to acronyms to identify

2While the named entity labels are used only to solve in document coreferences by the current system, as

described further in this section, preliminary experiments of probabilistically labeling the Wikipedia pages

show that the these labels could also be used successfully in the disambiguation process.

21


their expanded forms.

In the candidate generation phase, Cucerzan relied on an extensive pre-processing step

and used a rich set of features for aliases identification. For identifying various aliases of

a named entity, Cucerzan used Wikipedia redirect titles, disambiguation titles, link anchor

titles and truncated article titles. Longer mentions from co-reference chains were used to

replace the entities identified by the NER.

Cucerzan disambiguated the mention of the query entity with respect to document level

vectors obtained from all mentions of the entities in the document. Wikipedia contexts that

occur in the document and their category tags are aggregated into a document vector, which

is subsequently compared with the Wikipedia entity vector (of categories and contexts) of

each possible entity for disambiguation. The entities are assigned to surface forms that

maximize the similarity between the document vector and the Wikipedia entity vectors.

The main drawback with Cucerzan approach is that his approach does not handle NIL

entities, that is entities not having an entry in the Wikipedia, are not handled. His work

assumes that all mentions of entities in a document will surely have a mapping entry in

Wikipedia. This assumption fails when news articles are considered as they have many

entity mentions that might not have an entry in the Wikipedia. The query set for evaluating

this algorithm was also developed in such a way that entity mentions having no appropriate

article in Wikipedia were set aside from the evaluation set.

2.3 Approaches to Entity Linking

In this section, we discuss recent state-of-the-art work on EL. These algorithms have been

developed as part of the TAC-KBP, EL shared task. We discuss two heuristic based ap-

proaches and a machine learning approach and explain their short comings. All these ap-

proaches follow more or less a similar strategy. Their approach can be broken down into

two steps: First, they obtain a small set of possible candidate nodes from the KB using

various heuristics. Second, these possible candidate nodes are ranked using various simi-

22


larity techniques to identify the mapping node. We now discuss each of these algorithms

in detail.

2.3.1 Entity Linking as Cross-document Co-reference Resolution

Si Li et.al [62] model the task of EL as a CDCR problem. Their approach can be broken

down into four basic steps:

• Entity Retrieval : Since a KB generally contains millions of entities, it will be a

time consuming task to traverse the entire collection for linking an entity occurring

in a document (We refer to this entity occurring in a document which needs to linked

to a node in the KB, as query entity 3). Hence, Si Li et al. try to obtain a small set of

possible candidate nodes from the KB that can be linked to the query entity. In order

to arrive at this possible candidate set of nodes, they use Indri Retrieval Toolkit4,

which is based on language model and inference network. The system carries out a

basic topic relevance retrieval to get the top 10 possible mapping nodes from the KB

for each query entity.

• Named Entity Type Recognition : The entity types may be Person, Organization

or a Geo-Political entity. If the type of a target query entity is uncertain, then it is

regarded to be Unknown (UKN). In order to improve the accuracy of the resolution,

query entity (present in a document) type is identified by Stanford NER 5.

• Summarization : Since the test documents can be from various news articles and

transcripts, Si Li et.al believe that these documents might contain a lot of irrelevant

content for the query entity. Hence, they generate query specific summary instead of

using the original text for similarity measure between two documents; also different


Knowledge Base, if any.4http://www.lemurproject.org/indri/5http://nlp.stanford.edu/software/CRF-NER.shtml

23


queries may produce different summaries of the same original text. Intra document

CR is performed before extracting the summary. The heuristics used to generate the

summary were

– A sentence is considered part of the summary if it contains at least one word of

the query entity.

– If the pronoun in a sentence refers to an antecedent of the previous sentence

and if it is already present in the summary sentences, the current sentence is

also added to the summary sentences. The simplified Hobbs Naive algorithm

[12] is used for pronoun resolution.

– A sentence is not a summary sentence if it does not meet the above two require-

ments.

– Sometimes, no summary might be extracted by using their algorithm if there is

no query term in the document. In such cases, the original text is used instead

of the summary.

• Similarity Metrics : Si Li et.al calculate the similarity between the candidate nodes

obtained in entity retrieval phase and the given text document using two different

methods: Vector Space Model and KL divergence method.

– Vector Space Model : Let the summary vector of a document D be ~V (S). The

cosine similarity between two document D1 and D2 is computed as

Sim(D1, D2) =~V (D1) · ~V (D2)

|~V (D1)||~V (D2)|=

∑commonterms:tj

W1j ∗W2j (2.1)

Where tj is a term present in both D1 and D2, W1j is the weight of the term tj

in D1 and W2j is the weight of tj in D2. The weight of a term tj in the vector

~V (S) is given by:

24


Wj =tfj√∑Mi=1 tf

2i

(2.2)

Where tfi is the frequency of the term ti in the summary.

– The KL divergence Model : In probability theory and information theory,

the Kullback-Leibler divergence is a non-symmetric measure of the difference

between two probability distributions P and Q. Here they use improved KL di-

vergence model to measure the similarity between two documents. It is defined

to be

DKL(P ||Q) =∑i

(P (i)−Q(i))logP (i)

Q(i)(2.3)

where P stands for the distribution of terms in the summary of documentD1, Q

stands for the distribution of terms in the summary documentD2, word i occurs

in D1 or D2.

Different from the KL divergence, the improved KL divergence formula is sym-

metrical and non-negative. The more close to zero the value is, more similar

are the two documents.

The final similarity score between the document associated with the query entity and

candidate nodes is calculated using

F = 0.4 ∗ Sim(S1, S2) + 0.4 ∗ T + 0.2 ∗ S (2.4)

Where Sim(S1, S2) is the score obtained from Vector Space Model or KL divergence

method. T is a Boolean value and it is set to 1 if query entity and KB entity are the

exact match strings, else 0. S is the similarity between query entity and candidate

nodes, and the similarity score is obtained from Indri.

Finally, the output of the system is based on a co-reference decision which is made

by combining the entity type recognition and similarity measure. Two entity mentions are25


co-referent to the same entity only in the case when they have high similarity measure and

matched entity type.

The drawback with the approach of Si Li et.al is that their entity retrieval module is

naive and random. It results in the retrieval of many irrelevant candidate nodes from the

KB. Also, the technique of using the summary generated to calculate the similarity score

is not good because there is loss of valuable contextual information for the query entity.

We show through our experiments that a good entity retrieval module is highly important

for obtaining a high performing EL system. Also, using the complete document text for

calculating similarity increases the performance.

2.3.2 Two stage methodology for Entity Linking

Xianpei Han et.al [21] employed a two stage EL method, where the two stages corresponds

to the two main components of their system: The first component is a multi-way entity

candidate detector, which identifies all the possible nodes in the KB for a query entity

based on a variety of knowledge sources, such as the Wikipedia anchor dictionary, the web

etc. The second component is an entity linker, which links an entity mention with the real

world entity(KB node), it refers to by measuring the similarity between them, based on

the Wikipedia semantic knowledge and bag of words (BOW) model. We now explain the

complete system in detail.

The multi-way entity candidate detector phase uses three features to obtain possible

candidate nodes from the KB that can mapped to the query entity. The features used are

• Candidate detection using contextual information : In general, the context sur-

rounding a named entity is rich in information about the entity, especially for iden-

tifying abbreviations. Xianpei Han et.al use this intuition to obtain the variations of

the given query entity, if any. They manually framed a few patterns to identify these

variations. For example, a pattern (Cap∗?)(Abbr) would extract text phrases like

“the newly-formed All Basotho Convention (ABC) is far from certain”. Expanded

26


form of an abbreviated word is obtained in this way.

• Candidate detection using Wikipedia anchor dictionary : Entity candidates are

identified using the anchor dictionary of Wikipedia, which encodes rich information

about entities. A count of the anchor text phrase and the directed Wikipedia article

title is calculated. These counts are then used to identify the candidate nodes from

the KB for the given query entity.

• Candidate detection using web : The query entity along with the surrounding con-

textual words are submitted to the Google search engine 6. From the top K ranked

results they consider only the articles that belong to Wikipedia. These article titles

are also used to identify the candidate nodes.

Once the set of candidate nodes are obtained from the KB using the above heuristics,

they are ranked using a linear combination of two similarity metrics. Let the set of candi-

date nodes be E = {e1, e2, ..., en} for the query entity m and let the vector representations

be ei = {w1, w2, ..., wn} and m = {w′1, w

′2, ..., w

′n} respectively.

• BOW based similarity : Using the bag of words (BOW) model, both the query entity

mention m and the candidate nodes E are represented as a vector of word features,

and each word is weighted using the standard Tf-idf measure. The BOW based

similarity captures the word co-occurrence information. The similarity between e

and m is calculated using

SIMBOW (e,m) =

∑i

wiw′

i√∑i

(wi)2√∑

i

(w′

i)2

(2.5)

• Wikipedia Semantic Knowledge Based (WSKB) Similarity : Wikipedia seman-

tic similarity is computed between the candidate nodes E and query entity mention

document m. This is done in three steps6http://www.google.com/

27


– Wikipedia concept detection : The appearances of Wikipedia concepts are

detected using the method describe in Milne and Witten [42]. Then, the query

entity mention document and the candidate nodes are represented as a vector of

Wikipedia concepts {c1, c2, ..., cm}.

– Wikipedia concept weighting : Since all the concepts in representation are not

equally helpful, each concept is assigned a weight indicating its relatedness to

the query entity mention or the candidate node. In detail, for each concept c in

representation, we assign it a weight by averaging the semantic relatedness of c

to all other Wikipedia concept vectors i.e.

w(c, e) = |e|−1(∑

ci∈e,ci 6=c

sr(c, ci)) (2.6)

where sr(c, ci) is the semantic relatedness measure between two concepts c and

ci, which is computed using the method described in Milne and Witten [42].

– Finally, Wikipedia semantic similarity is calculated using

SIMwiki(e,m) =

∑ci∈m

∑cj∈ew(ci,m) ∗ w(cj, e) ∗ sr(ci, cj)∑

ci∈m∑

cj∈ew(ci,m) ∗ w(cj, e)(2.7)

The final similarity score for each candidate node is a linear combination of BOW and

WSKB similarities.

SIMHybrid(e,m) = λ ∗ SIMBOW (e,m) + (1− λ) ∗ SIMwiki(e,m) (2.8)

If the best ranked candidate node similarity score is greater than 0.4 it returned as the

mapping node, else NIL is predicted.

The system developed by us and Xianpei Han et.al bears a lot of similarities. Both

systems create candidate sets and then rank the sets using BOW as a feature. The difference

between the systems is that we use a more fine tuned module for generating the candidate

sets and for handling acronyms. Another key difference is in the approach to NIL detection.

28


We augment the KB with Wikipedia in order to predict NIL for entities that don’t have a

mapping node in the KB, where as Xianpei Han et.al predict a mapping node or NIL based

on a fixed threshold.

The main draw back with this approach is that the manually written heuristics for can-

didate detection will cover limited patterns. Another draw back is with respect to NIL

prediction methodology proposed by Xianpei Han et.al. Fixing the same threshold for

query entities occurring across various contexts is never a good strategy.

2.3.3 Supervised Machine Learning for Entity Linking

Fangtao Li et al. [28] use a “Learning to Rank” strategy to find the mapping node in the

KB for a query entity. They employ a list wise learning to rank model and augment it

with Naı̈ve Bayes binary classifier to find a mapping node. Their algorithm can be broken

down into multiple steps, but the main components remain the same i.e. candidate nodes

generation and ranking. We now explain the algorithm in detail.

• Preprocessing : Since, the KB can be in the order of millions, Fangtao Li et.al index

them for faster access of the documents. Also, sometimes query entities might be

misspelled, they use query correction function from Google, altavista 7 etc.

• Query Expansion : Fangtao Li et.al argue that using only the given query entity

is not sufficient to find the correct mapping node from the KB. Hence, they use

various strategies like using the document associated with the query entity to find the

expanded form for abbreviations, use Wikipedia redirect, disambiguation and link

information to obtain various possible variations of an entity.

• Candidate Generation : Using the obtained variations, they retrieve top 20 doc-

uments from the KB by forming an “OR” query from the entity variations. The

obtained set of candidate nodes are then ranked to identify the mapping node.

7http://www.altavista.com/

29


• List wise learning to Rank : Using a small training data of 285 queries they adopt

a ListNet, an algorithm of learning to rank proposed by Zhe Cao [9]. The candidate

nodes obtained are ranked using the model built. Then, they use a Naı̈ve Bayes

binary classifier to decide whether the top ranked node is correct or if NIL should be

predicted.

The drawback with the approach of Fangtao Li et al. is that it requires large corpus

of human annotated data to train the model. Creating a training data for three categories

mainly person, location and organization covering various contexts is a difficult and time

consuming task. McNamee et.al [37] also propose a supervised machine learning similar

to Fangtao Li et.al. The only difference is that McName considers absence as another entry

to rank and selects the top ranked node directly, unlike Fangtao Li et.al who use a Naı̈ve

Bayes binary classifier. We show that our approach scales to large scale KBs easily and

performs better than all the above algorithms without any training data.

2.4 Conclusions

In this chapter, we did elaborate discussions on the literature related to EL. We discussed

seminal work on Person Name Disambiguation and Co-reference Resolution as they share

a lot of similarities with EL. Then, we discussed seminal work on EL by Curezan, Bunescu

and Pasca. Later, we explained in detail the three systems developed as part of the TAC-

KBP, EL shared task and explained their shortcomings. We also discussed how our ap-

proach overcomes their shortcomings. In the next chapter, we explain the first phase of our

algorithm, Candidate List Generation (CLG).

30

Chapter 3

Candidate List Generation

Given a KB, the task of EL is to determine for each named entity occurring in a document,

which KB node is being referred to, or if it is a new entity and not present in the KB.

As discussed in Section 1.5, we break EL into two steps. In the first step, we build an

entity repository (ER), which contains different forms of various named entities. ER is

built using various features from Wikipedia. ER is a prerequisite for identifying candidate

nodes because it contains information about various forms in which a named entity can

occur.

In the next step, query entity1 is expanded to obtain its variations. In addition to using

ER for identifying query entity variations, we use web search results and Stanford NER.

These variations are used to generate candidate nodes from the KB, referred as Candidate

List (CL). This phase of generating the CL is referred as Candidate List Generation (CLG)

phase. These candidate nodes are finally ranked using various similarity techniques. In this

chapter, we explain in detail about the CLG phase.


Knowledge Base, if any.

31

CHAPTER 3. CANDIDATE LIST GENERATION

3.1 Building Entity Repository

In real world, a named entity can be referred using various forms like nick names, alias

names, acronyms and spelling variations. We introduced how a named entity could be

referred using these various forms with examples in Chapter 1. In order to handle these

variations, we build an ER which contains various forms in which an entity could be re-

ferred. Though web contains various forms of named entities, it is not an ideal place for us

to extract entity variations because of the following reasons.

• Web is voluminous and continuous to grow at an astounding rate in both the sheer

volume of traffic and size. Valuable information about entities is sparsely distributed

across the web. The process of mining entity variations from such voluminous data

is tedious and time consuming.

• Large percentage of web documents are unstructured. Inferencing information from

such wide range of documents is extremely difficult and not an ideal solution.

• Most of the information available on the web is never moderated. Hence, extracting

information from the web can result in false and unauthenticated data being extracted.

Hence, we use Wikipedia which is the largest semi-structured database [55] available

to mine various forms of named entities. The advantages of using Wikipedia are

• It has better coverage of named entities [69]. Since the KB provided by TAC-KBP

shared task covers only named entities, Wikipedia acts as a perfect platform for build-

ing our ER.

• Articles in Wikipedia are heavily linked and structured. We use the information

encoded in redirect and disambiguation pages for extracting named entity variations.

• With over 3.5 million articles Wikipedia is rightly sized and big enough to provide

information about name variants.

32


• Since data on Wikipedia is moderated, we can be assured to a certain level of authen-

tication on the information present in it.

The existing literature [18, 41, 45] confirms the fact that valuable information can be

mined from Wikipedia. A sample Wikipedia article/document encoded in XML is shown

in the figure 3.1.

Figure 3.1 A sample article/document in Wikipedia.

A Wikipedia article contains a unique title, an ID, text carrying information about an

entity/event and some meta information. We use the title and text of an article for identify-

ing name variants.

The features we use in extracting name variants from Wikipedia are33


• Redirect Pages : A redirect page in Wikipedia is an aid to navigation, it contains

no content but only a link to another article (target page) and strongly relates to

the concept of target page. In lay man terms, a redirect is a page which has no

content itself, but sends the reader to another article or a section of an article, or page,

usually from an alternative title. Redirect pages help in identifying the following

name variants.

– Alternative names (for example, “Edison Arantes do Nascimento” redirects to

“Pel”).

– Plurals (for example, “Greenhouse gases” redirects to “Greenhouse gas”).

– Closely related words (for example, “Symbiont” redirects to “Symbiosis”).

– Less specific forms of names, for which the article subject is still the primary

topic. For example, “Hitler” redirects to “Adolf Hitler”.

– More specific forms of names (for example, “Articles of Confederation and

Perpetual Union” redirects to “Articles of Confederation”).

– Abbreviations (for example, “DSM-IV” redirects to “Diagnostic and Statistical

Manual of Mental Disorders”).

– Alternative spellings or punctuation. For example, “Colour” redirects to “Color,

and Al-Jazeera” redirects to “Al Jazeera”.

– Likely misspellings (for example, “Condoleeza Rice” redirects to “Condoleezza

Rice”).

– Likely alternative capitalizations (for example, “Natural Selection” redirects to

“Natural selection”).

A sample redirect page encoded in XML is shown in the figure 3.2. A redirect page

contains a unique title and redirect information to the original article. For example,

from figure 3.2, we obtain “Tendulkar” as a name variant of “Sachin Tendulkar”.

34


Figure 3.2 A sample redirect document in Wikipedia.

• Disambiguation Pages : Disambiguation pages are specifically created for ambigu-

ous entities, and consist of links to articles defining the different meanings of the

entity. They are used as a process of resolving conflicts in article titles that occur

when a single term can be associated with more than one topic, making that term

likely to be the natural title for more than one article. In other words, disambigua-

tions are paths leading to different articles which could, in principle, have the same

title. For example, the word “Mercury” can refer to an element, a planet, a Roman

god, and many other things. This feature helps in homonym resolution.

A sample disambiguation page encoded in XML is shown in the figure 3.3. From fig-

ure 3.3, we can conclude that “Sachin” is a name variant for “Sachin Tendulkar”,“Sachin

Pilgaonkar” etc.

• Bold Text From First Paragraph : On randomly analyzing few pages in Wikipedia

we found that bold text from the first paragraph of a Wikipedia article in general

refers to the full/nick name of a named entity. This feature helps in identifying

full/nick names of an entity.

From figure 3.1, we can conclude that “Sachin Ramesh Tendulkar” (text in black

color) is a name variant of “Sachin Tendulkar”.

35


Figure 3.3 A sample disambiguation document in Wikipedia.

Using the above features from Wikipedia we obtain different variations of a named

entity. For example, variations obtained for “Sachin Tendulkar” are

• “Sachin Ramesh Tendulkar” from bold text of first paragraph, which is in fact the

full name of “Sachin Tendulkar”.

• “Tendulkar” from redirect page, which is the less specific form of “Sachin Ten-

dulkar”.

• “Sachin” from disambiguation page.

All these variations are indexed using Lucene 2 a high-performance, full-featured text

search engine to enable fast retrieval of documents.

ER is important because we have information about various forms of named entities at

one place. These variations are used in our CLG phase to identify candidate nodes from

the KB.2http://lucene.apache.org

36


3.2 Identifying Query Entity Variations

In this phase, we identify all possible variations of the query entity. We use query document

in context, web search results and ER to identify query entity variations. The entity vari-

ants obtained are then used during the candidate list 3 (CL) identification phase to identify

mapping nodes from the KB. We now describe the various steps in identifying query entity

variations in detail.

3.2.1 Using Query Document in Context

We use the given query document for two purposes. First, We use it identify expanded

form of the query entity, if it is an acronym. Secondly, we use it to identify full name, nick

name, alias name etc if any. We use Stanford NER for the establishing the second task. We

now describe each in detail.

Acronym Expansion : Here the goal is to find the expanded form of the query entity, if

it is an acronym. For this we check if the query entity is an acronym i.e. contains all upper

case characters. If the given query entity is an acronym, we try to find the expanded form

from the corresponding query document, if any. We use an N-Gram based approach to find

the expanded form of the query entity. For this we remove stop words from the document

and check if “N” continuous sequence of tokens have the same initials as our query entity.

If an expanded form is found we use it along with the query entity (acronym) to search in

ER. The intuition behind this is that, it is common for entities to be introduced in text as

full forms and subsequently referred to by shorter forms or pronouns. Resolving these in

document co-reference links to retrieve the full form can thus have a substantial impact on

candidate ambiguity.

For example, given the following sentences :

• ...the newly-formed All Basotho Convention (ABC) is far from certain...3The unordered list of candidate nodes obtained using query entity variations is referred as Candidate List

(CL).

37


• ...Abbott Laboratories (ABT:NYSE) ...

• ...the Anti-Corruption Unit (ACU) of the International Cricket Council (ICC) ...

• ...member countries of Asian Clearing Union (ACU) recorded...

We can easily identify the expanded forms of all the above acronyms using our simple

N-Gram based technique. For example, ABC refers to All Basotho Convention, the first

ACU refers to Anti-Corruption Unit and the second ACU refers to Asian Clearing Union.

Stanford Named Entity Recognizer 4 : Stanford NER provides a general implemen-

tation of linear chain Conditional Random Field (CRF) sequence models, coupled with

well-engineered feature extractors for NER. It can identify Person, Location and Organiza-

tion.

We run the Stanford NER on the query document. It would tokenize and extract named

entity mentions from the text and tag them as either “PERSON/LOCATION/ ORGANI-

ZATION”. Phrases belonging to either of the three categories and having our query entity

as a sub-string are identified as possible variations of the query entity. This feature would

help us in identifying full name, nick name, alias name etc of the query entity, if any. The

purpose of this heuristic is to use the least ambiguous mentions in the document as the basis

for CL identification. It is common for entities to be introduced in discourse as full forms

and subsequently referred to by shorter forms or pronouns. Resolving these in document

co-reference links to retrieve the full form can thus have a substantial impact on candidate

ambiguity, and subsequently on an EL system.

For example, the mention of “Columbus” will be co-referred to the full form “Colum-

bus, Ohio” if it is extracted as a mention from the query document.

4http://nlp.stanford.edu/software/CRF-NER.shtml

38


3.2.2 Using Entity Repository

Using the ER built, we obtain all possible name variants for the given query entity. In

simple terms the variations obtained are nothing but name variants of the query entity from

Wikipedia. The given query entity is searched upon the Lucene index built for the ER. The

results obtained are name variants of the query entity.

For example, “George W. Bush, George H. W. Bush, George P. Bush” etc are name

variants of “George Bush” found from ER.

3.2.3 Using Web Search Results

We use Google search engine to identify query entity variations. We use Google’s spell

suggestion and Google’s site specific search feature. We now describe each in detail.

Google Spell Suggestion : Essentially Google spell checking compares words entered

against a constantly changing list of the most common searches and isolates when a user

may have intended to enter a different word or words. Because it does not depend on a rigid

dictionary, it is more effective in isolating words and phrases that may be commonly used

but are often not included in formal dictionaries i.e. named entities. Google’s checker is

particularly good at recognizing frequently made typos, misspellings, and misconceptions.

For our purpose, although most of the query entity strings are well formed, there are still

some spelling errors, so we try to correct the spelling errors using spell suggestion feature

supplied by the Google search engine. We input the query entity string to the search engine,

and then the search engine will return a corrected spelling of the string if the original one

was wrong. Since our query entities are about named entities this would return the best

possible spelling.

Google Site Specific Search : Google allows a user to specify a single website from

which a user might want to get the results from. For example, the query [ Iraq site:nytimes.com

] will return pages about Iraq but only from nytimes.com . This feature of Google will per-

39


form a site specific search on that particular website and return a ranked set of documents

from the mentioned website. We use this feature to obtain ranked set of documents for

our query entities from Wikipedia. This feature helps us in identifying name variant of the

query entity when Wikipedia documents are ranked using Google search engine.

“site:en.wikipedia.org” is used to obtain a ranked set of documents from the Wikipedia

domain for a query entity. From the ranked set of web search results we consider the top

most ranked result title as a variation of our query entity.

For Example, HDFC Bank is obtained as a variation of the query entity HDFC.

3.3 Candidate Nodes Identification

Once the set of name variants of the query entity are obtained, we need to identify the set

of possible mapping nodes from the KB. We search the name variants of the query entity in

the titles of the KB. This searching of the name variants to identify mapping nodes from the

KB is an important step because if the correct mapping node isn’t picked into the Candidate

List 5 (CL), the system will fail irrespective of how good the ranking algorithm might be.

We believe that as long as the correct mapping node is picked into the CL, the likelihood of

it being returned as a mapping node after ranking is very high. This search of name variants

on the KB titles is done in the following way.

• Token Search : The name variants of the query entity are searched on the titles of

KB nodes. Boolean “AND” search of all the tokens of each query entity variation is

done on the KB node title. If all the tokens are present, we add the KB node to CL.

For example, If the given query entity is “CCP” and we find its name variant to

be “Chinese Communist Party”; We would retrieve nodes with the title “Chinese

Communist party” or “Communist Party of China”.

5The unordered list of candidate nodes obtained during candidate node identification is referred as Can-

didate List.

40


3.4 Adding Wikipedia Article to the Candidate List

As we need to predict NIL for query entities that don’t have a mapping node in the KB, we

add Wikipedia nodes also to the CL. We search the Wikipedia using the same name variants

obtained for a query entity. Token search used for searching the KB is used for searching

Wikipedia also. We only add Wikipedia nodes which aren’t present in the KB to the CL.

Adding Wikipedia articles to the CL allows us to consider strong matches against query

entities that do not have any corresponding node in the KB and hence we can return NIL.

That is, for a given query entity if the ranking function maps to the Wikipedia article from

the CL, we can confirm the non-presence of a node in the KB about the query entity. This

method of appending a given KB is far better strategy when compared to fixing a threshold

value for predicting NIL.

The result of the CLG phase is an unordered list of candidate nodes. We need to rank

this unordered list in order to find the correct mapping node. We have experimented with

various similarity functions for ranking which are explained in the next chapter.

A flow chart of our CLG phase is shown in the figure 3.4

3.5 Conclusions

In this chapter, we described various features used to build ER from Wikipedia. We used

Wikipedia specific syntax i.e redirect pages, disambiguation pages and bold text from first

paragraph of an article to build ER. Later, we used ER, web search results and Stanford

NER to identify query entity variations. Using these variations, we search the given KB

and Wikipedia to identify an unordered list of candidate nodes, referred as CL. In the next

chapter, we use various similarity techniques to rank the nodes in CL to obtain the mapping

node.

41


Figure 3.4 Flow Chart of Candidate List Generation Phase

42

Chapter 4

Entity Linking as Ranking

In this chapter, we describe the core part of our approach i.e. predicting the mapping node

from the generated list of candidate nodes, CL. We rank the candidate nodes based on its

similarity to the query document. Predicting the mapping node can be broken down into

three steps :

1. The list of candidate nodes and the query document are tokenized and represented as

token vectors.

2. We use a wide variety of similarity techniques in IR to compute the similarity be-

tween candidate node vectors and query document vector. The candidate node with

highest similarity score is referred as Best Ranked Node (BRN).

3. Mapping node or NIL is predicted based on BRN ∈ KB or BRN ∈Wikipedia.

To calculate the similarity between candidate nodes and the query document, we have

experimented with cosine similarity, Naı̈ve Bayes, maximum entropy, Tf-idf ranking and

pseudo relevance feedback ranking.

43

CHAPTER 4. ENTITY LINKING AS RANKING

4.1 Entity Linking as Ranking

The result of CLG phase in Chapter 4 is an unordered list of candidate nodes. If |CL|=0

1, we return NIL, otherwise we rank the candidate nodes to predict the mapping node.

For |CL|=0 is a case where no name variant of the query entity is present in the KB or

Wikipedia titles. We predict NIL for such cases as no candidate node could be obtained.

When |CL| 6= 0, similarity is calculated between the candidate nodes and the query doc-

ument using various techniques. For this, we represent the query document Dq and the

candidate nodes C = {C1, C2, ..., Cn}, where Ci ∈ C, as vectors. Similarity is calculated

between the vector representations of the query document (Dq) and candidate nodes (Ci).

4.2 Vector Representation of Documents

In this section, we describe briefly the process of obtaining the vector representation of a

document. First, query document (Dq) and candidate nodes (Ci) are tokenized using space

as a delimiter. Tokens belonging to the stop words list 2 are removed and the remaining

tokens are stemmed to obtain vectors for each document. The representation of a set of

documents as vectors in a common vector space is known as the Vector Space Model [60]

and is fundamental to a host of information retrieval operations ranging from scoring doc-

uments on a query, document classification and document clustering.

Let S denote the set of all stop words. Consider the document associated with the

query entity as Dq, where Dq={q1, q2, ..., qn} with qi /∈ S and qi is the stemmed word. Let

~V (Dq)=(q1, q2, ..., qn) be the vector representation of the query document.

Similarly, let the set of candidate nodes be C, where C={C1, C2, ..., Cn}. Ci is a can-

didate node and Ci ∈ C, Ci={w1, w2, ..., wm} with wi /∈ S and wi is the stem word. Let

~C={~V (D1), ..., ~V (Dn)}, with ~V (Di)=(wi1, wi2, ..., wim), be the vector representation of

candidate nodes.1|CL| refers to the size of the candidate list (CL).2we used a list of 200 frequently occurring stop words from the web.

44


We now discuss various techniques we experimented to calculate similarity between

candidate nodes and the query document.

4.3 Cosine Similarity

In this section, we describe in detail how we identify the BRN from the CL using cosine

similarity. The model is based on the intuition that documents with higher number of

common terms are more similar. In this model, we view the set of candidate nodes as a

set of vectors in a vector space, in which there is one axis for each token. We compute the

similarity between the query document and candidate nodes as the magnitude of the vector

difference between the vectors ~V (Dq) and candidate node vectors ~C .

Figure 4.1 Cosine Similarity.

The cosine similarity between the query document Dq and a candidate node Ci is com-

puted as

sim(Dq, Ci) =~V (Dq) · ~V (Ci)

|~V (Dq)||~V (Ci)|(4.1)

where the numerator represents the dot product (also known as the inner product) of the

vectors ~V (Dq) and ~V (Ci), while the denominator is the product of their euclidean lengths.

45


The dot product ~V (Dq) · ~V (Ci) of two vectors is defined asM∑j=1

DqCi, with M representing

union of tokens representing the documents Dq and Ci. The Euclidean length of Dq is

defined to be

√√√√ M∑j=1

~V (Dq). Similarly euclidean length for Ci is calculated.

The effect of the denominator of equation (4.1) is to length-normalize the vectors

~V (Dq) and ~V (Ci) to unit vectors ~v(Dq) = ~V (Dq)/|~V (Dq)| and ~v(Ci) = ~V (Ci)/|~V (Ci)|.

We can then rewrite (4.1) as

sim(Dq, Ci) = ~v(Dq) · ~v(Ci) (4.2)

Thus, (4.2) can be viewed as the dot product of the normalized versions of the two

vectors. This measure is the cosine of the angle θ between the two vectors, shown in Figure

4.1.

The candidate node Ci with highest cosine similarity score to query document Dq is

returned as the BRN.

4.4 Classification Model

In the field of IR, document classification is the task of assigning a document to one or

more classes, based on its features. This task is also referred as text classification, text

categorization, topic classification or topic spotting. The notion of classification is very

general and has many applications within and beyond IR. In our scenario, we assume each

candidate node Ci to represent a unique class label (Li). We need to determine which class

(Li) is the closest mapping class for our query document Dq. We have experimented with

two classification techniques

• Naı̈ve Bayes.

• Maximum Entropy.

46


We use the implementation of Naı̈ve Bayes and maximum entropy available in Rainbow

Text Classifier3.

Supervised classification models like Naı̈ve Bayes and maximum entropy require la-

beled training data. The labeled training data is obtained using a set of features to represent

a document. Selecting a set of features to represent a document is called as feature selec-

tion. We now explain the importance of feature selection process and later describe how

we use features to represent the training documents (Ci).

Feature Selection : Feature selection is the process of selecting a subset of the terms

occurring in the training set (C) and using only this subset as features in text classification.

Feature selection serves two main purposes.

• First, it makes training and applying a classifier more efficient by decreasing the size

of the effective vocabulary.

• Second, feature selection often increases classification accuracy by eliminating noise

features 4.

Our representation of the candidate nodes (C) obtained in section 4.2 serves this pur-

pose. By using the stop word removal and tokenization feature we have obtained subset of

effective vocabulary terms which represent the candidate nodes (Ci) better.

4.4.1 Naı̈ve Bayes

In this section, we explain how Naı̈ve Bayes is used for identifying BRN. Naı̈ve Bayes is a

simple probabilistic classifier based on applying Bayes theorem with strong independence

assumptions. It has been used for a wide range of applications like text classification [26,

56, 1, 61] , word sense disambiguation [49, 15] , sentiment classification [48, 64, 38] etc.

3http://www.cs.cmu.edu/∼mccallum/bow/rainbow/4A noise feature is one that, when added to the document representation, increases the classification error

on new data.

47


We now describe how Naı̈ve Bayes is used for identifying BRN. The probability of the

query document Dq being in class Li (candidate node, Ci) is computed as

P (Li|Dq) ∝ P (Li)∏

1≤k≤n

P (qk|Li) (4.3)

where P (qk|Li) is the conditional probability of term qk occurring in a candidate node

of class Li. We interpret P (qk|Li) as a measure of how much evidence qk contributes that

Li is the correct class. P (Li) is the prior probability of a candidate node occurring in class

Li. If a candidate node’s terms do not provide clear evidence for one class versus another,

we choose the one that has a higher prior probability. < q1, q2, ..., qn > are the tokens in

query document Dq that are part of the vocabulary we use for classification and n is the

number of such tokens in Dq.

Our goal is to find the best mapping class (Li) for the query document (Dq). The best

class in Naı̈ve Bayes classification is the most likely or maximum a posterior (MAP) class

cmap :

cmap = argmaxLiεC

P̂ (Li|Dq) = argmaxLiεC

P̂ (Li)∏

1≤k≤n

P̂ (qk|Li) (4.4)

We write P̂ for P because we do not know the true values of the parameters P (Li) and

P (qk|Li), but estimate them from the training set.

We obtain the likelihood for each candidate node (Li) and rank them accordingly. The

candidate node (Ci) with best likelihood score is returned as the BRN.

4.4.2 Maximum Entropy

In this section, we describe maximum entropy technique for identifying BRN from the

candidate nodes set (C). Maximum entropy has been widely used for variety of natural

language tasks like, language modeling [10, 58], part-of-speech tagging [52] and preposi-

tional phrase attachment [53]. The over-riding principle in maximum entropy is that when

48


nothing is known, the distribution should be as uniform as possible, that is, have maximal

entropy.

Our case being more similar to text classification, maximum entropy estimates the con-

ditional distribution of the class label (Li) given a candidate node Ci. We use the represen-

tation of the candidate nodes (C) obtained in section 4.2 and bag of words as a feature. The

labeled training data is used to estimate the expected value of the tokens on a class-by-class

basis. First, we introduce how to select a feature set for setting the constraints and building

the training model. Then, we move on to explain how it is used for identifying BRN.

Constraints and Features : In maximum entropy, we use the training data (Ci belong-

ing to class Li) to set constraints on the conditional distribution. We let any real-valued

function of the candidate node Ci and the class Li be a feature, fi(Ci, Li). Maximum en-

tropy allows us to restrict the model distribution to have the same expected value for this

feature as seen in the training data, candidate node setC. Thus, we stipulate that the learned

conditional distribution P (Li|Ci) must have the property:

1

|C|∑Ci∈C

fi(Ci, c(Ci)) =∑Ci

P (Ci)∑Li

P (Li|Ci)fi(Ci, Li) (4.5)

Thus, when using maximum entropy, the first step is to identify a set of feature functions

that will be useful for classification. Then, for each feature, measure its expected value

over the training data and take this to be a constraint for the model distribution. More

specifically, for each word-class combination we instantiate a feature as:

fw,L′i(Ci, Li) =

0, if Li 6= L

′i

N(Ci, w)

N(Ci)Otherwise,

(4.6)

where N(Ci, w) is the number of times word w occurs in document Ci, and N(Ci) is

the number of words in Ci. With this representation, if a word occurs often in one class,

we would expect the weight for that word-class pair to be higher than for the word paired

with other classes.

49


We use the representation of the documents obtained in section 4.2 to train a maximum

entropy probability distribution model and use it to classify the query document Dq. The

candidate node (Ci) which receives the highest probability estimate is returned as BRN.

4.5 Tf-idf Ranking

The Tf-idf weight (term frequency-inverse document frequency) [59] is often used for vari-

ous tasks in information retrieval and text mining. This weight is a statistical measure used

to evaluate how important a word is to a document in a collection or corpus. The impor-

tance increases proportionally to the number of times a word appears in the document but

is offset by the frequency of the word in the corpus. The intuition behind this model is that

a document that mentions a term more often has more to do with that term and therefore

should receive a higher score. Variations of the tf-idf weighting scheme are often used by

search engines as a central tool in scoring and ranking a document’s relevance given a user

query. We now explain how Tf-idf ranking is used in identifying the BRN.

4.5.1 Term frequency and weighting

Term frequency (TF) refers to how often a term appears in a specific document. Each

term in candidate node Ci is assigned a weight depending on the number of occurrences

of the term in the Ci. ∀qi, qi ∈ Dq, we compute a score between the query term qi and

candidate node Ci, based on the weight of qi in Dq. We assign the weight to be equal to the

number of occurrences of term qi in document Ci. This weighting scheme is referred to as

term frequency and is denoted tfqi,Ci, with the subscripts denoting the query term and the

candidate node in order. The ordering of the terms in the Ci is ignored but the number of

occurrences of each qi is all that is considered. We only retain information on the number

of occurrences of each qi.

50


4.5.2 Inverse document frequency

Inverse Document Frequency (IDF) is a measure of the general importance of a term.

Above mentioned raw term frequency suffers from a critical problem: all terms are con-

sidered equally important when it comes to assessing relevancy on a qi. In fact certain qi

have little or no discriminating power in determining relevance. To this end, we introduce

a mechanism for attenuating the effect of qi that occur too often in candidate nodes C to

be meaningful for relevance determination. An immediate idea is to scale down the term

weights of qi with high collection frequency, defined to be the total number of occurrences

of qi in the C. The idea would be to reduce the tf weight of qi by a factor that grows with

its frequency in candidate nodes C. By using this document-level statistic (the number of

documents containing qi) we discriminate between Ci for the purpose of scoring. IDF is

given by

idfqi = log|C|

1 + |qi ∈ Ci|(4.7)

where |C| is the total number of candidate nodes. |qi ∈ Ci| is the number of candidate

nodes where qi appears. If qi is not in the candidate nodes C, this will lead to a division-

by-zero. Hence, we use 1 + |qi ∈ Ci|.

4.5.3 Tf-idf Weighting

Combining the definitions of TF and IDF, we produce a composite weight for each qi in

each Ci. The tf-idf weighting scheme assigns to each qi a weight in document Ci and is

given by

tf − idfqi,Ci= tfqi,Ci

× idfqi (4.8)

In other words, tf − idfqi,Ciassigns to qi a weight in Ci that is

• highest when qi occurs many times within a small number of candidate nodesC (thus

51


lending high discriminating power to those candidate nodes);

• lower when the qi occurs fewer times in Ci, or occurs in many candidate nodes C

(thus offering a less pronounced relevance signal);

• lowest when the qi occurs in virtually all candidate nodes C.

Finally, the similarity between the query documentDq and a candidate nodeCi, Ci ∈ C

is given by

Similarity(Ci, Dq) =∑qiinDq

tf(qi, Ci) ∗ idf(qi) (4.9)

The candidate nodes Ci are ranked in descending order and the candidate node with

highest tf-idf score is returned as the Best Ranked Node (BRN).

4.6 Pseudo Relevance Feedback for Re-ranking

In this section we give a brief overview of Hyperspace to Analogue Language (HAL)

model. Later, we show how HAL is used to re-rank the ranked set of candidate nodes

obtained by Tf-idf ranking to identify BRN.

4.6.1 Pseudo Relevance Feedback

Pseudo relevance feedback, also known as blind relevance feedback, provides a method for

automatic local analysis. It automates the manual part of relevance feedback, so that the

user gets improved retrieval performance without an extended interaction. The method is to

do normal retrieval to find an initial set of most relevant documents, to then assume that the

top k ranked documents are relevant, and finally to do relevance feedback as before under

this assumption. Following this intuition top k documents are used to generate a language

model using HAL model, which is used to re-rank the candidate nodes.

52


4.6.2 Hyperspace to Analogue Language(HAL) Model

Hyperspace Analogue to Language [31] model constructs the dependencies of a word w on

other words based on their occurrence in the context of w in a sufficiently large corpus. The

intuition underlying HAL spaces is that when humans encounter a new concept, they derive

its meaning from accumulated experience of the context in which the concept appears. Thus

the meaning of the new concept can be learn’t from its usage with other concepts within

the same context. Lund and Burgess [31] discusses the use of lexical co-occurrence to

construct high dimensional semantic spaces in which a word can be represented as a point.

The representational model of this space can be constructed automatically from a corpus of

text.

The construction of HAL space can be seen as a vector representation of each word w,

occurring in the vocabulary T, in a high dimensional space spanned by different words in

the vocabulary. This process results in a |T |X|T | HAL matrix, where |T | is the number

of different words in the vocabulary. The HAL matrix is constructed by taking a window

of length K words and moving it across the corpus at one term increments. All words in

the window are said to co-occur with the first word, with strengths inversely proportional

to the distance between them. In our approach we have considered the co-occurrence to be

bidirectional, because in general it is agreed that preserving the word order is not useful for

IR. The weights assigned to each co-occurrence of terms are accumulated over the entire

corpus. That is, if n(w, k, w′) denote the number of times word w′ occurs k ≤ K distance

away from w when considered a window of length K, and W (k) = K − k + 1 denotes the

strength of this co-occurrence between the two words, then

HAL(w′/w) =

K∑k=0

W (k)n(w, k, w′) (4.10)

The length of the window size will invariably influence the quality of the associations

between a pair of terms. For instance, as the size of the window increases, the higher the

chance of representing spurious associations between terms. Various window sizes have

53


been used from 2 to 10. However, it is unclear what the best size of window is, experi-

ments [31] suggest a window of 4 or 8 for the purposes of IR. The original HAL Space

is direction sensitive because it records the co-occurrence information for terms preceding

every term. In general, it was found that preserving this term order was not useful for IR

and the combination of the row and column vectors for a term (thus a bidirectional win-

dow) was more effective. For instance with the sentences “The black cat ...” and “The cat

is black.”, while the ordering is different the notion that the cat is a particular color, black,

is preserved when taking both directions into account.

4.6.3 Re-ranked Candidate Nodes :

Using the popular tf-idf weighting for ranking results in the most important candidate nodes

being ranked on the top. Though the most important candidate nodes might appear in the

top-k results, we are still left with the problem of choosing a single node as the BRN. For

this we re-rank the candidate nodes using pseudo relevance feedback approach.

We build an HAL matrix over the top-k ranked candidate nodes. From the HAL matrix,

we use all the co-occurring words around our query entity within a window of size four and

expand the query. Experiments for various window sizes for HAL showed that fixing it at

four captures sufficient context. We re-rank the candidate nodes using the expanded query

and obtain re-ranked score for each candidate node. From experimental results we choose

to consider the top-5 ranked candidate nodes for building HAL matrix.

The final score of each candidate node is a weighted linear combination of its rank score

and re-rank score. The final score is given by

Final Score = λ ∗RankingScore+ (1− λ) ∗Re− rankedScore (4.11)

λ is the weight for each of the scores. Experimental results show that setting λ to 0.7

gave the best results.

We illustrate how pseudo relevance feedback works with an example. For a query

entity “Laguna Beach” from the query set the correct mapping node is “Laguna Beach,54


California”. The query document contains terms like “show, MTV, Jessica, Jason” etc and

this results in the tf-idf ranking function to assign a higher rank to “Laguna Beach: The

Real Orange County”, an MTV reality show.

After query expansion using HAL the words: “lifeguards, coastal, land, geography” etc

are added to the query. This actually results in a higher re-ranked score for the candidate

node “Laguna Beach, California”. The final score(a linear combination of ranking and

re-ranking score) results in “Laguna Beach, California” as the BRN.

4.7 Mapping Node Identification

Using the above five techniques we obtain a ranked set of candidate nodes, from the initially

unordered set of candidate nodes. From the above ranked list, candidate node with highest

similarity score to the query document is returned as BRN. BRN could be either from the

KB or Wikipedia. If BRN∈KB, we return it as a map for the query entity or NIL otherwise.

The output of our system for a query entity is summarized in equation 4.12

Mapping Node =

NIL, if CL=0

NIL, if CL ≥ 1 and BRN ∈Wikipedia

Node Id if CL ≥ 1 and BRN ∈ KB

(4.12)

4.8 Conclusions

In this chapter, we discussed various similarity techniques to rank the unordered list of

candidate nodes. We experimented with cosine similarity, Naı̈ve Bayes, maximum entropy,

TF-IDF ranking and pseudo relevance feedback ranking. The node with highest similarity

to the query document was returned as BRN. Mapping node or NIL was predicted based

on BRN ∈ KB or BRN ∈ Wikipedia. In the next chapter, we discuss the structure of the

data set used to evaluate our algorithm. We also describe the evaluation metric.

55

Chapter 5

Data Set

In this chapter, we give background of Text Analysis Conference (TAC). We then give a

brief overview of the data set that is required for evaluating an EL algorithm. We explain

in detail about the general structure of a Knowledge Base, Document Collection 1 and

Query Entities when encoded in XML. We conclude the chapter with an overview of the

evaluation metric.

5.1 Text Analysis Conference

Recently there has been wide spread interest in community wide evaluations for research

in information technologies. The Text Analysis Conference (TAC) is a series of evaluation

workshops organized to encourage research in Natural Language Processing (NLP) and

related applications, by providing a large test collection, common evaluation procedures,

and a forum for organizations to share their results. TAC comprises sets of tasks known

as “tracks”, each of which focuses on a particular sub problem of NLP. TAC tracks focus

on end-user tasks, but also include component evaluations situated within the context of

end-user tasks.

Question answering and information Extraction have been studied over the past decade;

1Set of query documents is referred as Document Collection.

56

CHAPTER 5. DATA SET

however evaluation has generally been limited to isolated targets or small scopes (i.e., sin-

gle documents). The Knowledge Base Population (KBP) Track at TAC was proposed to

explore extraction of information about entities with reference to an external knowledge

source. Using basic schema for persons, organizations, and locations, nodes in an ontology

must be created and populated using unstructured information found in text. This task has

been broken down into two sub problems: Entity Linking, where names must be aligned to

entities in the KB and Slot Filling, which involves mining information about entities from

text. The EL sub task was present in both TAC-KBP 20092 and 20103.

Compared to previous information extraction evaluations such as the Message Under-

standing Conference (MUC) and Automatic Content Extraction (ACE), KBP is different in

the following perspectives

• Extraction at large scale (e.g. 1 million documents).

• Using a representative collection (not selected for relevance).

• Cross-document entity resolution (extending the limited effort in ACE).

• Linking the facts in text to KB.

• Rapid adaptation to new relations.

We have evaluated the performance of our algorithm against the TAC-KBP 2009 and

2010 EL data sets. In the next section, we explain in detail the data set provided for TAC-

KBP track. We then give a brief overview of the evaluation metrics used to evaluate an EL

system.

2http://www.nist.gov/tac/2009/3http://www.nist.gov/tac/2010/

57

CHAPTER 5. DATA SET

5.2 Data set and Evaluation Metrics

For evaluating an EL system, we require a KB which contains nodes/entries having in-

formation about named entities and a set of documents which contain instances of and

information about named entities. We would also require a query which contains a named

entity and the document in which it occurs. We are required to link the named entity in

the query, present in a document, to a node in the KB. The data set provided by TAC-KBP

consists of a KB, document collection and a query set. Firstly, we give a brief overview of

the structure of the KB nodes.

5.2.1 Structure of nodes in Knowledge Base

KB is a structured database containing nodes describing a named entity. The KB provided

for the TAC-KBP track is derived from Wikipedia. Each KB entry (also referred to as a

node) contains

• A unique identifier (ID, like “E101”).

• A name string and a title.

• An assigned entity type of Person (PER), Organization (ORG), Geo-political Entity

(GPE) or Unknown (UKN).

• An automatically parsed version of the data from the infobox in the entity’s Wikipedia

article i.e. a set of slot names and values.

• A stripped version of the text from the Wikipedia article.

The title and name are canonical forms derived from Wikipedia. A sample KB node

encoded in XML is shown in the Fig.5.1.

There are a total of 818,741 nodes in the KB. The KB was same for both 2009 and 2010

data sets. Table 5.1 shows the breakdown of the number of nodes for each entity type.

58

CHAPTER 5. DATA SET

Figure 5.1 Knowledge Base Node

Type Count Percentage

Person (PER) 114,523 14.0%

Organization (ORG) 55,813 6.8%

Geo Political Entity (GPE) 116,499 14.2%

Unknown (UKN) 531,907 65%

All 818,741 100%

Table 5.1 Percentage break down of entity types in the Knowledge Base.

5.2.2 Structure of documents in Document Collection

The document collection contains a set of documents obtained from various sources like

news wire, newsgroup, conversational telephone speech transcripts etc. These articles con-

tain mentions of, and information about target query entities. They provide context for

disambiguating the query entity. A document in the document collection consists

• A unique document id.

• Source from where the document was obtained.

• A headline.

• And a disambiguation text which contains an instance of or information about an

entity or event.

59

CHAPTER 5. DATA SET

The document collection consists of a total of 1,287,292 documents. This collection of

documents formed the document collection for the 2009 TAC-KBP data set. An additional

490,596 blog articles were added to the 2009 document collection to form the 2010 TAC-

KBP document collection. A sample document collection document encoded in XML is

shown in the figure.5.2.

Figure 5.2 Document Collection Document

Number of documents from various sources in the document collection is shown in the

table 5.2.

Genre # documents

Broadcast Conversation 17

Broadcast News 665

Conversational Telephone Speech 1

News wire 1,286,609

Blog Articles 490,596

Table 5.2 No:Of documents from various sources in Document Collection.

60

CHAPTER 5. DATA SET

5.2.3 Structure of an Entity Linking Query

The query set contains a set of queries, where each query consists of a query entity and an

associated document-id from the document collection. This document provides the context

for the query entity. Query entities can occur as multiple queries using different name

variants or in multiple documents. Each query must be processed independently. Since the

documents can come from different sources, various name variations like acronyms and

nick names etc could refer to the same query entity. They might also occur in different

contexts. A sample query encoded in XML is shown in the figure 5.3.

Figure 5.3 Sample Query from the Query Set.

5.3 Evaluation Metrics

In this section, we give an overview of the standard evaluation metric used to evaluate

an EL algorithm. Micro-Average Score (MAS) is the standard evaluation metric used

for evaluating an EL system. In short, MAS is the precision over all the queries and is

calculated using

Micro Average Score =No.of correct responses

No.of Queries(5.1)

For example, Table 5.3 shows the query entity occurring in a query, correct mapping

node from the KB and the output of a system. The system was able to predict the correct

mapping node for 3 out of the 6 queries. Hence the MAS is 3/6=0.5 .

Another metric that can be used to evaluate an EL system is the Macro-Average Score.

In this metric, precision is calculated for each entity (nil and non-nil) and an average is

61

CHAPTER 5. DATA SET

Query string KB-id system output

Abbott 1 1

Abbott 1 101

Abbott 1 1

Abbott Labs 2 101

Abbott Laboratories 2 nil

Abbott Labs 2 2

Table 5.3 System output for a set of query strings

taken across the entities. The main problem with such a metric is that it might be biased

towards the system’s output. It would be unstable with respect to low-mention-count query

entities. The example below explains the calculation of Macro-Average Score.

From Table 5.3, the entity corresponding to the KB node with ID=1 was linked correctly

2 of 3 times for a precision of 0.67. The entity with ID=2 was linked correctly 1 of 3 times

for a precision of 0.33.The macro-averaged precision is 0.5. (0.67+0.33)/2.

In the next chapter, we explain in detail the experiments we have conducted and evaluate

the performance of our system on TAC-KBP 2009 and 2010 data sets. Discussions on error

analysis is also done.

62

Chapter 6

Evaluation

6.1 Evaluation

Thus far, we have discussed the algorithm developed by us for linking named entities oc-

curring in a text document to nodes in a KB. We evaluate our algorithm on two standard

data sets viz: 2009 and 2010 TAC-KBP, EL track data. The structure of the data set was

described in chapter 5. First, we give a brief overview of the two query sets and analyze

them. Later, we analyze the impact of each feature we have used in building our system

and do an error analysis. Finally, we conclude this chapter with a comparison of our sys-

tems performance with the performance of top five systems submitted at 2009 and 2010

TAC-KBP, EL shared task.

6.2 TAC-KBP 2009 and 2010 Query Set Analysis

In this section, we introduce a few statistics about the 2009 and 2010 TAC-KBP, EL track

query set. We compare how the queries are distributed over three different categories i.e.

Person, Location and Organization. Table 6.1 shows the total number of query entities

present in 2009 and 2010 query sets respectively and distribution of query types. The 2010

query set contains 2250 query entity mentions for 403 unique entities. The 2009 query

63

CHAPTER 6. EVALUATION

set contains 3904 entity mentions for 560 unique entities. It is evident from Table 6.1

that 2010 query set has an even distribution of query entity types. Whereas, in the 2009

query set most of the queries are of the type organization. We feel that the 2010 query set

provides a better base for conducting our evaluations and experiments. However, we have

also conducted experiments on 2009 query set so as to test the robustness of the system.

Year No:of Queries Unique Person Location Organization NIL

2009 3904 560 627 567 2710 57%

2010 2250 403 751 749 750 54.6%

Table 6.1 Statistics on 2009 and 2010 query sets.

There are almost same percentage of queries in both the query sets for which NIL

should be predicted. In 2009 query set, 57% of queries had no entry in the KB whereas,

2010 query set had 54.6%.

Year No:of Queries # 1 # 2 # 3 # 4 # 5

2009 3904 43 17 8 14 6

2010 2250 208 80 32 22 12

Table 6.2 Distribution of Non-Nil queries.

Query entities can occur as multiple queries using different name variants or in multiple

documents (providing different contexts). An example of how a query entity can occur in

multiple queries using different name variants and in multiple documents was shown in

section 5.2.3. A single KB node can be the output(mapping node) for multiple queries.

Table 6.2 shows the number of unique KB nodes mapped from 2009 and 2010 query

sets (query entities). An input query entity finally gets mapped to one node in the KB.

As mentioned in section 5.2.3 an input query entity can take multiple forms(nicknames,

aliases, acronyms etc) and could also occur in various contexts. In spite of these variations

the entities would ideally be linked to the same KB node. From Table 6.2, it can be seen64


that the 2010 query set has 80 of such KB nodes which were referred to by two query

variations.

To understand the complexity of the task consider the following table 6.3, with sample

queries taken from the TAC-KBP 2009 Query Set. It shows that there are 15 queries with

“Abbott/Abbot” as the query entity, but they refer to different KB nodes and belong to

different entity types. The same query entity is associated with 15 different documents

showing how varied the context is.

Query

string

KB-id KB title No:of

Queries

Unique query

documents

Entity Type

Abbot E0064214 Bud Abbott 1 1 Person

Abbott E0064214 Bud Abbott 4 4 Person

Abbott E0272065 Abbott Laboratories 9 9 Unknown

Abbott E0003813 Abbot, Texas 1 1 Geo-political

entity

Table 6.3 Sample Queries

The following two examples show how varied the context can be.

Context 1: A spokeswoman for Abbott said it does not expect the guidelines to affect

approval of its Xience stent, which is expected in the second quarter.

Context 2: Aside from items offered by the 67-year-old Fonda, the auction included

memorabilia related to Peter Frampton, Elvis Presley and Abbott and Costello.

In context 1 “Abbott” refers to “Abbott Laboratories” whereas in context 2 it refers to

“Bud Abbott”.

65


6.3 Candidate List Size Analysis

As our EL system can be broken down into two phases, we analyze each phase and its im-

pact on the overall performance of the system. We have used various heuristics to identify

named entity variations. Using the obtained name variants we identify the candidate nodes

from the KB and Wikipedia to form our CL.

Table 6.4 shows the mapping between number of query entities with a specific CL size

for various experiments(Runs) for 2010 query set. For example, we obtained only one

candidate node in the CLG phase for 908 queries in Run No.6 . Queries with least or

no disambiguation generally resulted in a CL of size less than 5. For highly ambiguous

queries, the CLG phase returns large number of variations of the query entity resulting in a

very sharp increase in the CL size. Average CL size per query is highest (8.01) when all the

heuristics(Run No.6) are used in CLG phase and it reduces drastically to 0.63 when only

using redirect pages from Wikipedia (Run No.3). This is because redirect pages result in

either a single name variant or none.

Run

No.

Heuristics Used |CL|=0 |CL|=1 |CL|=2 |CL|=3 |CL|=4 |CL|=5 Average

|CL|

1 Disambiguation pages* 1203 538 19 22 35 28 5.99

2 Bold text** 921 740 99 82 55 27 3.49

3 Redirect pages* 972 1256 16 1 0 1 0.63

4 Run No.s 1+2+3 630 935 84 61 22 30 7.91

5 Run No. 4 + Stanford

NER

626 924 99 61 22 30 7.93

6 Run No. 5 + Google

Search

535 908 161 90 38 30 8.01

Table 6.4 The above table indicates the number of queries(2010 query set) havinga particular candidate list size.

66


Run

No.

Heuristics Used |CL|=0 |CL|=1 |CL|=2 |CL|=3 |CL|=4 |CL|=5 Average

|CL|

1 Disambiguation pages* 1468 1525 69 81 82 30 5.38

2 Bold text** 1016 1886 246 182 93 122 2.72

3 Redirect pages* 1234 2516 70 54 0 0 0.95

4 Run No.s 1+2+3 679 1872 242 135 101 145 6.68

5 Run No. 4 + Stanford

NER

679 1872 242 135 101 145 6.69

6 Run No. 5 + Google

Search

602 1898 267 157 102 121 6.75

Table 6.5 The above table indicates the number of queries(2009 query set) havinga particular candidate list size.

Similarly Table 6.5 shows the mapping between number of query entities with a specific

CL size for various experiments (Runs) for 2009 query set. Clearly we can see the same

trend of average CL size increasing with the increase in number of heuristics used for iden-

tifying the name variants. Also it can be seen that when only redirect page from Wikipedia

is used the average CL size is 0.95. The major difference between the two data sets is the

percentage of queries for which the CL size is one for each heuristic. A large percentage

(48.6%) of queries in 2009 query set had resulted in a CL of size one compared to 40.3% in

2010 query set for Run No. 6. Using the redirect pages feature from Wikipedia will result

in either a single name variant or none. This redirect feature has very high impact on the

performance of our EL system which is shown in section 6.5.

Another key difference is the impact of Stanford NER and Google search for identify-

ing the name variants. Both the data sets show that by using Stanford NER and Google

0* indicates from Wikipedia and ** indicates from first paragraph of Wikipedia article. Google Search

includes both Google spell suggestion and Google directive search. Same notation is followed for rest of this

chapter.

67


Search, there was marginal increase in the CL size. The reason is that generally the name

variant obtained from Stanford NER and Google search might already be present in our

ER. Though both Stanford NER and Google Search result in small increase of CL size, the

impact of these two heuristics is very high which is discussed in section 6.5 .

6.4 Candidate List Generation Phase Analysis

The failure to list the correct mapping node in the CLG phase will result in failure of the

system irrespective of the ranking algorithm used. We believe that as long as the correct

mapping node is present in the CL, the context of the query entity will help in linking

it correctly. The column “Wrong map” in Table 6.6 indicates the failure to list the correct

candidate node in the CL even though the mapping node exists in the KB. The probability of

identifying correct mapping node in the CL increases as we add more heuristics to identify

named entity variations.

Run

No.

Heuristics Used Wrong Map - 2009 Wrong Map - 2010

1 Disambiguation pages* 555 468

2 Bold text** 525 458

3 Redirect pages* 609 422

4 Run No.s 1+2+3 279 212

5 Run No. 4 + Stanford NER 266 195

6 Run No. 5 + Google Search 241 117

Table 6.6 The above table indicates the failure to list the correct candidate nodein the Candidate List even though the mapping node exist in the Knowledge Base.

Google Search heuristic had more impact on 2010 query set than on 2009 query set for

identifying name variants not present in our ER. These name variants in turn resulted in

the correct mapping node being picked into the CL. This is evident with the reduction of68


wrong map from 195 to 117 for Run No. 6 i.e. for only 117 queries out of the 2250 queries

we could not pick the correct candidate node into the CL.

6.5 Entity Linking System Performance

In this section we evaluate the performance of our EL system. We use Micro-Average

Score(MAS) the standard metric proposed by TAC-KBP, EL track for evaluating system

performance.

Table 6.7 gives a brief overview of the number of participants for EL task at 2009

and 2010 TAC-KBP track. There was slight increase in the number of participants for

2010 EL task when compared to 2009. Each participating team is entitled to submit a

maximum of three runs. The TAC-KBP organizers would evaluate each run against the gold

standard data and report each teams performance. The base line score which is obtained

by predicting NIL for all the query entities is 57% and 54.6% for 2009 and 2010 EL query

sets. The average of Micro-Average Score obtained over 35 runs submitted at TAC-KBP

for 2009 EL task is 71.08% and 68.36% for 2010.

Year No:of participated

teams

Total runs submitted Base

line

Best Average

MAS

2009 13 35 57% 82.17% 71.08%

2010 16 46 54.6% 86.80% 68.36%

Table 6.7 Average Micro-Average Score and Base line scores obtained by variousparticipating universities/teams for TAC-KBP Entity Linking task on 2009 and2010 query sets.

Table 6.9 and Table 6.8 show the MAS obtained by our EL system on 2009 and 2010

EL query set respectively. Our best system achieved an MAS of 84.76% on 2010 query

set and 83.12% on 2009 query set. Our system performs close to current state-of-the-art

algorithms on EL. In fact, our system out performs all the systems submitted at TAC-KBP

69


2009 EL task and is only marginally behind the best system submitted at TAC-KBP 2010.

This shows the robustness of our algorithm and also the performance is very high when

compared to base line score or average MAS for all the runs submitted at 2009 and 2010

TAC-KBP EL task.

It is evident from Table 6.9 and Table 6.8 that pseudo relevance feedback for re-ranking

has performed very well. This shows that using co-occurrence statistics of a named entity

with other words helps in disambiguation and efficient ranking of candidate nodes. Re-

ranking has worked significantly well with all the CLG heuristics except in case of redirects

where the increase in performance is comparatively low. Cosine similarity and Naı̈ve Bayes

performed almost equally using bag of words as feature. This shows that using a simple bag

of words approach is sufficient to build a fairly well performing EL system. This simple

approach outperforms the baseline (54.6%) and the median(68.36%) across all the 46 runs

submitted for EL task at 2010 TAC-KBP, as well as for 2009. Maximum Entropy didnt fare

well as the data available for training the model was not sufficient i.e. certain candidate

nodes had sufficient text to describe an entity where as others didn’t. Hence, maximum

entropy couldn’t perform well.

Run

No.

Heuristics Used Maxent Cosine

Sim

Naı̈ve

Bayes

tf-idf

Ranking

Re-

ranking

1 Disambiguation pages* 67.96 71.02 71.29 71.96 72.49

2 Bold text** 69.96 73.07 73.87 74.71 75.16

3 Redirect pages* 74.36 78.4 78.53 78.44 78.53

4 Run No.s 1+2+3 75.69 79.73 79.82 81.02 81.38

5 Run No. 4 + Stanford NER 76.27 80.40 80.53 81.56 82.00

6 Run No. 5 + Google Search 77.11 81.51 81.59 82.89 84.76

Table 6.8 Micro-Average Score for individual heuristics for 2010 Query set.Google Search includes both Google spell suggestion and Google directive search.

70


Run

No.

Heuristics Used Maxent Cosine

Sim

Naı̈ve

Bayes

tf-idf

Ranking

Re-

ranking

1 Disambiguation pages* 74.03 76.36 76.54 76.95 77.09

2 Bold text** 74.85 77.36 77.48 78.41 78.76

3 Redirect pages* 77.66 80.43 80.56 80.58 80.78

4 Run No.s 1+2+3 78.64 81.25 81.32 81.92 82.02

5 Run No. 4 + Stanford NER 78.76 81.58 81.66 82.12 82.79

6 Run No. 5 + Google Search 79.02 81.81 81.86 82.69 83.12

Table 6.9 Micro-average score for individual heuristics for 2009 Query set.Google Search includes both Google spell suggestion and Google directive search.

6.6 Precision Vs Top “N” results

In this section, we plot the Precision Vs Top “N” results for Non-Nil queries for the five

techniques. Figure 6.1 and Figure 6.2 shows the plot for 2010 and 2009 TAC-KBP EL

query set respectively. It can be seen clearly that as we consider a higher number of hits,

the probability of finding the correct map for the query entity in the hits list increases.

From the both the figures it is evident that Tf-idf technique results in ranking the map-

ping node higher (in the ranked list) when compared to others. Further simple techniques

like cosine similarity and Naı̈ve Bayes perform consistently better than maximum entropy,

which shows that word occurrence statistics are sufficient for building a decently perform-

ing EL System. This is also reflected in the results presented in the Section 6.5.

Pseudo relevance feedback re-ranking strategy results in picking the mapping node as

the BRN. Re-ranking will work only as long as the mapping node is present in the top 5

ranked nodes, because we consider only top 5 ranked nodes for query expansion. If the

mapping node is present in the top 5 ranked nodes, there is very good probability that

it might be the BRN after re-ranking. There can be only one mapping node at the best in

71


Figure 6.1 Precision Vs Top “N” results for Non-Nil Queries from 2010 TAC-KBP Entity Linking Query Set.

Figure 6.2 Precision Vs Top “N” results for Non-Nil Queries from 2009 TAC-KBP Entity Linking Query Set.

these top 5 ranked nodes. If the mapping node isn’t present in the top 5 ranked nodes, query

expansion using pseudo relevance feedback will result in addition of irrelevant tokens and

72


hence will result in performance degradation which is evident from the figures 6.1 and 6.2.

6.7 NIL Prediction Accuracy

In total there are 1230 (54.6%) queries in the 2010 TAC-KBP EL query set and 2229 (57%)

queries in 2009 TAC-KBP EL query set for which there is no mapping node in the KB.

Table 6.10 and Table 6.11 demonstrates the number of queries for which NIL was predicted

when |CL| = 0, and |CL| ≥ 1 for various approaches. The tables also show correct NIL

prediction count and accuracy. Since a query entity can occur in any of the variations, it is

very important to search the KB with all possible variations. Therefore the approach which

extracts major variations is likely to have better NIL accuracy. Experimental results for

Run No.6 on 2009 and 2010 EL query sets support this intuition.

Run

No.

|CL|=0 |CL| ≥ 1 and BRN ∈Wikipedia

- Predicted Correct

predictions

Accuracy Predicted Correct

predictions

Accuracy

1 417 304 72.9% 1203 859 71.4%

2 638 499 78.2% 921 665 72.2%

3 577 424 73.4% 972 747 76.8%

4 704 575 81.7% 630 553 87.7%

5 694 574 82.7% 626 553 88.3%

6 654 565 86.4% 535 513 95.8%

Table 6.10 Statistics of NIL predictions and its accuracy for 2010 Query Set.

73


Run

No.

|CL|=0 |CL| ≥ 1 and BRN ∈Wikipedia

- Predicted Correct

predictions

Accuracy Predicted Correct

predictions

Accuracy

1 1005 855 85.07% 1468 1112 75.74%

2 1388 1203 86.67% 1016 759 74.70%

3 1403 1135 80.89% 1234 959 77.7%

4 1468 1338 91.14% 679 572 84.24%

5 1468 1338 91.14% 679 572 84.24%

6 1410 1282 90.92% 602 532 88.37%

Table 6.11 Statistics of NIL predictions and its accuracy for 2009 Query Set.

6.8 Comparison with Top 5 systems at TAC-KBP

We compare the MAS of our best system with the top 5 runs submitted at 2009 and 2010

TAC-KBP, EL task [36] [24]. Siel is the team name with which we had participated. Our

system performed the best at 2009 TAC-KBP, EL task and was runner up at 2010 TAC-KBP.

Table 6.12 and Table 6.13 compare the performance of our system against the best ranked

systems developed by other teams. Some of the participating teams are IBM research labs

1, John Hopkins University 2, Stanford University 3 etc.

Our system got an MAS of 83.73% and 82.17% on TAC-KBP, 2010 and 2009 EL shared

task. After post analysis and improving the algorithm we obtained an MAS of 84.76% and

83.12% for 2010 and 2009, EL data sets respectively.

1http://www.watson.ibm.com/index.shtml2http://www.jhu.edu/3http://www.stanford.edu/

74


Team Micro-Average Score

LCC 86.80%

Siel 83.73%

CMCRC 81.9%

hltcoe 81.47%

Stanford UBC 80.00%

Table 6.12 Performance Comparison with Top 5 systems at TAC-KBP 2010 EntityLinking sub task.

Team Micro-Average Score

Siel 82.17%

QUANTA1 80.33%

hltcoe1 79.84%

Stanford UBC2 78.84%

NLPR KBP1 76.72%

Table 6.13 Performance Comparison with Top 5 systems at TAC-KBP 2009 EntityLinking sub task.

6.9 Error Analysis

In this section, we give a few example queries for which our EL System has failed. Our

system can fail either in the CLG phase or the ranking phase. In the CLG phase, our system

failed for queries like “Air Group Inc., Marufu, LULAC” etc. The correct mapping nodes

are “Midwest-airlines, Grace Mugabe, Texas’s 21st Congressional District” respectively.

This is because our heuristics in CLG phase couldn’t identify the latter as variations for

query entities. As these variations could not be identified, we could not pick those nodes

from the KB into the CL.

In the ranking phase, query entity might be wrongly mapped to KB node as the docu-

75


ment context in which the query entity occurs might not be sufficient for disambiguating it.

For the four techniques i.e. cosine similarity, maximum entropy, Naı̈ve Bayes and Tf-idf

ranking once the wrong node is mapped we can’t correct it. But in the case of pseudo

relevance feedback re-ranking strategy, we make use of the ranked results to expand the

query for re-ranking. Here we found that for certain generic and ambiguous query enti-

ties which were wrongly mapped during the ranking phase were correctly mapped after

re-ranking. For example, generic and ambiguous queries like “Cleveland, George Bush,

UC” were correctly mapped to “Cleveland, Ohio, George W. Bush, University of Cincin-

nati” respectively, when the contextual information from HAL was used for re-ranking.

(They were wrongly mapped to “Grover Cleveland, George H. W. Bush, Xavier University

(Cincinnati)” respectively when only ranking was done to predict the mapping node).

Our manual examination of 2010 TAC-KBP, EL gold standard data showed that 5

queries had been wrongly mapped. We have raised these issues with the TAC organiz-

ing committee and our suggestions were deemed correct. For example, for the query entity

“Jeff Fiser” the gold standard result was “2006 Tennessee Titans season”, whereas the cor-

rect answer is “NIL”. Jeff Fisher was the head coach of “2006 Tennessee Titans season”,

but linking them is wrong. The other errors were on similar lines. On incorporating these

changes to the gold standard data our best system i.e. 84.76% would become 84.98%.

6.10 Conclusions

In this chapter, we did a detail comparison of TAC-KBP, EL query sets for the year 2009

and 2010. Later, we discussed in detail the impact of each feature we used during the CLG

phase. Further, we described the performance of our algorithm on the TAC-KBP, EL data

set. We also compared the performance of our algorithm against the top 5 participants at

TAC-KBP, EL shared task. In the next chapter, we state the contributions of this thesis and

conclude with a real world application of an EL system.

76

Chapter 7

Conclusion

Structured KBs are a rich source of data for various NLP, IE and IR tasks. Recently, with

the emergence of publicly available databases, they have been exploited for a number of IE

tasks ranging from NER to relation extraction systems. But, KBs face quite a few problems

like : inconsistency in the information present, incompleteness, inaccuracy of the facts and

outdated information being present. These problems arise from the fact that the KBs are

maintained manually. In this thesis, we addressed the problem of linking named entities

from a document to nodes in a KB, a key component for automatic updation of KBs. In

the last decade, many techniques were proposed to extract structured information from un-

structured documents, but they never focused on integrating this extracted information to

globally available KBs. This motivated us to work on methodologies that can be used to

link entities in textual documents to KB nodes. We believe that research on EL will help

reduce the manual effort put in by contributors across the world in keeping the information

up to date in public KBs. This new area of research moves beyond the problems of NER,

CR and CDCR. EL breaks the document barrier and helps in automating the task of updat-

ing KBs. It opens up a range of applications from information aggregation to automated

reasoning over extracted information.

We showed that the process of creating and updating KBs can be automated. The

process of automating this task can be broken down into two sub problems.

77

CHAPTER 7. CONCLUSION

• Entity Linking

• Slot Filling

In this thesis, we addressed the problem of EL. We discussed in detail current ap-

proaches to EL and their short comings. Most of the current approaches are either too

rigid, cannot scale to large KBs or require huge training data. We discussed various chal-

lenges involved in EL like mention ambiguity, variations in named entities viz: acronyms,

nick names, spelling variations and NIL detection. We proposed a robust solution which

addresses the above issues and scales to large KBs with millions of entries.

Our proposed technique uses Wikipedia syntax to find variants of various named en-

tities. Wikipedia specific features like redirect pages, disambiguation pages and bold text

from first paragraph were used to identify synonyms, homonyms etc. Google spell sugges-

tion and Google site specific search was also used to obtain name variants from the web.

Additionally, an NER was used to find name variants of the query entity from the given

query document context. Using the variations obtained, a Boolean “AND” search was

done on the KB node titles. A subset of nodes, referred as candidate nodes (Candidate List,

CL), were obtained from the KB that can be linked to the query entity. Similarly, nodes

from Wikipedia were also added to the CL. Adding Wikipedia articles to the CL allows us

to consider strong matches against query entities that do not have any corresponding node

in the KB and hence we can return NIL. That is, for a given query entity if the ranking

function maps to the Wikipedia article from the CL, we can confirm the non-presence of a

node in the KB about the query entity. The identification of these candidate nodes from the

KB and Wikipedia was referred to as Candidate List Generation (CLG) phase.

Once the list of candidate nodes were obtained, the candidate nodes and query docu-

ment were tokenized and represented as token vectors. Using these vectors, similarity score

was calculated between the query document and candidate nodes. The similarity score be-

tween query document and the candidate nodes was calculated using five techniques. The

techniques used were cosine similarity, Naı̈ve Bayes, maximum entropy, Tf-idf and pseudo

78


relevance feedback for re-ranking. The candidate node with highest similarity score was

returned as the Best Ranked Node (BRN). If BRN ∈ KB, we return it as a map for the

query entity or NIL otherwise.

Our algorithm was evaluated on a standard data set obtained from TAC-KBP, EL shared

task. Evaluation was done against TAC-KBP, 2009 and 2010 EL data set. Micro Average

Score (MAS) was used to evaluate our algorithms performance. We obtained very impres-

sive MAS of 83% and 85% on 2009 and 2010, EL data sets. Our results in chapter 6 show

that simple techniques like cosine similarity, Naı̈ve Bayes etc perform close to state of the

art. Pseudo relevance feedback performed close to state of the art algorithms and performed

the best on 2009 EL data set.

In this chapter, we discuss the contributions of this thesis and possible future directions.

We conclude with discussion on real world possible applications of an EL system.

7.1 Contributions

Most of the research community has focused on extracting structured information from

unstructured documents. But, using this extracted information to update KBs has received

very little focus. In this thesis, we attempted to fill this gap by trying to link entities occur-

ring in textual documents to nodes in a large KB. Once entities are linked to nodes in a KB,

document barrier is broken and information can be integrated across documents. We ap-

proached the problem of EL as a two stage problem. The basis for this technique is that we

focused on developing algorithms which can scale to large KBs and are robust. We experi-

mented with various similarity techniques and showed that simple approaches can perform

close to state-of-art algorithms and sometimes better. This was the major contribution of

the thesis. Some of the other contributions are :

• Identifying Named Entity Variations : We proposed three different methodologies

to identify named entity variations. We used Wikipedia specific syntax i.e. redi-

rect pages, disambiguation pages and bold text from first paragraph for identifying79


synonyms, homonyms, nick names, alias names etc. Additionally, web search re-

sults and an NER were also used to identify various forms in which a named entity

could occur. Web search results and NER feature generate very few variations of an

entity, but their prediction accuracy is very high, which shows that they are highly

important features. We used Google spell suggestion feature and Google site specific

search for identifying spelling errors and to identify entity variations as well. We

used the obtained variations to identify candidate nodes from the KB.

• Robust Candidate Nodes Generation : Our system is flexible enough to find name

variants but sufficiently restrictive to produce manageable candidate list despite a

large-scale KB. We used Boolean “AND” search to identify candidate nodes from the

KB and Wikipedia. Table 6.6 shows that our system was able to identify mapping

node in the CL for high percentage of queries. We firmly believe that as long the

correct mapping node is present in the CL, the likelihood of it being returned as the

mapping node is very high, which is reflected in our results. Furthermore, our system

can scale to large KBs with millions of entries.

• Features for Entity Disambiguation and Ranking : We developed a rich and ex-

tensible set of features based on the query entity mention, the query document, and

KB nodes. We used tokenization, stop word removal and stemming to represent the

documents as vectors. This basic feature set had high impact on the final performance

of the system because of cleaner representation of the documents. Also, we experi-

mented with various similarity techniques to rank the candidate nodes. To the best of

our knowledge we found no work that experimented with so many different similarity

techniques. This is one of the major contributions of this thesis. We showed simple

techniques like cosine similarity, Naı̈ve Bayes, Tf-idf ranking etc perform close to

state of the art approaches, without any training data.

• NIL Detection : We proposed a technique of appending a given KB with Wikipedia

documents in order to identify NIL mapping entities, which obviates hand tuning.80


From Table 6.10 and Table 6.11, it is clearly evident that this is a very useful feature.

This technique unlike other current approaches obviates the technique of fixing a

threshold for predicting NIL.

Our experiments were conducted on standard data sets for EL, provided by the TAC-

KBP. We evaluated our approach on both TAC-KBP, 2009 and 2010 EL data set. The

data set consisted of a KB, DC and query set. The DC consisted of news and blog arti-

cles providing real world documents and contexts to test our approach. Evaluation of our

methodology was confirming to the standard evaluation metrics for EL task. MAS was

used to evaluate our approach, which is the standard evaluation metric for EL.

Results of our experiments were reported in Chapter 6. Our algorithm achieved good

accuracy values while linking named entities to nodes in a large KB. Our results were on

par or better to the state-of-the-art approaches and systems developed as part of TAC-KBP

shared task. Our system was ranked first and second in TAC-KBP, EL shared tasks in 2009

and 2010 respectively.

7.2 Future Directions

Our approach can be considered as the building block for future research on EL. In this the-

sis, we have explored simple techniques like cosine similarity, Naı̈ve Bayes, Tf-idf ranking

etc and showed that they perform close to state-of-the-art and sometimes better. We feel

that now with training data available from TAC-KBP, machine learning techniques can be

explored. Another area that can be looked into is, refining the document context to cap-

ture only terms that describe the query entity. This would result in better ranking of the

candidate nodes and hence higher accuracy.

We firmly believe that as long as the candidate node is present in the CL, there is very

high likelihood of it being identified as the mapping node. The current research community

has been focusing more on the ranking algorithms as it is an interesting field. The perfor-

mance of an EL system is highly dependent on the CL identification phase. The higher81


accuracy with which candidate nodes are identified, the higher the probability of identi-

fying it as a mapping node. It is worthwhile to consider candidate generation strategies

carefully.

Also, in our current approach we have exploited Wikipedia, an NER and Web search

results for identifying name variants. This is another area of research where we need to

separate this module and make it independent of any resource.

Nil node clustering is an area where focus of the research community is needed. Current

EL systems either predict a mapping node or NIL, if no mapping node is present in the KB.

If Nil node clustering is done, the data for a single named entity could be integrated into

one and hence can be used to creating new nodes in the KB.

A cross lingual EL system would be very promising area of research, because research

in this area will help in building KBs for local languages. With the growth of new websites

and blogs in the local languages of different regions this would certainly give the opportu-

nity for the less resourced languages to have a KB of their own which in due time will help

in the growth of users using local languages.

7.3 Application of Entity Linking

We discuss some real world applications of EL.

Metadata Integration : A possible application could be importing of metadata from

KBs by linking the named entities in a document. On successful linkage, metadata from the

KBs could be imported to the document, which otherwise might not be explicitly present

in the document. The metadata when imported can also contain property value pairs like

Age:35, Name:Sachin etc and complex queries like that of SPARQL [51] can be fired. In

figure 7.1, information flow of such a system is shown. From figure 7.2, we can see how a

document would look like when information about entities “Assange, WikiLeaks, Elmers”

is imported from a KB. With this integration of information into a document, search space

for a user is increased, because this document is retrieved even for keyword searches like

82


“whistle blowers, swiss people, online archives” etc as this metadata is imported from the

KB.

Figure 7.1 An application of Entity Linking flow chart.

Figure 7.2 Possible application of Entity Linking.

Financial Domain : In the finance domain, EL can be used to identify company names

in a textual document and link them to a KB of tradable company names listed on the

stock markets. This can be used to aggregate company information into the document with

respect to stock market codes, analysis of relationship between news and share prices etc.

Search Feature Enhancement : Current day search engines return documents relevant

to the query posted by a user. The search engine results in general are a list of ranked

documents, where each result generally contains a title, a url, a snippet etc. We can use an

EL system to link the named entities present in the snippet/titles to a publicly available KB.

By doing this we can import information from the KB to the search results enhancing the83


user experience and can also provide him with structured information.

84

Bibliography

[1] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C.D. Spyropoulos, andP. Stamatopoulos. Learning to filter spam e-mail: A comparison of a naive bayesianand a memory-based approach. Arxiv preprint cs/0009009, 2000.

[2] A. Bagga and B. Baldwin. Entity-based cross-document coreferencing using the vec-tor space model. In Proceedings of the 17th international conference on Computa-tional linguistics-Volume 1, pages 79–85. Association for Computational Linguistics,1998.

[3] M. Banko, O. Etzioni, and T. Center. The tradeoffs between open and traditionalrelation extraction. Proceedings of ACL-08: HLT, pages 28–36, 2008.

[4] R. Barzilay and M. Elhadad. Using lexical chains for text summarization. In Proceed-ings of the ACL Workshop on Intelligent Scalable Text Summarization, volume 17.Madrid, spain, 1997.

[5] R. Barzilay, K.R. McKeown, and M. Elhadad. Information fusion in the context ofmulti-document summarization. In Proceedings of the 37th annual meeting of theAssociation for Computational Linguistics on Computational Linguistics, pages 550–557. Association for Computational Linguistics, 1999.

[6] R. Bunescu and R. Mooney. Subsequence kernels for relation extraction. Advancesin Neural Information Processing Systems, 18:171, 2006.

[7] R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disam-biguation. In Proceedings of EACL, volume 6, 2006.

[8] R.C. Bunescu and R.J. Mooney. A shortest path dependency kernel for relation ex-traction. In Proceedings of the conference on Human Language Technology and Em-pirical Methods in Natural Language Processing, pages 724–731. Association forComputational Linguistics, 2005.

[9] Z. Cao, T. Qin, T.Y. Liu, M.F. Tsai, and H. Li. Learning to rank: from pairwiseapproach to listwise approach. In Proceedings of the 24th international conferenceon Machine learning, pages 129–136. ACM, 2007.

85

BIBLIOGRAPHY

[10] S.F. Chen, R. Rosenfeld, and CARNEGIE-MELLON UNIV PITTSBURGH PASCHOOL OF COMPUTER SCIENCE. A Gaussian prior for smoothing maximumentropy models, 1999.

[11] H.L. Chieu and H.T. Ng. Named entity recognition: a maximum entropy approachusing global information. In Proceedings of the 19th international conference onComputational linguistics-Volume 1, pages 1–7. Association for Computational Lin-guistics, 2002.

[12] S.P. Converse. Resolving pronominal references in Chinese with the Hobbs algorithm.In Proceedings of the 4th SIGHAN workshop on Chinese language processing, pages116–122, 2005.

[13] S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. InProceedings of EMNLP-CoNLL, volume 2007, pages 708–716, 2007.

[14] A. Culotta and J. Sorensen. Dependency tree kernels for relation extraction. In Pro-ceedings of the 42nd Annual Meeting on Association for Computational Linguistics,pages 423–es. Association for Computational Linguistics, 2004.

[15] G. Escudero, L. Marquez, and G. Rigau. Naive Bayes and exemplar-based approachesto word sense disambiguation revisited. Arxiv preprint cs/0007011, 2000.

[16] O. Etzioni, M. Cafarella, D. Downey, A.M. Popescu, T. Shaked, S. Soderland, D.S.Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experi-mental study. Artificial Intelligence, 165(1):91–134, 2005.

[17] R. Florian, A. Ittycheriah, H. Jing, and T. Zhang. Named entity recognition throughclassifier combination. In Proceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4, pages 168–171. Association for Computa-tional Linguistics, 2003.

[18] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Con-ference on Artificial Intelligence, pages 6–12, 2007.

[19] E. Gabrilovich and S. Markovitch. Wikipedia-based semantic interpretation for natu-ral language processing. Journal of Artificial Intelligence Research, 34(1):443–498,2009.

[20] C. Giuliano, A. Lavelli, and L. Romano. Exploiting shallow linguistic informationfor relation extraction from biomedical literature. In Proceedings of the EleventhConference of the European Chapter of the Association for Computational Linguistics(EACL-2006), pages 5–7, 2006.

[21] X. Han and J. Zhao. NLPR KBP in TAC 2009 KBP Track: A Two-Stage Method toEntity Linking. In Proceedings of Test Analysis Conference 2009 (TAC 09).

86

BIBLIOGRAPHY

[22] E. Hovy and C.Y. Lin. Automated text summarization in SUMMARIST. Advancesin Automatic Text Summarization, 94, 1999.

[23] A. Iftene and A. Balahur-Dobrescu. Named entity relation mining usingwikipedia. Proceedings of the Sixth International Language Resources and Evalu-ation (LREC’08), pages 2–9517408, 2008.

[24] H. Ji, R. Grishman, H.T. Dang, and K. Griffitt. Overview of the TAC 2010 KnowledgeBase Population Track [DRAFT].

[25] K.T. JunichiKazama. Exploiting Wikipedia as external knowledge for named entityrecognition. In Proc. EMNLP-CoNLL, pages 698–707, 2007.

[26] S.B. Kim, K.S. Han, H.C. Rim, and S.H. Myaeng. Some effective techniques for naivebayes text classification. IEEE Transactions on Knowledge and Data Engineering,pages 1457–1466, 2006.

[27] M. Knights. Web 2.0. Communications Engineer, 5(1):30–35, 2007.

[28] F. Li, Z. Zhang, F. Bu, Y. Tang, X. Zhu, and M. Huang. THU QUANTA at TAC 2009KBP and RTE Track. In Text Analysis Conference (TAC), 2009.

[29] C.Y. Lin and E. Hovy. From single to multi-document summarization: A prototypesystem and its evaluation. In Proceedings of the 40th Annual Meeting on Associa-tion for Computational Linguistics, pages 457–464. Association for ComputationalLinguistics, 2002.

[30] V. Lopez, M. Pasin, and E. Motta. Aqualog: An ontology-portable question answeringsystem for the semantic web. The Semantic Web: Research and Applications, pages546–562, 2005.

[31] K. Lund and C. Burgess. Producing high-dimensional semantic spaces from lexicalco-occurrence. Behavior Research Methods Instruments and Computers, 28(2):203–208, 1996.

[32] I. Mani and E. Bloedorn. Multi-document summarization by graph search and match-ing. Arxiv preprint cmp-lg/9712004, 1997.

[33] I. Mani and M.T. Maybury. Advances in automatic text summarization. the MITPress, 1999.

[34] G.S. Mann and D. Yarowsky. Unsupervised personal name disambiguation. In Pro-ceedings of the seventh conference on Natural language learning at HLT-NAACL2003-Volume 4, pages 33–40. Association for Computational Linguistics, 2003.

[35] T. McArthur. Worlds of reference: lexicography, learning and language from the claytablet to the computer. 1986.

87

BIBLIOGRAPHY

[36] P. McNamee and H.T. Dang. Overview of the TAC 2009 knowledge base populationtrack. In Text Analysis Conference (TAC), 2009.

[37] P. McNamee, M. Dredze, A. Gerber, N. Garera, T. Finin, J. Mayfield, C. Piatko,D. Rao, D. Yarowsky, and M. Dreyer. HLTCOE approaches to knowledge base pop-ulation at TAC 2009. In Text Analysis Conference (TAC), 2009.

[38] P. Melville, W. Gryc, and R.D. Lawrence. Sentiment analysis of blogs by combininglexical knowledge with text classification. In Proceedings of the 15th ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages 1275–1284.ACM, 2009.

[39] R. Mihalcea. Using wikipedia for automatic word sense disambiguation. In Proceed-ings of NAACL HLT, volume 2007, 2007.

[40] A. Mikheev, M. Moens, and C. Grover. Named entity recognition without gazetteers.In Proceedings of the ninth conference on European chapter of the Association forComputational Linguistics, pages 1–8. Association for Computational Linguistics,1999.

[41] D. Milne, O. Medelyan, and I.H. Witten. Mining domain-specific thesauri fromwikipedia: A case study. In Proceedings of the 2006 IEEE/WIC/ACM InternationalConference on Web Intelligence, pages 442–448. IEEE Computer Society, 2006.

[42] D. Milne and I.H. Witten. Learning to link with wikipedia. In Proceeding of the 17thACM conference on Information and knowledge management, pages 509–518. ACM,2008.

[43] D.N. Milne, I.H. Witten, and D.M. Nichols. A knowledge-based search engine pow-ered by wikipedia. In Proceedings of the sixteenth ACM conference on Conferenceon information and knowledge management, pages 445–454. ACM, 2007.

[44] D. Moldovan, S. Harabagiu, M. Pasca, R. Mihalcea, R. Girju, R. Goodrum, andV. Rus. The structure and performance of an open-domain question answering sys-tem. In Proceedings of the 38th Annual Meeting on Association for ComputationalLinguistics, pages 563–570. Association for Computational Linguistics, 2000.

[45] K. Nakayama, T. Hara, and S. Nishio. Wikipedia mining for an association webthesaurus construction. Web Information Systems Engineering–WISE 2007, pages322–334, 2007.

[46] D.P.T. Nguyen, Y. Matsuo, and M. Ishizuka. Relation extraction from wikipedia usingsubtree mining. In Proceedings of the National Conference on Artificial Intelligence,volume 22, page 1414. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MITPress; 1999, 2007.

88

BIBLIOGRAPHY

[47] F. Ortega, J.M. Gonzalez-Barahona, and G. Robles. On the inequality of contributionsto Wikipedia. In Hawaii International Conference on System Sciences, Proceedingsof the 41st Annual, page 304. IEEE, 2008.

[48] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classification usingmachine learning techniques. In Proceedings of the ACL-02 conference on Empiricalmethods in natural language processing-Volume 10, pages 79–86. Association forComputational Linguistics, 2002.

[49] T. Pedersen. A simple approach to building ensembles of Naive Bayesian classifiersfor word sense disambiguation. In Proceedings of the 1st North American chapterof the Association for Computational Linguistics conference, pages 63–69. MorganKaufmann Publishers Inc., 2000.

[50] W.J. Plath. REQUEST: a natural language question-answering system. IBM Journalof Research and Development, 20(4):326–335, 1976.

[51] E. PrudHommeaux, A. Seaborne, et al. SPARQL query language for RDF. W3Cworking draft, 4, 2006.

[52] A. Ratnaparkhi et al. A maximum entropy model for part-of-speech tagging. InProceedings of the conference on empirical methods in natural language processing,volume 1, pages 133–142, 1996.

[53] A. Ratnaparkhi, J. Reynar, and S. Roukos. A maximum entropy model for prepo-sitional phrase attachment. In Proceedings of the workshop on Human LanguageTechnology, pages 250–255. Association for Computational Linguistics, 1994.

[54] D. Ravichandran and E. Hovy. Learning surface text patterns for a question answeringsystem. In Proceedings of the 40th Annual Meeting on Association for ComputationalLinguistics, pages 41–47. Association for Computational Linguistics, 2002.

[55] M. Remy. Wikipedia: The free encyclopedia. Reference Reviews, 16(6):5, 2002.

[56] J.D.M. Rennie. Improving multi-class text classification with naive Bayes. PhD thesis,Citeseer, 2001.

[57] A.E. Richman and P. Schone. Mining wiki resources for multilingual named entityrecognition. In Proceedings of the 46th Annual Meeting of the Association for Com-putational Linguistics: Human Language Technologies, pages 1–9. Citeseer, 2008.

[58] R. Rosenfeld. Adaptive statistical language modeling: a maximum entropy approach.PhD thesis, Citeseer, 2005.

[59] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval* 1.Information processing & management, 24(5):513–523, 1988.

89

BIBLIOGRAPHY

[60] G. Salton, A. Wong, and C.S. Yang. A vector space model for information retrieval.Journal of the American Society for information Science, 18(11):613–620, 1975.

[61] K.M. Schneider. A comparison of event models for Naive Bayes anti-spam e-mail fil-tering. In Proceedings of the tenth conference on European chapter of the Associationfor Computational Linguistics-Volume 1, pages 307–314. Association for Computa-tional Linguistics, 2003.

[62] Zongyu Zhang Xinsheng Li Jingyi Guan Weiran Xu Jun Guo Si Li, Sanyuan Gao.PRIS at TAC 2009: Experiments in KBP Track. In Proceedings of Test AnalysisConference 2009 (TAC 09).

[63] R. Srihari and W. Li. A question answering system supported by information extrac-tion. In Proceedings of the sixth conference on Applied natural language processing,pages 166–172. Association for Computational Linguistics, 2000.

[64] S. Tan, X. Cheng, Y. Wang, and H. Xu. Adapting naive bayes to domain adaptationfor sentiment analysis. Advances in Information Retrieval, pages 337–349, 2009.

[65] E.F. Tjong Kim Sang and F. De Meulder. Introduction to the CoNLL-2003 sharedtask: Language-independent named entity recognition. In Proceedings of the seventhconference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 142–147. Association for Computational Linguistics, 2003.

[66] M. Vela and T. Declerck. Concept and relation extraction in the finance domain.In Proceedings of the Eighth International Conference on Computational Semantics,pages 346–350. Association for Computational Linguistics, 2009.

[67] D.L. Waltz. An English language question answering system for a large relationaldatabase. Communications of the ACM, 21(7):526–539, 1978.

[68] D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation extraction. TheJournal of Machine Learning Research, 3:1083–1106, 2003.

[69] T. Zesch, I. Gurevych, and M. M”uhlh”auser. Analyzing and accessing Wikipedia as a lexical semantic resource. DataStructures for Linguistic Resources and Applications, pages 197–205, 2007.

[70] T. Zesch, C. Muller, and I. Gurevych. Extracting lexical semantic knowledge fromwikipedia and wiktionary. In Proceedings of the Conference on Language Resourcesand Evaluation (LREC), pages 1646–1652. Citeseer, 2008.

[71] G.D. Zhou and J. Su. Named entity recognition using an HMM-based chunk tag-ger. In Proceedings of the 40th Annual Meeting on Association for ComputationalLinguistics, pages 473–480. Association for Computational Linguistics, 2002.

90