effective passage retrieval in question answering...

EFFECTIVE PASSAGE RETRIEVAL IN

QUESTION ANSWERING SYSTEMS

By

Surya Ganesh V

200402042

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THEREQUIREMENTS FOR THE DEGREE OF

Master of Science (by Research)in

Computer Science & Engineering

Search and Information Extraction Lab

Language Technologies Research Centre

International Institute of Information Technology

Hyderabad, India

June 2010

Dedicated to my family and my friends.

INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY

Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “ Effective Passage Re-

trieval in Question Answering Systems ” by Surya Ganesh (200402042) sub-

mitted in partial fulfillment for the award of the degree of Master of Science (by

Research) in Computer Science & Engineering, has been carried out under my su-

pervision and it is not submitted elsewhere for a degree.

Date Advisor :

Dr. Vasudeva VarmaAssociate Professor

IIIT, Hyderabad

Acknowledgements

Foremost, I would like to express my sincere gratitude to my advisor Dr. Va-

sudeva Varma for all his guidance, support, encouragement and patience all through

my research work. I am grateful to him for providing me an opportunity to work in

SIEL along with an open and friendly research environment, giving me an exposure

to current research problems and involving me with industry projects.

I am thankful to Dr. Prasad Pingali for his in depth discussions with me on

my research work and providing his valuable feedback and suggestions. I thank

my thesis committee members, Prof. Kamlakar Karlapalem and Dr. Soma Paul for

sparing their valuable time to evaluate the progress of my research work. I will

forever remember the wonderful time I had with my friends.

Last but not the least, I would like to thank my father, Mr. Rama Krishna, my

mother, Mrs. Lakshmi and my sister, Veena for their endless love and support. With-

out their continuous support and encouragement I would not have completed this

work.

Abstract

Information Retrieval systems like Web search engines are often used by people

to find information of their interest. Despite the success of these systems, they

are not sophisticated enough to provide precise information for users requests. To

address this problem, complex Information Access systems called Question An-

swering systems have been developed. They aim at finding exact answers to natural

language questions in a large collection of documents (such as World Wide Web).

One of the most obvious limitations in the performance of many Question Answer-

ing systems is their inability to find text passages where candidate answers can be

found. Earlier research on the poor performance of passage retrieval highlighted

the terminological gap problem i.e., passages holding the answer to a question have

semantic alterations of original terms in the question. In this thesis, we proposed

two different techniques to reduce this problem.

Query expansion is a widely used technique in Information Retrieval to reduce

the terminological gap problem. First, we present a novel passage retrieval method-

ology which expands the query inherently. This methodology leverages Statisti-

cal Machine Translation model for Information Retrieval to retrieve a ranked set

of passages given a question. The retrieval within this model includes two steps:

estimation and ranking. In the estimation step, multiple translation models are con-

structed using a statistical alignment model. We perceive each such translation

model as an answer type profile. During ranking, based on the answer type of the

question its corresponding answer type profile is used to retrieve relevant passages.

Our experimental analysis on the performance of this retrieval methodology showed

significant improvements over different standard retrieval models including TFIDF,

Okapi BM25, Indri and KL-divergence. We found that simple statistical alignment

models like IBM model 1 are more suited for passage retrieval. We also showed

that our methodology addresses the problems of synonymy and polysemy.

Previous studies on explicit query expansion methods like pseudo relevance

feedback, and methods based on external knowledge sources like WordNet, Wikipedia

or Web have shown to improve the performance of Information Retrieval systems.

We proposed a novel query expansion method using Wikipedia. Our methodology

uses text content, category structure, and link structure of Wikipedia to generate a

set of terms semantically related to the question. As Boolean model allows a fine-

grained control over query expansion, these semantically related terms are added to

the original query to form an expanded Boolean query. Our experimental analysis

on the performance of these expanded queries on Lucene, an open source retrieval

system, showed significant improvements over seed queries. We also analyzed the

performance of expanded queries based on different scoring methods utilized in

selecting query expansion terms and for different query expansion lengths.

Adding to the above contributions which focused on reducing the terminolog-

ical gap between query and passage, we explored the necessity of passage priors

in ranking passages. Document Retrieval assumes that a document is independent

of its relevance, and non-relevance. The same assumption is being carried forward

in passage retrieval systems in the context of Question Answering. We relax this

assumption and explore the necessity of passage priors being relevant and non-

relevant in ranking passages given a query. We describe a mutual information mea-

sure for estimating these priors and a simple method for identifying relevant and

non-relevant text to a question using the Web and AQUAINT corpus as informa-

tion sources. Our experimental analysis of using passage priors as a re-ranking step

on top of language models including Indri and KL-divergence showed that passage

priors are necessary in ranking passages.

Publications

• Surya Ganesh and Vasudeva Varma, “Passage Retrieval Using Answer TypeProfiles in Question Answering”, In Proceedings of Pacific Asia Conferenceon Language, Information and Computation (PACLIC), Hong Kong, Decem-ber 2009.

• Surya Ganesh and Vasudeva Varma, “Exploiting structure and content ofWikipedia for Query Expansion in the context of Question Answering”, InProceedings of Recent Advances in Natural Language Processing (RANLP),Bulgaria, September 2009.

• Surya Ganesh and Vasudeva Varma, “Exploiting the use of Prior Probabili-ties for Passage Retrieval in Question Answering”, In Proceedings of RecentAdvances in Natural Language Processing (RANLP), Bulgaria, September2009.

• Vasudeva Varma, Prasad Pingali, Rahul Katragadda, Sai Krishna, Surya Ganesh,Kiran Sarvabhotla, Harish Garapati, Hareen Gopisetty, Vijay Bharath Reddy,Kranthi Reddy, Praveen Bysani and Rohit Bharadwaj, “IIIT Hyderabad atTAC 2008”. In the Working Notes of Text Analysis Conference (TAC) at thejoint meeting of the annual conferences of TAC and TREC, USA, November2008.

• Rohit Bharadwaj, Surya Ganesh and Vasudeva Varma, “A Naive Approachfor Monolingual Question Answering”, In the Working Notes of Cross Lan-guage Evaluation Forum (CLEF) Workshop, Greece, October 2009.

Contents

Table of Contents ix

List of Tables xii

List of Figures xiii

1 Introduction 11.1 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 History of QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Dimensions of QA . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.3 A General Architecture . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2.2 Problem Scope Definition . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Solution Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3.1 Passage Retrieval using Answer Type Profiles . . . . . . . . . . . . 101.3.2 Query Expansion using Wikipedia . . . . . . . . . . . . . . . . . . 111.3.3 Effect of Passage Priors in Passage Retrieval . . . . . . . . . . . . 11

1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Review of Passage Retrieval 142.1 Passage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.1 Discourse Passages . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.2 Semantic Passages . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.3 Window-based Passages . . . . . . . . . . . . . . . . . . . . . . . 152.1.4 Arbitrary Passages . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.1 Passage Retrieval based on Vector Space Models . . . . . . . . . . 162.2.2 Passage Retrieval based on density of Query terms . . . . . . . . . 172.2.3 Passage Retrieval based on NLP techniques . . . . . . . . . . . . . 182.2.4 Passage Retrieval based on Probabilistic Models and Language Mod-

eling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

ix

CONTENTS

3 Passage Retrieval using Answer Type Profiles 233.1 Language modeling for Information Retrieval . . . . . . . . . . . . . . . . 243.2 Statistical Machine Translation Model for Information Retrieval . . . . . . 283.3 Passage Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3.1 Parallel corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.3.2 Question Classification . . . . . . . . . . . . . . . . . . . . . . . . 323.3.3 Learning Answer type profiles . . . . . . . . . . . . . . . . . . . . 353.3.4 Passage Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Evaluation 394.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1.1 Average precision at 1 . . . . . . . . . . . . . . . . . . . . . . . . 404.1.2 Mean Reciprocal Rank . . . . . . . . . . . . . . . . . . . . . . . . 404.1.3 Total Document Reciprocal Rank . . . . . . . . . . . . . . . . . . 41

4.2 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2.1 AQUAINT Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2.2 FACTOID Questions . . . . . . . . . . . . . . . . . . . . . . . . . 434.2.3 Answer Judgements . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3.1 Retrieval Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3.2 Answer Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.3.3 Alignment Models . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5 Query Expansion Using Wikipedia 535.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.1.1 Proximity score . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.1.2 Outlink score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.2.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 595.2.2 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6 Effect of Passage Priors in Passage Retrieval 656.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.2 Estimation of prior probability . . . . . . . . . . . . . . . . . . . . . . . . 676.3 Identifying relevant and non-relevant text . . . . . . . . . . . . . . . . . . 68

6.3.1 Relevant text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.3.2 Non-relevant text . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

x

CONTENTS

6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.4.1 Evaluation Metrics and Data Set . . . . . . . . . . . . . . . . . . . 726.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7 Conclusions 777.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7.1.1 Passage Retrieval Using Answer Type Profiles . . . . . . . . . . . 787.1.2 Query Expansion Using Wikipedia . . . . . . . . . . . . . . . . . . 787.1.3 Effect of Passage Priors in Passage Retrieval . . . . . . . . . . . . 797.1.4 Opinion Question Answering System . . . . . . . . . . . . . . . . 797.1.5 Monolingual Question Answering System . . . . . . . . . . . . . . 80

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817.2.1 Passage Retrieval Using Answer Type Profiles . . . . . . . . . . . 817.2.2 Query Expansion Using Wikipedia . . . . . . . . . . . . . . . . . . 817.2.3 Effect of Passage Priors in Passage Retrieval . . . . . . . . . . . . 82

Bibliography 83

xi

List of Tables

3.1 Quantitative overview of QASP parallel corpus . . . . . . . . . . . . . . . 333.2 The coarse and fine grained answer types . . . . . . . . . . . . . . . . . . 343.3 Translations for word born in LOCATION profile . . . . . . . . . . . . . . 363.4 Translations for word born in NUMBER profile . . . . . . . . . . . . . . . 36

4.1 Quantitative overview of TREC 2002-2006 question sets . . . . . . . . . . 444.2 Strict evaluation scores for different Passage Retrieval methodologies . . . 474.3 Lenient evaluation scores for different Passage Retrieval methodologies . . 484.4 Strict evaluation scores for different categories of questions . . . . . . . . . 484.5 Lenient evaluation scores for different categories of questions . . . . . . . . 494.6 Strict and lenient evaluation scores for different statistical alignment models 50

5.1 Targets from TREC 2006 QA test set . . . . . . . . . . . . . . . . . . . . . 545.2 Top 10 expansion terms for the question “Which position did Warren Moon

play in professional football?” . . . . . . . . . . . . . . . . . . . . . . . . 585.3 Strict and lenient evaluation results for seed queries and expanded queries . 605.4 Statistical analysis of proximity scoring and outlink scoring methods . . . . 61

6.1 Redundancy scores for the passages retrieved from AQUAINT corpus . . . 726.2 Results for Indri retrieval model under strict and lenient criteria . . . . . . . 746.3 Results for KL divergence retrieval model under strict and lenient criteria . 74

xii

List of Figures

1.1 Pipeline architecture of a typical Question Answering system . . . . . . . . 41.2 Passage Retrieval in a Question Answering system . . . . . . . . . . . . . 7

3.1 Passage retrieval using Answer Type Profiles . . . . . . . . . . . . . . . . 31

4.1 A sample document from AQUAINT corpus . . . . . . . . . . . . . . . . . 424.2 A sample series of questions from TREC 2006 question set . . . . . . . . . 444.3 A sample segment from TREC 2006 answer judgements . . . . . . . . . . 45

5.1 Performance of passage retrieval for different query expansion lengths . . . 62

6.1 Performance of passage retrieval for different α values ranging from 0.0 to1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

xiii

Chapter 1

Introduction

Document retrieval systems like Web search engines have become common interfaces to

find information of users interests. Users typically transform their information need as

a keyword query, and then issue these queries to a document retrieval system. The sys-

tem then provides users with a set of ranked documents. Current Web search engines like

Google, Bing, Yahoo etc. are sophisticated enough to produce relevant documents for most

of the queries. But users have to skim through these documents to find their target infor-

mation, which obviously consumes their valuable time. Instead, users can better access the

information if they can issue their information need in a natural way i.e., through a natural

language question, and in turn receives the precise answer as a result. For instance, given

the information need “When was Mahatma Gandhi born?”, a system which gives the result

“2nd October, 1869” solves the purpose better than a document retrieval system. So, devel-

oping such systems which can automatically answer a natural language question has been

a long-standing goal.

1.1 Question Answering

The aim of Question Answering (QA) systems is to provide precise answer for a given

natural language question from a large collection of documents such as World Wide Web.

1

CHAPTER 1. INTRODUCTION

The first QA systems were developed in the 1960s, and they were basically front ends

or natural language interfaces to structured databases of specific domains. In contrast,

current QA systems use text documents as their underlying knowledge source, and building

these systems require insights from wide range of disciplines, including, Natural Language

Processing, Information Retrieval, Information Extraction and Artificial Intelligence. We

will here present a brief history of QA field, dimensions of QA, and the pipeline architecture

of a typical QA system.

1.1.1 History of QA

The field of QA has a history of about 50 years. The first noted QA system called “The Con-

versation Machine” [26] was developed in 1959. It answers questions related to weather

from a structured database. During 1960s there were about 15 QA systems [84] had been

developed of which the most prominent was BASEBALL system [27]. It answers questions

related to results, locations and dates of baseball games. The other influential QA systems

from that decade are ORACLE [65], ALA [89], and PROTOSYNTHEX [70]. Until the

early 1990s, there were few other research efforts in this field, and most of the systems

developed during this period are either natural language interfaces to structured databases

or domain specific. Off late, QA witnesses a true renaissance with the introduction of QA

track in Text REtrieval Conference (TREC 1) [93]. The rapid interchange of ideas between

participating groups and a common evaluation procedure are the main reasons which fos-

tered research in the field of open domain textual QA systems.

1.1.2 Dimensions of QA

The task of building a QA system is quite complex because of its number of different

dimensions [9, 29, 31]. Over the years many QA systems have been developed by targeting

a subset of dimensions. These dimensions can be roughly be divided into: question types,

1Text REtrieval Conference, http://trec.nist.gov

2


type of data used as a knowledge source for extracting answers, and whether the system is

domain dependent or domain independent.

Question Types

QA research attempts to deal with a wide range of question types including: factoid, list,

definition, procedure, reason, purpose, opinion, and cross-lingual questions. TREC focuses

solely on the former three types of questions which are collectively called FACT based

questions. Questions from different types require the use of different strategies to find their

respective answers. In general, question types are arranged hierarchically in taxonomies.

Type of data

The type of data, used as a knowledge source for extracting answers, distinguishes QA sys-

tems into two types: systems which answer questions by accessing structured information

from a database, and systems which answer questions by utilizing unstructured informa-

tion such as plain texts. The main challenge within QA systems that use databases is to

transform a natural language question into a database query. These systems are often re-

ferred as front ends or natural language interfaces to database systems. The other type of

QA systems are text based systems. In order to find an answer to a question, these systems

analyze documents in plain text format, such as newspaper articles, manuals, and encyclo-

pedias. The main challenge with in these systems is to match the question with text units,

such as phrases or sentences, in the document collection, and among those units identify

the answer.

Generality

Every QA system can be classified into one of the two types: closed domain system, and

open domain system. Closed domain QA systems deal with questions under a specific

domain (for example, cricket or medicine). Building these systems is relatively easier

3


because they can exploit domain-specific knowledge, frequently formalized in ontologies.

Open domain QA systems deal with questions about nearly everything. These systems

rely on general ontologies and world knowledge, and they typically have much more data

available from which answers can be extracted.

1.1.3 A General Architecture

Over the years, many QA systems have been developed. In TREC 2007 [18], 21 research

teams have participated in QA track. With each group having implemented their own sys-

tem, they followed different architectures and techniques. Nevertheless, most QA systems

consisted of four core components, namely, Question Analysis, Document Retrieval, Pas-

sage Retrieval, and Answer Extraction. The pipeline architecture of a typical QA system

with all the above components is shown in Figure 1.1. The detailed description about the

functioning of all the four core components is given below.

Figure 1.1 Pipeline architecture of a typical Question Answering system

4


Question Analysis

Analyzing the natural language question given as input to the system, is the first step to-

wards finding the answer. The main purpose of this component is to identify the expected

answer type, and construct a query that is used in document and passage retrieval compo-

nents. The process of identifying expected answer type of a question is known as Question

Classification, e.g., for the question “When was Mahatma Gandhi born?”, the answer type

is “DATE”. Several approaches, ranging from rule-based [22, 23] to the statistical machine

learning approaches [17, 28] have been developed for this purpose. Additionally, this com-

ponent constructs queries which are posed to the retrieval components. There are several

ways to construct these queries, which depends on the type of queries the retrieval engine

allows. For instance, some engines allow bag-of-words queries, where a query is an un-

ordered list of unique key terms in the question, and some engines allow Boolean queries,

where key terms in the question are connected by logical operators such as AND, OR etc.

The effectiveness of this component is critical to overall performance of a QA system. This

is because, if a question is incorrectly classified, the answer extraction component will try

to find answer candidates of wrong type.

Document Retrieval

The function of document retrieval is to identify documents that are likely to contain an

answer, by utilizing the query provided by question analysis component. Many document

retrieval frameworks [12, 61, 79] have been developed over the years, resulting in sophisti-

cated ways to compute the similarity between a document and the query. As the information

need in QA is much more specific than in traditional Information Retrieval systems, many

document retrieval engines used in QA are based on Boolean models [88]. These models

provide better options for constructing a query. In the initial stages of QA, recall is more

important because an answer once lost is lost forever. So, the documents retrieved should

contain the answer irrespective of their ranks in which the answer occurred.

5


Passage Retrieval

We now turn to the passage retrieval component, which is the main topic of this thesis.

The main function of this component is to reduce the search space for finding the answer

from a set of documents to a fixed number of passages (say top 20). Typically, answers are

expressed very locally in a document. So, short text excerpts returned by this component

are easier to process by later components of the QA system, which are otherwise compu-

tationally expensive. On the other hand, if the selected set of passages does not contain

the answer then any QA system would not answer the question even if the answer extrac-

tion component is 100% accurate. So, high performance of passage retrieval is desired to

improve the success rate of a QA system. Most often passage retrieval suffers from ter-

minological gap i.e., passages holding the answer to a question have semantic alterations

of original terms in the question. Moldovan et al. [59] showed that their system failed

to answer 25.7% of questions solely because of terminological gap. This problem is nor-

mally addressed by the use of query expansion techniques. In this thesis, we proposed two

different techniques to reduce the terminological gap between questions and passages.

Answer Extraction

The function of answer extraction component is to identify answer candidates and rank

them to select the final answer. In some QA systems, this component just returns a set

of ranked list of answers, where answers are ordered with respect to the confidence the

system has in each of them. To identify answer candidates, this component utilizes top

ranked passages provided by the passage retrieval component and applies natural language

processing techniques like syntactic parsing, dependency parsing, named entity recognizer

etc., on these passages. Several approaches have been proposed for ranking answer can-

didates, including, redundancy based approaches [21], pattern matching [95], proximity

based [32], probabilistic methods [42] and other related approaches.

6


1.2 Problem Description

The different dimensions of a QA system, and its multiple components, makes QA a com-

plex task. It is difficult to investigate each and every component, and the impact of different

techniques they employ on the overall performance of a QA system. So, within the field of

QA, researchers have focused on developing effective techniques for a single component,

and evaluate its performance in the context of QA. In this thesis, we will focus on designing

effective methodologies for passage retrieval in the context of QA.

As explained earlier, passage retrieval reduces the search space for finding the answer

from a set of documents to a fixed number of passages (say top 20). It is an intermediate

step between document retrieval and answer extraction. The input to this component is a set

of documents produced by the document retrieval component, and a query formulated from

the question. The output is a list of passages, ranked based on similarity scores between

query and passages from these documents. Figure 1.2 shows the illustration of this process.

Figure 1.2 Passage Retrieval in a Question Answering system

7


1.2.1 Motivation

Current QA systems like START, Answers.com, AnswerBus etc., produce passage as a

response instead of the actual answer phrase. This is in accordance with the experiments

conducted by Lin et al. [50] on designing effective interfaces for QA systems. They showed

that users prefer passage level answers over short phrases since it contains rich context

information for users to justify a correct answer. For instance, given the question “When

was Mahatma Gandhi born?”, a QA system with the response “2 October 1869”, answers

the question precisely, but a system with the response “Mohandas Karamchand Gandhi

(2 October 1869 - 30 January 1948) was the pre-eminent political and spiritual leader of

India during the Indian independence movement.” would better satisfy the user because it

provides contextual information that helps in judging the answer as correct or wrong. As

a result, a highly relevant passage can sufficiently answer a question. So, passage retrieval

has attracted a lot of interest in QA research community over the recent past.

Another major factor which has lead to the need for effective passage retrieval method-

ologies is because of large variation in performance from document retrieval to passage

retrieval. Derczynski et al. [20] have investigated the possibility of document retrieval and

passage retrieval components failure within a QA system. They have compared the per-

formances of these two retrieval components by using a common retrieval engine. For

the retrieval engine Terrier [54], a framework specially designed to deal with corpora in

the terabyte range, they have reported 98.3% coverage for top 20 documents and 63.8%

coverage for top 20 passages. Where, coverage measures the percentage of questions for

which at least one answer bearing text appears in the retrieved document/passage set. As

the decrease in the performance is significantly large, effective and sophisticated retrieval

methodologies have to be designed for passage retrieval instead of a common framework

for document retrieval and passage retrieval.

Tellex et al. [88] conducted a thorough quantitative evaluation of passage retrieval al-

gorithms employed by state-of-the-art QA systems. One of the main sources for error,

8


which they have identified, is the mismatch of crucial syntactic relations between terms in

a question and the same terms in a passage. Another major source of error is because of the

terminological gap problem, i.e., passages holding the answer to a question have semantic

variants of terms in the question. Moldovan et al. [59] showed that 25.7% of questions are

left unanswered because of this problem. In this thesis, we propose techniques which are

capable of answering these questions. Typically, in Information Retrieval, such a problem

is addressed by the use of query expansion techniques. In this thesis, we present a passage

retrieval methodology which expands the query inherently, and an explicit query expansion

method leveraging structure and content of Wikipedia.

1.2.2 Problem Scope Definition

This thesis focuses on effective passage retrieval methodologies in Question Answering

systems. A significant portion of questions are left unanswered because of the terminolog-

ical gap problem. In Information Retrieval, Query Expansion is a widely used technique

to reduce this problem. Our aim is to develop passage retrieval methodologies which ex-

pand queries either explicitly or implicitly. A detailed empirical analysis of our approaches

should be performed to show the improvements in performance over a wide range of pa-

rameter values.

1.3 Solution Outline

In this thesis, we focus on reducing the terminological gap problem in passage retrieval by

relying on the query expansion strategy. As a result, we have proposed a passage retrieval

methodology which expands queries inherently, and an explicit query expansion method

using Wikipedia. Adding to these two, we have also showed the necessity of passage priors

in ranking passages given a query. A brief description of all these methodologies is given

below.

9


1.3.1 Passage Retrieval using Answer Type Profiles

Firstly, we present a novel passage retrieval methodology using answer type profiles. This

methodology leverages Statistical Machine Translation (SMT) model for Information Re-

trieval [3], which expands input queries inherently during the process of retrieval. Within

this model the query generation process is viewed as a translation or distillation from a

passage. The relevance of a passage to a query is determined by the probability that the

query would have been generated as a translation of that passage. So for a given query, pas-

sages in the collection are ranked according to these probabilities. More specifically, the

mapping from a passage term to a query term is achieved by estimating translation model.

The translation model consists of triples, the query word, the passage word and the prob-

ability of translation. In our approach, we build multiple translation models, each one for

a category (answer type) of questions.As the translation model is unique given an answer

type, we view each such model as an answer type profile. In order to build these profiles,

we have utilized statistical alignment models which maximize the probability of the ob-

served (question, sentence) text pairs using Expectation Maximization algorithm. After the

maximization process is completed, the word level alignments are set to maximum poste-

rior predictions of the model. This entire process of constructing answer type profiles is

done offline. During the on-line phase, that is during retrieval, answer type profiles are

incorporated into the SMT framework to retrieve a ranked set of passages given a question.

Our analysis on the performance of this retrieval methodology showed significant improve-

ments over different standard retrieval models including TFIDF, Okapi BM25, Indri, and

KullbackLeibler divergence (KL divergence). We found that simple statistical alignment

models like IBM model 1 are more suited for passage retrieval in QA. We also showed that

our methodology addresses the problems of synonymy and polysemy.

10


1.3.2 Query Expansion using Wikipedia

As an alternate approach for reducing the terminological gap problem, we present a novel

query expansion method using Wikipedia, which is the leading open encyclopedia with a

wide coverage on diverse topics, events, entities, etc. Our methodology uses text content,

link structure, and category structure of Wikipedia to generate a set of terms semantically

related to a question. First we identify a set of potentially related sentences, and then

keywords from these sentences are ranked based on the following two scoring methods:

proximity scoring and outlink scoring. The assumption behind selection of terms based

on proximity scores is that semantically related terms are usually located in proximity, and

the distance between two terms could indicate the strength of their association. Where as,

outlink scoring prioritizes keywords from the anchor text of outlinks whose category in-

formation matches with the question. The top N scoring terms, where N varies linearly

depending on the size of seed query, and the seed query terms are mixed together to form

an expanded Boolean query, which allows fine grained control over query expansion. Our

analysis on the performance of these expanded queries on Lucene, an open source retrieval

system, showed significant improvements over seed queries. We also analyzed the perfor-

mance of expanded queries based on different scoring methods utilized in selecting terms

and for different expansion lengths.

1.3.3 Effect of Passage Priors in Passage Retrieval

In the case of document retrieval, the language modeling decomposition of probability

ranking principle assumes that, a document is independent of its relevance and non-relevance.

So during ranking, the two document priors - probability of a document being relevant and

probability of a document being non-relevant, cancel each other. A detailed description

about this ranking process is given in section 6.1. Effectively, this assumption states that

document ranking is independent of their priors. The same assumption is being carried

forward in passage retrieval systems in the context of QA. We relax this assumption and

11


explore the necessity of passage priors being relevant and non-relevant in ranking passages

given a query. For estimating these priors, we describe a mutual information measure

called KL divergence, which is often used in Information Retrieval to measure the distance

between two language models. We also present a simple method for identifying relevant

and non-relevant text to a question using the Web and AQUAINT corpus as information

sources. Our experimental analysis of using passage priors as a re-ranking step on top of

language models including Indri and KL divergence showed that passage priors are neces-

sary in ranking passages.

1.4 Outline of the Thesis

The rest of this thesis is organized as follows.

Chapter 2

In this chapter, we outline different techniques to define passages, and discuss some of the

earlier approaches to passage retrieval in QA. As our main goal is to reduce the terminolog-

ical gap problem by query expansion, we also discuss different query expansion techniques

that are applied for passage retrieval in QA.

Chapter 3

This chapter describes a new passage retrieval method leveraging statistical machine trans-

lation models for information retrieval. We overview the emergence of statistical language

modeling for information retrieval, and derive the translation model as an extension of

language modeling. Using the translation model, we describe a passage retrieval method,

which can solve the problems of synonymy and polysemy in information retrieval.

12


Chapter 4

In this chapter, we describe the experiments conducted to evaluate the passage retrieval

approach described in chapter 3. First, we give a detailed description of the evaluation

metrics and data sets used in the experiments. Then, we describe the experimental setup

and all three experiments conducted to analyze the performance of our approach. Finally,

we discuss the effectiveness of our approach based on a detailed analysis of the obtained

results.

Chapter 5

This chapter introduces a novel query expansion method using Wikipedia. We define query

expansion term space from the text content of Wikipedia, and present two scoring methods,

namely, proximity scoring and outlink scoring to select query expansion terms from this

space. We describe the experiments conducted to evaluate this method, and discuss all the

important observations made from the results of these experiments.

Chapter 6

This chapter shows the necessity of passage priors in ranking passages for a given question.

We describe a mutual information measure called KL divergence to compute this prior. We

also present a simple method for identifying relevant and non-relevant text to a question

using different information sources. We then describe the experiments conducted to eval-

uate the necessity of priors in passage retrieval, and discuss all the important observations

made from these experimental results.

Chapter 7

In this chapter, we present an overall summary of our work and its contributions. Finally,

we conclude with the possible directions for future work.

13

Chapter 2

Review of Passage Retrieval

Passage retrieval is an intermediate step between document retrieval and answer extraction

in a QA system. The goal of passage retrieval is to reduce the search space for finding the

answer from a set of documents to a fixed number of passages (say top 20). In this chapter,

we define what a passage means? and outline different techniques to define passages. We

then discuss some of the earlier approaches to passage retrieval in QA. As our main goal

is to reduce the terminological gap problem by query expansion, we also discuss different

query expansion techniques that are applied for passage retrieval in QA.

2.1 Passage

A passage is a contiguous block of text from a document. For the purpose of passage re-

trieval, several ideas have been proposed to define passages. These ideas are based on log-

ical divisions (such as, sentences, paragraphs, or sections), topical structure, fixed length

blocks, or variable length blocks of text in a document. According to Kaszkiel and Zo-

bel [38], a passage can be defined in the following four ways.

14

CHAPTER 2. REVIEW OF PASSAGE RETRIEVAL

2.1.1 Discourse Passages

Passages can be defined based on structural or logical divisions [78, 94], such as, sentences,

paragraphs, or sections of documents. These divisions can be identified by the occurrences

of period, indentation, empty line, etc. as their boundaries. We followed this technique

to define passages in evaluating our approaches. The text enclosed between and

 tags, i.e., a paragraph is considered as a passage in our case.

2.1.2 Semantic Passages

Documents can be segmented into semantic passages based on the topical structure [67, 71]

of documents. The principal idea is to divide documents into coherent units, where each

unit represents a topic or a sub topic. Several approaches have been designed to derive

these coherent units from documents.

2.1.3 Window-based Passages

The principal idea behind window-based passages [8, 86] is to divide a document into units

of equal length or bytes. Each such unit, that is, a passage here may or may not follow the

logical structure of the document. Windows can be overlapping or non-overlapping, and

can be bounded if paragraph boundaries are known.

2.1.4 Arbitrary Passages

Unlike discourse, semantic, and window-based passages which are typically predefined at

indexing time, arbitrary passages [37, 39] are defined at query time. Here the word arbitrary

means that a passage can start at any word in the document. Based on the length of these

passages, they are distinguished into two types, fixed length arbitrary passages, and variable

length arbitrary passages. Where fixed length arbitrary passages are similar to overlapped

windows but with an arbitrary starting point.

15


2.2 Previous Work

Significant amount of work has been done in passage retrieval in the context of QA. Here

we briefly overview some of the available approaches. We categorize all these approaches

into four classes based on their retrieval models as described below.

2.2.1 Passage Retrieval based on Vector Space Models

Vector space models have been widely used in the field of Information Retrieval to compute

similarity between a document and a query. Within this model, all retrieval units (for e.g.

documents, passages, or sentences) and queries are represented as vectors. So, given a

collection of documents, an n dimensional vector is generated for each document and each

query from sets of terms with associated weights, where n is the number of unique terms

in the collection. Then, a vector similarity function, such as cosine similarity, can be used

to compute the similarity between a document and a query. This model has been used for

passage retrieval in the context of QA to find a set of answer containing passages.

Light et al. [48] proposed a word overlap algorithm which simply counts the num-

ber of terms a passage has in common with the query. They have used discourse based

passaging technique, and defined each sentence as a passage. This naive approach for pas-

sage retrieval can act as a good baseline in comparisons. Gonzalez et al. [25] presented

a passage retrieval algorithm which ranks passages based on the non-length normalized

cosine similarity scores between the query terms and the passage. Each term in the pas-

sage is weighted based on its number of occurrences in the passage, and idf values, and a

query term is weighted based on its number of occurrences in the query. They have used

discourse based passaging technique, and defined n continuous sentences in a document

as a passage. In their experiments, they have reported that a passage with two sentences

produced optimum results. Tellex et al. [88] while evaluating different passage retrieval

algorithms, tested the effect of Okapi BM25 retrieval model for ranking passages. They

have defined passage as a sliding window, and from the experiments conducted, they have

16


reported comparable results with the other state-of-the-art passage retrieval algorithms.

2.2.2 Passage Retrieval based on density of Query terms

The density-based retrieval method ranks passages based on how close the matched words

appear to each other in the passage. The higher the density of question word distribution in

a passage, the more score it receives. Tellex et al. [88] showed that density based measures

work well for passage retrieval in QA. Following are some of the existing approaches.

Clarke et al. [11] described a passage retrieval algorithm know as MultiText. This

algorithm ranks passages based on length of the passage, and on the weights assigned to the

query terms they match. Intuitively, this algorithm favors short passages containing many

query terms with higher weights. They have defined arbitrary passages, where each passage

starts and ends with a query term. Ittycheriah et al. [34] presented a passage retrieval

algorithm which ranks passages based on a linear combination of five density measures -

matching words measure, thesaurus match measure, mismatch words measure, dispersion

measure and cluster words measure. Where, matching words measure sums the idf values

of words that appear in both the query and the passage, thesaurus match measure sums

the idf values of words in the query whose WordNet synonyms appear in the passage,

mismatch words measure sums the idf values of words that appear in the query and not

in the passage, dispersion measure counts the number of words in the passage between

matching query terms, and cluster words measure counts the number of words that occur

adjacently in both the question and the passage. This methodology has been shown to

perform better than retrieval methodologies based on vector space models. Lee et al. [47]

presents a passage retrieval algorithm which weights query terms based on their part-of-

speech. They have used discourse based passaging technique, and defined n continuous

sentences in a document as a passage. The final score of a passage is computed by adding

the scores of individual sentences. In their experiments, they have reported that a passage

with three sentences produced optimum results.

17

2.2.3 Passage Retrieval based on NLP techniques

The Natural Language Processing (NLP) techniques for passage retrieval have been devel-

oped to overcome the limitations of lexical matching based retrieval techniques like vector

space models, and density based models. The main idea behind NLP based retrieval is that

considering crucial relations between words can avoid false positives which are otherwise

retrieved by the latter techniques. This is because, many irrelevant passages even though

they contain question terms, the relation between these terms in the passages are different

from that of the question. Here, we briefly overview some of the existing approaches.

Cui et al. [16] explored the use of fuzzy dependency relation matching method to en-

hance passage retrieval. They have used discourse based passaging technique, and defined

each paragraph enclosed between and tags as a passage. They examined

dependency relations between query terms and key terms within passages by employing

Minipar [49], a fast and robust dependency parser, to accomplish dependency parsing. In

order to match these relations, they proposed fuzzy matching instead of strict matching

because the latter fails when semantically equivalent relationships are phrased differently.

This approach produced significant improvements when compared to the density based

passage retrieval approaches. Similarly Wu et al. [53] extracted surface relation patterns

from both the query and the passages to perform relation based matching. Even though

both the approaches reported significant improvements in precision, they are typically in-

effective for short questions which have very less query terms and relation paths. Ofoghi

et al. [64] presented a passage retrieval algorithm based on the observation that passage

retrieval fails when there is syntactic mismatch between the words inside passages and

queries. To overcome this limitation, they have presented a query formulation technique

which exploits intra-frame term-level relations inside FrameNet [2] for retrieving semanti-

cally related passages given a question. FrameNet is a lexicon resource for English whose

infrastructure is based on frame semantics [52]. However, on evaluating this approach they

have obtained only marginal improvements in performance over a density based passage

18


retrieval algorithm.

2.2.4 Passage Retrieval based on Probabilistic Models and Language

Modeling

Classical probabilistic models of Information Retrieval [72] state that a retrieval system

should rank the documents in decreasing order of their probability of relevance to the query.

The primary obstacle in these models is the need to estimate a relevance model, that is,

probability distribution of words in the relevant class. In order to overcome this constraint,

Ponte and Croft [68] introduced a new conceptual view of Information Retrieval called

language modeling. According to this model, the documents are ranked by the probability

that a query would be observed as a random sample from the respective document model.

This paradigm has been adopted for passage retrieval in the context of QA. Here, we briefly

overview some of the existing probabilistic and language modeling based approaches.

Emmanuel and Croft [13] exploited the appearance of answer patterns in factoid ques-

tions by constructing answer models. These are language models that were trained on a

parallel corpus consisting of questions and passages with the correct answers. In language

modeling based retrieval, prior of a passage is taken as constant. But, they relaxed this

assumption and used answer models to compute this prior value. In their experiments, they

have reported improvements in performance over a query likelihood baseline. Zhang and

Lee [102] proposed a language modeling approach for passage retrieval. In their approach,

first, a set of relevant passages were retrieved form the local document collection. They

have defined passage as a half overlapped window, with a maximum length of 30 words

and, was restricted not to cross paragraph boundary. Then, they constructed question-topic

language model by extracting relevant data from the Web. With some additional constraints

like answer type and answer context information on this model, passages retrieved ini-

tially from the local document collection are re-ranked based on their KL divergence scores

with question-topic language model. Merkel and Klakow [57] experimented with standard

19


language model based smoothing methods [101] like Jelinek-Mercer linear interpolation,

Bayesian smoothing with Dirichlet priors and absolute discounting. They also proposed

new models based on refinements such as ignoring query words, dynamic stopword lists

and stemming, and also modeled the expected answer type of a question into the language

modeling approach. Using these models, they have reported significant improvements over

standard methods.

Murdock and Croft [62] used the Statistical Machine Translation model for sentence

retrieval in QA. Their approach used IBM model 1 [6] to build a translation model for all the

question-sentence pairs in the training corpus. The constructed translation model is used in

the language modeling framework to retrieve a ranked set of passages. Their experimental

results on TREC data showed that their approach performed better than retrieval based on

query likelihood. Our approach for passage retrieval is very similar to the above approach,

but we construct more sophisticated multiple translation models which are perceived as

answer type profiles i.e., questions from distinct categories (answer types) have distinct

translation models. So, during the process of retrieval, based on the answer type of the

question its corresponding answer type profile is used to retrieve relevant passages. We

show that this methodology addresses the terminological gap problem by expanding the

query inherently with contextually related synonyms of query words.

2.3 Query Expansion

As our main strategy to overcome the terminological gap problem in passage retrieval is

query expansion, we will briefly look into the available literature on the same. Query ex-

pansion is the process of reformulating a seed query to enhance retrieval performance in

information retrieval operations. In the context of passage retrieval in QA, query expan-

sion methodologies are intended to improve the recall of passages that are relevant to the

question. Improving passage retrieval in this way would provide the best possible input

to a downstream answer extraction component in a pipeline QA system. Here, we briefly

20


overview some of the existing query expansion techniques that have been used to improve

the performance of passage retrieval.

Monz [60] tested blind relevance feedback, a widely used query expansion technique, in

the context of QA. Their approach selected query expansion terms based on standard Roc-

chio term weighting from top ranked documents. From the empirical evaluation, they ob-

served a reduction in the performance compared to original queries performance. Whereas,

the same technique was found effective for the ad hoc retrieval task. Pizzato et al. [66] em-

ployed a similar technique which uses named entities of the expected answer type from

top ranked documents as query expansion terms. Their empirical evaluation on PERSON

type factoid questions has shown only marginal improvement in performance. Moldovan

et al. [59] enhanced the performance of retrieval component in their QA system by using a

feedback loop with lexico-semantic alternations from WordNet as query expansion terms.

Also, Pasca and Harabagiu [80] reported substantial improvements when lexico-semantic

information from WordNet was used for query expansion. Van der Plas and Tiedemann [91]

have investigated the use of five different types of lexico-semantic information to the task of

query expansion for passage retrieval. Off these five types, three are corpus based methods,

in which expansion terms are selected based on proximity, syntax and alignment methods.

In the other two methods, categorized named entities, and synsets from European Word-

Net (EWN) are used as expansion terms. From the empirical evaluation, they have found

that out of the three corpus based methods only proximity based query expansion method

enhances passage retrieval performance that too only marginally. Among the other two

methods, categorized named entities produced better results.

Yang et al. [98] used WordNet and the Web to expand queries for QA. Only marginal

improvements were attained when the Web was used to extract expansion terms and when

WordNet was used to rank these extracted terms the improvement was reduced. But, on

semantic grouping the candidate expansion terms based on the relations between them, best

results were obtained. Bilotti et al. [4] studied the effect of stemming and explicit query

expansion using inflectional variants on document retrieval in the context of QA. Their

21


experimental results showed high recall for explicit query expansion and comparably low

recall when stemming was used. Sun et al. [87] studied two query expansion techniques

which make use of dependency relation analysis to extract contextual terms and relations

from external corpuses. These techniques were used to enhance the performance of density

based and relation based passage retrieval frameworks. Their experimental results showed

that relation based term expansion method with density based passage retrieval system

outperformed the local content analysis method for query expansion and relation expansion

method outperformed relation based passage retrieval system.

The above survey shows that widely used query expansion techniques like relevance

feedback, WordNet and other knowledge based approaches have resulted either in only

marginal improvements or reduction in performance. Effectively these techniques could

only identify semantically related terms (like synonyms in WordNet as keyword alterations)

to query terms. But in the case of passage retrieval, where the goal is to identify answer

containing passages, these terms should also discriminate an answer containing passage

from others. This suggests that not all the query expansion techniques, which have been

successful in ad hoc retrieval, will have a similar impact in QA. In this thesis, we propose

a novel query expansion technique which identifies query expansion terms with both these

properties. Our empirical evaluation also suggests the same.

Arguello et al. [1] described a technique for mining the links and anchor text in Wikipedia

for query expansion terms and phrases. The technique yielded consistent and significant

improvements in both recall and precision for blog recommendation. Similar to this ap-

proach, we present a query expansion technique using Wikipedia as a knowledge base. In

our approach, we use text content, category structure, and link structure of Wikipedia to

generate a set of terms semantically related to a question. Finally, the top N scoring terms,

where N varies linearly depending on the size of seed query, and the seed query terms are

mixed together to form an expanded Boolean query. We evaluate this approach on factoid

questions and show that on Okapi BM25 retrieval engine, the use of expanded queries lead

to significant improvements in performance.

22

Chapter 3

Passage Retrieval using Answer Type

Profiles

In this chapter, we describe a passage retrieval methodology leveraging the Statistical Ma-

chine Translation (SMT) model for Information Retrieval. This model has found its roots

from the statistical language modeling [68], and was first proposed by Berger and Laf-

ferty [3] for monolingual document retrieval. A notable feature of the SMT model is an

inherent query expansion component and its capability of handling the issues of synonymy

(multiple terms having similar meanings) and polysemy (the same term having multiple

meanings). The aim of our passage retrieval approach is that, during retrieval, query words

should be expanded inherently with only their contextually related synonyms, where the

context is determined by the answer type of the question. For instance, given the question

“Where was Mahatma Gandhi born?”, our approach aims at searching for contextually re-

lated synonyms for the word “born” (such as, “birthplace”, “hometown” etc. instead of

“birthdate” which is also the synonym of the same word) during retrieval. Our approach

includes two phases: one is off-line and the other is on-line. The off-line phase constructs

Answer Type Profiles (ATPs) from question-answer sentence pairs parallel corpus using a

statistical alignment model. The on-line phase uses ATPs within the SMT framework to

retrieve a ranked set of relevant passages given a question.

23

CHAPTER 3. PASSAGE RETRIEVAL USING ANSWER TYPE PROFILES

The rest of this chapter is organized as follows: Section 2 describes language model-

ing for Information Retrieval, and then we describe the emergence of statistical machine

translation model for information retrieval from the roots of language modeling in sec-

tion 3; Section 4 describes our passage retrieval methodology and Section 5 concludes the

chapter.

3.1 Language modeling for Information Retrieval

Statistical language modeling, or more simply, language modeling, refers to the task of

estimating a probability distribution that captures statistical regularities of natural language

use. The root of statistical language modeling dates back to the beginning of the 20th cen-

tury when Markov tried to model letter sequences in works of Russian literature [55]. Zipf

studied statistical properties of text and discovered that the frequency of words decays as

a power function of its rank. However, it was Shannon’s work [82] that inspired later re-

search in this area. In exploring the application of his newly founded theory of information

to human language, Shannon considered language as a statistical source, and measured how

well simple n-gram models predicted or, equivalently, compressed natural text. To do this,

he estimated the entropy of English through experiments with human subjects, and also es-

timated the cross-entropy of the n-gram models on natural text. For many years, statistical

language models have been used primarily for automatic speech recognition. Since 1980

when the first significant language model was proposed [76], statistical language modeling

has become a fundamental component of speech recognition, machine translation, spelling

correction, and so forth. It has also proven useful for natural language processing tasks

such as natural language generation and summarization. In 1998, it was introduced to

information retrieval and has opened up new ways of thinking about the retrieval process.

A statistical language model is a probability distribution over all possible sentences or

other linguistic units in a language [76]. It can also be viewed as a statistical model for

generating text. The task of language modeling, in general, answers the question: how

24


likely the ith word in a sequence would occur given the identities of the preceding i − 1

words? In most applications of language modeling, such as speech recognition and infor-

mation retrieval, the probability of a sentence is decomposed into a product of n − gram

probabilities.

Let’s assume that S denotes a specified sequence of k words,

S = w1, w2, w3, ...., wk

An n − gram language model considers the word sequence S to be a Markov process

with probability

Pn(S) =k∏

i=1

P (wi|wi−1, wi−2, ..., wi−n+1)

Where n refers to the order of the Markov process. When n = 2 we call it a bigram

language model which is estimated using information about the co-occurrence of pairs of

words. In the case of n = 1, we call it a unigram language model which uses only estimates

of the probabilities of individual words. For applications such as speech recognition and

machine translation, word order is important and higher order (usually trigram) models are

used. In Information Retrieval, the role of word order is less clear and unigram models

have been used extensively.

To establish the word n-gram language model, probability estimates are typically de-

rived from frequencies of n-gram patterns in the training data. It is common that many

possible word n-gram patterns would not appear in the actual data used for estimation,

even if the size of the data is huge and the value of n is small. As a consequence, for rare or

unseen events the likelihood estimates that are directly based on counts become problem-

atic. This is often referred to as the data sparseness problem. Smoothing is used to address

this problem and has been an important part in any language model.

The basic approach for using language models for information retrieval assumes that

the user has a reasonable idea of the terms that are likely to appear in the “ideal” document

25


that can satisfy his/her information need, and that the query terms the user chooses can

distinguish the “ideal” document from the rest of the collection [68]. The query is thus

generated as a piece of text representative of the “ideal” document. The task of the system

is then to estimate, for each of the documents in the collection, which is most likely to be

the ideal document. That is, we calculate:

argDmaxP (D|Q) = argDmaxP (Q|D)P (D)

Where Q is a query and D is a document. The prior probability P (D) is usually as-

sumed to be uniform and a language model P (Q|D) is estimated for every document. In

other words, we estimate a probability distribution over words for each document and cal-

culate the probability that the query is a sample from that distribution. Documents are

ranked according to this probability. This is generally referred to as the query-likelihood

retrieval model and was first proposed by Ponte and Croft [68]. In their work, Ponte and

Croft take a multi-variate Bernoulli approach to approximate P (Q|D). They represent a

query as a vector of binary attributes, one for each unique term in the vocabulary, indicating

the presence or absence of terms in the query. The number of times that each term occurs

in the query is not captured. There are a couple of assumptions behind this approach: 1)

the binary assumption: all attributes are binary. If a term occurs in the query, the attribute

representing the term takes the value of 1. Otherwise, it takes the value of 0. And, 2) the

independence assumption: terms occur independently of one another in a document. These

assumptions are the same as those underlie the binary independence model proposed in

earlier probabilistic information retrieval work [73, 92]. Based on these assumptions, the

query likelihood P (Q|D) is thus formulated as the product of two probabilities - the prob-

ability of producing the query terms and the probability of not producing other terms.

P (Q|D) =∏w∈Q

P (w|D)∏w/∈Q

(1.0− P (w|D))

Where P (w|D) is calculated by a non-parametric method that makes use of the aver-

26


age probability of w in documents containing it and a risk factor. For non-occurring terms,

the global probability of w in the collection is used instead. It is worth mentioning that

collection statistics such as term frequency and document frequency are integral parts of

the language model and not used heuristically as in traditional probabilistic and other ap-

proaches. In addition, document length normalization does not have to be done in an ad hoc

manner as it is implicit in the calculation of the probabilities. This approach to retrieval,

although very simple, has demonstrated superior performance to traditional probabilistic

retrieval using the Okapi-style tf − idf weighting [74] on TREC test collections. An 8.7%

improvement in performance (measured in average precision) is reported. This finding is

important because with few heuristics the simple language model can do at least as well as

one of the most successful probabilistic retrieval models previously available with heuristic

tf − idf weighting.

In contrast to Ponte and Croft’s approach, Hiemstra [30], Miller et al. [58], and Song

and Croft [85] employed a multinomial view of the query generation process. They treat

the query Q as a sentence of independent terms (i.e. Q = q1, q2, q3, ..., qm) taking into

account possibly multiple occurrences of the same term. The “ordered sequence of terms

assumption” behind this approach states that both queries and documents are defined by an

ordered sequence of terms [30]. A query of length k is modeled by an ordered sequence of

k random variables, one for each term occurrence in the query. While this assumption is not

usually made in traditional probabilistic information retrieval work, it has been essential for

many statistical natural language processing tasks (e.g. speech recognition). Based on this

assumption, the query generation probability can be obtained by multiplying the individual

term probabilities.

P (Q|D) =∏qi∈Q

P (qi|D)

Where qi is the ith term in the query. While through different theoretical derivations,

these models all arrived at similar way of computing P (w|D) (with w denoting any term) -

27


combining a component estimated from the document and one from the collection by linear

interpolation.

P (Q|D) =∏qi∈Q

αP (qi|D) + (1− α)P (qi|C)

Where P (qi|C) is the probability that qi appears in the collection and α is a weighting

parameter which lies between 0 and 1. This can also be viewed as a combination of infor-

mation from a local source, i.e. the document, and a global source, i.e. the collection. The

differences between those models reside in how P (qi|D) and P (qi|C) are estimated.

The basic model has been extended in a variety of ways. For example, documents have

been modeled as mixture of topics [33] and phrases are considered [85]. Progress has also

been made in understanding the formal underpinnings of the statistical language modeling

approach, and comparing it to traditional probabilistic approaches. Connections were found

and differences identified. Recent work has seen more sophisticated models developed that

are more closely related to the traditional approaches. For example, a language model that

explicitly models relevance [46] has been proposed, and a risk minimization framework

based on Bayesian decision theory has been developed [44]. Successful applications of

the language modeling approach to a number of retrieval tasks have also been reported,

including cross-lingual retrieval [45, 97] and distributed retrieval [83, 96]. Research car-

ried out by a number of groups has confirmed that the language modeling approach is a

theoretically attractive and potentially very effective probabilistic framework for studying

information retrieval problems.

3.2 Statistical Machine Translation Model for Informa-

tion Retrieval

Berger and Lafferty [3] have extended the multinomial language model as a translation

model for information retrieval. Within this model the query generation process is viewed

28


as a translation or distillation from a document. To determine the relevance of a document

to a query, this model estimates the probability that the query would have been generated as

a translation of that document. So for a given query, documents in the collection are ranked

according to these probabilities. More specifically, the mapping from a document term w

to a query term qi is achieved by estimating translation models P (qi|w). Using translation

models, the retrieval model becomes

P (Q|D) =∏qi∈Q

α∑w∈D

P (qi|w)P (w|D)

+(1− α)P (qi|C)

Where P (qi|w) is an entry in the translation model which is typically learned from a par-

allel corpus consisting of queries and documents relevant to those queries. The learned

translation model consists of triples, the query word, the document word and the probabil-

ity of translation. So, the translation model is a quantified mapping between query words

and document words. A notable feature of this model is an inherent query expansion com-

ponent and its capability of handling the issues of synonymy (multiple terms having similar

meanings) and polysemy (the same term having multiple meanings). However, as the trans-

lation models are context independent, their ability to handle the ambiguity of word senses

is only limited. This retrieval model has been adopted in several information retrieval tasks.

Jin et al. [81] constructed language models of document titles and determined the relevance

of a document to a query by estimating the likelihood that the query would have been the ti-

tle for the document. The title of a document is viewed as a translation from that document,

and the title language model is regarded as an approximate language model of the query.

They first estimated a translation model by using all the document-title pairs in a collection.

The translation model is then used for mapping a regular document language model to a ti-

tle language model. In the final step, the title language model estimated for each document

is used to compute the query likelihood, and documents are ranked accordingly. Similarly

for sentence retrieval in QA, Murdock and Croft [62] estimated a translation model for all

the question-sentence pairs in a collection. This model is used to rank the sentences given29


a question. However, in our passage retrieval approach we construct sophisticated multiple

translation models which are perceived as answer type profiles i.e., questions from distinct

categories (i.e. answer types) have distinct translation models. The aim of this approach is

that, during retrieval, query words should be expanded inherently with only their contextu-

ally related synonyms, where the context is determined by the answer type of the question.

For instance, given the question “Where was Mahatma Gandhi born?”, our approach aims

at searching for contextually related synonyms for the word “born” (such as, “birthplace”,

“hometown” etc. instead of “birthdate” which is also the synonym of the same word) during

retrieval. Hence, during the process of retrieval, based on the answer type of the question

its corresponding answer type profile is used to retrieve relevant passages.

3.3 Passage Retrieval

Typically, query expansion has been used to reduce the terminological gap problem in

passage retrieval, which ultimately improves the recall of passages that are relevant to a

given question. Query expansion is the process of reformulating a seed query to improve

retrieval performance in information retrieval operations. In general, queries are expanded

either explicitly, or they are expanded inherently within a passage retrieval methodology. In

explicit query expansion, new terms are added to the original query to bridge terminological

gap between the question and answer containing passages. Different methodologies have

been proposed to expand queries by utilizing top N ranked passages (pseudo-relevance

feedback) [24] or utilizing external knowledge sources like WordNet, Encyclopedias or

Web [98]. In implicit query expansion, the original query remains unchanged but during

the process of retrieval semantic variants of original query terms like their stems [4] or

morphological root forms are considered. However, most of these techniques does not take

the context of the question into consideration while expanding the seed queries.

In this section, we describe how we performed passage retrieval leveraging the de-

scribed SMT model for information retrieval. The aim of this approach is that, during

30


Figure 3.1 Passage retrieval using Answer Type Profiles

retrieval, query words should be expanded inherently with only their contextually related

synonyms, where the context is determined by the answer type of the question. Our

methodology includes two phases: offline phase and online phase. In the offline phase,

multiple translation models are constructed, each one for a category (i.e. answer type) of

questions. We perceive each such translation model as an answer type profile. In order

to build these profiles, we have utilized statistical alignment models which maximize the

probability of the observed (question, sentence) text pairs using Expectation Maximization

algorithm. After the maximization process is completed, the word level alignments are set

to maximum posterior predictions of the model. This phase includes the following steps:

construction of parallel corpus, semantic categorization of questions based on their answer

types, and building answer type profiles. During the online phase, answer type profiles are

incorporated into the SMT framework to retrieve a ranked set of passages given a ques-

tion. This entire retrieval process is illustrated in Figure 3.1, and the detailed description of

individual steps in estimation and ranking are given below.

31


3.3.1 Parallel corpus

In SMT systems, a bilingual parallel corpus, typically a word or sentence or paragraph or

document aligned, is used to build translation models. On this aligned bilingual corpus, a

statistical model, which maximizes the probability of the observed source and target lan-

guage text pairs, is used to learn translations. Similarly, translation models for monolingual

information retrieval are learned based on the following notion - queries and documents are

from different languages. Here, queries are considered as samples from concise language

and documents as samples from verbose language. In the case of a natural language ques-

tion as a query, where the information is focused, sentences with the answer to the question

are better translation samples than full documents. This is because there can be a lot of

noisy terms in the document which need not be right in the context of the question. So,

a parallel corpus consisting of questions and sentences with answers to those questions is

required to learn answer type profiles.

Each year at the conclusion of Question Answering track at TREC, NIST 1 releases

a set of question, document id, answer-triples for all the questions in the test set. Using

this resource, Kaisser and Lowe [36] developed a Question Answer Sentence Pair (QASP)

corpus to foster research in QA. They identified sentences which contain answers using

Amazon’s Mechanical Trunk, an “artificial artificial intelligence” web service. The corpus

consists of questions from TREC Question Answering track test sets for the years 2002 to

2006, and sentences consisting of answers from AQUAINT corpus. Table 3.1 shows the

quantitative overview of QASP parallel corpus.

3.3.2 Question Classification

The goal of question classification is to identify the answer type of a given question. In our

approach, this is a key important component in both the phases, that is, offline and online

phases. In the offline phase, it helps in categorizing questions from the training set based on

1http://trec.nist.gov/data/qamain.html

32


Year No. factoid No. sentence Average no.

questions pairs sentences

2002 429 2,006 4.67

2003 354 1,448 4.09

2004 204 865 4.24

2005 319 1,456 4.56

2006 352 1,405 3.99

TOTAL 1,911 7,180 4.33

Table 3.1 Quantitative overview of QASP parallel corpus

their answer types. In the online phase, given a question, it helps in identifying the answer

type. Based on this answer type, the corresponding ATP is used to rank passages.

Different approaches have been proposed for the task of question classification. Ear-

lier approaches for this task use manually constructed set of rules and heuristics to map

a question to an answer type. These approaches range from using only the surface form

of questions to using tagging, parsing and semantics. Obviously, these approaches require

tremendous amount of tedious work to achieve a reasonable accuracy. So, the focus has

been shifted towards machine learning approaches which can automatically construct a

high performance question classifier. Zhang and Lee [103] experimented with five classifi-

cation algorithms -

1. Nearest Neighbours: Given a question, this algorithm [99] finds its nearest neigh-

bours among the training examples, and uses the dominant class label of these nearest

neighbours as its class label.

2. Naive Bayes: The basic idea of this algorithm [56] is to estimate the parameters

of a multinomial generative model for instances, then find the most probable class

for a given instance using the Bayes’ rule and the Naive Bayes assumption that the

features occur independently of each other inside a class.33


3. Decision Tree: This algorithm [69] is a method for approximating discrete valued

target function, in which the learned function is represented by a tree of arbitrary

degree that classifies instances.

4. Sparse Network of Winnows: This algorithm [77] is specifically tailored for learn-

ing in the presence of a very large number of features, and the learned model is a

sparse network of a linear function.

5. Support Vector Machines: The main idea of Support Vector Machines (SVM) [15]

is to find a decision surface that separates the positive and negative examples while

maximizing the minimum margin. The margin is defined as the distance between the

decision surface and the nearest positive and negative training examples.

Their experimental results showed that SVM outperformed the other four methods.

Coarse Fine

ABBR abbreviation, expansion

DESC definition, description, manner, reason

ENTY animal, body, color, creation, currency,

disease/medical, event, food, instrument, language,

letter, other, plant, product, religion, sport, substance,

symbol, technique, term, vehicle, word

HUM description, group, individual, title

LOC city, country, mountain, other, state

NUM code, count, date, distance, money, order, other,

percent, period, speed, temperature, size, weight

Table 3.2 The coarse and fine grained answer types

For the evaluation of our passage retrieval methodology, we have built a question clas-

sifier using SVM. The classifier is trained on a standard data set provided by UIUC [17].34


It has about 5,500 questions for training and 500 questions for testing which are manually

labeled into 6 coarse grained and 50 fine grained answer types in a two level taxonomy [17]

as shown in Table 3.2. The classifier when evaluated for coarse grained classification on

500 test questions, produced an accuracy of 86.8% using bag-of-words as feature.

3.3.3 Learning Answer type profiles

A translation model is learned for every category (answer type) of questions using the

QASP parallel corpus described above. We perceive each such translation model as an

ATP. For instance, a translation model learned over parallel corpus with only NUMBER

type questions and their corresponding answer containing sentences is perceived as NUM-

BER type profile. In order to learn these profiles, statistical alignment models which maxi-

mize the probability of the observed (question, sentence) text pairs using Expectation Max-

imization algorithm are used. After the maximization process is completed, the word level

alignments are set to maximum posterior predictions of the model to produce triples: ques-

tion word, sentence word, probability of translating the sentence word into the question

word.

Several statistical alignment models like IBM Models 1-5, Hidden Markov Model

alignment model etc. have been proposed to build translation models. Out of these models,

earlier works [3] have shown that IBM model 1 [7] is more suited for information retrieval.

In generating translations, IBM model 1 considers all alignments equally likely, and ig-

nores subtler aspects of language being used. We used GIZA++ [63] an implementation

of IBM alignment models [7], for building ATPs. Sample profiles for LOCATION and

NUMBER types are shown in Table 3.3 and Table 3.4 respectively.

3.3.4 Passage Ranking

This is the online phase of our approach. In this phase, using the ATPs, passages that are

relevant to a question are retrieved. To determine the relevance of a passage to a question,

35


Word Probability

hometown 0.081337

immigrant 0.0322747

birthplace 0.0244121

competitor 0.0244121

career 0.0242433

birthday 0.0108326

Table 3.3 Translations for word born in LOCATION profile

Word Probability

born 0.330707

youngest 0.0147641

grandson 0.0147641

nursing 0.0134934

biography 0.00987116

birthdate 0.00492135

Table 3.4 Translations for word born in NUMBER profile

the probability that a question would have been generated as a translation of that passage is

estimated. Passages are ranked according to these probabilities. The relevance of a passage

A returned for question Q with answer type tj (where 1 ≤ j ≤ 6 for coarse grained

classification; 1 ≤ j ≤ 50 for fine grained classification) is computed using its profile

ATPj as shown in the equation below.

P (Q|A, tj) =∏qi∈Q

α∑w∈A

P (qi|w,ATPj)P (w|A)

+(1− α)P (qi|C) (3.1)

36


Where P (qi|w,ATPj) is an entry in the ATPj , P (w|A) is the probability of word w in the

passage A, P (qi|C) is the probability that qi appears in the AQUAINT collection and α is

the weighting parameter which lies between 0 and 1. In general, passage retrieval depends

heavily on the overlap between the query and passage vocabularies. As the aim of our

approach is to reduce the terminological gap problem, we accommodate a special condition

which Murdock and Croft [62] have used for ranking sentences given a question. According

to this condition, translations of passage terms to a query term are only considered when

the query term is not present in that passage. This is based on the assumption that when a

passage already contains query terms then, there would not be any source for terminological

gap problem. The mathematical representation of this condition is given in the equation

below. ∑w∈A

P (qi|w,ATPj)P (w|A) = tiP (qi|A)

+(1− ti)∑w∈A

P (qi|w,ATPj)P (w|A)

Where ti = 1 when qi = w, and 0 otherwise. This condition states that the probability of a

query term translating to itself is equal to 1 while ensuring that the translation probabilities

sum to one. Passages are finally ranked by accommodating this special condition into

equation 3.1.

3.4 Conclusion

Passage retrieval is a key component in a QA system. Unlike typical passage retrieval

methodologies which match the exact query terms on to the passages, our methodology

leverages the SMT framework. In this framework, a precomputed mapping between query

terms and passage terms, is used to rank passages given a question. Our methodology

does not rely on any external knowledge sources like WordNet, Encyclopedias or Web to

enhance the passage retrieval performance. Instead, it uses previously answered questions

and their answering sentences data to rank passages given a question. So, this can be37


considered as an alternative passage retrieval methodology. An empirical evaluation of this

methodology is described in the next chapter.

38

Chapter 4

Evaluation

In this chapter, we describe the experiments conducted to evaluate our passage retrieval

approach, that was described in the previous chapter. First, we give a detailed description

of the evaluation metrics and data sets used in the experiments. Then, we describe the

experimental setup and all the three experiments conducted to analyze the performance of

our approach. Finally, we discuss the effectiveness of our approach based on a detailed

analysis of the obtained results.

4.1 Evaluation Metrics

Many evaluation metrics like precision, recall, F-measure, mean average precision, bpref,

mean reciprocal rank, total document reciprocal rank etc. have been proposed to measure

the performance of information retrieval systems. All these measures assume a ground

truth notion of relevancy: every document is known to be either relevant or non-relevant

to a given query. In the context of passage retrieval for QA, a relevant passage refers to

an answer containing passage. Hence, given a question, every passage is marked either

relevant or non-relevant depending on whether it contains the answer or not. Even though

there are many metrics, based on the nature of the IR system, only a subset of IR metrics

are used during evaluation. In the context of QA, the following three metrics are widely

39

CHAPTER 4. EVALUATION

used to evaluate the passage retrieval component.

4.1.1 Average precision at 1

Given a question, precision at 1 is 0 if the passage retrieved at rank one is non-relevant,

and 1 if it is relevant. During evaluation, this score is averaged over all questions before

comparing different approaches. Let |Q| be the total number of questions in the test set,

and n be the total number of questions whose relevant passage appears at rank one. Then,

average precision at rank one (Prec@1) is defined as follows.

Prec@1 =n

|Q|

Average precision at 1 measures the proportion of questions for which a correct answer ap-

pears in the first retrieved passage. Alternatively, if passage retrieval is the last component

in the pipeline architecture of a QA system, then this metric reflects the performance of the

entire QA system.

4.1.2 Mean Reciprocal Rank

Mean Reciprocal Rank (MRR) at N is the mean of the inverse of highest ranked correct

answer if that answer appears in the top N . Each question receives a score that is the

reciprocal of the rank (i.e. 1/2 if the rank is 2, 1/3 if the rank is 3 etc.) at which the first

relevant passage is found or 0 if no relevant passage is found within top N ranked passages.

Let |Q| be the total number of questions in the test set, and ri be the rank of the first relevant

passage for the ith question. Then, MRR at N is defined as follows.

MRR =1

|Q|

|Q|∑i=1

1

ri

Where i ≤ N . This metric is typically used to measure the performance of complete QA

systems. Earlier TREC used MRR@5 scores, and in the recent years it used MRR@1

scores to rank different QA systems which participated in the QA task. However, in our

40


experiments we measure MRR for top 20 passages, which is the standard adopted by most

of the previous works on passage retrieval.

4.1.3 Total Document Reciprocal Rank

Total Document Reciprocal Rank (TDRR) extends MRR with a notion of recall. It is the

sum of all reciprocal ranks of all answer bearing passages per question (averaged over all

questions) and attains maximum if all retrieved passages are relevant. Similar to MRR, it

is measured over top N ranked passages for a given question. Let |Q| be the total number

of questions in the test set, and jrel be the relevant passage at rank j. Then, TDRR at N is

defined as follows.

TDRR =1

|Q|

|Q|∑i=1

N∑j=1

1

jrel

This measure favors a retrieval approach that ranks more than one relevant passage higher

than all non-relevant passages. Similar to MRR, TDRR is also measured for top 20 passages

in our experiments.

4.2 Data Set

We used TREC 2002 to 2006 QA data sets to test the effectiveness of our passage retrieval

approach. These data sets consist of: AQUAINT corpus, factoid questions, and answer

judgements provided by NIST for these questions. Answers for all the questions in these

data sets have to be drawn from the AQUAINT corpus. Answer judgements help in judg-

ing whether a passage is relevant to a given question or not. A detailed description of

AQUAINT corpus, factoid questions and answer judgements is given below.

4.2.1 AQUAINT Corpus

The AQUAINT corpus includes 1,033,461 documents taken from Associated Press newswire,

the New York Times newswire and the English portion of the Xinhua News Agency newswire.

41


Figure 4.1 A sample document from AQUAINT corpus

42

It was prepared by the Linguistic Data Consortium (LDC) for the AQUAINT project, and

covers Associated Press and New York Times news articles from June 1998 to September

2000, and Xinhua news articles from January 1996 to September 2000. All available ar-

ticles from these periods, which comprise about 3 gigabytes of data, are included in the

corpus. It was used in official benchmark evaluations conducted by NIST.

The entire corpus is divided into three directories (apw, nyt, xie) based on the three

different news sources. Within each directory, the data is further categorized based on

the year from which the news came from, and within each year, there is one file per date

of collection. Where, each file contains a stream of SGML-tagged text data presenting

the series of news stories reported on a given date. Each such story is enclosed between

< DOC > and < /DOC > tags, and within these tags, the news content is enclosed

between < TEXT > and < /TEXT > tags. The news content contains paragraph

markers (i.e. each paragraph is enclosed between and tags) which are

used as passage level boundaries in our experiments. A sample document, that is, a news

story from this corpus is shown in Figure 4.1.

4.2.2 FACTOID Questions

The question sets from TREC, for the years 2004-2006, consist of a series of questions for

each of a set of targets. The targets include people, organizations, events and other entities.

And, each target has 4-5 factoid questions. The questions from TREC 2002 and 2003

sets are not grouped based on targets, that is, every question is independent of every other

question and their targets reside within them. A quantitative overview of these question

sets is shown in Table 4.1.

The question sets for the years 2004-2006 are in XML format, and the format explicitly

tags the target as the target. Each question is assigned an id of the form X.Y where X is

target id and Y is number of the question in the series. A sample series of questions from

TREC 2006 question set is shown in Figure 4.2. The question sets for the years 2002 and

43


Year No. targets No. factoid No. questions

questions with answers

2002 - 500 429

2003 - 413 354

2004 65 231 204

2005 75 363 319

2006 75 404 352

Table 4.1 Quantitative overview of TREC 2002-2006 question sets. Column 4 de-notes the number of questions for which NIST have provided answer judgements.

Figure 4.2 A sample series of questions from TREC 2006 question set

2003 are in plain text format. All the questions in these question sets may not have answers

in the AQUAINT corpus. So, only those questions which have answers in this corpus are

selected during evaluation.

44


Figure 4.3 A sample segment from TREC 2006 answer judgements

4.2.3 Answer Judgements

Answer judgements released by NIST for the questions in the test set consists of triples -

1. question ID.

2. answer pattern.

3. IDs of the documents in which the answer appears.

A sample segment from TREC 2006 answer judgements file is shown in Figure 4.3.

These answer judgements are essential in determining whether a passage is relevant or

not for a given question. The presence of document IDs in the judgments creates two

scenarios for evaluation: strict and lenient. For strict scoring, the answer pattern must

occur in the passage, and the passage must be from one of the documents listed as relevant

in the answer judgments. For lenient scoring, the answer pattern must occur in the passage.

However, both the evaluation strategies cannot determine the true performance of a passage

retrieval approach. Strict scoring suffers from false negatives i.e., valid answer containing

passages are scored as incorrect, since the list of document IDs supplemented in answer

judgments is not exhaustive, and lenient scoring suffers from false positives i.e., wrong

answer containing passages are scored as correct, since some of the answer patterns are

not discriminating enough. So, the actual performance of a passage retrieval approach is

somewhere between the scores of these two evaluations. Hence, in our experiments we

compute scores for both the cases.45


4.3 Experiments

We conducted three experiments to test the effectiveness of our approach. In the first ex-

periment, we compared the performance of our approach against standard retrieval models.

In the second experiment, we analyzed the performance of our approach on questions from

different categories (answer types). Finally, in the third experiment, we tested the effect

of different statistical alignment models on our approach. Detailed explanations for all the

three experiments are given below.

4.3.1 Retrieval Models

In this experiment we compared the performance of our passage retrieval methodology

against standard retrieval methodologies including vector space models and language mod-

els. Two models from vector space models including TFIDF and Okapi BM25, and two

models from language modeling including KL-divergence and Indri were selected for com-

paring the results.

TFIDF: The TFIDF weighting scheme is often used in information retrieval. Many

variations of the TFIDF weighting scheme are being used by search engines as a

central tool in computing the relevance between a document and a user query. We

have used a variant of the TFIDF model based on the Okapi TF formula [74].

Okapi BM25: In information retrieval, Okapi BM25 represents the state-of-the-art

retrieval model and is based on the probabilistic retrieval framework developed by

Robertson [75]. It is a ranking function used by search engines to rank matching

documents according to their relevance to a given search query.

KL-divergence: The KL-divergence retrieval model [100] implements the cross en-

tropy of the query model with respect to the document model. It is a standard metric

for comparing distributions, which has proved to work well in IR systems.

46


Indri: The Indri retrieval model is based on a combination of the language modeling

and inference network [90] retrieval frameworks. Both frameworks, on their own,

have been widely studied, applied, and found to be very effective for a wide range

of retrieval tasks. Indri combines the benefits of these two frameworks to further

enhance retrieval effectiveness of IR systems.

Lemur, a language modeling toolkit provides the implementation of all the above re-

trieval models. Parameters in all these models were set to default values as provided by the

toolkit. Lemur as such does not support passage retrieval. So, we segmented documents in

to passages using the paragraph markers. Each such passage is considered as an individual

document and indexed separately using the toolkit. A total of five runs were conducted and

in each run questions from one of TREC 2002-2006 years were used for testing and the

questions from the rest of the years were used to construct ATPs. IBM model 1, which as-

sumes all possible alignments between source sentence and target sentence equally likely,

was used to construct ATPs. Similar to earlier works [3], we have set the α (weighting

parameter) value to be 0.95. The average scores of all the five runs are shown in Tables 4.2

and 4.3.

Method Prec@1 MRR TDRR

TFIDF 0.172 0.255 0.381

Okapi BM25 0.159 0.235 0.348

KL-divergence 0.175 0.255 0.369

Indri 0.177 0.254 0.376

ATP 0.210 0.287 0.430

Table 4.2 Strict evaluation scores for different passage retrieval methodologies.ATP denotes the passage retrieval methodology proposed by us.

47


Method Prec@1 MRR TDRR

TFIDF 0.259 0.348 0.705

Okapi BM25 0.227 0.313 0.626

KL-divergence 0.284 0.373 0.750

Indri 0.297 0.381 0.807

ATP 0.311 0.395 0.809

Table 4.3 Lenient evaluation scores for different passage retrieval methodologies.ATP denotes the passage retrieval methodology proposed by us.

Ans. Type Prec@1 MRR TDRR

ABBR 0.125 (0.125) 0.156 (0.178) 0.190 (0.220)

DESC 0.200 (0.144) 0.274 (0.225) 0.390 (0.343)

ENTY 0.165 (0.157) 0.235 (0.226) 0.331 (0.323)

HUM 0.198 (0.193) 0.280 (0.272) 0.432 (0.409)

LOC 0.208 (0.215) 0.311 (0.300) 0.494 (0.467)

NUM 0.243 (0.169) 0.309 (0.244) 0.455 (0.348)

Table 4.4 Strict evaluation scores for different categories (answer types) of ques-tions. Scores for Indri retrieval model are enclosed in parenthesis.

4.3.2 Answer Types

In our methodology, we build a translation model for every category (answer type) of ques-

tions. Each such translation model is termed as an ATP and it is used in the SMT framework

to retrieve a ranked set of passages given a question. In this experiment we analyzed the

performance of our methodology on different categories (answer types) of questions using

a similar setup as that of the previous experiment. Among the four retrieval models consid-

ered in the previous experiment, Indri retrieval model performed better. So, we compared

its results with the results obtained by using our methodology. Tables 4.4 and 4.5 shows the

48


average scores for strict and lenient evaluation, and scores for using Indri retrieval model

are enclosed in parenthesis.

Ans. Type Prec@1 MRR TDRR

ABBR 0.250 (0.125) 0.250 (0.198) 0.468 (0.443)

DESC 0.256 (0.322) 0.343 (0.401) 0.683 (0.836)

ENTY 0.287 (0.278) 0.368 (0.368) 0.747 (0.802)

HUM 0.325 (0.315) 0.421 (0.408) 0.931 (0.927)

LOC 0.369 (0.362) 0.467 (0.444) 1.054 (1.045)

NUM 0.304 (0.262) 0.376 (0.341) 0.688 (0.625)

Table 4.5 Lenient evaluation scores for different categories (answer types) ofquestions. Scores for Indri retrieval model are enclosed in parenthesis.

4.3.3 Alignment Models

In this experiment we tested the effect of different statistical alignment models on our

passage retrieval approach. So, the first experiment is repeated for different statistical

alignment models that are used to construct ATPs. We used IBM model 1 and GIZA++

alignment with default parameters to construct these ATPs. IBM model 1 assumes all pos-

sible alignments between source sentence and target sentence equally likely and GIZA++

alignment model is a mixture of IBM model1, HMM alignment model, IBM model3 and

IBM model4. The average strict and lenient scores when a particular alignment model is

used, are shown in Table 4.6.

49


Strict Evaluation

Model Prec@1 MRR TDRR

IBM Model1 0.210 0.287 0.430

GIZA++ 0.210 0.286 0.431

Lenient Evaluation

Model Prec@1 MRR TDRR

IBM Model1 0.311 0.395 0.809

GIZA++ 0.304 0.388 0.802

Table 4.6 Strict and lenient evaluation scores for different statistical alignmentmodels

4.4 Discussion

Typical retrieval methodologies in information retrieval like vector space models, and lan-

guage modeling directly match the exact query terms on to the documents. Whereas, a

statistical machine translation model for information retrieval leverages a precomputed

mapping between query terms and document/passage/sentence terms to quantify the rel-

evance of a document given a query. Such a mapping solves the problem of synonymy in

information retrieval. So, the application of statistical machine translation model for pas-

sage retrieval resulted in significant improvements compared to standard retrieval models

like TFIDF.

Along with synonymy, our methodology also addresses the problem of polysemy i.e.,

a term has different meanings in different contexts. We solve this problem by constructing

distinct translation models for distinct categories (answer types) of questions. For example

given the questions Q1: When was Paul Krugman born? and Q2: Where was Paul Krug-

man born?, our methodology uses NUMBER profile for Q1 and LOCATION profile for

Q2. Looking at the LOCATION and NUMBER profiles for the word born in Tables 3.3

and 3.4 respectively, we can observe that the word born is mapped to location related terms

50


like “hometown”, “birthplace” with high probabilities in LOCATION profile and date re-

lated terms like “birthdate” in NUMBER profile. So, this infers that our methodology of

using multiple translation models addresses the problem of polysemy.

Results from the first experiment showed that our approach outperformed other stan-

dard retrieval models including vector space models and language modeling, especially for

strict evaluation. Among the retrieval models considered in the first experiment, Indri per-

formed better, which is a state-of-the-art retrieval methodology used for both document and

passage retrieval. These improvements can be attributed to the ability of our methodology

to overcome the problems of synonymy and polysemy to some extent.

In the second experiment, we analyzed our methodology on different coarse grained

categories of questions. It showed better improvements in retrieval performance for NUM

type questions than compared to questions from rest of the categories. We believe this

is because, a large fraction of questions from TREC 2002-2006 data sets are NUM type

questions, which facilitated the construction of highly accurate profile. Also, for lenient

evaluation, the performance of our methodology for DESC, ENTY, and LOC type questions

is either similar or slightly lower than Indri retrieval model. But, the strict evaluation scores

for the same show otherwise, where our methodology considerably performed better than

Indri retrieval model. Even for the questions from other three types, the difference in

performances of our methodology and Indri retrieval model are larger for strict evaluation

than lenient evaluation. Obtaining similar performance under lenient evaluation and better

performance under strict evaluation shows that our approach reduces false positives in the

set of retrieved passages.

From the third experiment we found that simple alignment models like IBM model

1 performed better than GIZA++ alignment with default parameters. We believe this is

because, IBM model 1 is more suited for information retrieval as the subtler aspects of

language used for machine translation can be ignored for information retrieval. Question

classification is a common component in offline and online phases. Improving the per-

formance of this component could further enhance the effectiveness of our methodology.

51


This task of investigating the impact of question classification accuracy on the retrieval

performance is left for future work.

4.5 Conclusion

We conducted experiments on TREC 2002-2006 QA data sets to evaluate our passage re-

trieval approach using answer type profiles. These experiments showed that our method-

ology outperformed standard retrieval methodologies including TFIDF, Okapi BM25, KL-

divergence and Indri. We found that simple statistical alignment models like IBM model

1 are more suited for passage retrieval in QA. We also showed that our methodology ad-

dresses the problems of synonymy and polysemy in information retrieval.

52

Chapter 5

Query Expansion Using Wikipedia

Query expansion is a widely used technique in information retrieval to enhance the re-

trieval performance by reformulating the seed query, either by adding new terms or re-

weighting the original terms. Previous works have shown that substantial improvements

can be achieved by expanding short and incomplete queries. In the scenario of passage

retrieval for QA systems, the aim of query expansion is to reduce the query/passage mis-

match by expanding the query using words or phrases with a similar meaning or some other

statistical relation to the set of relevant passages. The hypothesis is that cleverly designed

query expansion techniques will improve recall of passages that are relevant to the query.

So, there will be more relevant passages in the list of retrieved passages, and they will be

better ranked, with query expansion than without it. Improving passage retrieval in this

way would provide the best possible input to a downstream answer extraction component

in a pipeline QA system. Typical query expansion techniques like relevance feedback and

knowledge based techniques have performed well in many IR applications. But most of

these techniques as described by Derczynski et al. [20] were not successful in the context

of QA.

In this chapter, we describe a novel query expansion method using Wikipedia. Wikipedia

is the leading open encyclopedia with a wide coverage on diverse topics, events, entities,

etc. It is a reliable data-source and has found its use in many applications [1]. Another

53

CHAPTER 5. QUERY EXPANSION USING WIKIPEDIA

factor which motivated us to use it, is based on the simple experiment we conducted using

TREC 2006 QA test set [19]. The test set consists of question series where each series

asks for information regarding a particular target. The targets in the test set include people,

organizations, events and other entities. Because of low data redundancy in Wikipedia, the

coverage of its articles is directly proportional to the size of the text content in them. So in

this experiment, we search for size of the text content present in Wikipedia for each target

and the results are shown in Table 5.1.

Target Count

Rich Content 64

Partial Content 8

Zero Content 3

Table 5.1 Targets from TREC 2006 QA test set

As seen in Table 5.1, every target in the test set is classified into one of the three classes:

rich content, partial content and zero content. The targets, whose articles in Wikipedia have

significant text content, are classified as rich content, and the targets which have only brief

text content are classified as partial content. And, the remaining targets which do not have

articles themselves but are described briefly with in a related article are classified as zero

content targets. This whole experiment of classifying targets was done manually on all

the 75 targets in the test set. From Table 5.1, we can observe that most of the targets

(64 out of 75) have rich content in Wikipedia. So, we used it as a knowledge source

for our query expansion method. Apart from the content of Wikipedia, we also used its

structured information in our method. Different forms of structured information exists

in Wikipedia, they include, link structure, tables, and category structure. In our query

expansion method, we use link structure and category structure of Wikipedia. Each article

in Wikipedia belongs to one or more categories and the links between articles signify a

semantic relationship between source and target articles.

54


The remainder of this chapter is organized as follows: Section 1 describes our query ex-

pansion method using Wikipedia; Section 2 describes the experiments conducted; Section

3 discusses the experimental results; and Section 4 concludes the chapter.

5.1 Methodology

Our query expansion method first defines a Query Expansion Term Space (QETS) and

then selects terms in this space based on proximity between terms and category information

of outlink pages in Wikipedia. The query expansion term space consists of terms which

could enhance the performance of passage retrieval for a given question. Ideally, these

terms should have the following two properties.

1. Be semantically related to the terms in the seed query.

2. Be good at discriminating between relevant and non-relevant passages.

So, defining QETS plays a major role in query expansion methods and it depends on

different factors. In the case of Document Retrieval, query expansion methods are intended

to bridge the gap between a high level general topic (expressed by the query) and the more

nuanced facets of that topic likely to be written about in the documents. So, they use

terms from top ranked documents or user selected documents to form QETS for a given

query. But, in the case of QA, query expansion methods are intended to rank the answer

containing passages higher, as only fixed number of top ranked passages are considered to

find the answer. So, constructing QETS with the terms that are semantically related to the

question could help in better ranking of the answer containing passages.

We use the content of Wikipedia to define QETS for a given question in the following

way. First, the Wikipedia article (A) corresponding to the question target is found and then

a set of sentences (S) from this article which consist of question keywords is found. The

above process of retrieving relevant sentences to a question is similar to that of passage

55


retrieval. The terms in these sentences excluding stopwords and question keywords, con-

stitute to form QETS for a given question. It also includes terms in the anchor text of

outlinks from the relevant sentences. Each term in this QETS is weighted based on its

semantic relatedness to the question. And, the strength of this semantic relation is captured

using a linear combination of proximity score and outlink score as shown in the equation

below.

score(t ∈ QETS) = ps(t, Q) + ls(t, C) (5.1)

Where t is a term in QETS, Q is a string of keywords in the question, C is the category

information of outlink page, ps(t, Q) and ls(t, C) are proximity and outlink scores of term

t. The significance and computation of proximity and outlink scores are described below.

5.1.1 Proximity score

Term proximity has been extensively studied for ranking documents. The basic idea un-

derlying these studies is that, in a relevant document, query terms appear relatively close

to each other. Based on this simple idea, several document scoring techniques have been

proposed. For example, Clarke et al. [11] described a document scoring technique based on

term proximity and density. Their document scoring function prioritizes documents which

have either shortest block of text containing all query terms or many blocks of text con-

taining all query terms. They have showed that this scoring function performed better than

some standard scoring methods when evaluated on TREC data sets. Also, proximity based

document ranking is more suitable for distributed information retrieval systems because

they do not rely on collection dependent statistics such as inverse document frequency and

average document length.

Similarly, term proximity has also been exploited for selecting new terms in query ex-

pansion techniques. The assumption behind selection of terms based on proximity scores

is that semantically related terms are usually located in proximity, and the distance be-

tween two terms could indicate the strength of their association. Arguably, the semantic

56


relatedness between terms weakens with the increase in distance separating them. In gen-

eral, proximity between two terms is computed over contiguous blocks of text, where, each

block is known as a window. Here, we compute the proximity score for a term in QETS

by combining its frequency and its minimum distance to any keyword in the question over

a fixed window size of single sentence. Normally, within a sentence most of the terms

occur only once. So, effectively our proximity score of a term is the summation of its min-

imum distances to a keyword in the question over all the relevant sentences (S) found in

Wikipedia. Finally, each term in QETS is weighted using the equation below.

ps(t ∈ QETS,Q) =

|S|∑i=1

tfsi(t) ∗ 1

dtsi(t, Q)

Where tfsi(t) is the term frequency of t in the sentence si and dtsi

(t, Q) is the minimum

distance between term t and a keyword from Q.

5.1.2 Outlink score

The links between articles in Wikipedia signify a semantic relationship between source and

target articles, and each article belongs to one or more categories. Outlink scoring method

exploits both link and category structures of Wikipedia to rank the terms in QETS. All

the outlinks present in the relevant sentence set (S) may not be semantically related to the

question. In order to find only the semantically related outlinks to the question, the category

information of these outlink pages is used. Only those outlinks with their category infor-

mation matching the question are considered semantically relevant. For example, given

the question “Which position did Warren Moon play in professional football?”, only the

outlinks that fall into any one of these categories “position/play/football/professional” are

considered semantically relevant to the question. Finally, all the terms from anchor texts

of these relevant outlinks are weighted based on their frequencies in the relevant sentences

(S) as shown in the equation below. For the rest of the terms in QETS, the outlink score

is zero.

ls(t ∈ QETS,C) = tfS(t)

57


The final scores of all the terms in QETS is computed using Equation 5.1, and they are

sorted based on these scores. After sorting, the top N terms are picked for query expansion.

The top 10 query expansion terms for the sample question “Which position did Warren

Moon play in professional football?” from TREC 2006 QA data set are shown in Table 5.2.

One of the query expansion terms “quarterback” is the name of a position in football and

even all other terms are semantically related to the keywords in the question. We use

the term expansion length (el) which defines the number of terms considered for query

expansion, in the rest of this chapter. To trade off the balance between length of original

query and expansion length, the latter must be proportional to the number of terms in the

former.

el = α× |Q| (5.2)

Where, α is a constant and |Q| is number of terms in the query. So, for short queries the

expansion length will be small and for long queries the expansion length will be large.

quarterback surpassed

league canadian

american record

completions attempts

touchdowns unmatched

Table 5.2 Top 10 expansion terms for the question “Which position did WarrenMoon play in professional football?”.

Boolean model allows a fine-grained control over query expansion. Tellex et al. [88], in

their study of different passage retrieval algorithms found that Boolean querying schemes

perform well in the QA task. So, we use Boolean model to form the expanded query from

the original query with appropriate weights. The expanded Boolean query is a combination

of question target, keywords in the question and expansion terms from Wikipedia. Finally,

the expanded Boolean query is given to the passage retrieval engine which searches for

58


relevant paragraphs that are likely to contain the answer.

5.2 Evaluation

In this section, we describe the experiments conducted to evaluate our query expansion

technique.

5.2.1 Evaluation Metrics

In the context of QA, the following three metrics are widely used to evaluate the passage

retrieval component.

1. Average precision at 1 (prec@1)

2. Mean Reciprocal Rank (MRR)

3. Total Document Reciprocal Rank (TDRR)

In all our experiments, we measure both MRR and TDRR for top 20 passages. A detailed

description about these evaluation metrics is given in section 4.1.

5.2.2 Data Set

We used TREC 2006 QA data set to test the effectiveness of our query expansion technique.

A detailed description about this data set is given in section 4.2. TREC also provides the

top 1000 documents for every target in the question set. These documents are retrieved

from the AQUAINT collection using Prise1 search engine. In our experiments we used

Wikipedia dump as of October 13, 2008. The dump consists of about 4.0 million articles in

XML format. We used Lucene2 (a freely available open-source IR engine) for indexing and

searching the Wikipedia articles. Lucene supports a Boolean query language, although it

1http://www-nlpir.nist.gov/works/papers/zp2/psearch design.html2http://lucene.apache.org/

59


performs ranked retrieval using BM25. So, we used Lucene for retrieving relevant passages

from top 1000 documents set in our experiments.

5.2.3 Experiments

We conducted three experiments to test the effectiveness of our query expansion technique.

In the first experiment, we compared the performance of passage retrieval using expanded

queries against the one which is using seed/original queries. The length of these expanded

queries vary dynamically based on the length of their seed queries, as shown in Equa-

tion 5.2. The value of α in the equation, which determines the expansion length, is set to 8.

The results for this experiment over all the factoid questions in the test set are shown in Ta-

ble 5.3. These results show improvements of about 24.6%, 11.1% and 12.4% in precision

at 1, MRR at 20 and TDRR scores respectively under strict criteria, and improvements of

about 18.4%, 10.5% and 13.8% under lenient criteria.

Criteria Metric SQ EQ

Prec@1 0.158 0.197

Strict MRR@20 0.252 0.280

TDRR 0.330 0.371

Prec@1 0.282 0.334

Lenient MRR@20 0.387 0.428

TDRR 0.742 0.845

Table 5.3 Strict and lenient evaluation results for seed queries (SQ) and expandedqueries (EQ)

In the second experiment, we analyzed the two scoring methods to show how much did

each of the two scores contribute to the overall performance of passage retrieval. For each

question in the test set, two expanded queries were constructed, where expansion terms for

first and second queries were selected from QETS by using proximity and outlink scores

60


respectively. Here too, the value of α is set to 8. The performance of passage retrieval

using the above two queries are shown in Table 5.4. Comparing these results with seed

and expanded queries performance from Table 5.3 shows an increase in the performance

over the former but not as much as the latter. So, the linear combination of proximity

and outlink score results in better ranking of terms in QETS. And, in between the two

scoring methods, outlink scoring performs better under strict criteria and proximity scoring

performs better under lenient criteria.

Criteria Metric PS OS

Prec@1 0.174 0.192

Strict MRR@20 0.262 0.279

TDRR 0.351 0.373

Prec@1 0.313 0.298

Lenient MRR@20 0.413 0.396

TDRR 0.825 0.786

Table 5.4 Statistical analysis of proximity scoring (PS) and outlink scoring (OS)methods

Finally, we tested our technique for different expansion lengths by varying α value in

Equation 5.2. Figure 5.1 shows the performance of passage retrieval for different expansion

lengths under strict and lenient criteria. Under both the criteria, the performance of our

technique has improved for all expansion lengths corresponding to α values ranging from

1 to 10 over the baseline (α = 0), and it attains maximum for the expansion length with

α = 8.

61


0 2 4 6 8 10

0.16

0.18

0.2

Pre

c@1

α0 2 4 6 8 10

0.3

0.32

0.34

α

0 2 4 6 8 100.24

0.26

0.28

MR

R

α0 2 4 6 8 10

0.38

0.4

0.42

0.44

α

0 2 4 6 8 100.32

0.34

0.36

0.38

TD

RR

α0 2 4 6 8 10

0.7

0.8

0.9

α

Strict Criteria Lenient Criteria

Figure 5.1 Performance of passage retrieval for different query expansion lengthscorresponding to α values ranging from 1 to 10 under strict and lenient criteria.

62


5.3 Discussion

Ideally, query expansion terms should have the following properties.

1. Be semantically related to the terms in the seed query.

2. Be good at discriminating between relevant and non-relevant passages.

The basic idea behind proximity scoring is that semantically related terms appear close

to each other. So, to identify terms semantically related to the terms in the seed query,

proximity scoring has been used in our query expansion method. This idea is empirically

evaluated in the second experiment, which showed considerable improvements over the

baseline results in Table 5.3. The results from the second experiment also show that prox-

imity scoring performs better under lenient evaluation, and outlink scoring performs under

strict evaluation. In strict evaluation, a passage is judged relevant if it contains the an-

swer and it is from a document which is marked as relevant by human assessors. So, this

evaluation avoids false positives which are otherwise judged relevant in the case of lenient

evaluation where a passage is judged relevant only by the presence of answer pattern. The

better evaluation scores under strict evaluation suggest that the approach is good at dis-

criminating relevant and non-relevant passages. Hence, the outlink scoring method results

in identifying query expansion terms which are good at discriminating between relevant

and non-relevant passages. And, the linear combination of proximity scoring and outlink

scoring combines both the ideal properties of query expansion terms. This can be verified

from the results in Tables 5.3 and 5.4, where expanded queries produced from the linear

combination of both the scoring methods resulted in better passage retrieval performance

than expanded queries produced from individual scoring methods.

The length of the expanded queries also play a major role in determining the retrieval

effectiveness. If this length is too large then there is chance of noisy terms getting added to

query which degrade the retrieval performance, and if the length is too small then the ex-

panded queries might not produce the expected improvements. In our third experiment, we

63


tested our query expansion approach over a broad range of expansion lengths. The results

showed improvements in performance over seed queries even for larger query expansion

lengths (10× |Q|, where, |Q| is the length of the seed query). This suggests that our query

expansion technique is less vulnerable to noise.

5.4 Conclusion

Query expansion techniques are often used to improve the performance of information re-

trieval systems. In this chapter, we have described a novel query expansion method which

aims to rank the answer containing passages better. It uses text content, link structure and

category structure of Wikipedia to generate a set of semantically related terms to the ques-

tion. An empirical evaluation using TREC 2006 QA data set showed significant improve-

ments using our query expansion method. We also analyzed the performance of expanded

queries based on different scoring methods utilized in selecting terms and for different

expansion lengths.

64

Chapter 6

Effect of Passage Priors in Passage

Retrieval

The Probability Ranking Principle [72] states that a retrieval system should rank the doc-

uments in decreasing order of their probability of relevance to the query. According to

the Language Modeling [68] decomposition [43] of this ranking principle, the documents

should be ranked using the following equation:

log rank(D) = log p(Q|D, R) + logp(D|R)

p(D|N)(6.1)

Here the first term p(Q|D, R) measures the likelihood of the query given a document that

is relevant and Language Modeling is being used to estimate this value. The second term

measures the prior probabilities of document being relevant, and non-relevant. But, docu-

ment retrieval assumes that a document is independent of its relevance, and non-relevance.

So, documents are only ranked based on Language Modeling i.e., the probability of a query

being generated by a document. Previous works [51, 62] showed that the same approach is

being applied to passage retrieval in the context of QA.

Previously Jagadeesh et al. [35] used prior probabilities in Query-Based Multi-Document

Summarization task. They defined an entropy based measure called Information Measure

to capture the prior of a sentence. This information measure was computed using external

65

CHAPTER 6. EFFECT OF PASSAGE PRIORS IN PASSAGE RETRIEVAL

information sources like the Web and Wikipedia. Their experimental results showed that

prior probabilities are necessary for ranking sentences in the summarization task. We use a

similar approach to exploit the use of prior probabilities for passage retrieval in QA.

In this chapter we describe a mutual information measure called KullbackLeibler di-

vergence (KL divergence) [14] to compute the prior of a passage. We also describe a

simple method for identifying relevant and non-relevant text to a question using the Web

and AQUAINT corpus (used in TREC QA evaluations) as information sources. The rest of

this chapter is organized as follows: Section 1 shows the derivation of Equation 6.1 from

the probability ranking principle; Section 2 describes the estimation of passage priors; Sec-

tion 3 describes the identification of relevant and non-relevant text to a question; Section 4

describes the experiments conducted and their results and Section 5 concludes the chapter.

6.1 Background

The Probability Ranking Principle [72] states that a retrieval system should rank the docu-

ments in decreasing order of their probability of relevance to the query. It suggests ranking

the documents by the odds of the probability of relevance (R) given the document (D) and

query (Q) over the probability of non-relevance (N ) given the document and query.

rank(D) =p(R|D, Q)

p(N |D, Q)

Applying Bayes’ rule on the above equation gives:

rank(D) =p(D, Q|R)

p(D, Q|N)∗ p(R)

p(N)

As p(R) and p(N) are independent of the document D they are ignored in the document

ranking process. Using chain rule, p(D, Q|R) can be factored as p(Q|D, R)p(D|R). Here

the first term p(Q|D, R) measures the likelihood of the query given a document that is

relevant and the second term p(D|R) measures the prior probability of a document being

relevant. Similarly, p(D, Q|N) can be factored as p(Q|D, N)p(D|N) to give the following

66


equation:

rank(D) =p(Q|D, R)

p(Q|D, N)∗ p(D|R)

p(D|N)

Language Modeling is being used to estimate p(Q|D, R) and p(Q|D, N). But, Language

Modeling makes the assumption that conditioned on the event of non-relevance, the query

is independent of the document. So, the above equation changes to:

rank(D) =p(Q|D, R)

p(Q|N)∗ p(D|R)

p(D|N)

Here p(Q|N) is independent of the document. So, it is ignored in the document ranking

process. Using the fact that the logarithm is monotonic function, the document ranking

function can also be written as log rank(D). So, documents are finally ranked based on:

log rank(D) = log p(Q|D, R) + logp(D|R)

p(D|N)

In the case of document retrieval, it is assumed that a document and its relevance or

non-relevance are independent. So, the last term in the above equation goes to zero (as

p(D|R) = p(D|N) = p(D)) and only Language Modeling is used in the document rank-

ing process. But, in the context of QA we relax this assumption and explore the necessity

of prior probabilities of a passage (A) given relevant, and non-relevant text to a question.

So in the context of QA, passages are ranked using:

log rank(A) = log p(Q|A, R) + logp(A|R)

p(A|N)(6.2)

In this chapter, we focus on estimating prior probabilities of passages and is described in

detail in the next section.

6.2 Estimation of prior probability

In this section we assume that relevant (R) and non-relevant (N ) text is identified for a

given question. In Information Retrieval, KL divergence is often used to measure the dis-

tance between two language models [10, 100]. We use this mutual information measure67


to estimate prior probabilities of passages. Let UA denotes the unigram language model of

passage A and UR, UN denote the unigram language models of relevant and non-relevant

text respectively. KL divergence between UA, UR and UA, UN are computed as follows:

D(UA||UR) =∑v∈V

UA(v) logUA(v)

UR(v)

D(UA||UN) =∑v∈V

UA(v) logUA(v)

UN(v)

Where v is a term in the vocabulary V and UA(v), UR(v), UN(v) are the unigram proba-

bilities of v in the passage, relevant and non-relevant text respectively. With the increase in

the divergence between passage and relevant text, the probability of passage being relevant

decreases. So, the prior probabilities are estimated as follows:

p(A|R) =1

1 + D(UA||UR)

p(A|N) =1

1 + D(UA||UN)

As KL divergence is always non-negative, both p(A|R) and p(A|N) always lie in the range

[0, 1]. This satisfies the basic law of probability i.e., the probability of an event should

always lie in the range [0, 1]. p(A|R) = 1 when UA = UR, as the divergence of two

equivalent distributions is zero. Similarly, p(A|N) = 1 when UA = UN . Substituting the

above estimates for prior probabilities in Equation 6.2 gives the final ranking function for

passage retrieval.

log rank(A) = log p(Q|A, R)− log1 + D(UA||UR)

1 + D(UA||UN)

6.3 Identifying relevant and non-relevant text

In the previous section we have assumed that the relevant and non-relevant text for a given

question is known. Here we will discuss a method to extract the required information based

on different query formulation strategies.

68


6.3.1 Relevant text

Breck et al. [5] noticed a correlation between the number of times an answer appeared in the

TREC corpus and the average performance of TREC systems on that particular question.

They showed that, the more times an answer appears in the text collection, the easier it

is to find it. As a text collection, the Web is larger in size than any research corpus by

several orders of magnitude. An important implication of this size is the amount of data

redundancy inherent in the Web i.e., each item of information has been stated in a variety

of ways in different documents in the Web.

Data redundancy in the Web indicates that the answer for a given natural language

question exists in many different forms in different documents. So, our methodology for

extracting relevant text relies on Web search engines. Currently, the Yahoo search engine

is used to retrieve this text from the Web. Assuming that an answer is likely to be found

within the vicinity of set of keywords in the question, a query composed of keywords in

it is given to the search engine. For example, given the question “Which position did

Warren Moon play in professional football?”, the following query “position warren moon

play professional football” is given to the search engine. The top N snippets/summaries

provided by the search engine are extracted to form relevant text.

Most of the snippets provided by the search engine consist of broken sentences. These

broken sentences may miss a part of answer pattern or entire answer pattern which is orig-

inally present in them. In either case, an automatic evaluation using a set of questions and

their corresponding answer patterns will fail to show the actual quality of snippets. So, we

manually examined the snippets for a set of 50 randomly selected questions from TREC

2006 test set [19]. We observed that on an average about 6 snippets out of top 10 snippets

provided by the search engine are relevant to the question. As the quality of snippets is

considerably high, we use them as relevant text to a given question.

69


6.3.2 Non-relevant text

The methodology for extracting non-relevant text is independent of the size of a text col-

lection unlike the methodology for relevant text. Here the structure of a question is used

to extract the required information. An input question is parsed to get POS tags of all the

terms in it. We have used the stanford parser [40, 41] to get POS tag sequence correspond-

ing to a question. Based on POS tags, all keywords in a question are divided into two sets:

Topic and Keyword.

Topic: Typically, questions ask for a specific information within a broad topic. For

example, the question “Which position did Warren Moon play in professional foot-

ball?”, asks for a specific information regarding “Warren Moon”. A topic can be

a person, location, organization, event or any other entity, which are proper nouns.

So, a topic set consists of all the proper nouns within a question. And, in ques-

tions where there are no proper nouns like “Which country is the leading producer

of rice?”, nouns “rice” and “country” are considered as individual topics and these

terms form topic set.

Keywords: This set contains all the keywords in a question which are not members

of topic set. So, for the question “Which position did Warren Moon play in profes-

sional football?”, the constituents of this set are “position”, “play”, “professional”

and “football”.

Using the above two sets, two distinct queries are formulated which represent their

non-relevance to a question.

QUERY I: It is formulated using topic set terms alone, which is based on the idea

that text which covers general information regarding a topic in the question can be

considered as non-relevant to it. So, for the above example question “warren moon”

is expected to retrieve non-relevant text.

70


QUERY II: It is formulated using terms from both topic and keyword sets. The idea

behind this query formulation is that text which covers information about a topic

in the question but does not contain any of the keywords in it, can be considered

as non-relevant to it. So, for the above example question, “warren moon -position

-play -professional -football” is expected to retrieve non-relevant text. The negative

operator (-) in the above query restricts the Information Retrieval system to retrieve

only information without terms in the query that succeed ‘-’ operator.

As the methodology is independent of the size of a corpus, two text collections which

include Web and AQUAINT corpus, are used to extract the required information. An em-

pirical evaluation using TREC 2006 QA test set was performed to test the quality of text

extracted by using the two queries described previously. Redundancy, a passage retrieval

performance evaluation metric, is used to measure the average number of answer bearing

passages found within the top N passages retrieved for each query formulation. The quality

of text is inversely proportional to redundancy i.e., lower the redundancy value better is the

quality of text extracted. All the FACTOID questions from the test set were used to mea-

sure redundancy. Table 6.1 shows the average redundancy scores for the top N passages

retrieved from AQUAINT corpus in the test set. QUERY I and QUERY II are the query

formulations from a question as described previously and QUERY is a keyword query for-

mulated for retrieving relevant snippets from the Web. These results show that QUERY

II produces better quality of non-relevant text than QUERY I. And, compared to QUERY

both QUERY I and QUERY II have significantly lower redundancy scores. A similar eval-

uation could not be performed on snippets retrieved from Web because of broken sentences

as described in the previous section.

As the extracted relevant and non-relevant text is not truly relevant and non-relevant to

a question, a linear interpolation of Language Modeling score and prior probabilities are

71


used to rank passages as shown in the equation below.

log rank(A) = (1− α) log p(Q|A, R)

−α log1 + D(UA||UR)

1 + D(UA||UN)

Where α is a weighting parameter which lies between 0 and 1.

Query Top 1 Top 10 Top 20 Top 100

QUERY 0.222 0.844 1.202 2.227

QUERY I 0.020 0.116 0.236 0.597

QUERY II 0.006 0.057 0.122 0.270

Table 6.1 Redundancy scores for the passages retrieved from AQUAINT corpususing different queries

6.4 Evaluation

In this section, we describe the experiments conducted to test the effectiveness of passage

priors in ranking passages.

6.4.1 Evaluation Metrics and Data Set

In the context of QA, the following three metrics are widely used to evaluate the passage

retrieval component.

1. Average precision at 1 (prec@1)

2. Mean Reciprocal Rank (MRR)

3. Total Document Reciprocal Rank (TDRR)

In all our experiments, we measure both MRR and TDRR for top 20 passages. A detailed

description about these evaluation metrics is given in section 4.1.72


We used TREC 2006 QA data set to test the effectiveness of passage priors in ranking

passages. A detailed description about this data set is given in section 4.2.

6.4.2 Experiments

As our aim is to test the effect of passage priors within a language modeling framework

for passage retrieval, we use two language modeling based retrieval models, Indri and KL

divergence. The Indri retrieval model is a state-of-the-art retrieval model that combines

the merits of language model and inference network [90]. The KL-divergence retrieval

model [100] implements the cross entropy of the query model with respect to the docu-

ment model. It is a standard metric for comparing distributions, which has proved to work

well in information retrieval systems. The implementations of both these retrieval models

are provided in Lemur, a toolkit for language modeling and information retrieval. In our

experiments, we incorporated our approach for passage ranking as a re-ranking step on

these retrieval models and the parameters of these models are set to default values as pro-

vided by the Lemur toolkit. After a retrieval model produces a ranked set of passages for

a given question, top 200 passages are re-ranked, of which top 20 passages are considered

for evaluation. The original scores for top 20 passages returned in the initial retrieval, act

as baseline. These results are compared against the re-ranked results of our approach.

We performed two experiments in which QUERY and QUERY II were used to extract

relevant and non-relevant text respectively. In the first experiment, we compared the re-

ranked and baseline results from the two retrieval models, and they are shown in Tables 6.2

and 6.3. The value of weighting parameter (α) was set to 0.5. Only Web was used to extract

relevant text but for extracting non-relevant text both AQUAINT and Web were used. So,

to analyze the effect of two text collections on computing the prior of a passage, we showed

results for both of them. The results listed under AQUAINT and Web show considerable

improvements over the baseline and in between the two, scores are better when Web was

used as a knowledge base for extracting non-relevant text. The relevant and non-relevant

73


text extracted from knowledge bases in our approach are not truly relevant and non-relevant

for a given question. They are just pseudo relevant and non-relevant texts. Even though

they are pseudo texts, they produced considerable improvement in the ranking of passages.

So, this suggests that passage priors are necessary in ranking passages.

Criteria Metric Indri AQUAINT Web

Prec@1 0.196 0.182 0.216

Strict MRR 0.268 0.268 0.286

TDRR 0.386 0.402 0.424

Prec@1 0.318 0.344 0.338

Lenient MRR 0.394 0.427 0.419

TDRR 0.820 0.899 0.888

Table 6.2 Results for Indri retrieval model under strict and lenient criteria

Criteria Metric KL Div. AQUAINT Web

Prec@1 0.216 0.213 0.233

Strict MRR 0.293 0.290 0.302

TDRR 0.429 0.433 0.443

Prec@1 0.335 0.369 0.364

Lenient MRR 0.415 0.446 0.440

TDRR 0.866 0.931 0.927

Table 6.3 Results for KL divergence retrieval model under strict and lenient crite-ria

In the second experiment we tested our methodology for different α values ranging from

0.0 and 1.0 in the ranking function. Figure 6.1 shows the performance of passage retrieval

for different α values under strict and lenient criteria. In all the cases, the performance of

passage retrieval improves over the baseline (α = 0.0) for α values between 0.0 and 0.6,74


and from then it is below the baseline. And, the performance reaches maximum when α

value is 0.4 which shows that the performance is biased towards language modeling scores.

This could be because the text used for computing prior of a passage is not strictly relevant

and non-relevant.

6.5 Conclusion

In this chapter, we have explored the necessity of prior probabilities of a passage being

relevant, and non-relevant to a question in the process of ranking passages. We described

a method for estimating these prior probabilities using KullbackLeibler divergence, and a

method for extracting relevant and non-relevant text to a question. Our experiments on

factoid questions from TREC 2006 test set showed that in the context of QA, use of prior

probabilities improves the performance of passage retrieval. The experimental results also

showed that performance is biased towards language modeling scores.

75


0 0.5 10.1

0.15

0.2

0.25

Pre

c@1

α0 0.5 1

0.2

0.25

0.3

0.35

α

0 0.5 10.15

0.2

0.25

0.3

MR

R

α0 0.5 1

0.35

0.4

α

0 0.5 10.25

0.3

0.35

0.4

0.45

TD

RR

α0 0.5 1

0.7

0.8

0.9

α

Strict Criteria Lenient Criteria

Figure 6.1 Performance of passage retrieval for different α values ranging from0.0 to 1.0 under strict and lenient criteria. In all the cases ‘(—*—)’ and ‘(· · · *· · · )’denotes re-ranked scores of KL divergence and Indri retrieval models respectively.

76

Chapter 7

Conclusions

In this thesis, we have studied the problem of passage retrieval in Question Answering

systems. Passage retrieval is an intermediate step between document retrieval and answer

extraction. If the answer to a question does not appear in the set of passages retrieved then,

it is impossible for any QA system to answer such a question. Moreover, a passage also acts

as a natural response for a QA system because it also includes the context surrounding the

answer. So, effective and sophisticated passage retrieval methodologies have to be designed

to improve the success rate of QA systems.

On evaluating different passage retrieval approaches it was found that, one of the major

sources of error is due to the terminological gap problem. Our goal is to reduce this gap and

enhance the performance of passage retrieval. We focused on query expansion, a widely

used technique in Information Retrieval, to reduce this gap. We proposed two different

solutions, first, a passage retrieval methodology which expands queries inherently, and

then an explicit query expansion method using Wikipedia. Our empirical evaluations of

both these solutions have shown significant improvements over standard approaches. In

addition, we also showed the necessity of passage priors in ranking passages given a query.

77

CHAPTER 7. CONCLUSIONS

7.1 Contributions

In this thesis, we have made the following contributions:

7.1.1 Passage Retrieval Using Answer Type Profiles

The aim of this work is that, during passage retrieval, query words should be expanded

inherently with only their contextually related synonyms, where the context is determined

by the answer type of the question. The following are the contributions of this work.

• We showed how Statistical Machine Translation model for Information Retrieval re-

duces the terminological gap problem.

• By constructing multiple translation models based on the semantic categories of

questions, we showed that query terms are mostly mapped to their synonyms whose

semantics are similar to that of a given question.

• We proved that this approach outperforms standard retrieval models including TFIDF,

Okapi BM25, Indri and KL-divergence.

• As this approach does not rely on any external knowledge sources like WordNet,

Encyclopedias or Web to enhance passage retrieval performance, it can be considered

as an alternative passage retrieval methodology.

7.1.2 Query Expansion Using Wikipedia

This work aims at extracting query expansion terms from Wikipedia by utilizing its text

content and structure. The following are the contributions of this work.

• We showed how the text content and structure of Wikipedia can be exploited for

query expansion in the context of QA.

• We showed that a linear combination of proximity and outlink scoring results in

better ranking of query expansion terms.78


• Our empirical evaluation showed that query expansion terms resulting from this ap-

proach posses both the properties of ideal query expansion terms.

• We showed that on Okapi BM25 retrieval engine the use of expanded queries lead to

significant improvements in performance over original queries.

• Even for large query expansion lengths, this approach has produced improvements

in performance over seed queries, which shows that the approach is less vulnerable

to noise.


This work aims at exploring the necessity of passage priors in ranking passages given a

query. The following are the contributions of this work.

• We showed why passage priors are necessary in ranking passages given a question.

• We described a mutual information measure called KL divergence to compute pas-

sage priors and a simple method for identifying relevant and non-relevant text.

• We presented different query construction techniques to extract relevant and non-

relevant text from Web and AQUAINT corpus as knowledge sources.

• We proved that using passage priors as a re-ranking step on top of language models

including Indri and KL-divergence improves the ranking of passages.

7.1.4 Opinion Question Answering System

The Text Analysis Conference (TAC) has provided a common platform with the Question

Answering track for researchers to devise techniques for answering opinion questions. The

track aims at answering rigid list and squishy list questions by mining opinions from blog

posts. Rigid list questions (Ex: Which countries would like to build nuclear power plants?)

ask for exact strings containing a list item, and squishy list questions (Ex: What features79


do people like in vista?) ask for strings containing an answer. We have developed a QA

system to answer these two types of questions. The following are the contributions of this

work.

• We have used the standard pipeline architecture (as shown in Figure 1.1) to answer

rigid list questions. The implementations of all the components were adjusted to

handle opinions expressed in these questions.

• Our system produces a ranked set of sentences as the answer for a given squishy

list question. These sentences are ranked based on the following three features -

query dependent, query independent, and opinion. Given a sentence, the first feature

quantifies the query relatedness of the sentence, the second feature quantifies the

general importance of the sentence, and the third feature measures the closeness of

opinions expressed in the question and the sentence.

• Out of all the participated systems, our system has produced best results for squishy

list questions, and second best results for rigid list questions.

7.1.5 Monolingual Question Answering System

The Cross Language Evaluation Forum (CLEF) has provided a common platform to eval-

uate QA systems, which take questions in one language and provide answers either in the

same language or in a different language. We have developed a monolingual English QA

system, that is, both questions and answers are provided in English. The following are the

contributions of this work.

• Our system answers five different types of questions, namely, Factoid, Definitive,

Reason, Procedure, and Purpose questions.

• We showed that just by effectively combining naive techniques from Information Re-

trieval, Information Extraction, and Natural Language Processing areas, a QA system

with reasonable performance can be developed.80


• From a total of 95 questions, our system has answered 54 questions correctly, 37

questions incorrectly, and 4 questions were unanswered.

7.2 Future Work

The work presented in this thesis can be extended and improved in certain aspects. The

discussion of evaluation results for all the three works described in chapters 3, 5 and 6

gives insights for the future research. Possible research directions are given below :

7.2.1 Passage Retrieval Using Answer Type Profiles

• In our approach, we have used question classifier using Support Vector Machines

(SVM), whose accuracy is 86.8%, to find the answer type given a question. There

are several other approaches for question classification, whose accuracies are either

equivalent or better than the one reported above, but, their impact on our passage

retrieval approach is uncertain. So, the impact of these question classification tech-

niques on the performance of our retrieval approach can be investigated.

• In addition to IBM model 1 and GIZA++ alignment model, other alignment models

like IBM models 2-5, HMM alignment model etc. can be explored to construct

answer type profiles.

7.2.2 Query Expansion Using Wikipedia

• In addition to outlink structure and category structure, Wikipedia also contains other

structured information including tables, inlinks etc. So, the use of this additional

structured information in generating query expansion terms can be explored.

• Our current approach uses only Wikipedia as the knowledge source to expand seed

queries. So, in addition to Wikipedia, similar knowledge sources like knol can be

81


exploited for extracting query expansion terms.


• Our approach for extracting relevant and non-relevant text from different knowledge

sources is very simple. For instance, in the case of extracting relevant text from Web,

it does not even consider the presence of answer candidates in the snippets that are

added to the text. By designing pruning strategies with constraints similar to the one

described above one can produce better relevant and non-relevant texts.

• Typically, passage priors denote the probabilities of classifying a passage into rele-

vant, and non-relevant classes. So, in addition to KL divergence, other text classifica-

tion algorithms like Naive Bayes, SVM etc. can be explored to analyze their impact

on passage retrieval performance.

82

Bibliography

[1] Jaime Arguello, Jonathan Elsas, Jamie Callan, and Jaime Carbonell. Documentrepresentation and query expansion models for blog recommendation. In Int. Conf.on Weblogs and Social Media (ICWSM), 2008.

[2] Collin F. Baker, Charles J. Fillmore, and John B. Lowe. The berkeley framenetproject. In Proceedings of the 17th international conference on Computational lin-guistics, pages 86–90, Morristown, NJ, USA, 1998. Association for ComputationalLinguistics.

[3] Adam Berger and John Lafferty. Information retrieval as statistical translation. InSIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conferenceon Research and development in information retrieval, pages 222–229, New York,NY, USA, 1999. ACM.

[4] Katz B. Bilotti, M.W. and J Lin. What works better for question answering: Stem-ming or morphological query expansion? In ACM SIGIR’04 Workshop InformationRetrieval for QA, 2004.

[5] Eric Breck, Marc Light, Gideon S. Mann, Ellen Riloff, Brianne Brown, PranavAnand, Mats Rooth, and Michael Thelen. Looking under the hood : Tools for diag-nosing your question answering engine. CoRR, cs.CL/0107006, 2001.

[6] Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra,Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. A statis-tical approach to machine translation. Comput. Linguist., 16(2):79–85, June 1990.

[7] Peter F. Brown, Vincent J.Della Pietra, Stephen A. Della Pietra, and Robert. L. Mer-cer. The mathematics of statistical machine translation: Parameter estimation. Com-putational Linguistics, 19:263–311, 1993.

[8] James P. Callan. Passage-level evidence in document retrieval. In SIGIR ’94: Pro-ceedings of the 17th annual international ACM SIGIR conference on Research anddevelopment in information retrieval, pages 302–310, New York, NY, USA, 1994.Springer-Verlag New York, Inc.

83

BIBLIOGRAPHY

[9] Jaime Carbonell, Eduard Hovy, Donna Harman, Steve Maiorano, John Prange, andKaren S. Jones. Vision statement to guide research in question & answering (Q&A)ans text summarization. Technical report, NIST, 2000.

[10] Claudio Carpineto, Renato de Mori, Giovanni Romano, and Brigitte Bigi. Aninformation-theoretic approach to automatic query expansion. ACM Trans. Inf. Syst.,19(1):1–27, 2001.

[11] Charles L. A. Clarke, Gordon V. Cormack, and Elizabeth A. Tudhope. Relevanceranking for one to three term queries. Inf. Process. Manage., 36(2):291–311, 2000.

[12] Kevyn Collins-thompson, Jamie Callan, and Egidio Terra. The effect of documentretrieval quality on factoid question answering performance. In In ACM SIGIRConference on Research and development in Information Retrieval, pages 574–575.Poster, 2004.

[13] Andres Corrada-Emmanuel and W. Bruce Croft. Answer models for question an-swering passage retrieval. In SIGIR ’04: Proceedings of the 27th annual inter-national ACM SIGIR conference on Research and development in information re-trieval, pages 516–517, New York, NY, USA, 2004. ACM.

[14] Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience, New York, NY, USA, 1991.

[15] Nello Cristianini and John Shawe-Taylor. An introduction to support Vector Ma-chines: and other kernel-based learning methods. Cambridge University Press, NewYork, NY, USA, 2000.

[16] Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan, and Tat-Seng Chua. Question an-swering passage retrieval using dependency relations. In SIGIR ’05: Proceedings ofthe 28th annual international ACM SIGIR conference on Research and developmentin information retrieval, pages 400–407, New York, NY, USA, 2005. ACM.

[17] Xin Li Dan, Xin Li, and Dan Roth. Learning question classifiers. pages 556–562,2002.

[18] Hoa Trang Dang, Diane Kelly, and Jimmy J. Lin. Overview of the trec 2007 questionanswering track. In TREC, 2007.

[19] Hoa Trang Dang, Jimmy J. Lin, and Diane Kelly. Overview of the trec 2006 questionanswering track 99. In TREC, 2006.

[20] L. Derczynski, J. Wang, R. Gaizauskas, and M. A. Greenwood. A data driven ap-proach to query expansion in question answering. In Proceedings of the 22nd In-ternational Conference on Computational Linguistics (COLING 2008) Workshop onInformation Retrieval for Question Answering (IR4QA08), 2008.

84

BIBLIOGRAPHY

[21] Susan Dumais, Michele Banko, Eric Brill, Jimmy Lin, and Andrew Ng. Web ques-tion answering: Is more always better? In In Proceedings of the 25th annual interna-tional ACM SIGIR conference on research and development in information retrieval,pages 291–298, 2002.

[22] David Eichmann and Padmini Srinivasan. Filters, webs and answers: The universityof iowa trec-8 results. In TREC, 1999.

[23] Olivier Ferret, Brigitte Grau, Gabriel Illouz, Christian Jacquemin, and Nicolas Mas-son. Qalc - the question-answering program of the language and cognition group atlimsi-cnrs. In In Voorhees and Harman [21.

[24] Zhiguo Gong, Chan Wa Cheang, and Leong Hou U. Web query expansion by word-net. In Database and Expert Systems Applications, 16th International Conference,DEXA 2005, Copenhagen, Denmark, August 22-26, 2005, Proceedings, volume3588 of Lecture Notes in Computer Science, pages 166–175. Springer, 2005.

[25] Jose Luis Vicedo Gonzalez, Antonio Ferrandez Rodriguez, and Fernando Llopis.University of alicante at trec-10. TREC.

[26] Berkeley E. C Green, L. E. S. and C. Gotlibb. Conversation with a computer. InComputers and Automation 8(10), pages 9–11, 1959.

[27] B.F. Green, A.K. Wolf, C. Chomsky, and K. Laugherty. Baseball: An automaticquestion answerer. In Proceedings Western Joint IRE-AIEE-ACM Computing Con-ference, volume 19, pages 219–224, Los Angeles, CA, 1961.

[28] Kadri Hacioglu and Wayne Ward. Question classification with support vector ma-chines and error correcting codes. In HLT-NAACL, 2003.

[29] Sanda M. Harabagiu, Steven J. Maiorano, and Marius A. Pasca. Open-domain tex-tual question answering techniques. Nat. Lang. Eng., 9(3):231–267, 2003.

[30] Djoerd Hiemstra. A linguistically motivated probabilistic model of information re-trieval. In ECDL ’98: Proceedings of the Second European Conference on Researchand Advanced Technology for Digital Libraries, pages 569–584, London, UK, 1998.Springer-Verlag.

[31] L. Hirschman and R. Gaizauskas. Natural language question answering: the viewfrom here. Nat. Lang. Eng., 7(4):275–300, 2001.

[32] Laurie Hiyakumoto, Lucian Vlad Lita, and Eric Nyberg. Multi-strategy informationextraction for question answering. In FLAIRS Conference, pages 678–683, 2005.

[33] Thomas Hofmann. Probabilistic latent semantic indexing. In SIGIR ’99: Proceed-ings of the 22nd annual international ACM SIGIR conference on Research and de-velopment in information retrieval, pages 50–57, New York, NY, USA, 1999. ACM.

85

BIBLIOGRAPHY

[34] Abraham Ittycheriah, Martin Franz, and Salim Roukos. IBM’s statistical questionanswering system - TREC-10. In Text REtrieval Conference, 2001.

[35] Jagadeesh Jagarlamudi, Prasad Pingali, and Vasudeva Varma. Capturing sentenceprior for query-based multi-document summarization. In David Evans, Sadaoki Fu-rui, and Chantal Soulupuy, editors, RIAO. CID, 2007.

[36] Michael Kaisser and John Lowe. Creating a research collection of question answersentence pairs with amazons mechanical turk. In Proceedings of the Sixth Interna-tional Language Resources and Evaluation (LREC’08), Marrakech, Morocco, may2008. European Language Resources Association (ELRA).

[37] Marcin Kaszkiel and Justin Zobel. Passage retrieval revisited. In SIGIR ’97: Pro-ceedings of the 20th annual international ACM SIGIR conference on Research anddevelopment in information retrieval, pages 178–185, New York, NY, USA, 1997.ACM.

[38] Marcin Kaszkiel and Justin Zobel. Effective ranking with arbitrary passages. Journalof the American Society for Information Science and Technology, 52:344–364, 2001.

[39] Marcin Kaszkiel, Justin Zobel, and Ron Sacks-Davis. Efficient passage ranking fordocument databases. ACM Trans. Inf. Syst., 17(4):406–439, 1999.

[40] Dan Klein and Christopher D. Manning. Accurate unlexicalized parsing. In ACL’03: Proceedings of the 41st Annual Meeting on Association for ComputationalLinguistics, pages 423–430, Morristown, NJ, USA, 2003. Association for Computa-tional Linguistics.

[41] Dan Klein and Christopher D. Manning. Fast exact inference with a factored modelfor natural language parsing. In In Advances in Neural Information Processing Sys-tems 15 (NIPS), pages 3–10. MIT Press, 2003.

[42] Jeongwoo Ko, Eric Nyberg, and Luo Si. A probabilistic graphical model for jointanswer ranking in question answering. In SIGIR ’07: Proceedings of the 30th annualinternational ACM SIGIR conference on Research and development in informationretrieval, pages 343–350, New York, NY, USA, 2007. ACM.

[43] J. Lafferty and C. Zhai. Probabilistic Relevance Models Based on Document andQuery Generation, volume 13. Kluwer International Series on Information Retrieval,2003.

[44] John Lafferty and Chengxiang Zhai. Document language models, query models,and risk minimization for information retrieval. In SIGIR ’01: Proceedings of the24th annual international ACM SIGIR conference on Research and development ininformation retrieval, pages 111–119, New York, NY, USA, 2001. ACM.

86

BIBLIOGRAPHY

[45] Victor Lavrenko. Optimal mixture models in ir. In Proceedings of the 24th BCS-IRSG European Colloquium on IR Research, pages 193–212, London, UK, 2002.Springer-Verlag.

[46] Victor Lavrenko and W. Bruce Croft. Relevance based language models. In SIGIR’01: Proceedings of the 24th annual international ACM SIGIR conference on Re-search and development in information retrieval, pages 120–127, New York, NY,USA, 2001. ACM.

[47] Gary Geunbae Lee, Jungyun Seo, Seungwoo Lee, Hanmin Jung, Bong hyun Cho,Changki Lee, Byung-Kwan Kwak, Jeongwon Cha, Dongseok Kim, JooHui An,Harksoo Kim, and Kyungsun Kim. Siteq: Engineering high performance qa sys-tem using lexico-semantic pattern matching and shallow nlp. In In Proceedings ofthe Tenth Text REtrieval Conference (TREC), pages 442–451, 2001.

[48] Marc Light, Gideon S. Mann, Ellen Riloff, and Eric Breck. Analyses for elucidatingcurrent question answering technology. Nat. Lang. Eng., 7(4):325–342, 2001.

[49] Dekang Lin. Dependency-based evaluation of minipar. In Proc. Workshop on theEvaluation of Parsing Systems, Granada, 1998.

[50] Jimmy Lin, Dennis Quan, Vineet Sinha, Karun Bakshi, David Huynh, Boris Katz,and David R. Karger. What makes a good answer? the role of context in questionanswering. In PROCEEDINGS OF INTERACT 2003, pages 25–32, 2003.

[51] Xiaoyong Liu and W. Bruce Croft. Passage retrieval based on language models. InCIKM ’02: Proceedings of the eleventh international conference on Information andknowledge management, pages 375–382, New York, NY, USA, 2002. ACM.

[52] John B. Lowe, Collin F. Baker, and Charles J. Fillmore. A frame-semantic approachto semantic annotation. In In Proceedings 1997 Siglex Workshop/ANLP97, pages18–24, 1997.

[53] S. Shaikh S. Small T. Strzalkowski M. Wu, M. Duan. University of albany’s ilqua intrec 2005. In Proceedings of TREC-14, pages 77–83, 2005.

[54] B. Macdonald, C.;He. Researching and building ir applications using terrier. Tech-nical report, University of Glasgow, Department of Computing Science, 2008.

[55] C. D. Manning and H. Schutze. Foundations of statistical language processing. InThe MIT Press, Cambridge, Massachusetts, 1999.

[56] Andrew McCallum and Kamal Nigam. A comparison of event models for naivebayes text classification. In IN AAAI-98 WORKSHOP ON LEARNING FOR TEXTCATEGORIZATION, pages 41–48. AAAI Press, 1998.

87

BIBLIOGRAPHY

[57] Andreas Merkel and Dietrich Klakow. Comparing improved language models forsentence retrieval in question answering. In In Proceedings of CLIN, 2007.

[58] David R. H. Miller, Tim Leek, and Richard M. Schwartz. A hidden markov modelinformation retrieval system. In SIGIR ’99: Proceedings of the 22nd annual in-ternational ACM SIGIR conference on Research and development in informationretrieval, pages 214–221, New York, NY, USA, 1999. ACM.

[59] Dan I. Moldovan, Marius Pasca, Sanda M. Harabagiu, and Mihai Surdeanu. Per-formance issues and error analysis in an open-domain question answering system.ACM Trans. Inf. Syst., 21(2):133–154, 2003.

[60] C Monz. From document retrieval to question answering. In ILLC DissertationSeries, 2003.

[61] Christof Monz. Document retrieval in the context of question answering. In InProceedings of the 25th European Conference on Information Retrieval Research(ECIR-03), pages 571–579. Springer, 2003.

[62] Vanessa Murdock and W. Bruce Croft. A translation model for sentence retrieval.In HLT ’05: Proceedings of the conference on Human Language Technology andEmpirical Methods in Natural Language Processing, pages 684–691, Morristown,NJ, USA, 2005. Association for Computational Linguistics.

[63] Franz Josef Och and Hermann Ney. A systematic comparison of various statisticalalignment models. Computational Linguistics, 29(1):19–51, 2003.

[64] Bahadorreza Ofoghi, John Yearwood, and Ranadhir Ghosh. A semantic approach toboost passage retrieval effectiveness for question answering. In ACSC ’06: Pro-ceedings of the 29th Australasian Computer Science Conference, pages 95–101,Darlinghurst, Australia, Australia, 2006. Australian Computer Society, Inc.

[65] A. V. Phillips. A question-answering routine. Technical report, Cambridge, MA,USA, 1960.

[66] Luiz Pizzato, Diego Molla, and Cecile Paris. Pseudo relevance feedback usingnamed entities for question answering. In Proceedings ALTW, volume 4, pages83–90, 2006.

[67] Jay M. Ponte and W. Bruce Croft. Text segmentation by topic. In ECDL ’97: Pro-ceedings of the First European Conference on Research and Advanced Technologyfor Digital Libraries, pages 113–125, London, UK, 1997. Springer-Verlag.

[68] Jay Michael Ponte. A language modeling approach to information retrieval. Master’sthesis, Amherst, MA, USA, 1998.

88

BIBLIOGRAPHY

[69] J. R. Quinlan. C4.5 Programs for Machine Learning. Morgan Kaufmann, California,1993.

[70] S. Klein R. F. Simmons and K. L. McConlogue. Indexing and dependency logic foranswering english questions. Technical report, 1964.

[71] Jeffrey C. Reynar. An automatic method of finding topic boundaries. In Proceedingsof the 32nd annual meeting on Association for Computational Linguistics, pages331–333, Morristown, NJ, USA, 1994. Association for Computational Linguistics.

[72] S. E. Robertson. The probability ranking principle in ir. pages 281–286, 1997.

[73] S. E. Robertson and Sparck K. Jones. Relevance weighting of search terms. Journalof the American Society for Information Science, 27(3):129–146, 1976.

[74] S.E. Robertson, S. Walker, S. Jones, M.M. Hancock-Beaulieu, and M. Gatford.Okapi at trec-3. pages 109–126, 1996.

[75] Stephen E. Robertson and Steve Walker. Okapi/keenbow at trec-8. In TREC, 1999.

[76] Ronald Rosenfeld. Two decades of statistical language modeling: Where do we gofrom here? In Proceedings of the IEEE, page 2000, 2000.

[77] Dan Roth. Learning to resolve natural language ambiguities: A unified approach. InAAAI/IAAI, pages 806–813, 1998.

[78] Gerard Salton, J. Allan, and Chris Buckley. Approaches to passage retrieval infull text information systems. In SIGIR ’93: Proceedings of the 16th annual in-ternational ACM SIGIR conference on Research and development in informationretrieval, pages 49–58, New York, NY, USA, 1993. ACM.

[79] Gerard Salton and Christopher Buckley. Term-weighting approaches in automatictext retrieval. pages 323–328, 1997.

[80] sca Marius Pa and Sanda M. Harabagiu. The informative role of WordNet in open-domain question answering. In Proceedings of the NAACL 2001 Workshop on Word-Net and Other Lexical Resources: Applications, Extensions and Customizations,pages 138–143, 2001.

[81] Rong Jin School and Rong Jin. Title language model for information retrieval. In InSIGIR, pages 42–48, 2002.

[82] C. E. Shannon. Prediction and entropy of printed english. Bell Systems TechnicalJournal, 30:50–64, 1951.

89

BIBLIOGRAPHY

[83] Luo Si, Rong Jin, Jamie Callan, and Paul Ogilvie. A language modeling frameworkfor resource selection and results merging. In CIKM ’02: Proceedings of the eleventhinternational conference on Information and knowledge management, pages 391–397, New York, NY, USA, 2002. ACM.

[84] R. F. Simmons. Answering english questions by computer: a survey. Commun.ACM, 8(1):53–70, 1965.

[85] Fei Song and W. Bruce Croft. A general language model for information retrieval.In CIKM ’99: Proceedings of the eighth international conference on Informationand knowledge management, pages 316–321, New York, NY, USA, 1999. ACM.

[86] Craig Stanfill and David L. Waltz. Statistical methods, artificial intelligence, andinformation retrieval. pages 215–225, 1992.

[87] Renxu Sun, Chai-Huat Ong, and Tat-Seng Chua. Mining dependency relations forquery expansion in passage retrieval. In SIGIR ’06: Proceedings of the 29th annualinternational ACM SIGIR conference on Research and development in informationretrieval, pages 382–389, New York, NY, USA, 2006. ACM.

[88] Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton.Quantitative evaluation of passage retrieval algorithms for question answering. InSIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conferenceon Research and development in informaion retrieval, pages 41–47, New York, NY,USA, 2003. ACM.

[89] J. P. Thorne. Automatic language analysis. Technical report, Arlington, Va., 1962.

[90] Howard Turtle and W. Bruce Croft. Evaluation of an inference network-based re-trieval model. ACM Trans. Inf. Syst., 9(3):187–222, 1991.

[91] Lonneke van der Plas and Jorg Tiedemann. Using lexico-semantic information forquery expansion in passage retrieval for question answering. In Proceedings of the9th SIGdial Workshop on Discourse and Dialogue, 2008.

[92] C. J. van Rijsbergen. A theoretical basis for the use of co-occurrence data in infor-mation retrieval. Journal of Documentation, 33(2):106–199, 1977.

[93] Ellen M. Voorhees. The trec question answering track, 1998.

[94] Ross Wilkinson. Effective retrieval of structured documents. In SIGIR ’94: Pro-ceedings of the 17th annual international ACM SIGIR conference on Research anddevelopment in information retrieval, pages 311–317, New York, NY, USA, 1994.Springer-Verlag New York, Inc.

[95] Liu Xiaoli, Dawei Hu, Min Feng, and Liu Wenyin. Semantic pattern based de-pendency matching for exact answer retrieval. Semantics, Knowledge and Grid,International Conference on, 0:262–265, 2007.

90

BIBLIOGRAPHY

[96] Jinxi Xu and W. Bruce Croft. Cluster-based language models for distributed re-trieval. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIRconference on Research and development in information retrieval, pages 254–261,New York, NY, USA, 1999. ACM.

[97] Jinxi Xu, Ralph Weischedel, and Chanh Nguyen. Evaluating a probabilistic modelfor cross-lingual information retrieval. In SIGIR ’01: Proceedings of the 24th annualinternational ACM SIGIR conference on Research and development in informationretrieval, pages 105–110, New York, NY, USA, 2001. ACM.

[98] Hui Yang, Tat-Seng Chua, Shuguang Wang, and Chun-Keat Koh. Structured use ofexternal knowledge for event-based open domain question answering. In SIGIR ’03:Proceedings of the 26th annual international ACM SIGIR conference on Researchand development in informaion retrieval, pages 33–40, New York, NY, USA, 2003.ACM.

[99] Yiming Yang and Xin Liu. A re-examination of text categorization methods. InSIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conferenceon Research and development in information retrieval, pages 42–49, New York, NY,USA, 1999. ACM.

[100] Chengxiang Zhai and John Lafferty. Model-based feedback in the language mod-eling approach to information retrieval. In CIKM ’01: Proceedings of the tenth in-ternational conference on Information and knowledge management, pages 403–410,New York, NY, USA, 2001. ACM.

[101] Chengxiang Zhai and John Lafferty. A study of smoothing methods for languagemodels applied to ad hoc information retrieval. In SIGIR ’01: Proceedings of the24th annual international ACM SIGIR conference on Research and development ininformation retrieval, pages 334–342, New York, NY, USA, 2001. ACM.

[102] Dell Zhang and Wee Sun Lee. A language modeling approach to passage questionanswering. In TREC, pages 489–495, 2003.

[103] Dell Zhang and Wee Sun Lee. Question classification using support vector machines.In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conferenceon Research and development in informaion retrieval, pages 26–32, New York, NY,USA, 2003. ACM.

91