answering list questions using co-occurrence and clustering

19
Answering List Questions using Co-occurrence and Clustering Majid Razmara and Leila Kosseim Concordia University [email protected]

Upload: leda

Post on 12-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Answering List Questions using Co-occurrence and Clustering. Majid Razmara and Leila Kosseim Concordia University [email protected]. Introduction. Question Answering TREC QA track Question Series Corpora. Target: American Girl dolls - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Answering List Questions  using  Co-occurrence and Clustering

Answering List Questions using Co-occurrence and

Clustering

Majid Razmara and Leila KosseimConcordia [email protected]

Page 2: Answering List Questions  using  Co-occurrence and Clustering

2

Introduction

• Question Answering• TREC QA track

Question Series Corpora

Target: American Girl dolls FACTOID: In what year were American Girl dolls first introduced? LIST: Name the historical dolls. LIST: Which American Girl dolls have had TV movies made about them? FACTOID: How much does an American Girl doll cost? FACTOID: How many American Girl dolls have been sold? FACTOID: What is the name of the American Girl store in New York? FACTOID: What corporation owns the American Girl company? OTHER: Other

Page 3: Answering List Questions  using  Co-occurrence and Clustering

3

Hypothesis

• Answer Instances

1. Have the same semantic entity class

2. Co-occur within sentences, or

3. Occur in different sentences sharing similar context

Based on Distributional Hypothesis: “Words occurring in the

same contexts tend to have similar meanings” [Harris, 1954].

Page 4: Answering List Questions  using  Co-occurrence and Clustering

Ltw_Eng_20050712.0032 (AQUAINT-2)United, which operates a hub at Dulles, has six luggage screening machines in its basement and several upstairs in the ticket counter area. Delta, Northwest, American, British Airways and KLM share four screening machines in the basement.

Ltw_Eng_20060102.0106 (AQUAINT-2)Independence said its last flight Thursday will leave White Plains, N.Y., bound for Dulles Airport.Flyi suffered from rising jet fuel costs and the aggressive response of competitors, led by United and US Airways.

New York Times (Web)Continental Airlines sued United Airlines and the committee that oversees operations at Washington Dulles International Airport yesterday, contending that recently installed baggage-sizing templates inhibited competition.

Wikipedia (Web)At its peak of 600 flights daily, Independence, combined with service from JetBlue and AirTranAirTran, briefly made Dulles the largest low-cost hub in the United States.

Target 232: "Dulles Airport“ Question 232.6: "Which airlines use Dulles”

4

Page 5: Answering List Questions  using  Co-occurrence and Clustering

5

Our Approach

1. Create an initial candidate list Answer Type Recognition Document Retrieval Candidate Answer Extraction It may also be imported from an external source

(e.g. Factoid QA)

2. Extract co-occurrence information

3. Cluster candidates based on their co-occurrence

Page 6: Answering List Questions  using  Co-occurrence and Clustering

6

Answer Type Recognition

• 9 Types: Person, Country, Organization, Job, Movie, Nationality, City, State,

and Other

• Lexical Patterns ^ (Name | List | What | Which) (persons | people | men | women | players | contestants |

artists | opponents | students) PERSON

^ (Name | List | What | Which) (countries | nations) COUNTRY

• Syntagmatic Patterns for Other types ^ (WDT | WP | VB | NN) (DT | JJ)* (NNS | NNP | NN | JJ | )* (NNS | NNP | NN | NNPS) (VBN | VBD |

VBZ | WP | $)

^ (WDT | WP | VB | NN) (VBD | VBP) (DT | JJ | JJR | PRP$ | IN)* (NNS | NNP | NN | )* (NNS | NNP | NN)

• Type Resolution

Page 7: Answering List Questions  using  Co-occurrence and Clustering

7

Type Resolution

• Resolves the answer subtype to one of the main types List previous conductors of the Boston Pops.

Type: OTHER Sub Type: Conductor PERSON

• WordNet's Hypernym Hierarchy

Page 8: Answering List Questions  using  Co-occurrence and Clustering

8

Document Retrieval

• Document Collection Source Document Collection

Few documents To extract candidates

Domain Document Collection Many documents To extract co-occurrence information

• Query Generation Google Query on Web Lucene Query on Corpora

Source

Domain

Page 9: Answering List Questions  using  Co-occurrence and Clustering

9

Candidate Answer Extraction

• Term Extraction Extract all terms that conform to the expected answer type Person, Organization, Job

Intersection of several NE taggers: LingPipe, Stanford tagger & Gate NE To get a better precision

Country, State, City, Nationality Gazetteer To get a better precision

Movie, Other Capitalized and quoted terms

Verification of Movie

Verification of OthernumHits(“SubType Term” OR “Term SubType”)

numHits(“Term”)

numHits(GoogleQuery intitle:Term site:www.imdb.com)

Page 10: Answering List Questions  using  Co-occurrence and Clustering

10

Co-occurrence Information Extraction

• Domain Collection Documents are split into sentences

• Each sentence is checked as to whether it contains

candidate answers

3 131

0 02

3 1

0

Candidates

Can

dida

tes

Sente

nces

Page 11: Answering List Questions  using  Co-occurrence and Clustering

• Steps:

1. Put each candidate term ti in a separate cluster Ci

2. Compute the similarity between each pair of clusters Average Linkage

3. Merge two clusters with highest inter-cluster similarity

4. Update all relations between this new cluster and other clusters

5. Go to step 3 until There are only N clusters, or The similarity is less than a threshold

11

Hierarchical Agglomerative Clustering

im jnCt Ct

nmji

ji )t,(tsimilarity|C||C|

1)C,(Csimilarity

Page 12: Answering List Questions  using  Co-occurrence and Clustering

• Similarity between each pair of candidates

• Based on co-occurrence within sentences

• Using chi-square (2)

• Shortcoming12

The Similarity Measure

)O)(OO)(OO)(OO(O

)OO– O(O N

2221221221111211

2211222112

termitermiTotal

termjO11O21O11 + O21

termjO12O22O12 + O22

TotalO11 + O12O21 + O22 N

Page 13: Answering List Questions  using  Co-occurrence and Clustering

13

Pinpointing the Right Cluster

• Question and target keywords are used as “spies”

• Spies are: Inserted into the list of candidate answers Are treated as candidate answers, hence

their similarity to one another and similarity to candidate answers are computed

they are clustered along with candidate answers

• The cluster with the most number of spies is returned The spies are removed

• Other approaches

Page 14: Answering List Questions  using  Co-occurrence and Clustering

oman

Cluster-31Cluster-2

Cluster-9

spain, bangladesh,

japan, germany, haiti, nepal,

china, sweden, iran, mexico,

vietnam, belgium, lebanon,

iraq, russia, turkey

pakistan, 2005,

afghanistan, octob,

u.s, india, affect,

earthquak

pakistanpakistan, 2005,

afghanistan, octob,

u.s, india, affect,

earthquak

pakistan, 2005,

afghanistan, octob,

u.s, india, affect,

earthquak

Target 269: Pakistan earthquakes of October 2005Pakistan earthquakes of October 2005

Question 269.2: What countries were affected by this earthquakeWhat countries were affected by this earthquake??

Recall = 2/3Recall = 2/3

Precision = 2/3Precision = 2/3

F-score = 2/3F-score = 2/3

14

Page 15: Answering List Questions  using  Co-occurrence and Clustering

15

TREC 2007 Results (F-measure)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Lym

baPA07

LCCFer

ret

ILQUA1

QASCU3

Ephyr

a3

UofL

FDUQAT16

B

IITDIB

M20

07T

pron

to07

run3

lsv200

7a

pirc

s07qa

1

QUANTA

csai

l1

Inte

llexe

r7A

aske

d07c

Dal07t

uam

s07a

tch

MIT

RE2007B

Drexe

lRun

2

eduF

sc05

iiitqa

07

Best0.479

Median0.085

Worst0.000

F=14.5

Results in TREC 2007

Page 16: Answering List Questions  using  Co-occurrence and Clustering

16

Evaluation of Clustering

• Baseline List of candidate answers prior to clustering

• Our Approach List of candidate answers filtered by the clustering

• Theoretical Maximum The best possible output of clustering based on the initial list

CorpusCorpusQuestionsQuestionsPrecisionPrecisionRecallRecallF-scoreF-score

BaselineBaselineOur ApproachOur ApproachTheoretical MaxTheoretical Max

2004TREC 2005 2006

237

0.0640.1411

0.4070.2870.407

0.0980.1540.472

BaselineBaselineOur ApproachOur ApproachTheoretical MaxTheoretical Max

TREC 200785

0.0750.1651

0.3880.2480.388

0.1060.1630.485

Page 17: Answering List Questions  using  Co-occurrence and Clustering

17

Percentage of each Question Type in Training Set

36%

32%

15%

5%5%

3%

2%

1%

1%

Other Person CountryOrganization Movie CityJob State Nationality

Evaluation of each Question Type

F-score of each type in training and test sets

0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

Perso

nO

ther

Country

State

Org

anizat

ion Job

Mov

ie

Nationa

lity City

Test Set

Training Set

Page 18: Answering List Questions  using  Co-occurrence and Clustering

18

Future Work

• Developing a module that verifies whether each candidate is a member of the answer type In case of Movie and Other types

• Using co-occurrence at the paragraph level rather than the sentence level Anaphora Resolution can be used Another method for similarity measure

2 does not work well with sparse data for example, using Yates correction for continuity (Yates’ 2)

• Using different clustering approaches

• Using different similarity measures Mutual Information

Page 19: Answering List Questions  using  Co-occurrence and Clustering

19

Questions?