© paul kantor prepared for rutgers lis558 1 question answering sept 20, 2004. paul kantor

© Paul KantorPrepared for Rutgers LIS558

1

Question Answering

Sept 20, 2004.

Paul Kantor


2

Why try QA?• Users want answers not documents

– what is the population of Brazil?– world almanac– web page about brazil

• The organizations behind the users want answers– where is Usama bin Ladin?– does Israel have nuclear weapons?– what is John Kerry’s real position on the war?


3

QA as a kind of IR

• Current approach– take question– ignore all semantics and syntax– retrieve documents with same words (or at least

some of them: vector methods; quorum methods)

– hope the user finds the answer in them


4

From Documents to Passages

• Break all documents into short pieces– this means that an index says– military: doc 34, part 5; doc 52, part 3 ...– nuclear; doc 34, part 7; doc 52 part 3 ….– We retrieve doc 52 part 3 first– and hope it contains the answer– overhead = size of part 3– cost: user’s time scanning


5

How to do better?

• A. Don’t need to– people scan pretty fast– serendipity

• B. We have to– FBI has 120,000 hours of terrorism related

tapes– imagine they are transcribed– what do we want to look for?


6

Sponsors Care enough

• TREC(onference)

• http://trec.nist.gov

• Originally intended to advance the state of the art of Information Retrieval.

• Developed a Question Answering Track

• Controlled Evaluation


7

TREC and the Cranfield Model• How do we know if one system is better than

another?

• Set problems for which we know the correct answers, and see whether the systems find them

• Works for small collections, or outrageously unique answers.

• The needle in the ocean, or “scavenger hunt”.


8

For large collections

• Pooled evaluation

• Set the task for many systems

• Pool the candidate results

• decide which ones are correct

• there may be more than one

• give credit to any system that finds one of them


9

What constitutes an answer

• “How did Abraham Lincoln die”

• Possible answers:– “bullet”– “suddenly”– “with people around”

• When did Lincoln die– “before the election”– April 14, 1865


10

Can you find this on the Web?

• The World Almanac 1997.

• 1865-Apr. 14. U.S. Pres. Abraham Lincoln shot by John Wilkes Booth in Washington D.C. Died Apr. 15.

• Provides enough context to be sure that the given information is intended to answer the question


11

Typical window

• For Factoid questions– where is the Taj Mahal– who killed Lincoln– ….

• Long window: 250 characters including spaces

• Short window: 50 characters including spaces


12

How to keep score?

• 1. During the civil war

• 2. In the spring time

• 3 at a theatre

• 4. April 15, 1965 ***

• 5. In April; 1965

• Score is “reciprocal rank”– =1/4 = 0.25


13

For many questions

• Score is “mean reciprocal rank” – the average over many questions of the RR.– If the system were perfect, it would be 1.

• Secondary measure:– how many times is the right answer in the top

position?


14

State of the art is amazing

• http://trec.nist.gov/pubs/trec11/papers/lcc.moldovan.pdf

• Correctly answered 415 out of 500.

• Confidence weighting– get more points if you have high confidence in

the answer and it is correct.


15

How do they do that?

• Very complicated– retrieve candidate passages– match candidate passages to templates that

would contain the answer (transformations)

• When did Lincoln die?– Lincoln died on XXXX– Lincoln was killed on XXXX– YYY killed Lincoln on XXXX etc.


16

Is the work done?• Answer patterns are hand-built.

• Scores of person-years of effort have been invested.

• There can always be new types of factoid questions.

• Requires new patterns

• we know how to build them

• Problem is “solved”.


17

But there is Analytic Question Answering

• Answer can not be contained in a single passage.

• Ideal goal– software that assembles the several passages

that define the answer– extracts the facts and relations from them– synthesizes the answer as a readable English

statement


18

Current state of the Art

• Ideal goal– software that assembles the several passages

that define the answer• we can find passages that are likely to be relevant

– extracts the facts and relations from them• we can not extract most facts from free form

sentences

• some success (McKeown, Columbia) dealing with narrow domains (sports; financial)


19

Relation to the ideal

– synthesizes the answer as a readable English statement

• there are some model programs that can synthesize a description of a sports event (basketball) from a numerical summary of the game

• In short: this problem is not yet solved!


20

Current good performance

• There are many components of QA programs being built under ARDA sponsorship in the AQUAINT program.

• The three end-to-end systems are– GINKO (Cycorp_

• . Millions of elementary propositions “you can’t push on a rope” all translated into a rigorous representation language


21

End-to-end

– LCC -- system has not been evaluated– HITIQA. SUNY Albany/Rutgers

• retrieves passages. Assembles them into clusters. Represents clusters by FRAMES such as TRANSFER

– FROM: US– TO: France– WHAT: Movies– MODE: Sale

• then matches frames of answers to those of question, and lets the user browse among them.


22

How are these evaluated?

• Evolving art

• Multiple measures– NASA TLX– Questionnaires / usability perspective– Cross Evaluation / quality of product– Glass Box logging tools -- capture every key

stroke and mouse movement/action


23

Current Situation

• 100% =20 hours of human effort Approx.

• 25% real system performance so far

• 15%

• 0% = Just use Google

• Morse et al. To be published.


24

Hoped for goal• To be able to express system quality not in

terms of “percent of questions answered” (because that is not applicable for analytic questions).

• Perhaps we can find an economically meaningful expression, so that the sponsor (or buyer) can say: “having system X on the desktop saves us Y hours of work per report we prepare.


25

Prospects

• This is a very important research area

• Systems that can answer questions relieve the tedious burden of reference librarianship

• Permit librarians to focus on the most interesting questions

• Our principal weapon is not increased understanding of languages.


26

Role of Language

• We knew enough 40 years ago

• Secret weapon is Moore’s Law

• we are able to compute thousands of potential patterns across hundreds of thousands of texts (using inverted index tehcniques)

• So there is reason to be hopeful, but ….


27

Thank you

• “it is very hard to predict, especially about the future” Yogi Berra

© paul kantor prepared for rutgers lis558 1 question answering sept 20, 2004. paul kantor

Documents

paul kantor slide

rutgers lis558

users time scanning

correct answers

nuclear doc

possible answers

questions score

abraham lincoln