© paul kantor prepared for rutgers lis558 1 question answering sept 20, 2004. paul kantor

27
© Paul Kantor Prepared for Rutgers LIS558 1 Question Answering Sept 20, 2004. Paul Kantor

Post on 20-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

© Paul KantorPrepared for Rutgers LIS558

1

Question Answering

Sept 20, 2004.

Paul Kantor

© Paul KantorPrepared for Rutgers LIS558

2

Why try QA?• Users want answers not documents

– what is the population of Brazil?– world almanac– web page about brazil

• The organizations behind the users want answers– where is Usama bin Ladin?– does Israel have nuclear weapons?– what is John Kerry’s real position on the war?

© Paul KantorPrepared for Rutgers LIS558

3

QA as a kind of IR

• Current approach– take question– ignore all semantics and syntax– retrieve documents with same words (or at least

some of them: vector methods; quorum methods)

– hope the user finds the answer in them

© Paul KantorPrepared for Rutgers LIS558

4

From Documents to Passages

• Break all documents into short pieces– this means that an index says– military: doc 34, part 5; doc 52, part 3 ...– nuclear; doc 34, part 7; doc 52 part 3 ….– We retrieve doc 52 part 3 first– and hope it contains the answer– overhead = size of part 3– cost: user’s time scanning

© Paul KantorPrepared for Rutgers LIS558

5

How to do better?

• A. Don’t need to– people scan pretty fast– serendipity

• B. We have to– FBI has 120,000 hours of terrorism related

tapes– imagine they are transcribed– what do we want to look for?

© Paul KantorPrepared for Rutgers LIS558

6

Sponsors Care enough

• TREC(onference)

• http://trec.nist.gov

• Originally intended to advance the state of the art of Information Retrieval.

• Developed a Question Answering Track

• Controlled Evaluation

© Paul KantorPrepared for Rutgers LIS558

7

TREC and the Cranfield Model• How do we know if one system is better than

another?

• Set problems for which we know the correct answers, and see whether the systems find them

• Works for small collections, or outrageously unique answers.

• The needle in the ocean, or “scavenger hunt”.

© Paul KantorPrepared for Rutgers LIS558

8

For large collections

• Pooled evaluation

• Set the task for many systems

• Pool the candidate results

• decide which ones are correct

• there may be more than one

• give credit to any system that finds one of them

© Paul KantorPrepared for Rutgers LIS558

9

What constitutes an answer

• “How did Abraham Lincoln die”

• Possible answers:– “bullet”– “suddenly”– “with people around”

• When did Lincoln die– “before the election”– April 14, 1865

© Paul KantorPrepared for Rutgers LIS558

10

Can you find this on the Web?

• The World Almanac 1997.

• 1865-Apr. 14. U.S. Pres. Abraham Lincoln shot by John Wilkes Booth in Washington D.C. Died Apr. 15.

• Provides enough context to be sure that the given information is intended to answer the question

© Paul KantorPrepared for Rutgers LIS558

11

Typical window

• For Factoid questions– where is the Taj Mahal– who killed Lincoln– ….

• Long window: 250 characters including spaces

• Short window: 50 characters including spaces

© Paul KantorPrepared for Rutgers LIS558

12

How to keep score?

• 1. During the civil war

• 2. In the spring time

• 3 at a theatre

• 4. April 15, 1965 ***

• 5. In April; 1965

• Score is “reciprocal rank”– =1/4 = 0.25

© Paul KantorPrepared for Rutgers LIS558

13

For many questions

• Score is “mean reciprocal rank” – the average over many questions of the RR.– If the system were perfect, it would be 1.

• Secondary measure:– how many times is the right answer in the top

position?

© Paul KantorPrepared for Rutgers LIS558

14

State of the art is amazing

• http://trec.nist.gov/pubs/trec11/papers/lcc.moldovan.pdf

• Correctly answered 415 out of 500.

• Confidence weighting– get more points if you have high confidence in

the answer and it is correct.

© Paul KantorPrepared for Rutgers LIS558

15

How do they do that?

• Very complicated– retrieve candidate passages– match candidate passages to templates that

would contain the answer (transformations)

• When did Lincoln die?– Lincoln died on XXXX– Lincoln was killed on XXXX– YYY killed Lincoln on XXXX etc.

© Paul KantorPrepared for Rutgers LIS558

16

Is the work done?• Answer patterns are hand-built.

• Scores of person-years of effort have been invested.

• There can always be new types of factoid questions.

• Requires new patterns

• we know how to build them

• Problem is “solved”.

© Paul KantorPrepared for Rutgers LIS558

17

But there is Analytic Question Answering

• Answer can not be contained in a single passage.

• Ideal goal– software that assembles the several passages

that define the answer– extracts the facts and relations from them– synthesizes the answer as a readable English

statement

© Paul KantorPrepared for Rutgers LIS558

18

Current state of the Art

• Ideal goal– software that assembles the several passages

that define the answer• we can find passages that are likely to be relevant

– extracts the facts and relations from them• we can not extract most facts from free form

sentences

• some success (McKeown, Columbia) dealing with narrow domains (sports; financial)

© Paul KantorPrepared for Rutgers LIS558

19

Relation to the ideal

– synthesizes the answer as a readable English statement

• there are some model programs that can synthesize a description of a sports event (basketball) from a numerical summary of the game

• In short: this problem is not yet solved!

© Paul KantorPrepared for Rutgers LIS558

20

Current good performance

• There are many components of QA programs being built under ARDA sponsorship in the AQUAINT program.

• The three end-to-end systems are– GINKO (Cycorp_

• . Millions of elementary propositions “you can’t push on a rope” all translated into a rigorous representation language

© Paul KantorPrepared for Rutgers LIS558

21

End-to-end

– LCC -- system has not been evaluated– HITIQA. SUNY Albany/Rutgers

• retrieves passages. Assembles them into clusters. Represents clusters by FRAMES such as TRANSFER

– FROM: US– TO: France– WHAT: Movies– MODE: Sale

• then matches frames of answers to those of question, and lets the user browse among them.

© Paul KantorPrepared for Rutgers LIS558

22

How are these evaluated?

• Evolving art

• Multiple measures– NASA TLX– Questionnaires / usability perspective– Cross Evaluation / quality of product– Glass Box logging tools -- capture every key

stroke and mouse movement/action

© Paul KantorPrepared for Rutgers LIS558

23

Current Situation

• 100% =20 hours of human effort Approx.

• 25% real system performance so far

• 15%

• 0% = Just use Google

• Morse et al. To be published.

© Paul KantorPrepared for Rutgers LIS558

24

Hoped for goal• To be able to express system quality not in

terms of “percent of questions answered” (because that is not applicable for analytic questions).

• Perhaps we can find an economically meaningful expression, so that the sponsor (or buyer) can say: “having system X on the desktop saves us Y hours of work per report we prepare.

© Paul KantorPrepared for Rutgers LIS558

25

Prospects

• This is a very important research area

• Systems that can answer questions relieve the tedious burden of reference librarianship

• Permit librarians to focus on the most interesting questions

• Our principal weapon is not increased understanding of languages.

© Paul KantorPrepared for Rutgers LIS558

26

Role of Language

• We knew enough 40 years ago

• Secret weapon is Moore’s Law

• we are able to compute thousands of potential patterns across hundreds of thousands of texts (using inverted index tehcniques)

• So there is reason to be hopeful, but ….

© Paul KantorPrepared for Rutgers LIS558

27

Thank you

• “it is very hard to predict, especially about the future” Yogi Berra