the multiple language question answering track at clef 2003
DESCRIPTION
The Multiple Language Question Answering Track at CLEF 2003. Bernardo Magnini*, Simone Romagnoli*, Alessandro Vallin* Jes ús Herrera**, Anselmo Peñas**, Víctor Peinado**, Felisa Verdejo** Maarten de Rijke*** * ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, Trento - Italy - PowerPoint PPT PresentationTRANSCRIPT
CLEF – Cross Language Evaluation Forum
Question Answering at CLEF 2003 (http://clef-qa.itc.it)
The Multiple Language
Question Answering Track at CLEF 2003
Bernardo Magnini*, Simone Romagnoli*, Alessandro Vallin*
Jesús Herrera**, Anselmo Peñas**, Víctor Peinado**, Felisa Verdejo**
Maarten de Rijke***
* ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, Trento - Italy
{magnini,romagnoli,vallin}@itc.it
** UNED, Spanish Distance Learning University, Madrid – Spain
{jesus.herrera,anselmo,victor,felisa}@lsi.uned.es
*** Language and Inference Technology Group, ILLC, University of Amsterdam - The Netherlands
Outline
Overview of the Question Answering track at CLEF 2003
• Report on the organization of QA tasks
• Present and discuss the participants’ results
• Perspectives for future QA campaigns
Question Answering
• QA: find the answer to an open domain question in a large collection of documents
INPUT: questions (instead of keyword-based queries)
OUTPUT: answers (instead of documents)
• QA track at TREC
– Mostly fact-based questions
Question: Who invented the electric light?
Answer: Edison
• Scientific Community
– NLP and IR
– AQUAINT program in USA
• QA as an applicative scenario
Multilingual QA
Purposes:
• Answers may be found in languages different from the language of the question
• Interest in QA systems for languages other than English
• Force the QA community to design real multilingual systems
• Check/improve the portability of the technologies implemented in current English QA systems
• Creation of reusable resources and benchmarks for further multilingual QA evaluation
QA at CLEF 2003 - Organization
“QA@CLEF” WEB SITE ( http://clef-qa.itc.it )
CLEF QA MAILING LIST ( [email protected] )
GUIDELINES FOR THE TRACK (following the model of TREC 2001)
Tasks at CLEF 2003
200 question
s
target corpus
exact answers
50 bytes answers
QA Tasks at CLEF 2003
Monolingual Bilingual English
Q-set Assessment Q-set Assessment
Italian ITC-irst ITC-irst ITC-irst NIST
DutchU. Amsterdam U. Amsterdam ITC-irst
U. Amsterdam
NIST
SpanishUNED UNED ITC-irst
UNED
NIST
FrenchITC-irst
U. Montreal
NIST
GermanITC-irst
DFKI
NIST
Tasks at CLEF 2003
Monolingual Bilingual English
Q-set Assessment Q-set Assessment
Italian ITC-irst ITC-irst ITC-irst NIST
DutchU. Amsterdam U. Amsterdam ITC-irst
U. Amsterdam
NIST
SpanishUNED UNED ITC-irst
UNED
NIST
FrenchITC-irst
U. Montreal
NIST
GermanITC-irst
DFKI
NIST
1
1
1 1
0
1
3
1
Bilingual against English
English questionsQuestion extraction
Italian questions
Translation
English answersQA system Assessment
English textcollection
1 p/m for 200 questions2 p/d for 200 questions
4 p/d for 1 run(600 answers)
Document Collections
Corpora licensed by CLEF in 2002:
• Dutch Algemeen Dagblad and NRC Handelsblad (years 1994 and 1995)
• Italian La Stampa and SDA press agency (1994)
• Spanish EFE press agency (1994)
• English Los Angeles Times (1994)
MONOLINGUAL TASKS
BILINGUAL TASK
Creating the Test Collection
CLEF Topics
150 q/aDutch
150 q/aItalian
150 q/aSpanish
MONOLINGUAL TEST SETS
150 Dutch/English
150 Italian/English
150 Spanish/English
ENGLISH
QUESTIONS SHARING
ILLC ITC-irst UNED
300Ita+Spa
300Dut+Spa
300Ita+Dut
NEW TARGET LANGUAGES
ENGLISH
the DISEQuA corpus
DATA MERGING
Questions
200 fact-based questions for each task:
- queries related to the events occurred in the years 1994 and/or 1995, i.e. the years of the target corpora;
- coverage of different categories of questions: date, location, measure, person, object, organization, other;
- questions were not guaranteed to have an answer in the corpora: 10% of the test sets required the answer string “NIL”
Questions
200 fact-based questions for each task:
- queries related to the events occurred in the years 1994 and/or 1995, i.e. the years of the target corpora
- coverage of different categories of questions (date, location, measure, person, object, organization, other)
- questions were not guaranteed to have an answer in the corpora: 10% of the test sets required the answer string “NIL”
- definition questions (“Who/What is X”)
- Yes/No questions
- list questions
Answers
Participants were allowed to submit up to three answers per question and up to two runs:
- answers must be either exact (i.e. contain just the minimal information) or 50 bytes long strings
- answers must be supported by a document
- answers must be ranked by confidence
Answers were judged by human assessors, according to four categories:
• CORRECT (R)• UNSUPPORTED (U)• INEXACT (X)• INCORRECT (W)
Judging the Answers
Question and judged responses Comment
What museum is directed by Henry Hopkins?W 1 irstex031bi 1 3253 LA011694-0094 Modern ArtU 1 irstex031bi 2 1776 LA011694-0094 UCLAX 1 irstex031bi 3 1251 LA042294-0050 Cultural Center
The second answer was correct but the document retrieved was not relevant. The third response missed bits of the name, and was judged non-exact.
Where did the Purussaurus live before becoming extinct?W 2 irstex031bi 1 9 NIL
The system erroneously “believed” that the query had no answer in the corpus, or could not find one.
When did Shapour Bakhtiar die?R 3 irstex031bi 1 484 LA012594-0239 1991W 3 irstex031bi 2 106 LA012594-0239 Monday
In the questions that asked for the date of an event, the year was often regarded as sufficient.
Who is John J. Famalaro accused of having killed?W 4 irstex031bi 1 154 LA072294-0071 ClarkR 4 irstex031bi 2 117 LA072594-0055 HuberW 4 irstex031bi 3 110 LA072594-0055 Department
The second answer, that returned the victim’s last name, was considered sufficient and correct, since in the document retrieved no other people named “Huber” were mentioned.
Evaluation Measures
The score for each question was the reciprocal of the rank of the first answer to be found correct; if no correct answer was returned, the score was 0.
The total score, or Mean Reciprocal Rank (MRR), was the mean score over all questions.
In STRICT evaluation only correct (R) answers scored points.
In LENIENT evaluation the unsupported (U) answers were considered correct, as well.
Participants
GROUP TASK RUN NAMEDLSI-UAUniversity of Alicante, Spain
Monolingual Spanishalicex031ms
alicex032ms
UVAUniversity of Amsterdam, The Netherlands
Monolingual Dutchuamsex031md
uamsex032md
ITC-irstItaly
Monolingual Italian
Bilingual Italian
irstex031mi
irstst032mi
irstex031bi
irstex032bi
ISIUniversity of Southern California, USA
Bilingual Spanishisixex031bs
isixex032bs
/ Bilingual Dutch /
DFKIGermany
Bilingual German dfkist031bg
CS-CMUCarnegie Mellon University, USA
Bilingual Frenchlumoex031bf
lumoex032bf
DLTGUniversity of Limerick, Ireland
Bilingual Frenchdltgex031bf
dltgex032bf
RALIUniversity of Montreal, Canada
Bilingual Frenchudemst031bf
udemex032bf
Participants in past QA tracks
Comparison between the number and place of origin of the participants in the past TREC and in this year’s CLEF QA tracks:
PARTICIPANTSNo. of
submitted runsUnited States
CanadaEurope Asia Australia TOTAL
TREC-8 13 3 3 1 20 46
TREC-9 14 7 6 / 27 75
TREC-10 19 8 8 / 35 67
TREC-11 16 10 6 / 32 67
CLEF 2003 3 5 / / 8 17
Performances at TREC-QA
• Evaluation metric: Mean Reciprocal Rank (MRR)1
rank of the correct answer
• Best result
• Average over 67 runs
/ 500
TREC-8 TREC-9 TREC-10
66%25% 58%
24%
67%23%
Results - EXACT ANSWERS RUNS
MONOLINGUAL TASKS
GROUP TASK RUN NAME MRRNo. of Q. with
at least one right answer
NIL Questions
strict lenient strict lenient returned correctly returned
DLSI-UAMonolingual
Spanish
alicex031ms .307 .320 80 87 21 5
alicex032ms .296 .317 70 77 21 5
ITC-irstMonolingual
Italianirstex031mi .422 .442 97 101 4 2
UVAMonolingual
Dutch
uamsex031md .298 .317 78 82 200 17
uamsex032md .305 .335 82 89 200 17
Results - EXACT ANSWERS RUNS
MONOLINGUAL TASKS
0
0,1
0,2
0,3
0,4
0,5
0,6
alic
ex03
1ms
alic
ex03
2ms
irste
x031
mi
uam
sex0
31m
d
uam
sex0
32m
d
RUN
MR
R
strict lenient
Results - EXACT ANSWERS RUNS
CROSS-LANGUAGE TASKS
GROUP TASK RUN NAME MRRNo. of Q. with
at least one right answer
NIL Questions
strict lenient strict lenient returned correctly returned
ISIBilingual Spanish
isixex031bs .302 .328 69 77 4 0
isixex032bs .271 .307 68 78 4 0
ITC-irstBilingual
Italian
irstex031bi .322 .334 77 81 49 6
irstex032bi .393 .400 90 92 28 5
CS-CMUBilingual French
lumoex031bf .153 .170 38 42 92 8
lumoex032bf .131 .149 31 35 91 7
DLTGBilingual French
dltgex031bf .115 .120 23 24 119 10
dltgex032bf .110 .115 22 23 119 10
RALIBilingual French
udemex032bf .140 .160 38 42 3 1
Results - EXACT ANSWERS RUNS
CROSS-LANGUAGE TASKS
0
0,1
0,2
0,3
0,4
0,5
0,6is
ixex
031b
s
isix
ex03
2bs
irste
x031
bi
irste
x032
bi
lum
oex0
31bf
lum
oex0
32bf
dltg
ex03
1bf
dltg
032b
f
udem
ex03
2bf
RUN
MR
R
strict lenient
Results - 50 BYTES ANSWERS RUNS
MONOLINGUAL TASKS
GROUP TASK RUN NAME
MRR
No. of Q. with at least one
right answer NIL Questions
strict lenient strict lenient returned correctly returned
ITC-irstMonolingual
Italianirstst032mi .449 .471 99 104 5 2
Results - 50 BYTES ANSWERS RUNS
CROSS-LANGUAGE TASKS
GROUP TASK RUN NAME MRR
No. of Q. with at least one
right answer NIL Questions
strict lenient strict lenient returned correctly returned
DFKIBilingual German
dfkist031bg .098 .103 29 30 18 0
RALIBilingual French
udemst031bf .213 .220 56 58 4 1
Average Results in Different Tasks
EXACT ANSWERS - MONOLINGUAL (5 runs)
0
50
100
150
200
1st 2nd 3rd not found
strict lenient
EXACT ANSWERS - BILINGUAL (9 runs)
0
50
100
150
200
1st 2nd 3rd not found
strict lenient
Approaches in CL QA
Two main different approaches used in Cross-Language QA systems:
answer extraction
question processing
answer extraction
question processing in the source language to retrieve information (such as keywords, question focus, expected answer type, etc.)
translation and expansion of the retrieved data
1
2
translation of the question into the target language (i.e. in the language of the document collection)
Approaches in CL QA
Two main different approaches used in Cross-Language QA systems:
answer extraction
question processing
answer extraction
preliminary question processing in the source language to retrieve information (such as keywords, question focus, expected answer type, etc.)
translation and expansion of the retrieved data
1
2
translation of the question into the target language (i.e. in the language of the document collection)
ITC-irst
RALI
DFKI
ISI
CS-CMU
Limerik
Conclusions
A pilot evaluation campaign for multiple language Question Answering Systems has been carried on.
Five European languages were considered: three monolingual tasks and five bilingual tasks against an English collection have been activated.
Considering the difference of the task, results are comparable with QA at TREC.
A corpus of 450 questions, each in four languages, reporting at least one known answer in the respective text collection, has been built.
This year experience was very positive: we intend to continue with QA at CLEF 2004.
Perspective for Future QA Campaigns
• Organization issues:
• Promote larger participation
• Collaboration with NIST
• Financial issues:
• Find a sponsor: ELRA, the new CELCT center, …
• Tasks (to be discussed)
• Update to TREC-2003: definition questions, list questions
• Consider just “exact answer”: 50 bytes did not have much favor
• Introduce new languages: in the cross-language task this is easy to do
• New steps toward multilinguality: English questions against other language collections; a small set of full cross-language tasks (e.g. Italian/Spanish).
Creation of the Question Set
1. Find 200 questions for each language (Dutch, Italian, Spanish), based on CLEF-2002 topics, with at least one answer in the respective corpus.
2. Translate each question into English, and from English into the other two languages.
3. Find answers in the corpora of the other languages (e.g. a Dutch question was translated and processed in the Italian text collection).
4. The result is a corpus of 450 questions, each in four languages, with at least one known answer in the respective text collection. More details in the paper and in the Poster.
5. Questions with at least one answer in all the corpora were selected for the final question set.