implementation of a qa system in a real context carlos amaral (priberam, portugal) dominique laurent...
TRANSCRIPT
Implementation of a QA system in a real context
Carlos Amaral (Priberam, Portugal)
Dominique Laurent (Synapse Développement, France)
Workshop TellMeMore, November 24, 2006, C.Amaral, D.Laurent
1. The Question-Answering systemWhat is a QA System ?
• System that enables the extraction of an answer (or several) to a request (a question) based on a corpus
• The problematic of « the type of the question »
• An answer or several, possibly a list from one or several documents, an answer of the type Yes/No…,
• On a corpus in one or several languages…
Workshop TellMeMore, November 24, 2006, C.Amaral, D.Laurent
1.1. QA and Language Processing• A QA system appears to be a LP « par excellence »• However, certain systems are uniquely based on pattern
matching (cf Soubotine & Soubotine, TREC 2003),• These systems seems to have reached their limits • And, if they can process all what is factual, the complex
questions/queries are far beyond their possibility.• The best systems validated at TREC and CLEF are
based on Automated Language Processing.
Workshop TellMeMore, November 24, 2006, C.Amaral, D.Laurent
1.2. OUR QA SYSTEM
• First developed (1999 - 2001) within a French innovation project (Anvar)
• Then (end 2001- end 2003) within the European project TRUST (FP5)
• Currently, (2005/06) within the European project M-CAST (FP6)
• Main features : targets B2B and B2C, multilingual, NLP based and intensive.
Workshop TellMeMore, November 24, 2006, C.Amaral, D.Laurent
A modular conception
FrenchLanguage
Module
ItalianLanguage
Module
PortugueseLanguage
Module
PolishLanguage
Module
EnglishLanguage
Module
Indexation engine Extraction of text engine
IndexDocumentsDocuments Visualizationof Results
Visualizationof Results
CzechLanguage
Module
Workshop TellMeMore, November 24, 2006, C.Amaral, D.Laurent
Workshop TellMeMore, November 24, 2006, C.Amaral, D.Laurent
1.3. Evaluations of the QA system
• Professional benchmarking contests and campaigns such as EQueR (2004) and CLEF (2005 & 2006),
• Evaluations for the French, English, Portuguese and Spanish language modules, in monolingual and multilingual.
Workshop TellMeMore, November 24, 2006, C.Amaral, D.Laurent
CLEF 2005
Workshop TellMeMore, November 24, 2006, C.Amaral, D.Laurent
64%
39,50%36,50%
25,50%
64,50%
0%
10%
20%
30%
40%
50%
60%
70%
French monolingualEnglish - FrenchPortuguese - FrenchItalian - FrenchPortuguese monolingual
Workshop TellMeMore, November 24, 2006, C.Amaral, D.Laurent
CLEF 200668% 67%
36%
44,5%47,5%
32,5%
52,5%
33,5%
0%
10%
20%
30%
40%
50%
60%
70%
80%
French monolingualEnglish1 - FrenchPortuguese - FrenchEnglsih2 - FrenchPortuguese monolingualSpanish monolingualPortuguese - SpanishSpanish - Portuguese
Workshop TellMeMore, November 24, 2006, C.Amaral, D.Laurent
• In CLEF 2005 and CLEF 2006, the best engines for monolingual were our systems for Portuguese and French. And the best systems for multilingual were our systems for English-French, Portuguese-French, Spanish-Portuguese, Portuguese-Spanish.
• Synapse Développement and Priberam are now partners of the project Quaero.
Workshop TellMeMore, November 24, 2006, C.Amaral, D.Laurent
2. Implementation in M-CAST Project• Tests carried-out on books in the National Czech library
and the Torun library in Poland,
• Processing several millions of digitized documents,
• Manages meta-data and UDC classification,
• Accommodates questions and answers in English, French, Italian, Portuguese, Polish, Czech
• Implemented on both library portals
2.1. Adaptation to Digital Libraries Resources
• Scanned texts : poor quality– > Spell checker to improve the quality of
documents.
• One book, lots of pages :– > Management of multi-part documents during
semantic analysis
Workshop TellMeMore, November 24, 2006, C.Amaral, D.Laurent
2.2. Integration of Dublin Core document’s attributes
• Storage of Dublin Core attributes as Metadata
• QA : Who is the author of Hamlet ?– Adaptation of the system to search in
metadata– Use of those metadata as filters
Workshop TellMeMore, November 24, 2006, C.Amaral, D.Laurent
2.3. Universal Decimal Classification
• Storage of UDC codes for each document
• Search through UDC codes
• Filtering through UDC codes
• Semantic disambigation through UDC codes
Workshop TellMeMore, November 24, 2006, C.Amaral, D.Laurent
Technical architecture
ATL Web
Service
Indexer
Searcher
Linguistic processor
QueryResult
list
Index documents
Semantic Index
Semantic Index
Semantic Index
Java Portal
QueryResult
list
SOAP XML
Index documents
SOAP XML
Indexed documents
Indexed documents
Language Modules
Document
Indexed Document
Question
Answer
Document Parser
Language Detector
IIS
Axis / Apache
Workshop TellMeMore, November 24, 2006, C.Amaral, D.Laurent
END of Presentation I would appreciate your
questions !
Thank you - Merci !