querying across languages: a dictionary-based approach to multilingual information retrieval

22
Querying Across Languages: Querying Across Languages: A Dictionary-Based A Dictionary-Based Approach to Multilingual Approach to Multilingual Information Retrieval Information Retrieval Doctorate Course Doctorate Course Web Information Retrieval Web Information Retrieval Speaker Speaker Gaia Trecarichi Gaia Trecarichi

Upload: yasuo

Post on 13-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval. Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi. Goal. Outline. What is Multilingual Information Retrieval (MLIR). Basic Approaches to MLIR. Resource Requirements for MLIR. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Querying Across Languages: A Querying Across Languages: A Dictionary-Based Approach to Dictionary-Based Approach to

Multilingual Information RetrievalMultilingual Information Retrieval

Doctorate Course Doctorate Course Web Information RetrievalWeb Information Retrieval

SpeakerSpeakerGaia TrecarichiGaia Trecarichi

Page 2: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

OutlineOutline

What is Multilingual Information Retrieval (MLIR)What is Multilingual Information Retrieval (MLIR)

Basic Approaches to MLIRBasic Approaches to MLIR

Xerox Experimental ApproachXerox Experimental Approach

Resource Requirements for MLIRResource Requirements for MLIR

Experimental ResultsExperimental Results

Conclusions and Future ExtensionsConclusions and Future Extensions

Detailed Query AnalysisDetailed Query Analysis

Sample Query ProfileSample Query Profile

GoalGoal

Page 3: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

GoalGoal

To build a fully-functional MLIR ( too much time and To build a fully-functional MLIR ( too much time and resources needed )resources needed )

ISIS NOTNOT

To To exploreexplore the the most important factorsmost important factors in making in making MLIR effective MLIR effective

ISIS

Page 4: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

5 Definitions for MLIR5 Definitions for MLIR1.1. IR in any IR in any language otherlanguage other thanthan English English

2.2. IR on a parallel document collection or on a IR on a parallel document collection or on a multilingual document collection where the multilingual document collection where the search search space is restrictedspace is restricted to the query language to the query language

3.3. IR on a IR on a monolingual document collectionmonolingual document collection which can be which can be queried in multiple languagesqueried in multiple languages

4.4. IR on a multilingual document collection, where IR on a multilingual document collection, where queries can queries can retrieve documents in multiple languagesretrieve documents in multiple languages

5.5. IR on IR on multilingual documentsmultilingual documents, i.e. more than one , i.e. more than one language can be present in the individual documentslanguage can be present in the individual documents

Page 5: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Basic Approaches to MLIR Basic Approaches to MLIR IR systems rank documents according to statistical IR systems rank documents according to statistical

similarity measures based on the cooccurrence of terms similarity measures based on the cooccurrence of terms in queries and documentsin queries and documents

Mechanism for query or document translationMechanism for query or document translation

TechniquesTechniques for the problem of interlingual term for the problem of interlingual term correspondencecorrespondence

Query translation is Query translation is easier easier butbut doesn’t provide much context doesn’t provide much context Document translation could be Document translation could be betterbetter butbut is is costing costing (time, storage (time, storage

resources)resources)

Term Vector TranslationTerm Vector Translation

Text TranslationText Translation

Latent Semantic CoindexingLatent Semantic Coindexing

Page 6: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Text TranslationText Translation High-end approach to MLIR (NLP and text High-end approach to MLIR (NLP and text

generation techniques)generation techniques)

Direct Mapping Direct Mapping of query from the source language into of query from the source language into one or more target languages by using an MT systemone or more target languages by using an MT system

Direct Resolution of ambiguity Direct Resolution of ambiguity by using structural by using structural information from the source language textinformation from the source language text

PROPRO Extensive body of researchExtensive body of research on MT on MT Commercial productsCommercial products available available

CONSCONS Low performanceLow performance of current MT systems [Radwan, 1994] of current MT systems [Radwan, 1994]

Page 7: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Term Vector TranslationTerm Vector Translation Direct Mapping Direct Mapping of each word in the query written in of each word in the query written in

the source language into the source language into all of its possible definitions all of its possible definitions in in the target languagesthe target languages

Uses Uses transfer dictionariestransfer dictionaries or or parallel aligned corpus parallel aligned corpus for for the direct mappingthe direct mapping

Should each term be weighted according to the number of Should each term be weighted according to the number of translations?translations?

Issues related with term weighting strategiesIssues related with term weighting strategies

Should more common translations be weighted proportionally higher?

Vector Space Models can be used as retrieval strategiesVector Space Models can be used as retrieval strategies

What resources do we use to obtain this information?

Page 8: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Latent Semantic CoindexingLatent Semantic Coindexing Indirect DerivationIndirect Derivation of query translation by using a of query translation by using a

training corpustraining corpus

Uses Uses Singular Value DecompositionSingular Value Decomposition of parallel of parallel document collection to obtain term vector document collection to obtain term vector representationrepresentation

Term vector representaion are Term vector representaion are comparable across all the comparable across all the languageslanguages of the collection (documents are represented of the collection (documents are represented as language-independent numerical vectors)as language-independent numerical vectors)

Query can retrieve a relevant document Query can retrieve a relevant document even if they even if they have no words in commonhave no words in common

Create a Create a reduced-dimension Semantic Spacereduced-dimension Semantic Space in which in which related terms are near each otherrelated terms are near each other

Page 9: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

LSI vs Standard Vector ModelLSI vs Standard Vector Model Standard Vector ModelStandard Vector Model

Treat words as if they are independentTreat words as if they are independent

LSILSI Term-term inter-relationships are automatically modeled and used Term-term inter-relationships are automatically modeled and used

to improve retrieval by numerically analysing existing texts (no to improve retrieval by numerically analysing existing texts (no need for external dictionaries, thesauri or knowledge bases)need for external dictionaries, thesauri or knowledge bases)

Represent documents as linear combinations of orthogonal termsRepresent documents as linear combinations of orthogonal terms

Represents terms as continuous values on each of the k orthogonal Represents terms as continuous values on each of the k orthogonal indexing dimensionsindexing dimensions

Page 10: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Resource RequirementsResource Requirements Support for character set of each language is neededSupport for character set of each language is needed

Facilities for automatic language recognitionFacilities for automatic language recognition

Morphological Analyzer (PoS recogniMorphological Analyzer (PoS recognitiontion, , stemming stemming algorithms, algorithms, inflectional analyzers)inflectional analyzers)

Ex: German word WeingEx: German word Weingäärrtnertnergenossenschaften is analyzed as the genossenschaften is analyzed as the feminine plural noun Wein#Gfeminine plural noun Wein#Gäärtner# Genosse(n)#schajtrtner# Genosse(n)#schajt

Crucial to find term entries in bilingual dictionariesCrucial to find term entries in bilingual dictionaries

Resources for query translationResources for query translation

Machine Translation SystemMachine Translation System

Transfer DictionariesTransfer Dictionaries

Parallel texts and/or monolingual domain-specific corporaParallel texts and/or monolingual domain-specific corpora

Page 11: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Resources for Query TranslationResources for Query Translation MT SystemMT System

Transfer dictionaries (Bilingual Thesauri)Transfer dictionaries (Bilingual Thesauri)

Parallel TextsParallel Texts

For direct term vector translationFor direct term vector translation

For direct query translationFor direct query translation

To extract relationships between terms for term vector translation or To extract relationships between terms for term vector translation or to get indirect query translation (ex. SLI)to get indirect query translation (ex. SLI)

Source of terminology to be used when parallel texts are not Source of terminology to be used when parallel texts are not availableavailable

Extracted from bilingual general dictionaries which include lots of Extracted from bilingual general dictionaries which include lots of “noise” vocabulary“noise” vocabulary

Domain-specific monolingual corporaDomain-specific monolingual corpora

Page 12: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Transfer Dictionaries vs Transfer Dictionaries vs Parallel TextsParallel Texts

Transfer DictionariesTransfer Dictionaries Conversion from bilingual dictionaries is a non-trivial effortConversion from bilingual dictionaries is a non-trivial effort

Parallel CorporaParallel Corpora

Needed in large quantity to train statistical models of great Needed in large quantity to train statistical models of great sophisticationsophistication

Generate term translation vectors with probabilities [Brown, 1993] Generate term translation vectors with probabilities [Brown, 1993]

Provide narrow but deep coverage (probabilities are domain Provide narrow but deep coverage (probabilities are domain specific)specific)

Provide broad but shallow coverage of the languageProvide broad but shallow coverage of the language

Translation probabilities are not availableTranslation probabilities are not available Most technical terminology is missingMost technical terminology is missing

Page 13: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Xerox Experimental Approach 1Xerox Experimental Approach 1 Evaluation in Multilingual IR Evaluation in Multilingual IR

Uses query with known relevance judgementUses query with known relevance judgement

Start with queries, documents, and relevance judgments in a Start with queries, documents, and relevance judgments in a single language single language

Translates the queries into another language by human Translates the queries into another language by human translatorstranslators

Translated queries are retranslated by the MLIR systemTranslated queries are retranslated by the MLIR system

Results are compared to the original queries to get a good Results are compared to the original queries to get a good sense of the relative performance of the MLIR systemsense of the relative performance of the MLIR system

Page 14: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Xerox Experimental Approach 2Xerox Experimental Approach 2 Experimental SettingExperimental Setting

Translated French queries and English documentsTranslated French queries and English documents

Conversion of an online bilingual French => English Conversion of an online bilingual French => English dictionary to a WORD-BASED transfer dictionary dictionary to a WORD-BASED transfer dictionary suitable for text retrievalsuitable for text retrieval

TIPSTER text collection and queries 51-100 from TREC TIPSTER text collection and queries 51-100 from TREC experiments [Harman, 1995]experiments [Harman, 1995]

Term vector translation modelTerm vector translation model

Bilingual Transfer Dictionary to generate the modelBilingual Transfer Dictionary to generate the model

Short version of queries (average lenght of 7 words)Short version of queries (average lenght of 7 words)

Page 15: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Xerox Experimental Approach 3Xerox Experimental Approach 3 MLIR ProcessMLIR Process

1.1. Query is morphologically analyzed and each term is replaced Query is morphologically analyzed and each term is replaced by its inflectional rootby its inflectional root

2.2. Each root is looked up in the bilingual transfer dictionary and Each root is looked up in the bilingual transfer dictionary and builds a translated query by taking the concatenation of all builds a translated query by taking the concatenation of all term translationsterm translations

3.3. The translated query is sent to a traditional monolingual IR The translated query is sent to a traditional monolingual IR systemsystem

Specialized term weighting and resolving ambiguity in translation Specialized term weighting and resolving ambiguity in translation are ignoredare ignored

NotesNotes

Vector Space Model is used to measure similarity between query Vector Space Model is used to measure similarity between query and each documentand each document

Page 16: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Experimental ResultsExperimental Results Comparing the original English queries to three Comparing the original English queries to three

retranslation generated by different versions of the transfer retranslation generated by different versions of the transfer dictionarydictionary

Three tranfer dictionary versions: Three tranfer dictionary versions: automatic word-based, automatic word-based, manual word-based and manual multi-word transfer dictionarymanual word-based and manual multi-word transfer dictionary

Average precision at 5,10,15 and 20 documents retrieved for the Average precision at 5,10,15 and 20 documents retrieved for the original English queries and the translation given by the different TDoriginal English queries and the translation given by the different TD

Original

English

Automatic

word-based

transfer dictionary

Manual

word-based

transfer dictionary

Manual

multi-word

transfer dictionary

0.393 0.235 0.269 0.357

Page 17: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Detailed Query AnalysisDetailed Query Analysis 11 Comparison of the performance of the translated (Tr) and original Comparison of the performance of the translated (Tr) and original

(Orig) English queries. Values given are the number of queries in each (Orig) English queries. Values given are the number of queries in each categorycategory

Performance

Automatic

word-based

transfer dictionary

Manual

word-based

transfer dictionary

Manual

multi-word

transfer dictionary

Tr > Orig

Tr ~ Orig

Tr < Orig

1

19

22

3

22

17

4

26

12

0.0 < Tr < Orig

Tr = 0.0

10

12

9

8

9

3

Improvement in performance as more manual effort is applied to the Improvement in performance as more manual effort is applied to the dictionary construction processdictionary construction process

Some queries which perform much better in their translated versions

Page 18: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Detailed Query AnalysisDetailed Query Analysis 22 Detailed Failure AnalysisDetailed Failure Analysis

Recognizing and translating multi-word expressions is crucial to Recognizing and translating multi-word expressions is crucial to success in MLIR (in contrast to monolingual IR)success in MLIR (in contrast to monolingual IR)

Carried out on the worse Carried out on the worse 1717 queries when using word-based queries when using word-based dictionarydictionary

9 9 queries lost information as a result of the queries lost information as a result of the failure to translate failure to translate multi-word expressionsmulti-word expressions correctly, correctly, 88 had problems due to had problems due to ambiguity ambiguity in translationin translation (i.e. extraneous definitions added to query), and (i.e. extraneous definitions added to query), and 44 suffered from a suffered from a loss in retranslation loss in retranslation (meaning decays with repeated (meaning decays with repeated translations)translations)

Individual components of phrases often have very diferent Individual components of phrases often have very diferent meanings in translation, so the entire sense of the phrase is often meanings in translation, so the entire sense of the phrase is often lostlost

Page 19: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Sample Query ProfileSample Query Profile 11 EnglishEnglish: original intent or interpretation of amendments to the U.S. : original intent or interpretation of amendments to the U.S.

ConstitutionConstitution

FrenchFrench: l’intention premkre ou une interpretation d’un amendment de : l’intention premkre ou une interpretation d’un amendment de la constitution des USAla constitution des USA

Term vector retranslationTerm vector retranslation

• intention - intention benefit

• premier - first initial bottom early front top leading basic primary original

• interpretation - interpretation

• amendment - amendment enrichment enriching agent

• constitution - formation settlement constitution

• USA - USA

Page 20: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Sample Query ProfileSample Query Profile 22

Version Precision Reasons for decay

Orig Eng

LR

TA1

TA2

Trans Eng

0.54

0.34

0.19

0.10

0.05

intent => intention, U.S. => USA

constitution, amendement

original, intention

The decay in performance of query 76 from the original The decay in performance of query 76 from the original English (orig Eng) to the translated English (traus Eng) due English (orig Eng) to the translated English (traus Eng) due to translation ambiguity (TA) and loss in retranslation (LR)to translation ambiguity (TA) and loss in retranslation (LR)

Page 21: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Future ExtensionsFuture Extensions

Additional loss in retranslation errors due to the experimented design Additional loss in retranslation errors due to the experimented design which cannot be avoided (i.e. the ambiguity introduced by the human which cannot be avoided (i.e. the ambiguity introduced by the human translator)translator)

ConclusionsConclusions Two primary sources of error in the current MLIR systemTwo primary sources of error in the current MLIR system

missing translationsmissing translations of multi-word expressions and of multi-word expressions and unresolved unresolved ambiguity in word-based translationambiguity in word-based translation

Improving automatically generated transfer dictionariesImproving automatically generated transfer dictionaries

Extracting MWE (gathering terminology lists from various MWE (gathering terminology lists from various specialized domains, performing terminology extraction from corporaspecialized domains, performing terminology extraction from corpora

Resolving ambiguity (using target language texts, term weighting Resolving ambiguity (using target language texts, term weighting strategies, user interactive tools)strategies, user interactive tools)

Using models other than the vector space model (i.e. weighted Using models other than the vector space model (i.e. weighted boolean model)boolean model)

Page 22: Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

THANK YOU!THANK YOU!