our ir strategy

35
Multilingual Search using Query Translation and Collection Selection Jacques Savoy, Pierre-Yves Berger University of Neuchatel, Switzerland www.unine.ch/info/clef/

Upload: paxton

Post on 15-Jan-2016

48 views

Category:

Documents


1 download

DESCRIPTION

Multilingual Search using Query Translation and Collection Selection Jacques Savoy, Pierre-Yves Berger University of Neuchatel, Switzerland www.unine.ch/info/clef/. 1. Effective monolingual IR system (combining indexing schemes?) 2. Combining query translation tools - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Our IR Strategy

Multilingual Search using

Query Translation and

Collection Selection

Jacques Savoy, Pierre-Yves Berger

University of Neuchatel, Switzerland

www.unine.ch/info/clef/

Page 2: Our IR Strategy

Our IR Strategy

1. Effective monolingual IR system1. Effective monolingual IR system

(combining indexing schemes?)(combining indexing schemes?)

2. Combining query translation tools2. Combining query translation tools

3. Effective collection selection and 3. Effective collection selection and

merging strategiesmerging strategies

Page 3: Our IR Strategy

Effective Monolingual IR

1.1. Define a stopword listDefine a stopword list

2.2. Have a « good » stemmerHave a « good » stemmer

Removing only the feminineRemoving only the feminineand plural suffixes (inflections)and plural suffixes (inflections)Focus on Focus on nounsnouns and and adjectivesadjectives see www.unine.ch/info/clef/see www.unine.ch/info/clef/

Page 4: Our IR Strategy

Indexing Units

For Indo-European languages : Set of significant words

CJK: bigramswith stoplistwithout stemming

For Finnish, we used 4-grams.

Page 5: Our IR Strategy

IR Models

Probabilistic Okapi Prosit or

deviation from randomness

Vector-space Lnu-ltc tf-idf (ntc-ntc) binary (bnn-bnn)

Page 6: Our IR Strategy

Monolingual Evaluation

Model English French Portug.

Q=TD word word word

Okapi 0.5422 0.4685 0.4835

Prosit 0.5313 0.4568 0.4695

Lnu-ltc 0.4979 0.4349 0.4579

tf-idf 0.3706 0.3309 0.3708

binary 0.3005 0.2017 0.1834

Page 7: Our IR Strategy

Monolingual Evaluation

Model Finnish Russian

Q=TD word word

Okapi 0.4773 0.3800

Prosit 0.4620 0.3448

Lnu-ltc 0.4643 0.3794

tf-idf 0.3862 0.2716

binary 0.1859 0.1512

Page 8: Our IR Strategy

Monolingual Evaluation

Model Finnish Russian

Q=TD word 4-gram word 4-gram

Okapi 0.4773 0.5386 0.3800 0.2890

Prosit 0.4620 0.5357 0.3448 0.2879

Lnu-ltc 0.4643 0.5022 0.3794 0.2852

tf-idf 0.3862 0.4466 0.2716 0.1916

binary 0.1859 0.2387 0.1512 0.0373

Page 9: Our IR Strategy

Problems with Finnish

Finnish can be characterized by:

1. A large number of cases (12+),

2. Agglutinative (saunastani)

& compound construction (rakkauskirje)

3. Variations in the stem

(matto, maton, mattoja, …).

Do we need a deeper morpholgical analysis?

Page 10: Our IR Strategy

Improvement using…

Relevance feedback (Rocchio)?

Page 11: Our IR Strategy

Monolingual Evaluation

Q=TD English French Finnish Russian

Okapi 0.5422 0.4685 0.5386 0.3800

+PRF 0.5704 0.4851 0.5308 0.3945

+5.2% +3.5% -1.5% +3.8%

Prosit 0.5313 0.4568 0.5357 0.3448

+PRF 0.5742 0.4643 0.5684 0.3736

+8.1% +1.6% +6.1% +8.4%

Page 12: Our IR Strategy

Relevance FeedbackMAP after blind query-expansion

0 10 15 20 30 40 50 60 75 100# of terms added (from 3 best-ranked doc.)

MAP

0.36

0.38

0.4

0.42

0.44

0.46

0.48

0.5

Prosit Okapidtu-dtn Lnu-ltc

Page 13: Our IR Strategy

Improvement using…

Data fusion approaches (see the paper)

Improvement with the English, Finnish and Portuguese corpora

Hurt the MAP with the French and Russian collections

Page 14: Our IR Strategy

Bilingual IR

1. Machine-readable dictionaries (MRD)e.g., Babylon

2. Machine translation services (MT)e.g., FreeTranslation BabelFish

3. Parallel and/or comparable corpora(not used in this evaluation campaign)

Page 15: Our IR Strategy

Bilingual IR

“Tour de France WinnerWho won the Tour de France in 1995?”

(Babylon1) “France, vainqueurOrganisation Mondiale de la Santé, le, France,* 1995”

(Right) “Le vainqueur du Tour de FranceQui a gagné le tour de France en 1995 ?”

Page 16: Our IR Strategy

Bilingual EN->FR/FI/RU/PT

TDOkapi

Frenchword

Finnish4-gram

Russianword

Portugword

Manual 0.4685 0.5386 0.3800 0.4835

Baby1 0.3706 0.1965 0.2209 0.3071

FreeTra 0.3845 N/A 0.3067 0.4057

InterTra 0.2664 0.2653 0.1216 0.3277

Reverso 0.3830 N/A 0.2960 N/A

Combi. 0.4066 0.3042 0.3888 0.4204

Page 17: Our IR Strategy

Bilingual EN->FR/FI/RU/PT

TDOkapi

Frenchword

Finnish4-gram

Russianword

Portugword

Manual 0.4685 0.5386 0.3800 0.4835

Baby1 79.1% 36.5% 58.1% 63.5%

FreeTra 82.1% N/A 80.7% 83.9%

InterTra 56.9% 49.3% 32.0% 67.8%

Reverso 81.8% N/A 77.9% N/A

Combi. 86.8% 56.5% 102.3% 86.9%

Page 18: Our IR Strategy

Bilingual IR

Improvement … Query combination (concatenation) Relevance feedback

(yes for PT, FR, RU, no for FI)Data fusion (yes for FR, no for RU, FI, PT)

Page 19: Our IR Strategy

Multilingual EN->EN FI FR RU

1. Create a common index Document translation (DT)

2. Search on each language and merge the result lists (QT)

3. Mix QT and DT

4. No translation

Page 20: Our IR Strategy

Merging Problem

EN

<–– Merging<–– Merging

RUFRFI

Page 21: Our IR Strategy

Merging Problem

1 EN120 1.22 EN200 1.03 EN050 0.74 EN705 0.6…

1 FR043 0.82 FR120 0.753 FR055 0.654 …

1 RU050 1.62 RU005 1.13 RU120 0.94 …

Page 22: Our IR Strategy

Merging Strategies

Round-robin (baseline)

Raw-score merging

Normalize

Biased round-robin

Z-score

Logistic regression

Page 23: Our IR Strategy

Merging Problem

1 EN120 1.22 EN200 1.03 EN050 0.74 EN705 0.6…

1 FR043 0.82 FR120 0.753 FR055 0.654 …

1 RU050 1.62 RU005 1.13 RU120 0.94 …

1 RU… 1 RU… 2 EN…2 EN…3 RU…3 RU…4 ….4 ….

Page 24: Our IR Strategy

Merging Strategies

Round-robin (baseline)

Raw-score merging

Normalize

Biased round-robin

Z-score

Logistic regression

Page 25: Our IR Strategy

Test-Collection CLEF-2004

EN FI FR RU

size 154 MB 137 MB 244 MB 68 MB

doc 56,472 55,344 90,261 16,716

topic 42 45 49 34

rel./query 8.9 9.2 18.7 3.6

Page 26: Our IR Strategy

Merging Strategies

Round-robin (baseline)

Raw-score merging

Normalize

Biased round-robin

Z-score

Logistic regression

Page 27: Our IR Strategy

Z-Score Normalization

1 EN120 1.22 EN200 1.03 EN050 0.74 EN765 0.6… …

Compute the mean and standard deviation

New score = ((old score-) / ) +

1 EN120 7.02 EN200 5.03 EN050 2.04 EN765 1.0…

Page 28: Our IR Strategy

MLIR Evaluation

Condition A (meta-search)

Prosit/Okapi (+PRF, data fusion for

FR, EN)

Condition C (same SE)

Prosit only (+PRF)

Page 29: Our IR Strategy

Multilingual Evaluation

EN->EN FR FI RU Cond. A Cond. C

Round-robin 0.2386 0.2358

Raw-score 0.0642 0.3067

Norm 0.2899 0.2646

Biased RR 0.2639 0.2613

Z-score wt 0.2669 0.2867

Logistic 0.3090 0.3393

Page 30: Our IR Strategy

Collection Selection

General idea:1. Use the logistic regression to compute

a probability of relevance for each document;

2. Sum the document scores from the first 15 best ranked documents;

3. If this sum ≥ , consider all the list,otherwise select only the first m doc.

Page 31: Our IR Strategy

Multilingual Evaluation

EN->EN FR FI RU Cond. A Cond. C

Round-robin 0.2386 0.2358

Logistic reg. 0.3090 0.3393

Optimal select 0.3234 0.3558

Select (m=0) 0.2957 0.3405

Select (m=3) 0.2953 0.3378

Page 32: Our IR Strategy

Conclusion (monolingual)

Finnish is still a difficult language

Blind-query expansion: different patterns with different IR models

Data fusion (?)

Page 33: Our IR Strategy

Conclusion (bilingual)

Translation resources not available for all languages pairs

Improvement bycombining query translations yes, but can we have a better combination scheme?

Page 34: Our IR Strategy

Conclusion (multilingual)

Selection and merging are still hard problems

Overfitting is not always the best choice (see results with Condition C)

Page 35: Our IR Strategy

Multilingual Search using

Query Translation and

Collection Selection

Jacques Savoy, Pierre-Yves Berger

University of Neuchatel, Switzerland

www.unine.ch/info/clef/