multilingual information retrievalsigir.org/afirm2019/slides/03. monday - multilingual ir... ·...
TRANSCRIPT
-
MultilingualInformation Retrieval
Doug OardCollege of Information Studies and UMIACS
University of Maryland, College ParkUSA
January 14, 2019 AFIRM
-
Global Trade
Source: Wikipedia (mostly 2017 estimates)
0.0
0.5
1.0
1.5
2.0
2.5
0.0 0.5 1.0 1.5 2.0 2.5
Expo
rts (
Trill
ions
of U
SD)
Imports (Trillions of USD)
USA
Japan
South Korea
Hong Kong
ChinaEU
-
Most Widely-Spoken Languages
Source: Ethnologue (SIL), 2018
0 200 400 600 800 1,000 1,200
Southern MinPersian
ThaiHausaItalian
VietnameseYue Chinese
TamilMarathiKoreanTurkishTelugu
Wu ChineseJavanese
Western PunjabiSwahili
JapaneseGerman
UrduIndonesianPortuguese
BengaliRussian
Modern Std ArabicFrench
SpanishHindi
Mandarin ChineseEnglish
Billions of Speakers
L1 speakers
L2 speakers
-
64%5%
4%
6%
2%
8%
2%4%
5% 0%
33%
28%
9%
6%
5%
5%
4%
4%4% 2%
English
Chinese
Spanish
Japanese
Portuguese
German
Arabic
French
Russian
Korean
Global Internet Users
-
What Does “Multilingual” Mean?• Mixed-language document
– Document containing more than one language• Mixed-language collection
– Collection of documents in different languages• Multi-monolingual systems
– Can retrieve from a mixed-language collection• Cross-language system
– Query in one language finds document in another• (Truly) multingual system
– Queries can find documents in any language
-
A Story in Two Parts
• IR from the ground up in any language– Focusing on document representation
• Cross-Language IR– To the extent time allows
-
DocumentsQuery
Hits
RepresentationFunction
RepresentationFunction
Query Representation Document Representation
ComparisonFunction Index
-
ASCII• American Standard
Code for Information Interchange
• ANSI X3.4-1968
| 0 NUL | 32 SPACE | 64 @ | 96 ` || 1 SOH | 33 ! | 65 A | 97 a || 2 STX | 34 " | 66 B | 98 b || 3 ETX | 35 # | 67 C | 99 c || 4 EOT | 36 $ | 68 D | 100 d || 5 ENQ | 37 % | 69 E | 101 e || 6 ACK | 38 & | 70 F | 102 f || 7 BEL | 39 ' | 71 G | 103 g || 8 BS | 40 ( | 72 H | 104 h || 9 HT | 41 ) | 73 I | 105 i || 10 LF | 42 * | 74 J | 106 j || 11 VT | 43 + | 75 K | 107 k || 12 FF | 44 , | 76 L | 108 l || 13 CR | 45 - | 77 M | 109 m || 14 SO | 46 . | 78 N | 110 n || 15 SI | 47 / | 79 O | 111 o || 16 DLE | 48 0 | 80 P | 112 p || 17 DC1 | 49 1 | 81 Q | 113 q || 18 DC2 | 50 2 | 82 R | 114 r || 19 DC3 | 51 3 | 83 S | 115 s || 20 DC4 | 52 4 | 84 T | 116 t || 21 NAK | 53 5 | 85 U | 117 u || 22 SYN | 54 6 | 86 V | 118 v || 23 ETB | 55 7 | 87 W | 119 w || 24 CAN | 56 8 | 88 X | 120 x || 25 EM | 57 9 | 89 Y | 121 y || 26 SUB | 58 : | 90 Z | 122 z || 27 ESC | 59 ; | 91 [ | 123 { || 28 FS | 60 < | 92 \ | 124 | || 29 GS | 61 = | 93 ] | 125 } || 30 RS | 62 > | 94 ^ | 126 ~ || 31 US | 64 ? | 95 _ | 127 DEL |
-
The Latin-1 Character Set
• ISO 8859-1 8-bit characters for Western Europe– French, Spanish, Catalan, Galician, Basque,
Portuguese, Italian, Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English
Printable Characters, 7-bit ASCII Additional Defined Characters, ISO 8859-1
-
Other ISO-8859 Character Sets
-2
-3
-4
-5
-7
-6
-9
-8
-
East Asian Character Sets
• More than 256 characters are needed– Two-byte encoding schemes (e.g., EUC) are used
• Several countries have unique character sets– GB in Peoples Republic of China, BIG5 in Taiwan,
JIS in Japan, KS in Korea, TCVN in Vietnam• Many characters appear in several languages
– Research Libraries Group developed EACC• Unified “CJK” character set for USMARC records
-
Unicode
• Single code for all the world’s characters– ISO Standard 10646
• Separates “code space” from “encoding”– Code space extends Latin-1
• The first 256 positions are identical– UTF-7 encoding will pass through email
• Uses only the 64 printable ASCII characters– UTF-8 encoding is designed for disk file systems
-
Limitations of Unicode
• Produces larger files than Latin-1• Fonts may be hard to obtain for some characters• Some characters have multiple representations
– e.g., accents can be part of a character or separate• Some characters look identical when printed
– But they come from unrelated languages• Encoding does not define the “sort order”
-
Strings and Segments• Retrieval is (often) a search for concepts
– But what we actually search are character strings
• What strings best represent concepts?– In English, words are often a good choice
• Well-chosen phrases might also be helpful– In German, compounds may need to be split
• Otherwise queries using constituent words would fail– In Chinese, word boundaries are not marked
• Thissegmentationproblemissimilartothatofspeech
-
Tokenization• Words (from linguistics):
– Morphemes are the units of meaning– Combined to make words
• Anti (disestablishmentarian) ism
• Tokens (from computer science)– Doug ’s running late !
-
Morphological Segmentation
Swahili Examplea + li + ni + andik + ish + a
he + past-tense + me + write + causer-effect + Declarative-mode
Credit: Ramy Eskander
-
Morphological Segmentation
Somali Examplecun + t + aa
eat + she
+ present-tense
Credit: Ramy Eskander
-
Stemming• Conflates words, usually preserving meaning
– Rule-based suffix-stripping helps for English• {destroy, destroyed, destruction}: destr
– Prefix-stripping is needed in some languages• Arabic: {alselam}: selam [Root: SLM (peace)]
• Imperfect: goal is to usually be helpful– Overstemming
• {centennial,century,center}: cent– Understamming:
• {acquire,acquiring,acquired}: acquir• {acquisition}: acquis
• Snowball: rule-based system for making stemmers
-
Longest Substring Segmentation
• Greedy algorithm based on a lexicon
• Start with a list of every possible term
• For each unsegmented string– Remove the longest single substring in the list– Repeat until no substrings are found in the list
-
Longest Substring Example• Possible German compound term (!):
– washington
• List of German words:– ach, hin, hing, sei, ton, was, wasch
• Longest substring segmentation– was-hing-ton– Roughly translates as “What tone is attached?”
-
oilpetroleum
probesurveytake samples
restrainoilpetroleum
probesurveytake samples
cymbidium goeringii
-
Probabilistic Segmentation
• For an input string c1 c2 c3 … cn• Try all possible partitions into w1 w2 w3 …
– c1 c2 c3 … cn– c1 c2 c3 c3 … cn– c1 c2 c3 … cn– etc.
• Choose the highest probability partition– Compute Pr(w1 w2 w3 ) using a language model
• Challenges: search, probability estimation
-
Non-Segmentation:N-gram Indexing
• Consider a Chinese document c1 c2 c3 … cn
• Don’t segment (you could be wrong!)
• Instead, treat every character bigram as a termc1 c2 , c2 c3 , c3 c4 , … , cn-1 cn
• Break up queries the same way
-
A “Term” is Whatever You Index
• Word sense• Token• Word• Stem• Character n-gram• Phrase
-
Summary• A term is whatever you index
– So the key is to index the right kind of terms!
• Start by finding fundamental features– We have focused on character coded text– Same ideas apply to handwriting, OCR, and speech
• Combine characters into easily recognized units– Words where possible, character n-grams otherwise
• Apply further processing to optimize results– Stemming, phrases, …
-
A Story in Two Parts
• IR from the ground up in any language– Focusing on document representation
Cross-Language IR– To the extent time allows
-
Query-Language CLIR
Englishqueries
Somali Document Collection
RetrievalEngine
TranslationSystem
English Document Collection
Results
select examine
-
Document-Language CLIR
RetrievalEngine
Translation System
Somaliqueries
Somalidocuments
Results
Englishqueries
select examine
Somali Document Collection
-
Query vs. Document Translation
• Query translation– Efficient for short queries (not relevance feedback)– Limited context for ambiguous query terms
• Document translation– Rapid support for interactive selection– Need only be done once (if query language is same)
-
Indexing Time:Statistical Document Translation
0
100
200
300
400
500
0 10 15 20 25 35 40 45
Thousands of documents
Inde
xing
tim
e (s
ec) monolingual cross-language
-
Language-Neutral Retrieval
“Interlingual”Retrieval
1: 0.91 2: 0.573: 0.36
Query“Translation”
SomaliQueryTerms
EnglishDocument
TermsDocument
“Translation”
-
Translation Evidence• Lexical Resources
– Phrase books, bilingual dictionaries, …• Large text collections
– Translations (“parallel”)– Similar topics (“comparable”)
• Similarity– Similar writing (if the character set is the same)– Similar pronunciation
• People– May be able to guess topic from lousy translations
-
Types of Lexical Resources• Ontology
– Organization of knowledge• Thesaurus
– Ontology specialized to support search• Dictionary
– Rich word list, designed for use by people• Lexicon
– Rich word list, designed for use by a machine• Bilingual term list
– Pairs of translation-equivalent terms
-
Named entities removed
Named entities from term list
Named entities added
Full Query
-
Backoff Translation• Lexicon might contain stems, surface
forms, or some combination of the two.
mangez mangez
mangez mangemange
mangezmange mange
mangez mange mangent mange
- eat
- eats eat
- eat
- eat
Document Translation Lexicon
surface form surface form
stem surface form
surface form stem
stem stem
-
Hieroglyphic
Egyptian Demotic
Greek
-
Types of Bilingual Corpora
• Parallel corpora: translation-equivalent pairs– Document pairs– Sentence pairs – Term pairs
• Comparable corpora: topically related– Collection pairs– Document pairs
-
Some Modern Rosetta Stones• News:
– DE-News (German-English)– Hong-Kong News, Xinhua News (Chinese-English)
• Government:– Canadian Hansards (French-English)– Europarl (Danish, Dutch, English, Finnish, French,
German, Greek, Italian, Portugese, Spanish, Swedish)– UN Treaties (Russian, English, Arabic, …)
• Religion– Bible, Koran, Book of Mormon
-
Word-Level Alignment
Diverging opinions about planned tax reform
Unterschiedliche Meinungen zur geplanten Steuerreform
English
German
Madam President , I had asked the administration …English
Señora Presidenta, había pedido a la administración del Parlamento …Spanish
-
A Translation Model
• From word-aligned bilingual text, we induce a translation model
• Example:)|( efp i 1)|( =∑
ifi efpwhere,
p(探测|survey) = 0.4p(试探|survey) = 0.3p(测量|survey) = 0.25p(样品|survey) = 0.05
-
Using Multiple Translations• Weighted Structured Query Translation
– Takes advantage of multiple translations and translation probabilities
• TF and DF of query term e are computed using TF and DF of its translations:
∑ ×=if
kiik DfTFefpDeTF ),()|(),(
∑ ×=if
ii fDFefpeDF )()|()(
-
BM-25
])(7)(*8
)),()(*9.03.0(
)),(*2.2(][)5.0)((
)5.0)(([logeqtfeqtf
detfavdl
ddldetf
edfedfN
Qek
k
k
+++++−∑
∈
document frequency
term frequency
document length
])(7)(*8
)),()(*9.03.0(
)),(*2.2(][)5.0)((
)5.0)(([logeqtfeqtf
detfavdl
ddldetf
edfedfN
Qek
k
k
+++++−∑
∈
])(7)(*8
)),()(*9.03.0(
)),(*2.2(][)5.0)((
)5.0)(([logeqtfeqtf
detfavdl
ddldetf
edfedfN
Qek
k
k
+++++−∑
∈
-
40%
50%
60%
70%
80%
90%
100%
110%
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Cumulative Probability Threshold
MA
P: C
LIR
/Mon
olin
gual
DAMM IMM PSQ
Retrieval Effectiveness
CLEF French
Chart2
000
0.10.10.1
0.20.20.2
0.30.30.3
0.40.40.4
0.50.50.5
0.60.60.6
0.70.70.7
0.80.80.8
0.90.90.9
0.990.990.99
0.9990.9990.999
DAMM
IMM
PSQ
Cumulative Probability Threshold
MAP: CLIR/Monolingual
0.9107883817
0.9177904564
0.9113070539
0.9107883817
0.9177904564
0.929719917
0.9097510373
0.9170124481
0.9315352697
0.9224585062
0.9177904564
0.9408713693
0.9380186722
0.919346473
0.9442427386
0.9538381743
0.9320539419
0.9486514523
0.9725103734
0.9395746888
0.9460580913
0.9885892116
0.955653527
0.9468360996
0.997406639
0.968620332
0.9416493776
1.002593361
0.9743257261
0.9196058091
0.6675311203
0.4037863071
0.9011929461
0.6677904564
0.4037863071
0.8503630705
raw
QTAQTqt and dt first cut off at cumulative p of 0.9qt and dt first cut off at cumulative p of 0.999
cumpw/o ppp_normcumpw/o ppp_normQTAQT
00.28060.26470.285300.29050.29410.2867cumpw/o ppp_normcumpw/o ppp_norm
0.10.28690.27140.29240.10.2890.29280.285300.28060.26470.285300.29050.29410.2867
0.20.28390.27420.29210.20.29020.29420.28670.10.28690.27140.29240.10.2890.29280.2853
0.30.29030.27580.29790.30.28670.29170.28530.20.28390.27420.29210.20.29020.29420.2867
0.40.28870.27660.30.40.27880.28540.28220.30.29030.27580.29790.30.28670.29170.2853
0.50.28220.280.29990.50.27550.29090.28210.40.28870.27660.30.40.27880.28540.2822
0.60.28070.29050.30290.60.28030.29530.28690.50.28220.280.29990.50.27550.29090.2821
0.70.27290.29470.30410.70.27270.2940.2880.60.28070.29050.30290.60.28030.29530.2869
0.80.26330.30130.30760.80.25320.29750.28560.70.27290.29470.30410.70.27270.2940.288
0.90.26570.30550.30760.90.21570.29930.28560.80.26330.30130.30760.80.25320.29750.2856
10.0856*0.08830.088410.0876*0.1460.06990.90.26570.30550.30760.90.21570.29930.2856
DTADTDTADT
cumpw/o ppcumpw/o ppcumpw/o ppcumpw/o pp
00.29380.291100.32250.311400.29380.291100.32250.3114
0.10.29350.29130.10.32190.31140.10.29350.29130.10.32190.3114
0.20.28830.29110.20.31880.31180.20.28830.29110.20.31880.3118
0.30.29080.29140.30.32110.31430.30.29080.29140.30.32110.3143
0.40.29890.29920.40.33070.32170.40.29890.29920.40.33070.3217
0.50.30470.30120.50.33090.32150.50.30470.30120.50.33090.3215
0.60.30990.30590.60.32540.32070.60.30990.30590.60.32540.3207
0.70.31990.31210.70.32460.32290.70.31990.31210.70.32460.3229
0.80.31640.31790.80.31940.32420.80.31640.31790.80.31940.3242
0.90.31520.32210.90.30540.32530.90.31520.32210.90.30540.3253
10.2224*0.124210.1344*0.1703
IMM: dict normalizedDAMM: dict normalizedIMMDAMM
cumpw/o ppcumpw/o ppp_normcumpw/o ppcumpw/o pp
00.28310.287400.28380.27550.286500.28750.288300.29350.2925
0.10.28310.28740.10.28380.27550.28650.10.28750.28830.10.29350.2925
0.20.2830.28730.20.28660.27660.29050.20.28740.28830.20.29560.2945
0.30.28370.28760.30.28690.28690.29910.30.28830.28880.30.29790.2963
0.40.28480.2880.40.30590.29820.30340.40.28940.29030.40.30960.3033
0.50.29120.2910.50.30620.29490.30430.50.29560.29370.50.31350.3066
0.60.29890.29960.60.32170.30960.31580.60.3030.30110.60.32470.3189
0.70.30370.30270.70.32540.31430.31690.70.30860.30430.70.32170.3192
0.80.31270.30570.80.33370.32090.32250.80.31340.30630.80.33610.3254
0.90.32280.30880.90.34660.32610.32660.90.32390.30960.90.34770.3294
10.2490.10120.950.34660.32780.328110.34520.2910.34520.2875
10.26160.08990.0899
PAMM-qPAMM-d
cumpw/o pp_normcumpw/o pp_norm
00.27930.27900.29320.294
0.10.27930.2790.10.29320.294
0.20.27930.2790.20.29320.2941
0.30.28640.28650.30.2940.2951
0.40.28550.28750.40.30090.3014
0.50.29740.29790.50.30450.3031
0.60.30750.30380.60.31840.3124
0.70.31220.30740.70.31560.314
0.80.32590.31350.80.31770.318
0.90.33640.320.90.33350.3268
10.2507*0.098410.2373*0.0999
SUMASUMSUM
cumpw/o ppcumpw/o ppcumpw/o pp
00.28280.283500.23620.236100.28790.2886
0.10.28250.28340.10.2380.23650.10.29440.2959
0.20.28450.28620.20.23990.24060.20.2970.3028
0.30.28970.29020.30.24190.24370.30.29030.3055
0.40.3060.30480.40.25390.25730.40.27210.3081
0.50.31880.31430.50.25450.26410.50.22480.306
0.60.32080.31840.60.23210.25930.60.22250.3058
0.70.31150.32320.70.22310.25630.70.22170.3079
0.80.29630.32060.80.20940.24330.80.21550.3072
0.90.23780.29830.90.19480.220.90.21470.3073
10.16750.1864
Stemming: search normalized except for dtStemming: search normalized except for dt
QTAQT
cumpw/o ppPirkolacumpw/o ppQT
00.35180.35140.351800.31090.3158cumpw/o pp
0.10.35880.35850.3580.10.31060.315400.35180.3514
0.20.35790.35920.3570.20.31030.3160.10.35880.3585
0.30.36030.36280.35840.30.30960.31760.20.35790.3592
0.40.35280.36410.350.40.30570.31760.30.36030.3628
0.50.34360.36580.33960.50.30180.31090.40.35280.3641
0.60.31650.36480.32080.60.29070.32070.50.34360.3658
0.70.28380.36510.29070.70.26680.32180.60.31650.3648
0.80.25080.36310.25880.80.22940.31760.70.28380.3651
0.90.19550.35460.21270.90.20330.31580.80.25080.3631
0.990.10620.34750.1130.990.10470.31450.90.19550.3546
0.9990.1020.32790.10710.9990.10090.2715
10.06960.1586
DTADTDT
cumpw/o ppcumpw/o ppcumpw/o pp
00.34980.35300.35920.348100.34980.353
0.10.34580.35330.10.36070.34790.10.34580.3533
0.20.34960.3520.20.35990.34670.20.34960.352
0.30.33370.35290.30.34980.3510.30.33370.3529
0.40.32760.36080.40.34620.35920.40.32760.3608
0.50.32230.36230.50.3330.35790.50.32230.3623
0.60.31920.36230.60.32290.35770.60.31920.3623
0.70.31730.36690.70.31450.35810.70.31730.3669
0.80.3130.36790.80.31510.35810.80.3130.3679
0.90.29890.37350.90.30010.35770.90.29890.3735
0.990.25970.36930.990.25940.3537
0.9990.22360.3480.9990.12880.3361
SUMASUMSUM
cumpw/o ppcumpw/o ppcumpw/o pp
00.3630.360800.32090.319600.35360.3523
0.10.36080.35890.10.32210.32050.10.36090.3617
0.20.36250.3610.20.3210.31950.20.36110.3676
0.30.36580.36380.30.31660.32480.30.34330.3718
0.40.37230.36920.40.3190.3410.40.29450.3727
0.50.37860.37670.50.30850.33470.50.23720.3681
0.60.3710.37950.60.28010.32870.60.22850.3689
0.70.35050.37950.70.26670.32910.70.22050.3675
0.80.30950.3770.80.23470.3220.80.21180.3652
0.90.22390.36380.90.20430.3090.90.19640.3641
0.990.06320.21791
0.9990.05820.1941
IMMDAMMIMMDAMM
cumpw/o ppcumpw/o ppcumpw/o ppcumpw/o pp
00.35450.353900.35210.351200.35470.354100.36020.3593
0.10.35450.35390.10.35210.35120.10.35470.35410.10.36020.3593
0.20.35420.35360.20.35280.35080.20.35450.35390.20.3620.3594
0.30.35450.35390.30.35740.35570.30.35490.35420.30.36370.3637
0.40.35550.35450.40.36340.36170.40.35580.3550.40.37070.3685
0.50.3610.35940.50.36860.36780.50.36040.3590.50.37370.3732
0.60.36410.36230.60.36750.3750.60.36250.36260.60.3770.3776
0.70.36820.36850.70.37120.38120.70.36810.36840.70.37520.3811
0.80.37440.37350.80.37910.38460.80.3730.37340.80.38360.3859
0.90.37120.37570.90.37450.38660.90.37390.37540.90.38540.3877
0.990.23340.15570.950.37390.388510.35130.358110.35130.3475
0.9990.23340.15570.990.33310.2574
0.9990.33250.2575
PAMM-qPAMM-dPAMM-qPAMM-d
cumpw/o ppcumpw/o ppcumpw/o ppcumpw/o pp
00.35570.35500.35380.353200.35390.353200.35680.3561
0.10.35570.3550.10.35380.35320.10.35390.35320.10.35680.3561
0.20.35550.35480.20.35360.3530.20.35390.35320.20.35660.3559
0.30.35590.35640.30.35410.35360.30.35380.3530.30.3580.3574
0.40.35320.35620.40.35910.35880.40.35810.35810.40.35320.3563
0.50.36480.36510.50.36550.3610.50.36480.3610.50.36180.365
0.60.36240.36490.60.37310.36890.60.36790.36560.60.36540.3653
0.70.37130.37230.70.37420.3730.70.37460.37270.70.37020.3709
0.80.37910.37790.80.38050.38010.80.38070.37820.80.38140.3776
0.90.37690.38230.90.3770.38390.90.37860.38040.90.37660.3796
0.990.36890.38580.990.36290.385510.35130.356910.35130.3577
0.9990.34660.36780.9990.34050.3679
IMM.f2eDAMM.f2e
cumpw/o ppcumpw/o pp
00.35610.352200.3270.3233
0.10.35610.35220.10.3270.3233
0.20.35650.35230.20.32690.3233
0.30.3560.35220.30.32730.3237
0.40.35540.35190.40.34020.3355
0.50.36040.35660.50.35760.3453
0.60.36770.36420.60.37190.3647
0.70.37290.36970.70.37510.3733
0.80.37290.370.80.37760.3755
0.90.36410.37080.90.37370.3788
0.990.36770.36990.990.3690.3779
0.9990.36210.36140.9990.35360.3665
PAMM-d.f2eFAMM
cumpw/o ppcumpp
00.37050.36900.0935
0.10.37050.3690.10.0987
0.20.3710.36910.20.0981
0.30.37120.36920.30.1079
0.40.36820.36760.40.1134
0.50.36720.36650.50.1177
0.60.36880.36770.60.1192
0.70.37450.36910.70.1206
0.80.37790.36920.80.1217
0.90.3630.37020.90.1227
0.990.36350.37170.990.1243
0.9990.35790.36410.9990.1243
Conflating dt with WordNet
PAMM-q
cumpw/o pp
00.3550.3545
0.10.3550.3545
0.20.35490.3543
0.30.35390.3538
0.40.35480.355
0.50.36120.3607
0.60.36650.3659
0.70.36560.3684
0.80.37280.3718
0.90.37180.3742
0.990.36690.3783
0.9990.34260.364
ADT.wn
cumpw/o pp
00.34770.3622
0.10.34440.3601
0.20.33680.3601
0.30.32520.3604
0.40.32110.3636
0.50.30680.3649
0.60.30050.3637
0.70.28520.3574
0.80.27210.3518
0.90.25180.3282
0.990.23410.2185
0.9990.2020.169
Establishing monolingual baseline
title+desc
cumulative_pw/ stemmingw/ stemming, w/ synonym expansionw/o stemming, w/ synonym expansionw/o stemming
00.38560.35210.29640.3011
0.10.38560.35210.29660.3011
0.20.38560.35770.29870.3011
0.30.38560.36360.30570.3011
0.40.38560.36870.31130.3011
0.50.38560.37270.31350.3011
0.60.38560.37250.31330.3011
0.70.38560.37440.31720.3011
0.80.38560.37530.31580.3011
0.90.38560.37650.31830.3011
10.38560.3780.31950.3011
raw
w/ stemming
w/ stemming, w/ synonym expansion
w/o stemming, w/ synonym expansion
w/o stemming
Cumulative Probability Threshold
Mean Average Precision
comparison
stemming applied, no probability threshold cutoff
cumpSQ-PirkolaSQ-KwokPSQcumpPSQSQ-PirkolaSQ-KwokcumpAPSQPSQcumpAPDTAPDT.WNPDTcumpKwokPirkolacumpIMM.qtIMM.dtDAMM.qtDAMM.dtcumpDAMM.qtDAMM.dtIMM.qtIMM.dt
00.35180.35180.351400.91130705390.91234439830.912344398300.31580.351400.34810.36220.35300.35180.351800.35390.35220.35120.323300.91078838170.838433610.91779045640.9133817427
0.10.3580.35880.35850.10.9297199170.92842323650.93049792530.10.31540.35850.10.34790.36010.35330.10.35880.3580.10.35390.35220.35120.32330.10.91078838170.838433610.91779045640.9133817427
0.20.3570.35790.35920.20.93153526970.92582987550.92816390040.20.3160.35920.20.34670.36010.3520.20.35790.3570.20.35360.35230.35080.32330.20.90975103730.838433610.91701244810.9136410788
0.30.35840.36030.36280.30.94087136930.92946058090.93438796680.30.31760.36280.30.3510.36040.35290.30.36030.35840.30.35390.35220.35570.32370.30.92245850620.83947095440.91779045640.9133817427
0.40.350.35280.36410.40.94424273860.90767634850.91493775930.40.31760.36410.40.35920.36360.36080.40.35280.350.40.35450.35190.36170.33550.40.93801867220.87007261410.9193464730.9126037344
0.50.33960.34360.36580.50.94865145230.88070539420.89107883820.50.31090.36580.50.35790.36490.36230.50.34360.33960.50.35940.35660.36780.34530.50.95383817430.89548755190.93205394190.9247925311
0.60.32080.31650.36480.60.94605809130.83195020750.82079875520.60.32070.36480.60.35770.36370.36230.60.31650.32080.60.36230.36420.3750.36470.60.97251037340.94579875520.93957468880.9445020747
0.70.29070.28380.36510.70.94683609960.75389004150.73599585060.70.32180.36510.70.35810.35740.36690.70.28380.29070.70.36850.36970.38120.37330.70.98858921160.96810165980.9556535270.9587655602
0.80.25880.25080.36310.80.94164937760.67116182570.65041493780.80.31760.36310.80.35810.35180.36790.80.25080.25880.80.37350.370.38460.37550.80.9974066390.97380705390.9686203320.9595435685
0.90.21270.19550.35460.90.91960580910.55160788380.50700207470.90.31580.35460.90.35770.32820.37350.90.19550.21270.90.37570.37080.38660.37880.91.0025933610.98236514520.97432572610.9616182573
0.990.1130.10620.34750.990.90119294610.29304979250.27541493780.990.31450.34750.990.35370.21850.36930.990.10620.1130.990.15570.36990.25740.37790.990.66753112030.98003112030.40378630710.9592842324
0.9990.10710.1020.32790.9990.85036307050.27774896270.26452282160.9990.27150.32790.9990.33610.1690.3480.9990.1020.10710.9990.15570.36140.25750.36650.9990.66779045640.9504668050.40378630710.9372406639
10.07080.06960.158610.41130705390.18360995850.1804979253
cumpPSQAPSQcumpPDTAPDTAPDT.WN
00.91130705390.818983402500.91545643150.90274896270.9393153527
0.10.9297199170.81794605810.10.91623443980.90223029050.9338692946
cumpPSQPDTAPDTcumpPDTPSQIMM0.20.93153526970.81950207470.20.91286307050.89911825730.9338692946
00.35140.3530.348100.91545643150.91130705390.91779045640.30.94087136930.82365145230.30.91519709540.91026970950.9346473029
0.10.35850.35330.34790.10.91623443980.9297199170.91779045640.40.94424273860.82365145230.40.93568464730.93153526970.9429460581
0.20.35920.3520.34670.20.91286307050.93153526970.91701244810.50.94865145230.80627593360.50.93957468880.92816390040.9463174274
0.30.36280.35290.3510.30.91519709540.94087136930.91779045640.60.94605809130.83169087140.60.93957468880.92764522820.9432053942
0.40.36410.36080.35920.40.93568464730.94424273860.9193464730.70.94683609960.83454356850.70.95150414940.92868257260.9268672199
0.50.36580.36230.35790.50.93957468880.94865145230.93205394190.80.94164937760.82365145230.80.95409751040.92868257260.9123443983
0.60.36480.36230.35770.60.93957468880.94605809130.93957468880.90.91960580910.81898340250.90.9686203320.92764522820.8511410788
0.70.36510.36690.35810.70.95150414940.94683609960.9556535270.990.90119294610.81561203320.990.95772821580.91727178420.5666493776
0.80.36310.36790.35810.80.95409751040.94164937760.9686203320.9990.85036307050.70409751040.9990.90248962660.87162863070.4382780083
0.90.35460.37350.35770.90.9686203320.91960580910.9743257261
0.990.34750.36930.35370.990.95772821580.90119294610.4037863071
0.9990.32790.3480.33610.9990.90248962660.85036307050.4037863071
10.15860.169910.44061203320.4113070539
cumpDAMMPAMM-qPAMM-dcumpPAMM-qPAMM-q.WNIMMFAMMPAMM-q(WN)FAMMcumpDAMMPSQFAMM
00.35120.3550.353200.92064315350.9193464730.91779045640.24247925310.35450.093500.9193464730.91130705390.2424792531
0.10.35120.3550.35320.10.92064315350.9193464730.91779045640.25596473030.35450.09870.10.9193464730.9297199170.2559647303
0.20.35080.35480.3530.20.92012448130.91882780080.91701244810.25440871370.35430.09810.20.91882780080.93153526970.2544087137
0.30.35570.35640.35360.30.92427385890.91753112030.91779045640.27982365150.35380.10790.30.91753112030.94087136930.2798236515
0.40.36170.35620.35880.40.92375518670.92064315350.9193464730.29408713690.3550.11340.40.92064315350.94424273860.2940871369
0.50.36780.36510.3610.50.94683609960.93542531120.93205394190.30523858920.36070.11770.50.93542531120.94865145230.3052385892
0.60.3750.36490.36890.60.94631742740.94891078840.93957468880.30912863070.36590.11920.60.94891078840.94605809130.3091286307
0.70.38120.37230.3730.70.96550829880.95539419090.9556535270.31275933610.36840.12060.70.95539419090.94683609960.3127593361
0.80.38460.37790.38010.80.98003112030.96421161830.9686203320.31561203320.37180.12170.80.96421161830.94164937760.3156120332
0.90.38660.38230.38390.90.99144190870.97043568460.97432572610.31820539420.37420.12270.90.97043568460.91960580910.3182053942
0.990.25740.38580.38550.991.00051867220.98106846470.40378630710.32235477180.37830.12430.990.98106846470.90119294610.3223547718
0.9990.25750.36780.36790.9990.95383817430.94398340250.40378630710.32235477180.3640.12430.9990.94398340250.85036307050.3223547718
0
cumpDAMMIMMASUMSUMcumpDAMMIMMPDTPSQ
00.35120.35390.31960.360800.91078838170.91779045640.91545643150.9113070539
0.10.35120.35390.32050.35890.10.91078838170.91779045640.91623443980.929719917
0.20.35080.35360.31950.3610.20.90975103730.91701244810.91286307050.9315352697
0.30.35570.35390.32480.36380.30.92245850620.91779045640.91519709540.9408713693
0.40.36170.35450.3410.36920.40.93801867220.9193464730.93568464730.9442427386
0.50.36780.35940.33470.37670.50.95383817430.93205394190.93957468880.9486514523
0.60.3750.36230.32870.37950.60.97251037340.93957468880.93957468880.9460580913
0.70.38120.36850.32910.37950.70.98858921160.9556535270.95150414940.9468360996
0.80.38460.37350.3220.3770.80.9974066390.9686203320.95409751040.9416493776
0.90.38660.37570.3090.36380.91.0025933610.97432572610.9686203320.9196058091
0.990.25740.15570.21790.990.66753112030.40378630710.95772821580.9011929461
0.9990.25750.15570.19410.9990.66779045640.40378630710.90248962660.8503630705
stemming applied, GIZA tables threshold at cump of 0.9
cumpDAMMPAMM-qIMMSUMPDTPSQcumpDAMMPAMM-qIMMSUMPDTPSQ
00.35930.35320.35410.35230.3530.351400.93179460580.91597510370.91830912860.91364107880.91545643150.9113070539
0.10.35930.35320.35410.36170.35330.35850.10.93179460580.91597510370.91830912860.93801867220.91623443980.929719917
0.20.35940.35320.35390.36760.3520.35920.20.93205394190.91597510370.91779045640.95331950210.91286307050.9315352697
0.30.36370.3530.35420.37180.35290.36280.30.94320539420.91545643150.91856846470.96421161830.91519709540.9408713693
0.40.36850.35810.3550.37270.36080.36410.40.9556535270.92868257260.92064315350.96654564320.93568464730.9442427386
0.50.37320.3610.3590.36810.36230.36580.50.96784232370.93620331950.93101659750.95461618260.93957468880.9486514523
0.60.37760.36560.36260.36890.36230.36480.60.9792531120.94813278010.94035269710.95669087140.93957468880.9460580913
0.70.38110.37270.36840.36750.36690.36510.70.98832987550.96654564320.95539419090.9530601660.95150414940.9468360996
0.80.38590.37820.37340.36520.36790.36310.81.00077800830.98080912860.96836099590.94709543570.95409751040.9416493776
0.90.38770.38040.37540.36410.37350.35460.91.00544605810.98651452280.97354771780.94424273860.9686203320.9196058091
10.34750.35690.35810.18430.158610.90119294610.92557053940.9286825726
Aggregation on SUM
cumpSUMASUM
00.36080.3196
0.10.35890.3205
0.20.3610.3195
0.30.36380.3248
0.40.36920.341
0.50.37670.3347
0.60.37950.3287
0.70.37950.3291
0.80.3770.322
0.90.36380.309
Aggregation with WordNet: no threshold applied to giza table
cumpPAMM-q.wnPAMM-q.ssIMM
00.35450.3550.3539
0.10.35450.3550.3539
0.20.35430.35480.3536
0.30.35380.35640.3539
0.40.3550.35620.3545
0.50.36070.36510.3594
0.60.36590.36490.3623
0.70.36840.37230.3685
0.80.37180.37790.3735
0.90.37420.38230.3757
0.990.37830.38580.1557
0.9990.3640.36780.1557
cumpAPDT.SSAPDT.WNPDT
00.34810.36220.353
0.10.34790.36010.3533
0.20.34670.36010.352
0.30.3510.36040.3529
0.40.35920.36360.3608
0.50.35790.36490.3623
0.60.35770.36370.3623
0.70.35810.35740.3669
0.80.35810.35180.3679
0.90.35770.32820.3735
0.990.35370.21850.3693
0.9990.33610.1690.348
CPMQTDTIMMPAMM-qPAMM-dPAMM-q.wnDAMMSUM
comparison
APDT.SS
APDT.WN
PDT
Cumulative Probability Threshold
Mean Average Precision
mono41-90
DAMM
PSQ
FAMM
Cumulative Probability Threshold
MAP: CLIR/Monolingual
Sheet1
PSQ
SQ-Pirkola
SQ-Kwok
Cumulative Probability Threshold
MAP: CLIR/Monolingual
sig-test
PSQ
APSQ
Cumulative Probability Threshold
MAP: CLIR/Monolingual
PDT
APDT
APDT.WN
Cumulative Probability Threshold
MAP: CLIR/Monolingual
DAMM.qt
DAMM.dt
IMM.qt
IMM.dt
Cumulative Probability Threshold
MAP: CLIR/Monolingual
PDT
PSQ
IMM
Cumulative Probability Threshold
MAP: CLIR/Monolingual
1
#REF!
PAMM-q
PAMM-q.WN
IMM
FAMM
Cumulative Probability Threshold
MAP: CLIR/Monolingual
PAMM-q
PAMM-q.WN
IMM
Cumulative Probability Threshold
MAP: CLIR/Monolingual
DAMM
IMM
PSQ
Cumulative Probability Threshold
MAP: CLIR/Monolingual
MONO-BRFMONO
4111
420.49740.4855
4311
440.36410.4865
450.56610.534
460.0620.0744
470.44590.3667
480.03480.0312
490.43310.4438
500.58110.4608
510.0670.04
520.01190.01
530.24140.2426
540.66630.5348
550.54880.5045
560.71070.7056
570.58360.549
580.9420.9234
590.34030.5687
600.2170.1909
610.10230.1087
620.76990.7297
630.77920.6236
650.27690.2726
660.9070.9169
670.5070.3032
680.69670.5909
690.52850.441
700.82710.803
710.04690.054
720.27630.2515
730.37550.383
740.35910.2341
7511
760.47020.4323
770.32730.3046
780.26440.132
790.59470.6387
800.29660.2792
810.66080.5778
820.73910.6875
8311
8411
850.30390.2467
860.34870.2773
870.46070.4177
880.93420.8996
890.71160.6133
900.16660.1296
Avg.0.5008940.470018
[email protected]@0.5DAMM-MONODAMM-PSQ
1010.4730.24750.11360.22550.3594
1510.33880.066100.27270.3388
1700.33330.50.0769-0.16670.2564
1470.49510.18560.24290.30950.2522
670.58520.30320.33770.2820.2475
1830.29330.29330.06100.2323
780.29670.1320.07220.16470.2245
1280.30350.26370.08130.03980.2222
1820.58130.50620.38960.07510.1917
1350.29510.15970.12130.13540.1738
41110.833300.1667
1840.41250.55130.2588-0.13880.1537
760.36370.43230.2119-0.06860.1518
580.93310.92340.7880.00970.1451
1000.47570.32960.33170.14610.144
740.6870.23410.54320.45290.1438
860.19720.27730.0666-0.08010.1306
1400.78290.6690.65560.11390.1273
1240.42980.30670.30670.12310.1231
610.69940.10870.5850.59070.1144
1290.24740.24110.13890.00630.1085
890.5780.61330.4755-0.03530.1025
1220.35730.28350.2590.07380.0983
1530.49060.40430.39520.08630.0954
1130.10210.08080.00720.02130.0949
710.13420.0540.04220.08020.092
650.08750.27260.0039-0.18510.0836
550.59410.50450.51520.08960.0789
540.08620.53480.0104-0.44860.0758
1930.23910.27440.1659-0.03530.0732
1120.55840.52040.48960.0380.0688
1300.53850.65190.4724-0.11340.0661
1630.48280.35330.4180.12950.0648
990.24850.30790.1845-0.05940.064
1190.88230.88560.8234-0.00330.0589
900.15510.12960.10120.02550.0539
460.26730.07440.22380.19290.0435
1150.77390.70550.7390.06840.0349
690.32930.4410.2946-0.11170.0347
800.27190.27920.2395-0.00730.0324
620.73570.72970.71080.0060.0249
1330.15950.19680.1377-0.03730.0218
920.4250.40090.40570.02410.0193
1090.09920.05340.08010.04580.0191
810.27260.57780.2545-0.30520.0181
1370.03120.18570.0139-0.15450.0173
440.64860.48650.63310.16210.0155
970.41010.30730.39580.10280.0143
1100.73140.64350.71770.08790.0137
500.45290.46080.4406-0.00790.0123
1260.81160.79630.79960.01530.012
630.79690.62360.78520.17330.0117
960.49670.57660.4852-0.07990.0115
570.17880.5490.1675-0.37020.0113
600.22680.19090.21640.03590.0104
870.39510.41770.3852-0.02260.0099
420.53150.48550.52180.0460.0097
1180.12980.04530.12240.08450.0074
850.23550.24670.2281-0.01120.0074
820.64820.68750.6409-0.03930.0073
1140.54840.48460.54180.06380.0066
1380.08680.31350.0819-0.22670.0049
1110.01850.15810.0145-0.13960.004
1020.57270.55210.56880.02060.0039
1540.04550.1250.0417-0.07950.0038
980.86510.85410.86340.0110.0017
1900.0760.13390.0753-0.05790.0007
4311100
590.91670.56870.91670.3480
7511100
8311100
8411100
930.966710.9667-0.03330
12111100
12311100
13411100
13611100
14100000
1420.38460.38460.384600
14411100
14900000
1550.02780.00820.02780.01960
15600.00160-0.00160
15800000
16500000
16711100
16800000
1730.66670.66670.666700
17400000
17500000
1780.55560.44440.55560.11120
17900000
18000000
18100000
18500000
1880.33330.33330.333300
1890.250.250.2500
19200000
19600000
19700000
19800000
19900000
20000000
510.03820.040.0385-0.0018-0.0003
470.44180.36670.44230.0751-0.0005
1040.26940.26830.27180.0011-0.0024
1200.02120.00940.02390.0118-0.0027
1430.10140.10950.1051-0.0081-0.0037
520.00560.010.0099-0.0044-0.0043
1590.01470.01920.0192-0.0045-0.0045
1480.02620.06550.0314-0.0393-0.0052
1950.24390.17220.25270.0717-0.0088
1620.20470.20380.21540.0009-0.0107
1080.27260.27930.2848-0.0067-0.0122
1500.14020.21080.1528-0.0706-0.0126
1570.15610.51610.1696-0.36-0.0135
1250.49870.47560.51340.0231-0.0147
1640.1690.35780.185-0.1888-0.016
1160.59370.67140.6101-0.0777-0.0164
1870.15010.18710.1692-0.037-0.0191
1520.26090.26290.2809-0.002-0.02
1030.60940.53470.62960.0747-0.0202
660.87460.91690.8954-0.0423-0.0208
1770.16140.18010.1832-0.0187-0.0218
1270.89520.83880.91750.0564-0.0223
700.77380.8030.7999-0.0292-0.0261
730.32260.3830.3512-0.0604-0.0286
950.30840.39030.3395-0.0819-0.0311
1320.37430.45830.4088-0.084-0.0345
770.30290.30460.3389-0.0017-0.036
480.04370.03120.08090.0125-0.0372
1060.34610.44730.3844-0.1012-0.0383
720.21950.25150.2581-0.032-0.0386
490.44080.44380.4807-0.003-0.0399
680.56170.59090.607-0.0292-0.0453
910.0870.17370.1425-0.0867-0.0555
940.31010.58760.3701-0.2775-0.06
880.83440.89960.8995-0.0652-0.0651
790.70560.63870.77230.0669-0.0667
1390.48340.51170.552-0.0283-0.0686
1760.88050.94890.9618-0.0684-0.0813
560.58380.70560.6747-0.1218-0.0909
1710.10890.19530.2025-0.0864-0.0936
1450.42280.43470.529-0.0119-0.1062
1070.2230.21750.33130.0055-0.1083
530.2840.24260.39440.0414-0.1104
450.39070.5340.5043-0.1433-0.1136
1860.1240.08910.25930.0349-0.1353
1170.33020.23490.46570.0953-0.1355
1310.32650.51830.4722-0.1918-0.1457
1050.30880.450.75-0.1412-0.4412
Topic
AP: DAMM-PSQ
[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@0.9-MONO
110.8333110.83331110.833310.833310.83330.833341-0.1667410
0.48550.49530.52180.48550.53940.52180.53150.53940.54450.52180.53180.52180.53790.53870.5218420.0363420.046
111111111111111430430
0.48650.63410.63310.48650.49080.63310.64860.49080.61850.63310.64820.63310.60260.31330.6331440.1466440.1621
0.5340.46370.50430.5340.45450.50430.39070.45450.39670.50430.40120.50430.39910.48960.504345-0.029745-0.1433
0.07440.22570.22380.07440.2330.22380.26730.2330.25320.22380.25050.22380.25630.23760.2238460.1494460.1929
0.36670.43990.44230.36670.45910.44230.44180.45910.44230.44230.44210.44230.44230.44230.4423470.0756470.0751
0.03120.10890.08090.03120.13340.08090.04370.13340.04670.08090.04410.08090.04750.04990.0809480.0497480.0125
0.44380.41260.48070.44380.39570.48070.44080.39570.43950.48070.44680.48070.46390.47940.4807490.036949-0.003
0.46080.46460.44060.46080.45390.44060.45290.45390.45690.44060.45690.44060.45780.44590.440650-0.020250-0.0079
0.040.0240.03850.040.00670.03850.03820.00670.03570.03850.03650.03850.03840.04340.038551-0.001551-0.0018
0.010.01010.00990.010.00670.00990.00560.00670.01010.00990.00780.00990.00950.010.009952-0.000152-0.0044
0.24260.24850.39440.24260.49140.39440.2840.49140.34550.39440.29630.39440.26980.39740.3944530.1518530.0414
0.53480.04280.01040.53480.06310.01040.08620.06310.17090.01040.17680.01040.08310.0670.010454-0.524454-0.4486
0.50450.50320.51520.50450.56540.51520.59410.56540.51280.51520.54440.51520.55210.52340.5152550.0107550.0896
0.70560.67580.67470.70560.61080.67470.58380.61080.67490.67470.65230.67470.67220.67820.674756-0.030956-0.1218
0.5490.21320.16750.5490.37580.16750.17880.37580.16080.16750.1330.16750.19770.22470.167557-0.381557-0.3702
0.92340.92090.7880.92340.93410.7880.93310.93410.92240.7880.92290.7880.92340.91880.78858-0.1354580.0097
0.568710.91670.56870.68060.91670.91670.680610.916710.9167110.9167590.348590.348
0.19090.18110.21640.19090.12010.21640.22680.12010.21830.21640.22350.21640.22010.21850.2164600.0255600.0359
0.10870.48190.5850.10870.42570.5850.69940.42570.71840.5850.73280.5850.71030.59850.585610.4763610.5907
0.72970.71040.71080.72970.76740.71080.73570.76740.77110.71080.7210.71080.77110.72030.710862-0.0189620.006
0.62360.7990.78520.62360.79370.78520.79690.79370.79750.78520.79750.78520.79750.80020.7852630.1616630.1733
0.27260.00310.00390.27260.00660.00390.08750.00660.08940.00390.0890.00390.08940.09630.003965-0.268765-0.1851
0.91690.8930.89540.91690.90910.89540.87460.90910.8930.89540.89470.89540.89550.89690.895466-0.021566-0.0423
0.30320.26860.33770.30320.51220.33770.58520.51220.48250.33770.61330.33770.50570.49920.3377670.0345670.282
0.59090.62050.6070.59090.72910.6070.56170.72910.5920.6070.56210.6070.58350.61260.607680.016168-0.0292
0.4410.34960.29460.4410.41030.29460.32930.41030.29530.29460.28680.29460.2930.29380.294669-0.146469-0.1117
0.8030.73570.79990.8030.74420.79990.77380.74420.75880.79990.76280.79990.76890.75780.799970-0.003170-0.0292
0.0540.0360.04220.0540.51490.04220.13420.51490.0430.04220.10270.04220.05840.05310.042271-0.0118710.0802
0.25150.25150.25810.25150.21270.25810.21950.21270.24140.25810.24480.25810.24830.24880.2581720.006672-0.032
0.3830.39490.35120.3830.29280.35120.32260.29280.3930.35120.330.35120.3770.37610.351273-0.031873-0.0604
0.23410.47430.54320.23410.72070.54320.6870.72070.69580.54320.70290.54320.69930.56030.5432740.3091740.4529
111111111111111750750
0.43230.42030.21190.43230.43760.21190.36370.43760.420.21190.38740.21190.40960.41220.211976-0.220476-0.0686
0.30460.30610.33890.30460.20.33890.30290.20.30680.33890.30710.33890.30550.31350.3389770.034377-0.0017
0.1320.09930.07220.1320.19790.07220.29670.19790.23790.07220.28330.07220.2590.13440.072278-0.0598780.1647
0.63870.67190.77230.63870.39320.77230.70560.39320.70560.77230.70560.77230.70560.77640.7723790.1336790.0669
0.27920.29120.23950.27920.24430.23950.27190.24430.27920.23950.28230.23950.29830.25380.239580-0.039780-0.0073
0.57780.23440.25450.57780.1360.25450.27260.1360.26020.25450.27350.25450.28340.22850.254581-0.323381-0.3052
0.68750.57290.64090.68750.65920.64090.64820.65920.64350.64090.65690.64090.63640.64460.640982-0.046682-0.0393
111111111111111830830
111111111111111840840
0.24670.22330.22810.24670.21150.22810.23550.21150.240.22810.23840.22810.23940.21970.228185-0.018685-0.0112
0.27730.07040.06660.27730.23040.06660.19720.23040.07460.06660.09020.06660.2120.20070.066686-0.210786-0.0801
0.41770.41910.38520.41770.45180.38520.39510.45180.37740.38520.38410.38520.39030.38340.385287-0.032587-0.0226
0.89960.8450.89950.89960.86240.89950.83440.86240.90030.89950.87880.89950.89020.91660.899588-0.000188-0.0652
0.61330.48020.47550.61330.52410.47550.5780.52410.58880.47550.58830.47550.57150.58910.475589-0.137889-0.0353
0.12960.12920.10120.12960.13910.10120.15510.13910.12370.10120.13770.10120.14530.1290.101290-0.0284900.0255
0.17370.15620.14250.17370.080.14250.0870.080.12220.14250.05920.14250.13830.17680.142591-0.031291-0.0867
0.40090.40780.40570.40090.41110.40570.4250.41110.41930.40570.43820.40570.4080.36080.4057920.0048920.0241
10.96670.9667110.96670.966710.96670.96670.96670.96670.96670.96670.966793-0.033393-0.0333
0.58760.40970.37010.58760.08440.37010.31010.08440.27110.37010.28120.37010.28180.34490.370194-0.217594-0.2775
0.39030.33710.33950.39030.43760.33950.30840.43760.33850.33950.33230.33950.36320.34130.339595-0.050895-0.0819
0.57660.32090.48520.57660.2290.48520.49670.2290.35050.48520.36630.48520.52420.50260.485296-0.091496-0.0799
0.30730.40090.39580.30730.40450.39580.41010.40450.40090.39580.42150.39580.40090.40050.3958970.0885970.1028
0.85410.850.86340.85410.85040.86340.86510.85040.86110.86340.87330.86340.87350.86060.8634980.0093980.011
0.30790.15310.18450.30790.23560.18450.24850.23560.28420.18450.27650.18450.32970.18080.184599-0.123499-0.0594
0.32960.29920.33170.32960.25980.33170.47570.25980.26060.33170.44090.33170.26710.41550.33171000.00211000.1461
0.24750.12520.11360.24750.4870.11360.4730.4870.13590.11360.42320.11360.46180.3460.1136101-0.13391010.2255
0.55210.57240.56880.55210.58690.56880.57270.58690.57680.56880.57390.56880.57670.57420.56881020.01671020.0206
0.53470.61350.62960.53470.4620.62960.60940.4620.61790.62960.62150.62960.61710.62790.62961030.09491030.0747
0.26830.27420.27180.26830.10090.27180.26940.10090.26960.27180.26940.27180.26970.27080.27181040.00351040.0011
0.450.36110.750.450.83330.750.30880.83330.31670.750.30880.750.31670.32690.751050.3105-0.1412
0.44730.38790.38440.44730.350.38440.34610.350.33690.38440.34590.38440.35950.39970.3844106-0.0629106-0.1012
0.21750.33630.33130.21750.29240.33130.2230.29240.19090.33130.18670.33130.23220.37930.33131070.11381070.0055
0.27930.27930.28480.27930.27760.28480.27260.27760.27720.28480.28330.28480.27840.28710.28481080.0055108-0.0067
0.05340.05650.08010.05340.090.08010.09920.090.07350.08010.11660.08010.07310.08620.08011090.02671090.0458
0.64350.71270.71770.64350.7030.71770.73140.7030.73660.71770.73530.71770.72670.73310.71771100.07421100.0879
0.15810.02290.01450.15810.09860.01450.01850.09860.01580.01450.01790.01450.01760.01470.0145111-0.1436111-0.1396
0.52040.5210.48960.52040.59010.48960.55840.59010.55760.48960.54960.48960.56830.50770.4896112-0.03081120.038
0.08080.00650.00720.08080.02830.00720.10210.02830.00930.00720.010.00720.03060.00910.0072113-0.07361130.0213
0.48460.47170.54180.48460.58550.54180.54840.58550.52260.54180.52960.54180.53810.54440.54181140.05721140.0638
0.70550.71980.7390.70550.81560.7390.77390.81560.720.7390.720.7390.720.7390.7391150.03351150.0684
0.67140.6240.61010.67140.59540.61010.59370.59540.63590.61010.62330.61010.63590.64010.6101116-0.0613116-0.0777
0.23490.36780.46570.23490.03870.46570.33020.03870.30460.46570.30780.46570.30290.41030.46571170.23081170.0953
0.04530.09640.12240.04530.14460.12240.12980.14460.09570.12240.12180.12240.12560.12520.12241180.07711180.0845
0.88560.82960.82340.88560.83080.82340.88230.83080.81880.82340.82070.82340.85810.85250.8234119-0.0622119-0.0033
0.00940.02390.02390.00940.00550.02390.02120.00550.02390.02390.02390.02390.02390.02390.02391200.01451200.0118
11111111111111112101210
0.28350.20550.2590.28350.24920.2590.35730.24920.26090.2590.28120.2590.25380.26250.259122-0.02451220.0738
11111111111111112301230
0.30670.30050.30670.30670.29580.30670.42980.29580.3210.30670.37820.30670.5360.47490.306712401240.1231
0.47560.50830.51340.47560.38110.51340.49870.38110.52750.51340.50730.51340.54540.51360.51341250.03781250.0231
0.79630.79320.79960.79630.78980.79960.81160.78980.80480.79960.80480.79960.79850.80620.79961260.00331260.0153
0.83880.89740.91750.83880.91090.91750.89520.91090.91860.91750.92270.91750.92070.91750.91751270.07871270.0564
0.26370.00260.08130.26370.15760.08130.30350.15760.09930.08130.10510.08130.12670.07730.0813128-0.18241280.0398
0.24110.25280.13890.24110.27710.13890.24740.27710.25140.13890.24650.13890.25020.25830.1389129-0.10221290.0063
0.65190.48280.47240.65190.53090.47240.53850.53090.52980.47240.53760.47240.53360.5610.4724130-0.1795130-0.1134
0.51830.40.47220.51830.48090.47220.32650.48090.4160.47220.36190.47220.4080.45870.4722131-0.0461131-0.1918
0.45830.40560.40880.45830.4960.40880.37430.4960.40880.40880.41250.40880.40560.52760.4088132-0.0495132-0.084
0.19680.12570.13770.19680.2120.13770.15950.2120.23940.13770.26670.13770.17860.18830.1377133-0.0591133-0.0373
11111111111111113401340
0.15970.14290.12130.15970.03290.12130.29510.03290.18690.12130.30460.12130.16170.12260.1213135-0.03841350.1354
11111111111111113601360
0.18570.01620.01390.18570.05930.01390.03120.05930.02480.01390.02210.01390.03240.01850.0139137-0.1718137-0.1545
0.31350.05710.08190.31350.00120.08190.08680.00120.06410.08190.07130.08190.05040.06390.0819138-0.2316138-0.2267
0.51170.55590.5520.51170.49490.5520.48340.49490.54960.5520.55250.5520.51320.58170.5521390.0403139-0.0283
0.6690.78190.65560.6690.74750.65560.78290.74750.80060.65560.79580.65560.78360.80540.6556140-0.01341400.1139
00000000000000014101410
0.38460.38460.38460.38460.38460.38460.38460.38460.38460.38460.38460.38460.38460.38460.384614201420
0.10950.1070.10510.10950.10080.10510.10140.10080.11350.10510.11180.10510.11570.11680.1051143-0.0044143-0.0081
11111111111111114401440
0.43470.49350.5290.43470.60750.5290.42280.60750.52190.5290.42810.5290.52380.52190.5291450.0943145-0.0119
0.18560.2060.24290.18560.45520.24290.49510.45520.28950.24290.41710.24290.31760.24860.24291470.05731470.3095
0.06550.02990.03140.06550.02410.03140.02620.02410.03150.03140.02690.03140.02730.03390.0314148-0.0341148-0.0393
00000000000000014901490
0.21080.21380.15280.21080.17940.15280.14020.17940.14460.15280.1410.15280.13740.1540.1528150-0.058150-0.0706
0.06610.000600.06610.130500.33880.13050.252100.31500.261700151-0.06611510.2727
0.26290.27280.28090.26290.31280.28090.26090.31280.2750.28090.25680.28090.27860.28920.28091520.018152-0.002
0.40430.39520.39520.40430.38310.39520.49060.38310.42290.39520.44360.39520.42990.39490.3952153-0.00911530.0863
0.1250.04170.04170.1250.050.04170.04550.050.04170.04170.04170.04170.04170.04550.0417154-0.0833154-0.0795
0.00820.02630.02780.00820.03850.02780.02780.03850.02780.02780.04170.02780.02380.02270.02781550.01961550.0196
0.0016000.00160.0025000.00250000000156-0.0016156-0.0016
0.51610.0740.16960.51610.26410.16960.15610.26410.17130.16960.17040.16960.17270.26020.1696157-0.3465157-0.36
00000000000000015801580
0.01920.01930.01920.01920.05570.01920.01470.05570.0190.01920.01440.01920.0190.0190.01921590159-0.0045
0.20380.20970.21540.20380.21050.21540.20470.21050.22270.21540.24030.21540.20230.20340.21541620.01161620.0009
0.35330.36530.4180.35330.61610.4180.48280.61610.33640.4180.41270.4180.46080.43620.4181630.06471630.1295
0.35780.18430.1850.35780.2780.1850.1690.2780.1950.1850.19310.1850.19020.18640.185164-0.1728164-0.1888
00000000000000016501650
11111111111111116701670
00000000000000016801680
0.50.06670.07690.50.03230.07690.33330.03230.08330.07690.10.07690.16670.11110.0769170-0.4231170-0.1667
0.19530.21020.20250.19530.13860.20250.10890.13860.10940.20250.10970.20250.10940.13390.20251710.0072171-0.0864
0.66670.66670.66670.66670.66670.66670.66670.66670.66670.66670.66670.66670.66670.66670.666717301730
00000000000000017401740
00000000000000017501750
0.94890.95340.96180.94890.89320.96180.88050.89320.94530.96180.92760.96180.92760.94530.96181760.0129176-0.0684
0.18010.18130.18320.18010.19170.18320.16140.19170.1820.18320.17370.18320.1810.1820.18321770.0031177-0.0187
0.44440.50.55560.44440.27780.55560.55560.27780.55560.55560.55560.55560.55560.50.55561780.11121780.1112
00000000000000017901790
00000000000000018001800
00000000000000018101810
0.50620.40250.38960.50620.52630.38960.58130.52630.59160.38960.58570.38960.56510.53260.3896182-0.11661820.0751
0.29330.24670.0610.29330.23330.0610.29330.23330.29330.0610.28330.0610.30.29330.061183-0.23231830
0.55130.32420.25880.55130.30190.25880.41250.30190.42270.25880.43260.25880.39540.37640.2588184-0.2925184-0.1388
00000000000000018501850
0.08910.14660.25930.08910.13870.25930.1240.13870.13170.25930.13250.25930.13370.24790.25931860.17021860.0349
0.18710.1510.16920.18710.1470.16920.15010.1470.15160.16920.15170.16920.15160.15150.1692187-0.0179187-0.037
0.33330.33330.33330.33330.33330.33330.33330.33330.33330.33330.33330.33330.33330.33330.333318801880
0.250.250.250.250.250.250.250.250.250.250.250.250.250.250.2518901890
0.13390.06770.07530.13390.05020.07530.0760.05020.07460.07530.07710.07530.07570.07410.0753190-0.0586190-0.0579
00000000000000019201920
0.27440.14650.16590.27440.1410.16590.23910.1410.18140.16590.20340.16590.24840.23540.1659193-0.1085193-0.0353
0.17220.25050.25270.17220.20960.25270.24390.20960.21660.25270.22420.25270.22270.24180.25271950.08051950.0717
00000000000000019601960
00000000000000019701970
00000000000000019801980
00000000000000019901990
00000000000000020002000
0.38560.36030.36580.38560.37350.36580.38660.37350.37570.36580.38230.36580.38390.37950.3658
p=0.007p=0.075p=0.050p=0.176p=0.162p=0.017p=0.184p=0.104p=0.0180.003
[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@0.5
tmp
-
Bilingual Query Expansionsource language query
QueryTranslation results
Source Language
IR
TargetLanguage
IR
source language collection
target language collection
expandedsource language
query
expandedtarget language
terms
Pre-translation expansion Post-translation expansion
-
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0 5,000 10,000 15,000
Mea
n Av
erag
e Pr
ecis
ion
Unique Dutch Terms
BothPostPreNone
Query Expansion Effect
Paul McNamee and James Mayfield, SIGIR-2002
Chart1
15591155911559115591
14032140321403214032
12473124731247312473
10932109321093210932
9355935593559355
7795779577957795
6236623662366236
4667466746674667
3118311831183118
1559155915591559
0000
Both
Post
Pre
None
Unique Dutch Terms
Mean Average Precision
0.3171
0.1514
0.2718
0.1153
0.3167
0.1427
0.2657
0.1005
0.3223
0.1056
0.2618
0.0966
0.3127
0.0937
0.2497
0.0931
0.2627
0.0904
0.2269
0.0841
0.2575
0.1059
0.2195
0.0794
0.2331
0.1082
0.2104
0.0809
0.2245
0.1003
0.2019
0.0783
0.2196
0.0906
0.1931
0.0766
0.2189
0.0705
0.1951
0.0696
0.2113
0.0697
0.1832
0.0623
Sheet1
BothPostPreNone
15,5910.31710.15140.27180.1153
14,0320.31670.14270.26570.1005
12,4730.32230.10560.26180.0966
10,9320.31270.09370.24970.0931
9,3550.26270.09040.22690.0841
7,7950.25750.10590.21950.0794
6,2360.23310.10820.21040.0809
4,6670.22450.10030.20190.0783
3,1180.21960.09060.19310.0766
1,5590.21890.07050.19510.0696
00.21130.06970.18320.0623
Sheet1
Both
Post
Pre
None
Unique Dutch Terms
Mean Average Precision
Sheet2
Sheet3
-
Cognate Matching
• Dictionary coverage is inherently limited– Translation of proper names– Translation of newly coined terms– Translation of unfamiliar technical terms
• Strategy: model derivational translation– Orthography-based– Pronunciation-based
-
Matching Orthographic Cognates
• Retain untranslatable words unchanged– Often works well between European languages
• Rule-based systems– Even off-the-shelf spelling correction can help!
• Subword (e.g., character-level) MT– Trained using a set of representative cognates
-
Matching Phonetic Cognates
• Forward transliteration– Generate all potential transliterations
• Reverse transliteration– Guess source string(s) that produced a transliteration
• Match in phonetic space
-
Cross-Language “Retrieval”
Search
Translated Query
Ranked List
QueryTranslation
Query
-
Uses of “MT” in CLIR
Search
TranslatedQuery
Selection
Ranked List
Examination
Document
Use
Document
QueryFormulation
QueryTranslation
Query
Query Reformulation
Indicative Translation
Snippet Translation
Term Translation
Term Matching
InformativeTranslation
-
Interactive Cross-LanguageQuestion Answering
0
1
2
3
4
5
6
7
8
8 11 13 4 16 6 14 7 2 10 15 12 1 3 9 5
Use
rs w
ith C
orre
ct A
nsw
ers
Question NumberiCLEF 2004
-
Questions, Grouped by Difficulty8 Who is the managing director of the International Monetary Fund?
11 Who is the president of Burundi?13 Of what team is Bobby Robson coach?4 Who committed the terrorist attack in the Tokyo underground?
16 Who won the Nobel Prize for Literature in 1994?6 When did Latvia gain independence?
14 When did the attack at the Saint-Michel underground station in Paris occur?7 How many people were declared missing in the Philippines after the
typhoon “Angela”?2 How many human genes are there?
10 How many people died of asphyxia in the Baku underground?15 How many people live in Bombay?12 What is Charles Millon's political party?
1 What year was Thomas Mann awarded the Nobel Prize?3 Who is the German Minister for Economic Affairs?9 When did Lenin die?5 How much did the Channel Tunnel cost?
-
For Further Reading• Multilingual IR
– Paul McNamee et al, Addressing Morphological Variation in Alphabetic Languages, SIGIR, 2009
• African-Language IR– Open CLIR Challenge (Swahili), IARPA, 2018– Nkosana Malumba et al, AfriWeb: A Search Engine for
a Marginalized Language, ICADL, 2015
• Cross-Language IR– Jian-Yun Nie, Cross-Language Information Retrieval,
Synthesis Lectures in HLT, Morgan&Claypool, 2010– Jianqiang Wang and Douglas W. Oard, Matching
Meaning for Cross-Language Information Retrieval, Information Processing and Management, 2012
Multilingual�Information RetrievalGlobal TradeMost Widely-Spoken LanguagesSlide Number 4What Does “Multilingual” Mean?A Story in Two PartsSlide Number 7ASCIIThe Latin-1 Character SetOther ISO-8859 Character SetsEast Asian Character SetsUnicodeLimitations of UnicodeStrings and SegmentsTokenizationMorphological SegmentationMorphological SegmentationStemmingLongest Substring SegmentationLongest Substring ExampleSlide Number 21Probabilistic SegmentationNon-Segmentation:�N-gram IndexingA “Term” is Whatever You IndexSummaryA Story in Two PartsQuery-Language CLIRDocument-Language CLIRQuery vs. Document TranslationIndexing Time:�Statistical Document TranslationLanguage-Neutral RetrievalTranslation EvidenceTypes of Lexical ResourcesSlide Number 34Backoff TranslationSlide Number 36Types of Bilingual CorporaSome Modern Rosetta StonesWord-Level AlignmentA Translation ModelUsing Multiple TranslationsBM-25Retrieval EffectivenessBilingual Query ExpansionQuery Expansion EffectCognate MatchingMatching Orthographic CognatesMatching Phonetic CognatesCross-Language “Retrieval”Uses of “MT” in CLIRInteractive Cross-Language�Question AnsweringQuestions, Grouped by DifficultyFor Further Reading