multilingual information retrievalsigir.org/afirm2019/slides/03. monday - multilingual ir... ·...

53
Multilingual Information Retrieval Doug Oard College of Information Studies and UMIACS University of Maryland, College Park USA January 14, 2019 AFIRM

Upload: others

Post on 29-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

  • MultilingualInformation Retrieval

    Doug OardCollege of Information Studies and UMIACS

    University of Maryland, College ParkUSA

    January 14, 2019 AFIRM

  • Global Trade

    Source: Wikipedia (mostly 2017 estimates)

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    0.0 0.5 1.0 1.5 2.0 2.5

    Expo

    rts (

    Trill

    ions

    of U

    SD)

    Imports (Trillions of USD)

    USA

    Japan

    South Korea

    Hong Kong

    ChinaEU

  • Most Widely-Spoken Languages

    Source: Ethnologue (SIL), 2018

    0 200 400 600 800 1,000 1,200

    Southern MinPersian

    ThaiHausaItalian

    VietnameseYue Chinese

    TamilMarathiKoreanTurkishTelugu

    Wu ChineseJavanese

    Western PunjabiSwahili

    JapaneseGerman

    UrduIndonesianPortuguese

    BengaliRussian

    Modern Std ArabicFrench

    SpanishHindi

    Mandarin ChineseEnglish

    Billions of Speakers

    L1 speakers

    L2 speakers

  • 64%5%

    4%

    6%

    2%

    8%

    2%4%

    5% 0%

    33%

    28%

    9%

    6%

    5%

    5%

    4%

    4%4% 2%

    English

    Chinese

    Spanish

    Japanese

    Portuguese

    German

    Arabic

    French

    Russian

    Korean

    Global Internet Users

  • What Does “Multilingual” Mean?• Mixed-language document

    – Document containing more than one language• Mixed-language collection

    – Collection of documents in different languages• Multi-monolingual systems

    – Can retrieve from a mixed-language collection• Cross-language system

    – Query in one language finds document in another• (Truly) multingual system

    – Queries can find documents in any language

  • A Story in Two Parts

    • IR from the ground up in any language– Focusing on document representation

    • Cross-Language IR– To the extent time allows

  • DocumentsQuery

    Hits

    RepresentationFunction

    RepresentationFunction

    Query Representation Document Representation

    ComparisonFunction Index

  • ASCII• American Standard

    Code for Information Interchange

    • ANSI X3.4-1968

    | 0 NUL | 32 SPACE | 64 @ | 96 ` || 1 SOH | 33 ! | 65 A | 97 a || 2 STX | 34 " | 66 B | 98 b || 3 ETX | 35 # | 67 C | 99 c || 4 EOT | 36 $ | 68 D | 100 d || 5 ENQ | 37 % | 69 E | 101 e || 6 ACK | 38 & | 70 F | 102 f || 7 BEL | 39 ' | 71 G | 103 g || 8 BS | 40 ( | 72 H | 104 h || 9 HT | 41 ) | 73 I | 105 i || 10 LF | 42 * | 74 J | 106 j || 11 VT | 43 + | 75 K | 107 k || 12 FF | 44 , | 76 L | 108 l || 13 CR | 45 - | 77 M | 109 m || 14 SO | 46 . | 78 N | 110 n || 15 SI | 47 / | 79 O | 111 o || 16 DLE | 48 0 | 80 P | 112 p || 17 DC1 | 49 1 | 81 Q | 113 q || 18 DC2 | 50 2 | 82 R | 114 r || 19 DC3 | 51 3 | 83 S | 115 s || 20 DC4 | 52 4 | 84 T | 116 t || 21 NAK | 53 5 | 85 U | 117 u || 22 SYN | 54 6 | 86 V | 118 v || 23 ETB | 55 7 | 87 W | 119 w || 24 CAN | 56 8 | 88 X | 120 x || 25 EM | 57 9 | 89 Y | 121 y || 26 SUB | 58 : | 90 Z | 122 z || 27 ESC | 59 ; | 91 [ | 123 { || 28 FS | 60 < | 92 \ | 124 | || 29 GS | 61 = | 93 ] | 125 } || 30 RS | 62 > | 94 ^ | 126 ~ || 31 US | 64 ? | 95 _ | 127 DEL |

  • The Latin-1 Character Set

    • ISO 8859-1 8-bit characters for Western Europe– French, Spanish, Catalan, Galician, Basque,

    Portuguese, Italian, Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English

    Printable Characters, 7-bit ASCII Additional Defined Characters, ISO 8859-1

  • Other ISO-8859 Character Sets

    -2

    -3

    -4

    -5

    -7

    -6

    -9

    -8

  • East Asian Character Sets

    • More than 256 characters are needed– Two-byte encoding schemes (e.g., EUC) are used

    • Several countries have unique character sets– GB in Peoples Republic of China, BIG5 in Taiwan,

    JIS in Japan, KS in Korea, TCVN in Vietnam• Many characters appear in several languages

    – Research Libraries Group developed EACC• Unified “CJK” character set for USMARC records

  • Unicode

    • Single code for all the world’s characters– ISO Standard 10646

    • Separates “code space” from “encoding”– Code space extends Latin-1

    • The first 256 positions are identical– UTF-7 encoding will pass through email

    • Uses only the 64 printable ASCII characters– UTF-8 encoding is designed for disk file systems

  • Limitations of Unicode

    • Produces larger files than Latin-1• Fonts may be hard to obtain for some characters• Some characters have multiple representations

    – e.g., accents can be part of a character or separate• Some characters look identical when printed

    – But they come from unrelated languages• Encoding does not define the “sort order”

  • Strings and Segments• Retrieval is (often) a search for concepts

    – But what we actually search are character strings

    • What strings best represent concepts?– In English, words are often a good choice

    • Well-chosen phrases might also be helpful– In German, compounds may need to be split

    • Otherwise queries using constituent words would fail– In Chinese, word boundaries are not marked

    • Thissegmentationproblemissimilartothatofspeech

  • Tokenization• Words (from linguistics):

    – Morphemes are the units of meaning– Combined to make words

    • Anti (disestablishmentarian) ism

    • Tokens (from computer science)– Doug ’s running late !

  • Morphological Segmentation

    Swahili Examplea + li + ni + andik + ish + a

    he + past-tense + me + write + causer-effect + Declarative-mode

    Credit: Ramy Eskander

  • Morphological Segmentation

    Somali Examplecun + t + aa

    eat + she

    + present-tense

    Credit: Ramy Eskander

  • Stemming• Conflates words, usually preserving meaning

    – Rule-based suffix-stripping helps for English• {destroy, destroyed, destruction}: destr

    – Prefix-stripping is needed in some languages• Arabic: {alselam}: selam [Root: SLM (peace)]

    • Imperfect: goal is to usually be helpful– Overstemming

    • {centennial,century,center}: cent– Understamming:

    • {acquire,acquiring,acquired}: acquir• {acquisition}: acquis

    • Snowball: rule-based system for making stemmers

  • Longest Substring Segmentation

    • Greedy algorithm based on a lexicon

    • Start with a list of every possible term

    • For each unsegmented string– Remove the longest single substring in the list– Repeat until no substrings are found in the list

  • Longest Substring Example• Possible German compound term (!):

    – washington

    • List of German words:– ach, hin, hing, sei, ton, was, wasch

    • Longest substring segmentation– was-hing-ton– Roughly translates as “What tone is attached?”

  • oilpetroleum

    probesurveytake samples

    restrainoilpetroleum

    probesurveytake samples

    cymbidium goeringii

  • Probabilistic Segmentation

    • For an input string c1 c2 c3 … cn• Try all possible partitions into w1 w2 w3 …

    – c1 c2 c3 … cn– c1 c2 c3 c3 … cn– c1 c2 c3 … cn– etc.

    • Choose the highest probability partition– Compute Pr(w1 w2 w3 ) using a language model

    • Challenges: search, probability estimation

  • Non-Segmentation:N-gram Indexing

    • Consider a Chinese document c1 c2 c3 … cn

    • Don’t segment (you could be wrong!)

    • Instead, treat every character bigram as a termc1 c2 , c2 c3 , c3 c4 , … , cn-1 cn

    • Break up queries the same way

  • A “Term” is Whatever You Index

    • Word sense• Token• Word• Stem• Character n-gram• Phrase

  • Summary• A term is whatever you index

    – So the key is to index the right kind of terms!

    • Start by finding fundamental features– We have focused on character coded text– Same ideas apply to handwriting, OCR, and speech

    • Combine characters into easily recognized units– Words where possible, character n-grams otherwise

    • Apply further processing to optimize results– Stemming, phrases, …

  • A Story in Two Parts

    • IR from the ground up in any language– Focusing on document representation

    Cross-Language IR– To the extent time allows

  • Query-Language CLIR

    Englishqueries

    Somali Document Collection

    RetrievalEngine

    TranslationSystem

    English Document Collection

    Results

    select examine

  • Document-Language CLIR

    RetrievalEngine

    Translation System

    Somaliqueries

    Somalidocuments

    Results

    Englishqueries

    select examine

    Somali Document Collection

  • Query vs. Document Translation

    • Query translation– Efficient for short queries (not relevance feedback)– Limited context for ambiguous query terms

    • Document translation– Rapid support for interactive selection– Need only be done once (if query language is same)

  • Indexing Time:Statistical Document Translation

    0

    100

    200

    300

    400

    500

    0 10 15 20 25 35 40 45

    Thousands of documents

    Inde

    xing

    tim

    e (s

    ec) monolingual cross-language

  • Language-Neutral Retrieval

    “Interlingual”Retrieval

    1: 0.91 2: 0.573: 0.36

    Query“Translation”

    SomaliQueryTerms

    EnglishDocument

    TermsDocument

    “Translation”

  • Translation Evidence• Lexical Resources

    – Phrase books, bilingual dictionaries, …• Large text collections

    – Translations (“parallel”)– Similar topics (“comparable”)

    • Similarity– Similar writing (if the character set is the same)– Similar pronunciation

    • People– May be able to guess topic from lousy translations

  • Types of Lexical Resources• Ontology

    – Organization of knowledge• Thesaurus

    – Ontology specialized to support search• Dictionary

    – Rich word list, designed for use by people• Lexicon

    – Rich word list, designed for use by a machine• Bilingual term list

    – Pairs of translation-equivalent terms

  • Named entities removed

    Named entities from term list

    Named entities added

    Full Query

  • Backoff Translation• Lexicon might contain stems, surface

    forms, or some combination of the two.

    mangez mangez

    mangez mangemange

    mangezmange mange

    mangez mange mangent mange

    - eat

    - eats eat

    - eat

    - eat

    Document Translation Lexicon

    surface form surface form

    stem surface form

    surface form stem

    stem stem

  • Hieroglyphic

    Egyptian Demotic

    Greek

  • Types of Bilingual Corpora

    • Parallel corpora: translation-equivalent pairs– Document pairs– Sentence pairs – Term pairs

    • Comparable corpora: topically related– Collection pairs– Document pairs

  • Some Modern Rosetta Stones• News:

    – DE-News (German-English)– Hong-Kong News, Xinhua News (Chinese-English)

    • Government:– Canadian Hansards (French-English)– Europarl (Danish, Dutch, English, Finnish, French,

    German, Greek, Italian, Portugese, Spanish, Swedish)– UN Treaties (Russian, English, Arabic, …)

    • Religion– Bible, Koran, Book of Mormon

  • Word-Level Alignment

    Diverging opinions about planned tax reform

    Unterschiedliche Meinungen zur geplanten Steuerreform

    English

    German

    Madam President , I had asked the administration …English

    Señora Presidenta, había pedido a la administración del Parlamento …Spanish

  • A Translation Model

    • From word-aligned bilingual text, we induce a translation model

    • Example:)|( efp i 1)|( =∑

    ifi efpwhere,

    p(探测|survey) = 0.4p(试探|survey) = 0.3p(测量|survey) = 0.25p(样品|survey) = 0.05

  • Using Multiple Translations• Weighted Structured Query Translation

    – Takes advantage of multiple translations and translation probabilities

    • TF and DF of query term e are computed using TF and DF of its translations:

    ∑ ×=if

    kiik DfTFefpDeTF ),()|(),(

    ∑ ×=if

    ii fDFefpeDF )()|()(

  • BM-25

    ])(7)(*8

    )),()(*9.03.0(

    )),(*2.2(][)5.0)((

    )5.0)(([logeqtfeqtf

    detfavdl

    ddldetf

    edfedfN

    Qek

    k

    k

    +++++−∑

    document frequency

    term frequency

    document length

    ])(7)(*8

    )),()(*9.03.0(

    )),(*2.2(][)5.0)((

    )5.0)(([logeqtfeqtf

    detfavdl

    ddldetf

    edfedfN

    Qek

    k

    k

    +++++−∑

    ])(7)(*8

    )),()(*9.03.0(

    )),(*2.2(][)5.0)((

    )5.0)(([logeqtfeqtf

    detfavdl

    ddldetf

    edfedfN

    Qek

    k

    k

    +++++−∑

  • 40%

    50%

    60%

    70%

    80%

    90%

    100%

    110%

    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

    Cumulative Probability Threshold

    MA

    P: C

    LIR

    /Mon

    olin

    gual

    DAMM IMM PSQ

    Retrieval Effectiveness

    CLEF French

    Chart2

    000

    0.10.10.1

    0.20.20.2

    0.30.30.3

    0.40.40.4

    0.50.50.5

    0.60.60.6

    0.70.70.7

    0.80.80.8

    0.90.90.9

    0.990.990.99

    0.9990.9990.999

    DAMM

    IMM

    PSQ

    Cumulative Probability Threshold

    MAP: CLIR/Monolingual

    0.9107883817

    0.9177904564

    0.9113070539

    0.9107883817

    0.9177904564

    0.929719917

    0.9097510373

    0.9170124481

    0.9315352697

    0.9224585062

    0.9177904564

    0.9408713693

    0.9380186722

    0.919346473

    0.9442427386

    0.9538381743

    0.9320539419

    0.9486514523

    0.9725103734

    0.9395746888

    0.9460580913

    0.9885892116

    0.955653527

    0.9468360996

    0.997406639

    0.968620332

    0.9416493776

    1.002593361

    0.9743257261

    0.9196058091

    0.6675311203

    0.4037863071

    0.9011929461

    0.6677904564

    0.4037863071

    0.8503630705

    raw

    QTAQTqt and dt first cut off at cumulative p of 0.9qt and dt first cut off at cumulative p of 0.999

    cumpw/o ppp_normcumpw/o ppp_normQTAQT

    00.28060.26470.285300.29050.29410.2867cumpw/o ppp_normcumpw/o ppp_norm

    0.10.28690.27140.29240.10.2890.29280.285300.28060.26470.285300.29050.29410.2867

    0.20.28390.27420.29210.20.29020.29420.28670.10.28690.27140.29240.10.2890.29280.2853

    0.30.29030.27580.29790.30.28670.29170.28530.20.28390.27420.29210.20.29020.29420.2867

    0.40.28870.27660.30.40.27880.28540.28220.30.29030.27580.29790.30.28670.29170.2853

    0.50.28220.280.29990.50.27550.29090.28210.40.28870.27660.30.40.27880.28540.2822

    0.60.28070.29050.30290.60.28030.29530.28690.50.28220.280.29990.50.27550.29090.2821

    0.70.27290.29470.30410.70.27270.2940.2880.60.28070.29050.30290.60.28030.29530.2869

    0.80.26330.30130.30760.80.25320.29750.28560.70.27290.29470.30410.70.27270.2940.288

    0.90.26570.30550.30760.90.21570.29930.28560.80.26330.30130.30760.80.25320.29750.2856

    10.0856*0.08830.088410.0876*0.1460.06990.90.26570.30550.30760.90.21570.29930.2856

    DTADTDTADT

    cumpw/o ppcumpw/o ppcumpw/o ppcumpw/o pp

    00.29380.291100.32250.311400.29380.291100.32250.3114

    0.10.29350.29130.10.32190.31140.10.29350.29130.10.32190.3114

    0.20.28830.29110.20.31880.31180.20.28830.29110.20.31880.3118

    0.30.29080.29140.30.32110.31430.30.29080.29140.30.32110.3143

    0.40.29890.29920.40.33070.32170.40.29890.29920.40.33070.3217

    0.50.30470.30120.50.33090.32150.50.30470.30120.50.33090.3215

    0.60.30990.30590.60.32540.32070.60.30990.30590.60.32540.3207

    0.70.31990.31210.70.32460.32290.70.31990.31210.70.32460.3229

    0.80.31640.31790.80.31940.32420.80.31640.31790.80.31940.3242

    0.90.31520.32210.90.30540.32530.90.31520.32210.90.30540.3253

    10.2224*0.124210.1344*0.1703

    IMM: dict normalizedDAMM: dict normalizedIMMDAMM

    cumpw/o ppcumpw/o ppp_normcumpw/o ppcumpw/o pp

    00.28310.287400.28380.27550.286500.28750.288300.29350.2925

    0.10.28310.28740.10.28380.27550.28650.10.28750.28830.10.29350.2925

    0.20.2830.28730.20.28660.27660.29050.20.28740.28830.20.29560.2945

    0.30.28370.28760.30.28690.28690.29910.30.28830.28880.30.29790.2963

    0.40.28480.2880.40.30590.29820.30340.40.28940.29030.40.30960.3033

    0.50.29120.2910.50.30620.29490.30430.50.29560.29370.50.31350.3066

    0.60.29890.29960.60.32170.30960.31580.60.3030.30110.60.32470.3189

    0.70.30370.30270.70.32540.31430.31690.70.30860.30430.70.32170.3192

    0.80.31270.30570.80.33370.32090.32250.80.31340.30630.80.33610.3254

    0.90.32280.30880.90.34660.32610.32660.90.32390.30960.90.34770.3294

    10.2490.10120.950.34660.32780.328110.34520.2910.34520.2875

    10.26160.08990.0899

    PAMM-qPAMM-d

    cumpw/o pp_normcumpw/o pp_norm

    00.27930.27900.29320.294

    0.10.27930.2790.10.29320.294

    0.20.27930.2790.20.29320.2941

    0.30.28640.28650.30.2940.2951

    0.40.28550.28750.40.30090.3014

    0.50.29740.29790.50.30450.3031

    0.60.30750.30380.60.31840.3124

    0.70.31220.30740.70.31560.314

    0.80.32590.31350.80.31770.318

    0.90.33640.320.90.33350.3268

    10.2507*0.098410.2373*0.0999

    SUMASUMSUM

    cumpw/o ppcumpw/o ppcumpw/o pp

    00.28280.283500.23620.236100.28790.2886

    0.10.28250.28340.10.2380.23650.10.29440.2959

    0.20.28450.28620.20.23990.24060.20.2970.3028

    0.30.28970.29020.30.24190.24370.30.29030.3055

    0.40.3060.30480.40.25390.25730.40.27210.3081

    0.50.31880.31430.50.25450.26410.50.22480.306

    0.60.32080.31840.60.23210.25930.60.22250.3058

    0.70.31150.32320.70.22310.25630.70.22170.3079

    0.80.29630.32060.80.20940.24330.80.21550.3072

    0.90.23780.29830.90.19480.220.90.21470.3073

    10.16750.1864

    Stemming: search normalized except for dtStemming: search normalized except for dt

    QTAQT

    cumpw/o ppPirkolacumpw/o ppQT

    00.35180.35140.351800.31090.3158cumpw/o pp

    0.10.35880.35850.3580.10.31060.315400.35180.3514

    0.20.35790.35920.3570.20.31030.3160.10.35880.3585

    0.30.36030.36280.35840.30.30960.31760.20.35790.3592

    0.40.35280.36410.350.40.30570.31760.30.36030.3628

    0.50.34360.36580.33960.50.30180.31090.40.35280.3641

    0.60.31650.36480.32080.60.29070.32070.50.34360.3658

    0.70.28380.36510.29070.70.26680.32180.60.31650.3648

    0.80.25080.36310.25880.80.22940.31760.70.28380.3651

    0.90.19550.35460.21270.90.20330.31580.80.25080.3631

    0.990.10620.34750.1130.990.10470.31450.90.19550.3546

    0.9990.1020.32790.10710.9990.10090.2715

    10.06960.1586

    DTADTDT

    cumpw/o ppcumpw/o ppcumpw/o pp

    00.34980.35300.35920.348100.34980.353

    0.10.34580.35330.10.36070.34790.10.34580.3533

    0.20.34960.3520.20.35990.34670.20.34960.352

    0.30.33370.35290.30.34980.3510.30.33370.3529

    0.40.32760.36080.40.34620.35920.40.32760.3608

    0.50.32230.36230.50.3330.35790.50.32230.3623

    0.60.31920.36230.60.32290.35770.60.31920.3623

    0.70.31730.36690.70.31450.35810.70.31730.3669

    0.80.3130.36790.80.31510.35810.80.3130.3679

    0.90.29890.37350.90.30010.35770.90.29890.3735

    0.990.25970.36930.990.25940.3537

    0.9990.22360.3480.9990.12880.3361

    SUMASUMSUM

    cumpw/o ppcumpw/o ppcumpw/o pp

    00.3630.360800.32090.319600.35360.3523

    0.10.36080.35890.10.32210.32050.10.36090.3617

    0.20.36250.3610.20.3210.31950.20.36110.3676

    0.30.36580.36380.30.31660.32480.30.34330.3718

    0.40.37230.36920.40.3190.3410.40.29450.3727

    0.50.37860.37670.50.30850.33470.50.23720.3681

    0.60.3710.37950.60.28010.32870.60.22850.3689

    0.70.35050.37950.70.26670.32910.70.22050.3675

    0.80.30950.3770.80.23470.3220.80.21180.3652

    0.90.22390.36380.90.20430.3090.90.19640.3641

    0.990.06320.21791

    0.9990.05820.1941

    IMMDAMMIMMDAMM

    cumpw/o ppcumpw/o ppcumpw/o ppcumpw/o pp

    00.35450.353900.35210.351200.35470.354100.36020.3593

    0.10.35450.35390.10.35210.35120.10.35470.35410.10.36020.3593

    0.20.35420.35360.20.35280.35080.20.35450.35390.20.3620.3594

    0.30.35450.35390.30.35740.35570.30.35490.35420.30.36370.3637

    0.40.35550.35450.40.36340.36170.40.35580.3550.40.37070.3685

    0.50.3610.35940.50.36860.36780.50.36040.3590.50.37370.3732

    0.60.36410.36230.60.36750.3750.60.36250.36260.60.3770.3776

    0.70.36820.36850.70.37120.38120.70.36810.36840.70.37520.3811

    0.80.37440.37350.80.37910.38460.80.3730.37340.80.38360.3859

    0.90.37120.37570.90.37450.38660.90.37390.37540.90.38540.3877

    0.990.23340.15570.950.37390.388510.35130.358110.35130.3475

    0.9990.23340.15570.990.33310.2574

    0.9990.33250.2575

    PAMM-qPAMM-dPAMM-qPAMM-d

    cumpw/o ppcumpw/o ppcumpw/o ppcumpw/o pp

    00.35570.35500.35380.353200.35390.353200.35680.3561

    0.10.35570.3550.10.35380.35320.10.35390.35320.10.35680.3561

    0.20.35550.35480.20.35360.3530.20.35390.35320.20.35660.3559

    0.30.35590.35640.30.35410.35360.30.35380.3530.30.3580.3574

    0.40.35320.35620.40.35910.35880.40.35810.35810.40.35320.3563

    0.50.36480.36510.50.36550.3610.50.36480.3610.50.36180.365

    0.60.36240.36490.60.37310.36890.60.36790.36560.60.36540.3653

    0.70.37130.37230.70.37420.3730.70.37460.37270.70.37020.3709

    0.80.37910.37790.80.38050.38010.80.38070.37820.80.38140.3776

    0.90.37690.38230.90.3770.38390.90.37860.38040.90.37660.3796

    0.990.36890.38580.990.36290.385510.35130.356910.35130.3577

    0.9990.34660.36780.9990.34050.3679

    IMM.f2eDAMM.f2e

    cumpw/o ppcumpw/o pp

    00.35610.352200.3270.3233

    0.10.35610.35220.10.3270.3233

    0.20.35650.35230.20.32690.3233

    0.30.3560.35220.30.32730.3237

    0.40.35540.35190.40.34020.3355

    0.50.36040.35660.50.35760.3453

    0.60.36770.36420.60.37190.3647

    0.70.37290.36970.70.37510.3733

    0.80.37290.370.80.37760.3755

    0.90.36410.37080.90.37370.3788

    0.990.36770.36990.990.3690.3779

    0.9990.36210.36140.9990.35360.3665

    PAMM-d.f2eFAMM

    cumpw/o ppcumpp

    00.37050.36900.0935

    0.10.37050.3690.10.0987

    0.20.3710.36910.20.0981

    0.30.37120.36920.30.1079

    0.40.36820.36760.40.1134

    0.50.36720.36650.50.1177

    0.60.36880.36770.60.1192

    0.70.37450.36910.70.1206

    0.80.37790.36920.80.1217

    0.90.3630.37020.90.1227

    0.990.36350.37170.990.1243

    0.9990.35790.36410.9990.1243

    Conflating dt with WordNet

    PAMM-q

    cumpw/o pp

    00.3550.3545

    0.10.3550.3545

    0.20.35490.3543

    0.30.35390.3538

    0.40.35480.355

    0.50.36120.3607

    0.60.36650.3659

    0.70.36560.3684

    0.80.37280.3718

    0.90.37180.3742

    0.990.36690.3783

    0.9990.34260.364

    ADT.wn

    cumpw/o pp

    00.34770.3622

    0.10.34440.3601

    0.20.33680.3601

    0.30.32520.3604

    0.40.32110.3636

    0.50.30680.3649

    0.60.30050.3637

    0.70.28520.3574

    0.80.27210.3518

    0.90.25180.3282

    0.990.23410.2185

    0.9990.2020.169

    Establishing monolingual baseline

    title+desc

    cumulative_pw/ stemmingw/ stemming, w/ synonym expansionw/o stemming, w/ synonym expansionw/o stemming

    00.38560.35210.29640.3011

    0.10.38560.35210.29660.3011

    0.20.38560.35770.29870.3011

    0.30.38560.36360.30570.3011

    0.40.38560.36870.31130.3011

    0.50.38560.37270.31350.3011

    0.60.38560.37250.31330.3011

    0.70.38560.37440.31720.3011

    0.80.38560.37530.31580.3011

    0.90.38560.37650.31830.3011

    10.38560.3780.31950.3011

    raw

    w/ stemming

    w/ stemming, w/ synonym expansion

    w/o stemming, w/ synonym expansion

    w/o stemming

    Cumulative Probability Threshold

    Mean Average Precision

    comparison

    stemming applied, no probability threshold cutoff

    cumpSQ-PirkolaSQ-KwokPSQcumpPSQSQ-PirkolaSQ-KwokcumpAPSQPSQcumpAPDTAPDT.WNPDTcumpKwokPirkolacumpIMM.qtIMM.dtDAMM.qtDAMM.dtcumpDAMM.qtDAMM.dtIMM.qtIMM.dt

    00.35180.35180.351400.91130705390.91234439830.912344398300.31580.351400.34810.36220.35300.35180.351800.35390.35220.35120.323300.91078838170.838433610.91779045640.9133817427

    0.10.3580.35880.35850.10.9297199170.92842323650.93049792530.10.31540.35850.10.34790.36010.35330.10.35880.3580.10.35390.35220.35120.32330.10.91078838170.838433610.91779045640.9133817427

    0.20.3570.35790.35920.20.93153526970.92582987550.92816390040.20.3160.35920.20.34670.36010.3520.20.35790.3570.20.35360.35230.35080.32330.20.90975103730.838433610.91701244810.9136410788

    0.30.35840.36030.36280.30.94087136930.92946058090.93438796680.30.31760.36280.30.3510.36040.35290.30.36030.35840.30.35390.35220.35570.32370.30.92245850620.83947095440.91779045640.9133817427

    0.40.350.35280.36410.40.94424273860.90767634850.91493775930.40.31760.36410.40.35920.36360.36080.40.35280.350.40.35450.35190.36170.33550.40.93801867220.87007261410.9193464730.9126037344

    0.50.33960.34360.36580.50.94865145230.88070539420.89107883820.50.31090.36580.50.35790.36490.36230.50.34360.33960.50.35940.35660.36780.34530.50.95383817430.89548755190.93205394190.9247925311

    0.60.32080.31650.36480.60.94605809130.83195020750.82079875520.60.32070.36480.60.35770.36370.36230.60.31650.32080.60.36230.36420.3750.36470.60.97251037340.94579875520.93957468880.9445020747

    0.70.29070.28380.36510.70.94683609960.75389004150.73599585060.70.32180.36510.70.35810.35740.36690.70.28380.29070.70.36850.36970.38120.37330.70.98858921160.96810165980.9556535270.9587655602

    0.80.25880.25080.36310.80.94164937760.67116182570.65041493780.80.31760.36310.80.35810.35180.36790.80.25080.25880.80.37350.370.38460.37550.80.9974066390.97380705390.9686203320.9595435685

    0.90.21270.19550.35460.90.91960580910.55160788380.50700207470.90.31580.35460.90.35770.32820.37350.90.19550.21270.90.37570.37080.38660.37880.91.0025933610.98236514520.97432572610.9616182573

    0.990.1130.10620.34750.990.90119294610.29304979250.27541493780.990.31450.34750.990.35370.21850.36930.990.10620.1130.990.15570.36990.25740.37790.990.66753112030.98003112030.40378630710.9592842324

    0.9990.10710.1020.32790.9990.85036307050.27774896270.26452282160.9990.27150.32790.9990.33610.1690.3480.9990.1020.10710.9990.15570.36140.25750.36650.9990.66779045640.9504668050.40378630710.9372406639

    10.07080.06960.158610.41130705390.18360995850.1804979253

    cumpPSQAPSQcumpPDTAPDTAPDT.WN

    00.91130705390.818983402500.91545643150.90274896270.9393153527

    0.10.9297199170.81794605810.10.91623443980.90223029050.9338692946

    cumpPSQPDTAPDTcumpPDTPSQIMM0.20.93153526970.81950207470.20.91286307050.89911825730.9338692946

    00.35140.3530.348100.91545643150.91130705390.91779045640.30.94087136930.82365145230.30.91519709540.91026970950.9346473029

    0.10.35850.35330.34790.10.91623443980.9297199170.91779045640.40.94424273860.82365145230.40.93568464730.93153526970.9429460581

    0.20.35920.3520.34670.20.91286307050.93153526970.91701244810.50.94865145230.80627593360.50.93957468880.92816390040.9463174274

    0.30.36280.35290.3510.30.91519709540.94087136930.91779045640.60.94605809130.83169087140.60.93957468880.92764522820.9432053942

    0.40.36410.36080.35920.40.93568464730.94424273860.9193464730.70.94683609960.83454356850.70.95150414940.92868257260.9268672199

    0.50.36580.36230.35790.50.93957468880.94865145230.93205394190.80.94164937760.82365145230.80.95409751040.92868257260.9123443983

    0.60.36480.36230.35770.60.93957468880.94605809130.93957468880.90.91960580910.81898340250.90.9686203320.92764522820.8511410788

    0.70.36510.36690.35810.70.95150414940.94683609960.9556535270.990.90119294610.81561203320.990.95772821580.91727178420.5666493776

    0.80.36310.36790.35810.80.95409751040.94164937760.9686203320.9990.85036307050.70409751040.9990.90248962660.87162863070.4382780083

    0.90.35460.37350.35770.90.9686203320.91960580910.9743257261

    0.990.34750.36930.35370.990.95772821580.90119294610.4037863071

    0.9990.32790.3480.33610.9990.90248962660.85036307050.4037863071

    10.15860.169910.44061203320.4113070539

    cumpDAMMPAMM-qPAMM-dcumpPAMM-qPAMM-q.WNIMMFAMMPAMM-q(WN)FAMMcumpDAMMPSQFAMM

    00.35120.3550.353200.92064315350.9193464730.91779045640.24247925310.35450.093500.9193464730.91130705390.2424792531

    0.10.35120.3550.35320.10.92064315350.9193464730.91779045640.25596473030.35450.09870.10.9193464730.9297199170.2559647303

    0.20.35080.35480.3530.20.92012448130.91882780080.91701244810.25440871370.35430.09810.20.91882780080.93153526970.2544087137

    0.30.35570.35640.35360.30.92427385890.91753112030.91779045640.27982365150.35380.10790.30.91753112030.94087136930.2798236515

    0.40.36170.35620.35880.40.92375518670.92064315350.9193464730.29408713690.3550.11340.40.92064315350.94424273860.2940871369

    0.50.36780.36510.3610.50.94683609960.93542531120.93205394190.30523858920.36070.11770.50.93542531120.94865145230.3052385892

    0.60.3750.36490.36890.60.94631742740.94891078840.93957468880.30912863070.36590.11920.60.94891078840.94605809130.3091286307

    0.70.38120.37230.3730.70.96550829880.95539419090.9556535270.31275933610.36840.12060.70.95539419090.94683609960.3127593361

    0.80.38460.37790.38010.80.98003112030.96421161830.9686203320.31561203320.37180.12170.80.96421161830.94164937760.3156120332

    0.90.38660.38230.38390.90.99144190870.97043568460.97432572610.31820539420.37420.12270.90.97043568460.91960580910.3182053942

    0.990.25740.38580.38550.991.00051867220.98106846470.40378630710.32235477180.37830.12430.990.98106846470.90119294610.3223547718

    0.9990.25750.36780.36790.9990.95383817430.94398340250.40378630710.32235477180.3640.12430.9990.94398340250.85036307050.3223547718

    0

    cumpDAMMIMMASUMSUMcumpDAMMIMMPDTPSQ

    00.35120.35390.31960.360800.91078838170.91779045640.91545643150.9113070539

    0.10.35120.35390.32050.35890.10.91078838170.91779045640.91623443980.929719917

    0.20.35080.35360.31950.3610.20.90975103730.91701244810.91286307050.9315352697

    0.30.35570.35390.32480.36380.30.92245850620.91779045640.91519709540.9408713693

    0.40.36170.35450.3410.36920.40.93801867220.9193464730.93568464730.9442427386

    0.50.36780.35940.33470.37670.50.95383817430.93205394190.93957468880.9486514523

    0.60.3750.36230.32870.37950.60.97251037340.93957468880.93957468880.9460580913

    0.70.38120.36850.32910.37950.70.98858921160.9556535270.95150414940.9468360996

    0.80.38460.37350.3220.3770.80.9974066390.9686203320.95409751040.9416493776

    0.90.38660.37570.3090.36380.91.0025933610.97432572610.9686203320.9196058091

    0.990.25740.15570.21790.990.66753112030.40378630710.95772821580.9011929461

    0.9990.25750.15570.19410.9990.66779045640.40378630710.90248962660.8503630705

    stemming applied, GIZA tables threshold at cump of 0.9

    cumpDAMMPAMM-qIMMSUMPDTPSQcumpDAMMPAMM-qIMMSUMPDTPSQ

    00.35930.35320.35410.35230.3530.351400.93179460580.91597510370.91830912860.91364107880.91545643150.9113070539

    0.10.35930.35320.35410.36170.35330.35850.10.93179460580.91597510370.91830912860.93801867220.91623443980.929719917

    0.20.35940.35320.35390.36760.3520.35920.20.93205394190.91597510370.91779045640.95331950210.91286307050.9315352697

    0.30.36370.3530.35420.37180.35290.36280.30.94320539420.91545643150.91856846470.96421161830.91519709540.9408713693

    0.40.36850.35810.3550.37270.36080.36410.40.9556535270.92868257260.92064315350.96654564320.93568464730.9442427386

    0.50.37320.3610.3590.36810.36230.36580.50.96784232370.93620331950.93101659750.95461618260.93957468880.9486514523

    0.60.37760.36560.36260.36890.36230.36480.60.9792531120.94813278010.94035269710.95669087140.93957468880.9460580913

    0.70.38110.37270.36840.36750.36690.36510.70.98832987550.96654564320.95539419090.9530601660.95150414940.9468360996

    0.80.38590.37820.37340.36520.36790.36310.81.00077800830.98080912860.96836099590.94709543570.95409751040.9416493776

    0.90.38770.38040.37540.36410.37350.35460.91.00544605810.98651452280.97354771780.94424273860.9686203320.9196058091

    10.34750.35690.35810.18430.158610.90119294610.92557053940.9286825726

    Aggregation on SUM

    cumpSUMASUM

    00.36080.3196

    0.10.35890.3205

    0.20.3610.3195

    0.30.36380.3248

    0.40.36920.341

    0.50.37670.3347

    0.60.37950.3287

    0.70.37950.3291

    0.80.3770.322

    0.90.36380.309

    Aggregation with WordNet: no threshold applied to giza table

    cumpPAMM-q.wnPAMM-q.ssIMM

    00.35450.3550.3539

    0.10.35450.3550.3539

    0.20.35430.35480.3536

    0.30.35380.35640.3539

    0.40.3550.35620.3545

    0.50.36070.36510.3594

    0.60.36590.36490.3623

    0.70.36840.37230.3685

    0.80.37180.37790.3735

    0.90.37420.38230.3757

    0.990.37830.38580.1557

    0.9990.3640.36780.1557

    cumpAPDT.SSAPDT.WNPDT

    00.34810.36220.353

    0.10.34790.36010.3533

    0.20.34670.36010.352

    0.30.3510.36040.3529

    0.40.35920.36360.3608

    0.50.35790.36490.3623

    0.60.35770.36370.3623

    0.70.35810.35740.3669

    0.80.35810.35180.3679

    0.90.35770.32820.3735

    0.990.35370.21850.3693

    0.9990.33610.1690.348

    CPMQTDTIMMPAMM-qPAMM-dPAMM-q.wnDAMMSUM

    [email protected]

    [email protected]

    comparison

    APDT.SS

    APDT.WN

    PDT

    Cumulative Probability Threshold

    Mean Average Precision

    mono41-90

    DAMM

    PSQ

    FAMM

    Cumulative Probability Threshold

    MAP: CLIR/Monolingual

    Sheet1

    PSQ

    SQ-Pirkola

    SQ-Kwok

    Cumulative Probability Threshold

    MAP: CLIR/Monolingual

    sig-test

    PSQ

    APSQ

    Cumulative Probability Threshold

    MAP: CLIR/Monolingual

    PDT

    APDT

    APDT.WN

    Cumulative Probability Threshold

    MAP: CLIR/Monolingual

    DAMM.qt

    DAMM.dt

    IMM.qt

    IMM.dt

    Cumulative Probability Threshold

    MAP: CLIR/Monolingual

    PDT

    PSQ

    IMM

    Cumulative Probability Threshold

    MAP: CLIR/Monolingual

    1

    #REF!

    PAMM-q

    PAMM-q.WN

    IMM

    FAMM

    Cumulative Probability Threshold

    MAP: CLIR/Monolingual

    PAMM-q

    PAMM-q.WN

    IMM

    Cumulative Probability Threshold

    MAP: CLIR/Monolingual

    DAMM

    IMM

    PSQ

    Cumulative Probability Threshold

    MAP: CLIR/Monolingual

    MONO-BRFMONO

    4111

    420.49740.4855

    4311

    440.36410.4865

    450.56610.534

    460.0620.0744

    470.44590.3667

    480.03480.0312

    490.43310.4438

    500.58110.4608

    510.0670.04

    520.01190.01

    530.24140.2426

    540.66630.5348

    550.54880.5045

    560.71070.7056

    570.58360.549

    580.9420.9234

    590.34030.5687

    600.2170.1909

    610.10230.1087

    620.76990.7297

    630.77920.6236

    650.27690.2726

    660.9070.9169

    670.5070.3032

    680.69670.5909

    690.52850.441

    700.82710.803

    710.04690.054

    720.27630.2515

    730.37550.383

    740.35910.2341

    7511

    760.47020.4323

    770.32730.3046

    780.26440.132

    790.59470.6387

    800.29660.2792

    810.66080.5778

    820.73910.6875

    8311

    8411

    850.30390.2467

    860.34870.2773

    870.46070.4177

    880.93420.8996

    890.71160.6133

    900.16660.1296

    Avg.0.5008940.470018

    [email protected]@0.5DAMM-MONODAMM-PSQ

    1010.4730.24750.11360.22550.3594

    1510.33880.066100.27270.3388

    1700.33330.50.0769-0.16670.2564

    1470.49510.18560.24290.30950.2522

    670.58520.30320.33770.2820.2475

    1830.29330.29330.06100.2323

    780.29670.1320.07220.16470.2245

    1280.30350.26370.08130.03980.2222

    1820.58130.50620.38960.07510.1917

    1350.29510.15970.12130.13540.1738

    41110.833300.1667

    1840.41250.55130.2588-0.13880.1537

    760.36370.43230.2119-0.06860.1518

    580.93310.92340.7880.00970.1451

    1000.47570.32960.33170.14610.144

    740.6870.23410.54320.45290.1438

    860.19720.27730.0666-0.08010.1306

    1400.78290.6690.65560.11390.1273

    1240.42980.30670.30670.12310.1231

    610.69940.10870.5850.59070.1144

    1290.24740.24110.13890.00630.1085

    890.5780.61330.4755-0.03530.1025

    1220.35730.28350.2590.07380.0983

    1530.49060.40430.39520.08630.0954

    1130.10210.08080.00720.02130.0949

    710.13420.0540.04220.08020.092

    650.08750.27260.0039-0.18510.0836

    550.59410.50450.51520.08960.0789

    540.08620.53480.0104-0.44860.0758

    1930.23910.27440.1659-0.03530.0732

    1120.55840.52040.48960.0380.0688

    1300.53850.65190.4724-0.11340.0661

    1630.48280.35330.4180.12950.0648

    990.24850.30790.1845-0.05940.064

    1190.88230.88560.8234-0.00330.0589

    900.15510.12960.10120.02550.0539

    460.26730.07440.22380.19290.0435

    1150.77390.70550.7390.06840.0349

    690.32930.4410.2946-0.11170.0347

    800.27190.27920.2395-0.00730.0324

    620.73570.72970.71080.0060.0249

    1330.15950.19680.1377-0.03730.0218

    920.4250.40090.40570.02410.0193

    1090.09920.05340.08010.04580.0191

    810.27260.57780.2545-0.30520.0181

    1370.03120.18570.0139-0.15450.0173

    440.64860.48650.63310.16210.0155

    970.41010.30730.39580.10280.0143

    1100.73140.64350.71770.08790.0137

    500.45290.46080.4406-0.00790.0123

    1260.81160.79630.79960.01530.012

    630.79690.62360.78520.17330.0117

    960.49670.57660.4852-0.07990.0115

    570.17880.5490.1675-0.37020.0113

    600.22680.19090.21640.03590.0104

    870.39510.41770.3852-0.02260.0099

    420.53150.48550.52180.0460.0097

    1180.12980.04530.12240.08450.0074

    850.23550.24670.2281-0.01120.0074

    820.64820.68750.6409-0.03930.0073

    1140.54840.48460.54180.06380.0066

    1380.08680.31350.0819-0.22670.0049

    1110.01850.15810.0145-0.13960.004

    1020.57270.55210.56880.02060.0039

    1540.04550.1250.0417-0.07950.0038

    980.86510.85410.86340.0110.0017

    1900.0760.13390.0753-0.05790.0007

    4311100

    590.91670.56870.91670.3480

    7511100

    8311100

    8411100

    930.966710.9667-0.03330

    12111100

    12311100

    13411100

    13611100

    14100000

    1420.38460.38460.384600

    14411100

    14900000

    1550.02780.00820.02780.01960

    15600.00160-0.00160

    15800000

    16500000

    16711100

    16800000

    1730.66670.66670.666700

    17400000

    17500000

    1780.55560.44440.55560.11120

    17900000

    18000000

    18100000

    18500000

    1880.33330.33330.333300

    1890.250.250.2500

    19200000

    19600000

    19700000

    19800000

    19900000

    20000000

    510.03820.040.0385-0.0018-0.0003

    470.44180.36670.44230.0751-0.0005

    1040.26940.26830.27180.0011-0.0024

    1200.02120.00940.02390.0118-0.0027

    1430.10140.10950.1051-0.0081-0.0037

    520.00560.010.0099-0.0044-0.0043

    1590.01470.01920.0192-0.0045-0.0045

    1480.02620.06550.0314-0.0393-0.0052

    1950.24390.17220.25270.0717-0.0088

    1620.20470.20380.21540.0009-0.0107

    1080.27260.27930.2848-0.0067-0.0122

    1500.14020.21080.1528-0.0706-0.0126

    1570.15610.51610.1696-0.36-0.0135

    1250.49870.47560.51340.0231-0.0147

    1640.1690.35780.185-0.1888-0.016

    1160.59370.67140.6101-0.0777-0.0164

    1870.15010.18710.1692-0.037-0.0191

    1520.26090.26290.2809-0.002-0.02

    1030.60940.53470.62960.0747-0.0202

    660.87460.91690.8954-0.0423-0.0208

    1770.16140.18010.1832-0.0187-0.0218

    1270.89520.83880.91750.0564-0.0223

    700.77380.8030.7999-0.0292-0.0261

    730.32260.3830.3512-0.0604-0.0286

    950.30840.39030.3395-0.0819-0.0311

    1320.37430.45830.4088-0.084-0.0345

    770.30290.30460.3389-0.0017-0.036

    480.04370.03120.08090.0125-0.0372

    1060.34610.44730.3844-0.1012-0.0383

    720.21950.25150.2581-0.032-0.0386

    490.44080.44380.4807-0.003-0.0399

    680.56170.59090.607-0.0292-0.0453

    910.0870.17370.1425-0.0867-0.0555

    940.31010.58760.3701-0.2775-0.06

    880.83440.89960.8995-0.0652-0.0651

    790.70560.63870.77230.0669-0.0667

    1390.48340.51170.552-0.0283-0.0686

    1760.88050.94890.9618-0.0684-0.0813

    560.58380.70560.6747-0.1218-0.0909

    1710.10890.19530.2025-0.0864-0.0936

    1450.42280.43470.529-0.0119-0.1062

    1070.2230.21750.33130.0055-0.1083

    530.2840.24260.39440.0414-0.1104

    450.39070.5340.5043-0.1433-0.1136

    1860.1240.08910.25930.0349-0.1353

    1170.33020.23490.46570.0953-0.1355

    1310.32650.51830.4722-0.1918-0.1457

    1050.30880.450.75-0.1412-0.4412

    Topic

    AP: DAMM-PSQ

    [email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@0.9-MONO

    110.8333110.83331110.833310.833310.83330.833341-0.1667410

    0.48550.49530.52180.48550.53940.52180.53150.53940.54450.52180.53180.52180.53790.53870.5218420.0363420.046

    111111111111111430430

    0.48650.63410.63310.48650.49080.63310.64860.49080.61850.63310.64820.63310.60260.31330.6331440.1466440.1621

    0.5340.46370.50430.5340.45450.50430.39070.45450.39670.50430.40120.50430.39910.48960.504345-0.029745-0.1433

    0.07440.22570.22380.07440.2330.22380.26730.2330.25320.22380.25050.22380.25630.23760.2238460.1494460.1929

    0.36670.43990.44230.36670.45910.44230.44180.45910.44230.44230.44210.44230.44230.44230.4423470.0756470.0751

    0.03120.10890.08090.03120.13340.08090.04370.13340.04670.08090.04410.08090.04750.04990.0809480.0497480.0125

    0.44380.41260.48070.44380.39570.48070.44080.39570.43950.48070.44680.48070.46390.47940.4807490.036949-0.003

    0.46080.46460.44060.46080.45390.44060.45290.45390.45690.44060.45690.44060.45780.44590.440650-0.020250-0.0079

    0.040.0240.03850.040.00670.03850.03820.00670.03570.03850.03650.03850.03840.04340.038551-0.001551-0.0018

    0.010.01010.00990.010.00670.00990.00560.00670.01010.00990.00780.00990.00950.010.009952-0.000152-0.0044

    0.24260.24850.39440.24260.49140.39440.2840.49140.34550.39440.29630.39440.26980.39740.3944530.1518530.0414

    0.53480.04280.01040.53480.06310.01040.08620.06310.17090.01040.17680.01040.08310.0670.010454-0.524454-0.4486

    0.50450.50320.51520.50450.56540.51520.59410.56540.51280.51520.54440.51520.55210.52340.5152550.0107550.0896

    0.70560.67580.67470.70560.61080.67470.58380.61080.67490.67470.65230.67470.67220.67820.674756-0.030956-0.1218

    0.5490.21320.16750.5490.37580.16750.17880.37580.16080.16750.1330.16750.19770.22470.167557-0.381557-0.3702

    0.92340.92090.7880.92340.93410.7880.93310.93410.92240.7880.92290.7880.92340.91880.78858-0.1354580.0097

    0.568710.91670.56870.68060.91670.91670.680610.916710.9167110.9167590.348590.348

    0.19090.18110.21640.19090.12010.21640.22680.12010.21830.21640.22350.21640.22010.21850.2164600.0255600.0359

    0.10870.48190.5850.10870.42570.5850.69940.42570.71840.5850.73280.5850.71030.59850.585610.4763610.5907

    0.72970.71040.71080.72970.76740.71080.73570.76740.77110.71080.7210.71080.77110.72030.710862-0.0189620.006

    0.62360.7990.78520.62360.79370.78520.79690.79370.79750.78520.79750.78520.79750.80020.7852630.1616630.1733

    0.27260.00310.00390.27260.00660.00390.08750.00660.08940.00390.0890.00390.08940.09630.003965-0.268765-0.1851

    0.91690.8930.89540.91690.90910.89540.87460.90910.8930.89540.89470.89540.89550.89690.895466-0.021566-0.0423

    0.30320.26860.33770.30320.51220.33770.58520.51220.48250.33770.61330.33770.50570.49920.3377670.0345670.282

    0.59090.62050.6070.59090.72910.6070.56170.72910.5920.6070.56210.6070.58350.61260.607680.016168-0.0292

    0.4410.34960.29460.4410.41030.29460.32930.41030.29530.29460.28680.29460.2930.29380.294669-0.146469-0.1117

    0.8030.73570.79990.8030.74420.79990.77380.74420.75880.79990.76280.79990.76890.75780.799970-0.003170-0.0292

    0.0540.0360.04220.0540.51490.04220.13420.51490.0430.04220.10270.04220.05840.05310.042271-0.0118710.0802

    0.25150.25150.25810.25150.21270.25810.21950.21270.24140.25810.24480.25810.24830.24880.2581720.006672-0.032

    0.3830.39490.35120.3830.29280.35120.32260.29280.3930.35120.330.35120.3770.37610.351273-0.031873-0.0604

    0.23410.47430.54320.23410.72070.54320.6870.72070.69580.54320.70290.54320.69930.56030.5432740.3091740.4529

    111111111111111750750

    0.43230.42030.21190.43230.43760.21190.36370.43760.420.21190.38740.21190.40960.41220.211976-0.220476-0.0686

    0.30460.30610.33890.30460.20.33890.30290.20.30680.33890.30710.33890.30550.31350.3389770.034377-0.0017

    0.1320.09930.07220.1320.19790.07220.29670.19790.23790.07220.28330.07220.2590.13440.072278-0.0598780.1647

    0.63870.67190.77230.63870.39320.77230.70560.39320.70560.77230.70560.77230.70560.77640.7723790.1336790.0669

    0.27920.29120.23950.27920.24430.23950.27190.24430.27920.23950.28230.23950.29830.25380.239580-0.039780-0.0073

    0.57780.23440.25450.57780.1360.25450.27260.1360.26020.25450.27350.25450.28340.22850.254581-0.323381-0.3052

    0.68750.57290.64090.68750.65920.64090.64820.65920.64350.64090.65690.64090.63640.64460.640982-0.046682-0.0393

    111111111111111830830

    111111111111111840840

    0.24670.22330.22810.24670.21150.22810.23550.21150.240.22810.23840.22810.23940.21970.228185-0.018685-0.0112

    0.27730.07040.06660.27730.23040.06660.19720.23040.07460.06660.09020.06660.2120.20070.066686-0.210786-0.0801

    0.41770.41910.38520.41770.45180.38520.39510.45180.37740.38520.38410.38520.39030.38340.385287-0.032587-0.0226

    0.89960.8450.89950.89960.86240.89950.83440.86240.90030.89950.87880.89950.89020.91660.899588-0.000188-0.0652

    0.61330.48020.47550.61330.52410.47550.5780.52410.58880.47550.58830.47550.57150.58910.475589-0.137889-0.0353

    0.12960.12920.10120.12960.13910.10120.15510.13910.12370.10120.13770.10120.14530.1290.101290-0.0284900.0255

    0.17370.15620.14250.17370.080.14250.0870.080.12220.14250.05920.14250.13830.17680.142591-0.031291-0.0867

    0.40090.40780.40570.40090.41110.40570.4250.41110.41930.40570.43820.40570.4080.36080.4057920.0048920.0241

    10.96670.9667110.96670.966710.96670.96670.96670.96670.96670.96670.966793-0.033393-0.0333

    0.58760.40970.37010.58760.08440.37010.31010.08440.27110.37010.28120.37010.28180.34490.370194-0.217594-0.2775

    0.39030.33710.33950.39030.43760.33950.30840.43760.33850.33950.33230.33950.36320.34130.339595-0.050895-0.0819

    0.57660.32090.48520.57660.2290.48520.49670.2290.35050.48520.36630.48520.52420.50260.485296-0.091496-0.0799

    0.30730.40090.39580.30730.40450.39580.41010.40450.40090.39580.42150.39580.40090.40050.3958970.0885970.1028

    0.85410.850.86340.85410.85040.86340.86510.85040.86110.86340.87330.86340.87350.86060.8634980.0093980.011

    0.30790.15310.18450.30790.23560.18450.24850.23560.28420.18450.27650.18450.32970.18080.184599-0.123499-0.0594

    0.32960.29920.33170.32960.25980.33170.47570.25980.26060.33170.44090.33170.26710.41550.33171000.00211000.1461

    0.24750.12520.11360.24750.4870.11360.4730.4870.13590.11360.42320.11360.46180.3460.1136101-0.13391010.2255

    0.55210.57240.56880.55210.58690.56880.57270.58690.57680.56880.57390.56880.57670.57420.56881020.01671020.0206

    0.53470.61350.62960.53470.4620.62960.60940.4620.61790.62960.62150.62960.61710.62790.62961030.09491030.0747

    0.26830.27420.27180.26830.10090.27180.26940.10090.26960.27180.26940.27180.26970.27080.27181040.00351040.0011

    0.450.36110.750.450.83330.750.30880.83330.31670.750.30880.750.31670.32690.751050.3105-0.1412

    0.44730.38790.38440.44730.350.38440.34610.350.33690.38440.34590.38440.35950.39970.3844106-0.0629106-0.1012

    0.21750.33630.33130.21750.29240.33130.2230.29240.19090.33130.18670.33130.23220.37930.33131070.11381070.0055

    0.27930.27930.28480.27930.27760.28480.27260.27760.27720.28480.28330.28480.27840.28710.28481080.0055108-0.0067

    0.05340.05650.08010.05340.090.08010.09920.090.07350.08010.11660.08010.07310.08620.08011090.02671090.0458

    0.64350.71270.71770.64350.7030.71770.73140.7030.73660.71770.73530.71770.72670.73310.71771100.07421100.0879

    0.15810.02290.01450.15810.09860.01450.01850.09860.01580.01450.01790.01450.01760.01470.0145111-0.1436111-0.1396

    0.52040.5210.48960.52040.59010.48960.55840.59010.55760.48960.54960.48960.56830.50770.4896112-0.03081120.038

    0.08080.00650.00720.08080.02830.00720.10210.02830.00930.00720.010.00720.03060.00910.0072113-0.07361130.0213

    0.48460.47170.54180.48460.58550.54180.54840.58550.52260.54180.52960.54180.53810.54440.54181140.05721140.0638

    0.70550.71980.7390.70550.81560.7390.77390.81560.720.7390.720.7390.720.7390.7391150.03351150.0684

    0.67140.6240.61010.67140.59540.61010.59370.59540.63590.61010.62330.61010.63590.64010.6101116-0.0613116-0.0777

    0.23490.36780.46570.23490.03870.46570.33020.03870.30460.46570.30780.46570.30290.41030.46571170.23081170.0953

    0.04530.09640.12240.04530.14460.12240.12980.14460.09570.12240.12180.12240.12560.12520.12241180.07711180.0845

    0.88560.82960.82340.88560.83080.82340.88230.83080.81880.82340.82070.82340.85810.85250.8234119-0.0622119-0.0033

    0.00940.02390.02390.00940.00550.02390.02120.00550.02390.02390.02390.02390.02390.02390.02391200.01451200.0118

    11111111111111112101210

    0.28350.20550.2590.28350.24920.2590.35730.24920.26090.2590.28120.2590.25380.26250.259122-0.02451220.0738

    11111111111111112301230

    0.30670.30050.30670.30670.29580.30670.42980.29580.3210.30670.37820.30670.5360.47490.306712401240.1231

    0.47560.50830.51340.47560.38110.51340.49870.38110.52750.51340.50730.51340.54540.51360.51341250.03781250.0231

    0.79630.79320.79960.79630.78980.79960.81160.78980.80480.79960.80480.79960.79850.80620.79961260.00331260.0153

    0.83880.89740.91750.83880.91090.91750.89520.91090.91860.91750.92270.91750.92070.91750.91751270.07871270.0564

    0.26370.00260.08130.26370.15760.08130.30350.15760.09930.08130.10510.08130.12670.07730.0813128-0.18241280.0398

    0.24110.25280.13890.24110.27710.13890.24740.27710.25140.13890.24650.13890.25020.25830.1389129-0.10221290.0063

    0.65190.48280.47240.65190.53090.47240.53850.53090.52980.47240.53760.47240.53360.5610.4724130-0.1795130-0.1134

    0.51830.40.47220.51830.48090.47220.32650.48090.4160.47220.36190.47220.4080.45870.4722131-0.0461131-0.1918

    0.45830.40560.40880.45830.4960.40880.37430.4960.40880.40880.41250.40880.40560.52760.4088132-0.0495132-0.084

    0.19680.12570.13770.19680.2120.13770.15950.2120.23940.13770.26670.13770.17860.18830.1377133-0.0591133-0.0373

    11111111111111113401340

    0.15970.14290.12130.15970.03290.12130.29510.03290.18690.12130.30460.12130.16170.12260.1213135-0.03841350.1354

    11111111111111113601360

    0.18570.01620.01390.18570.05930.01390.03120.05930.02480.01390.02210.01390.03240.01850.0139137-0.1718137-0.1545

    0.31350.05710.08190.31350.00120.08190.08680.00120.06410.08190.07130.08190.05040.06390.0819138-0.2316138-0.2267

    0.51170.55590.5520.51170.49490.5520.48340.49490.54960.5520.55250.5520.51320.58170.5521390.0403139-0.0283

    0.6690.78190.65560.6690.74750.65560.78290.74750.80060.65560.79580.65560.78360.80540.6556140-0.01341400.1139

    00000000000000014101410

    0.38460.38460.38460.38460.38460.38460.38460.38460.38460.38460.38460.38460.38460.38460.384614201420

    0.10950.1070.10510.10950.10080.10510.10140.10080.11350.10510.11180.10510.11570.11680.1051143-0.0044143-0.0081

    11111111111111114401440

    0.43470.49350.5290.43470.60750.5290.42280.60750.52190.5290.42810.5290.52380.52190.5291450.0943145-0.0119

    0.18560.2060.24290.18560.45520.24290.49510.45520.28950.24290.41710.24290.31760.24860.24291470.05731470.3095

    0.06550.02990.03140.06550.02410.03140.02620.02410.03150.03140.02690.03140.02730.03390.0314148-0.0341148-0.0393

    00000000000000014901490

    0.21080.21380.15280.21080.17940.15280.14020.17940.14460.15280.1410.15280.13740.1540.1528150-0.058150-0.0706

    0.06610.000600.06610.130500.33880.13050.252100.31500.261700151-0.06611510.2727

    0.26290.27280.28090.26290.31280.28090.26090.31280.2750.28090.25680.28090.27860.28920.28091520.018152-0.002

    0.40430.39520.39520.40430.38310.39520.49060.38310.42290.39520.44360.39520.42990.39490.3952153-0.00911530.0863

    0.1250.04170.04170.1250.050.04170.04550.050.04170.04170.04170.04170.04170.04550.0417154-0.0833154-0.0795

    0.00820.02630.02780.00820.03850.02780.02780.03850.02780.02780.04170.02780.02380.02270.02781550.01961550.0196

    0.0016000.00160.0025000.00250000000156-0.0016156-0.0016

    0.51610.0740.16960.51610.26410.16960.15610.26410.17130.16960.17040.16960.17270.26020.1696157-0.3465157-0.36

    00000000000000015801580

    0.01920.01930.01920.01920.05570.01920.01470.05570.0190.01920.01440.01920.0190.0190.01921590159-0.0045

    0.20380.20970.21540.20380.21050.21540.20470.21050.22270.21540.24030.21540.20230.20340.21541620.01161620.0009

    0.35330.36530.4180.35330.61610.4180.48280.61610.33640.4180.41270.4180.46080.43620.4181630.06471630.1295

    0.35780.18430.1850.35780.2780.1850.1690.2780.1950.1850.19310.1850.19020.18640.185164-0.1728164-0.1888

    00000000000000016501650

    11111111111111116701670

    00000000000000016801680

    0.50.06670.07690.50.03230.07690.33330.03230.08330.07690.10.07690.16670.11110.0769170-0.4231170-0.1667

    0.19530.21020.20250.19530.13860.20250.10890.13860.10940.20250.10970.20250.10940.13390.20251710.0072171-0.0864

    0.66670.66670.66670.66670.66670.66670.66670.66670.66670.66670.66670.66670.66670.66670.666717301730

    00000000000000017401740

    00000000000000017501750

    0.94890.95340.96180.94890.89320.96180.88050.89320.94530.96180.92760.96180.92760.94530.96181760.0129176-0.0684

    0.18010.18130.18320.18010.19170.18320.16140.19170.1820.18320.17370.18320.1810.1820.18321770.0031177-0.0187

    0.44440.50.55560.44440.27780.55560.55560.27780.55560.55560.55560.55560.55560.50.55561780.11121780.1112

    00000000000000017901790

    00000000000000018001800

    00000000000000018101810

    0.50620.40250.38960.50620.52630.38960.58130.52630.59160.38960.58570.38960.56510.53260.3896182-0.11661820.0751

    0.29330.24670.0610.29330.23330.0610.29330.23330.29330.0610.28330.0610.30.29330.061183-0.23231830

    0.55130.32420.25880.55130.30190.25880.41250.30190.42270.25880.43260.25880.39540.37640.2588184-0.2925184-0.1388

    00000000000000018501850

    0.08910.14660.25930.08910.13870.25930.1240.13870.13170.25930.13250.25930.13370.24790.25931860.17021860.0349

    0.18710.1510.16920.18710.1470.16920.15010.1470.15160.16920.15170.16920.15160.15150.1692187-0.0179187-0.037

    0.33330.33330.33330.33330.33330.33330.33330.33330.33330.33330.33330.33330.33330.33330.333318801880

    0.250.250.250.250.250.250.250.250.250.250.250.250.250.250.2518901890

    0.13390.06770.07530.13390.05020.07530.0760.05020.07460.07530.07710.07530.07570.07410.0753190-0.0586190-0.0579

    00000000000000019201920

    0.27440.14650.16590.27440.1410.16590.23910.1410.18140.16590.20340.16590.24840.23540.1659193-0.1085193-0.0353

    0.17220.25050.25270.17220.20960.25270.24390.20960.21660.25270.22420.25270.22270.24180.25271950.08051950.0717

    00000000000000019601960

    00000000000000019701970

    00000000000000019801980

    00000000000000019901990

    00000000000000020002000

    0.38560.36030.36580.38560.37350.36580.38660.37350.37570.36580.38230.36580.38390.37950.3658

    p=0.007p=0.075p=0.050p=0.176p=0.162p=0.017p=0.184p=0.104p=0.0180.003

    [email protected]@[email protected]@[email protected]@[email protected]@[email protected]@[email protected]@0.5

    [email protected]

    tmp

  • Bilingual Query Expansionsource language query

    QueryTranslation results

    Source Language

    IR

    TargetLanguage

    IR

    source language collection

    target language collection

    expandedsource language

    query

    expandedtarget language

    terms

    Pre-translation expansion Post-translation expansion

  • 0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    0.30

    0.35

    0 5,000 10,000 15,000

    Mea

    n Av

    erag

    e Pr

    ecis

    ion

    Unique Dutch Terms

    BothPostPreNone

    Query Expansion Effect

    Paul McNamee and James Mayfield, SIGIR-2002

    Chart1

    15591155911559115591

    14032140321403214032

    12473124731247312473

    10932109321093210932

    9355935593559355

    7795779577957795

    6236623662366236

    4667466746674667

    3118311831183118

    1559155915591559

    0000

    Both

    Post

    Pre

    None

    Unique Dutch Terms

    Mean Average Precision

    0.3171

    0.1514

    0.2718

    0.1153

    0.3167

    0.1427

    0.2657

    0.1005

    0.3223

    0.1056

    0.2618

    0.0966

    0.3127

    0.0937

    0.2497

    0.0931

    0.2627

    0.0904

    0.2269

    0.0841

    0.2575

    0.1059

    0.2195

    0.0794

    0.2331

    0.1082

    0.2104

    0.0809

    0.2245

    0.1003

    0.2019

    0.0783

    0.2196

    0.0906

    0.1931

    0.0766

    0.2189

    0.0705

    0.1951

    0.0696

    0.2113

    0.0697

    0.1832

    0.0623

    Sheet1

    BothPostPreNone

    15,5910.31710.15140.27180.1153

    14,0320.31670.14270.26570.1005

    12,4730.32230.10560.26180.0966

    10,9320.31270.09370.24970.0931

    9,3550.26270.09040.22690.0841

    7,7950.25750.10590.21950.0794

    6,2360.23310.10820.21040.0809

    4,6670.22450.10030.20190.0783

    3,1180.21960.09060.19310.0766

    1,5590.21890.07050.19510.0696

    00.21130.06970.18320.0623

    Sheet1

    Both

    Post

    Pre

    None

    Unique Dutch Terms

    Mean Average Precision

    Sheet2

    Sheet3

  • Cognate Matching

    • Dictionary coverage is inherently limited– Translation of proper names– Translation of newly coined terms– Translation of unfamiliar technical terms

    • Strategy: model derivational translation– Orthography-based– Pronunciation-based

  • Matching Orthographic Cognates

    • Retain untranslatable words unchanged– Often works well between European languages

    • Rule-based systems– Even off-the-shelf spelling correction can help!

    • Subword (e.g., character-level) MT– Trained using a set of representative cognates

  • Matching Phonetic Cognates

    • Forward transliteration– Generate all potential transliterations

    • Reverse transliteration– Guess source string(s) that produced a transliteration

    • Match in phonetic space

  • Cross-Language “Retrieval”

    Search

    Translated Query

    Ranked List

    QueryTranslation

    Query

  • Uses of “MT” in CLIR

    Search

    TranslatedQuery

    Selection

    Ranked List

    Examination

    Document

    Use

    Document

    QueryFormulation

    QueryTranslation

    Query

    Query Reformulation

    Indicative Translation

    Snippet Translation

    Term Translation

    Term Matching

    InformativeTranslation

  • Interactive Cross-LanguageQuestion Answering

    0

    1

    2

    3

    4

    5

    6

    7

    8

    8 11 13 4 16 6 14 7 2 10 15 12 1 3 9 5

    Use

    rs w

    ith C

    orre

    ct A

    nsw

    ers

    Question NumberiCLEF 2004

  • Questions, Grouped by Difficulty8 Who is the managing director of the International Monetary Fund?

    11 Who is the president of Burundi?13 Of what team is Bobby Robson coach?4 Who committed the terrorist attack in the Tokyo underground?

    16 Who won the Nobel Prize for Literature in 1994?6 When did Latvia gain independence?

    14 When did the attack at the Saint-Michel underground station in Paris occur?7 How many people were declared missing in the Philippines after the

    typhoon “Angela”?2 How many human genes are there?

    10 How many people died of asphyxia in the Baku underground?15 How many people live in Bombay?12 What is Charles Millon's political party?

    1 What year was Thomas Mann awarded the Nobel Prize?3 Who is the German Minister for Economic Affairs?9 When did Lenin die?5 How much did the Channel Tunnel cost?

  • For Further Reading• Multilingual IR

    – Paul McNamee et al, Addressing Morphological Variation in Alphabetic Languages, SIGIR, 2009

    • African-Language IR– Open CLIR Challenge (Swahili), IARPA, 2018– Nkosana Malumba et al, AfriWeb: A Search Engine for

    a Marginalized Language, ICADL, 2015

    • Cross-Language IR– Jian-Yun Nie, Cross-Language Information Retrieval,

    Synthesis Lectures in HLT, Morgan&Claypool, 2010– Jianqiang Wang and Douglas W. Oard, Matching

    Meaning for Cross-Language Information Retrieval, Information Processing and Management, 2012

    Multilingual�Information RetrievalGlobal TradeMost Widely-Spoken LanguagesSlide Number 4What Does “Multilingual” Mean?A Story in Two PartsSlide Number 7ASCIIThe Latin-1 Character SetOther ISO-8859 Character SetsEast Asian Character SetsUnicodeLimitations of UnicodeStrings and SegmentsTokenizationMorphological SegmentationMorphological SegmentationStemmingLongest Substring SegmentationLongest Substring ExampleSlide Number 21Probabilistic SegmentationNon-Segmentation:�N-gram IndexingA “Term” is Whatever You IndexSummaryA Story in Two PartsQuery-Language CLIRDocument-Language CLIRQuery vs. Document TranslationIndexing Time:�Statistical Document TranslationLanguage-Neutral RetrievalTranslation EvidenceTypes of Lexical ResourcesSlide Number 34Backoff TranslationSlide Number 36Types of Bilingual CorporaSome Modern Rosetta StonesWord-Level AlignmentA Translation ModelUsing Multiple TranslationsBM-25Retrieval EffectivenessBilingual Query ExpansionQuery Expansion EffectCognate MatchingMatching Orthographic CognatesMatching Phonetic CognatesCross-Language “Retrieval”Uses of “MT” in CLIRInteractive Cross-Language�Question AnsweringQuestions, Grouped by DifficultyFor Further Reading