using alignment for multilingual-text compression ehud s. conley and shmuel t. klein
TRANSCRIPT
![Page 1: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/1.jpg)
Using Alignment for Multilingual-Text Compression
Ehud S. Conley and Shmuel T. Klein
![Page 2: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/2.jpg)
Outline
• Multilingual text
• Problem definition
• Multilingual-text alignment
• Compression of multilingual texts using alignment– Algorithm– Results
• Future work
![Page 3: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/3.jpg)
Multilingual text
• Same contents in two or more (natural) languages– Legislative texts of the European Union in all
EU languages
Subject: Supplies of military equipment to Iraq
Objet: Livraisons de matériel militaire à l’Irak
![Page 4: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/4.jpg)
Problem definition
• How can multilingual texts be compressed more efficiently relative to compression of each language separately?–Can semantic equivalence be
exploited to reduce aggregate corpus size?
![Page 5: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/5.jpg)
Multilingual-text alignment (1)
• Mapping of equivalent text fragments to each other– Paragraph/sentence and word/phrase
levels
– Algorithms for both levels• Tokenization, lemmatization, shallow
parsing
– Alignment possibly partial
![Page 6: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/6.jpg)
Multilingual-text alignment (2)
Subject : Supplies of military equipment to Iraq Objet : Livraisons de matériel militaire à l’ Irak
![Page 7: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/7.jpg)
Linear alignment
• Given two parallel fragments S and T, the linear alignment of a token tj in T is the token si in S such that:
5.0
||
||j
T
Si
![Page 8: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/8.jpg)
Correct vs. linear alignment
5.09
8ji
1 2 3 4 5 6 7 8 9
Subject : Supplies of military equipment to Iraq
Objet : Livraisons de matériel militaire à l’ Irak
correct linear
9||,8|| TS
![Page 9: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/9.jpg)
Offset from linear alignment
• Signed distance between correct and linear alignments
– Usually very small values (mostly [-10, 10])
offset = 2 1 2 3 4 5 6 7 8 9
Subject : Supplies of military equipment to Iraq
Objet : Livraisons de matériel militaire à l’ Irak
correct linear
![Page 10: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/10.jpg)
Compression of multilingual texts using alignment:
Basic idea (1)• Compress by replacing words/phrases
with pointers to their translations within the other text– Original text restored using bilingual dictionary
• Store offsets relative to linear alignment– Small values small number of values
efficient encoding
![Page 11: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/11.jpg)
Compression of multilingual texts using alignment:
Basic idea (2)• Store number of words in pointed fragment
– Might be a multi-word phrase– bilan balance sheet
• Single pointer may replace multi-word phrase– matériel militaire pointer to military
equipment– chemin de fer railway
![Page 12: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/12.jpg)
Basic scheme: Example (option 1)
• Prefixes: 0 - word, 1 - pointer
• 1(offset, length)
offset = 2 1 2 3 4 5 6 7 8 9
Subject : Supplies of military equipment to Iraq
Objet : Livraisons de matériel militaire à l’ Irak
correct linear
1(0, 1) 0(:) 1(0, 1) 0(de) 1(2, 1) 1(0, 1) 0(à) 0(l’) 1(0, 1) Objet Livraisons matériel militaire Irak
![Page 13: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/13.jpg)
Basic scheme: Example (option 2)
• matériel militaire pointer to military equipment
• Offset relative to first words
offset = 1 1 2 3 4 5 6 7 8 9
Subject : Supplies of military equipment to Iraq
Objet : Livraisons de matériel militaire à l’ Irak
correct linear
1(0, 1) 0(:) 1(0, 1) 0(de) 1(1, 2) 0(à) 0(l’) 1(0, 1) Objet Livraisons matériel militaire Irak
![Page 14: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/14.jpg)
Complication: Words withmultiple possible translations
• Sometimes more than one possible translation per word– equipment
1. équipement
2. matériel
• Must encode correct translation within pointer– Store index of translation
![Page 15: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/15.jpg)
Complication:Morphological variants (1)
• Bilingual dictionary must use one morphological form (lemma)–go aller stands for:
{go, went, gone, going} {aller, vais, vas, va etc.}
![Page 16: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/16.jpg)
Complication:Morphological variants (2)
• Texts include inflected forms– More than one possible lemma
(bound {bind, bound}) must indicate correct lemmas for S to enable dictionary lookup
– Several variants per lemma must indicate correct inflections of translation words to enable restoration of T
![Page 17: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/17.jpg)
Complication: Morphological variants (3)LEMMA DICTIONARY lower
0. low (adj.) 1. lower (verb)
bound 0. bound 1. bind
BILINGUAL DICTIONARY low
0. bas 1. déprimé 2. grave 3. ignoble 4. inférieur 5. …
bound 0. bondir 1. limite 2. borne 3. bond 4. …
VARIANT DICTIONARY borne
0. borne (sing.) 1. bornes (pl.)
inférieur 0. inférieur (masc.) 1. inférieure (fem.) 2. inférieurs 3. inférieures
lower bound
borne inférieure
1(1,1,0,2,0) 1(-1,1,0,4,1) borne inférieure
•1(offset, length, lemma(s), translation, variant(s))•Multiple values for multiple words
![Page 18: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/18.jpg)
Optimizations
• No encoding for single option– Relevant for all 3 dictionaries
• Sort options by descending order of frequencies– Large number of small values better
encoding
• Encode length as (length – 1)– length never 0
![Page 19: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/19.jpg)
Binary encoding (1)
• Use 3 Huffman codes–H1: words + pointer prefix
–H2: absolute values of offsets
• sign bit follows, except for 0
–H3: lengths + indices
![Page 20: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/20.jpg)
Binary encoding (2)
• Words:
H1(lemma) [H3(variant)]
• Pointers:l = length, m = (# of words in translation)
H1(ptr_prefix) H2(offset) [sign_bit] H3(l – 1)[H3(lemma0)] … [H3(lemmal - 1)][H3(translation)][H3(variant0)] … [H3(variantm – 1)]
![Page 21: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/21.jpg)
Empirical results
• English-French responsa collection of European parliament (ARCADE project)
• Sizes do not include codes for HWORD and TRANS; also not dictionaries for TRANS– Dictionaries exist anyway in large IR systems– Heaps law: Dictionary size is αNβ, where 0.4 β 0.6
• For large corpora, size negligible
![Page 22: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/22.jpg)
Empirical results (2)
![Page 23: Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein](https://reader035.vdocument.in/reader035/viewer/2022062423/56649e245503460f94b11ba6/html5/thumbnails/23.jpg)
Future work
• Other test corpora– Other languages
• Compress target using lemmatized source
• Improve encoding
• Bidirectional scheme
• Pattern matching within compressed text
• Improved model for k languages