using alignment for multilingual text compressiondilant/cs175/talks_2/[j.lustig_t2].pdf ·...
TRANSCRIPT
Using Alignment for Multilingual Text
CompressionEhud S. Conley and Shmuel T. Klein
Department of Computer ScienceBar Ilan University
Presented by Jason Lustig, 11.29.06
Multilingual and Aligned
Two texts, S and T
T is a translation of S and the two are aligned
Example:
English - The dog jumped over the cat
Hebrew - הכלב קפץ מעליו החתול
We can use one text to compress the other
Sliding Window Drawbacks
Encoding for pointers can get very large
Our window has a limited size
You also have to decode the file from the beginning
Aligning solves some...
Because you need to match up words and phrases within a very short range - in the same block or paragraph - the encodings are very small
You can decode a particular document without doing the whole corpus
Similarities with VQ
Multilingual text compression is very similar to Vector Quantization
VQ uses a corpus of images to create vectors which approximate different colors in an image
Multilingual text compression uses a bilingual dictionary and the text alignment to create “vectors” to code where sections of one text can be automatically translated from the other
How to do it, simply
Take a word- and phrase-alignment of the two texts
Find the longest connected blocks between the two texts
Compress the translated text using these blocks instead of compressing the whole text
How to do it, more complex
current = 1Lt = length of translated textwhile current < Lt find the longest block if it is not found, output a pointer to the translation of the word at position currentend while
“The longest block”found = falsewhile m < Lt - current (max length of block at this pt. in the text) if there exists a connection btwn. translation and original: compute/output items for encoding of pointer, including: * difference in position of word and its aligned translation * stemmed versions of all words in the string * a translation of the string * variants of the words in the string break!end while
The role of encoding
Obviously there is a lot of information in each pointer really
This is because language is finnicky, words don’t always mean the same thing
For example, we aren’t actually including all the variants of the French “mineral” (mineral, minerals, minerale, minerales) but really just pointers to a dictionary of these variants
Dictionaries
This means that the structure of our dictionaries is extremely important
Large dictionaries would reduce compression - or would it?
Algorithm is considered not for use of simply compressing files on your computer but in a multilingual information retrieval system where you will have these already
Dictionary-specific compression can be used to reduce size
How does it match up?Compressed EU JOC corpus, collection of questions and answers on various topics
English-to-French
Trans (us!) seems to win out overall
Full size Gzip Bzip Hword Trans
7551550 0.307 0.214 0.225 0.212
Caveats
English and French are very similar
What would happen when we use a different pair such as English and Hebrew?
English and French have had similar historical experiences (WWII, etc.) so this means they have similar vocabularies of ideas, what about languages from very different cultures?
Translation more difficult, so would be compression
?