using alignment for multilingual text compressiondilant/cs175/talks_2/[j.lustig_t2].pdf ·...

Using Alignment for Multilingual Text

CompressionEhud S. Conley and Shmuel T. Klein

Department of Computer ScienceBar Ilan University

Presented by Jason Lustig, 11.29.06

Multilingual and Aligned

Two texts, S and T

T is a translation of S and the two are aligned

Example:

English - The dog jumped over the cat

Hebrew - הכלב קפץ מעליו החתול

We can use one text to compress the other

Sliding Window Drawbacks

Encoding for pointers can get very large

Our window has a limited size

You also have to decode the file from the beginning

Aligning solves some...

Because you need to match up words and phrases within a very short range - in the same block or paragraph - the encodings are very small

You can decode a particular document without doing the whole corpus

Similarities with VQ

Multilingual text compression is very similar to Vector Quantization

VQ uses a corpus of images to create vectors which approximate different colors in an image

Multilingual text compression uses a bilingual dictionary and the text alignment to create “vectors” to code where sections of one text can be automatically translated from the other

How to do it, simply

Take a word- and phrase-alignment of the two texts

Find the longest connected blocks between the two texts

Compress the translated text using these blocks instead of compressing the whole text

How to do it, more complex

current = 1Lt = length of translated textwhile current < Lt find the longest block if it is not found, output a pointer to the translation of the word at position currentend while

“The longest block”found = falsewhile m < Lt - current (max length of block at this pt. in the text) if there exists a connection btwn. translation and original: compute/output items for encoding of pointer, including: * difference in position of word and its aligned translation * stemmed versions of all words in the string * a translation of the string * variants of the words in the string break!end while

The role of encoding

Obviously there is a lot of information in each pointer really

This is because language is finnicky, words don’t always mean the same thing

For example, we aren’t actually including all the variants of the French “mineral” (mineral, minerals, minerale, minerales) but really just pointers to a dictionary of these variants

Dictionaries

This means that the structure of our dictionaries is extremely important

Large dictionaries would reduce compression - or would it?

Algorithm is considered not for use of simply compressing files on your computer but in a multilingual information retrieval system where you will have these already

Dictionary-specific compression can be used to reduce size

How does it match up?Compressed EU JOC corpus, collection of questions and answers on various topics

English-to-French

Trans (us!) seems to win out overall

Full size Gzip Bzip Hword Trans

7551550 0.307 0.214 0.225 0.212

Caveats

English and French are very similar

What would happen when we use a different pair such as English and Hebrew?

English and French have had similar historical experiences (WWII, etc.) so this means they have similar vocabularies of ideas, what about languages from very different cultures?

Translation more difficult, so would be compression

using alignment for multilingual text compressiondilant/cs175/talks_2/[j.lustig_t2].pdf ·...

Documents