Using Mutual Information
Technique in Cross Language
Information Retrieval
Syandra Sari Mirna AdrianiTrisakti University University of Indonesia
[email protected] [email protected]
ICADL 2008Bali, 3 Sept 2008
Syandra Sari 2
Outline
1. Introduction
2. CLIR
3. Related Research
4. Our Work
5. Mutual Information & QE
6. Experiment
7. Result and Analysis
8. Conclusion
Syandra Sari 3
Introduction
• The explosive growth of the World Wide Web
• Multilingual world in the Internet.
� Stimulated the CLIR research area
Syandra Sari 4
C L I R
• The challenge in this area is to overcome the language barrier between the query and the document collection.
• needs to transform queries and documents into a common representation, so that monolingual IR techniques can be applied
Syandra Sari 5
C L I R
Two approaches in CLIR:
• translate the query into the language of the documents or
• translate the documents into the language of the query.
Syandra Sari 6
C L I R
Techniques for translation process:
1.Machine translation
2.Bilingual dictionary
3.Parallel or comparable corpus
Syandra Sari 7
Related Research
• Yiming Yang et. al (1998):
– Created a corpus-based term-equivalence matrix
extracted automatically from bilingual corpora
– English-Chinese CLIR and English-Japanese CLIR
• Lavrenko et. al (2002):
– Applied language model for CLIR using parallel corpus
– Chinese-English CLIR
• Martin Braschler (2004):
– Used similarity thesauri for query translation.
– Some European language CLIR (English, Italian,
German, French, Spanish)
Syandra Sari 8
CLIR for Indonesian Language
• In our earlier study (Hayurani et.al, 2006):
– Indonesian-English CLIR
– Machine translation � 84.82% of monolingual
performance
– Bilingual dictionary � 51.98% of monolingual
performance
– Parallel corpus (using bilingual dictionary) �
8% of monolingual performance
Syandra Sari 9
Our Work
• is to do the translation process for Indonesia-English CLIR using
– parallel corpus
• Pseudo translation
• Mutual Information technique
– Dictionary
– Machine translations
• is aimed at evaluating language resources and tools available for Indonesian-English pair
Syandra Sari 10
Mutual Information
• In monolingual IR, was used for finding word association. (Church, 1990)
• In our work:
• For measuring the association degree between Indonesian and English word pair.
• The word pair that has highest mutual information value is considered to be the best word pair.
Syandra Sari 11
Mutual Information
• Mutual information of two points (words), x and
y, is defined to be:
• P(x) is the occurrence probability of word x
• P(y) is the occurrence probability of word y
• P(x,y) is the probability that the words x and y
occur together
• I(x,y) is mutual information value
2
( , )( , ) log
( ) ( )
P x yI x y
P x P y=
Syandra Sari 12
Mutual Information• In (Church, 1990) and (Myung-Gill, 1999) mutual
information value is computed based on word co-occurrence statistics and can be define as follows:
• f(x) is the number of documents containing x in a corpus;
• f(y) is the number of documents containing y in the corpus;
• f(x,y) is the number of documents containing both x and y in the corpus;
• N is the number of items or words in the corpus
2
* ( , )( , ) log
( ) ( )
N f x yI x y
f x f y=
Syandra Sari 13
Mutual Information
• We adapted the formula for Indonesian-English
parallel corpus
2
( , )
( , )log *( )
( ( ) 1)*( ( ) 1)Indonesian English
I x y
D x yN N
D x D y
=
+
+ +
Syandra Sari 14
Mutual Information
• D(x) is the number of Indonesian documents that contain Indonesian word x (exclusive);
• D(y) is the number of English documents that contain English word y (exclusive);
• D(x,y) is the number of Indonesian-English document pairs that contain Indonesian word x and English word y;
• NIndonesian is the number of items or words in the Indonesian corpus;
• NEnglish is the number of items or words in the English corpus
Syandra Sari 15
Query Expansion
• is process of adding words found in a certain number of top English documents retrieved into the query
• We used language model formula in choosing the best words to be added to the query
Syandra Sari 16
Experiment (1)
• Query
– 50 Indonesian queries from CLEF 2006
• Merek Nestle Merek-merek apa yang
dipasarkan oleh Nestle di seluruh dunia
• Nestlé Brands What brands are marketed
by Nestlé around the world
• Collection
– 169.478 English documents from CLEF 2006
(Glasgow Herald 1995, Los Angeles Time 1994)
Syandra Sari 17
Experiment (2)
Building Indonesian-English Parallel Corpus
English
corpus
Indonesian
corpus
Machine
Translation
INDONESIAN-ENGLISH
PARALLEL CORPUS
BUILDING PARALLEL CORPUS
Syandra Sari 18
Experiment (3)
Indonesian
Queries
English
Queries
Building BILINGUAL LIST WORD
using MUTUAL INFORMATION
& Translation Process Relevant
Document
IRS
Indonesian-English Parallel Corpus
Syandra Sari 19
Experiment (4)
Queries were also translated using:
• Bilingual dictionary
• Machine translations:
– Toggletext
– Transtool
• Pseudo-translation based on parallel corpus
Syandra Sari 20
Result and Analysis
• The Mean Average Precision (MAP) of
– monolingual English queries,
– the Indonesian queries
translated using
• bilingual dictionary
• Toggletext machine translation
• Transtool machine translation
Technique MAP
Monolingual 0,3242
Dictionary0,1685
(-48,02%)
M. Translation
(Toggletext)
0,2750
(-15,18%)
M. Translation
(Transtool)
0,2529
(-22,00%)
Syandra Sari 21
Result and AnalysisThe Mean Average Precision (MAP) of the Indonesian
queries translated using parallel corpus : pseudo-translation and mutual information technique
Technique MAP
Parallel-Pseudo Translation (PT) 0.2245 (-30.75%)
Parallel-Mutual Information(MI) 0.1085 (-66.53%)
Parallel- Mutual Information with
query expansion (MI-QE)0.1357 (-58.14%)
Syandra Sari 22
Result and Analysis
• English word
from the
highest value
of MI
(first rank)
• 110 word from
260 word
Indonesian word
English word
from MI
(first rank)
merek trademark
keadaan circumstance
pengangguran unemployment
visa visa
bahaya danger
Syandra Sari 23
Result &
Analysis
• English word from 2nd to
5th value of
MI
• 51 word
Indonesian
word
English word from
MI 1st rank to 5th
rank
Correct
English
word
Rank
africa
afrikan, african,
kwazulu, AFRICA,
gatsha
AFRICA 4
perawatan
aftercare, surgery,
TREATMENT,
carefree, arthritis
TREAT
MENT
3
bijaksana
discreet, PRUDENT,
wiser, wisdom,
indiscreet
PRU
DENT
2
main
romp, overplay,
PLAY playground,
nut
PLAY 3
Syandra Sari 24
Result and Analysis
• Example of Indonesian words get wrong English words as translation
• 44 word
Indonesian word
English word
from MI (first
to fifth rank)
pengaruh
relentless,
unsuspected,
mindful, favor,
abscond
produksi
latch, lug,
lubricant,
rapacity, sewage
universitas
conniption, petal,
ulna, grantee,
packager
Syandra Sari 25
Result and Analysis
• There are 55 Indonesian words did not
get best / correct English words because
of the stemming process such as:
1. Some Indonesian words get antonym word in English
• Example: “adil” was translated into “unjustice” using MI
• (“unjustice” is antonym of “justice”).
2. Some Indonesian words get related word in English.
Example: “matahari” was translated into “sunlight”, “solar”using MI (“sunlight”, “solar” are related to “sun”).
Syandra Sari 26
Result and Analysis
3. Some Indonesian words don’t get the best
translation.
Example: “putri” was translated into “daughter” using MI
the best translation for “putri” in English is “princess.
4. Some Indonesian words get English word that is a
translation for other variation of the Indonesian
words.
Example: “acara” was translated into “lawyer” using MI
“Lawyer” means “pengacara” in Indonesia.
“Pengacara” is one of variation for word “acara”.
Syandra Sari 27
Result and Analysis• There are 9 phrases but only 3 phrases get
correct translation in English
Indonesian phrase In English Using Mutual Information
bahan bakar fuel firewood explosive
energi atom atomic energy atomic energy
gerhana matahari solar eclipse solar eclipse
gempa bumi earthquake quake
lintas alam cross country undergo straightaway
perdana menteri prime ministry morihiro
sidang pengadilan trial unjustice convocation
uang sekolah tuition tuition
undang-undang dasar constitution basic invitation
Syandra Sari 28
Conclusion
• We find that mutual information
– could rank the words in good order based on value of
mutual information to get the best translation, however sometimes it gives the wrong or incorrect translation.
• Machine translation techniques are the best
translation method so far
• Based on the result of our evaluation, there is still
room for improvement to explore parallel corpus
in our future work.
Syandra Sari 29
Future Research
• explore better techniques in finding bilingual word pairs than mutual information technique.
• Apply better query expansion technique to improve the result