bilingual dictionary construction based on interlingual alignment...
TRANSCRIPT
Master Thesis
Bilingual Dictionary ConstructionBased on Interlingual Alignmentbetween Wordnets in Different
Languages
Supervisor Professor Toru ISHIDA
Department of Social Informatics
Graduate School of Informatics
Kyoto University
Taketo SASAKI
February 12, 2016
i
Bilingual Dictionary Construction Based on Interlingual
Alignment between Wordnets in Different Languages
Taketo SASAKI
Abstract
Machine readable bilingual dictionaries are essential for machine translation and
inter-lingual information retrieval. Existing approaches to bilingual dictionary
construction are classified into two directions: induction with multiple bilingual
dictionaries and statistical approach with parallel corpora.
In the former case, the issue is how to remove incorrect translation pairs
which arise from ambiguity of pivot words. In the latter case, similarity between
languages is measured by modeling languages based on corpora. However, due
to the dependence on orthographic features of languages, this approach has low
efficiency when constructing dictionaries of distant language pairs. Therefore,
correct alignment of words based on the words’ concepts without restriction of
languages’ features are needed for bilingual dictionary construction.
In this research, the author introduces an approach to construct bilingual
dictionaries using inter-lingual alignment of words, using conceptual dictionary
called WordNet. Concretely speaking, in this thesis, English WordNet and
wordnets of other languages, which are constructed based on English WordNet,
are used. Since English WordNet and wordnets of other languages share the
same conceptual structure, translation pairs can be extracted by aligning con-
cepts of words between wordnets through English WordNet. In this research,
the author focuses on problems as follows.
Definition of Conceptual Granularity Based on Graph Structure
In wordnet of a language, words are associated with conceptual structure of
English because it is constructed based on English WordNet. However, in
some language, there are words that express broader or narrower concept
than English words. In order to associate a source language word with a
target language word based on their concepts, a measure that expresses
conceptual granularity of word based on graph is needed.
Calculation of Confidence for Alignment considering Conceptual Granularity
There is possibility that words in different languages have different con-
ii
ceptual granularity but express the same concept. Therefore, in order to
link words based on their concepts, first we need to calculate confidence
for inter-lingual word alignment based on conceptual granularity, then we
need an algorithm to extract translation pairs based on confidence value.
For the first problem, graphs are used to represent relations between words
and concepts. Conceptual granularity is defined as diameter of the graph con-
sists of neighboring concepts of words. In order to examine the effects of con-
ceptual granularity, constructed dictionary from baseline method was analyzed.
Based on the analysis, the cause of incorrect translations is defined as matching
between words which have different conceptual granularity and only share part
of concepts. For the second problem, structural equivalence in graph is used to
calculate confidence. This is because structural equivalence nodes in WordNet
graph represent words that have same conceptual granularity and associated
with same concepts. In order to calculate structural equivalence, an algorithm
to extract subgraph related to bilingual dictionary construction is proposed. To
evaluate accuracy of a bilingual dictionary, we extract sample pairs from the
dictionary and classify them according to three grades. The result showed that
our proposed method can construct dictionaries with accuracy about 82%. In
addition, the effect of threshold value in the proposed algorithm is evaluated.
As a result, the fact that completely structural equivalence is too strict for bilin-
gual dictionary construction using Wordnets is revealed. Main contributions of
this research are as follows.
Definition of Conceptual Granularity Based on Graph Structure
Conceptual granularity of words is defined to align words based on the
words’ concepts. Analysis of several wordnets on distribution of conceptual
granularity of words revealed that relations between concepts and words are
distorted in wordnets in different languages.
Calculation of confidence for alignment considering conceptual granularity
In order to align words considering conceptual granularity, the author de-
fined confidence for alignment as structural equivalence in graph, and pro-
posed a bilingual dictionary construction algorithm using wordnets.
iii
異言語Wordnet間の言語間対応付けに基づく対訳辞書生成
佐々木 健人
内容梗概
機械翻訳や言語をまたいだ情報検索を行うには機械可読な対訳辞書が必要不
可欠である.この対訳辞書を自動で作成する手法は大きく二つに分けられる.
一つは英語のような言語資源の充実した言語を介して二つの既存の対訳辞書を
繋ぎ,帰納的に対訳を生成する方法で,もう一つは対訳コーパスなどの対訳辞
書以外の言語資源を用いて統計的に対訳を生成する手法である.
前者では,二つの対訳辞書間の語義ではなく,仲介言語の単語で対応付けを
行うため,仲介言語の単語が持つ多義性によって生じる不適切な対訳ペアを取
り除くことが課題となる.対して後者では,コーパスを用いて 2つの言語のモ
デルを作成し,それらの類似度を種辞書を用いて計測する.しかし,言語の正
字法に大きく依存するため離れた言語対に対しては効果が低いことが明らかに
なっている.異言語間の対訳辞書を作成するためには,言語の特徴による制約
を受けずに各言語の単語が表現する概念の対応付けを正確に行うことが必要で
ある.
そこで本研究では,WordNetと呼ばれる概念辞書の言語間の概念の対応付け
を用いて,その概念を表す単語の対訳関係を抽出し対訳辞書を自動生成する.
具体的には,英語のWordNetと,その概念構造に基づいて各言語で構築された
WordNet(以下異言語WordNetと記す)を用いる.英語のWordNetと異言語
WordNetは同じ概念構造を共有することから,英語のWordNetを介して異言
語WordNet間の概念の対応付けを行い,対訳関係を抽出する.本論文では,下
記の課題に取り組む.
グラフ構造に基づく概念粒度の定義
異言語WordNetは,英語のWordNetをもとに作成されたものであるため
英語の概念体系に各言語が対応付けられている.しかし,言語によっては
英語よりも広範囲の意味を表現する単語も存在すれば,より狭い意味を表
現する単語も存在する.概念にもとづいて単語を対応付けるためには,上
位概念,下位概念といったグラフ構造に基づく,概念の粒度を表現する指
標が必要である.
iv
概念粒度を考慮した対応付けの確信度の算出
言語間で同一概念を表す単語間で概念粒度の差が存在する可能性が存在す
る.そのため,概念に基づいて単語を対応付けるためには,概念粒度に基
づいた言語間の概念対応付けの確信度が必要となる.また確信度に基づく
対訳ペアの抽出アルゴリズムも必要である.
前者の課題に対して,まず概念体系と単語の関係を定量的に表現するために
グラフ構造を用い,その上で単語の概念粒度を単語の隣接概念体系から構成さ
れる部分グラフの直径で表現した.単語の概念粒度が対訳辞書に与える影響を
調査するために,日本語とインドネシア語のWordNetを用いて対訳辞書を生成
し,インドネシア語話者と協力して誤訳を分類した.その結果を用いて,グラフ
構造から誤訳を発生させる部分グラフを抽出し,誤訳の原因が概念粒度の異な
る単語同士のマッチングと概念粒度の大きな単語同士が部分的に概念を共有す
ることであると定義した.後者の課題に対しては,グラフにおける構造同値性
を用いて確信度の算出を行った.概念体系と単語の関係を表現したグラフにお
いて単語と単語が構造同値であることは,同一の概念粒度で同一の概念群に対
応付けられていることを表すからである.構造同値性計算のために,全体のグ
ラフから対訳ペア作成に関連する部分グラフを抽出するアルゴリズムを提案し
た.提案手法の有効性を確かめるために,提案手法を用いて作成した対訳辞書
に対して評価を行った.その結果,約 81%の適合率を得た.また,提案したア
ルゴリズムにおいて対訳ペアを決定する閾値の影響について評価を行った.そ
の結果,完全な構造同値はWordNetを用いて対訳辞書を作成する際には制約が
強すぎるということが明らかになった.本研究の貢献を以下に示す.
グラフ構造に基づく概念粒度の定義
概念に基づいて単語を対応付けるために,単語が表現する概念の粒度を概
念粒度として定義した.様々な言語のWordNetに対して単語の概念粒度の
分布について分析を行い,異言語WordNetにおいては概念粒度の大きい単
語の発生により,単語と概念体系の関係に歪みが生じていることを明らか
にした.
概念粒度を考慮した対応付けの確信度の算出
概念粒度を考慮して単語の対応付けを行うために,対応付けの確信度をグ
ラフ構造における構造同値性として定義し,この確信度を用いた対訳辞書
生成アルゴリズムを提案した.
Bilingual Dictionary Construction Based on Interlingual
Alignment between Wordnets in Different Languages
Contents
Chapter 1 Introduction 1
Chapter 2 Bilingual Dictionary Construction 4
2.1 Induction with Multiple Bilingual Dictionaries . . . . . . . . . . . . 4
2.2 Statistical Approach with Parallel Corpora . . . . . . . . . . . . . . . 6
2.3 Induction with WordNet Synsets . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 3 WordNet 9
3.1 Princeton WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Multilingualization of WordNet . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 4 Bilingual Dictionary Construction Using Wordnets in
Different Languages 16
4.1 Interlingual Alignment betweenWordnets in Different Languages
using WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Issues in Wordnets in Different Languages . . . . . . . . . . . . . . . 20
4.3 Interlingual Alignment betweenWordnets in Different Languages
Using Structural Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.1 Definition of Conceptual Granularity Based on Graph
Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.2 Calculation of Confidence for Alignment Considering
Conceptual Granularity . . . . . . . . . . . . . . . . . . . . . . . 25
Chapter 5 Evaluation 32
5.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Evaluation of Interlingual Alighnment Algorithm . . . . . . . . . . 33
Chapter 6 Discussion 44
Chapter 7 Conclusion 46
Acknowledgments 49
References 50
Chapter 1 Introduction
Recently, internationalization and the spread of the Internet are increasing our
chances of reading web documents written in foreign languages. Among them,
not only documents written in a global common language such as English but
also documents written in a non-global common language which few non-native
speakers speaks, such as Southeast Asian languages exists. Support from com-
puter is essential to read such documents. In NLP(Natural Language Pro-
cessing) area, a wide variety of achievements supporting such activities have
been reported. Especially machine readable bilingual dictionary that provides
a pair of words from different languages describing same meaning, that is called
translation pair, is essential to enable machine translation and cross language
information retrieval.
Generally, languages with few language resources such as bilingual dictio-
naries and corpora are called endangered languages, and languages with a lot of
language resources are called high resourced languages. High quality bilingual
dictionaries tend to be constructed between high resources languages, and bilin-
gual dictionaries between endangered languages, and between high resourced
languages and endangered languages are not available or low quality. There-
fore, bilingual construction methods that can be applied to any language pairs
are required.
A variety of researches on bilingual dictionary construction have been con-
ducted until now. Above all, approaches using existing bilingual dictionaries
or other language resources are useful in less-resourced languages. Existing ap-
proaches can be classified into two classes; induction with multiple bilingual
dictionaries and statistical approach with parallel corpora.
First, induction with multiple bilingual dictionary is the approach to con-
struct bilingual dictionaries through a third language. In this approach, a bilin-
gual dictionary between a source language and a target language is realized
by using a third language as pivot language. Its native implementation pro-
ceeds as follows. For each word in source language A take its translations to
the pivot language B using bilingual dictionary A-B, then for each such pivot
1
translation, take its translations to the target language C using bilingual dictio-
nary B-C. Bilingual dictionary A-C is realized by associating words in A with
inducted words in C. However, constructed dictionary yielded by this proce-
dures contains incorrect translation pairs because of polysemic and ambiguous
words in the pivot language. Take Japanese-English-French as an example. The
Japanese word “party(パーティ)”, which means a social gathering, is translated
into English word “party” correctly. However, when translating “party” into
French, it is translated into “partie(党)”, which means a formally constituted
political group, or “parti(パーティ)”. In this case, translating “パーティ” into
“partie(政党)” is incorrect. Therefore, issues of this approach is to extract only
correct translations coping with ambiguity of pivot language.
Second, statistical approaches with parallel corpora is the approach to ex-
tract bilingual dictionaries statistically using parallel corpora. Its general im-
plementation aligns the texts to each other at the chunk or word level. Al-
ternatively, monolingual corpora are used to construct probabilistic models for
measuring the similarity between two languages. However, this approach was
revealed to have low efficiency with distant language pairs due to its heavy
dependence on orthographic features of languages. In summary, in order to
construct bilingual dictionaries, correct alignment of words based on their con-
cepts without being restricted by features of languages is needed.
In this research, the author construct bilingual dictionary using inter-lingual
alignment of words based on concepts expressed by words, using conceptual dic-
tionary called WordNet. Concretely speaking, English WordNet and Wordnets
in different languages, which are constructed based on the conceptual structure
of English WordNet, are used. Since the same conceptual structure is shared
by English WordNet and wordnets in different languages, translation pairs are
extracted by aligning concepts of words between wordnets in different languages
through English WordNet. In this research, the author works on problems as
follows.
Definition of Conceptual Granularity Based on Graph Structure
In wordnets in different languages, words are associated with conceptual
structure of English because they are constructed based on English Word-
2
Net. However, in words belongs to some language, some expresses broader
concept than English words and others expresses narrower concept than
English words. In order to associate source language words with target
language words based on their concept, the measure that expresses con-
ceptual granularity of word based on graph structure, which illustrates
hyponym or hypernym relations in WordNet, is needed.
Calculation of confidence for alignment considering conceptual granularity
There is the possibility that difference in conceptual granularity between
word, which belong to different language and express same concept, exists.
Therefore, in order to associate source language words with target language
words based on their concepts, it is required that confidence for inter-lingual
word alignment based on conceptual granularity, and algorithm to extract
translation pairs based on confidence value.
This paper is organized as follows.
In Chapter 2, previous researches relevant to bilingual dictionary construc-
tion are introduced. These researches are classified into three according to
used language resources to construct bilingual dictionary: bilingual dictionary,
bilingual or monolingual corpora, WordNet, and each method is described. In
Chapter 3, WordNet, which is a language resource used in this research to
construct bilingual dictionary, is described: what kind of data is provided by
WordNet. Furthermore, previous researches relevant to multilingualization of
WordNet and the current state as of 2016 are described. In Chapter 4, the au-
thor introduces a bilingual dictionary construction method using WordNet. Re-
search issues for using WordNet to construct bilingual dictionaries are described
and methods to cope with each issue are proposed. In Chapter 5, bilingual
dictionaries are constructed using proposed method and evaluated from both
quantitative and qualitative perspectives. Experimental environments such as
language resources and evaluation indexes are described and results are shown.
In Chapter 6, discussion is carried out based on the result shown in Chapter 5.
Chapter 7 presents the conclusion.
3
Chapter 2 Bilingual Dictionary Construction
A variety of researches on bilingual dictionary construction have benn conducted
until now. Above all, approaches using existing bilingual dictionaries or other
language resources are useful in less-resourced languages. Existing approaches
can be classified into three directions; induction with multiple bilingual dictio-
naries, statistical approach with parallel corpora and induction with wordnet
synsets. In the following sections, previous researches are described according
to the classification above.
2.1 Induction with Multiple Bilingual Dictionaries
One of the most representative approaches to construct bilingual dictionaries
is to induce with multiple bilingual dictionaries. However, it is necessary to
discriminate equivalences from inappropriate words caused by the ambiguity in
the pivot language.
Tanaka et al [1]. proposed a method; this is called inverse consultation method
to treat the ambiguity in the pivot language by exploiting the structures of dic-
tionaries and ideograms of morphemes to measure the similarity of the meanings
of words. Figure 1 represents the outline of this method.
Tsuchiya et al [2]. proposes a method to expand a small existing bilingual
dictionary to a large bilingual dictionary using a pivot language. This method
depends on the assumption that it is possible to find a pivot language C for a
given language pair A and B on condition that there are both a large bilingual
dictionary from the source language A to the pivot language C, and a large
bilingual dictionary from the pivot language C to the target language B. First, a
co-occurrence vector on the target language B corresponding to an input word is
generated using both the seed dictionary and a monolingual corpus in the source
language A. Secondly, translation candidates are listed up referring both the
source-pivot bilingual dictionary A-C and the pivot-target bilingual dictionary
C-B, and their co-occurrence vectors are calculated based on a monolingual
corpus in the target language B. Finally, based on cosine similarity score, an
output is selected. Figure 2 shows whole procedures.
4
race
contest
Step2
競争
race
contest
competition
competition
race
ja en fr競争
contest
competition
match
competition
ja-en
fr-enWords in Intersection=2
競争competition
race
ancestry
race
ja-en
fr-en
Words in Intersection=1
競争→competition
Step1 Step3
Figure 1: Inverse consultation method
Figure 2: Expanding existing bilingual dictionary using a pivot language[2]
Another line of research is modeling the structures of the existing input
dictionaries as a constraint satisfaction problem. Mairidan et al [3]. proposed a
constraint approach to pivot-based dictionary induction where source language
and target language are closely related. They created constraints from lan-
guage similarity and model the structures of the input dictionaries as a Boolean
optimization problem, an extension of SAT.
5
In addition, Tokunaga et al [4]. proposes a method to extract the information
about the conceptual items from a translation pair of bilingual dictionaries.
Translation circuits, which consists of a headword of both the source and target
langauges and one of the word sence of both the headwords, is used to represent
the same concept. Figure 3 represents an example of translation circuits. This
is related to this research in that concepts of words have an important role.
英-日英-日 英-日
日-英
日-英日-英
日-英
party
政党 関係者パーティ
reception
Figure 3: Translation circuits for a word party
2.2 Statistical Approach with Parallel Corpora
Another line of research is constructing bilingual dictionary from a large amount
of parallel corpora. In this approach, there are generally two ways: using
crosslingual co-occurrence obtained from parralel corpus or monolingual cor-
pora to improve the bilingual dictionaries constructed by using pivot language
described in section 2.1, and extracting from parallel corpus or monolingual
corpora without using the pivot language.
As an example of first approach, Tanaka et al.[5] proposes a method to ex-
tract translations from non-aligned corpora. In their research, assuming that
translations of two co-occurring words in a source langauge also co-occur in
6
the target language, the stochastic matrix is used to model this assumption.
Besides, translation matrix, which provides the co-occurring information trans-
lated from the source into the target, is used to measure the ambiguity of
translation. Finally, they showed the best translation matrix is what resembles
that in the target language.
As described above, using parallel corpus to extract bilingual dictionaries is a
mature and promising if adequate amounts of data are available. However, they
cannot be applied to construct bilingual dictionaries between distant languages
due to its heavy dependence on orthographic features of languages.
2.3 Induction with WordNet Synsets
There are also approaches to use conceptual dictionaries for bilingual dictionary
construction. WordNet is often used as conceptual dictionary. Mohanty et
al.[6] constructed a multilingual dictionary using synsets of WordNet in place
of pivot words. Unlike the existing bilingual dictionaries in which a word has
several words as translation, as is shown in figure 11, the dictionary model in
which words from several languages are related to a synset is proposed. They
constructed a multilingual dictionary covers traditional languages of India using
the model above.
Figure 4: Multilingual dictionary model in which words are related to a synset[6]
Their method is most related to this research in that constructing bilingual
dictionaries using only WordNet. In their method, there is a restriction that
7
language features are close because they focus on India traditional languages.
In this research, the author proposes a general method that does not depend
on language features.
8
Chapter 3 WordNet
In this chapter, WordNet[7, 8], which is a language resource used in this re-
search to construct bilingual dictionary, and previous researches relevant to
multilingualization of WordNet and the current state as of 2016 are described.
3.1 Princeton WordNet
In this section, a language resource to construct bilingual dictionary in this
research, WordNet is described. WordNet is the concept dictionary on English
made by Princeton University1). In WordNet, English nouns, verbs, adjectives,
and adverbs are organized into sets of synonyms (called synsets), and each
synset has definition. In addition, semantic relations between a word and a
synset and between synsets link to each other. An example of synset definition is
as follows. In the following example, good, right, ripe are glosses, and sentences
enclosed in parentheses are definition of synset and usage examples.� �good, right, ripe -- (most suitable or right
for a particular purpose; "a good time to plant
tomatoes"; "the right time to act";"the time is
ripe for great sociological changes")� �In addition, semantic relationships with other synsets are defined. In Word-
Net, the following semantic relationships are defined.
Synonymy
Synonymy is WordNet’s basic relation, because WordNet uses sets of syn-
onyms (synsets) to represent word senses. Synonymy is a symmetric rela-
tion between word forms.
Antonymy
Antonymy (opposing-name) is also a symmetric semantic relation between
word forms, especially important in organizing the meanings of adjectives
and adverbs.
Hyponymy
1) http://wordnet.princeton.edu/
9
Hyponymy (sub-name) and its inverse, hypernymy (super-name), are tran-
sitive relations between synsets. Because there is usually only one hyper-
nym, this semantic relation organizes the meanings of nouns into a hierar-
chical structure.
Meronymy
Meronymy (part-name) and its inverse, holonymy (whole-name), are com-
plex semantic relations. WordNet distinguishes component parts, substan-
tive parts, and member parts.
Troponymy
Troponymy (manner-name) is for verbs what hyponymy is for nouns, al-
though the resulting hierarchies are much shallower.
Entailment
Entailment relations between verbs are also coded in WordNet.
Table 1 shows these semantic relations are defined in which parts of speech.
Table 1: Semantic relations in WordNet
Semantic Relation Syntactic Category Examples
Synonymy (similar) Noun, Verb, Adjective, Adverb pipe, tube
Antonymy (opposite) Adjective, Adverb, (Noun, Verb) wet, dry
Hyponymy (subordinate) Noun sugar maple, maple
Meronymy (part) Noun brom, hat
Troponymy (manner) Verb march, walk
Entailment Verb drive, ride
In WordNet, both nouns and verbs are organized into hierarchies, defined
by hypernym and hyponym relationships. The words at the same level present
synset members. In the following example, red and redness belong to same
synset, and crimson, ruby and deep red also belong to same synset. Each
synset has a unique index and a brief definition.
10
� �red, redness
=> sanguine
=> chrome red
=> Turkey red, alizarine red
=> cardinal, carmine
=> crimson, ruby, deep red
=> dark red
=> purplish red, purplish-red
=> cerise, cherry, cherry red
=> scarlet, vermilion, orange red� �As is shown in table 2, WordNet includes a great deal of semantic information
on English words. Particularly, WordNet covers quite large data on nouns,
including about 120,000 words and 80,000 synsets.
Table 2: Statistics of Princeton WordNet
Part of speech Words Synsets Word-Synset Pairs
Noun 117,798 82,115 146,312
Verb 11,529 13,767 25,047
Adjective 21,479 18,156 30,002
Adverb 4,481 3,621 5,580
Total 155,287 117,659 206,941
In this research, the author uses graphs in order to represent hierarchies
in WordNet and relations between synsets and words. Figure 5 represents re-
lations described above using a graph in which green colored nodes represent
synsets and red colored nodes represent words, and relations like hyponym are
defined between synsets and relations between synsets and words such as en-
glish represents that a word is an instance of the synset in a language. In the
follow sections, conceptual structure and relations between words and concepts
are described using graphs.
11
Word
Synset
Figure 5: An example of representation of conceptual structure by graph
3.2 Multilingualization of WordNet
As described above, beside WordNet is an English language resource made
by Princeton University, a lot of project (about 76 as of 2016)1) to construct
WordNet in several languages based on Princeton WordNet exist as it is a
very useful language resource. Focusing on Asian countries that are poor in
language resources, almost all official languages and part of ethnic languages
can be covered by Open Multilingual WordNet[9], Asian WordNet,WordNet
Bahasa[10] and Hindi WordNet.
The list of leading WordNet projects with covering languages is as follows.
Arabic WordNet
Arabic
Open Multilingual WordNet [11]
More than 200 languages including Albanian, Arabic, Bulgarian, Chinese,
Danish, Greek, English, Persian, Finnish, French, Hebrew, Croatian, Ital-
ian, Japanese, Spanish, Indonesian, Malay, Polish, Portuguese, Slovenian,
Swedish, Thai, etc.
Asian WordNet
Hindi, Indonesian, Japanese, Lao, Mongolian, Brunei, Nepali, Thai, Sin-
1) http://globalwordnet.org/wordnets-in-the-world/
12
hala, Vietnamese
WordNet Bahasa [10]
Malay, Indonesian
IndoWordNet
Hindi, Assamese, Bengali, Bodo, Gujarati, Kannada, Kashmiri, Konkani,
Malayalam, Meitei, Marathi, Nepali, Sanskrit, Tamil, Telugu, Punjabi,
Urdu, Oriya
EuroWordNet
Danish, English, French, German, Italian, Spanish
MultiWordNet
Italian, Spanish, Portuguese, Hebrew, Romanian, Latin
In this way, WordNet is available as a language resource in several languages
although difference in quality exists. Therefore, WordNet can be used to con-
struct bilingual dictionaries between several languages.
Figure 6: Linking with multiple WordNets[12]
Bond et al.[12] proposed a method to bootstrap the Wordnet using existing
multiple existing wordnets in order to deal with ambiguity inherent in transla-
tion. In their method, construction of Japanese WordNet is used as an example.
13
Although the obvious way to add Japanese to the English WordNet is to trans-
late the entries using an English-Japanese dictionary, the problem that bilingual
dictionaries are not marked with WordNet senses is exists. For example, an En-
glish word seal has several translation candidates in Japanese-English bilingual
dictionary including 判子 (stamp) and 海軍特殊部隊 (Navy Seal). Translation
candidate words need to be associated with the appropriate WordNet senses.
Wordnets in multiple languages are used in order to sense disambiguate the
translations. As is shown in figure 6, if looked up bat we get 蝙蝠 (bat) asso-
ciated with synsetID n#1 and バット (bat) associated with synsetID n#5 as
results. However, only by translating these words using bilingual dictionaries,
translations associated with correct synsetID cannot be acquired. Therefore,
words belong to same synsetID in wordnets in multiple languages are used to
acquire correct translations. A lot of wordnets constructed by this method are
published as Open Multilingual Wordnet1).
Table 3: Statistics of Japanese WordNet and WordNet Bahasa
WordNetの名称 Part of speech Words Synsets Word-Synset Pairs
Noun 66,003 42,737 99,419
Verb 15,346 6,819 33,673
Japanese WordNet Adjective 8,540 5,798 17,799
Adverb 4,085 1,830 7,157
Total 91,936 57,184 158,048
Noun 28,185 24,829 51,087
Verb 7,728 7,802 42,942
WordNet Bahasa Adjective 4,549 4,804 11,132
Adverb 685 650 1,413
Total 36,604 38,085 106,574
1) http://compling.hss.ntu.edu.sg/omw/
14
In this research, Japanese WordNet[13]1) and WordNet Bahasa[14]2) that is
Indonesian WordNet are used to realize a bilingual dictionary. Words in both
Wordnets are associated with synsets in Princeton WordNet. In this research,
Japanese and Indonesian are selected as target languages, because both pa-
per and electronic bilingual dictionaries between them do not exist sufficiently
and distance between them is far from a linguistic point of view. Statistics of
Japanese WordNet and WordNet Bahasa is shown in table 3. Both Wordnets
covers large data on nouns. Therefore, it is expected that bilingual dictionaries
that include a lot of headwords be constructed.
1) http://nlpwww.nict.go.jp/wn-ja/2) http://wn-msa.sourceforge.net
15
Chapter 4 Bilingual Dictionary Construction Us-
ingWordnets in Different Languages
In this chapter, the author proposes a method to construct bilingual dictionary
using Wordnets in different languages described in Chapter 3.
4.1 Interlingual Alignment between Wordnets in Differ-
ent Languages using WordNet
In this section, the author describes problems in existing bilingual dictionary
construction methods and a overview of interlingual alignment between Word-
nets in different languages.
In the methods inducting with multiple bilingual dictionaries introduced in
section 2.1, simple connection of Japanese-English and English-Indonesian dic-
tionaries using English as an intermediate language in order to create Japanese-
Indonesian dictionary causes some problems soon. As is shown in figure 7, there
is a possibility that a Japanese word is translated into English with the meaning
A but the English word might be translated into Indonesian with the meaning
B if the English word has several meanings such as meaning A and meaning
B. In this case, Japanese word and Indonesian word in the resulting transla-
tion pair do not have any common concept. In order to resolve this kind of
problems, Tanaka et al. [1] utilized translation graphs represents the structure
of dictionaries and morpheme, and Mairidan et al. [3] utilized linguistic charac-
teristics, such as similarity between languages, as constraints when modeling as
constraint satisfaction problem (CSP). In this way, it is needed to align concepts
expressed by words correctly.
In the statistical methods using corpora introduced in section 2.2, it was
revealed to have low efficiency with distant language pairs due to its heavy
dependence on orthographic features of languages.
In order to construct bilingual dictionary, it is needed to align concepts of
words from different languages correctly. However, as mentioned above, char-
acteristics between source language and target language need to be utilized to
improve quality and affect the quality of resulting bilingual dictionary. There-
16
fortune運命
Japanese English Indonesian
Meaning A
Meaning A
Meaning B
Word Bilingual Dictionary
nasib
kekayaan
(destiny)
(wealth)
Figure 7: Illustrates problems in pivot-based method
fore, in order to construct bilingual dictionaries, method to align concepts of
words from different languages correctly without affected by features of lan-
guages is needed.
Therefore, the author proposes the bilingual dictionary construction method
based on interlingual alignment between concept dictionaries in different lan-
guages. Concretely speaking, several Wordnets constructed in different lan-
guages based on Princeton WordNet (Japanese WordNet and WordNet Bahasa)
are used to construct bilingual dictionary (only nouns) by extracting word pairs
that express same concepts. Using only Wordnets for language resource ensures
versatility and practicality of the method of this research.
In this research, it is proposed that words are aligned based on conceptual
structure. This policy is illustrated in figure 8. In order to realize this method,
language resources that express conceptual structure is required. WordNet is
used as the concept dictionary. In WordNet, English words are categorized into
groups of synonymous words, which are called synsets. Simple definitions of
synsets are described, and synsets have various relations to other synsets as
described in chapter Chapter 3.
Each synset has an identifier (called synsetID). Words from each language
are associated using synsetID as pivot. SynsetID alignment enables to associate
source language words with target language words based on concept. As is
shown in figure 8, a synset with synsetID 04962784-n defined as (red color or
17
merah (red)
赤
Indonesian WordNet Japanese WordNet
04962784-n
Word Word ofsynset
04962784-n 04962784-n
English WordNet
Same synset
kemerahan (red)
赤色
Figure 8: Align words based on synset
pigment; the chromatic color resembling the hue of blood) is associated with
words merah(Red), kemerahan(Red) in Indonesian WordNet and赤色 (Red), 赤
(Red) in Japanese WordNet. Therefore it is possible to extract translation pairs
based on concepts by acquiring cartesian product of two word sets, which consist
of words of each language associated with a concept identified according to
synsetID. As described in section 2.3, Mohanty et al. constructed a multilingual
dictionary in the same way. In this research, this method is used as the baseline
method.
The outline of these procedures is as follows.� �1. Extract one synset from WordNet
2. From source language WordNet and target language WordNet, extract
sets of words associated with the extracted synset
3. Get the cartesian product of each word set as translation pairs� �Algorithm1 presents the formalization of this baseline method for bilingual
dictionary construction. In the 9th line, get related words(si, sl) is the function
to get words as a set associated with synset si in WordNet of language sl.
make translation pair(wsl,j, wtl,k) in the 13th line is the function to make a
translation pair using word wsl,j belongs to source language sl and word wtl,k
belongs to target langauge tl.
18
Algorithm 1 Baseline method
1: si /* i-th Synset */
2: S /* Set of synsets (S = s1, · · · , s|S|) */3: sl /* Source language */
4: tl /* Target language */
5: wl,j /* j-th Word in language l */
6: Wl /* Words set in language l */
7: R /* Set of translation pairs */
8: for all si in S do
9: Wsl,i ← get related words(si, sl) /* get words as a set associated with
synset si in WordNet of language sl */
10: Wtl,i ← get related words(si, tl)
11: for all wsl,j in Wsl,i do
12: for all wtl,k in Wtl,i do
13: R ← make translation pair(wsl,j, wtl,k) /* make a translation pair
using word wsl,j belongs to source language sl and word wtl,k belongs to
target langauge tl */
14: end for
15: end for
16: end for
17: return R
Bilingual dictionary made by the baseline method is evaluated. Table 4
presents the evaluation result for the baseline method. Precision value has re-
sulted in about 66%. Evaluation was conducted by the author and a native
Indonesian speaker. We extracted 100 sample translation pairs from the con-
structed bilingual dictionary and examined how many translation pairs can be
used as correct translation.
The result shown above cannot be said very good. This preliminary exper-
iment result shows that it is not enough to construct bilingual dictionary by
simply associating words based on concepts. In the next section, characteristics
and issues of multilingualized WordNet are described.
19
Table 4: Result of preliminary experiment
Precision Japanese headwords Indonesian headwords
Baseline method 66% 42,585 23,900
4.2 Issues in Wordnets in Different Languages
In section 4.1, the author described that simply corresponding words by using
synsets of WordNet is not enough to extract correct translation pairs. In this
section, issues of multilingualized WordNet for bilingual dictionary construction
are described.
First, characteristics of multilingualized WordNet are described. As de-
scribed in section 3.2, multilingualized WordNet is constructed by associating
words with English conceptual structure that was made by defining synset as
a group of synonyms and relations between synsets when constructing Prince-
ton WordNet. However, in non-English languages, even if a word expresses
same concept with English words, some include more abstract concepts than
English words and other include more concrete concepts than English words.
Figure 9 illustrates this situation. In this figure, English (left side) and Japanese
(right side) associated with the same conceptual system with hyponym and hy-
pernym relationship are shown. English words have one-to-one relationships
with synsets, while some Japanese words have one-to-many relationships with
synsets. Such one-to-one words and one-to-many words should not be matched
to extract translation pairs.
Also, there is the possibility that a gap exists between concept ranges even
if conceptual granularity of words are equal. Figure 10 illustrates this situa-
tion. In this figure, like figure 9, Indonesian (left side) and Japanese (right
side) associated with the same conceptual system with hyponym and hypernym
relationship are shown. In Japanese WordNet, a word “生き物 (living things)”
is associated with all three concepts. On the other hand, in Indonesian Word-
Net, a word “manusia” is also associated with more than one concept, but with
lower two concepts. These words share some in the range of their expressing
concepts. However, they should not be extracted as a translation pair because
20
Figure 9: Example of vagueness in mapping words to English concetual struc-
ture (Left: English WordNet, right: Japanese WordNet)
one can be used to express more abstract concept and, on the other hand, other
cannot be used to express that more abstract concept.
Summary of the above discussions on issues in bilingual dictionary construc-
tion using Wordnets in different languages is as follow.
Definition of conceptual granularity based on graph structure
Words in Wordnets in different languages are associated with English con-
ceptual structure as they are constructed based on English WordNet. How-
ever, in words belongs to some language, some expresses broader concept
than English words and others expresses narrower concept than English
words. In order to associate source language word with target language
21
Figure 10: Example that shows difference of conceptual granularity between
languages (Left: Indonesian WordNet, right: Japanese WordNet)
word based on their concept, the measure that expresses conceptual gran-
ularity of word based on graph structure illustrates hyponym or hypernym
relations in WordNet.
Calculation of confidence for alignment considering conceptual granularity
There is the possibility that difference in conceptual granularity between
words belongs to different language and expresses same concept exists.
Therefore, in order to associate source language words with target language
words based on their concepts, it is required that confidence for interlingual
word alignment based on conceptual granularity, and algorithm to extract
translation pairs based on confidence value.
Because of difference in conceptual granularity of words, concept alignment
between words in different languages becomes ambiguous.
In the follow section, the author proposes the methods to cope with each
22
issue.
4.3 Interlingual Alignment between Wordnets in Differ-
ent Languages Using Structural Equivalence
In this section, the author proposes the methods to cope with issues described
in section 4.2.
4.3.1 Definition of Conceptual Granularity Based on Graph Struc-
ture
In the section 4.2, it is described that there is the possibility that incorrect
translation pairs may be extracted by emergence of non-English words with
large conceptual granularity that have one-to-many relationship with concep-
tual structure. Conceptual granularity of a word can be expressed as the dis-
tance from the most abstract concept to the most concrete concept in a con-
ceptual structure consists of concepts linked with hyponym relationship when
this structure is associated with a word. Therefore, conceptual granularity of a
word wl,i belongs to language l is defined as the diameter of the graph consists
of synset nodes that are neighbor of node wl,i.
The outline of procedures to get conceptual granularity of a word is as
follows.� �1. Get a set of synset nodes neighbor a word node
2. Extract sub-graph consists of nodes belong to a set got in procedure1
and edges between them from whole WordNet graph
3. Calculate the diameter of extracted subgraph� �Algorithm2 presents the formalization of procedures above.
In the 6th line, neighbor of(wl,i) is the function to get a set of neighbor
synset nodes of wl,i. subgraph(G,S) in the 7th line is the function to extract a
subgraph that consists of nodes belongs to a node set S and edges between them
from graph G. In the 8th line, diam(SG) is the function to get the diameter of
graph SG.
Using procedures above, for example, conceptual granularity of a Japanese
word 「生き物 (living things)」 in figure 11 results in 2.
23
Algorithm 2 Conceptual Granularity(wl,i)
1: wl,i /* i-th Word in language l */
2: S /* Set of synsets */
3: G /* Graph consists of words and synsets in Wordnets */
4: SG /* Subgraph of G */
5: d /* Diameter of a graph */
6: S ← neighbor of(wl,i) /* get a set of neighbor synset nodes of wl,i */
7: SG← subgraph(G,S) /* extract a subgraph that consists of nodes belongs
to a node set S and edges between them from graph G */
8: d← diam(SG) /* get the diameter of graph SG */
9: return d
生き物
00004258-n00004475-n
00015388-n
hypo
hypo jpn jpn
jpn
Figure 11: Example illustrates conceptual granularity of a word “生き物 (living
things)”
The author analyze the effect of multilingualization of WordNet using Algo-
rithm 2. Granted that conceptual granularity of words varies from language to
language, the distributions of conceptual granularity of words should vary from
language to language, and a proportion of words with non-zero conceptual gran-
ularity should be increased in non-English WordNet. On the hypothesis above,
the author calculated conceptual granularity of words included in English1),
1) https://wordnet.princeton.edu/
24
Japanese1), Indonesian2), Chinese3) and Malaysian4) WordNet, and created a
histogram from conceptual granularity of words for each language. Table 5
shows statistics of each WordNet on nouns, and figure 12 shows the result.
Table 5: Statistics of WordNet in each language (only nouns)
Words Synsets Word-Synset Pairs
English WordNet 117,798 82,115 146,312
Japanese WordNet 66,003 42,737 99,419
Indonesian WordNet 28,185 24,829 51,087
Chinese WordNet 38,978 27,888 46,229
Malaysian WordNet 25,338 23,466 49,903
By figure 12, it is confirmed that words with 0 conceptual granularity oc-
cupies almost the entire in English WordNet, and, in non-English WordNet,
proportions of words with over 1 conceptual granularity are increased. Table
6 shows proportions of words with each conceptual granularity to the entire
words included in each language WordNet. By these results, it is confirmed
that multilingualization of WordNet causes the phenomenon that conceptual
granularity of words varies from language to language even if they are associate
with the same concept. This can be concluded that distortion between words
and conceptual structure arises from multilingualization of WordNet based on
English conceptual structure.
4.3.2 Calculation of Confidence for Alignment Considering Concep-
tual Granularity
In the section 4.3.1, is was described that a lot of words with large conceptual
granularity exist in multilingualized WordNet, and indicated followings as the
causes of extracting incorrect translation pairs.
1. Matching a word with words with different conceptual granularity
1) http://nlpwww.nict.go.jp/wn-ja/2) http://wn-msa.sourceforge.net/3) http://compling.hss.ntu.edu.sg/cow/4) http://wn-msa.sourceforge.net/
25
Figure 12: Distribution of conceptual granularity for each language
Table 6: Proportions of words with each conceptual granularity to the entire
words included in each language WordNet
Words Granularity0 1 2 3 4
English WordNet 117,798 99.96% 0.04% 0.00% 0.00% 0.00%
Japanese WordNet 66,003 85.4% 13.2% 1.40% 0.02% 0.00%
Indonesian WordNet 28,185 71.5% 20.7% 6.61% 1.06% 0.07%
Chinese WordNet 38,978 89.3% 9.58% 0.99% 0.06% 0.01%
Malaysian WordNet 25,338 68.2% 23.5% 7.04% 1.18% 0.06%
2. Matching a word with words share a part of associated synsets even if their
conceptual granularities are equal
Considering two above, it can be said that an ideal translation pair when
considering conceptual granularity is ’a pair of words which have same concep-
26
tual granularity and are associated with same concepts’. Figure 13 illustrates
this situation.
hypo
hypo
jpn
jpn
jpn
ind
ind
ind
Synset
JapaneseWord
IndonesianWord
Figure 13: Example of an ideal translation pair
In order to find ideal translation pairs like above based on calculation using
graph structure, the author introduces structural equivalence, which is used to
show similarity of position in social network analysis area. If both node A and
node B in the same network have completely same relations with other nodes
in the network, node A and node B are called structural equivalence. In other
words, a set of nodes that does not cause any change in relations, if labels
attached to nodes are exchanged, is called nodes structurally equivalent.
Figure 14 shows a graph consists of five nodes. In this graph, node B has
an unique position. On the other hand, node A, C, D and E only have relation
with node B. Therefore, if their labels are exchanged, no change occurs in
network structure. In the graph, only node B has the unique position and
other nodes A, C, D and E are structurally equivalent.
However, by structural equivalence above, several nodes are in only two
states: structurally equivalent or not. What relations nodes define determines
whether nodes are structurally equivalent or not.
27
C
A
B
D EFigure 14: Example of a graph that contains structural equivalence nodes
However, in complex network such as one consists of two Wordnets in differ-
ent languages, nodes that are not strictly structurally equivalent but are very
close to be structurally equivalent exist. Therefore, it is required that structural
equivalence is expressed as continuous quantity.
The author use correlation coefficient (Pearson product-moment correlation
coefficient) as the index of structural equivalence between two nodes in a graph.
Basically, it is considered an application of correlation coefficient in a row ele-
ment and a column element of the adjacency matrix. Correlation coefficient rij
between node i and node j can be defined as follows.
rij =
∑(xki − x∗i)(xkj − x∗j) +
∑(xik − xi∗)(xjk − xj∗)√∑
(xki − x∗i)2 +∑(xik − xi∗)2
√∑(xkj − x∗j)2 +
∑(xjk − xj∗)2
(1)
In the equivalence 1, ,xki is k, j element of adjacency matrix (others also the
same), x∗i is average of column i and xi∗ is average of row i (others also the
same), and k ̸= i, j. In other words, this correlation coefficient is an index that
indicates the degree of association between node i and node j on relations with
the other node k. This association marks 1 as the maximum value when they
have positive correlation, marks −1 as the minimum value when they have neg-
ative correlation and marks 0 when they are independent. When constructing
bilingual dictionary, what value is close to 1 is desirable.
Using figure 15 as an example, the procedures to extract translation pairs
using structural equivalence are described. The graph in figure 15 is a subgraph
of graph made by Japanese WordNet and Indonesian WordNet, that consists
28
Figure 15: Example for calculation of structural equivalence value
of synset nodes s1, s2 and s3 associated with a Japanese word node j1 and
Indonesian word nodes i1 and i2 associated with each synset node. After con-
verting this graph (directed graph) into an adjacency matrix, using equation
1, correlation coefficient shown top-left in the figure is calculated. Based on
resulted correlation coefficient, target language(Indonesian in the figure) words
that have correlation value over threshold value are selected as translation of a
source word(Japanese in the figure).
The outline of bilingual dictionary construction using two Wordnets in dif-
ferent languages is as follows.
29
� �1. Extract one word from source language
2. Extract a set of synsets associated with a source language word ex-
tracted in procedure 1
3. Extract a set of target words associated with synsets belong to a set of
synsets extracted in procedure 2
4. Extract subgraph consist of nodes extracted in procedures from 1 to 3
and edges between them from whole graph made by WordNet in source
language and WordNet in target language
5. Calculate structural equivalence values for extracted subgraph using
correlation coefficient
6. Extract target language words that has correlation value over threshold
value as translations of a source language word
7. Repeat procedures from 1 to 6 for each source language word� �Algorithm 3 presents the formalization of the whole procedures to construct
bilingual dictionary.
In the 11th line, neighbor of(wsl,i) is the function to get a set of neigh-
bor synset nodes of wsl,i. get neighbor target words(sj) in the 14th line is the
function to get a set of neighbor target word nodes of synset node sj. In the
26th line, subgraph(G,NC) is the function to extract a subgraph that con-
sists of nodes belongs to a node set NC and edges between them from graph
G. calc correlation coefficient(SG) in the 26th line is the function to calcu-
late correlation coefficient for graph SG using equation 1. In the 28th line,
get value(CM,wsl,i, n) is the function to get a (wsl,i, n) element of correla-
tion coefficient matrix. n belongs to Wtl in the 29th line is the function to
return true if a node n belongs to language Wtl, return false elsewhere and
make translation pair(wsl,i, n) in the 30th line is the function to make a trans-
lation pair from words wsl,i and n.
30
Algorithm 3 Proposing method
1: sl /* Source language */2: tl /* Target language */3: wl,i /* i-th Word in language l */4: Wl /* Words set in language l */5: NC /* Set of words and synsets */6: sj /* j-th Synset */7: S /* Set of synsets (S = s1, · · · , s|S|) */8: TW /* Set of words */9: G /* Graph consists of words and synsets in Wordnets */10: SG /* Subgraph of G */11: CM /* Correlation coefficient matrix */12: cv /* Value in correlation coefficient matrix */13: th /* Threshold value */14: R /* Set of translation pairs */15: for all wsl,i in Wsl do16: NC ← wsl,i
17: S ← neighbor of(wsl,i) /* get a set of neighbor synset nodes of wsl,i */18: for all sj in S do19: NC ← sj20: TW ← get neighbor target words(sj) /* get a set of neighbor target
word nodes of synset node sj */21: for all wtl,k in TW do22: NC ← wtl,k
23: end for24: end for25: SG ← subgraph(G,NC) /* extract a subgraph that consists of nodes
belongs to a node set NC and edges between them from graph G */26: CM ← calc correlation coefficient(SG) /* calculate correlation coeffi-
cient for graph SG */27: for all n in SG do28: cv ← get value(CM,wsl,i, n) /* get a (wsl,i, n) element of correlation
coefficient matrix */29: if cv ≥ th and n belongs to Wtl then30: R← make translation pair(wsl,i, n) /* make a translation pair from
words wsl,i and n */31: end if32: end for33: end for34: return R
31
Chapter 5 Evaluation
In this chapter, the author evaluates the proposed method described in the chap-
ter Chapter 4. After describing the experimental settings, the author describes
evaluation results.
5.1 Experimental Settings
In this experiment, the author construct a Japanese-Indonesian dictionary us-
ing the method proposed in this research, and the degree of improvement is
evaluated.
Japanese WordNet1) and WordNet Bahasa2), which are provided in Open
Multi-lingual WordNet3), are used in this experiment as language resources.
“Weblio Indonesian Dictionary” and its harmonized dictionary[1] are used as
answer Japanese-Indonesian dictionary and answer Indonesian-Japanese dictio-
nary. Table 7 shows the number of headwords on nouns in each dictionary.
Table 7: The number of headwords of an answer dictionary (only nouns)
Dictionary Name Language Pair Headwords
Weblio Indonesian Dictionary Japanese-Indonesian 9,832
Harmonized Weblio Indonesian Dictionary Indonesian-Japanese 10,079
In addition, Japanese Web corpus with appearance frequency data including
89,357 words, which is provided by Kurohashi Laboratory in Kyoto University, is
used as a Japanese test data. Table 8 shows the statistics of each WordNet(only
nouns).
In each WordNet, synsetID, category and data are included in the format
shown in the figure 9. In the category of data, lemma that expresses a word
associated with synsets, def which expresses the definition of a concept and exe
which expresses an example sentence using lemma exist. In addition, data that
1) http://compling.hss.ntu.edu.sg/omw/2) http://wn-msa.sourceforge.net/index.eng.html3) http://compling.hss.ntu.edu.sg/omw/
32
Table 8: Statics of Japanese WordNet and WordNet Bahasa(only nouns)
WordNet Name Words Synsets Word-Synset Pairs
Japanese WordNet 66,003 42,737 99,419
WordNet Bahasa 28,185 24,829 51,087
define relations between synsets are also provided in the format of xml. In the
experiment, lemma part of noun entries and hyponym relation in the definition
of relations are used to construct WordNet graph.
Table 9: Format of data of WordNet
SynsetID Classification Content
00001740-a jpn:lemma 可能
00001740-n jpn:lemma 実体
00001837-r jpn:lemma 西暦...
11820323-n jpn:def 南アフリカ産の茎のない多肉植物の属
04239436-n jpn:def 傷ついた前腕を支持する包帯
11427067-n jpn:def 太陽の大気圏の最も外側の領域...
07316856-n jpn:exe 彼はとうとう大きなチャンスをつかんだ
08438384-n jpn:exe 彼らは、木の小台を切った
07082198-n jpn:exe 彼は重い口を開いた
5.2 Evaluation of Interlingual Alighnment Algorithm
In this section, in the settings described in section 5.1, Japanese-Indonesian
bilingual dictionary is realized by the method proposed in this research and
evaluated. First, letting a threshold value th in the Algorithm 3 described in
section 4.3.2 be maximum value 1, the method is evaluated.
At first, it is shown how many headwords are extracted by the proposed
33
method. Table 10 presents the number of Japanese-Indonesian word pairs,
Japanese headwords and Indonesian headwords included in constructed Japanese-
Indonesian bilingual dictionary and Indonesian-Japanese dictionary. Compared
with headwords of answer dictionaries shown in table 7, about three times
Japanese headwords and about twice Indonesian headwords are acquired.
Table 10: The number of word pairs, Japanese headwords and Indonesian head-
words of constructed dictionaries
Dictionary Name Word Pairs Japanese Indonesian
Japanese-Indonesian 73,279 31,788 20,011
Indonesian-Japanese 40,849 23,886 18,238
Figure 16 represents the number of translation pairs and their intersection.
Their intersection is small due to difference of their features. Therefore, ordi-
nary evaluation method using F-means cannot be applied. Thus, translation
pairs in the intersection in figure 16 and translation pairs included only in the
constructed dictionary are separately evaluated.
Figure 16: The number of translations pairs in the constructed dictionary and
the answer dictionary, and their intersection
Firstly, the efficiency of the constructed bilingual dictionaries is estimated.
34
Concretely speaking, relations between recall, precision, F-means and threshold
value th are evaluated focusing on translation pairs included in the intersection
in figure 16 using Weblio Indonesian Dictionary. Based on this result, the most
suitable threshold for bilingual dictionary construction is discussed. Definitions
of recall, precision and F-means are as follows.
Recall =|Translations of Headwords ∩ Answer Translations of Headwords|
|Answer Translations of Headwords|(2)
Precision =|Translations of Headwords ∩ Answer Translations of Headwords|
|Translations of Headwords|(3)
F -means =2× precision× recall
precision+ recall(4)
Translations of Headwords in equation 2, 6 is the number of translations
of source language headwords of the constructed bilingual dictionaries, which
are headwords of translation pairs included in intersection in figure 16. Answer
Translations of Headwords in equation 2, 6 is the number of translations of
source language headwords of the answer dictionary, which are headwords of
translation pairs included in intersection in figure 16.
Relations between threshold and recall, precision and F-measure are calcu-
lated using Weblio Indonesian Dictionary as an answer dictionary. The result
is shown in figure 17a. First of all, selecting translation pairs letting threshold
be higher than 0.7 improved precision twice as much as the baseline method
(same as result of threshold 0.0), but recall deteriorated remarkably. However,
F-means rose 1.4 times at most, so whole quality can be said to be improved.
In addition, as is shown in figure 17a, F-means is maximized when threshold
is 0.6. Focusing on only precision, threshold between 0.8 and 1.0 maximizes
precision, but lower recall.
In a similar fashion, relations between threshold and recall, precision and
F-means are calculated using harmonized Weblio Indonesian Dictionary as an
35
(a) Relations between threshold and recall, precision and F-means in Japanese-
Indonesian bilingual dictionary
(b) Relations between threshold and recall, precision and F-means in
Indonesian-Japanese bilingual dictionary
Figure 17: Relations between threshold and recall, precision and F-means
36
answer dictionary. The result is shown in figure 17b. First of all, selecting trans-
lation pairs letting threshold be higher than 0.7 improved precision quadruple
as much as the baseline method (same as the result of threshold 0.0), but recall
deteriorated remarkably. However, F-means rose 2.2 times at most, so whole
quality can be said to be improved. In addition, as is shown in figure 17b, F-
means is maximized when threshold is 0.7. Focusing on only precision, threshold
between 0.8 and 1.0 maximizes precision, but lower recall. As a result of dis-
cussions above, extracting translation pairs considering structural equivalence
is of benefit of selecting correct translation pairs, but completely structural
equivalence is too strict for bilingual dictionary construction. In addition, it is
estimated that threshold value th should be between 0.6 and 0.7.
Based on the analysis above, translation pairs included only in the con-
structed dictionary in figure 16 are evaluated by accuracy, letting threshold be
0.7. For each translation pair in the constructed bilingual dictionary, the de-
gree is scored on three-grade evaluations; Acceptable, Partly Acceptable and Not
Acceptable. The details of three grades are defined below.� �Acceptable
It can be used as a translation.
Partly Acceptable
It can be used to tell what it means, but meaning is not completely
matched.
Not Acceptable
It can not be used as a translation.� �This evaluation was conducted by sapmling translation pairs from con-
structed bilingual dictionary and judging whether they can be used as trans-
lation with native Indonesian speaker according to grades described above. In
addition, recall was estimated using answer translation pairs made by native
Indonesian speaker. The definition of precision rate is shown is the equation 6.
In addition, Table 11 presents the result of evaluation on precision rate.
37
Precision =|Acceptable Pairs|Extracted Pairs
(5)
Recall =|Acceptable Pairs|
Answer Translation Pairs(6)
Table 11: Evaluation result
Precision Recall
Baseline method 66% 66%
Proposed method 81% 61%
As is shown in Table 11, the precision improved about 24.2% compared with
the baseline method described in the section 4.1. This result can be viewed as
that extracting translation pairs using structural equivalence is effective.
So far, the efficiency of constructed dictionaries is evaluated. Next, struc-
ture of constructed Japanese-Indonesian dictionaries is evaluated quantitatively,
and relations between threshold are analyzed. First of all, distribution of the
number of Indonesian translations for one Japanese headword is compared with
Weblio Indonesian Dictionary. The result is shown in figure 18a. All dictio-
naries including Weblio Indonesian Dictionary showed similar distortions like
an exponential function. The fact that constructed bilingual dictionaries have
same translation relations as general dictionaries is conformed by this result.
Moreover, as the threshold rises, the number of words that have many transla-
tions is decreased. This is self-evident from feature of structural equivalence. In
addition, constructed bilingual dictionaries have more translations than Weblio
Indonesian Dictionary.
In addition, relations between the number of Japanese headwords and trans-
lation pairs of constructed dictionaries, and threshold are shown in figure 20a
with the number of Japanese headwords and translations pairs of Weblio In-
donesian Dictionary. Moreover, the number of translation pairs in the construct
Japanese-Indonesian dictionary sharply decreases when threshold is between
0.6 and 0.7. Therefore, lost translation pairs are examined using graphs. As
38
(a) Distribution of the number of translations for one Japanese word in the
answer dictionary and constructed dictionaries of each threshold
(b) Distribution of the number of translations for one Indonesian word in
the answer dictionary and constructed dictionaries of each threshold
Figure 18: Distribution of the number of translations for one word in the answer
dictionary and constructed dictionaries of each threshold
39
a result, the fact that words belong to one common graph structure are not
extracted as translations is revealed. Figure 19 shows one example of extracted
common graph. In this graph, a Japanese word 資本金 (capital) and an In-
donesian word kapital(capital) (a red node in figure 19) shares more nodes than
others. In proposed method, such nodes are tend to have a high structural
equivalence value. However, these nodes become to be not extracted as trans-
lations when threshold is over 0.6. From the result above, it can be said that
words that have several independent meanings become to be not extracted as
translations as threshold becomes higher.
Figure 19: Example of common graph structure extracted by analysis
Besides, 35,983 of Japanese headwords that are only included in constructed
dictionaries are exist when threshold is 1.0, which is the most strict case. This
result implies the fact that constructing bilingual dictionaries using WordNet
can extend the number of headwords of existing bilingual dictionaries.
Also, in a similar fashion, structure of constructed Indonesian-Japanese dic-
tionaries is evaluated quantitatively, and relations between threshold are ana-
lyzed. First of all, distribution of the number of Japanese translations for one
40
(a) Relations between threshold and the number of Japanese headwords of
constructed bilingual dictionaries
(b) Relations between threshold and the number of Indonesian headwords
of constructed bilingual dictionaries
Figure 20: Relations between threshold and the number of headwords of con-
structed bilingual dictionaries
41
Indonesian headword is compared with harmonized Weblio Indonesian Dictio-
nary. The result is shown in figure 18b. All dictionaries including harmonized
Weblio Indonesian Dictionary showed similar distortions like an exponential
function. The fact that constructed bilingual dictionaries have same transla-
tion relations as general dictionaries is conformed by this result. Moreover, as
the threshold rises, the number of words that have many translations is de-
creased. This is self-evident from feature of structural equivalence. However,
unlike Japanese-Indonesian dictionary, the difference between harmonized We-
blio Indonesian Dictionary is small.
In addition, relations between the number of Indonesian headwords of con-
structed dictionaries and translation pairs of constructed dictionaries, and thresh-
old are shown in figure 20b with the number of Indonesian headwords and
translation pairs of harmonized Weblio Indonesian Dictionary. Result showed
an inclination similar to Japanese-Indonesians’ one. The cause of decrease in the
number of translation pairs revealed to be the same as the Japanese-Indonesian
case.
Finally, a headword coverage rate, that shows how many headwords of con-
structed bilingual dictionary appears in frequently used words list, is calcu-
rated based on Japanese Web corpus with appearance frequency data which
is provided by Kurohashi Laboratory in Kyoto University. In this words list,
appearance frequency value freq(wi) is given to a word wi. Concretely speak-
ing, accumulative frequency of appearance (hereafter refered to as AFA) and
accumulative accuracy rate (hereafter refered to as AAR) for the word list is
calculated. These values are formalized as follows.
Accumulative Frequency of Appearance(i) =
∑ij=1 freq(wj)∑freq(wj)
(7)
Accumulative Accuracy Rate(i) =|Words until rank i ∩ Headwords|
|Words until rank i|(8)
Figure 21 shows the results of above two.
As is shown in Figure 21, AAR shows about 75% when AFA is about 75%
42
Figure 21: Headword cover rate calculated using Japanese Web corpus with
app frequency data
and threshold is between 0.0 and 0.3, but AAR becomes lower as threshold val-
ues becomes higher. From this result, it can be said that Japanese headwords of
constructed Japanese-Indonesian bilingual dictionary covers most of frequently
used Japanese words. However, as the threshold value is rose to improve accu-
racy, headword coverage rate becomes lower. There is clear trade-off between
accuracy and word coverage rate in the proposed method.
43
Chapter 6 Discussion
In this chapter, efficiency and problems of the proposed method is discussed
based on the results showed in chapter Chapter 5.
In this research, bilingual dictionaries include only nouns are realized. How-
ever, there are also verbs, adjectives and adverbs in Wordnets. In conceptual
structure of WordNet, both nouns and verbs are organized into hierarchies in
which all are linked to a unique beginner synset. Noun hierarchies are far
deeper than verb hierarchies, and verb hierarchies are more complex than noun
hierarchies. Moreover, in adjectives, two central antonyms form binary poles,
while satellite synonyms connect to their respective poles via a similarity rela-
tions. Also, adverbs are organized into adjectives like structure because they
are defined based on their origin adjectives.
In the proposed method, in order to calculate structural equivalence matrix
not affected by whole graph structure, subgraphs are extracted with a focus on
concepts, which are associated with words to be translated. Therefore, the pro-
posed method is considered to be not affected by part of speeches. In addition,
structural equivalence is used to find persons who have similar roles from em-
ployees in the area of social network analysis. Thus, the proposal is very likely
to work well in hierarchical structures such as trees. Therefore, if structural
equivalence is calculated for whole graph, it is likely to work well in the case
of nouns and verbs. However, in that case, computational complexity will be
at issue because huge adjacency matrices have to be handled. For adjectives
and adverbs, it can be predictable that using clustering approach is likely to
improve accuracy because the existence of clusters in graph from discussions
above.
In this research, relations between words are simplified because only hy-
ponym relations are used to define graph structures. Therefore, structural
equivalence is considered to work well. However, there is a possibility that
all words in Wordnets cannot be included in constructed bilingual dictionaries
because of restriction of relations between words. As is shown in table 8,10, a
half of Japanese words in WordNet are not included in the constructed bilingual
44
dictionaries. Therefore, in order to improve recall value and enrich headwords,
relationships besides hyponym are required to be considered.
In the experiment, F-means marked the highest when threshold value th
is betwwen 0.6 and 0.7, which means the middle strong correlation originally
in correlation coefficient, for Japanese-Indonesian language pair. This can be
considered to have a correlation with language distance from English to each
language. Previous research[3] revealed that one-to-one word mapping is valid
in the case that the source language and the target language are close. In
other words, conceptual granularity of words are equal when languages are
closely related. Therefore, it is predictable that F-means marks the highest
when threshold value th is close to 1.0 for similar languages. These discussions
have to be verified. In addition, experiment results showed that there is a
trade-off between accuracy and word coverage rate. Therefore, it is needed to
decide which index is more important for desired dictionary. Moreover, feature
of translation pairs changes according to threshold value. As threshold rise,
polysemic of words in translation pairs is decreased. Thus, threshold should be
set higher value if polysemic of words needed to be removed.
Finally, this research is likely to enrich language resources between a highly-
resourced langauge and a less-resourced language, and between less-resourced
languages because the proposed method needs only Wordnets in both languages
unlike other existing approaches. This point make a contribution to the society.
45
Chapter 7 Conclusion
Machine readable bilingual dictionaries are essential for machine translation and
inter-lingual information retrieval. Existing approaches to bilingual dictionary
construction are classified into two classes: induction with multiple bilingual
dictionaries and statistical approach with parallel corpora.
In the former case, the issue is that how to remove incorrect translation pairs
which arise from ambiguity of pivot words. In the latter case, similarity between
languages is measured by modeling languages based on corpora. However, due
to its dependence on orthographic features of languages, this approach has low
efficiency with distant language pairs. Correct alignment of words based on
their concepts without being restricted by features of languages is needed to
bilingual dictionary construction.
In this research, the author construct bilingual dictionary using inter-lingual
alignment of words based on concepts expressed by words, using conceptual dic-
tionary called WordNet. Concretely speaking, English WordNet and Wordnets
in different languages, which are constructed based on the conceptual struc-
ture of English WordNet, are used. Since the English WordNet and wordnets
in different languages share the same conceptual structure, translation pairs
are extracted by aligning concepts of words between wordnets in different lan-
guages through English WordNet. However, in that case, there were isuues such
as definition of conceptual granularity based on graph structure and calculation
of confidence for alignment considering conceptual granularity. Contributions
of this research are as follows.
Definition of Conceptual Granularity Based on Graph Structure
Conceptual granularity of words are defined to align words based on con-
cepts. Analysis of several wordnets on distribution of concept density of
words revealed that relations between concepts and words are distorted by
the occurrence of words with large concept density in wordnets in different
languages.
Concretely speaking, graphs are used to represent relations between words
and concepts; besides conceptual granularity is defined as diameter of the
46
graph consists of neighboring concepts of words. In order to examine the
effects of concept density on resulting bilingual dictionaries, the author ex-
tracted incorrect translations from a resulted bilingual dictionary, which is
constructed by the baseline method using Japanese WordNet and Word-
Net Bahasa, cooperating with a native Indonesian speaker. Based on this
result, sub-graphs that cause incorrect translations are extracted, and the
cause of incorrect translations is defined as matching between words with
different conceptual granularity and sharing part of concept among words
with large conceptual granularity.
Calculation of Confidence for Alignment considering Conceptual Granularity
In order to align words considering conceptual granularity, the author de-
fined confidence for alignment as structural equivalence in graph structure
and proposed a bilingual dictionary construction algorithm using this confi-
dence value. Concretely speaking, coefficient correlation between elements
in adjacency matrix is used to calculate structural equivalence in graphs.
In addition, in order to calculate structural equivalence, an algorithm to
extract subgraph related to bilingual dictionary construction.
The quality of bilingual dictionary extracted by the proposed method is
evaluated by applying Japanese WordNet and WordNet Bahasa to the algo-
rithm. As a result of this, Japanese-Indonesian dictionary includes 73,279 word
pairs and Indonesian-Japanese dictionary includes 40,849 are extracted when
threshold value is 1.0. Accuracy of bilingual dictionary is evaluated by extract-
ing sample pairs from dictionary and classifying them according to three grades;
Acceptable, Partly Acceptable and Not Acceptable, and accuracy showed about
82%. In addition, the effect of threshold value is evaluated by measuring the
number of headwords, distribution of the number of translations for each word,
F-means, and Japanese headword coverage rate. As a result, the fact that
threshold value 1.0, which means completely structural equivalence, is too strict
for bilingual dictionary construction using Wordnets is revealed.
In this research, bilingual dictionary includes only nouns are constructed
using only hyponym relations in WordNet. However, as several other relations
are defined in WordNet, method to utilize these relations is required. More-
47
over, method to construct bilingual dictionary includes not only nouns is also
required. These problems need to be addressed in future research. In addition,
evaluation for other language pairs need to be carried out in the future research
because there are a lot of languages in which WordNet is available as language
resource.
48
Acknowledgments
The author would like to express sincere gratitude to the supervisor, Professor
Toru Ishida at Kyoto University, for his continuous guidance, valuable advice,
and helpful discussions.
The author would like to express his appreciations to the advisers, Associate
Professor Yohei Murakami at Kyoto University Design School and Associate
Professor Hiroaki Ohshima at Kyoto University for his valuable advice.
The author would like to thank all members of project, especially Arbi Haza
Nasution, for their technical advice.
Finally, the author would like to thank all members of Ishida and Matsubara
laboratory for their various supports and discussions.
49
References
[1] Tanaka, K. and Umemura, K.: Construction of a bilingual dictionary in-
termediated by a third language, Proceedings of the 15th conference on
Computational linguistics-Volume 1 , Association for Computational Lin-
guistics, pp. 297–303 (1994).
[2] Tsuchiya, M., Purwarianti, A., Wakita, T. and Nakagawa, S.: Expanding
Indonesian-Japanese small translation dictionary using a pivot language,
Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster
and Demonstration Sessions , Association for Computational Linguistics,
pp. 197–200 (2007).
[3] Wushouer, M., Lin, D., Ishida, T. and Hirayama, K.: A Constraint Ap-
proach to Pivot-based Bilingual Dictionary Induction, Asian Language In-
formation Processing, 2015 ACM Transactions on, ACM (2015).
[4] Tokunaga, T. and Tanaka, H.: The automatic extraction of conceptual
items from bilingual dictionaries, PRICAI-90 , pp. 304–309 (1990).
[5] Tanaka, K. and Iwasaki, H.: Extraction of lexical translations from non-
aligned corpora, Proceedings of the 16th conference on Computational
linguistics-Volume 2 , Association for Computational Linguistics, pp. 580–
585 (1996).
[6] Mohanty, R., Bhattacharyya, P., Pande, P., Kalele, S., Khapra, M. and
Sharma, A.: Synset based multilingual dictionary: Insights, applications
and challenges, Global Wordnet Conference, pp. 22–25 (2008).
[7] Miller, G. A.: WordNet: a lexical database for English, Communications
of the ACM , Vol. 38, No. 11, pp. 39–41 (1995).
[8] Fellbaum, C.: WordNet , Wiley Online Library (1998).
[9] Boyd-Graber, J., Fellbaum, C., Osherson, D. and Schapire, R.: Adding
dense, weighted connections to WordNet, Proceedings of the Third Global
WordNet Meeting , Jeju (2006).
[10] Mohamed Noor, N., Sapuan, S. and Bond, F.: Creating the Open Wordnet
Bahasa, Proceedings of the 25th Pacific Asia Conference on Language, In-
formation and Computation (PACLIC 25), Singapore, pp. 258–267 (2011).
50
[11] Bond, F. and Foster, R.: Linking and Extending an Open Multilingual
Wordnet, Sofia (2013).
[12] Bond, F., Isahara, H., Kanzaki, K. and Uchimoto, K.: Boot-strapping a
WordNet using multiple existing WordNets. (2008).
[13] Isahara, H., Bond, F., Uchimoto, K., Utiyama, M. and Kanzaki, K.: De-
velopment of the Japanese WordNet., LREC (2008).
[14] Lim, L. T. and Hussein, N.: Fast prototyping of a Malay wordnet sys-
tem, Proceedings of the Language, Artificial Intelligence and Computer
Science for Natural Language Processing (LAICS-NLP) Summer School
Workshop, pp. 13–16 (2006).
[15] Genzel, D.: Inducing a multilingual dictionary from a parallel multitext
in related languages, Proceedings of the conference on Human Language
Technology and Empirical Methods in Natural Language Processing , Asso-
ciation for Computational Linguistics, pp. 875–882 (2005).
[16] Lafourcade, M.: Multilingual dictionary construction and services-case
study with the fe* projects, Proc. of PACLING , pp. 289–306 (1997).
51