bilingual dictionary construction based on interlingual alignment...

Master Thesis

Bilingual Dictionary ConstructionBased on Interlingual Alignmentbetween Wordnets in Different

Languages

Supervisor Professor Toru ISHIDA

Department of Social Informatics

Graduate School of Informatics

Kyoto University

Taketo SASAKI

February 12, 2016

i

Bilingual Dictionary Construction Based on Interlingual

Alignment between Wordnets in Different Languages

Taketo SASAKI

Abstract

Machine readable bilingual dictionaries are essential for machine translation and

inter-lingual information retrieval. Existing approaches to bilingual dictionary

construction are classified into two directions: induction with multiple bilingual

dictionaries and statistical approach with parallel corpora.

In the former case, the issue is how to remove incorrect translation pairs

which arise from ambiguity of pivot words. In the latter case, similarity between

languages is measured by modeling languages based on corpora. However, due

to the dependence on orthographic features of languages, this approach has low

efficiency when constructing dictionaries of distant language pairs. Therefore,

correct alignment of words based on the words’ concepts without restriction of

languages’ features are needed for bilingual dictionary construction.

In this research, the author introduces an approach to construct bilingual

dictionaries using inter-lingual alignment of words, using conceptual dictionary

called WordNet. Concretely speaking, in this thesis, English WordNet and

wordnets of other languages, which are constructed based on English WordNet,

are used. Since English WordNet and wordnets of other languages share the

same conceptual structure, translation pairs can be extracted by aligning con-

cepts of words between wordnets through English WordNet. In this research,

the author focuses on problems as follows.

Definition of Conceptual Granularity Based on Graph Structure

In wordnet of a language, words are associated with conceptual structure of

English because it is constructed based on English WordNet. However, in

some language, there are words that express broader or narrower concept

than English words. In order to associate a source language word with a

target language word based on their concepts, a measure that expresses

conceptual granularity of word based on graph is needed.

Calculation of Confidence for Alignment considering Conceptual Granularity

There is possibility that words in different languages have different con-

ii

ceptual granularity but express the same concept. Therefore, in order to

link words based on their concepts, first we need to calculate confidence

for inter-lingual word alignment based on conceptual granularity, then we

need an algorithm to extract translation pairs based on confidence value.

For the first problem, graphs are used to represent relations between words

and concepts. Conceptual granularity is defined as diameter of the graph con-

sists of neighboring concepts of words. In order to examine the effects of con-

ceptual granularity, constructed dictionary from baseline method was analyzed.

Based on the analysis, the cause of incorrect translations is defined as matching

between words which have different conceptual granularity and only share part

of concepts. For the second problem, structural equivalence in graph is used to

calculate confidence. This is because structural equivalence nodes in WordNet

graph represent words that have same conceptual granularity and associated

with same concepts. In order to calculate structural equivalence, an algorithm

to extract subgraph related to bilingual dictionary construction is proposed. To

evaluate accuracy of a bilingual dictionary, we extract sample pairs from the

dictionary and classify them according to three grades. The result showed that

our proposed method can construct dictionaries with accuracy about 82%. In

addition, the effect of threshold value in the proposed algorithm is evaluated.

As a result, the fact that completely structural equivalence is too strict for bilin-

gual dictionary construction using Wordnets is revealed. Main contributions of

this research are as follows.


Conceptual granularity of words is defined to align words based on the

words’ concepts. Analysis of several wordnets on distribution of conceptual

granularity of words revealed that relations between concepts and words are

distorted in wordnets in different languages.

Calculation of confidence for alignment considering conceptual granularity

In order to align words considering conceptual granularity, the author de-

fined confidence for alignment as structural equivalence in graph, and pro-

posed a bilingual dictionary construction algorithm using wordnets.

iii

異言語Wordnet間の言語間対応付けに基づく対訳辞書生成

佐々木健人

内容梗概

機械翻訳や言語をまたいだ情報検索を行うには機械可読な対訳辞書が必要不

可欠である．この対訳辞書を自動で作成する手法は大きく二つに分けられる．

一つは英語のような言語資源の充実した言語を介して二つの既存の対訳辞書を

繋ぎ，帰納的に対訳を生成する方法で，もう一つは対訳コーパスなどの対訳辞

書以外の言語資源を用いて統計的に対訳を生成する手法である．

前者では，二つの対訳辞書間の語義ではなく，仲介言語の単語で対応付けを

行うため，仲介言語の単語が持つ多義性によって生じる不適切な対訳ペアを取

り除くことが課題となる．対して後者では，コーパスを用いて 2つの言語のモ

デルを作成し，それらの類似度を種辞書を用いて計測する．しかし，言語の正

字法に大きく依存するため離れた言語対に対しては効果が低いことが明らかに

なっている．異言語間の対訳辞書を作成するためには，言語の特徴による制約

を受けずに各言語の単語が表現する概念の対応付けを正確に行うことが必要で

ある．

そこで本研究では，WordNetと呼ばれる概念辞書の言語間の概念の対応付け

を用いて，その概念を表す単語の対訳関係を抽出し対訳辞書を自動生成する．

具体的には，英語のWordNetと，その概念構造に基づいて各言語で構築された

WordNet（以下異言語WordNetと記す）を用いる．英語のWordNetと異言語

WordNetは同じ概念構造を共有することから，英語のWordNetを介して異言

語WordNet間の概念の対応付けを行い，対訳関係を抽出する．本論文では，下

記の課題に取り組む．

グラフ構造に基づく概念粒度の定義

異言語WordNetは，英語のWordNetをもとに作成されたものであるため

英語の概念体系に各言語が対応付けられている．しかし，言語によっては

英語よりも広範囲の意味を表現する単語も存在すれば，より狭い意味を表

現する単語も存在する．概念にもとづいて単語を対応付けるためには，上

位概念，下位概念といったグラフ構造に基づく，概念の粒度を表現する指

標が必要である．

iv

概念粒度を考慮した対応付けの確信度の算出

言語間で同一概念を表す単語間で概念粒度の差が存在する可能性が存在す

る．そのため，概念に基づいて単語を対応付けるためには，概念粒度に基

づいた言語間の概念対応付けの確信度が必要となる．また確信度に基づく

対訳ペアの抽出アルゴリズムも必要である．

前者の課題に対して，まず概念体系と単語の関係を定量的に表現するために

グラフ構造を用い，その上で単語の概念粒度を単語の隣接概念体系から構成さ

れる部分グラフの直径で表現した．単語の概念粒度が対訳辞書に与える影響を

調査するために，日本語とインドネシア語のWordNetを用いて対訳辞書を生成

し，インドネシア語話者と協力して誤訳を分類した．その結果を用いて，グラフ

構造から誤訳を発生させる部分グラフを抽出し，誤訳の原因が概念粒度の異な

る単語同士のマッチングと概念粒度の大きな単語同士が部分的に概念を共有す

ることであると定義した．後者の課題に対しては，グラフにおける構造同値性

を用いて確信度の算出を行った．概念体系と単語の関係を表現したグラフにお

いて単語と単語が構造同値であることは，同一の概念粒度で同一の概念群に対

応付けられていることを表すからである．構造同値性計算のために，全体のグ

ラフから対訳ペア作成に関連する部分グラフを抽出するアルゴリズムを提案し

た．提案手法の有効性を確かめるために，提案手法を用いて作成した対訳辞書

に対して評価を行った．その結果，約 81%の適合率を得た．また，提案したア

ルゴリズムにおいて対訳ペアを決定する閾値の影響について評価を行った．そ

の結果，完全な構造同値はWordNetを用いて対訳辞書を作成する際には制約が

強すぎるということが明らかになった．本研究の貢献を以下に示す．

グラフ構造に基づく概念粒度の定義

概念に基づいて単語を対応付けるために，単語が表現する概念の粒度を概

念粒度として定義した．様々な言語のWordNetに対して単語の概念粒度の

分布について分析を行い，異言語WordNetにおいては概念粒度の大きい単

語の発生により，単語と概念体系の関係に歪みが生じていることを明らか

にした．

概念粒度を考慮した対応付けの確信度の算出

概念粒度を考慮して単語の対応付けを行うために，対応付けの確信度をグ

ラフ構造における構造同値性として定義し，この確信度を用いた対訳辞書

生成アルゴリズムを提案した．

Bilingual Dictionary Construction Based on Interlingual

Alignment between Wordnets in Different Languages

Contents

Chapter 1 Introduction 1

Chapter 2 Bilingual Dictionary Construction 4

2.1 Induction with Multiple Bilingual Dictionaries . . . . . . . . . . . . 4

2.2 Statistical Approach with Parallel Corpora . . . . . . . . . . . . . . . 6

2.3 Induction with WordNet Synsets . . . . . . . . . . . . . . . . . . . . . . 7

Chapter 3 WordNet 9

3.1 Princeton WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Multilingualization of WordNet . . . . . . . . . . . . . . . . . . . . . . . 12

Chapter 4 Bilingual Dictionary Construction Using Wordnets in

Different Languages 16

4.1 Interlingual Alignment betweenWordnets in Different Languages

using WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Issues in Wordnets in Different Languages . . . . . . . . . . . . . . . 20

4.3 Interlingual Alignment betweenWordnets in Different Languages

Using Structural Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3.1 Definition of Conceptual Granularity Based on Graph

Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3.2 Calculation of Confidence for Alignment Considering

Conceptual Granularity . . . . . . . . . . . . . . . . . . . . . . . 25

Chapter 5 Evaluation 32

5.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2 Evaluation of Interlingual Alighnment Algorithm . . . . . . . . . . 33

Chapter 6 Discussion 44

Chapter 7 Conclusion 46

Acknowledgments 49

References 50

Chapter 1 Introduction

Recently, internationalization and the spread of the Internet are increasing our

chances of reading web documents written in foreign languages. Among them,

not only documents written in a global common language such as English but

also documents written in a non-global common language which few non-native

speakers speaks, such as Southeast Asian languages exists. Support from com-

puter is essential to read such documents. In NLP(Natural Language Pro-

cessing) area, a wide variety of achievements supporting such activities have

been reported. Especially machine readable bilingual dictionary that provides

a pair of words from different languages describing same meaning, that is called

translation pair, is essential to enable machine translation and cross language

information retrieval.

Generally, languages with few language resources such as bilingual dictio-

naries and corpora are called endangered languages, and languages with a lot of

language resources are called high resourced languages. High quality bilingual

dictionaries tend to be constructed between high resources languages, and bilin-

gual dictionaries between endangered languages, and between high resourced

languages and endangered languages are not available or low quality. There-

fore, bilingual construction methods that can be applied to any language pairs

are required.

A variety of researches on bilingual dictionary construction have been con-

ducted until now. Above all, approaches using existing bilingual dictionaries

or other language resources are useful in less-resourced languages. Existing ap-

proaches can be classified into two classes; induction with multiple bilingual


First, induction with multiple bilingual dictionary is the approach to con-

struct bilingual dictionaries through a third language. In this approach, a bilin-

gual dictionary between a source language and a target language is realized

by using a third language as pivot language. Its native implementation pro-

ceeds as follows. For each word in source language A take its translations to

the pivot language B using bilingual dictionary A-B, then for each such pivot

1

translation, take its translations to the target language C using bilingual dictio-

nary B-C. Bilingual dictionary A-C is realized by associating words in A with

inducted words in C. However, constructed dictionary yielded by this proce-

dures contains incorrect translation pairs because of polysemic and ambiguous

words in the pivot language. Take Japanese-English-French as an example. The

Japanese word “party(パーティ)”, which means a social gathering, is translated

into English word “party” correctly. However, when translating “party” into

French, it is translated into “partie(党)”, which means a formally constituted

political group, or “parti(パーティ)”. In this case, translating “パーティ” into

“partie(政党)” is incorrect. Therefore, issues of this approach is to extract only

correct translations coping with ambiguity of pivot language.

Second, statistical approaches with parallel corpora is the approach to ex-

tract bilingual dictionaries statistically using parallel corpora. Its general im-

plementation aligns the texts to each other at the chunk or word level. Al-

ternatively, monolingual corpora are used to construct probabilistic models for

measuring the similarity between two languages. However, this approach was

revealed to have low efficiency with distant language pairs due to its heavy

dependence on orthographic features of languages. In summary, in order to

construct bilingual dictionaries, correct alignment of words based on their con-

cepts without being restricted by features of languages is needed.

In this research, the author construct bilingual dictionary using inter-lingual

alignment of words based on concepts expressed by words, using conceptual dic-

tionary called WordNet. Concretely speaking, English WordNet and Wordnets

in different languages, which are constructed based on the conceptual structure

of English WordNet, are used. Since the same conceptual structure is shared

by English WordNet and wordnets in different languages, translation pairs are

extracted by aligning concepts of words between wordnets in different languages

through English WordNet. In this research, the author works on problems as

follows.


In wordnets in different languages, words are associated with conceptual

structure of English because they are constructed based on English Word-

2

Net. However, in words belongs to some language, some expresses broader

concept than English words and others expresses narrower concept than

English words. In order to associate source language words with target

language words based on their concept, the measure that expresses con-

ceptual granularity of word based on graph structure, which illustrates

hyponym or hypernym relations in WordNet, is needed.


There is the possibility that difference in conceptual granularity between

word, which belong to different language and express same concept, exists.

Therefore, in order to associate source language words with target language

words based on their concepts, it is required that confidence for inter-lingual

word alignment based on conceptual granularity, and algorithm to extract

translation pairs based on confidence value.

This paper is organized as follows.

In Chapter 2, previous researches relevant to bilingual dictionary construc-

tion are introduced. These researches are classified into three according to

used language resources to construct bilingual dictionary: bilingual dictionary,

bilingual or monolingual corpora, WordNet, and each method is described. In

Chapter 3, WordNet, which is a language resource used in this research to

construct bilingual dictionary, is described: what kind of data is provided by

WordNet. Furthermore, previous researches relevant to multilingualization of

WordNet and the current state as of 2016 are described. In Chapter 4, the au-

thor introduces a bilingual dictionary construction method using WordNet. Re-

search issues for using WordNet to construct bilingual dictionaries are described

and methods to cope with each issue are proposed. In Chapter 5, bilingual

dictionaries are constructed using proposed method and evaluated from both

quantitative and qualitative perspectives. Experimental environments such as

language resources and evaluation indexes are described and results are shown.

In Chapter 6, discussion is carried out based on the result shown in Chapter 5.

Chapter 7 presents the conclusion.

3

Chapter 2 Bilingual Dictionary Construction

A variety of researches on bilingual dictionary construction have benn conducted

until now. Above all, approaches using existing bilingual dictionaries or other

language resources are useful in less-resourced languages. Existing approaches

can be classified into three directions; induction with multiple bilingual dictio-

naries, statistical approach with parallel corpora and induction with wordnet

synsets. In the following sections, previous researches are described according

to the classification above.

2.1 Induction with Multiple Bilingual Dictionaries

One of the most representative approaches to construct bilingual dictionaries

is to induce with multiple bilingual dictionaries. However, it is necessary to

discriminate equivalences from inappropriate words caused by the ambiguity in

the pivot language.

Tanaka et al [1]. proposed a method; this is called inverse consultation method

to treat the ambiguity in the pivot language by exploiting the structures of dic-

tionaries and ideograms of morphemes to measure the similarity of the meanings

of words. Figure 1 represents the outline of this method.

Tsuchiya et al [2]. proposes a method to expand a small existing bilingual

dictionary to a large bilingual dictionary using a pivot language. This method

depends on the assumption that it is possible to find a pivot language C for a

given language pair A and B on condition that there are both a large bilingual

dictionary from the source language A to the pivot language C, and a large

bilingual dictionary from the pivot language C to the target language B. First, a

co-occurrence vector on the target language B corresponding to an input word is

generated using both the seed dictionary and a monolingual corpus in the source

language A. Secondly, translation candidates are listed up referring both the

source-pivot bilingual dictionary A-C and the pivot-target bilingual dictionary

C-B, and their co-occurrence vectors are calculated based on a monolingual

corpus in the target language B. Finally, based on cosine similarity score, an

output is selected. Figure 2 shows whole procedures.

4

race

contest

Step2

競争

race

contest

competition

competition

race

ja en fr競争

contest

competition

match

competition

ja-en

fr-enWords in Intersection=2

競争competition

race

ancestry

race

ja-en

fr-en

Words in Intersection=1

競争→competition

Step1 Step3

Figure 1: Inverse consultation method

Figure 2: Expanding existing bilingual dictionary using a pivot language[2]

Another line of research is modeling the structures of the existing input

dictionaries as a constraint satisfaction problem. Mairidan et al [3]. proposed a

constraint approach to pivot-based dictionary induction where source language

and target language are closely related. They created constraints from lan-

guage similarity and model the structures of the input dictionaries as a Boolean

optimization problem, an extension of SAT.

5

In addition, Tokunaga et al [4]. proposes a method to extract the information

about the conceptual items from a translation pair of bilingual dictionaries.

Translation circuits, which consists of a headword of both the source and target

langauges and one of the word sence of both the headwords, is used to represent

the same concept. Figure 3 represents an example of translation circuits. This

is related to this research in that concepts of words have an important role.

英-日英-日英-日

日-英

日-英日-英

日-英

party

政党関係者パーティ

reception

Figure 3: Translation circuits for a word party

2.2 Statistical Approach with Parallel Corpora

Another line of research is constructing bilingual dictionary from a large amount

of parallel corpora. In this approach, there are generally two ways: using

crosslingual co-occurrence obtained from parralel corpus or monolingual cor-

pora to improve the bilingual dictionaries constructed by using pivot language

described in section 2.1, and extracting from parallel corpus or monolingual

corpora without using the pivot language.

As an example of first approach, Tanaka et al.[5] proposes a method to ex-

tract translations from non-aligned corpora. In their research, assuming that

translations of two co-occurring words in a source langauge also co-occur in

6

the target language, the stochastic matrix is used to model this assumption.

Besides, translation matrix, which provides the co-occurring information trans-

lated from the source into the target, is used to measure the ambiguity of

translation. Finally, they showed the best translation matrix is what resembles

that in the target language.

As described above, using parallel corpus to extract bilingual dictionaries is a

mature and promising if adequate amounts of data are available. However, they

cannot be applied to construct bilingual dictionaries between distant languages

due to its heavy dependence on orthographic features of languages.

2.3 Induction with WordNet Synsets

There are also approaches to use conceptual dictionaries for bilingual dictionary

construction. WordNet is often used as conceptual dictionary. Mohanty et

al.[6] constructed a multilingual dictionary using synsets of WordNet in place

of pivot words. Unlike the existing bilingual dictionaries in which a word has

several words as translation, as is shown in figure 11, the dictionary model in

which words from several languages are related to a synset is proposed. They

constructed a multilingual dictionary covers traditional languages of India using

the model above.

Figure 4: Multilingual dictionary model in which words are related to a synset[6]

Their method is most related to this research in that constructing bilingual

dictionaries using only WordNet. In their method, there is a restriction that

7

language features are close because they focus on India traditional languages.

In this research, the author proposes a general method that does not depend

on language features.

8

Chapter 3 WordNet

In this chapter, WordNet[7, 8], which is a language resource used in this re-

search to construct bilingual dictionary, and previous researches relevant to

multilingualization of WordNet and the current state as of 2016 are described.

3.1 Princeton WordNet

In this section, a language resource to construct bilingual dictionary in this

research, WordNet is described. WordNet is the concept dictionary on English

made by Princeton University1). In WordNet, English nouns, verbs, adjectives,

and adverbs are organized into sets of synonyms (called synsets), and each

synset has definition. In addition, semantic relations between a word and a

synset and between synsets link to each other. An example of synset definition is

as follows. In the following example, good, right, ripe are glosses, and sentences

enclosed in parentheses are definition of synset and usage examples.� �good, right, ripe -- (most suitable or right

for a particular purpose; "a good time to plant

tomatoes"; "the right time to act";"the time is

ripe for great sociological changes")� �In addition, semantic relationships with other synsets are defined. In Word-

Net, the following semantic relationships are defined.

Synonymy

Synonymy is WordNet’s basic relation, because WordNet uses sets of syn-

onyms (synsets) to represent word senses. Synonymy is a symmetric rela-

tion between word forms.

Antonymy

Antonymy (opposing-name) is also a symmetric semantic relation between

word forms, especially important in organizing the meanings of adjectives

and adverbs.

Hyponymy

1) http://wordnet.princeton.edu/

9

Hyponymy (sub-name) and its inverse, hypernymy (super-name), are tran-

sitive relations between synsets. Because there is usually only one hyper-

nym, this semantic relation organizes the meanings of nouns into a hierar-

chical structure.

Meronymy

Meronymy (part-name) and its inverse, holonymy (whole-name), are com-

plex semantic relations. WordNet distinguishes component parts, substan-

tive parts, and member parts.

Troponymy

Troponymy (manner-name) is for verbs what hyponymy is for nouns, al-

though the resulting hierarchies are much shallower.

Entailment

Entailment relations between verbs are also coded in WordNet.

Table 1 shows these semantic relations are defined in which parts of speech.

Table 1: Semantic relations in WordNet

Semantic Relation Syntactic Category Examples

Synonymy (similar) Noun, Verb, Adjective, Adverb pipe, tube

Antonymy (opposite) Adjective, Adverb, (Noun, Verb) wet, dry

Hyponymy (subordinate) Noun sugar maple, maple

Meronymy (part) Noun brom, hat

Troponymy (manner) Verb march, walk

Entailment Verb drive, ride

In WordNet, both nouns and verbs are organized into hierarchies, defined

by hypernym and hyponym relationships. The words at the same level present

synset members. In the following example, red and redness belong to same

synset, and crimson, ruby and deep red also belong to same synset. Each

synset has a unique index and a brief definition.

10

� �red, redness

=> sanguine

=> chrome red

=> Turkey red, alizarine red

=> cardinal, carmine

=> crimson, ruby, deep red

=> dark red

=> purplish red, purplish-red

=> cerise, cherry, cherry red

=> scarlet, vermilion, orange red� �As is shown in table 2, WordNet includes a great deal of semantic information

on English words. Particularly, WordNet covers quite large data on nouns,

including about 120,000 words and 80,000 synsets.

Table 2: Statistics of Princeton WordNet

Part of speech Words Synsets Word-Synset Pairs

Noun 117,798 82,115 146,312

Verb 11,529 13,767 25,047

Adjective 21,479 18,156 30,002

Adverb 4,481 3,621 5,580

Total 155,287 117,659 206,941

In this research, the author uses graphs in order to represent hierarchies

in WordNet and relations between synsets and words. Figure 5 represents re-

lations described above using a graph in which green colored nodes represent

synsets and red colored nodes represent words, and relations like hyponym are

defined between synsets and relations between synsets and words such as en-

glish represents that a word is an instance of the synset in a language. In the

follow sections, conceptual structure and relations between words and concepts

are described using graphs.

11

Word

Synset

Figure 5: An example of representation of conceptual structure by graph

3.2 Multilingualization of WordNet

As described above, beside WordNet is an English language resource made

by Princeton University, a lot of project (about 76 as of 2016)1) to construct

WordNet in several languages based on Princeton WordNet exist as it is a

very useful language resource. Focusing on Asian countries that are poor in

language resources, almost all official languages and part of ethnic languages

can be covered by Open Multilingual WordNet[9], Asian WordNet,WordNet

Bahasa[10] and Hindi WordNet.

The list of leading WordNet projects with covering languages is as follows.

Arabic WordNet

Arabic

Open Multilingual WordNet [11]

More than 200 languages including Albanian, Arabic, Bulgarian, Chinese,

Danish, Greek, English, Persian, Finnish, French, Hebrew, Croatian, Ital-

ian, Japanese, Spanish, Indonesian, Malay, Polish, Portuguese, Slovenian,

Swedish, Thai, etc.

Asian WordNet

Hindi, Indonesian, Japanese, Lao, Mongolian, Brunei, Nepali, Thai, Sin-

1) http://globalwordnet.org/wordnets-in-the-world/

12

hala, Vietnamese

WordNet Bahasa [10]

Malay, Indonesian

IndoWordNet

Hindi, Assamese, Bengali, Bodo, Gujarati, Kannada, Kashmiri, Konkani,

Malayalam, Meitei, Marathi, Nepali, Sanskrit, Tamil, Telugu, Punjabi,

Urdu, Oriya

EuroWordNet

Danish, English, French, German, Italian, Spanish

MultiWordNet

Italian, Spanish, Portuguese, Hebrew, Romanian, Latin

In this way, WordNet is available as a language resource in several languages

although difference in quality exists. Therefore, WordNet can be used to con-

struct bilingual dictionaries between several languages.

Figure 6: Linking with multiple WordNets[12]

Bond et al.[12] proposed a method to bootstrap the Wordnet using existing

multiple existing wordnets in order to deal with ambiguity inherent in transla-

tion. In their method, construction of Japanese WordNet is used as an example.

13

Although the obvious way to add Japanese to the English WordNet is to trans-

late the entries using an English-Japanese dictionary, the problem that bilingual

dictionaries are not marked with WordNet senses is exists. For example, an En-

glish word seal has several translation candidates in Japanese-English bilingual

dictionary including 判子 (stamp) and 海軍特殊部隊 (Navy Seal). Translation

candidate words need to be associated with the appropriate WordNet senses.

Wordnets in multiple languages are used in order to sense disambiguate the

translations. As is shown in figure 6, if looked up bat we get 蝙蝠 (bat) asso-

ciated with synsetID n#1 and バット (bat) associated with synsetID n#5 as

results. However, only by translating these words using bilingual dictionaries,

translations associated with correct synsetID cannot be acquired. Therefore,

words belong to same synsetID in wordnets in multiple languages are used to

acquire correct translations. A lot of wordnets constructed by this method are

published as Open Multilingual Wordnet1).

Table 3: Statistics of Japanese WordNet and WordNet Bahasa

WordNetの名称 Part of speech Words Synsets Word-Synset Pairs

Noun 66,003 42,737 99,419

Verb 15,346 6,819 33,673

Japanese WordNet Adjective 8,540 5,798 17,799

Adverb 4,085 1,830 7,157

Total 91,936 57,184 158,048

Noun 28,185 24,829 51,087

Verb 7,728 7,802 42,942

WordNet Bahasa Adjective 4,549 4,804 11,132

Adverb 685 650 1,413

Total 36,604 38,085 106,574

1) http://compling.hss.ntu.edu.sg/omw/

14

In this research, Japanese WordNet[13]1) and WordNet Bahasa[14]2) that is

Indonesian WordNet are used to realize a bilingual dictionary. Words in both

Wordnets are associated with synsets in Princeton WordNet. In this research,

Japanese and Indonesian are selected as target languages, because both pa-

per and electronic bilingual dictionaries between them do not exist sufficiently

and distance between them is far from a linguistic point of view. Statistics of

Japanese WordNet and WordNet Bahasa is shown in table 3. Both Wordnets

covers large data on nouns. Therefore, it is expected that bilingual dictionaries

that include a lot of headwords be constructed.

1) http://nlpwww.nict.go.jp/wn-ja/2) http://wn-msa.sourceforge.net

15

Chapter 4 Bilingual Dictionary Construction Us-

ingWordnets in Different Languages

In this chapter, the author proposes a method to construct bilingual dictionary

using Wordnets in different languages described in Chapter 3.

4.1 Interlingual Alignment between Wordnets in Differ-

ent Languages using WordNet

In this section, the author describes problems in existing bilingual dictionary

construction methods and a overview of interlingual alignment between Word-

nets in different languages.

In the methods inducting with multiple bilingual dictionaries introduced in

section 2.1, simple connection of Japanese-English and English-Indonesian dic-

tionaries using English as an intermediate language in order to create Japanese-

Indonesian dictionary causes some problems soon. As is shown in figure 7, there

is a possibility that a Japanese word is translated into English with the meaning

A but the English word might be translated into Indonesian with the meaning

B if the English word has several meanings such as meaning A and meaning

B. In this case, Japanese word and Indonesian word in the resulting transla-

tion pair do not have any common concept. In order to resolve this kind of

problems, Tanaka et al. [1] utilized translation graphs represents the structure

of dictionaries and morpheme, and Mairidan et al. [3] utilized linguistic charac-

teristics, such as similarity between languages, as constraints when modeling as

constraint satisfaction problem (CSP). In this way, it is needed to align concepts

expressed by words correctly.

In the statistical methods using corpora introduced in section 2.2, it was

revealed to have low efficiency with distant language pairs due to its heavy

dependence on orthographic features of languages.

In order to construct bilingual dictionary, it is needed to align concepts of

words from different languages correctly. However, as mentioned above, char-

acteristics between source language and target language need to be utilized to

improve quality and affect the quality of resulting bilingual dictionary. There-

16

fortune運命

Japanese English Indonesian

Meaning A

Meaning A

Meaning B

Word Bilingual Dictionary

nasib

kekayaan

(destiny)

(wealth)

Figure 7: Illustrates problems in pivot-based method

fore, in order to construct bilingual dictionaries, method to align concepts of

words from different languages correctly without affected by features of lan-

guages is needed.

Therefore, the author proposes the bilingual dictionary construction method

based on interlingual alignment between concept dictionaries in different lan-

guages. Concretely speaking, several Wordnets constructed in different lan-

guages based on Princeton WordNet (Japanese WordNet and WordNet Bahasa)

are used to construct bilingual dictionary (only nouns) by extracting word pairs

that express same concepts. Using only Wordnets for language resource ensures

versatility and practicality of the method of this research.

In this research, it is proposed that words are aligned based on conceptual

structure. This policy is illustrated in figure 8. In order to realize this method,

language resources that express conceptual structure is required. WordNet is

used as the concept dictionary. In WordNet, English words are categorized into

groups of synonymous words, which are called synsets. Simple definitions of

synsets are described, and synsets have various relations to other synsets as

described in chapter Chapter 3.

Each synset has an identifier (called synsetID). Words from each language

are associated using synsetID as pivot. SynsetID alignment enables to associate

source language words with target language words based on concept. As is

shown in figure 8, a synset with synsetID 04962784-n defined as (red color or

17

merah (red)

赤

Indonesian WordNet Japanese WordNet

04962784-n

Word Word ofsynset

04962784-n 04962784-n

English WordNet

Same synset

kemerahan (red)

赤色

Figure 8: Align words based on synset

pigment; the chromatic color resembling the hue of blood) is associated with

words merah(Red), kemerahan(Red) in Indonesian WordNet and赤色 (Red), 赤

(Red) in Japanese WordNet. Therefore it is possible to extract translation pairs

based on concepts by acquiring cartesian product of two word sets, which consist

of words of each language associated with a concept identified according to

synsetID. As described in section 2.3, Mohanty et al. constructed a multilingual

dictionary in the same way. In this research, this method is used as the baseline

method.

The outline of these procedures is as follows.� �1. Extract one synset from WordNet

2. From source language WordNet and target language WordNet, extract

sets of words associated with the extracted synset

3. Get the cartesian product of each word set as translation pairs� �Algorithm1 presents the formalization of this baseline method for bilingual

dictionary construction. In the 9th line, get related words(si, sl) is the function

to get words as a set associated with synset si in WordNet of language sl.

make translation pair(wsl,j, wtl,k) in the 13th line is the function to make a

translation pair using word wsl,j belongs to source language sl and word wtl,k

belongs to target langauge tl.

18

Algorithm 1 Baseline method

1: si /* i-th Synset */

2: S /* Set of synsets (S = s1, · · · , s|S|) */3: sl /* Source language */

4: tl /* Target language */

5: wl,j /* j-th Word in language l */

6: Wl /* Words set in language l */

7: R /* Set of translation pairs */

8: for all si in S do

9: Wsl,i ← get related words(si, sl) /* get words as a set associated with

synset si in WordNet of language sl */

10: Wtl,i ← get related words(si, tl)

11: for all wsl,j in Wsl,i do

12: for all wtl,k in Wtl,i do

13: R ← make translation pair(wsl,j, wtl,k) /* make a translation pair

using word wsl,j belongs to source language sl and word wtl,k belongs to

target langauge tl */

14: end for

15: end for

16: end for

17: return R

Bilingual dictionary made by the baseline method is evaluated. Table 4

presents the evaluation result for the baseline method. Precision value has re-

sulted in about 66%. Evaluation was conducted by the author and a native

Indonesian speaker. We extracted 100 sample translation pairs from the con-

structed bilingual dictionary and examined how many translation pairs can be

used as correct translation.

The result shown above cannot be said very good. This preliminary exper-

iment result shows that it is not enough to construct bilingual dictionary by

simply associating words based on concepts. In the next section, characteristics

and issues of multilingualized WordNet are described.

19

Table 4: Result of preliminary experiment

Precision Japanese headwords Indonesian headwords

Baseline method 66% 42,585 23,900

4.2 Issues in Wordnets in Different Languages

In section 4.1, the author described that simply corresponding words by using

synsets of WordNet is not enough to extract correct translation pairs. In this

section, issues of multilingualized WordNet for bilingual dictionary construction

are described.

First, characteristics of multilingualized WordNet are described. As de-

scribed in section 3.2, multilingualized WordNet is constructed by associating

words with English conceptual structure that was made by defining synset as

a group of synonyms and relations between synsets when constructing Prince-

ton WordNet. However, in non-English languages, even if a word expresses

same concept with English words, some include more abstract concepts than

English words and other include more concrete concepts than English words.

Figure 9 illustrates this situation. In this figure, English (left side) and Japanese

(right side) associated with the same conceptual system with hyponym and hy-

pernym relationship are shown. English words have one-to-one relationships

with synsets, while some Japanese words have one-to-many relationships with

synsets. Such one-to-one words and one-to-many words should not be matched

to extract translation pairs.

Also, there is the possibility that a gap exists between concept ranges even

if conceptual granularity of words are equal. Figure 10 illustrates this situa-

tion. In this figure, like figure 9, Indonesian (left side) and Japanese (right

side) associated with the same conceptual system with hyponym and hypernym

relationship are shown. In Japanese WordNet, a word “生き物 (living things)”

is associated with all three concepts. On the other hand, in Indonesian Word-

Net, a word “manusia” is also associated with more than one concept, but with

lower two concepts. These words share some in the range of their expressing

concepts. However, they should not be extracted as a translation pair because

20

Figure 9: Example of vagueness in mapping words to English concetual struc-

ture (Left: English WordNet, right: Japanese WordNet)

one can be used to express more abstract concept and, on the other hand, other

cannot be used to express that more abstract concept.

Summary of the above discussions on issues in bilingual dictionary construc-

tion using Wordnets in different languages is as follow.

Definition of conceptual granularity based on graph structure

Words in Wordnets in different languages are associated with English con-

ceptual structure as they are constructed based on English WordNet. How-

ever, in words belongs to some language, some expresses broader concept

than English words and others expresses narrower concept than English

words. In order to associate source language word with target language

21

Figure 10: Example that shows difference of conceptual granularity between

languages (Left: Indonesian WordNet, right: Japanese WordNet)

word based on their concept, the measure that expresses conceptual gran-

ularity of word based on graph structure illustrates hyponym or hypernym

relations in WordNet.


There is the possibility that difference in conceptual granularity between

words belongs to different language and expresses same concept exists.

Therefore, in order to associate source language words with target language

words based on their concepts, it is required that confidence for interlingual

word alignment based on conceptual granularity, and algorithm to extract

translation pairs based on confidence value.

Because of difference in conceptual granularity of words, concept alignment

between words in different languages becomes ambiguous.

In the follow section, the author proposes the methods to cope with each

22

issue.

4.3 Interlingual Alignment between Wordnets in Differ-

ent Languages Using Structural Equivalence

In this section, the author proposes the methods to cope with issues described

in section 4.2.

4.3.1 Definition of Conceptual Granularity Based on Graph Struc-

ture

In the section 4.2, it is described that there is the possibility that incorrect

translation pairs may be extracted by emergence of non-English words with

large conceptual granularity that have one-to-many relationship with concep-

tual structure. Conceptual granularity of a word can be expressed as the dis-

tance from the most abstract concept to the most concrete concept in a con-

ceptual structure consists of concepts linked with hyponym relationship when

this structure is associated with a word. Therefore, conceptual granularity of a

word wl,i belongs to language l is defined as the diameter of the graph consists

of synset nodes that are neighbor of node wl,i.

The outline of procedures to get conceptual granularity of a word is as

follows.� �1. Get a set of synset nodes neighbor a word node

2. Extract sub-graph consists of nodes belong to a set got in procedure1

and edges between them from whole WordNet graph

3. Calculate the diameter of extracted subgraph� �Algorithm2 presents the formalization of procedures above.

In the 6th line, neighbor of(wl,i) is the function to get a set of neighbor

synset nodes of wl,i. subgraph(G,S) in the 7th line is the function to extract a

subgraph that consists of nodes belongs to a node set S and edges between them

from graph G. In the 8th line, diam(SG) is the function to get the diameter of

graph SG.

Using procedures above, for example, conceptual granularity of a Japanese

word 「生き物 (living things)」 in figure 11 results in 2.

23

Algorithm 2 Conceptual Granularity(wl,i)

1: wl,i /* i-th Word in language l */

2: S /* Set of synsets */

3: G /* Graph consists of words and synsets in Wordnets */

4: SG /* Subgraph of G */

5: d /* Diameter of a graph */

6: S ← neighbor of(wl,i) /* get a set of neighbor synset nodes of wl,i */

7: SG← subgraph(G,S) /* extract a subgraph that consists of nodes belongs

to a node set S and edges between them from graph G */

8: d← diam(SG) /* get the diameter of graph SG */

9: return d

生き物

00004258-n00004475-n

00015388-n

hypo

hypo jpn jpn

jpn

Figure 11: Example illustrates conceptual granularity of a word “生き物 (living

things)”

The author analyze the effect of multilingualization of WordNet using Algo-

rithm 2. Granted that conceptual granularity of words varies from language to

language, the distributions of conceptual granularity of words should vary from

language to language, and a proportion of words with non-zero conceptual gran-

ularity should be increased in non-English WordNet. On the hypothesis above,

the author calculated conceptual granularity of words included in English1),

1) https://wordnet.princeton.edu/

24

Japanese1), Indonesian2), Chinese3) and Malaysian4) WordNet, and created a

histogram from conceptual granularity of words for each language. Table 5

shows statistics of each WordNet on nouns, and figure 12 shows the result.

Table 5: Statistics of WordNet in each language (only nouns)

Words Synsets Word-Synset Pairs

English WordNet 117,798 82,115 146,312

Japanese WordNet 66,003 42,737 99,419

Indonesian WordNet 28,185 24,829 51,087

Chinese WordNet 38,978 27,888 46,229

Malaysian WordNet 25,338 23,466 49,903

By figure 12, it is confirmed that words with 0 conceptual granularity oc-

cupies almost the entire in English WordNet, and, in non-English WordNet,

proportions of words with over 1 conceptual granularity are increased. Table

6 shows proportions of words with each conceptual granularity to the entire

words included in each language WordNet. By these results, it is confirmed

that multilingualization of WordNet causes the phenomenon that conceptual

granularity of words varies from language to language even if they are associate

with the same concept. This can be concluded that distortion between words

and conceptual structure arises from multilingualization of WordNet based on

English conceptual structure.

4.3.2 Calculation of Confidence for Alignment Considering Concep-

tual Granularity

In the section 4.3.1, is was described that a lot of words with large conceptual

granularity exist in multilingualized WordNet, and indicated followings as the

causes of extracting incorrect translation pairs.

1. Matching a word with words with different conceptual granularity

1) http://nlpwww.nict.go.jp/wn-ja/2) http://wn-msa.sourceforge.net/3) http://compling.hss.ntu.edu.sg/cow/4) http://wn-msa.sourceforge.net/

25

Figure 12: Distribution of conceptual granularity for each language

Table 6: Proportions of words with each conceptual granularity to the entire

words included in each language WordNet

Words Granularity0 1 2 3 4

English WordNet 117,798 99.96% 0.04% 0.00% 0.00% 0.00%

Japanese WordNet 66,003 85.4% 13.2% 1.40% 0.02% 0.00%

Indonesian WordNet 28,185 71.5% 20.7% 6.61% 1.06% 0.07%

Chinese WordNet 38,978 89.3% 9.58% 0.99% 0.06% 0.01%

Malaysian WordNet 25,338 68.2% 23.5% 7.04% 1.18% 0.06%

2. Matching a word with words share a part of associated synsets even if their

conceptual granularities are equal

Considering two above, it can be said that an ideal translation pair when

considering conceptual granularity is ’a pair of words which have same concep-

26

tual granularity and are associated with same concepts’. Figure 13 illustrates

this situation.

hypo

hypo

jpn

jpn

jpn

ind

ind

ind

Synset

JapaneseWord

IndonesianWord

Figure 13: Example of an ideal translation pair

In order to find ideal translation pairs like above based on calculation using

graph structure, the author introduces structural equivalence, which is used to

show similarity of position in social network analysis area. If both node A and

node B in the same network have completely same relations with other nodes

in the network, node A and node B are called structural equivalence. In other

words, a set of nodes that does not cause any change in relations, if labels

attached to nodes are exchanged, is called nodes structurally equivalent.

Figure 14 shows a graph consists of five nodes. In this graph, node B has

an unique position. On the other hand, node A, C, D and E only have relation

with node B. Therefore, if their labels are exchanged, no change occurs in

network structure. In the graph, only node B has the unique position and

other nodes A, C, D and E are structurally equivalent.

However, by structural equivalence above, several nodes are in only two

states: structurally equivalent or not. What relations nodes define determines

whether nodes are structurally equivalent or not.

27

C

A

B

D EFigure 14: Example of a graph that contains structural equivalence nodes

However, in complex network such as one consists of two Wordnets in differ-

ent languages, nodes that are not strictly structurally equivalent but are very

close to be structurally equivalent exist. Therefore, it is required that structural

equivalence is expressed as continuous quantity.

The author use correlation coefficient (Pearson product-moment correlation

coefficient) as the index of structural equivalence between two nodes in a graph.

Basically, it is considered an application of correlation coefficient in a row ele-

ment and a column element of the adjacency matrix. Correlation coefficient rij

between node i and node j can be defined as follows.

rij =

∑(xki − x∗i)(xkj − x∗j) +

∑(xik − xi∗)(xjk − xj∗)√∑

(xki − x∗i)2 +∑(xik − xi∗)2

√∑(xkj − x∗j)2 +

∑(xjk − xj∗)2

(1)

In the equivalence 1, ,xki is k, j element of adjacency matrix (others also the

same), x∗i is average of column i and xi∗ is average of row i (others also the

same), and k ̸= i, j. In other words, this correlation coefficient is an index that

indicates the degree of association between node i and node j on relations with

the other node k. This association marks 1 as the maximum value when they

have positive correlation, marks −1 as the minimum value when they have neg-

ative correlation and marks 0 when they are independent. When constructing

bilingual dictionary, what value is close to 1 is desirable.

Using figure 15 as an example, the procedures to extract translation pairs

using structural equivalence are described. The graph in figure 15 is a subgraph

of graph made by Japanese WordNet and Indonesian WordNet, that consists

28

Figure 15: Example for calculation of structural equivalence value

of synset nodes s1, s2 and s3 associated with a Japanese word node j1 and

Indonesian word nodes i1 and i2 associated with each synset node. After con-

verting this graph (directed graph) into an adjacency matrix, using equation

1, correlation coefficient shown top-left in the figure is calculated. Based on

resulted correlation coefficient, target language(Indonesian in the figure) words

that have correlation value over threshold value are selected as translation of a

source word(Japanese in the figure).

The outline of bilingual dictionary construction using two Wordnets in dif-

ferent languages is as follows.

29

� �1. Extract one word from source language

2. Extract a set of synsets associated with a source language word ex-

tracted in procedure 1

3. Extract a set of target words associated with synsets belong to a set of

synsets extracted in procedure 2

4. Extract subgraph consist of nodes extracted in procedures from 1 to 3

and edges between them from whole graph made by WordNet in source

language and WordNet in target language

5. Calculate structural equivalence values for extracted subgraph using

correlation coefficient

6. Extract target language words that has correlation value over threshold

value as translations of a source language word

7. Repeat procedures from 1 to 6 for each source language word� �Algorithm 3 presents the formalization of the whole procedures to construct

bilingual dictionary.

In the 11th line, neighbor of(wsl,i) is the function to get a set of neigh-

bor synset nodes of wsl,i. get neighbor target words(sj) in the 14th line is the

function to get a set of neighbor target word nodes of synset node sj. In the

26th line, subgraph(G,NC) is the function to extract a subgraph that con-

sists of nodes belongs to a node set NC and edges between them from graph

G. calc correlation coefficient(SG) in the 26th line is the function to calcu-

late correlation coefficient for graph SG using equation 1. In the 28th line,

get value(CM,wsl,i, n) is the function to get a (wsl,i, n) element of correla-

tion coefficient matrix. n belongs to Wtl in the 29th line is the function to

return true if a node n belongs to language Wtl, return false elsewhere and

make translation pair(wsl,i, n) in the 30th line is the function to make a trans-

lation pair from words wsl,i and n.

30

Algorithm 3 Proposing method

1: sl /* Source language */2: tl /* Target language */3: wl,i /* i-th Word in language l */4: Wl /* Words set in language l */5: NC /* Set of words and synsets */6: sj /* j-th Synset */7: S /* Set of synsets (S = s1, · · · , s|S|) */8: TW /* Set of words */9: G /* Graph consists of words and synsets in Wordnets */10: SG /* Subgraph of G */11: CM /* Correlation coefficient matrix */12: cv /* Value in correlation coefficient matrix */13: th /* Threshold value */14: R /* Set of translation pairs */15: for all wsl,i in Wsl do16: NC ← wsl,i

17: S ← neighbor of(wsl,i) /* get a set of neighbor synset nodes of wsl,i */18: for all sj in S do19: NC ← sj20: TW ← get neighbor target words(sj) /* get a set of neighbor target

word nodes of synset node sj */21: for all wtl,k in TW do22: NC ← wtl,k

23: end for24: end for25: SG ← subgraph(G,NC) /* extract a subgraph that consists of nodes

belongs to a node set NC and edges between them from graph G */26: CM ← calc correlation coefficient(SG) /* calculate correlation coeffi-

cient for graph SG */27: for all n in SG do28: cv ← get value(CM,wsl,i, n) /* get a (wsl,i, n) element of correlation

coefficient matrix */29: if cv ≥ th and n belongs to Wtl then30: R← make translation pair(wsl,i, n) /* make a translation pair from

words wsl,i and n */31: end if32: end for33: end for34: return R

31

Chapter 5 Evaluation

In this chapter, the author evaluates the proposed method described in the chap-

ter Chapter 4. After describing the experimental settings, the author describes

evaluation results.

5.1 Experimental Settings

In this experiment, the author construct a Japanese-Indonesian dictionary us-

ing the method proposed in this research, and the degree of improvement is

evaluated.

Japanese WordNet1) and WordNet Bahasa2), which are provided in Open

Multi-lingual WordNet3), are used in this experiment as language resources.

“Weblio Indonesian Dictionary” and its harmonized dictionary[1] are used as

answer Japanese-Indonesian dictionary and answer Indonesian-Japanese dictio-

nary. Table 7 shows the number of headwords on nouns in each dictionary.

Table 7: The number of headwords of an answer dictionary (only nouns)

Dictionary Name Language Pair Headwords

Weblio Indonesian Dictionary Japanese-Indonesian 9,832

Harmonized Weblio Indonesian Dictionary Indonesian-Japanese 10,079

In addition, Japanese Web corpus with appearance frequency data including

89,357 words, which is provided by Kurohashi Laboratory in Kyoto University, is

used as a Japanese test data. Table 8 shows the statistics of each WordNet(only

nouns).

In each WordNet, synsetID, category and data are included in the format

shown in the figure 9. In the category of data, lemma that expresses a word

associated with synsets, def which expresses the definition of a concept and exe

which expresses an example sentence using lemma exist. In addition, data that

1) http://compling.hss.ntu.edu.sg/omw/2) http://wn-msa.sourceforge.net/index.eng.html3) http://compling.hss.ntu.edu.sg/omw/

32

Table 8: Statics of Japanese WordNet and WordNet Bahasa(only nouns)

WordNet Name Words Synsets Word-Synset Pairs

Japanese WordNet 66,003 42,737 99,419

WordNet Bahasa 28,185 24,829 51,087

define relations between synsets are also provided in the format of xml. In the

experiment, lemma part of noun entries and hyponym relation in the definition

of relations are used to construct WordNet graph.

Table 9: Format of data of WordNet

SynsetID Classification Content

00001740-a jpn:lemma 可能

00001740-n jpn:lemma 実体

00001837-r jpn:lemma 西暦...

11820323-n jpn:def 南アフリカ産の茎のない多肉植物の属

04239436-n jpn:def 傷ついた前腕を支持する包帯

11427067-n jpn:def 太陽の大気圏の最も外側の領域...

07316856-n jpn:exe 彼はとうとう大きなチャンスをつかんだ

08438384-n jpn:exe 彼らは、木の小台を切った

07082198-n jpn:exe 彼は重い口を開いた

5.2 Evaluation of Interlingual Alighnment Algorithm

In this section, in the settings described in section 5.1, Japanese-Indonesian

bilingual dictionary is realized by the method proposed in this research and

evaluated. First, letting a threshold value th in the Algorithm 3 described in

section 4.3.2 be maximum value 1, the method is evaluated.

At first, it is shown how many headwords are extracted by the proposed

33

method. Table 10 presents the number of Japanese-Indonesian word pairs,

Japanese headwords and Indonesian headwords included in constructed Japanese-

Indonesian bilingual dictionary and Indonesian-Japanese dictionary. Compared

with headwords of answer dictionaries shown in table 7, about three times

Japanese headwords and about twice Indonesian headwords are acquired.

Table 10: The number of word pairs, Japanese headwords and Indonesian head-

words of constructed dictionaries

Dictionary Name Word Pairs Japanese Indonesian

Japanese-Indonesian 73,279 31,788 20,011

Indonesian-Japanese 40,849 23,886 18,238

Figure 16 represents the number of translation pairs and their intersection.

Their intersection is small due to difference of their features. Therefore, ordi-

nary evaluation method using F-means cannot be applied. Thus, translation

pairs in the intersection in figure 16 and translation pairs included only in the

constructed dictionary are separately evaluated.

Figure 16: The number of translations pairs in the constructed dictionary and

the answer dictionary, and their intersection

Firstly, the efficiency of the constructed bilingual dictionaries is estimated.

34

Concretely speaking, relations between recall, precision, F-means and threshold

value th are evaluated focusing on translation pairs included in the intersection

in figure 16 using Weblio Indonesian Dictionary. Based on this result, the most

suitable threshold for bilingual dictionary construction is discussed. Definitions

of recall, precision and F-means are as follows.

Recall =|Translations of Headwords ∩ Answer Translations of Headwords|

|Answer Translations of Headwords|(2)

Precision =|Translations of Headwords ∩ Answer Translations of Headwords|

|Translations of Headwords|(3)

F -means =2× precision× recall

precision+ recall(4)

Translations of Headwords in equation 2, 6 is the number of translations

of source language headwords of the constructed bilingual dictionaries, which

are headwords of translation pairs included in intersection in figure 16. Answer

Translations of Headwords in equation 2, 6 is the number of translations of

source language headwords of the answer dictionary, which are headwords of

translation pairs included in intersection in figure 16.

Relations between threshold and recall, precision and F-measure are calcu-

lated using Weblio Indonesian Dictionary as an answer dictionary. The result

is shown in figure 17a. First of all, selecting translation pairs letting threshold

be higher than 0.7 improved precision twice as much as the baseline method

(same as result of threshold 0.0), but recall deteriorated remarkably. However,

F-means rose 1.4 times at most, so whole quality can be said to be improved.

In addition, as is shown in figure 17a, F-means is maximized when threshold

is 0.6. Focusing on only precision, threshold between 0.8 and 1.0 maximizes

precision, but lower recall.

In a similar fashion, relations between threshold and recall, precision and

F-means are calculated using harmonized Weblio Indonesian Dictionary as an

35

(a) Relations between threshold and recall, precision and F-means in Japanese-

Indonesian bilingual dictionary

(b) Relations between threshold and recall, precision and F-means in

Indonesian-Japanese bilingual dictionary

Figure 17: Relations between threshold and recall, precision and F-means

36

answer dictionary. The result is shown in figure 17b. First of all, selecting trans-

lation pairs letting threshold be higher than 0.7 improved precision quadruple

as much as the baseline method (same as the result of threshold 0.0), but recall

deteriorated remarkably. However, F-means rose 2.2 times at most, so whole

quality can be said to be improved. In addition, as is shown in figure 17b, F-

means is maximized when threshold is 0.7. Focusing on only precision, threshold

between 0.8 and 1.0 maximizes precision, but lower recall. As a result of dis-

cussions above, extracting translation pairs considering structural equivalence

is of benefit of selecting correct translation pairs, but completely structural

equivalence is too strict for bilingual dictionary construction. In addition, it is

estimated that threshold value th should be between 0.6 and 0.7.

Based on the analysis above, translation pairs included only in the con-

structed dictionary in figure 16 are evaluated by accuracy, letting threshold be

0.7. For each translation pair in the constructed bilingual dictionary, the de-

gree is scored on three-grade evaluations; Acceptable, Partly Acceptable and Not

Acceptable. The details of three grades are defined below.� �Acceptable

It can be used as a translation.

Partly Acceptable

It can be used to tell what it means, but meaning is not completely

matched.

Not Acceptable

It can not be used as a translation.� �This evaluation was conducted by sapmling translation pairs from con-

structed bilingual dictionary and judging whether they can be used as trans-

lation with native Indonesian speaker according to grades described above. In

addition, recall was estimated using answer translation pairs made by native

Indonesian speaker. The definition of precision rate is shown is the equation 6.

In addition, Table 11 presents the result of evaluation on precision rate.

37

Precision =|Acceptable Pairs|Extracted Pairs

(5)

Recall =|Acceptable Pairs|

Answer Translation Pairs(6)

Table 11: Evaluation result

Precision Recall

Baseline method 66% 66%

Proposed method 81% 61%

As is shown in Table 11, the precision improved about 24.2% compared with

the baseline method described in the section 4.1. This result can be viewed as

that extracting translation pairs using structural equivalence is effective.

So far, the efficiency of constructed dictionaries is evaluated. Next, struc-

ture of constructed Japanese-Indonesian dictionaries is evaluated quantitatively,

and relations between threshold are analyzed. First of all, distribution of the

number of Indonesian translations for one Japanese headword is compared with

Weblio Indonesian Dictionary. The result is shown in figure 18a. All dictio-

naries including Weblio Indonesian Dictionary showed similar distortions like

an exponential function. The fact that constructed bilingual dictionaries have

same translation relations as general dictionaries is conformed by this result.

Moreover, as the threshold rises, the number of words that have many transla-

tions is decreased. This is self-evident from feature of structural equivalence. In

addition, constructed bilingual dictionaries have more translations than Weblio

Indonesian Dictionary.

In addition, relations between the number of Japanese headwords and trans-

lation pairs of constructed dictionaries, and threshold are shown in figure 20a

with the number of Japanese headwords and translations pairs of Weblio In-

donesian Dictionary. Moreover, the number of translation pairs in the construct

Japanese-Indonesian dictionary sharply decreases when threshold is between

0.6 and 0.7. Therefore, lost translation pairs are examined using graphs. As

38

(a) Distribution of the number of translations for one Japanese word in the

answer dictionary and constructed dictionaries of each threshold

(b) Distribution of the number of translations for one Indonesian word in

the answer dictionary and constructed dictionaries of each threshold

Figure 18: Distribution of the number of translations for one word in the answer

dictionary and constructed dictionaries of each threshold

39

a result, the fact that words belong to one common graph structure are not

extracted as translations is revealed. Figure 19 shows one example of extracted

common graph. In this graph, a Japanese word 資本金 (capital) and an In-

donesian word kapital(capital) (a red node in figure 19) shares more nodes than

others. In proposed method, such nodes are tend to have a high structural

equivalence value. However, these nodes become to be not extracted as trans-

lations when threshold is over 0.6. From the result above, it can be said that

words that have several independent meanings become to be not extracted as

translations as threshold becomes higher.

Figure 19: Example of common graph structure extracted by analysis

Besides, 35,983 of Japanese headwords that are only included in constructed

dictionaries are exist when threshold is 1.0, which is the most strict case. This

result implies the fact that constructing bilingual dictionaries using WordNet

can extend the number of headwords of existing bilingual dictionaries.

Also, in a similar fashion, structure of constructed Indonesian-Japanese dic-

tionaries is evaluated quantitatively, and relations between threshold are ana-

lyzed. First of all, distribution of the number of Japanese translations for one

40

(a) Relations between threshold and the number of Japanese headwords of

constructed bilingual dictionaries

(b) Relations between threshold and the number of Indonesian headwords

of constructed bilingual dictionaries

Figure 20: Relations between threshold and the number of headwords of con-

structed bilingual dictionaries

41

Indonesian headword is compared with harmonized Weblio Indonesian Dictio-

nary. The result is shown in figure 18b. All dictionaries including harmonized

Weblio Indonesian Dictionary showed similar distortions like an exponential

function. The fact that constructed bilingual dictionaries have same transla-

tion relations as general dictionaries is conformed by this result. Moreover, as

the threshold rises, the number of words that have many translations is de-

creased. This is self-evident from feature of structural equivalence. However,

unlike Japanese-Indonesian dictionary, the difference between harmonized We-

blio Indonesian Dictionary is small.

In addition, relations between the number of Indonesian headwords of con-

structed dictionaries and translation pairs of constructed dictionaries, and thresh-

old are shown in figure 20b with the number of Indonesian headwords and

translation pairs of harmonized Weblio Indonesian Dictionary. Result showed

an inclination similar to Japanese-Indonesians’ one. The cause of decrease in the

number of translation pairs revealed to be the same as the Japanese-Indonesian

case.

Finally, a headword coverage rate, that shows how many headwords of con-

structed bilingual dictionary appears in frequently used words list, is calcu-

rated based on Japanese Web corpus with appearance frequency data which

is provided by Kurohashi Laboratory in Kyoto University. In this words list,

appearance frequency value freq(wi) is given to a word wi. Concretely speak-

ing, accumulative frequency of appearance (hereafter refered to as AFA) and

accumulative accuracy rate (hereafter refered to as AAR) for the word list is

calculated. These values are formalized as follows.

Accumulative Frequency of Appearance(i) =

∑ij=1 freq(wj)∑freq(wj)

(7)

Accumulative Accuracy Rate(i) =|Words until rank i ∩ Headwords|

|Words until rank i|(8)

Figure 21 shows the results of above two.

As is shown in Figure 21, AAR shows about 75% when AFA is about 75%

42

Figure 21: Headword cover rate calculated using Japanese Web corpus with

app frequency data

and threshold is between 0.0 and 0.3, but AAR becomes lower as threshold val-

ues becomes higher. From this result, it can be said that Japanese headwords of

constructed Japanese-Indonesian bilingual dictionary covers most of frequently

used Japanese words. However, as the threshold value is rose to improve accu-

racy, headword coverage rate becomes lower. There is clear trade-off between

accuracy and word coverage rate in the proposed method.

43

Chapter 6 Discussion

In this chapter, efficiency and problems of the proposed method is discussed

based on the results showed in chapter Chapter 5.

In this research, bilingual dictionaries include only nouns are realized. How-

ever, there are also verbs, adjectives and adverbs in Wordnets. In conceptual

structure of WordNet, both nouns and verbs are organized into hierarchies in

which all are linked to a unique beginner synset. Noun hierarchies are far

deeper than verb hierarchies, and verb hierarchies are more complex than noun

hierarchies. Moreover, in adjectives, two central antonyms form binary poles,

while satellite synonyms connect to their respective poles via a similarity rela-

tions. Also, adverbs are organized into adjectives like structure because they

are defined based on their origin adjectives.

In the proposed method, in order to calculate structural equivalence matrix

not affected by whole graph structure, subgraphs are extracted with a focus on

concepts, which are associated with words to be translated. Therefore, the pro-

posed method is considered to be not affected by part of speeches. In addition,

structural equivalence is used to find persons who have similar roles from em-

ployees in the area of social network analysis. Thus, the proposal is very likely

to work well in hierarchical structures such as trees. Therefore, if structural

equivalence is calculated for whole graph, it is likely to work well in the case

of nouns and verbs. However, in that case, computational complexity will be

at issue because huge adjacency matrices have to be handled. For adjectives

and adverbs, it can be predictable that using clustering approach is likely to

improve accuracy because the existence of clusters in graph from discussions

above.

In this research, relations between words are simplified because only hy-

ponym relations are used to define graph structures. Therefore, structural

equivalence is considered to work well. However, there is a possibility that

all words in Wordnets cannot be included in constructed bilingual dictionaries

because of restriction of relations between words. As is shown in table 8,10, a

half of Japanese words in WordNet are not included in the constructed bilingual

44

dictionaries. Therefore, in order to improve recall value and enrich headwords,

relationships besides hyponym are required to be considered.

In the experiment, F-means marked the highest when threshold value th

is betwwen 0.6 and 0.7, which means the middle strong correlation originally

in correlation coefficient, for Japanese-Indonesian language pair. This can be

considered to have a correlation with language distance from English to each

language. Previous research[3] revealed that one-to-one word mapping is valid

in the case that the source language and the target language are close. In

other words, conceptual granularity of words are equal when languages are

closely related. Therefore, it is predictable that F-means marks the highest

when threshold value th is close to 1.0 for similar languages. These discussions

have to be verified. In addition, experiment results showed that there is a

trade-off between accuracy and word coverage rate. Therefore, it is needed to

decide which index is more important for desired dictionary. Moreover, feature

of translation pairs changes according to threshold value. As threshold rise,

polysemic of words in translation pairs is decreased. Thus, threshold should be

set higher value if polysemic of words needed to be removed.

Finally, this research is likely to enrich language resources between a highly-

resourced langauge and a less-resourced language, and between less-resourced

languages because the proposed method needs only Wordnets in both languages

unlike other existing approaches. This point make a contribution to the society.

45

Chapter 7 Conclusion

Machine readable bilingual dictionaries are essential for machine translation and

inter-lingual information retrieval. Existing approaches to bilingual dictionary

construction are classified into two classes: induction with multiple bilingual


In the former case, the issue is that how to remove incorrect translation pairs

which arise from ambiguity of pivot words. In the latter case, similarity between

languages is measured by modeling languages based on corpora. However, due

to its dependence on orthographic features of languages, this approach has low

efficiency with distant language pairs. Correct alignment of words based on

their concepts without being restricted by features of languages is needed to

bilingual dictionary construction.

In this research, the author construct bilingual dictionary using inter-lingual

alignment of words based on concepts expressed by words, using conceptual dic-

tionary called WordNet. Concretely speaking, English WordNet and Wordnets

in different languages, which are constructed based on the conceptual struc-

ture of English WordNet, are used. Since the English WordNet and wordnets

in different languages share the same conceptual structure, translation pairs

are extracted by aligning concepts of words between wordnets in different lan-

guages through English WordNet. However, in that case, there were isuues such

as definition of conceptual granularity based on graph structure and calculation

of confidence for alignment considering conceptual granularity. Contributions

of this research are as follows.


Conceptual granularity of words are defined to align words based on con-

cepts. Analysis of several wordnets on distribution of concept density of

words revealed that relations between concepts and words are distorted by

the occurrence of words with large concept density in wordnets in different

languages.

Concretely speaking, graphs are used to represent relations between words

and concepts; besides conceptual granularity is defined as diameter of the

46

graph consists of neighboring concepts of words. In order to examine the

effects of concept density on resulting bilingual dictionaries, the author ex-

tracted incorrect translations from a resulted bilingual dictionary, which is

constructed by the baseline method using Japanese WordNet and Word-

Net Bahasa, cooperating with a native Indonesian speaker. Based on this

result, sub-graphs that cause incorrect translations are extracted, and the

cause of incorrect translations is defined as matching between words with

different conceptual granularity and sharing part of concept among words

with large conceptual granularity.

Calculation of Confidence for Alignment considering Conceptual Granularity

In order to align words considering conceptual granularity, the author de-

fined confidence for alignment as structural equivalence in graph structure

and proposed a bilingual dictionary construction algorithm using this confi-

dence value. Concretely speaking, coefficient correlation between elements

in adjacency matrix is used to calculate structural equivalence in graphs.

In addition, in order to calculate structural equivalence, an algorithm to

extract subgraph related to bilingual dictionary construction.

The quality of bilingual dictionary extracted by the proposed method is

evaluated by applying Japanese WordNet and WordNet Bahasa to the algo-

rithm. As a result of this, Japanese-Indonesian dictionary includes 73,279 word

pairs and Indonesian-Japanese dictionary includes 40,849 are extracted when

threshold value is 1.0. Accuracy of bilingual dictionary is evaluated by extract-

ing sample pairs from dictionary and classifying them according to three grades;

Acceptable, Partly Acceptable and Not Acceptable, and accuracy showed about

82%. In addition, the effect of threshold value is evaluated by measuring the

number of headwords, distribution of the number of translations for each word,

F-means, and Japanese headword coverage rate. As a result, the fact that

threshold value 1.0, which means completely structural equivalence, is too strict

for bilingual dictionary construction using Wordnets is revealed.

In this research, bilingual dictionary includes only nouns are constructed

using only hyponym relations in WordNet. However, as several other relations

are defined in WordNet, method to utilize these relations is required. More-

47

over, method to construct bilingual dictionary includes not only nouns is also

required. These problems need to be addressed in future research. In addition,

evaluation for other language pairs need to be carried out in the future research

because there are a lot of languages in which WordNet is available as language

resource.

48

Acknowledgments

The author would like to express sincere gratitude to the supervisor, Professor

Toru Ishida at Kyoto University, for his continuous guidance, valuable advice,

and helpful discussions.

The author would like to express his appreciations to the advisers, Associate

Professor Yohei Murakami at Kyoto University Design School and Associate

Professor Hiroaki Ohshima at Kyoto University for his valuable advice.

The author would like to thank all members of project, especially Arbi Haza

Nasution, for their technical advice.

Finally, the author would like to thank all members of Ishida and Matsubara

laboratory for their various supports and discussions.

49

References

[1] Tanaka, K. and Umemura, K.: Construction of a bilingual dictionary in-

termediated by a third language, Proceedings of the 15th conference on

Computational linguistics-Volume 1 , Association for Computational Lin-

guistics, pp. 297–303 (1994).

[2] Tsuchiya, M., Purwarianti, A., Wakita, T. and Nakagawa, S.: Expanding

Indonesian-Japanese small translation dictionary using a pivot language,

Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster

and Demonstration Sessions , Association for Computational Linguistics,

pp. 197–200 (2007).

[3] Wushouer, M., Lin, D., Ishida, T. and Hirayama, K.: A Constraint Ap-

proach to Pivot-based Bilingual Dictionary Induction, Asian Language In-

formation Processing, 2015 ACM Transactions on, ACM (2015).

[4] Tokunaga, T. and Tanaka, H.: The automatic extraction of conceptual

items from bilingual dictionaries, PRICAI-90 , pp. 304–309 (1990).

[5] Tanaka, K. and Iwasaki, H.: Extraction of lexical translations from non-

aligned corpora, Proceedings of the 16th conference on Computational

linguistics-Volume 2 , Association for Computational Linguistics, pp. 580–

585 (1996).

[6] Mohanty, R., Bhattacharyya, P., Pande, P., Kalele, S., Khapra, M. and

Sharma, A.: Synset based multilingual dictionary: Insights, applications

and challenges, Global Wordnet Conference, pp. 22–25 (2008).

[7] Miller, G. A.: WordNet: a lexical database for English, Communications

of the ACM , Vol. 38, No. 11, pp. 39–41 (1995).

[8] Fellbaum, C.: WordNet , Wiley Online Library (1998).

[9] Boyd-Graber, J., Fellbaum, C., Osherson, D. and Schapire, R.: Adding

dense, weighted connections to WordNet, Proceedings of the Third Global

WordNet Meeting , Jeju (2006).

[10] Mohamed Noor, N., Sapuan, S. and Bond, F.: Creating the Open Wordnet

Bahasa, Proceedings of the 25th Pacific Asia Conference on Language, In-

formation and Computation (PACLIC 25), Singapore, pp. 258–267 (2011).

50

[11] Bond, F. and Foster, R.: Linking and Extending an Open Multilingual

Wordnet, Sofia (2013).

[12] Bond, F., Isahara, H., Kanzaki, K. and Uchimoto, K.: Boot-strapping a

WordNet using multiple existing WordNets. (2008).

[13] Isahara, H., Bond, F., Uchimoto, K., Utiyama, M. and Kanzaki, K.: De-

velopment of the Japanese WordNet., LREC (2008).

[14] Lim, L. T. and Hussein, N.: Fast prototyping of a Malay wordnet sys-

tem, Proceedings of the Language, Artificial Intelligence and Computer

Science for Natural Language Processing (LAICS-NLP) Summer School

Workshop, pp. 13–16 (2006).

[15] Genzel, D.: Inducing a multilingual dictionary from a parallel multitext

in related languages, Proceedings of the conference on Human Language

Technology and Empirical Methods in Natural Language Processing , Asso-

ciation for Computational Linguistics, pp. 875–882 (2005).

[16] Lafourcade, M.: Multilingual dictionary construction and services-case

study with the fe* projects, Proc. of PACLING , pp. 289–306 (1997).

51

bilingual dictionary construction based on interlingual alignment...

Documents