a tutorial on machine translation

Man to MachineA tutorial on the art of Machine TranslationJaganadh [email protected] http://jaganadhg.freeflux.net/blog

1 Introduction

Machine Translation(MT), is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. It is interesting to think about an MT system that can translate literary works from one language to another language. To enjoy the novel 'Anything for you, ma'am'1http://www.raheja.org/

; just feed the novel in to a MT system and get it translated to your language. Such kind of MT systems are supposed to break the language barrier. MT can help us to over come the technological barrier too. The drastic developments Information Communication Technology(ICT) lead to information overflow through the internet. But this information is available only in a very small sub set of languages. It is not reachable for for a significant portion of users/people. This particular phenomena is called as digital divide. Lots of information is available in the internet in English language; but the same information may not be available in our vernaculars like Hindi or Malayalam. In the case of India only 3% of the population can understand English2Sinha, R.M.K and A. Jain, 2003, 'Angalahindi: An English to Hindi machine translation system', Proceedings of the MT SUMMIT IX, Orelands, LA, pp.23-27.

. In a country like India achievements in the field of R&D in MT has great significance. In-short MT helps the world to be united both intellectually as well as culturally. To achieve this task we have to do lots of exercises both in the field of language and linguistics and computer science. The present tutorial is an introduction to the art of MT. This material is compiled with the help of some already published literature in the field. The main sources of this tutorial is mentioned in the reference section. The tutorial is just a theoretical over view of the field.

1.1. History of MT

The history of MT starts from early 1950's. But some hypothetical historic concepts existed before the period. In 17th cent. Two philosophers Leibniz and Descartes put forward proposal for codes which could relate words between languages. But still the proposal remains as theory only. The first proposal for developing MT were put forwarded by Warren Wever, a researcher at Rockefeller Foundation in 19493http://en.wikipedia.org/wiki/History_of_machine_translation

. After a few years actual research in the field of MT started at many universities in the United States. The first public demonstration of an MT system was held on 7th January 1954 and at the head office of IBM. It is known as 'Georgetown-IBM experiment'. The system was a kind of toy system, having just 250 words and translating just 49 carefully selected Russian Sentence in to English. Many institutions inside the US was very active in the R&D activities related to MT and the government was very much supportive to it. In 1964 the US government constituted a committee to evaluate the progress in MT research. The committee was called Automatic Language Processing Advisory Committee(ALPAC). They concluded that MT is more expensive, less accurate and slower than the human translation, and that despite the expenses. MT is not likely to reach the quality of a human translator in near future. But they recommended that tools to be developed to aid the translators like automatic dictionaries, and research in Computational Linguistics(CL) should be continued. It created a deep impact in the MT researchers. Mt research was abandoned for a short duration. But the field raised like a phoenix bird and significant developments are there. Mt research is very active in Indian Languages(IL) too.

2 Machine Translation

Translation can be defined as the act or process of translating, especially from one language into another. We know that producing high quality translation is difficult for human translators too. A translator should posses knowledge of Source Language(SL), Target Language(TL) and its grammar and culture etc.. Even if one posses all such knowledge we cant ensure that the person can produce high quality translation. Because natural language is ambiguous. Even for the term 'translation' have four different meaning in different context. So the selection of word meaning while translating from SL to TL requires context knowledge etc... Lets see how it can be made possible with computers.

2.1 Approaches in MT

Approaches in MT can be classified into four categories:Direct MT

Rule-based MT

Corpus-based MT

Knowledge-based MT

Machine Translation

Example based MT

Knowledge-based MT

Statistical MT

Corpus based MT

Rule based MT

Interlingua based MT

Transfer based MT

Direct MT

Fig.1. Machine Translation Approaches

Each of the approaches which mentioned above have its own advantages and disadvantages. A brief note on the approaches are given below.

2.1.1 Direct Machine Translation

As the very name suggests, direct MT systems provides direct translation. No intermediate representation or complex architecture will be involved in the approach. It carries out word by word translation with the help of a bilingual dictionary, usually followed by some syntactic re arrangement. It involves little analysis of SL text, no parsing and mainly relays on the quality of bilingual dictionary. Some minimal syntactic re arrangement etc.. only will be there in the system. A general flow of a direct MT system is like:Remove morphological inflection from the SL words

Look up a bilingual dictionary to get the corresponding TL word

Perform necessary syntactical rearrangemnts

SL words

TL words

Syntactic rearrangement

SL text

SL words

Morphological Analysis

Bilingual dict lookup

TL text

SL TL dictionary

Figure 2. Direct machine Translation System

Consider the example 'Sita slept in the garden'. Lets see how it will be translated to Hindi with a direct MT system.

Input (Englisg Sentence)- Sita slept in the garden.Words translation Syntactic rearrangement-

Besides simple word translation and ordering, suffix handling and preposition handling is needed to make the translation acceptable. It is called as idiomatization.

Consider the example :

English Sentence-The boy gave the girl a flower. Word Translation - Syntactic rearrangement- Idiomatization-

Modification of verb and adjective according to the gender of the subject is also required if the TL has such constrains. In languages like Hindi such kind of grammatical phenomena has to taken care to produce quality translation.

E.g.

English Sentence-She saw stars in the sky.Word Translation - Syntactic rearrangement- Idiomatization-

To attain such a great quality in direct MT is very difficult if the SL and TL does not share near syntactical as well as morphological phenomena. For a Hindi to English or English to Hindi translation system, such a word by replacement and idiomatization will not produce understandable MT output. Such kind of MT output is called as 'word salad'.

The major limitations for this MT approach is :Does not considers the structure and relationship between words

There is no attempt to disambiguate the sense. Majority of words in our natural language are ambiguous. For example the Hindi word is a verb denotes the activity of eating. When an adjective is preceded the meaning will be totally changed. Eg. .

No adaptability -The system which is developed for a particular language pair will not be suitable for another language pair.

2.1.2. Rule-based MT(RBMT)

The rule based approach in MT is pretty much advanced than the direct MT approach. The system relays on hand made linguistic rules for performing the MT process. There are two types of rule-based MT approaches are there 1) Transfer-based MT and 2) Interlingua based MT .

2.1.2.a. Transfer-based MT

Int this approach the SL text is analyzes the SL text to produce a representation that matches the rules of the target language. It requires the understanding of difference between the SL and TL. A typical flow of RBMT is likeAnalysis of SL text [syntactical]

Transfer Transfer the SL syntactic structure to TL syntactic structure.

Generation generate TL text with defined rules.

SL representation

TL representation

Analysis

Transfer

Synthesis

SL text

TL text

SL Grammar

SL TL dictionary

TL grammar

Figure -3 . Diagram of transfer-based MT

We can workout the system with our previous example 'Sita slept in the garden'. Input - Sita slept in the gardenAnalysis output- (S (NP (NNP Sita)) (VP (VBD slept) (PP (IN in) (NP (DT the) (NN garden)))))

After Syntactical transfer- (S (NP (NNP Sita)) (VP (PP (NP (DT the) (NN garden)) (IN in) ) (VBD slept) ))

Hindi lexicalization - (S (NP (NNP )) (VP (PP (NP (NN )) (IN ) ) (VBD ) ))Hindi Sentence-

The main advantage of the system is its modular structure. Analysis of SL text is independent of the TL text generator system. Another notable advantage of the system is its capability to disambiguate the word sense even in lexical level ambiguity too. For example the English word 'book' falls in two parts of speech (POS) category i.e noun and verb. This approach can handle such kind of lexical ambiguity up to certain extent. But the major disadvantage of the system is related to its adaptability or extensibility for a group of language pairs. If we are trying to develop a system for English to Hindi and Malayalam to Hindi we have to have to SL analyzers.

2.1.2.b Interlingua-based MT

In interlingua based approach, the SL will be converted in to a language independent meaning representation called 'interlingua'. From this interlingual representation, the TL text can be generated. In short the translation in this approach is a two-stage process, i.e analysis and synthesis.

Interlingua representation

SL text

TL synthesis

TL text

Analysis

Figure. 4. Model of interlingua based MT

The flow of the system is very clear from the above given diagram itself. The system will receive the input and performs SL analysis. This analysis is SL specific. The effort required to develop and interlingua based machine translation system is much more than the transfer based approach. The major source of difficulty in using this approach is defining a universal and abstract interligual representation. A sample interligua representation for the sentence 'Sita slept in the garden' is given below.

(*sleep(tense past)(mood declarative)(punctuation period) (subject (*Sita(number singular)(Location (*garden(reference definite)(number singular)))

Sample interlingua for the sentence 'Sita slept in the garden'

2.1.3. Corpus Based MT

Corpus is a large collection of text or speech in a language. In recent years there is an increased interest in corpus based MT systems. Because it needs less effort form the side of language/linguistic experts and less human effort is required. On the contarary they require large amount of sentence aligned parallel corpus. The corpus based approach can be classified in to two 1)statistical MT(SMT) and 2) example based MT (EBMT).

2.1.3.a. SMTThe SMT is inspired by the noisy channel used in Automatic Speech Recognition(ASR). The noisy channel model introduces noice that which makes it difficult to recognize the input word. A recognition system based on this builds a model of channels to identify how it modifies the input and recover the original of the word.

An SMT system models a TL sentence T, given a Sl sentence S, as the product of translation probability P(S|T) and TL probability P(T). The translation probability P(S|T) accounts for the adequecy of translation contents, where as P(T) accounts for fluency of target construction. The basic view behind the SMT is that every sentence in a language has a possible translation in other language; a sentence in one language can be translated to another language in many ways. This choice is translator specific one.

Language Model P(T)

T

Translation Model P(S|T)

S

Decoder

S

T

Figure 4. Noisy channal model for Englidh to Hindi MT

Let's consider the example of English to Hindi SMT system. Every Hindi sentence h is a possible translation of an English sentence e. The probability that ' ' is translation of 'Murthy eats apple' is low as compared to the probability of ' ' being the translation of the sentence. Every pair of sentence (e,h) a probability, P(h|e), which is the probability that a translator when presented with an English sentence e, will produce h as its Hindi translation. We can assume that when a native speaker of Hindi produces an English sentence he will be having a Hindi sentence in mind and will be translating it in to English mentally. The goal of SMT is to find the sentence h that the native speaker in his mind when he produces e. The noisy channel model can be described like

P(h|e) = P(e,h)/ P(e) = P(h) x P(e,h) / P(e)

The two components inSMT are Language Model(LM) and Translation Model(TM). A language model gives the probability of a sentence. These probabilities are calculated with N-Gram4http://en.wikipedia.org/wiki/N-gram

techniques. The translation model helps to compute the conditional probability P(e|h). it is trained from a parallel corpus of English/Hindi pairs. This section is just a birds eye view of the SMT techniques. Due to time constrains the section on SMT is concluding with this introductory remarks on SMT. Some Free and Open Source (FOSS) tools are available now to experiment with the SMT techniques5 www.apertium.org www.statmt.org/moses

.

2.1.3.b. Example-based MT(EBMT)

The EBMT system uses past translation examples to generate translation for a given SL text. EBMT systems maintains an example-base consisting of translation examples between source and target languages. When a SL sentence is given to the system, the system retrieves a similar SL sentence from the example-base and its translation. Then it adapts the example to generate the TL sentence of the input sentence. The EBMT system rest on the idea that slimier sentence will be have slimier translations. The system has two main modules 1)retrieval and 2) adaption.

SL sentence

Retrival

Adaption

TL sentence

Example aptterns

Example base

Adaption rules/ SL-TL dictionary

Figure 5. Example based MT

The task of retrial module is to retrieve translation examples from already stored example-base. This module tries to retrieve an example from the base which is slimier to the input sentence. The adaption module is responsible for carrying out the necessary modifications in the retried example to generate the TL sentence. This modification may involve addition, deletion, insertion of morphological words, constituent words or suffixes.

Lets elaborate the concept with the help of an example. Consider English- Hindi transaltion for the following input sentence:

Input-Santhosh is writing a letter.

Example base -Vikram wrote a poem.(1)Anand is writing.(2)Ravi is writing an essay.(3)Mukesh writes a Malayalam poem.(4)

Selection by the retriever Ravi is writing an essay

Using this retrieved pair the system swill replace Ravi with Santhosh and with in TL translation.

2.1.3 Knowledge-based MT(KBMT)

The MT systems which we seen so far uses either a morphological or syntactical or some extent of semantic knowledge to translate SL text in to TL. Even though the IL system uses some sort of semantics the central concept is syntactic analysis. Semantic based language analysis has been introduced by Artificial Intelligence(AI) researchers. This approach requires a large amount of ontological and lexical knowledge. The KBMT approach includes semantic parsing, lexical decomposition in to semantic networks and resolution of ambiguities and uncertainties by reference of knowledge-base.

person ::= ('person'('isa' creature)('agent-of' (Eat, Drink, Move, Attck, Love ....))('consists-of'(Hand Foot, ....)))

computer-user ::= ('computer-user'('isa' person)('agent-of (+(Operate)))('subworld' computer-world))

Example of an ontology for KBMT system

3. Machine Translation Evaluation

Many online MT systems are available for the general public. One of the most famous online MT service is the Google Translate service6http://translate.google.com/

. Have you ever tries the Hindi to English or English to Hindi translate service of Google? If not just try it out and have a fun!!!

Evaluation of MT is a harder task than developing an MT systems. Or we can say equal effort is required to evaluate MT. Why MT evaluation is crucial? Because what a consumer expects from a commercial MT project is high quality translation. The aim of MT evaluation is to measure how accurately an MY system can handle the phenomena included in translation from SL to TL. Consider that you are giving the sentence 'I like milk' as input to an MT system; it produces instead of . What will your reaction? Definitely you will tell that the MT system is waste!! Obviously an MT system may translate this sentence in to Hindi in the following ways Except the third translation everything else is acceptable.

Many MT evaluation techniques were developed by the researchers. Among them the BLUE7http://en.wikipedia.org/wiki/Bilingual_evaluation_understudy

, METROR8http://en.wikipedia.org/wiki/METEOR

and NIST9http://en.wikipedia.org/wiki/NIST_(metric)

metrics are widely used. These are automatic MT evaluation methods. Besides this the effective method is human-evaluation. But the disadvantage of human evaluation is that it is time consuming and costly! The automatic metrics are not that much effective in the case of all the language pairs. Adaptability of BLUE metric in English to Indian language is under study and some results and observations are already available10http://www.cse.iitb.ac.in/~pb/papers/icon07-bleu.pdf

.

4 MT Research in India

MT research in started in the dawn of 1970 and the beginning of 1980's. The major projects in MT system developments are carried out in IIT Kanpur, Central University of Hydrabad, IIIT Hydrabad, AU-KBC Research Center Chennai, C-DAC, IISC Kolkatta and Tamil Virtual University Thanjavur. The earlier system developed for English to Hindi is Anglabharati and anusaarak system from IIT Kanpur. A list of MT projects in India is given below.

Name of the MT ProjectName of R&D centerLanguage pair

AnglabharatiIIT KanpurEnglish to Indian languages

Anubharati''Indian Language to English

AnusaarakIIT Kanpur, Central Univ. of Hydrabad, IIIT hydrabadEnglish to Hindi, IL to IL

MaTraC-DAC Mumbai English to Indian Languages

MantraC-DAC PuneEnglish to Hindi

UNL based MTIIT BombayEnglish to Hindi, Marathi

Tamil Hindi anusaarak AU-KBC ChennaiTamil to Hindi

English Tamil MT''English Tamil

ShaktiIIIT HydrabadEnglish Hindi

Sampark''IL to Il

Beyond these project industry giants like IBM and Micrsoft are also engaged in English to Hindi MT system development.

5. References

[1] Natural Language Processing and Information Retrieval, Tanveer Siddiqui, U S Tiwary, Oxfoard University Press, Delhi, India, 2008.[2] Speech and Language Processing, Daniel Jurafsky and James H. Martin, Prentice Hall, 2009.[3] Foundation of Statistical Natural Language Processing, Chris Manning and Hinrich Schtze, MIT Press. Cambridge, MA: May 1999.[4] Statistical MT tutorial www.isi.edu/natural-language/mt/wkbk.rtf Accessed on 12-02-2010.[5] Automatic Translation of Languages, http://www.mt-archive.info/Bar-Hillel-1960.pdf Accessed on 15-02-2010. [6] An Introduction to Machine Translation, http://www.hutchinsweb.me.uk/IntroMT-TOC.htm, Accessed on 01-02-2010.

Note: Some of the examples and diagrams which used in this document is either directly adapted from the the book Natural Language Processing and Information Retrieval [1]. Some modifications were made in certain examples.

National Conference on Translation, Dept. of Hindi, University of Kerala, February 23-24- 2010

a tutorial on machine translation

Documents