machine translation across indian languages

72
Machine Translation across Indian Languages Dipti Misra Sharma LTRC, IIIT Hyderabad Patiala 15-11-2013

Upload: henry

Post on 11-Jan-2016

55 views

Category:

Documents


5 download

DESCRIPTION

Machine Translation across Indian Languages. Dipti Misra Sharma LTRC, IIIT Hyderabad Patiala 15-11-2013. Outline. Introduction Information Dynamics in language Machine Translation (MT) ‏ Approaches to MT Practical MT systems Challenges in MT Ambiguities - PowerPoint PPT Presentation

TRANSCRIPT

  • Machine Translation across Indian Languages

    Dipti Misra SharmaLTRC, IIIT Hyderabad Patiala15-11-2013

  • Outline

    Introduction Information Dynamics in language Machine Translation (MT)Approaches to MTPractical MT systems Challenges in MTAmbiguitiesSyntactic differences in L1 an L2 MT efforts in IndiaSampark : IL to IL MT systems Objective Design Issues Conclusions

  • Introduction Natural Language Processing (NLP) involves Processing information contained in natural languages Natural as opposed to formal/artificialFormal languages : Programming languages, logic, mathematics etcArtificial : Esperanto

  • Natural Language Processing (NLP)Helps in Communication betweenMan-machine Question answering systems, eg interactive railway reservationMan man Machine translation

  • Communication Transfer of information from one to the other Language is a means of communication

    Therefore, one can say It encodes what is communicated

    We apply the processes of Analysis (decoding) for understanding Synthesis (encoding) for expression (speaking)

  • What do we communicate ? Information Spain delivered a football masterclass at Euro 2012

    Intention Emphasis/focus Euro 2012 bagged/won by SpainSpain bags Euro 2012

    Introduces variation

  • How do we communicate ?We use linguistic elements such as Words (country, park, the, is, Bandipur, of, as, and, considered, National, a, spot, beautiful, tourist, life, in, best, wild, sanctuaries, the, one)

    Arrangement of the words (Sentences) Words are related to each-other to provide the composite meaning(Bandipur National park is a beautiful tourist spot and considered as one of the best wild life sanctuaries in the country)

  • How do we communicate ? Contd.. Arrangement of sentences (Discourse) Sentences or parts of sentences are related to each other to provide a cohesive meaning*Considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km. Bandipur National park is a beautiful tourist spot.

    Bandipur National park is a beautiful tourist spot and considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km

    Languages differ in the way they organise information in these entities All of these interact in the organisation of information

  • Information Dynamics in Language (1/4) Languages encode information

    Hindi: cuuhe maarate haiM kutte 'rat-pl' 'kill-hab' 'pres-pl' 'dog-pl' rats kill dogs Hindi sentence is ambiguous Possible interpretations Dogs kill ratsRats kill dogsHowever,

    English sentence is not ambiguous

  • Information Dynamics in Language (2/4)Ambiguity in Hindi is resolved if,

    cuuhe maarate haiM kuttoM korats kill-hab pres-pl dogs-obl acc

    Hindi encodes information in morphemes English encodes information in positions

    Languages encode information differently

  • English does not explicitly mark accusative case (except in pronouns) no morpheme

    No lexical item/morpheme for yes no questions (Eng: Is he coming ? Hindi : kyaa vah aa rahaa hai?)

    Position plays an important role in encoding information in English

    Subject is sacrosanct

    Hindi encodes information morphologically

  • Information Dynamics in Language (3/4)Another example,

    This chair has been sat on

    The chair has been used for sitting Someone sat on this chair, and it is known The sentence does not mention someone

    Languages encode information partially

  • Information Dynamics in Language (4/4)English pronouns he, she, itHindi pronounvaha

    He is going to Delhi ==> vaha dilli jaa rahaa haiShe is going to Delhi ==> vaha dillii jaa rahii haiIt broke ==> vaha TuuTa ??

    Information does not always map fully from one language into anotherConceptual worlds may be different

    Gender Information

  • Information in Language Languages encode information differently

    Languages code information only partially

    Tension between BREVITY and PRECISION

  • Human beings use World knowledge Context (both linguistic and extra-linguistic) Cultural knowledge and Language conventions

    to resolve ambiguities

    Can all this knowledge be provided to the machine ?

  • Languages differ Script (For written language) Vocabulary Grammar

    These differences can be considered as a measure of language distance

  • Language DistanceScript -------------- Vocabulary----------Grammar Urdu-> Hindi Telugu -> Hindi Telugu->HindiEnglish -> Hindi English-> Hindi English->Hindi

  • Machine TranslatoionMachine translation aims at automatic translation ofa text in source language to a text in the target language. Mohan gave Hari a book -> Mohan ne Hari ko kitAba dI

  • Machine TranslationLet us view MT as a problem of Language encoding (source) - analysis Decoding (target) - synthesis

  • English to Hindi : An ExampleSL (Eng) sentence: Imet a boy who plays cricket with you everyday

    Mapped to TL(Hin) :I a boy met who everyday with you cricket plays

    TL synthesis : mEM eka laDake se milA jo roza tumhAre sAtha kriketa khelatA hEOR mEM roza tumhAre sAtha kriketa khelanevAle eka laDake se milAOR meM eka Ese laDake se milA jo roza tumhAre sAtha kriketa khelatA hE

  • Machine Translation : Challenges Languages encode information differently Language codes information only partially Tension between BREVITY and PRECISION Brevity wins leading to inherent ambiguity at different levels

  • Linguistic Issues in MT (1/2)Look at the word 'plot' in the following examples (a) The plot having rocks and boulders is not good. (b) The plot having twists and turns is interesting. 'plot' in (a) means 'a piece of land' and in (b) 'an outline of the events in a story'

  • Linguistic Issues in MT (2/2) Ambiguity in Language Lexical level Sentence level Structural differences between SL and TL

  • Lexical ambiguityLexical ambiguity can be both for

    Content words nouns, verbs etcFunction words prepositions, TAMs etc

    Content words ambiguity is of two types

    HomonymyPolysemy

  • Homonymy A word has two or more unrelated senses

    Example : I was walking on the bank (river-bank)

    I deposited the money in the bank (money-bank)

  • Polsysemy 'Act', an English noun

    1. It was a kind act to help the blind man across the road (kArya) 2. The hero died in the Act four, scene three (aMka) 3. Don't take her seriously, its all an act (aBinaya) 4. The parliament has passed an Act (dhArA)

  • Function words can also pose problems (1/5) Prepositions English prepositions in the target language

    Tense Aspect Modality (TAM) Lexical correspondence of TAM

  • Function words can also pose problems (2/5)Function words can also be ambiguousFor example English preposition 'in' (a) I met him in the garden mEM usase bagIce meM milA (b) I met him in the morning mEM usase subaha 0 milA

    'Ambiguity' here refers to the 'appropriate correspondence' in the target language.

  • Function words can also pose problems(3/5)He bought a shirt with tiny collars. usane chote kOlaroM vAlI kamIza kharIdI he tiny collars with shirt boughtwith gets translated as vAlI in hindiHe washed a shirt with soap. usane sAbuna se kamIza dhoI he soap with shirt washedwith gets translated as se .

  • Function words can also pose problems (4/5)TAM Markers mark tense, aspect and modalityConsist of inflections and/or auxiliary verbs in HindiAn important source of informationNarrow down the meaning of a verb (eg. lied, lay)

  • Function words can also pose problems (4/5)TAM Markers mark tense, aspect and modalityConsist of inflections and/or auxiliary verbs in HindiAn important source of informationNarrow down the meaning of a verb (eg. lied, lay)

  • Function words can also pose problems (5/5)English Simple Past vs Habitual'1a. He stayed in the guest house during his visit to our University in Jan (rahA)1b. He stayed in the guest house whenever he visited us (rahatA thA)

    2a. He went to the school just now (gayA)2b. He went to the school everyday (jAtA thA)

  • Sentence level ambiguityo I met the girl in the store + Possible readings a) I met the girl who works in the store b) I met the girl while I was in the store o Time flies like an arrow. + Possible parses: a) Time flies like an arrow (N V Prep Det N) b) Time flies like an arrow (N N V Det N) c) Time flies like an arrow (V N Prep Det N) (flies are like an arrow) d) Time flies like an arrow (V N Prep Det N) (manner of timing)

  • Differences in SL and TL

    Lexical level(a) One word may translate into different words in different contexts (WSD)English 'plot' zamiin, kathanak(b) A SL word may not have a corresponding word in the TL (Gaps) English 'reads' in 'This book reads very well' (d) Pronouns across Indian languagesHindi 'vaha' Telugu 'adi', 'atanu', 'aame'

  • Differences in SL and TLStructural differences(a) word order (English Hindi)(b) nominal modification (Hindi Tamil, Telugu etc) (i)relative clause vs relative participles Telugu 'nenu tinnina camcaa' Hindi : *meraa khaayaa cammaca Maine jis cammaca se khaayaa hai vah cammac (ii) missing copula (Hindi Telugu, Bengali, Tamil etc)Telugu : raamudu mancivaadu Hindi : Ram acchaa ladakaa hai

  • Human beings useWorld KnowledgeContextCultural knowledge andLanguage conventions

    To resolve ambiguities and interpret meaning

  • What to do for the machine ?Challenging problem!!! Providing all the knowledge may: - take too much of time and effort - be difficult/become complex - not be possible (world knowledge acquired from experience)

    Therefore, Break the problem into smaller problemsChoose the solution as per the nature of problem Build language resources to the extent possible and continue to add to it

    Engineer knowledge efficiently

  • Approaches to MT (1/2) Rule-based or Transfer based Uses linguistic rules to map SL and TL, such asMaps grammatical structures Disambiguation rules Knowledge-based Extensive knowledge of the domain Concepts in the languageAbility to reason

  • Approaches to MT (2/2)Example-based Mapping is based on stored example translations Translation memory based Uses phrases/words from earlier translation as examples StatisticalDoes not formulate explicit linguistic knowledge Develops rules based on probabilities

    HybridMixes two or more techniques

  • A Glance at MT Efforts in India (1/4) Domain Specific

    Mantra system (C-DAC, Pune) Translation of govt. appointment letters Uses Tree Adjoining Grammar

    Public health compaign documents Angla Bharati approach (C-DAC Noida & IIT Kanpur)

  • A Glance at MT Efforts in India (2/4) Application Specific Matra (Human aided MT) (NCST,now C-DAC, Mumbai)

    General Purpose (not yet in use) Angla Bharati approach (IIT Kanpur ) UNL based MT (IIT Bombay) Shiva: EBMT (IIIT Hyderabad/IISc Bangalore) Shakti: English-Hindi MT system (IIIT Hyderabad)

  • MT Efforts in India (3/4)Major Government funded MT projects in consortium mode Indian Language to Indian Language Machine Translation (ILMT) (Lead Institute - IIIT, Hyderabad) English to Indian Language Machine TranslationMantra, Shakti etc (Lead inst - C-DAC, Pune)Anglabharati (Lead inst IIT, Kanpur) Sanskrit to Hindi MT System (Lead Inst University of Hyderabad)

  • MT Efforts in India (4/4) Anusaaraka : Language Accesspr cum MT System(IIIT, Hyderabad, Chinmaya Shodh Sansthan)

  • Our FocusSampark : Indian Language to Indian Language MT systems

  • Sampark : Indian Language to Indian Language MT Systems

    Consortium mode projectFunded by DeiTY11 Partiicpating InstitutesNine language pairs18 Systems

  • Participating institutionsIIIT, Hyderabad (Lead institute)University of HyderabadIIT, BombayIIT, KharagpurAUKBC, ChennaiJadavpur University, KolkataTamil University, ThanjavurIIIT, TrivandrumIIIT, AllahabadIISc, Bangalore CDAC, Noida

  • ObjectivesDevelop general purpose MT systems from one IL to another for 9 language pairs

    Bidirectional

    Deliver domain specific versions of the MT systems. Domains are: Tourism and pilgrimageOne additional domain (health/agriculture, box office reviews, electronic gadgets instruction manuals, recipes, cricket reports)

    By-products basic tools and lexical resources for Indian languages:POS taggers, chunkers, morph analysers, shallow parsers, NERs, parsers etc.Bidirectional bilingual dictionaries, annotated corpora, etc.

  • Language Pairs (Bidirectional)Tamil-Hindi Telugu-HindiMarathi-HindiBengali-HindiTamil-TeluguUrdu-HindiKannada-HindiPunjabi-Hindi Malayalam-Tamil

  • User ScenarioWeb based system for tourism/ pilgrimage domain. A common traveler/tourist/piligrim to access info in his language.Access to selected Government portals in agriculture/healthAutomatic MT in domainGeneral purpose web based translationPotential to attach to major search engines such as Google, Yahoo, Microsoft, Web-duniya

  • Design and Approach Largely transfer based Analysis, Transfer, GenerateModular (module could be Pipeline architecture Hybrid some modules statistical, some rule based Analysis : Shallow parser No deep parsing in the first phase

  • Approach Largely transfer based Analysis, Transfer, GenerateModularModules could be statistical or rule based depending on the nature of problem (Hybrid)Pipeline architectureAnalysis : Shallow parsing followed by a simple parser

  • Designo Design decisions based on- the commonality in Indian languages - easy to extend to other languages

    o Phase the development - Phase 1 o Analysis at sentence levelo Shallow parsero Simple parsero Transfer : map lexicon, structures, scripto Generate the target

  • Design ContdPhase 2 Extend the analysis to discourse levelAnaphora resolutionRelations between clauses (discourse connectives) Word Sense Disambiguation (WSD) Named Entity Recognition (NER) Multi Word Expressions (MWE) Explore SMT for transfer rules

  • Transfer based MTSource SentenceSource AnalysisAnalysisAnalysis in Target LanguageTarget SentenceTransferGeneration

  • Form(Input sentence/text)MeaningAnalysisFormGenerationL1L1Various types of linguistic information helps in arriving from form to meaningIt is complex.Modularization helps in simplifying it.

  • Modularize

    Word StructureIn contextMorph Analyser Syntactic What is functions asSemanticWhat it means(POS tagger)(WSD)Relations between words Local (local word grouping,/ chunking)Non-local (Subject,object/karaka)

  • Form(Input sentence/text)MeaningAnalysisFormGenerationSemantic analysisPOSChunkingparsingMorph AnalysisFormal semanticsAll this information is implicit in language. How to make it explicit? Build resources Dictionaries, Verb frames, Treebanks

  • Sampark Architecture

  • Details StandardsAnnotation standards POS and ChunkInput output of each module Representation - SSFData format Dictionaries

    Emphasis on proper software engineeringDevelopment environment DashboardBlackboard architectureCVS for version controletc.

  • Machine Learning: Separating engines from language data Module for Task (T) Sentence in Language (L)

  • Horizontal TasksH1 POS Tagging & Chunking engineH2 Morph analyser engineH3 Generator engineH4 Lexical disambiguation engineH5 Named entity engineH6 Dictionary standardsH7 Corpora annotation standardsH8 Evaluation of output (comprehensibility)H9 Testing & integration

  • Vertical Tasks for Each LanguageV1 POS tagger & chunkerV2 Morph analyzerV3 GeneratorV4 Named entity recognizerV5 Bilingual dictionary bidirectionalV6 Transfer grammarV7 Annotated corpusV8 EvaluationV9 Co-ordination

  • Vertical Tasks for Each LanguageV1 POS tagger & chunkerV2 Morph analyzerV3 GeneratorV4 Named entity recognizerV5 Bilingual dictionary bidirectionalV6 Transfer grammarV7 Annotated corpusV8 EvaluationV9 Co-ordination

  • An Example : Hindi to Panjabi System 1500 . . . . .

    1500

  • Panjabi to Hindi . . . 23 1931 , . .

    23 1931 ,

  • Panjabi to Hindi (NER) . (WSD) (Agreement) . (word generation) . 23 1931 , (function word substitution) . .

  • EvaluationTesting, system integration, and evaluation team Involvement of industryRegular In-house subjective evaluationThird party evaluation on system submission

  • Achievements of ILMT Project Phase I18 MT systems built among Indian languagesShallow parser for all 9 Indian languagesLexical resources for all 9 languages

    Largely built from scratchDeveloped standards for all stagesDeveloped open architecture

  • Achievements -DeploymentDeployed and running over web 8 systems (sampark.org.in)

    Others deployed over ILMT test site 4 more ready to go to Sampark soon Rest are being evaluated and tested internally(require a few more months to go to Sampark site after reaching quality levels) Constant qualilty improvement going on for various existing modulesNew modules are under testing and would be soon integrated

  • Future Tasks Enhance the quality of MT outputEnhancing dictionariesIncreasing coverage of grammar Adding new technology to ILMT systemsFull sentence parsingDiscourse processing - anaphora Target some users

  • Some PossibilitiesPossible tie up with search engines companiesPossible tie up with content companies such as - Dainik Jagran, Web duniya, Rediff, Yahoo Identify translation bureaus and agenciesBuild MT workbench for their use, their domains, etc. Poised for major public impact with a unique technology.

  • Future Systems Add language pairs Gujrati Hindi Kashmiri Hindi Manipuri Hindi Oriya Hindi Etc

  • Future Systems Add language pairs Gujrati Hindi Kashmiri Hindi Manipuri Hindi Oriya Hindi Etc

  • CONCLUSION Developing MT systems, though a challenging task, is a useful effort particularly in the multilingual context of India

    **********************************************************************************************