cs460/626 : natural language processing/speech, nlp …cs626-460-2012/lecture_slides/cs626... ·...

42
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 31– Introduction to Machine Translation) Pushpak Bhattacharyya CSE Dept., IIT Bombay 29 th March, 2012

Upload: tranliem

Post on 17-Mar-2018

231 views

Category:

Documents


1 download

TRANSCRIPT

  • CS460/626 : Natural Language Processing/Speech, NLP and the Web

    (Lecture 31 Introduction to Machine Translation)

    Pushpak BhattacharyyaCSE Dept., IIT Bombay

    29th March, 2012

  • Introduction to MT Machine Translation (MT) is a technique to translate

    texts from one natural language to another natural language using a machine

    Translated text should have two desired properties:

    Adequacy: Meaning should be conveyed correctly

    Fluency: Text should be fluent in the target Fluency: Text should be fluent in the target language

    Translation between distant languages is a difficult task

    Handling Language Divergence is a major challenge

    2

  • Kinds of MT Systems(point of entry from source to the target text)

    Deep understanding level

    Interlingual level

    Ascendin

    g transfe

    rLogico-semantic level

    Conceptual transfer

    Semantic transfer

    Ontological interlingua

    Semantico-linguistic interlingua

    SPA-structures (semantic& predicate-argument)

    Ascendin

    g transfe

    r

    Syntactico-functional level

    Morpho-syntactic level

    Syntagmatic level

    Graphemic level Direct translation

    Syntactic transfer (surface)

    Syntactic transfer (deep)

    Multilevel transfer

    F-structures (functional)

    C-structures (constituent)

    Tagged text

    Text

    Mixing levels Multilevel description

    Semi-direct translation Descending transfers

    3

  • Why is MT difficult: Language Divergence

    One of the main complexities of MT: Language Divergence

    Languages have different ways of Languages have different ways of expressing meaning

    Lexico-Semantic Divergence

    Structural Divergence

    4

    English-IL Language Divergence with illustrations from Hindi(Dave, Parikh, Bhattacharyya, Journal of MT, 2002)

  • Language Divergence Theory: Lexico-Semantic Divergences

    Conflational divergence F: vomir; E: to be sick E: stab; H: churaa se maaranaa (knife-with hit) S: Utrymningsplan; E: escape plan

    Structural divergence Structural divergence E: SVO; H: SOV

    Categorial divergence Change is in POS category

    Not communicative (AdjVerb)

    5

  • Language Divergence Theory: Lexico-Semantic Divergences (cntd)

    Head swapping divergence E: Prime Minister of India; H: bhaarat ke pradhaanmantrii (India-of Prime Minister)

    Lexical divergence E: advise; H: paraamarsh denaa (advice give): NounIncorporation- very common Indian LanguageIncorporation- very common Indian LanguagePhenomenon

    6

  • Language Divergence Theory: Syntactic Divergences

    Constituent Order divergence

    E: Singh, the PM of India, will address the nationtoday; H: bhaarat ke pradhaan mantrii, singh, (India-of PM, Singh)

    Adjunction Divergence

    E: She will visit here in the summer; H: vah yahaagarmii meM aayegii (she here summer-in will come)

    E: She will visit here in the summer; H: vah yahaagarmii meM aayegii (she here summer-in will come)

    Preposition-Stranding divergence

    E: Who do you want to go with?; H: kisake saath aapjaanaa chaahate ho? (who with)

    7

  • Language Divergence Theory: Syntactic Divergences

    Null Subject Divergence

    E: I will go; H: jaauMgaa (subject dropped)

    Pleonastic Divergence

    E: It is raining; H: baarish ho rahii haai (rainhappening is: no translation of it)happening is: no translation of it)

    8

  • Language Divergence exists even for close cousins(Marathi-Hindi-English: case marking and postpositions do not transfer easily)

    (immediate past) ? ! (M) ? (H) When did you come? Just now (I came) (H) kakhan ele? ei elaam/eshchhi (B)

    (certainty in future) !

    (certainty in future) ! (M) ! (H) He is in for a thrashing. (E) ekhan o maar khaabei/khaachhei

    (assurance) ! . (M) $ (H) I will see you tomorrow. (E) aami kaal tomaar saathe dekhaa korbo/korchhi (B)

    9

  • Can MT make a difference

    Yes: MT is ubiquitoushttp://www.cfilt.iitb.ac.in/judicial/

    http://www.cfilt.iitb.ac.in:11080/XeroxIITBSMT/Translate

    http://translate.google.com

  • Helping a person in distress (1/3)

    French Lady (FL): (in French, translated into English by Google translate) Do you know how far it is to the Bangalore you know how far it is to the Bangalore main bus stand from the airport?

    Self: (in English, translated into French by GT) It is about 10 KM.

    FL: Can I take a bus?

  • Helping a person in distress (2/3)

    FL: I think that is expensive.

    Self: Yes about 3000 Rupees, about 50 Self: Yes about 3000 Rupees, about 50 Euros.

    FL: No, I will take a bus.

    Self: Did you hurt your ankle. Do you need medical help?

  • Helping a person in distress (3/3)

    Self: Should I look for a doctor?

    FL: No, thank you, I will consult in FL: No, thank you, I will consult in Mysore.

    All this conversation occurred via Google translation. The lady and I took turns to type in the sentences in respective translation boxes.

  • Translation is Ubiquitous

    Between Languages

    Delhi is the capital of India

    Between dialects

    Example next slide

    Between registers

    My mom not well.

    My mother is unwell (in a leave application)

  • Between dialects (1/3)

    Lage Raho Munnabhai: an excellent example

    Scene: Munnabhai (Sanjay Dutt) is Prof. MurliPrasad Sharma being interviewed with some citizens asking questions in presence of Jahnavi(Vidya Baalan)(Vidya Baalan)

    Question by citizen: , . $ .

  • Between dialects (2/3) Bapu from behind invisible to others:

    Munnabhai

    #

    Bapu

    % Munnabhai

    full country % Bapu

    * Munnabhai

    ,%-

  • Between dialects (3/3) Bapu

    $ Munnabhai

    ( $ ,

    Bapu

    $ Munnabhai

    / 0 $ , , /, heart heart !

  • Interlingua Based MT

    18

  • MT: EnConverion + Deconversion

    EnglishHindi

    Interlingua(UNL)

    FrenchChinese

    generation

    Analysis

    19

  • Challenges of interlingua generation

    Mother of NLP problems - Extract meaning from a sentence!

    Almost all NLP problems are sub-problems

    Named Entity Recognition Named Entity Recognition

    POS tagging

    Chunking

    Parsing

    Word Sense Disambiguation

    Multiword identification

    and the list goes on...

  • UNL: a United Nations project

    Started in 1996 10 year program 15 research groups across continents First goal: generatorsFirst goal: generators Next goal: analysers (needs solving various ambiguity

    problems) Current active language groups

    UNL_French (GETA-CLIPS, IMAG) UNL_Hindi (IIT Bombay with additional work on UNL_English)

    UNL_Italian (Univ. of Pisa) UNL_Portugese (Univ of Sao Paolo, Brazil) UNL_Russian (Institute of Linguistics, Moscow) UNL_Spanish (UPM, Madrid)

    21

  • World-wide Universal Networking Language (UNL) Project

    UNL

    English Russian

    Marathi

    22

    Japanese

    Hindi

    Spanish

    Language independent meaning representation.

    Others

  • UNL represents knowledge: John eats rice with a spoon

    Semantic relations

    attributes

    Universal words

    Repositoryof 42SemanticRelations

    and84 attributelabels

    23

  • Hindi Generation from Interlingua (UNL)

    (Joint work with S. Singh, M. Dalal, V. Vachhani, Om Damani

    MT Summit 2007)

  • HinD ArchitectureDeconversion = Transfer + Generation

  • Step-through the Deconverter Module Output

    UNL Expression

    obj(contact(

    Lexeme Selection

    contact farmer this you region taluka manchar khatav

    Case Identification

    * * *

    contact farmer* this you region* taluka* manchar

    you

    this

    farmer

    agtobj

    pur

    plc

    contact

    orregion

    nam

    :01

    khatav Morphology Generation

    contact .@imperative farmer.@pl this you region

    taluka manchar Khatav

    Function Word Insertion

    contact farmers this for you region

    or taluka of Manchar Khatav

    Linearization This for you manchar region or khatav

    | taluka of farmers contact

    nam

    or

    khatav

    manchar

    taluka

    nam

  • Lexeme Selection[ ]{}"contact(icl>communicate(agt>person,obj>person)) (V,VOA,VOA-

    ACT,VOA-COMM,VLTN,TMP,CJNCT,N-V,link,Va)

    [ ]{}"contact(icl>representative)

    (N,ANIMT,FAUNA,MML,PRSN,Na)

    Lexical Choice is unambiguousLexical Choice is unambiguous

    you

    this farmer

    agtobj contact

    pur

  • Case Marking

    Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V @past N

    Depends on UNL Relation and the properties of the nodes Depends on UNL Relation and the properties of the nodes Case get transferred from head to modifiers

    you

    this *farmer

    agtobj contact

    pur

  • Morphology Generation: NounsThe boy saw me.

    Boys saw me.

    The King saw me.

    Kings saw me.

    Suffix Attribute values

    uoM @N,@NU,@M,@pl,@oblique

    U @N,@NU,@M,@sg,@oblique I @N,@NI,@F,@sg,@oblique iyoM @N,@NI,@F,@pl,@oblique

    oM @N,@NA,@NOTCH,@F,@pl,@oblique

    Kings saw me.

  • Verb MorphologySuffix Tense Aspect Mood N Gen P V

    E

    -e rahaa thaa

    @past @progress - @sg @male 3rd e

    -taa hai @present @custom - @sg @male 3rd -

    -iyaa thaa @past @complete - @sg @male 3rd I

    saktii hain @present - @ability @pl @female 3rd A

  • After Morphology Generation

    you

    this

    farmer

    agtobj contact

    pur

    you

    this

    farmer

    agtobjcontact

    pur

  • Function Word Insertion

    Rel Par+ Par- Chi+ Chi- Ch/FW

    Obj V VINT N#ANIMT @topic

    you

    this

    farmer

    agtobj contact

    pur

    Obj V VINT N#ANIMT @topic

  • Linearization

    This you manchar region

    you

    this

    farmer

    agtobj

    pur

    plc

    Contact

    khatav taluka farmers

    contact

    nam

    orregion

    khatav

    manchar

    taluka

    nam

    :01

  • Syntax Planning: Assumptions The relative word order of a

    UNL relations relata does not depend on:

    Semantic Independence: the semantic properties of the this

    contact

    pur

    semantic properties of the relata.

    Context Independence: the rest of the expression.

    The relative word order of various relations sharing a relatum does not depend on

    Local Ordering: the rest of the expression.

    this

    you

    this

    farmer

    agtobj

    contact

    pur

  • you

    this

    farmer

    agtobj contact

    pur

    Syntax Planning: Strategy Divide a nodes relations in

    Before_set = {obj,pur,agt}

    After_set = {}

    agt aoj obj ins

    Agt L L L

    Aoj L L

    Obj R

    Ins

    obj pur

    agt

    Topo Sort each group: pur agt objthis you farmer

    Final order: this you farmer contact

    region

    khatav

    manch

    taluka

  • Syntax Planning AlgoStack Before

    Current

    After

    Current

    Output

    region

    farmer

    contact

    this

    you

    manchar taluka

    you

    this

    farmer

    agtobj

    pur

    plc

    Contact

    manchar taluka

    manchar

    region

    taluka

    farmer

    contact

    region

    taluka

    farmer

    Contact

    this

    you

    manchar

    taluka

    farmer

    contact

    this

    you

    manchar

    region

    plc

    nam

    orregion

    khatav

    manchar

    taluka

    nam

    :01

  • All Together (UNL -> Hindi) Module Output

    UNL Expression

    obj(contact(..

    Lexeme Selection

    contact farmer this you region taluka manchar khatav

    Case Identification

    * * *

    contact farmer* this you region* taluka* manchar Identification

    contact farmer* this you region* taluka* manchar

    khatav Morphology Generation

    contact .@imperative farmer.@pl this you region

    taluka manchar Khatav

    Function Word Insertion

    contact farmers this for you region

    or taluka of Manchar Khatav

    Syntax Linearization

    This for you manchar region or khatav

    | taluka of farmers contact

  • How to Evaluate UNL Deconversion

    UNL -> Hindi

    Reference Translation needs to be generated from UNL

    Needs expertise Needs expertise

    Compromise: Generate reference translation from original English sentence from which UNL was generated

    Works if you assume that UNL generation was perfect

    Note that fidelity is not an issue

  • Manual Evaluation Guidelines

    Fluency of the given translation is:(4) Perfect: Good grammar(3) Fair: Easy-to-understand but flawed grammar(2) Acceptable: Broken - understandable with effort(1) Nonsense: Incomprehensible(1) Nonsense: Incomprehensible

    Adequacy: How much meaning of the reference sentence isconveyed in the translation:(4) All: No loss of meaning(3) Most: Most of the meaning is conveyed(2) Some: Some of the meaning is conveyed(1) None: Hardly any meaning is conveyed

  • Sample Translations

    Hindi Output Fluency Adequacy

    0.5 ! 10 )

    2 3

    ,- / 0 2 34 3 4,- / 0 2 34 6 2 8 ,

    3 4

    8 3< 3<

    1 1

    )> 8 4 4

    - B D

    2 3

  • Results BLEU Fluency Adequacy

    Geometric Average 0.34 2.54 2.84 Arithmetic Average 0.41 2.71 3.00 Standard Deviation 0.25 0.89 0.89 Pearson Cor. BLEU 1.00 0.59 0.50

    Pearson Cor. Fluency 0.59 1.00 0.68

    Fluency vs Adequacy

    315 3 4

    34

    165

    2662

    155138

    210 12

    131

    168

    0

    50

    100

    150

    200

    1 2 3 4

    Fluency

    Nu

    mb

    er o

    f s

    ente

    nce

    s

    Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

    Good Correlation between Fluency and BLUE Strong Correlation betweenFluency and Adequacy Can do large scale evaluation using Fluency alone

    Caution: Domain diversity, Speaker diversity

  • Summary: interlingua based MT

    English to Hindi

    Rule governed

    High level of linguistic expertise needed High level of linguistic expertise needed

    Takes a long time to build (since 1996)

    But produces great insight, resources and tools

    42