aside –the rosetta stonefrank/csc401/lectures2017/5-2_smt.pdfrosetta stone. he noticed: 1. the...

52

Upload: phamduong

Post on 29-May-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

CSC401/2511– Spring2017 2

Aside– TheRosettastone

CSC401/2511– Spring2017 4

AncientEgyptian(c.3000BCE)• Few writers• Stone tablets• Many(>1500)symbols

representingideas(e.g.,apple)

• Afew(~140)symbolsrepresentingsounds(e.g.gah)

• Demotic(c.650BCE)• Many writers• Papyrus sheets• Morepurposes(e.g.,

recipes,contracts)• Fewersymbols• Higherproportion of

symbolsrepresentingsounds

CSC401/2511– Spring2017 5

TheRosettastone• TheRosettastone datesfrom196BCE.

• Itwasre-discoveredbyFrenchsoldiersduringNapoleon’sinvasionofEgyptin1799CE.

AncientEgyptian

hieroglyphs

EgyptianDemotic

AncientGreek

• Itcontainsthreeparalleltextsindifferentlanguages,onlythelast ofwhichwasunderstood.

• By1799,ancientEgyptianhadbeenforgotten.

CSC401/2511– Spring2017 6

Writingsystems• Logographic: adj.Describeswritingsystemswhose

symbols denotesemantic ideas.

• Phonographic: adj. Describeswritingsystemswhosesymbols denotesounds.E.g.,inEnglishthesymbols‘sh’mean

• Somewritingsystemsareamixofthesequalities:• � mā ‘mother’,formedfrom:• � nǚ (meanslike)‘woman’• � mă (soundslike)‘horse’

CSC401/2511– Spring2017 7

Writingsystems• Logographic: Symbolsrefertoideas.• Phonographic: Symbolsrefertosounds.

• Englishcarrieslogographicheritage.

IsancientEgyptianlogographicorphonographic?

Proto-Sinaitic

“alph”(ox)

“bet”(house)

“kaf”(palm)

“mem”(water)

“en”(eye)

Phoenician

Cyrillic A b K M O P

“ro”(head)

CSC401/2511– Spring2017 8

DecipheringRosetta• During1822–1824,Jean-FrançoisChampollion workedontheRosettastone.Henoticed:1. ThecircledEgyptiansymbolsappearedinroughly

thesamepositionsastheword‘Ptolemy’intheGreek.2. ThenumberofEgyptianhieroglyphtokensweremuchlarger

thanthenumberofGreekwords→Egyptianseemedtohavebeenpartiallyphonographic.

3. Cleopatra’scartouchewaswritten

CSC401/2511– Spring2017 9

Aside– decipheringRosetta• Soifwas‘Ptolemy’ andwas‘Cleopatra’andthesymbolscorrespondedtosounds– canwematchupthesymbols?

P

P L

L O

O

E

E

C A T R A

T M Y

• Thisapproachdemonstratedthevalueofworkingfromparalleltextstodecipheranunknownlanguage:• Itwouldnothavebeenpossiblewithoutaligning unknownwords(hieroglyhs)toknownwords(Greek)…

CSC401/2511– Spring2017 10

Today• Introductiontostatisticalmachinetranslation(SMT).

• Whatwewantisasystemtotakeutterances/sentencesinonelanguageandtransformthemtoanother:

Nemangepasce chat!

Don’teatthatcat!

CSC401/2511– Spring2017 11

Directtranslation• Abilingualdictionarythatalignswordsacrosslanguagescanbehelpful,butonlyforsimplecases.

¿ Dónde está la biblioteca ?Where is the library ?Où est la bibliothèque ?

Mi nombre es T-boneMy name is T-boneMon nom est T-bone

CSC401/2511– Spring2017 12

Challenge1:lexicalambiguity• Awordtokeninonelanguagemayhavemanypossibletranslationsinanother:

• E.g., book theflight→ reservarreadthebook →libro

thechair inthechair→ président,chaise

kill thequeen→ tuer lareinekill theQueen→ éteindre lamusique deQueen

CSC401/2511– Spring2017 13

Challenge2:differingwordorders• English: subject– (trans.)verb– objectJapanese: subject– object– (trans.)verb

e.g., English: IBMboughtLotusJapanese: ~IBMLotusbought

• English: determiner– adjective– nounFrench: determiner– noun– adjective

e.g., English: thefastzombieFrench: lezombierapide

CSC401/2511– Spring2017 14

Challenge3:unpreservedsyntax• Differences insyntaxbetweenlanguagesarefeltoverlongerdistancesthansimplewordalternations.• E.g.,

• Thisimpliesthatwe’dneedhigh-levelgrammarsofthesourceandtargetlanguages.

Thebottlefloated intothecave

Labotella entró alacuerva flotando(thebottleenteredtothecavefloating)

CSC401/2511– Spring2017 15

Challenge4:syntacticambiguity• Syntacticambiguityinthesourcemakesitdifficulttoproduceasinglesentenceinthetargetlanguage.• E.g.,

Rickhitthezombiewiththestick

Rickgolpeó elzombieconelpalo

(thestickwasused)

Rickgolpeó elzombiequetenia elpalo

(thezombiehadthestick)

CSC401/2511– Spring2017 16

Challenge5:idiosyncracies• Languageshavetheirownidioms,and“feel”.

• E.g.,

Wehavetoburnthemidnightoil Ilfaut travailler tard

Estie desacramouille Hostofthesacrament

Bygolly!

Ilfaut brûler l’huiledeminuit

CSC401/2511– Spring2017 17

ClassicalMT:Dictionaries• EarlyMTinvolvedmerelylookingupeachwordinabilingualdictionaryofrules.• E.g.,translate‘much’or‘many’ intoRussian:

If precedingwordishow return skol’koelseifprecedingwordisas return stol’ko zheelseif wordismuch

if precedingwordisvery returnnilelseif followingwordisanounreturnmnogo

else (wordismany)if precedingwordisaprepositionandnextwordisanoun

returnmnogiielse returnmnogo

FromJurafsky &Martin

CSC401/2511– Spring2017 18

ClassicalMT:Dictionaries• Thisapproachcausessomeproblems,e.g.,

• It’sdifficult/impossibletocapturelong-range re-orderings:• English: Sourcessaid thatIBMboughtLotusyesterdayJapanese: ~Sources yesterdayIBMLotusboughtthatsaid

• It’sdifficulttodisambiguateparts-of-speech:• English: Theysaidthat Ipunchedthat zombie• French: Ilsontditque j'aifrappéce zombie

• Havingexpertswritelotsofrulescanbecomeunruly.• …andexpensive...andfullofmistakes…

CSC401/2511– Spring2017 19

ClassicalMT:Transfer-basedapproach• Transfer-basedMTinvolvesthreephases:

• Analysis: e.g.,build syntacticparsetreesofthesourcesentence.

• Transfer: e.g.,convert thesource-languageparsetreetoatarget-languageparsetree.

• Generation: e.g.,produce anoutputsentence fromthetarget-languageparsetree.

• Thesesystemscaninvolvefairlydeepanalysis,oftenincludingsemantic analysis.

CSC401/2511– Spring2017 20

Exampleofsyntactictransfer

FromReginaBarzilay atMIT

Seecsc485/2501formoreon

computationalapproachestoparsetrees

CSC401/2511– Spring2017 21

Exampleofsyntactictransfer

FromReginaBarzilay atMIT

Transformationsaredefinedatthesyntacticlevel

CSC401/2511– Spring2017 22

ClassicalMT:Transfer-basedapproach• Transferringbetweenparsetreesallowsustoencodemoregeneral ruleswithlong-term dependencies.

• However,ifwewanttotranslatebetween! languages,we’dneed"(!$) setsoftransformationrules.• Thiswouldinvolvelotsofexpertsineachlanguage($$).• Thiscanbesomewhatmitigated byabstractingbeyondsyntaxintoaninterlingua:aconceptualspacecommontoall languages.• Wemightneedaworkabletheoryofneurolinguistics todothisproperly,but‘hacks’aregettingsomegoodresults.

CSC401/2511– Spring2017 23

Statisticalmachinetranslation• Machinetranslationseemedtobeanintractableproblemuntilachangeinperspective…

WhenIlookatanarticleinRussian,Isay:‘ThisisreallywritteninEnglish,butithasbeencoded insomestrangesymbols.Iwillnowproceedtodecode.’

WarrenWeaver March,1947

ClaudeShannon July,1948

Transmitter&(')

Receiver&((|')

Noisychannel' (

CSC401/2511– Spring2017 24

Hownottousethenoisychannel• Themodel&(*, ,) tellsushowlikelyanEnglishsentence*andaFrenchsentence, aretocorrespond toeachother.

• Imaginethatyou’regivenaFrenchsentence,,,andyouwanttoconvertittothebestcorrespondingEnglishsentence,*∗• i.e., *∗ = argmax

4&(*, ,)

• Othersmaybetemptedtomodelthisas*∗ = argmax

4& * , &(,)

Thisisuselessifyou’realwaysgiven,

CSC401/2511– Spring2017 25

Hownottousethenoisychannel• Othersmaybetemptedtomodelthisas

*∗ = argmax4

& * , &(,)

Thisisuselessifyou’realwaysgiven,

• If&(*|,) isamodelthattranslatesword-to-word,thenwecannotaccountfordifferingwordordersacrosslanguages.• E.g., Source French: lezombierapide

TargetEnglish: thezombiefast

• If&(*|,) includessyntax,itbecomesvery difficulttolearnwithoutexpertsorspecially-annotateddata.

26

Thenoisychannel

Source5(6)

LanguagemodelChannel5(7|6)

Translationmodel*′

Decoder

,′

6∗ Observed7

CSC401/2511 – Spring 2017

*∗ = argmax4

&(,|*)&(*)

CSC401/2511– Spring2017 27

Howtousethenoisychannel• Howdoesthiswork?

*∗ = argmax4

&(,|*)&(*)

• &(*) isalanguagemodel (e.g.,N-gram)andencodesknowledgeofwordorder.

• &(,|*) isaword-leveltranslationmodelthatencodesonlyknowledgeonanunordered word-by-wordbasis.

• Combining thesemodelscangiveusnaturalness andfidelity,respectively.

CSC401/2511– Spring2017 28

Howtousethenoisychannel• ExamplefromKoehnandKnightusingonlyconditionallikelihoodsofSpanish wordsgivenEnglish words.

• Que hambre tengo yo→WhathungerhaveI & 9 * = 1.4*=>

HungryIamso & 9 * = 1.0*=@

Iamsohungry & 9 * = 1.0*=@

HaveIthathunger & 9 * = 2.0*=>

CSC401/2511– Spring2017 29

Howtousethenoisychannel• …andwiththeEnglishlanguagemodel

• Que hambre tengo yo→WhathungerhaveI & 9 * & * = 1.4*=>×1.0*=@

HungryIamso & 9 * &(*) = 1.0*=@×1.4*=@

Iamsohungry & 9 * &(*) = 1.0*=@×1.0*=C

HaveIthathunger & 9 * &(*) = 2.0*=>×9.8*=F

CSC401/2511– Spring2017 30

Howtolearn5(7|6)?• Solution:collectstatisticsonvastparalleltexts

…citizen ofCanadahastheright tovoteinanelectionofmembersofthe

HouseofCommonsorofa

legislativeassemblyandtobequalifiedformembership…

e.g.,theCanadianHansards:bilingualParliamentaryproceedings

…citoyencanadienaledroit devoteetestéligibleaux

électionslégislativesfédéralesouprovinciales …

CSC401/2511– Spring2017 31

Bilingualdata

FromChrisManning’scourseatStanford

• DatafromLinguisticDataConsortiumatUniversityofPennsylvania.

CSC401/2511– Spring2017 32

Alignment• Inpractice,wordsandphrasescanbeoutoforder.

Quantauxeaux minérales etauxlimonades,elles rencontrenttoujours plusd’adeptes.Eneffet,notre sondagefaitressortirdesventesnettementsupérieuresàcelles de1987,pourlesboissons àbasedecolanotamment

Accordingtooursurvey

1988salesof

mineralwaterandsoftdrinks

weremuchhigherthanin1987,

reflectingthegrowingpopularity

oftheseproducts.Coladrink

manufacturersinparticular

achievedaboveaveragegrowthrates

FromManning&Schütze

alignment

CSC401/2511– Spring2017 33

Alignment• Alsoinpractice,we’reusuallynotgiventhealignment.

Quantauxeaux minérales etauxlimonades,elles rencontrenttoujours plusd’adeptes.Eneffet,notre sondagefaitressortirdesventesnettementsupérieuresàcelles de1987,pourlesboissons àbasedecolanotamment

Accordingtooursurvey

1988salesof

mineralwaterandsoftdrinks

weremuchhigherthanin1987,

reflectingthegrowingpopularity

oftheseproducts.Coladrink

manufacturersinparticular

achievedaboveaveragegrowthrates

FromManning&Schütze

CSC401/2511– Spring2017 34

Sentencealignment• Sentencescanalsobeunaligned acrosstranslations.

• E.g., Hewashappy.E1 Hehadbacon.E2 →Ilétait heureux parcequ'il avaitdubacon.F1

*G ,G*$ ,$*H ,H*C ,C*> ,>*@ ,@*F ,F…

*G ,G*$*H ,$*C ,H*> ,C

,>*@ ,@*F ,F…

CSC401/2511– Spring2017 35

Sentencealignment• Weoftenneedtoalignsentences beforewecanalignwords.

• We’lllookattwobroadclassesofmethods:1. Methodsthatonlylookatsentencelength,2. Methodsbasedonlexicalmatches,or“cognates”.

CSC401/2511– Spring2017 36

1.Sentencealignmentbylength(GaleandChurch,1993)

• Assumingtheparagraphalignmentisknown,• ℒ4 isthe#ofwordsinanEnglish sentence,• ℒJ isthe#ofwordsinaFrench sentence.

• Assumeℒ4 andℒJ haveGaussian/normaldistributionswithK = LMN andOP = QPMN.• Empiricalconstants R andS set‘byhand’.• Thepenalty,TUSV(ℒ4, ℒJ),ofaligningsentenceswithdifferentlengthsisbasedonthedivergence oftheseGaussians.

CSC401/2511– Spring2017 37

1.SentencealignmentbylengthWecanassociatecostswithdifferenttypesofalignments.

WX,Y isthepriorcostofaligningZ sentencesto[ sentences.

TUSV = TUSV ℒ4\ + ℒ4^, ℒJ\ + T$,G +TUSV ℒ4_, ℒJ + TG,G +

TUSV ℒ4a, ℒJ_ + TG,G +TUSV ℒ4b, ℒJa + ℒJb + TG,$ +TUSV ℒ4c, ℒJc + TG,GFinddistributionofsentencebreakswithminimumcostusingdynamicprogramming

*G ,G*$*H ,$*C ,H*> ,C

,>*@ ,@

It’sabitmorecomplicated– seepaperoncourse

webpage

CSC401/2511– Spring2017 38

2.Sentencealignmentbycognates• Cognates: n.pl.Wordsthathaveacommon

etymological origin.• Etymological: adj.Pertainingtothehistorical

derivationofaword.E.g.,porc→pork

• Theintuitionisthatwordsthatarerelated acrosslanguageshavesimilarspellings.• e.g.,zombie/zombie,government/gouvernement• Notalways:son (maleoffspring)vs.son (sound)

• Cognatescan“anchor”sentencealignmentsbetweenrelatedlanguages.

CSC401/2511– Spring2017 39

2.Sentencealignmentbycognates• Cognatesshouldbespelledsimilarly…

• N-graph: n.SimilartoN-grams,butcomputedatthecharacter-level,ratherthanattheword-level.

E.g.,TUdeV(S, ℎ, Z) isatrigraphmodel

• Church(1993)tracksall4-graphs whichareidenticalacrosstwotexts.• Hecallsthisa‘signal-based’approximationtocognateidentification.

CSC401/2511– Spring2017 40

2a.Church’smethod

FromManning&SchützeEnglish French

English

French

e.g.,theZgh French4-graph

isequaltothe[gh English4-graph.

1. Concatenatepairedtexts.

2. Placea‘dot’wheretheZgh Frenchandthe[ghEnglish4-graphareequal.

3. Searchforashortpath‘near’thebilingualdiagonals.

CSC401/2511– Spring2017 41

2a.Church’smethod

FromManning&Schütze

• Eachpointalongthispathisconsideredtorepresentamatchbetweenlanguages.

• TherelevantEnglishandFrenchsentencesare∴aligned.

English French

English

Frenche.g.,thejgh Frenchsentenceisalignedto thekgh English

sentence.

CSC401/2511– Spring2017 42

2b.Melamed’s method• !T9(l, m) isthelongestcommonsubsequenceofcharacters (withgapsallowed)inwordsl andm.

• Melamed(1993)measuressimilarityofwordsl andm

!T9n l, m =opeqVℎ(!T9 l, m )

max(opeqVℎ l , opeqVℎ m )• e.g.,

!T9n rstuvwxuwy, rsdtuvwpxuwy =10

12‘LCSRatio’

CSC401/2511– Spring2017 43

2b.Melamed’s method• Excludesstopwordsfrombothlanguages.

(e.g.,the,a,le,un)

• MelamedempiricallydeclaredthatcognatesoccurwhenzW{| ≥ ~. �Ä (i.e.,there’salotofoverlapinthosewords).• ∴ 25%ofwordsinCanadianHansard arecognates.

• AswithChurch,constructa“bitext”graph.• Putapointatposition(Z, [) ≡ !T9n Z, [ ≥ 0.58.• Findanear-diagonalalignment,asbefore.

CSC401/2511– Spring2017 44

Fromsentencestowords• We’vecomputedthesentence alignments.

• Whataboutword alignments?

CSC401/2511– Spring2017 45

Wordalignment

• Wordalignmentscanbe1:1,N:1,1:N,0:1,1:0,…E.g.,“zerofertility”word:nottranslated(1:0)

“spurious”words:generatedfrom‘nothing’(0:1)

Onewordtranslatedasseveralwords(1:N)

alignment

Note thatthisisonlyonepossible

alignment

CSC401/2511– Spring2017 46

IntuitionofstatisticalMT

• Thewords‘the’and‘maison’ co-occurfrequently,butnot asfrequentlyas‘the’and‘la’.

5(ÉÑ|yÖu) shouldbehigher than5(ÜÉuáv|yÖu),5(àÉuáu|yÖu),andeven5(xÑXQsw|yÖu)

Note:weconsiderallpossible wordalignments….

CSC401/2511– Spring2017 47

Assignment2– content

• Build N-gramlanguagemodels,withsmoothing.

• Learn word-levelalignmentswiththeIBM-1modelusingdatafromtheCanadianHansard.

• Combine thelanguageandalignmentmodelsintoasimpleFrench-to-Englishtranslator.

• Therearesomebonusmarksavailableforsubstantiallygoingbeyondtheminimalrequirements.

CSC401/2511– Spring2017 48

Assignment2– languages

• Sentenceshavealreadybeensplit andaligned foryou.• Wordshavenot beenaligned.

• Youdon’t needtoknowFrenchforthisassignment.• Frenchismore‘rigid’thanEnglish,soitsuseofcontractions,e.g.,aremoreregular.

• Youhavetodosomepre-processingofFrenchsentences,butthoserulesaregiventoyouexplicitly.

CSC401/2511– Spring2017 49

Assignment2– practical

• WillbepostedbyMonday13February.

• WillbeprogrammedinMatlab .• VarioussupportfunctionsforthisassignmentwillbeavailableonCDF.

• Hamed Heydari willgiveatutorialonMatlab on10February.

• Markswillbegivenmoreforunderstanding thealgorithmsandconceptsthanforspecificresults.

CSC401/2511– Spring2017 50

Puzzle:Machinetranslation• Puzzle(forfun):TranslatethisCentauriphrasetoArcturan:

“farok crrrok hihok yorok clok kantok ok-yurp”1a.ok-voon ororok sprok .1b.at-voon bichat dat .

7a.lalok farok ororok lalok sprok izok enemok .7b.watjjatbichatwatdatvateneat.

2a.ok-drubel ok-voon anok plok sprok .2b.at-drubel at-voon pippat rrat dat .

8a.lalokbrokanokploknok.8b.iat lat pippat rrat nnat .

3a.erok sprok izok hihok ghirok .3b.totat dat arrat vathilat .

9a.wiwok nok izok kantok ok-yurp .9b.totat nnat quat oloat at-yurp .

4a.ok-voonanokdrokbrokjok.4b.at-voonkratpippatsatlat.

10a.lalok mok nok yorok ghirok clok .10b.wat nnat gatmatbathilat .

5a.wiwok farok izok stok .5b.totat jjat quatcat.

11a.lalok nok crrrok hihok yorok zanzanok .11b.watnnatarratmatzanzanat.

6a.lalok sprok izok jok stok .6b.watdatkratquatcat.

12a.lalok rarok nok izok hihok mok .12b.watnnatforatarratvatgat.

CSC401/2511– Spring2017 51

Puzzle:Machinetranslation• Hinttogetstarted:

“farok crrrok hihok yorok clok kantok ok-yurp”

1a.ok-voon ororok sprok .1b.at-voon bichat dat .

7a.lalok farok ororok lalok sprok izok enemok .7b.watjjat bichatwatdatvateneat.

2a.ok-drubel ok-voon anok plok sprok .2b.at-drubel at-voon pippat rrat dat .

8a.lalokbrokanokploknok.8b.iat lat pippat rrat nnat .

3a.erok sprok izok hihok ghirok .3b.totat dat arrat vathilat .

9a.wiwok nok izok kantok ok-yurp .9b.totat nnat quat oloat at-yurp .

4a.ok-voonanokdrokbrokjok.4b.at-voonkratpippatsatlat.

10a.lalok mok nok yorok ghirok clok .10b.wat nnat gatmatbathilat .

5a.wiwok farok izok stok .5b.totat jjat quatcat.

11a.lalok nok crrrok hihok yorok zanzanok .11b.watnnatarratmatzanzanat.

6a.lalok sprok izok jok stok .6b.watdatkratquatcat.

12a.lalok rarok nok izok hihok mok .12b.watnnatforatarratvatgat.

CSC401/2511– Spring2017 52

Readings

• Manning&Schütze: Sections13.0and13.2

• (optional)Gale&Church“AProgramforAligningSentencesinBilingualCorpora”(oncoursewebsite)