hebrew-to-english xfer mt project - update

18
Hebrew-to-English XFER MT Project - Update Alon Lavie June 2, 2004

Upload: jorden-kennedy

Post on 30-Dec-2015

24 views

Category:

Documents


3 download

DESCRIPTION

Hebrew-to-English XFER MT Project - Update. Alon Lavie June 2, 2004. The Team. Alon Lavie Shuly Wintner (Faculty at Haifa Univ.) Yaniv Eytani (MS student at Haifa Univ.) Erik Peterson and Kathrin Probst…. Hebrew Input. בשורה הבאה. Preprocessing. Morphology. Transfer Rules. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hebrew-to-English XFER MT Project - Update

Hebrew-to-English XFER MT Project - Update

Alon Lavie

June 2, 2004

Page 2: Hebrew-to-English XFER MT Project - Update

June 2, 2004 Hebrew-to-English MT Update 2

The Team

• Alon Lavie• Shuly Wintner (Faculty at Haifa Univ.)• Yaniv Eytani (MS student at Haifa Univ.)• Erik Peterson and Kathrin Probst…

Page 3: Hebrew-to-English XFER MT Project - Update

Transfer Engine

English Language Model

Transfer Rules{NP1,3}NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]((X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1))

Translation Lexicon

N::N |: ["$WR"] -> ["BULL"]((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "BULL"))

N::N |: ["$WRH"] -> ["LINE"]((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "LINE"))

Hebrew Input

בשורה הבאה

Decoder

English Output

in the next line

Translation Output Lattice

(0 1 "IN" @PREP)(1 1 "THE" @DET)(2 2 "LINE" @N)(1 2 "THE LINE" @NP)(0 2 "IN LINE" @PP)(0 4 "IN THE NEXT LINE" @PP)

Preprocessing

Morphology

Page 4: Hebrew-to-English XFER MT Project - Update

June 2, 2004 Hebrew-to-English MT Update 4

Main Tasks in Month-1

• Hebrew Encoding Issues• Hebrew Language Resources:

– H-to-E Translation Lexicon– Morphological Analyzer

• Putting together a front-end to the XFER engine: morphology, format conversions

• Elicitation for Hebrew (two versions of EC)• Installing system on local server in Haifa

Page 5: Hebrew-to-English XFER MT Project - Update

June 2, 2004 Hebrew-to-English MT Update 5

Main Tasks in Month-2

• Improving Hebrew Language Resources:– H-to-E Translation Lexicon: “full” spelling, reverse

dict, compounds, enhanced English side– Morphological Analyzer: all analyses, lattice

representation

• Manual Transfer Grammar• Collecting development and testing data (and

their reference translations)• Development based on small dev-set• Evaluation on test data

Page 6: Hebrew-to-English XFER MT Project - Update

June 2, 2004 Hebrew-to-English MT Update 6

Hebrew Encoding Issues

• Input texts are (mostly) in standard Windows encoding for Hebrew

• Morphology analyzer and other resources already set to work in an “ascii-like” representation

Converter script converts the input into the ascii representation

• All further processing is done in the ascii representation• Lexicon and grammar rules are also in ascii

representation• Elicitation is done in UTF8 Hebrew, output is converted

to ascii representation

Page 7: Hebrew-to-English XFER MT Project - Update

June 2, 2004 Hebrew-to-English MT Update 7

Translation Lexicon

• “Dahan” H-to-E and E-to-H dictionary available to us

• Excel spreadsheet format (from prev project)• Coverage is not great but not bad

– H-to-E is about 15K translation pairs– E-to-H is about 7K translation pairs

• POS information on both sides• No proper names or named entities• Issue with spelling convention “KTIB XSR”

Page 8: Hebrew-to-English XFER MT Project - Update

June 2, 2004 Hebrew-to-English MT Update 8

Translation Lexicon

• Yaniv wrote scripts that– Extract the relevant fields from the excel file– Extract words in “deficient spelling” and transform into

“full spelling”– Extract and special treat compound nouns– Merge with added lexicons (i.e. names)– Sort and remove duplicate entries– Convert to the XFER lexicon format

• Kathrin adapted script that “enhances” lexicon for English generation (plurals of nouns, tensed verb forms)

• [Show portion of full lexicon…]

Page 9: Hebrew-to-English XFER MT Project - Update

June 2, 2004 Hebrew-to-English MT Update 9

Morphological Analyzer

• Morphology is a big deal for Hebrew• Not just inflections and derivations, but also

– Different words due to omission of vowels from the script– Attached prefixes for conj, det, prepositions, and some

attached possessive suffixes• Analyzer program from MS student at Technion already

available, works on Windows and with minimal adaptation on Linux

• Coverage is reasonable…• Produces all analyses or a disambiguated analysis for

each word• Entire sentence passed as input to morpher (not word-

by-word)

Page 10: Hebrew-to-English XFER MT Project - Update

June 2, 2004 Hebrew-to-English MT Update 10

Morphological Processing

• Split attached prefixes and suffixes into separate words for translation

• Produce f-structures as output• Convert feature-value codes to our conventions• Install morpher as a server running on our linux

machines• Yaniv wrote java scripts to handle input-output from the

morpher• Erik integrated a wrapper for running morpher as a

server on our linux machines• “All analyses mode”: all possible analyses for each input

word returned, represented in the form of a input lattice

Page 11: Hebrew-to-English XFER MT Project - Update

June 2, 2004 Hebrew-to-English MT Update 11

Morphology Example

• Input word: B$WRH

0 1 2 3 4 |--------B$WRH--------| |-----B-----|$WR|--H--| |--B--|-H--|--$WRH---|

Page 12: Hebrew-to-English XFER MT Project - Update

June 2, 2004 Hebrew-to-English MT Update 12

Morphology ExampleY0: ((SPANSTART 0) Y1: ((SPANSTART 0) Y2: ((SPANSTART 1) (SPANEND 4) (SPANEND 2) (SPANEND 3) (LEX B$WRH) (LEX B) (LEX $WR) (POS N) (POS PREP)) (POS N) (GEN F) (GEN M) (NUM S) (NUM S) (STATUS ABSOLUTE)) (STATUS ABSOLUTE))

Y3: ((SPANSTART 3) Y4: ((SPANSTART 0) Y5: ((SPANSTART 1) (SPANEND 4) (SPANEND 1) (SPANEND 2) (LEX $LH) (LEX B) (LEX H) (POS POSS)) (POS PREP)) (POS DET))

Y6: ((SPANSTART 2) Y7: ((SPANSTART 0) (SPANEND 4) (SPANEND 4) (LEX $WRH) (LEX B$WRH) (POS N) (POS LEX)) (GEN F) (NUM S) (STATUS ABSOLUTE))

Page 13: Hebrew-to-English XFER MT Project - Update

June 2, 2004 Hebrew-to-English MT Update 13

Manual Transfer Grammar

• Written by Alon in a couple of days…• Current grammar has 36 rules:

– 21 NP rules – one PP rule – 6 verb complexes and VP) rules – 8 higher-phrase and sentence-level rules

• Captures the most common (mostly local) structural differences between Hebrew and English

• [show portion of grammar…]

Page 14: Hebrew-to-English XFER MT Project - Update

June 2, 2004 Hebrew-to-English MT Update 14

Elicitation for Hebrew

• Erik made sure Elicitation Tool works for Hebrew

• Various versions of EC used:– Two reduced versions of full EC– Two versions of Structural EC

• Shuly and Yaniv translated and aligned substantial portion of both

• Kathrin trained an initial learned grammar

Page 15: Hebrew-to-English XFER MT Project - Update

June 2, 2004 Hebrew-to-English MT Update 15

Decoding

• Strong Decoder for H-to-E:– Kathrin and Alon adapted script for running

Stephan’s decoder.– No real amounts of parallel text, so no

translation model scores for the edges… – Kathrin constructed a new English LM for

decoding the Hebrew-to-English system• 160 Million words• Includes English side of our translation lexicon

• [show portion of lattice…]

Page 16: Hebrew-to-English XFER MT Project - Update

June 2, 2004 Hebrew-to-English MT Update 16

Sample Output (dev-data)

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the government instructions and immigration police

in a letter in broken english which spread among the foreign workers thanks to them hotel for their hard work and announced that will purchase for hm flight tickets for their countries from their money

Page 17: Hebrew-to-English XFER MT Project - Update

June 2, 2004 Hebrew-to-English MT Update 17

Evaluation Results

• Test set of 62 sentences from Haaretz newspaper, 2 reference translations

System BLEU NIST P R METEOR

No Gram 0.0616 3.4109 0.3830 0.4153 0.3668

Learned 0.0774 3.5451 0.3938 0.4219 0.3775

Manual 0.1026 3.7789 0.4085 0.4241 0.3834

Page 18: Hebrew-to-English XFER MT Project - Update

June 2, 2004 Hebrew-to-English MT Update 18

Further Issues

• Transfer: XFER engine cannot handle the construction of full lattices anymore (too many entries) we need a pruning mechanism

• Further improvements in the translation lexicon and morphological analyzer

• Decoding: – Adding a source-language LM– Can we train a translation model?

• Manual Grammar development…• Improved grammar learning…