evaluation of context-dependent phrasal translation lexicons for statistical machine translation m...

51
Evaluation of Context- Dependent Phrasal Translation Lexicons for Statistical Machine Translation Marine CARPUAT and Dekai WU Human Language Technology Center Department of Computer Science and Engineering HKUST

Upload: byron-dean

Post on 01-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Evaluation of Context-DependentPhrasal Translation Lexiconsfor Statistical Machine Translation

Marine CARPUAT and Dekai WU

Human Language Technology CenterDepartment of Computer Science and Engineering

HKUST

New resources for SMT: context-dependent phrasal translation lexicons

A key new resource for Phrase Sense Disambiguation (PSD) for SMT [Carpuat & Wu 2007]

Entirely automatically acquired Consistently improves 8 translation quality metrics [EMNLP

2007] Fully phrasal just like conventional SMT lexicons [TMI 2007]

But… much larger than conventional lexicons!

Why is this extremely large resource necessary? Is its contribution observably useful? Is it used by the SMT system differently than conventional

SMT lexicons?

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Our finding: context-dependent lexicons directly improve lexical choice in SMT Exploit the available vocabulary better for phrasal

segmentation more and longer phrases are used in decoding consistent with other findings [TMI2007]

fully phrasal context-dependent lexicons yield more reliable improvements than single word lexicons

Select better translation candidates even after compensating for differences in phrasal

segmentation

improvements in BLEU, TER, METEOR, etc. really reflect improved lexical choice

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Problems with current SMT systems

Input 张 教 授 给 一 群 人 就 “ 中 国 和 印 度 ” 上 课 。Ref. Prof. Zhang gave a lecture on “China and India” to a packed audience.

SMT1 Prof. Zhang to a group of people on `China and India` class.

SMT2 Prof. Zhang and a group of people go into class on “China and India”.

Correct translation

SMT2 Prof. Zhang and a group of people go into class on “China and India”.

Ref. Prof. Zhang gave a lecture on “China and India” to a packed audience.

upgo intoclimb…attendgave …

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Translation lexicons in SMT are independent of context!

张 教 授 给 一 群 人 就 “ 中 国 和 印 度 ” 上 课 。

欢 迎 大 家 明 天 来 上 课 , 题 目 是 “ 中 国 和 印 度 ” 。

Prof. Zhang gave a lecture on “China and India” to a packed audience.

Everyone is welcome to attend class tomorrow, on the topic “China and India”.

up go into climb … attend gave

up go into climb … attend gave

.25

.25

.20….10.05…

.25

.25

.20….10.05…

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Phrasal lexicons in SMT are independent of context too!

张 教 授 给 一 群 人 就 “ 中 国 和 印 度 ” 上 课 。

欢 迎 大 家 明 天 来 上 课 , 题 目 是 “ 中 国 和 印 度 ” 。

Prof. Zhang gave a lecture on “China and India” to a packed audience.

Everyone is welcome to attend class tomorrow, on the topic “China and India”.

attend class gave a lecture …

attend class gave a lecture …

.45

.15

.45

.15

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Current SMT systems are hurt byvery weak models of context Translation disambiguation models are too simplistic:

Phrasal lexicon translation probabilities are static, so not sensitive to context

Context in input language is only modeled weakly by phrase segments

Context in output language is only modeled weakly by n-grams

Error analysis reveals many lexical choice errors

Yet, few attempts at directly modeling context

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Today’s SMT systems ignore the contextual features that would help lexical choice

No full sentential context merely local n-gram context

No POS information merely surface form of words

No structural information merely word n-gram identities

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

attend class gave a lecture …

Correct translation disambiguation requires rich context features

张 教 授 给 一 群 人 就 “ 中 国 和 印 度 ” 上 课 。

欢 迎 大 家 明 天 来 上 课 , 题 目 是 “ 中 国 和 印 度 ” 。

Prof. Zhang gave a lecture on “China and India” to a packed audience.

Everyone is welcome to attend class tomorrow, on the topic “China and India”.

.45

.15

.45

.15 attend class gave a lecture …

.15

.80

.70

.20

N N P N P N V

V N AD V V N V N

SUBJ

SUBJ

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Today’s SMT systems ignore context in their phrasal translation lexicons

atle

tle

afePeP

fePePe

... ,|log logargmax

... |log logargmax

*

... ) ,|(log logargmax1

ja

m

jjle feteP

j cj(f)

Entire input

sentence context

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

But context-dependent lexical choice does not necessarily improve translation quality

Early pilot study [Brown et al. 1991] use single most discriminative feature to disambiguate

between 2 English translations of a French word WSD improves French-English translation quality, but not

on a significant vocabulary and allowing only 2 senses

Context-dependent lexical choice helps word alignment, but not really translation quality [Garcia Varea et al. 2001, 2002]

maximum-entropy trained bilexicon replaces IBM-4/5 translation probabilities

improves AER on Canadian Hansards and Verbmobil tasks small improvement on WER and PER by rescoring n-best

lists, but not statistically significant [Garcia Varea & Casacuberta 2005]

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Context-dependent modeling improves quality of Statistical MT [Carpuat & Wu 2007]

Introduced context-dependent phrasal lexicons for SMT

leverage WSD techniques for SMT lexical choice generalize conventional WSD to Phrase Sense

Disambiguation

Context-dependent modeling always improves SMT accuracy

on all tasks - 3 different IWSLT06 datasets, NIST04 on all 8 common automatic metrics - BLEU, NIST, METEOR,

METEOR+synsets, TER, WER, PER, CDER

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

No other WSD for SMT approach improves translation quality as consistently

Until recently, using WSD to improve SMT quality has met with mixed or disappointing results

Carpuat & Wu [ACL-2005], Cabezas & Resnik [unpub]

Last year, for the first time, different approaches showed that WSD can help translation quality

WSD improved BLEU (but how about other metrics??) on 3 Chinese-English tasks [Carpuat et al. IWSLT-2006]

WSD improved BLEU (but how about other metrics??) on Chinese-English NIST task [Chan et al. ACL-2007]

WSD improved METEOR (but not BLEU!) on Spanish-English Europarl task [Giménez & Màrquez WMT-2007]

Phrasal WSD improves BLEU, NIST, METEOR (but how about error rates??)

on Italian-English and Chinese-English IWSLT task [Stroppa et al. TMI-2007]

But no other approach improves on 8 metrics on 4 different tasks

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

But how useful are the context-dependent lexicons as resources?

Improving translation quality is great, but… Metrics aggregate impact of many different factors Metrics ignore how translation hypotheses are generated

Context-dependent lexicons are more expensive to train, so…

Are their contributions observably useful?

Direct analysis needed: how do SMT systems use context-dependent vs. conventional lexicons?

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Learning context-dependent vs. conventional lexicons for SMT

learned from the same word-aligned parallel data: cover the same phrasal input vocabulary know the same phrasal translation candidates

Only difference: an additional context-dependent parameter dynamically computed vs. static conventional

scores Uses WSD modeling vs. MLE in conventional

lexicons

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Word Sense Disambiguation provides appropriate models of context WSD has long targeted the questions of

how to design context features how to combine contextual evidence into a sense prediction

Senseval/SemEval have extensively evaluated WSD systems

with different feature sets with different machine learning classifiers

Senseval multilingual lexical sample tasks use observable lexical translations as senses just like lexical choice in SMT E.g. Senseval-2003 English-Hindi, SemEval-2007 Chinese-

English

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Leveraging a Senseval WSD system

Top Senseval-3 Chinese Lexical Sample system[Carpuat et al. 2004]

standard classification models maximum entropy, SVM, boosted decision stumps, naïve

Bayes

rich lexical and syntactic features bag of word sentence context position sensitive co-occurring words and POS tags basic syntactic dependency features

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Generalizing WSD to PSD for context-dependent phrasal translation lexicons One PSD model per input language phrase

regardless of POS, length, etc. Generalization of standard WSD models

Sense candidates are the phrase translation candidates seen in training

The sense candidates are extracted just like the conventional SMT phrasal lexicon

typically, output language phrases consistent with the intersection of bidirectional IBM alignments

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Extracting PSD senses and training examples from word-aligned parallel text

is there a new - age music concert within the next few days ?

在 最近 一段 时间 里 有 流行音乐 会 吗 ?

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Extracting PSD senses and training examples from word-aligned parallel text

在 最近 一段 时间 <t sense= “within”> 里 </t> 有 流行音乐 会 吗 ?

is there a new - age music concert within the next few days ?

在 最近 一段 时间 里 有 流行音乐 会 吗 ?

Extracted PSD training instances:

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Extracting PSD senses and training examples from word-aligned parallel text

在 最近 一段 时间 里 有 <t sense=“new - age music”> 流行音乐 </t> 会 吗 ?

Extracted PSD training instances:

is there a new - age music concert within the next few days ?

在 最近 一段 时间 里 有 流行音乐 会 吗 ?

在 最近 一段 时间 <t sense= “within”> 里 </t> 有 流行音乐 会 吗 ?

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Extracting PSD senses and training examples from word-aligned parallel text

在 <t sense=“within the next few days”> 最近 一段 时间 里 </t> 有 流行音乐 会 吗 ?

is there a new - age music concert within the next few days ?

在 最近 一段 时间 里 有 流行音乐 会 吗 ?

在 最近 一段 时间 里 有 <t sense=“new - age music”> 流行音乐 </t> 会 吗 ?

Extracted PSD training instances:在 最近 一段 时间 <t sense= “within”> 里 </t> 有 流行音乐 会 吗 ?

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Integrating context-dependent lexicon into phrase-based SMT architectures The context-dependent phrasal lexicon probabilities

Are conditional translation probabilities can naturally be added as a feature in log linear translation

models

Unlike conventional translation probabilities, they are dynamically computed dependent on full-sentence context

Decoding can make full use of context-dependent phrasal lexicons predictions at all stages of decoding

unlike in n-best reranking

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Evaluating context-dependent phrasal translation lexicons

lexical choice only vs. translation quality [Carpuat & Wu EMNLP 2007]

integrated evaluation in SMT vs. stand-alone as in Senseval [Carpuat et al. 2004]

fully phrasal lexicons only vs. single-word context-dependent lexicon [Carpuat & Wu TMI

2007]

Translation task Test set: NIST-04 Chinese-English text translation

1788 sentences 4 reference translations

Standard phrase-based SMT decoder (Moses)

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Experimental setup

Learning the lexicons Standard conventional lexicon learning

Newswire Chinese-English corpus ~2M sentences

Standard word-alignment methodology GIZA++ Intersection using “grow-diag” heuristics [Koehn et al. 2003]

Standard Pharaoh/Moses phrase-table Maximum phrase length = 10 Translation probabilities in both directions, lexical weights

Context-dependent lexicons Use the exact same word-aligned parallel data Train a WSD model for each known phrase

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Step 1: Evaluating phrasal segmentation with context-dependent vs. conventional lexicons

Goal: compare the phrasal segmentation of the input sentence used to produce the top hypothesis

Method:

We do not evaluate accuracy There is no gold standard phrasal segmentation!

Instead, we analyze how the input phrases available in lexicons are used

SMT uses longer input phrases with context-dependent lexicons

Context-dependent lexicons help use longer, less ambiguous phrases

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

SMT uses more input phrase types with context-dependent lexicons

26% of phrase types used with context-dependent lexicon are not used with conventional lexicon

96% of those lexicon entries are truly phrasal (not single words)

Context-dependent lexicons make better use of available input language vocabulary

SMT uses more rare phrases with context-dependent lexicons

With context modeling, less training data is needed for phrases to be usedHKUST Human Language Technology Center Carpuat & Wu, LREC 2008

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Step 2: Comparing translation selection

Goal: compare translation selection only

Method:

We compare accuracy of translation selection for identical segments only

Because different lexicons yield different phrasal segmentations

A translation is considered accurate if it matches any of the reference translations

Because input sentence and references are not word-aligned

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Context-dependent lexicon predictions match references better

Context-dependent lexicons yield more matches than conventional lexicons

48% of errors made with conventional lexicons are corrected with context-dependent lexicons

Lexicon: Conventional

Match

No match

Context-dependent

Match 1435 2139

No match 683 2272

Conclusion: context-dependent phrasal translation lexicons are useful resources for SMT

A key new resource for Phrase Sense Disambiguation (PSD) for SMT [Carpuat & Wu 2007]

Entirely automatically acquired Consistently improves 8 translation quality metrics [EMNLP

2007] Fully phrasal just like conventional SMT lexicons [TMI 2007]

But… much larger than conventional lexicons!

Why is this extremely large resource necessary? Is its contribution observably useful? Is it used by the SMT system differently than conventional

SMT lexicons?

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Conclusion: context-dependent phrasal translation lexicons are useful resources for SMT

Improve phrasal segmentation Exploit available input vocabulary better

More phrases, longer phrases and more rare phrases are used in decoding

Consistent with other findings fully phrasal context-dependent lexicons yield more reliable

improvements than single word lexicons [Carpuat & Wu TMI2007]

Improve translation candidate selection Even after compensating for differences in phrasal

segmentation

Genuinely improve lexical choice Not just BLEU and other metrics!

Evaluation of Context-DependentPhrasal Translation Lexiconsfor Statistical Machine Translation

Marine CARPUAT and Dekai WU

Human Language Technology CenterDepartment of Computer Science and Engineering

HKUST

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Translation quality evaluationNot just BLEU, but 8 automatic metrics

N-gram matching metrics BLEU4 NIST METEOR METEOR+synsets

augmented with WordNet synonym matching Edit distances

TER WER PER CDER

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Context-dependent modeling consistently improves translation quality

Test set

Experiment

BLEU NIST METEOR

METEOR

(no syn)

TER WER PER CDER

IWSLT 1

SMT 42.21 7.888 65.40 63.24 40.45 45.58 37.80 40.09

SMT+WSD 42.38 7.902 65.73 63.64 39.98 45.30

37.60 39.91

IWSLT 2

SMT 41.49 8.167 66.25 63.85 40.95 46.42 37.52 40.35

SMT+WSD 41.97 8.244 66.35 63.86 40.63 46.14

37.25 40.10

IWSLT 3

SMT 49.91 9.016 73.36 70.70 35.60 40.60 32.30 35.46

SMT+WSD 51.05 9.142 74.13 71.44 34.68 39.75

31.71 34.58

NIST SMT 20.41 7.155 60.21 56.15 76.76 88.26 61.71 70.32

SMT+WSD 20.92 7.468 60.30 56.79 71.34 83.37

57.29 67.38

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Results are statistically significant

NIST results are statistically significant at the 95% level

Tested using paired bootstrap resampling

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Translations with context-dependent phrasal lexicons often differ from SMT translations

Test set Translations changed by context modeling

IWSLT 1 25.49%

IWSLT 2 30.40%

IWSLT 3 29.25%

NIST 95.74%

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Context-dependent modeling helps even for small and single-domain IWSLT

IWSLT is a single-domain task with very short sentences

Even in these conditions, context-dependent phrasal lexicons are helpful

there are genuine sense ambiguities E.g.

“turn” vs. “transfer”

context-features are available 19 observed features per occurrence of a Chinese phrase

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

The most useful context features are not available in standard SMT

The 3 most useful context feature types are: POS tag of word preceding the target phrase POS tag of word following the target phrase Bag-of-word context

We use weights learned by maximum entropy classifier to determine the most useful features:

We normalized feature weights for each WSD model and then compute average weight of each feature type

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Dynamic context-dependent sense predictions are better than static predictions

Context-dependent modeling often helps rank the correct translation first

Even when context-dependent modeling picks the same translation candidate, the WSD scores are more discriminative than baseline translation probabilities

better at overriding incorrect LM predictions

gives higher confidence to translate longer input phrases when appropriate

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Context-dependent modeling improves phrasal lexical choice examples

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Context-dependent modeling improves phrasal lexical choice examples

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Context-dependent modeling prefers longer phrases

Input

Ref. No parliament members voted against him .

SMTWithout any congressmen voted against him .

SMT+WSD No congressmen voted against him .

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Context-dependent modeling prefers longer phrases

Input

Ref. No parliament members voted against him .

SMTWithout any congressmen voted against him .

SMT+WSD No congressmen voted against him .

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Context-dependent modeling prefers longer phrases

Input

Ref. No parliament members voted against him .

SMTWithout any congressmen voted against him .

SMT+WSD No congressmen voted against him .

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Context-dependent modeling prefers longer phrases

Average length of Chinese phrases used is higher with context-dependent phrasal lexicon

This confirms that Context-dependent predictions for all phrases are useful Context-dependent predictions should be available at all

stages of decoding

This explains why using WSD for single words only has a less reliable impact on translation quality

as in Cabezas & Resnik [2005], Carpuat et al. [2006]

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Context-dependent lexicons should be phrasal to always help translation

Test set

Experiment BLEU NIST METEOR

METEOR

(no syn)

TER WER PER CDER

# 1 SMT 42.21 7.888 65.40 63.24 40.45 45.58 37.80 40.09

+word lex. 41.94 7.911 65.55 63.52 40.59 45.61 37.75 40.09

+phrasal lex.

42.38 7.902 65.73 63.64 39.98 45.30

37.60 39.91

# 2 SMT 41.49 8.167 66.25 63.85 40.95 46.42 37.52 40.35

+word lex. 41.31 8.161 66.23 63.72 41.34 46.82 37.98 40.69

+phrasal lex.

41.97 8.244 66.35 63.86 40.63 46.14

37.25 40.10

# 3 SMT 49.91 9.016 73.36 70.70 35.60 40.60 32.30 35.46

+word lex. 49.73 9.017 73.32 70.82 35.72 40.61 32.10 35.30

+phrasal lex.

51.05 9.142 74.13 71.44 34.68 39.75

31.71 34.58

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

No other WSD for SMT approach improves translation quality as consistently

Until recently, using WSD to improve SMT quality has met with mixed or disappointing results

Carpuat & Wu [ACL-2005], Cabezas & Resnik [unpub]

Last year, for the first time, different approaches showed that WSD can help translation quality

WSD improved BLEU (but how about other metrics??) on 3 Chinese-English tasks [Carpuat et al. IWSLT-2006]

WSD improved BLEU (but how about other metrics??) on Chinese-English NIST task [Chan et al. ACL-2007]

WSD improved METEOR (but not BLEU!) on Spanish-English Europarl task [Giménez & Màrquez WMT-2007]

Phrasal WSD improves BLEU, NIST, METEOR (but how about error rates??)

on Italian-English and Chinese-English IWSLT task [Stroppa et al. TMI-2007]

But no other approach improves on 8 metrics on 4 different tasks

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Context-dependent modeling improves quality of Statistical MT Presenting context-dependent phrasal lexicons for SMT

leverage WSD techniques for SMT lexical choice

Context-dependent modeling always improves SMT accuracy on all tasks - 3 different IWSLT06 datasets, NIST04 on all 8 common automatic metrics - BLEU, NIST, METEOR,

METEOR+synsets, TER, WER, PER, CDER

Why? Most useful context features are unavailable to current SMT

systems Better phrasal segmentation Better phrasal lexical choice

more accurate rankings more discriminative scores

HKUST Human Language Technology Center Carpuat & Wu, LREC 2008

Maxent-based sense disambiguation in Candide [Berger 1996]

No evaluation of impact on translation quality only 2 example sentences, no contrastive evaluation by human

judgment nor any automatic metric extension by Garcia Varea et al. does not significantly improve

translation quality Still does not model input language context Overly simplified context model

does not use full sentential context only 3 words to the left, 3 words to the right

does not generalize over word identities only words, no POS tags

does not generalize to phrasal disambiguation targets only words

Does not augment the existing SMT model only replace context-independent translation probability