1 novel inference, training and decoding methods over translation forests zhifei li center for...

1

Novel Inference, Training and Decoding Methods over Translation Forests

Zhifei LiCenter for Language and Speech Processing

Computer Science Department

Johns Hopkins University

Advisor: Sanjeev Khudanpur

Co-advisor: Jason Eisner

2

Statistical Machine Translation Pipeline

Decoding

Bilingual

Data

Translation

Models

Monolingual

English

Language

Models

Discriminative Training

Optimal

Weights

Unseen

SentencesTranslation

Outputs

Generative Training

Generative Training

Held-out Bilingual Data

3

垫子上的猫

Training a Translation Model

dianzi shangdemao

a caton the mat

dianzi shang

the mat

4

垫子上的猫


dianzi shangdemao

a caton the mat

dianzi shangthe mat,mao a cat,

5

垫子上的猫


dianzi shangde

on the mat

dianzi shang de

on the mat,

dianzi shangthe mat,mao a cat,

6


dianzi shang de

on the mat,

dianzi shangthe mat,

垫子上的猫demao

a caton

de maoa cat on,

mao a cat,

7


dianzi shang de

on the mat,

dianzi shangthe mat,

垫子上的猫de

on

de maoa cat on,

de on,

mao a cat,

8

dianzi shang de gou垫子上的狗

the dog on the mat

dianzi shang de gou

Decoding a Test Sentence

Translation is easy?

Derivation Tree

9

dianzi shang de mao

a cat on the mat

垫子上的猫

zhongguo de shouducapital of China

wo de maomy cat

zhifei de maozhifei ’s cat

Translation Ambiguity

10

dianzi shang de mao

Joshua (chart parser)


11

dianzi shang de mao

the mat a catthe mat a cata cat on the mata cat on the mat

a cat of the mata cat of the mat the mat ’s a catthe mat ’s a cat



12

dianzi shang de mao

hypergraph

hypergraph



13

A hypergraph is a compact data structure to encode exponentially

many trees.hypered

gehypered

ge

edgeedge

hyperedge

hyperedge

nodenodeFSAFSA

Packed Forest

Packed Forest

14


many trees.

15


many trees.

16


many trees.

17


many trees.Structure sharing

18

Why Hypergraphs?•Contains a much larger hypothesis

space than a k-best list

•General compact data structure

• special cases include

• finite state machine (e.g., lattice),

• and/or graph

• packed forest

• can be used for speech, parsing, tree-based MT systems, and many more

19

p=2

p=2

p=1

p=1

p=3

p=3

p=2

p=2

Weighted HypergraphWeighted

Hypergraph

Linear model:Linear model:

weights features

derivationforeign input

20

Probabilistic HypergraphProbabilistic Hypergraph

Log-linear model:

Log-linear model:

Z=2+1+3+2=8

Z=2+1+3+2=8

p=2/8

p=2/8

p=1/8

p=1/8

p=3/8

p=3/8

p=2/8

p=2/8

21

The hypergraph defines a probability distribution over

trees!Probabilistic HypergraphProbabilistic Hypergraph

the distribution is parameterized by Θ

p=2/8

p=2/8

p=1/8

p=1/8

p=3/8

p=3/8

p=2/8

p=2/8

22


How do we set the parameters Θ?

Training

Why are the problems difficult?- brute-force will be too slow as there are

exponentially many trees, so require sophisticated dynamic programs- sometimes intractable, require

approximations

The hypergraph defines a probability distribution over

trees! the distribution is parameterized by Θ

Decoding Which translation do we present to a user?

What atomic operations do we need to perform?

Atomic Inference

23

Inference, Training and Decoding on Hypergraphs• Atomic Inference

• finding one-best derivations

• finding k-best derivations

• computing expectations (e.g., of features)• Training• Perceptron, conditional random field (CRF),

minimum error rate training (MERT), minimum risk, and MIRA

• Decoding• Viterbi decoding, maximum a posterior (MAP) decoding,

and minimum Bayes risk (MBR) decoding

24

Outline• Hypergraph as Hypothesis Space

• Unsupervised Discriminative Training

‣ minimum imputed risk

‣ contrastive language model estimation

• Variational Decoding

• First- and Second-order Expectation Semirings

main focus

25

Training Setup• Each training example consists of

• a foreign sentence (from which a hypergraph is generated to represent many possible translations)

• a reference translation

x: dianzi shang de mao

y: a cat on the mat

• Training

•adjust the parameters Θ so that the reference translation is preferred by the model

27

Supervised: Minimum Empirical Risk• Minimum Empirical Risk

Training

x ylossempirical

distribution

- MERT- CRF- PeceptronWhat if the input x is

missing?

MT decoder

MT output

negated BLEU

• Uniform Empirical Distribution

28

• Minimum Empirical Risk Training

Unsupervised: Minimum Imputed Risk

•Minimum Imputed Risk Training

loss

: reverse model

: imputed input

: forward system

Speech recognition?

Round trip translation

29

Training Reverse Model

and are parameterized and trained separately

Our goal is to train a good forward system

is fixed when training

30

Approximating

• Approximations

exponentially many x, stored ina hypergraph

- k-best

- sampling

- lattice

variational approximation +

lattice decoding (Dyer et al., 2008)

CFG is not closed under composition!

SCFG SCFG

31

The Forward System

•Deterministic Decoding• use one-best translation

•Randomized Decoding• use a distribution of translations

the objective is not differentiable ☹

differentiable ☺expected loss

33

Experiments

•Supervised Training• require bitext

•Unsupervised Training• require monolingual English

•Semi-supervised Training• interpolation of supervised and

unsupervised

34

Semi-supervised Training

Adding unsupervised data helps!

40K sent. pairs

551 features

36

Supervised vs. UnsupervisedUnsupervised training performs as well as (and often better than) the supervised one!Unsupervised uses 16 times of data as supervised. For example,

Chinese

English

Sup 100 16*100

Unsup

16*100

16*16*100

•More experiments• different k-best size

• different reverse model

But, fair comparison!

41







42

Language Modeling• Language Model

• assign a probability to an English sentence y

• typically use an n-gram model

• Global Log-linear Model (whole-sentence maximum-entropy LM)

All English sentences with any length!

Sampling ☹slow

(Rosenfeld et al., 2001)


a set of n-grams occurred in yLocally normalize

d

Globally normalized

43

• Global Log-linear Model (whole-sentence maximum-entropy LM)



Contrastive Estimation

• Contrastive Estimation (CE)

neighborhood or contrastive set

improve both speed and accuracy

not proposed for language modelingNeighborhood

Function

loss

(Smith and Eisner, 2005)

(Smith and Eisner, 2005)

train to recover the original English as much as

possiblea set of alternate Eng. sentences of

44

Contrastive Language Model Estimation• Step-1: extract a confusion grammar

(CG)• an English-to-English SCFG

• Step-2: for each English sentence, generate a contrastive set (or neighborhood) using the CG

• Step-3: discriminative training

paraphrase

insertion

re-ordering

neighborhood function

45

Step-1: Extracting a Confusion Grammar (CG)• Deriving a CG from a bilingual grammar

• use Chinese side as pivots

Bilingual RuleBilingual Rule Confusion Rule

Confusion Rule

Our neighborhood function is learned and MT-specific.

CG captures the confusion an MT system will have when translating an

input.

46

Step-2: Generating Contrastive Sets

a cat on the mat

CGCG the cat the matthe cat ’s the

matthe mat on the catthe mat of the cat

Contrastive set:

Contrastive set:

Translating “dianzi shang de mao”?

47

Step-3: Discriminative Training

• Training Objective

• Step-2: for each English sentence, generate a contrastive set (or neighborhood) using the CG

• Step-3: discriminative training

• Iterative Training

contrastive set

CE maximizes the conditional likelihood

expected loss

48

Applying the Contrastive Model

• We can use the contrastive model as a regular language model

• We can incorporate the contrastive model into an end-to-end MT system as a feature

• We may also use the contrastive model to generate paraphrase sentences (if the loss function measures semantic similarity)

• the rules in CG are symmetric

50

Parsing

Confusion grammar

Monolingual

English Training Contrastive

LM

English

SentenceOne-best English

Test on Synthesized Hypergraphs of English Data

RankHypergraph

(Neighborhood)

BLEU Score?

51

baseline LM (5-gram)

word penalty • Target side of a confusion rule

• Rule bigram features

The contrastive LM better recovers the original English than a regular n-gram LM.

All the features look at only the target sides of

confusion rules

Results on Synthesized Hypergraphs

52

Results on MT Test Set

Add CLM as a feature

The contrastive LM helps to improve MT performance.

53

Adding Features on the CG itself• On English Set

• On MT Set

Paraphrasing model

Paraphrasing model

glue rules or regular confusion rules?

one big feature

61

•Supervised: Minimum Empirical Risk

•Unsupervised: Contrastive LM Estimation

Summary for Discriminative Training

•Unsupervised: Minimum Imputed Risk

require bitext

require monolingual

English

require monolingual

English

62

•Supervised Training

•Unsupervised: Contrastive LM Estimation

Summary for Discriminative Training

•Unsupervised: Minimum Imputed Risk

require a reverse model

can have both TM and LM features

can have LM features only

require bitext

require monolingual

English

require monolingual

English

63







64

Variational Decoding•We want to do inference under p, but it is

intractable

• Instead, we derive a simpler distribution q*

•Then, we will use q* as a surrogate for p in inference

p(y|x)

Q q*(y)

P

intractable MAP decoding

tractable estimation

tractable decoding

(Sima’an 1996)

65

Variational Decoding for MT: an Overview

Sentence-specific decoding


Foreign sentence x SMTSMT

MAP decoding under P is intractable

p(y | x)=∑d∈D(x,y) p(d|x)

p(y | x)=∑d∈D(x,y) p(d|x)

11Generate a hypergraph for the foreign sentence

Three steps:Three steps:

p(d | x)p(d | x)

53

66

q* is an n-gram model over output strings.

Decode using q*on the hypergraph

11

p(d | x)p(d | x)

p(d | x)p(d | x)

q*(y | x)q*(y | x)

22

33

Estimate a model from the hypergraph by minimizing KL

Generate a hypergraph

q*(y | x)q*(y | x)

≈∑d∈D(x,y) p(d|x)≈∑d∈D(x,y) p(d|x)

54

Approximate a hypergraph with a lattice!

67







68

• First-order expectations: - expectation - entropy - expected loss - cross-entropy - KL divergence - feature expectations- first-order gradient of Z

• Second-order expectations: - expectation over product - interaction between features- Hessian matrix of Z - second-order gradient descent- gradient of expectation - gradient of expected loss or entropy


• “Decoding” quantities: - Viterbi- K-best- Counting- ......

A semiring framework to compute all of these

Recipe to compute a quantity:• Choose a semiring

• Specific a semiring weight for each hyperedge

• Run the inside algorithm

69

Applications of Expectation Semirings: a Summary

70






Inference, Training and Decoding on Hypergraphs

(Li et al., ACL 2009)

(Li and Eisner, EMNLP 2009)

(In Preparation)(In Preparation)

71

My Other MT Research• Training methods (supervised)

• Discriminative forest reranking with Perceptron (Li and Khudanpur, GALE book chapter 2009)

• Discriminative n-gram language models(Li and Khudanpur, AMTA 2008)

• Algorithms

• Oracle extraction from hypergraphs (Li and Khudanpur, NAACL 2009)

• Efficient intersection between n-gram LM and CFG (Li and Khudanpur, ACL SSST 2008)

• Others• System combination (Smith et al., GALE book

chapter 2009)

• Unsupervised translation induction for Chinese abbreviations (Li and Yarowsky, ACL 2008)

72

Research other than MT

• Information extraction

• Relation extraction between formal and informal phrases (Li and Yarowsky, EMNLP 2008)

• Spoken dialog management

• Optimal dialog in consumer-rating systems using a POMDP (Li et al., SIGDial 2008)

73

Joshua project• An open-source parsing-based MT toolkit (Li et al.

2009)

• support Hiero (Chiang, 2007) and SAMT (Venugopal et al., 2007)

• Team members

• Zhifei Li, Chris Callison-Burch, Chris Dyer, Sanjeev Khudanpur, Wren Thornton, Jonathan Weese, Juri Ganitkevitch, Lane Schwartz, and Omar Zaidan

All the methods presented have been implemented in Joshua!

Only rely on word-aligner and SRI LM!

74

Thank you!XieXie!

谢谢 !

78

Decoding over a hypergraph

Pick a single translation to output

(why not just pick the tree with the highest weight?)

Given a hypergraph of possible translations (generated for a given foreign

sentence by already-trained model)

79

Spurious Ambiguity •Statistical models in MT exhibit spurious

ambiguity

• Many different derivations (e.g., trees or segmentations) generate the same translation string

•Tree-based MT systems

• derivation tree ambiguity

•Regular phrase-based MT systems

• phrase segmentation ambiguity

80

Spurious Ambiguity in Derivation Trees 机器翻译软件

S->( 机器 , machine)S->( 翻译 , translation)S->( 软件 , software)

S->( 机器 , machine)S->( 翻译 , translation)S->( 软件 , software)

S->( 机器 , machine) 翻译 S->( 软件 , software)

S->(S0 S1, S0 S1)

S->(S0 S1, S0 S1)

S->(S0 S1, S0 S1)

S->(S0 S1, S0 S1)

S->(S0 翻译 S1, S0 translation S1)

• Same output: “machine translation software”

• Three different derivation trees

jiqi fanyi yuanjian

machine translation software

machine transfer software

Another translation:

81

MAP, Viterbi and N-best Approximations

•Viterbi approximation

•N-best approximation (crunching) (May and Knight 2006)

•Exact MAP decoding

NP-hard (Sima’an 1996)

82

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivation

translation string

MAP

MAP vs. Approximations

0.28

0.28

0.44

Viterbi

0.16

0.14

0.13

4-best crunchin

g0.16

0.28

0.13

• Our goal: develop an approximation that considers all the derivations but still allows tractable decoding

• Viterbi and crunching are efficient, but ignore most derivations

• Exact MAP decoding under spurious ambiguity is intractable on HG

83

Variational DecodingDecoding using Variational

approximation Decoding using a sentence-specific

approximate distribution

84

Variational Decoding for MT: an Overview



Foreign sentence x SMTSMT

MAP decoding under P is intractable

p(y | x)p(y | x)

11Generate a hypergraph for the foreign sentence

Three steps:Three steps:

p(y, d | x)p(y, d | x)

68

85

q* is an n-gram model over output strings.

Decode using q*on the hypergraph

11

p(d | x)p(d | x)

p(d | x)p(d | x)

q*(y | x)q*(y | x)

22

33

Estimate a model from the hypergraph

Generate a hypergraph

q*(y | x)q*(y | x)

≈∑d∈D(x,y) p(d|x)≈∑d∈D(x,y) p(d|x)

69

86

Variational Inference•We want to do inference under p, but it is

intractable



p

Q q*

P

87

constant

Variational Approximation•q*: an approximation having minimum distance to p

• Three questions

• how to parameterize q?

• how to estimate q*?

• how to use q* for decoding?

a family of distributions

an n-gram modelcompute expected n-gram counts and normalizescore the hypergraph with the n-gram model

88

KL divergences under different variational models

•Larger n ==> better approximation q_n ==> smaller KL divergence from p

•The reduction of KL divergence happens mostly when switching from unigram to bigram

89

BLEU Results on Chinese-English NIST MT 2004

Tasks

• variational decoding improves over Viterbi, MBR, and crunching

(Kumar and Byrne, 2004)

(Kumar and Byrne, 2004)

New!New!

(May and Knight, 2006)

(May and Knight, 2006)

90

Variational Inference•We want to do inference under p, but it is

intractable



p

Q q*

P

intractable

tractable

tractable

91







92

• First-order quantities: - expectation - entropy - Bayes risk - cross-entropy - KL divergence - feature expectations- first-order gradient of Z

• Second-order quantities: - expectation over product - interaction between features- Hessian matrix of Z - second-order gradient descent- gradient of expectation - gradient of entropy or Bayes risk


• “decoding” quantities: - Viterbi- K-best- Counting- ......


93

‣ choose a semiring

‣ specify a weight for each hyperedge

‣ run the inside algorithm

Compute Quantities on a Hypergraph: a Recipe

• three steps:•Semiring-weighted inside algorithm

a set with plus and times operationse.g., integer numbers with regular +

and ×

each weight is a semiring member

complexity is O(hypergraph size)

94

Semirings• “Decoding” time semirings

• counting, Viterbi, K-best, etc.

• “Training” time semirings

• first-order expectation semirings

• second-order expectation semirings (new)

• Applications of the Semirings (new)

• entropy, risk, gradient of them, and many more

(Goodman, 1999)

(Goodman, 1999)

(Eisner, 2002)

(Eisner, 2002)

95

dianzi0 shang1 de2 mao3

How many trees?

How many trees?

four☺four☺compute it

use a semiring?

compute ituse a

semiring?

96

ke=1

‣ choose a semiring

‣ specify a weight for each hyperedge‣ run the inside algorithm

counting semiring: ordinary integers with

regular + and x

1 1

1111

11

Compute the Number of Derivation TreesThree steps:

97

ke=1k(v1)= k(e1) k(v2)= k(e2)

k(v5)= k(e7) k(v3) k(e8) k(v4)

k(v5)=4

k(v4)= k(e5) k(v1) k(v2) k(e6) k(v1) k(v2)

k(v3)= k(e3) k(v1) k(v2) k(e4) k(v1) k(v2)

k(v4)=2

k(v1)=1

k(v3)=2

k(v2)=1

Bottom-up process in computing the number of trees

1 1

1111

11

98

the mat a cat

the mat a cat

a cat on the mat

a cat on the mat

a cat of the mat

a cat of the mat

the mat ’s a cat

the mat ’s a cat

expected translation length?

expected translation length?2/8×4 + 6/8×5 =

4.752/8×4 + 6/8×5 =

4.75

4

5

5

5

p=2/8p=2/8

p=1/8

p=1/8

p=3/8

p=3/8

p=2/8

p=2/8

variance?

variance?2/8×(4-4.75)^2 + 6/8×(5-

4.75)^2 ≈ 0.192/8×(4-4.75)^2 + 6/8×(5-

4.75)^2 ≈ 0.19

99

First- and Second-order Expectation Semirings

• each member is a 2-tuple:

• each member is a 4-tuple:

First-order:First-order:

Second-order:

Second-order:

(Eisner, 2002)

(Eisner, 2002)

100

ke=1k(v1)= k(e1) k(v2)= k(e2)

k(v5)= k(e7) k(v3) k(e8) k(v4)

k(v4)= k(e5) k(v1) k(v2) k(e6) k(v1) k(v2)

k(v3)= k(e3) k(v1) k(v2) k(e4) k(v1) k(v2)

k(v5)= 〈 8, 4.75 〉

k(v4)= 〈 1, 2 〉

k(v3)= 〈 0, 1 〉

k(v2)= 〈 1, 2 〉k(v1)= 〈 1, 2 〉

〈 1, 2 〉

〈 1, 2 〉

〈 2, 0 〉

〈 3, 3 〉

〈 1, 1 〉

〈 2, 2 〉

〈 1, 0 〉

〈 1, 0 〉

First-order:each semiring member is a 2-tuple

Fake number

s

101

ke=1k(v1)= k(e1) k(v2)= k(e2)

k(v5)= k(e7) k(v3) k(e8) k(v4)

k(v4)= k(e5) k(v1) k(v2) k(e6) k(v1) k(v2)

k(v3)= k(e3) k(v1) k(v2) k(e4) k(v1) k(v2)

〈 1,2,2,4〉

〈 2,0,0,0 〉〈 3,3,3,3

〉

〈 1,1,1,1〉〈 2,2,2,2 〉

〈 1, 0,0,0 〉

〈 1,0,0,0〉

〈 1,2,2,4〉

k(v5)= 〈 8,4.5,4.5,5 〉k(v4)= 〈 1,2,1

,3 〉k(v3)= 〈 1,1,1,1

〉

k(v2)= 〈 1,2,2,4 〉

k(v1)= 〈 1,2,2,4 〉

Second-order:each semiring member is a 4-tuple

Fake number

s

102

Expectations on Hypergraphs

• r(d) is a function over a derivation d

• Expectation over a hypergraph

e.g., translation length is additively decomposed!

• r(d) is additively decomposed

e.g., the length of the translation yielded by d

exponential size

103

Second-order Expectations on Hypergraphs

• r and s are additively decomposed

• Expectation of products over a hypergraph

r and s can be identical or different functions.

exponential size

104

Compute expectation using expectation semiring:Compute expectation using expectation semiring:

Why?Why? entropy is an expectation

Entropy:Entropy:

pe : transition probability or log-linear score at edge e re ?

log p(d) is additively decomposed!

105

Why?Why? cross-entropy is an expectation

Entropy:Entropy:


Cross-entropy:Cross-entropy:

log q(d) is additively decomposed!


106

Why?Why? Bayes risk is an expectation

Entropy:Entropy:


Cross-entropy:Cross-entropy:Bayes risk:Bayes risk:

(Tromble et al. 2008)

(Tromble et al. 2008)

L(Y(d)) is additively decomposed!


107

Applications of Expectation Semirings: a Summary

108

• First-order quantities: - expectation - entropy - Bayes risk - cross-entropy - KL divergence - feature expectations- first-order gradient of Z

• Second-order quantities: - Expectation over product - interaction between features- Hessian matrix of Z - second-order gradient descent- gradient of expectation - gradient of entropy or Bayes risk


• “decoding” quantities: - Viterbi- K-best- Counting- ......


1 novel inference, training and decoding methods over translation forests zhifei li center for...

Documents

maohypergrapha hypergraph

approximationsthe hypergraph

mat zhongguo

maodianzi shang

matthe mat s

decoding methods

miradecodingviterbi

probability distribution