1 novel inference, training and decoding methods over translation forests zhifei li center for...
TRANSCRIPT
1
Novel Inference, Training and Decoding Methods over Translation Forests
Zhifei LiCenter for Language and Speech Processing
Computer Science Department
Johns Hopkins University
Advisor: Sanjeev Khudanpur
Co-advisor: Jason Eisner
2
Statistical Machine Translation Pipeline
Decoding
Bilingual
Data
Translation
Models
Monolingual
English
Language
Models
Discriminative Training
Optimal
Weights
Unseen
SentencesTranslation
Outputs
Generative Training
Generative Training
Held-out Bilingual Data
3
垫子 上 的 猫
Training a Translation Model
dianzi shangdemao
a caton the mat
dianzi shang
the mat
4
垫子 上 的 猫
Training a Translation Model
dianzi shangdemao
a caton the mat
dianzi shangthe mat,mao a cat,
5
垫子 上 的 猫
Training a Translation Model
dianzi shangde
on the mat
dianzi shang de
on the mat,
dianzi shangthe mat,mao a cat,
6
Training a Translation Model
dianzi shang de
on the mat,
dianzi shangthe mat,
垫子 上 的 猫demao
a caton
de maoa cat on,
mao a cat,
7
Training a Translation Model
dianzi shang de
on the mat,
dianzi shangthe mat,
垫子 上 的 猫de
on
de maoa cat on,
de on,
mao a cat,
8
dianzi shang de gou垫子 上 的 狗
the dog on the mat
dianzi shang de gou
Decoding a Test Sentence
Translation is easy?
Derivation Tree
9
dianzi shang de mao
a cat on the mat
垫子 上 的 猫
zhongguo de shouducapital of China
wo de maomy cat
zhifei de maozhifei ’s cat
Translation Ambiguity
10
dianzi shang de mao
Joshua (chart parser)
Joshua (chart parser)
11
dianzi shang de mao
the mat a catthe mat a cata cat on the mata cat on the mat
a cat of the mata cat of the mat the mat ’s a catthe mat ’s a cat
Joshua (chart parser)
Joshua (chart parser)
12
dianzi shang de mao
hypergraph
hypergraph
Joshua (chart parser)
Joshua (chart parser)
13
A hypergraph is a compact data structure to encode exponentially
many trees.hypered
gehypered
ge
edgeedge
hyperedge
hyperedge
nodenodeFSAFSA
Packed Forest
Packed Forest
14
A hypergraph is a compact data structure to encode exponentially
many trees.
15
A hypergraph is a compact data structure to encode exponentially
many trees.
16
A hypergraph is a compact data structure to encode exponentially
many trees.
17
A hypergraph is a compact data structure to encode exponentially
many trees.Structure sharing
18
Why Hypergraphs?•Contains a much larger hypothesis
space than a k-best list
•General compact data structure
• special cases include
• finite state machine (e.g., lattice),
• and/or graph
• packed forest
• can be used for speech, parsing, tree-based MT systems, and many more
19
p=2
p=2
p=1
p=1
p=3
p=3
p=2
p=2
Weighted HypergraphWeighted
Hypergraph
Linear model:Linear model:
weights features
derivationforeign input
20
Probabilistic HypergraphProbabilistic Hypergraph
Log-linear model:
Log-linear model:
Z=2+1+3+2=8
Z=2+1+3+2=8
p=2/8
p=2/8
p=1/8
p=1/8
p=3/8
p=3/8
p=2/8
p=2/8
21
The hypergraph defines a probability distribution over
trees!Probabilistic HypergraphProbabilistic Hypergraph
the distribution is parameterized by Θ
p=2/8
p=2/8
p=1/8
p=1/8
p=3/8
p=3/8
p=2/8
p=2/8
22
Probabilistic HypergraphProbabilistic Hypergraph
How do we set the parameters Θ?
Training
Why are the problems difficult?- brute-force will be too slow as there are
exponentially many trees, so require sophisticated dynamic programs- sometimes intractable, require
approximations
The hypergraph defines a probability distribution over
trees! the distribution is parameterized by Θ
Decoding Which translation do we present to a user?
What atomic operations do we need to perform?
Atomic Inference
23
Inference, Training and Decoding on Hypergraphs• Atomic Inference
• finding one-best derivations
• finding k-best derivations
• computing expectations (e.g., of features)• Training• Perceptron, conditional random field (CRF),
minimum error rate training (MERT), minimum risk, and MIRA
• Decoding• Viterbi decoding, maximum a posterior (MAP) decoding,
and minimum Bayes risk (MBR) decoding
24
Outline• Hypergraph as Hypothesis Space
• Unsupervised Discriminative Training
‣ minimum imputed risk
‣ contrastive language model estimation
• Variational Decoding
• First- and Second-order Expectation Semirings
main focus
25
Training Setup• Each training example consists of
• a foreign sentence (from which a hypergraph is generated to represent many possible translations)
• a reference translation
x: dianzi shang de mao
y: a cat on the mat
• Training
•adjust the parameters Θ so that the reference translation is preferred by the model
27
Supervised: Minimum Empirical Risk• Minimum Empirical Risk
Training
x ylossempirical
distribution
- MERT- CRF- PeceptronWhat if the input x is
missing?
MT decoder
MT output
negated BLEU
• Uniform Empirical Distribution
28
• Minimum Empirical Risk Training
Unsupervised: Minimum Imputed Risk
•Minimum Imputed Risk Training
loss
: reverse model
: imputed input
: forward system
Speech recognition?
Round trip translation
29
Training Reverse Model
and are parameterized and trained separately
Our goal is to train a good forward system
is fixed when training
30
Approximating
• Approximations
exponentially many x, stored ina hypergraph
- k-best
- sampling
- lattice
variational approximation +
lattice decoding (Dyer et al., 2008)
CFG is not closed under composition!
SCFG SCFG
31
The Forward System
•Deterministic Decoding• use one-best translation
•Randomized Decoding• use a distribution of translations
the objective is not differentiable ☹
differentiable ☺expected loss
33
Experiments
•Supervised Training• require bitext
•Unsupervised Training• require monolingual English
•Semi-supervised Training• interpolation of supervised and
unsupervised
34
Semi-supervised Training
Adding unsupervised data helps!
40K sent. pairs
551 features
36
Supervised vs. UnsupervisedUnsupervised training performs as well as (and often better than) the supervised one!Unsupervised uses 16 times of data as supervised. For example,
Chinese
English
Sup 100 16*100
Unsup
16*100
16*16*100
•More experiments• different k-best size
• different reverse model
But, fair comparison!
41
Outline• Hypergraph as Hypothesis Space
• Unsupervised Discriminative Training
‣ minimum imputed risk
‣ contrastive language model estimation
• Variational Decoding
• First- and Second-order Expectation Semirings
42
Language Modeling• Language Model
• assign a probability to an English sentence y
• typically use an n-gram model
• Global Log-linear Model (whole-sentence maximum-entropy LM)
All English sentences with any length!
Sampling ☹slow
(Rosenfeld et al., 2001)
(Rosenfeld et al., 2001)
a set of n-grams occurred in yLocally normalize
d
Globally normalized
43
• Global Log-linear Model (whole-sentence maximum-entropy LM)
(Rosenfeld et al., 2001)
(Rosenfeld et al., 2001)
Contrastive Estimation
• Contrastive Estimation (CE)
neighborhood or contrastive set
improve both speed and accuracy
not proposed for language modelingNeighborhood
Function
loss
(Smith and Eisner, 2005)
(Smith and Eisner, 2005)
train to recover the original English as much as
possiblea set of alternate Eng. sentences of
44
Contrastive Language Model Estimation• Step-1: extract a confusion grammar
(CG)• an English-to-English SCFG
• Step-2: for each English sentence, generate a contrastive set (or neighborhood) using the CG
• Step-3: discriminative training
paraphrase
insertion
re-ordering
neighborhood function
45
Step-1: Extracting a Confusion Grammar (CG)• Deriving a CG from a bilingual grammar
• use Chinese side as pivots
Bilingual RuleBilingual Rule Confusion Rule
Confusion Rule
Our neighborhood function is learned and MT-specific.
CG captures the confusion an MT system will have when translating an
input.
46
Step-2: Generating Contrastive Sets
a cat on the mat
CGCG the cat the matthe cat ’s the
matthe mat on the catthe mat of the cat
Contrastive set:
Contrastive set:
Translating “dianzi shang de mao”?
47
Step-3: Discriminative Training
• Training Objective
• Step-2: for each English sentence, generate a contrastive set (or neighborhood) using the CG
• Step-3: discriminative training
• Iterative Training
contrastive set
CE maximizes the conditional likelihood
expected loss
48
Applying the Contrastive Model
• We can use the contrastive model as a regular language model
• We can incorporate the contrastive model into an end-to-end MT system as a feature
• We may also use the contrastive model to generate paraphrase sentences (if the loss function measures semantic similarity)
• the rules in CG are symmetric
50
Parsing
Confusion grammar
Monolingual
English Training Contrastive
LM
English
SentenceOne-best English
Test on Synthesized Hypergraphs of English Data
RankHypergraph
(Neighborhood)
BLEU Score?
51
baseline LM (5-gram)
word penalty • Target side of a confusion rule
• Rule bigram features
The contrastive LM better recovers the original English than a regular n-gram LM.
All the features look at only the target sides of
confusion rules
Results on Synthesized Hypergraphs
52
Results on MT Test Set
Add CLM as a feature
The contrastive LM helps to improve MT performance.
53
Adding Features on the CG itself• On English Set
• On MT Set
Paraphrasing model
Paraphrasing model
glue rules or regular confusion rules?
one big feature
61
•Supervised: Minimum Empirical Risk
•Unsupervised: Contrastive LM Estimation
Summary for Discriminative Training
•Unsupervised: Minimum Imputed Risk
require bitext
require monolingual
English
require monolingual
English
62
•Supervised Training
•Unsupervised: Contrastive LM Estimation
Summary for Discriminative Training
•Unsupervised: Minimum Imputed Risk
require a reverse model
can have both TM and LM features
can have LM features only
require bitext
require monolingual
English
require monolingual
English
63
Outline• Hypergraph as Hypothesis Space
• Unsupervised Discriminative Training
‣ minimum imputed risk
‣ contrastive language model estimation
• Variational Decoding
• First- and Second-order Expectation Semirings
64
Variational Decoding•We want to do inference under p, but it is
intractable
• Instead, we derive a simpler distribution q*
•Then, we will use q* as a surrogate for p in inference
p(y|x)
Q q*(y)
P
intractable MAP decoding
tractable estimation
tractable decoding
(Sima’an 1996)
65
Variational Decoding for MT: an Overview
Sentence-specific decoding
Sentence-specific decoding
Foreign sentence x SMTSMT
MAP decoding under P is intractable
p(y | x)=∑d∈D(x,y) p(d|x)
p(y | x)=∑d∈D(x,y) p(d|x)
11Generate a hypergraph for the foreign sentence
Three steps:Three steps:
p(d | x)p(d | x)
53
66
q* is an n-gram model over output strings.
Decode using q*on the hypergraph
11
p(d | x)p(d | x)
p(d | x)p(d | x)
q*(y | x)q*(y | x)
22
33
Estimate a model from the hypergraph by minimizing KL
Generate a hypergraph
q*(y | x)q*(y | x)
≈∑d∈D(x,y) p(d|x)≈∑d∈D(x,y) p(d|x)
54
Approximate a hypergraph with a lattice!
67
Outline• Hypergraph as Hypothesis Space
• Unsupervised Discriminative Training
‣ minimum imputed risk
‣ contrastive language model estimation
• Variational Decoding
• First- and Second-order Expectation Semirings
68
• First-order expectations: - expectation - entropy - expected loss - cross-entropy - KL divergence - feature expectations- first-order gradient of Z
• Second-order expectations: - expectation over product - interaction between features- Hessian matrix of Z - second-order gradient descent- gradient of expectation - gradient of expected loss or entropy
Probabilistic HypergraphProbabilistic Hypergraph
• “Decoding” quantities: - Viterbi- K-best- Counting- ......
A semiring framework to compute all of these
Recipe to compute a quantity:• Choose a semiring
• Specific a semiring weight for each hyperedge
• Run the inside algorithm
69
Applications of Expectation Semirings: a Summary
70
• Unsupervised Discriminative Training
‣ minimum imputed risk
‣ contrastive language model estimation
• Variational Decoding
• First- and Second-order Expectation Semirings
Inference, Training and Decoding on Hypergraphs
(Li et al., ACL 2009)
(Li and Eisner, EMNLP 2009)
(In Preparation)(In Preparation)
71
My Other MT Research• Training methods (supervised)
• Discriminative forest reranking with Perceptron (Li and Khudanpur, GALE book chapter 2009)
• Discriminative n-gram language models(Li and Khudanpur, AMTA 2008)
• Algorithms
• Oracle extraction from hypergraphs (Li and Khudanpur, NAACL 2009)
• Efficient intersection between n-gram LM and CFG (Li and Khudanpur, ACL SSST 2008)
• Others• System combination (Smith et al., GALE book
chapter 2009)
• Unsupervised translation induction for Chinese abbreviations (Li and Yarowsky, ACL 2008)
72
Research other than MT
• Information extraction
• Relation extraction between formal and informal phrases (Li and Yarowsky, EMNLP 2008)
• Spoken dialog management
• Optimal dialog in consumer-rating systems using a POMDP (Li et al., SIGDial 2008)
73
Joshua project• An open-source parsing-based MT toolkit (Li et al.
2009)
• support Hiero (Chiang, 2007) and SAMT (Venugopal et al., 2007)
• Team members
• Zhifei Li, Chris Callison-Burch, Chris Dyer, Sanjeev Khudanpur, Wren Thornton, Jonathan Weese, Juri Ganitkevitch, Lane Schwartz, and Omar Zaidan
All the methods presented have been implemented in Joshua!
Only rely on word-aligner and SRI LM!
74
Thank you!XieXie!
谢谢 !
75
76
77
78
Decoding over a hypergraph
Pick a single translation to output
(why not just pick the tree with the highest weight?)
Given a hypergraph of possible translations (generated for a given foreign
sentence by already-trained model)
79
Spurious Ambiguity •Statistical models in MT exhibit spurious
ambiguity
• Many different derivations (e.g., trees or segmentations) generate the same translation string
•Tree-based MT systems
• derivation tree ambiguity
•Regular phrase-based MT systems
• phrase segmentation ambiguity
80
Spurious Ambiguity in Derivation Trees 机器 翻译 软件
S->( 机器 , machine)S->( 翻译 , translation)S->( 软件 , software)
S->( 机器 , machine)S->( 翻译 , translation)S->( 软件 , software)
S->( 机器 , machine) 翻译 S->( 软件 , software)
S->(S0 S1, S0 S1)
S->(S0 S1, S0 S1)
S->(S0 S1, S0 S1)
S->(S0 S1, S0 S1)
S->(S0 翻译 S1, S0 translation S1)
• Same output: “machine translation software”
• Three different derivation trees
jiqi fanyi yuanjian
machine translation software
machine transfer software
Another translation:
81
MAP, Viterbi and N-best Approximations
•Viterbi approximation
•N-best approximation (crunching) (May and Knight 2006)
•Exact MAP decoding
NP-hard (Sima’an 1996)
82
red translation
blue translation
green translation
0.160.140.140.130.120.110.100.10
probabilityderivation
translation string
MAP
MAP vs. Approximations
0.28
0.28
0.44
Viterbi
0.16
0.14
0.13
4-best crunchin
g0.16
0.28
0.13
• Our goal: develop an approximation that considers all the derivations but still allows tractable decoding
• Viterbi and crunching are efficient, but ignore most derivations
• Exact MAP decoding under spurious ambiguity is intractable on HG
83
Variational DecodingDecoding using Variational
approximation Decoding using a sentence-specific
approximate distribution
84
Variational Decoding for MT: an Overview
Sentence-specific decoding
Sentence-specific decoding
Foreign sentence x SMTSMT
MAP decoding under P is intractable
p(y | x)p(y | x)
11Generate a hypergraph for the foreign sentence
Three steps:Three steps:
p(y, d | x)p(y, d | x)
68
85
q* is an n-gram model over output strings.
Decode using q*on the hypergraph
11
p(d | x)p(d | x)
p(d | x)p(d | x)
q*(y | x)q*(y | x)
22
33
Estimate a model from the hypergraph
Generate a hypergraph
q*(y | x)q*(y | x)
≈∑d∈D(x,y) p(d|x)≈∑d∈D(x,y) p(d|x)
69
86
Variational Inference•We want to do inference under p, but it is
intractable
• Instead, we derive a simpler distribution q*
•Then, we will use q* as a surrogate for p in inference
p
Q q*
P
87
constant
Variational Approximation•q*: an approximation having minimum distance to p
• Three questions
• how to parameterize q?
• how to estimate q*?
• how to use q* for decoding?
a family of distributions
an n-gram modelcompute expected n-gram counts and normalizescore the hypergraph with the n-gram model
88
KL divergences under different variational models
•Larger n ==> better approximation q_n ==> smaller KL divergence from p
•The reduction of KL divergence happens mostly when switching from unigram to bigram
89
BLEU Results on Chinese-English NIST MT 2004
Tasks
• variational decoding improves over Viterbi, MBR, and crunching
(Kumar and Byrne, 2004)
(Kumar and Byrne, 2004)
New!New!
(May and Knight, 2006)
(May and Knight, 2006)
90
Variational Inference•We want to do inference under p, but it is
intractable
• Instead, we derive a simpler distribution q*
•Then, we will use q* as a surrogate for p in inference
p
Q q*
P
intractable
tractable
tractable
91
Outline• Hypergraph as Hypothesis Space
• Unsupervised Discriminative Training
‣ minimum imputed risk
‣ contrastive language model estimation
• Variational Decoding
• First- and Second-order Expectation Semirings
92
• First-order quantities: - expectation - entropy - Bayes risk - cross-entropy - KL divergence - feature expectations- first-order gradient of Z
• Second-order quantities: - expectation over product - interaction between features- Hessian matrix of Z - second-order gradient descent- gradient of expectation - gradient of entropy or Bayes risk
Probabilistic HypergraphProbabilistic Hypergraph
• “decoding” quantities: - Viterbi- K-best- Counting- ......
A semiring framework to compute all of these
93
‣ choose a semiring
‣ specify a weight for each hyperedge
‣ run the inside algorithm
Compute Quantities on a Hypergraph: a Recipe
• three steps:•Semiring-weighted inside algorithm
a set with plus and times operationse.g., integer numbers with regular +
and ×
each weight is a semiring member
complexity is O(hypergraph size)
94
Semirings• “Decoding” time semirings
• counting, Viterbi, K-best, etc.
• “Training” time semirings
• first-order expectation semirings
• second-order expectation semirings (new)
• Applications of the Semirings (new)
• entropy, risk, gradient of them, and many more
(Goodman, 1999)
(Goodman, 1999)
(Eisner, 2002)
(Eisner, 2002)
95
dianzi0 shang1 de2 mao3
How many trees?
How many trees?
four☺four☺compute it
use a semiring?
compute ituse a
semiring?
96
ke=1
‣ choose a semiring
‣ specify a weight for each hyperedge‣ run the inside algorithm
counting semiring: ordinary integers with
regular + and x
1 1
1111
11
Compute the Number of Derivation TreesThree steps:
97
ke=1k(v1)= k(e1) k(v2)= k(e2)
k(v5)= k(e7) k(v3) k(e8) k(v4)
k(v5)=4
k(v4)= k(e5) k(v1) k(v2) k(e6) k(v1) k(v2)
k(v3)= k(e3) k(v1) k(v2) k(e4) k(v1) k(v2)
k(v4)=2
k(v1)=1
k(v3)=2
k(v2)=1
Bottom-up process in computing the number of trees
1 1
1111
11
98
the mat a cat
the mat a cat
a cat on the mat
a cat on the mat
a cat of the mat
a cat of the mat
the mat ’s a cat
the mat ’s a cat
expected translation length?
expected translation length?2/8×4 + 6/8×5 =
4.752/8×4 + 6/8×5 =
4.75
4
5
5
5
p=2/8p=2/8
p=1/8
p=1/8
p=3/8
p=3/8
p=2/8
p=2/8
variance?
variance?2/8×(4-4.75)^2 + 6/8×(5-
4.75)^2 ≈ 0.192/8×(4-4.75)^2 + 6/8×(5-
4.75)^2 ≈ 0.19
99
First- and Second-order Expectation Semirings
• each member is a 2-tuple:
• each member is a 4-tuple:
First-order:First-order:
Second-order:
Second-order:
(Eisner, 2002)
(Eisner, 2002)
100
ke=1k(v1)= k(e1) k(v2)= k(e2)
k(v5)= k(e7) k(v3) k(e8) k(v4)
k(v4)= k(e5) k(v1) k(v2) k(e6) k(v1) k(v2)
k(v3)= k(e3) k(v1) k(v2) k(e4) k(v1) k(v2)
k(v5)= 〈 8, 4.75 〉
k(v4)= 〈 1, 2 〉
k(v3)= 〈 0, 1 〉
k(v2)= 〈 1, 2 〉k(v1)= 〈 1, 2 〉
〈 1, 2 〉
〈 1, 2 〉
〈 2, 0 〉
〈 3, 3 〉
〈 1, 1 〉
〈 2, 2 〉
〈 1, 0 〉
〈 1, 0 〉
First-order:each semiring member is a 2-tuple
Fake number
s
101
ke=1k(v1)= k(e1) k(v2)= k(e2)
k(v5)= k(e7) k(v3) k(e8) k(v4)
k(v4)= k(e5) k(v1) k(v2) k(e6) k(v1) k(v2)
k(v3)= k(e3) k(v1) k(v2) k(e4) k(v1) k(v2)
〈 1,2,2,4〉
〈 2,0,0,0 〉〈 3,3,3,3
〉
〈 1,1,1,1〉〈 2,2,2,2 〉
〈 1, 0,0,0 〉
〈 1,0,0,0〉
〈 1,2,2,4〉
k(v5)= 〈 8,4.5,4.5,5 〉k(v4)= 〈 1,2,1
,3 〉k(v3)= 〈 1,1,1,1
〉
k(v2)= 〈 1,2,2,4 〉
k(v1)= 〈 1,2,2,4 〉
Second-order:each semiring member is a 4-tuple
Fake number
s
102
Expectations on Hypergraphs
• r(d) is a function over a derivation d
• Expectation over a hypergraph
e.g., translation length is additively decomposed!
• r(d) is additively decomposed
e.g., the length of the translation yielded by d
exponential size
103
Second-order Expectations on Hypergraphs
• r and s are additively decomposed
• Expectation of products over a hypergraph
r and s can be identical or different functions.
exponential size
104
Compute expectation using expectation semiring:Compute expectation using expectation semiring:
Why?Why? entropy is an expectation
Entropy:Entropy:
pe : transition probability or log-linear score at edge e re ?
log p(d) is additively decomposed!
105
Why?Why? cross-entropy is an expectation
Entropy:Entropy:
pe : transition probability or log-linear score at edge e re ?
Cross-entropy:Cross-entropy:
log q(d) is additively decomposed!
Compute expectation using expectation semiring:Compute expectation using expectation semiring:
106
Why?Why? Bayes risk is an expectation
Entropy:Entropy:
pe : transition probability or log-linear score at edge e re ?
Cross-entropy:Cross-entropy:Bayes risk:Bayes risk:
(Tromble et al. 2008)
(Tromble et al. 2008)
L(Y(d)) is additively decomposed!
Compute expectation using expectation semiring:Compute expectation using expectation semiring:
107
Applications of Expectation Semirings: a Summary
108
• First-order quantities: - expectation - entropy - Bayes risk - cross-entropy - KL divergence - feature expectations- first-order gradient of Z
• Second-order quantities: - Expectation over product - interaction between features- Hessian matrix of Z - second-order gradient descent- gradient of expectation - gradient of entropy or Bayes risk
Probabilistic HypergraphProbabilistic Hypergraph
• “decoding” quantities: - Viterbi- K-best- Counting- ......
A semiring framework to compute all of these