novel inference, training and decoding methods over translation forests

93
1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department Johns Hopkins University Advisor: Sanjeev Khudanpur Co-advisor: Jason Eisner

Upload: tansy

Post on 09-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Novel Inference, Training and Decoding Methods over Translation Forests. Zhifei Li Center for Language and Speech Processing Computer Science Department Johns Hopkins University. Advisor: Sanjeev Khudanpur Co-advisor: Jason Eisner. Statistical Machine Translation Pipeline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Novel Inference, Training and Decoding Methods over Translation Forests

1

Novel Inference, Training and Decoding Methods over Translation Forests

Zhifei LiCenter for Language and Speech Processing

Computer Science Department

Johns Hopkins University

Advisor: Sanjeev Khudanpur

Co-advisor: Jason Eisner

Page 2: Novel Inference, Training and Decoding Methods over Translation Forests

2

Statistical Machine Translation Pipeline

Decoding

Bilingual

Data

Translation

Models

Monolingual

English

Language

Models

Discriminative Training

Optimal

Weights

Unseen

SentencesTranslation

Outputs

Generative Training

Generative Training

Held-out Bilingual Data

Page 3: Novel Inference, Training and Decoding Methods over Translation Forests

3

垫子 上 的 猫

Training a Translation Model

dianzi shangdemao

a caton the mat

dianzi shang

the mat

Page 4: Novel Inference, Training and Decoding Methods over Translation Forests

4

垫子 上 的 猫

Training a Translation Model

dianzi shangdemao

a caton the mat

dianzi shangthe mat,mao a cat,

Page 5: Novel Inference, Training and Decoding Methods over Translation Forests

5

垫子 上 的 猫

Training a Translation Model

dianzi shangde

on the mat

dianzi shang de

on the mat,

dianzi shangthe mat,mao a cat,

Page 6: Novel Inference, Training and Decoding Methods over Translation Forests

6

Training a Translation Model

dianzi shang de

on the mat,

dianzi shangthe mat,

垫子 上 的 猫demao

a caton

de maoa cat on,

mao a cat,

Page 7: Novel Inference, Training and Decoding Methods over Translation Forests

7

Training a Translation Model

dianzi shang de

on the mat,

dianzi shangthe mat,

垫子 上 的 猫de

on

de maoa cat on,

de on,

mao a cat,

Page 8: Novel Inference, Training and Decoding Methods over Translation Forests

8

dianzi shang de gou垫子 上 的 狗

the dog on the mat

dianzi shang de gou

Decoding a Test Sentence

Translation is easy?

Derivation Tree

Page 9: Novel Inference, Training and Decoding Methods over Translation Forests

9

dianzi shang de mao

a cat on the mat

垫子 上 的 猫

zhongguo de shouducapital of China

wo de maomy cat

zhifei de maozhifei ’s cat

Translation Ambiguity

Page 10: Novel Inference, Training and Decoding Methods over Translation Forests

10

dianzi shang de mao

Joshua (chart parser)

Joshua (chart parser)

Page 11: Novel Inference, Training and Decoding Methods over Translation Forests

11

dianzi shang de mao

the mat a catthe mat a cata cat on the mata cat on the mat

a cat of the mata cat of the mat the mat ’s a catthe mat ’s a cat

Joshua (chart parser)

Joshua (chart parser)

Page 12: Novel Inference, Training and Decoding Methods over Translation Forests

12

dianzi shang de mao

hypergraph

hypergraph

Joshua (chart parser)

Joshua (chart parser)

Page 13: Novel Inference, Training and Decoding Methods over Translation Forests

13

A hypergraph is a compact data structure to encode exponentially

many trees.hypered

gehypered

ge

edgeedge

hyperedge

hyperedge

nodenodeFSAFSA

Packed Forest

Packed Forest

Page 14: Novel Inference, Training and Decoding Methods over Translation Forests

14

A hypergraph is a compact data structure to encode exponentially

many trees.

Page 15: Novel Inference, Training and Decoding Methods over Translation Forests

15

A hypergraph is a compact data structure to encode exponentially

many trees.

Page 16: Novel Inference, Training and Decoding Methods over Translation Forests

16

A hypergraph is a compact data structure to encode exponentially

many trees.

Page 17: Novel Inference, Training and Decoding Methods over Translation Forests

17

A hypergraph is a compact data structure to encode exponentially

many trees.Structure sharing

Page 18: Novel Inference, Training and Decoding Methods over Translation Forests

18

Why Hypergraphs?•Contains a much larger hypothesis

space than a k-best list

•General compact data structure

• special cases include

• finite state machine (e.g., lattice),

• and/or graph

• packed forest

• can be used for speech, parsing, tree-based MT systems, and many more

Page 19: Novel Inference, Training and Decoding Methods over Translation Forests

19

p=2

p=2

p=1

p=1

p=3

p=3

p=2

p=2

Weighted HypergraphWeighted

Hypergraph

Linear model:Linear model:

weights features

derivationforeign input

Page 20: Novel Inference, Training and Decoding Methods over Translation Forests

20

Probabilistic HypergraphProbabilistic Hypergraph

Log-linear model:

Log-linear model:

Z=2+1+3+2=8

Z=2+1+3+2=8

p=2/8

p=2/8

p=1/8

p=1/8

p=3/8

p=3/8

p=2/8

p=2/8

Page 21: Novel Inference, Training and Decoding Methods over Translation Forests

21

The hypergraph defines a probability distribution over

trees!Probabilistic HypergraphProbabilistic Hypergraph

the distribution is parameterized by Θ

p=2/8

p=2/8

p=1/8

p=1/8

p=3/8

p=3/8

p=2/8

p=2/8

Page 22: Novel Inference, Training and Decoding Methods over Translation Forests

22

Probabilistic HypergraphProbabilistic Hypergraph

How do we set the parameters Θ?

Training

Why are the problems difficult?- brute-force will be too slow as there are

exponentially many trees, so require sophisticated dynamic programs- sometimes intractable, require

approximations

The hypergraph defines a probability distribution over

trees! the distribution is parameterized by Θ

Decoding Which translation do we present to a user?

What atomic operations do we need to perform?

Atomic Inference

Page 23: Novel Inference, Training and Decoding Methods over Translation Forests

23

Inference, Training and Decoding on Hypergraphs• Atomic Inference

• finding one-best derivations

• finding k-best derivations

• computing expectations (e.g., of features)• Training• Perceptron, conditional random field (CRF),

minimum error rate training (MERT), minimum risk, and MIRA

• Decoding• Viterbi decoding, maximum a posterior (MAP) decoding,

and minimum Bayes risk (MBR) decoding

Page 24: Novel Inference, Training and Decoding Methods over Translation Forests

24

Outline• Hypergraph as Hypothesis Space

• Unsupervised Discriminative Training

‣ minimum imputed risk

‣ contrastive language model estimation

• Variational Decoding

• First- and Second-order Expectation Semirings

main focus

Page 25: Novel Inference, Training and Decoding Methods over Translation Forests

25

Training Setup• Each training example consists of

• a foreign sentence (from which a hypergraph is generated to represent many possible translations)

• a reference translation

x: dianzi shang de mao

y: a cat on the mat

• Training

•adjust the parameters Θ so that the reference translation is preferred by the model

Page 26: Novel Inference, Training and Decoding Methods over Translation Forests

27

Supervised: Minimum Empirical Risk• Minimum Empirical Risk

Training

x ylossempirical

distribution

- MERT- CRF- PeceptronWhat if the input x is

missing?

MT decoder

MT output

negated BLEU

• Uniform Empirical Distribution

Page 27: Novel Inference, Training and Decoding Methods over Translation Forests

28

• Minimum Empirical Risk Training

Unsupervised: Minimum Imputed Risk

•Minimum Imputed Risk Training

loss

: reverse model

: imputed input

: forward system

Speech recognition?

Round trip translation

Page 28: Novel Inference, Training and Decoding Methods over Translation Forests

29

Training Reverse Model

and are parameterized and trained separately

Our goal is to train a good forward system

is fixed when training

Page 29: Novel Inference, Training and Decoding Methods over Translation Forests

30

Approximating

• Approximations

exponentially many x, stored ina hypergraph

- k-best

- sampling

- lattice

variational approximation +

lattice decoding (Dyer et al., 2008)

CFG is not closed under composition!

SCFG SCFG

Page 30: Novel Inference, Training and Decoding Methods over Translation Forests

31

The Forward System

•Deterministic Decoding• use one-best translation

•Randomized Decoding• use a distribution of translations

the objective is not differentiable ☹

differentiable ☺expected loss

Page 31: Novel Inference, Training and Decoding Methods over Translation Forests

33

Experiments

•Supervised Training• require bitext

•Unsupervised Training• require monolingual English

•Semi-supervised Training• interpolation of supervised and

unsupervised

Page 32: Novel Inference, Training and Decoding Methods over Translation Forests

34

Semi-supervised Training

Adding unsupervised data helps!

40K sent. pairs

551 features

Page 33: Novel Inference, Training and Decoding Methods over Translation Forests

36

Supervised vs. UnsupervisedUnsupervised training performs as well as (and often better than) the supervised one!Unsupervised uses 16 times of data as supervised. For example,

Chinese

English

Sup 100 16*100

Unsup

16*100

16*16*100

•More experiments• different k-best size

• different reverse model

But, fair comparison!

Page 34: Novel Inference, Training and Decoding Methods over Translation Forests

41

Outline• Hypergraph as Hypothesis Space

• Unsupervised Discriminative Training

‣ minimum imputed risk

‣ contrastive language model estimation

• Variational Decoding

• First- and Second-order Expectation Semirings

Page 35: Novel Inference, Training and Decoding Methods over Translation Forests

42

Language Modeling• Language Model

• assign a probability to an English sentence y

• typically use an n-gram model

• Global Log-linear Model (whole-sentence maximum-entropy LM)

All English sentences with any length!

Sampling ☹slow

(Rosenfeld et al., 2001)

(Rosenfeld et al., 2001)

a set of n-grams occurred in yLocally normalize

d

Globally normalized

Page 36: Novel Inference, Training and Decoding Methods over Translation Forests

43

• Global Log-linear Model (whole-sentence maximum-entropy LM)

(Rosenfeld et al., 2001)

(Rosenfeld et al., 2001)

Contrastive Estimation

• Contrastive Estimation (CE)

neighborhood or contrastive set

improve both speed and accuracy

not proposed for language modelingNeighborhood

Function

loss

(Smith and Eisner, 2005)

(Smith and Eisner, 2005)

train to recover the original English as much as

possiblea set of alternate Eng. sentences of

Page 37: Novel Inference, Training and Decoding Methods over Translation Forests

44

Contrastive Language Model Estimation• Step-1: extract a confusion grammar

(CG)• an English-to-English SCFG

• Step-2: for each English sentence, generate a contrastive set (or neighborhood) using the CG

• Step-3: discriminative training

paraphrase

insertion

re-ordering

neighborhood function

Page 38: Novel Inference, Training and Decoding Methods over Translation Forests

45

Step-1: Extracting a Confusion Grammar (CG)• Deriving a CG from a bilingual grammar

• use Chinese side as pivots

Bilingual RuleBilingual Rule Confusion Rule

Confusion Rule

Our neighborhood function is learned and MT-specific.

CG captures the confusion an MT system will have when translating an

input.

Page 39: Novel Inference, Training and Decoding Methods over Translation Forests

46

Step-2: Generating Contrastive Sets

a cat on the mat

CGCG the cat the matthe cat ’s the

matthe mat on the catthe mat of the cat

Contrastive set:

Contrastive set:

Translating “dianzi shang de mao”?

Page 40: Novel Inference, Training and Decoding Methods over Translation Forests

47

Step-3: Discriminative Training

• Training Objective

• Step-2: for each English sentence, generate a contrastive set (or neighborhood) using the CG

• Step-3: discriminative training

• Iterative Training

contrastive set

CE maximizes the conditional likelihood

expected loss

Page 41: Novel Inference, Training and Decoding Methods over Translation Forests

48

Applying the Contrastive Model

• We can use the contrastive model as a regular language model

• We can incorporate the contrastive model into an end-to-end MT system as a feature

• We may also use the contrastive model to generate paraphrase sentences (if the loss function measures semantic similarity)

• the rules in CG are symmetric

Page 42: Novel Inference, Training and Decoding Methods over Translation Forests

50

Parsing

Confusion grammar

Monolingual

English Training Contrastive

LM

English

SentenceOne-best English

Test on Synthesized Hypergraphs of English Data

RankHypergraph

(Neighborhood)

BLEU Score?

Page 43: Novel Inference, Training and Decoding Methods over Translation Forests

51

baseline LM (5-gram)

word penalty • Target side of a confusion rule

• Rule bigram features

The contrastive LM better recovers the original English than a regular n-gram LM.

All the features look at only the target sides of

confusion rules

Results on Synthesized Hypergraphs

Page 44: Novel Inference, Training and Decoding Methods over Translation Forests

52

Results on MT Test Set

Add CLM as a feature

The contrastive LM helps to improve MT performance.

Page 45: Novel Inference, Training and Decoding Methods over Translation Forests

53

Adding Features on the CG itself• On English Set

• On MT Set

Paraphrasing model

Paraphrasing model

glue rules or regular confusion rules?

one big feature

Page 46: Novel Inference, Training and Decoding Methods over Translation Forests

61

•Supervised: Minimum Empirical Risk

•Unsupervised: Contrastive LM Estimation

Summary for Discriminative Training

•Unsupervised: Minimum Imputed Risk

require bitext

require monolingual

English

require monolingual

English

Page 47: Novel Inference, Training and Decoding Methods over Translation Forests

62

•Supervised Training

•Unsupervised: Contrastive LM Estimation

Summary for Discriminative Training

•Unsupervised: Minimum Imputed Risk

require a reverse model

can have both TM and LM features

can have LM features only

require bitext

require monolingual

English

require monolingual

English

Page 48: Novel Inference, Training and Decoding Methods over Translation Forests

63

Outline• Hypergraph as Hypothesis Space

• Unsupervised Discriminative Training

‣ minimum imputed risk

‣ contrastive language model estimation

• Variational Decoding

• First- and Second-order Expectation Semirings

Page 49: Novel Inference, Training and Decoding Methods over Translation Forests

64

Variational Decoding•We want to do inference under p, but it is

intractable

• Instead, we derive a simpler distribution q*

•Then, we will use q* as a surrogate for p in inference

p(y|x)

Q q*(y)

P

intractable MAP decoding

tractable estimation

tractable decoding

(Sima’an 1996)

Page 50: Novel Inference, Training and Decoding Methods over Translation Forests

65

Variational Decoding for MT: an Overview

Sentence-specific decoding

Sentence-specific decoding

Foreign sentence x SMTSMT

MAP decoding under P is intractable

p(y | x)=∑d∈D(x,y) p(d|x)

p(y | x)=∑d∈D(x,y) p(d|x)

11Generate a hypergraph for the foreign sentence

Three steps:Three steps:

p(d | x)p(d | x)

53

Page 51: Novel Inference, Training and Decoding Methods over Translation Forests

66

q* is an n-gram model over output strings.

Decode using q*on the hypergraph

11

p(d | x)p(d | x)

p(d | x)p(d | x)

q*(y | x)q*(y | x)

22

33

Estimate a model from the hypergraph by minimizing KL

Generate a hypergraph

q*(y | x)q*(y | x)

≈∑d∈D(x,y) p(d|x)≈∑d∈D(x,y) p(d|x)

54

Approximate a hypergraph with a lattice!

Page 52: Novel Inference, Training and Decoding Methods over Translation Forests

67

Outline• Hypergraph as Hypothesis Space

• Unsupervised Discriminative Training

‣ minimum imputed risk

‣ contrastive language model estimation

• Variational Decoding

• First- and Second-order Expectation Semirings

Page 53: Novel Inference, Training and Decoding Methods over Translation Forests

68

• First-order expectations: - expectation - entropy - expected loss - cross-entropy - KL divergence - feature expectations- first-order gradient of Z

• Second-order expectations: - expectation over product - interaction between features- Hessian matrix of Z - second-order gradient descent- gradient of expectation - gradient of expected loss or entropy

Probabilistic HypergraphProbabilistic Hypergraph

• “Decoding” quantities: - Viterbi- K-best- Counting- ......

A semiring framework to compute all of these

Recipe to compute a quantity:• Choose a semiring

• Specific a semiring weight for each hyperedge

• Run the inside algorithm

Page 54: Novel Inference, Training and Decoding Methods over Translation Forests

69

Applications of Expectation Semirings: a Summary

Page 55: Novel Inference, Training and Decoding Methods over Translation Forests

70

• Unsupervised Discriminative Training

‣ minimum imputed risk

‣ contrastive language model estimation

• Variational Decoding

• First- and Second-order Expectation Semirings

Inference, Training and Decoding on Hypergraphs

(Li et al., ACL 2009)

(Li and Eisner, EMNLP 2009)

(In Preparation)(In Preparation)

Page 56: Novel Inference, Training and Decoding Methods over Translation Forests

71

My Other MT Research• Training methods (supervised)

• Discriminative forest reranking with Perceptron (Li and Khudanpur, GALE book chapter 2009)

• Discriminative n-gram language models(Li and Khudanpur, AMTA 2008)

• Algorithms

• Oracle extraction from hypergraphs (Li and Khudanpur, NAACL 2009)

• Efficient intersection between n-gram LM and CFG (Li and Khudanpur, ACL SSST 2008)

• Others• System combination (Smith et al., GALE book

chapter 2009)

• Unsupervised translation induction for Chinese abbreviations (Li and Yarowsky, ACL 2008)

Page 57: Novel Inference, Training and Decoding Methods over Translation Forests

72

Research other than MT

• Information extraction

• Relation extraction between formal and informal phrases (Li and Yarowsky, EMNLP 2008)

• Spoken dialog management

• Optimal dialog in consumer-rating systems using a POMDP (Li et al., SIGDial 2008)

Page 58: Novel Inference, Training and Decoding Methods over Translation Forests

73

Joshua project• An open-source parsing-based MT toolkit (Li et al.

2009)

• support Hiero (Chiang, 2007) and SAMT (Venugopal et al., 2007)

• Team members

• Zhifei Li, Chris Callison-Burch, Chris Dyer, Sanjeev Khudanpur, Wren Thornton, Jonathan Weese, Juri Ganitkevitch, Lane Schwartz, and Omar Zaidan

All the methods presented have been implemented in Joshua!

Only rely on word-aligner and SRI LM!

Page 59: Novel Inference, Training and Decoding Methods over Translation Forests

74

Thank you!XieXie!

谢谢 !

Page 60: Novel Inference, Training and Decoding Methods over Translation Forests

75

Page 61: Novel Inference, Training and Decoding Methods over Translation Forests

76

Page 62: Novel Inference, Training and Decoding Methods over Translation Forests

77

Page 63: Novel Inference, Training and Decoding Methods over Translation Forests

78

Decoding over a hypergraph

Pick a single translation to output

(why not just pick the tree with the highest weight?)

Given a hypergraph of possible translations (generated for a given foreign

sentence by already-trained model)

Page 64: Novel Inference, Training and Decoding Methods over Translation Forests

79

Spurious Ambiguity •Statistical models in MT exhibit spurious

ambiguity

• Many different derivations (e.g., trees or segmentations) generate the same translation string

•Tree-based MT systems

• derivation tree ambiguity

•Regular phrase-based MT systems

• phrase segmentation ambiguity

Page 65: Novel Inference, Training and Decoding Methods over Translation Forests

80

Spurious Ambiguity in Derivation Trees 机器 翻译 软件

S->( 机器 , machine)S->( 翻译 , translation)S->( 软件 , software)

S->( 机器 , machine)S->( 翻译 , translation)S->( 软件 , software)

S->( 机器 , machine) 翻译 S->( 软件 , software)

S->(S0 S1, S0 S1)

S->(S0 S1, S0 S1)

S->(S0 S1, S0 S1)

S->(S0 S1, S0 S1)

S->(S0 翻译 S1, S0 translation S1)

• Same output: “machine translation software”

• Three different derivation trees

jiqi fanyi yuanjian

machine translation software

machine transfer software

Another translation:

Page 66: Novel Inference, Training and Decoding Methods over Translation Forests

81

MAP, Viterbi and N-best Approximations

•Viterbi approximation

•N-best approximation (crunching) (May and Knight 2006)

•Exact MAP decoding

NP-hard (Sima’an 1996)

Page 67: Novel Inference, Training and Decoding Methods over Translation Forests

82

red translation

blue translation

green translation

0.160.140.140.130.120.110.100.10

probabilityderivation

translation string

MAP

MAP vs. Approximations

0.28

0.28

0.44

Viterbi

0.16

0.14

0.13

4-best crunchin

g0.16

0.28

0.13

• Our goal: develop an approximation that considers all the derivations but still allows tractable decoding

• Viterbi and crunching are efficient, but ignore most derivations

• Exact MAP decoding under spurious ambiguity is intractable on HG

Page 68: Novel Inference, Training and Decoding Methods over Translation Forests

83

Variational DecodingDecoding using Variational

approximation Decoding using a sentence-specific

approximate distribution

Page 69: Novel Inference, Training and Decoding Methods over Translation Forests

84

Variational Decoding for MT: an Overview

Sentence-specific decoding

Sentence-specific decoding

Foreign sentence x SMTSMT

MAP decoding under P is intractable

p(y | x)p(y | x)

11Generate a hypergraph for the foreign sentence

Three steps:Three steps:

p(y, d | x)p(y, d | x)

68

Page 70: Novel Inference, Training and Decoding Methods over Translation Forests

85

q* is an n-gram model over output strings.

Decode using q*on the hypergraph

11

p(d | x)p(d | x)

p(d | x)p(d | x)

q*(y | x)q*(y | x)

22

33

Estimate a model from the hypergraph

Generate a hypergraph

q*(y | x)q*(y | x)

≈∑d∈D(x,y) p(d|x)≈∑d∈D(x,y) p(d|x)

69

Page 71: Novel Inference, Training and Decoding Methods over Translation Forests

86

Variational Inference•We want to do inference under p, but it is

intractable

• Instead, we derive a simpler distribution q*

•Then, we will use q* as a surrogate for p in inference

p

Q q*

P

Page 72: Novel Inference, Training and Decoding Methods over Translation Forests

87

constant

Variational Approximation•q*: an approximation having minimum distance to p

• Three questions

• how to parameterize q?

• how to estimate q*?

• how to use q* for decoding?

a family of distributions

an n-gram modelcompute expected n-gram counts and normalizescore the hypergraph with the n-gram model

Page 73: Novel Inference, Training and Decoding Methods over Translation Forests

88

KL divergences under different variational models

•Larger n ==> better approximation q_n ==> smaller KL divergence from p

•The reduction of KL divergence happens mostly when switching from unigram to bigram

Page 74: Novel Inference, Training and Decoding Methods over Translation Forests

89

BLEU Results on Chinese-English NIST MT 2004

Tasks

• variational decoding improves over Viterbi, MBR, and crunching

(Kumar and Byrne, 2004)

(Kumar and Byrne, 2004)

New!New!

(May and Knight, 2006)

(May and Knight, 2006)

Page 75: Novel Inference, Training and Decoding Methods over Translation Forests

90

Variational Inference•We want to do inference under p, but it is

intractable

• Instead, we derive a simpler distribution q*

•Then, we will use q* as a surrogate for p in inference

p

Q q*

P

intractable

tractable

tractable

Page 76: Novel Inference, Training and Decoding Methods over Translation Forests

91

Outline• Hypergraph as Hypothesis Space

• Unsupervised Discriminative Training

‣ minimum imputed risk

‣ contrastive language model estimation

• Variational Decoding

• First- and Second-order Expectation Semirings

Page 77: Novel Inference, Training and Decoding Methods over Translation Forests

92

• First-order quantities: - expectation - entropy - Bayes risk - cross-entropy - KL divergence - feature expectations- first-order gradient of Z

• Second-order quantities: - expectation over product - interaction between features- Hessian matrix of Z - second-order gradient descent- gradient of expectation - gradient of entropy or Bayes risk

Probabilistic HypergraphProbabilistic Hypergraph

• “decoding” quantities: - Viterbi- K-best- Counting- ......

A semiring framework to compute all of these

Page 78: Novel Inference, Training and Decoding Methods over Translation Forests

93

‣ choose a semiring

‣ specify a weight for each hyperedge

‣ run the inside algorithm

Compute Quantities on a Hypergraph: a Recipe

• three steps:•Semiring-weighted inside algorithm

a set with plus and times operationse.g., integer numbers with regular +

and ×

each weight is a semiring member

complexity is O(hypergraph size)

Page 79: Novel Inference, Training and Decoding Methods over Translation Forests

94

Semirings• “Decoding” time semirings

• counting, Viterbi, K-best, etc.

• “Training” time semirings

• first-order expectation semirings

• second-order expectation semirings (new)

• Applications of the Semirings (new)

• entropy, risk, gradient of them, and many more

(Goodman, 1999)

(Goodman, 1999)

(Eisner, 2002)

(Eisner, 2002)

Page 80: Novel Inference, Training and Decoding Methods over Translation Forests

95

dianzi0 shang1 de2 mao3

How many trees?

How many trees?

four☺four☺compute it

use a semiring?

compute ituse a

semiring?

Page 81: Novel Inference, Training and Decoding Methods over Translation Forests

96

ke=1

‣ choose a semiring

‣ specify a weight for each hyperedge‣ run the inside algorithm

counting semiring: ordinary integers with

regular + and x

1 1

1111

11

Compute the Number of Derivation TreesThree steps:

Page 82: Novel Inference, Training and Decoding Methods over Translation Forests

97

ke=1k(v1)= k(e1) k(v2)= k(e2)

k(v5)= k(e7) k(v3) k(e8) k(v4)

k(v5)=4

k(v4)= k(e5) k(v1) k(v2) k(e6) k(v1) k(v2)

k(v3)= k(e3) k(v1) k(v2) k(e4) k(v1) k(v2)

k(v4)=2

k(v1)=1

k(v3)=2

k(v2)=1

Bottom-up process in computing the number of trees

1 1

1111

11

Page 83: Novel Inference, Training and Decoding Methods over Translation Forests

98

the mat a cat

the mat a cat

a cat on the mat

a cat on the mat

a cat of the mat

a cat of the mat

the mat ’s a cat

the mat ’s a cat

expected translation length?

expected translation length?2/8×4 + 6/8×5 =

4.752/8×4 + 6/8×5 =

4.75

4

5

5

5

p=2/8p=2/8

p=1/8

p=1/8

p=3/8

p=3/8

p=2/8

p=2/8

variance?

variance?2/8×(4-4.75)^2 + 6/8×(5-

4.75)^2 ≈ 0.192/8×(4-4.75)^2 + 6/8×(5-

4.75)^2 ≈ 0.19

Page 84: Novel Inference, Training and Decoding Methods over Translation Forests

99

First- and Second-order Expectation Semirings

• each member is a 2-tuple:

• each member is a 4-tuple:

First-order:First-order:

Second-order:

Second-order:

(Eisner, 2002)

(Eisner, 2002)

Page 85: Novel Inference, Training and Decoding Methods over Translation Forests

100

ke=1k(v1)= k(e1) k(v2)= k(e2)

k(v5)= k(e7) k(v3) k(e8) k(v4)

k(v4)= k(e5) k(v1) k(v2) k(e6) k(v1) k(v2)

k(v3)= k(e3) k(v1) k(v2) k(e4) k(v1) k(v2)

k(v5)= 〈 8, 4.75 〉

k(v4)= 〈 1, 2 〉

k(v3)= 〈 0, 1 〉

k(v2)= 〈 1, 2 〉k(v1)= 〈 1, 2 〉

〈 1, 2 〉

〈 1, 2 〉

〈 2, 0 〉

〈 3, 3 〉

〈 1, 1 〉

〈 2, 2 〉

〈 1, 0 〉

〈 1, 0 〉

First-order:each semiring member is a 2-tuple

Fake number

s

Page 86: Novel Inference, Training and Decoding Methods over Translation Forests

101

ke=1k(v1)= k(e1) k(v2)= k(e2)

k(v5)= k(e7) k(v3) k(e8) k(v4)

k(v4)= k(e5) k(v1) k(v2) k(e6) k(v1) k(v2)

k(v3)= k(e3) k(v1) k(v2) k(e4) k(v1) k(v2)

〈 1,2,2,4〉

〈 2,0,0,0 〉〈 3,3,3,3

〈 1,1,1,1〉〈 2,2,2,2 〉

〈 1, 0,0,0 〉

〈 1,0,0,0〉

〈 1,2,2,4〉

k(v5)= 〈 8,4.5,4.5,5 〉k(v4)= 〈 1,2,1

,3 〉k(v3)= 〈 1,1,1,1

k(v2)= 〈 1,2,2,4 〉

k(v1)= 〈 1,2,2,4 〉

Second-order:each semiring member is a 4-tuple

Fake number

s

Page 87: Novel Inference, Training and Decoding Methods over Translation Forests

102

Expectations on Hypergraphs

• r(d) is a function over a derivation d

• Expectation over a hypergraph

e.g., translation length is additively decomposed!

• r(d) is additively decomposed

e.g., the length of the translation yielded by d

exponential size

Page 88: Novel Inference, Training and Decoding Methods over Translation Forests

103

Second-order Expectations on Hypergraphs

• r and s are additively decomposed

• Expectation of products over a hypergraph

r and s can be identical or different functions.

exponential size

Page 89: Novel Inference, Training and Decoding Methods over Translation Forests

104

Compute expectation using expectation semiring:Compute expectation using expectation semiring:

Why?Why? entropy is an expectation

Entropy:Entropy:

pe : transition probability or log-linear score at edge e re ?

log p(d) is additively decomposed!

Page 90: Novel Inference, Training and Decoding Methods over Translation Forests

105

Why?Why? cross-entropy is an expectation

Entropy:Entropy:

pe : transition probability or log-linear score at edge e re ?

Cross-entropy:Cross-entropy:

log q(d) is additively decomposed!

Compute expectation using expectation semiring:Compute expectation using expectation semiring:

Page 91: Novel Inference, Training and Decoding Methods over Translation Forests

106

Why?Why? Bayes risk is an expectation

Entropy:Entropy:

pe : transition probability or log-linear score at edge e re ?

Cross-entropy:Cross-entropy:Bayes risk:Bayes risk:

(Tromble et al. 2008)

(Tromble et al. 2008)

L(Y(d)) is additively decomposed!

Compute expectation using expectation semiring:Compute expectation using expectation semiring:

Page 92: Novel Inference, Training and Decoding Methods over Translation Forests

107

Applications of Expectation Semirings: a Summary

Page 93: Novel Inference, Training and Decoding Methods over Translation Forests

108

• First-order quantities: - expectation - entropy - Bayes risk - cross-entropy - KL divergence - feature expectations- first-order gradient of Z

• Second-order quantities: - Expectation over product - interaction between features- Hessian matrix of Z - second-order gradient descent- gradient of expectation - gradient of entropy or Bayes risk

Probabilistic HypergraphProbabilistic Hypergraph

• “decoding” quantities: - Viterbi- K-best- Counting- ......

A semiring framework to compute all of these