distilling knowledge for search-based structured...

25
Distilling Knowledge for Search-based Structured Prediction Yijia Liu*, Wanxiang Che, Huaipeng Zhao, Bing Qin, Ting Liu Research Center for Social Computing and Information Retrieval Harbin Institute of Technology

Upload: others

Post on 18-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Distilling Knowledge for Search-based Structured Prediction

Yijia Liu*, Wanxiang Che, Huaipeng Zhao, Bing Qin, Ting Liu

Research Center for Social Computing and Information Retrieval

Harbin Institute of Technology

Page 2: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Complex Model Wins

[ResNet, 2015]

[He+, 2017]

Page 3: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

90

90.5

91

91.5

92

92.5

93

Dependency Parsing

Baseline search SOTA Distillation Ensemble

20.5

22

23.5

25

26.5

NMT

Baseline search SOTA Distillation Ensemble

Page 4: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

90

90.5

91

91.5

92

92.5

93

Dependency Parsing

Baseline search SOTA Distillation Ensemble

20.5

22

23.5

25

26.5

NMT

Baseline search SOTA Distillation Ensemble

0.6

1.3

0.8

2.6

Page 5: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Classification vs. Structured Prediction

Classifier! "

Structured

Predictor! "#, "%, … , y(

Page 6: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Classification vs. Structured Prediction

ClassifierI like this book

Structured

PredictorI like this book

Page 7: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Search Space

Search-based Structured Prediction

I thislike book

Page 8: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Search Space

! " # that Controls Search Process

I thislike book

0

0.5

1

book i like love the this

p(y | I, like)

Page 9: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Search Space

Generic ! " # Learning Algorithm

I thislike book

0

0.5

1

book I like love the this

argmax p(y | I, like)$(y=this)

Page 10: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Search Space

Problems of the Generic Learning Algorithm

I thislike book

the

Ambiguities in training data

“both this and the seems reasonable”

Page 11: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Search Space

Problems of the Generic Learning Algorithm

I thislike book

the

Ambiguities in training data

“both this and the seems reasonable”

Training and test discrepancy

“What if I made wrong decision?”

?love

Page 12: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Search Space

Solutions in Previous Works

I thislike book

the

Ambiguities in training data

Ensemble (Dietterich, 2000)

Training and test discrepancy

Explore (Ross and Bagnell, 2010)

love

Page 13: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Search Space

Where We Are

I thislike book

thelove

Knowledge Distillation

Ambiguities in training data Training and test discrepancy

Page 14: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Knowledge Distillation

Learning from negative log-likelihood Learning from knowledge distillation

0

0.5

1

book I like love the this

argmax p(y | I, like)!(y=this)

0

0.5

1

book I like love the this

argmax sumy q(y) p(y |I, like)

" # $, &'()) is the output distribution

of a teacher model (e.g. ensemble)

On supervised data argmax0

0

0.5

1

book I like love the this

p(y | I, like)!(y=this)

0

0.5

1

book I like love the this

sumy q(y) p(y | I, like)

1 − 3 + 3

Page 15: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Knowledge Distillation: from Where

Learning from knowledge distillation

0

0.5

1

book I like love the this

argmax sumy q(y) p(y |I, like)

Ambiguities in training data

Ensemble (Dietterich, 2000)

We use ensemble of M structure predictor as the teacher q

Page 16: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Search Space

KD on Supervised (reference) Data

I thislike book

the

!(y=this)

0

0.5

1

book I like love the this

p(y | I, like)

0

0.5

1

book I like love the this

sumy q(y) p(y | I, like)

1 − $ + $

!(y=this)

Page 17: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Search Space

KD on Explored Data

I

this

like book

the

0

0.5

1

book I like love the this

sumy q(y) p(y | I, like, the)

Training and test discrepancy

Explore (Ross and Bagnell, 2010)

We use teacher q to explore the search space & learn from KD on the explored data

Page 18: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

We combine KDon reference andexplored data

Page 19: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Experiments

Transition-based Dependency Parsing

Penn Treebank (Stanford dependencies)

LAS

Baseline 90.83

Ensemble (20) 92.73

Distill (reference, ! = 1.0) 91.99

Distill (exploration) 92.00

Distill (both) 92.14

Ballesteros et al. (2016) (dyn. oracle) 91.42

Andor et al. (2016) (local, B=1) 91.02

Neural Machine Translation

IWSLT 2014 de-en

BLEU

Baseline 22.79

Ensemble (10) 26.26

Distill (reference, ! = 0.8) 24.76

Distill (exploration) 24.64

Distill (both) 25.44

MIXER (Ranzato et al. 2015) 20.73

Wiseman and Rush (2016) (local B=1) 22.53

Wiseman and Rush (2016) (global B=1) 23.83

Page 20: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Analysis:Why the Ensemble Works Better?

• Examining the ensemble on the “problematic” states.

I thislike book

the

Optimal-yet-ambiguous Non-optimal

love

Page 21: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Analysis:Why the Ensemble Works Better?

• Examining the ensemble on the “problematic” states.

• Testbed: Transition-based dependency parsing.

• Tools: dynamic oracle, which returns a set of reference actionsfor one state.

• Evaluate the output distributions against the reference actions.

optimal-yet-ambiguous non-optimal

Baseline 68.59 89.59

Ensemble 74.19 90.90

Page 22: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Analysis:Is it Feasible to Fully Learn from KD w/o NLL?

Fully learning from KD is feasible

92.0792.0491.9391.9

91.791.72

91.5591.49

91.3

91.1

90.9

10.90.80.70.60.50.40.30.20.10

26.9627.0427.1326.95

26.626.64

26.3726.21

26.0925.9

24.93

10.90.80.70.60.50.40.30.20.10

Transition-based Parsing Neural Machine Translation

Page 23: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Analysis:Is Learning from KD Stable?

Transition-based Parsing Neural Machine Translation

Page 24: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Conclusion

• We propose to distill an ensemble into a single model both fromreference and exploration states.

• Experiments on transition-based dependency parsing and machine translation show that our distillation methodsignificantly improves the single model’s performance.

• Analysis gives empirically guarantee for our distillation method.

Page 25: Distilling Knowledge for Search-based Structured Predictionyjliu.net/cv/res/2018-06-19-report.compressed.pdf · 6/19/2018  · 90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline

Thanks and Q/A