jhu mt class: automated evaluation
TRANSCRIPT
© 2010 IBM Corporation
IBM Research
55
What It Takes to compete against Top Human Jeopardy! PlayersOur Analysis Reveals the Winner’s Cloud
Winning Human Performance
Winning Human Performance
2007 QA Computer System2007 QA Computer System
Grand Champion Human Performance
Grand Champion Human Performance
Each dot – actual historical human Jeopardy! games
More ConfidentMore Confident Less ConfidentLess Confident
Computers?Not So Good.
© 2010 IBM Corporation
IBM Research
10
Baseline 12/06
v0.1 12/07
v0.3 08/08
v0.5 05/09
v0.6 10/09
v0.8 11/10
v0.4 12/08
DeepQA: Incremental Progress in Answering Precision on the Jeopardy Challenge: 6/2007-11/2010
v0.2 05/08
IBM WatsonPlaying in the Winners Cloud
V0.7 04/10
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
Edit distance = 163 substitutions
8 deletions5 insertions
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
Edit distance = 163 substitutions
8 deletions5 insertions
ed(i, j) = min
ed(i− 1, j) + del(wi)ed(i, j − 1) + ins(w�
j)ed(i− 1, j − 1) + sub(wi, w�
j)
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
Edit distance = 163 substitutions
8 deletions5 insertions
ed(i, j) = min
ed(i− 1, j) + del(wi)ed(i, j − 1) + ins(w�
j)ed(i− 1, j − 1) + sub(wi, w�
j)
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
Precision:7/15 tokens = 47%
Recall:7/12 tokens = 58%
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
Despite the strong northerly winds , the sky remains very clear .
The sky was still crystal clear , though the north wind was howling .
Although a north wind was howling , the sky remained clear and blue .
Precision: 11/15 tokens
sky very northern shrieked clear wind Although across the the , still was it .
However , the sky remained clear under the strong north wind .
Despite the strong northerly winds , the sky remains very clear .
The sky was still crystal clear , though the north wind was howling .
Although a north wind was howling , the sky remained clear and blue .
Precision: 11/15 tokens
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
Despite the strong northerly winds , the sky remains very clear .
The sky was still crystal clear , though the north wind was howling .
Although a north wind was howling , the sky remained clear and blue .
Precision: 11/15 tokens4/14 bigrams1/13 trigrams
sky very northern shrieked clear wind Although across the the , still was it .
However , the sky remained clear under the strong north wind .
Despite the strong northerly winds , the sky remains very clear .
The sky was still crystal clear , though the north wind was howling .
Although a north wind was howling , the sky remained clear and blue .
Precision: 11/15 tokens0/14 bigrams0/13 trigrams
very clear .
However , the sky remained clear under the strong north wind .
Despite the strong northerly winds , the sky remains very clear .
The sky was still crystal clear , though the north wind was howling .
Although a north wind was howling , the sky remained clear and blue .
Precision: 3/1 tokens2/2 bigrams1/1 trigrams
very clear . shrieked was still Although wind , across it northern the the sky
However , the sky remained clear under the strong north wind .
Despite the strong northerly winds , the sky remains very clear .
The sky was still crystal clear , though the north wind was howling .
Although a north wind was howling , the sky remained clear and blue .
Precision: 11/15 tokens4/14 bigrams1/13 trigrams
a north . the was and was the the the though the , the sky
However , the sky remained clear under the strong north wind .
Despite the strong northerly winds , the sky remains very clear .
The sky was still crystal clear , though the north wind was howling .
Although a north wind was howling , the sky remained clear and blue .
Precision: 11/15 tokens4/14 bigrams1/13 trigrams
BLEU-1BLEU-4
BLEU-v11bBLEU-v12
METEOR-v0.6NIST-v11b
TER-v0.7.254-GRRATEC1AmberATEC3ATEC4
Meteor-v0.7TerrorCat
BEwT-EBadger
BadgerLiteBleu-sbpBleuSPCDer
DP-OrDP-OrpDR-OrEDPM
LETMETEOR-ranking
MaxSim
RTERose
SEPIA1SEPIA2
SNRSR-Or
SVM-RankTERpULCh
ULCoptinvWermBLEUmTER
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
TER: Translation (Error|Edit) Distance
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
TER: Translation (Error|Edit) Distance
Basically edit distance with swaps
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
TER: Translation (Error|Edit) Distance
Basically edit distance with swapsHow hard is it to compute this?
Although the northern wind shrieked across the sky , it was still very clear .
However , the sky remained clear under the strong north wind .
TER: Translation (Error|Edit) Distance
Basically edit distance with swapsHow hard is it to compute this?
ter(i, j) = min
ter(i− 1, j) + del(wi)ter(i, j − 1) + ins(w�
j)ter(i− 1, j − 1) + sub(wi, w�
j)maxk ter(i− 1, [1, ...k − 1, k + 1, ...j]) + 1
〈PM〉 〈IT〉 〈SB〉
Why Not Use all Translations?
el primer ministro italiano Silvio Berlusconi
(Dreyer & Marcu ’12)
〈PM〉 〈IT〉 〈SB〉
Why Not Use all Translations?
el primer ministro italiano Silvio Berlusconi
〈PM〉 → prime-minister〈PM〉 → PM〈PM〉 → prime minister〈PM〉 → head of government〈PM〉 → premier
〈IT〉 → Italian
〈SB〉 → Silvio Berlusconi〈SB〉 → Berlusconi
(Dreyer & Marcu ’12)
〈PM〉 〈IT〉 〈SB〉
Why Not Use all Translations?
el primer ministro italiano Silvio Berlusconi
〈PM〉 → prime-minister〈PM〉 → PM〈PM〉 → prime minister〈PM〉 → head of government〈PM〉 → premier
〈IT〉 → Italian
〈SB〉 → Silvio Berlusconi〈SB〉 → Berlusconi
〈S〉 → 〈SB〉 , 〈IT〉 〈PM〉〈S〉 → 〈IT〉 〈PM〉 〈SB〉〈S〉 → the 〈IT〉 〈PM〉 , 〈SB〉〈S〉 → the 〈PM〉 of Italy
(Dreyer & Marcu ’12)
HyTER
•Entire set is exponential, but finite.
•Can be encoded as an FST.
•Then compute edit distance as FST composition!
HyTER statistics
•3-4 annotators per sentence.
•2-3 hours per annotator per sentence.
•>1M translations per annotator per sentence.
•>1B translations per sentence (combined).
•Shockingly low overlap between annotators (~10K).
Parting Thoughts•Evaluating machine translation is really, really hard.
•Human evaluation: expensive, slow, unreproducible. But arguably what we want.
•Automatic evaluation: fast, cheap, consistent. But might not have anything to do with what we want.
•It’s also really, really important.
•It’s easier to improve what you measure.
•Research funding often driven by evaluation.
•What should we be measuring?