sixth workshop on statistical machine translationsince this is the sixth workshop on statistical...

579
WMT 2011 Sixth Workshop on Statistical Machine Translation Proceedings of the Workshop July 30–31, 2011

Upload: others

Post on 26-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

  • WMT 2011

    Sixth Workshop on

    Statistical Machine Translation

    Proceedings of the Workshop

    July 30–31, 2011

  • With support from EUROMATRIXPLUS,an EU Framework Programme 7 funded project.

    c©2011 The Association for Computational Linguistics

    Order copies of this and other ACL proceedings from:

    Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

    ISBN 978-1-937284-12-1/1-937284-12-3

    ii

  • Introduction

    The EMNLP 2011 Workshop on Statistical Machine Translation (WMT-2011) took place on Saturdayand Sunday, July 30–31 in Edinburgh, Scotland, immediately following the Conference on EmpiricalMethods on Natural Language Processing (EMNLP) 2011, which was hosted by the University ofEdinburgh.

    This is the seventh time this workshop has been held. The first time was in 2005 as part of the ACL 2005Workshop on Building and Using Parallel Texts. In the following years the Workshop on StatisticalMachine Translation was held at HLT-NAACL 2006 in New York City, US, ACL 2007 in Prague,Czech Republic, ACL 2008, Columbus, Ohio, US, EACL 2009 in Athens, Greece, and ACL 2010 inUppsala, Sweden.

    The focus of our workshop was to use parallel corpora for machine translation. Recent experimentationhas shown that the performance of SMT systems varies greatly with the source language. In thisworkshop we encouraged researchers to investigate ways to improve the performance of SMT systemsfor diverse languages, including morphologically more complex languages, languages with partial freeword order, and low-resource languages.

    Prior to the workshop, in addition to soliciting relevant papers for review and possible presentation,we conducted a shared task that brought together machine translation systems for an evaluation onpreviously unseen data. The results of the shared task were announced at the workshop, and theseproceedings also include an overview paper for the shared task that summarizes the results, as wellas provides information about the data used and any procedures that were followed in conducting orscoring the task. In addition, there are short papers from each participating team that describe theirunderlying system in some detail.

    Like in previous years, we have received a far larger number of submission than we could acceptfor presentation. This year we have received 42 full paper submissions (not counting withdrawnsubmissions) and 47 shared task submissions. In total WMT-2011 featured 18 full paper oralpresentations and 47 shared task poster presentations.

    The invited talk was given by William Lewis (Microsoft Research), Robert Munro (StanfordUniversity), and Stephan Vogel (Carnegie Mellon University).

    We would like to thank the members of the Program Committee for their timely reviews. We alsowould like to thank the participants of the shared task and all the other volunteers who helped with theevaluations.

    Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar F. Zaidan

    iii

  • WMT 5-year Retrospective Best Paper Award

    Since this is the Sixth Workshop on Statistical Machine Translation, we have decided to create a WMT5-year Retrospective Best Paper Award, to be given to the best paper that was published at the firstWorkshop on Statistical Machine Translation, which was held at HLT-NAACL 2006 in New York. Thegoals of this retrospective award are to recognize high-quality work that has stood the test of time, andto highlight the excellent work that appears at WMT.

    The WMT11 program committee voted on the best paper from a list of six nominated papers. Five ofthese were nominated by high citation counts, which we defined as having 10 or more citations in theACL anthology network (excluding self-citations), and more than 30 citations on Google Scholar. Wealso opened the nomination process to the committee, which yielded one further nomination for a paperthat did not reach the citation threshold but was deemed to be excellent.

    The program committee decided to award the WMT 5-year Retrospective Best Paper Award to:

    Andreas Zollmann and Ashish Venugopal. 2006. Syntax Augmented Machine Translation via ChartParsing. In Proceedings of the Workshop on Statistical Machine Translation. Pages 138–141.

    This short paper described Zollmann and Venugopal’s entry into the WMT06 shared translationtask. Their system introduced a parsing-based machine translation system. Like David Chiang’sHiero system, Zollmann and Venugopal’s system used synchronous context free grammars (SCFGs).Instead of using a single non-terminal symbol, X, Zollmann and Venugopal’s SCFG rules containedlinguistically informed non-terminal symbols that were extracted from a parsed parallel corpus.

    This paper was one of the first publications to demonstrate that syntactically-informed approachesto statistical machine translation could achieve translation quality that was comparable to – or evenbetter than – state-of-the-art phrase-based and and hierarchical phrase-based approaches to machinetranslation. Zollmann and Venugopal’s approach has influenced a number of researchers, and has beenintegrated into open source translation software like the Joshua and Moses decoders.

    In many ways this paper represents the ideals of the WMT workshops. It introduced a novel approachto machine translation and demonstrated its value empirically by comparing it to other state-of-the-artsystems on a public data set.

    Congratulations to Andreas Zollmann and Ashish Venugopal for their excellent work!

    iv

  • Organizers:

    Chris Callison-Burch (Johns Hopkins University)Philipp Koehn (University of Edinburgh)Christof Monz (University of Amsterdam)Omar F. Zaidan (Johns Hopkins University)

    Invited Talk:

    William Lewis (Microsoft Research)Robert Munro (Stanford University)Stephan Vogel (Carnegie Mellon University)

    Program Committee:

    Lars Ahrenberg (Linköping University)Nicola Bertoldi (FBK)Graeme Blackwood (University of Cambridge)Michael Bloodgood (University of Maryland)Ondrej Bojar (Charles University)Thorsten Brants (Google)Chris Brockett (Microsoft)Nicola Cancedda (Xerox)Marine Carpuat (National Research Council Canada)Simon Carter (University of Amsterdam)Francisco Casacuberta (University of Valencia)Daniel Cer (Stanford University)Mauro Cettolo (FBK)Boxing Chen (National Research Council Canada)Colin Cherry (National Research Council Canada)David Chiang (ISI)Jon Clark (Carnegie Mellon University)Stephen Clark (University of Cambridge)Christophe Costa Florencio (University of Amsterdam)Michael Denkowski (Carnegie Mellon University)Kevin Duh (NTT)Chris Dyer (Carnegie Mellon University)Marc Dymetman (Xerox)Marcello Federico (FBK)

    v

  • Andrew Finch (NICT)Jose Fonollosa (University of Catalonia)George Foster (National Research Council Canada)Alex Fraser (University of Stuttgart)Michel Galley (Microsoft)Niyu Ge (IBM)Dmitriy Genzel (Google)Ulrich Germann (University of Toronto)Kevin Gimpel (Carnegie Mellon University)Adria de Gispert (University of Cambridge)Spence Green (Stanford University)Nizar Habash (Columbia University)Keith Hall (Google)Greg Hanneman (Carnegie Mellon University)Kenneth Heafield (Carnegie Mellon University)John Henderson (MITRE)Howard Johnson (National Research Council Canada)Doug Jones (Lincoln Labs MIT)Damianos Karakos (Johns Hopkins University)Maxim Khalilov (University of Amsterdam)Roland Kuhn (National Research Council Canada)Shankar Kumar (Google)Philippe Langlais (Univeristy of Montreal)Adam Lopez (Johns Hopkins University)Wolfgang Macherey (Google)Nitin Madnani (Educational Testing Service)Daniel Marcu (Language Weaver)Yuval Marton (IBM)Lambert Mathias (Nuance)Spyros Matsoukas (Raytheon BBN Technologies)Arne Mauser (RWTH Aachen)Arul Menezes (Microsoft)Bob Moore (Google)Smaranda Muresan (Rutgers University)Kemal Oflazer (Carnegie Mellon University)Miles Osborne (University of Edinburgh)Matt Post (Johns Hopkins University)Chris Quirk (Microsoft)Antti-Veikko Rosti (Raytheon BBN Technologies)Salim Roukos (IBM)Anoop Sarkar (Simon Fraser University)Holger Schwenk (University of Le Mans)

    vi

  • Jean Senellart (Systran)Khalil Simaan (University of Amsterdam)Michel Simard (National Research Council Canada)David Smith (University of Massachusetts Amherst)Matthew Snover (Raytheon BBN Technologies)Joerg Tiedemann (Uppsala University)Christoph Tillmann (IBM)Dan Tufis (Romanian Academy)Jakob Uszkoreit (Google)Masao Utiyama (NICT)David Vilar (RWTH Aachen)Clare Voss (Army Research Labs)Taro Watanabe (NTT)Andy Way (Dublin City University)Jinxi Xu (Raytheon BBN Technologies)Sirvan Yahyaei (Queen Mary, University of London)Daniel Zeman (Charles University)Richard Zens (Google)Bing Zhang (Raytheon BBN Technologies)

    vii

  • Table of Contents

    A Grain of Salt for the WMT Manual EvaluationOndřej Bojar, Miloš Ercegovčević, Martin Popel and Omar Zaidan . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    A Lightweight Evaluation Framework for Machine Translation ReorderingDavid Talbot, Hideto Kazawa, Hiroshi Ichikawa, Jason Katz-Brown, Masakazu Seno and Franz

    Och . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    Findings of the 2011 Workshop on Statistical Machine TranslationChris Callison-Burch, Philipp Koehn, Christof Monz and Omar Zaidan . . . . . . . . . . . . . . . . . . . . . 22

    Evaluate with Confidence Estimation: Machine ranking of translation outputs using grammatical fea-tures

    Eleftherios Avramidis, Maja Popović, David Vilar and Aljoscha Burchardt . . . . . . . . . . . . . . . . . . 65

    AMBER: A Modified BLEU, Enhanced Ranking MetricBoxing Chen and Roland Kuhn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    TESLA at WMT 2011: Translation Evaluation and Tunable MetricDaniel Dahlmeier, Chang Liu and Hwee Tou Ng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Sys-tems

    Michael Denkowski and Alon Lavie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

    Approximating a Deep-Syntactic Metric for MT Evaluation and TuningMatouš Macháček and Ondřej Bojar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

    Evaluation without references: IBM1 scores as evaluation metricsMaja Popović, David Vilar, Eleftherios Avramidis and Aljoscha Burchardt . . . . . . . . . . . . . . . . . . 99

    Morphemes and POS tags for n-gram based evaluation metricsMaja Popović . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

    E-rating Machine TranslationKristen Parton, Joel Tetreault, Nitin Madnani and Martin Chodorow . . . . . . . . . . . . . . . . . . . . . . . 108

    TINE: A Metric to Assess MT AdequacyMiguel Rios, Wilker Aziz and Lucia Specia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

    Regression and Ranking based Optimisation for Sentence Level MT EvaluationXingyi Song and Trevor Cohn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

    MAISE: A Flexible, Configurable, Extensible Open Source Package for Mass AI System EvaluationOmar Zaidan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

    ix

  • MANY improvements for WMT’11Loı̈c Barrault . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .135

    The UPV-PRHLT combination system for WMT 2011Jesús González-Rubio and Francisco Casacuberta. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .140

    CMU System Combination in WMT 2011Kenneth Heafield and Alon Lavie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

    The RWTH System Combination System for WMT 2011Gregor Leusch, Markus Freitag and Hermann Ney . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

    Expected BLEU Training for Graphs: BBN System Description for WMT11 System Combination TaskAntti-Veikko Rosti, Bing Zhang, Spyros Matsoukas and Richard Schwartz . . . . . . . . . . . . . . . . . 159

    The UZH System Combination System for WMT 2011Rico Sennrich . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

    Description of the JHU System Combination Scheme for WMT 2011Daguang Xu, Yuan Cao and Damianos Karakos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

    Multiple-stream Language Models for Statistical Machine TranslationAbby Levenberg, Miles Osborne and David Matthews. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177

    KenLM: Faster and Smaller Language Model QueriesKenneth Heafield . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

    Wider Context by Using Bilingual Language Models in Machine TranslationJan Niehues, Teresa Herrmann, Stephan Vogel and Alex Waibel . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

    A Minimally Supervised Approach for Detecting and Ranking Document Translation PairsKriste Krstovski and David A. Smith . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

    Agreement Constraints for Statistical Machine Translation into GermanPhilip Williams and Philipp Koehn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

    Fuzzy Syntactic Reordering for Phrase-based Statistical Machine TranslationJacob Andreas, Nizar Habash and Owen Rambow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

    Filtering Antonymous, Trend-Contrasting, and Polarity-Dissimilar Distributional Paraphrases for Im-proving Statistical Machine Translation

    Yuval Marton, Ahmed El Kholy and Nizar Habash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

    Productive Generation of Compound Words in Statistical Machine TranslationSara Stymne and Nicola Cancedda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

    SampleRank Training for Phrase-Based Machine TranslationBarry Haddow, Abhishek Arun and Philipp Koehn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

    x

  • Instance Selection for Machine Translation using Feature Decay AlgorithmsErgun Bicici and Deniz Yuret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

    Investigations on Translation Model Adaptation Using Monolingual DataPatrik Lambert, Holger Schwenk, Christophe Servan and Sadaf Abdul-Rauf . . . . . . . . . . . . . . . . 284

    Topic Adaptation for Lecture Translation through Bilingual Latent Semantic ModelsNick Ruiz and Marcello Federico . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294

    Personal Translator at WMT2011Vera Aleksic and Gregor Thurmair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

    LIMSI @ WMT11Alexandre Allauzen, Hélène Bonneau-Maynard, Hai-Son Le, Aurélien Max, Guillaume Wis-

    niewski, François Yvon, Gilles Adda, Josep Maria Crego, Adrien Lardilleux, Thomas Lavergne andArtem Sokolov. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .309

    Shallow Semantic Trees for SMTWilker Aziz, Miguel Rios and Lucia Specia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316

    RegMT System for Machine Translation, System Combination, and EvaluationErgun Bicici and Deniz Yuret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

    Improving Translation Model by Monolingual DataOndřej Bojar and Aleš Tamchyna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330

    The CMU-ARK German-English Translation SystemChris Dyer, Kevin Gimpel, Jonathan H. Clark and Noah A. Smith . . . . . . . . . . . . . . . . . . . . . . . . . 337

    Noisy SMS Machine Translation in Low-Density LanguagesVladimir Eidelman, Kristy Hollingshead and Philip Resnik . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344

    Stochastic Parse Tree Selection for an Existing RBMT SystemChristian Federmann and Sabine Hunsicker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351

    Joint WMT Submission of the QUAERO ProjectMarkus Freitag, Gregor Leusch, Joern Wuebker, Stephan Peitz, Hermann Ney, Teresa Herrmann,

    Jan Niehues, Alex Waibel, Alexandre Allauzen, Gilles Adda, Josep Maria Crego, Bianka Buschbeck,Tonio Wandmacher and Jean Senellart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .358

    CMU Syntax-Based Machine Translation at WMT 2011Greg Hanneman and Alon Lavie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365

    The Uppsala-FBK systems at WMT 2011Christian Hardmeier, Jörg Tiedemann, Markus Saers, Marcello Federico and Mathur Prashant372

    The Karlsruhe Institute of Technology Translation Systems for the WMT 2011Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel . . . . . . . . . . . . . . . . . . . . . 379

    xi

  • CMU Haitian Creole-English Translation System for WMT 2011Sanjika Hewavitharana, Nguyen Bach, Qin Gao, Vamshi Ambati and Stephan Vogel . . . . . . . . 386

    Experiments with word alignment, normalization and clause reordering for SMT between English andGerman

    Maria Holmqvist, Sara Stymne and Lars Ahrenberg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393

    The Value of Monolingual Crowdsourcing in a Real-World Translation Scenario: Simulation usingHaitian Creole Emergency SMS Messages

    Chang Hu, Philip Resnik, Yakov Kronrod, Vladimir Eidelman, Olivia Buzek and Benjamin B.Bederson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399

    The RWTH Aachen Machine Translation System for WMT 2011Matthias Huck, Joern Wuebker, Christoph Schmidt, Markus Freitag, Stephan Peitz, Daniel Stein,

    Arnaud Dagnelies, Saab Mansour, Gregor Leusch and Hermann Ney . . . . . . . . . . . . . . . . . . . . . . . . . . . 405

    ILLC-UvA translation system for EMNLP-WMT 2011Maxim Khalilov and Khalil Sima’an . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413

    UPM system for the translation taskVerónica López-Ludeña and Rubén San-Segundo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420

    Two-step translation with grammatical post-processingDavid Mareček, Rudolf Rosa, Petra Galuščáková and Ondřej Bojar . . . . . . . . . . . . . . . . . . . . . . . . 426

    Influence of Parser Choice on Dependency-Based MTMartin Popel, David Mareček, Nathan Green and Zdenek Zabokrtsky . . . . . . . . . . . . . . . . . . . . . . 433

    The LIGA (LIG/LIA) Machine Translation System for WMT 2011Marion Potet, Raphaël Rubino, Benjamin Lecouteux, Stéphane Huet, Laurent Besacier, Hervé

    Blanchon and Fabrice Lefèvre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440

    Factored Translation with Unsupervised Word ClustersChristian Rishøj and Anders Søgaard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447

    The BM-I2R Haitian-Créole-to-English translation system description for the WMT 2011 evaluationcampaign

    Marta R. Costa-jussà and Rafael E. Banchs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452

    The Universitat d’Alacant hybrid machine translation system for WMT 2011Vı́ctor M. Sánchez-Cartagena, Felipe Sánchez-Martı́nez and Juan Antonio Pérez-Ortiz . . . . . . 457

    LIUM’s SMT Machine Translation Systems for WMT 2011Holger Schwenk, Patrik Lambert, Loı̈c Barrault, Christophe Servan, Sadaf Abdul-Rauf, Haithem

    Afli and Kashif Shah . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464

    Spell Checking Techniques for Replacement of Unknown Words and Data Cleaning for Haitian CreoleSMS Translation

    Sara Stymne . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470

    xii

  • Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar ExtractorJonathan Weese, Juri Ganitkevitch, Chris Callison-Burch, Matt Post and Adam Lopez . . . . . . . 478

    DFKI Hybrid Machine Translation System for WMT 2011 - On the Integration of SMT and RBMTJia Xu, Hans Uszkoreit, Casey Kennington, David Vilar and Xiaojun Zhang . . . . . . . . . . . . . . . . 485

    CEU-UPV English-Spanish system for WMT11Francisco Zamora-Martinez and Maria Jose Castro-Bleda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490

    Hierarchical Phrase-Based MT at the Charles University for the WMT 2011 Shared TaskDaniel Zeman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496

    Crisis MT: Developing A Cookbook for MT in Crisis SituationsWilliam Lewis, Robert Munro and Stephan Vogel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501

    Generative Models of Monolingual and Bilingual Gappy PatternsKevin Gimpel and Noah A. Smith . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512

    Extraction Programs: A Unified Approach to Translation Rule ExtractionMark Hopkins, Greg Langmead and Tai Vo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523

    Bayesian Extraction of Minimal SCFG Rules for Hierarchical Phrase-based TranslationBaskaran Sankaran, Gholamreza Haffari and Anoop Sarkar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533

    From n-gram-based to CRF-based Translation ModelsThomas Lavergne, Alexandre Allauzen, Josep Maria Crego and François Yvon . . . . . . . . . . . . . 542

    xiii

  • Conference Program

    Saturday, July 30, 2011

    Session 1: Shared Tasks and Evaluation

    9:00–9:20 A Grain of Salt for the WMT Manual EvaluationOndřej Bojar, Miloš Ercegovčević, Martin Popel and Omar Zaidan

    9:20–9:40 A Lightweight Evaluation Framework for Machine Translation ReorderingDavid Talbot, Hideto Kazawa, Hiroshi Ichikawa, Jason Katz-Brown, MasakazuSeno and Franz Och

    9:40–10:30 Findings of the 2011 Workshop on Statistical Machine TranslationChris Callison-Burch, Philipp Koehn, Christof Monz and Omar Zaidan

    10:30–11:00 Coffee

    Session 2: Shared Metrics and System Combination Tasks

    11:00–12:40 Poster Session Metrics

    Evaluate with Confidence Estimation: Machine ranking of translation outputs usinggrammatical featuresEleftherios Avramidis, Maja Popović, David Vilar and Aljoscha Burchardt

    AMBER: A Modified BLEU, Enhanced Ranking MetricBoxing Chen and Roland Kuhn

    TESLA at WMT 2011: Translation Evaluation and Tunable MetricDaniel Dahlmeier, Chang Liu and Hwee Tou Ng

    Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of MachineTranslation SystemsMichael Denkowski and Alon Lavie

    Approximating a Deep-Syntactic Metric for MT Evaluation and TuningMatouš Macháček and Ondřej Bojar

    Evaluation without references: IBM1 scores as evaluation metricsMaja Popović, David Vilar, Eleftherios Avramidis and Aljoscha Burchardt

    xv

  • Saturday, July 30, 2011 (continued)

    Morphemes and POS tags for n-gram based evaluation metricsMaja Popović

    E-rating Machine TranslationKristen Parton, Joel Tetreault, Nitin Madnani and Martin Chodorow

    TINE: A Metric to Assess MT AdequacyMiguel Rios, Wilker Aziz and Lucia Specia

    Regression and Ranking based Optimisation for Sentence Level MT EvaluationXingyi Song and Trevor Cohn

    MAISE: A Flexible, Configurable, Extensible Open Source Package for Mass AI SystemEvaluationOmar Zaidan

    11:00–12:40 Poster Session System Combination

    MANY improvements for WMT’11Loı̈c Barrault

    The UPV-PRHLT combination system for WMT 2011Jesús González-Rubio and Francisco Casacuberta

    CMU System Combination in WMT 2011Kenneth Heafield and Alon Lavie

    The RWTH System Combination System for WMT 2011Gregor Leusch, Markus Freitag and Hermann Ney

    Expected BLEU Training for Graphs: BBN System Description for WMT11 System Com-bination TaskAntti-Veikko Rosti, Bing Zhang, Spyros Matsoukas and Richard Schwartz

    The UZH System Combination System for WMT 2011Rico Sennrich

    Description of the JHU System Combination Scheme for WMT 2011Daguang Xu, Yuan Cao and Damianos Karakos

    xvi

  • Saturday, July 30, 2011 (continued)

    12:40–14:00 Lunch

    Session 3: Language Models and Document Alignment

    14:00–14:25 Multiple-stream Language Models for Statistical Machine TranslationAbby Levenberg, Miles Osborne and David Matthews

    14:25–14:50 KenLM: Faster and Smaller Language Model QueriesKenneth Heafield

    14:50–15:15 Wider Context by Using Bilingual Language Models in Machine TranslationJan Niehues, Teresa Herrmann, Stephan Vogel and Alex Waibel

    15:15–15:40 A Minimally Supervised Approach for Detecting and Ranking Document Translation PairsKriste Krstovski and David A. Smith

    15:40–16:10 Coffee

    Session 4: Syntax and Semantics

    16:10–16:35 Agreement Constraints for Statistical Machine Translation into GermanPhilip Williams and Philipp Koehn

    16:35–17:00 Fuzzy Syntactic Reordering for Phrase-based Statistical Machine TranslationJacob Andreas, Nizar Habash and Owen Rambow

    17:00–17:25 Filtering Antonymous, Trend-Contrasting, and Polarity-Dissimilar Distributional Para-phrases for Improving Statistical Machine TranslationYuval Marton, Ahmed El Kholy and Nizar Habash

    17:25–17:50 Productive Generation of Compound Words in Statistical Machine TranslationSara Stymne and Nicola Cancedda

    xvii

  • Sunday, July 31, 2011

    Session 5: Machine Learning and Adaptation

    9:00–10:25 SampleRank Training for Phrase-Based Machine TranslationBarry Haddow, Abhishek Arun and Philipp Koehn

    9:25–10:50 Instance Selection for Machine Translation using Feature Decay AlgorithmsErgun Bicici and Deniz Yuret

    9:50–10:15 Investigations on Translation Model Adaptation Using Monolingual DataPatrik Lambert, Holger Schwenk, Christophe Servan and Sadaf Abdul-Rauf

    10:15–10:40 Topic Adaptation for Lecture Translation through Bilingual Latent Semantic ModelsNick Ruiz and Marcello Federico

    10:40–11:00 Coffee

    Session 6: Shared Translation Task

    11:00–12:40 Poster Presentations

    Personal Translator at WMT2011Vera Aleksic and Gregor Thurmair

    LIMSI @ WMT11Alexandre Allauzen, Hélène Bonneau-Maynard, Hai-Son Le, Aurélien Max, GuillaumeWisniewski, François Yvon, Gilles Adda, Josep Maria Crego, Adrien Lardilleux, ThomasLavergne and Artem Sokolov

    Shallow Semantic Trees for SMTWilker Aziz, Miguel Rios and Lucia Specia

    RegMT System for Machine Translation, System Combination, and EvaluationErgun Bicici and Deniz Yuret

    Improving Translation Model by Monolingual DataOndřej Bojar and Aleš Tamchyna

    xviii

  • Sunday, July 31, 2011 (continued)

    The CMU-ARK German-English Translation SystemChris Dyer, Kevin Gimpel, Jonathan H. Clark and Noah A. Smith

    Noisy SMS Machine Translation in Low-Density LanguagesVladimir Eidelman, Kristy Hollingshead and Philip Resnik

    Stochastic Parse Tree Selection for an Existing RBMT SystemChristian Federmann and Sabine Hunsicker

    Joint WMT Submission of the QUAERO ProjectMarkus Freitag, Gregor Leusch, Joern Wuebker, Stephan Peitz, Hermann Ney, Teresa Her-rmann, Jan Niehues, Alex Waibel, Alexandre Allauzen, Gilles Adda, Josep Maria Crego,Bianka Buschbeck, Tonio Wandmacher and Jean Senellart

    CMU Syntax-Based Machine Translation at WMT 2011Greg Hanneman and Alon Lavie

    The Uppsala-FBK systems at WMT 2011Christian Hardmeier, Jörg Tiedemann, Markus Saers, Marcello Federico and MathurPrashant

    The Karlsruhe Institute of Technology Translation Systems for the WMT 2011Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel

    CMU Haitian Creole-English Translation System for WMT 2011Sanjika Hewavitharana, Nguyen Bach, Qin Gao, Vamshi Ambati and Stephan Vogel

    Experiments with word alignment, normalization and clause reordering for SMT betweenEnglish and GermanMaria Holmqvist, Sara Stymne and Lars Ahrenberg

    The Value of Monolingual Crowdsourcing in a Real-World Translation Scenario: Simula-tion using Haitian Creole Emergency SMS MessagesChang Hu, Philip Resnik, Yakov Kronrod, Vladimir Eidelman, Olivia Buzek and Ben-jamin B. Bederson

    The RWTH Aachen Machine Translation System for WMT 2011Matthias Huck, Joern Wuebker, Christoph Schmidt, Markus Freitag, Stephan Peitz, DanielStein, Arnaud Dagnelies, Saab Mansour, Gregor Leusch and Hermann Ney

    ILLC-UvA translation system for EMNLP-WMT 2011Maxim Khalilov and Khalil Sima’an

    xix

  • Sunday, July 31, 2011 (continued)

    UPM system for the translation taskVerónica López-Ludeña and Rubén San-Segundo

    Two-step translation with grammatical post-processingDavid Mareček, Rudolf Rosa, Petra Galuščáková and Ondřej Bojar

    Influence of Parser Choice on Dependency-Based MTMartin Popel, David Mareček, Nathan Green and Zdenek Zabokrtsky

    The LIGA (LIG/LIA) Machine Translation System for WMT 2011Marion Potet, Raphaël Rubino, Benjamin Lecouteux, Stéphane Huet, Laurent Besacier,Hervé Blanchon and Fabrice Lefèvre

    Factored Translation with Unsupervised Word ClustersChristian Rishøj and Anders Søgaard

    The BM-I2R Haitian-Créole-to-English translation system description for the WMT 2011evaluation campaignMarta R. Costa-jussà and Rafael E. Banchs

    The Universitat d’Alacant hybrid machine translation system for WMT 2011Vı́ctor M. Sánchez-Cartagena, Felipe Sánchez-Martı́nez and Juan Antonio Pérez-Ortiz

    LIUM’s SMT Machine Translation Systems for WMT 2011Holger Schwenk, Patrik Lambert, Loı̈c Barrault, Christophe Servan, Sadaf Abdul-Rauf,Haithem Afli and Kashif Shah

    Spell Checking Techniques for Replacement of Unknown Words and Data Cleaning forHaitian Creole SMS TranslationSara Stymne

    Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar ExtractorJonathan Weese, Juri Ganitkevitch, Chris Callison-Burch, Matt Post and Adam Lopez

    DFKI Hybrid Machine Translation System for WMT 2011 - On the Integration of SMT andRBMTJia Xu, Hans Uszkoreit, Casey Kennington, David Vilar and Xiaojun Zhang

    CEU-UPV English-Spanish system for WMT11Francisco Zamora-Martinez and Maria Jose Castro-Bleda

    xx

  • Sunday, July 31, 2011 (continued)

    Hierarchical Phrase-Based MT at the Charles University for the WMT 2011 Shared TaskDaniel Zeman

    12:40–14:00 Lunch

    Invited Talk

    14:00–15:40 Crisis MT: Developing A Cookbook for MT in Crisis SituationsWilliam Lewis, Robert Munro and Stephan Vogel

    15:40–16:10 Coffee

    Session 7: Translation Models

    16:10–16:35 Generative Models of Monolingual and Bilingual Gappy PatternsKevin Gimpel and Noah A. Smith

    16:35–17:00 Extraction Programs: A Unified Approach to Translation Rule ExtractionMark Hopkins, Greg Langmead and Tai Vo

    17:00–17:25 Bayesian Extraction of Minimal SCFG Rules for Hierarchical Phrase-based TranslationBaskaran Sankaran, Gholamreza Haffari and Anoop Sarkar

    17:25–17:50 From n-gram-based to CRF-based Translation ModelsThomas Lavergne, Alexandre Allauzen, Josep Maria Crego and François Yvon

    xxi

  • Proceedings of the 6th Workshop on Statistical Machine Translation, pages 1–11,Edinburgh, Scotland, UK, July 30–31, 2011. c©2011 Association for Computational Linguistics

    A Grain of Salt for the WMT Manual Evaluation∗

    Ondřej Bojar, Miloš Ercegovčević, Martin PopelCharles University in Prague

    Faculty of Mathematics and PhysicsInstitute of Formal and Applied Linguistics{bojar,popel}@ufal.mff.cuni.cz

    [email protected]

    Omar F. ZaidanDepartment of Computer Science

    Johns Hopkins [email protected]

    Abstract

    The Workshop on Statistical MachineTranslation (WMT) has become one ofACL’s flagship workshops, held annuallysince 2006. In addition to soliciting pa-pers from the research community, WMTalso features a shared translation task forevaluating MT systems. This shared taskis notable for having manual evaluation asits cornerstone. The Workshop’s overviewpaper, playing a descriptive and adminis-trative role, reports the main results of theevaluation without delving deep into ana-lyzing those results. The aim of this paperis to investigate and explain some interest-ing idiosyncrasies in the reported results,which only become apparent when per-forming a more thorough analysis of thecollected annotations. Our analysis shedssome light on how the reported resultsshould (and should not) be interpreted, andalso gives rise to some helpful recommen-dation for the organizers of WMT.

    1 Introduction

    The Workshop on Statistical Machine Translation(WMT) has become an annual feast for MT re-searchers. Of particular interest is WMT’s sharedtranslation task, featuring a component for man-ual evaluation of MT systems. The friendly com-petition is a source of inspiration for participatingteams, and the yearly overview paper (Callison-Burch et al., 2010) provides a concise report of thestate of the art. However, the amount of interest-ing data collected every year (the system outputs

    ∗ This work has been supported by the grants EuroMa-trixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003 ofthe Czech Republic), P406/10/P259, MSM 0021620838, andDARPA GALE program under Contract No. HR0011-06-2-0001. We are grateful to our students, colleagues, and thethree reviewers for various observations and suggestions.

    and, most importantly, the annotator judgments)is quite large, exceeding what the WMT overviewpaper can afford to analyze with much depth.

    In this paper, we take a closer look at the datacollected in last year’s workshop, WMT101, anddelve a bit deeper into analyzing the manual judg-ments. We focus mainly on the English-to-Czechtask, as it included a diverse portfolio of MT sys-tems, was a heavily judged language pair, and alsoillustrates interesting “contradictions” in the re-sults. We try to explain such points of interest,and analyze what we believe to be the positive andnegative aspects of the currently established eval-uation procedure of WMT.

    Section 2 examines the primary style of man-ual evaluation: system ranking. We discuss howthe interpretation of collected judgments, the com-putation of annotator agreement, and documentthat annotators’ individual preferences may rendertwo systems effectively incomparable. Section 3is devoted to the impact of embedding referencetranslations, while Section 4 and Section 5 discusssome idiosyncrasies of other WMT shared tasksand manual evaluation in general.

    2 The System Ranking Task

    At the core of the WMT manual evaluation is thesystem ranking task. In this task, the annotatoris presented with a source sentence, a referencetranslation, and the outputs of five systems overthat source sentence. The instructions are keptminimal: the annotator is to rank the presentedtranslations from best to worst. Ties are allowed,but the scale provides five rank labels, allowing theannotator to give a total order if desired.

    The five assigned rank labels are submitted atonce, making the 5-tuple a unit of annotation. Inthe following, we will call this unit a block. Theblocks differ from each other in the choice of the

    1http://www.statmt.org/wmt10

    1

  • Language Pair Systems Blocks Labels Comparisons Ref ≥ others Intra-annot. κ Inter-annot. κGerman-English 26 1,050 5,231 10,424 0.965 0.607 0.492English-German 19 1,407 6,866 13,694 0.976 0.560 0.512Spanish-English 15 1,140 5,665 11,307 0.989 0.693 0.508English-Spanish 17 519 2,591 5,174 0.935 0.696 0.594French-English 25 837 4,156 8,294 0.981 0.722 0.452English-French 20 801 3,993 7,962 0.917 0.636 0.449Czech-English 13 543 2,691 5,375 0.976 0.700 0.504English-Czech 18 1,395 6,803 13,538 0.959 0.620 0.444Average 19 962 4,750 9,471 0.962 0.654 0.494

    Table 1: Statistics on the collected rankings, quality of references and kappas across language pairs. Ingeneral, a block yields a set of five rank labels, which yields a set of

    (52

    )= 10 pairwise comparisons.

    Due to occasional omitted labels, the Comparisons/Blocks ratio is not exactly 10.

    source sentence and the choice of the five systemsbeing compared. A couple of tricks are introducedin the sampling of the source sentences, to en-sure that a large enough number of judgments isrepeated across different screens for meaningfulcomputation of inter- and intra-annotator agree-ment. As for the sampling of systems, it is doneuniformly – no effort is made to oversample or un-dersample a particular system (or a particular pairof systems together) at any point in time.

    In terms of the interface, the evaluation utilizesthe infrastructure of Amazon’s Mechanical Turk(MTurk)2, with each MTurk HIT3 containing threeblocks, corresponding to three consecutive sourcesentences.

    Table 1 provides a brief comparison of the vari-ous language pairs in terms of number of MT sys-tems compared (including the reference), numberof blocks ranked, the number of pairwise com-parisons extracted from the rankings (one blockwith 5 systems ranked gives 10 pairwise compar-isons, but occasional unranked systems are ex-cluded), the quality of the reference (the percent-age of comparisons where the reference was betteror equal than another system), and the κ statistic,which is a measure of agreement (see Section 2.2for more details).4

    We see that English-to-Czech, the language pairon which we focus, is not far from the average inall those characteristics except for the number ofcollected comparisons (and blocks), making it thesecond most evaluated language pair.

    2http://www.mturk.com/3“HIT” is an acronym for human intelligence task, which

    is the MTurk term for a single screen presented to the anno-tator.

    4We only use the “expert” annotations of WMT10, ignor-ing the data collected from paid annotators on MTurk, sincethey were not part of the official evaluation.

    2.1 Interpreting the Rank Labels

    The description in the WMT overview paper says:“Relative ranking is our official evaluation met-ric. [Systems] are ranked based on how frequentlythey were judged to be better than or equal toany other system.” (Emphasis added.) The WMToverview paper refers to this measure as “≥ oth-ers”, with a variant of it called “> others” that doesnot reward ties.

    We first note that this description is somewhatambiguous, and an uninformed reader might in-terpret it in one of two different ways. For somesystem A, each block in which A appears includesfour implicit pairwise comparisons (against theother presented systems). How is A’s score com-puted from those comparisons?

    The correct interpretation is that A is re-warded once for each of the four comparisons inwhich A wins (or ties).5 In other words, A’s scoreis the number of pairwise comparisons in whichA wins (or ties), divided by the total number ofpairwise comparisons involving A. We will use“≥ others” (resp. “> others”) to refer to this inter-pretation, in keeping with the terminology of theoverview paper.

    The other interpretation is that A is rewardedonly if A wins (or ties) all four comparisons. Inother words, A’s score is the number of blocks inwhichA wins (or ties) all comparisons, divided bythe number of blocks in which A appears. We willuse “≥ all in block” (resp. “> all in block”) torefer to this interpretation.6

    5Personal communication with WMT organizers.6There is yet a third interpretation, due to a literal read-

    ing of the description, where A is rewarded at most once perblock if it wins (or ties) any one of its four comparisons. Thisis probably less useful: it might be good at identifying thebottom tier of systems, but would fail to distinguish betweenall other systems.

    2

  • REF CU

    -BO

    JAR

    CU

    -TE

    CT

    O

    EU

    RO

    TR

    AN

    S

    ON

    LIN

    EB

    PC

    -TR

    AN

    S

    UE

    DIN

    ≥ others 95.9 65.6 60.1 54.0 70.4 62.1 62.2> others 90.5 45.0 44.1 39.3 49.1 49.4 39.6≥ all in block 93.1 32.3 30.7 23.4 37.5 32.5 28.1> all in block 81.3 13.6 19.0 13.3 15.6 18.7 10.6

    Table 2: Sentence-level ranking scores for theWMT10 English-Czech language pair. The “≥others” and “> others” scores reproduced hereexactly match numbers published in the WMT10overview paper. A boldfaced score marks the bestsystem in a given row (besides the reference).

    For quality control purposes, the WMT organiz-ers embed the reference translations as a ‘system’alongside the actual entries (the idea being that anannotator clicking randomly would be easy to de-tect, since they would not consistently rank thereference ‘system’ highly). This means that thereference is as likely as any other system to ap-pear in a block, and when the score for a system Ais computed, pairwise comparisons with the refer-ence are included.

    We use the publicly released human judgments7

    to compute the scores of systems participating inthe English-Czech subtask, under both interpreta-tions. Table 2 reports the scores, with our “≥ oth-ers” (resp. “> others”) scores reproduced exactlymatching those reported in Table 21 of the WMToverview paper. (For clarity, Table 2 is abbreviatedto include only the top six systems of twelve.)

    Our first suggestion is that both measures couldbe reported in future evaluations, since each tellsus something different. The first interpretationgives partial credit for an MT system, hence distin-guishing systems from each other at a finer level.This is especially important for a language pairwith relatively few annotations, since “≥ others”would produce a larger number of data points (fourper system per block) than “≥ all in block” (oneper system per block). Another advantage of theofficial “≥ others” is greater robustness towardsvarious factors like the number of systems in thecompetition, the number of systems in one blockor the presence of the reference in the block (how-ever, see Section 3).

    As for the second interpretation, it helps iden-tify whether or not a single system (or a smallgroup of systems) is strongly dominant over theother systems. For the systems listed in Table 2,

    7http://statmt.org/wmt10/results.html

    -10

    0

    10

    20

    30

    40

    50

    60

    10 20 30 40 50 60 70 80

    >=

    All

    in B

    lock

    >= Others

    Czech-EnglishEnglish-Czech

    English-FrenchEnglish-GermanEnglish-SpanishFrench-English

    German-EnglishSpanish-English

    a*x+b

    Figure 1: “≥ all in block” and “≥ others” providevery similar ordering of systems.

    “> all in block” suggests its potential in the con-text of system combination: CU-TECTO and PC-TRANS win almost one fifth of the blocks in whichthey appear, despite the fact that either a refer-ence translation or a combination system alreadyappears alongside them. (See also Table 4 below.)

    Also, note that if the ranking task were designedspecifically to cater to the “≥ all in block” inter-pretation, it would only have two ‘rank’ labels (ba-sically, “top” and “non-top”). In that case, an-notators would spend considerably less time perblock than they do now, since all they need to dois identify the top system(s) per block, without dis-tinguishing non-top systems from each other.

    Even for those interested in distinguishing non-state-of-the-art systems from each other, we pointout that the “≥ all in block” interpretation ulti-mately gives a system ordering that is very simi-lar to that of the official “≥ others” interpretation,even for the lower-tier systems (Figure 1).

    2.2 Annotator AgreementThe WMT10 overview paper reports inter- andintra-annotator agreement over the pairwise com-parisons, to show the validity of the evaluationsetup and the “≥ others” metric. Agreement isquantified using the following formula:

    κ =P (A)− P (E)

    1− P (E)(1)

    where P (A) is the proportion of times two anno-tators are observed to agree, and P (E) is the ex-pected proportion of times two annotators wouldagree by chance. Note that κ has a value of at most1, with higher κ values indicating higher rates ofagreement. The κ measure is more meaningful

    3

  • 0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 5 10 15 20 25 30 35

    Kap

    pa

    Source length

    Intra. incl. ref.Intra. excl. ref.Inter. incl. ref.

    Inter. excl. ref.Moderate agreement

    Figure 2: Intra-/inter-annotator agreementwith/without references, across various sourcesentence lengths (lengths of n and n + 1 are usedto plot the point at x = n). This figure is based onall language pairs.

    than reporting P (A) as is, since it takes into ac-count, via P (E), how ‘surprising’ it is for annota-tors to agree in the first place.

    In the context of pairwise comparisons, anagreement between two annotators occurs whenthey compare the same pair of systems (S1,S2),and both agree on their relative ranking: eitherS1 > S2, S1 = S2, or S1 < S2. P (E) is then:

    P (E) = P 2(S1>S2)+P 2(S1=S2)+P 2(S1 S2, S1 = S2, andS1 < S2) are equally likely, it is using the S mea-sure version of Equation 1, though it does not ex-plicitly say so – it simply calls it “the kappa coef-ficient” (K).

    Regardless of what the measure should becalled, we believe that the uniform distribution it-self is not appropriate, even though it seems tomodel a “random clicker” adequately. In partic-ular, and given the design of the ranking inter-face, 13 is an overestimate of P (S1 = S2) fora random clicker, and should in fact be 15 : eachsystem receives one of five rank labels, and fortwo systems to receive the same rank label, thereare only five (out of 25) label pairs that satisfyS1 = S2. Therefore, with P (S1 = S2) = 15 ,

    8These three measures were later generalized to more thantwo annotators (Fleiss, 1971; Bartko and Carpenter, 1976),Thus, without loss of generality, our examples involve twoannotators.

    4

  • “≥ Others” S π

    Inter incl. ref. 0.487 0.454excl. ref. 0.439 0.403

    Intra incl. ref. 0.633 0.609excl. ref. 0.601 0.575

    Table 3: Summary of two variants of kappa: S(or K as it is reported in the WMT10 paper) andour proposed Scott’s π. We report inter- vs. intra-annotator agreement and collected from all com-parisons (“incl. ref.”) vs. collected only fromcomparisons without the reference (“excl. ref.”)because it is generally easier to agree that the ref-erence is better than the other systems. This tableis based on all language pairs.

    we have P (S1 > S2) = P (S1 < S2) = 25 , andtherefore P (E) = 0.36 rather than 0.333.

    Taking the discussion a step further, we actuallyadvocate following the idea of Scott’s π, wherebythe distribution of each category is estimated em-pirically from the actual annotation, rather thanassuming a random annotator – these frequenciesare easy to compute, and reflect a more meaning-ful P (E).9

    Under this interpretation, P (S1 = S2) is cal-culated to be 0.168, reflecting the fraction of pair-wise comparisons that correspond to a tie. (Notethat this further supports the claim that settingP (S1 = S2) = 13 for a random clicker, as usedin the WMT overview paper, is an overestimate.)This results in P (E) = 0.374, yielding, for in-stance, π = 0.454 for “≥ others” inter-annotatoragreement, somewhat lower than κ = 0.487 (re-ported in Table 3).

    We do note that the difference is rather small,and that our aim is to be mathematically soundabove all. Carefully defining P (E) would be im-portant when comparing kappas across differenttasks with different P (E), or when attemptingto satisfy certain thresholds (as the cited 0.4 and0.67). Furthermore, if one is interested in mea-suring agreement for individual annotators, suchas identifying those who have unacceptably lowintra-annotator agreement, the question of P (E) isquite important, since annotation behavior variesnoticeably from one annotator to another. A ‘con-servative’ annotator who prefers to rank systemsas being tied most of the time would have a high

    9We believe that P (E) should not reflect the chance thattwo random annotators would agree, but the chance that twoactual annotators would agree randomly. The two sound sub-tly related but are actually quite different.

    P (E), whereas an annotator using ties moderatelywould have a low P (E). Hence, two annotatorswith equal agreement rates (P (A)) are not neces-sarily equally proficient, since their P (E) mightdiffer considerably.10

    2.3 The ≥ variant vs. the > variantEven within the same interpretation of how sys-tems could be scored, there is a question ofwhether or not to reward ties. The overview paperreports both variants of its measure, but does notnote that there are non-trivial differences betweenthe two orderings. Compare for example the “≥others” ordering vs. the “> others” ordering ofCU-BOJAR and PC-TRANS (Table 2), showing anunexpected swing of 7.9%:

    ≥ others > othersCU-BOJAR 65.6 45.0PC-TRANS 62.1 49.4

    CU-BOJAR seems better under the≥ variant, butloses out when only strict wins are rewarded. The-oretically, this could be purely due to chance, butthe total number of pairwise comparisons in “≥others” is relatively large (about 1,500 pairwisecomparisons for each system), and ought to can-cel such effects.

    A similar pattern could be seen under the “all inblock” interpretation as well (e.g. for CU-TECTOand ONLINEB). Table 4 documents this effect bylooking at how often a system is the sole winnerof a block. Comparing PC-TRANS and CU-BOJARagain, we see that PC-TRANS is up there with CU-TECTO and DCU-COMBO as the most frequent solewinners, winning 71 blocks, whereas CU-BOJARis the sole winner of only 53 blocks. This is inspite of the fact that PC-TRANS actually appearedin slightly fewer blocks than CU-BOJAR (385 vs.401).

    One possible explanation is that the two vari-ants (“≥” and “>”) measure two subtly differentthings about MT systems. Digging deeper into Ta-ble 2’s values, we find that CU-BOJAR is tied withanother system 65.6 − 45.0 = 20.4% of the time,while PC-TRANS is tied with another system only62.1− 49.4 = 12.7% of the time. So it seems thatPC-TRANS’s output is noticeably different fromanother system more frequently than CU-BOJAR,which reduces the number of times that annotators

    10Who’s more impressive: a psychic who correctly pre-dicts the result of a coin toss 50% of the time, or a psychicwho correctly predicts the result of a die roll 50% of the time?

    5

  • Blocks Sole Winner305 Reference

    73 CU-TECTO71 PC-TRANS70 DCU-COMBO57 RWTH-COMBO54 ONLINEB53 CU-BOJAR46 EUROTRANS41 UEDIN41 UPV-COMBO

    175 One of eight other systems409 No sole winner

    1395 Total English-to-Czech Blocks

    Table 4: A breakdown of the 1,395 blocks for theEnglish-Czech task, according to which system (ifany) is the sole winner. On average, a system ap-pears in 388 blocks.

    mark PC-TRANS as tied with another system.11 Inthat sense, the “≥” ranking is hurting PC-TRANS,since it does not benefit from its small number ofties. On the other hand, the “>” variant would notreward CU-BOJAR for its large number of ties.

    The “≥ others” score may be artificially boostedif several very similar systems (and thereforelikely to be “tied”) take part in the evaluation.12

    One possible solution is to completely disregardties and calculate the final score as winswins+losses . Werecommend to use this score instead of “≥ others”( wins+tieswins+ties+losses ) which is biased toward often tiedsystems, and “> others” ( winswins+ties+losses ) which isbiased toward systems with few ties.

    2.4 Surprise? Does the Number ofEvaluations Affect a System’s Score?

    When examining the system scores for theEnglish-Czech task, we noticed a surprising pat-tern: it seemed that the more times a system issampled to be judged, the lower its “≥ others”score (“≥ all in block” behaving similarly). Ascatter plot of a system’s score vs. the number ofblocks in which it appears (Figure 3) makes thepattern obvious.

    We immediately wondered if the pattern holdsin other language pairs. We measured Pearson’scorrelation coefficient within each language pair,reported in Table 5. As it turns out, English-

    11Indeed, PC-TRANS is a commercial system (manually)tuned over a long period of time and based on resources verydifferent from what other participants in WMT use.

    12In the preliminary WMT11 results, this seems to hap-pen to four Moses-like systems (UEDIN, CU-BOJAR, CU-MARECEK and CU-TAMCHYNA) which have better “≥ oth-ers” score but worse “> others” score than CU-TECTO.

    Correlation of Block CountSource Target vs. “≥ Others”English Czech -0.558English Spanish -0.434Czech English -0.290Spanish English -0.240English French -0.227English German -0.161French English -0.024German English 0.146Overall -0.092

    Table 5: Pearson’s correlation between the num-ber of blocks where a system was ranked and thesystem’s “≥ others” score. (The reference itself isnot included among the considered systems).

    30

    35

    40

    45

    50

    55

    60

    65

    70

    75

    80

    350 360 370 380 390 400 410 420

    >=

    Oth

    ers

    Number of judgments

    cmu-heafield-combocu-bojar

    cu-tecto

    cu-zeman

    dcu

    dcu-combo

    eurotrans

    koc

    koc-combo

    onlineA

    onlineB

    pc-trans

    potsdam

    rwth-combo

    sfu

    uedin

    upv-combo

    a*x+b

    Figure 3: A plot of “≥ others” system score vs.times judged, for English-Czech.

    Czech happened to be the one language pair wherethe ‘correlation’ is strongest, with only English-Spanish also having a somewhat strong correla-tion. Overall, though, there is a consistent trendthat can be seen across the language pairs. Couldit really be the case that the more often a system isjudged, the worse its score gets?

    Examining plots for the other language pairsmakes things a bit clearer. Consider for examplethe plot for English-Spanish (Figure 4). As onewould hope, the data points actually come togetherto form a cloud, indicating a lack of correlation.The reason that a hint of a correlation exists is thepresence of two outliers in the bottom right cor-ner. In other words, the very worst systems are,indeed, the ones judged quite often. We observedthis pattern in several other language pairs as well.

    The correlation naturally does not imply cau-sation. We are still not sure how to explain theartifact. A subtle possibility lies in the MTurkinterface: annotators have the choice to accept aHIT or skip it before actually providing their la-

    6

  • 10

    20

    30

    40

    50

    60

    70

    80

    130 135 140 145 150 155 160 165 170

    >=

    Oth

    ers

    Number of judgments

    cambridge

    cmu-heafield-combo

    cu-zeman

    dcu

    dfkijhu

    koc

    koc-combo

    onlineAonlineB

    rwth-combo

    sfu

    uedinupb-combo

    upv upv-nnlm

    a*x+b

    Figure 4: A plot of “≥ others” system score vs.times judged, for English-Spanish.

    bels. It might be the case that some annotators aremore willing to accept HITs when there is an ob-viously poor system (since that would make theirtask somewhat easier), and who are more proneto skipping HITs where the systems seem hard todistinguish from each other. So there might be acausation effect after all, but in the reverse order:a system gets judged more often if it is a bad sys-tem.13 A suggestion from the reviewers is to run apilot annotation with deliberate inclusion of a poorsystem among the ranked ones.

    2.5 Issues of Pairwise Judgments

    The WMT overview paper also provides pairwisesystem comparisons: each cell in Table 6 indicatesthe percentage of pairwise comparisons betweenthe two systems where the system in the columnwas ranked better (>) than the system in the row.For instance, there are 81 ranking responses whereboth CU-TECTO and CU-BOJAR were present andindeed ranked14 among the 5 systems in the block.In 37 (45.7%) of the cases, CU-TECTO was rankedbetter, in 29 (35.8%), CU-BOJAR was ranked betterand there was a tie in the remaining 15 (18.5%)cases. The ties are not explicitly shown in Table 6but they are implied by the total of 100%. The cellis in bold where there was a win in the pairwisecomparison, so 45.7 is bold in our example.

    An interesting “discrepancy” in Table 6 is thatCU-TECTO wins pairwise comparisons with CU-BOJAR and UEDIN but it scores worse than themin the official “≥ others”, cf. Table 2. Simi-larly, UEDIN outperformed ONLINEB in the pair-

    13No pun intended!14The users sometimes did not fill any rank for a system.

    Such cases are ignored.

    RE

    F

    CU

    -BO

    JAR

    CU

    -TE

    CT

    O

    EU

    RO

    TR

    AN

    S

    ON

    LIN

    EB

    PC

    -TR

    AN

    S

    UE

    DIN

    REF - 4.3 4.3 5.1 3.8 3.6 2.3CU-BOJAR 87.1 - 45.7 28.3 44.4 39.5 41.1CU-TECTO 88.2 35.8 - 38.0 55.8 44.0 36.0EUROTRANS 88.5 60.9 46.8 - 50.7 53.8 48.6ONLINEB 91.2 31.1 29.1 32.8 - 43.8 39.3PC-TRANS 88.0 45.3 42.9 28.6 49.3 - 36.6UEDIN 94.3 39.3 44.2 31.9 32.1 49.5 -

    Table 6: Pairwise comparisons extracted fromsentence-level rankings of the WMT10 English-Czech News Task. Re-evaluated to reproduce thenumbers published in WMT10 overview paper.Bold in column A and row B means that systemA is pairwise better than system B.

    wise comparisons but it was ranked worse in both> and ≥ official comparison.

    In the following, we focus on the CU-BOJAR(B) and CU-TECTO (T) pair because they are in-teresting competitors on their own. They both usethe same parallel corpus for lexical mapping butoperate very differently: CU-BOJAR is based onMoses while CU-TECTO transfers at a deep syn-tactic layer and generates target text which is moreor less grammatically correct but suffers in lexicalchoice.

    2.5.1 Different Set of SentencesThe mismatch in the outcomes of “≥ others” andpairwise comparisons could be caused by differentset of sentences. The pairwise ranking is collectedfrom the set of blocks where both CU-BOJAR andCU-TECTO appeared (and were indeed ranked).Each of the systems however competes in otherblocks as well, which contributes to the official “≥others”.

    The set of sentences underlying the comparisonis very different and more importantly that the ba-sis for pairwise comparisons is much smaller thanthe basis of the official “≥ others” interpretation.The outcome of the official interpretation howeverdepends on the random set of systems your systemwas compared to. In our case, it is impossible todistinguish, whether CU-TECTO had just bad luckon sentences and systems it was compared to whenCU-BOJAR was not in the block and/or whether the81 blocks do not provide a reliable picture.

    2.5.2 Pairwise Judgments UnreliableTo complement WMT10 rankings for the two sys-tems and avoid the possible lower reliability dueto 5-fold ranking instead of a targeted compari-

    7

  • Author of B says:both both

    B>T T>B fine wrong Total

    Tsa

    ys:

    B>T 9 - 1 1 11T>B 2 13 - 3 18both fine 2 - 2 3 7both wrong 10 5 1 11 27Total 23 18 4 18 63

    Table 7: Additional annotation of 63 CU-BOJAR(B) vs. CU-TECTO (T) sentences by two annota-tors.

    Better BothAnnotator B T fine wrongA 24 23 5 11C 10 12 5 36D 32 20 2 9M 11 18 7 27O 23 18 4 18Z 25 27 2 9Total 125 118 25 110

    Table 8: Blurry picture of pairwise rankings ofCU-BOJAR vs. CU-TECTO. Wins in bold.

    son, we asked the main authors of both CU-BOJARand CU-TECTO to carry out a blind pairwise com-parison on the exact set of 63 sentences appearingacross the 81 blocks in which both systems wereranked. As the totals in Table 7 would suggest,each author unwittingly recognized his system andslightly preferred it. The details however reveal asubtler reason for the low agreement: one of theannotators was less picky about MT quality andaccepted 10+5 sentences completely rejected bythe other annotator. In total, these two annotatorsagreed on 9 + 13 + 2 + 11 = 35 (56%) of casesand their pairwise κ is 0.387.

    A further annotation of these 63 sentences byfour more people completes the blurry picture:the pairwise κ for each pair of our five annota-tors ranges from 0.242 to 0.615 with the aver-age 0.407±0.106. The multi-annotator κ (Fleiss,1971) is 0.394 and all six annotators agree on asingle label only in 24% of cases. The agree-ment is not better even if we merge the categories“Both fine” and “Both wrong” into a single one:The pairwise κ ranges from 0.212 to 0.620 withthe average 0.405±0.116, the multi-annotator κ is0.391. Individual annotations are given in Table 8.

    Naturally, the set of these 63 sentences is not arepresentative sample. Even if one of the systems

    SRC It’s not completely ideal.REF Nenı́ to úplně ideálnı́. RanksPC-TRANS To nenı́ úplně ideálnı́. 2 5CU-BOJAR To nenı́ úplně ideálnı́. 5 4

    Table 9: Two rankings by the same annotator.

    SRC FCC awarded a tunnel in Slovenia for 64 millionREF FCC byl přidělen tunel ve Slovinsku za 64 milionůGloss FCC was awarded a tunnel in Slovenia for 64 million

    HYP1 FCC přidělil tunel ve Slovinsku za 64 miliónůHYP2 FCC přidělila tunel ve Slovinsku za 64 milionůGloss FCC awardedmasc/fem a tunnel in Slovenia for 64 million

    Figure 5: A poor reference translation confuseshuman judges. The SRC and REF differ in the ac-tive/passive form, attributing completely differentroles to “FCC”.

    actually won, such an observation could not havebeen generalized to other test sets. The purposeof the exercise was to check whether we are at allable to agree which of the systems translates thisspecific set of sentences better. As it turns out,even a simple pairwise ranking can fail to pro-vide an answer because different annotators sim-ply have different preferences.

    Finally, Table 9 illustrates how poor theWMT10 rankings can be. The exact same stringproduced by two systems was ranked differentlyeach time – by the same annotator. (The hypothe-sis is a plausible translation, only the informationstructure of the sentence is slightly distorted so thetranslation may not fit well it the surrounding con-text.)

    3 The Impact of the ReferenceTranslation

    3.1 Bad Reference Translations

    Figure 5 illustrates the impact of poor referencetranslation on manual ranking as carried out inSection 2.5.2. Of our six independent annotations,three annotators marked the hypotheses as “bothfine” given the match with the source and threeannotators marked them as “both wrong” due tothe mismatch with the reference. Given the con-struction of the WMT test set, this particular sen-tence comes from a Spanish original and it wasmost likely translated directly to both English andCzech.

    8

  • Correlation ofSource Target Reference vs. “≥ others”Spanish English 0.341English French 0.164French English 0.098German English 0.088Czech English -0.041English Czech -0.145English Spanish -0.411English German -0.433Overall -0.107

    Table 10: Pearson’s correlation of the relative per-centage of blocks where the reference was in-cluded in the ranking and the final “≥ others”of the system (the reference itself is not includedamong the considered systems).

    25

    30

    35

    40

    45

    50

    55

    60

    65

    70

    75

    0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26

    >=

    Oth

    ers

    Relative presence of the reference

    cmu-heafield-combo

    cu-zeman

    dfki

    fbk

    jhu

    kit

    koc

    koc-combo

    limsi

    liu

    onlineA

    onlineB

    rwth

    rwth-combo

    sfu

    uedin

    uppsala

    upv-combo

    a*x+b

    Figure 6: Correlation of the presence of the ref-erence and the official “≥ others” for English-German evaluation.

    3.2 Reference Can Skew PairwiseComparisons

    The exact set of competing systems in each 5-foldranking in WMT10 evaluation is random. The “≥others” however is affected by this: a system maysuffer more losses if often compared to the refer-ence, and similarly it may benefit from being com-pared to a poor competitor.

    To check this, we calculate the correlation be-tween the relative presence of the reference amongthe blocks where a system was judged and thesystem’s official “≥ others” score. Across lan-guage, there is almost no correlation (Pearson’scoefficient: −0.107). However, for some languagepairs, the correlation is apparent, as listed in Ta-ble 10. Negative correlation means: the more of-ten the system was compared along with the refer-ence, the worse the score of the system.

    Figure 6 plots the extreme case of English-German evaluation.

    Source Target Min Avg±StdDev MaxEnglish Czech 40 65±19 115English French 40 66±17 110English German 10 40±16 80English Spanish 30 54±15 85Czech English 5 38±13 60French English 5 37±15 70German English 10 32±12 65Spanish English 35 56±11 70

    Table 11: The number of post-edits per system foreach language pair to complement Figure 3 (page12) of the WMT10 overview paper.

    4 Other WMT10 Tasks

    4.1 Blind Post-Editing UnreliableWMT often carries out one more type of manualevaluation: “Editing the output of systems withoutdisplaying the source or a reference translation,and then later judging whether edited translationswere correct.” (Callison-Burch et al., 2010). Wecall the evaluation “blind post-editing” for short.

    We feel that blind post-editing is more infor-mative than system ranking. First, it constitutesa unique comprehensibility test, and after all, MTshould aim at comprehensible output in the firstplace. Second, blind post-editing can be furtheranalyzed to search for specific errors in systemoutput, see Bojar (2011) for a preliminary study.

    Unfortunately, the amount of post-edits col-lected in WMT10 varied a lot across systems andlanguage pairs. Table 11 provides the minimum,average and maximum number of post-edits ofoutputs of a particular MT system. We see thate.g. while English-to-Czech has many judgmentsof this kind per system, Czech-to-English is one ofthe worst supported directions.

    It is not surprising that conclusions based on 5observations can be extremely deceiving. For in-stance CU-BOJAR seems to produce 60% of out-puts comprehensible (and thus wins in Figure 3 onpage 12 in the WMT overview paper), far betterthan CMU. This is not in line with the ranking re-sults where both rank equally (Table 5 on page 10in the WMT overview paper). In fact, CU-BOJARwas post-edited 5 times and 3 of these post-editswere acceptable while CMU was post-edited 30times and 5 of these post-edits were acceptable.

    4.2 A Remark on System Combination TaskOne results of WMT10 not observed in previousyears was that system combinations indeed per-formed better than individual systems. Previous

    9

  • Dev Set Test SetSententes 455 2034 DiffGOOGLE 17.32±1.25 16.76±0.60 ↘BOJAR 16.00±1.15 16.90±0.61 ↗TECTOMT 11.48±1.04 13.19±0.58 ↗PC-TRANS 10.24±0.92 10.84±0.46 ↗EUROTRAN 9.64±0.92 11.04±0.48 ↗

    Table 12: BLEU scores of sample five systems inEnglish-to-Czech combination task.

    years failed to show this clearly, because GoogleTranslate used to be included among the combinedsystems, making it hard to improve. In WMT10,Google Translate was excluded from system com-bination task (except for translations involvingCzech, where it was accidentally included).

    Our Table 12 provides an additional explanationwhy the presence of Google among combined sys-tems leads to inconclusive results. While the testset was easier (based on BLEU) than the develop-ment set for most systems, it was much harder forGoogle. All system combinations were thus likelyto overfit and select Google n-grams most often.Without access to Google powerful language mod-els, the combination systems were likely to under-perform Google in final fluency of the output.

    5 Further Issues of Manual Evaluation

    We have already seen that the comprehensibilitytest by blind post-editing provides a different pic-ture of the systems than the official ranking. Berkaet al. (2011) introduced a third “quiz-based evalu-ation”. The quiz-like evaluation used the English-to-Czech WMT10 systems, applied to differenttexts: short text snippets were translated and an-notators were asked to answer three yes/no ques-tions complementing each snippet. The order ofthe systems was rather different from the officialWMT10 results: CU-TECTO won the quiz-basedevaluation despite being the fourth in WMT10.

    Because the texts were different in WMT10 andthe quiz-based evaluation, we asked a small groupof annotators to apply the ranking technique on thetext snippets. While not exactly comparable to theWMT10 ranking, the WMT10 ranking was con-firmed: CU-TECTO was again among the lowest-scoring systems and Google won the ranking.

    Bojar (2011) applies the error-flagging manualevaluation by Vilar et al. (2006) to four systemsof WMT09 English-to-Czech task. Again, theoverall order of the systems is somewhat differ-ent when ranked by the number of errors flagged.

    Mireia Farrús and Fonollosa (2010) use a coarserbut linguistically motivated error classification forCatalan-Spanish and suggest that differences inranking are caused by annotators treating sometypes of errors as more serious.

    In short, different types of manual evaluationslead to different results even when identical sys-tems and texts are evaluated.

    6 Conclusion

    We took a deeper look at the results of the WMT10manual evaluation, and based on our observations,we have some recommendations for future evalu-ations:

    • We propose to use a score which ignoresties instead of the official “≥ others” metricwhich rewards ties and “> others” which pe-nalizes ties. Another score, “≥ all in block”,could help identify which systems are moredominant.

    • Inter-annotator agreement decreases dramat-ically with sentence length; we recommendincluding fewer sentences per block, at leastfor longer sentences.

    • We suggest agreement be measured based onan empirical estimate of P (E), or at least us-ing a more correct random clicking P (E) =0.36.

    • There is evidence of a negative correlationbetween the number of times a system isjudged and its score; we recommend a deeperanalysis of this issue.

    • We recommend the reference be sampled ata lower rate than other systems, so as to playa smaller role in the evaluation. We also rec-ommend better quality control over the pro-duction of the references.

    And to the readers of the WMT overview paper,we point out:

    • Pairwise comparisons derived from 5-foldrankings are sometimes unreliable. Even atargeted pairwise comparison of two systemscan shed little light as to which is superior.

    • The acceptability of post-edits is sometimesvery unreliable due to the low number of ob-servations.

    10

  • ReferencesR. Artstein and M. Poesio. 2008. Inter-coder agree-

    ment for computational linguistics. ComputationalLinguistics, 34(4):555–596.

    John J. Bartko and William T. Carpenter. 1976. On themethods and theory of reliability. Journal of Ner-vous and Mental Disease, 163(5):307–317.

    E. M. Bennett, R. Alpert, and A. C. Goldstein. 1954.Communications through limited questioning. Pub-lic Opinion Quarterly, 18(3):303–308.

    Jan Berka, Martin Černý, and Ondřej Bojar. 2011.Quiz-Based Evaluation of Machine Translation.Prague Bulletin of Mathematical Linguistics, 95:77–86, March.

    Ondřej Bojar. 2011. Analyzing Error Types inEnglish-Czech Machine Translation. Prague Bul-letin of Mathematical Linguistics, 95:63–76, March.

    Chris Callison-Burch, Philipp Koehn, Christof Monz,Kay Peterson, Mark Przybocki, and Omar Zaidan.2010. Findings of the 2010 joint workshop on sta-tistical machine translation and metrics for machinetranslation. In Proceedings of the Joint Fifth Work-shop on Statistical Machine Translation and Metric-sMATR, pages 17–53, Uppsala, Sweden, July. Asso-ciation for Computational Linguistics.

    Jacob Cohen. 1960. A coefficient of agreementfor nominal scales. Educational and PsychologicalMeasurement, 20(1):37–46.

    Barbara Di Eugenio and Michael Glass. 2004. Thekappa statistic: A second look. Computational lin-guistics, 30(1):95–101.

    J. L. Fleiss. 1971. Measuring nominal scale agree-ment among many raters. Psychological Bulletin,76(5):378–382.

    Klaus Krippendorff. 1980. Content Analysis: An In-troduction to Its Methodology. Sage Publications,Beverly Hills, CA. Chapter 12.

    J. Richard Landis and Gary G. Koch. 1977. The mea-surement of observer agreement for categorical data.Biometrics, 33:159–174.

    José B. Mariño Mireia Farrús, Marta R. Costa-jussàand José A. R. Fonollosa. 2010. Linguistic-basedevaluation criteria to identify statistical machinetranslation errors. In Proceedings of the 14th AnnualConference of the Euoropean Association for Ma-chine Translation (EAMT’10), pages 167–173, May.

    William A. Scott. 1955. Reliability of content analy-sis: The case of nominal scale coding. Public Opin-ion Quarterly, 19(3):321–325.

    David Vilar, Jia Xu, Luis Fernando D’Haro, and Her-mann Ney. 2006. Error Analysis of Machine Trans-lation Output. In International Conference on Lan-guage Resources and Evaluation, pages 697–702,Genoa, Italy, May.

    11

  • Proceedings of the 6th Workshop on Statistical Machine Translation, pages 12–21,Edinburgh, Scotland, UK, July 30–31, 2011. c©2011 Association for Computational Linguistics

    A Lightweight Evaluation Framework for Machine Translation Reordering

    David Talbot1 and Hideto Kazawa2 and Hiroshi Ichikawa2Jason Katz-Brown2 and Masakazu Seno2 and Franz J. Och1

    1 Google Inc. 2 Google Japan1600 Amphitheatre Parkway Roppongi Hills Mori TowerMountain View, CA 94043 6-10-1 Roppongi, Tokyo 106-6126{talbot, och}@google.com {kazawa, ichikawa}@google.com

    {jasonkb, seno}@google.com

    Abstract

    Reordering is a major challenge for machinetranslation between distant languages. Recentwork has shown that evaluation metrics thatexplicitly account for target language word or-der correlate better with human judgments oftranslation quality. Here we present a simpleframework for evaluating word order indepen-dently of lexical choice by comparing the sys-tem’s reordering of a source sentence to ref-erence reordering data generated from manu-ally word-aligned translations. When used toevaluate a system that performs reordering asa preprocessing step our framework allows theparser and reordering rules to be evaluated ex-tremely quickly without time-consuming end-to-end machine translation experiments. Anovelty of our approach is that the translationsused to generate the reordering reference dataare generated in an alignment-oriented fash-ion. We show that how the alignments aregenerated can significantly effect the robust-ness of the evaluation. We also outline someways in which this framework has allowed ourgroup to analyze reordering errors for Englishto Japanese machine translation.

    1 Introduction

    Statistical machine translation systems can performpoorly on distant language pairs such as Englishand Japanese. Reordering errors are a major sourceof poor or misleading translations in such systems(Isozaki et al., 2010). Unfortunately the stan-dard evaluation metrics used by the statistical ma-chine translation community are relatively insensi-

    tive to the long-distance reordering phenomena en-countered when translating between such languages(Birch et al., 2010).

    The ability to rapidly evaluate the impact ofchanges on a system can significantly accelerate theexperimental cycle. In a large statistical machinetranslation system, we should ideally be able to ex-periment with separate components without retrain-ing the complete system. Measures such as per-plexity have been successfully used to evaluate lan-guage models independently in speech recognitioneliminating some of the need for end-to-end speechrecognition experiments. In machine translation,alignment error rate has been used with some mixedsuccess to evaluate word-alignment algorithms butno standard evaluation frameworks exist for othercomponents of a machine translation system (Fraserand Marcu, 2007).

    Unfortunately, BLEU (Papineni et al., 2001) andother metrics that work with the final output of a ma-chine translation system are both insensitive to re-ordering phenomena and relatively time-consumingto compute: changes to the system may require therealignment of the parallel training data, extractionof phrasal statistics and translation of a test set. Astraining sets grow in size, the cost of end-to-end ex-perimentation can become significant. However, it isnot clear that measurements made on any single partof the system will correlate well with human judg-ments of the translation quality of the whole system.

    Following Collins et al. (2005a) and Wang (2007),Xu et al. (2009) showed that when translating fromEnglish to Japanese (and to other SOV languagessuch as Korean and Turkish) applying reordering as

    12

  • a preprocessing step that manipulates a source sen-tence parse tree can significantly outperform state-of-the-art phrase-based and hierarchical machinetranslation systems. This result is corroborated byBirch et al. (2009) whose results suggest that bothphrase-based and hierarchical translation systemsfail to capture long-distance reordering phenomena.

    In this paper we describe a lightweight frameworkfor measuring the quality of the reordering compo-nents in a machine translation system. While ourframework can be applied to any translation sys-tem in which it is possible to derive a token-levelalignment from the input source tokens to the out-put target tokens, it is of particular practical interestwhen applied to a system that performs reorderingas a preprocessing step (Xia and McCord, 2004). Inthis case, as we show, it allows for extremely rapidand sensitive analysis of changes to parser, reorder-ing rules and other reordering components.

    In our framework we evaluate the reordering pro-posed by a system separately from its choice of tar-get words by comparing it to a reference reorderingof the sentence generated from a manually word-aligned translation. Unlike previous work (Isozakiet al., 2010), our approach does not rely on the sys-tem’s output matching the reference translation lexi-cally. This makes the evaluation more robust as theremay be many ways to render a source phrase in thetarget language and we would not wish to penalizeone that simply happens not to match the reference.

    In the next section we review related work onreordering for translation between distant languagepairs and automatic approaches to evaluating re-ordering in machine translation. We then describeour evaluation framework including certain impor-tant details of how our reference reorderings werecreated. We evaluate the framework by analyz-ing how robustly it is able to predict improvementsin subjective translation quality for an English toJapanese machine translation system. Finally, wedescribe ways in which the framework has facili-tated development of the reordering components inour system.

    2 Related Work

    2.1 Evaluating Reordering

    The ability to automatically evaluate machine trans-lation output has driven progress in statistical ma-chine translation; however, shortcomings of thedominant metric, BLEU (Papineni et al., 2001) , par-ticularly with respect to reordering, have long beenrecognized (Callison-burch and Osborne, 2006).Reordering has also been identified as a major fac-tor in determining the difficulty of statistical ma-chine translation between two languages (Birch etal., 2008) hence BLEU scores may be most unreli-able precisely for those language pairs for which sta-tistical machine translation is most difficult (Isozakiet al., 2010).

    There have been many results showing that met-rics that account for reordering are better correlatedwith human judgements of translation quality (Lavieand Denkowski, 2009; Birch and Osborne, 2010;Isozaki et al., 2010). Examples given in Isozaki etal. (2010) where object and subject arguments arereversed in a Japanese to English statistical machinetranslation system demonstrate how damaging re-ordering errors can be and it should therefore notcome as a surprise that word order is a strong pre-dictor of translation quality; however, there are otheradvantages to be gained by focusing on this specificaspect of the translation process in isolation.

    One problem for all automatic evaluation metricsis that multiple equally good translations can be con-structed for most input sentences and typically ourreference data will contain only a small fraction ofthese. Equally good translations for a sentence maydiffer both in terms of lexical choice and word or-der. One of the potential advantages of designing ametric that looks only at word order, is that it may,to some extent, factor out variability along the di-mension of the lexical choice. Previous work on au-tomatic evaluation metrics that focus on reordering,however, has not fully exploited this.

    The evaluation metrics proposed in Isozaki et al.(2010) compute a reordering score by comparingthe ordering of unigrams and bigrams that appearin both the system’s translation and the reference.These scores are therefore liable to overestimatethe reordering quality of sentences that were poorlytranslated. While Isozaki et al. (2010) does propose

    13

  • a work-around to this problem which combines thereordering score with a lexical precision term, thisclearly introduces a bias in the metric whereby poortranslations are evaluated primarily on their lexicalchoice and good translations are evaluated more onthe basis of their word order. In our experienceword order is particularly poor in those sentencesthat have the lowest lexical overlap with referencetranslations; hence we would like to be able to com-pute the quality of reordering in all sentences inde-pendently of the quality of their lexical choice.

    Birch and Osborne (2010) are closer to our ap-proach in that they use word alignments to induce apermutation over the source sentence. They com-pare a source-side permutation generated from aword alignment of the reference translation with onegenerated from the system’s using various permuta-tion distances. However, Birch and Osborne (2010)only demonstrate that these metrics are correlatedwith human judgements of translation quality whencombined with BLEU score and hence take lexicalchoice into account.

    Birch et al. (2010) present the only results weare aware of that compute the correlation be-tween human judgments of translation quality anda reordering-only metric independently of lexicalchoice. Unfortunately, the experimental set-up thereis somewhat flawed. The authors ‘undo’ reorderingsin their reference translations by permuting the ref-erence tokens and presenting the permuted transla-tions to human raters. While many machine trans-lation systems (including our own) assume that re-ordering and translation can be factored into sepa-rate models, e.g. (Xia and McCord, 2004), and per-form these two operations in separate steps, the lat-ter conditioned on the former, Birch et al. (2010) aremaking a much stronger assumption when they per-form these simulations: they are assuming that lexi-cal choice and word order are entirely independent.It is easy to find cases where this assumption doesnot hold and we would in general be very surprisedif a similar change in the reordering component inour system did not also result in a change in the lex-ical choice of the system; an effect which their ex-periments are unable to model.

    Another minor difference between our evaluationframework and (Birch et al., 2010) is that we usea reordering score that is based on the minimum

    number of chunks into which the candidate and ref-erence permutations can be concatenated similar tothe reordering component of METEOR (Lavie andDenkowski, 2009). As we show, this is better cor-related with human judgments of translation qualitythan Kendall’s τ . This may be due to the fact thatit counts the number of ‘jumps’ a human reader hasto make in order to parse the system’s order if theywish to read the tokens in the reference word order.Kendall’s τ on the other hand penalizes every pairof words that are in the wrong order and hence hasa quadratic (all-pairs) flavor which in turn might ex-plain why Birch et al. (2010) found that the square-root of this quantity was a better predictor of trans-lation quality.

    2.2 Evaluation Reference Data

    To create the word-aligned translations from whichwe generate our reference reordering data, we useda novel alignment-oriented translation method. Themethod (described in more detail below) seeksto generate reference reorderings that a machinetranslation system might reasonably be expected toachieve. Fox (2002) has analyzed the extent towhich translations seen in a parallel corpus can bebroken down into clean phrasal units: they foundthat most sentence pairs contain examples of re-ordering that violate phrasal cohesion, i.e. the cor-responding words in the target language are notcompletely contiguous or solely aligned to the cor-responding source phrase. These reordering phe-nomena are difficult for current statistical transla-tion models to learn directly. We therefore deliber-ately chose to create reference data that avoids thesephenomena as much as possible by having a singleannotator generate both the translation and its wordalignment. Our word-aligned translations are cre-ated with a bias towards