taus guidelines postediting productivity

7/28/2019 TAUS Guidelines Postediting Productivity

1/5

TAUSBestPracticeGuidelines

MeasuringPost-EditingProductivity

December2012

Small-scale,Short-termPost-EditingProductivityTests

Post-editingproductivitymeasurementappliestoscenarioswhereyoumightwishtouseMTasatranslatorproductivitytool.Generally,small-scale

productivitytestsshouldbeusedifyouarethinkingaboutgettingstartedwith

MTinaparticularlanguagepair,andwonderingwhetheryoushouldinvestfurthereffort.TheproductivitytestswillhelpyouunderstandyourpotentialROI

perlanguagepair/contenttype.

Youmayalsoundertakesuchtestsperiodically,sayannually,togetanindication

ofimprovements(ornot)inproductivity.

Small-scale,short-termtests,bytheirnature,gatherlessdataandareless

rigorousthanthelarger-scale,long-termonesforwhichweprovideaseparatesetofbestpracticeguidelines.

TAUSmembersareabletouseproductivitytestingandhumanevaluationtools

attauslabs.com,aneutralandindependentenvironment.Theseensurethatthe

bestpracticesoutlinedbelowareapplied,withtheadditionalbenefitofautomatedreporting.

Design

CompareappleswithapplesnotappleswithorangesProductivitymeasuresareflawedwhentheydontcomparelikewithlike.When

comparingpost-editingagainsttranslation,itshouldbeclearwhetherornot

translationmeansTEPTranslation,Edit,Proof-andwhetherornotpost-editingincludesfinalproofing.Iftranslatorshaveaccesstoterminology,sotoo

shouldpost-editors.


2/5

Employatleastthreepost-editorspertargetlanguage

Productivitywillvarybyindividual.Byincludinganumberofpost-editorswithvaryingtranslationandpost-editingexperience,youwillgetamoreaccurate

averagemeasurement.

ExercisecontroloveryourparticipantprofileEngagethepeoplewhowillactuallydothepost-editinginliveprojects.Includeappropriatecontenttype

MTengineswillhavevaryingdegreesofsuccesswithdifferentcontenttypes.

Testeachcontenttypeyouwishtomachinetranslate.ForSMTenginesthetrainingdatashouldnotformpartofthetestset.Out-of-domaintestdatashould

notbetestedondomain-specificengines.

Includeasufficientnumberofwords

Foreachcontenttypeandtargetlanguage,werecommendincludingatleast300

segmentsofMToutputforshort-termtests.Themorewordsincluded,themorereliabletheresultswillbe.

Provideclearguidelines

Ifpost-editorsshouldadheretoyourstandardstyle-guide,thenthisshouldbeclearlystated.Inaddition,guidelinesforpost-editingshouldbeprovidedsothat

allparticipantsunderstandwhatis/isnotrequired.SeetheTAUSMachine

TranslationPost-EditingGuidelines.Ithasbeennotedthatpeoplesometimesdonotadheretoguidelines,sotheclearerandmoresuccincttheyare,thebetter.

Measures

Measureactualpost-editingeffortandavoidmeasuresofperceivedeffortWhenpeopleareaskedtoratewhichsegmentmightbeeasiertopost-edit,they

areonlyratingperceived,notactualeffort.Itisbeenfrequentlyshownthat

agreementbetweenparticipantsinsuchexercisesisonlylowtomoderate.

Measurethedeltabetweentranslationproductivityandpost-editing

productivityThedeltabetweenthetwoactivitiesistheimportantmeasurement.Individuals

willbenefittoagreaterorlesserextentfromMT,sothedeltashouldbe

calculatedperindividualandtheaveragedeltashouldthenbecalculated.

ExtrapolatedailyproductivitywithcautionForshort-termmeasures,numbersofpost-editorsandwordsareoftenlimited

andsopotentialdailyproductivityshouldbeextrapolatedwithsomecautionas

youmayhaveoutlierswhowillskewresultsinasmallgroup.TheTAUSproductivitytestingreportsshowonaverageandindividualpost-

editorproductivity,enablinguserstodeterminetheinfluenceofoutliersonresults.


3/5

Measurefinalquality

Increasedefficienciesaremeaninglessifthedesiredlevelofqualityisnotreached.Belowaretwoexamplesoftechniquesforqualitymeasurement.

Humanevaluation:Forexample,usingacompanysstandarderrortypologyoradequacy/fluencyevaluation.

Editdistancemeasures:SomeresearchhasshowntheTER(TranslationEditRate)andGTM(GeneralTextMatcher)measurestocorrelatefairly

wellwithhumanassessmentsofquality.TheTAUSqualityevaluationtoolsenableyoutoundertakeadequacy/fluency

evaluationanderrortypologyreview,againreportsshowon

averageandindividualpost-editorproductivity,enablinguserstodeterminetheinfluenceofoutliersonresults.

Large-Scale,Longer-TermPost-EditingProductivityTestsThelarger-scale,longer-termtestswouldbeusedifyouarealreadyreasonablycommittedtoMT,oratleasttotestingitonalarge-scale,andlookingtocreatea

virtuouscycletowardsoperationalexcellence,guidedbysuchtests.

Design

Compareappleswithapplesnotappleswithoranges

Productivitymeasuresareflawedwhentheydontcomparelikewithlike.When

comparingpost-editingagainsttranslation,itshouldbeclearwhetherornottranslationmeansTEPTranslation,Edit,Proof-andwhetherornotpost-

editingincludesfinalproofing.Iftranslatorshaveaccesstoterminology,sotoo

shouldpost-editors.

Employasufficientnumberofpost-editorspertargetlanguageProductivitywillvarybyindividual.Byincludingabroadgroupofpost-editors

withvaryingtranslationandpost-editingexperience,youwillgetamore

accurateaveragemeasurement.Forlong-termmeasurements,werecommendemployingmorethanthreepost-editors,preferablyatleastfiveorsix.

Exercisecontroloveryourparticipantprofile

Engagethepeoplewhowillactuallydothepost-editinginliveprojects.

Employingstudentsorthecrowdisnotvalidiftheyarenottheactualpost-editorsyouwouldemployinaliveproject.

ConducttheproductivitymeasurementasyouwouldacommercialprojectYouwanttheparticipantstoperformthetaskastheywouldanycommercial

projectsothatthemeasuresyoutakearereliable.


4/5

Includeappropriatecontenttype

MTengineswillhavevaryingdegreesofsuccesswithdifferentcontenttypes.Testeachcontenttypeyouwishtomachinetranslate.

ForSMTenginesthetrainingdatashouldnotformpartofthetestset.Out-of-domaintestdatashouldnotbetestedondomain-specificengines.

Includeasufficientnumberofwords

Foreachcontenttypeandtargetlanguage,werecommendcollatingpost-editing

throughputoveranumberofweeks.Themorewordsincluded,themorereliabletheresultswillbe.

UserealistictoolsandenvironmentsCommonly,MTisintegratedintoTMtools.Itisrecommendedthatifthepost-

editoristoeventuallyworkinthisstandardenvironment,thenproductivitytestsshouldbedoneinthisenvironment,asmorerealisticmeasurescanbeobtained.

ProvideclearguidelinesIfpost-editorsshouldadheretoyourstandardstyle-guide,thenthisshouldbe

clearlystated.Inaddition,guidelinesforpost-editingshouldbeprovidedsothat

allparticipantsunderstandwhatis/isnotrequired.SeetheTAUSMachineTranslationPost-EditingGuidelines.Ithasbeennotedthatpeoplesometimesdo

notadheretoguidelines,sotheclearerandmoresuccincttheyare,thebetter.

Involverepresentativesfromyourpost-editingcommunityinthedesign

andanalysisAswithTAUSMachineTranslationPost-EditingGuidelines,werecommendthat

representativesofthepost-editingcommunitybeinvolvedintheproductivitymeasurement.Havingastakeinsuchaprocessgenerallyleadstoahigherlevel

ofconsensus.

LargeScale,Long-TermTests:MeasuresGaugethequalityleveloftherawMToutputfirst

ProductivityisdirectlyrelatedtothelevelofqualityoftherawMToutput.To

understandPEproductivitymeasurements,youneedtounderstandthebaselinequalityoftheMToutput.Randomsamplingoftheoutputisrecommended.

Measureactualpost-editingeffortandavoidmeasuresofperceivedeffort

Whenpeopleareaskedtoratewhichsegmentmightbeeasiertopost-edit,they

areonlyratingperceived,notactualeffort.Itisbeenfrequentlyshownthatagreementbetweenparticipantsinsuchexercisesisonlylowtomoderate.

Measuremorethanwordsperhour

Wordsperhourorwordsperdaygiveasimplisticviewofproductivity.The

importantquestionis:canproductiontimebereducedacrossthelifecycleofa

project(withoutcompromisingquality)?Therefore,itmaybemoreappropriatetomeasurethetotalgainindaysfordeliveryorpublicationofthetranslated


5/5

content.Spreadingmeasurementoutovertimewillalsoshowwhetherpost-

editingproductivityratesrise,plateauorfallovertime.

Measurethedeltabetweentranslationproductivityandpost-editingproductivity

Thedeltabetweenthetwoactivitiesistheimportantmeasurement.IndividualswillbenefittoagreaterorlesserextentfromMT,sothedeltashouldbecalculatedperindividualandtheaveragedeltashouldthenbecalculated.

MeasurefinalqualityIncreasedefficienciesaremeaninglessifthedesiredlevelofqualityisnot

reached.Belowaretwoexamplesoftechniquesforqualitymeasurement.Forlongitudinalmeasures,averagefinalqualitycanbecomparedtoseewhat

improvementsordegradationsoccurred:

Humanevaluation:Forexample,usingacompanysstandarderrortypology.Notethathumanratersofqualityoftendisplaylowratesofagreement.Themoreraters,thebetter(atleastthree).

Editdistancemeasures:SomeresearchhasshowntheTER(TranslationEditRate)andGTM(GeneralTextMatcher)measurestocorrelatefairlywellwithhumanassessmentsofquality.

Measureopinions,getfeedback

Somemeasurementofpost-editoropinion/profilecanbeusefulinhelpingto

interpretthequantitativemeasures.GatheringfeedbackonthemostcommonorproblematicerrorscanhelpimprovetheMTsystemovertime.

QuestionsAboutMeasurement

Self-reportorautomate?Ifarealisticenvironmentisused,itisdifficulttoautomateproductivity

measurements.Self-reportingisoftenusedinstead,wherepost-editorsfillinatablereportingthenumberofwordstheypost-edited,dividedbythenumberof

hourstheyworked.Self-reportingiserror-prone,butwithalargeenoughgroup

ofparticipantsunder-orover-reportingshouldbemitigated.

Whataboutusingconfidencescoresasindicatorsofproductivity?

ConfidencescoresthatareautomaticallygeneratedbytheMTsystemarepotentialindicatorsofbothqualityandproductivity.However,developmentis

stillintheearlystagesandthereisnotyetenoughresearchonthepotentiallinksbetweenconfidencescoresandactualpost-editingproductivity.

Whataboutre-trainingenginesovertime?IfSMTenginesarere-trainedwithquality-approvedpost-editedcontentover

time,thenitcanbeexpectedthattheMTenginewillproducehigherqualityrawoutputastimeprogressesandthatpost-editingproductivitymayincrease.This

shouldbetestedoverthelong-term.

taus guidelines postediting productivity

Documents