taus guidelines postediting productivity
TRANSCRIPT
-
7/28/2019 TAUS Guidelines Postediting Productivity
1/5
TAUSBestPracticeGuidelines
MeasuringPost-EditingProductivity
December2012
Small-scale,Short-termPost-EditingProductivityTests
Post-editingproductivitymeasurementappliestoscenarioswhereyoumightwishtouseMTasatranslatorproductivitytool.Generally,small-scale
productivitytestsshouldbeusedifyouarethinkingaboutgettingstartedwith
MTinaparticularlanguagepair,andwonderingwhetheryoushouldinvestfurthereffort.TheproductivitytestswillhelpyouunderstandyourpotentialROI
perlanguagepair/contenttype.
Youmayalsoundertakesuchtestsperiodically,sayannually,togetanindication
ofimprovements(ornot)inproductivity.
Small-scale,short-termtests,bytheirnature,gatherlessdataandareless
rigorousthanthelarger-scale,long-termonesforwhichweprovideaseparatesetofbestpracticeguidelines.
TAUSmembersareabletouseproductivitytestingandhumanevaluationtools
attauslabs.com,aneutralandindependentenvironment.Theseensurethatthe
bestpracticesoutlinedbelowareapplied,withtheadditionalbenefitofautomatedreporting.
Design
CompareappleswithapplesnotappleswithorangesProductivitymeasuresareflawedwhentheydontcomparelikewithlike.When
comparingpost-editingagainsttranslation,itshouldbeclearwhetherornot
translationmeansTEPTranslation,Edit,Proof-andwhetherornotpost-editingincludesfinalproofing.Iftranslatorshaveaccesstoterminology,sotoo
shouldpost-editors.
-
7/28/2019 TAUS Guidelines Postediting Productivity
2/5
Employatleastthreepost-editorspertargetlanguage
Productivitywillvarybyindividual.Byincludinganumberofpost-editorswithvaryingtranslationandpost-editingexperience,youwillgetamoreaccurate
averagemeasurement.
ExercisecontroloveryourparticipantprofileEngagethepeoplewhowillactuallydothepost-editinginliveprojects.Includeappropriatecontenttype
MTengineswillhavevaryingdegreesofsuccesswithdifferentcontenttypes.
Testeachcontenttypeyouwishtomachinetranslate.ForSMTenginesthetrainingdatashouldnotformpartofthetestset.Out-of-domaintestdatashould
notbetestedondomain-specificengines.
Includeasufficientnumberofwords
Foreachcontenttypeandtargetlanguage,werecommendincludingatleast300
segmentsofMToutputforshort-termtests.Themorewordsincluded,themorereliabletheresultswillbe.
Provideclearguidelines
Ifpost-editorsshouldadheretoyourstandardstyle-guide,thenthisshouldbeclearlystated.Inaddition,guidelinesforpost-editingshouldbeprovidedsothat
allparticipantsunderstandwhatis/isnotrequired.SeetheTAUSMachine
TranslationPost-EditingGuidelines.Ithasbeennotedthatpeoplesometimesdonotadheretoguidelines,sotheclearerandmoresuccincttheyare,thebetter.
Measures
Measureactualpost-editingeffortandavoidmeasuresofperceivedeffortWhenpeopleareaskedtoratewhichsegmentmightbeeasiertopost-edit,they
areonlyratingperceived,notactualeffort.Itisbeenfrequentlyshownthat
agreementbetweenparticipantsinsuchexercisesisonlylowtomoderate.
Measurethedeltabetweentranslationproductivityandpost-editing
productivityThedeltabetweenthetwoactivitiesistheimportantmeasurement.Individuals
willbenefittoagreaterorlesserextentfromMT,sothedeltashouldbe
calculatedperindividualandtheaveragedeltashouldthenbecalculated.
ExtrapolatedailyproductivitywithcautionForshort-termmeasures,numbersofpost-editorsandwordsareoftenlimited
andsopotentialdailyproductivityshouldbeextrapolatedwithsomecautionas
youmayhaveoutlierswhowillskewresultsinasmallgroup.TheTAUSproductivitytestingreportsshowonaverageandindividualpost-
editorproductivity,enablinguserstodeterminetheinfluenceofoutliersonresults.
-
7/28/2019 TAUS Guidelines Postediting Productivity
3/5
Measurefinalquality
Increasedefficienciesaremeaninglessifthedesiredlevelofqualityisnotreached.Belowaretwoexamplesoftechniquesforqualitymeasurement.
Humanevaluation:Forexample,usingacompanysstandarderrortypologyoradequacy/fluencyevaluation.
Editdistancemeasures:SomeresearchhasshowntheTER(TranslationEditRate)andGTM(GeneralTextMatcher)measurestocorrelatefairly
wellwithhumanassessmentsofquality.TheTAUSqualityevaluationtoolsenableyoutoundertakeadequacy/fluency
evaluationanderrortypologyreview,againreportsshowon
averageandindividualpost-editorproductivity,enablinguserstodeterminetheinfluenceofoutliersonresults.
Large-Scale,Longer-TermPost-EditingProductivityTestsThelarger-scale,longer-termtestswouldbeusedifyouarealreadyreasonablycommittedtoMT,oratleasttotestingitonalarge-scale,andlookingtocreatea
virtuouscycletowardsoperationalexcellence,guidedbysuchtests.
Design
Compareappleswithapplesnotappleswithoranges
Productivitymeasuresareflawedwhentheydontcomparelikewithlike.When
comparingpost-editingagainsttranslation,itshouldbeclearwhetherornottranslationmeansTEPTranslation,Edit,Proof-andwhetherornotpost-
editingincludesfinalproofing.Iftranslatorshaveaccesstoterminology,sotoo
shouldpost-editors.
Employasufficientnumberofpost-editorspertargetlanguageProductivitywillvarybyindividual.Byincludingabroadgroupofpost-editors
withvaryingtranslationandpost-editingexperience,youwillgetamore
accurateaveragemeasurement.Forlong-termmeasurements,werecommendemployingmorethanthreepost-editors,preferablyatleastfiveorsix.
Exercisecontroloveryourparticipantprofile
Engagethepeoplewhowillactuallydothepost-editinginliveprojects.
Employingstudentsorthecrowdisnotvalidiftheyarenottheactualpost-editorsyouwouldemployinaliveproject.
ConducttheproductivitymeasurementasyouwouldacommercialprojectYouwanttheparticipantstoperformthetaskastheywouldanycommercial
projectsothatthemeasuresyoutakearereliable.
-
7/28/2019 TAUS Guidelines Postediting Productivity
4/5
Includeappropriatecontenttype
MTengineswillhavevaryingdegreesofsuccesswithdifferentcontenttypes.Testeachcontenttypeyouwishtomachinetranslate.
ForSMTenginesthetrainingdatashouldnotformpartofthetestset.Out-of-domaintestdatashouldnotbetestedondomain-specificengines.
Includeasufficientnumberofwords
Foreachcontenttypeandtargetlanguage,werecommendcollatingpost-editing
throughputoveranumberofweeks.Themorewordsincluded,themorereliabletheresultswillbe.
UserealistictoolsandenvironmentsCommonly,MTisintegratedintoTMtools.Itisrecommendedthatifthepost-
editoristoeventuallyworkinthisstandardenvironment,thenproductivitytestsshouldbedoneinthisenvironment,asmorerealisticmeasurescanbeobtained.
ProvideclearguidelinesIfpost-editorsshouldadheretoyourstandardstyle-guide,thenthisshouldbe
clearlystated.Inaddition,guidelinesforpost-editingshouldbeprovidedsothat
allparticipantsunderstandwhatis/isnotrequired.SeetheTAUSMachineTranslationPost-EditingGuidelines.Ithasbeennotedthatpeoplesometimesdo
notadheretoguidelines,sotheclearerandmoresuccincttheyare,thebetter.
Involverepresentativesfromyourpost-editingcommunityinthedesign
andanalysisAswithTAUSMachineTranslationPost-EditingGuidelines,werecommendthat
representativesofthepost-editingcommunitybeinvolvedintheproductivitymeasurement.Havingastakeinsuchaprocessgenerallyleadstoahigherlevel
ofconsensus.
LargeScale,Long-TermTests:MeasuresGaugethequalityleveloftherawMToutputfirst
ProductivityisdirectlyrelatedtothelevelofqualityoftherawMToutput.To
understandPEproductivitymeasurements,youneedtounderstandthebaselinequalityoftheMToutput.Randomsamplingoftheoutputisrecommended.
Measureactualpost-editingeffortandavoidmeasuresofperceivedeffort
Whenpeopleareaskedtoratewhichsegmentmightbeeasiertopost-edit,they
areonlyratingperceived,notactualeffort.Itisbeenfrequentlyshownthatagreementbetweenparticipantsinsuchexercisesisonlylowtomoderate.
Measuremorethanwordsperhour
Wordsperhourorwordsperdaygiveasimplisticviewofproductivity.The
importantquestionis:canproductiontimebereducedacrossthelifecycleofa
project(withoutcompromisingquality)?Therefore,itmaybemoreappropriatetomeasurethetotalgainindaysfordeliveryorpublicationofthetranslated
-
7/28/2019 TAUS Guidelines Postediting Productivity
5/5
content.Spreadingmeasurementoutovertimewillalsoshowwhetherpost-
editingproductivityratesrise,plateauorfallovertime.
Measurethedeltabetweentranslationproductivityandpost-editingproductivity
Thedeltabetweenthetwoactivitiesistheimportantmeasurement.IndividualswillbenefittoagreaterorlesserextentfromMT,sothedeltashouldbecalculatedperindividualandtheaveragedeltashouldthenbecalculated.
MeasurefinalqualityIncreasedefficienciesaremeaninglessifthedesiredlevelofqualityisnot
reached.Belowaretwoexamplesoftechniquesforqualitymeasurement.Forlongitudinalmeasures,averagefinalqualitycanbecomparedtoseewhat
improvementsordegradationsoccurred:
Humanevaluation:Forexample,usingacompanysstandarderrortypology.Notethathumanratersofqualityoftendisplaylowratesofagreement.Themoreraters,thebetter(atleastthree).
Editdistancemeasures:SomeresearchhasshowntheTER(TranslationEditRate)andGTM(GeneralTextMatcher)measurestocorrelatefairlywellwithhumanassessmentsofquality.
Measureopinions,getfeedback
Somemeasurementofpost-editoropinion/profilecanbeusefulinhelpingto
interpretthequantitativemeasures.GatheringfeedbackonthemostcommonorproblematicerrorscanhelpimprovetheMTsystemovertime.
QuestionsAboutMeasurement
Self-reportorautomate?Ifarealisticenvironmentisused,itisdifficulttoautomateproductivity
measurements.Self-reportingisoftenusedinstead,wherepost-editorsfillinatablereportingthenumberofwordstheypost-edited,dividedbythenumberof
hourstheyworked.Self-reportingiserror-prone,butwithalargeenoughgroup
ofparticipantsunder-orover-reportingshouldbemitigated.
Whataboutusingconfidencescoresasindicatorsofproductivity?
ConfidencescoresthatareautomaticallygeneratedbytheMTsystemarepotentialindicatorsofbothqualityandproductivity.However,developmentis
stillintheearlystagesandthereisnotyetenoughresearchonthepotentiallinksbetweenconfidencescoresandactualpost-editingproductivity.
Whataboutre-trainingenginesovertime?IfSMTenginesarere-trainedwithquality-approvedpost-editedcontentover
time,thenitcanbeexpectedthattheMTenginewillproducehigherqualityrawoutputastimeprogressesandthatpost-editingproductivitymayincrease.This
shouldbetestedoverthelong-term.