taus guidelines postediting productivity

Upload: taus

Post on 03-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 TAUS Guidelines Postediting Productivity

    1/5

    TAUSBestPracticeGuidelines

    MeasuringPost-EditingProductivity

    December2012

    Small-scale,Short-termPost-EditingProductivityTests

    Post-editingproductivitymeasurementappliestoscenarioswhereyoumightwishtouseMTasatranslatorproductivitytool.Generally,small-scale

    productivitytestsshouldbeusedifyouarethinkingaboutgettingstartedwith

    MTinaparticularlanguagepair,andwonderingwhetheryoushouldinvestfurthereffort.TheproductivitytestswillhelpyouunderstandyourpotentialROI

    perlanguagepair/contenttype.

    Youmayalsoundertakesuchtestsperiodically,sayannually,togetanindication

    ofimprovements(ornot)inproductivity.

    Small-scale,short-termtests,bytheirnature,gatherlessdataandareless

    rigorousthanthelarger-scale,long-termonesforwhichweprovideaseparatesetofbestpracticeguidelines.

    TAUSmembersareabletouseproductivitytestingandhumanevaluationtools

    attauslabs.com,aneutralandindependentenvironment.Theseensurethatthe

    bestpracticesoutlinedbelowareapplied,withtheadditionalbenefitofautomatedreporting.

    Design

    CompareappleswithapplesnotappleswithorangesProductivitymeasuresareflawedwhentheydontcomparelikewithlike.When

    comparingpost-editingagainsttranslation,itshouldbeclearwhetherornot

    translationmeansTEPTranslation,Edit,Proof-andwhetherornotpost-editingincludesfinalproofing.Iftranslatorshaveaccesstoterminology,sotoo

    shouldpost-editors.

  • 7/28/2019 TAUS Guidelines Postediting Productivity

    2/5

    Employatleastthreepost-editorspertargetlanguage

    Productivitywillvarybyindividual.Byincludinganumberofpost-editorswithvaryingtranslationandpost-editingexperience,youwillgetamoreaccurate

    averagemeasurement.

    ExercisecontroloveryourparticipantprofileEngagethepeoplewhowillactuallydothepost-editinginliveprojects.Includeappropriatecontenttype

    MTengineswillhavevaryingdegreesofsuccesswithdifferentcontenttypes.

    Testeachcontenttypeyouwishtomachinetranslate.ForSMTenginesthetrainingdatashouldnotformpartofthetestset.Out-of-domaintestdatashould

    notbetestedondomain-specificengines.

    Includeasufficientnumberofwords

    Foreachcontenttypeandtargetlanguage,werecommendincludingatleast300

    segmentsofMToutputforshort-termtests.Themorewordsincluded,themorereliabletheresultswillbe.

    Provideclearguidelines

    Ifpost-editorsshouldadheretoyourstandardstyle-guide,thenthisshouldbeclearlystated.Inaddition,guidelinesforpost-editingshouldbeprovidedsothat

    allparticipantsunderstandwhatis/isnotrequired.SeetheTAUSMachine

    TranslationPost-EditingGuidelines.Ithasbeennotedthatpeoplesometimesdonotadheretoguidelines,sotheclearerandmoresuccincttheyare,thebetter.

    Measures

    Measureactualpost-editingeffortandavoidmeasuresofperceivedeffortWhenpeopleareaskedtoratewhichsegmentmightbeeasiertopost-edit,they

    areonlyratingperceived,notactualeffort.Itisbeenfrequentlyshownthat

    agreementbetweenparticipantsinsuchexercisesisonlylowtomoderate.

    Measurethedeltabetweentranslationproductivityandpost-editing

    productivityThedeltabetweenthetwoactivitiesistheimportantmeasurement.Individuals

    willbenefittoagreaterorlesserextentfromMT,sothedeltashouldbe

    calculatedperindividualandtheaveragedeltashouldthenbecalculated.

    ExtrapolatedailyproductivitywithcautionForshort-termmeasures,numbersofpost-editorsandwordsareoftenlimited

    andsopotentialdailyproductivityshouldbeextrapolatedwithsomecautionas

    youmayhaveoutlierswhowillskewresultsinasmallgroup.TheTAUSproductivitytestingreportsshowonaverageandindividualpost-

    editorproductivity,enablinguserstodeterminetheinfluenceofoutliersonresults.

  • 7/28/2019 TAUS Guidelines Postediting Productivity

    3/5

    Measurefinalquality

    Increasedefficienciesaremeaninglessifthedesiredlevelofqualityisnotreached.Belowaretwoexamplesoftechniquesforqualitymeasurement.

    Humanevaluation:Forexample,usingacompanysstandarderrortypologyoradequacy/fluencyevaluation.

    Editdistancemeasures:SomeresearchhasshowntheTER(TranslationEditRate)andGTM(GeneralTextMatcher)measurestocorrelatefairly

    wellwithhumanassessmentsofquality.TheTAUSqualityevaluationtoolsenableyoutoundertakeadequacy/fluency

    evaluationanderrortypologyreview,againreportsshowon

    averageandindividualpost-editorproductivity,enablinguserstodeterminetheinfluenceofoutliersonresults.

    Large-Scale,Longer-TermPost-EditingProductivityTestsThelarger-scale,longer-termtestswouldbeusedifyouarealreadyreasonablycommittedtoMT,oratleasttotestingitonalarge-scale,andlookingtocreatea

    virtuouscycletowardsoperationalexcellence,guidedbysuchtests.

    Design

    Compareappleswithapplesnotappleswithoranges

    Productivitymeasuresareflawedwhentheydontcomparelikewithlike.When

    comparingpost-editingagainsttranslation,itshouldbeclearwhetherornottranslationmeansTEPTranslation,Edit,Proof-andwhetherornotpost-

    editingincludesfinalproofing.Iftranslatorshaveaccesstoterminology,sotoo

    shouldpost-editors.

    Employasufficientnumberofpost-editorspertargetlanguageProductivitywillvarybyindividual.Byincludingabroadgroupofpost-editors

    withvaryingtranslationandpost-editingexperience,youwillgetamore

    accurateaveragemeasurement.Forlong-termmeasurements,werecommendemployingmorethanthreepost-editors,preferablyatleastfiveorsix.

    Exercisecontroloveryourparticipantprofile

    Engagethepeoplewhowillactuallydothepost-editinginliveprojects.

    Employingstudentsorthecrowdisnotvalidiftheyarenottheactualpost-editorsyouwouldemployinaliveproject.

    ConducttheproductivitymeasurementasyouwouldacommercialprojectYouwanttheparticipantstoperformthetaskastheywouldanycommercial

    projectsothatthemeasuresyoutakearereliable.

  • 7/28/2019 TAUS Guidelines Postediting Productivity

    4/5

    Includeappropriatecontenttype

    MTengineswillhavevaryingdegreesofsuccesswithdifferentcontenttypes.Testeachcontenttypeyouwishtomachinetranslate.

    ForSMTenginesthetrainingdatashouldnotformpartofthetestset.Out-of-domaintestdatashouldnotbetestedondomain-specificengines.

    Includeasufficientnumberofwords

    Foreachcontenttypeandtargetlanguage,werecommendcollatingpost-editing

    throughputoveranumberofweeks.Themorewordsincluded,themorereliabletheresultswillbe.

    UserealistictoolsandenvironmentsCommonly,MTisintegratedintoTMtools.Itisrecommendedthatifthepost-

    editoristoeventuallyworkinthisstandardenvironment,thenproductivitytestsshouldbedoneinthisenvironment,asmorerealisticmeasurescanbeobtained.

    ProvideclearguidelinesIfpost-editorsshouldadheretoyourstandardstyle-guide,thenthisshouldbe

    clearlystated.Inaddition,guidelinesforpost-editingshouldbeprovidedsothat

    allparticipantsunderstandwhatis/isnotrequired.SeetheTAUSMachineTranslationPost-EditingGuidelines.Ithasbeennotedthatpeoplesometimesdo

    notadheretoguidelines,sotheclearerandmoresuccincttheyare,thebetter.

    Involverepresentativesfromyourpost-editingcommunityinthedesign

    andanalysisAswithTAUSMachineTranslationPost-EditingGuidelines,werecommendthat

    representativesofthepost-editingcommunitybeinvolvedintheproductivitymeasurement.Havingastakeinsuchaprocessgenerallyleadstoahigherlevel

    ofconsensus.

    LargeScale,Long-TermTests:MeasuresGaugethequalityleveloftherawMToutputfirst

    ProductivityisdirectlyrelatedtothelevelofqualityoftherawMToutput.To

    understandPEproductivitymeasurements,youneedtounderstandthebaselinequalityoftheMToutput.Randomsamplingoftheoutputisrecommended.

    Measureactualpost-editingeffortandavoidmeasuresofperceivedeffort

    Whenpeopleareaskedtoratewhichsegmentmightbeeasiertopost-edit,they

    areonlyratingperceived,notactualeffort.Itisbeenfrequentlyshownthatagreementbetweenparticipantsinsuchexercisesisonlylowtomoderate.

    Measuremorethanwordsperhour

    Wordsperhourorwordsperdaygiveasimplisticviewofproductivity.The

    importantquestionis:canproductiontimebereducedacrossthelifecycleofa

    project(withoutcompromisingquality)?Therefore,itmaybemoreappropriatetomeasurethetotalgainindaysfordeliveryorpublicationofthetranslated

  • 7/28/2019 TAUS Guidelines Postediting Productivity

    5/5

    content.Spreadingmeasurementoutovertimewillalsoshowwhetherpost-

    editingproductivityratesrise,plateauorfallovertime.

    Measurethedeltabetweentranslationproductivityandpost-editingproductivity

    Thedeltabetweenthetwoactivitiesistheimportantmeasurement.IndividualswillbenefittoagreaterorlesserextentfromMT,sothedeltashouldbecalculatedperindividualandtheaveragedeltashouldthenbecalculated.

    MeasurefinalqualityIncreasedefficienciesaremeaninglessifthedesiredlevelofqualityisnot

    reached.Belowaretwoexamplesoftechniquesforqualitymeasurement.Forlongitudinalmeasures,averagefinalqualitycanbecomparedtoseewhat

    improvementsordegradationsoccurred:

    Humanevaluation:Forexample,usingacompanysstandarderrortypology.Notethathumanratersofqualityoftendisplaylowratesofagreement.Themoreraters,thebetter(atleastthree).

    Editdistancemeasures:SomeresearchhasshowntheTER(TranslationEditRate)andGTM(GeneralTextMatcher)measurestocorrelatefairlywellwithhumanassessmentsofquality.

    Measureopinions,getfeedback

    Somemeasurementofpost-editoropinion/profilecanbeusefulinhelpingto

    interpretthequantitativemeasures.GatheringfeedbackonthemostcommonorproblematicerrorscanhelpimprovetheMTsystemovertime.

    QuestionsAboutMeasurement

    Self-reportorautomate?Ifarealisticenvironmentisused,itisdifficulttoautomateproductivity

    measurements.Self-reportingisoftenusedinstead,wherepost-editorsfillinatablereportingthenumberofwordstheypost-edited,dividedbythenumberof

    hourstheyworked.Self-reportingiserror-prone,butwithalargeenoughgroup

    ofparticipantsunder-orover-reportingshouldbemitigated.

    Whataboutusingconfidencescoresasindicatorsofproductivity?

    ConfidencescoresthatareautomaticallygeneratedbytheMTsystemarepotentialindicatorsofbothqualityandproductivity.However,developmentis

    stillintheearlystagesandthereisnotyetenoughresearchonthepotentiallinksbetweenconfidencescoresandactualpost-editingproductivity.

    Whataboutre-trainingenginesovertime?IfSMTenginesarere-trainedwithquality-approvedpost-editedcontentover

    time,thenitcanbeexpectedthattheMTenginewillproducehigherqualityrawoutputastimeprogressesandthatpost-editingproductivitymayincrease.This

    shouldbetestedoverthelong-term.