user performance indicators in task-based data collection

User Performance Indicators In Task-Based Data

Collection Systems

Jon Chamberlain, Cliff O’ReillySchool of Computer Science and Electronic Engineering

University of EssexWivenhoe Park, CO4 3SQ England{jchamb,coreila}@essex.ac.uk

Abstract

When attempting to analyse and improve asystem interface it is often the performanceof system users that measures the success ofdifferent iterations of design. This paper in-vestigates the importance of sensory and cog-nitive stages in human data processing, usingdata collected from Phrase Detectives, a text-based game for collecting language data, anddiscusses its application for interface design.

1 Introduction

When attempting to analyse and improve a system in-terface it is often the performance of system users thatmeasures the success of different iterations of design.The metric of performance depends on the context ofthe task and what is considered the most importantoutputs by the system owners, for example one systemmay desire high quality output from users, whereas an-other might want fast output from users [RC10].

When quality is the performance measure it is es-sential to have a trusted gold standard with whichto judge the user’s responses. A common problemfor natural language processing applications, such asco-reference resolution, is that there is not sufficientresources available and creating them is both time-consuming and costly [PCK+13].

Using user response time as a performance indica-tor presents a different set of problems and it may

Copyright c© 2014 for the individual papers by the paper’sauthors. Copying permitted for private and academic purposes.This volume is published and copyrighted by its editors.

In: U. Kruschwitz, F. Hopfgartner and C. Gurrin (eds.):Proceedings of the MindTheGap’14 Workshop, Berlin, Ger-many, 4-March-2014, published at http://ceur-ws.org

Figure 1: Stages of processing in human cognition.

not necessarily be assumed that speed correlates toquality. A fast response may indicate a highly traineduser responding to a simple task and conversely a slowresponse might indicate a difficult task that requiresmore thought.

It is therefore important to understand what is hap-pening to the user during the response and whetherthere is anything that can be done to the system toimprove performance.

This paper investigates the importance of sensoryand cognitive stages in human data processing, usingdata collected from Phrase Detectives, a text-basedgame for collecting language data, and attemps to iso-late the effect of each stage. Furthermore we discussthe implications and its application for interface de-sign.

2 Related Work

The analysis of timed decision making has been a keyexperimental model in Cognitive Psychology. Stud-ies in Reaction (or Response) Time (RT) show thatthe human interaction with a system can be dividedinto discrete stages: incoming stimulus; mental re-sponse; and behavioural response [Ste69]. Althoughtraditional psychological theories follow this model ofprogression from perception to action, recent studiesare moving more towards models of increasing com-plexity [HMU08].

For our investigation we distinguish between 3stages of processing required from the user to elicitan output response from input stimuli (see also Figure1):

Figure 2: A task presented in Annotation Mode.

1. input processing (sensory processing) where theuser reads the text and comprehends it;

2. decision making (cognitive processing) where theuser makes a choice about how to complete thetask;

3. taking action (motor response) to enter the re-sponse into the system interface (typically usinga keyboard or mouse).

This model demonstrates how a user responds to atask and can be seen in many examples of user interac-tion in task-based data collection systems. In crowd-sourcing systems, such as Amazon’s Mechanical Turk1

or data collection games, a user is given an input (typ-ically a section of text or an image) and asked to com-plete a task using that input, such as to identify alinguistic feature in the text or to categorise objectsin an image [KCS08]. The model can also be seen insecurity applications such as reCAPTCHA, where theresponse of the user proves they are human and notan automated machine [vAMM+08]. As a final exam-ple, the model can be seen in users’ responses to asearch results page, with the list of results being theinput and the click to the target document being theresponse [MTO12].

The relationship between accuracy in completing atask and the time taken is known as the Speed Ac-curacy Trade-off. Evidence from studies in ecologi-cal decision-making show clear indications that dif-ficult tasks can be guessed where the costs of errorare low. This results in lower accuracy but fastercompletion time [CSR09, KBM06]. Whilst studies us-ing RT as a measure of performance are common, it

1http://www.mturk.com

Figure 3: A task presented in Validation Mode.

has yet to be incorporated into more sophisticatedmodels predicting data quality from user behaviour[RYZ+10, WRfW+09, KHH12, MRZ05].

3 Data Collection

Phrase Detectives is a game-with-a-purpose designedto collect data on anaphoric co-reference2 in Englishdocuments [CPKs08, PCK+13].

The game uses 2 modes for players to complete alinguistic task. Initially text is presented in Annota-tion Mode (called Name the Culprit in the game - seeFigure 2) where the player makes an annotation de-cision about a highlighted markable (section of text).If different players enter different interpretations for amarkable then each interpretation is presented to moreplayers in Validation Mode (called Detectives Confer-ence in the game - see Figure 3). The players in Vali-dation Mode have to agree or disagree with the inter-pretation.

The game was released as 2 interfaces: in 2008 asan independent website system (PD)3 and in 2011 asan embedded game within the social network Face-book (PDFB).4 Both versions of the Phrase Detectivesgame were built primarily in PHP, HTML, CSS andJavaScript, employ the same overall game architectureand run simultaneously on the same corpus of docu-ments.

One of the differences between Phrase Detectivesand other data collection games is that it uses pre-processing to offer the players a restricted choice ofoptions. In Annotation Mode the text has embeddedcode that shows all selectable markables; In ValidationMode the player is offered a binary choice of agree-

2Anaphoric co-reference is a type of linguistic reference whereone expression depends on another referential element. An ex-ample would be the relation between the entity ‘Jon’ and thepronoun ‘his’ in the text ‘Jon rode his bike to school’.

3http://www.phrasedetectives.com4https://apps.facebook.com/phrasedetectives

Figure 4: Proportional frequency of RT in the 2 modes of the 2 interfaces of Phrase Detectives.

Table 1: Total responses for the 2 modes in the 2 in-terfaces of Phrase Detectives.

PD PDFBTotal Annotations 1,096,575 520,434Total Validations (Agree) 123,224 115,280Total Validations (Disagree) 278,896 199,197

ing or disagreeing with an interpretation. This makesthe interface more game-like and allows the data tobe analysed in a more straightforward way as all re-sponses are clicks rather than keyboard typing. In thissense it makes the findings more comparable to searchresult tasks than reCAPTCHA typing tasks.

4 Analysis

In order to investigate the human data processing inthe Phrase Detectives game the RT was analysed indifferent ways. All data analysed in this paper is fromthe first 2 years of data collection from each interfaceand does not include data from markables that areflagged as ignored.5 Responses of 0 seconds were notincluded because they were more likely to indicate aproblem with the system rather than a sub 0.5 secondresponse. Responses over 512 seconds (8:32 minutes)6

were also not included and outliers do not representmore than 0.5% of the total responses.

An overview of the total responses from each in-terface shows the PDFB interface had proportionately

5System administrators manually correct pre-processing er-rors by tagging redundant markables to be ignored.

6The upper time limit is set at 512 seconds because the datais part of a larger investigation that used RT grouped by a powerfunction and it is assumed no task would take longer than this.

Table 2: Minimum, median and mean RT from a ran-dom sample of 50,000 responses of each response typefrom PD and PDFB.

PD PDFBAnnotation RT (min) 1.0s 2.0sAnnotation RT (med) 3.0s 6.0sAnnotation RT (mean) 7.2s 10.2sValidation (Agr) RT (min) 1.0s 1.0sValidation (Agr) RT (med) 5.0s 6.0sValidation (Agr) RT (max) 10.0s 10.5sValidation (Dis) RT (min) 1.0s 2.0sValidation (Dis) RT (med) 3.0s 6.0sValidation (Dis) RT (mean) 8.4s 9.9s

fewer annotations to validations than the PD interfaceindicating that the players in the latter disagreed witheach other more (see Table 1). A random sample of50,000 responses per response type (annotation, agree-ing validation, and disagreeing validation) shows thatusers respond differently between the 2 interfaces (seeTable 2). The data was also plotted as a proportionalfrequency of RT, with a focus on the first 15 seconds(see Figure 4).

There is a significant difference in the RT betweeninterfaces (p<0.05, unpaired t-test). This may indi-cate a higher level of cheating and spam in PD how-ever PDFB may be slower because it had to load theFacebook wrapper in addition to the interface. This issupported by the minimum RT for PDFB being 2.0s inAnnotation and Validation (Disagree) Modes, whereit could be assumed that this is the system’s maxi-mum speed. The 2 interfaces differ in the proportionof responses 2 seconds or less (almost a third of all

responses in PD but a negligible amount in PDFB).One of the motivations for this research is to under-stand the threshold where responses can be excludedbased on predicted RT rather than comparison to agold standard.

The RT for validations was slower than for anno-tations in the PD interface. This is counter-intuitiveas Annotation Mode has more options for the user tochoose from and requires a more complex motor re-sponse. One of the assumptions in the orginal gamedesign was that a Validation Mode would be fasterthan an Annotation Mode and it would make datacollection more efficient.

The data was further analysed to investigate the 3stages of user processing. Different data models wereused to isolate the effect of the stage in question andnegate the influence of the 2 other stages.

4.1 Input processing

A random sample of 100,000 validation (agree and dis-agree) responses were taken from the PDFB corpus.The RT and character distance at the start of themarkable were tested for a linear correlation, the hy-pothesis being that more input data (i.e., a long text)will require more time for the player to read and com-prehend. Validation Mode was used because it alwaysdisplays the same number of choices to the player nomatter what the length of the text (i.e., 2) so the ac-tion and decision making stages should be constantand any difference observed in RT would be due toinput processing.

There was a significant correlation between RT andthe amount of text displayed on the screen (p<0.05,Pearson’s Correlation) which supports the hypothesisthat processing a larger input takes longer time.

4.2 Decision making

The decision making stage was investigated using ananalysis of 5 documents in the PD corpus that hada double gold standard (i.e., had been marked by 2language experts), excluding markables that were am-biguous (i.e., the 2 experts did not agree on the bestanswer) or where there was complete consensus. Thecomparison of paired responses of individual markablesminimises the effect of processing time and the actiontime is assumed to be evenly distributed.

The analysis shows that an incorrect response takeslonger, significantly so in the case of making an annota-tion or agreeing with an annotation (p<0.05, paired t-test) - see Table 3. Given that this dataset is from PDwhere there are a high number of fast spam responsesit is feasible that the true incorrect RT is higher. Tak-ing longer to make an incorrect response is indicative

Table 3: Mean RT for aggregated correct and incor-rect responses in the 2 modes from 122 gold standardmarkable observations (80 in the case of ValidationDisagree). * indicates p<0.05.

Correct IncorrectAnnotation* 10.1s 12.8sValidation (Agree)* 13.5s 17.7sValidation (Disagree) 14.5s 15.0s

Table 4: Minimum, median and maximum RT forclicking actions in Annotation Mode from 6,176 mark-ables (p<0.01).

Min Med Max1 click (DN, NR) 1.0s 5.0s 123.3s2 clicks (DO1) 1.0s 9.8s 293.0s3 clicks (DO2, PR1) 2.0s 12.0s 509.0s

of a user who does not have a good understanding ofthe task or that the task is more difficult than usual.

Mean RT is slower than the general dataset (Table2). One explaination is that the gold standard was cre-ated from some of the first documents to be completedand the user base at that time would mostly have beeninterested early adopters, beta testers and colleaguesof the developers rather than the more general crowdthat developed over time, including spammers makingfast responses.

4.3 Taking action

A random sample of 100,000 markables and associ-ated annotations was taken from completed documentsfrom both interfaces where the markable starting char-acter was greater than 1,000 characters. Annotationswere grouped on the minimum number of clicks thatwould be required to make the response (any mark-ables that had no responses in any group were ex-cluded). Thus the effect of input processing speed wasminimised in selected markables and decision makingtime is assumed to be evenly distributed.

• 1 click response, including Discourse-New (DN)and Non-Referring (NR);

• 2 click response, including Discourse-Old (DO)where 1 antecedent was chosen;

• 3 click response, including DO where 2 an-tecedents were chosen and Property (PR) where1 antecedent was chosen.

There is a significant difference between each group(p<0.01, paired t-test), implying that the motor re-sponse per click is between 2 to 4 seconds, althoughfor some tasks it is clearly faster as can be seen in theminimum RT. This makes the filtering of responses

below a threshold RT important as in some cases theuser not would have enough time to process the in-put, make a decision and take action. This will bedependent of how difficult the task is to repond to.

Here the actions require the user to click on a link orbutton but this methodology can be extended to coverdifferent styles of input, for example freetext entry.Freetext is a more complicated response because thesame decision can be expressed in different ways andautomatic text processing and normalisation would berequired. However, when a complex answer might beadvantageous, it is usefulto have an unrestricted way ofcollecting data allowing novel answers to be recorded.To this end the Phrase Detectives game allowed free-text comments to be added to markables.

5 Discussion

By understanding the way users interact with a systemeach task response time can be predicted. In the caseof the Phrase Detectives game we can use a predictionof what the user should do for a given size of input toprocess, task difficulty and data entry mode. The samecould be applied to any task driven system such assearch, where the system returns a set of results froma query of known complexity with a set of actionableareas that allow a response to be predicted even whenthe user is unknown.

When the system is able to predict a response timefor a given input, task and interface combination userperformance can be measured, with users that performas predicted being used as a pseudo-gold standard sothe system can learn from new data. Outlier data canbe filtered; a response that is too fast may indicate theuser is clicking randomly or that it is an automatedor spam response; a response that is too slow mayindicate the user is distracted, fatigued or does notunderstand the task and therefore the quality of theirjudgement is likely to be poor.

The significant results uncovered by the analysisof the Phrase Detectives data should be treated withsome caution. Independent analysis of each process-ing stage is not entirely possible for log data becauseusers are capable of performing each stage simulta-neously, i.e., by making decisions and following thetext with the mouse cursor whilst reading the text.A more precise model could be achieved with eye-tracking and GOMS (Goals, Operators, Methods, andSelection) rule modelling [CNM83] using a test groupto establish baselines for comparison to the log dataor by using implicit user feedback from more detailedlogs [ABD06]. Without using more precise measures ofresponse time this method is most usefully employedas a way to detect and filter spam and very poor re-sponses, rather than as a way to evaluate and predict

user performance.

Modelling the system and measuring user per-formance allows designers to benchmark proposedchanges to see if they have the desired effect, eitheran improvement in user performance or a negligibledetriment when, for example, monetising an interfaceby adding more advertising. Sensory and motor ac-tions in the system can be improved by changes to theinterface, for example in the case of search results, en-suring the results list page contains enough data so theuser is likely to find their target but not so much thatit slows the user down with input processing. Evensimple changes such as increasing the contrast or sizeof the text might allow faster processing of the inputtext and hence improve user performance. Decisionmaking can be improved through user training, eitherexplicitly with instructions and training examples orimplicitly by following interface design conventions sothe user is pre-trained in how the system will work.

Predicting a user response is an imprecise scienceand other human factors should be considered as po-tentially overriding factors in any analysis. A user’sexpectations of how an interface should operate com-bined with factors beyond measurement may negatecareful design efforts.

6 Conclusion

Our investigation has shown that all three stages ofuser interaction within task-based data collection sys-tems (processing the input; making a decision; andtaking action) have a significant effect on the responsetime of users and this has an impact on how inter-face design elements should be applied. Using responsetime to evaluate users from log data may only be accu-rate enough to filter outliers rather than predict perfor-mance, however this is the subject of future research.

6.0.1 Acknowledgements

The authors would like to thank the reviewers and DrUdo Kruschwitz for their comments and suggestions.The creation of the original game was funded by EP-SRC project AnaWiki, EP/F00575X/1.

References

[ABD06] Eugene Agichtein, Eric Brill, and SusanDumais. Improving web search rank-ing by incorporating user behavior in-formation. In Proceedings of the 29thAnnual International ACM SIGIR Con-ference on Research and Development inInformation Retrieval, SIGIR ’06, pages19–26, New York, NY, USA, 2006. ACM.

[CNM83] Stuart K. Card, Allen Newell, andThomas P. Moran. The Psychology ofHuman-Computer Interaction. L. Erl-baum Associates Inc., Hillsdale, NJ,USA, 1983.

[CPKs08] Jon Chamberlain, Massimo Poesio, andUdo Kruschwitz. Phrase Detectives:A web-based collaborative annotationgame. In Proceedings of the Interna-tional Conference on Semantic Systems(I-Semantics’08), 2008.

[CSR09] Lars Chittka, Peter Skorupski, andNigel E Raine. Speed–accuracy trade-offs in animal decision making. Trendsin Ecology & Evolution, 24(7):400–407,2009.

[HMU08] Hauke R. Heekeren, Sean Marrett, andLeslie G. Ungerleider. The neural sys-tems that mediate human perceptual de-cision making. Nature reviews. Neuro-science, 9(6):467–479, June 2008.

[KBM06] Leslie M Kay, Jennifer Beshel, and ClaireMartin. When good enough is best. Neu-ron, 51(3):277–278, 2006.

[KCS08] Aniket Kittur, Ed H. Chi, and BongwonSuh. Crowdsourcing user studies withmechanical turk. In Proceedings of theSIGCHI Conference on Human Factorsin Computing Systems, CHI ’08, pages453–456, New York, NY, USA, 2008.ACM.

[KHH12] Ece Kamar, Severin Hacker, and EricHorvitz. Combining human and ma-chine intelligence in large-scale crowd-sourcing. In Proceedings of the 11th In-ternational Conference on AutonomousAgents and Multiagent Systems - Volume1, AAMAS ’12, pages 467–474, Richland,SC, 2012. International Foundation forAutonomous Agents and Multiagent Sys-tems.

[MRZ05] Nolan Miller, Paul Resnick, and RichardZeckhauser. Eliciting informative feed-back: The Peer-Prediction method.Management Science, 51(9):1359–1373,September 2005.

[MTO12] Craig Macdonald, Nicola Tonellotto, andIadh Ounis. Learning to predict responsetimes for online query scheduling. In Pro-ceedings of the 35th International ACM

SIGIR Conference on Research and De-velopment in Information Retrieval, SI-GIR ’12, pages 621–630, New York, NY,USA, 2012. ACM.

[PCK+13] Massimo Poesio, Jon Chamberlain, UdoKruschwitz, Livio Robaldo, and LucaDucceschi. Phrase detectives: Utilizingcollective intelligence for internet-scalelanguage resource creation. ACM Trans-actions on Interactive Intelligent Sys-tems, 2013.

[RC10] Filip Radlinski and Nick Craswell. Com-paring the sensitivity of information re-trieval metrics. In Proceedings of the 33rdinternational ACM SIGIR conferenceon Research and development in infor-mation retrieval, pages 667–674. ACM,2010.

[RYZ+10] Vikas C. Raykar, Shipeng Yu, Linda H.Zhao, Gerardo Hermosillo Valadez,Charles Florin, Luca Bogoni, and LindaMoy. Learning from crowds. Journalof Machine Learning Research, 11:1297–1322, August 2010.

[Ste69] Saul Sternberg. The discovery of pro-cessing stages: Extensions of Donders’method. Acta Psychologica, 30:276–315,1969.

[vAMM+08] Luis von Ahn, Benjamin Maurer, ColinMcMillen, David Abraham, and ManuelBlum. reCAPTCHA: Human-basedcharacter recognition via web securitymeasures. Science, 321(5895):1465–1468,2008.

[WRfW+09] Jacob Whitehill, Paul Ruvolo, Ting fanWu, Jacob Bergsma, and Javier Movel-lan. Whose vote should count more: Op-timal integration of labels from label-ers of unknown expertise. In Y. Ben-gio, D. Schuurmans, J. Lafferty, C. K. I.Williams, and A. Culotta, editors, Ad-vances in Neural Information ProcessingSystems 22, page 2035–2043, December2009.

user performance indicators in task-based data collection

Documents