breaking bad - understanding the behavior of crowd workers in categorization microtasks
TRANSCRIPT
Breaking Bad - Understanding Behavior of Crowd Workers
in Categorization Microtasks
Ujwal Gadiraju, Ricardo Kawase, Patrick Siehndel and Besnik Fetahu
METU NCC, 2nd September 2015
Outline
● Motivation
● Categorization Tasks
● Analysis & Results
● Conclusions
2
What is the problem?
● Increase in the number of new task requesters on AMT (1000 per month) [Difallah et al., WWW’15].
○ Not all task requesters are familiar with task task task-specific settings
○ No tangible guidelines for task design ; ■ task length■ monetary incentive■ task completion time
Worker Behavior in Categorization Tasks
● Categorization tasks are one of the most common types of crowdsourced tasks. [Gadiraju et al., A taxonomy of Microtasks on the Web, HT’14]
● Experimental Setup:○ 9 tasks deployed on CrowdFlower○ Task length : 20, 30, 40 units○ Monetary Reward : 1 , 2, 3 USD cents
Tasks Design● Clear instructions and help snippets.● Workers have to select the most
suitable category in each Set (1-5) consisting of 10 different categories.
● Category options were manually tailored to avoid ambiguity.
● Set-1 was made compulsory, Set-2 through Set-5 were optional.
● Tasks were deployed non-concurrently, and order of units were randomized within each task.
● Tasks designed to facilitate 100% accuracy in responses (with an aim to study worker behavior).
Data Collection
● Responses gathered from 100 workers in each task ; 900 workers in total.
● We collected 27,000 unit judgments in total. In 88% of the cases, workers provided responses for all sets (incl. optional).
● Average Task Completion Time○ Tasks with length of 20 Units : 11.3 mins○ Tasks with length of 30 Units : 16.4 mins○ Tasks with length of 40 Units : 18.6 mins
● Tipping Point : The first point (unit-index) at which a worker provides an unacceptable response after having provided at least one acceptable response. [Gadiraju et al., CHI’2015]
● Beaver Workers : Workers who exert additional effort by answering optional questions in order to help task requesters.
Definitions
Consistency of Units within Tasks● Avg. accuracy of around 90% with little Std. Dev.
● We tolerate 10% incorrect responses from workers, owing to possible drifts in attention spans / boredom.
● Bad Workers : Workers who answer 10% or more of the units within a categorization task incorrectly.
● Poor Starters : Workers whose first 2 responses within a categorization task are incorrect.
Poor Starters, Bad Workers, & Tipping Point
Task Completion Time vs Worker Accuracy
Worker Behavior Within a TaskKey Findings● A worker’s accuracy decreases through the
course of a task. (optional sets are not considered). ○ This is more prominent as the task length
increases.● Workers that exert additional effort project
higher accuracies within tasks.● The additional effort that workers exert
decreases through the course of a task. ○ This is more prominent as the task length
increases.
Scrutiny of Additional Responses
● % Correct Additional Responses gradually decreases from Set-1 to Set-5.
● On average, workers skip more optional sets as they proceed from Set-2 to Set-5.
Workers Breaking Bad
Adjusted Tipping Point (ATP) : Workers that consecutively respond to at least 10% of the units in a task incorrectly, are said to have an ATP. The index of the first unit at which this is observed, is called the ATP of the worker. Such a worker is called a BREAKER.
Conclusions & Future WorkTo achieve good quality in categorization tasks…● It is better to err on the lower side of monetary
incentives offered.● Use minimum time required as a filter, but give
ample time for task completion. It is better to err on the higher side of maximum task completion time.
● It is better to err on the shorter side of task length.
● We can gauge worker intentions through the nature of their responses to optional questions.
● We plan to quantify the limits and these guidelines in the imminent future.
Removal of Ineligible Workers
Ineligible workers : The workers who do not conform to the priorly stated prerequisites, belong to this category.
● We found 9 ineligible workers who used browser-embedded translator tools in order to participate in the task.
● Ineligible workers were not considered in the further analysis.