leveraging side information to improve label quality control in...

Leveraging Side Information to Improve Label Quality Control in Crowd-sourcing

Yuan Jin and Mark CarmanFaculty of Information Technology

Monash University{yuan.jin, mark.carman}@monash.edu

Dongwoo Kim and Lexing XieCollege of Engineering and Computer Science

Australian National University{dongwoo.kim, lexing.xie}@anu.edu.au

Abstract

We investigate the possibility of leveraging side informationfor improving quality control over crowd-sourced data. Weextend the GLAD model, which governs the probability ofcorrect labeling through a logistic function in which workerexpertise counteracts item difficulty, by systematically encod-ing different types of side information, including worker in-formation drawn from demographics and personality traits,item information drawn from item genres and content, andcontextual information drawn from worker responses and la-beling sessions. Modeling side information allows for betterestimation of worker expertise and item difficulty in sparsedata situations and accounts for worker biases, leading to bet-ter prediction of posterior true label probabilities. We demon-strate the efficacy of the proposed framework with overallimprovements in both the true label prediction and the un-seen worker response prediction based on different combina-tions of the various types of side information across three newcrowd-sourcing datasets. In addition, we show the frameworkexhibits potential of identifying salient side information fea-tures for predicting the correctness of responses without theneed of knowing any true label information.

IntroductionCrowd-sourcing, the process of outsourcing human intelli-gence tasks to an undefined, generally large group of peopleto seek answers via an open call (Sheng, Provost, and Ipeiro-tis 2008), has gained much popularity in machine learn-ing communities in recent years for generating and collect-ing labeled data, thanks to the development of the corre-sponding online service providers, such as Amazon Mer-chanical Turk1 and CrowdFlower2. These platforms facili-tate large-scale online data labeling and collection processesin an inexpensive and timely manner. However, they arealso constantly confronted by workers with various motivesand abilities, who end up producing conflicting labels forthe same items. Moreover, an ever-growing number of unla-beled items versus limited budgets in most crowd-sourcingprojects often results in a small number of responses per

Copyright c© 2017, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

1https://www.mturk.com/2https://www.crowdflower.com/

item. Aggregating such small numbers of conflicting re-sponses using majority vote to infer the true label of eachitem is often unreliable.

To overcome the above problem, the quality of responsesmust be controlled in a principled manner such that the in-fluence of “high-quality” responses can outweigh that of“low-quality” ones when aggregated for the true labels. Thisactivity is known as Quality Control for Crowd-sourcing(QCCs) (Lease 2011). The QCCs methods, largely basedon statistical modeling and machine learning, consider theabilities of workers to govern the quality of the labelsthey produce with greater abilities indicating higher quality(Dawid and Skene 1979; Raykar et al. 2010a). Some of themalso consider the difficulty of items that can counteract theworker abilities to undermine the label quality (Rasch 1960;Whitehill et al. 2009; Bachrach et al. 2012). These methodshave overall achieved superior performance over the conven-tional majority vote, and the basic pre-task or in-task incom-petent worker filtering using control questions. In fact, it istypical for the QCCs methods to follow the filtering in a pi-pline to enhance the quality control performance nowadays.

However, there exists one major pitfall of the currentQCCs models which is that they are vulnerable to the labelsparsity problem, which happens frequently in real-worldcrowd-sourcing scenarios where only few labels get col-lected for each item or from each worker for a task (Jungand Lease 2012). Consequently, parameter estimation forthese models (e.g. estimates of worker abilities and item dif-ficulty) becomes unreliable, causing the models’ QCCs per-formance to deteriorate. As an example, due to the paucityof her provided labels, an expert worker can be consideredinaccurate by the QCCs models if most of her labels hap-pen to disagree with the majority, while a novice worker canbe considered as accurate if her happens to have made somefortunate guesses. In this case, extra side information aboutthe demographics of these workers, the questions they haveanswered, the time they have taken to respond, or even theirsituated environments can possibly help to improve the es-timation of their parameters in the models when the labelsthey provide are scarce. Following the previous example, ifwe know that the demographics of the expert (e.g. her educa-tion, interests related to the particular crowd-sourcing task)is very similar to those of the other experts who have beenlabeling correctly (e.g. by agreeing with the majority), then

the belief of her being a novice due to her poor performanceso far can be counteracted by her similarity with the otherexperts. As a result, her next label will be more trusted bythe models.

In most crowd-sourcing tasks, there is always extra sideinformation that is easy to collect (e.g. labeling prices,worker comments or feedback), or can be collected withsome minor efforts (e.g. designing simple surveys to collectdemographics or programming to collect them silently). Theside information can be elicited from the items, the work-ers and the contexts in which worker-item interaction takesplace during the crowd-sourcing process. Relevant work incrowd-sourcing thus far has focused on exploiting only veryspecific types of side information for improving the perfor-mance of QCCs. To our best knowledge, a study that de-velops a scalable framework able to integrate and utilize ar-bitrary types of side information for improving true labelprediction is still missing.

The corresponding contributions of this paper are summa-rized as follows:• We have developed a probabilistic framework, by extend-

ing the basic GLAD model (Whitehill et al. 2009), thatseamlessly integrates various types of side informationwhile modeling the interaction between the worker exper-tise and the item difficulty;

• Overall improvements in both the true label predictionand the unseen (held-out) worker response predictionhave been found in our experiments over three newcrowd-sourcing datasets all of which experience the labelsparsity problem for their items. The experiments involve(1) the comparison between the different instantiations ofour framework with incremental amounts of side infor-mation and three baseline methods in terms of the abovetwo prediction tasks, and (2) their comparison in termsof the same prediction tasks by learning from small ran-dom response subsets sampled from the workers of thethree datasets where they are exposed to escalated spar-sity across both the workers and the items.

• We show the framework has the potential of automaticallyidentifying side information features important for pre-dicting the correctness of responses in an unsupervisedmanner.

Related WorkResearch that considers differences in worker abilities whileinferring the true labels for items dates back to the workof Dawid and Skene (1979), who integrated parametersmodeling workers’ abilities and biases into a single confu-sion matrix accommodating conditional probabilities of allpossible response labels given all possible true labels. Sincethen, many models have been proposed to prevent the learn-ing of the confusion matrix for each worker from over-fittingthe corresponding labeled data. They have succeeded viaeither simplifying the confusion matrix setting by makingit symmetric (Whitehill et al. 2009; Raykar et al. 2010b;Liu, Peng, and Ihler 2012; Wauthier and Jordan 2011),or grouping workers with similar confusion matrices to-gether to smooth the worker-specific confusion matrices

with the ones learned at the group-level (Venanzi et al. 2014;Moreno et al. 2014). Sometimes modeling worker abilitiesalone is not enough for accurate estimation of label qual-ity, which might also be affected by variations in labelingaccuracy across items. Accordingly, there has been research(Whitehill et al. 2009; Bachrach et al. 2012) taking into ac-count another set of model parameters known as item diffi-culty, with others further considering the multi-dimensionalinteractions between the two entities (Welinder et al. 2010;Ruvolo, Whitehill, and Movellan 2013).

Among all the prior work, research investigating the useof side information to improve QCCs is limited. Kamar,Kapoor, and Horvitz (2015) studied utilizing observed sideinformation, in particular features of the items, to estimateitem-side confusion matrices to account for task-dependentbiases. Kajino, Tsuboi, and Kashima (2012) developed con-vex optimization techniques using worker-specific classi-fiers centered on a base classifier which takes in item fea-tures for inferring their true labels. Ruvolo, Whitehill, andMovellan (2013) built a multi-dimensional logit model forpredicting the correct label probability based on observedworker features. Ma et al. (2015) took into account “bag-of-word” information for learning the topical expertise of in-dividual workers. Venanzi et al. (2016) considered responsedelay information to better distinguish spammers from gen-uine workers. We can see that each of these relevant workshas focused on exploiting only one specific type of sideinformation for improving the label quality control perfor-mance.

Proposed FrameworkWe are interested in developing a unified framework that isable to predict item true labels based on both worker re-sponses and various types of side information about work-ers, items and contexts. The framework should be scalablefor incorporating new types of side information. Table 1summarizes the inputs, parameters and hyper-parameters ofthe proposed framework.

Basic FrameworkOur proposed framework extends the GLAD (Whitehill etal. 2009) model which is shown in Figure 1a. GLAD appliesa logistic function to the product between worker expertiseand item difficulty variables to calculate the probability ofcorrect labeling. More precisely, GLAD defines the proba-bility of a response ruv being correct, (i.e. equal to the truelabel lv), as follows:

p(ruv = lv) = fHuv =1

1 + exp(−zHuv)zHuv = fHuu fHvv fHuu = eu fHvv = exp(dv) (1)

where H = {eu, dv}u∈U,v∈V denotes the set of model pa-rameters excluding the latent true labels L of items V . Ac-cording to GLAD, eu ∈ R models the expertise of workeru, and 1/ exp(dv) with dv ∈ Rmodels the difficulty of itemv. For clarity, we call dv the easiness of item v for the rest ofthe paper. Moreover, GLAD assumes normal priors over eu

and dv with µv, σ2v and µu, σ2

u being their prior means andvariances respectively. Whitehill et al. (2009) further sim-plify the GLAD model so that instead of needing to infer a|K| × |K| matrix of parameters for each worker as done byDawid and Skene (1979), the model makes use of the fol-lowing conditional statements:

p(ruv|lv) = fHuv if ruv = lv; otherwise =1− fHuv|K| − 1

(2)

We choose GLAD as the basis of our framework becauseof its factorization nature which easily allows for linear com-bination of different factors encoding arbitrary informationabout workers, items and their dyadic relations.

Incorporating Worker InformationWe start with encoding information about workers into thebasic framework as shown in Figure 1b. In this case, theworker-side expression fHuu is changed to be:

fHuu = eu + xTuβU (3)

Here the dot product between the multi-dimensional featurevector xu of worker u and the weight vector βU forms aglobal regression across all the workers with βU learned tobring the expertise offsets of similar workers closer together.This helps to smooth the irregular expertise estimates thatresult from the sparse labels across workers. Moreover, weassume a normal prior over the m-th component of βU withmean µUm and standard deviation σUm.

Incorporating Item InformationThe item information is incorporated into the basic frame-work as shown in Figure 1c. The item-side expression fHvvnow has the following form:

fHvv = exp(dv + xTv βV) (4)

where the dot product between the multi-dimensional fea-ture vector xv of item v and weight vector βV forms a globalregression over all the items. βV serves the same purpose asits worker-side counterpart. We assume a normal prior overthe m-th component of βV with mean µVm and standard de-viation σVm.

Incorporating Response and Session InformationWe consider contextual information at both the responselevel and the session level. The former type of contextual in-formation is specific to each response given by a worker toan item, encoded by features including response delay andorder, while the latter is specific to each labeling session ofa worker which we define to start when a task page is loadedand to end once the page is submitted with no time-out in be-tween. In this case, session features can include labeling de-vices (e.g. a personal computer), rendering browsers and thetime periods (e.g. the day of the week) of the labeling. Thesession features capture much more variation in the labelingacross different workers than within each worker as mostof the workers remain situated in the same environments

Symbols DescriptionInputs

U set of workersV set of itemsK set of label categories

ruv ∈ K response of worker u for item v

R set of responses {ruv|(u ∈ U) ∧ (v ∈ V)}xu feature vectors for worker uxv feature vectors for item v

xus feature vectors for a labeling session s of worker uxuv feature vectors for the response given by worker u to item v

ParametersL = {lv}v∈V set of latent true label variables for items in V

H set of model parameters excluding Lθ probability vector over item true label categorieseu expertise variable eu ∈ (−∞,+∞) of worker udv “easiness” variable dv ∈ [0,+∞) of item v

βU weight vector for worker features xuβV weight vector for item features xvβS weight vector for session features xusαu weight vector for the responses given by worker u

Hyper-parametersH0 set of hyper-parameters for the model parametersγ Dirichlet prior for item true label lv

µu, σu Normal prior for euµv, σv Normal prior for dvµU , σU Normal prior for each weight component of βU

µV , σV Normal prior for each weight component of βV

µS , σS Normal prior for each weight component of βS

µα, σα Normal prior for each weight component ofαu

Table 1: List of Notation used.

throughout the entire crowd-sourcing tasks. In contrast, theresponse features should provide much more insight into thevariation within each worker’s labeling behavior.

To leverage the advantages of both types of contextual in-formation, we incorporate them into the basic framework asshown in Figures 1d and 1e, where S is the set of label-ing sessions each corresponding to a task page containing anumber of label-collecting questions.

The corresponding changes made to the worker-side ex-pression and to the entire expression are respectively the fol-lowing:

fHuu = eu + xTusβS

zHuv = fHuu fHvv + xTuvαu (5)

Here the dot product between the multi-dimensional featurevector xus of worker u within session s and the weight vec-tor βS forms a global regression over all the sessions of allthe workers, while αu are specific to worker u, serving asthe weight vector of a local linear regression over the fea-ture vectors of all the responses made by worker u. Suchlocal regressions aim at addressing worker-specific biasesthat GLAD fails to handle properly (Welinder et al. 2010).We again assume normal priors over the m-th component ofboth βS and αu with means µSm, µmu and standard devia-tions σSm, σmu, respectively.

ruvlvρ

λ

eu

dv µe, σe

µd, σd

UV UV

(a) GLAD

ruvlvρ

λ

eu

xuβU

µU , σUµd, σd

µe, σe

dv

UV UV

(b) GLAD + worker features

ruvlvρ

λ

eu

xv

βV

µV , σV

µe, σeµd, σd dv

UV UV

(c) GLAD + item features

ruvlvρ

λ

eu

xusβS

µS , σSµd, σd

µe, σe

dv

US

UV UV

(d) GLAD + labeling session features

ruv

xuv

lvρ

λ

eu µe, σe

αu

µd, σd µα, σα

dv

UV UV

(e) GLAD + response features

Figure 1: (a), (b), (c), (d) and (e) show the GLAD model, and the GLAD model with the observed features about workers, items,sessions and responses respectively.

InferenceIn this section, we describe a stochastic parameter estima-tor to obtain posterior probabilities of the latent true labelvariables L. More specifically, in each iteration of the esti-mation, we alternate between the Collapsed Gibbs sampling(Griffiths and Steyvers 2004) for L given the current esti-mates of the model parametersH, and the one-step gradientdescent for updatingH given the sampled L.

Collapsed Gibbs Sampling for LAt this stage, we employ a collapsed Gibbs sampler to ob-tain posterior samples for L = {lv}v∈V given the currentestimates of H. In this case, the conditional probabilities oftrue label lv is obtained by marginalizing out the multino-mial probability vector θ, which ends up being:

P (lv = k|L\v,Rv,H,γ) ∝N\vk + γk∑

j∈K(N\vj + γj)

×

∏u∈Uv

((fHuv)

1{ruv=lv}(1− fHuv|K| − 1

)1{ruv 6=lv})(6)

where Uv is the set of workers who responded item v with aset of responses Rv , L\v is the set of current true label as-signments to all the items excluding item v, and N\vk is thenumber of items excluding v whose true labels are currentlyinferred to be k.

Gradient Descent forHThe conditional probability distributions of the model pa-rametersH = (θ, e,d,βU ,βV ,βS ,α) are hard to computeanalytically due to the presence of the logistic function. In-stead, we run the gradient descent for one step with respectto H on the negative logarithm of its joint conditional prob-

ability distribution.

Q(H) = −∑v∈V

∑u∈Uv

log

((fHuv)

1{ruv=lv}×

(1− fHuv|K| − 1

)1{ruv 6=lv})− log p(H|H0). (7)

The first term of Equation 7 is the negative log-likelihood ofthe response data, and the second term is the log-prior withH0 denoting the set of hyper-parameters ofH. To minimizeQ(H), we take partial derivative of Equation 7 with respectto each element inH.

Estimate eu and dv In this case, the relevant prior termsin log p(H|H0) are (dv−µv)2

2σ2v

+ (eu−µu)22σ2u

. The gradients ofeu and dv are thus:

∂Q∂dv

= −∑u∈Uv

(δuvf

Huu fHvv

)+dv − µvσ2v

(8)

∂Q∂eu

= −∑v∈Vu

(δuvf

Hvv

)+eu − µuσ2u

, (9)

where δuv = [1{ruv = lv}(1 − fHuv) − 1{ruv 6= lv}fHuv],and Vu is the set of items responded by worker u.

Estimate βUm and βVm Taking derivatives w.r.t. βUm and βVm,them-th components of βU and βV respectively, yields sim-ilar equations:

∂Q∂βUm

= −∑v∈V

∑u∈Uv

(δuvf

Hvv xmu

)+βUm − µU

σU2 (10)

∂Q∂βVm

= −∑v∈V

∑u∈Uv

(δuvf

Huu fHvv xmv

)+βVm − µV

σV2 (11)

Equation 10 is also applied to ∂Q∂βSm

with xmus and βSmrespectively replacing xmu and βUm.

Estimate αmu The gradient w.r.t. αmu, the m-th compo-nent of αu, is given by:

∂Q∂αmu

= −∑v∈Vu

(δuvxuv

)+

αmu − µασ2α

(12)

ExperimentsWe present experiments that study the performance of ourframework on combining different types of side informationto improve the quality control for crowd-sourcing on threereal-world datasets.

DatasetsThe three datasets were collected in separate crowd-sourcingtasks on CrowdFlower with three responses collected foreach item. As a basic quality control measure, we filtered outworkers who did not achieve 88% accuracy on pre-definedcontrol questions. The qualified workers were also asked foradditional information including demographics and personaltraits3. Each qualified worker is allowed to label a certainnumber of items for each task and is free to quit labeling atany time. Nineteen items were randomly selected and shownto the a worker on each task page. Table 2 provides a sum-mary of the three datasets.

Dataset # Worker # Item # ResponseStack Overflow 505 14,021 42,063

Evergreen Webpage 434 7,336 22,008TREC 2011 160 1826 5,478

Table 2: Dataset Summary.

TREC 2011 Crowd-sourcing Track Each CrowdFlowerworker was asked to judge the relevance levels (i.e. highlyrelevant, relevant and non-relevant) of 38 Web-pages totheir corresponding queries in the TREC 2011 Crowd-sourcing Track dataset4. Figure 2a shows a question forcollecting the relevance judgments for a pair of query (i.e.“french lick resort and casino”) and document.

Stack Overflow Post Status Judgement Each Crowd-Flower worker was asked to judge the status of 95 archivedquestions from Stack Overflow5. The status of a questioncan be either open, meaning it is regarded suitable to stayactive (i.e. visible, answerable and editable) on Stack Over-flow, or closed, meaning the opposite for reasons includingthat it is not a real question (i.e. questions that are ambigu-ous, too broad or “show no efforts” in seeking answers), not

3This was done by mixing the demographic survey questionswith the control questions in the quiz for the workers to answerprior to starting the crowd-sourcing task. As a result, there exist asmall number of missing values in the demographic data collectedas not all the survey questions were chosen by CrowdFlower toappear on the quiz page.

4https://sites.google.com/site/treccrowd/20115The set of questions judged is a random subset of the

training dataset used in the Kaggle competition “Predict ClosedQuestions on Stack Overflow” (https://www.kaggle.com/c/predict-closed-questions-on-stack-overflow).

DatasetSide Info. Stack-Overflow Evergreen TREC

1. age, 2. gender, 3. education, 4. personality traitsself-appraisal about: 1. mother tongue 1. mother tongue

1. programming self-appraisal about: self-appraisal about:experience 2. frequency of 2. frequency of

2. frequency of online search online searchWorker search/post/ 3. frequency of 3. diversity of

edit/answer bookmarking online search-ing questions Web-pages topics3. diversity of 4. frequency of 4. average #

questions revisiting SERPsdealt with Web-pages checked

Item 1. content length, 2. item genre- 1. Web-page features -

Response 1. response time/delays, 2. response orderSession 1. weekends or weekdays, 2. time of the day

3. labeling devices (e.g. PC, tablets, etc.)

Table 3: features encoding different types of information.

constructive to the Website (i.e. questions that are subjectiveand have no correct answers) and too localized (i.e. ques-tions that are not reproducible, thus useless to other workersin the future). Figure 2b shows a question for collecting thestatus judgments for a Stack Overflow post.

Evergreen Webpage Judgment Each CrowdFlowerworker was asked to judge 57 Web-pages6 on whetherthey think these Web-pages have a timeless quality or, inother words, will still be considered by average workers asvaluable or relevant in the future.

Feature CollectionWe collected various types of side-information as summa-rized in Table 3.

Worker Feature The worker features include both the de-mographic and personality trait features which are com-mon to all the three tasks, and the “self-appraisal” featureswhich vary from task to task. The ages of workers (start-ing from 18) were discretized into 9 groups with the first8 groups each having a 5-year gap onward and the last be-ing of age 60 or over. The education backgrounds of work-ers were divided into 5 categories from “Less than highschool” to “Master degree or above”. To collect informa-tion about the personality traits of workers, we directly em-ployed the 10 survey questions used by Kazai, Kamps, andMilic-Frayling (2012) based on the so called five personal-ity trait dimensions (John, Naumann, and Soto 2008). As forthe “self-appraisal” features, these were designed to capturethe possible nuances in workers’ expertise levels on differenttasks from their own perspectives.

Item Feature The item features include both the commonfeatures which are the content length and item genres (al-ready known for each item of each dataset), and those unique

6We used the training set from the Kaggle Compe-tition “StumbleUpon Evergreen Classification Challenge”https://www.kaggle.com/c/stumbleupon.

(a) a question for relevance judgment (b) a question for Stack Overflow post status judgment

to the Evergreen dataset which are the original Web-pagefeatures7 provided in the Kaggle competition.

Response and Session Features The contextual featureswere collected at both the response and the session lev-els, and were the same across all the tasks. Response-levelfeature “response delay” records the amount of time eachworker took to label each item. Its value was calculated bysubtracting the click time of the previous item (or the pageload time if it was more recent) from the click time of theparticular item. We also computed the “response order” ofeach item by ordering items by their “last click time”. Asfor the session feature “time of the day”, we set its value,either “daytime”, “night” or ‘late night”, corresponding tothe periods [6am, 7pm), [7pm, 23pm), and [23pm, 6am), re-spectively.

Feature Normalization The pre-processing of the fea-tures involved binarizing the non-numeric features, and nor-malizing the numeric features using a Z-score transforma-tion, (except for the numeric feature “response order” forwhich we used a min-max normalization as its values werealways uniformly distributed). For numeric features withhighly skewed empirical distributions, a log-transformation8

was applied prior to normalizing with the Z-score transfor-mation. Finally, we normalized the response-level feature“response delay” on a per-worker basis (i.e. using workerspecific mean and standard deviation values), in order tofacilitate local linear regression with the worker-specificweight vector αu as specified in Equation 5.

Experiment SetupBaselines We verify the efficacy of our model by compar-ing it with the following three baselines:

• Majority Vote: the predicted true label for an item is theresponse given by the majority of the workers.

• GLAD: the probability of a correct response is a logisticfunction over the product between the worker expertiseand the item “easiness” variables.

• Community-based Dawid-Skene (DS) (Venanzi et al.2014): to smooth out unreliable estimates of the confusion

7https://www.kaggle.com/c/stumbleupon/data8log(c+ x) where c = min(0.1,min(x))

matrix entries due to label sparsity, the matrix is drawn(row-wise) from one that is shared by a community towhich the worker is inferred to belong.

Evaluation Metrics We use the following metrics to eval-uate the performance of the baseline methods and our pro-posed framework in terms of the item true label prediction:

• Predictive accuracy (Accu):1|V|∑v∈V

1{lv = lv}

where lv = argmaxl∈K

P (lv = l|R,Model)

• Log-Loss (Log):− 1|V|∑v∈V

∑l∈K

1{lv = l} log(P (lv = l|R,Model)

)We use the following metric to evaluate the performance ofthe baselines nd our proposed framework with respect to theunseen worker response prediction.

• Mean Absolute Error (MAE):1|Rh|

∑ruv∈Rh

1(ruv 6= ruv)

where ruv = argmaxl∈K

P (ruv = l|R\h,Model)

The first measure (Accu) tells us the performance in termsof the number of correctly predicted true labels. The sec-ond measure (Log) tells us how confident the model is in itspredictions of the true labels. The second measure is oftenmore sensitive than the first and can therefore be useful forcomparing similarly performing models. The third measure(MAE) informs us of the performance with respect to the av-erage number of correctly predicted held-out responses Rhfrom the workers based on the posterior mode ruv given thetraining responses R\h and the specific model being tested.In the experiment using the entire responses, the proportionof the held-out test data Rh is set to be 30% and the accu-racy of the unseen worker response prediction of a model isobtained by averaging its performance over 10 such held-outtests.

Hyper-parameter Setup For GLAD, Whitehill et al.(2009) imposes normal priors over both eu and dv withmeans and standard deviations both set to be 1. This set-ting assumes that a priori approximately 75% of workersare reliable and 75% of items are relatively easy for aver-age workers. Although straightforward, we believe that the

above hyper-parameter settings for GLAD is not particularlyeffective for modeling common crowd-sourcing scenariosfor two main reasons.

First, crowd-sourcing tasks conducted on CrowdFlowercan only be attempted, by default, by leveled 9 (in otherwords, experienced and high-quality) workers. We adoptedthis default setting for all our tasks. Moreover, control ques-tions are very often used to vet and remove low-performingworkers before and during the tasks. Thus, we believe that apriori no workers should be considered unreliable (that is eubelow or close to 0). Instead, all workers should be expectedto have abilities close to the prior mean.

Secondly, it is common to collect only a small number ofresponses for each item during crowd-sourcing. Such few re-sponses can hardly provide sufficient information for GLADto reliably estimate dv . Moreover, GLAD applies an expo-nential transformation on dv to ensure its non-negativity,which is likely to further “inflate” its inaccurate estimation.As an example, when dv = 3 (i.e. two positive standard de-viations from its prior mean), exp(dv) = 20.1, much largerthan eu = 3 which is also at two positive standard devia-tions. As a result, the log-odds of correct labeling, modeledas their product in GLAD, is likely to be dominated by thepotentially inaccurate estimates of dv . Thus, we suggest astronger regularization penalty for dv to suppress these prob-lems. Based on the above analysis, we set the prior means foreu and dv to be 2 and 0 respectively. Note that assigning thelatter to zero imposes an uninformative setting for the priorof dv . We set the prior variance for both to be 0.1 in accor-dance with the arguments above (to limit the variance of euand to increase the regularization on dv).

Each component of the Dirichlet prior vector γ is set tobe 1. For each regression weight vector (e.g. βU ), we set theprior means of its components to be 0, and the prior stan-dard deviations to be (0.01 / #features) so that the influencebrought by each global/local regression is comparable to thatbrought by its affecting factor (e.g. eu). As for the gradientdescent step size, we define a default step size of η = 0.001and calculate a parameter-specific size based on the num-ber of data instances available for estimating the parameter,that is (η/#datapoints). Discounting η is necessary as pa-rameters are estimated at different (global or local) levels inour framework. Except for the Dirichlet prior for the classproportions to be fixed all at 1, the other hyper-parametersof the community-based DS, including the number of com-munities, is set through 10-fold cross-validaton repeated andaveraged over 5 times, evaluated upon the likelihood of thevalidation responses.

Prediction with Subsampled Responses While the de-fault evaluation task is to check the efficacy of our frame-work under item-side label sparsity as only 3 responses werecollected for each item, we would like to further investigatewhether the framework can handle even greater degrees oflabel sparsity which happens not only on the item side butalso on the worker side. To do this, we randomly subsampled

9http://crowdflowercommunity.tumblr.com/post/80598014542/introducing-contributor-performance-levels

the same number of responses from each worker. By merg-ing all the subsampled responses from each worker, we ob-tained a data subset with far fewer items for each of the threedatasets. We varied the number of responses subsampled perworker from 1 to 12 (after which we actually observed verymarginal differences in the model performence), and ran allthe models for the true label prediction as well as the un-seen (held-out) worker response prediction at each subsam-pling point. The unseen response prediction is evaluated onthe remaining responses. Both prediction tasks are evaluatedusing the same metrics before as used in the experimentswith the full responses. The whole subsampling procedurewas repeated 5 times before we obtained the average pre-dictive accuracy of each model. The hyper-parameter setupin this case remained unchanged as we employed the samehyper-parameter setting for our framework and GLAD, andthe 10-fold cross-validation for finding the optimal numberof communities.

Stackoverflow Evergreen TRECAccu Log Accu Log Accu Log

MV 0.6083 0.9748 0.7630 0.6635 0.4720 1.1244-Community DS 0.6088 0.9292 0.7633 0.6248 0.4818 1.112

GLAD 0.6084 0.9323 0.7631 0.6385 0.4747 1.126GLAD+I 0.6085 0.9306 0.7632 0.6280 0.4765 1.122GLAD+L 0.6084 0.9311 0.7631 0.6296 0.4751 1.126GLAD+R 0.6087 0.9294 0.7633 0.6271 0.4766 1.122GLAD+S 0.6085 0.9306 0.7631 0.6292 0.4760 1.123

GLAD+I+L+R 0.6088 0.9294 0.7633 0.6256 0.4808 1.116GLAD+I+L+S 0.6087 0.9301 0.7633 0.6252 0.4804 1.116GLAD+I+R+S 0.6090 0.9289 0.7634 0.6245 0.4819 1.112GLAD+L+R+S 0.6088 0.9298 0.7633 0.6258 0.4802 1.118

GLAD+I+L+R+S 0.6090 0.9288 0.7634 0.6238 0.4820 1.108

Table 4: True label predictive accuracy of the models acrossthe three datasets. We denote the side information about theworkers, the items, the sessions and the responses respec-tively with capital letters “L”, “I”, “S” and “R”.

Stackoverflow Evergreen TRECMAE MAE MAE

4-Community DS 0.2306 0.1858 0.5176GLAD 0.2438 0.1901 0.5216

GLAD+I 0.2347 0.1874 0.5176GLAD+L 0.2396 0.1898 0.5195GLAD+R 0.2288 0.1854 0.5173GLAD+S 0.2356 0.1901 0.5211

GLAD+I+L+R 0.2245 0.1801 0.5162GLAD+I+L+S 0.2325 0.1854 0.5187GLAD+I+R+S 0.2276 0.1812 0.5169GLAD+L+R+S 0.2318 0.1860 0.5184

GLAD+I+L+R+S 0.2226 0.1801 0.5160

Table 5: Unseen (held-out) response predictive accuracy ofthe models across 30% held-out response data from the threedatasets.

Figure 3: Changes of the true label predictive accuracy by varying the number of responses subsampled from each workeracross the three datasets.

ResultsThe results of the experiment using the entire response dataare summarized in Tables 4 and 5. For clarity, the side in-formation of the workers, the items, the sessions and the re-sponses is abbreviated to “L”, “I”, “S” and “R” respectively.From Table 4, our framework with all the features outper-forms the three baselines (with the comunity-based DS op-timized at 4 communities) on all the datasets. The largestimprovement in predictive accuracy is seen over MajorityVote (MV) on the TREC dataset (i.e. by 1%). Marginal im-provements in accuracy have been observed over the threebaselines on the Stackoverflow and the Evergreen datasets.We believe the reason for observing only marginal improve-ments on these datasets is that all workers have exhibitedsimilar ability for these tasks, producing labels of similarquality across the items.

We note from Table 4 that incorporating the observed fea-tures about items appears to produce a larger reduction inthe log-loss than that is achieved by adding worker or ses-sion features. This is in line with our expectation of the firstexperiment which is that the item features help to reduce theuncertainty in the item easiness dv , which likely has suf-fered from label sparsity across the three crowd-sourcingtasks with only three labels collected for each item . In con-trast, reduction in the uncertainty of worker expertise eu, at-tributed to the addition of the worker and the session fea-tures into our framework, appears far less beneficial giventhat there is already abundant response data available for theestimation. Moreover, it appears that incorporating responseinformation brings systematic improvements in both accu-racy and log-loss. The result confirms that our framework isable to consistently utilize such information to mitigate thebias specific to each worker.

From Table 5, we can see that when equipped with all

the side information features, our framework again defeatsthe baseline models GLAD and community-based DS (themajority vote intrinsically not suitable for predicting workerresponses) with the minimum mean absolute error (high-lighted in bold font) for the 30% held-out response dataacross all the datasets.

The results of the experiment using the randomly sub-sampled response data from workers are summarized in Fig-ures 3 and 4. We observe from Figure 3 that when the num-ber of responses subsampled per worker is below 6, ourframework has significantly outperformed GLAD and Ma-jority Vote aross all the three datasets by leveraging onlyone type of side information. When the number is 12, ourframework with combined types of side information still dis-tinctly exceeds the performance of the two baselines overthe TREC dataset. The community-based DS, whose opti-mal number of community is 2 in this case, is clearly beatenby our framework with (1) a single side-information typeover the Evergreen dataset when the number of subsampledresponses is below 4, and (2) the combined types over theTREC and the Stackoverflow datasets when the number isbelow 3.

When applied to the task of predicting unseen (held-out)responses of workers as shown in Figure 4, our frameworkstill holds clear advantages over the GLAD model by lever-aging only single types of side information, and over thecommunity-based DS model (with its optimal number ofcommunities being 2) by leveraging multiple (in our experi-ment at least 3) types of side information.

Overall, our framework has made much more significantimprovements over the baselines when the items are farfewer and the response data are far more scarce, comparedto its performance in the previous experiments based on theentire responses.

2 4 6 8 10 12

0.450

0.455

0.460

0.465

0.470

0.475

0.480

TREC

MA

E o

n th

e un

seen

(hel

d-ou

t) re

spon

ses

2 4 6 8 10 12

0.26

0.27

0.28

0.29

0.30

Stackoverflow2 4 6 8 10 12

0.230

0.235

0.240

0.245

0.250

0.255

Evergreen

2-community DSGLADGLAD+LGLAD+SGLAD+IGLAD+RGLAD+I+L+RGLAD+I+L+SGLAD+I+R+SGLAD+L+R+SGLAD+I+L+R+S

Figure 4: Changes of the unseen (held-out) response predictive accuracy by varying the number of responses subsampled fromeach worker across the three datasets.

Statistical Analysis of Feature ImportanceTo get a clear idea of which features might be importantfor predicting the accuracy of the responses, we conductedone-way ANOVA where the resulting P-values indicate thesignificance of the correlation between the features and thecorrect responses (Kazai, Kamps, and Milic-Frayling 2013;Li, Zhao, and Fuxman 2014). We compared the P-valueswith the feature weight estimates from our framework withall the types of side information. We listed Top-5 featureswith respect to the P-values and the absolute feature weightvalues inferred from the Evergreen dataset in Table 6. Wealso obtained similar results with Stack-Overflow and TRECdatasets. Although our framework works in a fully unsuper-vised manner whereas one-way ANOVA is supervised, theresults show that our framework is equally capable of iden-tifying salient features for predicting the accuracy of eachresponse.

Conclusions and Future WorkIn this paper we have developed a probabilistic frameworkfor improving the quality control over crowd-sourced databy leveraging its associated side information from items,workers, labeling sessions and responses. The respectivesource features include item genres and content features,worker demographics and personality traits, labeling devicesand labeling time periods, response delays and response or-ders. The efficacy of the framework has been demonstratedon three new crowd-sourcing datasets, where we have ob-served overall consistent improvements in predictive accu-racy and log-loss, as well as in unseen (held-out) responseprediction, when the response data is scarce across bothworkers and items. Moreover, response-level information

Top 5 features ranked according to:Supervised Prediction (P-value) Unsupervised Prediction (weight)

I

1. num alphaNumeric chars 1. num links2. num links 2. num alphaNumeric chars3. domain business 3. content length4. content length 4. frame tag ratio5. frame tag ratio 5. domain business

L

1. searchOnline sometimes 1. searchOnline veryOften2. artist agreeModerately 2. lazy slightlyAgree3. revisit topic diversity level3 3. revisit topic diversity level34. education Bachelor 4. thorough stronglyAgree5. search topic diversity level3 5. age 45 to 49

S1. use mobilePhone, 2. use PC 1. use mobilePhone, 2. use tablet3. use tablet, 4. weekends, 5. latenight 3. use PC, 4. weekends, 5. latenight

Table 6: Comparison between top 5 most predictive featuresfor supervised and unsupervised (actual) setting.

was found particularly useful for helping the framework toaccount for worker-specific biases. In addition, our frame-work is found to be promising at identifying salient sourcefeatures without having to know any true-label information.

Future work includes extending the current frameworkwith the matrix factorization component to allow fordomain-specific expertise to be estimated for each workerand the level of domain membership to be estimated for eachitem.

Acknowledgments. The authors thank Professor Wray Buntine for useful dis-cussions regarding this work. Mark Carman acknowledgesresearch funding through a Collaborative Research Projectaward from CSIRO’s Data61.

ReferencesBachrach, Y.; Graepel, T.; Minka, T.; and Guiver, J. 2012.How to grade a test without knowing the answers—abayesian graphical model for adaptive crowdsourcing andaptitude testing. In Proceedings of the 29th InternationalConference on Machine Learning (ICML-12), 1183–1190.Dawid, A. P., and Skene, A. M. 1979. Maximum likelihoodestimation of observer error-rates using the em algorithm.Journal of the Royal Statistical Society. Series C (AppliedStatistics) 28(1):pp. 20–28.Griffiths, T. L., and Steyvers, M. 2004. Finding scientifictopics. Proceedings of the National academy of Sciences101(suppl 1):5228–5235.John, O. P.; Naumann, L. P.; and Soto, C. J. 2008. Paradigmshift to the integrative big five trait taxonomy. Handbook ofpersonality: Theory and research 3:114–158.Jung, H. J., and Lease, M. 2012. Improving quality ofcrowdsourced labels via probabilistic matrix factorization.In Proceedings of the 4th Human Computation Workshop(HCOMP) at AAAI, 101–106.Kajino, H.; Tsuboi, Y.; and Kashima, H. 2012. A convexformulation for learning from crowds. In AAAI.Kamar, E.; Kapoor, A.; and Horvitz, E. 2015. Identifyingand accounting for task-dependent bias in crowdsourcing. InThird AAAI Conference on Human Computation and Crowd-sourcing.Kazai, G.; Kamps, J.; and Milic-Frayling, N. 2012. The faceof quality in crowdsourcing relevance labels: Demograph-ics, personality and labeling accuracy. In Proceedings ofthe 21st ACM international conference on Information andknowledge management, 2583–2586. ACM.Kazai, G.; Kamps, J.; and Milic-Frayling, N. 2013. An anal-ysis of human factors and label accuracy in crowdsourcingrelevance judgments. Information retrieval 16(2):138–178.Lease, M. 2011. On quality control and machine learning incrowdsourcing.Li, H.; Zhao, B.; and Fuxman, A. 2014. The wisdom of mi-nority: discovering and targeting the right group of workersfor crowdsourcing. In Proceedings of the 23rd internationalconference on World wide web, 165–176. ACM.Liu, Q.; Peng, J.; and Ihler, A. T. 2012. Variational infer-ence for crowdsourcing. In Advances in Neural InformationProcessing Systems, 692–700.Ma, F.; Li, Y.; Li, Q.; Qiu, M.; Gao, J.; Zhi, S.; Su, L.; Zhao,B.; Ji, H.; and Han, J. 2015. Faitcrowd: Fine grained truthdiscovery for crowdsourced data aggregation. In Proceed-ings of the 21th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, 745–754. ACM.Moreno, P. G.; Teh, Y. W.; Perez-Cruz, F.; and Artes-Rodrıguez, A. 2014. Bayesian nonparametric crowdsourc-ing. arXiv preprint arXiv:1407.5017.Rasch, G. 1960. Probabilistic Models for Some Intelligenceand Attainment Tests. MESA Press.Raykar, V. C.; Yu, S.; Zhao, L. H.; Valadez, G. H.; Florin,C.; Bogoni, L.; and Moy, L. 2010a. Learning from crowds.Journal of Machine Learning Research 11:1297–1322.

Raykar, V. C.; Yu, S.; Zhao, L. H.; Valadez, G. H.; Florin,C.; Bogoni, L.; and Moy, L. 2010b. Learning from crowds.Journal of Machine Learning Research 11(Apr):1297–1322.Ruvolo, P.; Whitehill, J.; and Movellan, J. R. 2013. Ex-ploiting commonality and interaction effects in crowdsourc-ing tasks using latent factor models. In Neural InformationProcessing Systems. Workshop on Crowdsourcing: Theory,Algorithms and Applications.Sheng, V. S.; Provost, F.; and Ipeirotis, P. G. 2008. Get an-other label? improving data quality and data mining usingmultiple, noisy labelers. In Proceedings of the 14th ACMSIGKDD International Conference on Knowledge Discov-ery and Data Mining, KDD ’08, 614–622. New York, NY,USA: ACM.Venanzi, M.; Guiver, J.; Kazai, G.; Kohli, P.; and Shokouhi,M. 2014. Community-based bayesian aggregation modelsfor crowdsourcing. In Proceedings of the 23rd InternationalConference on World Wide Web, WWW ’14, 155–164. NewYork, NY, USA: ACM.Venanzi, M.; Guiver, J.; Kohli, P.; and Jennings, N. R. 2016.Time-sensitive bayesian information aggregation for crowd-sourcing systems. Journal of Artificial Intelligence Research56:517–545.Wauthier, F. L., and Jordan, M. I. 2011. Bayesian bias mit-igation for crowdsourcing. In 25th Annual Conference onNeural Information Processing Systems 2011, 1800–1808.Welinder, P.; Branson, S.; Perona, P.; and Belongie, S. J.2010. The multidimensional wisdom of crowds. In Ad-vances in neural information processing systems, 2424–2432.Whitehill, J.; Ruvolo, P.; Wu, T.; Bergsma, J.; and Movel-lan, J. R. 2009. Whose vote should count more: Optimalintegration of labels from labelers of unknown expertise. In23rd Annual Conference on Neural Information ProcessingSystems, NIPS’09, 2035–2043.

leveraging side information to improve label quality control in...

Documents