how to grade a test without knowing the answerspeople.ee.duke.edu/~lcarin/dec8.15.2012.pdf ·...

How To Grade a Test Without Knowing theAnswers

Yoram Bachrach, Tom Minka, John Guive, and ThoreGraepel

Paper Review by David Carlson

Yoram Bachrach, Tom Minka, John Guive, and Thore Graepel How To Grade a Test Without Knowing the Answers

Introduction

Crowdsourcing services allow us to collect the opinions ofmany people, but the question of how to aggregate data isan open problemWe would like to obtain information from multiple sources(participants) and aggregate it into a single completesolution (each classification task is denoted a question)Given the correct answers to each items, it is easy to findthe skill of each participant and determine which questionsare easy and which are hardHow to probabilistically infer the answers to each question,the skill of each participant, and the difficulty of thequestions?How to determine the questions to best test the ability of aparticipant?


The DARE Model

The DARE (Difficulty-Ability-REsponse) model can be used toinfer the latent abilities of each participant, the latent difficultyand discrimination of each question, as well as the correctanswer to each question.


Probability of Correctly Answering a Question

Given that a participant p has an ability of ap ∈ R and that aquestion q has a difficulty of dq ∈ R. Then the probability thatparticipant p correctly answers question q can be found as afunction of the quantity tpq = ap − dq:

P(cpq = T |tpq, τq) =

∫ ∞

−∞φ(√τq(x − tpq))θ(x)dx

= Φ(√τqtpq) (1)

The precision τq determines how discriminative the question qis.


The DARE Model

Let yq be the true answer for question q and Rq be the set ofanswers besides the true answer for q and R be the set ofanswers.

yq ∼ DiscreteUniform(R) (2)

rpq =

{yq, cpq = T

DiscreteUniform(Rq), cpq 6= T

}(3)

ap ∼ N (µp, σ2p) (4)

dq ∼ N (µq, σ2q) (5)

tpq = ap − dq (6)P(cpq = T | tpq, τq) = Φ(

√τqtpq) (7)

τq ∼ Ga(k , θ) (8)


Graphical ModelHow To Grade a Test Without Knowing the Answers

used to represent context-dependent conditional inde-pendence relations, and are suited for implementationsof approximate message passing inference.

In order to do inference on the model, we need todefine prior distributions for the variables of inter-est. We assume factorizing Gaussian priors for theabilities ap ∼ Normal(µp,σ

2p) and difficulties dq ∼

Normal(µq,σ2q ). We choose a Gaussian prior as it lets

us specify a range of plausible values based on twoparameters (mean and variance) per variable, and ad-mits a relatively simple approximate inference. Thefactorization assumption reflects the belief that a pri-ori knowing the difficulty of one question would notbe informative about the difficulty of another ques-tion, and similarly for the abilities of participants. Wealso assume factorizing discrete uniform priors for thetrue answers yq ∼ DiscreteUniform(Rq) and for theresponses rpq ∼ DiscreteUniform(Rq) for participant-question pairs. Finally, we define factorizing Gammapriors for the precision parameters τq ∼ Gamma(k, θ).The Gamma prior is conveniently parameterized by ashape parameter k and a scale parameter θ, and is theconjugate prior for the precision parameter τ := σ−2

of the normal distribution if the mean µ is known.This choice simplifies inference by approximate mes-sage passing because the posterior also takes the func-tional form of the Gamma distribution.

Based on the above specification we defined a jointprobability distribution p(ap, dq, tpq, τq, cpq, rpq, yq) forspecific pairs of question q and participant p. Assum-ing exchangeability of questions q and participants pwe get a model with two plates as depicted in Fig-ure 1, where one plate runs over participants p ∈ Pand the other over questions q ∈ Q (plates denote areplication of the fraction of the graphical model theycontain). From a generative point of view, this modelsa table with |P | rows (one for each participant p) and|Q| columns (one for each question q), where entriesare the responses rpq of participants to questions.

3.2. Probabilistic Inference

We show how the model can infer quantities of interest.Generally, the data is given in the form of two incom-plete sets: A set of m participant-question-responsetriples R := {rp1q1

, . . . , rpmqm} and a set of n ground-

truth question-answer pairs y := {yq1, . . . , yqn

}. Onespecial case is when all the ground-truth question-answer pairs are known. This is the traditional testscenario as used in aptitude tests including schooltests, GMAT, IQ tests, etc. Another special case iscrowdsourcing domains where we may not have anyground-truth available, but can obtain participant-

Figure 1. Factor graph for the joint difficulty-ability-response estimation (DARE) model.

question-response triples depending on budget andtime constraints. As shown in Section 4, provid-ing some ground-truth question-answer pairs (a “gold-set”), can improve the accuracy of the inferred answersyq because it can be used to assess the abilities ap ofthe participants p more accurately, leading to moreaccurate inference on the answers yq. Generally, forevery observed response r∗pq, we set the discrete priordistribution p(rpq) to a single point distribution con-centrated on the observed response r∗pq, and similarlyfor every known ground-truth question-answer pair y∗

q .

Given the data R and y, we wish to infer sev-eral approximate marginal (posterior) distributions:the discrete distribution p(yq|R,y) over correct an-swers yq, which assign a probability πqr ∈ [0, 1] toeach of the possible responses r ∈ Rq; the Gaussiandensity p(ap|R,y) over abilities ap of participants pwith means µp and variances σ2

p; the Gaussian den-sity p(dq|R,y) over difficulties dq of questions q withmeans µq and variances σ2

q ; the Bernoulli distributionp(cpq|R,y) over correctness cpq of participant p’s re-sponse to question q given by probabilities πpq; thediscrete distribution p(rpq|R,y) over responses rpq ofparticipant p to question q, which assign a proba-bility πpqr ∈ [0, 1] to each of the possible responsesr ∈ Rq; the Gamma distribution p(τq|R,y) over theprecision/discrimination parameter τq, with scale pa-rameters θq and shape parameters kq.

Inference in the model is done using approximatemessage passing (see (Koller & Friedman, 2009)).We used Infer.NET (Minka et al., 2010), a pack-age for probabilistic inference. Specifically, we usedthe expectation-propagation (EP) algorithm presentedin (Minka, 2001). EP allows us to calculate marginaldistributions of interest on a given factor graph by it-eratively calculating messages along edges that propa-gate information across the factor graph. In our case,EP provides only an approximation to the exact solu-


Adaptive Testing

Often there is a high cost of obtaining additional data points, soyou would like to be able to ask questions that will give the mostinformation. If we define the parameters of interest as w , thenwe can define the posterior entropy S(w):

S(w) =

∫...

∫p(w) log

m(w)

p(w)dw (9)

m(w) is an arbitrary base measure that doesn’t affect theoutcome.


Adaptive Testing

The posterior distribution before and after obtaining anadditional sample for person p is:

pm(ap) = N (ap;µp,m, σ2p,m) (10)

pm+1(ap|rpq) = N (ap;µp,m+1(rpq), σ2p,m+1(rpq)) (11)

Using these distributions, we can state simply that the changein entropy reduction is:

∆S(rpq) =12

log(σ2p,m/σ

2p,m+1(rpq)) (12)


Adaptive Testing

To determine the best question q to ask participant p, we cancalculate the expected reduction in entropy. Let πpq be theprobability of participant p choosing answer rpq. Then theexpected reduction in entropy is:

12

∑

rpq∈R

πpq log(σ2p,m/σ

2p,m+1(rpq)) (13)

Where R is the set of all possible answers. We then choose theq∗ that reduces the expected entropy the most.


Results

The DARE model was applied to a standard intelligence testcalled Raven’s Standard Progressive Matricies (SPM). Itconsists of 60 questions with 8 possible answers. The sampleconsisted of 120 participants.

The experiments look at learning the correct answers to thequestions, the helpfulness of a partial information (set of trueanswers), and active learning to judge a participants ability.

Inference was performed using infer.NET via EP loopy updates.


Results: IQ Estimates versus ”known” resultsHow To Grade a Test Without Knowing the Answers

Figure 2. Item similar to those in the SPM test

allows us to compute the probability p(yq|R,y) that agiven answer yq is correct. To minimize the probabil-ity of error we select the mode of that distribution asthe model’s answer for that question. When providedwith the responses of all 120 participants the DAREmodel correctly infers the correct responses for 46 ofthe questions. Note, that the number of errors is nottoo surprising because some items in the test were verydifficult and few participants answered them correctly.The highest raw score was fifty so even the top scoringparticipant answered ten items incorrectly.

We can calculate a participant’s raw IQ score with re-spect to the true correct answers y∗

q or with respect tothe predicted “correct” answers yq, and we refer to thelatter score as the model raw score. In crowdsourcingsituations when the correct answer for each question isunknown, one can use the model raw scores as an esti-mate of participants’ abilities. Figure 3 shows a scatterplot, in which each point represents a participant; theposition on the x-axis represents the participant’s rawIQ score, and the position along the y-axis their modelraw score. As Figure 3 indicates, there is a very strongcorrelation between the true raw IQ scores and modelraw scores (R2 = 0.7243), and the difference betweenthe two scores is rather small across all participants.

One can think of DARE as an aggregator that re-ceives the responses of a crowd of participants, andoutputs the inferred answer for each question. Ex-isting work (Lyle, 2008; Bachrach et al., 2012) testssimpler aggregators using IQ test data. The formeruses majority voting, and the latter does consider theability levels of participants but assumes all items tobe of equal difficulty. Another possible simplifying as-sumption, not examined in this earlier work, is that allthe participants have equal ability. In contrast, in theDARE model, the probability of a participant to knowthe correct answer depends both on the difficulty dq

of the question and the ability ap of the participant,

Figure 3. Estimates of skill levels for missing informationregarding the correct answers to the questions.

with the above scenarios as special cases.

We refer to the model with different question difficul-ties as the question model, and the model with differentparticipant abilities as the participant model. We ex-amine how such simplifications affect the model’s abil-ity to infer correct answers as a function of the amountof available data. Figure 4 shows how well the ques-tion, participant and DARE models perform in thisregard. For any given crowd size, shown on the x-axis,we randomly selected 10, 000 subsets of participantsof that size. For each such crowd we inferred the cor-rect answer yq to each question using the model, andused the number of questions for which the inferredanswer yq was equal to the true correct answer y∗

q as ameasure of the model’s performance. The y-axis is thequality of each model, averaged over the 10, 000 sam-pled crowds. Figure 4 shows the ability of all modelsto infer correct answers increases with the amount ofdata. It also shows that DARE outperforms the sim-pler models. Interestingly, only modeling participantsability of is better than only modeling question diffi-culty (which is equivalent to majority vote).

Figure 4. Effect of crowd size on correct responses inferred.


Results: Effect of Crowd Size

How To Grade a Test Without Knowing the Answers

Figure 2. Item similar to those in the SPM test

allows us to compute the probability p(yq|R,y) that agiven answer yq is correct. To minimize the probabil-ity of error we select the mode of that distribution asthe model’s answer for that question. When providedwith the responses of all 120 participants the DAREmodel correctly infers the correct responses for 46 ofthe questions. Note, that the number of errors is nottoo surprising because some items in the test were verydifficult and few participants answered them correctly.The highest raw score was fifty so even the top scoringparticipant answered ten items incorrectly.

We can calculate a participant’s raw IQ score with re-spect to the true correct answers y∗

q or with respect tothe predicted “correct” answers yq, and we refer to thelatter score as the model raw score. In crowdsourcingsituations when the correct answer for each question isunknown, one can use the model raw scores as an esti-mate of participants’ abilities. Figure 3 shows a scatterplot, in which each point represents a participant; theposition on the x-axis represents the participant’s rawIQ score, and the position along the y-axis their modelraw score. As Figure 3 indicates, there is a very strongcorrelation between the true raw IQ scores and modelraw scores (R2 = 0.7243), and the difference betweenthe two scores is rather small across all participants.

One can think of DARE as an aggregator that re-ceives the responses of a crowd of participants, andoutputs the inferred answer for each question. Ex-isting work (Lyle, 2008; Bachrach et al., 2012) testssimpler aggregators using IQ test data. The formeruses majority voting, and the latter does consider theability levels of participants but assumes all items tobe of equal difficulty. Another possible simplifying as-sumption, not examined in this earlier work, is that allthe participants have equal ability. In contrast, in theDARE model, the probability of a participant to knowthe correct answer depends both on the difficulty dq

of the question and the ability ap of the participant,

Figure 3. Estimates of skill levels for missing informationregarding the correct answers to the questions.

with the above scenarios as special cases.

We refer to the model with different question difficul-ties as the question model, and the model with differentparticipant abilities as the participant model. We ex-amine how such simplifications affect the model’s abil-ity to infer correct answers as a function of the amountof available data. Figure 4 shows how well the ques-tion, participant and DARE models perform in thisregard. For any given crowd size, shown on the x-axis,we randomly selected 10, 000 subsets of participantsof that size. For each such crowd we inferred the cor-rect answer yq to each question using the model, andused the number of questions for which the inferredanswer yq was equal to the true correct answer y∗

q as ameasure of the model’s performance. The y-axis is thequality of each model, averaged over the 10, 000 sam-pled crowds. Figure 4 shows the ability of all modelsto infer correct answers increases with the amount ofdata. It also shows that DARE outperforms the sim-pler models. Interestingly, only modeling participantsability of is better than only modeling question diffi-culty (which is equivalent to majority vote).

Figure 4. Effect of crowd size on correct responses inferred.


Results: Effect of Partial InformationHow To Grade a Test Without Knowing the Answers

Crowdsourcing Data: To examine the appli-cability of our model to crowdsourcing, we testedour model on the TREC 2011 Crowdsourcing Trackdataset (Lease & Kazai, 2011), generated by crowd-sourced workers which classified search engine re-sponses for queries (relevant / irrelevant). Each query-document pair is a “question as workers must deter-mine if the document is relevant for the query. Thisdataset is sparse, as most workers only examined few“questions, and all “questions have at most 10 answersin total, and includes ground-truth judgements. Weisolated the 369 “questions with the most answers (8per “question), and the 84 workers who answered themost “questions (at least 30 answers each). Our anal-ysis shows that DARE slightly outperforms majorityvoting on this dataset. Majority voting gets 206 ques-tions correct, while DARE gets 210 correct. We alsotested how well DARE estimate participants skills,similarly to Figure 3. Although for this crowdsourcingdataset the resulting scatter plot is quite noisy (r2 of0.79), it is similar to the one in Figure 3.

4.2. Partial Information on Correct Answers

We now examine situations where participants are firsttested on a “gold-set” of questions for which the cor-rect answer is known. Consider choosing i questionsand making the correct answer to these questions ob-servable to the model. This does not reveal the cor-rect answer to the remaining |Q| − i questions, butit does allow the model to better estimate the abil-ity levels of the participants, which in turn allows themodel to better infer the correct answer to these re-maining items. Figure 5 shows this effect in DARE.The x-axis represents the number of “revealed” ques-tions and the y-axis represents the proportion of theremaining questions for which the model inferred theright answer. For each number i of “revealed” items,we sampled 100, 000 crowds of 20 participants and irevealed questions (uniformly at random), and the lo-cation on the y-axis is the average proportion of theremaining questions for which the model inferred theright answer over this sample. As the figure shows,having a larger “gold-set” increases the model’s abilityto infer the correct response for the remaining items.

4.3. Adaptive Skill Testing

We now show how DARE can be used for adaptiveskill testing. Given a budget of b questions to ask,our goal is to infer the participants’ ability levels. Weuse DARE to estimate a participant’s raw IQ scoreafter only observing this participant’s responses to aset of “asked” questions (revealed responses). In astatic approach, for each budget b we choose a specific

Figure 5. Effect of partial information on correct answers.

question set used for all the participants, and measurethe RMSE in estimated raw IQ across all participants.To choose the best set of questions of size b for thestatic approach, one must enumerate over all possible�|Q|

b

�question sets of size b, and use the one minimiz-

ing the error. This is intractable when |Q| is large, sowe heuristically choose the question set. We selectedstatic question set for a given budget b by choosingquestions that equally partition the participant pop-ulation in terms of the fraction of participants whosolved the question2 For example, with a budget b = 2we selected a question that roughly two thirds of theparticipants solved correctly and one that roughly onethird of the participants solved incorrectly.

We also implemented the adaptive testing scheme ofSection 3.3 and compared it to the baseline static ap-proach. Under the adaptive approach, the next ques-tion to ask depends on the participant’s response toearlier questions, so we reveal the participant’s re-sponses one at a time. To measure the RMSE for agiven budget b, we simulated the adaptive process foreach of the participants and averaged the errors acrossall participants. Figure 6 shows RMSEs for the staticand adaptive approaches for different budget levels. Itshows the adaptive approach has a smaller error in itsinferred ability levels for any given budget. 3

5. Conclusions and Limitations

We presented the DARE model for inferring the cor-rect answers, difficulty levels of questions and abil-

2The static approach essentially has access to informa-tion regarding the difficulty of the questions which is notnormally be available. As our analysis shows, our activeapproach beats the static approach even when the staticapproach can use such information.

3The standard deviation for the RMSEs is 1.07 for theadaptive scheme and 0.99 for the static scheme.


Results: Static versus Adaptive MethodsHow To Grade a Test Without Knowing the Answers

Figure 6. Static and adaptive skill testing.

ity levels of participants in multiple problem domains.Our evaluation of the model shows that joint infer-ence of these quantities is possible to a high level ofaccuracy and that it is indeed possible to grade a testwithout knowing the answers. We showed that in oursetting modeling participants’ ability levels is more im-portant than questions’ difficulty levels, that includinga “gold-set” helps, and that active learning leads tomore efficient testing.

Our approach is subject to several limitations. Ourevaluation used an IQ dataset, whereas crowdsourc-ing tasks may exhibit different properties, such as agreater homogeneity in task difficulty levels. Also, weassume that participants answer to the best of theirability, but participants may be selfish agents withvarying motives. For a game theoretic treatment ofsuch issues see (DiPalantino & Vojnovic, 2009; Gaoet al., 2012).

Many questions are open for future research. Arethere better models for aggregating responses, or mod-els better tailored to other domains? How can onetractably compute the optimal non-adaptive test for agiven population? Can we use similar models to inferthe ability levels of individuals when only their perfor-mance within the context of a group is known?

References

Anastasi, A., Urbina, S., et al. Psychological testing. Pren-tice Hall Upper Saddle River, NJ, 1997.

Bachrach, Y., Graepel, T., Kasneci, G., Kosinski, M., andVan Gael, J. Crowd IQ - aggregating opinions to boostperformance. In AAMAS, 2012.

de Caritat, M.J.A.N. et al. Essai sur l’application del’analyse a la probabilite des decisions rendues a la plu-ralite des voix. L’imprimerie royale, 1785.

DiPalantino, D. and Vojnovic, M. Crowdsourcing and all-pay auctions. In ACM EC, 2009.

Gao, X.A., Bachrach, Y., Key, P., and Graepel, T. Qual-ity expectation-variance tradeoffs in crowdsourcing con-tests. AAAI, 2012.

Getoor, L., Friedman, N., Koller, D., Pfeffer, A., andTaskar, B. 5 probabilistic relational models. Statisti-cal relational learning, pp. 129, 2007.

Hambleton, R.K., Swaminathan, H., and Rogers, H.J. Fun-damentals of item response theory, volume 2. 1991.

Herbrich, R., Minka, T., and Graepel, T. Trueskill: Abayesian skill rating system. NIPS, 2007.

Kasneci, G., Van Gael, J., Herbrich, R., and Graepel, T.Bayesian knowledge corroboration with logical rules anduser feedback. In ECML/PKDD, 2010.

Koller, D. and Friedman, N. Probabilistic Graphical Mod-els: Principles and Techniques. 2009.

Kosinski, M., Bachrach, Y., Kasneci, G., Van-Gael, J., andGraepel, T. Crowd IQ: Measuring the intelligence ofcrowdsourcing platforms. In ACM Web Sciences, 2012.

Lease, M. and Kazai, G. Overview of the trec 2011 crowd-sourcing track (conference notebook). In TREC, 2011.

Lyle, J.A. Collective problem solving: Are the manysmarter than the few? 2008.

MacKay, David J.C. Information-based objective functionsfor active data selection. Neural Computation, 1992.

Minka, T., Winn, J.M., Guiver, J.P., and Knowles, D.A.Infer.NET 2.4, 2010.

Minka, Tom and Winn, John. Gates. NIPS, 21, 2008.

Minka, T.P. A family of algorithms for approximateBayesian inference. PhD thesis, 2001.

Pearl, J. Probabilistic reasoning in intelligent systems :networks of plausible inference. 1988.

Pennock, D.M. and Sami, R. Computational aspects ofprediction markets, 2007.

Raven, J.C. Standard progressive matrices plus.

Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C.,Bogoni, L., and Moy, L. Learning from crowds. JMLR,2010.

Sen, A. Social choice theory. Handbook of mathematicaleconomics, 3:1073–1181, 1986.

Welinder, P., Branson, S., Belongie, S., and Perona, P. Themultidimensional wisdom of crowds. In NIPS, 2010.

Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., and Movel-lan, J. Whose vote should count more: Optimal integra-tion of labels from labelers of unknown expertise. NIPS,22:2035–2043, 2009.

Woolley, A.W., Chabris, C.F., Pentland, A., Hashmi, N.,and Malone, T.W. Evidence for a collective intelligencefactor in the performance of human groups. Science, 330(6004):686, 2010.

Yan, Y., Rosales, R., Fung, G., and Dy, J. Active learningfrom crowds. In ICML, 2011.


Conclusions

The DARE model is able to infer the correct answers to thequestions without knowing the questions, as well as theability of the participants and the difficulty of the questionsThe model can’t handle all situations (I.E. trick questions),but can do well in many situationsAll inference is done using infer.NET(http://research.microsoft.com/en-us/um/cambridge/projects/infernet/)


how to grade a test without knowing the answerspeople.ee.duke.edu/~lcarin/dec8.15.2012.pdf ·...

Documents