consensus relevance with topic and worker conditional models paul n. bennett, microsoft research...

15
Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai, Microsoft Research Cambridge

Upload: jennifer-byrd

Post on 18-Jan-2018

216 views

Category:

Documents


0 download

DESCRIPTION

Common Axes of Generalization Topics Documents Relevance Observed in Training Relevance Not Observed in Training Observed In Training Not Observed In Training Compute consensus for “new documents” on known topics. Compute consensus on new topics for documents with known relevance on other topics. Use rules or observed worker accuracies on other topics/documents to compute consensus on new topics and documents. Note hidden axis of observed workers.

TRANSCRIPT

Page 1: Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,

Consensus Relevance with Topic and Worker Conditional Models

Paul N. Bennett, Microsoft Research

Joint withEce Kamar, Microsoft Research

Gabriella Kazai, Microsoft Research Cambridge

Page 2: Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,

Motivation for Consensus Task• Recover actual relevance of a topic-document pair based on

noisy predictions from multiple labelers.

• Obtain a more reliable signal from the crowd and/or benefit from scale (expert quality from inexperienced assessors).

• Variety of proposed approaches in the literature and in competition.– Supervised: Classification models.– Semi-supervised: EM-style algorithms.– Unsupervised: majority vote.

Page 3: Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,

Common Axes of Generalization

Topics

DocumentsRelevance

Observed in TrainingRelevance

Not Observed in Training

Obs

erve

d In

Tra

inin

gN

ot O

bser

ved

In T

rain

ing

Compute consensus for “new

documents” on known topics.

Compute consensus on new topics for documents with

known relevance on other topics.

Use rules or observed worker

accuracies on other topics/documents to compute consensus on new topics and

documents.

Note hidden axis of observed workers.

Page 4: Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,

Our Approach• Supervised

– Given gold truth judgments on a topic set and worker responses, learn a consensus model to generalize to new documents on same topic set.

– Must be able to generalize to new workers.

• Want a well-founded probabilistic method– Need to handle major sources of worker error.

• Worker skill/accuracy.• Topic difficulty.

– Needs to handle correlation in labels.• Correlation expected because of underlying label.

• Note: will use “assessor” for ground truth labeler and “worker” for noisy labelers.

Page 5: Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,

Basic Model

The probability of relevance should depend on the document, worker response vector, and topic:

where is a particular topic, is a particular document, is the event that is relevant to topic and is the response of worker k for the i,jth pair abbreviate

P (𝑅 𝑖 , 𝑗|𝑡 𝑖 ,𝑑 𝑗 , {𝑤1 ,…,𝑤𝑛|𝑤𝑘is elicited for i , j })

Page 6: Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,

Exchangeability Related Assumptions

• Given two identical sets of voting history, we assume two workers have the same response distribution.

• Whether or not a worker’s opinion is elicited is not informative.

• The ordering of responses/elicitation is not informative.

Page 7: Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,

Relevance Conditional Independence

• Assume conditional independence of worker response given document relevance.– implies workers have comparable accuracies across tasks.

• Assume one topic independent prior on relevance

• Referred to as naïve Bayes.

P (𝑟 𝑖 , 𝑗|𝑡𝑖 ,𝑑 𝑗 ,𝑤 𝑖 : 𝑗 )∝P (𝑟 𝑖 , 𝑗 ) ∏𝑤∈�⃑�𝑖 : 𝑗

P (𝑤|𝑟 𝑖 , 𝑗 )

Probability of relevance across all topics.Probability of a random worker’s

response given relevance (across all topics).

Page 8: Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,

Topic and Relevance Conditional Independence

• Assume response conditionally independent given topic and relevance.– Implies workers have comparable accuracy within a

topic, but varying across topics. • Assume topic dependent prior on relevance.

• Referred to as nB Topic.

P (𝑟 𝑖 , 𝑗|𝑡𝑖 ,𝑑 𝑗 ,𝑤 𝑖 : 𝑗 )∝P (𝑟 𝑖 , 𝑗|𝑡 𝑖 ) ∏𝑤∈𝑤 𝑖 : 𝑗

P (𝑤|𝑟 𝑖 , 𝑗 ,𝑡 𝑖 )Probability of relevance for this topic.

Probability of a random worker’s response given relevance

for this topic.

Page 9: Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,

Worker and Relevance Conditional Independence

• Each worker has a particular skill/accuracy in making relevance judgments.

• This can be estimated by aggregating a history of accuracy across all tasks.

• Responses are independent conditional on historical accuracy and relevance.

• Referred to as nB Worker.

P (𝑟 𝑖 , 𝑗|𝑡𝑖 ,𝑑 𝑗 ,𝑤 𝑖 : 𝑗 )∝P (𝑟 𝑖 , 𝑗 ) ∏𝑤𝑘∈𝑤 𝑖 : 𝑗

P (𝑤𝑘|𝑟 𝑖 , 𝑗 , h𝑘)

Probability of relevance across all topics.Probability of this worker’s response

given relevance (across all topics).

Page 10: Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,

Evaluation• Which Label

– Gold: evaluate using expert assessor’s label as truth.– Consensus: evaluate using consensus of participants’ responses as truth.– Other Participant: evaluate using a particular participant’s responses as

truth.

• Methodology– Use development validation as test to decide what method to submit.– Split development train into 80/20 train/validation by topic-docID pair

(i.e. for a given topic all responses for a docID were completely in/out of the validation set.

Page 11: Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,

Development Set

Model TruePos TrueNeg FalsePos FalseNeg Accuracy DefaultAcc Prec Recall Specificity

Majority Vote 101 8 17 19 75.2% 82.8% 85.6% 84.2% 32.0%

naive Bayes 120 0 25 0 82.8% 82.8% 82.8% 100.0% 0.0%

nB Topic 115 7 18 5 84.1% 82.8% 86.5% 95.8% 28.0%

nB Worker 117 1 24 3 81.4% 82.8% 83.0% 97.5% 4.0%

• Skew and scarcity of development set, made model selection challenging.

• Chose nB Topic since only method that outperformed the baseline (predicting most common class).

Page 12: Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,

Team AccuracySoft Accuracy Recall Precision Specificity Log Loss RMSE

Accuracy Rank

Soft Accuracy Rank

MSRC 69.3% 64.0% 79.0% 66.2% 59.6% 610.28 44.9% 3 6

uogTr 36.7% 44.1% 13.6% 25.3% 59.8% 931.74 58.8% 10 10

LingPipe 67.6% 66.2% 76.2% 65.0% 59.0% 975.88 49.7% 5 4

GeAnn 60.7% 57.7% 88.4% 56.9% 33.0% 1150.45 51.3% 7 8

UWaterlooMDS 69.4% 67.4% 80.2% 66.0% 58.6% 1435.79 50.1% 2 3

uc3m 69.9% 69.9% 75.4% 67.9% 64.4% 2772.38 54.9% 1 1

BUPT-WILDCAT 68.5% 68.5% 78.6% 65.4% 58.4% 2901.33 56.1% 4 2

TUD_DMIR 66.2% 66.2% 76.4% 63.5% 56.0% 3113.16 58.1% 6 5

UTaustin 60.4% 60.4% 90.8% 56.5% 30.0% 3647.36 62.9% 8 7

qirdcsuog 52.9% 52.9% 82.4% 51.8% 23.4% 4338.12 68.6% 9 9

Results

• Methods that report probabilities did better on probability measures in almost all cases and almost always improve on decision theoretic threshold.

• Outlier’s performance in Log loss and conversion to accuracy implies poorly calibrated wrt decision threshold, but likely good overall.

• Our method best on probability measures and near top in general.

Page 13: Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,

Conclusions• Simple topic and relevance conditional assumption model produces

– Best performance on probability measures on gold set.– Nearly best performance on accuracy.

• Topic-level effects explain the majority of variability in judgments (on this data and over set of submissions).

• Future:– Worker-relevance on test set– Worker-topic-relevance conditional independence model– Method performance versus best/median individual worker (sufficient

data to evaluate?)

Page 14: Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,

Thoughts for Future Crowdsourcing Tracks

• Is consensus independent of elicitation?– Can consensus be studied independent of the design for worker

response collection?– Probably okay if development and test sets are collected with the

same methodology.

• Likely collection design impact factors worth analyzing.– Number of gold standard in “training set” on topic– Number of labels per worker– Number of labels per item– Number of worker responses on observed items– Stability of topic-conditional prior of relevance

Page 15: Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,

Questions?