ground truth, machine learning, and the mechanical turk · pdf fileground truth, machine...
TRANSCRIPT
Ground Truth,Machine Learning,and theMechanical Turk
Bob Carpenter (w. Emily Jamison, Breck Baldwin)
Alias-i, Inc.
1
Supervised Machine Learning
1. Define coding standard mapping inputs to outputs,e.g.:
• English word→ stem
• newswire text→ person name spans
• biomedical text→ genes mentioned
2. Collect inputs and code “gold standard” training data
3. Develop and train statistical model using data
4. Apply to unseen inputs
2
Coding Bottleneck
• Bottleneck is collecting training corpus
• Commericial data’s expensive (e.g. LDA, ELRA)
• Academic corpora typically restrictively licensed
• Limited to existing corpora
• For new problems, use: self, grad students, temps,interns, . . .
• Mechanical Turk to the rescue
3
Named Entities Worked
• Conveying the coding standard
– official MUC-6 standard dozens of pages
– examples are key
– (maybe a qualifying exam)
• User Interface Problem
– highlighting with mouse too fiddly (c.f. Fitt’s Law)
– one entity type at a time (vs. pulldown menus)
– checkboxes (vs. highlighting spans)
6
Discussion: Named Entities
• 190K tokens, 64K capitalized, 4K names
• Less than a week at 2 cents/400 tokens
• Turkers overall better than LDC data
– Correctly Rejected: Webster’s, Seagram, Du Pont,
Buick-Cadillac, Moon, erstwhile Phineas Foggs
– Incorrectly Accepted: Tass
– Missed Punctuation: J E. ‘‘Buster’’ Brown
• Many Turkers no better than chance(c.f. social psych by Yochai Benkler, Harvard)
7
Morphological Stemming Worked
• Three iterations on coding standard
– simplified task to one stem
• Four iterations on final standard instructions– added previously confusing examples
• Added qualifying test
9
Gene Linkage Failed
• Could get Turkers to pass qualifier
• Could not get Turkers to take task even at $1/hit
• Doing coding ourselves (5-10 minutes/HIT)
• How to get Turkers do these complex tasks?
– Low concentration tasks done quickly
– Compatible with studies of why Turkers Turk
11
Some Labeled Data
• Seed the data with cases with known labels
• Use known cases to estimate coder accuracy
• Vote with adjustment for accuracy
• Requires relatively large amount of items for
– estimating accuracies well– liveness for new items
• Gold may not be as pure as requesters think
• Some preference tasks have no “right” answer
– e.g. Bing vs. Google, Facestat, Colors, ...
14
Estimate Everything
• Gold standard labels
• Coder accuracies
– sensitivity (false negative rate; misses)
– specificity (false positive rate; false alarms)
– imbalance indicates bias; high values accuracy
• Coding standard difficulty
– average accuracies
– variation among coders
• Item difficulty (important, but not enough data)
15
Benefits of Estimation
• Full Bayesian posterior inference
– probabilistic “gold standard”
– compatible with Bayesian machine learning
• Works better than voting with threshold
– largest benefit with few Turkers/item
– evaluated with known “gold standard”
• Compatible with adding gold standard cases
16
Why we Need Task Difficulty
• What’s your estimate for:
– a baseball player who goes 12 for 20?
– a market that goes down 9 out of 10 days?
– a coin that lands heads 3 out of 10 times?
– . . .
• Smooth estimates for coders with few items
• Hierarchical model of accuracy prior
17
Soft Gold Standard
• Is 24 karat gold standard even possible?
• Some items are really marginal
• Traditional approach
– censoring disagreements
– adjudicating disagreements (revise standard)
– adjudication may not converge
• Posterior uncertainty can be modeled
18
Active Learning
• Choose most useful items to code next
• Balance two criteria
– high uncertainty
– high typicality (how to measure?)
• Can get away with fewer coders/item
• May introduce sampling bias
19
Code-a-Little, Learn-a-Little
• Semi-automated coding
• System suggests labels
• Coders correct labels
• Much faster coding
• But may introduce bias
20
5 Dentists Diagnosing CavitiesDentists Count Dentists Count Dentists Count00000 1880 10000 22 00001 78910001 26 00010 43 10010 600011 75 10011 14 00100 2310100 1 00101 63 10101 2000110 8 10110 2 00111 2210111 17 01000 188 11000 201001 191 11001 20 01010 1711010 6 01011 67 11011 2701100 15 11100 3 01101 8511101 72 01110 8 11110 101111 56 11111 100
22
Posteriors for Dentist AccuraciesAnnotator Specificities
a.00.6 0.7 0.8 0.9 1.0
050
100
150
200
Annotator Key
12345
Annotator Sensitivities
a.10.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
050
100
150
200
• Posterior density vs. point estimates (e.g. mean)
23
Posteriors for Dentistry Data Items00000
0.0 0.5 1.0
00001
0.0 0.5 1.0
00010
0.0 0.5 1.0
00011
0.0 0.5 1.0
00100
0.0 0.5 1.0
00101
0.0 0.5 1.0
00110
0.0 0.5 1.0
00111
0.0 0.5 1.0
01000
0.0 0.5 1.0
01001
0.0 0.5 1.0
01010
0.0 0.5 1.0
01011
0.0 0.5 1.0
01100
0.0 0.5 1.0
01101
0.0 0.5 1.0
01110
0.0 0.5 1.0
01111
0.0 0.5 1.0
10000
0.0 0.5 1.0
10001
0.0 0.5 1.0
10010
0.0 0.5 1.0
10011
0.0 0.5 1.0
10100
0.0 0.5 1.0
10101
0.0 0.5 1.0
10110
0.0 0.5 1.0
10111
0.0 0.5 1.0
11000
0.0 0.5 1.0
11001
0.0 0.5 1.0
11010
0.0 0.5 1.0
11011
0.0 0.5 1.0
11100
0.0 0.5 1.0
11101
0.0 0.5 1.0
11110
0.0 0.5 1.0
11111
0.0 0.5 1.0
Accounts for bias, so very different from simple vote!
24
Beta-Binomial “Random Effects”
����α0 ����
β0 ����α1 ����
β1
����θ0,j ����
θ1,j
����π ����
ci ����xk
@@R
��
@@R
��
- -
@@@@@R
��
���
I K
J
25
Sampling Notation
Coders don’t all code the same itemsci ∼ Bernoulli(π)
a0,j ∼ Beta(α0, β0)
a1,j ∼ Beta(α1, β1)
xk ∼ Bernoulli(cika1,jk
+ (1− cik)(1− a0,jk
))
π ∼ Beta(1, 1)
α0/(α0 + β0) ∼ Beta(1, 1)
α0 + β0 ∼ Polynomial(−5/2)
α1/(α1 + β1) ∼ Beta(1, 1)
α1 + β1 ∼ Polynomial(−5/2)
26
Posterior for Coding Difficulty/Variance
These are posteriors from a simulation with cross-hairsgiving “true” answer:
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●●●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
0.60 0.65 0.70 0.75 0.80 0.85 0.90
020
4060
8010
0
Posterior: Sensitivity Mean & Scale
alpha.1 / (alpha.1 + beta.1)
alph
a.1
+ b
eta.
1
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
0.60 0.65 0.70 0.75 0.80 0.85 0.90
020
4060
8010
0
Posterior: Specificity Mean & Scale
alpha.0 / (alpha.0 + beta.0)
alph
a.0
+ b
eta.
0
27
Extending Model
• Task difficulty
– Logistic Item-Response models
– Need more than 5-10 coders/item for tight posterior
• Easy to extend coding types
– Multinomial responses
– Ordinal responses (e.g. Likert 1-5 scale)
– Scalar responses
28
The End
• References
– http://lingpipe-blog.com
• Contact
• R/BUGS Code Subversion Repository
– http://alias-i.com/lingpipe/sandbox
29