spam? no, thanks! panos ipeirotis – new york university propublica, apr 1 st 2010 (disclaimer: no...

Report

Tags:

Post on 30-Dec-2015

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Spam? No, thanks!

Panos Ipeirotis – New York University

ProPublica, Apr 1st 2010(Disclaimer: No jokes included)

“A Computer Scientist in a Business

School”

http://behind-the-enemy-lines.blogspot

.com/

Email: panos@nyu.edu

“A Computer Scientist in a Business

School”

http://behind-the-enemy-lines.blogspot

.com/

Email: panos@nyu.edu

Panos Ipeirotis - Introduction

New York University, Stern School of Business

Example: Build an Adult Web Site Classifier

Need a large number of hand-labeled sites Get people to look at sites and classify them

as:G (general), PG (parental guidance), R (restricted), X

(porn)

Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:

$15/hr MTurk: 2500 websites/hr, cost: $12/hr

Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:

$15/hr MTurk: 2500 websites/hr, cost: $12/hr

Bad news: Spammers!

Worker ATAMRO447HWJQ

labeled X (porn) sites as G (general

audience)

Worker ATAMRO447HWJQ

labeled X (porn) sites as G (general

audience)

Improve Data Quality through Repeated Labeling

Get multiple, redundant labels using multiple workers Pick the correct label based on majority vote

Probability of correctness increases with number of workers

Probability of correctness increases with quality of workers

1 worker

70%

correct

1 worker

70%

correct

11 workers

93%

correct

11 workers

93%

correct

11-vote Statistics MTurk: 227 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost:

$15/hr

11-vote Statistics MTurk: 227 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost:

$15/hr

Single Vote Statistics MTurk: 2500 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost:

$15/hr

Single Vote Statistics MTurk: 2500 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost:

$15/hr

But Majority Voting is Expensive

Using redundant votes, we can infer worker quality

Look at our spammer friend ATAMRO447HWJQ together with other 9 workers

Our “friend” ATAMRO447HWJQ mainly marked sites as G.Obviously a spammer…

We can compute error rates for each worker

Error rates for ATAMRO447HWJQ P[X → X]=9.847% P[X → G]=90.153% P[G → X]=0.053% P[G → G]=99.947%

Rejecting spammers and BenefitsRandom answers error rate = 50%

Average error rate for ATAMRO447HWJQ: 45.2% P[X → X]=9.847% P[X → G]=90.153% P[G → X]=0.053% P[G → G]=99.947%

Action: REJECT and BLOCK

Results: Over time you block all spammers Spammers learn to avoid your HITS You can decrease redundancy, as quality of workers is higher

After rejecting spammers, quality goes up Spam keeps quality down Without spam, workers are of higher quality Need less redundancy for same quality Same quality of results for lower cost

With spam

1 worker

70%

correct

With spam

1 worker

70%

correct

With spam

11 workers

93%

correct

With spam

11 workers

93%

correct

Without

spam

1 worker

80% correct

Without

spam

1 worker

80% correct

Without

spam

5 workers

94% correct

Without

spam

5 workers

94% correct

Correcting biases

Classifying sites as G, PG, R, X Sometimes workers are careful but biased

Classifies G → P and P → R Average error rate for ATLJIK76YH1TF: 45.0%

Is ATLJIK76YH1TF a spammer?Is ATLJIK76YH1TF a spammer?

Error Rates for Worker: ATLJIK76YH1TF

P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%

Error Rates for Worker: ATLJIK76YH1TF

P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%

Correcting biases

For ATLJIK76YH1TF, we simply need to compute the “non-recoverable” error-rate (technical details omitted)

Non-recoverable error-rate for ATLJIK76YH1TF: 9%

Error Rates for Worker: ATLJIK76YH1TF

P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%

Error Rates for Worker: ATLJIK76YH1TF

P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%

Too much theory?

Open source implementation available at:http://code.google.com/p/get-another-label/

Input: – Labels from Mechanical Turk– Cost of incorrect labelings (e.g., XG costlier than GX)

Output: – Corrected labels– Worker error rates– Ranking of workers according to their quality

Alpha version, more improvements to come! Suggestions and collaborations welcomed!

Thank you!

Questions?

“A Computer Scientist in a Business School”

http://behind-the-enemy-lines.blogspot.com

/

Email: panos@nyu.edu

“A Computer Scientist in a Business School”

http://behind-the-enemy-lines.blogspot.com

/

Email: panos@nyu.edu

top related

propublica annual report - amazon...

Documents

report to...

Documents

get another label? improving data quality and machine...

Documents

panos parpas

Documents

panos london

Documents

panos infinity linked

Documents

panos ipeirotis stern school of business new york university...

Documents

portfolio. art by panos antonopoulos...art by panos...

Documents

panos ipeirotis (with anindya ghose and beibei li) leonard...

Documents

panos kokkinias1

Education

suzanne o'donovan & kostas ipeirotis - booklet

Documents

prof. panos ipeirotis search and the new economy privacy on...

Documents

propublica media kit - amazon...

Documents

suzanne o'donovan & kostas ipeirotis - urban synthesis...

Documents

what should i learn next? discovering economically...

Documents

to search or to crawl? towards a query optimizer for...

Documents

charles ornstein, propublica jennifer lafleur, cir

Documents

prof. panos ipeirotis search and the new economy session 1...

Documents

panos ipeirotis stern school of business new york university...

Documents

tziralis & ipeirotis at 3rd prediction markets workshop

Business