a brief introduction to crowdsourcing for data collection

23
A BRIEF INTRODUCTION TO CROWDSOURCED DATA COLLECTION ELENA SIMPERL UNIVERSITY OF SOUTHAMPTON 25-May-14 USEWOD@ESWC2014 1

Upload: elena-simperl

Post on 21-Nov-2014

316 views

Category:

Technology


0 download

DESCRIPTION

Examples of crowdsourcing experiments for research data forensics, talk at the USEWOD workshop at ESWC2014

TRANSCRIPT

Page 1: A brief introduction to crowdsourcing for data collection

A BRIEF

INTRODUCTION TO

CROWDSOURCED

DATA COLLECTION

ELENA SIMPERL

UNIVERSITY OF SOUTHAMPTON

25-May-14

USEWOD@ESWC2014 1

Page 2: A brief introduction to crowdsourcing for data collection

EXECUTIVE SUMMARY

• Crowdsourcing can help

with research data forensics

• But

• There are things computers do better

than humans hybrid approaches are

the ultimate solution

• There is crowdsourcing and

crowdsourcing pick your faves and

mix them

• Human intelligence is a valuable

resource experiment design is key

2

Page 3: A brief introduction to crowdsourcing for data collection

CROWDSOURCING:

PROBLEM SOLVING VIA

OPEN CALLS

"Simply defined, crowdsourcing represents the act of a

company or institution taking a function once performed by

employees and outsourcing it to an undefined (and generally

large) network of people in the form of an open call. This can

take the form of peer-production (when the job is performed

collaboratively), but is also often undertaken by sole

individuals. The crucial prerequisite is the use of the open

call format and the large network of potential

.“

[Howe, 2006]

25-May-14

3

Page 4: A brief introduction to crowdsourcing for data collection

CROWDSOURCING

COMES IN DIFFERENT

FORMS AND FLAVORS

25-May-14

4

Page 5: A brief introduction to crowdsourcing for data collection

25-May-14

5

DIMENSIONS OF CROWDSOURCING

Page 6: A brief introduction to crowdsourcing for data collection

DIMENSIONS OF CROWDSOURCING

WHAT IS

OUTSOURCED

• Tasks based on

human skills not

easily replicable by

machines

• Visual recognition

• Language

understanding

• Knowledge acquisition

• Basic human

communication

• ...

WHO IS THE CROWD

• Open call (crowd accessible through a platform)

• Call may target specific skills and expertise (qualification tests)

• Requester typically knows less about the ‘workers’ than in other ‘work’ environments

25-May-14

6

See also [Quinn & Bederson, 2012]

Page 7: A brief introduction to crowdsourcing for data collection

USEWOD EXPERIMENT: TASK AND CROWD

WHAT IS

OUTSOURCED

• Annotating research papers with

data set information.

• Alternative representations of the

domain

• Bibliographic reference

• Abstract + title

• Paragraph

• Full paper

• What if the domain is not known in

advance or is infinite?

• Do we know the list of potential

answers?

• Is there only one correct solution to

each atomic task?

• How many people would solve the

same task?

WHO IS THE CROWD

• People who know the papers or the data sets

• Experts in the (broader ) field

• Casual gamers

• Librarians

• Anyone (knowledgeable of English, with a computer/cell phone…)

• Combinations thereof…

25-May-14

7

Page 8: A brief introduction to crowdsourcing for data collection

25-May-14

Tutorial@ISWC2013

CROWDSOURCING AS

‚HUMAN COMPUTATION‘

Outsourcing tasks that machines find difficult to solve

to humans

8

Page 9: A brief introduction to crowdsourcing for data collection

DIMENSIONS OF CROWDSOURCING (2)

HOW IS THE TASK OUTSOURCED

• Explicit vs. implicit participation

• Tasks broken down into smaller units

undertaken in parallel by different people

• Coordination required to handle cases with

more complex workflows

• Partial or independent answers consolidated

and aggregated into complete solution

25-May-14

9

See also [Quinn & Bederson, 2012]

Page 10: A brief introduction to crowdsourcing for data collection

EXAMPLE: CITIZEN SCIENCE

WHAT IS OUTSOURCED

• Object recognition, labeling,

categorization in media content

WHO IS THE CROWD

• Anyone

HOW IS THE TASK

OUTSOURCED

• Highly parallelizable tasks

• Every item is handled by multiple

annotators

• Every annotator provides an answer

• Consolidated answers solve scientific

problems

25-May-14

10

Page 11: A brief introduction to crowdsourcing for data collection

Users aware of how their

input contributes to the

achievement of

application’s goal (and

identify themselves with it)

vs.

Tasks are hidden behind

the application narratives.

Engagement ensured

through other incentives

25-May-14

11

EXPLICIT VS. IMPLICIT

CONTRIBUTION - AFFECTS

MOTIVATION AND ENGAGEMENT

Page 12: A brief introduction to crowdsourcing for data collection

USEWOD EXPERIMENT: TASK

DESIGN

HOW IS THE TASK OUTSOURCED:

ALTERNATIVE MODELS

• Use the data collected here to train a IE algorithm

• Use paid microtask workers to go a first screening, then expert crowd to sort out challenging cases

• What if you have very long documents potentially mentioning different/unknown data sets?

• Competition via Twitter

• ‘Which version of DBpedia does this paper use?’

• One question a day, prizes

• Needs golden standard to bootstrap and redundancy

• Involve the authors

• Use crowdsourcing to find out Twitter accounts, then launch campaign on Twitter

• Write an email to the authors…

• Change the task

• Which papers use Dbpedia 3.X? • Competition to find all papers

25-May-14

12

Page 13: A brief introduction to crowdsourcing for data collection

EXAMPLE: SOYLENT AND COMPLEX

WORKFLOWS

25-May-14

13

http://www.youtube.com/watch?v=n_miZqsPwsc

WHAT IS OUTSOURCED

• Text shortening, proof-

reading, open editing

WHO IS THE CROWD

• MTurk

HOW IS THE TASK

OUTSOURCED

• Text divided into paragraphs

• Select-fix-verify pattern

• Multiple workers in each step

See also [Bernstein et al., 2010]

Page 14: A brief introduction to crowdsourcing for data collection

DIMENSIONS OF CROWDSOURCING (3)

HOW ARE THE

RESULTS VALIDATED

• Solutions space closed vs. open

• Performance measurements/ground truth

• Statistical techniques employed to predict accurate solutions

• May take into account confidence values of algorithmically generated solutions

HOW CAN THE

PROCESS BE

OPTIMIZED

• Incentives and motivators

• Assigning tasks to people based on their skills and performance (as opposed to random assignments)

• Symbiotic combinations of human- and machine-driven computation, including combinations of different forms of crowdsourcing

25-May-14

14

See also [Quinn & Bederson, 2012]

Page 15: A brief introduction to crowdsourcing for data collection

USEWOD EXPERIMENT:

VALIDATION

• Domain is fairly restricted

• Spam and obvious wrong answers can be detected easily

• When are two answers the same? Can there be more

than one correct answer per question?

• Redundancy may not be the final answer

• Most people will be able to identify the data set, but

sometimes the actual version is not trivial to reproduce

• Make educated version guess based on time intervals

and other features

25-May-14

15

Page 16: A brief introduction to crowdsourcing for data collection

ALIGNING INCENTIVES

IS ESSENTIAL

Motivation: driving force that

makes humans achieve their

goals

Incentives: ‘rewards’ assigned

by an external ‘judge’ to a

performer for undertaking a

specific task

• Common belief (among

economists): incentives can be

translated into a sum of money

for all practical purposes.

Incentives can be related to

both extrinsic and intrinsic

motivations.

Extrinsic motivation if task is

considered boring, dangerous,

useless, socially undesirable,

dislikable by the performer.

Intrinsic motivation is driven by

an interest or enjoyment in the

task itself.

16

Page 17: A brief introduction to crowdsourcing for data collection

EXAMPLE: DIFFERENT

CROWDS FOR DIFFERENT

TASKS

Contest

Linked Data experts

Difficult task

Final prize

Find Verify

Microtasks

Workers

Easy task

Micropayments

TripleCheckMate [Kontoskostas2013]

MTurk

Adapted from [Bernstein2010] http://mturk.com

See also [Acosta et al., 2013]

17

Page 18: A brief introduction to crowdsourcing for data collection

IT‘S NOT ALWAYS

JUST ABOUT MONEY

25-May-14

18

http://www.crowdsourcing.org/editorial/how-to-motivate-the-crowd-infographic/

http://www.oneskyapp.com/blog/tips-to-motivate-participants-of-crowdsourced-

translation/

[Kaufmann, Schulze, Viet, 2011]

Page 19: A brief introduction to crowdsourcing for data collection

USEWOD EXPERIMENT:

OTHER INCENTIVES

MODELS

• Twitter-based contest

• ‘Which version of DBpedia does this paper use?’

• One question a day, prizes

• If question is not answered correctly, increase the prize

• If low participation, re-focus the audience or change the

incentive.

• Altruism: for each ten papers annotated we send a

student to ESWC…

25-May-14

19

Page 20: A brief introduction to crowdsourcing for data collection

PRICING ON MTURK: AFFORDABLE,

BUT SCALE OF EXPERIMENTS

DOES MATTER

25-May-14

20

[Ipeirotis, 2008]

Page 21: A brief introduction to crowdsourcing for data collection

USEWOD EXPERIMENT:

HYBRID APPROACH

• Use IE algorithm to select best candidates

• Use different types of crowds

• Publish results as Linked Data

25-May-14

21

See also [Demartini et al., 2012]

Page 22: A brief introduction to crowdsourcing for data collection

25-May-14

22

SUMMARY

Page 23: A brief introduction to crowdsourcing for data collection

SUMMARY AND FINAL

REMARKS

• There are things computers do

better than humans hybrid

approaches are the ultimate

solution

• There is crowdsourcing and

crowdsourcing pick your faves

and mix them

• Human intelligence is a valuable

resource experiment design is

key

25-May-14

23