crowdsourcing the assembly of concept hierarchies

35
Crowdsourcing Crowdsourcing the Assembly of Concept Hierarchies Kai Eckert¹ Mathias Niepert¹ Christof Niemann¹ Joint Conference on Digital Libraries (JCDL), Brisbane, Australia, 2010 Cameron Buckner² Colin Allen² Heiner Stuckenschmidt¹ Presentation: Kai Eckert Wednesday, June 23, 2010 ¹ University of Mannheim, Germany ² Indiana University, USA

Upload: kai-eckert

Post on 18-Dec-2014

1.331 views

Category:

Technology


1 download

DESCRIPTION

How to create a taxonomy by a paid workforce provided by Amazon Mechanical Turk. Evaluative comparison to an existing community of motivated students and domain experts. Presentation held at JCDL 2010, Brisbane, Australia (http://www.jcdl2010.org).

TRANSCRIPT

Page 1: Crowdsourcing the Assembly of Concept Hierarchies

CrowdsourcingCrowdsourcingthe Assembly of Concept

Hierarchies

Kai Eckert¹Mathias Niepert¹Christof Niemann¹

Joint Conference on Digital Libraries (JCDL), Brisbane, Australia, 2010

Cameron Buckner²Colin Allen²Heiner Stuckenschmidt¹

Presentation: Kai EckertWednesday, June 23, 2010

¹ University of Mannheim, Germany ² Indiana University, USA

Page 2: Crowdsourcing the Assembly of Concept Hierarchies

Motivation

● Various types of Concept Hierarchies:

● Thesauri● Taxonomies● Classifications● Ontologies● ...

● Manual creation is expensive.

● Automatic creation lacks quality.

Page 3: Crowdsourcing the Assembly of Concept Hierarchies

Could the users do the work?

● Divide the work between a lot of users.

● Motivate them to be part of a community.

● Achieve quality control by means of redundancy.

● Can a concept hierarchy be created like e.g. Wikipedia?

Page 4: Crowdsourcing the Assembly of Concept Hierarchies

● The Indiana Philosophy Ontology Project.

● A browsable taxonomy of philosophical ideas.

● Ideas are extracted from the Stanford Encyclopedia of Philosophy (SEP).

● Intuitive access to the SEP via the InPhO taxonomy.

● Entry point for other philosophical ressources on the web.

Page 5: Crowdsourcing the Assembly of Concept Hierarchies

From the SEP to InPhO

Extraction of newideas and relationships

Gathering communityfeedback about ideas and relationships

Process feedback andinfer positions in theclassification tree

Start with a hand-builtformal ontology describing majortopics and sub-topics.

Page 6: Crowdsourcing the Assembly of Concept Hierarchies

Gathering community feedback

Page 7: Crowdsourcing the Assembly of Concept Hierarchies

Gathering community feedback

Relatedness

Page 8: Crowdsourcing the Assembly of Concept Hierarchies

Gathering community feedback

Relatedness

Relative Generality

is more specific thanis more specific than

Page 9: Crowdsourcing the Assembly of Concept Hierarchies
Page 10: Crowdsourcing the Assembly of Concept Hierarchies

Great stuff, but...

● what, if you do not have a motivated community of expert users?

● Well,...

● Like almost everything,you can buy it at Amazon...

● Amazon Mechanical Turk

Page 11: Crowdsourcing the Assembly of Concept Hierarchies

Amazon Mechanical Turk (AMT)

● Platform for the placing and taking ofHuman Intelligence Tasks (HIT).

● 100,000 – 400,000 HITs available.

● Number of workers: ??? (100,000 in 100 countries, 2007, New York Times).

Page 12: Crowdsourcing the Assembly of Concept Hierarchies

HIT Definition

Time allotted per assignment: Maximum timea worker can work on a single task.

Worker restrictions: Approval Rate, Location

Reward per assignment: How much do you pay for each HIT?

Number of assignments per HIT: How many unique workers do you want to work on each HIT?

Page 13: Crowdsourcing the Assembly of Concept Hierarchies

HIT Result

Answer of each worker for each HIT

Accept Time, Submit Time, Work Time In Seconds

Worker ID

Page 14: Crowdsourcing the Assembly of Concept Hierarchies

Our questions

Can we replace the InPhO community by means of Amazon Mechanical Turk?

How much does it cost and what is the resulting quality?

Page 15: Crowdsourcing the Assembly of Concept Hierarchies

Experimental Setup

Minimum overlap i=1 2 3 4 5

Number of pairs 3,237 1,154 370 187 92

● We wanted some overlap within the experts:

We decided for the 1,154 pairs.

● Each pair was evaluated by 5 different workers.

● Each worker evaluated at least 12 pairs (1 HIT).

● 87 distinct workers.

● The HITs were completed in 20 hours.

Page 16: Crowdsourcing the Assembly of Concept Hierarchies

Measuring Agreement

● Calculation of the distance between two answers:

● Relatedness: Absolute value of the difference

● Relative Generality: Match: 0, otherwise: 1

● The evaluation deviation is the mean distance of a user to the users in a reference group.

Page 17: Crowdsourcing the Assembly of Concept Hierarchies

Comparison with Experts(Relative Generality)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00

30InPhO UsersAMT Users

Fra

ctio

n o

f u

sers

in %

Follow Experts Own Opinion

Page 18: Crowdsourcing the Assembly of Concept Hierarchies

Comparison with Experts(Relative Generality)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00

30InPhO UsersAMT Users

Fra

ctio

n o

f u

sers

in %

Follow Experts Own Opinion

Ran

do

m C

licke

r

Page 19: Crowdsourcing the Assembly of Concept Hierarchies

Comparison with Experts(Relative Generality)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00

30InPhO UsersAMT Users

Fra

ctio

n o

f u

sers

in %

Follow Experts Own Opinion

Page 20: Crowdsourcing the Assembly of Concept Hierarchies

InPhO Users are quite consistent.InPhO Users are quite consistent.

Comparison with Experts(Relative Generality)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00

30InPhO UsersAMT Users

Fra

ctio

n o

f u

sers

in %

Follow Experts Own Opinion

Page 21: Crowdsourcing the Assembly of Concept Hierarchies

InPhO Users are quite consistent.InPhO Users are quite consistent.

Comparison with Experts(Relative Generality)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00

30InPhO UsersAMT Users

Fra

ctio

n o

f u

sers

in %

Follow Experts Own Opinion

AMT Users are not consistent.→ Are there good ones?

Page 22: Crowdsourcing the Assembly of Concept Hierarchies

InPhO Users are quite consistent.InPhO Users are quite consistent.

Comparison with Experts(Relative Generality)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00

30InPhO UsersAMT Users

Fra

ctio

n o

f u

sers

in %

Follow Experts Own Opinion

AMT Users are not consistent.→ Are there good ones?

Yes, there are!→ But which ones?

Page 23: Crowdsourcing the Assembly of Concept Hierarchies

InPhO Users are quite consistent.InPhO Users are quite consistent.

Comparison with Experts(Relative Generality)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00

30InPhO UsersAMT Users

Fra

ctio

n o

f u

sers

in %

Follow Experts Own Opinion

AMT Users are not consistent.→ Are there good ones?

Yes, there are!→ But which ones?

Page 24: Crowdsourcing the Assembly of Concept Hierarchies

Mixed Results...

Can we just use the good ones?

Page 25: Crowdsourcing the Assembly of Concept Hierarchies

Telling the good from the bad

● First approach: Filtering by working time

● Hypothesis 1: Workers who think some time before they answer, give better answers.

● Hypothesis 2: Probably there are workers who give quick random responses.

Page 26: Crowdsourcing the Assembly of Concept Hierarchies

Filtering by working time

>80s

>140

s

>200

s

>260

s

>320

s

>380

s

>440

s

>500

s

>560

s

>620

s

>680

s

>740

s

>800

s0

20

40

60

80

100

84

75

68

57

44

36

29

22

17

13

9 9 8 75 4 4 3

# Users

Average working time for one HIT (12 pairs)

Nu

mb

er

of

Us

ers

Page 27: Crowdsourcing the Assembly of Concept Hierarchies

Filtering by working time

>80s

>140

s

>200

s

>260

s

>320

s

>380

s

>440

s

>500

s

>560

s

>620

s

>680

s

>740

s

>800

s0

0,3

0,6

0,9

1,2

1,5

0

20

40

60

80

100

84

75

68

57

44

36

29

22

17

13

9 9 8 75 4 4 3

1,38 1,

41

1,37

1,36 1,

391,

48

1,42

1,47

1,27

1,10

1,06

1,35

1,31

1,21

0,64

# UsersDeviation

Average working time for one HIT (12 pairs)

De

via

tio

n f

rom

Ex

pe

rts

Nu

mb

er

of

Us

ers

Page 28: Crowdsourcing the Assembly of Concept Hierarchies

Telling the good from the bad

● Second approach: Filtering by comparison with a hidden gold standard.

● Test pairs:

● Social Epistemology – Epistemology (P1)

● Computer Ethics – Ethics (P2)

● Chinese Room Argument – Chinese Philosophy (P3)

● Dualism - Philosophy of Mind (P4)

Page 29: Crowdsourcing the Assembly of Concept Hierarchies

Applying filters

● Test pairs:● Social Epistemology – Epistemology (P1)● Computer Ethics – Ethics (P2)● Chinese Room Argument – Chinese Philosophy (P3)● Dualism - Philosophy of Mind (P4)

● Filters:

1) P1 and P2 are correct (Common Sense)

2) Like 1), additionally P4 is correct (+Background)

3) Like 1), additionally P3 is correct (+Lexical)

4) All have to be correct (All)

Page 30: Crowdsourcing the Assembly of Concept Hierarchies

Filter results for relatedness

Filter Users Deviation Max. Dev.

All (4) 7 0.60 1.00

+Lexical (3) 10 0.87 1.78

+Background (2) 23 0.84 1.41

Common Sense (1) 40 1.11 1.96

All AMT 87 1.39 2.96

All InPhO 25 0.77 1.75

Random --- 1.8 ---

Page 31: Crowdsourcing the Assembly of Concept Hierarchies

Filter results for relative generality

Filter Users Deviation Max. Dev.

All (4) 7(5) 0.12 0.22

+Lexical (3) 10(8) 0.14 0.27

+Background (2) 23(20) 0.15 0.45

Common Sense (1) 40(35) 0.21 0.59

All AMT 87(78) 0.45 1.00

All InPhO 25 0.23 0.47

Random --- 0.75 ---

Page 32: Crowdsourcing the Assembly of Concept Hierarchies

Financial considerations

Filter Pairs Evaluations Cost per Pair Cost per Evaluation

--- 1,138 5,690 US$ 0.111 US$ 0.022

Common Sense (1) 1,074 1,909 US$ 0.117 US$ 0.066

+Background (2) 1,018 1,558 US$ 0.124 US$ 0.081

+Lexical (3) 215 215 US$ 0.586 US$ 0.586

All (4) 183 183 US$ 0.689 US$ 0.689

● Overall payments: 126 US$

● Estimation for all pairs with filter „All (4)“: 784 US$

● Estimation for all pairs with redundancy (5x): 3,920 US$.

Page 33: Crowdsourcing the Assembly of Concept Hierarchies

ConclusionAMT answers are of varying quality. But this is true for many communities, too.

With moderate filtering („Background“), we achieved a quality comparable to the InPhO community.

With 5 evaluations per pair, we still covered 89% of all pairs with this filter.

The resulting InPhO taxonomy is online:http://inpho.cogs.indiana.edu/amt_taxonomy

No need for existing data, gold standards or training data (Beside the filter pairs).

No need for a community?

Page 34: Crowdsourcing the Assembly of Concept Hierarchies

„Computer ethics doesn't exist. Blue is black and red is blood on the internet. Nobody cares, because they are lonely.“

Anonymous Mechanical Turk Worker

Thank you

Questions?

Kai [email protected]://www.slideshare.net/kaiec

Page 35: Crowdsourcing the Assembly of Concept Hierarchies

Photo Credits

● Michal Zacharzewski (Title Crowd), http://www.sxc.hu/profile/mzacha

● Peter Suneson (Crowd sillhouette), http://www.sxc.hu/profile/CMSeter

● Alaa Hamed (Egyptian Coins), http://www.sxc.hu/profile/alaasafei

● Piotr Lewandowski (Money), http://www.sxc.hu/profile/LeWy2005

● Asif Akbar (Clock), http://www.sxc.hu/profile/asifthebes

● Zern Liew (Traffic Cone), http://www.sxc.hu/profile/eidesign

● Peter Gustafson (Counting Fingers), http://www.sxc.hu/profile/liaj

● Kostya Kisleyko (Yes No), http://www.sxc.hu/profile/dlnny

● Sergio Roberto Bichara (Barcode), http://www.sxc.hu/profile/srbichara

● Maggie Molloy (Icons), http://www.sxc.hu/profile/agthabrown

● Sanja Gjenero (World with Crowd), http://www.sxc.hu/profile/lusi

● Wikimedia Commons (The Turk), http://en.wikipedia.org/wiki/File:Kempelen_chess1.jpg