crowdsourcing: (a bit of) theory and ((quite) some) practice · other examples i ligue de...
TRANSCRIPT
Crowdsourcing:(a bit of) theory and ((quite) some) practice
Karen Fort
enetCollect, September 7th, 2017
1 / 56
Where I’m talking fromMain research interests (http://karenfort.org/)
I Language resources creation for natural language processing
I Ethics in natural language processing (NLP)
2 / 56
Where I’m talking fromRelated actions
Games (with a purpose) I participated in creating:
Language games portal and recurring workshop:
Games4NLP
3 / 56
Crowdsourcing: back to basics
Games with a purpose (GWAPs)
GWAP-ing in practice: ZombiLingo
Conclusion
4 / 56
Crowdsourcing: back to basicsDefinitionCrowdsourcingSBeyond the myths: ”Crowdsourcing is recent”Beyond the myths: ”Crowdsourcing implies a crowd”Beyond the myths: ”Crowdsourcing implies non-experts”
Games with a purpose (GWAPs)
GWAP-ing in practice: ZombiLingo
Conclusion
5 / 56
Crowdsourcing
Crowdsourcing is ”the act of a company or institutiontaking a function once performed by employees andoutsourcing it to an undefined (and generally large)network of people in the form of an opencall.”[Howe, 2006]
I no a priori identification or selection of the participants(”open call”)
I massive (in production and participation)
I (relatively) cheap
6 / 56
Crowdsourcing
Crowdsourcing is ”the act of a company or institutiontaking a function once performed by employees andoutsourcing it to an undefined (and generally large)network of people in the form of an opencall.”[Howe, 2006]
I no a priori identification or selection of the participants(”open call”)
I massive (in production and participation)
I (relatively) cheap
7 / 56
Crowdsourcing
Crowdsourcing is ”the act of a company or institutiontaking a function once performed by employees andoutsourcing it to an undefined (and generally large)network of people in the form of an opencall.”[Howe, 2006]
I no a priori identification or selection of the participants(”open call”)
I massive (in production and participation)
I (relatively) cheap
8 / 56
Some remarkable achievements
Wikipedia1 (September 2017) :
I more than 45 million articles in 241 languages
I more than 8 million views per hour for the English version(2014 - could not find more recent data)
Distributed Proofreaders (Gutenberg Project) 2 :
I nearly 40,000 digitalized and corrected books
JeuxDeMots [Lafourcade, 2007]3:
I more than 150 million relations in the lexical network
I more than 2 million terms added by the players
1http://stats.wikimedia.org/EN/Sitemap.htm2http://www.pgdp.net/c/stats/stats_central.php3http://www.jeuxdemots.org
9 / 56
A simplified taxonomy (more in [Geiger et al., 2011])
remuneratednot remunerated
direct
indirect
10 / 56
A simplified taxonomy (more in [Geiger et al., 2011])
remuneratednot remunerated
direct
indirect
11 / 56
A simplified taxonomy (more in [Geiger et al., 2011])
remuneratednot remunerated
direct
indirect
12 / 56
A simplified taxonomy (more in [Geiger et al., 2011])
remuneratednot remunerated
direct
indirect
13 / 56
Myth #1: ”Crowdsourcing is recent”Instructions pour les voyageurs et les employes des colonies
Citizen science:
I published by the Museum Nationald’Histoire Naturelle (Paris, France)
I for the travellers and the colonies’employees:
”to share the results oftheir own experiences, tobenefit themselves andthe scientific world”
I first published in 1824
14 / 56
Other examples
I Ligue de Protection des Oiseaux (birds protectionleague)4:
I monitoring of the birds populationsI created more than a century agoI 5,000 active participants
I Longitude Prize5 (1714):I awarded by the British governement to whoever would invent a
simple and practical method allowing to determine a ship’slongitude
I still exists: in 2014 the theme was ”Global antibioticsresistance”
4https://www.lpo.fr5https://longitudeprize.org/
15 / 56
Myth #2: ”Crowdsourcing implies a crowd of participants”
1 10 20 30 40 50 60 70 80 90
100 000
200 000
300 000
400 000
500 000
Players ranked according to their score
Nu
mb
erof
poi
nts
Number of points per player
Number of players on Phrase Detectives according to the number of points
gained in the game (February 2011 - February 2012)
16 / 56
A crowd of participants? JeuxDeMots
20 100 200 300 400 500 600
250 000
500 000
750 000
1 000 000
Players ranked according to their score
Nu
mb
erof
poi
nts
Number of points per player
Number of players on JeuxDeMots according to their ranking in the game
(source : http://www.jeuxdemots.org/generateRanking-4.php)
17 / 56
A crowd of participants? ZombiLingo
18 / 56
A crowd of workers? [Fort et al., 2011]
Number of active Turkers on Amazon Mechanical Turk:
I number of workers registered on the website: more than500,000
I 80% of the tasks (HIT) are performed by the 20% most activeTurkers [Deneme, 2009]
⇒ really active workers: between 15,059 and 42,912
19 / 56
Experts vs non-experts
Example of the annotation of named entities in a microbiologycorpus:
I experts of the domain?I of the corpus (microbiology)?I of the application (NLP)?
20 / 56
Experts vs non-experts
Example of the annotation of named entities in a microbiologycorpus:
I experts of the domain?I of the corpus (microbiology)?I of the application (NLP)?
→ experts of the task
21 / 56
Crowdsourcing
Using a crowd of ”non-experts”?
22 / 56
Crowdsourcing
Using a crowd of ”non-experts”?
→ Finding/training experts (of the task) in the crowd
23 / 56
Crowdsourcing: back to basics
Games with a purpose (GWAPs)Using the (basic) knowledge of the crowdUsing the basic education of the crowdUsing the learning capabilities of the crowd
GWAP-ing in practice: ZombiLingo
Conclusion
24 / 56
Games with a purpose (GWAPs)
Allow to benefit from:
1. the knowledge of the ”world”
2. the basic education
3. the learning capabilities
of the participants (seldom a crowd)
25 / 56
JeuxDeMots: playing association of ideas. . .. . . to create a lexical network [Lafourcade and Joubert, 2008]
More than 154 million relations (created by approx. 6,000 players),that are constantly updated
I play by pairs
I more and more complex,typed relations
I challenges
I lawsuits
I etc.
26 / 56
Phrase Detectives: playing detective. . .. . . to annotate co-reference [Chamberlain et al., 2008]
3.5M decisions from 45k players
I pre-annotated corpus
I detailed instructions
I trainingI 2 different playing modes
I annotationI validation (correction of
annotations)
27 / 56
FoldIt: playing proteins folding. . .. . . to solve scientific issues [Khatib et al., 2011]
Solution to the crystal structure of a monomeric retroviral protease(simian AIDS-causing monkey virus)
Solution to an issue unsolved forover a decade
I found in a couple of weeks
I by a team of players
I that will allow for thecreation of antiretroviraldrugs
28 / 56
FoldIt: playing proteins folding. . .. . . without any prior knowledge in biochemistry [Cooper et al., 2010]
Step-by-step training
I tutorial decomposed by concepts
I puzzles for each concept
I access to the following puzzles is given only if your level issufficient
29 / 56
Crowdsourcing: back to basics
Games with a purpose (GWAPs)
GWAP-ing in practice: ZombiLingoOverview of the gameMotivating playersBehind the curtainEnsuring qualityResults
Conclusion
30 / 56
A complex annotation task
I annotation guidelinesI 29 relation typesI approx. 50 pages
I counter-intuitive decisions (not school grammar, linguistics):aobj = au
[...] avoir recours au type de mesures [...]
i.e. head of the PP is the preposition
→ decompose the complexity of the task [Fort et al., 2012],not simplify it!
31 / 56
33 / 56
34 / 56
35 / 56
General features
Bring the fun through:
I zombie design
I use of (crazy) objectsI regular challenges (specific corpus and design) on a trendy
topic:I Star Wars (when the movie was playing)I soccer (during the Euro)I Pokemon (well...)
36 / 56
LeaderboardS (for achievers)
Criteria:
I number of annotations or points
I in total, during the month, during the challenge
37 / 56
Hidden features (for explorers)
I appearing randomly
I with different effects: objects, other game, etc.
38 / 56
Duels (for socializers (and killers?))
I select an enemy
I challenge them on a specific type of relation
39 / 56
Badges (?) (for collectors)
I play all the sentences for a relation type, for a corpus
I play all the sentences from a corpus
40 / 56
Organizing quality assurance
Una
nnot
ated
cor
pus
(Wik
iped
ia)
Ref c
orpu
s (S
equo
ia)
Play phaseTraining phase
REFTrain & Control
REFEval Eval
Raw text ANNOTATION(no feedback)
Pre annotation with 2 parsers
Player’s confidence
EXPGame
TRAINING(feedback)
CONTROL(feedback)
EVAL(no feedback) EXPEval
41 / 56
Preprocessing data (freely available corpora)
Una
nnot
ated
cor
pus
(Wik
iped
ia)
Ref c
orpu
s (S
equo
ia)
Play phaseTraining phase
REFTrain & Control
REFEval Eval
Raw text ANNOTATION(no feedback)
Pre annotation with 2 parsers
Player’s confidence
EXPGame
TRAINING(feedback)
CONTROL(feedback)
EVAL(no feedback) EXPEval
42 / 56
Preprocessing data (freely available corpora)
Pre-annotation with two parsers
1. a statistical parser: Talismane [Urieli, 2013]
2. a symbolic parser, based on graph rewriting:FrDep-Parse [Guillaume and Perrier, 2015]
→ play the items for which the two parsers give differentannotations
43 / 56
Training, control and evaluationReference: 3,099 sentences of the Sequoia corpus [Candito and Seddah, 2012]
Ref c
orpu
s (S
equo
ia)
Play phaseTraining phase
REFTrain & Control
REFEval Eval
Player’s confidenceTRAINING
(feedback)CONTROL
(feedback)
EVAL(no feedback) EXPEval
REFTrain&Control REFEval Unused
50% 25% 25%1,549 sentences 776 sentences 774 sentences
I REFTrain&Control is used to train the players
I REFEval is used like a raw corpus, to evaluate the producedannotations
44 / 56
Training the players
Compulsory for each dependency relation
I sentences are taken from the REFTrain&Controlcorpus
I a feedback is given in case of error
45 / 56
Dealing with cognitive fatigue and long-term playersControl mechanism
Sentences from the REFTrain&Controlcorpus are proposed regularly
1. if the player fails to find the right answer, a feedback with thesolution is given
46 / 56
Dealing with cognitive fatigue and long-term playersControl mechanism
Sentences from the REFTrain&Controlcorpus are proposed regularly
1. if the player fails to find the right answer, a feedback with thesolution is given
2. after a given number of failures on the same relation, theplayer cannot play anymore and has to redo the correspondingtraining
47 / 56
Dealing with cognitive fatigue and long-term playersControl mechanism
Sentences from the REFTrain&Controlcorpus are proposed regularly
1. if the player fails to find the right answer, a feedback with thesolution is given
2. after a given number of failures on the same relation, theplayer cannot play anymore and has to redo the correspondingtraining
→ we deduce a level of confidence for the player on this relation
48 / 56
Production: game corpus sizecompared to other existing French dependency syntax corpora
As of July 10, 2016
I 647 players (1,150 as of Sept. 5th, 2017)
I who produced 107,719 annotations (more than 390,000 as ofSept. 5th, 2017)
Sequoia 7.0 UD-French 1.3 FTB-UC FTB-SPMRL Gamefree free not ”free” not ”free” free
validated after ZL6 + errors validated validated validated
Sent. 3,099 16,448 12,351 18,535 5,221
Tok. 67,038 401,960 350,947 557,149 128,046
+ (ever)growing resource!
6ZL 1.0, July 2014 vs UD 1.0 January 2015.49 / 56
Evaluating qualityon the REFEval corpus
aux.tps
sujaux.pass
aff detobj.cpl
aobj
mod.rel
dep.coord
obj.pats
pobj.o
deobj
coord
objm
od
0
0.5
1
F-m
easu
re
Talismane FrDep-Parse Game
NB: left part of the figure = density of annotation > 150 / 56
Annotation densityon the REFEval corpus
aux.tps
sujaux.pass
aff detobj.cpl
aobj
mod.rel
dep.coord
obj.pats
pobj.o
deobj
coord
objm
od
0
2
4
nu
mb
erof
answ
ers
per
ann
otat
ion
→ need more annotations on some relations
51 / 56
Next stepsExpand to new languages and new annotation types
New languages
I English
I less-resourced languages
New annotation types
I part-of-speech (POS),
I corpus building,
I etc.
Alice Millour (PhD student)
52 / 56
Crowdsourcing: back to basics
Games with a purpose (GWAPs)
GWAP-ing in practice: ZombiLingo
Conclusion
53 / 56
Crowdsourcing for language resources creation
Achievements
I surprisingly good results in terms of quantity and quality
I we demonstrated that we can train people on a complex task
Difficulties
I motivating people → language learning could work (?7)
I communication / advertisement
I gamification
7There is no (yet) scientific evidence that it does.54 / 56
https://github.com/zombilingo
http://zombilingo.org/export
55 / 56
ZombiLingo’s team and fundings
Bruno Guillaume (researcher)
Nicolas Lefebvre (engineer)
56 / 56
AppendixAmazon Mechanical Turk: a platform of legendsEthical issuesVarying costsComplexity of the tasks
Bibliographie
History: von Kempelen’s Mechanical Turk
A mechanical chess player created by J. W. von Kempelen in 1770
History: von Kempelen’s Mechanical Turk
In fact, a human chess master was hiding inside to operate themachine
History: von Kempelen’s Mechanical Turk
It’s artificial artificial intelligence!
And Amazon created AMT
Amazon created for its own needs a
microworking platformand opened it to everyone in 2005
(taking 10% of the transactions (20% now))
AMT: the dream come true?for NLP
[Snow et al., 2008]
AMT: the dream come true?for NLP
[Snow et al., 2008]
It’s very cheap, fast, good quality work
and it’s a hobby for the Turkers!
Amazon Mechanical Turk (AMT)
MTurk
Amazon Mechanical Turk (AMT)
MTurk is a crowdsourcing system: work outsourced via the Web,done by many people (the crowd), here, the Turkers
Amazon Mechanical Turk (AMT)MTurk is a crowdsourcing, microworking system: tasks are cut intosmall pieces (HITs) and their execution is paid for by theRequesters
Amazon Mechanical Turk (AMT)
MTurk is a crowdsourcing, microworking system: tasks are cut into
small pieces (HITs) and their execution is paid for.
Amazon Mechanical Turk (AMT)
MTurk is a crowdsourcing, microworking system: tasks are cut into
small pieces (HITs) and their execution is paid for.
Is AMT Ethical and/or legal?
Ethics:
I No identification: no relation between Requesters and Turkersand among Turkers
I No possibility to unionize, to protest against wrongdoings orto go to court.
I No minimal wage (< 2$/hr in average)
I Possibility to refuse to pay the Turkers
Is AMT Ethical and/or legal?
AMT: a hobby for the Turkers ?
[Ross et al., 2010, Ipeirotis, 2010] show that:I Turkers are priorly financially motivated (91%):
I 20% use AMT as their primary source of income;I 50% as their secondary source of income;I leisure is important for only a minority (30%).
I 20% of the Turkers spend more than 15 hour a week onAMT, and contribute to 80% of the tasks.
I observed mean hourly wages is below US$ 2.
[Gupta et al., 2014]: given that training the workers is impossibleon AMT, an important part of the work of the Turkers is hidden
Turkers are not anonymous [Lease et al., 2013]The Turkers’ id corresponds to their Amazon client id
AMT allows to reduce costs
Very low wages ⇒ low costs? Yes, but. . .
I UI development
I creation of protections against spammers
I validation and post-processing costs
Some tasks (for example, in less-resourced languages) generatecosts that are similar to the usual costs in the domain, due to alack of qualified Turkers [Novotney and Callison-Burch, 2010].
Depending on an external platform
Impossible to control:
I the costs
I the participants’ working conditions
I the selection of participants
I the conditions of the experiment
HITs (Human Intelligence Tasks): simplified tasks
AMT does not allow for the training of the workers:
I quality is not satisfactory for complexe tasks (for example,summarizing [Gillick and Liu, 2010])
I for some simple tasks, NLP tools produce better results thanAMT [Wais et al., 2010].
⇒ Simplification of the tasks:I a real textual entailment annotation task (inference, neutral,
contradiction) is reduced to 2 phases and 1 question:”Would most people say that if the first sentence is true, thenthe second sentence must be true?”[Bowman et al., 2015]
Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D.(2015).A large annotated corpus for learning natural languageinference.arXiv preprint arXiv:1508.05326.
Candito, M. and Seddah, D. (2012).Le corpus Sequoia : annotation syntaxique et exploitation pourl’adaptation d’analyseur par pont lexical.In Proceedings of the Traitement Automatique des LanguesNaturelles (TALN), Grenoble, France.
Chamberlain, J., Poesio, M., and Kruschwitz, U. (2008).Phrase Detectives: a web-based collaborative annotationgame.In Proceedings of the International Conference on SemanticSystems (I-Semantics’08), Graz, Austria.
Cooper, S., Treuille, A., Barbero, J., Leaver-Fay, A., Tuite, K.,Khatib, F., Snyder, A. C., Beenen, M., Salesin, D., Baker, D.,and Popovic, Z. (2010).The challenge of designing scientific discovery games.In Proceedings of the Fifth International Conference on theFoundations of Digital Games, FDG ’10, pages 40–47, NewYork, NY, USA. ACM.
Deneme (2009).How many turkers are there?http://groups.csail.mit.edu/uid/deneme/.
Fort, K., Adda, G., and Cohen, K. B. (2011).Amazon Mechanical Turk: Gold mine or coal mine?Computational Linguistics (editorial), 37(2):413–420.
Fort, K., Nazarenko, A., and Rosset, S. (2012).Modeling the complexity of manual annotation tasks: a grid ofanalysis.
In International Conference on Computational Linguistics(COLING), pages 895–910, Mumbai, India.
Geiger, D., Seedorf, S., Schulze, T., Nickerson, R. C., andSchader, M. (2011).Managing the crowd: Towards a taxonomy of crowdsourcingprocesses.In AMCIS 2011 Proceedings.
Gillick, D. and Liu, Y. (2010).Non-expert evaluation of summarization systems is risky.In Proceedings of the NAACL HLT 2010 Workshop on CreatingSpeech and Language Data with Amazon’s Mechanical Turk,CSLDAMT ’10, pages 148–151, Stroudsburg, PA, USA.Association for Computational Linguistics.
Guillaume, B. and Perrier, G. (2015).Dependency Parsing with Graph Rewriting.
InProceedings of IWPT 2015, 14th International Conference on Parsing Technologies,pages 30–39, Bilbao, Spain.
Gupta, N., Martin, D., Hanrahan, B. V., and O’Neill, J.(2014).Turk-life in india.In Proceedings of the 18th International Conference onSupporting Group Work, GROUP ’14, pages 1–11, New York,NY, USA. ACM.
Howe, J. (2006).The rise of crowdsourcing.Wired Magazine, 14(6).
Ipeirotis, P. (2010).The new demographics of mechanical turk.http://behind-the-enemy-lines.blogspot.com/2010/03/new-demographics-of-mechanical-turk.html.
Khatib, F., DiMaio, F., Cooper, S., Kazmierczyk, M., Gilski,M., Krzywda, S., Zabranska, H., Pichova, I., Thompson, J.,Popovic, Z., et al. (2011).Crystal structure of a monomeric retroviral protease solved byprotein folding game players.Nature structural & molecular biology, 18(10):1175–1177.
Lafourcade, M. (2007).Making people play for lexical acquisition.In 7th Symposium on Natural Language Processing (SNLP2007), Pattaya, Thailand.
Lafourcade, M. and Joubert, A. (2008).JeuxDeMots : un prototype ludique pour l’emergence derelations entre termes.In Proceedings of the Journees internationales d’Analysestatistique des Donnees Textuelles (JADT), Lyon, France.
Lease, M., Hullman, J., Bigham, J. P., Bernstein, M. S., Kim,J., Lasecki, W., Bakhshi, S., Mitra, T., and Miller, R. C.(2013).Mechanical turk is not anonymous.Technical report.
Novotney, S. and Callison-Burch, C. (2010).Cheap, fast and good enough: automatic speech recognitionwith non-expert transcription.In Annual Conference of the North American Chapter of theAssociation for Computational Linguistics (NAACL), HLT’10,pages 207–215, Stroudsburg, PA, USA. Association forComputational Linguistics.
Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., andTomlinson, B. (2010).Who are the crowdworkers?: shifting demographics inmechanical turk.
In Proceedings of the 28th of the international conferenceextended abstracts on Human factors in computing systems,CHI EA ’10, pages 2863–2872, New York, NY, USA. ACM.
Snow, R., O’Connor, B., Jurafsky, D., and Ng., A. Y. (2008).Cheap and fast - but is it good? evaluating non-expertannotations for natural language tasks.In Proceedings of EMNLP 2008, pages 254–263.
Urieli, A. (2013).Robust French syntax analysis: reconciling statistical methodsand linguistic knowledge in the Talismane toolkit.PhD thesis, Universite de Toulouse II le Mirail, France.
Wais, P., Lingamneni, S., Cook, D., Fennell, J., Goldenberg,B., Lubarov, D., Marin, D., and Simons, H. (2010).Towards building a high-qualityworkforce with mechanicalturk.In Proceedings of Computational Social Science and theWisdom of Crowds (NIPS).