role of crowdsourcing
TRANSCRIPT
Data and text mining workshopThe role of crowdsourcing
Anna Noel-StorrWellcome Trust, London, Friday 6th March 2015
What is crowdsourcing?
“…the practice of obtaining needed services, ideas, or content by soliciting contributions
from a large group of people, and especially from an online community, rather than from traditional employees…”
Image credit: DesignCareer
What is crowdsourcing?
Knowledge discovery
and management
Brabham’s problem focused crowdsourcing typology: 4 types
What is crowdsourcing?
Knowledge discovery
and management
Broadcastsearch
Brabham’s problem focused crowdsourcing typology: 4 types
What is crowdsourcing?
Knowledge discovery
and management
Broadcastsearch
Peer-vetted creative
production
Brabham’s problem focused crowdsourcing typology: 4 types
What is crowdsourcing?
Knowledge discovery
and management
Broadcastsearch
Peer-vetted creative
production
Distributed human
intelligence tasking
Brabham’s problem focused crowdsourcing typology: 4 types
What is crowdsourcing?
Knowledge discovery
and management
Broadcastsearch
Peer-vetted creative
production
Distributed human
intelligence tasking
Brabham’s problem focused crowdsourcing typology: 4 types
Micro-tasking: process
Breaking down large corpus of data into smaller units and distributing those units to a large online crowd
“the distribution of small parts of a problem”
Human computation
Humans remain better than machines at certain tasks: e.g. Identifying pizza toppings from a picture of a pizzae.g. “preventing obesity without eating like a rabbit”.ti. – autotag: Animal study
Tools and platforms
What platforms and tools exist and how do they work?
Image credit: ThinkStock
The Zooniverse
“each project uses the efforts and ability of volunteers to help scientists and researchers deal with the flood of data that confronts them”
Classification and annotation
Galaxy Zoo
Operation War Diary
Health related evidence productionCan we use crowdsourcing to identify the
evidence in a more timely way?
- Known pressure point within the review production- Between 2000 and 5000 citations per new review, but can be much more- A not much loved task
Trial identification
The Embase project
Cochrane’s Central Register
of Controlled Trials:
CENTRAL
EmbaseCrowd
Embaseauto
Step 2: Use a crowd to screen thousands of search results from Embase and feed the identified reports of RCTs into CENTRAL
How will the crowd do this?
Step 1: run a very sensitive search in the largest biomedical database for studies
The screening tool
Three choices
You are not alone!
(and you can’t go back)
Progress bar
Yellow highlights to indicate a likely RCT
Red highlights
The Embase project: recruitment
- 900+ people have signed-up to screen citations in 12 months- 110,000+ citations have been collectively screened
- 4,000 RCTs/q-RCTs identified by the crowd
0
100
200
300
400
500
600
700
800
900
1000
Feb-14 Mar-14 Apr-14 May-14 Jun-14 Jul-14 Aug-14 Sep-14 Oct-14 Nov-14 Dec-14 Jan-15 Feb-15 Mar-15
Number of Participants
Participants
Why do people do it?
Made it very easy to participate (and equally easy to stop!)
Gain experience(bulk up the CV)
Provide feedback: both to the individual and to
the community
Wanting to do something to contribute (healthcare is a strong hook)
(people are more likely to come back)
RCT RCT RCT
Reject Reject Reject
Unsure
CENTRAL
Bin
Resolver
How accurate is the crowd?
RCTReject Resolver
5%
Crowd accuracy
TP1565
FP 9
FN 2
TN 2888
TP415
FP 5
FN 1
TN 2649
The Crowd:INDEXTEST
The Crowd:INDEXTEST
The Info specialist: REFERENCE STANDARD
The Info specialists: REFERENCE STANDARD
Validation 1Validation 2
Sensitivity: 99.9% Specificity: 99.7% Sensitivity: 99.8% Specificity: 99.8%
Enriched sample; blinded to crowd decision; dual independent screeners as reference standard
Enriched sample; blinded to crowd decision; single independent expert screener (me!) as reference standard; possibility of incorporation bias
Individual screener accuracy is also carefully monitored
How fast is the crowd?
Number of weeks
Jan 2014 Jul 2014 Jan 2015
6 weeks
5 weeks
2 weeks
More screeners and more screeners screening more quickly
Length of time to screen one month’s worth of records
More of the same, and more tasks
As the crowd becomes more efficient, we plan to do two things:1. Increase the databases we search – feed in more citations2. Offer other ‘micro-tasks’
Feed in more citations – from other databases
Bin
Y
N
ScreenAnnotate, appraise
And in these tasks the machine plays a vital and complementary role…
e.g. is the healthcare condition Alzheimer’s disease? Y, N, Unsure
Perfect partnership
Machine driven probability + Collective human decision-making
It’s not one or the other, the ideal is both
In summary
• Effective method in large scale study identification
• Identify more studies, more quickly
• No compromise on quality or accuracy
• Offers meaningful ways to contribute
• Feasible to recruit a crowd• Highly functional tool• Complements data and text
mining
And enables the move towards the living review
Crowdsourcing: