ucsd / dbmi seminar 2015-02-6

70
Crowdsourcing and Citizen Science for Biology Andrew Su, Ph.D. @andrewsu [email protected] http://sulab.org February 6, 2015 UCSD Slides: slideshare.net/andrewsu

Upload: andrew-su

Post on 17-Jul-2015

1.157 views

Category:

Science


2 download

TRANSCRIPT

Crowdsourcing and

Citizen Science for

Biology

Andrew Su, Ph.D.@andrewsu

[email protected]

http://sulab.org

February 6, 2015

UCSD

Slides: slideshare.net/andrewsu

Few genes are well annotated…2

Data: NCBI, February 2013

41%

65%

CTNNB1

VEGFA

SIRT1

FGFR2

TGFB1

TP53

MEF2C

BMP4

LEF1

WNT5A

TNF

20,473

protein-

coding

genes

Genes, sorted by decreasing counts

GO

An

no

tati

on

Co

un

ts

… because the literature is sparsely curated?3

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1983 1988 1993 1998 2003 2008 2013

Number of new PubMed-indexed articles

… because the literature is sparsely curated?4

0

10

20

30

40

1983 1988 1993 1998 2003 2008 2013

Average capacity of human scientist

5

311,696 articles (1.5% of PubMed)

have been cited by GO annotations

6

0

Sooner or later, the

research community will

need to be involved in the

annotation effort to scale

up to the rate of data

generation.

The Long Tail is a prolific source of content7

Short

Head

Long Tail

Content

produced

Contributors (sorted)

News :

Video:

Product reviews:

Food reviews:

Talent judging:

Newspapers

TV/Hollywood

Consumer reports

Food critics

Olympics

Blogs

YouTube

Amazon reviews

Yelp

American Idol

Wikipedia is reasonably accurate8

Wikipedia has breadth and depth9

http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008

Articles

Words(millions)

Wikipedia Britannica

Online

10

We can harness the

Long Tail of scientists

to directly participate in

the gene annotation

process.

From crowdsourcing to structured data11

The Gene Wiki

Mark2Cure

Filtering, extracting, and summarizing PubMed

Documents

Concepts Review article

Filtering, extracting, and summarizing PubMed

Documents

Concepts

Wiki success depends on a positive feedback14

Gene wiki page utility

Number of

users

Number of

contributors

1001

2002

10,000 gene “stubs” within Wikipedia15

Protein structure

Symbols and

identifiers

Tissue expression

pattern

Gene Ontology

annotations

Links to structured

databases

Gene

summary

Protein

interactions

Linked

references

Huss, PLoS Biol, 2008

Utility

Users

Contributors

Gene Wiki has a critical mass of readers16

Total: 4.0 million views / month

Huss, PLoS Biol, 2008; Good, NAR, 2011

Utility

Users

Contributors

Gene Wiki has a critical mass of editors17

Increase of ~10,000 words / month from >1,000 edits

Currently 1.42 million words

Approximately equal to 230 full-length articles

Good, NAR, 2011

Utility

Users

Contributors

Editor

count Editors

Edits Edit c

ount

A review article for every gene is powerful18

References to the literature

Hyperlinks to related conceptsReelin: 98 editors, 703 edits since July 2002

Heparin: 358 editors, 654 edits since June 2003

AMPK: 109 editors, 203 edits since March 2004

RNAi: 394 editors, 994 edits since October 2002

Making the Gene Wiki more computable19

Structured annotationsFree text

Analyses

Text-mining

Making the Gene Wiki more computable20

Structured annotationsFree text

Analyses

Text-mininghttp://fiehnlab.ucdavis.edu/projects/rice_metabolome/

Making the Gene Wiki more computable21

Structured annotationsFree text

Analyses

Text-mining

Making the Gene Wiki more computable22

Structured annotationsFree text

Databases

Making the Gene Wiki more computable23

Structured annotationsFree text

Making the Gene Wiki more computable24

Structured annotationsFree text

Wikidata25

Provide a database of the

world’s knowledge that

anyone can edit

- Denny Vrandečić

Centralizing key data storage26

Source: http://commons.wikimedia.org/wiki/File:Wikidata_slides_Magnus_Manske,_Cambridge,_2014-02-27.pdf

Centralizing key data storage27

Centralizing key data storage28

Centralizing key data storage29

287 language editions of Wikipedia

Bioinformatics community

Loading biological data into Wikidata30

Entrez

Gene

Ensembl

UniProt

UCSC

PDB

RefSeq

Wikidata for biology31

is a

regulates

Interacts

with

Protein

Glycoprotein

Neural

development

VLDL receptor

Amyloid

precursor

protein

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

Reelin

http://www.wikidata.org/wiki/Q414043

Wikidata for biology32

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en

Current progress

• All human and mouse genes and

proteins loaded

• All diseases (Human Disease Ontology)

loaded

• Dataset of all drugs in preparation

• Datasets for gene-disease, drug-

disease, and drug-protein relationships

in preparation

33

The

Long Tail of scientists

is a valuable source of

information on gene

function

34

From crowdsourcing to structured data35

The Gene Wiki

Mark2Cure

The biomedical literature is growing fast…36

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1983 1988 1993 1998 2003 2008 2013

Number of new PubMed-indexed articles

… but it is very hard to query and compute37

… but it is very hard to query and compute38

Imatinib

Crizotinib

Erlotinib

Gefitinib

Sorafenib

Lapatinib

Dasatinib

Acute myeloid leukemia

Acute lymphoblastic leukemia

Chronic myelogenous leukemia

Chronic lymphocytic leukemia

Hodgkin lymphoma

Non-Hodgkin lymphoma

Myeloma

AND

Information Extraction39

1. Find mentions of high level concepts in

text

2. Map mentions to specific terms in

ontologies

3. Identify relationships between concepts

Disease mentions in PubMed abstracts40

NCBI Disease corpus

• 793 PubMed abstracts

• (100 development, 593 training, 100 test)

• 12 expert annotators (2 annotate each abstract)

6,900 “disease” mentions

Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in

PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural

Language Processing. Association for Computational Linguistics.

Question: Can a group of non-scientists

collectively perform concept recognition in

biomedical texts?

41

The Mechanical Turk42

http://en.wikipedia.org/wiki/The_Turk

The Mechanical Turk43

http://en.wikipedia.org/wiki/The_Turk

Amazon Mechanical Turk (AMT)44

Requester

Amazon

For each task, specify:

• a qualification test

• how many workers per task

• how much we will pay per task

Manages:

• parallel execution of jobs

• worker access to tasks

via qualification tests

• payments

• task advertising

Workers

1. Create tasks

2. Execute

3. Aggregate

Instructions to workers45

• Highlight all diseases and disease abbreviations

• “...are associated with Huntington disease ( HD )... HD patients

received...”

• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked

immunodeficiency…”

• Highlight the longest span of text specific to a disease

• “... contains the insulin-dependent diabetes mellitus locus …”

• Highlight disease conjunctions as single, long spans.

• “... a significant fraction of familial breast and ovarian cancer , but

undergoes…”

• Highlight symptoms - physical results of having a

disease

– “XFE progeroid syndrome can cause dwarfism, cachexia, and

microcephaly. Patients often display learning disabilities, hearing loss,

and visual impairment.

Qualification test46

Test #1: “Myotonic dystrophy ( DM ) is associated with a ( CTG ) in

trinucleotide repeat expansion in the 3-untranslated region of a protein

kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . ”

Test #2: “Germline mutations in BRCA1 are responsible for most cases of

inherited breast and ovarian cancer . However , the function of the BRCA1

protein has remained elusive . As a regulated secretory protein , BRCA1

appears to function by a mechanism not previously described for tumour

suppressor gene products.”

Test #3: “We report about Dr . Kniest , who first described the condition in

1952 , and his patient , who , at the age of 50 years is severely

handicapped with short stature , restricted joint mobility , and blindness but

is mentally alert and leads an active life . This is in accordance with

molecular findings in other patients with Kniest dysplasia and…”

26 yes / no questions

Qualification test results47

Threshold

for passing

33/194 passed

17%Workers

qualified

workers

Simple annotation interface48

Click to see

instructions

Highlight

disease

mentions

Experimental design

• Task: Identify the disease mentions in

the 593 abstracts from the NCBI disease

corpus

– $0.06 per Human Intelligence Task (HIT)

– HIT = annotate one abstract from PubMed

– 5 workers annotate each abstract

49

This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

Aggregation function based on simple voting50

5

0

1 or more votes (K=1)This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

K=2

K=3 K=4

This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

Comparison to gold standard51

F = 0.81, k = 2

• 593 documents

• 5 users / doc

• 7 days

• $192.90PrecisionRecall

Comparison to gold standard52

F = 0.87, k = 6

• 593 documents

• 15 users / doc

• 9 days

• $630.96

Precision

Recall

Comparison to gold standard53

0 1614121086420

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Workers per document

Maxim

um

F-s

core

Comparisons to text-mining algorithms54

F s

core

Text-mining

BA

NN

ER

NC

BO

Annota

tor

Mechanical

Turk

Comparisons to human annotators55

Average level of

agreement

between expert

annotators

(stage 1)

F = 0.76

Comparisons to human annotators56

F = 0.76F = 0.87

Average level of

agreement

between expert

annotators

(stage 2)

57

In aggregate, our worker

ensemble is faster, cheaper

and as accurate as a single

expert annotator for disease

concept recognition.

Information Extraction58

1. Find mentions of high level concepts in

text

2. Map mentions to specific terms in

ontologies

3. Identify relationships between concepts

Annotating the relationships59

This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

therapeutic target

subjectpredicate

object

GENE

DISEASE

Does Mechanical Turk scale?60

1,000,000 articles per year

10 annotators / article

4 tasks / doc

$0.06 / task

$ 2,400,000 / year

61

http://mark2cure.org

Key stats

• Launched Jan 19, 2015

• In 2.5 weeks

– 1984 document annotations

– 80 unique users

– 22% complete

62

Docum

ent

annota

tions

The

Long Tail of

citizen scientists

can collaboratively

annotate biomedical

text.

63

64

Ben Good

Andra Waagmeester

Lynn Schriml, U Maryland

Elvira Mitraka, U Maryland

Gang Fu, NCBI

Evan Bolton, NCBI

Paul Pavlidis, U British Columbia

Peter Robinson, Charite

Many Wikipedia and Wikidata

editors

WP:MCB Project

Gene Wiki / Wikidata

Ramya Gamini

Louis Gioia

Salvatore Loguercio

Adam Mark

Erick Scott

Greg Stupp

Kevin Xin

Other Group members

Funding and Support

BioGPS: GM83924

Gene Wiki: GM089820

BD2K COE: GM114833

Contact

http://sulab.org

[email protected]

@andrewsu

+Andrew Su

Mark2Cure

Ben Good

Max Nanis

Ginger Tsueng

Chunlei Wu

Next slide!

Why do I Mark2Cure?65

I am retired, have a doctorate in

medical humanities, and have two

children with Gaucher disease. I am

just looking for some way to put my

education to use. Sounds like a perfect

situation for me.

My 4 year old daughter Phoebe is

living with and battling rare

disease.

I have Ehlers Danlos Syndrome. I hope to help people

learn about this painful and debilitating disorder, so that

others like me can receive more effective medical care.

Take part in

something that

helps humanity.

I Mark2Cure in memory of

my son Mike who had type 1

diabetes.

Studied biology in

college and I really

miss it!

In memory of my daughter

who had Cystic Fibrosis

Give back

Worker demographics: gender66

First HIT was a survey

Age67

Occupation 68

Education 69

Why? 70