how will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential...

29
How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? nimally, we need to use the formation that exists

Upload: ophelia-jackson

Post on 12-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

How will we efficiently understandthe interactions of ~20,000 genes,with ~200 million potential pairwise interactions?Minimally, we need to use the

information that exists

Page 2: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

June 1979: 2 relevant papers

S. Brenner (Genetics 1974) The genetics of Caenorhabditis elegans

J. Sulston & R. Horvitz (Developmental Biology 1977) Post-embryonic cell lineages of the nematode, Caenorhabditis elegans

Jan 2008: >200,000 relevant papers

Page 3: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

2

1

Predicting Gene Interactions from information available in public databases

Prioritizing high resolution genetic interaction tests by knowledge mining

Full text information retrievalHans-Michael Muller, Arun Rangarajan, Tracy Teal, Kimberly Van Auken, Juancarlos Chan

Weiwei Zhong

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Page 4: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

Scientists spend more time skimming for information than reading papers.

Much information are details hidden in the full text, and are neither in the abstract nor captured in MeSH terms.

We designed Textpresso to do automated skimming for researchers and database curators.

The output can be used for more sophisticated Natural Language Processing.

www.textpresso.org

Textpresso Literature Search Engine

Page 5: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

Full Text Sentence Ontology

PubMed

Google Scholar

(-)

+

+ +

-

- -

MeSHTaxonomy

Gene OntologyCustomizedNeuroscience Information Framework

Textpresso

Can we do better than PubMed and Google Scholar?

Page 6: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

precursorupstream cascade descendants

GENE

Reporter Genes

PATHWAY

Drosophilaanatomy

FOXO HOXA1 pax2PKD1

denticle

wing

MP2 neuron

GFP, EGFP, YFP, lacZ, CFP, Green Fluorescent Protein, reporter gene, dsRed, mCherry

Categories are “bags of words”

Page 7: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

ARTICLE TEXT

TEXTPRESSO CATEGORIES

egl-38 regulates lin-3 transcription in vulF in L3 larvae

gene

regulation process life stage

anatomy

Individual sentences in full text are marked up with Categories

Automatically mark up the whole corpus of papers with terms of categories, and index for rapid searching

gene

Page 8: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

What Arabidopsis genes are expressed in the meristem based on reporter genes? 14,930 A.t. paperswww.textpresso.org/arabidopsis

Page 9: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

Is a nicotinic receptor associated with Drugs of Abuse other than nicotine?

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

www.textpresso.org/neuroscience 15,786 papers

Page 10: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

The problem with clever fly names

Gene name abbreviationforager forascute aswee weWashed eye We

Train system to recognize gene names by context

use italics from PDF ~70%

~85%

Michael Müller, Arun Rangarajan

Page 11: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

What reporter genes have been used with Drosophila genes to study human disease? 20,099 full-text fly paperswww.textpresso.org/fly

Page 12: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

Find all sentences that contain ≥2 gene names and ≥1 association or regulation word:

26,000 sentences out of 4.400 articlessimple interface to “check off” sentences

100 sentences per hour

Database curation: e.g. Gene-Gene Interactions

output into database

Page 13: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

2

1

Predicting Gene Interactions from information available in public databases

Prioritizing high resolution genetic interaction tests by knowledge mining

Full text information retrievalHans-Michael Muller, Arun Rangarajan, Tracy Teal, Kimberly Van Auken, Juancarlos Chan

Weiwei Zhong

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Page 14: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

Training Set

Training set 4775 Positive Interactions

Genetic, Literature curation (1909) Yeast two-hybrid screen (2933)

3296 Negative Genetic Interactions cis doubles in genetic mapping

Benchmark 5515 Positives: KEGG database 5000 Negatives: Randomly selected

Page 15: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

Algorithm

worm gene pair

yeast orthologs

total score

fly orthologs fly score

worm score

yeast score

Ortholog mapping

Scoring Score integration

interactionGO

expressionphenotypemicroarray

GOexpressionphenotype microarray

interactionGO

localizationphenotypemicroarray

Page 16: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

)|(

)|(

negvp

posvpL =

p(v | pos): probabilities of the predictor having value v if two genes interactp(v | neg): probabilities of the predictor having value v if two genes do not interact

likelihood ratio

0

1

2

3

4

5

6

7

0 5 10 15 20 25

C. elegans expression

L

term usage (% of annotated genes associated with the term)

Scoring and score integration

n: number of predictorsLi: likelihood ratio of each predictor

score = lni=1

n

∑ Li

sum the logs of the L’s

Page 17: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 18: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 19: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 20: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

lin-3

let-23

sem-5

sos-1

let-60

lin-45

mek-2

mpk-1

lip-1

ksr-1

gap-1

v1.6v1.4 & v1.6

Page 21: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

Testing let-60 ras Interactors

WT% Muv% average

N2 100 0 3.0

let-60(gf) 0 100 4.3

let-60(gf); tax-6(RNAi) 40 60 3.4

N2

let-60(gf)

let-60(gf);tax-6(RNAi)

87 genes have score >0.9; 17 confirmed from literature Inactivating genes on a gain-of-function (gf) let-60 mutant by RNAi Assay vulva precursor cell (VPC) induction

not Multivulva

strong Multivulva

weak Multivulva

Page 22: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

0

1

2

3

4

5

6

controltax-6 csn-5 qua-1

C01G8.9

pfn-3 nhr-41

C05D10.3 Y48G10A.3

dlg-1 tag-22 grd-11

W03F11.6

mig-15 taf-6.1 taf-1 lin-32

unc-55

Y59A8B.23 Y48G10A.3

wrt-8 sqv-7 wrt-4 evl-20 C07H6.3

glp-1 unc-59

grd-1 wrt-7 hog-1 cdc-25.3

che-1 mom-5

Y53C12C.1

rnt-1 cki-1 let-413

taf-4 tig-2

tag-117 psa-4

T24H10.7

lin-48 src-2

B0353.1 R05G6.10

T18D3.7

grd-2 ZC84.3 cdc-42

cki-2F59A2.4

K10H10.1C04C3.3F34D6.4F34D10.2C25H3.4H27A23.1Y54G11A.1

B0035.16M03C11.4C41C4.8M01F1.5ZK945.8ZK643.2F26E4.12C16A3.7C53A3.2

H14N18.4W02D3.6F08A8.4C37H5.3F28H6.3R10E11.3R04B5.5B0491.1C06A8.6

let-60(gf) VPC InductionUnder Various RNAi

12 hits (p<0.05) in 49 genes; 1 hit in 26 randomly selected genesCombined with literature, 29/66 (44%) predictions confirmed

p< 0.01 p< 0.05

VP

C in

duct

ion

inde

x

Score > 0.9 Score < 0.6

Page 23: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

let-60 ras interactors (suppressors)

tax-6 calcineurin

csn-5 COP-9 signalosome

qua-1 hedgehog-related protein

C01G8.9 SWI/SNF-related (eyelid)

C05D10.3 ABC transporter (white)

pfa-3 profilin

nhr-4 transcription factor

Page 24: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

QuickTime™ and a decompressor

are needed to see this picture.

C. elegans Interactions

Input 4,726 known interactions among 2,713 genesPredict additional 18,863 for total of 23,589 interactions among 4,408 genes

Page 25: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

for Drosophila

Page 26: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 27: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

QuickTime™ and a decompressor

are needed to see this picture.

D. melanogaster interactionsInput 4,180 known interactions among 1,262 genes,Predict 13,126 for 17,306 interactions among 6,044 genes

Page 28: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

Automated, Quantitative Phenotyping

Chris Cronin: movement analysisBMC-Genetics 2005Chris Cronin: movement analysisBMC-Genetics 2005

generative graphicslocomotion

plate demographics (Weiwei Zhong)

morphology

sexual behavior

E. Fontaine, A. Whittaker, Joel Burdick

Page 29: How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the

2

1

Predicting Gene Interactions from information available in public databases

Prioritizing high resolution genetic interaction tests by knowledge mining

Full text information retrievalHans-Michael Muller, Arun Rangarajan, Tracy Teal, Kimberly Van Auken, Juancarlos Chan

Weiwei Zhong

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.