daejeon, 2005 alfonso valencia cnb-csic text mining in bioinformatics the first international...

39
Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005) KAIST, Daejeon, South Korea Nov, 2005 Alfonso Valencia Centro Nacional de Biotecnología - CSIC

Upload: jeffrey-gregory

Post on 20-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Text mining in Bioinformatics

The First International Symposium on Languages in Biology and Medicine (LMB2005)

KAIST, Daejeon, South Korea

Nov, 2005

Alfonso Valencia

Centro Nacional de Biotecnología - CSIC

Page 2: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Proteomics

Predicted networks

literature

Functional Genomics

Page 3: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

(mouse model)cdks are not essential,cdk1 can replace others

SPECIFIC PROTEINS

Cdks, cyclins, kinases

Page 4: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Residues determinant of the dimerization of Chemokine receptors

Hernanz-Falcon, et al., Nat Immunol. 2004Bioinform. 2005

Two residues are able to control CCR5 chemokine receptor dimerization, both in vitro and in vivo blocking CCL5-induced responses in human cell

lines and in primary T cells.

Figure 3. FRET experiment for CCR5. Cyan fluorescentprotein (CFP) fluorescence lifetime images (calculated from thephase shift) of HEK-293 cells expressing CCR5wt-CFP/CCR5wt-YFP (lower) or CCR5mut-CFP/CCR5mut-YFP(upper). The pseudocolor scale ranges from 0 (black) to 4.0 ns(white).

Del Sol, Pazos, Valencia JMB 2003

Page 5: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Buchnera aphidicola genomehttp://www.pdg.cnb.uam.es/fabascal/Buch_ORFand_www/)

Page 6: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Query Query ProteinProtein

Similar Similar ProteinsProteins

Standar Sequence SearchesStandar Sequence Searches

Protein groupsProtein groups

rab (M. musculus)rab (M. musculus)

rab (C. elegans)rab (C. elegans)

rab (H. sapiens)rab (H. sapiens)

ras (H. sapiens)ras (H. sapiens)

ras (M. musculus)ras (M. musculus)

ras (C. elegans)ras (C. elegans)

ras2 (H. sapiens)ras2 (H. sapiens)

Homologous Homologous ProteinsProteins

Similar Similar Functions ? Functions ?

by F, Abascal

Page 7: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Genequiz flowchart

by The Genequiz consortium 1995-2002

Page 8: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Annotation workflows (INB)

Page 9: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Sequence Based function prediction

Del Pozo, Valencia 2004

10 20 30 40 50 60 70 80 90 100

100

90

80

70

60

50

40

30

20

10

0 Identity class (%)

% c

onse

rvat

ion

4th

E.C

. dig

it

Valencia Curr. Op Struc Biol 05

Page 10: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

SLIDE WINDOW APPROACH

Krallinger Valencia Drug Discovery Today 2005

Page 11: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

BioCreAtIvE

C. Blaschke and A. Valencia : CNB-CSIC L. Hirschman and A. Yeh: MITRE R. Apweiler, E. Camon, et al., GOA (EBI) C. Wu: PIR J. Blake (MGI) J. Wilbur and L. Tanabe (NCBI) L. Grivell (EMBO) Full-text access (HighWire Press)EMBO Evaluation Workshop, April 2004 , Granada, Spain

http://www.pdg.cnb.uam.es/BioLINK

Task 1: Extraction of gene or protein names from text, and their mapping into standardized gene identifiers for fly, mouse, yeast.

1a.- Gene list annotation (Creating a list of genes mentioned in abstracts). Useful for indexing “a number of systems (4) were able to extract general gene names from sentences of MEDLINE abstracts

at over 80% balanced precision and recall”

1b.- Gene name mentions. Corresponds to “named entity” task in the natural language processing.“ the results ranged from a high for yeast of 92% balanced precision and recall, to somewhat lower scores

for fly (82%) and mouse (79%)”

BioCreAtIvE ©Results, methods, and evaluation papers published in BMC Bioinformatics 2005.

Page 12: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Ye a

rsEvolution of gene names

Hoffmann, Valencia TIGs 2003

Gene names

The evolution of gene names over time is a “scale free” process- “critical state” system- the evolution of a gene name cannot be predicted- some gene name act as attractors of other names

Page 13: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Text mining in a nutshell

1. Protein / gene namesInterspecies

Linking to DBs

2. RelationsProtein protein

Others (regulation, drugs)

Function

3. Type of RelationProteins

Metabolic pathways

1. 80% prec/recall (BioCreative)Far less than that

Essential (not NLP)

2. Easy on the surfaceBest known one (accessible?)

Dictionaries

Very difficult (ie GO in BioCreative)

3. SemanticSummaries very difficult

New challenge, unexplored

Hoffmann et al., Science STKE 2005Krallinger et al., Genome Biology 2005Krallinger et al., DDToday 2005

Page 14: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

SUISEKI

Extraction of the interactions

Extraction of the interactions Human expert manipulationHuman expert manipulation

Pubmed12M entries

Extraction of protein namesExtraction of protein names

* [protein A] ... verb indicating an action ... [protein B]

“After extensive purification, Cdk2 was still bound to cyclin D1”

Rules (frames) to identify the interactions

Rules (frames) to identify the interactions

Selecting terms that indicate interactionSelecting terms that indicate interaction

activate, associated with, bind, interact, phosphorylate, regulateAction words are for example:

Selection of the text corpusSelection of the text corpus

Page 15: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Blaschke Valencia IEEE 2002

Page 16: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Hoffmann et al., Science STKE 2005Krallinger et al., Genome Biology 2005 Krallinger et al., Drug Disc. Today 2005

Page 17: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Hoffmann Valencia Nat Genet 2004

VISIT: iHOP

Page 18: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Page 19: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Page 20: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Page 21: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Page 22: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

???update humans mice Drosophila zebrafish C.elegans Arabidopsis yeast E.coli Average

Relevant docs (RD)(Goldstd. LocusLink)

30760 7357 12711 119 170 - - - 10223.4

RD: Recall (%) 84.3 76.6 79.6 74 95.3       81.96

Nr of gene-articlereferences (GA) (Goldstd.LocusLink)

36816 8471 27994 147 279 - - - 14741.4

GA: Recall 26327 5639 14495 105 251 - - - 9363.4

GA: Recall (%) 71.5 66.6 51.8 71.4 90  -  - - 70.26

Exact gene localisations(GL) (Goldstd. manual)

403 381 438 360 597 295 536 476 435.75

GL: True positives (Goldstd.manual)

351 354 409 352 584 271 533 442 412

GL: Precision, exactlocalisation (%)

87.1 92.9 93.4 97.8 97.8 91.9 99.4 92.9 94.15

F-measure (%) 78.5 77.6 66.6 82.5 93.7 - - - 80

iHOP inside

Hoffmann Valencia Bioinform. 2005

Page 23: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Hermjakob, et al., IntAct: an open source molecular interaction database.Nucleic Acids Res. 2004

Page 24: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Page 25: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

HCAD Chromosomal Translocations Database

Hoffmann et al., NAR 2004

Page 26: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

BioCreAtIvE C. Blaschke and A. Valencia : CNB-CSIC L. Hirschman and A. Yeh: MITRE R. Apweiler, E. Camon, et al., GOA (EBI) C. Wu: PIR J. Blake (MGI) J. Wilbur and L. Tanabe (NCBI) L. Grivell (EMBO) Full-text access (HighWire Press)EMBO Evaluation Workshop, April 2004 , Granada, Spain http://www.pdg.cnb.uam.es/BioLINK

Task 1: Extraction of gene / protein names from text, mapping to identifiers (fly, mouse, yeast)

1a.- Identification of a list of genes in text. indexing 4 systems 80% balanced precision and recall

1b.- Gene name mentions linked to DB entries. entity task identification in NLP yeast of 92% balanced precision and recall, fly (82%) and mouse (79%)

Task 2: GO to protein via text for a collection of human genes.

2a.- Text piece for a given GO and protein. identification Best systems with large coverage aprox. 23% correct identification

2b.- Find the GO and text for a list of proteins.Best systems cover most proteins with a 20% correct identification

BioCreAtIvE ©Results, methods, and evaluation papers published in BMC Bioinformatics 2005.

Page 27: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Biocreative Task 2a

Krallinger et al., BMC Bioinfo. 05

Page 28: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Krallinger, Padron, et al., 2005

Correlation GO - Protein spaces (sub-tags)

Page 29: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Protein names sub tag:

1- Original protein name2- Heuristic typographical variants3- Variants from external links to db4- Protein name forming word types5- External links forming word types6- GOBO sequence terms7- GOBO mutation event terms

GO term sub tag:

1- OriginalGO term2- NL variants of GO term3- GO term forming word types4- GO term definition word types

Page 30: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Krallinger, Padron, et al., 2005

Correlation GO - Protein spaces (sub-tags)

Page 31: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Page 32: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Interface for the EBI GO team during the

Biocreative evaluation

Page 33: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Text mining in a nutshell

1. Protein / gene names1. Interspecies2. Linking to DBs

2. Relations1. Protein protein2. Others (regulation, drugs)3. Function

3. Type of Relation1. Proteins2. Metabolic pathways

4. Concepts for groups of genes1. Existing2. Creating new ones

1. 80% prec/recall (biocreative)1. Far less than that2. Essential (not NLP)

2. Easy on the surface1. Best known one (accessible?)2. Dictionaries3. Very difficult (to GO Biocreative)

3. Semantic1. Summaries very difficult2. New challenge, unexplored

4. Knowledge discovery1. Summaries and generalization2. Not jet

Hoffmann et al., Science STKE 2005Krallinger et al., Genome Biology 2005Krallinger et al., DDToday 2005

Page 34: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Experiment: Iyer et al (1999) Science 283, 83-87MeiosisCyclinCheckpointInterphaseNucleoplasmaDivisionHistoneReplicationChromatid

DipeptidylProlylnmrCollagen-binding

17 genesPCNACDC2MSH2LBR

TOP2A...

24 genesABCA5

CATELF2PIM1WNT2

...

Cell cycle

Unknown

DNA replicationDNA metabolismCell Cycle control

PCNA-MSH2The binding of PCNA to MSH2 may reflect linkage between mismatch repair and replication.

LBR-CDC2LBR undergoes mitotic phosphorylation mediated by p34(cdc2) protein kinase.

Word

s

GO codes

Sentences

Words

Blaschke, et al., Funct. Integ. Genomics 2001

Page 35: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Page 36: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

SOTA clustering versus significance of Geisha terms.

Oliveros, Blaschke, GIW 2000 ©

Page 37: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

SOTA and GEISA mixed information

Blaschke, Herrero, Dopazo, Valencia 2002

Expression based clustering

Weight (expression) + Weight (text)

Term (text) based clustering

Page 38: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

www.pdg.cnb.uam.es-14th ISMB : Fortaleza, Sept 06 http://www.iscb.org/ Text mining SIGs, Biolink

- The European School on Bioinformatics. BioSapiens http://www.biosapiens.info- Winter symposium Bologna Feb 2006

- Master Bioinformatica. U. Complutense Enero - Julio 2006bbm1.ucm.es/masterbioinfo

www.cab.inta.eswww.inba.org www.bioalma.com

Page 39: Daejeon, 2005 Alfonso Valencia CNB-CSIC Text mining in Bioinformatics The First International Symposium on Languages in Biology and Medicine (LMB2005)

Daejeon, 2005Alfonso Valencia CNB-CSIC

Stable clusters > central processes with expression and functional information agree

Unstable groups > contradictory information

“jumping” genes, divergent expression and functional classifications.

(Gene of very unstable behavior > related with insufficient information)