daejeon, 2005 alfonso valencia cnb-csic text mining in bioinformatics the first international...
TRANSCRIPT
Daejeon, 2005Alfonso Valencia CNB-CSIC
Text mining in Bioinformatics
The First International Symposium on Languages in Biology and Medicine (LMB2005)
KAIST, Daejeon, South Korea
Nov, 2005
Alfonso Valencia
Centro Nacional de Biotecnología - CSIC
Daejeon, 2005Alfonso Valencia CNB-CSIC
Proteomics
Predicted networks
literature
Functional Genomics
Daejeon, 2005Alfonso Valencia CNB-CSIC
(mouse model)cdks are not essential,cdk1 can replace others
SPECIFIC PROTEINS
Cdks, cyclins, kinases
Daejeon, 2005Alfonso Valencia CNB-CSIC
Residues determinant of the dimerization of Chemokine receptors
Hernanz-Falcon, et al., Nat Immunol. 2004Bioinform. 2005
Two residues are able to control CCR5 chemokine receptor dimerization, both in vitro and in vivo blocking CCL5-induced responses in human cell
lines and in primary T cells.
Figure 3. FRET experiment for CCR5. Cyan fluorescentprotein (CFP) fluorescence lifetime images (calculated from thephase shift) of HEK-293 cells expressing CCR5wt-CFP/CCR5wt-YFP (lower) or CCR5mut-CFP/CCR5mut-YFP(upper). The pseudocolor scale ranges from 0 (black) to 4.0 ns(white).
Del Sol, Pazos, Valencia JMB 2003
Daejeon, 2005Alfonso Valencia CNB-CSIC
Buchnera aphidicola genomehttp://www.pdg.cnb.uam.es/fabascal/Buch_ORFand_www/)
Daejeon, 2005Alfonso Valencia CNB-CSIC
Query Query ProteinProtein
Similar Similar ProteinsProteins
Standar Sequence SearchesStandar Sequence Searches
Protein groupsProtein groups
rab (M. musculus)rab (M. musculus)
rab (C. elegans)rab (C. elegans)
rab (H. sapiens)rab (H. sapiens)
ras (H. sapiens)ras (H. sapiens)
ras (M. musculus)ras (M. musculus)
ras (C. elegans)ras (C. elegans)
ras2 (H. sapiens)ras2 (H. sapiens)
Homologous Homologous ProteinsProteins
Similar Similar Functions ? Functions ?
by F, Abascal
Daejeon, 2005Alfonso Valencia CNB-CSIC
Genequiz flowchart
by The Genequiz consortium 1995-2002
Daejeon, 2005Alfonso Valencia CNB-CSIC
Annotation workflows (INB)
Daejeon, 2005Alfonso Valencia CNB-CSIC
Sequence Based function prediction
Del Pozo, Valencia 2004
10 20 30 40 50 60 70 80 90 100
100
90
80
70
60
50
40
30
20
10
0 Identity class (%)
% c
onse
rvat
ion
4th
E.C
. dig
it
Valencia Curr. Op Struc Biol 05
Daejeon, 2005Alfonso Valencia CNB-CSIC
SLIDE WINDOW APPROACH
Krallinger Valencia Drug Discovery Today 2005
Daejeon, 2005Alfonso Valencia CNB-CSIC
BioCreAtIvE
C. Blaschke and A. Valencia : CNB-CSIC L. Hirschman and A. Yeh: MITRE R. Apweiler, E. Camon, et al., GOA (EBI) C. Wu: PIR J. Blake (MGI) J. Wilbur and L. Tanabe (NCBI) L. Grivell (EMBO) Full-text access (HighWire Press)EMBO Evaluation Workshop, April 2004 , Granada, Spain
http://www.pdg.cnb.uam.es/BioLINK
Task 1: Extraction of gene or protein names from text, and their mapping into standardized gene identifiers for fly, mouse, yeast.
1a.- Gene list annotation (Creating a list of genes mentioned in abstracts). Useful for indexing “a number of systems (4) were able to extract general gene names from sentences of MEDLINE abstracts
at over 80% balanced precision and recall”
1b.- Gene name mentions. Corresponds to “named entity” task in the natural language processing.“ the results ranged from a high for yeast of 92% balanced precision and recall, to somewhat lower scores
for fly (82%) and mouse (79%)”
BioCreAtIvE ©Results, methods, and evaluation papers published in BMC Bioinformatics 2005.
Daejeon, 2005Alfonso Valencia CNB-CSIC
Ye a
rsEvolution of gene names
Hoffmann, Valencia TIGs 2003
Gene names
The evolution of gene names over time is a “scale free” process- “critical state” system- the evolution of a gene name cannot be predicted- some gene name act as attractors of other names
Daejeon, 2005Alfonso Valencia CNB-CSIC
Text mining in a nutshell
1. Protein / gene namesInterspecies
Linking to DBs
2. RelationsProtein protein
Others (regulation, drugs)
Function
3. Type of RelationProteins
Metabolic pathways
1. 80% prec/recall (BioCreative)Far less than that
Essential (not NLP)
2. Easy on the surfaceBest known one (accessible?)
Dictionaries
Very difficult (ie GO in BioCreative)
3. SemanticSummaries very difficult
New challenge, unexplored
Hoffmann et al., Science STKE 2005Krallinger et al., Genome Biology 2005Krallinger et al., DDToday 2005
Daejeon, 2005Alfonso Valencia CNB-CSIC
SUISEKI
Extraction of the interactions
Extraction of the interactions Human expert manipulationHuman expert manipulation
Pubmed12M entries
Extraction of protein namesExtraction of protein names
* [protein A] ... verb indicating an action ... [protein B]
“After extensive purification, Cdk2 was still bound to cyclin D1”
Rules (frames) to identify the interactions
Rules (frames) to identify the interactions
Selecting terms that indicate interactionSelecting terms that indicate interaction
activate, associated with, bind, interact, phosphorylate, regulateAction words are for example:
Selection of the text corpusSelection of the text corpus
Daejeon, 2005Alfonso Valencia CNB-CSIC
Blaschke Valencia IEEE 2002
Daejeon, 2005Alfonso Valencia CNB-CSIC
Hoffmann et al., Science STKE 2005Krallinger et al., Genome Biology 2005 Krallinger et al., Drug Disc. Today 2005
Daejeon, 2005Alfonso Valencia CNB-CSIC
Hoffmann Valencia Nat Genet 2004
VISIT: iHOP
Daejeon, 2005Alfonso Valencia CNB-CSIC
Daejeon, 2005Alfonso Valencia CNB-CSIC
Daejeon, 2005Alfonso Valencia CNB-CSIC
Daejeon, 2005Alfonso Valencia CNB-CSIC
Daejeon, 2005Alfonso Valencia CNB-CSIC
???update humans mice Drosophila zebrafish C.elegans Arabidopsis yeast E.coli Average
Relevant docs (RD)(Goldstd. LocusLink)
30760 7357 12711 119 170 - - - 10223.4
RD: Recall (%) 84.3 76.6 79.6 74 95.3 81.96
Nr of gene-articlereferences (GA) (Goldstd.LocusLink)
36816 8471 27994 147 279 - - - 14741.4
GA: Recall 26327 5639 14495 105 251 - - - 9363.4
GA: Recall (%) 71.5 66.6 51.8 71.4 90 - - - 70.26
Exact gene localisations(GL) (Goldstd. manual)
403 381 438 360 597 295 536 476 435.75
GL: True positives (Goldstd.manual)
351 354 409 352 584 271 533 442 412
GL: Precision, exactlocalisation (%)
87.1 92.9 93.4 97.8 97.8 91.9 99.4 92.9 94.15
F-measure (%) 78.5 77.6 66.6 82.5 93.7 - - - 80
iHOP inside
Hoffmann Valencia Bioinform. 2005
Daejeon, 2005Alfonso Valencia CNB-CSIC
Hermjakob, et al., IntAct: an open source molecular interaction database.Nucleic Acids Res. 2004
Daejeon, 2005Alfonso Valencia CNB-CSIC
Daejeon, 2005Alfonso Valencia CNB-CSIC
HCAD Chromosomal Translocations Database
Hoffmann et al., NAR 2004
Daejeon, 2005Alfonso Valencia CNB-CSIC
BioCreAtIvE C. Blaschke and A. Valencia : CNB-CSIC L. Hirschman and A. Yeh: MITRE R. Apweiler, E. Camon, et al., GOA (EBI) C. Wu: PIR J. Blake (MGI) J. Wilbur and L. Tanabe (NCBI) L. Grivell (EMBO) Full-text access (HighWire Press)EMBO Evaluation Workshop, April 2004 , Granada, Spain http://www.pdg.cnb.uam.es/BioLINK
Task 1: Extraction of gene / protein names from text, mapping to identifiers (fly, mouse, yeast)
1a.- Identification of a list of genes in text. indexing 4 systems 80% balanced precision and recall
1b.- Gene name mentions linked to DB entries. entity task identification in NLP yeast of 92% balanced precision and recall, fly (82%) and mouse (79%)
Task 2: GO to protein via text for a collection of human genes.
2a.- Text piece for a given GO and protein. identification Best systems with large coverage aprox. 23% correct identification
2b.- Find the GO and text for a list of proteins.Best systems cover most proteins with a 20% correct identification
BioCreAtIvE ©Results, methods, and evaluation papers published in BMC Bioinformatics 2005.
Daejeon, 2005Alfonso Valencia CNB-CSIC
Biocreative Task 2a
Krallinger et al., BMC Bioinfo. 05
Daejeon, 2005Alfonso Valencia CNB-CSIC
Krallinger, Padron, et al., 2005
Correlation GO - Protein spaces (sub-tags)
Daejeon, 2005Alfonso Valencia CNB-CSIC
Protein names sub tag:
1- Original protein name2- Heuristic typographical variants3- Variants from external links to db4- Protein name forming word types5- External links forming word types6- GOBO sequence terms7- GOBO mutation event terms
GO term sub tag:
1- OriginalGO term2- NL variants of GO term3- GO term forming word types4- GO term definition word types
Daejeon, 2005Alfonso Valencia CNB-CSIC
Krallinger, Padron, et al., 2005
Correlation GO - Protein spaces (sub-tags)
Daejeon, 2005Alfonso Valencia CNB-CSIC
Daejeon, 2005Alfonso Valencia CNB-CSIC
Interface for the EBI GO team during the
Biocreative evaluation
Daejeon, 2005Alfonso Valencia CNB-CSIC
Text mining in a nutshell
1. Protein / gene names1. Interspecies2. Linking to DBs
2. Relations1. Protein protein2. Others (regulation, drugs)3. Function
3. Type of Relation1. Proteins2. Metabolic pathways
4. Concepts for groups of genes1. Existing2. Creating new ones
1. 80% prec/recall (biocreative)1. Far less than that2. Essential (not NLP)
2. Easy on the surface1. Best known one (accessible?)2. Dictionaries3. Very difficult (to GO Biocreative)
3. Semantic1. Summaries very difficult2. New challenge, unexplored
4. Knowledge discovery1. Summaries and generalization2. Not jet
Hoffmann et al., Science STKE 2005Krallinger et al., Genome Biology 2005Krallinger et al., DDToday 2005
Daejeon, 2005Alfonso Valencia CNB-CSIC
Experiment: Iyer et al (1999) Science 283, 83-87MeiosisCyclinCheckpointInterphaseNucleoplasmaDivisionHistoneReplicationChromatid
DipeptidylProlylnmrCollagen-binding
17 genesPCNACDC2MSH2LBR
TOP2A...
24 genesABCA5
CATELF2PIM1WNT2
...
Cell cycle
Unknown
DNA replicationDNA metabolismCell Cycle control
PCNA-MSH2The binding of PCNA to MSH2 may reflect linkage between mismatch repair and replication.
LBR-CDC2LBR undergoes mitotic phosphorylation mediated by p34(cdc2) protein kinase.
Word
s
GO codes
Sentences
Words
Blaschke, et al., Funct. Integ. Genomics 2001
Daejeon, 2005Alfonso Valencia CNB-CSIC
Daejeon, 2005Alfonso Valencia CNB-CSIC
SOTA clustering versus significance of Geisha terms.
Oliveros, Blaschke, GIW 2000 ©
Daejeon, 2005Alfonso Valencia CNB-CSIC
SOTA and GEISA mixed information
Blaschke, Herrero, Dopazo, Valencia 2002
Expression based clustering
Weight (expression) + Weight (text)
Term (text) based clustering
Daejeon, 2005Alfonso Valencia CNB-CSIC
www.pdg.cnb.uam.es-14th ISMB : Fortaleza, Sept 06 http://www.iscb.org/ Text mining SIGs, Biolink
- The European School on Bioinformatics. BioSapiens http://www.biosapiens.info- Winter symposium Bologna Feb 2006
- Master Bioinformatica. U. Complutense Enero - Julio 2006bbm1.ucm.es/masterbioinfo
www.cab.inta.eswww.inba.org www.bioalma.com
Daejeon, 2005Alfonso Valencia CNB-CSIC
Stable clusters > central processes with expression and functional information agree
Unstable groups > contradictory information
“jumping” genes, divergent expression and functional classifications.
(Gene of very unstable behavior > related with insufficient information)