michigan, 2005 alfonso valencia cnb-csic text mining ismb05 alfonso valencia cnb-csic

20
Michigan, 2005 Alfonso Valencia CNB-CSIC Text Mining ISMB05 Alfonso Valencia CNB-CSIC

Upload: quentin-shaw

Post on 17-Dec-2015

222 views

Category:

Documents


5 download

TRANSCRIPT

Michigan, 2005Alfonso Valencia CNB-CSIC

Text MiningISMB05

Alfonso Valencia

CNB-CSIC

Michigan, 2005Alfonso Valencia CNB-CSIC

SLIDE WINDOW APPROACH

Krallinger Valencia Drug Discovery Today 2005

ISMB-Biolink

Michigan, 2005Alfonso Valencia CNB-CSIC

BioLINK SIG: Linking Literature, Information and Knowledge for Biology

A Joint Meeting ofThe ISMB BioLINK Special Interest Group on Text Data Mining andThe ACL Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological

Semantics

Christian Blaschke, Hagit Shatkay, Kevin B. Cohen, Lynette Hirschman

1. InTex: a Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text. S. T. Ahmed, D. Chidambaram, H. Davulcu, C. Baral

2. Corpus Design for Biomedical Natural Language Processing. K. B. Cohen, L. Fox, P. V. Ogren, L. Hunter

3. Unsupervised Gene/Protein Named Entity Normalization using Automatically Extracted Dictionaries. A. M. Cohen

4. Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions. A. Ramani, E. Marcotte, R. Bunescu, R. Mooney

5. MedTag: a Collection of Biomedical Annotations. L.H Smith, L. Tanabe, T. Rindflesch, W. John Wilbur

6. A Machine Learning Approach to Acronym Generation. Y. Tsuruoka, S. Ananiadou, J. Tsujii7. Weakly Supervised Learning Methods for Improving the Quality of Gene Name Normalization Data.

B. Wellner8. Adaptive String Similarity Metrics for Biomedical Reference Resolution. B. Wellner, J. Castaño, J.

Pustejovsky9. A Cross-Domain Application of Natural Language Processing in Biology. I. Chiu, L. H. Shu10. Functional Annotation of Genes Using Hierarchical Text Categorization. S. Kiritchenko, S. Matwin,

A. F. Famili11. Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing. P.

Nakov, A. Schwartz, B. Wolf, M. Hearst12. Searching for High-Utility Text in the Biomedical Literature. H. Shatkay, A. Rzhetsky, W. J. Wilbur13. Automatic Highlighting of Bioscience Literature. H. Wang, S. Bradshaw, M. Light

BioLINK SIG / BioOntologies in ECCB05 Madrid Sept. www.eccb05.org

Michigan, 2005Alfonso Valencia CNB-CSIC

Competitions

- BioCreAtIveTask 1: Extraction of gene / protein names from text, mapping to identifiers

(fly, mouse, yeast) Task 2: GO to protein via text for a collection of human genes.

- TREC I, II- KDD- JNLPBA- others

Text Mining vs. Curation

• Text Mining supports curation

• Curators build and maintain ontologies and databases

• Text Mining profits from data from different resources: ontologies, databases

BioCreAtIvE ©

Michigan, 2005Alfonso Valencia CNB-CSIC

Text mining in a nutshell

1. Protein / gene namesInterspecies

Linking to DBs

2. Relations between entitiesProtein-protein

Other entities (regulation, drugs)

Function

3. Type of RelationProteins

Metabolic pathways

1. 80% prec/recall (BioCreative)Far less than that

Essential (Bioinformatics not NLP)

2. Easy on the surfaceBest known one (accessible?)

Dictionaries

Very difficult (i.e. GO in BioCreative)

3. SemanticSummaries very difficult

New challenge, unexplored

Hoffmann et al., Science STKE 2005Krallinger et al., Genome Biology 2005Krallinger et al., DDToday 2005

Michigan, 2005Alfonso Valencia CNB-CSIC

Krallinger et al., Genome Biology 2005

Michigan, 2005Alfonso Valencia CNB-CSIC

Text mining in a nutshell1. Protein / gene names

1. Interspecies2. Linking to DBs

2. Relations1. Protein protein2. Others (regulation, drugs)3. Function

3. Type of Relation1. Proteins2. Metabolic pathways

4. Concepts for groups of genes1. Existing2. Creating new ones

1. 80% prec/recall (biocreative)1. Far less than that2. Essential (not NLP)

2. Easy on the surface1. Best known one (accessible?)2. Dictionaries3. Very difficult (to GO Biocreative)

3. Semantic1. Summaries very difficult2. New challenge, unexplored

4. Knowledge discovery1. Summaries and generalization2. Not jet

Hoffmann et al., Science STKE 2005Krallinger et al., Genome Biology 2005

Michigan, 2005Alfonso Valencia CNB-CSIC

MeiosisCyclinCheckpointInterphaseNucleoplasmaDivisionHistoneReplicationChromatid

DipeptidylProlylnmrCollagen-binding

17 genesPCNACDC2MSH2LBR

TOP2A...

24 genesABCA5

CATELF2PIM1WNT2

...

Cell cycle

Unknown

DNA replicationDNA metabolismCell Cycle control

PCNA-MSH2The binding of PCNA to MSH2 may reflect linkage between mismatch repair and replication.

LBR-CDC2LBR undergoes mitotic phosphorylation mediated by p34(cdc2) protein kinase.

Word

s

GO codes

Sentences

Words

Blaschke, et al., Funct. Integ. Genomics 2001

Michigan, 2005Alfonso Valencia CNB-CSIC

AC Intro1:30-1:45pm Text Mining: Dietrich Rebholz-Schuhmann

7. High-recall Protein Entity Recognition Using a Dictionary. Kou, Cohen, Murphy1:45-2:10pm

9. Beyond The Clause: Extraction of Phosphorylation Information from Medline Abstracts. Narayanaswamy, Ravikumar, Vijay-Shanker2:10-2:35pm

Michigan, 2005Alfonso Valencia CNB-CSIC

Michigan, 2005Alfonso Valencia CNB-CSIC

Exponential Growth in Data

EMBLTotal Entries / year

MedlineTotal Articles / year

MedlineNew Articles / year

Michigan, 2005Alfonso Valencia CNB-CSIC

OFFICIAL 62542 44.46 %

ALIAS 51749 36.79 %

PROTEIN 26363 18.74 %

The 2492 selected genes in the year 2002 were cited 140654 times

Tamames et al., 2005

Michigan, 2005Alfonso Valencia CNB-CSIC

Leon et al., 2004

- 98 pathways with more than one step (information available for 73)

- 2111 individual steps. Protein-compound links in abstracts

Total 2111 steps 856 linked (40 %)Bacterial chemotaxis 19 17

(89 %)Glutathione metabolism 7 6

(85 %)Fatty acid biosynthesis -path 1- 9 7

(78 %)

in sentences

Total 2111 steps 611 linked (29%)

Bacterial chemotaxis 19 13 (65 %)

Two-component system 85 52 (61 %)

Citrate cycle -TCA cycle- 27 17 (63 %)

KEGG links to literature

Michigan, 2005Alfonso Valencia CNB-CSIC

Ye a

rsEvolution of gene names

Hoffmann, Valencia TIGs 2003

Gene names

The evolution of gene names over time is a “scale free” process- “critical state” system- the evolution of a gene name cannot be predicted- some gene name act as attractors of other names

Michigan, 2005Alfonso Valencia CNB-CSIC

Hoffmann Valencia Nat Genet 2004

Michigan, 2005Alfonso Valencia CNB-CSIC

Michigan, 2005Alfonso Valencia CNB-CSIC

SOTA clustering versus significance of Geisha terms.

Oliveros, Blaschke, GIW 2000 ©

Michigan, 2005Alfonso Valencia CNB-CSIC

SOTA and GEISA mixed information

Blaschke, Herrero, Dopazo, Valencia 2002

Expression based clustering

Weight (expression) + Weight (text)

Term (text) based clustering

Michigan, 2005Alfonso Valencia CNB-CSIC

Michigan, 2005Alfonso Valencia CNB-CSIC

Stable clusters > central processes with expression and functional information agree

Unstable groups > contradictory information

“jumping” genes, divergent expression and functional classifications.

(Gene of very unstable behavior > related with insufficient information)