keynote conference talk - june 17th 2015
TRANSCRIPT
Current challenges and opportunities for the text mining of interactionsRaul Rodriguez-EstebanData Science, pRED InformaticsRoche Innovation Center Basel
pRED Informatics – your scientific informatics expertsWe are:
Your scientific data experts within Roche pRED
Connecting research, knowledge and people across Roche pRED
Scientists and informatics professionals united in a single organization
Information technology scouts for Roche pRED
Data Science at Roche
Applies the concept of mixed informatics capability teams to retrieve and analyze data to support drug project decision-making.CapabilitiesCheminformaticsBioinformaticsText miningInformation scienceGenomicsGenetics...
Text mining in Data Science
Main goalSupporting the decision making of R&D projects with tools and expertise in text mining.
Some strategic themes• Maintaining a text mining
infrastructure for R&D.• Testing new text mining products
that could be beneficial to R&D projects.
• Proposing and implementing initiatives to improve our text mining capabilities.
• Increasing the value of our open-domain and licensed content.
The flow of the talk
1. Curation2. Name recognition3. Interactions
CURATION
Curation drives our text mining strategies
overhead=
With precision of 75%, you get 1 false positive for every 3 true positives. Overhead is 1/3=33%.
Acceptable overhead
With precision of 50%, overhead is 100%, you get one false positive for every true positive.
Unacceptable overhead
It all depends on how much curation time is available.
1. In our work, we almost always need curation: curation is a crucial constraint.2. Overhead determines the need for curation.
Crowdsourcing: no, we can’t
Burger et al. (2014)
Curation
Mining
Corollaries
overhead=
1. We probably don’t need text mining systems with precision above 75%-80%.
2. However, when precision goes down, overhead goes up very quickly.
3. But how does this all apply in practice?
We use multiple strategies so that we can choose one depending on the curation effort required or available. We adapt our text mining strategy to our curation resources.
The multiple strategy approach: many tools for the same problem
The multiple strategy approach
recall
precision
High recall
Compromise
High precision
Example for gene names
recall
precision
Dictionary - basedGene Name Identification
– Machine LearningGene Name
Normalization – Machine Learning
Name recognition
Our current approach to name recognitionMultiple OntologiesMachine Learning Open Source and Proprietary BANNER for genes / proteins named entity recognition
CBioC : Collaborative Bio Curation Arizona State University ChemSpot for chemical named entity recognition
Institut für Informatik Humboldt-Universität zu Berlin ChemAxon for chemical name-to-structure
ChemAxon DNorm for disease named entity recognition +
normalization (MESH, OMIM)NCBI : National Center for Biotechnology Information National Institutes of Health
Open source packages come from a research environment and were not easy to use in a production environment.
I2E from Linguamatics
• General-purpose rule-based – Ontologies– Regular expressions– Shallow parsing– Boolean logic
• Interactive– Pre-indexed– Graphical interface– Highlighted, compact output
I2E = Interactive Information Extraction
Integration through UIMA
Give the components the same interface for seamless usage: XML file for component description Parameter configuration Shared resources initialization Parallel computing (dedicated cluster or multiple servers) Data processing Access results through indexes
Interactions
Current state for protein-protein interactions
GOOD
BAD
First, the bad news
Krallinger et al. (2008)
Biocreative II, full text and gene name normalization, F-Score of 35%
Pyysalo et al. (2008) Change of corpus lowers F-score
Kabiljo et al. (2009) Change of corpus lowers F-score, keywords + entity recognition is competitive with machine learning
Tikk et al. (2010) Change of corpus lowers F-score, rule-based (RelEx) is competitive with machine learning
Why is performance bad? The leaky and noisy pipeline for PPIs
Identify interactions between those gene names
Loss of true positives
Addition of false positives
Identify articles that contain interactionsIdentify and normalize gene names within those articles
Why is performance bad? The problem is ill-defined
Ill-defined problems are those that do not have clear goals, solution paths, or expected solution. (Wikipedia)
• Every gold standard corpus defines interactions somewhat differently.
• “Interactions” is a concept coined for bioNLP. Outside of bioNLP it means something else.
• Interactions might be too broad a concept.
And now for some good news
• Tikk et al. (2010):“[…] we think that there is also a need to complement the currently predominant approach, treating all interactions as equally important, with more specific extraction tasks. To this end, it is important to create specialized corpora, such as those for the extraction of regulation events or for protein complex formation.”
• Some specialized systems perform better than PPI systems:– Phosphorylation (Tudor et al., 2015)
• Some other interactions besides PPI work better:- Drug-drug: DDIExtraction 2013, F-score of 65.1%- Expression (Neves et al., 2013)
• More generally, it makes sense to have multiple strategies for extracting interactions, both for PPIs and for other types of interactions.
Our current multiple strategies
1 - High precision, specialized for biomarkers (DiMA)
2 – Flexible, all-purpose (Linguamatics I2E)
3 - High recall for protein-protein interactions (University of Zürich)
Rule-based system for biomarkers: Disease Marker Associations Database (DiMA)
Generation of the Knowledge BaseExtracting Gene-Disease-Relationships
Genes / Proteins(Linguamatics)
Relationship(query patterns)
Diseases(Linguamatics)
Standardized Relationships Altered Expression Genetic Variation Role Marker Response Marker Regulatory
Modification Negative Association
Query Development Multiple variations for
each corpus 50+ Sub-queries
Variations of linguistic patterns
The ERBB2 gene (HER2/neu) is overexpressed in many human breast cancers.
ERBB2EntrezID:2064Score: 87
Altered Expressionis overexpressed
Breast Neoplasmsbreast cancers
Standardized Relationships
Sentence
Flexible strategy: rule-based + named-entity recognition
BANNERDNorm
ChemSpotChemAxon
Multiple ontologies
Machine learning named-entity recognition
Rules
Ontologies
High recall strategy: Ontogene
High recall systems are typically very noisy.Strategies to reduce the need for curation:1 - Fact-centric2 - Collection-wide 3 – Ranked
Collection-wide and fact-centric view
• Focusing more on facts than on mentions– The same fact can be redundantly mentioned many times
in many documents.– However, mentions of facts may also re-inforce each other.
Rzhetsky et al. (2006)
• Aggregate view of the literature rather than document view.
– We are interested in facts across documents, the “literature-wide” view of a fact.
Ranked, not filtered
Ranking has been somewhat disregarded in text mining.But all results are not created equal if you have to curate them.The goal is to present first the results with highest quality and best biological evidence. Users are warned that they will find noisy results.
Biological evidence
Mining quality
Our simple approach for biological evidence: based on number of document mentions.
Bonus: trained using a database, not a corpus
Gold-standard corpora are typically small and expensive to develop.Using a biological database we can increase dramatically the amount of training data.BioGRID is a leading biological database which includes information of the type “in a certain article, proteins X and Y are said to interact”A training set based on BioGRID covers interactions from 20,928 PubMed abstracts describing physical interactions in humans.
The interface of the system: gene-centric
Gene A– Interacting gene B1
• Example mention B1.1• Example mention B1.2• …
– Interacting gene B2• Example mention B2.1• Example mention B2.2• …
– Interacting gene B3• Example mention B3.1• Etc.
The interface of the system
Take home
• Current state of protein-protein interaction extraction is poor but strategies to cope are available.
• A multiple strategy approach provides the feasibility to confront different types of questions.
• We have recently developed a high-recall strategy together with University of Zürich. The approach is collection-wide, fact-centric, ranked and trained on BioGRID.
• Future approaches may increasingly hinge on specialization in certain types of interactions and certain tasks.
Acknowledgements
Roche pREDiHermann BillerBarbara Endler-JobstRalf JaegerAurélien OomsMartin Romacker
Roche DiagnosticsMartin Baron
Uni ZürichSimon ClematideLenz FurrerHernani MarquesFabio Rinaldi
NCBIRobert Leaman
Doing now what patients need next