chemical entity extraction using the chemicalize.org-technology josef scheiber novartis pharma ag...

Chemical Entity extraction using the chemicalize.org-technologyJosef Scheiber

Novartis Pharma AG – NITAS/TMS

Where the story of this project started ...

DreirosenbrückeNovartis Campus

A day in October 2008Some time around 7:45 in the morning ...

Vision for textminingIntegration chemical, biological knowledge

Mining for Chemical Knowledge - Rationale

- Make text corpora searchable for chemistry

- Generate chemistry databases for use in research based on Scientific Papers or Patents

- Link Chemical Information with further annotation in an automated way for e.g. Chemogenomics applications

- Patent analyis for MedChem projects

Connection table

Mining for chemical Knowledge - Rationale

Information on compounds targeting GPCRs

2005: >14.000 publications

1992: 256 articles & 34

patents

1988: 9 journal articles

HELPInformation explosion

Source: Banville, Debra L. “Mining chemical structural information from the drug literature.” Drug Discovery Today, Number 1/2 Jan. 2006, p.35-42

http://www.mosaicsoftware.co.uk/WebsiteImages/Pile%20of%20paper.jpg

http://www.mosaicsoftware.co.uk/WebsiteImages/Pile%20of%20paper.jpg

Example:Project Prospect – Royal Society of Chemistry

Enhancing Journal Articles with Chemical Features

This helps you identifying other articles talking about the same molecule

Mining for Chemical Knowledge – Focus for today

- Make text corpora searchable for chemistry

- Generate chemistry databases for use in research based on Scientific Papers or Patents

- Link Chemical Information with further annotation in an automated way for e.g. Chemogenomics applications

- Patent analyis for MedChem projects

Connection table

A use case for successful patent mining(molecules you sometimes find in your inbox ;-) )

Vardenafil (2003, Bayer) –

€ 1.24 billion (USD 1.6 billion)

Sildenafil (1998, Pfizer) –

€ 11.7 billion (USD 15.1 billion)

Slide inspired by an example from Steve Boyer/IBM; Sales data from Prous Integrity datase

http://avalonws.eu.novartis.net:8080/avalon/renderer/test.mol?smiles=CCCC1=NC(=C2N1NC(=NC2=O)C3=C(C=CC(=C3)S(=O)(=O)N4CCN(CC4)CC)OCC)C

http://avalonws.eu.novartis.net:8080/avalon/renderer/test.mol?smiles=CCCC1=NN(C2=C1NC(=NC2=O)C3=C(C=CC(=C3)S(=O)(=O)N4CCN(CC4)C)OCC)C

Conventional Database Building

Facts – current standard

... (ACS) owes most of its wealth to its two 'information services' divisions — the publications arm and the Chemical Abstracts Service (CAS), a rich database of chemical information and literature. Together, in 2004, these divisions made about $340 million — 82% of the society's revenue — and accounted for $300 million (74%) of its expenditure. Over the past five years, the society has seen its revenue and expenditure grow steadily ...

Source: ACS homepage

Facts

Established applicationStraighforward useDe-facto Gold standardUnique data source

Very costlyNo structure export for reasonable priceVery limited in large-scale follow-up analysisMost recent patents not available

Not data (search), but integration, analysis and insight, leading to

decisions and discovery

Now – What would be the perfect solution?

All patent offices require to provide all claimed structures as machine-readable version available for one-click-download

Text extraction

Definition: Extract all molecules that are mentioned in a patent text of interest, convert them to structures and make them available in

machine-readable format

Mining for Chemical KnowledgeTechnologies from providers

Text entity recognition Image recognition

(a) Extractors (IUPAC names)- TEMIS Chemical Entity Relationships Skill Cartridge- Accelrys Pipeline Pilot extractor (Notiora)- Fraunhofer (ProMiner Chemistry)- Chemaxon (chemicalize.org)- Oscar (Corbett, Murray-Rust et al.)- SureChem- IBM ChemFrag Annotator

(b) Converter (Names connection table)- CambridgeSoft name=struct- Openeye Lexichem- Chemaxon

- OSRA (NIH)

- Clide Pro (Keymodule Ltd.)

- Fraunhofer chemoCR

- ChemReader

The objective

To provide a tool that provides sophisticated text analysis methods for NIBR scientists and

thereby leverages the methods of TMS

Mining for Chemical Knowledge – Novartis Tools – the chemicalize-technology is working under the hood!

Clipboard Analysis

Patent text

Identified structures

View structure onMouseOver

Export to other

applications

Mining for Knowledge – Novartis ToolsInput example: J Med Chem Paper

Mining for Chemical Knowledge – Use Case

Medicinal Chemist wants to synthesize competitor compound as tool compound for own project

Identification of core scaffold Analysis of

substitution patterns

This enables the identification of compounds most representative for a competitor patent

Example – A text-based patent

Automated Text

extraction452

compounds

Reference636 compounds

71%

A patent example

Example – An image-base patent

Text extraction not suitable for this case, it does find only a meager 40 molecules, 1129 in reference – Why?

An entirely image-based patent example

Language issues – e.g. Japanese patents

Encountered problems

OCR (Optical Character Recognition)!!

USPTO and WIPO are now available full text in most cases

Typos!

Name2Struct problems (less an issue here)

IBM initiative Patent Mining / ChemVerse database (Steve Boyer)

The objective is to automatically extract all molecules from all patents available and make them searchable in a database

They leverage cloud computing and have access to all full-text patents

This is going absolutely the right direction

They annotate the molecules with information from freely available databases

Future ideas: Patent Analysis

Markush translation, Image+Target

Ranking capabilities of outcome for User

„blurred“ dicos for translating stuff like aryl, cycloalkyl etc.

Select annotate as entity on the fly error-correction

Result goes in a database Crowdsourcing efforts to improve and store results

Suggest functionality

To enable true Patinformatics analyses ...

Definition by Tony Trippe:

Acknowledgements

Alex Fromm Katia Vella Olivier Kreim

Therese Vachon Daniel Cronenberger Pierre Parisot Martin Romacker Nicolas Grandjean

NITAS/TMS Clayton Springer Naeem Yusuff Bharat Lagu

And many other people in different divisions of NIBR for their support

chemical entity extraction using the chemicalize.org-technology josef scheiber novartis pharma ag...

Documents

available slide

discovery slide

competitor patent slide

molecule slide

biological knowledge

acs homepage slide

methods of tms slide

machinereadable format