closing the gap between chemistry and biology: joining between text tombs and databases

40
[1] Closing the gap between chemistry and biology: Joining between text tombs and databases Presentation for Uppsla University Department of Neuroscience, Sept 2013 By Christopher Southan Curator for IUPHARdb, http://www.guidetopharmacology.org/ Queen's Medical Research Institute, University of Edinburgh Email: [email protected] Twitter: http://twitter.com/#!/cdsouthan Blog: http://cdsouthan.blogspot.com/ LinkedIN: http://www.linkedin.com/in/cdsouthan TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htm Publications: http://www.citeulike.org/user/cdsouthan/order/year,,/publications Presentations: http://www.slideshare.net/cdsouthan

Upload: chris-southan

Post on 10-May-2015

464 views

Category:

Technology


2 download

DESCRIPTION

Progress in the biomedical sciences is critically dependent on explicit chemical structures and bioactivity results described in text. This applies across drug discovery, pharmacology, chemical biology, and metabolomics. However the entombing of the majority of these structures and associated data within patents, papers, abstracts and web pages has been a major barrier to progress. This presentation introduces the current public information flow from documents and its associated barriers, such as inadequate author specification of structures, journal pay walls precluding text mining and the patchiness of MeSH chemistry annotation for PubMed-to-PubChem connectivity. It then reviews trends that are lowering these barriers. These include the Google merge of over 50 million InChIKey(s) from PubChem, ChemSpider and UniChem, ChEMBL containing SAR for 0.8 million structures from 50K medicinal chemistry papers, over 20 million abstracts in PubMed, and full-text open patent chemistry in SureChemOpen bringing PubChem patent-extracted structures to 15 million. In addition, options such as Open Lab Books and figshare are expanding the choices for surfacing new structures. Methods will be outlined for establishing document-to-document and document-to-database links via chemical structures. These include the PubChem toolbox, protein targets in UniProt, PubChem BioAssay, ChEMBL indexing in UK PMC, SureChemOpen, chemicalize.org for text name-to-structure conversion , OSRA for image-to-structure conversion, Venny for set comparisons and InChIKey searching in Google [1]. Combined use of these approaches to make joins between patents, papers, abstracts chemical database entries, SAR data and drug target protein sequences will be illustrated with recent novel antimalarial lead compounds, patent-only BACE2 inhibitors and company code numbers in the NCATS repurposing list.

TRANSCRIPT

Page 1: Closing the gap between chemistry and biology: Joining between text tombs and databases

[1]

Closing the gap between chemistry and biology: Joining between text tombs and

databases

Presentation for Uppsla University Department of Neuroscience, Sept 2013

By Christopher Southan

Curator for IUPHARdb, http://www.guidetopharmacology.org/

Queen's Medical Research Institute, University of Edinburgh

Email: [email protected]

Twitter: http://twitter.com/#!/cdsouthan

Blog: http://cdsouthan.blogspot.com/

LinkedIN: http://www.linkedin.com/in/cdsouthan

TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htm

Publications: http://www.citeulike.org/user/cdsouthan/order/year,,/publications

Presentations: http://www.slideshare.net/cdsouthan

Page 2: Closing the gap between chemistry and biology: Joining between text tombs and databases

[2]

Abstract

• Progress in the biomedical sciences is critically dependent on explicit chemical structures and bioactivity results described in text. This applies across drug discovery, pharmacology, chemical biology, and metabolomics. However the entombing of the majority of these structures and associated data within patents, papers, abstracts and web pages has been a major barrier to progress. This presentation introduces the current public information flow from documents and its associated barriers, such as inadequate author specification of structures, journal pay walls precluding text mining and the patchiness of MeSH chemistry annotation for PubMed-to-PubChem connectivity. It then reviews trends that are lowering these barriers. These include the Google merge of over 50 million InChIKey(s) from PubChem, ChemSpider and UniChem, ChEMBL containing SAR for 0.8 million structures from 50K medicinal chemistry papers, over 20 million abstracts in PubMed, and full-text open patent chemistry in SureChemOpen bringing PubChem patent-extracted structures to 15 million. In addition, options such as Open Lab Books and figshare are expanding the choices for surfacing new structures. Methods will be outlined for establishing document-to-document and document-to-database links via chemical structures. These include the PubChem toolbox, protein targets in UniProt, PubChem BioAssay, ChEMBL indexing in UK PMC, SureChemOpen, chemicalize.org for text name-to-structure conversion , OSRA for image-to-structure conversion, Venny for set comparisons and InChIKey searching in Google [1]. Combined use of these approaches to make joins between patents, papers, abstracts chemical database entries, SAR data and drug target protein sequences will be illustrated with recent novel antimalarial lead compounds, patent-only BACE2 inhibitors and company code numbers in the NCATS repurposing list.

Page 3: Closing the gap between chemistry and biology: Joining between text tombs and databases

[3]

The Chem < - > Bio Join

• Chemistry that does something: drug discovery, drug development, toxicology, pharmacology, systems chemical biology (probes), structural biology, metabolomics, chemical ecology, etc etc ….

• With the exception of some PubChem Bioassays, the majority of data is sill primarily archived in documents

Page 4: Closing the gap between chemistry and biology: Joining between text tombs and databases

[4]

Getting chemistry out of text is difficult

Page 5: Closing the gap between chemistry and biology: Joining between text tombs and databases

[5]

That’s why we used to have to pay

73 million

4,059,232

5.1 million

~ 20,000

Page 6: Closing the gap between chemistry and biology: Joining between text tombs and databases

[6]

The Chemical Representational Hextet: Different usage between documents and

databases

?

Page 7: Closing the gap between chemistry and biology: Joining between text tombs and databases

[7]

A recent NRDD article

• Just images and code numbers• No PubChem or ChemSpider IDs• No SMILES or InChIs• No molfiles for download• No links in or out• No MeSH > PubChem substances• Some cited sources might have IUPAC names

Page 8: Closing the gap between chemistry and biology: Joining between text tombs and databases

[8]

You can dig out structures from text for free:- but its hard work

Page 9: Closing the gap between chemistry and biology: Joining between text tombs and databases

[9]

What’s out there for free

• InChIKey in Google ~ 50 million • PubChem = 48 million • PubChem ROF + 250-800 Mw (lead-like) = 31 million• ChemSpider = 28 million • PubChem all docs (papers & patents) = 16 million • PubChem patents = 15 million• SureChemOpen = 14.5 million• PubChem journal sources (PubMed + ChEMBL) = 1 million

Page 10: Closing the gap between chemistry and biology: Joining between text tombs and databases

[10]

Medicinal chemistry patents (tombs with lids off)

• WO, C07D = 72,737 (assignee vs. year plots below)• ~ 50 novel structures with SAR per patent = ~ 3.5 million bioactives • Paradoxically now completely open for chemistry or any mining

Page 11: Closing the gap between chemistry and biology: Joining between text tombs and databases

[11]

PubMed: ~ 10% with chemistry (guarded tombs)

“Free full text” = 575,513 (24%)

Page 12: Closing the gap between chemistry and biology: Joining between text tombs and databases

[12]

Growth: (escaping the

tombs)• Patent “big bang”

(SureChem & SCRIPDB in 2012)

• Literature “slow burn” (ChEMBL 2009 jump)

• Paradox - patents:papers 15:1

(both sets of CIDs cumulative)

Page 13: Closing the gap between chemistry and biology: Joining between text tombs and databases

[13]

Databases <> structures < > documents:links, but few reciprocal

Abstracts

Patents

Papers

15 mill

0.2 mill (mainly MeSH)

0.8 mill (ChEMBL)

12K

Page 14: Closing the gap between chemistry and biology: Joining between text tombs and databases

[14]

Triaging document or webpage chemistry

• Identify the structure specification types, e.g.– Semantic names (all sources)– Code names (press releases, papers and abstracts) – IUPAC names (papers, patents and abstracts)– Images (papers, patents, & Google images)– SMILES (open lab books)– InChi strings (open lab books)– SDF files (open lab books, & github)

Convert these to a structure (e.g. SDF, SMILES, InChI) then:– Search InChIKey in Google– Search major databases– Search SureChemOpen– Compare extracted sets for intersects and diffs – Extend exact match connectivity with similarity searching

Page 15: Closing the gap between chemistry and biology: Joining between text tombs and databases

[15]

Triage example: a new antimalaria

The MMV390048 code name is linked to an image in press reports but is PubChem and PubMed -ve

Page 16: Closing the gap between chemistry and biology: Joining between text tombs and databases

[16]

Images: convert and search

Real chemists sketch them in a jiffy;

the rest of us can use OSRA: Optical Structure Recognition Application

(after editing, CS(=O)(=O)c3ccc(C2=CN=C(N)C(C1=CCC(C(F)(F)F)N=C1)C2)cc3)

Page 17: Closing the gap between chemistry and biology: Joining between text tombs and databases

[17]

Making connections: image > strucure > database > documents

CID 53311393 > ChEMBL > PubMedSureChem or chemicalize.org > patent

Page 18: Closing the gap between chemistry and biology: Joining between text tombs and databases

[18]

Patent SAR from WO2011086531:Collating activities via SureChemOpen

CID 53311393 >

Page 19: Closing the gap between chemistry and biology: Joining between text tombs and databases

[19]

Patent SAR results: top-20 from 39 IC50s

Page 20: Closing the gap between chemistry and biology: Joining between text tombs and databases

[20]

Results > figshare

http://figshare.com/articles/Patent_SAR_for_MMV390048/657979

Page 21: Closing the gap between chemistry and biology: Joining between text tombs and databases

[21]

Structures > MyNCBI

http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1zWhcobieZbIouGfUdsdbHek5/.

Page 22: Closing the gap between chemistry and biology: Joining between text tombs and databases

[22]

SAR Table: iOS app from Molecular

Materials Informatics

SureChemOpen strucs ->

manual data collation ->

PubChem CIDs -> SDF ->

Dropbox -> SAR Table

-> edit in data, R-group decompose

-> share

Page 23: Closing the gap between chemistry and biology: Joining between text tombs and databases

[23]

InChIKey in Google: instant orthogonal joining

Page 24: Closing the gap between chemistry and biology: Joining between text tombs and databases

[24]

Chemicalize.org: 413 strucs from WO2011086531

CID 53311393 ->

Page 25: Closing the gap between chemistry and biology: Joining between text tombs and databases

[25]

Using OPSIN and chemcalize.org to fix recalcitrant IUPACs from WO2011086532

Can quasi-manually extract ~ 10 more “split IUPAC” examples

Page 26: Closing the gap between chemistry and biology: Joining between text tombs and databases

[26]

Clustering document extraction sets: CheS-Mapper

WO2011086531 -> chemicalize.org -> 413 cpds download -> CheS-Mapper -> cluster 8 -> export 53 cpds

Page 27: Closing the gap between chemistry and biology: Joining between text tombs and databases

[27]

PubChem -> ChEMBL -> PMID -> assay -> strucs

• CHEMBL2041980 (structure)• PMID 22390538 (paper)• CHEMBL2045642 (assay for 32 strucs

from paper)• The 32 CIDs all have patent matches

Page 28: Closing the gap between chemistry and biology: Joining between text tombs and databases

[28]

Venny: intersects, diffs, de-dupes and merges

1) WO2011086531 matches in PubCHem

2) CheS-Mapper cluster 8 from WO2011086532

3) ChEMBL assayed cpds from PMID 22390538

(handles any regular strings e.g. db IDs, SMILES, IChI or InChIKey)

Page 29: Closing the gap between chemistry and biology: Joining between text tombs and databases

[29]

Page 30: Closing the gap between chemistry and biology: Joining between text tombs and databases

[30]

NCATS/MRC: the joy of codes with no structures

http://cdsouthan.blogspot.se/2012/09/mrc-22-vs-ncats-58-repurposing-lists.html

Page 31: Closing the gap between chemistry and biology: Joining between text tombs and databases

[31]

Code name-to-structure mapping:

Dig out the code names

PubChem Substance

PubChem Compound

PubMed/MeSH

Google Scholar

Google Images

Google open (filtered)

Page 32: Closing the gap between chemistry and biology: Joining between text tombs and databases

[32]

Sometimes the system works

Page 33: Closing the gap between chemistry and biology: Joining between text tombs and databases

[33]

PubMed > ChEMBL

Page 34: Closing the gap between chemistry and biology: Joining between text tombs and databases

[34]

Sometimes you get missing and cryptic links

Page 35: Closing the gap between chemistry and biology: Joining between text tombs and databases

[35]

NVP-Bxd552: Google results

Page 36: Closing the gap between chemistry and biology: Joining between text tombs and databases

[36]

BACE2: Almost no chemistry in papers

Page 37: Closing the gap between chemistry and biology: Joining between text tombs and databases

[37]

BACE2

1. WO2013054291 > chemicalize.org 2. Download 450 structures3. Upload to PubChem search

Page 38: Closing the gap between chemistry and biology: Joining between text tombs and databases

[38]

Scibite > Alerts for new chemistry

Page 39: Closing the gap between chemistry and biology: Joining between text tombs and databases

[39]

Conclusions

• The ability to extract chemical structures from text and web sources has been transformed by an expansion of the public toolbox

• The PubChem big-bang increases probability of extraction having database exact or similarity matches

• Paradoxically, the patent corpus is now completely open while access to journal text is still restricted

• However, ChEMBL has extracted ~ 0.8 mill. SAR-linked and target mapped structures from ~ 50K papers

• The submission of ~15 mill. patent structures to PubChem ensures at least representation from the majority of medicinal chemistry patents (many of which spawned the subsequent ChEMBL papers)

• Those who want to share their structures globally (e.g. OSDD) have an expanding set of options for surfacing their results.

Page 40: Closing the gap between chemistry and biology: Joining between text tombs and databases

[40]

References