patent chemisty big bang: utilities for smes

27
www.guidetopharmacology.org The open patent chemistry “big bang”: large opportunities for small enterprises Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY, Centre for Integrative Physiology, University of Edinburgh ACS Mon, Mar 14 CINF: Division of Chemical Information, 79 SESSION: Chemical Information for Small Businesses & Startups 1:00 PM - 4:55 PM- Room 24C 4:25pm - 4:50pm, 1 http:// www.slideshare.net/cdsouthan/patent-chemisty-big-bang-utilit ies-for-smes

Upload: chris-southan

Post on 14-Jan-2017

450 views

Category:

Science


1 download

TRANSCRIPT

Page 1: Patent chemisty big bang: utilities for SMEs

1

www.guidetopharmacology.org

The open patent chemistry “big bang”: large opportunities for small enterprises

Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY, Centre for Integrative Physiology, University of Edinburgh

ACS Mon, Mar 14 CINF: Division of Chemical Information, 79 SESSION: Chemical Information for Small Businesses & Startups 1:00 PM - 4:55 PM- Room 24C 4:25pm - 4:50pm,

http://www.slideshare.net/cdsouthan/patent-chemisty-big-bang-utilities-for-smes

Page 2: Patent chemisty big bang: utilities for SMEs

2

Abstract (will be skipped for presentation)

In 2012, after the first IBM open deposition of 2.5 million structures, few would have predicted that PubChem compounds that include patent-extracted submissions would approach 20 million by 2015 (PMID 26194581). The current major open patent chemistry feeds (in size order) are NextMove, SCRIPDB, Thomson Pharma, IBM and SureChEMBL. The comparative statistics of sources and the arguments that the coverage probability of lead compound prior-art structures is now very high, will be presented. The consequences are that the academic community and small companies can now patent-mine extensively in PubChem and SureChEMBL, possibly even without needing commercial sources to support their own filings. Other recent major enabling aspects for small institutions include a) the open availability of patent full-text for querying b) a range of free tools for DIY chemistry extraction (PMID 23618056) and c) automatic bioentity mark-up in patent text (e.g. protein names) from the SureChEMBL/SciBite collaboration. Examples of DIY analysis of newly published patents will be shown. Even for small enterprises not filing directly open patent chemistry presents a big expansion in accessible SAR space and aspects of mining this will be exemplified. However, open chemistry extraction does bring in a variety of artefacts that add confounding structural “noise” These include a) permutations of mixtures and chiral exemplifications, b) virtual structures c) extractions from documents cannot directly indicate IP status and d) “common chemistry” swamping. These problems and some partial solutions using PubChem filters will be discussed.

Page 3: Patent chemisty big bang: utilities for SMEs

3

Encouraging preface

Page 4: Patent chemisty big bang: utilities for SMEs

4

Outline

• Balancing IP against bioactivity mining• Source coverage for patent extraction• Caveats with automated extraction• The example of US9056843• Source extraction comparisons• DIY extraction• Questions on open searching • Conclusions• References

Page 5: Patent chemisty big bang: utilities for SMEs

5

IP vs SAR from open patent mining IP assessment

• Essential source of prior art chemistry • De facto adjunct to commercial sources• Improved portals (EPO, WIPO, FPOL)• SureChEMBL, TRP & BindingDB active• PubChem content is chemistry from

patents, not patented chemistry • CNER brainless compared to expert IP-

relevance selection• Claim section extraction often weak• Extracted artefacts confounding (e.g.

mixtures & virtuals)• Dense image tables still a coverage gap• IBM and SCRIPDB static in PubChem • Asian chemistry shortfall• The “common chemistry” problem• Patent blitzing for drug candidates

Bioactivity data mining

• Circa 5x more SAR than literature• Patent families collapse to < 100K

C07D primary documents• Advanced query options in

SureChEMBL • Bulk synthesis extraction (NextMove)• Valuable intersects with papers, authors

and targets via ChEMBL• Easy intersecting with DIY chemistry

extraction from any document• Obfuscation in example > assay data• Challenge of judging scientific quality• Only ~ 5 mil structures potentially

linkable to bioactivity data• Thus ~ 15 million have marginal utility• CNER > structural multiplexing

Page 6: Patent chemisty big bang: utilities for SMEs

6

Big chemistry: prior art statistics

March 2016 snapshots

• GDB-13: 907 million virtual structures (similarity search)• Google InChIKey: 120+? million (exact match search)• EBI UniChem: 110.7 million 27 sources (exact match search) • CAS: 109 million substances (commercial, similarity search)• PubChem: 89 million 390 sources (similarity search)• ChemSpider: 43 million 510 sources (similarity search)• SureChEMBL: 16.8 million (similarity search)• GVKBio: 6.2 million (commercial bioactivity capture from patents and

papers, similarity search)

Page 7: Patent chemisty big bang: utilities for SMEs

7

History of patent chemistry feeds into PubChem

• 2006 - Thomson (Reuters) Pharma (TRP) manual extractions from patents and papers (now 4.3 mil, ~40% patents)

• 2011- IBM phase 1 chemical named entity recognition (CNER) 2.5 mil - SLING Consortium EPO extraction 0.1 mil• 2012 - SCRIPDB, CNER plus Complex Work Units (CWU) 4.0 mil• 2013 - SureChem, CNER + image, 9.0 mil• 2014 - BindingDB USPTO assay extraction (now 0.08 mil) • 2015- (CNER+images + CWU)

• SureChEMBL 13.0 mil• IBM phase 2, 7.0 mil, • NextMove Software 1.4 mil synthesis mapping

• 2016 - SureChEMBL 15.8 mil• CIDs from CNER extractions 19.1 mil (from 88.8 mill, 4th March)• Total patent chemistry with estimate from TRP ~ 20.5 mill

Page 8: Patent chemisty big bang: utilities for SMEs

8

CNER patent sources vs. patent and paper curation:corroboration and divergence

IBM + SCRIPDB +

SureChembl + NextMove

= 19.01

ChEMBL20 = 1.45

Thomson Pharma = 4.3

17.3

0.18

1.4 2.5

0.12 0.25

0.9Counts are PubChem Compound Identifiers (CIDs) in millions

Page 9: Patent chemisty big bang: utilities for SMEs

9

CNER caveats (I) fragmentation: Mw plots

Can be partially ameliorated by using Mw ranking as a filter

Page 10: Patent chemisty big bang: utilities for SMEs

10

CNER caveats (II) the bioactivity-gap: majority of patent chemistry has no linked assay data

Page 11: Patent chemisty big bang: utilities for SMEs

11

CNER caveats (III): strange patent-unique structures

• Weird stuff generally non-biological chemistry (i.e. not A61)• For the record C07D = 10.9, A61K = 0.9, (C097D + A61K) = 0.81 mill CIDs

Page 12: Patent chemisty big bang: utilities for SMEs

12

CNER caveats (IV): mixture extractions (a mixed blessing)

• Mostly TFA or HCl salts • Includes combination claims and reactant mixtures• Causes sources to appear more divergent by exact match statistics • PubChem splits to component CIDs while maintaining the back-mapping• Can normalise with “CovalentUnitCount =1” filter

Page 13: Patent chemisty big bang: utilities for SMEs

13

An example

“Trifluoromethyl-oxadiazole derivatives and their use in the treatment of disease” (Novartis)

PTC for the patent family WO2013008162, 2013-01-17

Page 14: Patent chemisty big bang: utilities for SMEs

14

SAR table

All three data sets extracted and example-numbered in BindingDB

Page 15: Patent chemisty big bang: utilities for SMEs

15

PubChem retrieval by patent number -> series cluster

Page 16: Patent chemisty big bang: utilities for SMEs

16

Extraction splits by source, date and isomeric connectivity:(it can get complicated….)

Different sources (SIDs) for same structure (CID)

Different CID isomers with same core connectivity

Page 17: Patent chemisty big bang: utilities for SMEs

17

Impressive SureChEMBL family extraction

4830 rows 648 IDs mapped to 511 PubChem CIDs

Page 18: Patent chemisty big bang: utilities for SMEs

18

Extraction source

selectivity

• 151 BindingDB CIDs direct from PubChem • 93 Thomson Pharma CIDs (within the 151 above)• 296 SDFs from SciFinder > 269 CIDs• 648 SureChEMBL IDs > 511 CIDs• Numbers are not absolute because of “round tripping” mapping issues but

they illustrate the selectivity and extent of open coverage

Page 19: Patent chemisty big bang: utilities for SMEs

19

Orthogonal entity mark-up(I) : Ferret (Chrome plug-in)

Page 20: Patent chemisty big bang: utilities for SMEs

20

Orthogonal entity mark-up (II) : SciBite’s Termite (within SureChEMBL)

Page 21: Patent chemisty big bang: utilities for SMEs

21

Roll-your-own extraction (II): OSRA

Page 22: Patent chemisty big bang: utilities for SMEs

22

Roll-your-own extraction (I): ChemAxon chemicalize.org

Page 23: Patent chemisty big bang: utilities for SMEs

23

Recent comparative analysis

• Compared SureChEMBL and IBM with SciFinder and Reaxys for a small patent set (i.e. open vs commercial)

• Concluded; “50–66 % of the relevant content from the latter was also found in the former”

• Equivalent comparisons executed in the latest PubChem with all patent sources would probably record a higher overlap

Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents, Senger, et al. J. Cheminf. 2015, 7:49 doi:10.1186/s13321-015-0097-z (GSK and SureChEMBL)

http://www.ncbi.nlm.nih.gov/pubmed/26457120

Page 24: Patent chemisty big bang: utilities for SMEs

24

First 64K$ Q: can you search your novel chemistry in open dbs?

• The InChIKey connectivity layer already facilitates blinded exact match (isomer-agnostic) searching anywhere, including Google

• PubChem and SureChEMBL default to https; so searching is secure • There is (and never will be?) patent case law where novelty was

challenged in court based on structures intercepted from public servers• Without metadata (e.g. target & disease) interception per se not much use • As for sequence data, hard evidence of serious competitive damage via

query interception remains zero (after 20+ years)• Commercial dbs cannot capture all prior art, so need open check anyway

Page 25: Patent chemisty big bang: utilities for SMEs

25

Second 64K$ Q: Can you file based on open-only diligence?

If convinced your novel series < billion$ drug, maybe not - but consider

• Chances of completely missing an overlapping chemical series in open sources from a competing patent are diminishing

• Prior art is confounded anyway by the 18-month publication shadow and Markush enumeration

• Filing a 12 month provisional is low cost option• Portal queries allow you to find relevant patents (e.g. by target name)

even if open chemistry extraction was limited• The searches that really count are the ones the patent examiner does

for you (on payment) using all their sources (including PubChem)• However, attorney costs for drafting applications need balancing

against savings on commercial patent resources

Page 26: Patent chemisty big bang: utilities for SMEs

26

Conclusions

• The “Big Bang” of open chemistry and full text from patents now make these an essential part of IP and bioactivity assessments for SMEs

• The combination of SureChEMBL and other sources within PubChem provide over 20 million patent-extracted structures and powerful analysis options

• The gap between open and commercial has narrowed to the point you can at least consider doing without the latter

• Note also the former has functionality absent from the latter • Bioactivity identification, mining and target mapping are still challenging but

becoming easier• It is important to understand patent chemistry automated extraction quirks,

artefacts, and pitfalls so you can filter these

Page 27: Patent chemisty big bang: utilities for SMEs

27

References and questions

http://cdsouthan.blogspot.com/ 19 posts have the tag “patents”

http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.ncbi.nlm.nih.gov/pubmed/23506624(with PubMed Commons data link)

www.ncbi.nlm.nih.gov/pubmed/25415348 http://www.ncbi.nlm.nih.gov/pubmed/23399051

http://www.ncbi.nlm.nih.gov/pubmed/23618056