20 million public patent structures: looking at the gift horse
TRANSCRIPT
![Page 1: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/1.jpg)
1
www.guidetopharmacology.org
20 million public patent-extracted chemical structures: a look at the gift horse
Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY, Centre for Integrative Physiology, University of Edinburgh
http://www.guidetopharmacology.org/index.jspPrepared for Global Health Compound Design webinar, 30th Nov
Recording should become available belowhttp://
www.mmv.org/research-development/computational-chemistry/global-health-compound-design-webinars
http://www.slideshare.net/cdsouthan/20-mill-public-patent-structures-looking-at-the-gift-horse
![Page 2: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/2.jpg)
2
Outline
• Good and bad news about chemistry from patens• Chemical Named Entity Recognition, pros and cons• Major submitters to PubChem• New WIPO initiative• Overlaps between sources• Examples of CNER caveats• Roll your own extractions• Curated activity-to-target mappings• MMV example• Conclusions• References
![Page 3: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/3.jpg)
3
Looking at informatics gift horses
• We will look at just patent chemistry here• But any source repays detailed analysis• What are the statistics of entity and relationship capture?• Can we assess real-world comparative utility?• No source is free of caveats, overlaps, complexities, quirks and errors• So can we ameliorate these during exploitation?• PubChem submitters can be sliced, diced and compared in detail • Public sources welcome feedback but may not have resources to implement• The example below shows the analysis of four “horses” at once
![Page 4: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/4.jpg)
4
Medicinal chemistry from patents: good news, part I
• This presentation will focus on bioactivity value, not IP assessments (but I can try to address IP-related questions)
• Patents are a Cinderella scientific data source with underestimated utility by academics
• They typically publish between two-to-five years before a paper with some of the same examples
• They may contain anywhere between 2x to 10x the amount of SAR than an eventual paper
• For some filings from world-class medicinal chemistry teams, (academic or commercial) the SAR never appears anywhere else
![Page 5: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/5.jpg)
5
Good news, part II
• Paradoxically, documents are more “open” than papers (e.g. for text mining)• The non-redundant primary med. chem. data corpus (first-filings with
composition of matter, classified as C07+A61) is well below 100K• Examiners search reports and inventivness assessments are public• Citations of papers and other patents usually extensive• Massive synthetic protocol and analytical data archive• Estimated total bioactive compounds ~ 4- 6 million • A treasure trove for compound design, chemical property extraction (see
slides from previous speaker, Igor Tetko) and many other uses
![Page 6: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/6.jpg)
6
Bad news: part I
• Data mining is more difficult than for papers• Access historically dominated by commercial products• Need to engage with quirks of patent family redundancy, Kind Codes,
patent classifications, 100s pages of turgid legal text, Markush nests• Major portals pushing towards 50 million documents• Some applicants are guilty of varying degrees of obfuscation to make data
mining more difficult (e.g. the “Novel Compound” titles)• What gets into public databases are not patented structures, merely
structures extracted from patents
![Page 7: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/7.jpg)
7
Bad news: part II
• Finding first-filings can be difficult • Judging data quality is a challenge• Few journal authors cite their patents• A large proportion of SAR data is “binned” rather than discrete values• Some applicants don’t declare data values at all• From public extractions so far, the proportion of bioactive examples:
“other” (including non-med. chem. and artefacts) is ~ 5:15 million• Comparing sources indicates constitutive divergence of extraction• Automated extraction has inadvertently contaminated public databases
with a variety of artefactual structures, running into millions
![Page 8: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/8.jpg)
8
Chemical Named Entity Recognition (CNER)
• Automated process of documents in > structures out• SureChEMBL pipeline shown above, other sources similar• Name-to-Struc (n2s) by look-up and/or IUPAC translation, image-to-
struc (i2s) and mol files from USPTO Complex Work Units (CWUs)• Indexing usually added e.g. abstract, descriptions, claims
![Page 9: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/9.jpg)
9
History of patent chemistry feeds into PubChem• 2006 -Thomson (Reuters) Pharma (TRP) manual extraction of patents
and papers, 2016 4.3 mil ~40% patents, guess ~1.5 mill – now static :)• 2011- IBM phase 1 CNER 2.5 mil - SLING Consortium EPO extraction 0.1 mil (static)• 2012 - SCRIPDB, CNER 4.0 mil (static)• 2013 - SureChem, CNER 9.0 mil (> SureChEMBL)• 2014 - BindingDB USPTO manual assay mapping 0.1 mil (active)• 2015- CNER
• SureChEMBL 13.0 mil (active)• IBM phase 2, 7.0 mil, (static)• NextMove Software 1.4 mil synthesis mapping (static)
• 2016 (Nov) all large sources above = 19.46 mill + ~ 1.5 mill Thomson
![Page 10: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/10.jpg)
10
CNER: good news and bad news• SureChEMBL is the major contribution to public patent chemistry by far• 17.51 million cpds in UniChem on 22 Nov• 16.25 million in PubChem up to August• 8.43 million are novel (i.e. source-uniqe CIDs)• In situ chemistry is indexed and downloadable within days of publication• Complemented by SciBites automated “bio-entity” indexing (on the fly)• Powerful query interface• UniChem cross-indexing (e.g. to PubChem and/or ChEMBL)
But• SurChEMBL remains the only active CNER source – others are static• Current feed hiccups are being addressed• Extraction performance compromised by poor OCR quality in WO
documents and instances of very dense image tables • Some types of CNER artefacts are introduced in subsequent slides
![Page 11: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/11.jpg)
11
Major PubChem CNER patent sources at the CID level: corroboration but also divergence
SCRIPDB = 4.0(SID:CID 1.5)
IBM = 7.9(SID:CID 1.2)
SureChEMBL = 14.6(SID:CID 1.0)
0.66
2.12
0.67 8.56
0.53 3.26
1.95Compound Identifiers (CIDs) in millions with a union of 17.8 (in 2015)
![Page 12: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/12.jpg)
12
Patent CNER vs. manual bioactivity sources in PubChem: corroboration along with (expected) divergence
SCRIPDB + IBM + SureChEMBL = 17.8
Thomson (Reuters) Pharma = 4.3
ChEMBL = 1.4
16.13
0.18
0.12 0.90
1.35 0.26
2.55Counts (2015) are CIDs in millions
![Page 13: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/13.jpg)
13
A “new horse” (Oct 2016)
• ~ 7 million structures so far from WO and US from 1978• WIPO collaboration with InfoChem and NextMove
![Page 14: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/14.jpg)
14
CNER fragmentation
• Mainly split IUPAC strings but some authentic intermediates• Compare with selective manual extraction by Thomson/Derwent
![Page 15: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/15.jpg)
15
Bioactivity-gap: most patent chemistry has no linked data
Comparing the total CNER patent set with a bioactivity-centric source e.g. Guide to PHARMACOLOGY (GtoPdb) at 6037 CIDs (2015 numbers)
![Page 16: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/16.jpg)
16
Patent-unique structures: strange big things
https://www.blogger.com/blogger.g?blogID=2155351992730855318#editor/target=post;postID=8959213643856200429;onPublishedMenu=allposts;onClosedMenu=allposts;postNum=2;src=postname post on “chessbordane”
![Page 17: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/17.jpg)
17
Mixtures from patents: more confounding than useful
PubChem ameliorates the issue by splitting SID mixtures to component CIDs while maintaining the mappings
![Page 18: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/18.jpg)
18
Continual re-extraction of common chemistry
![Page 19: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/19.jpg)
19
US6589997: missing punctuation > CNER > mixtures
![Page 20: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/20.jpg)
20
Virtuals I: stereo enumerations from US 20080085923
260 CIDs > 581 SIDs from IBM, SureChEMBL, SCRIPDB, Thomson Pharma and Discovery Gate
![Page 21: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/21.jpg)
21
Virtuals II: deuterated enumerations from US20080045558
986 deuterated CIDs > 2818 SIDs from IBM, SureChEMBL and SCRIPDB,
http://www.slideshare.net/cdsouthan/causes-and-consequences-of-automated-extraction-of-patentspecified-virtual-deuterated-drugs
![Page 22: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/22.jpg)
22
Some good news: supplementing CNER with DIY extractionEither for unprocessed patent documents (e.g. on publication day) or where the extraction of examples by CNER is clearly gapped
![Page 23: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/23.jpg)
23
More good news: expert activity-to-target patent mapping complements CNER
![Page 24: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/24.jpg)
24
Expert activity-to-target mapping II
http://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=2331
![Page 25: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/25.jpg)
25
Utility example from MMV
Pick up from the SureChEMBL interface with MMV as applicant or C07 + malaria
![Page 26: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/26.jpg)
26
Following through: SureChEMBL > PubChem
• CID > “similar compounds” (Tanimoto 90% neighbours) 58 CIDs > cluster
• Generally picks out analogue series from same patent (i.e. the 118s)
• But note structures from other sources nesting into the cluster (e.g. 426, 509, 920, 280 and 308)
![Page 27: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/27.jpg)
27
Conclusions
• The open patent chemistry “Big Bang” value massively outweighs the caveats (i.e. it’s a very nice horse - thanks…)
• The majority of med. chem. exemplifications are now out there• All contributing sources are to be congratulated, and PubChem for
wrangling most of them • But, it is important to look closely at the gift horse• We can then resolve and understand quirks, artefacts and pitfalls• PubChem slicing and filtering can partially ameliorate these• Activity-to-target mapping for SAR extraction is the main pinch point • Those without commercial sources are now more enabled for patent mining• Those with commercial sources can now synergise with open ones
![Page 28: 20 million public patent structures: looking at the gift horse](https://reader036.vdocument.in/reader036/viewer/2022070512/588aaeab1a28ab4c308b6c4f/html5/thumbnails/28.jpg)
28
References
http://cdsouthan.blogspot.com/ 19 posts have the tag “patents”
http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.ncbi.nlm.nih.gov/pubmed/23506624
N.b. from the reproducibility aspect, anyone needing technical tips to reproduce or extend the PubChem queries used for these slides is welcome to contact me
www.ncbi.nlm.nih.gov/pubmed/25415348 //nar.oxfordjournals.org/content/early/2015/10/11/nar.gkv1037
Southan C: Examples of SAR-centric patent mining using open resources, in Elsevier COMPREHENSIVE MEDICINAL CHEMISTRY III, July 2017, in press