![Page 1: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/1.jpg)
Leveraging IP data for its scientific content
• The future of IP in the era of machine learning / cognitive computing / AI
• Computer curation & finding dark data
Stephen Boyer [email protected]@gmail.com
![Page 2: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/2.jpg)
To be accomplished in the next hour
• the utility of IP data for advancing science
• emerging technologies [ machine learning ]
• the potential to be realized
• the challenges & opportunities
An appreciation of :
![Page 3: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/3.jpg)
“There’s gold in ‘them’ documents”
What’s in them ?
What is it good for ?
What are people doingto mine the information ?
How are they using it ?
![Page 4: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/4.jpg)
A bit of history
How we got to now !
What technologies do we have to work with ?
![Page 5: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/5.jpg)
Evolving technologies relevant to patents
The Past Recent Past Present / Future
1990 - 2005 2006 - 2009 2010 - 2018
• Easy Web Access• Text Searching• Keyword• Boolean • BRS (open text) • Verity • Lucene / Solr
• Image Downloads• Tiff --> PDF
• Text Analytics • IBM UIMA
• Natural Language Processing• NLP
• Entity Identification• Co-occurrence Analysis• Visualization Tools• Citation Mapping• W3C Standards• Federated Search • Unique Entity ID’s
• InChI• GeneID’s• other
• Data Availability • Integration of open source• Google Patents
• Contextual analysis • Semantic search• Network graphs • Relationship detection• Advanced grammar analysis
• Machine Learning • Google Patents • Neural Networks • Image Analysis• OSRA /Clide• Automated analysis • Machine translation
![Page 6: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/6.jpg)
Evolving Analytics, Visualization & Knowledge
Availability of bulk machine-readible data
Understanding the content of the documents Why bother ?
Making documents “machine readable “
• Sections• Tables• Citations• Data types • Etc.
Understanding the format of the documents
![Page 7: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/7.jpg)
3.1 million patent applications worldwide in 2016
Source = Francis Gurry, WIPO, Ambassador Briefing 2018
How many patent documents are there ?
![Page 8: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/8.jpg)
Distribution of Global Patenting has Shifted in Recent Decades
Source = Francis Gurry, WIPO, Ambassador Briefing 2018
![Page 9: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/9.jpg)
Machine learningto analyze & interpret
the components of a document
Example: Work done by Peter Starr, IBM Zurich labs
Step 1: Making documents “machine-readable”
![Page 10: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/10.jpg)
PDF parser PDF interpretation Semantic representation
PDF is the pervasive language of the enterprise
Step 1: Making documents “machine-readable”
![Page 11: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/11.jpg)
Cross-mapping citation data between publications and patents
PatentsO-References
Publications• PubMed ID #• Author ORCIDs• Impact factor• Medical subject codes,
ontologies, etc.
journal articles
cited in patents
![Page 12: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/12.jpg)
PatentsO-References
Publications• PubMed ID #• Author ORCIDs• Impact factor• Medical subject codes,
ontologies, etc.
~6,000,000 / 40,000,000 patent citations map to journal articles other than patents
Citation mapping
In ~10,500 unique journals
~175,000,000 patent citations map to other patents
Cross-mapping citation data between publications and patents
![Page 13: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/13.jpg)
PatentsO-References
Publications• PubMed ID #• Author ORCIDs• Impact factor• Medical subject codes,
ontologies, etc.
~6,000,000 / 40,000,000 patent citations map to journal articles other than patents
Citation mapping
In ~10,500 unique journals
~175,000,000 patent citations map to other patents
?
Cross-mapping citation data between publications and patents
![Page 14: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/14.jpg)
PatentsO-References
Publications• PubMed ID #• Author ORCIDs• Impact factor• Medical subject codes,
ontologies, etc.
Citation mapping
In ~10,500 unique journals
?
Cross-mapping citation data between publications and patents
Funding• NIH• NSF • EU
$
![Page 15: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/15.jpg)
Using computers to understand
what’s in the documents
Annotating the documents – NLPEntity identification
Visualizing the content
A brief review of work done at IBM & by a host of others
Step 2 : Computer Curation of Content
![Page 16: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/16.jpg)
Evolving Analytics, Visualization & Knowledge
Availability of bulk machine-readible data
Understanding the content of the documents Why bother ?
Making documents machine- readable
• Sections• Tables• Citations• Data types • Etc.
Understanding the format of the documents
Analysis of the content
• NLP• Entity identification• Contextual Analysis• Table data • Relationships• Normalization
![Page 17: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/17.jpg)
Early Text Mining Technologies
entity identification
a) (2P/4S)-4-[4-Amino-5-(4-benzyloxy-phenyl)pyrrolo[2,3-d]pyrimidin-7-yl]-2-hydroxymethyl-pyrrolidine-1-carboxylic acid tert-butyl ester prepared analogously to Example 18 starting from (2R/4S)-4-[4-amino-5-(4-
benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester 2-ethyl ester (Example 20a). 1 H-NMR (CDCl3, ppm): 8.52 (s, 1H), 7.52-7.32 (m, 7H), 7.1 (d, 2H), 6.95 (d,1 H), 5.50 (m, 1H), 5.13 (s, 2H), 4.62-4.42 (m, 2H), 4.28 (m, 2H), 4.10 (m, 1H), 3.95-3.70 (m, 1H), 2.75 (m, 1H), 2.50 (m, 1H),1.49
(s, 9H). b) (2R/4S)-{4-[4-Amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidin-2-yl}-methanol: 0.100 g
of (2R/4S)4-[4-amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester is dissolved in 4 ml of tetrahydrofuran; 10 ml of 4M hydrogen chloride in diethyl ether are added, and stirring is carried out for 1 hour at room temperature. The product is filtered off and dried under a high vacuum. The dihydrochloride of the title compound is obtained. 1 H-NMR (CD3 OD, ppm): 8.4 (s, 1H); 7.60 (s, 1H), 7.5-
7.10 (m, 9H), 5.65 (m, 1H), 5.18 (s, 2H), 4.32 (m, 1H), 4.00-3.65 (m, 4H), 2.60 (m, 2H). EXAMPLE 24
(2R/4S)-4-(4-Amino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester 0.130 g of (2R/4S)-4-(4-benzyloxycarbonylamino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester is dissolved in 8 ml of methanol, and the solution is
hydrogenated over 0.030 g of palladium-on-carbon (10%) for 1 hour at normal pressure. The catalyst is removed by filtration, the filtrate is concentrated by
What is this compound ??
NO
O
HO
N
N
N
O
NH2
![Page 18: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/18.jpg)
Paper Words
-- - - - --- -- -- - - -- -- --- -- -
Chemical Names
Dictionary of the English language– minus –
the dictionary of desired entities
. -- -
toluene
[CC1=CC=CC=C1]
CH3
Name=Structure
SMILES String
2D Structure
methyl benzene
Computational Resources
Blue Gene – enabled
Summary of overall text analysis operations for chemistry HMM, CRF, CFG
3D structurecompute
300 properties permolecule
![Page 19: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/19.jpg)
Patent document’s chemical report molecular timeline & chemical name-to-structure mouse-over
![Page 20: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/20.jpg)
• Chemical yields
• Quantities
• Physical attributes Melting points, Boiling Points
• Solvents and Temperatures
• Spectral Data
• NMR data• IR data• Mas Spec data • Assay data
Text-mining technologies identify in-document properties
Source courtesy of Dr. Roger Sayles
![Page 21: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/21.jpg)
Example of text mining from patent & scientific literature
Source = Dr. Valery Tkachenko, Royal Society of Chemistry, UK
> 175K Compound-value associations
![Page 22: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/22.jpg)
Signals / Triggers for identifying specific entities
Step 7: To a solution of 4-(1-(4-cyanophenyl)-1H-benzo[d]imidazol-6-yl)benzoic acid 6 (300 mg, 0.88 mmol) in DMF (5 mL) was added NMM (0.178 mL, 1.76 mmol) followed by HATU (501 mg, 1.32 mmol) at rt and the solution was stirred for 30 min. Morpholine (0.84 mL, 0.968 mmol) was added to the reaction mixture and stirring was continued for 16 h. The reaction mixture was diluted with EtOAc and washed with water and brine solution. The organic layer was dried over anhydrous Na2SO4 and concentrated under reduced pressure to obtain crude product. The crude product was purified by preparative HPLC to afford 4-(6-(4-(morpholine-4-carbonyl)phenyl)-1H-benzo[d]imidazol-1-yl)benzonitrile (100 mg, 27%, AUC HPLC 95.9%) as an off-white solid; m.p. 207-210° C.; 1H NMR (400 MHz, CDCl3) δ (ppm): 8.18 (s, 1H), 7.98-7.92 (m, 3H), 7.72 (d, J=6.6 Hz, 3H), 7.66-7.61 (m, 3H), 7.51 (d, J=7.9 Hz, 2H), 3.90-3.40 (m, 8H); MS (ESI) m/z 409.15 [C25H20N4O2+H]+.
Example of extracting NMR & MS data from US patents
What about BP?
Source = Dr. Valery Tkachenko, Royal Society of Chemistry, UK
![Page 23: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/23.jpg)
1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
Re-creating spectral data from text data
text input
spectral ouput
Source = Dr. Valery Tkachenko, Royal Society of Chemistry, UK
![Page 24: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/24.jpg)
NMR data extracted by year of publication
0
500000
1000000
1500000
2000000
2500000
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
Cum
ulat
ive
dist
inct
NM
R ex
trac
ted
Year of Publication
USPTO grants
USPTO applications
Documenting the increase in data with time
From 1976-2014 USPTONMR data
►
H 975543C 56536
unknown 44306F 9429P 3241B 91Si 62Sn 22Se 11N 8
Source = Dr. Valery Tkachenko, Royal Society of Chemistry, UK
![Page 25: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/25.jpg)
1H-N
MR
freq
uenc
y
0 Mhz
50 Mhz
100 Mhz
150 Mhz
200 Mhz
250 Mhz
300 Mhz
350 Mhz
400 Mhz
450 Mhz
1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014Year of patent filing
Tracking technology improvement with timeExample of NMR
Source = Dr. Valery Tkachenko, Royal Society of Chemistry, UK
Year of patent filing
![Page 26: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/26.jpg)
Extracting chemical reactions from text
Results from Drs. Roger Sayles & Daniel Lowe, NextMove
![Page 27: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/27.jpg)
Step 7: To a solution of 4-(1-(4-cyanophenyl)-1H-benzo[d]imidazol-6-yl)benzoic acid 6 (300 mg, 0.88 mmol) in DMF (5 mL) was added NMM (0.178 mL, 1.76 mmol) followed by HATU (501 mg, 1.32 mmol) at rt and the solution was stirred for 30 min. Morpholine (0.84 mL, 0.968 mmol) was added to the reaction mixture and stirring was continued for 16 h. The reaction mixture was diluted with EtOAc and washed with water and brine solution. The organic layer was dried over anhydrous Na2SO4 and concentrated under reduced pressure to obtain crude product. The crude product was purified by preparative HPLC to afford 4-(6-(4-(morpholine-4-carbonyl)phenyl)-1H-benzo[d]imidazol-1-yl)benzonitrile (100 mg, 27%, AUC HPLC 95.9%) as an off-white solid; m.p. 207-210° C.; 1H NMR (400 MHz, CDCl3) δ (ppm): 8.18 (s, 1H), 7.98-7.92 (m, 3H), 7.72 (d, J=6.6 Hz, 3H), 7.66-7.61 (m, 3H), 7.51 (d, J=7.9 Hz, 2H), 3.90-3.40 (m, 8H); MS (ESI) m/z 409.15 [C25H20N4O2+H]+.
US20150038506
Reaction Extraction System
Step 7: To a solution of 4-(1-(4-cyanophenyl)-1H-benzo[d]imidazol-6-yl)benzoic acid 6 (300 mg, 0.88 mmol) in DMF (5 mL) was added NMM (0.178 mL, 1.76 mmol) followed by HATU (501 mg, 1.32 mmol) at rt and the solution was stirred for 30 min. Morpholine (0.84 mL, 0.968 mmol) was added to the reaction mixture and stirring was continued for 16 h. The reaction mixture was diluted with EtOAc and washed with water and brine solution. The organic layer was dried over anhydrous Na2SO4 and concentrated under reduced pressure to obtain crude product. The crude product was purified by preparative HPLC to afford 4-(6-(4-(morpholine-4-carbonyl)phenyl)-1H-benzo[d]imidazol-1-yl)benzonitrile (100 mg, 27%, AUC HPLC 95.9%) as an off-white solid; m.p. 207-210° C.; 1H NMR (400 MHz, CDCl3) δ (ppm): 8.18 (s, 1H), 7.98-7.92 (m, 3H), 7.72 (d, J=6.6 Hz, 3H), 7.66-7.61 (m, 3H), 7.51 (d, J=7.9 Hz, 2H), 3.90-3.40 (m, 8H); MS (ESI) m/z 409.15 [C25H20N4O2+H]+.
Making dark data useful example: extracting chemical reactions from text
Results from Drs. Roger Sayles & Daniel Lowe, NextMove
![Page 28: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/28.jpg)
Step 7: To a solution of 4-(1-(4-cyanophenyl)-1H-benzo[d]imidazol-6-yl)benzoic acid 6 (300 mg, 0.88 mmol) in DMF (5 mL) was added NMM (0.178 mL, 1.76 mmol) followed by HATU (501 mg, 1.32 mmol) at rt and the solution was stirred for 30 min. Morpholine (0.84 mL, 0.968 mmol) was added to the reaction mixture and stirring was continued for 16 h. The reaction mixture was diluted with EtOAc and washed with water and brine solution. The organic layer was dried over anhydrous Na2SO4 and concentrated under reduced pressure to obtain crude product. The crude product was purified by preparative HPLC to afford 4-(6-(4-(morpholine-4-carbonyl)phenyl)-1H-benzo[d]imidazol-1-yl)benzonitrile (100 mg, 27%, AUC HPLC 95.9%) as an off-white solid; m.p. 207-210° C.; 1H NMR (400 MHz, CDCl3) δ (ppm): 8.18 (s, 1H), 7.98-7.92 (m, 3H), 7.72 (d, J=6.6 Hz, 3H), 7.66-7.61 (m, 3H), 7.51 (d, J=7.9 Hz, 2H), 3.90-3.40 (m, 8H); MS (ESI) m/z 409.15 [C25H20N4O2+H]+.
US20150038506 Step 7: To a solution of 4-(1-(4-cyanophenyl)-1H-benzo[d]imidazol-6-yl)benzoic acid 6 (300 mg, 0.88 mmol) in DMF (5 mL) was added NMM (0.178 mL, 1.76 mmol) followed by HATU (501 mg, 1.32 mmol) at rt and the solution was stirred for 30 min. Morpholine (0.84 mL, 0.968 mmol) was added to the reaction mixture and stirring was continued for 16 h. The reaction mixture was diluted with EtOAc and washed with water and brine solution. The organic layer was dried over anhydrous Na2SO4 and concentrated under reduced pressure to obtain crude product. The crude product was purified by preparative HPLC to afford 4-(6-(4-(morpholine-4-carbonyl)phenyl)-1H-benzo[d]imidazol-1-yl)benzonitrile (100 mg, 27%, AUC HPLC 95.9%) as an off-white solid; m.p. 207-210° C.; 1H NMR (400 MHz, CDCl3) δ (ppm): 8.18 (s, 1H), 7.98-7.92 (m, 3H), 7.72 (d, J=6.6 Hz, 3H), 7.66-7.61 (m, 3H), 7.51 (d, J=7.9 Hz, 2H), 3.90-3.40 (m, 8H); MS (ESI) m/z 409.15 [C25H20N4O2+H]+.
Results from Drs. Roger Sayles & Daniel Lowe, NextMove
Making dark data useful
![Page 29: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/29.jpg)
The Growing Number of Chemical Reactions Derived from the Patent Literature
https://bitbucket.org/dan2097/patent-reaction-extraction/downloads/
Millions of reaction SMILESmade publically available
thanks to Daniel Lowe & Roger Sayles
# of
che
mic
al re
actio
ns
Year of patent filing
Source = Roger Sayles & Daniel Lowe, Next Move
![Page 30: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/30.jpg)
Computer curation Classifying patents from their technical content
What does this enable that could not be done before ?
![Page 31: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/31.jpg)
Categorization of chemical reactions from patents
Results from Drs. Roger Sayles & Daniel Lowe, NextMove
10 most frequent reactions
Classifying patents via its scientific content
https://bitbucket.org/dan2097/patent-reaction-extraction/downloads/
![Page 32: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/32.jpg)
% Y
ield
Mass of Product [grams]
What does this enable that could not be done before ?Analyze scale-vs-yield
Reactions of greatest interestfor manufacturing High yield Large scale
20%
40%
60%
80%
Results from Drs. Roger Sayles & Daniel Lowe, NextMove
16,355 Suzuki coupling reactions extracted from 2001 – 2013 US Applications
![Page 33: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/33.jpg)
Results from Drs. Roger Sayles & Daniel Lowe, NextMove
16,355 Suzuki coupling reactions extracted from 2001 – 2013 US Applications
What does this enable that could not be done before ?Analyzing frequency-vs-time
Suzu
ki c
oupl
ings
as
%ag
e of
re
actio
ns /
year
Year of patent filing
![Page 34: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/34.jpg)
Relationships
![Page 35: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/35.jpg)
Entity types identified that were associated with structures derived from patents
Source = Roger Sayles – NextMove
![Page 36: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/36.jpg)
The Number of Biological Activities Derived from Patents vs the Scientific Literature
Source = Roger Sayles
![Page 37: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/37.jpg)
Evolving Analytics, Visualization & Knowledge
Availability of bulk machine readible data
Understanding the Documents
Understanding what’s In the documents Why bother ?
The format of the document • Sections• Tables• Citations• Data types • Etc…
Analysis of the content • NLP• Entity identification• Contextual Analysis• Table data • Relationships• Normalization
• Integration with Other Data • Development of feature spaces• Seeing the unobvious • Learning • Predicting
![Page 38: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/38.jpg)
Patent data alone is insufficient
![Page 39: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/39.jpg)
PubChem
CIDSIDAID
InChIKeyCAS
SynonymsPubMedPatents
NLM MeSH
Chemical:SynonymsMeSH DUIDisease:
MeSH DUI
FDA SRS
Drug:FDA SPLFDA NDC
Ingredient:UNII
InChIKey
NCBIProteinGeneCDD
TaxonomyPubMed
BioSystems
NLM HSDBPharmacology
ToxicityMetabolismProperties
Manufacture
VA NDF-RT
NLM RxNorm
FDA/NLMDailyMed
NCI Metathesaurus
Disease Ontology
Protein Ontology
GeneOntology
DrugBankDrug:
PubChemATC
Target:Uniprot
GeneCard
KEGG
Drug:PubChem
ATCTarget:Gene
Disease:OMIM
ChEMBL
Drug:ATC
ChEBITradeNameCompound:
Pharmacology
ChEBI
Source:IntEnzKEGG
PDBeChemChEMBL
IUPHAR-DB
Drug:Classification
Target:NomenclaturePharmacology
IBM
PatentPubMedTerminology/Ontology
Public Database
Database + Terminology
Integration with Open Source Data
Drs. Evan Bolton & Gang Fu, NIH
![Page 40: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/40.jpg)
NIH PubChem RDF – Triple & Entity Counts
https://pubchem.ncbi.nlm.nih.gov/rdf/ Drs. Evan Bolton & Gang Fu, NIH
Integration with Open Source Data
![Page 41: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/41.jpg)
What are “Cognitive Technologies” ?
![Page 42: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/42.jpg)
“Big Data, (Machine Learning, Neural Networks, Cognitive Computing, AI) is like teenage sex:
Everyone talks about it, nobody really knows how to do it,
Everyone thinks everyone else is doing it.
So everyone claims they are doing it….”
Source: Dan Ariely , Duke University
Machine Learning, Neural Networks, Cognitive Computing, AI
![Page 43: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/43.jpg)
Google “ A mostly complete chart of Neural Networks “
A mostly complete chart ofNeural Networks
![Page 44: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/44.jpg)
Google “ A mostly complete chart of Neural Networks “
A mostly complete chart ofNeural Networks
![Page 45: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/45.jpg)
Accessible Information
Usefulness starts with access to the information.
Transformation Apply business logic,
human curation, and/or machine learning
Useful Information
Solving user problems
Making IP Data Accessible and UsefulWhat Google is doing
Slide courtesy of Ian Wetherbee , Google
The critical first step in making patent information useful Is open access to machine-readable bulk data
![Page 46: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/46.jpg)
Accessible Information
Usefulness starts with access to the information.
Transformation Apply business logic,
human curation, and/or machine learning
Useful Information
Solving user problems
Making IP Data Accessible and UsefulWhat Google is doing
Slide courtesy of Ian Wetherbee , Google
The critical first step in making patent information useful Is open access to machine-readable bulk data
• Machine Classification• Document Similarity • Machine Translation• ….
http://media.epo.org/play/gsgoogle2017
![Page 47: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/47.jpg)
5
4
3
2
1
0
6 Perfect translation
humanneural (GNMT)
Phrase-based (PBMT)
English>Spanish
English>French
English>Chinese
Spanish>English
French >English
Chinese >English
Google’s machine translation Tr
ansl
atio
n qu
ality
Translation model Slide courtesy of Ian Wetherbee , Google
47,710,923 patents full-text translated
![Page 48: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/48.jpg)
Accessible InformationUsefulness starts with access to the information.
The advantages of making patents accessible & useful”
Slide curtesy of Ian Wetherbee , Google
Enables the private sector to transform and improve information, benefitting the patent system
Improves the transparency into patent quality and the patent system
Improves transparency into legal rights
Empowers the public to obtain the full benefits of the disclosure
“Open machine-readable data is the critical first step in making patent information useful” *
![Page 49: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/49.jpg)
An Example
Finding compounds that might fight cancer
What are people doing with this data ?
![Page 50: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/50.jpg)
Pharma asks
1. What genes regulate xyz condition ? 2. What compounds regulate those xyz genes ?
An approach to answering these questions : chemical ontologies
Other approaches include• Computational chemical modeling• Similarity Ensemble Approach (SEA) • Literature-based discovery• Experimental high through-put screening
![Page 51: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/51.jpg)
Chemical Ontologies
But first some chemistry
Work done in collaboration with:
University of Alberta Prof David Wishart & Yannick FeunangOntochem Prof Lutz Weber
![Page 52: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/52.jpg)
Physical • Examples: Molecular Weight, Melting point, Boiling Point
Molecular• Examples: Steroid, Prostaglandin, Amino Acid, Alkene, Imidazole
Functional • Examples: Anti-Inflammatory, Explosive, Refrigerant, Pesticide
Legal attributes • Patented for a purpose
Molecules have different types of attributes
![Page 53: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/53.jpg)
Example of a chemical ontology
Consider this molecule
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
![Page 54: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/54.jpg)
Carboxylic acid
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
![Page 55: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/55.jpg)
Carboxylic acidBenzoic acid
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
![Page 56: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/56.jpg)
Carboxylic acidBenzoic acid
Phenol
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
![Page 57: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/57.jpg)
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
![Page 58: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/58.jpg)
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
![Page 59: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/59.jpg)
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Pyridine
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
![Page 60: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/60.jpg)
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Pyridine
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
![Page 61: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/61.jpg)
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Pyridine
Sulfone
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
![Page 62: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/62.jpg)
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Pyridine
Sulfone
Sulfonamide
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
![Page 63: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/63.jpg)
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Pyridine
Sulfone
Sulfonamide
Azobenzene
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
![Page 64: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/64.jpg)
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Pyridine
Sulfone
Sulfonamide
Azobenzene
Benzene
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
![Page 65: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/65.jpg)
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Pyridine
Sulfone
Sulfonamide
Azobenzene
Benzene
Molecular attributes (labels) Is a Benzoic acidIs a Carboxylic acidIs a Carbonyl cpdIs a PhenolIs a AxobenzeneIs a Azo compound Is a SulfoneIs a SulfonamideIs a PyridineIs a Benzene Is a Hydroxy
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
![Page 66: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/66.jpg)
Carboxylic acidBenzoic acid
Phenol
Hydroxy group
Azo
Pyridine
Sulfone
Sulfonamide
Azobenzene
Benzene
Functional attributes Is used for the treatment of Crohn's diseaseIs used for the treatment of rheumatoid arthritisIs used for the treatment of ulcerative colitis
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Molecular attributes (labels) Is a Benzoic acidIs a Carboxylic acidIs a Carbonyl cpdIs a PhenolIs a AxobenzeneIs a Azo compound Is a SulfoneIs a SulfonamideIs a PyridineIs a Benzene Is a Hydroxy
![Page 67: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/67.jpg)
[H][C@@]12C[C@@]3([H])[C@]4([H])C[C@]([H])(F)C5=CC(=O)C=C[C@]5(C)[C@@]4(F)[C@@H](O)C[C@]3(C)[C@@]1(OC(C)(C)O2)C(=O)COC(C)=O
SMILES String
ClassyFire OntoChem
ClassyFire: Halogenated steroids (6); Fluorohydrins (7); Halohydrins (7); 1,3-dioxolanes(9); 11-beta-hydroxysteroids (9); Dioxolanes (9); 3-oxo delta-1,4-steroids (10); Alpha-acyloxy ketones (10); Delat-1,4-steroids (10); 11-hydroxysteroids (12); Gluco/mineralcorticoids, progestogins and derivatives (13); Pregnane steroids (13); 20-oxosteroids (15); Acetate salts (22); 3-oxosteroids (26); Oxosteroids (27); Carboxylic acid salts (30); Hydroxysteroids (32); Cyclic ketones (45); Alpha amino acid amides (73); Pyrrolidines (80); D-alpha-amino acids(85); Cyclic ketones (45); Acetals (50); Steroids and steroid derivatives (51); Alkyl fluorides (53); Alkyl halides (67); Cyclic alcohols and derivatives (86); Ketones (101); Organofluorides (128); Carboxylic acid esters (139); Secondary alcohols (187); Oxacyclic compounds (192); Lipids and lipid-like molecules (209); Organohalogen compounds (272); Ethers (393); Alcohols and polyols (395); Carboxylic acid derivatives (423); Carboxylic acids and derivatives (548); Carbonyl compounds (598); Organic acids and derivatives (633); Organoheterocyclic compounds (651); Organooxygen compounds (856); Organic compounds (978); Chemical entities (989); Hydrocarbon derivatives (995);
OntoChem: 17-deoxy-prednisolones (6); halohydrins (6); prednisolones (6); ethanoic acid esters (20); methyl esters (20); acetals (37); alkyl fluorides (56); cyclic ketones (61); natural product derivatives (92); fluorine compounds (126); alkene derivatives (172); polycyclic compounds (184); oxacyclic compounds (190); secondary alcohols (202); carboxylic acids (249); formic acid derivatives (559); lipophilic molecules (642); lipinski molecules (785); bioavailable molecules (867); oxygen compounds (891); small molecules (949); carbon compounds (974); hetero compounds (978);
Generating molecular attributes via SMILES
![Page 68: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/68.jpg)
ChemBL dB of 1.4 Mcompounds AND their
bio activity towards targets
UOA Classifyer (CF) SW OntoChem (OC) SW
SMILES STRINGS
ChemBL dB of 1.4 Mcompounds AND their
bioactivity towards targets IncludingCF + OC
chemical Lables
Obtain a database of chemical compounds & their SAR
OC labels CF labels
This processing was provided by Ontochem
This processing was provided by U of Alberta
This database was provided by EBI
This processing was provided by IBM
Research
We call this the CHEMBL ontology dB
![Page 69: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/69.jpg)
MDM2 Raw output Out of 1.4 M molecules ~ 558 had activity towards MDM2 but only 27 had activity less then 30 nm
![Page 70: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/70.jpg)
Scoring of molecular labels for MDM2-produced training set of 27 compounds[ label cutoff = 20 , activity cutoff = 30 , corpus count cutoff = 200K ]
Score = (observed count - expected count)2 / expected count
![Page 71: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/71.jpg)
MDM2 Raw output
Classyfire (CF) OntoChem (OC)
![Page 72: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/72.jpg)
Comparison of the 100 compounds identified by CF with the 100 compounds identified by OC for MDM2 with label cut off = 10 labels & assay minimum = 30 & corpus count cut off = 300K
57 of the predicted compounds are in common
Overlap based on ChemBl ID’s of predicted compounds
![Page 73: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/73.jpg)
A sample from the 57 compounds identified to have potential MDM2 Activity by both CF & OC
![Page 74: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/74.jpg)
![Page 75: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/75.jpg)
IC50 value for Mdm2/P53 binding wascalculated (by sigmoid fitting using Prism(GraphPad Software).The results are shown below.
US 2009/ 0312310A1 This [240 page] patent application had 26 compounds with reported assay data for MDM2
![Page 76: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/76.jpg)
IC50 value for Mdm2/P53 binding wascalculated by sigmoid fitting using Prism(GraphPad Software).The results are shown below.
US 2009/ 0312310A1
Example 18Example 39
Example 93
Example 97
Example 111
Example 155
Example 126 Example 180
Example 220
![Page 77: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/77.jpg)
A sample from the 57 compounds identified to have potential MDM2 Activity by both CF & OC
![Page 78: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/78.jpg)
Compound Attributes
compound 1
compound 2
compound 3
A B C D E … Y Z
1 2 0 0 4 5 2 0 7 3 0 1 1 2
………
Feature Vector
compound 1compound 2compound 3
Physical Relate Attributes
LcStructure Pka
Log P …
StructureMol File / SMILES
Functional Attributes
EC50
Target -Assay
PairOther Attributes
LD50
Target-Assay
PairTarget1 Target
2 EC50
Primary Assays Secondary Assays
MDM2
JAK3
SGLT2
---
Ki ---
Anti-target -Assay
Pair
Target Attributes
Target 1
Target 2
Target 3
A B C D E … Y Z
1 2 0 0 4 5 2 0 7 3 0 1 1 2
………
Feature Vector
![Page 79: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/79.jpg)
Goal oriented learning
Cost / reward
Act
Predict an action which will reduce cost and/or increase reward
![Page 80: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/80.jpg)
BIG ISSUES
1) Obfuscation
2) Access to & integration of worldwide data• Open access to bulk machine-readable data
3) Incentives & quotas
4) Algorithms and Bias
![Page 81: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/81.jpg)
WHAT IS THIS?
= a soccer ball
= a spherical recreational device
BIG ISSUES OBFUSCATION
![Page 82: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/82.jpg)
Source = Dr. George Papadatatos EMBL – EBI
European Molecular Biology Laboratory EMBL & EBI
Markush structures are daunting and the situation is getting worse
![Page 83: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/83.jpg)
BIG ISSUES
Access to and integration of WW Data
![Page 84: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/84.jpg)
• Chinese, Japanese and Korean (CJK) patents now account for over half of all national patentfilings and hence are of increasing importance to patent informatics.
To demonstrate the importance of this …
• 1,740,040 distinct compounds were extracted from ~63,000 Korean patent applications - spanning from 1990 to March 2015
• Of these ~ 230,770 compounds were novel to Korean patents when compared tocompounds derived from US data - (spanning from 1976-March 2015)
• In the period 2006-2014, 46% of compounds appeared in a KIPO filing before a USPTO filing.
The Importance of Foreign Patent Filings
Notes from Drs. D, Low & R Sayles
![Page 85: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/85.jpg)
An Example of extracting chemical entities from CJK patents
Notes from Drs. D, Low & R Sayles
![Page 86: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/86.jpg)
Chemicalsfrom Chinese
Patents -
Attempts to process Chinese Patent Documents
Extracting chemical structures from Chinese patents…
Work done in collaboration with Dr R Sayles
![Page 87: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/87.jpg)
Final Thoughts
Thanks to
• everyone in this room
• the scientific community
• especially those whose data was presented
• society in general
for providing us with these important
“Adjacent Possibilities”
Final thoughts
![Page 88: Leveraging IP data for its scientific content...2018/04/03 · Leveraging IP data for its scientific content • The future of IP in the era of machine learning / cognitive computing](https://reader030.vdocument.in/reader030/viewer/2022041003/5ea52c520db24b75ad5c2d27/html5/thumbnails/88.jpg)
Source – J Kreulen
IBM Almaden Research Center, San Jose, California