the integration of chemistry with everything else · order by count_acn desc acname count_acn...
TRANSCRIPT
-
The Integration of Chemistry with Everything Else
ACS National Meeting , August 2020Ian Wetherbee, Lutz Weber, Evan Bolton, Vihang Mehta, Stephen Boyer, Jane Frommer
Contributions directed toward the identification, organization and FAIR availability of the world's molecular content
-
Previously at ACS San Diego...
Google Patents
Patent corpusGoogle Scholar Google Books
computer curation processes
-
Google Patents
Worldwide patent corpus
~ 50 Billion mentions of scientific entities
New today:Donating annotated data to the public
Google Translate
computer curation processes
Google BigQueryDataset
NIH
Other
SciWalker
(Your application)
DataCC BY 4.0
“Google Patents Research Data” by Google, based on data provided by IFI CLAIMS Patent Services and OntoChem, is licensed under a Creative Commons Attribution 4.0 International License.
Natural language processing
Name to structure
Image to structure
Map to ontology
-
Table# of rows 47,676,364,297publication_number ocid preferred_name domainsourceconfidencecharacter_offset_startcharacter_offset_endinchiinchi_keysmiles
Schema
Data
count distinct domain samples
14,187,900,687 8,512 substances Fluoxastrobin|ougon|Sulfosulfuron
6,245,936,990 6,935 methods electrochemical analysis|ultrasonication|slugging
4,863,317,264 10,770 effects pro-activator|glycemic|vasopressic
3,891,972,531 13,629,554 chemCompound sulfur|3-(3-methoxypropoxy)propan-1-ol|Iprindole
1,921,140,121 20,124 inorgmatTc alloys|Ferromolybdenum|Li[Ni1/3Co1/3Mn1/3]O2 (NCM 111)
1,711,449,095 2,652,600 chemGroup stabilizer|1,1,1-trifluoroethane|3-pyrrolidinyl group
1,636,858,964 5,996 polymers4-(hydroxymethyl)-1,3-dioxolan-2-one|dodecyl group|NoRC associated RNA
1,376,553,092 571,320 chemClass α-D-xylose|Pyrylium salt|Isophorone
1,364,950,542 141,500 proteins GLP1R family|C1QT3 family|Amphiregulin
1,167,892,259 10,704 nutritionPteridium aquilinum subsp pseudocaudatum|vanilla syrup|Basella
1,074,276,820 108,928 humangenes AP3D1|IARS|EHD4
936,819,010 2,668 anatomyMegakaryocytes|Neurofibrillary Tangles|Satellite Cells, Skeletal Muscle
906,061,800 29,624 drugs Guanine|atovaquone|aminoxytriphene
835,712,433 199,070 species Geotrichum candidum|Nezara|Abutilon
696,195,364 22,159 diseasesFibromyalgia|Social avoidant behaviour|Adenocarcinoma pancreas
271,303,308 8,808 natprod Vinblastine|Physagulin M|Indol-3-propionic acid
73,346,722 936 toxicity aquatic toxicology|Ames test|ribotoxic
42,923,635 26,160,939 chem
N1(*)*N2C(C1=O)C(C(c1c2nc(Nc2**c*c2)nc1)(*)*)(*)*|C(N(I)CC)([C](#N)CCCCCCCC)CCC(=*)*|OC(=O)[C@]1(C[C@@H](C(*)C(C2(C(C2)C(C)CC)[C@@H](CO)O)O1)C)F
-
Google Patents extracts data from both text & images
Table# of rows 47,676,364,297publication_number ocid preferred_name domainsourceconfidencecharacter_offset_startcharacter_offset_endinchiinchi_keysmiles
Data
count distinct domain source
3,891,972,531 13,629,554 chemCompoundtitle,abstract,claims,description
1,711,449,095 2,652,600 chemGrouptitle,abstract,claims,description
1,376,553,092 571,320 chemClasstitle,abstract,claims,description
21,291,227 14,912,387 chem pdf *16,987,431 11,183,893 chem image *
4,644,977 2,711,658 chem mol
* still processing
-
Total number of chemistry entities
extracted 36 full text countries Data
country_code description claims abstract pdf * image * title mol Grand Total
US 1,409,208,212 145,488,707 10,651,425 6,504,233 6,640,303 1,425,365 4,644,977 1,584,563,222
JP 1,059,614,185 93,795,657 14,312,034 3,136,333 3,051,411 1,319,277 1,175,228,897
CN 880,335,706 130,838,122 25,630,032 1,869,214 1,180,255 2,573,546 1,042,426,875
EP 680,356,821 87,966,443 5,377,765 1,595,481 1,972,898 840,088 778,109,496
WO 522,361,876 71,891,448 4,733,108 2,332,337 2,367,645 537,893 604,224,307
KR 445,605,239 49,543,964 5,263,342 1,299,386 1,118,158 516,001 503,346,090
CA 292,648,762 45,117,787 2,855,808 1,196,171 431,672 342,250,200
AU 258,294,420 46,518,099 441,824 1,224,705 193,271 327,160 306,999,479
TW 173,408,881 19,402,002 1,259,332 947,090 51,152 144,676 195,213,133
DE 78,908,305 11,580,392 1,213,910 162,209 460,756 92,325,572
ES 65,087,386 7,723,456 2,081,943 269,099 38,847 141,399 75,342,130
RU 41,438,558 11,637,908 2,211,258 151,414 169,750 146,552 55,755,440
HU 29,630,553 4,311,374 375,805 104,216 68,077 34,490,025
BR 26,326,881 3,065,815 658,663 103,517 26,257 144,139 30,325,272
GB 17,906,597 3,136,854 6,741,940 40,357 532 299,648 28,125,928
FR 21,874,630 3,797,415 672,698 53,555 13,107 154,800 26,566,205
CZ 18,422,163 3,750,093 247,458 65,715 25,433 22,510,862
DK 19,854,711 1,309,738 22,895 96,009 144 84,597 21,368,094
EA 14,802,787 1,759,257 336,897 86,134 27,593 17,012,668
PT 14,069,417 1,881,382 73,238 48,035 48,439 16,120,511
SK 10,616,113 2,204,289 128,483 35,653 13,119 12,997,657
FI 8,985,958 1,029,199 24,540 26,695 40 33,119 10,099,551
SU 5,638,012 1,197,129 410,553 7,567 90,371 7,343,632
CH 5,229,765 902,985 107,427 14,020 13 19,554 6,273,764
DD 4,547,354 742,071 183,256 12,317 17,961 5,502,959
NL 4,386,277 728,270 72,673 9,146 372 35,251 5,231,989
BG 3,142,521 665,065 66,290 14,842 9,970 3,898,688
CS 3,131,663 486,895 127,412 7,119 23,594 3,776,683
OA 2,938,019 627,075 68,314 13,236 4,853 3,651,497
AT 1,567,123 282,784 19,963 2,250 5 101,882 1,974,007
RO 1,439,864 233,016 107,070 5,079 14,563 1,799,592
BE 1,453,407 233,066 45,029 3,660 931 10,387 1,746,480
SI 899,347 274,808 19,133 3,729 14,236 1,211,253
LU 806,646 160,527 3,934 1,978 131 3,614 976,830
LV 588,528 124,227 9,173 5,151 2,177 729,256
AP 551,743 94,321 6,079 1,792 5,036 658,971
MX 485,287 65,869 551,156
LT 449,403 82,608 10,841 1,455 4,008 548,315
-
Start OCID End OCID Domain Sub-domain229910000000 229914999999 chem inorganic materials229915000000 229915999999 chem alloys229920000000 229929999999 chem polymers229930000000 229939999999 chem natural products229940000000 229969999999 chem drugs239000000000 239999999999 chem substances 200000000000 209999999999 diseases main200000000000 200999999999 diseases diseases::OIS201000000000 201999999999 diseases diseases::hdo202000000000 202999999999 diseases diseases::icd9203000000000 203999999999 diseases diseases::icd10204000000000 204999999999 diseases diseases::snomed205000000000 205999999999 diseases diseases::elsevier206000000000 206999999999 diseases diseases::MedDRA208999000000 208999999999 diseases diseases::MeSH209000000000 209999999999 diseases diseases::syno_ocids
Table
# of rows 47, 571, 142, 272publication_number ocid preferred_name domainsourceconfidencecharacter_offset_startcharacter_offset_endinchiinchi_keysmiles
OCID : a unique identifier for every entity
OCIDs provided by OntoChem
Data
-
Website
● Easy to use● Limited feature
scope/breadth● Limit to
configurability/extensibility
Publication
Patent
Compound
Disease
Clinical trial
Assay
Classification
Target
Side effect
Entity graph
Previously at ACS San Diego...
Database
● More complex configuration (uploading, maintenance)
● Private data, other data outside of scope
Dataset scopeGoogle BigQuery
-
Confidential + Proprietary
Project 2
Project 1
SQL interface1 TB query free / month
10 GB storage free / month
Google BigQuery
No server setupAnalyze, download results
A single huge, relational database
Compound -> MW, LogP, etc.
Patent -> Compound, Target
Patent -> Company, Class, Text, etc.
ACLs
etc.
Compound -> Toxicity
-
Example of use in cloudBigQuery
-
1,074 patents in the D06M15/00 patent classification mention perfluorooctanoic acid
-
Example of use in webPubChem
-
PubChem is a chemical information resource
• 100s of data fields about chemicals
• Biological activities• Programmatic access interfaces• FTP site for bulk downloads• Extensive integration with
chemistry-related websites
• Millions of monthly users
https://pubchem.ncbi.nlm.nih.gov
https://pubchem.ncbi.nlm.nih.gov/
-
Google Patent contribution to PubChem
• +45B CSV rows available on the PubChem FTP site ‘as-is’
• +16K gzip compressed CSV files (subdirectories each with up to 1000 files)
• +4TB content uncompressed
• Structures via SMILES added to PubChem Substance (+9M)
• Association of patent to structure (billions of links)
-
Google Patent contribution to PubChem
• +45B CSV rows available on the PubChem FTP site ‘as-is’
• +16K gzip compressed CSV files (subdirectories each with up to 1000 files)
• +4TB content uncompressed
• Structures via SMILES added to PubChem Substance (+9M)
• Association of patent to structure (billions of links) Accessible via:
https://ftp.ncbi.nlm.nih.gov/pubchem/Other/GooglePatents/ftp://ftp.ncbi.nlm.nih.gov/pubchem/Other/GooglePatents/
https://ftp.ncbi.nlm.nih.gov/pubchem/Other/GooglePatents/ftp://ftp.ncbi.nlm.nih.gov/pubchem/Other/GooglePatents/
-
Google Patent contribution to PubChem
• +45B CSV rows available on the PubChem FTP site ‘as-is’
• +16K gzip compressed CSV files (subdirectories each with up to 1000 files)
• +4TB content uncompressed
• Structures via SMILES added to PubChem Substance (+9M)
• Association of patent to structure (billions of links)
1•• Extract SMILES
2•• Add as PubChem
Substance records
3•• Associate patent
identifiers to records
Processing of Google Patent contribution
-
Google Patent contribution to PubChem
• To be integrated with other patent links within PubChem
• Searchable collection• Patent section of compounds• Accessible by programmatic interfaces• Downloadable per record• Associated to metadata about a given
patent
-
}aid_sid_cid_acname_acvalue_aidname243,326,686
30.2 GBaidsidcidacnameacvalueaidname
SQL in BQ
SELECT distinct(acname), count(acname) as count_acn FROM `ncbi-research-pubchem.pubchem.aid_sid_cid_acname_acvalue_aidname` GROUP BY acnameORDER BY count_acn DESC
acname count_acnPotency 51,654,071IC50 12,947,417EC50 2,664,101Ki 559,067IC90 334,633AC50 209,578GI50 132,714Kd 88,048CC50 75,027MIC 64,718LC50 38,073TGI 33,396Activity 27,139AbsAC40_uM 21,374ED50 16,745LD50 3,018
‘starting material’ = PubChem tables ‘reagents’ = SQL query
product = answers to queries
ExampleWhat types of assays are in PubChem ?
partial list
} 554.3 MB processed, could run ~1800 times (= 1TB) per month for free
}
-
Kaggle +
ExampleWhat compounds have assays for targets MDM2, MEK, KRAS, & IL17?
Ass
ay #
, lo
gAssay type
Assay Counts by Target
Free Python notebooks
5TB / month free BQ quota
-
Acknowledgements We would like to acknowledge our colleagues who help make this work possible.
● Vihang Mehta
● Robert Frommer
● Brodrick Arneson
● Stephen S. Walker
● Igor Filippov
● Members of the OntoChem team
● The Intramural Research Program of the National Library of Medicine, National Institutes of Health.
● The entire PubChem team and all PubChem contributors and collaborators
-
Thank you!
BigQuery: "Google Patents Public Datasets"patents-public-data.google_patents_research.publications
PubChem download: https://ftp.ncbi.nlm.nih.gov/pubchem/Other/GooglePatents/
Google Patents: patents.google.comPubChem: pubchem.ncbi.nlm.nih.govSciWalker: www.sciwalker.com
-
Backup
-
This Kaggle Script will query PubChem to answer the following question:What bio assay data is available for a specific target or list of targets? e.g. MDM2,MEK, KRAS,IL17. etc.
Example
-
Pubchem ChemBlTarget Central
EPA
Clinical Trials
FDA Medline Drug Central
~ 250 tables from ~ 15 dbs
cluster
machine-learning
AI
visualization
post processing
How the data are used
Kaggle + BQ
-
aid_sid_cid_acname_acvalue_aidname243326686aidsidcidacnameacvalueaidnameinchikeysmiles
How many molecules have EC50 assays for SGLT2 ? What are they ?
or Kaggle Table 1
Answer: 143 unique molecules 20 duplicate compounds
output
SELECT cid, acname, acvalue, aidname FROM `google.com:skb1-228101.NIH_Data.PUBCHEM_AID_CID_SMILES_EC50_Data` WHERE acname = 'EC50' AND UPPER(aidname) LIKE '%SGLT2%'
Example
-
Kaggle in BQ+
Results for 3,443 assays for MDM2
Output file opened with DataWarrior
Example
-
Biomarkers co-occurring in the same document or clinical trial
Most frequently mentioned pairs of biomarkers within the same document
Example
-
Opportunities for partners
Google
• Patents• Scholar• Books
BigQuery
Middleware
to be developed
Front end
Integrity
GWAS
PMCMedLine COSMIC
Patents
Text sources Databases
Your Applications
Data Studio
Client Data
Looker
Knime
Other
Open Data
-
Example