acs 2013 indianapolis_cvsp
TRANSCRIPT
![Page 1: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/1.jpg)
Karen Karapetyan, Colin Batchelor, Jonathan Steele , David Sharpe
Valery Tkachenko, Antony Williams
ACS Indianapolis September 2013
Building support for the semantic webfor chemistry
at the Royal Society of Chemistry
![Page 2: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/2.jpg)
![Page 3: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/3.jpg)
http://www.openphacts.org
Open PHACTS is an Innovative Medicines Initiative (IMI) project, aiming to reduce the barriers to
drug discovery in industry, academia and for small businesses.
Semantic web is one of the corner stones
![Page 4: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/4.jpg)
RDF Export
Data:
ChEMBLHMDB
DrugBank
Chemistry Validation and Standardization Platform (CVSP)at cvsp.chemspider.com
• Validation• Standardization• Parent generation• Run on Hadoop-based farm
![Page 5: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/5.jpg)
CVSP : chemical validation
free chemistry validation platform that performs:
• Structure validation• Atoms• Bonds• Valence• Stereo• If aromatic - check that uniquely dearomatized• Strongest acid not ionized first in partially-ionized system
• Cross-matching of SDF fields• synonyms• InChIs• Smiles
![Page 6: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/6.jpg)
Input formats supported:CDX, Mol, SdfZipGzTab-delimited text files
![Page 7: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/7.jpg)
CVSP: standardization modules• Custom processing let’s user to put together workflow from pre-defined
standardization modules list
![Page 8: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/8.jpg)
![Page 9: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/9.jpg)
• ChemSpider (passed 100K records)• All records are planned to pass through CVSP
• DrugBank (~6.5K records)
• ChEMBL (~1.2 mln records)
Data set examples
![Page 10: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/10.jpg)
ChemSpider issues
![Page 11: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/11.jpg)
DrugBank dataset (6516 records)
~60 records that can’t be dearomatized unambiguously
DB04283 DB04462
![Page 12: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/12.jpg)
~30 records with bonds that do not make sense
DB04283
DDB04009
![Page 13: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/13.jpg)
2 records where Smiles, InChI, and name did not match the structure
DB00611 DB01547
![Page 14: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/14.jpg)
~40 records where InChIs did not match the structure
DrugBank ID: DB00755InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-14+
DruGBank ID: DB00614
![Page 15: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/15.jpg)
DB08128
J. Brechner, IUPACGraphical Representation of stereochem. configurationsSection: ST-1.1.10
DB06287
7 records with 2 stereo bonds at chiral atoms
![Page 16: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/16.jpg)
CVSP validation of ChEMBL 16 (~1.3 mln. records)
• Overall 0.7% of records had validation issues
• Stereo problems (~82%)• Directions of bonds do not make sense (~63%)• Ambiguous stereo : 2 stereo bonds at chiral center (~19%)
![Page 17: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/17.jpg)
“Direction of bond makes no sense” – 63%
![Page 18: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/18.jpg)
“Stereo types of the opposite bonds mismatch” -15%
http://www.iupac.org/publications/pac/2006/pdf/7810x1897.pdf
![Page 19: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/19.jpg)
“Stereo types of non-opposite bonds match” – 2%
![Page 20: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/20.jpg)
“atom not recognized” – 3% isotopes
Should be atom from periodic table
No mass difference in atom line
No “M ISO” in connection table
In molfile:
![Page 21: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/21.jpg)
CVSP : standardization
• Standardization workflow was developed for Open PHACTS’s registration system
• Workflow includes modules like• SMIRKS rules derived from FDA SRS manual• Resetting symmetric stereo• Dearomatize• Layout• Fix “fixable” stereo issues• Disconnect all metals from N, O, F• Fold non-stereo hydrogens• Handle partial ionization of acid-base• etc
![Page 22: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/22.jpg)
Open PHACTS chemical registry system: what we use as chemical identity?
• Standard InChI/InChIKey (currently used ChemSpider)• Absolute smiles (isomeric canonical)
Drawbacks• SMILES –many flavors• Standard InChI
• does not include unknown/undefined stereo unless at least one defined stereo is present• does not distinguish between undefined and unknown stereo (always “?”)• standard InChI does some basic tautomer canonicalization which we wanted to prevent
to distinguish between all tautomers (sometimes useful for linking spectral data to specific tautomer)
• assumes absolute stereo or no stereo at all
Path we took:Non-standard InChI with options: SUU SLUUD FixedH SUCF• Always include unknown/undefined stereo (‘u’,’?’)• add Fixed H layer (to distinguish between tautomers)• Uses chiral flag in MOL/SD record (ON – absolute stereo, OFF-
relative)
![Page 23: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/23.jpg)
For each Compound (CSID) parent generation is attempted
“Tautomerism in large databases”, Sitzmann and others, J.Comput Aided Mol Des (2010)
Parent Description
Charge-Unsensitive An attempt is made to neutralize ionized acids and bases. Envisioned to be an ongoing improvement while new cases appear.
Isotope-Unsensitive Isotopes replaced by common weight
Stereo-Unsensitive Stereo is stripped
Tautomer-Unsensitive Tautomer canonicalization is attempting to generate a “reasonable” tautomer
Super-Unsensitive This parent is all of the above
No fragment unsensitive parent – we treat all fragments as equal entities
![Page 24: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/24.jpg)
CTABREGID1DataSourceSynonym1Synonym2XRef1etc
DepositedSDF record
Standardized entity
OPS_ID1 Super Parent (OPS_ID8)
Parents
Charge Parent (OPS_ID7)
Isotope Parent (OPS_ID5)
Stereo Parent (OPS_ID4)
Tautomer Parent (OPS_ID6)
Fragment (OPS_ID3)
Fragment (OPS_ID2)
![Page 25: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/25.jpg)
Chemistry Validation and Standardization Platform (CVSP)at cvsp.chemspider.com
• Validation• Standardization• Parent generation
RDF Export
Data
![Page 26: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/26.jpg)
Data is being imported from ChemSpider to Open PHACTS in
RDF/turtle
![Page 27: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/27.jpg)
RDF/VoID– VoID is an RDF Schema vocabulary for expressing metadata
about RDF datasets. It is intended as a bridge between the publishers and users of RDF data. http://www.w3.org/TR/void
• skos:exactMatch (Simple Knowledge Organisation System)E.g. To link compounds in OPS with compounds in ChEBI.• skos:closeMatch E.g. To link Stereo Insensitive Parents to their Children within OPS.• skos:relatedMatch E.g. To link Parent compounds that contain others as Fragments.
– Recommendations on how to create the VoID have been specified by Manchester here: http://www.cs.man.ac.uk/~graya/ops/2012/ED-datadesc/
![Page 28: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/28.jpg)
O H
O
O H
O
O–
O
Na+
Na+
O
O–
O
O–
OPS1
O–
ONa
+
DrugBank ID DB07241
OPS5OPS4
OPS3
OPS2
OPS6
ops:OPS1 skos:exactMatch <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/DB07241> .
ops:OPS2 skos:relatedMatch ops:OPS1 .
ops:OPS3 skos:relatedMatch ops:OPS1 .
ops:OPS3 skos:closeMatch ops:OPS4 .
ops:OPS3 skos:closeMatch ops:OPS5 .
ops:OPS4 skos:closeMatch ops:OPS6 .
ops:OPS5 skos:closeMatch ops:OPS6 .
![Page 29: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/29.jpg)
![Page 30: Acs 2013 indianapolis_cvsp](https://reader033.vdocument.in/reader033/viewer/2022052619/555083adb4c905a85c8b4805/html5/thumbnails/30.jpg)
Future work
Enabling full semantic web capabilities:
• Establishing RDF server with all relationships (including parent-child relationships)
• Develop SPARQL capability for querying RDF
Validate all records in ChemSpider by passing it through CVSP