Open PHACTS - Chemistry Platform Update and learnings
Antony Williams and Valery Tkachenko
ORCID ID:0000-0002-2668-4821
@gray_alasdair Big Data Integration 2
OpenPHACTS and CRS Diagram
The Chemical Registration ServiceChemistry processing•Validation•Standardization•Properties generation•Properties retrieval
Export•RDF•SDF
API•Domain-specific searches•Chemical visualization•Properties•Conversions
Subsystems
• “CVSP” (frontend, backend, database)• Compounds (frontend, database)• OpenPHACTS API (frontend, database)• Datasources registry (frontend, database)• Processing farm (optional)
Structure-Based Database linking
• Open PHACTS, and many other projects requiring the linking of structure databases, depend on mappings
• Different databases use different processes for standardization prior at deposition
• Examples: PubChem, EBI databases, ChemSpider, etc.
DrugBank• ~60 records can’t be dearomatized unambiguously
• ~40 records where InChIs did not match structure• 2 records where SMILES, InChI and name did not
match the structure• 7 records with 2 stereo bonds at chiral atoms
DB04283 DB04462
Standardizers• EBI Standardizer:
https://wwwdev.ebi.ac.uk/chembl/extra/francis/standardiser/
• PubChem Standardizer: https://pubchem.ncbi.nlm.nih.gov/standardize/standardize.cgi
• NCGC Standardizer: https://tripod.nih.gov/?p=61
• The CVSP Standardizer work in Open PHACTS http://cvsp.chemspider.com/
Standardization Rules
• Available from: http://tinyurl.com/hwapem3 • Use the SRS as guidance for standardization• Adjust as necessary to our needs
Nitro groups
Salt and Ionic Bonds
The CVSP Systemhttp://cvsp.chemspider.com
Supports various file formats
Comptox Chemistry DashboardPrior to deposition check a deposition…
>3450 compounds in one SDF
98 Errors, 1571 Warnings
Review Errors
Validation Rule Set
Various Rules Sets Available
CVSP – My own custom rules
ChEMBL Validation Review (of 1.3 million records)• 11,020 records with 4 bonds and zero charge, e.g.
CHEMBL501101 or CHEMBL501973
• 271 records with hypervalent oxygen (e.g. , CHEMBL2219679), carbon (e.g. 1005895), boron, chlorine, iodine or phosphine
• 6,177 records where direction of bond makes no sense, e.g. CHEMBL12760 and CHEMBL34704
Chemical Validation first… Standardization Second• Chemical Validation detects errors –
Standardization FIXES them according to rules
• SMIRKS transformations are based on both InChI Normalization and FDA SRS rules
Standardization SMIRKSExamples of InChI normalization [*;H+:1]>>[*;H:1][O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3]>>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3][N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2]
Examples of FDA SRS rules[n:1]=[O:2]>>[n+:1][O-:2][*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3][N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5]Thiopurine [H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]([H,*:12])[n:9]2>>[H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]([H,*:12])[N:9]2)[C:3]1=[S:2]
Examples of Standardization
Double bond with adjacent wiggly single bond
Collapser hydrogen atoms with no stereo bonds
ClCl
Cl
NH 2
O
Cl
N
H
H
Cl
H
Cl
O
Examples of Standardization
Remove symmetric stereocenters
Turn off chiral flag if no up or down bonds
Chiral flag is setN H 2
NH 2NH 2
N H 2
Defining a Community Rule Set
• There are multiple standardizers, each with their own rules set
• Can we decide on a default community rules set, like Standard InChI, that could be used by ALL Standardizers?
• A joint meeting between the Research Data Alliance (RDA), IUPAC and ACS Division of Chemical Information discussed the value and possibilities of this approach (July 2016)
EPA is investigating CVSP
• EPA is investigating CVSP as a validation and standardization platform
• Considering the API aspects of CVSP to integrate to our registration system
• CVSP is a reference implementation and “starting point” for a community rules set
CVSP code is now Open Source
• Open Source CVSP code now released• Code is hosted on Open PHACTS Github
https://github.com/openphacts/ops-crs • Valery Tkachenko will offer future support • Hoping for additional community engagement
and support
• Some details of availability….
Virtual Machines
• OPS_FRONT (all websites and API)• OPS_BACK (all heavy-lifting)• OPS_DB (databases)
• VMs are VMware images• Can be converted to other hypervisors