text mining to produce large chemistry datasets for community access

Post on 27-Jan-2017

417 Views

Category:

Science

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Text-mining to produce large chemistry datasets for community access

Valery Tkachenko1, Aileen Day1, Daniel Lowe2, Igor Tetko3, Carlos Coba4 , Antony Williams5

1 Royal Society of Chemistry, UK2 NextMove Software, UK3 HelmholtzZentrum München, Germany4 Mestrelab Research, Santiago de Compostela, Spain5 EPA, US

ACS Fall 2015Boston, MAAugust 17th 2015

ChemSpider

Refs - we live in linked world

Properties

ChemSpider spectra

Knowledge systems

Datastore

Raw data´Data inµprocess

´Data outµprocess UI, API, Services, etc

RSC Archive – since 1841

Prospecting RSC articles

Further work – properties and spectra mining

Text mining of the chemical documents

Term Examples of text matchedFromLiterature “lit.”

MeltingPoint “mpt”, “melting point”, “m.p.”Qualifier “>”; “approximately”

Value “75° C”, “200° F”, “one hundred degrees Celsius”Range “184-186° C”, “191.5 to 192.4° C”

MeasurementError

“50±° C”

OutcomeQualifier

“decomp.”, “with decomposition”, “subl.”

FromLiterature? MeltingPoint Qualifier? (Value | Range | MeasurementError) OutcomeQualifier?

Why MP?

Used for water solubility prediction

Yalkowsky equation:

logS = 0.5 – 0.01(MP-25) – log Kow

Detecting suspicious melting points

• Value was greater than 500° C

• Value was a range wider than 50° C

• Value was a range where the second temperature was lower than the first temperature

300k Melting Point Datasets

Bergström 277Bradley 2886OCHEM 22404Enamine 21883Patents 228079

data

BergströmBradleyOCHEMEnaminePatents

Tetko et al J. Chemoinformatics, in preparation

Melting point model: data distribution

Some modeling highlights

LibSVM grid search was used to select parameters in grid (ca 1.5 years of CPU-time optimization)Largest model:

668k descriptors (MolPrint) ~ 0.2 trillions entriesBiggest model:

618Mb (Dragon descriptors)Most accurate model: Consensus, average of 5 models

RMSE < 32°C for the drug like region, MP [50,250]°C

Prediction error

NMR data• Extract from 1976-2014 USPTO applications

*unknown – starts off with NMR: peak list (no nucleus)

H 975543C 56536

unknown 44306F 9429P 3241B 91Si 62Sn 22Se 11N 8

NMR text mining• We can find and index text spectra:13C NMR

(CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)

NMR extracted by year of publication

0

500000

1000000

1500000

2000000

2500000

1976

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

Cum

ulati

ve d

istin

ct N

MR

extr

acte

d

Year of Publication

USPTO grants

USPTO applications

NMR solvents

48.5%

38.3%

8.7%

1.1% 1.0% 1.0% 1.4%

CDCl3

DMSO-d6

CD3OD

D2O

Acetone-d6

MeOD

Others

Others: CD2Cl2, CD3CN-d3, C6D6, Pyridine-d5, THF-d8, CD3Cl, dimethylformamide-d7, d1-trifluoroacetic acid, methanol-d3, acetic acid-d4, toluene-d8, sulfuric acid-d2, 1,1,2,2-tetrachloroethane-d2, CD3OCD3, dioxane-d8, 1,2-dichloroethane-d4

1H-NMR frequency over time

0 Mhz

50 Mhz

100 Mhz

150 Mhz

200 Mhz

250 Mhz

300 Mhz

350 Mhz

400 Mhz

450 Mhz

1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014Year of patent filing

MestreLabs Mnova NMR

1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)

13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)

Detecting suspicious NMR spectra

• Last peak of NMR spectra is unannotated and:– All other peaks are annotated– Spectrum has 1 peak and is proton or

unknown NMR

O

O

OH

Br

> <SuspiciousValue>true> <Value>1H-NMR (400 MHz, d6-Acetone): 11.8-10.8 (brs, 1H), 7.78Comments: Only the labile proton is reported in the spectrum. The other aromatic and aliphatic protons are completely missing in the spectrum.

H2N

NH2

O

O

> <SuspiciousValue>true> <Value>1H-NMR (400 MHz, CDCl3): 6.85 (1H, d, J=7.8 Hz), 6.10 (1H, dd, J=7.8 and 2.2 Hz), 6.06 (1H, d, J=2.2 Hz), 4.66 (1H, m), 3.75 (4H, br s), 3.40 (2H, s), 1.97Comments: There are only 11 protons reported in the spectrum whilst the molecule contains more than 50 protons.

Knowledge systems

Datastore

Raw data´Data inµprocess

´Data outµprocess UI, API, Services, etc

Synthetic chemistry articleCompoundsReactionAnalytical DataText and References

RSC Databases

RSC CompoundsRSC ReactionsRSC SpectraRSC CrystalsRSC PolymersRSC MaterialsRSC AssaysRSC AlgorithmsRSC Models…and on…

Input pipelineDeposition Gateway

Staging databases

Compounds Reactions Spectra Crystals

Materials

Compounds Module

Spectra Module

Reactions Module

Materials Module

TextminingModule

«Module

Web UI for unified depositions

DropBox, Google Drive, SkyDrive, etc

ELNs, templated data input

Documents

API, FTP, etc

Raw data

Valid

ated

data

Staging databases

All databases are sliced by data sources/data collections and have simple security model where each data slice/source is private, public or embargoed

Etc

Experiments

Research

Output pipeline

Compounds Reactions Spectra Crystals Documents

CompoundsAPI

ReactionsAPI

SpectraAPI

CrystalsAPI

DocumentsAPI

CompoundsWidgets

ReactionsWidgets

SpectraWidgets

CrystalsWidgets

DocumentsWidgets

Data layer

Data access layer

User interface widgets

layer

Analytical Laboratory application

User interface

layer(examples)

Electronic Laboratory NotebookPaid 3rd party integrations(various platforms – SharePoint, Google, etc)

Chemical Inventory application

ChemSpider 2.0

Cross-database links

Compounds domain

Data quality issue and CVSP

– Robochemistry

– Proliferation of errors in public and private databases

• ChemSpider• PubChem• DrugBank• KEGG• ChEBI/ChEMBL

– Automated quality control system

Chemistry Validation and Standardization Platform

Reactions domain

Reactions domain

Analytical data domain

Crystallography domain

3D printable structures

New Repository Architecturedoi: 10.1007/s10822-014-9784-5

Thank you

Email: tkachenkov@rsc.org

Slides: http://www.slideshare.net/valerytkachenko16

top related