text mining to produce large chemistry datasets for community access

44
Text-mining to produce large chemistry datasets for community access Valery Tkachenko 1 , Aileen Day 1 , Daniel Lowe 2 , Igor Tetko 3 , Carlos Coba 4 , Antony Williams 5 1 Royal Society of Chemistry, UK 2 NextMove Software, UK 3 HelmholtzZentrum München, Germany 4 Mestrelab Research, Santiago de Compostela, Spain 5 EPA, US ACS Fall 2015 Boston, MA August 17 th 2015

Upload: valery-tkachenko

Post on 27-Jan-2017

417 views

Category:

Science


5 download

TRANSCRIPT

Page 1: Text mining to produce large chemistry datasets for community access

Text-mining to produce large chemistry datasets for community access

Valery Tkachenko1, Aileen Day1, Daniel Lowe2, Igor Tetko3, Carlos Coba4 , Antony Williams5

1 Royal Society of Chemistry, UK2 NextMove Software, UK3 HelmholtzZentrum München, Germany4 Mestrelab Research, Santiago de Compostela, Spain5 EPA, US

ACS Fall 2015Boston, MAAugust 17th 2015

Page 2: Text mining to produce large chemistry datasets for community access
Page 3: Text mining to produce large chemistry datasets for community access

ChemSpider

Page 4: Text mining to produce large chemistry datasets for community access

Refs - we live in linked world

Page 5: Text mining to produce large chemistry datasets for community access

Properties

Page 6: Text mining to produce large chemistry datasets for community access

ChemSpider spectra

Page 7: Text mining to produce large chemistry datasets for community access

Knowledge systems

Datastore

Raw data´Data inµprocess

´Data outµprocess UI, API, Services, etc

Page 8: Text mining to produce large chemistry datasets for community access

RSC Archive – since 1841

Page 9: Text mining to produce large chemistry datasets for community access

Prospecting RSC articles

Page 10: Text mining to produce large chemistry datasets for community access

Further work – properties and spectra mining

Page 11: Text mining to produce large chemistry datasets for community access

Text mining of the chemical documents

Term Examples of text matchedFromLiterature “lit.”

MeltingPoint “mpt”, “melting point”, “m.p.”Qualifier “>”; “approximately”

Value “75° C”, “200° F”, “one hundred degrees Celsius”Range “184-186° C”, “191.5 to 192.4° C”

MeasurementError

“50±° C”

OutcomeQualifier

“decomp.”, “with decomposition”, “subl.”

FromLiterature? MeltingPoint Qualifier? (Value | Range | MeasurementError) OutcomeQualifier?

Page 12: Text mining to produce large chemistry datasets for community access

Why MP?

Used for water solubility prediction

Yalkowsky equation:

logS = 0.5 – 0.01(MP-25) – log Kow

Page 13: Text mining to produce large chemistry datasets for community access

Detecting suspicious melting points

• Value was greater than 500° C

• Value was a range wider than 50° C

• Value was a range where the second temperature was lower than the first temperature

Page 14: Text mining to produce large chemistry datasets for community access

300k Melting Point Datasets

Bergström 277Bradley 2886OCHEM 22404Enamine 21883Patents 228079

data

BergströmBradleyOCHEMEnaminePatents

Tetko et al J. Chemoinformatics, in preparation

Page 15: Text mining to produce large chemistry datasets for community access

Melting point model: data distribution

Page 16: Text mining to produce large chemistry datasets for community access

Some modeling highlights

LibSVM grid search was used to select parameters in grid (ca 1.5 years of CPU-time optimization)Largest model:

668k descriptors (MolPrint) ~ 0.2 trillions entriesBiggest model:

618Mb (Dragon descriptors)Most accurate model: Consensus, average of 5 models

RMSE < 32°C for the drug like region, MP [50,250]°C

Page 17: Text mining to produce large chemistry datasets for community access

Prediction error

Page 18: Text mining to produce large chemistry datasets for community access

NMR data• Extract from 1976-2014 USPTO applications

*unknown – starts off with NMR: peak list (no nucleus)

H 975543C 56536

unknown 44306F 9429P 3241B 91Si 62Sn 22Se 11N 8

Page 19: Text mining to produce large chemistry datasets for community access

NMR text mining• We can find and index text spectra:13C NMR

(CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)

Page 20: Text mining to produce large chemistry datasets for community access

NMR extracted by year of publication

0

500000

1000000

1500000

2000000

2500000

1976

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

Cum

ulati

ve d

istin

ct N

MR

extr

acte

d

Year of Publication

USPTO grants

USPTO applications

Page 21: Text mining to produce large chemistry datasets for community access

NMR solvents

48.5%

38.3%

8.7%

1.1% 1.0% 1.0% 1.4%

CDCl3

DMSO-d6

CD3OD

D2O

Acetone-d6

MeOD

Others

Others: CD2Cl2, CD3CN-d3, C6D6, Pyridine-d5, THF-d8, CD3Cl, dimethylformamide-d7, d1-trifluoroacetic acid, methanol-d3, acetic acid-d4, toluene-d8, sulfuric acid-d2, 1,1,2,2-tetrachloroethane-d2, CD3OCD3, dioxane-d8, 1,2-dichloroethane-d4

Page 22: Text mining to produce large chemistry datasets for community access

1H-NMR frequency over time

0 Mhz

50 Mhz

100 Mhz

150 Mhz

200 Mhz

250 Mhz

300 Mhz

350 Mhz

400 Mhz

450 Mhz

1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014Year of patent filing

Page 23: Text mining to produce large chemistry datasets for community access

MestreLabs Mnova NMR

Page 24: Text mining to produce large chemistry datasets for community access

1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)

Page 25: Text mining to produce large chemistry datasets for community access

13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)

Page 26: Text mining to produce large chemistry datasets for community access

Detecting suspicious NMR spectra

• Last peak of NMR spectra is unannotated and:– All other peaks are annotated– Spectrum has 1 peak and is proton or

unknown NMR

Page 27: Text mining to produce large chemistry datasets for community access

O

O

OH

Br

> <SuspiciousValue>true> <Value>1H-NMR (400 MHz, d6-Acetone): 11.8-10.8 (brs, 1H), 7.78Comments: Only the labile proton is reported in the spectrum. The other aromatic and aliphatic protons are completely missing in the spectrum.

Page 28: Text mining to produce large chemistry datasets for community access

H2N

NH2

O

O

> <SuspiciousValue>true> <Value>1H-NMR (400 MHz, CDCl3): 6.85 (1H, d, J=7.8 Hz), 6.10 (1H, dd, J=7.8 and 2.2 Hz), 6.06 (1H, d, J=2.2 Hz), 4.66 (1H, m), 3.75 (4H, br s), 3.40 (2H, s), 1.97Comments: There are only 11 protons reported in the spectrum whilst the molecule contains more than 50 protons.

Page 29: Text mining to produce large chemistry datasets for community access

Knowledge systems

Datastore

Raw data´Data inµprocess

´Data outµprocess UI, API, Services, etc

Page 30: Text mining to produce large chemistry datasets for community access

Synthetic chemistry articleCompoundsReactionAnalytical DataText and References

Page 31: Text mining to produce large chemistry datasets for community access

RSC Databases

RSC CompoundsRSC ReactionsRSC SpectraRSC CrystalsRSC PolymersRSC MaterialsRSC AssaysRSC AlgorithmsRSC Models…and on…

Page 32: Text mining to produce large chemistry datasets for community access

Input pipelineDeposition Gateway

Staging databases

Compounds Reactions Spectra Crystals

Materials

Compounds Module

Spectra Module

Reactions Module

Materials Module

TextminingModule

«Module

Web UI for unified depositions

DropBox, Google Drive, SkyDrive, etc

ELNs, templated data input

Documents

API, FTP, etc

Raw data

Valid

ated

data

Staging databases

All databases are sliced by data sources/data collections and have simple security model where each data slice/source is private, public or embargoed

Etc

Experiments

Research

Page 33: Text mining to produce large chemistry datasets for community access

Output pipeline

Compounds Reactions Spectra Crystals Documents

CompoundsAPI

ReactionsAPI

SpectraAPI

CrystalsAPI

DocumentsAPI

CompoundsWidgets

ReactionsWidgets

SpectraWidgets

CrystalsWidgets

DocumentsWidgets

Data layer

Data access layer

User interface widgets

layer

Analytical Laboratory application

User interface

layer(examples)

Electronic Laboratory NotebookPaid 3rd party integrations(various platforms – SharePoint, Google, etc)

Chemical Inventory application

ChemSpider 2.0

Page 34: Text mining to produce large chemistry datasets for community access

Cross-database links

Page 35: Text mining to produce large chemistry datasets for community access

Compounds domain

Page 36: Text mining to produce large chemistry datasets for community access

Data quality issue and CVSP

– Robochemistry

– Proliferation of errors in public and private databases

• ChemSpider• PubChem• DrugBank• KEGG• ChEBI/ChEMBL

– Automated quality control system

Page 37: Text mining to produce large chemistry datasets for community access

Chemistry Validation and Standardization Platform

Page 38: Text mining to produce large chemistry datasets for community access

Reactions domain

Page 39: Text mining to produce large chemistry datasets for community access

Reactions domain

Page 40: Text mining to produce large chemistry datasets for community access

Analytical data domain

Page 41: Text mining to produce large chemistry datasets for community access

Crystallography domain

Page 42: Text mining to produce large chemistry datasets for community access

3D printable structures

Page 43: Text mining to produce large chemistry datasets for community access

New Repository Architecturedoi: 10.1007/s10822-014-9784-5

Page 44: Text mining to produce large chemistry datasets for community access

Thank you

Email: [email protected]

Slides: http://www.slideshare.net/valerytkachenko16