Transcript
Page 1: ChemSpider as an integration hub for interlinked chemistry data

ChemSpider as an Integration Hub for Interlinked

Chemistry Data

Antony WilliamsSETAC

November 18th 2013

Page 2: ChemSpider as an integration hub for interlinked chemistry data

How Much Data Online?

• How much data regarding environmental toxicology and chemistry is online?

• How can it all be mapped together?

Page 3: ChemSpider as an integration hub for interlinked chemistry data

A Grand Challenge….

• Let’s map together all historical chemistry data and build systems to integrate new data

• Let’s integrate chemistry, toxicology and biology data and add in disease data too

• Lets model the data and see if we can extract new relationships – quantitative and qualitative

• Let’s make it all available on the web

Page 4: ChemSpider as an integration hub for interlinked chemistry data
Page 5: ChemSpider as an integration hub for interlinked chemistry data

What about this….

• We’re going to map the world

• We’re going to take photos of as many places as we can and link them together

• We’ll let people annotate and curate the map

• Then let’s make it available free on the web

• We’ll make it available for decision making

• Put it on Mobile Devices, Give it Away

Page 6: ChemSpider as an integration hub for interlinked chemistry data

The World of Online Chemistry

• Property databases• Compound aggregators• Screening assay results• Scientific publications • Encyclopedic articles (Wikipedia)• Metabolic pathway databases• ADME/Tox data – eTOX for example• Blogs/Wikis and Open Notebook Science

Page 7: ChemSpider as an integration hub for interlinked chemistry data

How to Map Data Together

• Download the structure representations and map together at the structure level

• Integrate and mesh chemical names, chemical properties, analytical data

• Carry URL links and retain external links to original data sets (assume no link decay)

• It sounds easy….

Page 8: ChemSpider as an integration hub for interlinked chemistry data

ChemSpider

• Build a HUB connecting as many data sources as possible

• NOT to harvest all data from each data source

• Today we have >29 million unique chemicals from >500 data sources

• Focus on improving data quality

• Allow users to enhance, curate and annotate

Page 9: ChemSpider as an integration hub for interlinked chemistry data

RSC’s ChemSpider

Page 10: ChemSpider as an integration hub for interlinked chemistry data

Identifiers are very useful! But what when they are “closed”

Page 11: ChemSpider as an integration hub for interlinked chemistry data

CAS Numbers Validation?

Page 12: ChemSpider as an integration hub for interlinked chemistry data

Various Registration Numbers

Page 13: ChemSpider as an integration hub for interlinked chemistry data

Mappings and Inconsistencies

PubChemDrugbankChemSpider

Imatinib

Mesylate

Page 14: ChemSpider as an integration hub for interlinked chemistry data

The InChI Identifier

Page 15: ChemSpider as an integration hub for interlinked chemistry data

InChIStrings Hash to InChIKeys

Page 16: ChemSpider as an integration hub for interlinked chemistry data

Vancomycin – Search the Internet

Page 17: ChemSpider as an integration hub for interlinked chemistry data

Vancomycin

Search Molecular SKELETON

Search Full Molecule

Page 18: ChemSpider as an integration hub for interlinked chemistry data

Full Skeleton Search: 529 Hits

Page 19: ChemSpider as an integration hub for interlinked chemistry data

Full Molecule Search: 294 Hits

Page 20: ChemSpider as an integration hub for interlinked chemistry data

Historical Data for reference

• As evidence that InChI is proliferating and data is improving:

• Three years ago there were only 104 hits on the complete InChI online

• Only 4 were correct

Page 21: ChemSpider as an integration hub for interlinked chemistry data

What you might not know about Chemistry Databases on the Internet

Page 22: ChemSpider as an integration hub for interlinked chemistry data

NCGC Pharma Collection

Page 23: ChemSpider as an integration hub for interlinked chemistry data

NCGC Pharma Collection

Page 24: ChemSpider as an integration hub for interlinked chemistry data

NCGC Pharma Collection

Page 25: ChemSpider as an integration hub for interlinked chemistry data

PHYSPROP Database

• The freely downloadable database under the EPI Suite prediction software

• Very Basic filters suggest data quality issues

Page 26: ChemSpider as an integration hub for interlinked chemistry data

The Stereochemistry challenge.12500 chemicals with “missed” stereo

Page 27: ChemSpider as an integration hub for interlinked chemistry data

NIST Webbook

Page 28: ChemSpider as an integration hub for interlinked chemistry data

PubChem

Page 29: ChemSpider as an integration hub for interlinked chemistry data

Patents

Page 30: ChemSpider as an integration hub for interlinked chemistry data

Patents

Page 31: ChemSpider as an integration hub for interlinked chemistry data

But Chemspider is curated right?

Page 32: ChemSpider as an integration hub for interlinked chemistry data

Originally 15 compounds “called” Yohimbine54 Skeletons for Yohimbine

Page 33: ChemSpider as an integration hub for interlinked chemistry data

Crowdsourced Curation

• Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate

Page 34: ChemSpider as an integration hub for interlinked chemistry data

Search “Vitamin H”

Page 35: ChemSpider as an integration hub for interlinked chemistry data

“Curate” Identifiers

Page 36: ChemSpider as an integration hub for interlinked chemistry data

“Curate” Identifiers

Page 37: ChemSpider as an integration hub for interlinked chemistry data

“Curate” Identifiers

Page 38: ChemSpider as an integration hub for interlinked chemistry data

Chemical name dictionaries for:

• Text-mining (publications, patents)• Used to index PubMed and link Google Patents

• Linking to other databases – think Biology!• When structures are not available names link

• Searching the web• Names link to structures link to InChIs

Page 39: ChemSpider as an integration hub for interlinked chemistry data

I want to know about “Vincristine”

Page 40: ChemSpider as an integration hub for interlinked chemistry data

Vincristine: Identifiers to link

Page 41: ChemSpider as an integration hub for interlinked chemistry data

Vincristine: Vendors and SourcesLinked by Structure

Page 42: ChemSpider as an integration hub for interlinked chemistry data

Vincristine: PatentsLinked by Name

Page 43: ChemSpider as an integration hub for interlinked chemistry data

Vincristine: ArticlesLinked by Name

Page 44: ChemSpider as an integration hub for interlinked chemistry data

What needs to happen?

• Standards• Standardization of structures

• More sharing of data – downloadable data collections for mapping, meshing and integration

• InChI adoption

• Collaboration• Stop reinventing the wheel• Share data, share efforts and speed the

process

Page 45: ChemSpider as an integration hub for interlinked chemistry data

Adopting Modified FDA Rules

Page 46: ChemSpider as an integration hub for interlinked chemistry data

Nitro groups

Page 47: ChemSpider as an integration hub for interlinked chemistry data

Salt and Ionic Bonds

Page 48: ChemSpider as an integration hub for interlinked chemistry data

Ammonium salts

Page 49: ChemSpider as an integration hub for interlinked chemistry data

What if we could capture it all?Digitally Enhancing the RSC Archive

Page 50: ChemSpider as an integration hub for interlinked chemistry data

Start with data in publications

Page 51: ChemSpider as an integration hub for interlinked chemistry data

Text Mining

The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer, thermometer and reflux condenser.

The reaction mixture was heated at reflux with stirring , for a period of about one-half hour.

After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

Page 52: ChemSpider as an integration hub for interlinked chemistry data

ChemSpider Reactions

Page 53: ChemSpider as an integration hub for interlinked chemistry data

Turn “Figures” Into Data

FIGURE

EXTRACTED DATA

Page 54: ChemSpider as an integration hub for interlinked chemistry data

Conclusions

• There are some amazing online resources for environmental toxicology and chemistry already!

• ChemSpider has an important role in quality data and linking resources

• Crowdsourced deposition, validation and curation works

• Standards are an important part of data linking

• MORE collaboration and data sharing can benefit us all

Page 55: ChemSpider as an integration hub for interlinked chemistry data

Thank you

Email: [email protected] Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams


Top Related