chemspider -connecting and curating online chemistry resources

55
ChemSpider -Connecting and Curating Online Chemistry Resources Antony Williams EBI, November 30 th 2010

Upload: antony-williams-chemconnector-orcid-0000-0002-2668-4821

Post on 25-Jun-2015

1.778 views

Category:

Technology


0 download

DESCRIPTION

This is a presentation given at the European Informatics Institute (EBI), in Cambridge on December 1st 2010. This was at an EMBL-EBI Industry Program Workshop regarding "Chemical Structure Resources". This is where I unveiled details regarding the intra/inter-validation studies validating drug structures on multiple public domain chemistry databases. I also unveiled early results regarding the SurveyMonkey study of "trust" that the community has about public domain chemistry resources

TRANSCRIPT

Page 1: ChemSpider -Connecting and Curating Online Chemistry Resources

ChemSpider -Connecting and Curating Online Chemistry Resources

Antony WilliamsEBI, November 30th 2010

Page 2: ChemSpider -Connecting and Curating Online Chemistry Resources

Chemistry on the Internet 100s of websites serving up chemistry data, SDF

files of structures and data Some primary resources : PubChem, ChEBI,

DrugBank, ChemIDPlus, Wikipedia

ChemSpider “links” chemistry on the internet Almost 25 million compounds, 400 data sources Allows community deposition, curation, annotation Integrating properties, publications, patents, media Text, structure, substructure (in testing) searching

Page 3: ChemSpider -Connecting and Curating Online Chemistry Resources

www.chemspider.com

Page 4: ChemSpider -Connecting and Curating Online Chemistry Resources

Search for a Chemical

Page 5: ChemSpider -Connecting and Curating Online Chemistry Resources

Available Information…

Linked to vendors, safety data, toxicity, metabolism

Page 6: ChemSpider -Connecting and Curating Online Chemistry Resources

We Have Delivered the Vision

“Build a Structure Centric Community toServe Chemists”

Integrate chemical structure data on the web Create a “structure-based hub” to information,

data and algorithmic predictions Let chemists contribute their own data Allow the community to curate/correct data

Page 7: ChemSpider -Connecting and Curating Online Chemistry Resources

How Did We Build It?

We deal in Molfiles or SDF files – including coordinates

We do rudimentary filtering – valence checking, charge imbalance – prior to deposition

We have our own “business logic” to standardize We use InChI to “aggregate tautomers” to one

record Link out to external sites where possible using IDs

Page 8: ChemSpider -Connecting and Curating Online Chemistry Resources

Inherited Errors

We have inherited errors from every database… all public compound databases, including ours, have errors

“Incorrect” structures – assertions, timelines etc “Incorrect” names associated with structures Properties Links Publications ENORMOUS CHALLENGE

Page 9: ChemSpider -Connecting and Curating Online Chemistry Resources

What is the Structure of Vitamin K?

Page 10: ChemSpider -Connecting and Curating Online Chemistry Resources

MeSH

A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K

Page 11: ChemSpider -Connecting and Curating Online Chemistry Resources

What is the Structure of Vitamin K1?

Page 12: ChemSpider -Connecting and Curating Online Chemistry Resources

What is the Structure of Vitamin K1?

Page 13: ChemSpider -Connecting and Curating Online Chemistry Resources

CAS’s Common Chemistry

Page 14: ChemSpider -Connecting and Curating Online Chemistry Resources

Wikipedia

Page 15: ChemSpider -Connecting and Curating Online Chemistry Resources
Page 16: ChemSpider -Connecting and Curating Online Chemistry Resources
Page 17: ChemSpider -Connecting and Curating Online Chemistry Resources

ChEBI – Manual Curation

Page 18: ChemSpider -Connecting and Curating Online Chemistry Resources
Page 19: ChemSpider -Connecting and Curating Online Chemistry Resources
Page 20: ChemSpider -Connecting and Curating Online Chemistry Resources

PubChem

Page 21: ChemSpider -Connecting and Curating Online Chemistry Resources
Page 22: ChemSpider -Connecting and Curating Online Chemistry Resources

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”

Variants of systematic names on PubChem

2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl

Page 23: ChemSpider -Connecting and Curating Online Chemistry Resources

Public Domain Chemistry Databases

Our databases are a mess…

Non-curated databases are proliferating errors We source and deposit data between databases Original sources of errors hard to determine Curation is time-consuming, challenging and

exacting

An examination of quality in databases – inter/intra lab comparison of processes for 150 drugs

Page 24: ChemSpider -Connecting and Curating Online Chemistry Resources
Page 25: ChemSpider -Connecting and Curating Online Chemistry Resources

Vytorin: Ezetimibe/Simvastatin

Page 26: ChemSpider -Connecting and Curating Online Chemistry Resources

Vytorin: Ezetimibe/Simvastatin

Page 27: ChemSpider -Connecting and Curating Online Chemistry Resources

Vytorin: Ezetimibe/Simvastatin

Page 28: ChemSpider -Connecting and Curating Online Chemistry Resources

Vytorin: Ezetimibe/Simvastatin

Page 29: ChemSpider -Connecting and Curating Online Chemistry Resources

Vytorin: Ezetimibe/Simvastatin

Page 30: ChemSpider -Connecting and Curating Online Chemistry Resources

Symbicort: Budesonide + Formoterol

Page 31: ChemSpider -Connecting and Curating Online Chemistry Resources

Symbicort: Budesonide + Formoterol

ChemIDPlus

Wikipedia

Page 32: ChemSpider -Connecting and Curating Online Chemistry Resources

DrugBank: Search Symbicort…

Page 33: ChemSpider -Connecting and Curating Online Chemistry Resources

Symbicort: Budesonide + Formoterol PubChem

8 structures called Budesonide. 1 “correct” 6 structures called Formoterol. 1 “correct” Search on “Symbicort” gives 1 structure.

Page 34: ChemSpider -Connecting and Curating Online Chemistry Resources

Taxol: Paclitaxel 44 structures

Page 35: ChemSpider -Connecting and Curating Online Chemistry Resources

Taxol: Paclitaxel Bioassay Data

Page 36: ChemSpider -Connecting and Curating Online Chemistry Resources

Taxol: Paclitaxel Bioassay Data

Most Bioassay data associated with structure with one ambiguous stereocenter

Page 37: ChemSpider -Connecting and Curating Online Chemistry Resources

Data on the Web – Good or Bad??

Taken from: Rafael Sidis’ Blog

Page 38: ChemSpider -Connecting and Curating Online Chemistry Resources

Data on the Registry

Page 39: ChemSpider -Connecting and Curating Online Chemistry Resources

Data on the Registry

Page 40: ChemSpider -Connecting and Curating Online Chemistry Resources

Data on the Registry

Page 41: ChemSpider -Connecting and Curating Online Chemistry Resources

How are data handled in Pharma?

Algorithms for “collapsing” data? Skeletons only? Processing structure-name pairs? Manual curation? Does it matter relative to the noise in the

measurements?

Do correct structure representations matter, and to who?????

Page 42: ChemSpider -Connecting and Curating Online Chemistry Resources

EPA’s DailyMed

Page 43: ChemSpider -Connecting and Curating Online Chemistry Resources

EPA’s DailyMed

Page 44: ChemSpider -Connecting and Curating Online Chemistry Resources

EPA’s DailyMed

Page 45: ChemSpider -Connecting and Curating Online Chemistry Resources

Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results.

Page 46: ChemSpider -Connecting and Curating Online Chemistry Resources
Page 47: ChemSpider -Connecting and Curating Online Chemistry Resources

Drug Name Generic Name ChEBI ChemSpiderCAS Com.

Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia

SpirivaTiotropium Bromide

No Hits No Hits 4/0

DepakoteValproate semisodium No

Structure

Basen Voglibose No Hits No Hits 2/1 Symbicort 1) Budesonide 8/1 Symbicort 2) Formoterol WRONG No Hits 6/1 Vytorin 1) Ezetimibe No Hits Vytorin 2) Simvastatin 2/1 Taxol Paclitaxel 44/1 Thalidomid Thalidomide No Hits Zocor Simvastatin 2/1 Crestor Rosuvastatin No Hits 2/1

Page 48: ChemSpider -Connecting and Curating Online Chemistry Resources

Why Curated Dictionaries Matter

Page 49: ChemSpider -Connecting and Curating Online Chemistry Resources

Success Depends on Dictionaries

Page 50: ChemSpider -Connecting and Curating Online Chemistry Resources

Online Curation

Online databases generally do NOT allow curation or annotation

If you find errors they stay there! ChemSpider allows immediate curation

Page 51: ChemSpider -Connecting and Curating Online Chemistry Resources

Crowdsourcing Works

Over 100 people have deposited data (structures, spectra, etc) and participated in data curation

Different level curators check each others work Wikipedia is the modern primary example Some curators are “madmen”…

Page 52: ChemSpider -Connecting and Curating Online Chemistry Resources

Crowdsourcing Works

Over 100 people have deposited data (structures, spectra, etc) and participated in data curation

Different level curators check each others work Wikipedia is the modern primary example Some curators are “madmen”… The Oxford English Dictionary

Page 53: ChemSpider -Connecting and Curating Online Chemistry Resources

Collaborative Data Curation

How can we COLLECTIVELY clean online data?

ChemSpider has inherited junk from >400 data sources. Some of this has proliferated into PubChem. We should deprecate it.

We need to develop a way to share curation actions back to original data sources

A mindset of bigger is better is problematic. How many “real chemicals” are in the public databases?

Page 54: ChemSpider -Connecting and Curating Online Chemistry Resources

ChemSpider

ChemSpider is free to use. Multiple web services are available. New data added daily. Curation and data validation ongoing everyday. Provided by the RSC.

www.chemspider.com

Page 55: ChemSpider -Connecting and Curating Online Chemistry Resources

Thank you

Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams