chemspider – a crowdsourcing environment for hosting and validating chemistry resources (and...

59
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on U.S. Government Chemical Databases and Open Chemistry August 2011

Upload: kristian-mcdaniel

Post on 11-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush)

Antony Williams5th Meeting on U.S. Government Chemical Databases and Open Chemistry

August 2011

Page 2: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

I want to know about “Vincristine”

Page 3: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Vincristine: Identifiers and Properties

Page 4: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Vincristine: Vendors and Sources

Page 5: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Vincristine: Patents

Page 6: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Vincristine: Articles

Page 7: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Vincristine: RSC Databases

Page 8: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Searches: The INTERNET

Page 9: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Validated Names for Searching…

Page 10: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

And InChIs…

Page 11: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

ChemSpider

The Free Chemical Database

A central hub for chemists to source information >26 million unique chemical records Aggregated from >400 data sources Chemicals, spectra, CIF files, movies, images,

podcasts, links to patents, publications, predictions

A central hub for chemists to deposit & curate data

Page 12: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Essential aspects of ChemSpider

ChemSpider is a BIG database..and growing

Our focus has increasingly become QUALITY over quantity

Data curation and validation is our strength – crowdsourcing is contributing, more is required

Validated data has enabled linking of the internet

Page 13: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

There are NO errors in ChemSpider

Page 14: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

There are NO errors in ChemSpider

Page 15: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

“All That Glisters is Not Gold”What is the structure of Discodermolide?

Page 16: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

How to distinguish…who’s wrong?

Page 17: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Neither is wrong

Page 18: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Data Curation…long torturous task

Data curation – JUST structure-name validation is a long, torturous, iterative task.

How about validating “data” – PhysChem data such as logP data, boiling points, melting points (J.C.Bradley’s talk), spectra

Page 19: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Hand on my heart….

Page 20: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Hand on my heart

No offence meant by what follows! We ALL have quality issues!

Page 21: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

PHYSPROP Database

The freely downloadable database under the EPI Suite prediction software

Very Basic filters suggest data quality issues

Page 22: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

The Stereochemistry challenge.12500 chemicals with “missed” stereo

Page 23: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

NIST Webbook

Page 24: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

EPA’s DailyMed

Page 25: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

EPA’s DailyMed

Page 26: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

EPA’s DailyMed

Page 27: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

PubChem

Page 28: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Linking

Page 29: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on
Page 30: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on
Page 31: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on
Page 32: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on
Page 33: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Patents

Page 34: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Patents

Page 35: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

WYSIWYG compounds

Page 36: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

WYSIWYG compounds

Page 37: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Data Curation…long torturous task

Data curation – JUST structure-name validation is a long, torturous, iterative task.

How about validating “data” – PhysChem data such as logP data, boiling points, melting points (J.C.Bradley’s talk), spectra

The crowd in crowdsourcing is …generally small

Which of the large databases are doing careful curation. How can we share the workload? Hmm..

Page 38: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results.

Page 39: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on
Page 40: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Drug Name Generic Name ChEBI ChemSpiderCAS Com.

Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia

SpirivaTiotropium Bromide

No Hits No Hits 4/0

DepakoteValproate semisodium No

Structure

Basen Voglibose No Hits No Hits 2/1 Symbicort 1) Budesonide 8/1 Symbicort 2) Formoterol WRONG No Hits 6/1 Vytorin 1) Ezetimibe No Hits Vytorin 2) Simvastatin 2/1 Taxol Paclitaxel 44/1 Thalidomid Thalidomide No Hits Zocor Simvastatin 2/1 Crestor Rosuvastatin No Hits 2/1

Page 41: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Who does the Curation?

Page 42: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

ChemSpider can “do it” for us

ChemSpider has built a curation interface used by the community and ourselves for curating.

All curation activities are available for review, online immediately, iteratively checked.

Curators have different abilities based on their profile: There are only a few “Master Curators”.

Can we “share” the curation workload?

Page 43: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Proof of Concept Data Curation Sharing

Page 44: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Identifier Dictionaries

Reciprocal curation processes…share curation with each other.

If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.

A series of “added” and “removed” synonyms against InChIKeys for matching.

Who will participate???

Page 45: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Proof of Concept Data Curation Sharing

Page 46: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Lessons Learned : Big vs Good!

Page 47: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

15 compounds called Yohimbine54 Skeletons for Yohimbine

Page 48: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Aggegators suffer dilution…

Page 49: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

User Understanding of Data

Users searching “Yohimbine” expect to find it…not labeled versions of it, not ambiguous stereochemistries, not partial stereochemistries.

Data “aggregation” into a meaningful form is a major challenge. e.g. Assays for radiolabeled compounds linked to actual drugs.

Data curation efforts such as ChEMBL are essential!

Page 50: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

SciMobileApps.com

Page 51: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

SciDBs.com (Coming soon)

Page 52: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Open PHACTS : partnership between European Community and EFPIA

Freely accessible for knowledge discovery and verification. Data on small molecules Pharmacological profiles Pharmacokinetics ADMET data Biological targets and pathways Proprietary and public data sources.

Page 53: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Standardization and Quality

Our initial approaches to standardization were imperfect. We are revisiting to support OpenPHACTS.

Highly dependent on InChI and not enough standardization prior to InChI generation.

InChI is excellent and acknowledged imperfect. Way better than SMILES for linking the internet!

Page 54: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Conclusions

ChemSpider is one of many important chemistry resources on the internet

We have assumed an important role of curating and validating data – specifically name-structure dictionaries are of high importance but data validation is also key

We are a part of the federation of internet databases serving chemistry. MORE collaboration can serve us all better…how?

Page 55: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

A Plea to Gov’t DBs…

Please improve gov’t DB communications

Page 56: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

A Plea to Gov’t DBs…

Please improve gov’t DB communications

Please buddy up and get closer together

Page 57: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

A Plea to Gov’t DBs…

Please improve gov’t DB communications

Please buddy up and get closer together

Get into deep conversations

Page 58: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Acknowledgments

Our development team – headed by THAT man..

Many in this room: InChI, PubChem, DssTOX, FDA, ChEBI/ChEMBL, SureChem, many more

Curators – special gratitude to Barrie Walker!

Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel)

Page 59: ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on

Thank you

Email: [email protected] Twitter: ChemConnectorPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams