online public compound databases
DESCRIPTION
This is a workshop I gave on "Online Public Compound Databases" at the BCCE in Dallas, Texas on August 3rd 2010. It is an overview of online resources, InChI, linking data, online data quality, searching and ChemSpider.TRANSCRIPT
Online Public Compound DatabasesAntony Williams
Introductions….
Hi….I’m Antony Williams, ChemSpiderman
NMR Spectroscopist by training Worked in gov’t lab, academia, Fortune 500,
start-up, founded ChemSpider, now work for the Royal Society of Chemistry
I am the host of ChemSpider… 25 million compounds Linked to 400 data sources You’ll hear more….
What’s your interest in Public Compound DBs? What public compound databases do you use? What are you looking to find? What proprietary databases do you presently use? What do you trust? Why are for-fee databases not enough? What issues do you have with free chemistry
databases/resources online?
What would the ideal solution provide????
Content is King and Quality Costs Chemistry “content” is big money
Patent searching Structures and properties Drug databases Literature databases
Chemical Abstracts Service (CAS), the “Gold Standard” in Chemistry related information 103 years of content $260 million revenue (2006) >50 million substances >60 million sequences
What’s the Status of Chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases (eMolecules) Metabolic pathway databases (WikiPathways) Virtual Screening databases (ZINC DB) Property databases (Beilstein) Screening assay results (PubChem) Patents with chemical structures (SureChem) ADME/Tox data (OEChem) Scientific publications (Many publishers) Compound aggregators (ChemSpider) Blogs/Wikis and Open Notebook Science (Many)
Synthesis Blogs…TotallySynthetic.com
Org Prep Daily
Molbank (Open Access Journal)
Synthetic Pages (Website)
Lots of “Public Compound” Databases PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC ChemSpider Lots of chemical vendors What’s missing??? What do you use online?
Where Would You look? What Do You Trust?
Linked Data on the Web
Taken from: Rafael Sidis’ Blog
What is a compound?
Connections Can Lead Anywhere
Where Would You look? What Do You Trust?
PubChem
PubChem
PubChem is “a repository of screening data”
BUT, publishers, vendors, lots of chemical data holders..almost 30 million compounds
Properties, 3D optimized structures, links to various databases, chemical names, registry numbers, synonyms, trade names
PubChem is a repository – non-curated, no way to annotate or clean data. Data are free, not “open”
LIVE DEMO of PubChem
Name a chemical compound – search and review
Next slides: Methane… Diamond Vancomycin Taxol Cholesterol
Chemistry on The Internet Is Messy
It’s Methane…
What’s Methane?
What’s Methane?
What ELSE is Methane???
The Challenges of Internet Data
Text-based searches commonly will get you to “representative data”
Accurate chemical structures are hard to find!
Wikipedia IS a good source of accurate chemistry data..not perfect but good. See Tacrolimus Tell the story of Domoic Acid – Next Slide
The EXPERTS must get it right?!
Wikipedia, C&E News, PubChem C&E News (from ACS)
Feedback from Steve Ritter
“As for where we source our structures, our primary source is the researcher and peer-reviewed papers, because many compounds are novel.
..we always double check them against one or more primary sources, typically Merck Index and SciFinder.
Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.”
Feedback from Steve Ritter
“As a rule, we at C&EN don’t use Wikipedia as a primary source for structures or chemical information, and I recommend that policy to anyone.”
“It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day.”
The Challenges of Internet Data
Text-based searches commonly will get you to “representative data”
Accurate chemical structures are hard to find! Wikipedia IS a good source of accurate chemistry
data..not perfect but good. See Tacrolimus Tell the story of Domoic Acid
Unfortunately, question everything
Question Everything online: www.dhmo.org
The FDA’s DailyMed
Structures on DailyMed
Lack of Stereochemistry
Does Stereochemistry Matter?
Does one stereocenter matter?
Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon
Does one stereocenter matter?
Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide
Incorrect Structures
Wow!
Collaborative Knowledge Management
Taxol on PubChem
Drugbank
Digitonin? More Crowdsourcing…
Comments on the Blog September 15th, 2009 at 1:57 pm It looks like
both ChEBI and Wikipedia structures are wrong as far as aglycon is concerned. According to http://www3.interscience.wiley.com/journal/20330/abstract
“…for the first time to confirm beyond all doubt the structure suggested by Tschesche and Wulff for digitonin by means of modern NMR techniques, and to assign all proton and carbon resonances.”Structure 1 shows methyl group at C-20 going UP, i.e. 20β (while by default spirostan is 20α).
CAS as an authority
The Blogging Community Participate
Will it ever end? The community says the structure of digitonin has
“up” 20-Methyl.
If so, then multiple substances related to digitonin have OPPOSITE stereo at 20-Methyl
The spirostane skeleton is considered to have a “down” Methyl group so all spirostane-related structures would be wrong
The ACD/Dictionary has 24 structures with close skeleton and all have the “down” Methyl group.
Chemistry is REALLY Messy
Vancomycin
Who will curate?
How would you clean such a large dataset?
Assertions!!!
An Introduction to the InChI Identifier
Multiple Layers
InChIStrings Hash to InChIKeys
InChIs for Taxol
Back to Taxol
DrugBank: RCINICONZNJXQF-CLDWUXIMDD ChEBI: RCINICONZNJXQF-GXKQXQCDDN Wikipedia: RCINICONZNJXQF-MZXODVADBJ
Which one is correct???
Vancomycin – Search the Internet
Full Molecule Search: 4 Hits
Full Skeleton Search: 104 Hits
Vancomycin on ChemSpider 1 compound – 3 days
Assertion and Chemical Entities
Who says what Taxol is?
What is the “timeline” for a molecule?
How do we clean up the Public data?
InChIKeys for Taxol
DrugBank: RCINICONZNJXQF-CLDWUXIMDD ChEBI: RCINICONZNJXQF-GXKQXQCDDN Wikipedia: RCINICONZNJXQF-MZXODVADBJ
Structure validation is tough work!Who is validating chemistry data online???
Bio-Break
Next Up – QUALITY CHOICES for online data
An introduction to ChemSpider
Crowdsourced Participation and Curation
Tony’s Quality Choices For Data
Chemical Abstracts Service and Reaxys – not free but definitely high quality!
Wikipedia Chemistry is good ChEBI (look for “starred” compounds to indicate
manual curation) DSSTox – manually curated EPA database. Very
high quality ChemIDPlus – ongoing curation and good quality The databases of David Wishart – manually
curated. Good but not perfect – DrugBank, HMDB, FooDB and others
ChEBI: http://www.ebi.ac.uk/chebi/
Chemical Entities of Biological Interest from the European Bioinformatics Institute
DSSTox: http://www.epa.gov/comptox/dsstox/
Distributed Structure Searching for Toxicology from Ann Richards at the Environmental Protection Agency
ChemIDPlus – 350,000 Compoundshttp://chem.sis.nlm.nih.gov/chemidplus/
DrugBank: http://www.drugbank.ca/
And Our Own Work...ChemSpider ChemSpider is:
Building a Structure Centric Community for Chemists >25 million compounds, >400 data sources
A deposition and curation platform
A publishing platform for the community
Grows daily – more depositions, more links, more data sources
How Was ChemSpider Built? ChemSpider was a “hobby project”
Housed in a basement and running off three servers – one bought, two built
Sensitive to weather and power stability
Went live at ACS Spring 2007 in Chicago
Search Cholesterol
Live DEMO
ChemSpider demo…
Link off a structure in ChemSpider
Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”
Answering Questions for Chemists Questions a chemist might ask…
What is the melting point of n-butanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?
Complex Data and Information
Crowd-sourcing Chemistry Curation
Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
Multi-level Curation and Approval
ChemSpider SyntheticPages
ChemSpider Synthesis will be a home for all things “synthetic”
An online resource for synthetic procedures from blogs, other online resources, RSC supplementary info, other publishers etc.
Public peer-review and feedback for synthetic procedures
ChemSpider Everywhere : Embed
ChemSpider Everywhere: Spectral Game
ChemSpider EverywhereCrowdsourced Curation of Spectra
ChemSpider EverywhereChemMobi
Where is ChemSpider Lacking?
More databases coming online monthly
Quality of data remains the primary issue
ChemSpider is limited to “defined chemicals”. No support for: Polymers Minerals Markush structures
It’s a long road ahead…
Conclusions
The internet enables chemistry, at a reduced cost
Web 2.0 is here and improving quality
Question Quality!
Crowdsourcing to expand, curate and integrate
InChIs are enabling chemistry on the internet
Thank you
[email protected]: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams