rsc chemspider – building an internet based community for chemists

Post on 10-May-2015

1.120 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

This is a general presentation about our efforts to build an internet based community for chemists using ChemSpider. A general overview of data quality online, crowdsourced deposition and curation and our progress to deliver a solution to the community for resourcing data.

TRANSCRIPT

RSC ChemSpider – Building an Internet Based Community for Chemists

Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Poor quality and little curation/validation work

Too many searches required to resource data

media.obsessable.com

As few interfaces as possible

What do humans want?

Chemistry on the Internet FUTURE

Search by chemical structure and substructure

Chemistry articles indexed and searchable

Reduced number of searches to find data

Data are integrated – compounds, vendors, syntheses, data, publications and patents

For Synthesis…TotallySynthetic.com

Org Prep Daily (Blog)

Lots of “Public Compound” Databases

PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC Lots of chemical vendors ChemSpider

Where Would You look? What Do You Trust?

Linked Data on the Web

Taken from: Rafael Sidis’ Blog

What is a compound?

What is ChemSpider?

ChemSpider is:

Building a Structure Centric Community for Chemists >23 million compounds, >300 data sources

A deposition and curation platform

A publishing platform for the community

Grows daily – more depositions, more links, more data sources

How Was ChemSpider Built? ChemSpider was a “hobby project”

Housed in a basement and running off three servers – one bought, two built

Sensitive to weather and power stability

Went live at ACS Spring 2007 in Chicago

Search Cholesterol

Search Cholesterol

Search Cholesterol

Search Cholesterol

Search Cholesterol

Linked across the internet

Kyoto Encyclopedia of Genes and Genomes

Link off a structure in ChemSpider

Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

Links to Patents based on structure

Clickthrough to Patents

Articles Linked

Answering Questions for Chemists Questions a chemist might ask…

What is the melting point of n-butanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?

Complex Data and Information

ChemSpider is a structure-centric hub

ChemSpider aggregates and links out across the internet

Data aggregate based on “structures and links”

What defines a chemical compound?

What is a compound?

Question Everything online: www.dhmo.org

Di-Hydrogen Monoxide

2H

Di-Hydrogen Monoxide

2H + 1O

Di-Hydrogen Monoxide

H2O

Di-Hydrogen Monoxide

H2OWater

It’s all on Wikipedia…

It’s all on Wikipedia…

Chemistry on The Internet Is Messy

It’s Methane…

What’s Methane?

What’s Methane?

What ELSE is Methane???

PubChem

Truly “I Love You”

Chemistry is REALLY Messy

Vancomycin

Who will curate?

How would you clean such a large dataset?

Assertions!!!

Vancomycin

Who will curate?

How would you clean such a large dataset?

Vancomycin on ChemSpider 1 compound – 3 days

The EXPERTS must get it right?!

Wikipedia, C&E News, PubChem C&E News

(from ACS)

What About Digitonin?

CAS as an authority

The Blogging Community Participate

The FDA’s DailyMed

Structures on DailyMed

Lack of Stereochemisty

Incorrect Structures

Wow!

The InChI Identifier

Multiple Layers

InChIStrings Hash to InChIKeys

InChIs for Taxol

Back to Taxol

DrugBank: RCINICONZNJXQF-CLDWUXIMDD

ChEBI: RCINICONZNJXQF-

GXKQXQCDDN Wikipedia: RCINICONZNJXQF-

MZXODVADBJ

Which one is correct???

InChIKeys for Taxol

DrugBank: RCINICONZNJXQF-CLDWUXIMDD

ChEBI: RCINICONZNJXQF-

GXKQXQCDDN Wikipedia: RCINICONZNJXQF-

MZXODVADBJ

ChEBI and Wikipedia are the SAME structure

Drugbank is a DIFFERENT structure – ONE stereocenter

Does one stereocenter matter?

Does one stereocenter matter?

Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon

Does one stereocenter matter?

Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon

Building a Structure Centric Community for Chemists

Assertion and Chemical Entities

Who says what Taxol is?

What is the “timeline” for a molecule?

How do we clean up the Public data?

The Quality source is Chemical Abstracts Service…

ChemSpider Searches

ChemSpider Searches

ChemSpider Complex Searches

Vancomycin – Search the Internet

Full Molecule Search: 4 Hits

Full Skeleton Search: 104 Hits

The InChI “Resolver”

Citizen Scientists

Crowd-sourcing Chemistry Curation

Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate

Building a Structure Centric Community for Chemists

Multi-level Curation and Approval

Citizens as Data Sources

Entity-Extraction, Mark-up, Annotate

Success Depends on Dictionaries

Project Prospect

ChemMantis and CJOC

Name-Structure Pairs

Species – linked to Wikipedia

Semantic Linking of Structures

What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

ChemSpider Everywhere

Linked from Wikipedia Linked from Open Notebook Science sites

using EMBED Linked from Blogs using Structure/Spectra

EMBED Integrated into structure drawing packages

such as ACD/ChemSketch, Symyx Draw, Open Source applets

Integrated to software offerings from Thermo, Waters, Agilent, Bruker

ChemSpider Everywhere : Embed

ChemSpider Everywhere:What do computers want?

Web services

flickr.com/photos/microcosmos

ChemSpider Everywhere: Spectral Game

ChemSpider EverywhereCrowdsourced Curation of Spectra

ChemSpider EverywhereChemMobi

There are always gaps...

What ChemSpider doesn’t deal with yet...

Markush structures and other “non-defineds” Materials Minerals Polymers Biological macromolecules

What’s next?

Continue the curation effort and keep cleaning

Finish depositions – millions left to deposit

Layer on RDF to allow the semantic web to benefit from our efforts

Integrate RSC content – a massive archive!

Integrate RSC publishing workflows and databases

Thank you

antony.williams@chemspider.comTwitter: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams

top related