hosting a compound centric community resource for chemistry data
DESCRIPTION
Laboratories around the world continue to generate immense amounts of data that are non-proprietary and of value to the community. If available these data could dramatically reduce costs by minimizing rework and ultimately facilitating faster research. High quality reference data collections of chemical compound dictionaries, properties and spectra have been generated over many decades. With the advent of social networking tools and platforms such as Wikipedia, the community has an opportunity to contribute. The ChemSpider platform hosted by the Royal Society of Chemistry is a compound centric database with associated data. Already populated with almost 25 million unique compounds the community can deposit and host their own data, and curate and annotate existing data including those generated in Open Notebook Science Efforts. This presentation will provide an overview of progress to date and outline the vision of this community platform for chemistry and ensuring the longevity of chemistry reference data.TRANSCRIPT
Hosting a Compound Centric Community Resource for Chemistry Data
Antony Williams, ACS Anaheim March 28th 2011
Data Archiving, e-Science andPrimary Data How much data generated in a lab, that COULD
go public, is lost forever? Public Domain reference databases of value?
Syntheses Properties Spectra CIFs Images
Much of chemistry is chemical structure-based – where and how could we host these data?
The Social Network
Career-wise, within the next few years NOT having a personal presence online will be a detriment Self-marketing Establishing a profile Getting on the record Collaborative Science Demonstrating a skill set Measured using alternative metrics Contributing to the public peer review process
Social Networking Tools
A growing number of social networking tools:
Facebook Twitter Linked-In Flickr YouTube Blogs Communities Collaborative environments
Collaborative Knowledge Management
TotallySynthetic.com
Contributing Chemistry online Property databases Compound aggregators Screening assay results Scientific publications Encyclopedic articles (Wikipedia) Metabolic pathway databases ADME/Tox data – eTOX for example Blogs/Wikis and Open Notebook Science Contributing Open Source code to projects
Chemistry Social Networking Methods of sharing MY chemistry online include:
Wikis or blogs Slideshare for presentations YouTube for videos Flickr, Wikimedia etc. for images (and FigShare) PubChem for assay data NMRShiftDB for NMR assignments GoogleDocs for data (and FigShare)
FigShare
FigShare
Chemistry Social Networking Methods of sharing MY chemistry online include:
Wikis or blogs Slideshare for presentations YouTube for videos Flickr, Wikimedia etc. for images (and FigShare) PubChem for assay data NMRShiftDB for NMR assignments GoogleDocs for data (and FigShare)
What other online environments can you immediately share chemistry data?
ChemSpider
ChemSpider is a chemistry database >25 million compounds, >400 data sources A deposition platform
Structure(s) Identifiers Links to internet resources, articles and DOIs Experimental data (spectra, images, CIFS) Multimedia (videos, MP3s)
A curation and annotation platform Remove “bad data” Annotate existing data
A publishing platform for the community
Search for a Chemical by name
Available Information…
Linked to vendors, safety data, toxicity, metabolism
Available Information….
Crowdsourced “Annotations”
Users can add Descriptions/Syntheses/Commentaries Links to PubMed articles Links to articles via DOIs Add spectral data Add Crystallographic Information Files Add photos Add MP3 files Add Videos
Spectra
Spectra
Inherited Errors
Inherited errors from every database… all public compound databases, including ours, have errors
“Incorrect” structures – assertions, timelines etc
“Incorrect” names associated with structures
ENORMOUS CHALLENGE
Crowdsourced Curation
Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
Search “Vitamin H”
“Curate” Identifiers
“Curate” Identifiers
“Curate” Identifiers
Crowdsourcing Works
>130 people have deposited data and participated in data curation
Different level curators check each other
More curators and depositors are encouraged!
Molbank (Open Access Journal)
ChemSpider SyntheticPages
Many syntheses are not published but are of value
CSSP: A database of synthesis procedures built for the community, by the community.
Peer-reviewed by the community
Each contribution has a DOI – of value to the submitter?
Vandalism
Vandalism of ChemSpider is VERY rare…
Three acts of vandalism ever Someone tried to “sell a house!” A vendor posted their logo against a chemical A student, Katie Crow, posted a “personal
photo”
But data quality can appear like vandalism!
Drivers in the Social Network Anonymity is a choice in the social networks
Many people on Wikipedia are anonymous Many blogs are anonymous Comments on blogs can be anonymous
Anonymity in peer-review will likely become less important and may be generational
I may want acknowledgment if… I share my data I review a paper I share my expertise
The Alt-Metrics Manifesto
http://altmetrics.org/manifesto/
Enabled by ORCID…
Who declares data as Open? Data licensing is very interesting and can spark
“interesting” conversations. Opinions differ: Are images data? Are assertions data? What on a ChemSpider record is data? Is PubChem or PubMed Open Data?
We allow people to declare their data as Open and add an Open Data button at upload
A lot of data on ChemSpider are free but not Open Pragmatism: Our focus is a community resource
Licensing “My Work” Online Is it “my” chemistry once it’s online?
The complex nature of licensing “my” chemistry Blogs - copyrighted and creative commons Wikis - mixed licensing, depends on the host(s) Data – much value in sharing data as “Open Data”
Often, people can make money from your work!
Police your own “licensing” – how many people have read the Facebook and Twitter agreements?!
ChemSpiderA Structure Centric Host
An established community resource
>25 million compounds from >400 data sources Thousands of users per day Approaching a million transactions per day A crowdsourced deposition and curation platform Grows daily – more depositions, more data A publishing platform for the community Contributions welcome! Learn how…
ChemSpider Training Session
ChemSpider: A Community Resource for Chemical Data
Wednesday, March 30th
8:30-11:00 AM
Anaheim Convention Center, Room 211 A
Acknowledgments RSC|ChemSpider team The “Crowd” of curators All Data Source providers
GGA Software Services ACD/Labs OpenEye Accelrys
Thank you
Email: [email protected] Twitter: ChemConnectorPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams