hosting a compound centric community resource for chemistry data

39
Hosting a Compound Centric Community Resource for Chemistry Data Antony Williams, ACS Anaheim March 28 th 2011

Upload: orcid-0000-0002-2668-4821

Post on 10-May-2015

1.443 views

Category:

Technology


2 download

DESCRIPTION

Laboratories around the world continue to generate immense amounts of data that are non-proprietary and of value to the community. If available these data could dramatically reduce costs by minimizing rework and ultimately facilitating faster research. High quality reference data collections of chemical compound dictionaries, properties and spectra have been generated over many decades. With the advent of social networking tools and platforms such as Wikipedia, the community has an opportunity to contribute. The ChemSpider platform hosted by the Royal Society of Chemistry is a compound centric database with associated data. Already populated with almost 25 million unique compounds the community can deposit and host their own data, and curate and annotate existing data including those generated in Open Notebook Science Efforts. This presentation will provide an overview of progress to date and outline the vision of this community platform for chemistry and ensuring the longevity of chemistry reference data.

TRANSCRIPT

Page 1: Hosting a compound centric community resource for chemistry data

Hosting a Compound Centric Community Resource for Chemistry Data

Antony Williams, ACS Anaheim March 28th 2011

Page 2: Hosting a compound centric community resource for chemistry data

Data Archiving, e-Science andPrimary Data How much data generated in a lab, that COULD

go public, is lost forever? Public Domain reference databases of value?

Syntheses Properties Spectra CIFs Images

Much of chemistry is chemical structure-based – where and how could we host these data?

Page 3: Hosting a compound centric community resource for chemistry data

The Social Network

Career-wise, within the next few years NOT having a personal presence online will be a detriment Self-marketing Establishing a profile Getting on the record Collaborative Science Demonstrating a skill set Measured using alternative metrics Contributing to the public peer review process

Page 4: Hosting a compound centric community resource for chemistry data

Social Networking Tools

A growing number of social networking tools:

Facebook Twitter Linked-In Flickr YouTube Blogs Communities Collaborative environments

Page 5: Hosting a compound centric community resource for chemistry data

Collaborative Knowledge Management

Page 6: Hosting a compound centric community resource for chemistry data

TotallySynthetic.com

Page 7: Hosting a compound centric community resource for chemistry data

Contributing Chemistry online Property databases Compound aggregators Screening assay results Scientific publications Encyclopedic articles (Wikipedia) Metabolic pathway databases ADME/Tox data – eTOX for example Blogs/Wikis and Open Notebook Science Contributing Open Source code to projects

Page 8: Hosting a compound centric community resource for chemistry data

Chemistry Social Networking Methods of sharing MY chemistry online include:

Wikis or blogs Slideshare for presentations YouTube for videos Flickr, Wikimedia etc. for images (and FigShare) PubChem for assay data NMRShiftDB for NMR assignments GoogleDocs for data (and FigShare)

Page 9: Hosting a compound centric community resource for chemistry data

FigShare

Page 10: Hosting a compound centric community resource for chemistry data

FigShare

Page 11: Hosting a compound centric community resource for chemistry data

Chemistry Social Networking Methods of sharing MY chemistry online include:

Wikis or blogs Slideshare for presentations YouTube for videos Flickr, Wikimedia etc. for images (and FigShare) PubChem for assay data NMRShiftDB for NMR assignments GoogleDocs for data (and FigShare)

What other online environments can you immediately share chemistry data?

Page 12: Hosting a compound centric community resource for chemistry data

ChemSpider

ChemSpider is a chemistry database >25 million compounds, >400 data sources A deposition platform

Structure(s) Identifiers Links to internet resources, articles and DOIs Experimental data (spectra, images, CIFS) Multimedia (videos, MP3s)

A curation and annotation platform Remove “bad data” Annotate existing data

A publishing platform for the community

Page 13: Hosting a compound centric community resource for chemistry data

Search for a Chemical by name

Page 14: Hosting a compound centric community resource for chemistry data

Available Information…

Linked to vendors, safety data, toxicity, metabolism

Page 15: Hosting a compound centric community resource for chemistry data

Available Information….

Page 16: Hosting a compound centric community resource for chemistry data

Crowdsourced “Annotations”

Users can add Descriptions/Syntheses/Commentaries Links to PubMed articles Links to articles via DOIs Add spectral data Add Crystallographic Information Files Add photos Add MP3 files Add Videos

Page 17: Hosting a compound centric community resource for chemistry data
Page 18: Hosting a compound centric community resource for chemistry data

Spectra

Page 19: Hosting a compound centric community resource for chemistry data

Spectra

Page 20: Hosting a compound centric community resource for chemistry data

Inherited Errors

Inherited errors from every database… all public compound databases, including ours, have errors

“Incorrect” structures – assertions, timelines etc

“Incorrect” names associated with structures

ENORMOUS CHALLENGE

Page 21: Hosting a compound centric community resource for chemistry data

Crowdsourced Curation

Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate

Page 22: Hosting a compound centric community resource for chemistry data

Search “Vitamin H”

Page 23: Hosting a compound centric community resource for chemistry data

“Curate” Identifiers

Page 24: Hosting a compound centric community resource for chemistry data

“Curate” Identifiers

Page 25: Hosting a compound centric community resource for chemistry data

“Curate” Identifiers

Page 26: Hosting a compound centric community resource for chemistry data

Crowdsourcing Works

>130 people have deposited data and participated in data curation

Different level curators check each other

More curators and depositors are encouraged!

Page 27: Hosting a compound centric community resource for chemistry data

Molbank (Open Access Journal)

Page 28: Hosting a compound centric community resource for chemistry data

ChemSpider SyntheticPages

Many syntheses are not published but are of value

CSSP: A database of synthesis procedures built for the community, by the community.

Peer-reviewed by the community

Each contribution has a DOI – of value to the submitter?

Page 29: Hosting a compound centric community resource for chemistry data
Page 30: Hosting a compound centric community resource for chemistry data

Vandalism

Vandalism of ChemSpider is VERY rare…

Three acts of vandalism ever Someone tried to “sell a house!” A vendor posted their logo against a chemical A student, Katie Crow, posted a “personal

photo”

But data quality can appear like vandalism!

Page 31: Hosting a compound centric community resource for chemistry data

Drivers in the Social Network Anonymity is a choice in the social networks

Many people on Wikipedia are anonymous Many blogs are anonymous Comments on blogs can be anonymous

Anonymity in peer-review will likely become less important and may be generational

I may want acknowledgment if… I share my data I review a paper I share my expertise

Page 32: Hosting a compound centric community resource for chemistry data

The Alt-Metrics Manifesto

http://altmetrics.org/manifesto/

Page 33: Hosting a compound centric community resource for chemistry data

Enabled by ORCID…

Page 34: Hosting a compound centric community resource for chemistry data

Who declares data as Open? Data licensing is very interesting and can spark

“interesting” conversations. Opinions differ: Are images data? Are assertions data? What on a ChemSpider record is data? Is PubChem or PubMed Open Data?

We allow people to declare their data as Open and add an Open Data button at upload

A lot of data on ChemSpider are free but not Open Pragmatism: Our focus is a community resource

Page 35: Hosting a compound centric community resource for chemistry data

Licensing “My Work” Online Is it “my” chemistry once it’s online?

The complex nature of licensing “my” chemistry Blogs - copyrighted and creative commons Wikis - mixed licensing, depends on the host(s) Data – much value in sharing data as “Open Data”

Often, people can make money from your work!

Police your own “licensing” – how many people have read the Facebook and Twitter agreements?!

Page 36: Hosting a compound centric community resource for chemistry data

ChemSpiderA Structure Centric Host

An established community resource

>25 million compounds from >400 data sources Thousands of users per day Approaching a million transactions per day A crowdsourced deposition and curation platform Grows daily – more depositions, more data A publishing platform for the community Contributions welcome! Learn how…

Page 37: Hosting a compound centric community resource for chemistry data

ChemSpider Training Session

ChemSpider: A Community Resource for Chemical Data

Wednesday, March 30th

8:30-11:00 AM

Anaheim Convention Center, Room 211 A

Page 38: Hosting a compound centric community resource for chemistry data

Acknowledgments RSC|ChemSpider team The “Crowd” of curators All Data Source providers

GGA Software Services ACD/Labs OpenEye Accelrys

Page 39: Hosting a compound centric community resource for chemistry data

Thank you

Email: [email protected] Twitter: ChemConnectorPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams