experiences in hosting big chemistry data collections for the community

67
Experiences in Hosting Big Chemistry Data Collections for the Community Antony Williams July 30 th 2014, NIST

Upload: orcid-0000-0002-2668-4821

Post on 10-May-2015

645 views

Category:

Science


3 download

DESCRIPTION

Access to scientific information has changed dramatically as a result of the web and its underpinning technologies. The quantities of data, the array of tools available to search and analyze, the devices and the shift in community participation continues to expand while the pace of change does not appear to be slowing. RSC hosts a number of chemistry data resources for the community including ChemSpider, one of the community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day. The platform offers the ability for crowdsourcing enabling the community to deposit and curate data. This presentation will provide an overview of the expanding reach of this cheminformatics platform and the nature of the solutions that it helps to enable including structure validation and text mining and semantic markup. ChemSpider is limited in scope as a chemical compound database and we are presently architecting the RSC Data Repository, a platform that will enable us to extend our reach to include chemical reactions, analytical data, and diverse data depositions from chemists across various domains. We will also discuss the possibilities it offers in terms of supporting data modeling and sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community.

TRANSCRIPT

Page 1: Experiences in Hosting Big Chemistry Data Collections for the Community

Experiences in Hosting Big Chemistry Data Collections

for the Community

Antony WilliamsJuly 30th 2014, NIST

Page 2: Experiences in Hosting Big Chemistry Data Collections for the Community

Overview of Our Activities

• The Royal Society of Chemistry as a provider of chemistry for the community:• As a charity • As a scientific publisher• As a host of commercial databases• As a partner in grant-based projects• As the host of ChemSpider• And now in development : the RSC Data

Repository for Chemistry

Page 3: Experiences in Hosting Big Chemistry Data Collections for the Community

• ~30 million chemicals and growing

• Data sourced from >500 different sources

• Crowd sourced curation and annotation

• Ongoing deposition of data from our journals and our collaborators

• Structure centric hub for web-searching

• …and a really big dictionary!!!

Page 4: Experiences in Hosting Big Chemistry Data Collections for the Community

ChemSpider

Page 5: Experiences in Hosting Big Chemistry Data Collections for the Community

ChemSpider

Page 6: Experiences in Hosting Big Chemistry Data Collections for the Community

ChemSpider

Page 7: Experiences in Hosting Big Chemistry Data Collections for the Community

Experimental/Predicted Properties

Page 8: Experiences in Hosting Big Chemistry Data Collections for the Community

Literature references

Page 9: Experiences in Hosting Big Chemistry Data Collections for the Community

Patents references

Page 10: Experiences in Hosting Big Chemistry Data Collections for the Community

RSC Books

Page 11: Experiences in Hosting Big Chemistry Data Collections for the Community

Google Books

Page 12: Experiences in Hosting Big Chemistry Data Collections for the Community

Vendors and data sources

Page 13: Experiences in Hosting Big Chemistry Data Collections for the Community

Crowdsourced “Annotations”

• Users can add • Descriptions, Syntheses and Commentaries• Links to PubMed articles• Links to articles via DOIs • Add spectral data• Add Crystallographic Information Files• Add photos• Add MP3 files• Add Videos

Page 14: Experiences in Hosting Big Chemistry Data Collections for the Community

APIs

Page 15: Experiences in Hosting Big Chemistry Data Collections for the Community

APIs

Page 16: Experiences in Hosting Big Chemistry Data Collections for the Community

WebBook and ChemSpider

Page 17: Experiences in Hosting Big Chemistry Data Collections for the Community

WebBook and ChemSpider

Page 18: Experiences in Hosting Big Chemistry Data Collections for the Community

WebBook and ChemSpider

Page 19: Experiences in Hosting Big Chemistry Data Collections for the Community

WebBook and ChemSpider

Page 20: Experiences in Hosting Big Chemistry Data Collections for the Community

WebBook and ChemSpider

Page 21: Experiences in Hosting Big Chemistry Data Collections for the Community

Javascript viewer NMR, MS, IR

Page 22: Experiences in Hosting Big Chemistry Data Collections for the Community

Aspirin on ChemSpider

Page 23: Experiences in Hosting Big Chemistry Data Collections for the Community

Many Names, One Structure

Page 24: Experiences in Hosting Big Chemistry Data Collections for the Community

What is the Structure of Vitamin K?

Page 25: Experiences in Hosting Big Chemistry Data Collections for the Community

MeSH

• A lipid cofactor that is required for normal blood clotting.

• Several forms of vitamin K have been identified: • VITAMIN K 1 (phytomenadione) derived

from plants, • VITAMIN K 2 (menaquinone) from bacteria,

and synthetic naphthoquinone provitamins, • VITAMIN K 3 (menadione).

Page 26: Experiences in Hosting Big Chemistry Data Collections for the Community

What is the Structure of Vitamin K?

Page 27: Experiences in Hosting Big Chemistry Data Collections for the Community

The ultimate “dictionary”

• Search all forms of structure IDs

• Systematic name(s)

• Trivial Name(s)

• SMILES

• InChI Strings

• InChIKeys

• Database IDs

• Registry Number

Page 28: Experiences in Hosting Big Chemistry Data Collections for the Community

Linking Names to Structures

Page 29: Experiences in Hosting Big Chemistry Data Collections for the Community

Semantic Mark-up of Articles

Page 30: Experiences in Hosting Big Chemistry Data Collections for the Community

Data Quality IssuesWilliams and Ekins, DDT, 16: 747-750 (2011)

Science Translational Medicine 2011

Page 31: Experiences in Hosting Big Chemistry Data Collections for the Community

Data quality is a known issue

Page 32: Experiences in Hosting Big Chemistry Data Collections for the Community

Standardize

• Use the SRS as a guidance document for standardization

• Adjust as necessary to our needs

Page 33: Experiences in Hosting Big Chemistry Data Collections for the Community

Nitro groups

Page 34: Experiences in Hosting Big Chemistry Data Collections for the Community

Salt and Ionic Bonds

Page 35: Experiences in Hosting Big Chemistry Data Collections for the Community

Ammonium salts

Page 36: Experiences in Hosting Big Chemistry Data Collections for the Community

CVSP Filtering and Flagging

Page 37: Experiences in Hosting Big Chemistry Data Collections for the Community

Openness and Quality IssuesWilliams and Ekins, DDT, 16: 747-750 (2011)

Science Translational Medicine 2011

Page 38: Experiences in Hosting Big Chemistry Data Collections for the Community

Substructure # of

Hits

# of

Correct

Hits

No

stereochemistry

Incomplete

Stereochemistry

Complete but

incorrect

stereochemistry

Gonane 34 5 8 21 0

Gon-4-ene 55 12 3 33 7

Gon-1,4-diene 60 17 10 23 10

Page 39: Experiences in Hosting Big Chemistry Data Collections for the Community

Crowdsourced Enhancement

• The community can clean and enhance the database by providing Feedback and direct curation

• Tens of thousands of edits made

Page 40: Experiences in Hosting Big Chemistry Data Collections for the Community

Data Quality is Work

• Cholesterol

• Taxol

Page 41: Experiences in Hosting Big Chemistry Data Collections for the Community

Maybe we can help?

• Is there an interest in data checking the WebBook or other NIST data sources?

Page 42: Experiences in Hosting Big Chemistry Data Collections for the Community

Publications-summary of work

• Scientific publications are a summary of work• Is all work reported?• How much science is lost to pruning?• What of value sits in notebooks and is lost?• Publications offering access to “real data”?

• How much data is lost?• How many compounds never reported?• How many syntheses fail or succeed?• How many characterization measurements?

Page 43: Experiences in Hosting Big Chemistry Data Collections for the Community

What are we building?

• We are building the “RSC Data Repository”

• Containers for compounds, reactions, analytical data, tabular data

• Algorithms for data validation and standardization

• Flexible indexing and search technologies

• A platform for modeling data and hosting existing models and predictive algorithms

Page 44: Experiences in Hosting Big Chemistry Data Collections for the Community

Deposition of Data

Page 45: Experiences in Hosting Big Chemistry Data Collections for the Community

Compounds

Page 46: Experiences in Hosting Big Chemistry Data Collections for the Community

Reactions

Page 47: Experiences in Hosting Big Chemistry Data Collections for the Community

Analytical data

Page 48: Experiences in Hosting Big Chemistry Data Collections for the Community

Crystallography data

Page 49: Experiences in Hosting Big Chemistry Data Collections for the Community

Can we get historical data?

• Text and data can be mined

• Spectra can be extracted and converted

• SO MUCH Open Source Code available

Page 50: Experiences in Hosting Big Chemistry Data Collections for the Community

Text Mining

The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .

The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .

After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

Page 51: Experiences in Hosting Big Chemistry Data Collections for the Community

Text Mining

The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .

The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .

After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

Page 52: Experiences in Hosting Big Chemistry Data Collections for the Community

Text spectra?

13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)

Page 53: Experiences in Hosting Big Chemistry Data Collections for the Community

1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)

Page 54: Experiences in Hosting Big Chemistry Data Collections for the Community

Turn “Figures” Into Data

Page 55: Experiences in Hosting Big Chemistry Data Collections for the Community

Make it interactive

Page 56: Experiences in Hosting Big Chemistry Data Collections for the Community

SO MANY reactions!

Page 57: Experiences in Hosting Big Chemistry Data Collections for the Community

Extracting our Archive

• What could we get from our archive?• Find chemical names and generate structures• Find chemical images and generate structures• Find reactions• Find data (MP, BP, LogP) and deposit• Find figures and database them• Find spectra (and link to structures)

Page 58: Experiences in Hosting Big Chemistry Data Collections for the Community

Models published from data

Page 59: Experiences in Hosting Big Chemistry Data Collections for the Community

Text-mining Data to compare

Page 60: Experiences in Hosting Big Chemistry Data Collections for the Community

How is DERA going?

• We have text-mined all 21st century articles… >100k articles from 2000-2013

• Marked up with XML and published onto the HTML forms of the articles

• Required multiple iterations based on dictionaries, markup, text mining iterations

• New visualization tools in development – not just chemical names. Add chemical and biomedical terms markup also!

Page 61: Experiences in Hosting Big Chemistry Data Collections for the Community

Work in Progress

Page 62: Experiences in Hosting Big Chemistry Data Collections for the Community

Work in Progress

Page 63: Experiences in Hosting Big Chemistry Data Collections for the Community

Work in Progress

Page 64: Experiences in Hosting Big Chemistry Data Collections for the Community

Work in Progress

Page 65: Experiences in Hosting Big Chemistry Data Collections for the Community

Dictionary(ontologies)

RSC ontologies(methods, reactions)

Dictionary(chemistry)

Text-mining

Curated dictionaries for known names

ACD N2S

OPSIN

Unknown names: automated name to structure conversion

XML ready for publication

Marked-up XML

Production processes

CDX integration (coming soon)

Chemical structures SD

file

Is It Easy?

Page 66: Experiences in Hosting Big Chemistry Data Collections for the Community

Acknowledgments

• Regarding InChI – Steve Stein, Steve Heller, Dmitrii Tchekhovskoi, Igor Pletnev

Page 67: Experiences in Hosting Big Chemistry Data Collections for the Community

Email: [email protected]: 0000-0002-2668-4821 Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

Thank you