heavenly conjunctions in chemical information

17
[1] Chemicalize.org, SureChemOpen, PubChem and the InChIKey: Heavenly conjunctions with transformative utility Christopher Southan, TW2Informatics, Göteborg, Sweden, ChemAxon UGM, Budapest, May 2013 Video http:// www.youtube.com/watch?feature=player_embedded&v=OKLw9BaQzY0#t=0s Related Posters http:// www.slideshare.net/cdsouthan/the-patent-chemistry-big-bang-in-pubc hem http:// www.slideshare.net/cdsouthan/cs-cax-bioitchemicalizeposter03apr

Upload: chris-southan

Post on 25-May-2015

176 views

Category:

Health & Medicine


1 download

DESCRIPTION

ChemAxon UGM 2013 The ChemAxon name-to-struc functionality is not only a component of the SureChem patent extraction pipeline but also powers chemicalize.org. Both operations are now submitting sources to PubChem. The former has deposited structures that bring the patent-extracted total in PC to 14.5 mill. CIDs. The deposition from chemicalize is ~0.3 mill., but has been actively selected by users and is 20% unique. The final conjunction is that all three sources generate the InChIKey (IK) that turns Google into a de-facto merge of PubChem and ChemSpider of ~50 mill. structures. Chemicalize.org users can convert new patents, other external or internal documents and web based text. Individual results can be Googled, searched against SurChemOpen and bulk extractions triaged against PubChem. It thus becomes possible to connect chemistry between patents, papers, abstracts and database records via exact match or similarity searching. When SureChem and chemicalize.org update their submissions, relationships with the other ~200 PubChem sources (including ChEMBL and vendor databases) are re-computed and new CID links made. The synergy between SureChem and chemicalize.org is powerful because matches between them (~ 0.15 mill.) via SureChemOpen, give occurrence statistics and the location of the structure within patents. The applications of chemicalize.org are extended by web tools such as Venny for determining intersects from multiple extractions and CheS-Mapper for cluster visualization. These utility expansions will be illustrated by documents specifying BACE1 inhibitors for Alzheimer’s disease.

TRANSCRIPT

Page 1: Heavenly Conjunctions in Chemical Information

[1]

Chemicalize.org, SureChemOpen, PubChem and the InChIKey: Heavenly conjunctions with

transformative utility

Christopher Southan, TW2Informatics, Göteborg, Sweden,

ChemAxon UGM, Budapest, May 2013

Video http://www.youtube.com/watch?feature=player_embedded&v=OKLw9BaQzY0#t=0s

Related Postershttp://www.slideshare.net/cdsouthan/the-patent-chemistry-big-bang-in-pubchem

http://www.slideshare.net/cdsouthan/cs-cax-bioitchemicalizeposter03apr

Image credit: http://www.eso.org/public/images/yb_vlt_moon_cnn_cc/

Page 2: Heavenly Conjunctions in Chemical Information

[2]

Dr Christopher Southan, Ph.D., M.Sc.,B.Sc.TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htmMobile: +46(0)702-530710Skype: cdsouthanEmail: [email protected]: http://twitter.com/#!/cdsouthanBlog: http://cdsouthan.blogspot.com/LinkedIN: http://www.linkedin.com/in/cdsouthanPublications: http://www.citeulike.org/user/cdsouthan/order/year,,/publicationsSlideshare: http://www.slideshare.net/cdsouthanFigshare: http://figshare.com/authors/Christopher%20Southan/97432

Page 3: Heavenly Conjunctions in Chemical Information

[3]

The ChemAxon name-to-struc functionality is not only a component of the SureChem patent extraction pipeline but also powers chemicalize.org. Both operations are now submitting sources to PubChem. The former has deposited structures that bring the patent-extracted total in PC to 14.5 mill. CIDs. The deposition from chemicalize is ~0.3 mill., but has been actively selected by users and is 20% unique. The final conjunction is that all three sources generate the InChIKey (IK) that turns Google into a de-facto merge of PubChem and ChemSpider of ~50 mill. structures. Chemicalize.org users can convert new patents, other external or internal documents and web based text. Individual results can be Googled, searched against SurChemOpen and bulk extractions triaged against PubChem. It thus becomes possible to connect chemistry between patents, papers, abstracts and database records via exact match or similarity searching. When SureChem and chemicalize.org update their submissions, relationships with the other ~200 PubChem sources (including ChEMBL and vendor databases) are re-computed and new CID links made. The synergy between SureChem and chemicalize.org is powerful because matches between them (~ 0.15 mill.) via SureChemOpen, give occurrence statistics and the location of the structure within patents. The applications of chemicalize.org are extended by web tools such as Venny for determining intersects from multiple extractions and CheS-Mapper for cluster visualization. These utility expansions will be illustrated by documents specifying BACE1 inhibitors for Alzheimer’s disease.

Abstract

Page 4: Heavenly Conjunctions in Chemical Information

[4]

Auspicious Conjunctions 2012-13

• PubChem: structures to slice ‘n dice (48 mill)• SureChemOpen: majority of patent chemistry opened up (14.5 mill)• Chemicalize.org : chemistry extractable from any text tombs (0.3 mill)• Chemical images: patents extracted in SureChemOpen, OSRA

handles papers• InChIKey indexing in Google (50 mill +)• ChemSpider: crowdsourcing chemisty quality (28 mill)• Exapnding toolbox e.g.OPSIN, Venny, Ches-mapper• SciBite alerts• Expanding preview and surfacing options e.g. ChEMBLntd, Github,

OSDD, Open Lab Books, figshare etc• Rise of mobile chemistry

Page 5: Heavenly Conjunctions in Chemical Information

[5]

Databases <> structures < > documents

Abstracts

Patents

Papers

15 mill

0.2 mill (MeSH)

0.8 mill (ChEMBL)

12K

Google InChIKey ~ 50 million (47m PubChem + 33m UniChem + 28m ChemSpider)

Page 6: Heavenly Conjunctions in Chemical Information

[6]

Triaging chemistry from text

• Identify the structure specification types, e.g.– Semantic names (all sources)– Code names (press releases, papers and abstracts) – IUPAC names (papers, patents and abstracts)– Images (papers, patents, & Google images)– SMILES (open lab books)– InChi strings (open lab books)– SDF files (open lab books, & github)

Convert these to a structure (e.g. SDF, SMILES, InChI) then:– Search InChIKey in Google– Search major databases– Search SureChemOpen– Compare extracted sets for intersects and diffs – Extend exact match connectivity with similarity searching

Page 7: Heavenly Conjunctions in Chemical Information

[7]

PubChem Composition

Page 8: Heavenly Conjunctions in Chemical Information

[8]

SureChemOpen Composition (in PubChem)

Page 9: Heavenly Conjunctions in Chemical Information

[9]

Chemicalize.org Composition (in PubChem)

Page 10: Heavenly Conjunctions in Chemical Information

[10]

BACE2 Conjunctions

Page 11: Heavenly Conjunctions in Chemical Information

[11]

BACE2 Conjunctions

Page 12: Heavenly Conjunctions in Chemical Information

[12]

Chemicalise.org Triage

Page 13: Heavenly Conjunctions in Chemical Information

[13]

BACE2 Conjunctions

1. WO2013054291 > chemicalize.org 2. Download 450 structures3. Upload to PubChem search

Page 14: Heavenly Conjunctions in Chemical Information

[14]

Clustering document extraction sets: CheS-Mapper

Page 15: Heavenly Conjunctions in Chemical Information

[15]

Venny: intersects, diffs, de-dupes and merges

Page 16: Heavenly Conjunctions in Chemical Information

[16]

Conclusions

• Transformative opening up of chemistry > biology via structure >document connectivity

• Open mining of patent metadata and data• Expanding toolbox• Inexorable expansion of open-access publishing

But;

• Journal chemistry extraction > database records still slow• Text mining of journals still restricted • Author annotation and direct db submission rare• Pharmaceutical research publications are still blinding structures (see

PMID: 23159359)