semantic chemical publishing
DESCRIPTION
Semantic Chemical Publishing. Nick Day*, Peter Corbett, Peter Murray-Rust Unilever Centre for Molecular Informatics, University of Cambridge, UK. March 27 th , 2007. All software Open Source * [email protected]. Overview. What is ‘semantic chemistry’ and markup? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/1.jpg)
Semantic Chemical Publishing
Nick Day*, Peter Corbett, Peter Murray-RustUnilever Centre for Molecular Informatics, University of Cambridge, UK.
March 27th, 2007
• All software Open Source* [email protected]
![Page 2: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/2.jpg)
Overview
• What is ‘semantic chemistry’ and markup?
• OSCAR3 – robotic analysis of chemistry in free text• recognition of chemical names• name-2-structure• chemical verbs, adjectives and reaction names• terminologies (e.g. techniques)• RSC Project Prospect …
• CrystalEye – creating semantic chemistry from crystallography:
• High-throughput robotic harvesting • Re-use using CIF2CML• Dissemination through CMLRSS
![Page 3: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/3.jpg)
The Semantic Web
“People keep asking what Web 3.0 is. I think maybe when you've got an overlay of scalable vector graphics […] on Web 2.0 and access to a semantic Web integrated across a huge space of data, you'll have access to an unbelievable data resource.”
- Tim Berners-Lee, A 'more revolutionary' Web (2006)
![Page 4: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/4.jpg)
…Let’s change the vision to chemistry…
![Page 5: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/5.jpg)
The Chemical Semantic Web “…when you've got an overlay of [CML,
InChI and chemical ontologies - everything well-defined and marked up] - on Web 2.0 and access to a semantic Web integrated across a huge space of data, you'll have access to an unbelievable data resource.”
[our adaptations]
… what are chemical semantics and CML?…
…..
![Page 6: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/6.jpg)
Implicit and explicit semantics• Implicit semantics
“Compound 2a melted at 119oC”humans are good at interpreting this; machines see just a string.
• Explicit semantics
<cml:molecule ref=“2a”> <cml:property> <cml:scalar dictRef=“prop:mpt” units=“units:celsius” dataType=“xsd:float” >119</cml:scalar> </cml:property> </cml:molecule>
4 namespaces, 3 dictionaries
propertyDictionary
unitsDictionary
W3CSchema
CML Schema
Molecules in CML/InChI
![Page 7: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/7.jpg)
UCC’s approach to creating Semantic Chemistry
• Authoring tools for theses and collaboration with publishers• XML-ization (through FoX) of Comp. Chem. codes (MOPAC,
CASTEP, SIESTA, GULP, ABINIT, DL_POLY GAMESS…)• Capturing/conversion of CML data at source (SPECTRa)• Rich clients (Bioclipse)• Legacy Conversion (OpenBabel, CDK, JUMBO…)• Intelligent Ontologies (Golem)
(today we will cover the following…)• Chemical Linguistics and text-mining (OSCAR3) • Legacy Conversion – CIF
Many of these semantic chemical components are now deployed or prototyped…
![Page 8: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/8.jpg)
Chemical semantic framework at UCC
prototyped Under development
legacy
CIF2CML
OSCAR3
CrystalEye
Golem
SPECTRa JUMBO
FoX
OSCAR1
Units
![Page 9: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/9.jpg)
“OSCAR”
OSCAR1 + CheckCML (2003, 2004, 2005, 2006) Student projects supported by RSC SciBorg (2005-2009) EPSRC project (Computer Lab, Chemistry, Cambridge)OSCAR3 (Peter Corbett)
• Recognition of chemical entities.• Name2structure, chemical diagrams, canonical identifiers• Chemical heuristics to parse article full-text• Links to ontologies and molecular databases.• Open source• High-throughput – 500, 000 PubMed abstracts parsed• Substructure and similarity search on corpora
![Page 10: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/10.jpg)
OSCAR3 Concepts - Example
All markup is automatic
![Page 11: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/11.jpg)
…how can this be used for publishing?...
UCC and RSC have been collaborating on transferring this technology to journal articles…
Project Prospect (2007) adds semantics…
![Page 12: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/12.jpg)
Project Prospect (RSC)
Typical HTML paper Semantics confined to hyperlinks
Prospect adds more semantics…
![Page 13: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/13.jpg)
… Prospect markup includes:
• CML (Chemical Markup Language)
• InChI
• IUPAC Gold Book
• Gene Ontology
• …and more coming …
…the marked-up semantic paper…
![Page 14: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/14.jpg)
Project Prospect RSC
… but not all chemistry is in free text …
![Page 15: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/15.jpg)
Gaussian
CIF
Chemistry is also “Data”
… some examples of data taken from theses…
![Page 16: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/16.jpg)
Semantic Chemistry and the Datument
The document is only part of the scientific recordWe can transform the experimental data to CML.The integrated result is a datument…
1
2 3
4
Experimental
Text / HTML
Compounds andAnalytical data
Crystallography
Reactions
Comp chem output
Measurements and properties
Supplemental andOther data
+
• P. Murray-Rust, The complete chemical E-publication, 216th ACS National Meeting, Boston, August 23-27 (1998), CINF-033.
• Peter Murray-Rust, Henry S. Rzepa and Michael Wright, Development of Chemical Markup Language (CML) as a System for Handling Complex Chemical Content, New J. Chem., 2001, 618-634.
• P. Murray-Rust and H. S. Rzepa, "The Next Big Thing: From Hypermedia to Datuments", J. Digital Inf., 2004, 5, article 248, 2004-03-18.
<xhtml/><mathml/><svg/><cml/><animl/><thermoml/>
XML Datument
![Page 17: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/17.jpg)
The Datument
1
2 3
4
Experimental
CMLCore,InChI
CMLReact
CMLSpect
AnIML
CMLCryst
CMLComp
CMLTable
graphs
O
OO
O
OSCAR3
CrystalEye
![Page 18: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/18.jpg)
Chemical Crystallography
Universally published as CIF.• Complete output of structure experiment,• standard supplementary data for article full-text.• > 10, 000 CIFs published online per year• > 30, 000 unpublished per year (e.g. theses)
![Page 19: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/19.jpg)
CrystalEye
The aim:
To automatically create semantic chemistry from crystallography (CIFs) published on the Web.
![Page 20: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/20.jpg)
CIFCIFCIFCIF
CrystalEye
CIFCIFCML
CML
Aggregation
• Web spider checks publishers and repositories every day.• Currently over 60,000 validated CIF files.
TOC
Article Article Article
CIF
IUCr, RSC
publishers
OAI-PMHThesis
Record Record
CIF
Cambridge, Imperial
repositories
CML
Dept.Crystallographic
ServiceCIF
CIF
Browse/SearchInterface
SPECTRa
SPECTRa-T
![Page 21: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/21.jpg)
Marking up and ValidationCIFDOI
InChI SMILES
CDKjni-InChI
CML*
CheckCIF
CheckCIF HTML
CheckCIF XML
Raw crystalstructure
CML
CIFDOM
Legacy2CML
JUMBO
• Validated,• Disorder resolved,• Unique molecules,• Bond orders and charges.
2D image
Stereochemistry added
![Page 22: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/22.jpg)
CML*
JUMBO
Ring nuclei Metal centres
CrystalEye: Re-Use through XML/CML• Automatic generation of fragments
• ca. 1 million fragments with 50,000 different chemical types• Open Access via automatically generated HTML
Here are examples of a ring-nucleus and a metal centre…
Ligands Metal ClustersLinkers
Cu centre
Cu(OH2)(C2HO2Cl2)2(C6H6N2O)2
![Page 23: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/23.jpg)
CrystalEye for humans
Generatedcontent Fragments Entries Annotations
CrystalEye
Data sources Repository
Publisher
CIF
CIF
Services Substruct. search Data searchBrowse
Browsing links back to
publications
HTML
![Page 24: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/24.jpg)
CrystalEye Webpage DemoLet’s assume we’re interested in Cu-N bonds:
• Browse to title page
• View structure data
• Explore fragments
• Inspect bond lengths
![Page 25: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/25.jpg)
CMLRSS newsfeeds• How can the chemist find every Cu-N bond immediately?
• CrystalEye uses RSS 1, RSS 2 and Atom 1.0 to create both RSS and CMLRSS feeds.
Web browsing Using RSS
= contains Cu-N
= no Cu-N
What we’ve just been doing
CrystalEyeStructure:
TOC Cu-N feed
*read*
RSSReader
…
![Page 26: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/26.jpg)
CML-RSS feeds
• Feeds for:• Journal …Acta Cryst. E, Dalton• compound class …organometallic• atoms in structures …Os, Ir, Pt• bonds in structures …Zn-N, Cu-N
• Thousands of feeds, but the robot does the work.• Client-side software can filter and reorganize the feeds
For a chemist who specialises in Cu-N chemistry, CrystalEye can alert them EVERY DAY to new examples.
RSS CMLRSS
![Page 27: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/27.jpg)
CrystalEye: KnowledgeBase not DataBase
• Aggregation by robots, not humans• All types of chemistry (organic, inorganic, etc…)• Social computing - aggregates COD*, theses• Software validation by robots, not humans,• Open and free;
• Goes live in April…
* Crystallographic Open Database
![Page 28: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/28.jpg)
Chemical semantic framework at UCC
prototyped Under development
CIF2CML
FoX
GOLEM
SPECTRa
CrystalEye
JUMBO
OSCAR3
legacy
![Page 29: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/29.jpg)
Cambridge Semantic Chemistry• Released:
• CML – XML for chemistry• JUMBO – library for CML• OSCAR1/CheckCML – data validation• OSCAR3 – text mining• FoX – Fortran XML
• Soon:• CrystalEye – crystallographic knowledgebase• SPECTRa – chemical repositories
• Later…• Golem - ontologies• CMLUnits
![Page 30: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/30.jpg)
Acknowledgements• CML - Henry Rzepa• OSCAR1 – Sam Adams, Joe Townsend, Fraser Norton, Justin
Davies, Richard Marsh, Jonathan Goodman• OSCAR 3 – Ann Copestake, SimoneTeufel (Computer Lab)• SPECTRa – Jim Downing, Alan Tonge, Peter Morgan• Software - CDK, Jmol, jni-InChI and many Blue Obelisk
contributions• Timo Hannay (NPG), Richard Kidd (RSC), Colin Batchelor (RSC),
Brian McMahon (IUCr)
![Page 31: Semantic Chemical Publishing](https://reader035.vdocument.in/reader035/viewer/2022062217/56813fe4550346895daacf6d/html5/thumbnails/31.jpg)
Thankyou.