new software developments on chemical information
TRANSCRIPT
![Page 1: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/1.jpg)
New Software Developments on
Chemical Information Extraction from Patent
Documents and Markush Structure Analysis
Wei Deng (David)
PIUG Meeting
May 2nd, 2012
Denver CO
![Page 2: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/2.jpg)
ChemAxon’s Naming Technology
• Name to structure
– IUPAC, traditional and common names
– A library of existing drugs
– Support CAS Registry number
– Homology group: alkyl, aryl …
– Future: Biological names
• Structure to Name
– IUPAC Name, traditional names
• Accuracy and coverage constantly improving
• Also available from command-line
2
![Page 3: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/3.jpg)
ChemAxon’s “Document to Structure”
• Extract chemical information from documents – Names: powered by the Naming Technology
– Also import smiles, InChI, CAS number …
– Images: OSRA
– Returns structure and their location in the document
• Works with scanned PDF since 5.8 (Feb 2012)
– Great for patent mining
• OCR and syntax correction constantly developed
– 3-rnethyl-l-me- thoxynaphthalene
– 3-methyl-1-methoxynaphthalene
3
![Page 4: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/4.jpg)
From Document to Structures
4
Non-searchable patent (50 pages) Structure (text + image) + location
![Page 5: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/5.jpg)
Search by Structure or Text
5
![Page 6: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/6.jpg)
Non-searchable PDF is now Searchable
6
![Page 7: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/7.jpg)
ChemAxon’s “Document to Structure”
• New Features in 5.9 (Mar 2012)
– MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt …
– Embedded structure objects (ChemDraw, Symyx, Marvin
…)
– Progressively display result
– Speed improvement
– Instant JChem Integration; Simplfied API
• Currently in development for 5.10 (May
2012) – Image-to-structure “Confidence”
– Fragment groups integration with Markush generation
– Biologic names
7
![Page 8: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/8.jpg)
Free Online Service Chemicalize.org
• Extract chemical information from web pages and documents
• Interactively display all structures and their predicted properties
• Search all structures extracted
• Gather links of interest to chemists for post processing (search,
analysis, reporting, fun…)
• Recently reviewed on Journal of Chemical Information and
Modeling
8
![Page 9: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/9.jpg)
9
Webpage - chemicalized
• All chemical names are highlighted with dotted line
• Mouse over a name pops up the structure image
• Click on the image will direct to the data page
• Links are “respected”
![Page 10: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/10.jpg)
• Customizable report layout for calculation results. Users can move, open, close, expand calculation boxes and this is remembered on the next visit
Data Page: Extensive Predicted Properties
![Page 11: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/11.jpg)
11
• All structures are summarized above the chemicalized page
• Click on a structure to highlight all occurrences. Click again to
navigate to the next occurrence
• All structures can be downloaded as MRV or SDF (useful with
online patent full text)
Webpage - chemicalized
![Page 12: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/12.jpg)
PDF File - chemicalized
![Page 13: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/13.jpg)
Aspirin: query highlighted in results
Searching Chemicalize.org – Structure Search
![Page 14: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/14.jpg)
• Aspirin; web page hits - “show” related structures
• Autosuggest while typing
Searching Chemicalize.org – Keyword Search
![Page 15: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/15.jpg)
Everything is Published
• Recent viewed
– Webpages
– Structures
– Documents
– Searched queries (structure and keyword)
15
![Page 16: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/16.jpg)
Availability and Customization
• Source code available
• Minor changes required on example codes
for customization, such as
– Import extracted structures to other databases
– Post-process filtering according to properties
– Batch process of multiple documents
16
![Page 17: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/17.jpg)
MARKUSH TECHNOLOGY
UPDATE
17
![Page 18: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/18.jpg)
ChemAxon - Thomson Reuters
Markush project history
1987 Thomson Scientific (Derwent) starts indexing Markush
structures (in collaboration with Questel & INPI)
1998 INPI & Derwent Markush databases merge to form MMS
(Merged Markush Service)
2000 ChemAxon launches first version of JChem Base
2005 Chemaxon starts working on Markush technology
2008 Markush search & enumeration first release in JChem 5.0
2010 Markush DARC file format support in JChem 5.3
2011 Full MMS searchable with JChem 5.5-5.8
![Page 19: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/19.jpg)
Search the Full Patent Database
• Complete patent database from Thomson Reuters (Markush +
exemplified + non-structural information, dated back to 1987)
• Data internally hosted or on Amazon Cloud
• Powerful virtual machine, secure connection and confidential
search
• Useful new features:
– Export exemplified structures
– Retrieve patent document
– Enumerate Markush structures and output result
– Notation
• Batch search of multiple queries
• Constantly improving search performance
19
![Page 20: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/20.jpg)
New Interface, New Buttons, New Features
• All information in one place
20
![Page 21: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/21.jpg)
Export Exemplified Structures
21
![Page 22: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/22.jpg)
Retrieve Patent Document
22
![Page 23: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/23.jpg)
Add Notes
23
![Page 24: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/24.jpg)
Notation Overview
24
![Page 25: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/25.jpg)
Search in Instant JChem
• Search in both exemplified and
Markush structures
• Various structure search options:
– Substructure or full
– Broad translation
– Stereochemistry, tautomer
– Atom/bond matching
• Text search (including dates)
• Multiple search results can be
creatively combined
• Flexible visualization functions to
display result with scripting feasibility
• Integrated with new interfaces for
navigation and enumeration
25
![Page 26: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/26.jpg)
Improved R-group Hit Visualization
Integrated with Markush Viewer
![Page 27: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/27.jpg)
• Substructure hit visualization
– Scaffold only
– Scaffold + relevant R-groups
– Scaffold + all R-group with relevant R-group
colored
Query
Result in original Markush
Reduced result
Hit Expansion
Hit alignment
Hit colouring
Structure cleaning
![Page 28: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/28.jpg)
Batch Search of Multiple Queries
28
![Page 29: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/29.jpg)
Batch Search of Multiple Queries
29
![Page 30: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/30.jpg)
OTHER MARKUSH-RELATED
ANALYSIS
30
![Page 31: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/31.jpg)
Atom lists, bond lists
Position variation bond
Link nodes and repeating units
R-groups
Multiple attachment points
Up to thousands of R-group definitions
Nested to any depth
Markush Structure Features I
![Page 32: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/32.jpg)
Homology groups (“Superatoms”, “Generic definitions”)
(properties)
Easy to understand
graphical representation
All supported in MRV file
Markush Structure Features II
![Page 33: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/33.jpg)
Markush Viewer
![Page 34: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/34.jpg)
Markush Enumeration
Functionality: Full
Sequential
Random
Calculate library size
Scaffold alignment
and coloring
Markush code
Homology group enumeration
Post filter
![Page 35: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/35.jpg)
Re-designed Markush Enumeration Interface
• Markush reduction (hit expansion) according to query
• Query aligned and colored in enumerated structures
• Post-filtering and structure export
• Improved enumeration speed
35
![Page 36: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/36.jpg)
Other Features
• R-group decomposition in molecule tables
• Creation of Markush structure from selected rows.
![Page 37: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/37.jpg)
Future Work
• Improve search speed and accuracy
• Additional query variations
• Better visualization
• Integration with Document to Structure to
extract chemical information from patent
documents
• Collaboration with Linguamatics
37
![Page 38: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/38.jpg)
Hunting for Hidden Treasures
• A CINF Symposium regarding “chemical
information in patents and other documents”
• ACS meeting in Philadelphia, August 19-23,
2012.
• Current speakers from
– Content providers
– Software providers
– Pharmaceutical users
• One opening slot on Markush structure
analysis
38
![Page 39: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/39.jpg)
Acknowledgements
JChem base, Markush and IJC
Helpers - JChem WS, Cartridge,
Core & Marvin, Marketing
• Steve Hajkowski
• Brian Larner
• Don Walter
• Gez Cross
• Tony Ferns
• Tim Miller
![Page 40: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/40.jpg)
Backup Slides
40
![Page 41: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/41.jpg)
Markush User Community
• Markush user community – IP experts / patent searchers / Information
scientists
– Patent lawyers / Patent agents
– Medicinal chemists
– Computational chemists/Cheminformaticians
• Goal – Bring Markush search to a wider
community
![Page 42: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/42.jpg)
Import from Thomson Reuters
42
![Page 43: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/43.jpg)
Query atom
any, metal, hetero ...
Atom topology (ring, chain)
Stereochemistry (E/Z, tetrahedral)
Aromatic, aliphatic atoms
Substitution count
Block substitution (s*)
H count
Explicit H full support
Ring bond count
Isolate ring on atoms (rb*)
Additional Markush Query Features I
on Atoms
![Page 44: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/44.jpg)
Additional Markush Query Features II
Bond topology (chain/ring)
Equal homology translation
Broad translation switchable
Simple R-group queries
on Bonds
and other ...
![Page 45: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/45.jpg)
Multiple Queries: Overlap Analysis
45
![Page 46: New Software Developments on Chemical Information](https://reader031.vdocument.in/reader031/viewer/2022012916/61c70e5a4fa9a474e55742d6/html5/thumbnails/46.jpg)
Multiple Queries: Overlap Analysis
46