![Page 1: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/1.jpg)
Digitizing documents to provide a public spectroscopy database
Antony Williams, Colin Batchelor, William Brouwer and Valery Tkachenko
ACS Indianapolis
![Page 2: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/2.jpg)
How can we digitize documents?
• As a publisher we would LOVE to bring data out of our historical archive
• What could we do?• Find chemical names and generate structures• Find chemical images and generate structures• Find reactions – and make a database!• Find data (MP, BP, LogP) and deposit• Find figures and database them• Find spectra (and link to structures)
![Page 3: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/3.jpg)
DERA
• Data enabling the RSC Archive
• Data extraction from the RSC Archive
• Difficult enhancements of the RSC Archive!!!
![Page 4: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/4.jpg)
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
![Page 5: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/5.jpg)
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
![Page 6: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/6.jpg)
Text-Mining
![Page 7: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/7.jpg)
How is DERA going?
• We are working on 21st articles first
• Mostly marked up with XML, more structured, easier to handle
• 8.2Gbytes of data, >100k articles from 2000-2013
• Markup will be published onto the HTML forms of the articles
• We will iterate based on dictionaries, markup, OSCAR extraction
![Page 8: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/8.jpg)
ChemSpider Reactions
![Page 9: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/9.jpg)
ChemSpider Reactions
![Page 10: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/10.jpg)
Structure Extraction from Images
• Structure extraction from images is old technology. It’s difficult!• Commercial and Open Source tools
• CLiDE• OSRA• Imago• Lots of others
![Page 11: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/11.jpg)
Detailed analysis and test sets
• Detailed analysis from GGA : http://ggasoftware.com/imago/report/report.html
![Page 12: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/12.jpg)
ESI – Text Spectra
![Page 13: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/13.jpg)
Lots of “Textual Spectra”
![Page 14: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/14.jpg)
Do we want to search text spectra?
What do we get when we search:
13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
![Page 15: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/15.jpg)
1 Hit. Yay!
![Page 16: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/16.jpg)
Reality
• No one will ever have perform a “spectral search” based on text searching!
• From sample to sample, solvents, concentration, temperature will change peak positions. The chance of even the same peak list is tiny.
• Reality need is a “spectral database” where search algorithms deal with peak positions, intensities, multiplicity when appropriate
![Page 17: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/17.jpg)
Text and Images Spectra into “Real Spectra”?
• We can turn text into structures
• We can turn images into structures
• So is it possible to turn text into spectra?
![Page 18: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/18.jpg)
MestreLabs Mnova NMR Beta
![Page 19: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/19.jpg)
1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
![Page 20: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/20.jpg)
13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
![Page 21: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/21.jpg)
Text Conversion Approaches
• Work in progress but early observations• Converted spectra are NOT what would be
seen in the data• They are commonly GOOD approximations
of C13 spectra (except intensity)• They are average BUT useful approximations
of H1 spectra – couplings are tough, dispersion of spectra, overlaps etc.
• We need to figure out workflows, structure associations, storage in ChemSpider
![Page 22: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/22.jpg)
It’s exactly the WRONG WAY!
• We should NOT be mining data out of future publications
• Structures should be submitted “correctly”
• Spectra should be digital spectral formats, not images
• ESI should be RICH and interactive
![Page 23: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/23.jpg)
ESI – Text and Image Spectra
![Page 24: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/24.jpg)
ESI – Text and Image Spectra
![Page 25: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/25.jpg)
![Page 26: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/26.jpg)
Extracted JCAMP Spectrum
![Page 27: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/27.jpg)
![Page 28: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/28.jpg)
Turn “Figures” Into Data
![Page 29: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/29.jpg)
Plot2Txt (p2t)
Plot2txt.com (p2t) proprietary cloud based service for fast large scale document content extraction
Figures in technical documents are recognized and converted into text, CSV and other formats eg., JCAMP without human intervention.
Extracted data suitable for storage/indexing, further reuse
![Page 30: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/30.jpg)
What’s the process?
Input : PDF document collection, split into pages, handed to p2t instances and processed
Output : Spectra in JCAMP/CSV, molecules in BMP images
page
page
p2t
p2t
![Page 31: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/31.jpg)
Test Experiments
Input : 74 supplementary data documents/ 3444 pages Output : p2t extracted content in 1069 page instances
578 molecules ~ 10% false positives eg., classifies Bruker logo as
chemical object ~ 20% false negatives eg., missing some symbols
from structure
1151 spectra > 80% of peaks extracted to within 1-2 decimal
places (ppm)
![Page 32: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/32.jpg)
Performance
Plot2Txt output: processed on average 1.4 M pixels / second / CPU core
(Intel i7, O3 optimization in compilation) 2 hours for 1069 pages, in serial
0
0.5
1
1.5
2
2.5
0 200 400 600 800 1000
M p
ixe
ls / s
eco
nd
page number
![Page 33: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/33.jpg)
Analysis Process
• Manual examination….viewing spectra, one at a time, and comparing extracted JCAMP versus image (TIME!)
• Generally excellent results for high S/N – small/close peaks can be lost
• Spectrum is “representative enough” and way more useful than just images for indexing and searching
• Structure association MUST be checked but name-structure association can be used
![Page 34: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/34.jpg)
Prepare CONSISTENT JCAMP
![Page 35: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/35.jpg)
Data onto ChemSpider
![Page 36: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/36.jpg)
Summary
Plot2txt does recognize and extract content Rapid and increasingly accurate process Fails in low resolution cases, some fine
structure in spectra is lost
Structure recognition is NEW needs some work in order to lower false negatives
![Page 37: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/37.jpg)
Future data checking opportunity
• How will we check data consistency?
• How do we know the structure and the spectra match? Comparing image to spectrum is NOT enough!!!
• Predict spectra, use spectral verification, use algorithmic checking.
• Flag “dodgy data” and use crowdsourcing for data checking – If 10,000 spectra online are 5% in error are they useful???
![Page 38: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/38.jpg)
Future Work
• We can EASILY find text spectra in articles but have work to do regarding:• Pipelining of work and structure association• Non-truncation from wordwrapping
• We can quite easily find spectra based on Figure Legends and have work regarding• Pipelining of work and structure association
• Validation of structure-spectrum association• Data curation
![Page 39: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/39.jpg)
Grand Target
• I want ALL 21st century spectra converted and in ChemSpider in one year
• I REALLY want scientists to get the value of real data over image data in terms of ESI
• I want authors to have data validation via our web services
• We will support IR, Raman, UV-Vis, 1D NMR and 2D…yet to come!
![Page 40: Digitizing documents to provide a public spectroscopy database](https://reader034.vdocument.in/reader034/viewer/2022052410/554e7e30b4c90545698b517f/html5/thumbnails/40.jpg)
Acknowledgments
• Bill Brouwer – Plot2Txt.com live in 2 weeks
• Carlos Cobas and Santi Dominguez
• Colin Batchelor and Peter Corbett – OSCAR, text mining, dictionaries, markup
• Valery Tkachenko, Alexey Pshenichnov and Richard Gay – ChemSpider Reactions
• Daniel Lowe – ChemSpider Reactions data
• ACD/Labs – Provider of spectroscopy tools