can machines understand the scientific literature
TRANSCRIPT
![Page 1: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/1.jpg)
Can machines “understand” the scientific literature?
Peter Murray-Rust, Reader Emeritus, Dept of Chemistry, Univ Cambridge
and Founder TheContentMine
Trinity College Science Society, Cambridge UK, 2017-02-21
contentmine.org is supported by a grant to PMR as a
![Page 2: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/2.jpg)
(2x digital music industry!)
ContentMine is an OpenLocked Non-Profit company
![Page 3: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/3.jpg)
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRYTEXT
MATH
contentmine.org tackles these
![Page 4: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/4.jpg)
AMI! Tell me what YOU know about monoxidine?
![Page 5: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/5.jpg)
Wikipedia
![Page 6: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/6.jpg)
Wikidata for moxonidine
![Page 7: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/7.jpg)
Wikidata for moxonidine
![Page 8: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/8.jpg)
Entity extraction
OPSIN says this name is wrong! OSIRIS will interpret this structureIncluding the annotation
![Page 9: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/9.jpg)
Reaction Schemes
![Page 10: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/10.jpg)
Tables
![Page 11: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/11.jpg)
Tables
![Page 12: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/12.jpg)
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
![Page 13: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/13.jpg)
Automatic semantic markup of chemistry
Could be used for analytical, crystallization, etc.
![Page 14: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/14.jpg)
AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)
![Page 15: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/15.jpg)
6 ContentMine Fellows for 6 months
![Page 16: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/16.jpg)
Neo Christopher Chung
Warsaw, Computational Biology Wants to find out geographic and temporal differences in the use of genomic software tools
![Page 17: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/17.jpg)
Paola Masuzzo Ghent, Computational Omics and Systems Biology Wants to mine literature around cell migrations and invasion to create 1) collection of
minimum requirements, 2) check for nomenclatura consistency and 3) construct a knowledge map
![Page 18: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/18.jpg)
Alexandra Bannach-Brown Edinburgh, Neuroscience Problem: huge body of works in animal studies about depressions. systematic review is the main
approach for getting insight. Wants: identify papers in systematic review of depressive behaviour in animals. What
drugs, what methods, what outcomes and signs/phenotypes. Use outcomes for document clustering.
and expedite scientific advances."
Corpus: 70.000 Papers
![Page 19: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/19.jpg)
Alexandre Hannud Abdo “Our goal is to mine facts from global health research and provide automated referenced
summaries to practitioners and agents who don’t have the means or the time to navigate the literature.
From Brazil, Life Sciences, works on project about evolution of oncology Wants: extract facts from cancer research conference papers and global health papers
OPEN NOTEBOOK RESEARCH
![Page 20: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/20.jpg)
Alexandre Hannud Abdo “Our goal is to mine facts from global health research and provide automated referenced
summaries to practitioners and agents who don’t have the means or the time to navigate the literature.
From Brazil, Life Sciences, works on project about evolution of oncology „I am extremely happy to join this first cohort of ContentMine Fellows. I participated in a
ContentMine workshop in 2014 and have been following the progress of the project ever since, looking for an opportunity to collaborate which now materializes.“
Problem: Get text and metadata out of old conference proceedings and measure the evolution of ideas and practice using entity analysis, especially trends.
Wants: extract facts from cancer research conference papers and global health papers. Extracting topics (innovations, developments) and comparing the two types of publications. Find out which facts from conferences get later on published in articles.
Has some issues with software
![Page 21: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/21.jpg)
Guanyang Zhang Biology, Arizona „My ContentMine Fellowship project will focus on mining weevil-plant associations from literature
records.“ „Motivation. Comprising ~70,000 described and 220,000 estimated species, weevils
(Curculionoidea) are one of the most diverse plant-feeding insect lineages and constitute nearly 5% of all known animals.“
„Knowledge of host plant associations is critical for pest management, conservation, and comparative biological research. This knowledge is, however, scattered in 300 years of historical literature and difficult to access.“
Weevil-plant association network graph made with Google Fusion Table. Each blue circle is a weevil tribe and yellow circle a plant genus. The size of a circle represents the number of associations.
![Page 22: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/22.jpg)
Lars Willighagen 15 years old NL Wants: extract data about conifers (relations to chemicals, height etc.) Outcome: database with webpage containing conifer properties Table Facts Visualiser DEMO Card DEMO Word Cloud „ I applied to this fellowship to learn new things and combine the ContentMine with two previous
projects I never got to finish, and I got really excited by the idea and the ContentMine at large.“
![Page 23: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/23.jpg)
Multisegment diagram
![Page 24: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/24.jpg)
Multisegment diagram
Whitespace “corridors”
SuperpixelBounding box
Semanticlabels
![Page 25: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/25.jpg)
Chemical Computer Vision
Raw Mobile photo; problems:Shadows, contrast, noise, skew, clipping
![Page 26: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/26.jpg)
Binarization (pixels = 0,1)
Irregular edges
![Page 27: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/27.jpg)
Posterisation
Extracted since unique posterized colour
![Page 28: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/28.jpg)
Note Jaggy and broken pixels
NEW Bacteria must have a phylogenetic tree
Length_________Weight Binomial Name Culture/Strain GENBANK ID
EvolutionRate
![Page 29: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/29.jpg)
Supertree for 924 species
Tree
![Page 30: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/30.jpg)
UNITS
TICKS
QUANTITYSCALE
TITLES
DATA!!2000+ points
VECTOR PDF
![Page 31: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/31.jpg)
Dumb PDF
CSV
SemanticSpectrum
2nd Derivative
Smoothing Gaussian Filter
Automaticextraction
![Page 32: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/32.jpg)
C) What’s the problem with this spectrum?
Org. Lett., 2011, 13 (15), pp 4084–4087
Original thanks to ChemBark
![Page 33: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/33.jpg)
After AMI2 processing…..
… AMI2 has detected a square
![Page 34: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/34.jpg)
![Page 35: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/35.jpg)
AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)
![Page 36: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/36.jpg)
Search on publicly accessible papers on “Zika”
https://rawgit.com/ContentMine/amidemos/master/zika/full.dataTables.html
![Page 37: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/37.jpg)
![Page 38: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/38.jpg)
![Page 39: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/39.jpg)
“… simulated by 21cmFAST is in principle independent”
“it is a feature of the 21cmFAST code, and is explained in §3.1.”
SciCodes[1]: Searching for software in arXiv[1]
[1] Proposal to LJ Arnold Foundation (Alice Allen ASCL and PMR)
Using the semi-numerical simulation, 21cmFAST,
[2] arxiv.org: the physics/maths/astronomy.. Preprint server
The language identifies the software!
arxIv has >500 mentions of “21cmFast”
![Page 40: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/40.jpg)
Questions and comments
Thanks:• Andy Howlett, Dept Chemistry, Cambridge• Mark Williamson, Dept Chemistry, Cambridge• Ross Mounce, Biology, University of Bath• Shuttleworth Foundation
PM-R has offered to mentor an MSc project this summer for anyone interested.
contentmine.org
![Page 41: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/41.jpg)
![Page 42: Can machines understand the scientific literature](https://reader034.vdocument.in/reader034/viewer/2022052606/58ea9bff1a28abe5728b5013/html5/thumbnails/42.jpg)