overview of practical content mining
Post on 11-Feb-2017
181 Views
Preview:
TRANSCRIPT
Overview of Practical Content Mining
Peter Murray-Rust
JISC, London, 2014-12-01
What is Content Mining
• Mining Text, Tables and Lists, Diagrams, Images• Born-digital documents• High-throughput (millions of items/year)• Formal and Informal Collaboration
• Role of UK• Hands-on• Everything is OPEN (OSI , CC-BY, CC0)
ContentMine
• 1-2 year Shuttleworth Funding from 2014-03• Free to everyone, Open Source, updated daily• Structured Text, and Image/Diagram Mining• Workshops for training and training trainers• Bottom-up community development– Bioscience (EuropePMC, BBSRC)– Disease Ebola– Astrophysics (Stray Toaster)– Chemistry (TSB, EBI, PennState - Citeseer)
• We fight for Justice and Freedom
ContentMine People• Jenny Molloy• Ross Mounce• Peter Murray-Rust + volunteers (Bioscience, disease)• Richard Smith-Unna + 20 quickscrape volunteers• Steph Unna• Cottage Labs (Mark MacGillivray, Emanuil Tolev, Richard
Jones)• Prof Charles Oppenheim • Karien Bezuidenhout (Shuttleworth)• Advisory Board RSN
ContentMine Workshops (1-hour -> full day or more)
2014-May->Nov• Budapest/Shuttleworth• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US
Upcoming• JISC• LIBER • BL• Wellcome Trust• WHO
Ebola Collaborators (Atlanta)Roxanne Further Moore, Jessie Gunter, April Clyburne-Sherin
Regular Expressions(Easier than Crosswords or Sudoku)
Ebola EbolaMali (not Malicious)
Mali\W (end of word)
Bat or bat [Bb]at (alternatives)bat or bats bats? (optional letter)Bat or Bats or bat or bats
[Bb]ats?
Sudden onset [Ss]udden\s+onset (space/s)Panthera leo or Gorilla gorilla
[A-Z][a-z]+\s+[a-z]+(ranges of letters)
Ebola regex• <compoundRegex title="ebola">• <regex weight="1.0" fields="ebola" case="">(Ebola)</regex>• <regex weight="1.0" fields="marburg">(Marburg)</regex>• <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagic\s+fever)</regex>• <regex weight="0.8" fields="sudden_onset">([Ss]udden\s+onset)</regex>• <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omiting\s+diarrho?ea)</regex>• <regex weight="0.5" fields="guinea">(Guinea)</regex>• <regex weight="0.5" fields="sierra_leone">(Sierra\s+Leone)</regex>• <regex weight="0.5" fields="liberia">(Liberia)</regex>• <regex weight="0.5" fields="mali">(Mali)\W</regex>• <regex weight="0.6" fields="contact_tracing">([Cc]ontact\s+tracing)</regex>• <regex weight="0.5" fields="bat">\W([Bb]ats?\W)</regex>• <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex>• <regex weight="0.5" fields="drc">(Democratic Republic\s*(\s*of)?(\s*the)?\s*Congo)(DRC)</regex>• <regex weight="0.6" fields="safe_burial">([Ss]afe\s+burial\s+practice?s)</regex>• <regex weight="1.0" fields="etu">([Ee]bola\s+treatment\s+units?)(ETU)</regex>• </compoundRegex>
I
15 mins to create, 15 mins to install and testOr run online at CottageLabs
Results of Regex on Ebola• <resultsList xmlns="http://www.xml-cml.org/ami">• <results xmlns="">• <source xmlns="http://www.xml-cml.org/ami"• name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" />• <result>• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7"• lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak ">• <regex xmlns="" weight="1.0" fields="[ebola]">• <pattern>(Ebola)</pattern>• </regex>• <hits xmlns="">• <hit ebola="Ebola" />• </hits>• </regex>• </result>• <result>• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9"• lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains ">• <regex xmlns="" weight="0.5" fields="[sierra_leone]">• <pattern>(Sierra\s+Leone)</pattern>• </regex>• <hits xmlns="">• <hit sierra_leone="Sierra Leone" />• </hits>• </regex>• </result>
Demo of Content Mining
ChemicalTagger (Lezan Hawizy) a shallow, domain-specific, semantic parser for un/natural language.
Bacterial WP_phylogenetic tree
Our machines have read and interpreted 4300 in an hour with > 95% accuracy
Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
WP: Clostridium_butyricum
Genbank ID
American Type Culture Collection
RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs
QueuesRepos
Scientificliterature
SciencePlugins
ScienceVolunteers
Collaboration with Open Access Button
AMI (extraction) architecture
PDF2SVG
Imageanalysis
SVG2XML
Regex Species Phylo Chem
AMI
tablessectionscaptioneddiagrams
Immediate Stakeholders
– Researchers (bio, EBI, chem, materials, astro)– Funders WT, FWF (Austria), RCUK,– Libraries (repositories, theses)– Service providers (EuropePMC)– knowledge-based SMEs– Library organisations (JISC, RLUK, LIBER, SPARC)– Non-profits (Wikimedia, WHO, Mozilla)
Content production
• Scholarly articles• Theses• Repositories• Grey scientific literature• Grey politico-socio-legal literature• Company output (reports, accounts, contracts)
(e.g. OpenOil)
STM Publishers Licence2012_03_15_Sample_Licence_Text_Data_Mining.pdf (Summary: PMR has NO rights)• [cannot publish to: ] “libraries, repositories, or archives”• [cannot] “Make the results of any TDM Output available on an externally facing server or
website”• “Subscriber shall pay a […] fee”
Heather Piwowar: “negotiating with publishers [made me physically ill]”
WE WALKED OUT• Brit Library• JISC• RLUK• OKFN• …• Ross Mounce• PM-R
Licences destroy Content Mining
Challenges
• Active opposition from content “owners” including serious lobbying and FUD
• Ignorance and apathy from universities; inappropriate reward system
• Sub-optimal technology of publishers• Lack of common infrastructure, technology,
APIs• And it’s objectively messy anyway
Technical problems
• PDF: lacks words, tables, diagrams• Non-Unicode character sets (or worse)• Graphics objects largely destroyed (converted
to PNG or worse)• No communal ontology for document
structure.• HTML carries PublisherJunk and Javascript
Goals of Mining
• Classification of resources• Entity extraction and indexing• Aggregation within discipline• Inter-disciplinary, e.g. biodiversity,
phytochemistry• Repurposing (twitter, ePub, annotation)• Semantification/intelligent documents• Detection of error and fraud
What we need
• Inter/national commitment to infrastructure• Common ontologies and APIs• Development of community• Go beyond academia; non-academic reward
system
top related