Download - Scalable Text Mining
![Page 1: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/1.jpg)
Scalable Text Mining
Jee-Hyub KimText-Mining Pipeline BuilderLiterature Services Team
2 Feb 2016
![Page 2: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/2.jpg)
A Text-Mining Pipeline
Text
![Page 3: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/3.jpg)
Contents
● Text-Mining Pipeline Crisis● Session 1: Build Your Own Pipeline● Session 2: Build Your Own Dictionary● Wrap Up
![Page 4: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/4.jpg)
Use case Semantic type Dictionary type
Document type Section Metadata Delivery method
OpenAIRE accession numbers
pattern(e.g, [0-9][A-Za-z0-9]{3})
patents
Title, Claim, Description,
Abstract, Figure, Table
Pubyear, IPCR summary table
ERC grant identifiers pattern articles Acknowledgements search index
CTTV gene, disease term(e.g., IBD)
articles, abstracts json
ELIXIR-EXCELERTAE resource names term articles summary table
1000 Genomes cell line names pattern articles !Acknowledgements REST API
Wikipedia accession numbers pattern wikipages summary table
KEW Garden species names (muitilingual) term articles summary table
ChEMBL resource name term articles Author, Journal summary table
Ensembl genomic range pattern articles summary table
A long list of requests
![Page 5: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/5.jpg)
Scalable Text Mining
● For the last few years, we’re having a pipeline crisis!● A long list of requests and our slow responses
○ Makes you unhappy.● Even worse, it’s a long tail!
○ Never the same pipeline used for each request.○ Every time, we have to build a new pipeline.○ We need a new approach to solve this crisis.
![Page 6: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/6.jpg)
Objective
● We want to build a LEGO-like platform that helps you to build your own text-mining pipeline and your own dictionary.
![Page 7: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/7.jpg)
A Key Block: Dictionary-Based Tagger
● Role: To identify names (e.g., proteins, species, accession numbers, etc.)
● Dictionary-based approach for mining names.○ Simple○ Readable○ Interactive
● Building a dictionary is a VERY iterative process○ 20% for building an initial dictionary and the rest for
refining it.● Good dictionaries are a key for text-mining success
stories.
![Page 8: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/8.jpg)
Agile Revision Process
![Page 9: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/9.jpg)
Session 1
Build Your Own Pipeline
As …, I want a pipeline to do ...
![Page 10: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/10.jpg)
Pipeline Stories
● CTTV○ As a researcher, I want to find articles with
supporting evidence from drug discovery● ERC
○ As a funder, I want to funded articles more searchable.
● ELIXIR-EXCELERATE○ As a resource manager, I want to know impacts of
resources.
![Page 11: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/11.jpg)
Second, Find & Describe Blocks You Need
When you want You can use
to extract a sentence Sentence splitter
to limit your mining to an article section Section tagger
to identify disease namesto identify database idetifiers Dictionary-based tagger
to find relations between genes and diseases Relation extractor
to get some analytics Summary table generator
to get article meta data Europe PMC REST API
to produce text-mined data in RDF RDF generator
![Page 12: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/12.jpg)
Then, Build a Pipeline using Blocks
![Page 13: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/13.jpg)
Session 2
Build Your Own Dictionary
Designing filtering rules
![Page 14: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/14.jpg)
How to Revise a Dictionary?
● We want to build an expressive language for filtering.● Global filtering rule
○ A length of term > 2○ Case sensitive
● Per-entry filtering rule○ A term should be tagged when it is mentioned in
Methods section.○ A pattern should be tagged when it follows a term
“omim”● Blacklist: e.g., stop words
![Page 15: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/15.jpg)
Per-Entry Rules
● A spreadsheet per entry
● Definitions○ Context: should (not) be after a tem.○ Section: should (not) be mentioned a section. ○ URI: check if http://www.ebi.ac.
uk/efo/EFO_0001997 exists
Entry information Filtering rules
Term/Pattern Entry ID DB Context Section URI
Pattern HG[0-9]{5} 1000 genomes
!(grant|fun
d)!ACK
Term basal cell EFO_0001997 efo Methods Yes
![Page 16: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/16.jpg)
Analytics
● Summary table
● Top 100 frequent terms
PMCID Term ID Frequency
PMCID4698870 Nutlin-3 ChEBI:46742 16
PMCID4698870 cell cycle arrests GO:0007050 6
Top Name Document Freq. Collection Freq.
1 protein 678,987 1,823,783
2 water 563,234 1,233,332
![Page 17: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/17.jpg)
Spreadsheet for Filtering Rules
http://tinyurl.com/zlwbx2y
![Page 18: Scalable Text Mining](https://reader031.vdocument.in/reader031/viewer/2022030311/58eee2391a28ab156b8b45bb/html5/thumbnails/18.jpg)
Wrap Up
● What is your pipeline story?● Have you managed to create your own dictionary?● What service blocks are missing?● What should be the interfaces?● How should we deliver?