scalable text mining

18
Scalable Text Mining Jee-Hyub Kim Text-Mining Pipeline Builder Literature Services Team 2 Feb 2016

Upload: jee-hyub-kim

Post on 13-Apr-2017

180 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Scalable Text Mining

Scalable Text Mining

Jee-Hyub KimText-Mining Pipeline BuilderLiterature Services Team

2 Feb 2016

Page 2: Scalable Text Mining

A Text-Mining Pipeline

Text

Page 3: Scalable Text Mining

Contents

● Text-Mining Pipeline Crisis● Session 1: Build Your Own Pipeline● Session 2: Build Your Own Dictionary● Wrap Up

Page 4: Scalable Text Mining

Use case Semantic type Dictionary type

Document type Section Metadata Delivery method

OpenAIRE accession numbers

pattern(e.g, [0-9][A-Za-z0-9]{3})

patents

Title, Claim, Description,

Abstract, Figure, Table

Pubyear, IPCR summary table

ERC grant identifiers pattern articles Acknowledgements search index

CTTV gene, disease term(e.g., IBD)

articles, abstracts json

ELIXIR-EXCELERTAE resource names term articles summary table

1000 Genomes cell line names pattern articles !Acknowledgements REST API

Wikipedia accession numbers pattern wikipages summary table

KEW Garden species names (muitilingual) term articles summary table

ChEMBL resource name term articles Author, Journal summary table

Ensembl genomic range pattern articles summary table

A long list of requests

Page 5: Scalable Text Mining

Scalable Text Mining

● For the last few years, we’re having a pipeline crisis!● A long list of requests and our slow responses

○ Makes you unhappy.● Even worse, it’s a long tail!

○ Never the same pipeline used for each request.○ Every time, we have to build a new pipeline.○ We need a new approach to solve this crisis.

Page 6: Scalable Text Mining

Objective

● We want to build a LEGO-like platform that helps you to build your own text-mining pipeline and your own dictionary.

Page 7: Scalable Text Mining

A Key Block: Dictionary-Based Tagger

● Role: To identify names (e.g., proteins, species, accession numbers, etc.)

● Dictionary-based approach for mining names.○ Simple○ Readable○ Interactive

● Building a dictionary is a VERY iterative process○ 20% for building an initial dictionary and the rest for

refining it.● Good dictionaries are a key for text-mining success

stories.

Page 8: Scalable Text Mining

Agile Revision Process

Page 9: Scalable Text Mining

Session 1

Build Your Own Pipeline

As …, I want a pipeline to do ...

Page 10: Scalable Text Mining

Pipeline Stories

● CTTV○ As a researcher, I want to find articles with

supporting evidence from drug discovery● ERC

○ As a funder, I want to funded articles more searchable.

● ELIXIR-EXCELERATE○ As a resource manager, I want to know impacts of

resources.

Page 11: Scalable Text Mining

Second, Find & Describe Blocks You Need

When you want You can use

to extract a sentence Sentence splitter

to limit your mining to an article section Section tagger

to identify disease namesto identify database idetifiers Dictionary-based tagger

to find relations between genes and diseases Relation extractor

to get some analytics Summary table generator

to get article meta data Europe PMC REST API

to produce text-mined data in RDF RDF generator

Page 12: Scalable Text Mining

Then, Build a Pipeline using Blocks

Page 13: Scalable Text Mining

Session 2

Build Your Own Dictionary

Designing filtering rules

Page 14: Scalable Text Mining

How to Revise a Dictionary?

● We want to build an expressive language for filtering.● Global filtering rule

○ A length of term > 2○ Case sensitive

● Per-entry filtering rule○ A term should be tagged when it is mentioned in

Methods section.○ A pattern should be tagged when it follows a term

“omim”● Blacklist: e.g., stop words

Page 15: Scalable Text Mining

Per-Entry Rules

● A spreadsheet per entry

● Definitions○ Context: should (not) be after a tem.○ Section: should (not) be mentioned a section. ○ URI: check if http://www.ebi.ac.

uk/efo/EFO_0001997 exists

Entry information Filtering rules

Term/Pattern Entry ID DB Context Section URI

Pattern HG[0-9]{5} 1000 genomes

!(grant|fun

d)!ACK

Term basal cell EFO_0001997 efo Methods Yes

Page 16: Scalable Text Mining

Analytics

● Summary table

● Top 100 frequent terms

PMCID Term ID Frequency

PMCID4698870 Nutlin-3 ChEBI:46742 16

PMCID4698870 cell cycle arrests GO:0007050 6

Top Name Document Freq. Collection Freq.

1 protein 678,987 1,823,783

2 water 563,234 1,233,332

Page 17: Scalable Text Mining

Spreadsheet for Filtering Rules

http://tinyurl.com/zlwbx2y

Page 18: Scalable Text Mining

Wrap Up

● What is your pipeline story?● Have you managed to create your own dictionary?● What service blocks are missing?● What should be the interfaces?● How should we deliver?