are you ready for the golden age of text mining? john mcnaught deputy director, national centre for...

Are you ready for the golden age of text mining?

John McNaughtDeputy Director, National Centre for Text Mining

University of [email protected]

London Info International 2

Overview

• Text mining in a nutshell• Enriching content, enhancing search, enabling

discovery, reducing costs• Interoperability and evaluation• The C change

McNaught


How do we (humans) discover?

• Find, read, learn, analyse a lot• Ask “What if…?”• Construct hypotheses, test them

– Explore many avenues, associations• Work collaboratively• Share results and data with others

– Reproducibility validation• Integrate heterogeneous data/information/knowledge • (vs. Serendipity: by lucky accident)

McNaught


Barriers to discovery

• Find: document oriented, too many hits• Read: too much to read, even if we find relevant hits• Learn: too fast growth to keep up, to know most things• Analyse: duplication of efforts, many new results to

document• Construct hypotheses: hard, can’t tell which are most

promising, or if have missed any• Share: primary vehicles are documents and curated

databases (massive curation backlog)• Integrate: document often the key, hard to link in to

different worlds of data, information, knowledgeMcNaught


How does TM aid discovery?• Find: more precise, relevant information, within and across

documents• Read: much faster than human• Learn: extracts, packages, links, synthesises, summarises, reduces

burden• Analyse: recognises duplication; clusters, classifies, drives semantic

author aids• Construct hypotheses: rapidly finds and ranks unknown associations

for testing• Share: reduces curation effort, complements and validates data bases• Integrate: links documents deeply into worlds of data, information

and knowledge

McNaught

Text mining in a nutshell

Otherdata

ApplicationsSemantic searchData mining

McNaught

McNaught

Words

Terms

Entities

Relations

Events

Wordform co-occurrence, pattern matching, …

Term recognition and normalisation

Named entity recognition

Relation extraction

Event extraction

Associations

Metaknowledgeextraction

Dat

a m

inin

g, C

lust

erin

g

What is known aboutthis disease, protein, person?

What is linked with X?

{Who, what} Xed {whom, what} where, when and how?

What if…?

Keywordsearch

Is X possible, certain, probable, suggested, past, to come?

What is this paper about?

Increased sophistication? Increased customisation!

A complex space

LanguagesEnglish French GermanSpanishPortugueseItalianPolish….ChineseHinduArabicUrduJapaneseKorean….

TasksTranslationInformation extractionSemantic searchQuestion answeringSentiment analysisSummarizationKnowledge discoveryDatabase curationSystematic reviewingPathway reconstruction….

Domains Finance/BusinessHealthBiologySocial SciencesHumanities…

Text Types Scientific articles(Full papers/abstracts)Social mediaPatentsClinical records, EMRBooks, theses, reportsNewswire…

TechnologyTokenizersSentence SplittersParagraph SplittersNP ChunkersSyntactic parsersSemantic parsersNE recognizersRelation extractorsEvent extractors…

Diversity of Languages and Language Resourcesincluding temporal diversityDiversity of Contexts

Diversity of Applications8

Resources(mono- and multilingual)GazetteersAnnotated corporaLexiconsTerminologiesWordnetsThesauriOntologiesGrammars…

Europe’s Languages and Language Technology support

McNaught 9

DutchFrenchGermanItalianSpanish

CatalanCzechFinnish

HungarianPolish

Portuguese

Swedish

BasqueBulgarianDanishGalicianGreek

Norwegian

RomanianSlovakSlovene

CroatianEstonianIcelandic

IrishLatvian

LithuanianMalteseSerbian

English

good support through Language

Technology

weak orno support

(no ‘excellent’ support)

http://www.meta-net.eu


Enhancing historical collections

• If you have a domain collection going back centuries– How easy is it for users to find answers to research

questions?• Language evolves, terms come and go,

concepts drift, …• TM can enhance collections in many ways

– Handling temporal aspects of language is key– Enabling event-based semantic search

McNaught


Looking into the past

• Semantic search for historians of medicine– Treatment and prevention of diseases over time– Medical and public health perspectives

• British Medical Journal archive (from 1840)– Around 350K articles

• London Medical Officer of Health reports (1848-1972) (Wellcome Library)– Around 5,000 reports from different boroughs

McNaught

In historical collections, same concept expressed by different terms across different time periods

Users miss information due to unfamiliar terminology

TM to extract/link diachronic synonyms, organize in thesaurus

Use diachronic thesaurus for time-sensitive search

Traditional searchUser searches for

”pulmonary tuberculosis” but doesn’t know historical synonym

“pulmonary phthisis”User expands query

Narrow down results according to faceted search(facets derived both from

document metadata and from text mining)

System automatically suggests related terms

Distribution of “pulmonary tuberculosis” and “pulmonary

phthisis” across time

(A mock-up for user feedback)

Analysing events of interest to historians

Type Description Participants

Affect An entity or event is affected, infected, changed or transformed, possibly by another entity or event

Cause: of the affectionTarget: Entity or event affectedSubject: Medical subject affected

Cause An entity or event results in manifestation of another entity or event

Cause: of the eventResult: Resulting entity or eventSubject: Medical subject affected


Classic case of working together• End user (typically) not a text miner• Text miner (typically) not a domain expert• Requirements and evaluation: challenge for both• Need to work together to understand

– How TM can help, what it can and cannot do– What questions are of interest– What role human has– What outcomes are desirable– What existing resources can be exploited

McNaught

http://miningbiodiversity.org

Mining Biodiversity

AimTransform Biodiversity Heritage Library into a next-generation social digital library130,000 volumes of digitised legacy literature

A multi-disciplinary approach 1. Text Mining2. Machine learning3. Data visualisation4. History of Science5. Environmental History & Studies6. Library and Information Science7. Social Media

Mining Biodiversity

Mining BiodiversitySemantic metadata

extraction to support search

Observation

Habitation

Nutrition


Finding evidence• Event extraction can drive semantic search as

we’ve seen. We can go a step further… • Example: application for Europe PubMed Central• Deeply analyse documents• Index relationships• Key off search term, to dynamically generate

from indexed relationships questions that have known answers– Not auto-completion

McNaught

EvidenceFinder: a new way to discover

83,717,24 Sentences about genes, proteins, diseases & metabolites2,550,328 Documents

How can you tell if an article is relevant to you in your listed search results? Are there hidden gems in the full-text literature that you might be missing?Are there smarter ways to browse the biomedical literature?

Europe PMC’s EvidenceFinder enriches your literature exploration by suggesting questions alongside your search results, providing a way to find informationburied in full text articles that is directly relevant to you. This helps you identify articles and research that you might have overlooked throughdirect key word searching.

http://europepmc.org/

McNaught

Finding unknown associations

• Need massive amounts of text to find unknown associations, generate hypotheses

• Must go across collections: silos irrelevant to researcher

• Must go across disciplines: cognate and distant – all can shed light

• Information often available in literature many years before, but unsuspected as not explicitly written down

Reproducing a finding - reported (11/2011) in Nature Medicine - with FACTA+, using MEDLINE prior to date

Info=degree of surprise

http://www.nactem.ac.uk/facta-visualizer/

SGK1 gene, enzyme and symptom: high level of enzyme = infertilelow level = miscarriage


Building models

• In many domains, build models to understand relationships and processes

• Rely on literature to provide evidence• Slow, laborious work• Example: reconstruction of biological

pathways

McNaught

Nodes : 652

Links: 444

600 papers were read to

construct the pathway:

“inevitable gaps” due to manual methods

Oda & Kitano (2006) in Mol Syst Biol

www.nactem.ac.uk 27

Mapping reactions and text: PathText

Link to text mining results(green icon)

Building models based on textual evidence

1. The mitotic arrest-deficient protein Mad1 forms a complex with Mad2, which is required for imposing mitotic arrest on cells in which the spindle assembly is perturbed. PMID: 18981471

2. Mad1, an upstream regulator of Mad2, forms a tight core complex with Mad2 and facilitates Mad2 binding to Cdc20. PMID: 18318601

28

2013


Systematic reviews, etc.• Systematic reviews, evidence-based public health reviews

– Balanced reviews to aid policy, guideline, best practice development

• Trade-offs: cost, time available, number of hits to screen/retain, number of full texts to read– May miss relevant items

• EBPH reviews: complex questions, exploration of scope required

• Even basic TM can save 75% of manual effort (EPPI-Centre, IoE)

• Use of TM to identify, rank, cluster most relevant items• NaCTeM & Univ Liverpool currently working with NICE on

supporting EBPH reviewersMcNaught

30

Interoperability and evaluation

• TM involves many processes and resources• May be no need to customise, just to select from

repositories of available tools and resources• But tools and resources often incompatible at

linguistic/semantic levels• Difficult to mix and match, to find best

combination for task at hand• Hence drive towards interoperability to enable

users to get best out of TM

McNaught London Info International


A tool can show different results when trained onone corpus and tested on another, compared totraining and testing on same corpus

McNaught

Training data

Test data

AIMed GENETAG GENIA GGP PennBioIE PIR

AIMed 89.5 38.5 63.3 40.8 54.7

GENETAG 58.4 75.2 43.1 31.3 56.0

GENIA GGP 66.3 31.0 90.7 34.1 42.6

PennBioIE 65.9 41.2 55.4 84.1 54.0

PIR 54.3 42.0 49.0 37.0 83.6

Importance of evaluating tools

Text mining workflows:Rapid TM development, interoperability, common data representation, sharable type system, evaluation

IBM Journal of Research and Development (2011)

U-Compare: a modular NLP workflow construction and evaluation system.

Kano, Y., Miwa, M., Cohen, K. B., Hunter, L., Ananiadou, S. and Tsujii, J.

Database: The Journal of Biological Databases and Curation (2012)

Argo: an integrative, interactive, text mining-based workbench supporting curation.

Rak, R., Rowley, A., Black, W.J. and Ananiadou, S

POS taggerB

SentenceSplitter B

library

POS taggerA

Sentence Splitter A

NER

Sentence Splitter ASentence Splitter ASentence Splitter A

SentenceSplitter BSentenceSplitter BSentenceSplitter B

POS taggerA

POS taggerA

POS taggerA

POS taggerB

POS taggerB

POS taggerB

NERNERNER

Workflow A Workflow B Workflow C

F-Score A F-Score B F-Score C

U-Compare: Evaluate and Compare TM Workflows

UIMA SSOpenNLP

SSGENIA SS

UIMA TokenizerOpenNLP Tokenizer

GENIA Tagger as Tokenizer

GENIA TaggerStepp Tagger

OpenNLP Tagger

ABNERMedT-NER

GENIA Tagger as

NER

34

• Web-based application• Interactive creation of

workflows • Cloud and high-

performance computing

• Integrated TM/NLP processing system• GUI for workflow creation• Library of ready-to-use processing components• Statistics, visualizations, developer APIs• Supports UIMA and sharable type system• http://argo.nactem.ac.uk

Open AIRE-COAR Conference 35

Workflow Editor

Evaluation of Chemical NER workflowsSupplies gold

standard corpus

Removes gold annotations so that they can be created

automatically

Combinations of syntactic and semantic components create

annotations

Compares and reports precision, recall and F1 of the different branches against the gold standard corpus


The C change in TM in the UK

• 1/7/2014: Copyright exception for text and data mining for non-commercial purposes

• 1/10/2014: Copyright exception for quotation• If have lawful access to any text, you can now

– Copy it for non-commercial text mining purposes– Display/communicate results (e.g., annotations, associations) of

TM to others– Illustrate results with snippets from text (quotations)

• None of this can be overridden by contract (licence, Ts&Cs)• https://www.gov.uk/government/uploads/system/

uploads/attachment_data/file/375954/Research.pdf

McNaught


Current state in the EU

• Copyright and licensing in relation to TM is a hot topic

• “The right to read is the right to mine” (Open Knowledge Foundation)

• Hope on the horizon:– EC President Jean-Claude Juncker to take steps

within his first 6 months to modernise copyright rules “in light of digital revolution and changed consumer behaviour”

McNaught


Take home messages

• Text mining can be applied in any domain and for many tasks

• In text mining, no one size fits all– Text miners and users must work closely together

• Content (at least in UK) can be mined on a massive scale for non-commercial purposes– but even a modest collection can benefit from text

mining• Who is your text mining champion?McNaught


Contact and Acknowledgements

• www.nactem.ac.uk• Funders and sponsors: MRC, AHRC, JISC,

BBSRC, ESRC, NIH, DARPA, Europe PubMed Central funders (Wellcome Trust + 25 funders), NHS, European Commission

• Previous funding from: AstraZeneca, Pfizer, Elsevier, Nature Publishing Group, BBC

McNaught

http://www.nactem.ac.uk/

are you ready for the golden age of text mining? john mcnaught deputy director, national centre for...

Documents

data bases

overview text mining

different worlds of

golden age of text mining

relevant information

document construct hypotheses

keyword search

links documents