are you ready for the golden age of text mining? john mcnaught deputy director, national centre for...
TRANSCRIPT
Are you ready for the golden age of text mining?
John McNaughtDeputy Director, National Centre for Text Mining
University of [email protected]
London Info International 2
Overview
• Text mining in a nutshell• Enriching content, enhancing search, enabling
discovery, reducing costs• Interoperability and evaluation• The C change
McNaught
London Info International 3
How do we (humans) discover?
• Find, read, learn, analyse a lot• Ask “What if…?”• Construct hypotheses, test them
– Explore many avenues, associations• Work collaboratively• Share results and data with others
– Reproducibility validation• Integrate heterogeneous data/information/knowledge • (vs. Serendipity: by lucky accident)
McNaught
London Info International 4
Barriers to discovery
• Find: document oriented, too many hits• Read: too much to read, even if we find relevant hits• Learn: too fast growth to keep up, to know most things• Analyse: duplication of efforts, many new results to
document• Construct hypotheses: hard, can’t tell which are most
promising, or if have missed any• Share: primary vehicles are documents and curated
databases (massive curation backlog)• Integrate: document often the key, hard to link in to
different worlds of data, information, knowledgeMcNaught
London Info International 5
How does TM aid discovery?• Find: more precise, relevant information, within and across
documents• Read: much faster than human• Learn: extracts, packages, links, synthesises, summarises, reduces
burden• Analyse: recognises duplication; clusters, classifies, drives semantic
author aids• Construct hypotheses: rapidly finds and ranks unknown associations
for testing• Share: reduces curation effort, complements and validates data bases• Integrate: links documents deeply into worlds of data, information
and knowledge
McNaught
Text mining in a nutshell
Otherdata
ApplicationsSemantic searchData mining
McNaught
McNaught
Words
Terms
Entities
Relations
Events
Wordform co-occurrence, pattern matching, …
Term recognition and normalisation
Named entity recognition
Relation extraction
Event extraction
Associations
Metaknowledgeextraction
Dat
a m
inin
g, C
lust
erin
g
What is known aboutthis disease, protein, person?
What is linked with X?
{Who, what} Xed {whom, what} where, when and how?
What if…?
Keywordsearch
Is X possible, certain, probable, suggested, past, to come?
What is this paper about?
Increased sophistication? Increased customisation!
A complex space
LanguagesEnglish French GermanSpanishPortugueseItalianPolish….ChineseHinduArabicUrduJapaneseKorean….
TasksTranslationInformation extractionSemantic searchQuestion answeringSentiment analysisSummarizationKnowledge discoveryDatabase curationSystematic reviewingPathway reconstruction….
Domains Finance/BusinessHealthBiologySocial SciencesHumanities…
Text Types Scientific articles(Full papers/abstracts)Social mediaPatentsClinical records, EMRBooks, theses, reportsNewswire…
TechnologyTokenizersSentence SplittersParagraph SplittersNP ChunkersSyntactic parsersSemantic parsersNE recognizersRelation extractorsEvent extractors…
Diversity of Languages and Language Resourcesincluding temporal diversityDiversity of Contexts
Diversity of Applications8
Resources(mono- and multilingual)GazetteersAnnotated corporaLexiconsTerminologiesWordnetsThesauriOntologiesGrammars…
Europe’s Languages and Language Technology support
McNaught 9
DutchFrenchGermanItalianSpanish
CatalanCzechFinnish
HungarianPolish
Portuguese
Swedish
BasqueBulgarianDanishGalicianGreek
Norwegian
RomanianSlovakSlovene
CroatianEstonianIcelandic
IrishLatvian
LithuanianMalteseSerbian
English
good support through Language
Technology
weak orno support
(no ‘excellent’ support)
http://www.meta-net.eu
London Info International 10
Enhancing historical collections
• If you have a domain collection going back centuries– How easy is it for users to find answers to research
questions?• Language evolves, terms come and go,
concepts drift, …• TM can enhance collections in many ways
– Handling temporal aspects of language is key– Enabling event-based semantic search
McNaught
London Info International 11
Looking into the past
• Semantic search for historians of medicine– Treatment and prevention of diseases over time– Medical and public health perspectives
• British Medical Journal archive (from 1840)– Around 350K articles
• London Medical Officer of Health reports (1848-1972) (Wellcome Library)– Around 5,000 reports from different boroughs
McNaught
In historical collections, same concept expressed by different terms across different time periods
Users miss information due to unfamiliar terminology
TM to extract/link diachronic synonyms, organize in thesaurus
Use diachronic thesaurus for time-sensitive search
Traditional searchUser searches for
”pulmonary tuberculosis” but doesn’t know historical synonym
“pulmonary phthisis”User expands query
Narrow down results according to faceted search(facets derived both from
document metadata and from text mining)
System automatically suggests related terms
Distribution of “pulmonary tuberculosis” and “pulmonary
phthisis” across time
(A mock-up for user feedback)
Analysing events of interest to historians
Type Description Participants
Affect An entity or event is affected, infected, changed or transformed, possibly by another entity or event
Cause: of the affectionTarget: Entity or event affectedSubject: Medical subject affected
Cause An entity or event results in manifestation of another entity or event
Cause: of the eventResult: Resulting entity or eventSubject: Medical subject affected
London Info International 15
Classic case of working together• End user (typically) not a text miner• Text miner (typically) not a domain expert• Requirements and evaluation: challenge for both• Need to work together to understand
– How TM can help, what it can and cannot do– What questions are of interest– What role human has– What outcomes are desirable– What existing resources can be exploited
McNaught
http://miningbiodiversity.org
Mining Biodiversity
AimTransform Biodiversity Heritage Library into a next-generation social digital library130,000 volumes of digitised legacy literature
A multi-disciplinary approach 1. Text Mining2. Machine learning3. Data visualisation4. History of Science5. Environmental History & Studies6. Library and Information Science7. Social Media
Mining Biodiversity
Mining BiodiversitySemantic metadata
extraction to support search
Observation
Habitation
Nutrition
London Info International 19
Finding evidence• Event extraction can drive semantic search as
we’ve seen. We can go a step further… • Example: application for Europe PubMed Central• Deeply analyse documents• Index relationships• Key off search term, to dynamically generate
from indexed relationships questions that have known answers– Not auto-completion
McNaught
EvidenceFinder: a new way to discover
83,717,24 Sentences about genes, proteins, diseases & metabolites2,550,328 Documents
How can you tell if an article is relevant to you in your listed search results? Are there hidden gems in the full-text literature that you might be missing?Are there smarter ways to browse the biomedical literature?
Europe PMC’s EvidenceFinder enriches your literature exploration by suggesting questions alongside your search results, providing a way to find informationburied in full text articles that is directly relevant to you. This helps you identify articles and research that you might have overlooked throughdirect key word searching.
http://europepmc.org/
McNaught
Finding unknown associations
• Need massive amounts of text to find unknown associations, generate hypotheses
• Must go across collections: silos irrelevant to researcher
• Must go across disciplines: cognate and distant – all can shed light
• Information often available in literature many years before, but unsuspected as not explicitly written down
Reproducing a finding - reported (11/2011) in Nature Medicine - with FACTA+, using MEDLINE prior to date
Info=degree of surprise
http://www.nactem.ac.uk/facta-visualizer/
SGK1 gene, enzyme and symptom: high level of enzyme = infertilelow level = miscarriage
London Info International 25
Building models
• In many domains, build models to understand relationships and processes
• Rely on literature to provide evidence• Slow, laborious work• Example: reconstruction of biological
pathways
McNaught
Nodes : 652
Links: 444
600 papers were read to
construct the pathway:
“inevitable gaps” due to manual methods
Oda & Kitano (2006) in Mol Syst Biol
www.nactem.ac.uk 27
Mapping reactions and text: PathText
Link to text mining results(green icon)
Building models based on textual evidence
1. The mitotic arrest-deficient protein Mad1 forms a complex with Mad2, which is required for imposing mitotic arrest on cells in which the spindle assembly is perturbed. PMID: 18981471
2. Mad1, an upstream regulator of Mad2, forms a tight core complex with Mad2 and facilitates Mad2 binding to Cdc20. PMID: 18318601
28
2013
London Info International 29
Systematic reviews, etc.• Systematic reviews, evidence-based public health reviews
– Balanced reviews to aid policy, guideline, best practice development
• Trade-offs: cost, time available, number of hits to screen/retain, number of full texts to read– May miss relevant items
• EBPH reviews: complex questions, exploration of scope required
• Even basic TM can save 75% of manual effort (EPPI-Centre, IoE)
• Use of TM to identify, rank, cluster most relevant items• NaCTeM & Univ Liverpool currently working with NICE on
supporting EBPH reviewersMcNaught
30
Interoperability and evaluation
• TM involves many processes and resources• May be no need to customise, just to select from
repositories of available tools and resources• But tools and resources often incompatible at
linguistic/semantic levels• Difficult to mix and match, to find best
combination for task at hand• Hence drive towards interoperability to enable
users to get best out of TM
McNaught London Info International
London Info International 31
A tool can show different results when trained onone corpus and tested on another, compared totraining and testing on same corpus
McNaught
Training data
Test data
AIMed GENETAG GENIA GGP PennBioIE PIR
AIMed 89.5 38.5 63.3 40.8 54.7
GENETAG 58.4 75.2 43.1 31.3 56.0
GENIA GGP 66.3 31.0 90.7 34.1 42.6
PennBioIE 65.9 41.2 55.4 84.1 54.0
PIR 54.3 42.0 49.0 37.0 83.6
Importance of evaluating tools
Text mining workflows:Rapid TM development, interoperability, common data representation, sharable type system, evaluation
IBM Journal of Research and Development (2011)
U-Compare: a modular NLP workflow construction and evaluation system.
Kano, Y., Miwa, M., Cohen, K. B., Hunter, L., Ananiadou, S. and Tsujii, J.
Database: The Journal of Biological Databases and Curation (2012)
Argo: an integrative, interactive, text mining-based workbench supporting curation.
Rak, R., Rowley, A., Black, W.J. and Ananiadou, S
POS taggerB
SentenceSplitter B
library
POS taggerA
Sentence Splitter A
NER
Sentence Splitter ASentence Splitter ASentence Splitter A
SentenceSplitter BSentenceSplitter BSentenceSplitter B
POS taggerA
POS taggerA
POS taggerA
POS taggerB
POS taggerB
POS taggerB
NERNERNER
Workflow A Workflow B Workflow C
F-Score A F-Score B F-Score C
U-Compare: Evaluate and Compare TM Workflows
UIMA SSOpenNLP
SSGENIA SS
UIMA TokenizerOpenNLP Tokenizer
GENIA Tagger as Tokenizer
GENIA TaggerStepp Tagger
OpenNLP Tagger
ABNERMedT-NER
GENIA Tagger as
NER
34
• Web-based application• Interactive creation of
workflows • Cloud and high-
performance computing
• Integrated TM/NLP processing system• GUI for workflow creation• Library of ready-to-use processing components• Statistics, visualizations, developer APIs• Supports UIMA and sharable type system• http://argo.nactem.ac.uk
Open AIRE-COAR Conference 35
Workflow Editor
Evaluation of Chemical NER workflowsSupplies gold
standard corpus
Removes gold annotations so that they can be created
automatically
Combinations of syntactic and semantic components create
annotations
Compares and reports precision, recall and F1 of the different branches against the gold standard corpus
London Info International 37
The C change in TM in the UK
• 1/7/2014: Copyright exception for text and data mining for non-commercial purposes
• 1/10/2014: Copyright exception for quotation• If have lawful access to any text, you can now
– Copy it for non-commercial text mining purposes– Display/communicate results (e.g., annotations, associations) of
TM to others– Illustrate results with snippets from text (quotations)
• None of this can be overridden by contract (licence, Ts&Cs)• https://www.gov.uk/government/uploads/system/
uploads/attachment_data/file/375954/Research.pdf
McNaught
London Info International 38
Current state in the EU
• Copyright and licensing in relation to TM is a hot topic
• “The right to read is the right to mine” (Open Knowledge Foundation)
• Hope on the horizon:– EC President Jean-Claude Juncker to take steps
within his first 6 months to modernise copyright rules “in light of digital revolution and changed consumer behaviour”
McNaught
London Info International 39
Take home messages
• Text mining can be applied in any domain and for many tasks
• In text mining, no one size fits all– Text miners and users must work closely together
• Content (at least in UK) can be mined on a massive scale for non-commercial purposes– but even a modest collection can benefit from text
mining• Who is your text mining champion?McNaught
London Info International 40
Contact and Acknowledgements
• www.nactem.ac.uk• Funders and sponsors: MRC, AHRC, JISC,
BBSRC, ESRC, NIH, DARPA, Europe PubMed Central funders (Wellcome Trust + 25 funders), NHS, European Commission
• Previous funding from: AstraZeneca, Pfizer, Elsevier, Nature Publishing Group, BBC
McNaught