icic 2013 conference proceedings david milward linguamatics
DESCRIPTION
Unstructured Text in Big Data: the Elephant in the Room David Milward (Linguamatics, UK) A traditional approach has been to extract structured data from the unstructured text. When it comes to big data, manual or semi-automated methods just don't scale. Even with fully automated methods, it is infeasible to extract all the facts and relationships buried in the text to produce a 'knowledge base of everything' that answers every possible end-user question. In recent years, there has therefore been a trend towards using text mining over unstructured data to directly answer questions, effectively creating novel databases on the fly. However, this has widened the gap between information professionals and end users. In this talk I will explain how multiple approaches are being adopted to exploit the skills of information professionals to improve decision support for end users. These include specialised real-time querying, alerting based on semantic queries, semantic enrichment of documents, and population of semantic stores or linked data.TRANSCRIPT
Unstructured Text in Big Data The Elephant in the Room
David Milward
ICIC, October 2013
Click to edit Master title style Click to edit Master title style Unstructured Big Data
• Big Data
– Volume, Variety, Velocity
• Estimated 80% of all data is unstructured
– Need to be able to make decisions based on this data
• Ever increasing amount of data makes it harder and harder to filter by hand
– Need more automation
© 2013 Linguamatics Ltd. 2
80%
20%
Click to edit Master title style Click to edit Master title style From Unstructured Data to Structured
Linguamatics Confidential
Unstructured or Semi-Structured Content Sources
Structured, relational data Normalized concept identifiers
Traditional text mining works in batch mode
Requires each extraction to be programmed in, or learnt from a large amount of annotated material
Identify concepts and relationships
Click to edit Master title style Click to edit Master title style
Results
Information Consumers
Querying
I2E Agile Text Mining
Confidential
Click to edit Master title style Click to edit Master title style
In the beginning...
Increasing Range Of Applications
Protein-Protein
Interactions
Vocabulary Development
Biomarker Discovery
Target Identification
& Prioritization
Drug Repositioning
Conference Abstract Mining
Patent Analysis
Key Opinion Leader
Identification
Competitive Intelligence
Safety/Tox
In-licensing Opportunities
SAR Extraction
Systems Biology
Mining FDA Drug
Labels Extracting Numerical
and Experimental
Data
Chemical Search
Sentiment Analysis in
Social Media
Plant Metabolic Pathways
Mining Electronic Medical Records
Patent FTO
Confidential
Click to edit Master title style Click to edit Master title style The Database Approach
• Can we predetermine what is required?
• One approach is to try to extract everything … but in practice have to make assumptions, whether by hand or automatically
– Do we include negative statements?
– Do we include speculation?
– What context do we include?
• Need a good idea of how the data will be used to know what needs to be extracted
• There are cases where this works well e.g.
– have useful structured data and want to add to it from free text
– have new records structured, but want to bring in information from legacy free text
© 2013 Linguamatics Ltd. 6
Click to edit Master title style Click to edit Master title style Extracting for a Relational Database
• Extracting all relevant information from pathology reports
© 2013 Linguamatics Ltd. 7
Click to edit Master title style Click to edit Master title style Extracting for a Semantic Store
• Output as a set of triples
• Because all the parts have URIs, the parts are web-addressable
Psoriasis Treat
Peptide Express
Associate Infectious Disease
Web
© 2013 Linguamatics Ltd. 8
Click to edit Master title style Click to edit Master title style The Ad Hoc Approach
• No need to build a database
• Regard text mining queries as ways to create a database on the fly
• Keep all the unstructured free text, and query for what you need when you need it
• Even index structured data with text mining to link information together
• Used very successfully by knowledge professionals:
– Just like a search engine, there are thousands of questions people want to ask of the data
– You can’t predict them all beforehand
© 2013 Linguamatics Ltd. 9
Click to edit Master title style Click to edit Master title style Ad Hoc Querying
• Find relationships that you want for any particular information request
• Treat the text mining as an “agile” database to answer thousands of different questions e.g. metabolites from fruit
© 2013 Linguamatics Ltd. 10
Click to edit Master title style Click to edit Master title style Do We Need More?
• I2E Agile Text Mining has opened up text mining to the end user, but it is still mainly used by knowledge professionals
• Given the importance of unstructured big data, can we bring the benefits of text mining to an even wider audience?
© 2013 Linguamatics Ltd. 11
Click to edit Master title style Click to edit Master title style
• Successful queries converted into regular workflows using Pipeline Pilot or KNIME
• Real-time workflows for up-to-date dashboards, alerts
• Further analysis, visualization, integration with other data
12
Workflow Automation
Pipeline Pilot
Click to edit Master title style Click to edit Master title style Clinical Trials Analysis from I2E Text Mining Results
Dates
Colours indicate Phase
Using Pipeline Pilot visualization
13 © 2013 Linguamatics Ltd.
Click to edit Master title style Click to edit Master title style Embedding I2E within Web Apps
4734_8.pptx 14 10/21/2013
Click to edit Master title style Click to edit Master title style Form-Based Querying
• Create web interfaces connecting to I2E server
• Information professionals develop sophisticated queries
• Queries are parameterized to allow end users to customize
• Javascript allows portability to mobile devices such as iPads
Click to edit Master title style Click to edit Master title style Enhancing Enterprise Search
• Use text mining to annotate concepts to feed into a search engine to:
– provide concept search
– provide concepts for facets
– provide intelligent thumbnails for documents, pulling out key information
© 2013 Linguamatics Ltd. 16
Click to edit Master title style Click to edit Master title style
Keywords
Semantic annotation of documents V
alu
e
© 2013 Confidential
Concepts
Relationships
Search Engine
NLP-based Text Mining
Click to edit Master title style Click to edit Master title style Provide Concepts as Facets for Enterprise Search
Refine the search based on
terminologies
Document information and
teaser
Document preview and additional
information
© 2013 Linguamatics Ltd. 18
Click to edit Master title style Click to edit Master title style Dictionary Matching - Companies
Click to edit Master title style Click to edit Master title style Pattern Matching - Institutions
• Not a fixed list: can find previously unknown institutions
Click to edit Master title style Click to edit Master title style Pattern Matching - Mutations
Click to edit Master title style Click to edit Master title style Pattern Matching - Chemicals
• Assign structure to novel chemicals
• Distinguish exemplified compounds
• Distinguish compounds given properties
Click to edit Master title style Click to edit Master title style
• ... I2E can mine and extract with precision
Whatever the Content...
Scientific literature
Click to edit Master title style Click to edit Master title style
• I2E 4.1 Linked Servers
• Users can query local information or information on the cloud
• Proprietary content can reside within Enterprise
• Standard sources e.g. patents can be hosted on the cloud, and shared by all users
• Data from content providers can reside on their sites
External Content
Content Provider
Clinical Trials
Drug Labels MEDLINE Internal Content
Publisher
Patents
…. and Wherever
24
I2E Enterprise
I2E Platform I2E Platform
I2E OnDemand
Click to edit Master title style Click to edit Master title style Bringing the Elephant Down to Size
• Data warehouses for data that you know you need
• Embedding text mining in other applications e.g. Enterprise Search to provide benefits of semantic search
• Building specialized new interfaces to provide self-service applications for end-users
• Continuing role for ad hoc text mining
– text mining can ask almost any question of the data
– you can never predict every question
© 2013 Linguamatics Ltd. 25