text analytics in enterprise search - daniel ling

29
Text Analytics in Enterprise Search Daniel Ling (Findwise)

Upload: lucenerevolution

Post on 20-Jun-2015

919 views

Category:

Technology


3 download

DESCRIPTION

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011 Text analytics is a large and interesting subject, covering a wide range of topics. In the world of enterprise search however, the usual application of text analytics rarely ranges beyond extracting semi-structured information from the source data. As some of the more advanced concepts in text analytics, such as automatic text categorization, can be easily leveraged to bring a search installation from a search tool to a tool for discovery.

TRANSCRIPT

Page 1: Text Analytics in Enterprise Search - Daniel Ling

Text Analytics in Enterprise Search Daniel Ling (Findwise)

Page 2: Text Analytics in Enterprise Search - Daniel Ling

What will I cover?

Intro

About Text Analytics

Benefits and possibilities

Examples

Solution Techniques to Examples

Conclusions

3

Page 3: Text Analytics in Enterprise Search - Daniel Ling

My Background

Daniel Ling

Findwise

Enterprise Search and Findability Consultant

Experience and expertise

5+ years of Enterprise Search Experience

20+ enterprise search implementations, ranging industries

Lucene, FAST ESP, Solr

Apache Solr my primary search platform

Focus areas includes Findability and Search Architecture and Implementation, Text Analytics, Document Processing.

4

Page 4: Text Analytics in Enterprise Search - Daniel Ling

About Text Analytics

5

Page 5: Text Analytics in Enterprise Search - Daniel Ling

Text Analytics in the Enterprise

Challenges:

80% of data in the Enterprise is unstructured.

Reduce the time looking for information (currently 9.6 hours per week)

Reduce the time reading documents / e-mails (currently 14.5 hours per week)

Benefits:

More predictable scale and domain

Well-understood domain

Supporting content for analytics can be identified

6

Page 6: Text Analytics in Enterprise Search - Daniel Ling

Text Analytics

The definition

A set of linguistic, statistical and machine learning techniques used to model and structure information content of textual source.

- Wikipedia.org

7

Page 7: Text Analytics in Enterprise Search - Daniel Ling

Types of Applications

• Entity Extraction

• Document Categorization

• Sentiment Analysis

• Summarization

8

Page 8: Text Analytics in Enterprise Search - Daniel Ling

Frameworks and Techniques

9

Framework Techniques

Solr Statistics, Lingustics

Mallet, Classifier4j, etc, etc..

Statistical natural language processing

Mahout (Hadoop) Machine Learning, Statistics

GATE General language processing framework

UIMA Content analytics, text mining, pipeline

OpenNLP Machine learning toolkit for NLP

Page 9: Text Analytics in Enterprise Search - Daniel Ling

Benefits and possibilities

10

Page 10: Text Analytics in Enterprise Search - Daniel Ling

Benefits and possibilities

Text analytics can bring some structure to the unstructured content

Enhance discovery and findability of content

• Works well together with search

Increase relevance and precision with extracted keywords and meta-data

Generating content for dynamic pages / topic pages

• Selection of documents and extracts from documents

Track and discover sentiments

Reduce the time for user to analyze content

11

Page 11: Text Analytics in Enterprise Search - Daniel Ling

Examples

12

Page 12: Text Analytics in Enterprise Search - Daniel Ling

Entity Extraction

Types of Entities for Extraction:

• Dates

• Places

• Companies

• Objects (Product names, etc)

• People

• Events

13

Page 13: Text Analytics in Enterprise Search - Daniel Ling

Example – Presenting the data

14

Page 14: Text Analytics in Enterprise Search - Daniel Ling

15

Example – Presenting the data

Page 15: Text Analytics in Enterprise Search - Daniel Ling

16

Example – Facets on the data

Page 16: Text Analytics in Enterprise Search - Daniel Ling

Example Solution: Entity Extraction Rule-based entity extraction

Combination of lists and regular expressions

Works within well-understood domains.

Requires maintaining lists.

Lists from: Country lists from World Factbook, Public Companies from Google Finance, Customers from CRM.

Workflow: Document for indexing > Update Request Handler > Update Chain (lookup and match entities) > Writes to index

17

Update Chain (processor)

(lists | input fields | entity fields) Lucene Index

(entity fields)

Page 17: Text Analytics in Enterprise Search - Daniel Ling

Example Solution: Entity Extraction

18

Register a custom class to lookup resources and extract found entities to specific Solr fields, setup in solrconfig.xml:

Page 18: Text Analytics in Enterprise Search - Daniel Ling

Document Categorization

To assign a label to the document / content / data.

Labels for the category or for the sentiment.

Threshold values for matching a category before labeling.

Statistics and “knowledge” from previous examples can be used.

19

Page 19: Text Analytics in Enterprise Search - Daniel Ling

20

Example – Facets from Categories

Page 20: Text Analytics in Enterprise Search - Daniel Ling

Example Solution: Document Categorization

Training the component, Mallet (Machine Learning for Language Toolkit).

• Alternative components includes Lucene (TFIDF) index (MoreLikeThis), OpenNLP, Textcat, Classifier4j.

Running the new documents against the model/index of trained documents.

Training from interface, adhoc, or index pre-categorized.

21

*

* Figure from the book Taming Text.

Page 21: Text Analytics in Enterprise Search - Daniel Ling

Example Solution: Document Categorization

Mallet and the process of setup and train:

22

Page 22: Text Analytics in Enterprise Search - Daniel Ling

Example Solution: Document Categorization

Evaluation of new document:

23 23

Update Chain (processor)

(input document) Lucene Index

(category field)

Setting the evaluated category tag to the document in pipeline:

Page 23: Text Analytics in Enterprise Search - Daniel Ling

Document Summarization

Summarize a document, at index time or on-demand.

Leverage from the knowledge and term statistics of the document and the index.

Picks the “most important” sentences based on the statistics and displays those.

24

Page 24: Text Analytics in Enterprise Search - Daniel Ling

25

Example – Summarize content

Static Summaries

Dynamic Summaries

Page 25: Text Analytics in Enterprise Search - Daniel Ling

26

Example – Summarize content - 1

Page 26: Text Analytics in Enterprise Search - Daniel Ling

27

Example – Summarize content - 2

Page 27: Text Analytics in Enterprise Search - Daniel Ling

Example Solution: Document Summarization

Custom RequestHandler that receives document ID and field to summarize.

Custom Search Component making the selection of top sentences.

Selecting a subset of sentences and sends these back in a field.

28

RequestHandler (SearchComponent for summariziation)

Lucene Index

Page 28: Text Analytics in Enterprise Search - Daniel Ling

Wrap Up

• Examples: Entity Extraction, Document Categorization, Summarization.

• Technology: You can take small steps and get a great deal of gain, since you can leverage from features and components of Solr and Lucene (as well as other open source NLP frameworks).

• Value: Benefits from text analytics includes the increase in discovery, findability and productivity from the solution.

29

Page 29: Text Analytics in Enterprise Search - Daniel Ling

Questions ?

[email protected]

www.findabilityblog.com

30