search is not enough: using solr for analytics

41
Steve Kearns Director of Product Management www.basistech.com Search is Not Enough Using Solr for Analytics

Upload: lucenerevolution

Post on 18-Jun-2015

1.010 views

Category:

Technology


2 download

DESCRIPTION

Presented by Steve Kearns, Basis Technology - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 Search is everywhere, and it is a crucially important capability in any enterprise, application, or website. However, an increasingly sophisticated user base expects their search engine to bring them more than just document hits - they want the facts, answers, and context that connect the results with their workflow. In this talk, Steve Kearns will discuss and demonstrate how the combination of structured data, text analytics on unstructured data, and Solr can be used to power advanced analytics applications at scale. This includes integrating text analytics components into Solr, adjustments to the Solr Schema, as well as UI-level changes that support the integration of structured and unstructured data from several sources.

TRANSCRIPT

Page 1: Search is Not Enough: Using Solr for Analytics

Steve Kearns

Director of Product Management

www.basistech.com

Search is Not Enough

Using Solr for Analytics

Page 2: Search is Not Enough: Using Solr for Analytics

Agenda

• Basis Technology

• Search and Metadata

• Text + Text Analytics = Metadata

• Solr += Analytics

Configuration

Interface

• Conclusion

Page 3: Search is Not Enough: Using Solr for Analytics

About Basis Technology

• Global leader in computational linguistics as applied to

search-based applications, information discovery, and

identity resolution

• Developer of the most capable, most mature, and most

widely used platform for multilingual text analytics

• Solutions for commercial enterprises expanding globally

and for government agencies dealing with foreign

intelligence

• Offices: Boston, Washington, San Francisco, London,

Tokyo

Page 4: Search is Not Enough: Using Solr for Analytics

Search

Page 5: Search is Not Enough: Using Solr for Analytics

Metadata

Page 6: Search is Not Enough: Using Solr for Analytics

Search and Metadata

• Search alone

Helps find documents/records

May return many unnecessary results

Inefficient way to solve a specific problem

• Search with Metadata

New ways to visualize, navigate and explore

Helps enable users to take action on documents/records

Provides context to aid decision making

New ways to connect disparate data sources

New knobs to tune relevance

• Structured Data Sources

Link unstructured data against structured

Page 7: Search is Not Enough: Using Solr for Analytics

Metadata In Action

Page 8: Search is Not Enough: Using Solr for Analytics

Metadata in Action

Page 9: Search is Not Enough: Using Solr for Analytics

Metadata – Where does it come from?

• Structured Information associated with documents

Author, publish date, part number, price, provenance, etc.

• Manual Annotation

• Text Analytics

Page 10: Search is Not Enough: Using Solr for Analytics

Text Analytics

Page 11: Search is Not Enough: Using Solr for Analytics

Text Analytics

A set of automated analytical

methods designed to add

structure to unstructured

content

Page 12: Search is Not Enough: Using Solr for Analytics

Text Analytics techniques

Page 13: Search is Not Enough: Using Solr for Analytics

Categorization of Text Analytics Technology

• Document-level Analytics

Language Identification

Summarization

Categorization

• Sub-document Analysis

Lemmatization – Improved Search

Entity Extraction

Fact/Relationship Extraction

Topic Extraction

Sentiment

• Cross-Document Analysis

Document Clustering (at index or query time)

Entity Search and Co-reference Resolution

Page 14: Search is Not Enough: Using Solr for Analytics

Text Analytics in Action: E-Discovery

• Demo!

Page 15: Search is Not Enough: Using Solr for Analytics

Document Level Analysis: Language Identification

• Sub-document Lang ID is possible

La Grande-Bretagne a

de son côté jugé que

l'accord de

Luxembourg

constituait un

véritable changement

dans la stratégie

agricole de l'Europe,

tandis que l'Irlande y a

vu un gage de stabilité

et et de sécurité pour

les agriculteurs. Le président nigérian

Olusegun Obasanjo a

salué cette

l'engagement du G8,

déclarant que "la

condition majeure au

développement est

l'absence de conflit".

La porte-parole de la

présidence française,

Catherine Colonna, a

pour sa part qualifié la

réunion

d'"exceptionnelle".

Американская

софтверная компания

становится

пользующимся спросом

у спецслужб США

экспертом в области

лингвистики (в

частности, изучения и

обработки информации

на арабском языке)

после терактов 11

сентября 2001 г.

В данный момент

правительство США,

обвиняющее

радикальную

мусульманскую

группировку "Аль

Каида" в терактах 2

года назад,

активизирует свое

внимание к арабскому

языку и программам

его обработки.

Грамматика языков

данной группы

「端末側で行単位に(あるいは一画面分)編集しておいて、

送信キーによりまとめて送信する」という方式と、

「端末には知能はなく、一字一字すべてがその都度送られ処理される」

という方式は、究極的に前者は半二重通信、後者は全二重通信とフィットします。

後者では、入力のエコーもコンピュータ側で制御されます。

つまり、入力した字の表示はキー入力がコンピュータに送られ、

それが送り返されて表示されます。

FNPがコンピュータと端末の間に

あって、実際の端末とのやりとりを制御するのです。そして、コンピュータとFNPの間の通信は、

少量の転送には不向きで、大量の一括転送に向いていました。

FNPによるコンピュータへの割り

込み要求は高価なものだったからです。Multicsでのプロセスのwake upも高価だということもありました。

私ごとになりますが、ちょうどこのころ大学院生でしたが、ACOS-6

用のある言語処理系の開発を請け負って作っていました。ACOS-

6はMulticsの概念に非常に近いものを持っていました、あるいは持とうとしていました。

また、ハードウェアも大変似ていました。シールをはがすと、

その下から別のアメリカの会社の名前が出てくるマシンでテスト

したこともありました。1年間ほとんど休みなしにマシンルーム

にこもっていて、ここでの議論と疑問を自分のテーマとしても

扱ったことがあるのです。それで、よーくわかるのです。

Программное обеспечение

Basis Technology позволяет

осуществлять поиск слов с

близкими значениями, а

также транслитерировать

арабские и фарси-буквы в

латинские. Продукт был

разработан по

специальному заказу

правительства США с

целью оптимизации

процесса анализа арабских

текстов.

La Grande-Bretagne

a de son côté jugé

que l'accord de

Luxembourg

constituait un

véritable

changement dans la

stratégie

Après avoir

rencontré les

présidents de

quatre des cinq pays

africains (Afrique du

Sud, Algérie,

Sénégal, Nigeria)

membres du comité

de pilotage du

Le président

nigérian Olusegun

Obasanjo a salué

cette l'engagement

du G8, déclarant

que "la condition

majeure au

développement est

French

Программное

обеспечение Basis

Technology позволяет

осуществлять поиск

слов с близкими

значениями, а также

транслитерировать

Американская

софтверная

компания

становится

пользующимся

спросом у спецслужб

США экспертом в

области

В данный момент

правительство

США, обвиняющее

радикальную

мусульманскую

группировку "Аль

Каида" в терактах 2

Russian

「端末側で行単位に(あるいは一画面分)編集しておいて、

送信キーによりまとめて送信する」という方式と、

「端末には知能はなく、一字一字すべてがその都度送られ処理される」

「端末側で行単位に(あるいは一画面分)編集しておいて、

送信キーによりまとめて送信する」という方式と、

「端末には知能はなく、一字一字すべてがその都度送られ処理される」

FNPがコンピュータと端末の

間にあって、実際の端末とのやりとりを制御するのです。そして、コンピュータとFNPの間の通信は、

少量の転送には不向きで、大量の一括転送に向いていまし

Japanese

Bild vergrößern

Berlin (AP) Der Kanzler

strahlte: «Ich gestehe, dass

ich 90 Prozent Zustimmung

EVIAN (AP) - Les membres du

G8 se sont engagés dimanche

soir à soutenir la

これはファンドマネージャー

さんが嘘をついているという

わけではありません。計算

ilHaaqa-n bikitaabinaa s-

sirriyyi r-raqiimi fii yurjae

ittikhaadha maa yulzamu

German

29%

French

33%

Japanese

21%

Arabic

17%

Page 16: Search is Not Enough: Using Solr for Analytics

Document Level Analysis: Categorization

• Group Documents into Pre-defined categories

http://news.google.com/

http://www.bbc.co.uk/

Page 17: Search is Not Enough: Using Solr for Analytics

Sub-Document Analysis: Linguistics

• Segmentation of Asian language

• Lemmatization

N-Gram

Morphological

Segmentation

Stemming

Lemmatization

Page 18: Search is Not Enough: Using Solr for Analytics

Sub-Document Analysis: Sentiment

• Sentence, paragraph, entity, aspect, emotion

http://twittersentiment.appspot.com/search?query=Lucene

http://maps.google.com/maps/place?cid=7410753351872099397

Page 19: Search is Not Enough: Using Solr for Analytics

Sub-Document Analysis: Entity Extraction

• Identify Named Concepts in Unstructured Text

Statistical, rules, lists

http://www.twitscoop.com/

Page 20: Search is Not Enough: Using Solr for Analytics

Sub-Document: Fact / Rel. / Event Extraction

• Identify Facts, Link Entities, Events and Times

http://www.silobreaker.com/FlashNetwork.aspx?DrillDownItems=11_237360

Page 21: Search is Not Enough: Using Solr for Analytics

Cross-Document: Entity Co-reference Resolution

• Map extracted entities to real-world Concepts

Page 22: Search is Not Enough: Using Solr for Analytics

Cross-Document Analysis: Clustering

• Near Duplicate Detection

• Unsupervised Clustering

Page 23: Search is Not Enough: Using Solr for Analytics

Text Analytics: Entity Search

Page 24: Search is Not Enough: Using Solr for Analytics

Solr += Analytics

Page 25: Search is Not Enough: Using Solr for Analytics

Text Analytics in/around Solr

• Analyzer/Tokenizer/TokenFilter

• UpdateRequestProcessor

Run Analysis in Solr

Call External Analysis Service

• Pre-Processor to Solr

Page 26: Search is Not Enough: Using Solr for Analytics

Integration Point: Analyzer

• Good for:

Linguistics

Segmentation of Asian Language

Customized Segmentation

• Limitations:

No access to document object

• An Analyzer is:

Charfilter

Tokenizer

Set of TokenFilters

Page 27: Search is Not Enough: Using Solr for Analytics

Analyzer/Tokenizer Configuration

• Schema.xml

FieldType

• Analyzer (Index)

– CharFilter

– Tokenize

– TokenFilter

• Analyzer (Query)

Page 28: Search is Not Enough: Using Solr for Analytics

Integration Point: UpdateRequestProcessor

• Runs Before Analyzers

• Full Access to Document

• Two options:

Run the analysis directly in Solr

Call out to external analysis services

• Limitations:

Think through your indexing strategy

Page 29: Search is Not Enough: Using Solr for Analytics

Integration Point: UpdateRequestProcessor

• Run the analysis directly in Solr

Good for light weight, stateless document analytics

Not good for cross-document analytics

• Call out to external analysis services

Web Services, UIMA, OpenPipeline, GATE, custom code

Note that these external calls are synchronous

Additional complexity / points of failure

Page 30: Search is Not Enough: Using Solr for Analytics

UpdateRequestProcessor Configuration

• SolrConfig.xml

RequestHandler

• update.processor = UpdateRequestProcessorChain.name

UpdateRequestProcessorChain

• Processors

Page 31: Search is Not Enough: Using Solr for Analytics

Integration Point: Pre-Processor

• Index in Solr as Last Step of Analysis

• Good For:

Finer-grained control

Managing dependencies between analytic components

Scalability

• Limitations:

Complexity / New points of failure

Cannot use Solr’s content acquisition features

Page 32: Search is Not Enough: Using Solr for Analytics

Integration Summary

• There are Many Options!

• Document-Level Analysis:

Generally, safe to run in UpdateRequestProcessor

• Sub-Document Analysis:

UpdateRequestProcessor or external

• Cross-Document Analysis:

Run external

• Multiple-Analysis Components:

Run external document processing pipeline

Page 33: Search is Not Enough: Using Solr for Analytics

Other Concerns

• Re-Indexing may be expensive, so when linking against

structured data..

Index RowID if structured DB allows changes

• Retrieve row details at page rendering time to enable faceting

Index content if DB is static

• FieldCollapsing of Similar Documents

Powerful way to reduce the number of results without losing information

Page 34: Search is Not Enough: Using Solr for Analytics

Dashboard

Page 35: Search is Not Enough: Using Solr for Analytics

Search and Filter

Page 36: Search is Not Enough: Using Solr for Analytics

Detailed Document View

Page 37: Search is Not Enough: Using Solr for Analytics

Entity Search – Cross Language

Page 38: Search is Not Enough: Using Solr for Analytics

Search/Filter/Explore

Page 39: Search is Not Enough: Using Solr for Analytics

Summary

Text Analytics Enables Productive Search

Page 40: Search is Not Enough: Using Solr for Analytics

For More Information

• Visit www.basistech.com

• Write to [email protected]

• Call 617-386-2090 or 800-697-2062

Page 41: Search is Not Enough: Using Solr for Analytics

Steve Kearns

Director of Product Management

www.basistech.com

Thank You!