search is not enough: using solr for analytics

Steve Kearns

Director of Product Management

www.basistech.com

Search is Not Enough

Using Solr for Analytics

Agenda

• Basis Technology

• Search and Metadata

• Text + Text Analytics = Metadata

• Solr += Analytics

Configuration

Interface

• Conclusion

About Basis Technology

• Global leader in computational linguistics as applied to

search-based applications, information discovery, and

identity resolution

• Developer of the most capable, most mature, and most

widely used platform for multilingual text analytics

• Solutions for commercial enterprises expanding globally

and for government agencies dealing with foreign

intelligence

• Offices: Boston, Washington, San Francisco, London,

Tokyo

Search

Metadata

Search and Metadata

• Search alone

Helps find documents/records

May return many unnecessary results

Inefficient way to solve a specific problem

• Search with Metadata

New ways to visualize, navigate and explore

Helps enable users to take action on documents/records

Provides context to aid decision making

New ways to connect disparate data sources

New knobs to tune relevance

• Structured Data Sources

Link unstructured data against structured

Metadata In Action

Metadata in Action

Metadata – Where does it come from?

• Structured Information associated with documents

Author, publish date, part number, price, provenance, etc.

• Manual Annotation

• Text Analytics

Text Analytics

Text Analytics

A set of automated analytical

methods designed to add

structure to unstructured

content

Text Analytics techniques

Categorization of Text Analytics Technology

• Document-level Analytics

Language Identification

Summarization

Categorization

• Sub-document Analysis

Lemmatization – Improved Search

Entity Extraction

Fact/Relationship Extraction

Topic Extraction

Sentiment

• Cross-Document Analysis

Document Clustering (at index or query time)

Entity Search and Co-reference Resolution

Text Analytics in Action: E-Discovery

• Demo!

Document Level Analysis: Language Identification

• Sub-document Lang ID is possible

La Grande-Bretagne a

de son côté jugé que

l'accord de

Luxembourg

constituait un

véritable changement

dans la stratégie

agricole de l'Europe,

tandis que l'Irlande y a

vu un gage de stabilité

et et de sécurité pour

les agriculteurs. Le président nigérian

Olusegun Obasanjo a

salué cette

l'engagement du G8,

déclarant que "la

condition majeure au

développement est

l'absence de conflit".

La porte-parole de la

présidence française,

Catherine Colonna, a

pour sa part qualifié la

réunion

d'"exceptionnelle".

Американская

софтверная компания

становится

пользующимся спросом

у спецслужб США

экспертом в области

лингвистики (в

частности, изучения и

обработки информации

на арабском языке)

после терактов 11

сентября 2001 г.

В данный момент

правительство США,

обвиняющее

радикальную

мусульманскую

группировку "Аль

Каида" в терактах 2

года назад,

активизирует свое

внимание к арабскому

языку и программам

его обработки.

Грамматика языков

данной группы

「端末側で行単位に（あるいは一画面分）編集しておいて、

送信キーによりまとめて送信する」という方式と、

「端末には知能はなく、一字一字すべてがその都度送られ処理される」

という方式は、究極的に前者は半二重通信、後者は全二重通信とフィットします。

後者では、入力のエコーもコンピュータ側で制御されます。

つまり、入力した字の表示はキー入力がコンピュータに送られ、

それが送り返されて表示されます。

FNPがコンピュータと端末の間に

あって、実際の端末とのやりとりを制御するのです。そして、コンピュータとFNPの間の通信は、

少量の転送には不向きで、大量の一括転送に向いていました。

FNPによるコンピュータへの割り

込み要求は高価なものだったからです。Multicsでのプロセスのwake upも高価だということもありました。

私ごとになりますが、ちょうどこのころ大学院生でしたが、ACOS-6

用のある言語処理系の開発を請け負って作っていました。ACOS-

6はMulticsの概念に非常に近いものを持っていました、あるいは持とうとしていました。

また、ハードウェアも大変似ていました。シールをはがすと、

その下から別のアメリカの会社の名前が出てくるマシンでテスト

したこともありました。１年間ほとんど休みなしにマシンルーム

にこもっていて、ここでの議論と疑問を自分のテーマとしても

扱ったことがあるのです。それで、よーくわかるのです。

Программное обеспечение

Basis Technology позволяет

осуществлять поиск слов с

близкими значениями, а

также транслитерировать

арабские и фарси-буквы в

латинские. Продукт был

разработан по

специальному заказу

правительства США с

целью оптимизации

процесса анализа арабских

текстов.

La Grande-Bretagne

a de son côté jugé

que l'accord de

Luxembourg

constituait un

véritable

changement dans la

stratégie

Après avoir

rencontré les

présidents de

quatre des cinq pays

africains (Afrique du

Sud, Algérie,

Sénégal, Nigeria)

membres du comité

de pilotage du

Le président

nigérian Olusegun

Obasanjo a salué

cette l'engagement

du G8, déclarant

que "la condition

majeure au

développement est

French

Программное

обеспечение Basis

Technology позволяет

осуществлять поиск

слов с близкими

значениями, а также

транслитерировать

Американская

софтверная

компания

становится

пользующимся

спросом у спецслужб

США экспертом в

области

В данный момент

правительство

США, обвиняющее

радикальную

мусульманскую

группировку "Аль

Каида" в терактах 2

Russian







FNPがコンピュータと端末の

間にあって、実際の端末とのやりとりを制御するのです。そして、コンピュータとFNPの間の通信は、

少量の転送には不向きで、大量の一括転送に向いていまし

Japanese

Bild vergrößern

Berlin (AP) Der Kanzler

strahlte: «Ich gestehe, dass

ich 90 Prozent Zustimmung

EVIAN (AP) - Les membres du

G8 se sont engagés dimanche

soir à soutenir la

これはファンドマネージャー

さんが嘘をついているという

わけではありません。計算

ilHaaqa-n bikitaabinaa s-

sirriyyi r-raqiimi fii yurjae

ittikhaadha maa yulzamu

German

29%

French

33%

Japanese

21%

Arabic

17%

Document Level Analysis: Categorization

• Group Documents into Pre-defined categories

http://news.google.com/

http://www.bbc.co.uk/

Sub-Document Analysis: Linguistics

• Segmentation of Asian language

• Lemmatization

N-Gram

Morphological

Segmentation

Stemming

Lemmatization

Sub-Document Analysis: Sentiment

• Sentence, paragraph, entity, aspect, emotion

http://twittersentiment.appspot.com/search?query=Lucene

http://maps.google.com/maps/place?cid=7410753351872099397

Sub-Document Analysis: Entity Extraction

• Identify Named Concepts in Unstructured Text

Statistical, rules, lists

http://www.twitscoop.com/

Sub-Document: Fact / Rel. / Event Extraction

• Identify Facts, Link Entities, Events and Times

http://www.silobreaker.com/FlashNetwork.aspx?DrillDownItems=11_237360

Cross-Document: Entity Co-reference Resolution

• Map extracted entities to real-world Concepts

Cross-Document Analysis: Clustering

• Near Duplicate Detection

• Unsupervised Clustering

Text Analytics: Entity Search

Solr += Analytics

Text Analytics in/around Solr

• Analyzer/Tokenizer/TokenFilter

• UpdateRequestProcessor

Run Analysis in Solr

Call External Analysis Service

• Pre-Processor to Solr

Integration Point: Analyzer

• Good for:

Linguistics

Segmentation of Asian Language

Customized Segmentation

• Limitations:

No access to document object

• An Analyzer is:

Charfilter

Tokenizer

Set of TokenFilters

Analyzer/Tokenizer Configuration

• Schema.xml

FieldType

• Analyzer (Index)

– CharFilter

– Tokenize

– TokenFilter

• Analyzer (Query)

Integration Point: UpdateRequestProcessor

• Runs Before Analyzers

• Full Access to Document

• Two options:

Run the analysis directly in Solr

Call out to external analysis services

• Limitations:

Think through your indexing strategy

Integration Point: UpdateRequestProcessor

• Run the analysis directly in Solr

Good for light weight, stateless document analytics

Not good for cross-document analytics

• Call out to external analysis services

Web Services, UIMA, OpenPipeline, GATE, custom code

Note that these external calls are synchronous

Additional complexity / points of failure

UpdateRequestProcessor Configuration

• SolrConfig.xml

RequestHandler

• update.processor = UpdateRequestProcessorChain.name

UpdateRequestProcessorChain

• Processors

Integration Point: Pre-Processor

• Index in Solr as Last Step of Analysis

• Good For:

Finer-grained control

Managing dependencies between analytic components

Scalability

• Limitations:

Complexity / New points of failure

Cannot use Solr’s content acquisition features

Integration Summary

• There are Many Options!

• Document-Level Analysis:

Generally, safe to run in UpdateRequestProcessor

• Sub-Document Analysis:

UpdateRequestProcessor or external

• Cross-Document Analysis:

Run external

• Multiple-Analysis Components:

Run external document processing pipeline

Other Concerns

• Re-Indexing may be expensive, so when linking against

structured data..

Index RowID if structured DB allows changes

• Retrieve row details at page rendering time to enable faceting

Index content if DB is static

• FieldCollapsing of Similar Documents

Powerful way to reduce the number of results without losing information

Dashboard

Search and Filter

Detailed Document View

Entity Search – Cross Language

Search/Filter/Explore

Summary

Text Analytics Enables Productive Search

For More Information

• Visit www.basistech.com

• Write to [email protected]

• Call 617-386-2090 or 800-697-2062

http://www.basistech.com/

mailto:[email protected]

Steve Kearns

Director of Product Management

www.basistech.com

Thank You!

search is not enough: using solr for analytics

Technology

document level analysis

stateless document analytics

metadata text

text analytics techniques

document object

metadata search

metadata solr

text analyticsa