2010 10-building-global-listening-platform-with-solr

28
Building a Global Listening Platform with Solr Steve Kearns Rosette Product Manager Basis Technology October 7, 2010 Monday, October 04, 2010 2

Upload: lucidworks-archived

Post on 08-May-2015

1.006 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2010 10-building-global-listening-platform-with-solr

Building a Global Listening Platform with Solr

Steve Kearns

Rosette Product Manager

Basis TechnologyOctober 7, 2010

Monday, October 04, 2010

2

Page 2: 2010 10-building-global-listening-platform-with-solr

Agenda

• Agenda

• Who Am I?

• What is a “Listening Platform”?

• Challenges (Technical & Global)

• Details

• Demonstration

3

Page 3: 2010 10-building-global-listening-platform-with-solr

About me

• Product Manager at Basis Technology

– Rosette linguistics platform

• Language ID

• Language Support for Search

• Entity Extraction

• Entity Translation/Search

• …

• Related history

– Media Monitoring at BBN Technologies

• Video, Web content extraction: STT, MT, Search

4

Page 4: 2010 10-building-global-listening-platform-with-solr

What is a Listening Platform?

• Content aggregator for online media

• Targets:

– Social/Brand monitoring

– Government OSINT

• Functions:

– Content acquisition

– Content analysis

– Search indexing

– Search (UI)

– Visualization

5

Page 5: 2010 10-building-global-listening-platform-with-solr

Content Acquisition

• What:

– News Articles

– Social Media

• How:

– Web Crawler

• Nutch!

– RSS feed reader/aggregator

• ROME/Curn

– Pay a 3rd party aggregator

• Good option, if you can afford it.

6

Page 6: 2010 10-building-global-listening-platform-with-solr

Content Acquisition Problems

• Web Crawling

– Content Extraction!

• BoilerPipe, Readability

– Crawl History

• Duplicate detection

• Article updates?

– Crawl Control

• Crawl Depth?

• Crawl Restrictions?

• Per-site configuration doesn’t scale.

7

Page 7: 2010 10-building-global-listening-platform-with-solr

Content Acquisition: How

• CURN – Customizable Utilitarian RSS Notifier

8

CURN History

RSS

Feed

ListSolr

RSS Feed

Download Story URL

New

story?

Extract Content

(BoilerPipe || Readability)

Create Solr Message

End

CURN

Output

Plug-In

NoYes

Page 8: 2010 10-building-global-listening-platform-with-solr

What is a Listening Platform?

• Content aggregator for online media

• Targets:

– Social/Brand monitoring

– Government OSINT

• Functions:

– Content acquisition

– Content analysis

– Search indexing

– Other visualization

9

Page 9: 2010 10-building-global-listening-platform-with-solr

Content Analysis

• Language Identification

• Entity Extraction

• Relationship Extraction

• Classification

• Near-Duplicate Detection

• Story Tracking

10

Page 10: 2010 10-building-global-listening-platform-with-solr

Content Analysis: How?

• Preprocessor to Solr

– Custom

– OpenPipeline

• Solr UpdateRequestProcessors

– Chain of URP’s defined in SolrConfig

– Add, edit, remove fields

11

Page 11: 2010 10-building-global-listening-platform-with-solr

Content Analysis: How

• Custom distributed processing pipeline

– Complexity of components

– Number of components

– Some components require their own data storage

• Solr Indexing is the final processing step

12

Page 12: 2010 10-building-global-listening-platform-with-solr

Content Analysis Details

• Language Identification for:

– Indexing

– Faceting/Searching

– Entity Extraction

• Language-specific indexing for:

– Improved recall with high precision

• Entity Extraction for:

– Faceting

– Entity search

– Input to relationship extraction

13

Page 13: 2010 10-building-global-listening-platform-with-solr

Language Identification

• Detect dominant language

• Find language regions

14

Page 14: 2010 10-building-global-listening-platform-with-solr

Language-Specific Indexing

• Every language has unique challenges:

– Tokenization

• Morphological Analysis vs. N-Gram

– Stemming vs. Lemmatization

• All European and Middle Eastern languages

– Compound words

• Swedish, Danish, Norwegian, Dutch, German

15

Page 15: 2010 10-building-global-listening-platform-with-solr

Morphological Analysis vs. N-Gram

• Search Term: 東京東京東京東京 ルパン上映時間ルパン上映時間ルパン上映時間ルパン上映時間

• N-Gram:

• Morphological:

16

Page 16: 2010 10-building-global-listening-platform-with-solr

Stemming vs. Lemmatization

• Stemming:

– Set of rules for removing characters from words

– Increased recall at the expense of precision

– Example EN rule: Remove trailing “ing” or “al”

• Lemmatization:

– Complex set of approaches for producing the

dictionary form of a word

– Increased recall without hurting precision

– Uses context to disambiguate candidates

17

Page 17: 2010 10-building-global-listening-platform-with-solr

Stemming vs. Lemmatization

• English: “I have spoken at several conferences”

• Stemming:

• Lemmatization:

18

Page 18: 2010 10-building-global-listening-platform-with-solr

Stemming vs. Lemmatization

• German: “Am Samstagmorgen fliege ich zurueck nach Boston.”

• Stemming:

• Lemmatization (and decompounding!)

19

Page 19: 2010 10-building-global-listening-platform-with-solr

Stemming and Lemmatization Challenges

• Can I index text from many languages into the same field?

– Yes, but it’s not always a good idea!

• Query language ID is not accurate.

– You need a custom Query Analyzer that does

stemming/lemmatization in many languages for

the same query.

• How do I query text in multiple fields?

– Dismax parser allows you to specify multiple

fields to search.

20

Page 20: 2010 10-building-global-listening-platform-with-solr

Entity Extraction

• Process of identifying people, places, organizations, dates,

times, etc. in unstructured text.

• Methods:– List-based

– Rules-based

– Statistical-based

• Define your goals upfront!– Some extraction methods work better for certain entity types

• Rules work well for dates, email addresses, and URL’s, but not people

• Lists work well for titles, but not locations

• Statistical extractors work well for ambiguous entities like people,

locations, organizations

21

Page 21: 2010 10-building-global-listening-platform-with-solr

Entity Extraction Example

22

Page 22: 2010 10-building-global-listening-platform-with-solr

User Interface

• Google-style search results aren’t enough!

• Design UI around workflow

and/or

• Design for Exploration

23

Page 23: 2010 10-building-global-listening-platform-with-solr

Dashboard/Summary

24

Page 24: 2010 10-building-global-listening-platform-with-solr

Faceting

25

Page 25: 2010 10-building-global-listening-platform-with-solr

Link Analysis

26

Page 26: 2010 10-building-global-listening-platform-with-solr

Details

• Data Acquisition: Curn RSS Aggregator

• Analysis:

– Basis Technology:

• Language ID

• Search Enablement

• Entity Extraction

• Indexing: Solr

• UI:– JSP

– Yahoo UI (YUI)

27

• Relationship Extraction

• Entity Search

– Javascript InfoViz Toolkit (theJit.org)

– gRaphaël (g.raphaeljs.com)

Page 27: 2010 10-building-global-listening-platform-with-solr

Architecture

28

CURN(RSS Harvester)

MySQL(Long Term Datastore)

Rosette Analysis

Components• Lang ID

• Entity Extraction

• Relationship Extraction

Document

Classification and

Clustering

Indexing /

Query Service

SolrName Indexer

User Interface

Page 28: 2010 10-building-global-listening-platform-with-solr

Demo

• Listening Platform built on Solr

• I built this version in 3 months using Solr and products from Basis Technology

• I would be happy to show you the Solr config and let you try it out

29