2010 10-building-global-listening-platform-with-solr
TRANSCRIPT
Building a Global Listening Platform with Solr
Steve Kearns
Rosette Product Manager
Basis TechnologyOctober 7, 2010
Monday, October 04, 2010
2
Agenda
• Agenda
• Who Am I?
• What is a “Listening Platform”?
• Challenges (Technical & Global)
• Details
• Demonstration
3
About me
• Product Manager at Basis Technology
– Rosette linguistics platform
• Language ID
• Language Support for Search
• Entity Extraction
• Entity Translation/Search
• …
• Related history
– Media Monitoring at BBN Technologies
• Video, Web content extraction: STT, MT, Search
4
What is a Listening Platform?
• Content aggregator for online media
• Targets:
– Social/Brand monitoring
– Government OSINT
• Functions:
– Content acquisition
– Content analysis
– Search indexing
– Search (UI)
– Visualization
5
Content Acquisition
• What:
– News Articles
– Social Media
• How:
– Web Crawler
• Nutch!
– RSS feed reader/aggregator
• ROME/Curn
– Pay a 3rd party aggregator
• Good option, if you can afford it.
6
Content Acquisition Problems
• Web Crawling
– Content Extraction!
• BoilerPipe, Readability
– Crawl History
• Duplicate detection
• Article updates?
– Crawl Control
• Crawl Depth?
• Crawl Restrictions?
• Per-site configuration doesn’t scale.
7
Content Acquisition: How
• CURN – Customizable Utilitarian RSS Notifier
8
CURN History
RSS
Feed
ListSolr
RSS Feed
Download Story URL
New
story?
Extract Content
(BoilerPipe || Readability)
Create Solr Message
End
CURN
Output
Plug-In
NoYes
What is a Listening Platform?
• Content aggregator for online media
• Targets:
– Social/Brand monitoring
– Government OSINT
• Functions:
– Content acquisition
– Content analysis
– Search indexing
– Other visualization
9
Content Analysis
• Language Identification
• Entity Extraction
• Relationship Extraction
• Classification
• Near-Duplicate Detection
• Story Tracking
10
Content Analysis: How?
• Preprocessor to Solr
– Custom
– OpenPipeline
• Solr UpdateRequestProcessors
– Chain of URP’s defined in SolrConfig
– Add, edit, remove fields
11
Content Analysis: How
• Custom distributed processing pipeline
– Complexity of components
– Number of components
– Some components require their own data storage
• Solr Indexing is the final processing step
12
Content Analysis Details
• Language Identification for:
– Indexing
– Faceting/Searching
– Entity Extraction
• Language-specific indexing for:
– Improved recall with high precision
• Entity Extraction for:
– Faceting
– Entity search
– Input to relationship extraction
13
Language Identification
• Detect dominant language
• Find language regions
14
Language-Specific Indexing
• Every language has unique challenges:
– Tokenization
• Morphological Analysis vs. N-Gram
– Stemming vs. Lemmatization
• All European and Middle Eastern languages
– Compound words
• Swedish, Danish, Norwegian, Dutch, German
15
Morphological Analysis vs. N-Gram
• Search Term: 東京東京東京東京 ルパン上映時間ルパン上映時間ルパン上映時間ルパン上映時間
• N-Gram:
• Morphological:
16
Stemming vs. Lemmatization
• Stemming:
– Set of rules for removing characters from words
– Increased recall at the expense of precision
– Example EN rule: Remove trailing “ing” or “al”
• Lemmatization:
– Complex set of approaches for producing the
dictionary form of a word
– Increased recall without hurting precision
– Uses context to disambiguate candidates
17
Stemming vs. Lemmatization
• English: “I have spoken at several conferences”
• Stemming:
• Lemmatization:
18
Stemming vs. Lemmatization
• German: “Am Samstagmorgen fliege ich zurueck nach Boston.”
• Stemming:
• Lemmatization (and decompounding!)
19
Stemming and Lemmatization Challenges
• Can I index text from many languages into the same field?
– Yes, but it’s not always a good idea!
• Query language ID is not accurate.
– You need a custom Query Analyzer that does
stemming/lemmatization in many languages for
the same query.
• How do I query text in multiple fields?
– Dismax parser allows you to specify multiple
fields to search.
20
Entity Extraction
• Process of identifying people, places, organizations, dates,
times, etc. in unstructured text.
• Methods:– List-based
– Rules-based
– Statistical-based
• Define your goals upfront!– Some extraction methods work better for certain entity types
• Rules work well for dates, email addresses, and URL’s, but not people
• Lists work well for titles, but not locations
• Statistical extractors work well for ambiguous entities like people,
locations, organizations
21
Entity Extraction Example
22
User Interface
• Google-style search results aren’t enough!
• Design UI around workflow
and/or
• Design for Exploration
23
Dashboard/Summary
24
Faceting
25
Link Analysis
26
Details
• Data Acquisition: Curn RSS Aggregator
• Analysis:
– Basis Technology:
• Language ID
• Search Enablement
• Entity Extraction
• Indexing: Solr
• UI:– JSP
– Yahoo UI (YUI)
27
• Relationship Extraction
• Entity Search
– Javascript InfoViz Toolkit (theJit.org)
– gRaphaël (g.raphaeljs.com)
Architecture
28
CURN(RSS Harvester)
MySQL(Long Term Datastore)
Rosette Analysis
Components• Lang ID
• Entity Extraction
• Relationship Extraction
Document
Classification and
Clustering
Indexing /
Query Service
SolrName Indexer
User Interface
Demo
• Listening Platform built on Solr
• I built this version in 3 months using Solr and products from Basis Technology
• I would be happy to show you the Solr config and let you try it out
29