content analytics for better search
Post on 16-Apr-2017
996 Views
Preview:
TRANSCRIPT
Content Analytics
for
Better Search
Otis Gospodneti Sematext International
Agenda
Intro: Otis & Sematext
Basic Search
Taming Search Results
Key Phrases
Beyond Search
About Otis Gospodneti
Member: Apache Lucene/Solr/Nutch/Mahout
Author: Lucene in Action 1 & 2
Entrepreneur: Simpy, Lucene Consulting, Sematext Int'l since 2007
About Sematext
Consulting, development, support:
Big Data (Hadoop, HBase, Voldemort...)
Search (Lucene, Solr, Elastic Search...)
Web Crawling (Nutch)
Machine Learning (Mahout)
Basic Search
Taming Search Results
Related searches (high query volume)
Search results clustering (fuzzy)
Named Entity Recognition (NER)
Faceted search (structured input)
10 days of data (5K/min)
Example: Related Searches
Example: Results Clustering
Example: Named Entities
Sorry, no screenshot, but I know sites use this!
Really, I do!
:)
Example: Faceted Search
Content Analysis: Key Phrases
Related searches
Search results clustering
Named Entity Recognition (NER)
Faceted search
Key PhrasesCollocations
Statistically Improbable Phrases (SIPs)
10 days of data (5K/min)
Example: Key Phrases & Search
Example: Key Phrases & Search
Definitions: Collocations
Collocations are phrases whose words are seen together more than you would expect given an estimate of how frequent each individual word is in the given text vs. how often they are seen together in the same text.
Source: http://sematext.com/demo/kpe/
See: http://en.wikipedia.org/wiki/Collocation
Definitions: SIPs
Statistically Improbably Phrases are phrases that appear in a text more often than you would expect given how often they appear in another text. In this demo we extract SIPs by comparing texts from two different time periods.
Source: http://sematext.com/demo/kpe/
See: http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases
Language Models
Hybrid Key Phrases
Beyond Search
Content analysis
Trend spotting / Buzz monitoring
Social media
Customer reviews / Brand
Book Content Analysis
SIPs at Amazon
Amazon SIPs are the most distinctive phrases in the text of books in the Search Inside! program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.SIPs are not necessarily improbable within a particular book, but they are improbable relative to all books in Search Inside!. For example, most SIPs for a book on taxes are tax related. But because we display SIPs in order of their improbability score, the first SIPs will be on tax topics that this book mentions more often than other tax books. For works of fiction, SIPs tend to be distinctive word combinations that often hint at important plot elements.
News Content Analysis
Source: http://sematext.com/demo/kpe/
SIPs & News Topic Trending
The text for the new (or you can think of it as "current") period goes from now to up to 7 days back. The text for the old (or "past") period is for the 7 days before that.
now new text (now - 7 days) text (now - 14 days)
Customer Experience
Mindshare Technologies (MT) is a Voice of the Customer company who helps companies make operational improvements based on customer feedback. MT's client list includes many of the world's largest restaurant chains, hotels, car rental agencies, and telecommunications companies. Much of the feedback we collect is from surveys that contain open-ended questions where customers can leave comments. MT has used the Key Phrase Extractor to unlock the value contained in these comments. We are able to identify common problems experienced by customers and are even able to detect emerging topics that are starting to catch fire. Mindshare's clients are able to leverage this information and make operational changes that improve customer experiences.
Lessons
GIGO
Language-awareness (POS)
Filtering (England v)
sematext.com
blog.sematext.com
@sematext
@otisg
otis@sematext.com
Contact
Copyright 2010 Sematext Int'l. All rights reserved.
top related