enhancing discovery with solr and mahout
DESCRIPTION
Enhancing Discovery with Solr and Mahout. Grant Ingersoll Chief Scientist Lucid Imagination. Evolution. Minding the Intersection. Topics. Background Apache Mahout Apache Solr and Lucene Recommendations with Mahout Collaborative Filtering Discovery with Solr and Mahout Discussion. - PowerPoint PPT PresentationTRANSCRIPT
1 CONFIDENTIAL |
Thinking Lucene Think Lucid
Grant IngersollChief ScientistLucid Imagination
Enhancing Discovery with Solr and Mahout
2 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Evolution
Documents•Models•Feature Selection
User Interaction•Clicks•Ratings/Reviews•Learning to Rank•Social Graph
Queries•Phrases•NLP
Content Relationships•Page Rank, etc.•Organization
3 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Minding the Intersection
Search
DiscoveryAnalytics
4 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Background– Apache Mahout– Apache Solr and Lucene
Recommendations with Mahout– Collaborative Filtering
Discovery with Solr and Mahout
Discussion
Topics
5 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Apache Lucene in a Nutshell
http://lucene.apache.org/java Java based Application Programming Interface (API) for adding search and
indexing functionality to applications Fast and efficient scoring and indexing algorithms Lots of contributions to make common tasks easier:
– Highlighting, spatial, Query Parsers, Benchmarking tools, etc.
Most widely deployed search library on the planet
6 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Apache Solr in a Nutshell
http://lucene.apache.org/solr Lucene-based Search Server + other features and functionality Access Lucene over HTTP:
– Java, XML, Ruby, Python, .NET, JSON, PHP, etc.
Most programming tasks in Lucene are taken care of in Solr Faceting (guided navigation, filters, etc.) Replication and distributed search support Lucene Best Practices
7 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Apache Mahout in a Nutshell
An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License– http://mahout.apache.org
The Three C’s:– Collaborative Filtering (recommenders)– Clustering– Classification
Others:– Frequent Item Mining– Primitive collections– Math stuff
http://dictionary.reference.com/browse/mahout
8 CONFIDENTIAL |
Thinking Lucene Think Lucid
Recommendations with Mahout
9 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Collaborative Filtering (CF)– Provide recommendations solely based on preferences expressed between
users and items– “People who watched this also watched that”
Content-based Recommendations (CBR)– Provide recommendations based on the attributes of the items and user profile– ‘Modern Family’ is a sitcom, Bob likes sitcoms
• => Suggest Modern Family to Bob
Mahout geared towards CF, can be extended to do CBR– Classification can also be used for CBR
Aside: search engines can also solve these problems
Recommenders
10 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Dracula Jane Eyre Frankenstein Java Programming
Bob 1 4 ??? -
Mary 5 1 4 -
In many instances, user’s don’t provide actual ratings– Clicks, views, etc.
Non-Boolean ratings can also often introduce unnecessary noise– Even a low rating often has a positive correlation with highly rated items in the
real world
Example: Should we recommend Frankenstein to Bob?
To Rate or Not?
Dracula Jane Eyre Frankenstein
Bob 1 4 ???
Mary 5 1 4
11 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Collaborative Filtering with Mahout
Extensive framework for collaborative filtering
Recommenders– User based– Item based– Slope One
Online and Offline support– Offline can utilize Hadoop
Item 1
Item 2
… Item m
User 1 - 0.5 0.9
User 2 0.1 0.3 -
…
User n 0.8 0.7 0.1
Recommendations for User X
12 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
User Similarity
Item 1 Item 2 Item 3 Item 4
User 1
User 2 User
3 User 4
What should we recommend for User 1?
13 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Item Similarity
Item 1 Item 2 Item 3 Item 4
User 1
User 2 User
3 User 4
What should we recommend for User 1?
14 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Intuition: There is a linear relationship between rated items– Y = mX + b where m = 1
Solve for b upfront based on existing ratings: b = (Y-X)– Find the average difference in preference value for every pair of items
Online can be very fast, but requires up front computation and memory
Slope One
User Item 1 Item 2
A 3.5 2
B ? 3
User A: 3.5 – 2 = 1.5
Item 1 (User B) = 3 + 1.5 = 4.5
15 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Online– Predates Hadoop– Designed to run on a single node
• Matrix size of ~ 100M interactions– API for integrating with your application
Offline– Hadoop based– Designed to run on large cluster– Several approaches:
• RecommenderJob, ItemSimilarityJob, ParallelALSFactorizationJob
Online and Offline Recommendations
16 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Essentially does matrix multiplication using distributed techniques $MAHOUT_HOME/bin/examples/asf-email-examples.sh
RecommenderJob
101 102 103 104 105
101 7 2 0 1 3
102 2 8 3 5 2
103 0 3 3 6 4
104 1 5 6 4 7
105 3 2 4 7 9
User A
3.0
0
4.0
3.0
2.0
X =
Recs
30
37
38
53
64
17 CONFIDENTIAL |
Thinking Lucene Think Lucid
Discovery with Solr
18 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Goals:– Guide users to results without having to guess at keywords– Encourage serendipity– Never show empty results
Out of the Box:– Faceting– Spell Checking– More Like This– Clustering (Carrot2)
Extend– Clustering (with Mahout)– Frequent Item Mining (with Mahout)
Discovery with Solr
19 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Automatically group similar content together to aid users in discovering related items and/or avoiding repetitive content
Solr has search result clustering– Pluggable– Default implementation uses Carrot2
Mahout has Hadoop based large scale clustering– K-Means, Minhash, Dirichlet, Canopy, Spectral, etc.
Clustering
20 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Discovery In Action
Pre-reqs:– Apache Ant 1.7.x, Subversion (SVN)
Command Line 1:– svn co https://svn.apache.org/repos/asf/lucene/dev/trunk solr-trunk– cd solr-trunk/solr/– ant example– cd example– java –Dsolr.clustering.enabled=true –jar start.jar
Command Line 2– cd exampledocs; java –jar post.jar *.xml
http://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true
21 CONFIDENTIAL |
Thinking Lucene Think Lucid
Solr + Mahout
22 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Most Mahout tasks are offline Solr provides many touch points for integration:
– ClusteringEngine• Clustering results
– SearchComponent• Suggestions – Related searches, clusters, MLT, spellchecking
– UpdateProcessor• Classification of documents
– FunctionQuery
Basics
23 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Discover frequently co-occurring items
Use Case: Related Searches from Solr Logs
Hadoop and sequential versions– Parallel FP Growth
Input:– <optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE– Comma, pipe also allowed as delimiters
Example: Frequent Itemset Mining
24 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Goal: – Extract user queries from Solr logs– Feed into FIM to generate Related Keyword Searches
Context:– Solr Query logs– bin/mahout regexconverter –input $PATH_TO_LOGS --output /tmp/solr/output
--regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite --transformerClass url --formatterClass fpg
– bin/mahout fpg --input /tmp/solr/output/ -o /tmp/solr/fim/output -k 25 -s 2 --method mapreduce
– bin/mahout seqdumper --seqFile /tmp/solr2/results/frequentpatterns/part-r-00000
FIM on Solr Query Logs
25 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Key: Chris: Value: ([Chris, Hostetter],870), ([Chris],870), ([Search, Faceted, Chris, Hostetter, Webcast, Power, Mastering],18), ([Search, Faceted, Chris, Hostetter, Webcast, Power],18), ([Search, Faceted, Chris, Hostetter],18), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone, QA, Refcard],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors],12), ([Solr, new, Chris, Hostetter, webcast, along],12), ([Solr, new, Chris, Hostetter, webcast],12), ([Solr, new, Chris, Hostetter],12)
Output
26 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
http://lucene.apache.org http://mahout.apache.org http://manning.com/owen http://manning.com/ingersoll
http://www.lucidimagination.com [email protected] @gsingers
Resources
27 CONFIDENTIAL |
Thinking Lucene Think Lucid
Appendix
28 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Mahout Overview
MathVectors/Matrices/SVD
RecommendersClusteringClassificationFreq. PatternMining
Genetic
Utilities/IntegrationLucene/Vectorizer
Collections (primitives)
Apache Hadoop
Applications
Examples
See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms