enhancing discovery with solr and mahout
Post on 24-Feb-2016
66 Views
Preview:
DESCRIPTION
TRANSCRIPT
1 CONFIDENTIAL |
Thinking Lucene Think Lucid
Grant IngersollChief ScientistLucid Imagination
Enhancing Discovery with Solr and Mahout
2 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Evolution
Documents•Models•Feature Selection
User Interaction•Clicks•Ratings/Reviews•Learning to Rank•Social Graph
Queries•Phrases•NLP
Content Relationships•Page Rank, etc.•Organization
3 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Minding the Intersection
Search
DiscoveryAnalytics
4 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Background– Apache Mahout– Apache Solr and Lucene
Recommendations with Mahout– Collaborative Filtering
Discovery with Solr and Mahout
Discussion
Topics
5 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Apache Lucene in a Nutshell
http://lucene.apache.org/java Java based Application Programming Interface (API) for adding search and
indexing functionality to applications Fast and efficient scoring and indexing algorithms Lots of contributions to make common tasks easier:
– Highlighting, spatial, Query Parsers, Benchmarking tools, etc.
Most widely deployed search library on the planet
6 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Apache Solr in a Nutshell
http://lucene.apache.org/solr Lucene-based Search Server + other features and functionality Access Lucene over HTTP:
– Java, XML, Ruby, Python, .NET, JSON, PHP, etc.
Most programming tasks in Lucene are taken care of in Solr Faceting (guided navigation, filters, etc.) Replication and distributed search support Lucene Best Practices
7 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Apache Mahout in a Nutshell
An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License– http://mahout.apache.org
The Three C’s:– Collaborative Filtering (recommenders)– Clustering– Classification
Others:– Frequent Item Mining– Primitive collections– Math stuff
http://dictionary.reference.com/browse/mahout
8 CONFIDENTIAL |
Thinking Lucene Think Lucid
Recommendations with Mahout
9 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Collaborative Filtering (CF)– Provide recommendations solely based on preferences expressed between
users and items– “People who watched this also watched that”
Content-based Recommendations (CBR)– Provide recommendations based on the attributes of the items and user profile– ‘Modern Family’ is a sitcom, Bob likes sitcoms
• => Suggest Modern Family to Bob
Mahout geared towards CF, can be extended to do CBR– Classification can also be used for CBR
Aside: search engines can also solve these problems
Recommenders
10 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Dracula Jane Eyre Frankenstein Java Programming
Bob 1 4 ??? -
Mary 5 1 4 -
In many instances, user’s don’t provide actual ratings– Clicks, views, etc.
Non-Boolean ratings can also often introduce unnecessary noise– Even a low rating often has a positive correlation with highly rated items in the
real world
Example: Should we recommend Frankenstein to Bob?
To Rate or Not?
Dracula Jane Eyre Frankenstein
Bob 1 4 ???
Mary 5 1 4
11 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Collaborative Filtering with Mahout
Extensive framework for collaborative filtering
Recommenders– User based– Item based– Slope One
Online and Offline support– Offline can utilize Hadoop
Item 1
Item 2
… Item m
User 1 - 0.5 0.9
User 2 0.1 0.3 -
…
User n 0.8 0.7 0.1
Recommendations for User X
12 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
User Similarity
Item 1 Item 2 Item 3 Item 4
User 1
User 2 User
3 User 4
What should we recommend for User 1?
13 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Item Similarity
Item 1 Item 2 Item 3 Item 4
User 1
User 2 User
3 User 4
What should we recommend for User 1?
14 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Intuition: There is a linear relationship between rated items– Y = mX + b where m = 1
Solve for b upfront based on existing ratings: b = (Y-X)– Find the average difference in preference value for every pair of items
Online can be very fast, but requires up front computation and memory
Slope One
User Item 1 Item 2
A 3.5 2
B ? 3
User A: 3.5 – 2 = 1.5
Item 1 (User B) = 3 + 1.5 = 4.5
15 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Online– Predates Hadoop– Designed to run on a single node
• Matrix size of ~ 100M interactions– API for integrating with your application
Offline– Hadoop based– Designed to run on large cluster– Several approaches:
• RecommenderJob, ItemSimilarityJob, ParallelALSFactorizationJob
Online and Offline Recommendations
16 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Essentially does matrix multiplication using distributed techniques $MAHOUT_HOME/bin/examples/asf-email-examples.sh
RecommenderJob
101 102 103 104 105
101 7 2 0 1 3
102 2 8 3 5 2
103 0 3 3 6 4
104 1 5 6 4 7
105 3 2 4 7 9
User A
3.0
0
4.0
3.0
2.0
X =
Recs
30
37
38
53
64
17 CONFIDENTIAL |
Thinking Lucene Think Lucid
Discovery with Solr
18 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Goals:– Guide users to results without having to guess at keywords– Encourage serendipity– Never show empty results
Out of the Box:– Faceting– Spell Checking– More Like This– Clustering (Carrot2)
Extend– Clustering (with Mahout)– Frequent Item Mining (with Mahout)
Discovery with Solr
19 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Automatically group similar content together to aid users in discovering related items and/or avoiding repetitive content
Solr has search result clustering– Pluggable– Default implementation uses Carrot2
Mahout has Hadoop based large scale clustering– K-Means, Minhash, Dirichlet, Canopy, Spectral, etc.
Clustering
20 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Discovery In Action
Pre-reqs:– Apache Ant 1.7.x, Subversion (SVN)
Command Line 1:– svn co https://svn.apache.org/repos/asf/lucene/dev/trunk solr-trunk– cd solr-trunk/solr/– ant example– cd example– java –Dsolr.clustering.enabled=true –jar start.jar
Command Line 2– cd exampledocs; java –jar post.jar *.xml
http://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true
21 CONFIDENTIAL |
Thinking Lucene Think Lucid
Solr + Mahout
22 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Most Mahout tasks are offline Solr provides many touch points for integration:
– ClusteringEngine• Clustering results
– SearchComponent• Suggestions – Related searches, clusters, MLT, spellchecking
– UpdateProcessor• Classification of documents
– FunctionQuery
Basics
23 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Discover frequently co-occurring items
Use Case: Related Searches from Solr Logs
Hadoop and sequential versions– Parallel FP Growth
Input:– <optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE– Comma, pipe also allowed as delimiters
Example: Frequent Itemset Mining
24 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Goal: – Extract user queries from Solr logs– Feed into FIM to generate Related Keyword Searches
Context:– Solr Query logs– bin/mahout regexconverter –input $PATH_TO_LOGS --output /tmp/solr/output
--regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite --transformerClass url --formatterClass fpg
– bin/mahout fpg --input /tmp/solr/output/ -o /tmp/solr/fim/output -k 25 -s 2 --method mapreduce
– bin/mahout seqdumper --seqFile /tmp/solr2/results/frequentpatterns/part-r-00000
FIM on Solr Query Logs
25 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Key: Chris: Value: ([Chris, Hostetter],870), ([Chris],870), ([Search, Faceted, Chris, Hostetter, Webcast, Power, Mastering],18), ([Search, Faceted, Chris, Hostetter, Webcast, Power],18), ([Search, Faceted, Chris, Hostetter],18), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone, QA, Refcard],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors],12), ([Solr, new, Chris, Hostetter, webcast, along],12), ([Solr, new, Chris, Hostetter, webcast],12), ([Solr, new, Chris, Hostetter],12)
Output
26 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
http://lucene.apache.org http://mahout.apache.org http://manning.com/owen http://manning.com/ingersoll
http://www.lucidimagination.com grant@lucidimagination.com @gsingers
Resources
27 CONFIDENTIAL |
Thinking Lucene Think Lucid
Appendix
28 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination
Mahout Overview
MathVectors/Matrices/SVD
RecommendersClusteringClassificationFreq. PatternMining
Genetic
Utilities/IntegrationLucene/Vectorizer
Collections (primitives)
Apache Hadoop
Applications
Examples
See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
top related