enhancing discovery with solr and mahout

28
1 CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

Upload: hamlet

Post on 24-Feb-2016

66 views

Category:

Documents


0 download

DESCRIPTION

Enhancing Discovery with Solr and Mahout. Grant Ingersoll Chief Scientist Lucid Imagination. Evolution. Minding the Intersection. Topics. Background Apache Mahout Apache Solr and Lucene Recommendations with Mahout Collaborative Filtering Discovery with Solr and Mahout Discussion. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Enhancing Discovery with  Solr and Mahout

1 CONFIDENTIAL |

Thinking Lucene Think Lucid

Grant IngersollChief ScientistLucid Imagination

Enhancing Discovery with Solr and Mahout

Page 2: Enhancing Discovery with  Solr and Mahout

2 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Evolution

Documents•Models•Feature Selection

User Interaction•Clicks•Ratings/Reviews•Learning to Rank•Social Graph

Queries•Phrases•NLP

Content Relationships•Page Rank, etc.•Organization

Page 3: Enhancing Discovery with  Solr and Mahout

3 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Minding the Intersection

Search

DiscoveryAnalytics

Page 4: Enhancing Discovery with  Solr and Mahout

4 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Background– Apache Mahout– Apache Solr and Lucene

Recommendations with Mahout– Collaborative Filtering

Discovery with Solr and Mahout

Discussion

Topics

Page 5: Enhancing Discovery with  Solr and Mahout

5 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Apache Lucene in a Nutshell

http://lucene.apache.org/java Java based Application Programming Interface (API) for adding search and

indexing functionality to applications Fast and efficient scoring and indexing algorithms Lots of contributions to make common tasks easier:

– Highlighting, spatial, Query Parsers, Benchmarking tools, etc.

Most widely deployed search library on the planet

Page 6: Enhancing Discovery with  Solr and Mahout

6 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Apache Solr in a Nutshell

http://lucene.apache.org/solr Lucene-based Search Server + other features and functionality Access Lucene over HTTP:

– Java, XML, Ruby, Python, .NET, JSON, PHP, etc.

Most programming tasks in Lucene are taken care of in Solr Faceting (guided navigation, filters, etc.) Replication and distributed search support Lucene Best Practices

Page 7: Enhancing Discovery with  Solr and Mahout

7 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Apache Mahout in a Nutshell

An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License– http://mahout.apache.org

The Three C’s:– Collaborative Filtering (recommenders)– Clustering– Classification

Others:– Frequent Item Mining– Primitive collections– Math stuff

http://dictionary.reference.com/browse/mahout

Page 8: Enhancing Discovery with  Solr and Mahout

8 CONFIDENTIAL |

Thinking Lucene Think Lucid

Recommendations with Mahout

Page 9: Enhancing Discovery with  Solr and Mahout

9 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Collaborative Filtering (CF)– Provide recommendations solely based on preferences expressed between

users and items– “People who watched this also watched that”

Content-based Recommendations (CBR)– Provide recommendations based on the attributes of the items and user profile– ‘Modern Family’ is a sitcom, Bob likes sitcoms

• => Suggest Modern Family to Bob

Mahout geared towards CF, can be extended to do CBR– Classification can also be used for CBR

Aside: search engines can also solve these problems

Recommenders

Page 10: Enhancing Discovery with  Solr and Mahout

10 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Dracula Jane Eyre Frankenstein Java Programming

Bob 1 4 ??? -

Mary 5 1 4 -

In many instances, user’s don’t provide actual ratings– Clicks, views, etc.

Non-Boolean ratings can also often introduce unnecessary noise– Even a low rating often has a positive correlation with highly rated items in the

real world

Example: Should we recommend Frankenstein to Bob?

To Rate or Not?

Dracula Jane Eyre Frankenstein

Bob 1 4 ???

Mary 5 1 4

Page 11: Enhancing Discovery with  Solr and Mahout

11 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Collaborative Filtering with Mahout

Extensive framework for collaborative filtering

Recommenders– User based– Item based– Slope One

Online and Offline support– Offline can utilize Hadoop

Item 1

Item 2

… Item m

User 1 - 0.5 0.9

User 2 0.1 0.3 -

User n 0.8 0.7 0.1

Recommendations for User X

Page 12: Enhancing Discovery with  Solr and Mahout

12 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

User Similarity

Item 1 Item 2 Item 3 Item 4

User 1

User 2 User

3 User 4

What should we recommend for User 1?

Page 13: Enhancing Discovery with  Solr and Mahout

13 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Item Similarity

Item 1 Item 2 Item 3 Item 4

User 1

User 2 User

3 User 4

What should we recommend for User 1?

Page 14: Enhancing Discovery with  Solr and Mahout

14 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Intuition: There is a linear relationship between rated items– Y = mX + b where m = 1

Solve for b upfront based on existing ratings: b = (Y-X)– Find the average difference in preference value for every pair of items

Online can be very fast, but requires up front computation and memory

Slope One

User Item 1 Item 2

A 3.5 2

B ? 3

User A: 3.5 – 2 = 1.5

Item 1 (User B) = 3 + 1.5 = 4.5

Page 15: Enhancing Discovery with  Solr and Mahout

15 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Online– Predates Hadoop– Designed to run on a single node

• Matrix size of ~ 100M interactions– API for integrating with your application

Offline– Hadoop based– Designed to run on large cluster– Several approaches:

• RecommenderJob, ItemSimilarityJob, ParallelALSFactorizationJob

Online and Offline Recommendations

Page 16: Enhancing Discovery with  Solr and Mahout

16 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Essentially does matrix multiplication using distributed techniques $MAHOUT_HOME/bin/examples/asf-email-examples.sh

RecommenderJob

101 102 103 104 105

101 7 2 0 1 3

102 2 8 3 5 2

103 0 3 3 6 4

104 1 5 6 4 7

105 3 2 4 7 9

User A

3.0

0

4.0

3.0

2.0

X =

Recs

30

37

38

53

64

Page 17: Enhancing Discovery with  Solr and Mahout

17 CONFIDENTIAL |

Thinking Lucene Think Lucid

Discovery with Solr

Page 18: Enhancing Discovery with  Solr and Mahout

18 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Goals:– Guide users to results without having to guess at keywords– Encourage serendipity– Never show empty results

Out of the Box:– Faceting– Spell Checking– More Like This– Clustering (Carrot2)

Extend– Clustering (with Mahout)– Frequent Item Mining (with Mahout)

Discovery with Solr

Page 19: Enhancing Discovery with  Solr and Mahout

19 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Automatically group similar content together to aid users in discovering related items and/or avoiding repetitive content

Solr has search result clustering– Pluggable– Default implementation uses Carrot2

Mahout has Hadoop based large scale clustering– K-Means, Minhash, Dirichlet, Canopy, Spectral, etc.

Clustering

Page 20: Enhancing Discovery with  Solr and Mahout

20 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Discovery In Action

Pre-reqs:– Apache Ant 1.7.x, Subversion (SVN)

Command Line 1:– svn co https://svn.apache.org/repos/asf/lucene/dev/trunk solr-trunk– cd solr-trunk/solr/– ant example– cd example– java –Dsolr.clustering.enabled=true –jar start.jar

Command Line 2– cd exampledocs; java –jar post.jar *.xml

http://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true

Page 21: Enhancing Discovery with  Solr and Mahout

21 CONFIDENTIAL |

Thinking Lucene Think Lucid

Solr + Mahout

Page 22: Enhancing Discovery with  Solr and Mahout

22 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Most Mahout tasks are offline Solr provides many touch points for integration:

– ClusteringEngine• Clustering results

– SearchComponent• Suggestions – Related searches, clusters, MLT, spellchecking

– UpdateProcessor• Classification of documents

– FunctionQuery

Basics

Page 23: Enhancing Discovery with  Solr and Mahout

23 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Discover frequently co-occurring items

Use Case: Related Searches from Solr Logs

Hadoop and sequential versions– Parallel FP Growth

Input:– <optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE– Comma, pipe also allowed as delimiters

Example: Frequent Itemset Mining

Page 24: Enhancing Discovery with  Solr and Mahout

24 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Goal: – Extract user queries from Solr logs– Feed into FIM to generate Related Keyword Searches

Context:– Solr Query logs– bin/mahout regexconverter –input $PATH_TO_LOGS --output /tmp/solr/output

--regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite --transformerClass url --formatterClass fpg

– bin/mahout fpg --input /tmp/solr/output/ -o /tmp/solr/fim/output -k 25 -s 2 --method mapreduce

– bin/mahout seqdumper --seqFile /tmp/solr2/results/frequentpatterns/part-r-00000

FIM on Solr Query Logs

Page 25: Enhancing Discovery with  Solr and Mahout

25 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Key: Chris: Value: ([Chris, Hostetter],870), ([Chris],870), ([Search, Faceted, Chris, Hostetter, Webcast, Power, Mastering],18), ([Search, Faceted, Chris, Hostetter, Webcast, Power],18), ([Search, Faceted, Chris, Hostetter],18), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone, QA, Refcard],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors],12), ([Solr, new, Chris, Hostetter, webcast, along],12), ([Solr, new, Chris, Hostetter, webcast],12), ([Solr, new, Chris, Hostetter],12)

Output

Page 26: Enhancing Discovery with  Solr and Mahout

26 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

http://lucene.apache.org http://mahout.apache.org http://manning.com/owen http://manning.com/ingersoll

http://www.lucidimagination.com [email protected] @gsingers

Resources

Page 27: Enhancing Discovery with  Solr and Mahout

27 CONFIDENTIAL |

Thinking Lucene Think Lucid

Appendix

Page 28: Enhancing Discovery with  Solr and Mahout

28 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Mahout Overview

MathVectors/Matrices/SVD

RecommendersClusteringClassificationFreq. PatternMining

Genetic

Utilities/IntegrationLucene/Vectorizer

Collections (primitives)

Apache Hadoop

Applications

Examples

See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms