practical machine learning: innovations in recommendation workshop
TRANSCRIPT
© 2014 MapR Technologies 1© 2014 MapR Technologies
Practical Machine Learning: Innovations in Recommendation
Ted Dunning22 June 2014 Big Data Everywhere Conference Tel Aviv
© 2014 MapR Technologies 2
Who I am
Ted Dunning, Chief Applications Architect, MapR TechnologiesEmail [email protected] [email protected] @Ted_Dunning
Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout
22 June 2014 Big Data Everywhere Conference #DataIsrael
© 2014 MapR Technologies 3http://www.wired.com/wiredenterprise/2012/12/mahout/
Recommendation: Widely Used Machine LearningExample: Open source Apache Mahout used in production
© 2014 MapR Technologies 4
Recommendations
– Data: interactions between people taking action (users) and items • Data used to train recommendation model
– Goal is to suggest additional interactions– Example applications: movie, music or map-based restaurant choices;
suggesting sale items for e-stores or via cash-register receipts
© 2014 MapR Technologies 5
Google maps: restaurant recommendations
© 2014 MapR Technologies 6
Google maps: tech recommendations
© 2014 MapR Technologies 7
Tutorial Part 1:How recommendation works,or “I want a pony”…
© 2014 MapR Technologies 9
First question:Are you using the right data?
© 2014 MapR Technologies 10
Recommendation
Behavior of a crowd helps us understand what individuals will do
© 2014 MapR Technologies 11
Recommendations
Alice got an apple and a puppyAlice
© 2014 MapR Technologies 12
Recommendations
Alice got an apple and a puppyAlice
Charles got a bicycleCharles
© 2014 MapR Technologies 13
Recommendations
Alice got an apple and a puppyAlice
Charles got a bicycleCharles
Bob Bob got an apple
© 2014 MapR Technologies 14
Recommendations
Alice got an apple and a puppyAlice
Charles got a bicycleCharles
Bob What else would Bob like?
© 2014 MapR Technologies 15
Recommendations
Alice got an apple and a puppyAlice
Charles got a bicycleCharles
Bob A puppy!
© 2014 MapR Technologies 16
You get the idea of how recommenders can work…
© 2014 MapR Technologies 17
By the way, like me, Bob also wants a pony…
© 2014 MapR Technologies 18
Recommendations
?
Alice
Bob
Charles
Amelia
What if everybody gets a pony?
What else would you recommend for new user Amelia?
© 2014 MapR Technologies 19
Recommendations
?
Alice
Bob
Charles
Amelia
If everybody gets a pony, it’s not a very good indicator of what to else predict...
© 2014 MapR Technologies 20
Problems with Raw Co-occurrence
• Very popular items co-occur with everything or why it’s not very helpful to know that everybody wants a pony… – Examples: Welcome document; Elevator music
• Very widespread occurrence is not interesting to generate indicators for recommendation– Unless you want to offer an item that is constantly desired, such as
razor blades (or ponies)• What we want is anomalous co-occurrence
– This is the source of interesting indicators of preference on which to base recommendation
© 2014 MapR Technologies 21
Overview: Get Useful Indicators from Behaviors
1. Use log files to build history matrix of users x items– Remember: this history of interactions will be sparse compared to all
potential combinations2. Transform to a co-occurrence matrix of items x items3. Look for useful indicators by identifying anomalous co-occurrences to
make an indicator matrix– Log Likelihood Ratio (LLR) can be helpful to judge which co-
occurrences can with confidence be used as indicators of preference– ItemSimilarityJob in Apache Mahout uses LLR
© 2014 MapR Technologies 22
Apache Mahout: Overview• Open source Apache project http://mahout.apache.org/• Mahout version is 0.9 released Feb 2014; inc Scala
– Summary 0.9 blog at http://bit.ly/1rirUUL• Library of scalable algorithms for machine learning
– Some run on Apache Hadoop distributions; others do not require Hadoop– Some can be run at small scale– Some are run in parallel; others are sequential
• Includes the following main areas:– Clustering & related techniques– Classification– Recommendation– Mahout Math Library
© 2014 MapR Technologies 24
Log Files
Alice
Bob
Charles
Alice
Bob
Charles
Alice
© 2014 MapR Technologies 25
Log Files
u1
u3
u2
u1
u3
u2
u1
t1
t4
t3
t2
t3
t3
t1
© 2014 MapR Technologies 26
History Matrix: Users x Items
Alice
Bob
Charles
✔ ✔ ✔✔ ✔
✔ ✔
© 2014 MapR Technologies 27
Co-Occurrence Matrix: Items x Items
1 2 01
1 11
1
0
002
How do you tell which co-occurrences are useful?
© 2014 MapR Technologies 28
Co-Occurrence Matrix: Items x Items
1 2 01
1 11
1
0
002
Use LLR test to turn co-occurrence into indicators…
© 2014 MapR Technologies 29
Co-occurrence Binary Matrix
11not
not
1
© 2014 MapR Technologies 30
Which one is the anomalous co-occurrence?
A not AB 13 1000
not B 1000 100,000
A not AB 1 0
not B 0 10,000
A not AB 10 0
not B 0 100,000
A not AB 1 0
not B 0 2
© 2014 MapR Technologies 31
Which one is the anomalous co-occurrence?
A not AB 13 1000
not B 1000 100,000
A not AB 1 0
not B 0 10,000
A not AB 10 0
not B 0 100,000
A not AB 1 0
not B 0 20.90 1.95
4.52 14.3
© 2014 MapR Technologies 32
Co-Occurrence Matrix: Items x Items
1 2 01
1 11
1
0
002
Recap:Use LLR test to turn co-occurrence into indicators
© 2014 MapR Technologies 33
Indicator Matrix: Anomalous Co-Occurrence
Result:Each marked row shows the indicators of what to recommend…
✔✔
© 2014 MapR Technologies 34
Indicator Matrix: Anomalous Co-Occurrence
✔✔
Why not pony + other item?
© 2014 MapR Technologies 36
How will you deliver recommendations to users?
© 2014 MapR Technologies 37
Seeking Simplicity
Innovation: Exploit search technology to deploy your recommendation system
© 2014 MapR Technologies 38
But first, a look at how search works…
© 2014 MapR Technologies 39
Apache Solr/Apache Lucene• Apache Solr/Lucene is an open-source powerful search
engine used for flexible, heavily indexed queries including data such as– Full text, geographical data, statistically weighted data
• Lucene – Provides core retrieval– Is a low-level library
• Solr – Is a web-based wrapper around Lucene – Is easy to integrate because you talk to it via a web-interface
• URL http://host machine:8888
© 2014 MapR Technologies 40
LucidWorks• Enterprise platform and collection of applications based on
Apache Solr/Lucene– Wrapper around Solr– A free version ships with MapR
• LucidWorks leaves Solr exposed but makes Solr administration much easier, which in turn makes it easier to use Lucene
• URL http://host machine:8989
© 2014 MapR Technologies 41
Solr
LucidWorks
Query Response
Index
Lucene
Query Response
Data Source
Relationship of Solr /Lucene/LucidWorks
© 2014 MapR Technologies 42
Other Options: Elastic Search• Apache Lucene library at the heart of several approaches
– It can also be used on its own• Elastic Search is a different web interface for Lucene
– Real time search and analytics– Open source (not Apache)– Big advantage is less accumulated cruft
http://www.elasticsearch.com/
© 2014 MapR Technologies 44
What is a Document?• Data is stored in collections made up of documents• Documents contain fields that can be
– Indexed • makes field searchable; don’t have to index all
– Stored• If you want Solr to return content, have Solr store content• Not all data of interest must be stored: can access via stored URL (good for very large
data set)– Multi-value
• Body field can contain more than one type of data – Facetted
• A way to refine a search or use for statistics• Example: data for country: Could return facetted as “37 from US, 23 UK, 7 Japan”
© 2014 MapR Technologies 45
Fields and How to Set Them Up• Lucene is mostly used for text
– Text has to be tokenized• Also supports other types
– Long, string, keywords, comma separated
• Fields properties such as stored, indexed, faceted can also be defined
• Defaults aren’t usually so great
© 2014 MapR Technologies 46
Example of Facetted Search
The indexed fields “Area” and “Gender” have been facetted to provide counts for the results
© 2014 MapR Technologies 48
Field-specific Searches Using Lucene Syntax• Documents often have title, author, keywords and body text• General syntax is
field:(term1 term2)• Alternatives
field:(term1 OR term2) field:(“term1 term2” ~ 5) field:(term1 AND term2)
• Default field, default interpretations work very well for text
© 2014 MapR Technologies 49
Send Data to Solr (not LucidWorks)• LucidWorks has lots of spiders that can use file extension or
mime type to trigger certain file parsers– Works best with web-ish sources and mime types– Also includes MapR and MapR high volume indexers
• More common at modest volumes or for updates to use JSON format
{"id":"book_314", "title":{"set":"The Call of the Wild"}}
• Use REST interface to send update filescurl http://localhost:8983/solr/update \
-H 'Content-type:application/json' --data-binary @file.json
© 2014 MapR Technologies 51
Back to recommendation: How do you abuse search to make recommendation easy?
© 2014 MapR Technologies 52
Collection of Documents: Insert Meta-Data
Search Technology
Item meta-data
Document for “puppy” id: t4
title: puppydesc: The sweetest little puppy ever.keywords: puppy, dog, pet
Ingest easily via NFS
© 2014 MapR Technologies 53
From Indicator Matrix to New Indicator Field
✔id: t4title: puppydesc: The sweetest little puppy ever.keywords: puppy, dog, pet
indicators: (t1)
Solr document for “puppy”
Note: data for the indicator field is added directly to meta-data for a document in Apache Solr or Elastic Search index. You don’t need to create a separate index for the indicators.
© 2014 MapR Technologies 54
Let’s look at a real example: we built a music recommender
© 2014 MapR Technologies 56
© 2014 MapR Technologies 57
User activity: Listens to classic jazz hit “Take the A Train”
© 2014 MapR Technologies 58
System delivers recommendations based on activity
© 2014 MapR Technologies 59
Let’s look inside the music recommender…
© 2014 MapR Technologies 60
Music Meta Data for Search Document Collections
• MusicBrainz data
• Data includes Artist ID, MusicBrainz ID, Name, Group/Person, From (geo locations) and Gender as seen in this sample
© 2014 MapR Technologies 62
Sample User Behavior Histories: Music Log Files
13 START 10113 2182654281
23 BEACON 10113 218265 428124 START 10113 796006
1193502834 BEACON 10113 7960061193502844 BEACON 10113 7960061193502854 BEACON 10113 7960061193502864 BEACON 10113 7960061193502874 BEACON 10113 7960061193502884 BEACON 10113 7960061193502894 BEACON 10113 79600611935028104 BEACON 10113 79600611935028109 FINISH 10113 79600611935028111 START 10113 589999
12011972121 BEACON 10113 58999912011972
Time
Event type
User ID
Artist ID
Track ID
© 2014 MapR Technologies 63
Sample Music Log Files
Artist ID for jazz musician Duke Ellington
What has user 119 done here in the highlighted lines?
© 2014 MapR Technologies 64
Internals of a Recommendation Engine
© 2014 MapR Technologies 65
Internals of a Recommendation Engine
© 2014 MapR Technologies 66
id 1710mbid 592a3b6d-c42b-4567-99c9-ecf63bd66499name Chuck Berryarea United Statesgender Maleindicator_artists 386685,875994,637954,3418,1344,789739,1460, …
id 541902mbid 983d4f8f-473e-4091-8394-415c105c4656name Charlie Winstonarea United Kingdomgender Noneindicator_artists 997727,815,830794,59588,900,2591,1344,696268, …
Lucene Documents for Music RecommendationNotice that data from indicator matrix of trained Mahout recommender model has been added to indicator field in documents of the artists collection
© 2014 MapR Technologies 68
Offline Analysis
Analysis Using Mahout
Users History Log Files Indicators
Search Technology
Item Meta-Data
© 2014 MapR Technologies 69
Log FilesMahout Analysis
Search Technology
Item Meta-Data
Ingest easily via NFS
MapR Cluster
via NFS Python
Use Python directly via NFS
Pig
Web TierRecommendations
New User History
Real-time recommendations using MapR data platform
© 2014 MapR Technologies 70
A Quick Simplification• Users who do h
• Also do
User-centric recommendations
Item-centric recommendations
© 2014 MapR Technologies 71
Architectural Advantage
User-centric recommendations
Item-centric recommendations
© 2014 MapR Technologies 72
Architectural Advantage
User-centric recommendations
With the first design, you have to do the real-time computation first (in parenthesis). No way to pre-compute. Less efficient, less fast.
With the second design, you can pre-compute offline (overnight) things that change slowly. Only the smaller computation for new user vector (h) is done in real-time, so response is very fast.
Item-centric recommendations
© 2014 MapR Technologies 74
Tutorial Part 2:How to make recommendation better
© 2014 MapR Technologies 75
Going Further: Multi-Modal Recommendation
© 2014 MapR Technologies 76
Going Further: Multi-Modal Recommendation
© 2014 MapR Technologies 77
For example• Users enter queries (A)
– (actor = user, item=query) • Users view videos (B)
– (actor = user, item=video)• ATA gives query recommendation
– “did you mean to ask for”• BTB gives video recommendation
– “you might like these videos”
© 2014 MapR Technologies 78
The punch-line• BTA recommends videos in response to a query
– (isn’t that a search engine?)– (not quite, it doesn’t look at content or meta-data)
© 2014 MapR Technologies 79
Real-life example• Query: “Paco de Lucia”• Conventional meta-data search results:
– “hombres de paco” times 400– not much else
• Recommendation based search:– Flamenco guitar and dancers– Spanish and classical guitar– Van Halen doing a classical/flamenco riff
© 2014 MapR Technologies 80
Real-life example
© 2014 MapR Technologies 81
Hypothetical Example• Want a navigational ontology?• Just put labels on a web page with traffic
– This gives A = users x label clicks• Remember viewing history
– This gives B = users x items• Cross recommend
– B’A = label to item mapping• After several users click, results are whatever users think they
should be
© 2014 MapR Technologies 82
Nice. But we can do better?
© 2014 MapR Technologies 84
Symmetry Gives Cross Recommentations
Conventional recommendations with off-line learning
Cross recommendations
© 2014 MapR Technologies 85
users
things
© 2014 MapR Technologies 86
users
thingtype 1
thingtype 2
© 2014 MapR Technologies 87
© 2014 MapR Technologies 88
Bonus Round:
When worse is better
© 2014 MapR Technologies 89
The Real Issues After First Production• Exploration• Diversity• Speed
• Not the last fraction of a percent
© 2014 MapR Technologies 90
Result Dithering• Dithering is used to re-order recommendation results
– Re-ordering is done randomly
• Dithering is guaranteed to make off-line performance worse
• Dithering also has a near perfect record of making actual performance much better
© 2014 MapR Technologies 91
Result Dithering• Dithering is used to re-order recommendation results
– Re-ordering is done randomly
• Dithering is guaranteed to make off-line performance worse
• Dithering also has a near perfect record of making actual performance much better
“Made more difference than any other change”
© 2014 MapR Technologies 92
Why Dithering Works
Real-time recommender
Overnight training
Log Files
© 2014 MapR Technologies 93
Why Use Dithering?
© 2014 MapR Technologies 94
Simple Dithering Algorithm• Synthetic score from log rank plus Gaussian
• Pick noise scale to provide desired level of mixing
• Typically
• Also… use floor(t/T) as seed
© 2014 MapR Technologies 95
Example … ε = 2
© 2014 MapR Technologies 96
Lesson:Exploration is good
© 2014 MapR Technologies 97
Thank you
Ted Dunning, Chief Applications Architect, MapR TechnologiesEmail [email protected] [email protected] @Ted_Dunning
Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout
22 June 2014 Big Data Everywhere Conference #DataIsrael