search engines for machine learning: presented by joe blue, mapr
TRANSCRIPT
Search Engines for Machine Learning Joseph Blue, Data Scientist, MapR [email protected]
ROADMAP
The Deployment Challenge (WANT)
All About Recom-‐menders (BUILD)
Search Engine Delivers Results (DEPLOY)
Improving Those Results
(IMPROVE)
Recommendations • Data: interacKons between people taking acKon (users) and items
• Used to train recommendaKon model
• Goal is to suggest addiKonal interacKons • Example applicaKons: movie, music or map-‐based restaurant choices; suggesKng sale
items for e-‐stores or via cash-‐register receipts
W A N T B U I L D D E P L O Y I M P R O V E
Spend your Cycles Wisely
Time
W A N T B U I L D D E P L O Y I M P R O V E
D A T A D E V E L O P D E P L O Y
D A T A D & D Take more Kme to understand your data and deploy a good recommender quickly
Of bikes and ponies
?
Alice
Bob
Charles
Amelia
What if everybody gets a pony?
What else would you recommend for new user Amelia?
W A N T B U I L D D E P L O Y I M P R O V E
Three Matrices
✔ ✔
1 2 0 1
1 1
1
1
0
0 0
2
Alice
Bob
Charles
✔ ✔ ✔
✔ ✔
✔ ✔
User-‐item interacKon
Item Co-‐occurrence
Indicators
But we need a method for iden@fying anomalous co-‐occurrence…
W A N T B U I L D D E P L O Y I M P R O V E
Log Likelihood Two Ways
U S E R S
• Size = # users interact with that item • Overlap = # users who have two items in common • LL = f ( size & overlap & number of users)
W A N T B U I L D D E P L O Y I M P R O V E
Items will be shared by users, but how much is too much? 10
10,000 not
not
0
0
13 100,000 not
not
1,000
1,000
14.3
0.90
LL = 2* yij log(yijµij
)j=1
2
∑i=1
2
∑
Updating the metadata = deployment
✔
✔ Indicator
id: t4 Ktle: puppy desc: The sweetest liZle puppy ever. keywords: puppy, dog, pet indicators: (t1)
Solr document for “puppy”
W A N T B U I L D D E P L O Y I M P R O V E
Note: data for the indicator field is added directly to meta-‐data for a document in Apache Solr collec9on. You don’t need to create a separate index for the indicators.
Complete indicator matrix from log-‐likelihood…
Example Workflow
Log Files Mahout Analysis
S O L R C O L L E C T I O N
Item Meta-‐Data
Ingest easily via NFS
MapR Cluster
via NFS Python
Use Python directly via NFS
Pig
Web Tier
RecommendaKons
New User History
W A N T B U I L D D E P L O Y I M P R O V E
O F F L I N E
O N – L I N E
1
2
3
But we can do better…
W A N T B U I L D D E P L O Y I M P R O V E
id: t4 Ktle: puppy desc: The sweetest liZle puppy ever. keywords: puppy, dog, pet indicators: (t1)
The indicated items are returned when we query the collecKon based on user history, but not all user behaviors are created equal.
Items with opposite polarity may turn your recommendaKons into a spam generator.
Example: consider the difference in future purchases afer viewing or purchasing razor blades vs. Blu-‐ray movie…
Knowing your Data moves the Needle
W A N T B U I L D D E P L O Y I M P R O V E
✔ ✔ ✔ ✔ ✔
✔ ✔
1 2 0 1
1 1 1
1
0
0 0 2
✔ ✔
id: t4 Ktle: puppy desc: The sweetest liZle puppy ever. keywords: puppy, dog, pet purchase indicators: (t1) click indicators: (t2)
✔ ✔
✔ ✔ ✔
✔
0 0 1 0
1 0 1
0
1
1 1 0
✔
✔
purchases clicks