google news personalization big data reading group november 12, 2007 presented by babu pillai
TRANSCRIPT
Google News Personalization
Big Data reading groupNovember 12, 2007
Presented by Babu Pillai
Problem: finding stuff on Internet
• Know what you want: – content-based filtering,– search
• Don’t know– browse
• How to handle: Don’t know but, show me something interesting!
Google News• Top Stories
• Recommendationsfor registered users
• Based on userclick history,community clicks
Problem Scale
• Lots of users, (more is good)– Millions of clicks from millions of users
• Problem: high churn in item set– Several million items (clusters of news articles
about the same story, as identified by GN) per month
– Continuous addition, deletion
• Strict timing (few hundred ms)• Existing systems not suitable
Memory-based Ratings
• General form:
where r is rating of item sk for user ua, and w(ua,ui) is similarity between users ua and ui
• Problem: scalability, even when similarity is computed offline
Model-based techniques
• Clustering / segmentation, e.g. based on interests
• Bayesian models, Markov Decision, …– All are computationally expensive
What’s in this paper?
• Investigate 2 different ways to cluster users: MinHash, and PLSI
• Implement both on MapReduce
Google News Rating Model
• 1 click = 1 positive vote
• Noisier than 1-5 ranking (Netflix)
• No explicit negatives
• Why might it work? Partly due to the fairly significant article clips provided, so a user that clicks is likely genuinely interested
Design guidelines for a scalable rating system
• Associate users into clusters of similar users (based on prior clicks, offline)
• Users can belong to multiple clusters
• Generate rating using much smaller sets of user clusters, rather than all users:
Technique 1: MinHash
• Probabilistically assign users to clusters based on click history
• Use Jaccard coefficient:
distance is a metric
• Using this metric is computationally expensive, not feasible even offline
MinHash as a form of Locality Sensitive Hashing
• Basic idea: assign hash value to each use based on click history
• How: randomly permute set of all items; assign id of first item in this order that appears in the user’s click history as the hash value for the user
• Probability that 2 users have the same hash is equal to the Jaccard coefficient
Using MinHash for clusters
• Concatenate p>1 such hashes as cluster id for increased precision
• Apply q>1 in parallel (users belong to q clusters) to improve recall
• Don’t actually maintain p*q permutations: hash item id with random seed to get proxy for permutation index, for p*q different seeds
MinHash on MapReduce
• Generate p x q hashes for each user based on click history; generate q p-long cluster ids by concatenation
• Map using cluster id’s as keys
• Reduce to form membership lists for each cluster id
Technique 2: PLSI clustering
• Probabilistic Latent Semantic Indexing• Main idea: hidden state z that correlates
users and items
• Generate this clustering from training set based on EM algorithm give by Hoffman04– Iterative technique, generates new probability
estimates based on previous estimates
PLSI as MapReduce
• Q* can be independently computed for each (u,s), given prior N(z,s), N(z), p(z|u): map to RxK machines (R, K partitions for u, s respectively)
• Reduce is simply addition
PLSI in a dynamic environment
• Treat Z as user clusters
• On each click, update p(s|z) for all clusters the user belongs to
• This approximates PLSI, but is updated dynamically as additional items are added
• Does not allow additions of users
Cluster-based recommendation
• For each cluster, maintain number of clicks, decayed by time, for each item visited by a member
• For a candidate item, lookup user’s clusters, add up age-discounted visitation counts, normalized by total clicks
• Do this using both MinHash and PLSI clustering
One more technique: Covisitation
• Memory-based technique• Create adjacency matrix between all pairs of
items (can be directed)• Increment corresponding count if one item
visited soon after another
• Recommendation: for candidate item j, sum of all counts from i to j for all items i in recent click history of user, normalized appropriately
Whole System
• Offline clustering
• Online click history update, cluster item stats update, covisitation update
Results
Generally around 30-50% better than popularity based recommendations
Techniques don’t work well together, though
Discussion
• Covisitation appears to work as well as clustering
• Operational details missing: how big are cluster memberships, etc.
• All of the clustering is done offline