offline evaluation of recommender systems: all pain and no gain?

Post on 08-Sep-2014

9.802 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

Keynote for the workshop on Reproducibility and Replication in Recommender Systems at ACM RecSys, Hong Kong, 12 October 2013.

TRANSCRIPT

Offline Evaluation of Recommender Systems

All pain and no gain?

Mark LevyMendeley

About me

About me

Some things I built

Something I'm building

What is a good recommendation?

What is a good recommendation?

One that increases the usefulnessof your product in the long run1

1. WARNING: hard to measure directly

What is a good recommendation?

● One that increased your bottom line:

– User bought item after it was recommended

– User clicked ad after it was shown

– User didn't skip track when it was played

– User added document to library...

– User connected with contact...

Why was it good?

Why was it good?

● Maybe it was

– Relevant

– Novel

– Familiar

– Serendipitous

– Well explained

● Note: some of these are mutually incompatible

What is a bad recommendation?

What is a bad recommendation?

(you know one when you seen one)

What is a bad recommendation?

What is a bad recommendation?

What is a bad recommendation?

What is a bad recommendation?

● Maybe it was

– Not relevant

– Too obscure

– Too familiar

– I already have it

– I already know that I don't like it

– Badly explained

What's the cost of getting it wrong?

● Depends on your product and your users

– Lost revenue

– Less engaged user

– Angry user

– Amused user

– Confused user

– User defects to a rival product

Hypotheses

Good offline metricsexpress product goals

Most (really) bad recommendationscan be caught by business logic

Issues

● Real business goals concern long-term user behaviour e.g. Netflix

“we have reformulated the recommendation problem to the question of optimizing the probability a member chooses to

watch a title and enjoys it enough to come back to the service”

● Usually have to settle for short-term surrogate

● Only some user behaviour is visible

● Same constraints when collecting training data

Least bad solution?

● “Back to the future” aka historical log analysis

● Decide which logged event(s) indicate success

● Be honest about “success”

● Usually care most about precision @ small k

● Recall will discriminate once this plateaus

● Expect to have to do online testing too

Making metrics meaningful

● Building a test framework + data is hard

● Be sure to get best value from your work

● Don't use straw man baselines

● Be realistic – leave the ivory tower

● Make test setups and baselines reproducible

Making metrics meaningful

● Old skool k-NN systems are better than you think

– Input numbers from mining logs

– Temporal “modelling” (e.g. fake users)

– Data pruning (scalability, popularity bias, quality)

– Preprocessing (tf-idf, log/sqrt, )…– Hand crafted similarity metric

– Hand crafted aggregation formula

– Postprocessing (popularity matching)

– Diversification

– Attention profile

Making metrics meaningful

● Measure preference honestly

● Predicted items may not be “correct” just because they were consumed once

● Try to capture value

– Earlier recommendation may be better

– Don't need a recommender to suggest items by same artist/author

● Don't neglect side data

– At least use it for evaluation / sanity checking

Making metrics meaningful

● Public data isn't enough for reproducibility or fair comparison

● Need to document preprocessing

● Better:

Release your preparation/evaluation code too

What's the cost of poor evaluation?

What's the cost of poor evaluation?

Poor offline evaluation can lead toyears of misdirected research

Ex 1: Reduce playlist skips

● Reorder a playlist of tracks to reduce skips by avoiding “genre whiplash”

● Use audio similarity measure to compute transition distance, then travelling salesman

● Metric: sum of transition distances (lower is better)

● 6 months work to develop solution

Ex 1: Reduce playlist skips

● Result: users skipped more often

● Why?

Ex 1: Reduce playlist skips

● Result: users skipped more often

● When a user skipped a track they didn't like they were played something else just like it

● Better metric: average position of skipped tracks (based on logs, lower down is better)

Ex 2: Recommend movies

● Use a corpus of star ratings to improve movie recommendations

● Learn to predict ratings for un-rated movies

● Metric: average RMSE of predictions for a hidden test set (lower is better)

● 2+ years work to develop new algorithms

Ex 2: Recommend movies

● Result: “best” solutions were never deployed

● Why?

Ex 2: Recommend movies

● Result: “best” solutions were never deployed

● User behaviour correlates with rank not RMSE

● Side datasets an order of magnitude more valuable than algorithm improvements

● Explicit ratings are the exception not the rule

● RMSE still haunts research labs

Can contests help?

● Good:

– Great for consistent evaluation

● Not so good:

– Privacy concerns mean obfuscated data

– No guarantee that metrics are meaningful

– No guarantee that train/test framework is valid

– Small datasets can become overexposed

Ex 3: Yahoo! Music KDD Cup

● Largest music rating dataset ever released

● Realistic “loved songs” classification task

● Data fully obfuscated due to recent lawsuits

Ex 3: Yahoo! Music KDD Cup

● Result: researchers hated it

● Why?

Ex 3: Yahoo! Music KDD Cup

● Result: researchers hated it

● Research frontier focussed on audio content and metadata, not joinable to obfuscated ratings

Ex 4: Million Song Challenge

● Large music dataset with rich metadata

● Anonymized listening histories

● Simple item recommendation task

● Reasonable MAP@500 metric

● Aimed to solve shortcomings of KDD Cup

● Only obfuscation was removal of timestamps

Ex 4: Million Song Challenge

● Result: winning entry didn't use side data

● Why?

Ex 4: Million Song Challenge

● Result: winning entry didn't use side data

● No timestamps so test tracks chosen at random

● So “people who listen to A also listen to B”

● Traditional item similarity solves this well

● More honesty about “success” might have shown that contest data was flawed

Ex 5: Yelp RecSys Challenge

● Small business review dataset with side data

● Realistic mix of input data types

● Rating prediction task

● Informal procedure to create train/test sets

Ex 5: Yelp RecSys Challenge

● Result: baseline algorithms high up leaderboard

● Why?

Ex 5: Yelp RecSys Challenge

● Result: baseline algorithms high up leaderboard

● Train/test split was corrupt

● Competition organisers moved fast to fix this

● But left only one week before deadline

Ex 6: MIREX Audio Chord Estimation

● Small dataset of audio tracks

● Task to label with predicted chord symbols

● Human labelled data hard to come by

● Contest hosted by premier forum in field

● Evaluate frame-level prediction accuracy

● Historical glass ceiling around 80%

Ex 6: MIREX Audio Chord Estimation

● Result: 2011 winner ftw

● Why?

Ex 6: MIREX Audio Chord Estimation

● Result: 2011 winner ftw

● Spoof entry relying on known test set

● Protest against inadequate test data

● Other research showed weak generalisation of winning algorithms from same contest

● Next year results dropped significantly

So why evaluate offline at all?

● Building test framework ensures clear goals

● Avoid wishful thinking if your data is too thin

● Be efficient with precious online testing

– Cut down huge parameter space

– Don't alienate users

● Need to publish

● Pursuing science as well as profit

Online evaluation is tricky too

● No off the shelf solution for services

● Many statistical gotchas

● Same mismatch between short-term and long-term success criteria

● Results open to interpretation by management

● Can make incremental improvements look good when radical innovation is needed

Ex 7: Article Recommendations

● Recommender for related research articles

● Massive download logs available

● Framework developed based on co-downloads

● Aim to improve on existing search solution

● Management “keen for it work”

● Several weeks of live A/B testing available

● No offline evaluation

Ex 7: Article Recommendations

● Result: worse than similar title search

● Why?

Ex 7: Article Recommendations

● Result: worse than similar title search

● Inadequate business rules e.g. often suggesting other articles from same publication

● Users identified only by organisational IP range so value of “big data” very limited

● Establishing an offline evaluation protocol would have shown these in advance

Isn't there software for that?

Rules of the game:

– Model fit metrics (e.g. validation loss) don't count

– Need a transparent “audit trail” of data to support genuine reproducibility

– Just using public datasets doesn't ensure this

Isn't there software for that?

Wish list for reproducible evaluation:

– Integrate with recommender implementations

– Handle data formats and preprocessing

– Handle splitting, cross-validation, side datasets

– Save everything to file

– Work from file inputs so not tied to one framework

– Generate meaningful metrics

– Well documented and easy to use

Isn't there software for that?

Current offerings:

● GraphChi/GraphLab

● Mahout

● LensKit

● MyMediaLite

Isn't there software for that?

Current offerings:

● GraphChi/GraphLab

– Model validation loss, doesn't count

● Mahout

– Only rating prediction accuracy, doesn't count

● LensKit

– Too hard to understand, won't use

Isn't there software for that?

Current offerings:

● MyMediaLite

– Reports meaningful metrics

– Handles cross-validation

– Data splitting not transparent

– No support for pre-processing

– No built in support for standalone evaluation

– API is capable but current utils don't meet wishlist

Eating your own dog food

● Built a small framework around new algorithm

● https://github.com/mendeley/mrec

– Reports meaningful metrics

– Handles cross-validation

– Supports simple pre-processing

– Writes everything to file for reproducibility

– Provides API and utility scripts

– Runs standalone evalutions

– Readable Python code

Eating your own dog food

● Some lessons learned

– Usable frameworks are hard to write

– Tradeoff between clarity and scalability

– Should generate explicit validation sets

● Please contribute!

● Or use as inspiration to improve existing tools

Where next?

● Shift evaluation online:

– Contests based around online evaluation

– Realistic but not reproducible

– Could some run continuously?

● Recommender Systems as a commodity:

– Software and services reaching maturity now

– Business users can tune/evaluate themselves

– Is there a way to report results?

Where next?

● Support alternative query paradigms:

– More like this, less like that

– Metrics for dynamic/online recommenders

● Support recommendation with side data:

– LibFM, GenSGD, WARP research @google, …– Open datasets?

Thanks for listening

mark.levy@mendeley.com

@gamboviol

https://github.com/gamboviol

https://github.com/mendeley/mrec

top related