offline evaluation of recommender systems: all pain and no gain?

Offline Evaluation of Recommender Systems

All pain and no gain?

Mark LevyMendeley

About me

Some things I built

Something I'm building

What is a good recommendation?

One that increases the usefulnessof your product in the long run1

1. WARNING: hard to measure directly

What is a good recommendation?

● One that increased your bottom line:

– User bought item after it was recommended

– User clicked ad after it was shown

– User didn't skip track when it was played

– User added document to library...

– User connected with contact...

Why was it good?

● Maybe it was

– Relevant

– Novel

– Familiar

– Serendipitous

– Well explained

● Note: some of these are mutually incompatible

What is a bad recommendation?

(you know one when you seen one)

What is a bad recommendation?

● Maybe it was

– Not relevant

– Too obscure

– Too familiar

– I already have it

– I already know that I don't like it

– Badly explained

What's the cost of getting it wrong?

● Depends on your product and your users

– Lost revenue

– Less engaged user

– Angry user

– Amused user

– Confused user

– User defects to a rival product

Hypotheses

Good offline metricsexpress product goals

Most (really) bad recommendationscan be caught by business logic

Issues

● Real business goals concern long-term user behaviour e.g. Netflix

“we have reformulated the recommendation problem to the question of optimizing the probability a member chooses to

watch a title and enjoys it enough to come back to the service”

● Usually have to settle for short-term surrogate

● Only some user behaviour is visible

● Same constraints when collecting training data

Least bad solution?

● “Back to the future” aka historical log analysis

● Decide which logged event(s) indicate success

● Be honest about “success”

● Usually care most about precision @ small k

● Recall will discriminate once this plateaus

● Expect to have to do online testing too

Making metrics meaningful

● Building a test framework + data is hard

● Be sure to get best value from your work

● Don't use straw man baselines

● Be realistic – leave the ivory tower

● Make test setups and baselines reproducible

● Old skool k-NN systems are better than you think

– Input numbers from mining logs

– Temporal “modelling” (e.g. fake users)

– Data pruning (scalability, popularity bias, quality)

– Preprocessing (tf-idf, log/sqrt, )…– Hand crafted similarity metric

– Hand crafted aggregation formula

– Postprocessing (popularity matching)

– Diversification

– Attention profile

● Measure preference honestly

● Predicted items may not be “correct” just because they were consumed once

● Try to capture value

– Earlier recommendation may be better

– Don't need a recommender to suggest items by same artist/author

● Don't neglect side data

– At least use it for evaluation / sanity checking

● Public data isn't enough for reproducibility or fair comparison

● Need to document preprocessing

● Better:

Release your preparation/evaluation code too

What's the cost of poor evaluation?

Poor offline evaluation can lead toyears of misdirected research

Ex 1: Reduce playlist skips

● Reorder a playlist of tracks to reduce skips by avoiding “genre whiplash”

● Use audio similarity measure to compute transition distance, then travelling salesman

● Metric: sum of transition distances (lower is better)

● 6 months work to develop solution

● Result: users skipped more often

● Why?

● Result: users skipped more often

● When a user skipped a track they didn't like they were played something else just like it

● Better metric: average position of skipped tracks (based on logs, lower down is better)

Ex 2: Recommend movies

● Use a corpus of star ratings to improve movie recommendations

● Learn to predict ratings for un-rated movies

● Metric: average RMSE of predictions for a hidden test set (lower is better)

● 2+ years work to develop new algorithms

● Result: “best” solutions were never deployed

● Why?

● Result: “best” solutions were never deployed

● User behaviour correlates with rank not RMSE

● Side datasets an order of magnitude more valuable than algorithm improvements

● Explicit ratings are the exception not the rule

● RMSE still haunts research labs

Can contests help?

● Good:

– Great for consistent evaluation

● Not so good:

– Privacy concerns mean obfuscated data

– No guarantee that metrics are meaningful

– No guarantee that train/test framework is valid

– Small datasets can become overexposed

Ex 3: Yahoo! Music KDD Cup

● Largest music rating dataset ever released

● Realistic “loved songs” classification task

● Data fully obfuscated due to recent lawsuits

● Result: researchers hated it

● Why?

● Result: researchers hated it

● Research frontier focussed on audio content and metadata, not joinable to obfuscated ratings

Ex 4: Million Song Challenge

● Large music dataset with rich metadata

● Anonymized listening histories

● Simple item recommendation task

● Reasonable MAP@500 metric

● Aimed to solve shortcomings of KDD Cup

● Only obfuscation was removal of timestamps

● Result: winning entry didn't use side data

● Why?

● Result: winning entry didn't use side data

● No timestamps so test tracks chosen at random

● So “people who listen to A also listen to B”

● Traditional item similarity solves this well

● More honesty about “success” might have shown that contest data was flawed

Ex 5: Yelp RecSys Challenge

● Small business review dataset with side data

● Realistic mix of input data types

● Rating prediction task

● Informal procedure to create train/test sets

● Result: baseline algorithms high up leaderboard

● Why?

● Result: baseline algorithms high up leaderboard

● Train/test split was corrupt

● Competition organisers moved fast to fix this

● But left only one week before deadline

Ex 6: MIREX Audio Chord Estimation

● Small dataset of audio tracks

● Task to label with predicted chord symbols

● Human labelled data hard to come by

● Contest hosted by premier forum in field

● Evaluate frame-level prediction accuracy

● Historical glass ceiling around 80%

● Result: 2011 winner ftw

● Why?

● Result: 2011 winner ftw

● Spoof entry relying on known test set

● Protest against inadequate test data

● Other research showed weak generalisation of winning algorithms from same contest

● Next year results dropped significantly

So why evaluate offline at all?

● Building test framework ensures clear goals

● Avoid wishful thinking if your data is too thin

● Be efficient with precious online testing

– Cut down huge parameter space

– Don't alienate users

● Need to publish

● Pursuing science as well as profit

Online evaluation is tricky too

● No off the shelf solution for services

● Many statistical gotchas

● Same mismatch between short-term and long-term success criteria

● Results open to interpretation by management

● Can make incremental improvements look good when radical innovation is needed

Ex 7: Article Recommendations

● Recommender for related research articles

● Massive download logs available

● Framework developed based on co-downloads

● Aim to improve on existing search solution

● Management “keen for it work”

● Several weeks of live A/B testing available

● No offline evaluation

● Result: worse than similar title search

● Why?

● Result: worse than similar title search

● Inadequate business rules e.g. often suggesting other articles from same publication

● Users identified only by organisational IP range so value of “big data” very limited

● Establishing an offline evaluation protocol would have shown these in advance

Isn't there software for that?

Rules of the game:

– Model fit metrics (e.g. validation loss) don't count

– Need a transparent “audit trail” of data to support genuine reproducibility

– Just using public datasets doesn't ensure this

Wish list for reproducible evaluation:

– Integrate with recommender implementations

– Handle data formats and preprocessing

– Handle splitting, cross-validation, side datasets

– Save everything to file

– Work from file inputs so not tied to one framework

– Generate meaningful metrics

– Well documented and easy to use

Current offerings:

● GraphChi/GraphLab

● Mahout

● LensKit

● MyMediaLite

Current offerings:

● GraphChi/GraphLab

– Model validation loss, doesn't count

● Mahout

– Only rating prediction accuracy, doesn't count

● LensKit

– Too hard to understand, won't use

Current offerings:

● MyMediaLite

– Reports meaningful metrics

– Handles cross-validation

– Data splitting not transparent

– No support for pre-processing

– No built in support for standalone evaluation

– API is capable but current utils don't meet wishlist

Eating your own dog food

● Built a small framework around new algorithm

● https://github.com/mendeley/mrec

– Reports meaningful metrics

– Handles cross-validation

– Supports simple pre-processing

– Writes everything to file for reproducibility

– Provides API and utility scripts

– Runs standalone evalutions

– Readable Python code

Eating your own dog food

● Some lessons learned

– Usable frameworks are hard to write

– Tradeoff between clarity and scalability

– Should generate explicit validation sets

● Please contribute!

● Or use as inspiration to improve existing tools

Where next?

● Shift evaluation online:

– Contests based around online evaluation

– Realistic but not reproducible

– Could some run continuously?

● Recommender Systems as a commodity:

– Software and services reaching maturity now

– Business users can tune/evaluate themselves

– Is there a way to report results?

Where next?

● Support alternative query paradigms:

– More like this, less like that

– Metrics for dynamic/online recommenders

● Support recommendation with side data:

– LibFM, GenSGD, WARP research @google, …– Open datasets?

Thanks for listening

mark.levy@mendeley.com

@gamboviol

https://github.com/gamboviol

https://github.com/mendeley/mrec

offline evaluation of recommender systems: all pain and no gain?

article recommendations

recommend

baseline algorithms

winning entry

offline evaluation

poor evaluation

side data

users skipped

Technology

recommender systems an introduction chapter07 evaluating...

personalized recommender by exploiting domain based expert...

recommender introduction to recommender systems and

introduction to recommender - leuphana … · systems and...

a fuzzy recommender system for eelections - unifr.ch fuzzy...

recommender systems handbook - home -...

movie genome: alleviating new item cold start in movie...

proven offline marketing strategies for accountants -...

recommender systems. outline limitations of recommender...

recommender systems - universidade nova de...

1 preserving privacy in collaborative filtering through...

recommender lab

offline evaluation for recommender...

recommender systems tutorial (part 2) -- offline components

tfr: a tourist food recommender system based on...

offline components: collaborative filtering in cold -start...

affective recommender systems: the role of emotions in...

recommender systems

recommender systems recommender systems

user data analytics and recommender system for...