cs 572: information retrievaleugene/cs572/lectures... · the shawshank redemption 4.593 137812 lord...

97
CS 572: Information Retrieval Recommender Systems 2: Implementation and Applications Acknowledgements Many slides in this lecture are adapted from Xavier Amatriain (Netflix), Yehuda Koren (Yahoo), and Dietmar Jannach (TU Dortmund), Jimmy Lin, Foster Provost

Upload: others

Post on 16-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

CS 572: Information Retrieval

Recommender Systems 2: Implementation and Applications

Acknowledgements

Many slides in this lecture are adapted from Xavier Amatriain (Netflix), Yehuda

Koren (Yahoo), and Dietmar Jannach (TU Dortmund), Jimmy Lin, Foster Provost

Page 2: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Recap of Recommendation Approaches

Pros Cons

Collaborative No knowledge-engineering effort, serendipity of results, learns market segments

Requires some form of rating feedback, cold start for new users and new items

Content-based No community required, comparison between items possible

Content descriptions necessary, cold start for new users, no surprises

Knowledge-based Deterministic recommendations, assured quality, no cold-start, can resemble sales dialogue

Knowledge engineering effort to bootstrap, basically static, does not react to short-term trends

4/6/2016 CS 572: Information Retrieval. Spring 2016 2

Page 3: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Recap: What works (in “real world”)

4/6/2016 CS 572: Information Retrieval. Spring 2016 3

Page 4: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Today: Applications

• Products: Netflix challenge

• Information: News recommendation

• Social: Who to follow

• Human issues (how recommendations are used)

4/6/2016 CS 572: Information Retrieval. Spring 2016 4

Page 5: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

4/6/2016 CS 572: Information Retrieval. Spring 2016 5

Netflix Challenge

Page 6: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

scoremovieuser

1211

52131

43452

41232

37682

5763

4454

15685

23425

22345

5766

4566

scoremovieuser

?621

?961

?72

?32

?473

?153

?414

?284

?935

?745

?696

?836

Training data Test data

Movie rating data

4/6/2016 CS 572: Information Retrieval. Spring 2016 6

Page 7: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Netflix Prize

• Training data

– 100 million ratings

– 480,000 users

– 17,770 movies

– 6 years of data: 2000-2005

• Test data

– Last few ratings of each user (2.8 million)

– Evaluation criterion: root mean squared error (RMSE)

– Netflix Cinematch RMSE: 0.9514

• Competition

– 2700+ teams

– $1 million grand prize for 10% improvement on Cinematch result

– $50,000 2007 progress prize for 8.43% improvement

4/6/2016 CS 572: Information Retrieval. Spring 2016 7

Page 8: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Overall rating distribution

• Third of ratings are 4s

• Average rating is 3.68From TimelyDevelopment.com

4/6/2016 CS 572: Information Retrieval. Spring 2016 8

Page 9: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

#ratings per movie

• Avg #ratings/movie: 56274/6/2016 CS 572: Information Retrieval. Spring 2016 9

Page 10: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

#ratings per user

• Avg #ratings/user: 2084/6/2016 CS 572: Information Retrieval. Spring 2016 10

Page 11: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Average movie rating by movie count

More ratings to better moviesFrom TimelyDevelopment.com

4/6/2016 CS 572: Information Retrieval. Spring 2016 11

Page 12: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Most loved movies

CountAvg ratingTitle

137812 4.593 The Shawshank Redemption

133597 4.545 Lord of the Rings: The Return of the King

180883 4.306 The Green Mile

150676 4.460 Lord of the Rings: The Two Towers

139050 4.415 Finding Nemo

117456 4.504 Raiders of the Lost Ark

180736 4.299 Forrest Gump

147932 4.433 Lord of the Rings: The Fellowship of the ring

149199 4.325 The Sixth Sense

144027 4.333 Indiana Jones and the Last Crusade

4/6/2016CS 572: Information Retrieval. Spring 201612

Page 13: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Important RMSEs

Grand Prize: 0.8563; 10% improvement

BellKor: 0.8693; 8.63% improvement

Cinematch: 0.9514; baseline

Movie average: 1.0533

User average: 1.0651

Global average: 1.1296

Inherent noise: ????

Personalization

erroneous

accurate

4/6/2016 CS 572: Information Retrieval. Spring 2016 13

Page 14: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Challenges

• Size of data

– Scalability

– Keeping data in memory

• Missing data

– 99 percent missing

– Very imbalanced

• Avoiding overfitting

• Test and training data differ significantly

movie #16322

4/6/2016 CS 572: Information Retrieval. Spring 2016 14

Page 15: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

15

The Rules

• Improve RMSE metric by 10%

– RMSE= square root of the averaged squared difference between each prediction and the actual rating

– amplifies the contributions of egregious errors, both false positives ("trust busters") and false negatives ("missed opportunities").

• 100 million “training” ratings provided

– User, Rating, Movie name, date

4/6/2016 CS 572: Information Retrieval. Spring 2016

Page 16: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

4/6/2016 CS 572: Information Retrieval. Spring 2016 16

Page 17: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

4/6/2016 CS 572: Information Retrieval. Spring 2016 17

Page 18: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

movie #7614 movie #3250

movie #15868

Lessons from the Netflix Prize

Yehuda Koren

The BellKor team

(with Robert Bell & Chris Volinsky)

4/6/2016 CS 572: Information Retrieval. Spring 2016 18

Page 19: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

• Use an ensemble of complementing predictors

• Two, half tuned models worth more than a single, fully tuned model

The BellKor recommender system

4/6/2016 CS 572: Information Retrieval. Spring 2016 19

Page 20: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Effect of ensemble size

0.87

0.8725

0.875

0.8775

0.88

0.8825

0.885

0.8875

0.89

0.8925

1 8 15 22 29 36 43 50

#Predictors

Err

or

- R

MS

E

4/6/2016 CS 572: Information Retrieval. Spring 2016 20

Page 21: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

• Use an ensemble of complementing predictors

• Two, half tuned models worth more than a single, fully tuned model

• But:Many seemingly different models expose similar characteristics of the data, and won’t mix well

• Concentrate efforts along three axes...

The BellKor recommender system

4/6/2016 CS 572: Information Retrieval. Spring 2016 21

Page 22: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

The first axis:

• Multi-scale modeling of the data

• Combine top level, regional modeling of the data, with a refined, local view:

– k-NN: Extracting local patterns

– Factorization: Addressing regional effects

The three dimensions of the BellKor system

Global effects

Factorization

k-NN

4/6/2016 CS 572: Information Retrieval. Spring 2016 22

Page 23: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Multi-scale modeling – 1st tier

• Mean rating: 3.7 stars

• The Sixth Sense is 0.5 stars above avg

• Joe rates 0.2 stars below avg

Baseline estimation:Joe will rate The Sixth Sense 4 stars

Global effects:

4/6/2016 CS 572: Information Retrieval. Spring 2016 23

Page 24: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Multi-scale modeling – 2nd tier

• Both The Sixth Sense and Joe are placed high on the “Supernatural Thrillers” scale

Adjusted estimate:Joe will rate The Sixth Sense 4.5 stars

Factors model:

4/6/2016 CS 572: Information Retrieval. Spring 2016 24

Page 25: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Multi-scale modeling – 3rd tier

• Joe didn’t like related movie Signs

• Final estimate:Joe will rate The Sixth Sense 4.2 stars

Neighborhood model:

4/6/2016 CS 572: Information Retrieval. Spring 2016 25

Page 26: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

The second axis:

• Quality of modeling

• Make the best out of a model

• Strive for:

– Fundamental derivation

– Simplicity

– Avoid overfitting

– Robustness against #iterations, parameter setting, etc.

• Optimizing is good, but don’t overdo it!

The three dimensions of the BellKor system

global

local

quality4/6/2016 CS 572: Information Retrieval. Spring 2016 26

Page 27: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

• The third dimension will be discussed later...

• Next:Moving the multi-scale view along the quality axis

The three dimensions of the BellKor system

global

local

quality

???

4/6/2016 CS 572: Information Retrieval. Spring 2016 27

Page 28: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

• Earliest and most popular collaborative filtering method

• Derive unknown ratings from those of “similar” items (movie-movievariant)

• A parallel user-user flavor: rely on ratings of like-minded users (not in this talk)

Local modeling through k-NN

4/6/2016 CS 572: Information Retrieval. Spring 2016 28

Page 29: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

k-NN

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

movie

s

- unknown rating - rating between 1 to 5

4/6/2016 CS 572: Information Retrieval. Spring 2016 29

Page 30: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

k-NN

121110987654321

455 ?311

3124452

534321423

245424

5224345

423316

users

movie

s

- estimate rating of movie 1 by user 5

4/6/2016 CS 572: Information Retrieval. Spring 2016 30

Page 31: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

k-NN

121110987654321

455 ?311

3124452

534321423

245424

5224345

423316

users

Neighbor selection:

Identify movies similar to 1, rated by user 5

movie

s

4/6/2016 CS 572: Information Retrieval. Spring 2016 31

Page 32: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

k-NN

121110987654321

455 ?311

3124452

534321423

245424

5224345

423316

users

Compute similarity weights:

s13=0.2, s16=0.3

movie

s

4/6/2016 CS 572: Information Retrieval. Spring 2016 32

Page 33: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

k-NN

121110987654321

4552.6311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average:

(0.2*2+0.3*3)/(0.2+0.3)=2.6

movie

s

4/6/2016 CS 572: Information Retrieval. Spring 2016 33

Page 34: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Properties of k-NN

• Intuitive

• No substantial preprocessing is required

• Easy to explain reasoning behind a recommendation

• Accurate?

4/6/2016 CS 572: Information Retrieval. Spring 2016 34

Page 35: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Grand Prize: 0.8563

BellKor: 0.8693

Cinematch: 0.9514

Movie average: 1.0533

User average: 1.0651

Global average: 1.1296

Inherent noise: ????

0.96

0.91

k-NN on the RMSE scale

k-NN

erroneous

accurate

4/6/2016 CS 572: Information Retrieval. Spring 2016 35

Page 36: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

k-NN - Common practice

1. Define a similarity measure between items: sij

2. Select neighbors -- N(i;u): items most similar to i, that were rated by u

3. Estimate unknown rating, rui, as the weighted average:

( ; )

( ; )

ij uj ujj N i u

ui ui

ijj N i u

s r br b

s

baseline estimate for rui

4/6/2016 CS 572: Information Retrieval. Spring 2016 36

Page 37: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

k-NN - Common practice1. Define a similarity measure between items: sij

2. Select neighbors -- N(i;u): items similar to i, rated by u

3. Estimate unknown rating, rui, as the weighted average:

Problems:

1. Similarity measures are arbitrary; no fundamental justification

2. Pairwise similarities isolate each neighbor; neglect

interdependencies among neighbors

3. Taking a weighted average is restricting; e.g., when

neighborhood information is limited

( ; )

( ; )

ij uj ujj N i u

ui ui

ijj N i u

s r br b

s

4/6/2016 CS 572: Information Retrieval. Spring 2016 37

Page 38: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

(We allow )

Interpolation weights

• Use a weighted sum rather than a weighted average:

( ; )

1ij

j N i u

w

( ; )ui ui ij uj ujj N i u

r b w r b

• Model relationships between item i and its neighbors

• Can be learnt through a least squares problem from all

other users that rated i:

2

( ; )Minw vi vi ij vj vjv u j N i u

r b w r b

4/6/2016 CS 572: Information Retrieval. Spring 2016 38

Page 39: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Interpolation weights

2

( ; )Minw vi vi ij vj vjv u j N i u

r b w r b

• Interpolation weights derived based on their role;

no use of an arbitrary similarity measure

• Explicitly account for interrelationships among the

neighbors

Challenges:

• Deal with missing values

• Avoid overfitting

• Efficient implementation

Mostly

unknown

Estimate inner-products

among movie ratings

4/6/2016 CS 572: Information Retrieval. Spring 2016 39

Page 40: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Geared

towards

females

Geared

towards

males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Lethal Weapon

Independence Day

AmadeusThe Color Purple

Dumb and Dumber

Ocean’s 11

Sense and Sensibility

Gus

Dave

Latent factor models

4/6/2016 CS 572: Information Retrieval. Spring 2016 40

Page 41: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Latent factor models

45531

312445

53432142

24542

522434

42331

item

s

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1

~

~

item

s

users

users

A rank-3 SVD approximation

4/6/2016 CS 572: Information Retrieval. Spring 2016 41

Page 42: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Estimate unknown ratings as inner-products of factors:

45531

312445

53432142

24542

522434

42331

item

s

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1

~

~

item

s

users

A rank-3 SVD approximation

users

?

4/6/2016 CS 572: Information Retrieval. Spring 2016 42

Page 43: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Estimate unknown ratings as inner-products of factors:

45531

312445

53432142

24542

522434

42331

item

s

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1

~

~

item

s

users

A rank-3 SVD approximation

users

?

4/6/2016 CS 572: Information Retrieval. Spring 2016 43

Page 44: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Estimate unknown ratings as inner-products of factors:

45531

312445

53432142

24542

522434

42331

item

s

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1

~

~

item

s

users

2.4

A rank-3 SVD approximation

users

4/6/2016 CS 572: Information Retrieval. Spring 2016 44

Page 45: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Latent factor models45531

312445

53432142

24542

522434

42331

.2-.4.1

.5.6-.5

.5.3-.2

.32.11.1

-22.1-.7

.3.7-1

-.92.41.4.3-.4.8-.5-2.5.3-.21.1

1.3-.11.2-.72.91.4-1.31.4.5.7-.8

.1-.6.7.8.4-.3.92.41.7.6-.42.1

~

Properties:• SVD isn’t defined when entries are unknown use specialized

methods

• Very powerful model can easily overfit, sensitive to regularization

• Probably most popular model among contestants

– 12/11/2006: Simon Funk describes an SVD based method

– 12/29/2006: Free implementation at timelydevelopment.com

4/6/2016 CS 572: Information Retrieval. Spring 2016 45

Page 46: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Grand Prize: 0.8563

BellKor: 0.8693

Cinematch: 0.9514

Movie average: 1.0533

User average: 1.0651

Global average: 1.1296

Inherent noise: ????

0.93

0.89

Factorization on the RMSE scale

factorization

4/6/2016 CS 572: Information Retrieval. Spring 2016 46

Page 47: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

BellKore Approach

• User factors:Model a user u as a vector pu ~ Nk(, )

• Movie factors:Model a movie i as a vector qi ~ Nk(γ, Λ)

• Ratings:Measure “agreement” between u and i: rui ~ N(pu

Tqi, ε2)

• Maximize model’s likelihood:– Alternate between recomputing user-factors, movie-

factors and model parameters

– Special cases: • Alternating Ridge regression

• Nonnegative matrix factorization

4/6/2016 CS 572: Information Retrieval. Spring 2016 47

Page 48: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Combining multi-scale views

global effects

regional effects

local effects

Residual fitting

factorization

k-NN

Weighted average

A unified model

k-NN

factorization

4/6/2016 CS 572: Information Retrieval. Spring 2016 48

Page 49: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Localized factorization model

• Standard factorization: User u is a linear function parameterized by pu

rui = puTqi

• Allow user factors – pu – to depend on the item being predicted

rui = pu(i)Tqi

• Vector pu(i) models behavior of u on items like i

4/6/2016 CS 572: Information Retrieval. Spring 2016 49

Page 50: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

RMSE vs. #Factors

0.905

0.91

0.915

0.92

0.925

0.93

0.935

0 20 40 60 80

#Factors

RM

SE

Factorization

Localized factorization

Results on Netflix Probe set

More

accurate

4/6/2016 CS 572: Information Retrieval. Spring 2016 50

Page 51: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

The third dimension of the BellKor system

• A powerful source of information:Characterize users by which movies they rated, rather than how they rated

• A binary representation of the data

45531

312445

53432142

24542

522434

42331

users

mo

vie

s

010100100101

111001001100

011101011011

010010010110

110000111100

010010010101

usersm

ovie

s

4/6/2016 CS 572: Information Retrieval. Spring 2016 51

Page 52: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

The third dimension of the BellKor system

movie #17270

Great news to recommender systems:

• Works even better on real life datasets (?)

• Improve accuracy by exploiting implicit feedback

• Implicit behavior is abundant and easy to collect:

– Rental history

– Search patterns

– Browsing history

– ...

• Implicit feedback allows predicting personalized ratings for users that never rated!

4/6/2016 CS 572: Information Retrieval. Spring 2016 52

Page 53: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord
Page 54: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord
Page 55: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Recommended Stories

Signed in, search history enabled

Page 56: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord
Page 57: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Challenges

• Scale

– Number of unique visitors in last 2 months: several millions

– Number of stories within last 2 months: several millions

• Item Churn

– News stories change every 10-15 mins

– Recency of stories is important

– Cannot build models ever so often

• Noisy ratings

– Clicks treated as noisy positive vote

Page 58: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Approach

• Content-based vs. Collaborative filtering

• Collaborative filtering based on co-visitation

– Content agnostic: Can be applied to other application domains, languages, countries

– Google’s key strength: Huge user base and their data.

• Could have used content based filtering

• Focus on algorithms that are scalable

Page 59: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Algorithm Overview

• Obtain a list of candidate stories

• For each story:

– Obtain 3 scores (y1, y2, y3)

– Final score = ∑ wi*yi

• User clustering algorithms (Minhash (y1), PLSI (y2))

– Score ~ number of times this story was clicked by other users in your cluster

• Story-story co-visitation (y3)

– Score ~ number of times this story was co-clicked with other stories in your recent click history

Page 60: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Algorithm Overview …

• User clustering done offline as batch process

– Can be run every day or 2-3 times a week

• Cluster-story counts maintained in real time

• New users

– May not be clustered

– Rely on co-visitation to generate recommendations

Page 61: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Personalization

Servers

StoryTable

(cluster +covisit

counts)

UserTable

(user clusters,

click hist)

User clustering

(Offline)

(Mapreduce)

Rank Request

Rank Reply

Click Notify

Read Stats

Update Stats

Read user profileUpdate profile

Architecture

Bigtables

News Frontend

Webserver

Page 62: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

User clustering - Minhash

||/||2121 uuuu SSSS

• Input: User and his clicked stories

• User similarity =

• Output: User clusters.

– Similar users belong to same cluster

},...,,{ 21

u

m

uu

u sssS

Page 63: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Minhash …

Randomly permute the universe of clicked stories

min defined by permutation

Pseudo-random permutation

Compute hash for each story and treat hash-value as

permutation index

Treat MinHash value as ClusterId

Probabilistic clustering

},...,,{},...,,{ ''

2

'

121 mm ssssss

)min()( u

jsuMH

||

||)}()({

21

21

21

uu

uu

SS

SSuMHuMHP

Page 64: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

MinHash as Mapreduce

• Map phase:– Input: key = userid, value = storyid

– Compute hash for each story (parallelizable across all data)

– Output: key = userid value = storyhash

• Reduce phase:– Input: key = userid value = <list of storyhash values>

– Compute min over the story hash-values (parallelizable across users)

Page 65: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

PLSI Framework

u1

u2

u3

u4

Users

s1

s2

s3

Stories

z1

z2

Latent Classes

P(s|u) = ∑z P(s|z)*P(z|u)

Page 66: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Clustering - PLSI Algorithm

• Learning (done offline)

– ML estimation • Learn model parameters that maximize the

likelihood of the sample data

• Ouput = P[zj|u]’s P[s|zj]’s

– P[zj|u]’s lead to a soft clustering of users

• Runtime: we only use P[zj|u]’s and

ignore P[s|zj]’s

u

z

s

P[z|u]

P[s|z]

Page 67: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

PLSI (EM estimation) as Mapreduce

• E step: (Map phase)

q*(z; u,s,Ø) = p(s|z)p(z|u)/(∑z p(s|z)p(z|u))

• M step: (Reduce phase)

p(s|z) = ∑u q*(z; u,s,Ø)/(∑s ∑u q*(z; u,s,Ø))

p(z|u) = ∑s q*(z; u,s,Ø)/(∑z ∑s q*(z; u,s,Ø))

• Cannot load the entire model from prev iteration in a single machine during map phase

• Trick: Partition users and stories. Each partition loads the stats pertinent to it

Page 68: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

PLSI as Mapreduce

Page 69: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Covisitation

• For each story si store the covisitation counts with other stories c(si, sj )

• Candidate story: sk

• User history: s1,…, sn

• score (si, sj ) = c(si, sj )/∑m c(si, sm )

• total_score(sk) = ∑n score(sk, sn )

Page 70: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Experimental results

Page 71: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Live traffic clickthrough evaluation

Page 72: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Summary

• Most effective approaches are often hybrid between collaborative and content-based filtering– c.f. Netflix, Google News recommendations

• Lots of data available, but algorithm scalability critical – (use Map/Reduce, Hadoop)

• Suggested rules-of-thumb (take w/ grain of salt):– If have moderate amounts of rating data for each user, use

collaborative filtering with kNN (reasonable baseline)– When little data available, use content (model-) based

recommendations

4/6/2016 CS 572: Information Retrieval. Spring 2016 72

Page 73: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

4/6/2016 CS 572: Information Retrieval. Spring 2016 73

Page 74: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

4/6/2016 CS 572: Information Retrieval. Spring 2016 74

Page 75: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

4/6/2016 CS 572: Information Retrieval. Spring 2016 75

Page 76: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

4/6/2016 CS 572: Information Retrieval. Spring 2016 76

Page 77: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

4/6/2016 CS 572: Information Retrieval. Spring 2016 77

Page 78: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

4/6/2016 CS 572: Information Retrieval. Spring 2016 78

Page 79: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

EXPLAINING RECOMMENDATIONS

4/6/2016 CS 572: Information Retrieval. Spring 2016 79

Page 80: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Explanation is Critical

4/6/2016 CS 572: Information Retrieval. Spring 2016 80

Page 81: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Explanations in recommender systems

Additional information to explain the system’s output following some objectives

Page 82: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Objectives of explanations

• Transparency

• Validity

• Trustworthiness

• Persuasiveness

• Effectiveness

Efficiency

Satisfaction

Relevance

Comprehensibility

Education

Page 83: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Explanations in general

• How? and Why? explanations in expert systems• Form of abductive reasoning

– Given: 𝐾𝐵⊨𝑅𝑆𝑖 (item i is recommended by method RS)– Find 𝐾𝐵′ ⊆ 𝐾𝐵 s.t. 𝐾𝐵′⊨𝑅𝑆𝑖

• Principle of succinctness– Find smallest subset of 𝐾𝐵′ ⊆ 𝐾𝐵 s.t. 𝐾𝐵′⊨𝑅𝑆𝑖

i.e. for all 𝐾𝐵′′ ⊂ 𝐾𝐵′ holds 𝐾𝐵′′⊭ 𝑅𝑆𝑖

• But additional filtering– Some parts relevant for

deduction, might be obviousfor humans

[Friedrich & Zanker, AI Magazine, 2011]

Page 84: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Taxonomy for generating explanations in RS

Major design dimensions of current explanation components:

• Category of reasoning model for generating explanations – White box

– Black box

• RS paradigm for generating explanations– Determines the exploitable semantic relations

• Information categories

Page 85: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

RS paradigms and their ontologies

• Classes of objects – Users

– Items

– Properties

• N-ary relations between them

• Collaborative filtering– Neighborhood based CF (a)

– Matrix factorization (b)

• Introduces additional factors as proxies for determining similarities

Page 86: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

RS paradigms and their ontologies

• Content-based

– Properties characterizing items

– TF*IDF model

• Knowledge based

– Properties of items

– Properties of user model

– Additional mediating domain concepts

Page 87: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

• Similarity between items

• Similarity between users

• Tags– Tag relevance (for

item)– Tag preference (of

user)

Page 88: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Results from testing the explanation feature

• Knowledgeable explanations significantly increase the users’

perceived utility

• Perceived utility strongly correlates with usage intention etc.

Explanation

Trust

PerceivedUtility

PositiveUsage exp.

Recommend others

Intention to repeated usage

** sign. < 1%, * sign. < 5%

+*

+**

+**

+**

+**

+

Page 89: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Probabilistic View (Foster Provost)

4/6/2016 CS 572: Information Retrieval. Spring 2016 89

Page 90: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Example

4/6/2016 CS 572: Information Retrieval. Spring 2016 90

Page 91: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Consumers Increasingly Concerned About Inferences Made About Them

4/6/2016 CS 572: Information Retrieval. Spring 2016 91

Page 92: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Example Result

4/6/2016 CS 572: Information Retrieval. Spring 2016 92

Page 93: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Critical Eveidence(“Counter-factural”)

• Models can be viewed as evidence-combining systems

• For a specific decision, can ask:

• What is a minimal set of evidence such that if it were not present, the decision would not have been made?

4/6/2016 CS 572: Information Retrieval. Spring 2016 93

Page 94: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Removing Evidence: Cloacking

4/6/2016 CS 572: Information Retrieval. Spring 2016 94

Page 95: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

How Many Likes Have to Remove?

4/6/2016 CS 572: Information Retrieval. Spring 2016 95

Page 96: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Explanations in recommender systems: Summary

• There are many types of explanations and various goals that an explanation can achieve

• Which type of explanation can be generated depends greatly on the recommender approach applied

• Explanations may be used to shape the wishes and desires of customers but are a double-edged sword– On the one hand, explanations can help the customer to make

wise buying decisions;

– On the other hand, explanations can be abused to push a customer in a direction which is advantageous solely for the seller

• As a result a deep understanding of explanations and their effects on customers is of great interest.

Page 97: CS 572: Information Retrievaleugene/cs572/lectures... · The Shawshank Redemption 4.593 137812 Lord of the Rings: The Return of the King 4.545 133597 The Green Mile 4.306 180883 Lord

Resources

• Recommender Systems Survey: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1423975

• Koren et al. on winning the NetFlix challenge:

– http://videolectures.net/kdd09_koren_cftd/

– http://www.searchenginecaffe.com/2009/09/winning-

1m-netflix-prize-methods.html

• http://www.recommenderbook.net/

4/6/2016 CS 572: Information Retrieval. Spring 2016 97