hypothesis-based collaborative filtering

30
Amancio Bouza, PhD Defense University of Zurich, Switzerland February 10th, 2012 Hypothesis-Based Collaborative Filtering Retrieving Like-Minded Individuals Based on the Comparison of Hypothesized Preferences

Upload: amancio-bouza

Post on 08-Feb-2017

1.123 views

Category:

Education


0 download

TRANSCRIPT

Amancio Bouza, PhD DefenseUniversity of Zurich, SwitzerlandFebruary 10th, 2012

Hypothesis-Based Collaborative FilteringRetrieving Like-Minded IndividualsBased on the Comparison of Hypothesized Preferences

Asian Food

Italian Food

Asian Food

Italian Food

Online Retailers vs. B&M Retailers

Revenue from Long Tail: 25% at Amazon [Anderson, 2006]3

Amazon.com: 2.3 Mio[Brynjolfsson et al., 2003]

B&M: 40k-100k[Brynjolfsson et al., 2003]

Online Retailers vs. B&M Retailers

Revenue from Long Tail: 25% at Amazon [Anderson, 2006]3

Amazon.com: 2.3 Mio[Brynjolfsson et al., 2003]

B&M: 40k-100k[Brynjolfsson et al., 2003]

Information Overload

Overchoice

Recommender Systems are Essential for Welfare

Mitigate negative effects [Hinz et al., 2011]

Collective wisdom: collaborative filtering

Consumer welfare increase by $731 million (to $1.03 billion) [Brynjolfsson et al., 2003]

Book market: enhancement of consumer welfare 7-10 times more than through competition and lower prices [Brynjolfsson and Smith, 2000]

4

Recommender Systems are Essential for Welfare

Mitigate negative effects [Hinz et al., 2011]

Collective wisdom: collaborative filtering

Consumer welfare increase by $731 million (to $1.03 billion) [Brynjolfsson et al., 2003]

Book market: enhancement of consumer welfare 7-10 times more than through competition and lower prices [Brynjolfsson and Smith, 2000]

4

Essential for Economic and Public Welfare

Collaborative Filtering

5

Collaborative Filtering

5

Individuals which share similar preferences in the past will share similar preferences in the future.

Collaborative Filtering

5

Individuals which share similar preferences in the past will share similar preferences in the future.

Significance of sparsityPartial representation of preferencesAssessability of preference similarityIncompleteness of preferences

Issues of Common Rated Products: Cold-Start Problem

6

Significance of sparsityPartial representation of preferencesAssessability of preference similarityIncompleteness of preferences

Issues of Common Rated Products: Cold-Start Problem

6

Significance of sparsityPartial representation of preferencesAssessability of preference similarityIncompleteness of preferences

Issues of Common Rated Products: Cold-Start Problem

6

ThesisHypothesis-Based Collaborative Filtering

ThesisHypothesis-Based Collaborative Filtering

Evaluation & AnalysisPreference SimilarityPreference Modelling

Thesis Overview

8

HypothesizedPreference Modelling

32 Chapter 3. Conceptualization and Specification of Preferences

T

T

T

h

P

A...

T

h ... ... .........

Root

Test

Utility

Partial preference

Figure 3.1: Partial preferences are encoded as branches from the root to the leaf of the decision tree. Thenodes of a branch corresponds to tests of some product properties and the leaf corresponds to the utility ofthe product.

which we described in Eq. (2.4) and recapitulate at this point:

C ← arg maxck

P(C = ck)∏j

P(Aj|C = ck) (3.7)

This rule calculates the most probable utility based on the observed probability distribution ofP(C) and P(Aj|C).

The Naïve Bayes classification rule is, in fact, a linear combination of all conditional proba-bilities of P(C) and P(Aj|C). Since Naïve Bayes assumes strong (naïve) independence amongproperties, an individual’s preferences for product properties are likewise independent. Therefore,we can interpret each conditional probability P(C) ∗ P(Aj|C) as a hypothesized partial preferencewith the set of constraints consisting of the single constraint Aj as the condition and the mostprobable conclusion c as the utility:

C ← arg maxck

P(C = ck) ∗ P(Aj|C = ck) (3.8)

For the purpose of simplification, we can apply Bayes’s theorem to simplify the definition ofhypothesized partial preferences of Eq. (3.8) and write hypothesized partial preferences as:

C ← arg maxck

P(C = ck|Aj) (3.9)

To extract all hypothesized partial preferences, we compute for each attribute the mostprobable utility.

Grounded Theory

05

1015

20

05

1015

20

0.7

0.8

0.9

1

1.1

Effort

HypokNNNBCompSup

Visibility

MAE

MAE

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Ontology-based Decision Tree Learner: SemTree

4.2 SEMTREE extension to the decision tree model 37

4.2.2 Injecting concept features to generalize from features

The logic of feature generalization by injecting concept features is presented in Figure 4.1. The

concepts that can be used for the concept feature generation are given by an ontology. We use

the rdfs:subClassOf and rdf:type properties which connect instances with concepts and concepts

among each other concepts.

Let’s denote Ii ∈ I1, · · · , In as the feature vector representation of item i. The associated

classification to item i is denoted as Ci ∈ C1, · · · , Cn. Further, D, J, S, E, A, L represent ontology

instances and U, Z, H, X classes and superclasses.

D

no no no ... yes no yes C2

no no yes ... yes no no C1

no yes yes ... no no no C1

yes no no ... no yes no C2

yes yes no ... no no no C1

J S E A L

U X

H

Z

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Instances

Associated

Class

Ontology

I1

I2

I3

I4

In

Figure 4.1: SemTree uses generalization of single concepts to improve the classification

If we look at the two instances where the feature value for feature D is set to yes (I4 and In)

the instance is once classified to class C1 and the other time to C2. Therefore, the feature D does

not provide evidence for the classification of the instance.

The features J and S individually are similar to the case before and do not provide evidence

for the instance classification. Since J and S are instances of the ontology class U we can combine

(generalize) both features to the concept feature of U, which is used as decision node in the tree

only if it provides a greater information gain. To calculate the information gain, we have to set

the values of the concept feature for each instance. If one of the feature values for an instance is

set to yes the feature value of U is also set to yes. This is depicted in the figure with the grey

background. In boxes with grey background color, the feature value is set to yes, otherwise no.

In the next case we go a step further. Again, the features E, A, and L do not provide evidence

for the classification. However, if we look at the concept features X and H the information gain is

not greater than the one of the individual features and we discard the concept features and use

the features E, A, and L instead. In the example in Figure 4.1 the feature respectively concept

feature with the highest information gain is the concept feature of U and is therefore used as

decision node in the tree.

Empirical Study

sparsity degree90%80%70%60%50%40%30%20%10%0%

MA

E

1.025

1.000

0.975

0.950

0.925

0.900

0.875

0.850

0.825

0.800

0.775

0.750

Naïve BayesJ48SVDPcorrkNN

HypokNN-J48CompSemSup

HypokNN-NBCompSup

HypokNN-semiJ48Utilprob

HypokNN-NBUtilcorr

Page 3

Semantic Jaccard Index

HypothesizedPreference Similarity

Hypothesis Composition-Based Preference Similarity

50 Chapter 5. Hypothesized Preference Similarity

sij

hb1 hbj hbn

ha1

hai

ham

s11

smn

vb

va

c1 c2 c3 c4 c5

c1

c2

c3

c4

c5

Figure 5.1: The partial preference similarity matrix S represents the similarities between the hypothesizedpartial preferences of two individuals a and b

hb(g) as the similarity function sim(va ◦ vbT) : Rm×n → [0, 1] ⊂ R, which consolidates all partialpreference similarities in S. Hence, the theoretical framework of our algorithmic framework is:

sim�ha(g), hb(g)

�≡ sim(va ◦ vbT) (5.28)

Our algorithmic framework of hypothesis composition-based preference similarity consists oftwo components. One component is a method to compute the similarity of hypothesized partialpreferences to determine S and refers to Eq. (5.26). This component is presented in Section 5.3.1.

The other component is a method to consolidate the similarities of S and is refers to Eq. (5.28).This component is presented in Section 5.3.2.

5.3.1 Similarity of Hypothesized Partial Preferences

As defined in Chapter 3.2, a hypothesized partial preference hi(g) consists of a set of constraintspi and the assigned rating ci. Based on this, we define the similarity between two hypothesizedpartial preference ha

i (g) and hbj (g) as the similarity between both corresponding constraint set pa

iand pb

j combined with the similarity between both corresponding ratings cai and cb

j :

sim�ha

i (g), hbj (g)

�≡ sim(pa

i , pbj ) ∗ sim(ca

i , cbj ) (5.29)

As it is depicted in Figure 5.1, the similarity sim(pai , pb

j ) ∗ sim(cai , cb

j ) corresponds to the elementsij in the partial preference similarity matrix S, i.e., sij = sim(pa

i , pbj ) ∗ sim(ca

i , cbj ).

This algorithmic framework allows for any kind of similarity metrics to compute the similaritybetween constraint sets and the similarity between rating concepts. In the following, we proposetwo possible similarity metrics for computing sim(pa

i , pbj ) and a similarity metric for computing

Hypothesized Utility-Based Preference Similarity

FORMULAS FOR THE SOFTALK

AMANCIO BOUZA

User functionu(i) = ck

hypothesized User functionh(i) + ε(i) = ck

h : i �→ ck

hypothesized User functionh(i) → ck

hypothesized User functionha(i) → ck

hypothesized User functionhb(i) → ck

User-Based collaborative Filtering:

r̂aj = ra + κn�

b�=a

sim(a, b) ∗ (rbj − rb)

Normalization factor κ:

κ =1

n�

b�=a

sim(a, b)

Example calculation:

r̂a2 = 3 +0.875 ∗ (4− 3.66) + 0.25 ∗ (1− 2.33)

0.875 + 0.25

= 3 +0.298− 0.333

1.125= 2.969

User Model:ua(i) = ha(i) + ε(i)

User Model:u(i) = h(i) + ε(i)

1

EvaluationEmpirical Study

EvaluationEmpirical Study

Experimental SettingDataset: MovieLens 100k (quasi benchmark)

10 datasets with different degree of rating sparsity

Method: k-fold cross-validation (k=5)Performance metrics:

Rating prediction accuracy: MAE, RMSERelevance filtering quality: Precision, Recall, F1-score, MCC, AUC

Candidates for ComparisonHypothesis-Based :

11 hypothesis-based collaborative filtering methodsBaseline:

3 collaborative filtering: SVD, PcorrkNN, WoC4 content filtering

Statistical test: non-parametric Wilcoxon signed-rank test for dependent samples

alpha = 0.01

10

Experimental SettingDataset: MovieLens 100k (quasi benchmark)

10 datasets with different degree of rating sparsity

Method: k-fold cross-validation (k=5)Performance metrics:

Rating prediction accuracy: MAE, RMSERelevance filtering quality: Precision, Recall, F1-score, MCC, AUC

Candidates for ComparisonHypothesis-Based :

11 hypothesis-based collaborative filtering methodsBaseline:

3 collaborative filtering: SVD, PcorrkNN, WoC4 content filtering

Statistical test: non-parametric Wilcoxon signed-rank test for dependent samples

alpha = 0.01

10

ResultsHypothesis-Based candiate: HypokNN-NBCompSup0% sparsity:

MAE: 0.76 RMSE: 0.97Precision: 0.84Recall: 0.54AUC: 0.66

90% sparsity:MAE: 0.86RMSE: 1.09Precision: 0.75Recall: 0.52AUC: 0.56

significant

11

sparsity degree90%80%70%60%50%40%30%20%10%0%

MA

E

1.025

1.000

0.975

0.950

0.925

0.900

0.875

0.850

0.825

0.800

0.775

0.750

Naïve BayesJ48SVDPcorrkNN

HypokNN-J48CompSemSup

HypokNN-NBCompSup

HypokNN-semiJ48Utilprob

HypokNN-NBUtilcorr

Page 3

AnalysisGrounded Theory

AnalysisGrounded Theory

Analysis

Grounded Theory

PropertiesIndividual’s Effort: number of ratingsIndividual’s Attitude: rating meanIndividual’s Selectivity: rating std.dev.Product’s Visibility: number of ratingsProduct’s Popularity: rating meanProduct’s Polarization: rating std.dev.Performance: MAEPerformance difference between HypokNN-CF methods: difference of MAE

13

05

1015

20

05

1015

20

0.7

0.8

0.9

1

1.1

Effort

HypokNNNBCompSup

Visibility

MAE

MAE

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Comparison of CF methods

14

individual’s effort

prod

uct’s

vis

ibili

ty

80%

spar

sity

0% sp

arsit

y

0 5 10 15 200

2

4

6

8

10

12

14

16

18

20

Effort

Visi

bilit

y

dMAE

1

0

1

0 5 10 15 200

2

4

6

8

10

12

14

16

18

20

Effort

Visi

bilit

y

dMAE

4

3

2

1

0

1

2

3

4

HypokNNNBCompSup - PcorrkNN

0 5 10 15 200

2

4

6

8

10

12

14

16

18

20

Effort

Visi

bilit

y

dMAE

1

0

1

0 5 10 15 200

2

4

6

8

10

12

14

16

18

20

Effort

Visi

bilit

y

dMAE

4

3

2

1

0

1

2

3

4

HypokNNJ48CompSemSup - PcorrkNN

When Hypothesis-Based Collaborative Filtering

15

Minor Cold-Start

Minor Cold-Start

Individual Product

Effort Selectivity Visibility Popularity

low high low high low high low high

Individual

Product

Effort

Selectivity

Visibility

Popularity

low

high

low

high

low

high

low

high

+ ++---

+

--+

-+ +++

+

Verified by empirical evaluation

sparsity degree90%80%70%60%50%40%30%20%10%0%

MA

E

0.925

0.900

0.875

0.850

0.825

0.800

0.775

0.750

0.725

PcorrkNN

GT-HypokNN-NBCompSup

HypokNN-NBCompSup

GT-HypokNN-semiJ48Utilprob

HypokNN-semiJ48Utilprob

GT-HypokNN-NBUtilprob

HypokNN-NBUtilprob

Page 1

Amancio Bouza, PhD DefenseUniversity of Zurich, SwitzerlandFebruary 10th, 2012

Hypothesis-Based Collaborative FilteringRetrieving Like-Minded IndividualsBased on the Comparison of Hypothesized Preferences

Evaluation & AnalysisPreference SimilarityPreference Modelling

Evaluation & AnalysisPreference SimilarityPreference Modelling

HypothesizedPreference Modelling

32 Chapter 3. Conceptualization and Specification of Preferences

T

T

T

h

P

A...

T

h ... ... .........

Root

Test

Utility

Partial preference

Figure 3.1: Partial preferences are encoded as branches from the root to the leaf of the decision tree. Thenodes of a branch corresponds to tests of some product properties and the leaf corresponds to the utility ofthe product.

which we described in Eq. (2.4) and recapitulate at this point:

C ← arg maxck

P(C = ck)∏j

P(Aj|C = ck) (3.7)

This rule calculates the most probable utility based on the observed probability distribution ofP(C) and P(Aj|C).

The Naïve Bayes classification rule is, in fact, a linear combination of all conditional proba-bilities of P(C) and P(Aj|C). Since Naïve Bayes assumes strong (naïve) independence amongproperties, an individual’s preferences for product properties are likewise independent. Therefore,we can interpret each conditional probability P(C) ∗ P(Aj|C) as a hypothesized partial preferencewith the set of constraints consisting of the single constraint Aj as the condition and the mostprobable conclusion c as the utility:

C ← arg maxck

P(C = ck) ∗ P(Aj|C = ck) (3.8)

For the purpose of simplification, we can apply Bayes’s theorem to simplify the definition ofhypothesized partial preferences of Eq. (3.8) and write hypothesized partial preferences as:

C ← arg maxck

P(C = ck|Aj) (3.9)

To extract all hypothesized partial preferences, we compute for each attribute the mostprobable utility.

Grounded Theory

05

1015

20

05

1015

20

0.7

0.8

0.9

1

1.1

Effort

HypokNNNBCompSup

Visibility

MAE

MAE

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Ontology-based Decision Tree Learner: SemTree

4.2 SEMTREE extension to the decision tree model 37

4.2.2 Injecting concept features to generalize from features

The logic of feature generalization by injecting concept features is presented in Figure 4.1. The

concepts that can be used for the concept feature generation are given by an ontology. We use

the rdfs:subClassOf and rdf:type properties which connect instances with concepts and concepts

among each other concepts.

Let’s denote Ii ∈ I1, · · · , In as the feature vector representation of item i. The associated

classification to item i is denoted as Ci ∈ C1, · · · , Cn. Further, D, J, S, E, A, L represent ontology

instances and U, Z, H, X classes and superclasses.

D

no no no ... yes no yes C2

no no yes ... yes no no C1

no yes yes ... no no no C1

yes no no ... no yes no C2

yes yes no ... no no no C1

J S E A L

U X

H

Z

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Instances

Associated

Class

Ontology

I1

I2

I3

I4

In

Figure 4.1: SemTree uses generalization of single concepts to improve the classification

If we look at the two instances where the feature value for feature D is set to yes (I4 and In)

the instance is once classified to class C1 and the other time to C2. Therefore, the feature D does

not provide evidence for the classification of the instance.

The features J and S individually are similar to the case before and do not provide evidence

for the instance classification. Since J and S are instances of the ontology class U we can combine

(generalize) both features to the concept feature of U, which is used as decision node in the tree

only if it provides a greater information gain. To calculate the information gain, we have to set

the values of the concept feature for each instance. If one of the feature values for an instance is

set to yes the feature value of U is also set to yes. This is depicted in the figure with the grey

background. In boxes with grey background color, the feature value is set to yes, otherwise no.

In the next case we go a step further. Again, the features E, A, and L do not provide evidence

for the classification. However, if we look at the concept features X and H the information gain is

not greater than the one of the individual features and we discard the concept features and use

the features E, A, and L instead. In the example in Figure 4.1 the feature respectively concept

feature with the highest information gain is the concept feature of U and is therefore used as

decision node in the tree.

Empirical Study

sparsity degree90%80%70%60%50%40%30%20%10%0%

MA

E

1.025

1.000

0.975

0.950

0.925

0.900

0.875

0.850

0.825

0.800

0.775

0.750

Naïve BayesJ48SVDPcorrkNN

HypokNN-J48CompSemSup

HypokNN-NBCompSup

HypokNN-semiJ48Utilprob

HypokNN-NBUtilcorr

Page 3

Semantic Jaccard Index

HypothesizedPreference Similarity

Hypothesis Composition-Based Preference Similarity

50 Chapter 5. Hypothesized Preference Similarity

sij

hb1 hbj hbn

ha1

hai

ham

s11

smn

vb

va

c1 c2 c3 c4 c5

c1

c2

c3

c4

c5

Figure 5.1: The partial preference similarity matrix S represents the similarities between the hypothesizedpartial preferences of two individuals a and b

hb(g) as the similarity function sim(va ◦ vbT) : Rm×n → [0, 1] ⊂ R, which consolidates all partialpreference similarities in S. Hence, the theoretical framework of our algorithmic framework is:

sim�ha(g), hb(g)

�≡ sim(va ◦ vbT) (5.28)

Our algorithmic framework of hypothesis composition-based preference similarity consists oftwo components. One component is a method to compute the similarity of hypothesized partialpreferences to determine S and refers to Eq. (5.26). This component is presented in Section 5.3.1.

The other component is a method to consolidate the similarities of S and is refers to Eq. (5.28).This component is presented in Section 5.3.2.

5.3.1 Similarity of Hypothesized Partial Preferences

As defined in Chapter 3.2, a hypothesized partial preference hi(g) consists of a set of constraintspi and the assigned rating ci. Based on this, we define the similarity between two hypothesizedpartial preference ha

i (g) and hbj (g) as the similarity between both corresponding constraint set pa

iand pb

j combined with the similarity between both corresponding ratings cai and cb

j :

sim�ha

i (g), hbj (g)

�≡ sim(pa

i , pbj ) ∗ sim(ca

i , cbj ) (5.29)

As it is depicted in Figure 5.1, the similarity sim(pai , pb

j ) ∗ sim(cai , cb

j ) corresponds to the elementsij in the partial preference similarity matrix S, i.e., sij = sim(pa

i , pbj ) ∗ sim(ca

i , cbj ).

This algorithmic framework allows for any kind of similarity metrics to compute the similaritybetween constraint sets and the similarity between rating concepts. In the following, we proposetwo possible similarity metrics for computing sim(pa

i , pbj ) and a similarity metric for computing

Hypothesized Utility-Based Preference Similarity

FORMULAS FOR THE SOFTALK

AMANCIO BOUZA

User functionu(i) = ck

hypothesized User functionh(i) + ε(i) = ck

h : i �→ ck

hypothesized User functionh(i) → ck

hypothesized User functionha(i) → ck

hypothesized User functionhb(i) → ck

User-Based collaborative Filtering:

r̂aj = ra + κn�

b�=a

sim(a, b) ∗ (rbj − rb)

Normalization factor κ:

κ =1

n�

b�=a

sim(a, b)

Example calculation:

r̂a2 = 3 +0.875 ∗ (4− 3.66) + 0.25 ∗ (1− 2.33)

0.875 + 0.25

= 3 +0.298− 0.333

1.125= 2.969

User Model:ua(i) = ha(i) + ε(i)

User Model:u(i) = h(i) + ε(i)

1

Thank you!