explaining the user experience of recommender systems with user experiments

The User Experienceof Recommender Systems

IntroductionWhat I want to do today

INFORMATION AND COMPUTER SCIENCES

IntroductionBart Knijnenburg

- Current- Informatics PhD candidate,

UC Irvine- Research intern,

Samsung R&D

- Past- Researcher & teacher,

TU Eindhoven- MSc Human Technology

Interaction, TU Eindhoven- M Human-Computer

Interaction, Carnegie Mellon

IntroductionBart Knijnenburg

- UMUAI paper- Explaining the User Experience

of Recommender Systems (first UX framework in RecSys)

- Founder of UCERSTI- workshop on user-centric

evaluation of recommender systems and their interfaces

- this year: tutorial on user experiments

IntroductionMy goal:

Introduce my framework for user-centric evaluation of recommender systemsExplain how the framework can help in conducting user experiments

My approach:

- Start with “beyond the five stars”

- Explain how to improve upon this

- Introduce a 21st century approach to user experiments

Introduction

“What is a user experiment?”

“A user experiment is a scientific method to investigate how and why system aspects

influence the users’ experience and behavior.”

IntroductionWhat I want to do today

Beyond the 5 starsThe Netflix approach

Towards user experienceHow to improve upon the Netflix approach

Evaluation frameworkA framework for user-centric evaluation

MeasurementMeasuring subjective valuations

AnalysisStatistical evaluation of results

Some examplesFindings based on the framework

Beyond the 5 starsThe Netflix approach

Beyond the 5 stars

Offline evaluations may not give the same outcome as online evaluations

Cosley et al., 2002; McNee et al., 2002

Solution: Test with real users

Beyond the 5 stars

Systemalgorithm

Interactionrating

Beyond the 5 stars

Higher accuracy does not always mean higher satisfactionMcNee et al., 2006

Solution: Consider other behaviors“Beyond the five stars”

Beyond the 5 stars

Systemalgorithm

Interaction

rating

consumption

retention

Beyond the 5 stars

The algorithm counts for only 5% of the relevance of a recommender system

Francisco Martin - RecSys 2009 keynote

Solution: test those other aspects“Beyond the five stars”

Beyond the 5 stars

System

algorithm

interaction

presentation

Interaction

rating

consumption

retention

Beyond the 5 starsStart with a hypothesis

Algorithm/feature/design X will increase member engagement and ultimately member retention

Design a testDevelop a solution or prototypeThink about dependent & independent variables, control, significance…

Execute the test

Beyond the 5 stars

Let data speak for itselfMany different metricsWe ultimately trust member engagement (e.g. hours of play) and retention

Is this good enough?I think it is not

Towards user experienceHow to improve upon the Netflix approach

User experience

“Testing a recommender against a static videoclip system, the number of clicked clips

and total viewing time went down!”

perceived recommendation quality

perceived system effectiveness

personalized

recommendationsOSA

number of clips watched from beginning

to end totalviewing time

number of clips clicked+

− −

choicesatisfaction

User experience

Knijnenburg et al.: “Receiving Recommendations and Providing Feedback”, EC-Web 2010

User experience

Behavior is hard to interpretRelationship between behavior and satisfaction is not always trivial

User experience is a better predictor of long-term retentionWith behavior only, you will need to run for a long time

Questionnaire data is more robust Fewer participants needed

User experience

Measure subjective valuations with questionnairesPerception and experience

Triangulate these data with behaviorFind out how they mediate the effect on behavior

Measure every step in your theoryCreate a chain of mediating variables

User experience

Perception: do users notice the manipulation?

- Understandability, perceived control

- Perceived recommendation diversity and quality

Experience: self-relevant evaluations of the system

- Process-related (e.g. choice difficulty)

- System-related (e.g. system effectiveness)

- Outcome related (e.g. choice satisfaction)

User experience

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

User experience

Personal and situational characteristics may have an important impact

Adomavicius et al., 2005; Knijnenburg et al., 2012

Solution: measure those as well

User experience

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

Personal CharacteristicsPersonal CharacteristicsPersonal Characteristicsgender privacy expertise

Situational CharacteristicsSituational CharacteristicsSituational Characteristicsroutine system trust choice goal

Evaluation frameworkA framework for user-centric evaluation

FrameworkFramework for user-centric evaluation of recommenders

Knijnenburg et al.: “Explaining the user experience of recommender systems”, UMUAI 2012

Systemalgorithm

interaction

presentation

Perceptionusability

quality

appeal

Experiencesystem

process

outcome

Interactionrating

consumption

retention

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

Framework

Objective System Aspects (OSA)

- visual / interaction design

- recommender algorithm

- presentation of recommendations

- additional features

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

FrameworkUser Experience (EXP)

Different aspects may influence different things

- interface -> system evaluations

- preference elicitation method -> choice process

- algorithm -> quality of the final choice

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

Framework

Perception or Subjective

System Aspects (SSA) Link OSA to EXP (mediation)Increase the robustness of the effects of OSAs on EXPHow and why OSAs affect EXP

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

Personal characteristics (PC)

Users’ general dispositionBeyond the influence of the systemTypically stable across tasks

Framework

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

Framework

Situational Characteristics (SC)

How the task influences the experienceAlso beyond the influence of the systemHere used for evaluation, not for augmenting algorithms!

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

Framework

Interaction (INT)

Observable behavior

- browsing, viewing, log-ins

Final step of evaluationBut not the goal of the evaluationTriangulate with EXP

Framework

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

FrameworkHas suggestions for measurement scales

e.g. How can we measure something like “satisfaction”?

Provides a good starting point for causal relations

e.g. How and why do certain system aspects influence the user experience?

Useful for integrating existing worke.g. How do recommendation list length, diversity and presentation each have an influence on user experience?

Framework

“Statisticians, like artists, have the bad habit of falling in love with their models.”

George Box

MeasurementMeasuring subjective valuations

Measurement

“To measure satisfaction, we asked users whether they liked the system

(on a 5-point rating scale).”

Measurement

Does the question mean the same to everyone?

- John likes the system because it is convenient

- Mary likes the system because it is easy to use

- Dave likes it because the recommendations are good

A single question is not enough to establish content validity

We need a multi-item measurement scale

Measurement

Measure at least 5 items per conceptAt least 2-3 should remain after removing bad items

It is most convenient to statements to which the participant can agree/disagree on a 5- or 7-point scale:

“I am satisfied with this system.”

Completely disagree, somewhat disagree, neutral, somewhat agree, completely agree

MeasurementUse both positively and negatively phrased items

- They make the questionnaire less “leading”

- They help filtering out bad participants

- They explore the “flip-side” of the scale

The word “not” is easily overlookedBad: “The recommendations were not very novel”Better: “The recommendations were not very novel”Best: “The recommendations felt outdated”

Measurement

Choose simple over specialized wordsParticipants may have no idea they are using a “recommender system”

Use as few words as possibleBut always use complete sentences

Measurement

Use design techniques that help to improve recall and provide appropriate time references

Bad: “I liked the signup process” (one week ago)Good: Provide a screenshotBest: Ask the question immediately afterwards

Avoid double-barreled itemsBad: “The recommendations were relevant and fun”

Measurement

“We asked users ten 5-point scale questions and summed the answers.”

MeasurementIs the scale really measuring a single thing?

- 5 items measure satisfaction, the other 5 convenience

- The items are not related enough to make a reliable scale

Are two scales really measuring different things?

- They are so closely related that they actually measure the same thing

We need to establish convergent and discriminant validity

This makes sure the scales are unidimensional

MeasurementSolution: factor analysis

- Define latent factors, specify how items “load” on them

- Factor analysis will determine how well the items “fit”

- It will give you suggestions for improvement

Benefits of factor analysis:

- Establishes convergent and discriminant validity

- Outcome is a normally distributed measurement scale

- The scale captures the “shared essence” of the items

General privacy

concerns

Control concerns

Collection concerns

collct22

collct23

collct24

collct25

collct26

collct27

ctrl28

ctrl29

ctrl30

ctrl31

ctrl32

gipc16 gipc17 gipc18 gipc19 gipc21

Measurement

General privacy

concerns

Control concerns

Collection concerns

collct22

collct24

collct25

collct26

collct27

ctrl28

ctrl29

gipc17 gipc18 gipc21

Measurement

AnalysisStatistical evaluation of results

Analysis

Manipulation -> perception: Do these two algorithms lead to a different level of perceived quality?

T-test-1

Perceived quality

Analysis

Perception -> experience:Does perceived quality influence system effectiveness?

Linear regression -2

-3 0 3

System effectiveness

Recommendation quality

personalized

recommendationsOSA

Analysis

Manipulation -> perception -> experience

Does the algorithm influence effectiveness via perceived quality?

Path model

Analysis

Two manipulations -> perception:

What is the combined effect of list diversity and list length on perceived recommendation quality?

Factorial ANOVA 0

5 items 10 items 20 items

Perceived quality

low diversificationhigh diversification

Willemsen et al.: “Not just more of the same”, submitted to TiiS

Analysis

Manipulation x personal characteristic -> outcome

Do experts and novices rate these two interaction methods differently in terms of usefulness?

ANCOVA-2

-3 0 3

System usefulness

Domain knowledge

case-based PE attribute-based PE

Knijnenburg & WIllemsen: “Understanding the Effect of

Adaptive Preference Elicitation Methods“, RecSys2009

Analysis

Use only one method: structural equation models

- A combination of Factor Analysis and Path Models

- The statistical method of the 21st century

- Statistical evaluation of causal effects

- Report as causal paths

Analysis

We compared three recommender systems

Three different algorithms

MF-I and MF-E algorithms are more effective

��

Analysis

The mediating variables show the entire story

Number of items clicked in the

playerRating

of clicked itemsNumber of items

rated higher than the predicted rating

Number of items rated

1.250 (.183)p < .001

.310 (.148)p < .05

.356 (.175)p < .05

1.352 (.222)p < .001

.856 (.132)p < .001

20.094 (7.600)p < .01

.314 (.111)p < .01

1.865 (.729)p < .05

participant's age

.846 (.127)p < .001

Matrix Factorization recommender with explicit feedback (MF-E)(versus generally most popular; GMP)

Matrix Factorization recommender with implicit feedback (MF-I)

(versus most popular; GMP)OSA

perceived recommendation variety

Analysis

“All models are wrong, but some are useful.”

George Box

Some examplesFindings based on the framework

Some examples

Giving users many options:

- increases the chance they find something nice

- also increases the difficulty of making a decision

Result: choice overload

Solution:

- Provide short, diversified lists of recommendations

movieexpertise

.455 (.211)p < .05

.181 (.075)p < .05

.503 (.090)p < .001

1.151 (.161)p < .001

.336 (.089)p < .001

-.417 (.125)p < .005

+ − +

.205 (.083)p < .05

.879 (.265)p < .001

.612 (.220)p < .01 -.804 (.230)

p < .001

.894 (.287)p < .005

perceived recommendation variety

Top-20

vs Top-5 recommendationsOSA

choicesatisfaction

choicedifficulty

Lin-20

vs Top-5 recommendationsOSA

Some examples

��

Some examples

Choice difficulty

low diversificationhigh diversification

Perceived quality

Choice satisfaction

Some examples

Experts and novices differ in their decision-making process

- Experts want to leverage their domain knowledge

- Novices make decisions by choosing from examples

Result: they want different preference elicitation methods

Solution: Tailor the preference elicitation method to the userA needs-based approach seems to work well for everyone

Some examples

-3 0 3

System usefulness

Domain knowledge

case-based PE attribute-based PE

Some examples*!

-2! -1! 0! 1! 2!Con

Domain Knowledge!

TopN!Sort!Explicit!Implicit!Hybrid!

-2! -1! 0! 1! 2!

Domain Knowledge!

TopN!Sort!Explicit!Implicit!Hybrid! **!

-2! -1! 0! 1! 2!

Domain Knowledge!

***!1! **!

-2! -1! 0! 1! 2!

Domain Knowledge!

-2! -1! 0! 1! 2!

Domain Knowledge!

-2! -1! 0! 1! 2!Con

Domain Knowledge!

-2! -1! 0! 1! 2!

Domain Knowledge!

TopN!Sort!Explicit!Implicit!Hybrid! **!

-2! -1! 0! 1! 2!

Domain Knowledge!

***!1! **!

-2! -1! 0! 1! 2!

Domain Knowledge!

-2! -1! 0! 1! 2!

Domain Knowledge!

Some examples

-2! -1! 0! 1! 2!Con

Domain Knowledge!

-2! -1! 0! 1! 2!Con

Domain Knowledge!

Some examples

-2! -1! 0! 1! 2!

Domain Knowledge!

-2! -1! 0! 1! 2!

Domain Knowledge!

Some examples

-2! -1! 0! 1! 2!

Domain Knowledge!

TopN!Sort!Explicit!Implicit!Hybrid!**!

-2! -1! 0! 1! 2!

Domain Knowledge!

Some examples***!1! **!

-2! -1! 0! 1! 2!

Domain Knowledge!

***!1! **!

-2! -1! 0! 1! 2!

Domain Knowledge!

Some examples*!1!

-2! -1! 0! 1! 2!

Domain Knowledge!

-2! -1! 0! 1! 2!

Domain Knowledge!

Some examples

Social recommenders limit the neighbors to the users’ friends

Result:

- More control: friends can be rated as well

- Better inspectability: Recommendations can be traced

We show that each of these effects existThey both increase overall satisfaction

Some examplesRate likes Weigh friends

Some examples

User Experience (EXP)Objective System Aspects (OSA)

Subjective System Aspects (SSA)

Understandability

(R2 = .153)

Satisfaction with the system

(R2 = .696)

Perceived control

(R2 = .311)

Interaction (INT)

Average rating(R2 = .508)

Inspection time (min)(R2 = .092)

0.166 (0.077)*

Perceived recommendation

quality(R2 = .512)

Controlitem/friend vs. no control

Inspectabilityfull graph vs. list only

number of known recommendations

(R2 = .044)

−0.332 (0.088)***

0.257(0.124)*

0.205(0.100)*

0.375(0.094)***

−0.152 (0.063)*

0.410 (0.092)***

0.955 (0.148)***

0.231(0.114)*

0.377(0.074)***

0.770(0.094)***

0.249(0.049)***

0.695 (0.304)* 0.067 (0.022)**

0.323 (0.031)***

0.148(0.051)**

!2(2) = 10.70**item: 0.428 (0.207)*friend: 0.668 (0.206)**

!2(2) = 10.81**item: −0.181 (0.097)1

friend: −0.389 (0.125)**

0.288 (0.091)**

0.459 (0.148)**

Personal Characteristics (PC)

Music expertise

Familiarity with recommenders

Trusting propensity

IntroductionI discussed how to do user experiments with my framework

Beyond the 5 starsThe Netflix approach is a step in the right direction

Towards user experienceWe need to measure subjective valuations as well

Evaluation frameworkWith the framework you can put this all together

MeasurementMeasure subjective valuations with questionnaires and factor analysis

AnalysisUse structural equation models to test comprehensive causal models

Some examplesI demonstrated the findings of some studies that used this approach

See you in Dublin!

bart.k@uci.edu

www.usabart.nl

@usabart

ResourcesUser-centric evaluation

Knijnenburg, B.P., Willemsen, M.C., Gantner, Z., Soncu, H., Newell, C.: Explaining the User Experience of Recommender Systems. UMUAI 2012.

Knijnenburg, B.P., Willemsen, M.C., Kobsa, A.: A Pragmatic Procedure to Support the User-Centric Evaluation of Recommender Systems. RecSys 2011.

Choice overloadWillemsen, M.C., Graus, M.P., Knijnenburg, B.P., Bollen, D.: Not just more of the same: Preventing Choice Overload in Recommender Systems by Offering Small Diversified Sets. Submitted to TiiS.

Bollen, D.G.F.M., Knijnenburg, B.P., Willemsen, M.C., Graus, M.P.: Understanding Choice Overload in Recommender Systems. RecSys 2010.

ResourcesPreference elicitation methods

Knijnenburg, B.P., Reijmer, N.J.M., Willemsen, M.C.: Each to His Own: How Different Users Call for Different Interaction Methods in Recommender Systems. RecSys 2011.

Knijnenburg, B.P., Willemsen, M.C.: The Effect of Preference Elicitation Methods on the User Experience of a Recommender System. CHI 2010.

Knijnenburg, B.P., Willemsen, M.C.: Understanding the effect of adaptive preference elicitation methods on user satisfaction of a recommender system. RecSys 2009.

Social recommendersKnijnenburg, B.P., Bostandjiev, S., O'Donovan, J., Kobsa, A.: Inspectability and Control in Social Recommender Systems. RecSys 2012.

ResourcesUser feedback and privacy

Knijnenburg, B.P., Kobsa, A: Making Decisions about Privacy: Information Disclosure in Context-Aware Recommender Systems. Submitted to TiiS.

Knijnenburg, B.P., Willemsen, M.C., Hirtbach, S.: Getting Recommendations and Providing Feedback: The User-Experience of a Recommender System. EC-Web 2010.

explaining the user experience of recommender systems with user experiments

computer sciences

user experiments information

user experiencepersonal

user experiencebehavior

user experienceperception

humancomputer interaction

user experimentsmy approach

ve stars information

Technology

recommender systems: from algorithms to user...

user data analytics and recommender system for...

capturing knowledge of user preferences with recommender...

capturing knowledge of user preferences with recommender...

improving user experience in recommender systems

recommender system using sas - spears business ·...

capturing knowledge of user preferences with recommender...

a cross-cultural user evaluation of product recommender...

user based recommender systems using implicative rating...

a user-centric evaluation of recommender algorithms for an...

a cross-cultural user evaluation of product recommender...

survey on recommender systems for solving cold … ·...

hybrid recommender system based on yelp user …...hybrid...

phosphor: explaining transitions in the user...

towards complex user feedback and presentation context in...

supercharging recommender systems using taxonomies for...

expanding user features with social relationships in ...

user modeling and recommender systems: introduction to...

user modeling and recommender systems: recommendation...

an e ective recommender system by unifying user and item...