adapting rankers online, maarten de rijke

Adapting Rankers Online

Maarten de Rijke


Joint work with Katja Hofmann and Shimon Whiteson

2

Adapting Rankers Online 3

Growing complexity of search engines Current methods for optimizing mostly work offline


No distinction between training and operating Search engine observes users’ natural interactions

with the search interface, infers information from them, and improves its ranking function automatically

Expensive data collection not required; the collected data matches target users and target setting

4

Online learning to rank

Search engine observes users’ natural interactionswith the search interface, infers information from


Users’ natural interactions with the search interface

Oard and Kim, 2001Kelly and Teevan, 2004

Minimum scopeMinimum scopeMinimum scopeMinimum scope

Segment Object Class

Examine

Retain

Reference

Annotate

Create

View, Listen, Scroll, Find, Query Select Browse

PrintBookmark, Save, Delete, Purchase,

EmailSubscribe

Copy-and-paste, Quote

Forward, Reply, Link, Cite

Mark up Rate, Publish Organize

Type, Edit Author

Beh

avio

r cat

egor

y

Refers to

purpose of

observed

behavior

Refers to smallest possible scope of item being acted upon


Users’ interactions

Relevance feedback History goes back close to forty years Typically used for query expansion, user profiling

Explicit feedback Users explicitly give feedback Keywords, selecting or marking documents,

answering questions Natural explicit feedback can be difficult to obtain “Unnatural” explicit feedback through TREC

assessors and crowd sourcing6


Users’ interactions (2)

Implicit feedback for learning, query expansion and user profiling Observe users’ natural interactions with system Reading time, saving, printing, bookmarking,

selecting, clicking, … Thought to be less accurate than explicit

measures Available in very large quantities at no cost

7


Learning to rank online

Using online learning to rank approaches, retrieval systems can learn directly from implicit feedback, while they are running Algorithms need to explore new solutions to obtain

feedback for effective learning and exploit what has been learned to produce results acceptable to users

Interleaved comparison methods can use implicit feedback to detect small differences between rankers and can be used to learn ranking functions online

8


Balancing exploration and exploitation Inferring preferences from clicks

9

Agenda


Balancing Exploitationand Exploration

10K. Hofmann et al. (2011), Balancing exploration and exploitation. In: ECIR ’11.

Recent work


Challenges

Generalize over queries and documents Learn from implicit feedback that is …

noisy relative rank-biased

Keep users happy while learning

11


Learning document pair-wise preferences

Insight: infer preferences from clicks

12

Vienna

Joachims, T. (2002). Optimizing search engines using clickthrough data. In KDD '02, pages 133-142.


Learning document pair-wise preferences

Input: feature vectors constructed from document pairs

Output: correct / incorrect order Learning method: supervised learning, e.g., SVM

13Joachims, T. (2002). Optimizing search engines using clickthrough data. In KDD '02, pages 133-142.

y ∈ {−1,+1}(�x(q, di), �x(q, dj)) ∈ Rn × Rn


Challenges




14


Dueling bandit gradient descent

Learns a ranking function consisting of a weight vector for a linear weighted combination of feature vectors from feedback about relative quality of rankings Outcome: weights for ranking

Approach Maintain a current “best” ranking function On each incoming query:

Generate a new candidate ranking function Compare to current “best” If candidate is better, update “best” ranking function

15Yue, Y. and Joachims, T. (2009). Interactively optimizing information retrieval systems as a dueling bandits problem. In ICML '09.

S = w�x(q, d)

x1

x2

current best w

candidate w


Challenges




16


Exploration and exploitation

17

Need to learn effectively from rank-biased feedback

Need to present high-quality results while learning

Exploration Exploitation

Previous approaches are either purely exploratory or purely exploitative


Questions

Can we improve online performance by balancing exploration and exploitation?

How much exploration is needed for effective learning?

18


Problem formulation

Reinforcement learning No explicit labels Learn from feedback from the environment in

response to actions (document lists) Contextual bandit problem

19

Retrievalsystem

Environment(user)

try something

get feedback

Retrievalsystem

Environment(user)

documents

clicks


Our method

Learning based on Dueling Bandit Gradient Descent Relative evaluations of quality of two document

lists Infers such comparisons from implicit feedback

Balance exploration and exploitation with k-greedy comparison of document lists

20


k-greedy exploration

To compare document lists, interleave

An exploration rate k influences the relative number of documents from each list

21

Blue wins comparison

Exploration rate k = 0.5


k-greedy exploration

Exploration

rate k = 0.5 Exploration

rate k = 0.2


Evaluation

Simulated interactions We need to

observe clicks on arbitrary result lists measure online performance

Simulate clicks and measure online performance Probabilistic click model: assume dependent click

model and define click and stop probabilities based on standard learning to rank data sets

Measure cumulative reward of the rankings displayed to the user

23


Experiments

Vary exploration rate k Three click models

“perfect” “navigational” “informational”

Evaluate on nine data sets (LETOR 3.0 and 4.0)

24


“Perfect” click model

Click model

Provides an upperbound

25

P(c|R) P(c|NR) P(s|R) P(s|NR)

1.0 0.0 0.0 0.0

Final performance over time for data set NP2003 and perfect click model

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8


“Perfect” online performance

k = 0.5 k = 0.4 k = 0.3 k = 0.2 k = 0.1

HP2003 119.91 125.71 129.99 130.55 128.50

HP2004 109.21 111.57 118.54 119.86 116.46

NP2003 108.74 113.61 117.44 120.46 119.06

NP2004 112.33 119.34 124.47 126.20 123.70

TD2003 82.00 84.24 88.20 89.36 86.20

TD2004 85.67 90.23 91.00 91.71 88.98

OHSUMED 128.12 130.40 131.16 133.37 131.93

MQ2007 96.02 97.48 98.54 100.28 98.32

MQ2008 90.97 92.99 94.03 95.59 95.14

Dark borders indicate significant improvements over the k = 0.5 baseline

Darker shades indicate higher performance

26

125.71

Best performance

with only two

exploratory

documents for

top-10 results


“Navigational” click model

Click model

Simulate realistic but reliable interaction

27


0.95 0.05 0.9 0.2

Final performance over time for data set NP2003 and navigational click model

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8


“Navigational” online performance

28

k = 0.5 k = 0.4 k = 0.3 k = 0.2 k = 0.1

HP2003 102.58 109.78 118.84 116.38 117.52

HP2004 89.61 97.08 99.03 103.36 105.69

NP2003 90.32 100.94 105.03 108.15 110.12

NP2004 99.14 104.34 110.16 112.05 116.00

TD2003 70.93 75.20 77.64 77.54 75.70

TD2004 78.83 80.17 82.40 83.54 80.98

OHSUMED 125.35 126.92 127.37 127.94 127.21

MQ2007 95.50 94.99 95.70 96.02 94.94

MQ2008 89.39 90.55 91.24 92.36 92.25



125.71

Best performance with little

exploration and lots of

exploitation


“Informational” click model

Click model

Simulate very noisy interaction

29


0.9 0.4 0.5 0.1

Final performance over time for data set NP2003 and informational click model

k = 0.5 k = 0.2 k = 0.1

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8


“Informational” online performance

30

k = 0.5 k = 0.4 k = 0.3 k = 0.2 k = 0.1

HP2003 59.53 63.91 61.43 70.11 71.19

HP2004 41.12 52.88 48.54 55.88 55.16

NP2003 53.63 53.64 57.60 58.40 69.90

NP2004 60.59 63.38 64.17 63.23 69.96

TD2003 52.78 52.95 51.58 55.76 57.30

TD2004 58.49 61.43 59.75 62.88 63.37

OHSUMED 121.39 123.26 124.01 126.76 125.40

MQ2007 91.57 92.00 91.66 90.79 90.19

MQ2008 86.06 87.26 85.83 87.62 86.29



125.71

Highest improvements with

low exploration rates:

interaction between

noise and data set


Summary

What? Developed first method for balancing exploration and

exploitation in online learning to rank Devised experimental framework for simulating user

interactions and measuring online performance And so?

Balancing exploration and exploitation improves online performance for all click models and all data sets

Best results are achieved with 2 exploratory documents per results list

31


What’s next here?

Validate simulation assumptions Evaluate using on click logs Develop new algorithms for online learning to rank

for IR that can balance exploration and exploitation

32


Inferring Preferencesfrom Clicks

33

Ongoingwork


Interleaved ranker comparison methods

Use implicit feedback (“clicks”), not to infer absolute judgments, but to compare two rankers by observing clicks on an interleaved result list Interleave two ranked lists (“outputs of two rankers”) Use click data to detect even very small differences

between rankers

Examine three existing methods for interleaving, identify issues with them and propose a new one

34


Three methods (1)

Balanced interleave method Interleaved list is generated for each query based

on the two rankers User’s clicks on interleaved list are attributed to

each ranker based on how they ranked the clicked docs

Ranker that obtains more clicks is deemed superior

35Joachims, Evaluating retrieval performanceusing clickthrough data, In: Text Mining, 2003


x

1) Interleaving 2) Comparison

d1d2d3d4

List l1d2d3d4d1

List l2

d1 d2 d3 d4

d2 d1 d3 d4

d1 d2 d3 d4

d2 d1 d3 d4

k = min(4,3) = 3click count:c1 = 1c2 = 2

obse

rved

cl

icks

c


l2 wins the first comparison, and the lists tie for the second. In expectation l2 wins.

x

x x

Two possible interleaved lists l:


Three methods (2)

Team draft method Create an interleaved list following the model of

“team captains” selecting their team from a set of players

For each pair of documents to be placed in the interleaved list, a coin flip determines which list gets to select a document first

Record which document contributed which document

37Radlinski et al., How does click-through data reflect retrieval quality? 2008



d1d2d3d4

List l1d2d3d4d1

List l2

d1 1d2 2d3 1d4 2

d2 2d1 1d3 2d4 1

Four possible interleaved lists l, with different assignments a:

assignments a

For the interleaved lists a) and b) l1 wins the comparison. l2 wins in the other two cases.

d1 1d2 2d3 2d4 1

d2 2d1 1d3 1d4 2

x

x x

x

a)

b)

c)

d)


Three methods (3)

Document-constraint method Result lists are interleaved and clicks observed as

for the balanced interleaved method Infer constraints on pairs of individual documents

based on clicks and ranks For each pair of a clicked document and a higher-ranked non-

clicked document, a constraint is inferred that requires the former to be ranked higher than the latter

The original list that violates fewer constraints is deemed superior

39He et al., Evaluation of methods for relative comparison of retrievalsystems based on clickthroughs, 2009



d1d2d3d4

List l1d2d3d4d1

List l2

d1 d2 d3 d4

d2 d1 d3 d4

xd1 d2 d3 d4

d2 d1 d3 d4

inferred constraintsviolated by: l1 l2d2 ≻ d1 x -d3 ≻ d1 x -

xx x

l2 wins the first comparison, and loses the second. In expectation l2 wins.

inferred constraintsviolated by: l1 l2d1 ≻ d2 - xd3 ≻ d2 x x



Assessing comparison methods

Bias Don’t prefer either ranker when clicks are random

Sensitivity The ability of a comparison method to detect

differences in the quality of rankings

Balanced interleave and document constraint are biased

Team draft may suffer from insensitivity41


A new proposal

Briefly Based on team draft Instead of interleaving deterministically, model the

interleaving process as random sampling from softmax functions that define probability distributions over documents

Derive an estimator that is unbiased and sensitive to small ranking changes

Marginalize over all possible assignments to make estimates more reliable

42

1) Probabilistic Interleave 2) Probabilistic Comparison

d1d2d3d4

l1 ! softmax s1d2d3d4d1

All permutations of documents in D are possible.

l2 ! softmax s2

s2

s1

For each rank of the interleaved list l draw one of {s1, s2} and sample d:

d1

d2

d3

d4

s2

s1d2

d3

d4

...

P(dr=1)= 0.85P(dr=2)= 0.10P(dr=3)= 0.03P(dr=4)= 0.02

s2

s1 d3

d4

d4

s2

s1

......

...

...

d4...

Observe data, e.g.d1 1d2 2d3 1d4 2

xx

1 1 1 11 1 1 21 1 2 11 1 2 21 2 1 11 2 1 21 2 2 11 2 2 22 1 1 12 1 1 22 1 2 12 1 2 22 2 1 12 2 1 22 2 2 12 2 2 2

a2 02 01 11 11 11 10 20 22 02 01 11 11 11 10 20 2

o(ci,a)

0.0530.0530.0580.0580.0650.0650.0710.0710.0010.0010.0010.0010.0010.0010.0010.001

P(a|li,qi)

P(c1 > c2) = 0.108P(c1 < c2) = 0.144

s2 (based on l2) wins the comparison. s1 and s2 tie in expectation.

marginalize over all possible assignments:


For an incoming query System generates

interleaved list Observe clicks Compute probability of

each possible outcome

... All possible assignments are

generated; Probability of each is computed Expensive; only need to do this

until the lowest observed click


Question

Do analytical differences between the methods translate into performance differences?


Evaluation

Set-up Simulation based on dependent click model

Perfect and realistic instantiations Not binary, but with relevance levels

MSLR-WEB30k Microsoft learning to rank data set 136 doc features (i.e., rankers)

Three experiments Exhaustive comparison of all distinct ranker pairs

9,180 distinct pairs Selection of small subsets for detailed analysis Add noise

45


Results (1)

Experiment 1 Accuracy

Percentage of pairs of rankers for which a comparison method identified the better ranker after 1000 queries

46

Method Accuracybalanced interleave 0.881

team draft 0.898

document constraint 0.857new 0.914


Results (2): overview

“Problematic” pairs Pairs of rankers for which

all methods correctly identified the better one

Three achieved perfect accuracy within 1000 queries

For each method, incorrectly judged pair with highest difference in NDCG

47


Results (3): perfect model

48

0

0.2

0.4

0.6

0.8

1

1 10 100 1k 2k 5k 10k

balanced interleaveteam draft

document constraintmarginalized probabilities

0

0.2

0.4

0.6

0.8

1

1 10 100 1k 2k 5k 10k

0

0.2

0.4

0.6

0.8

1

1 10 100 1k 2k 5k 10k

0

0.2

0.4

0.6

0.8

1

1 10 100 1k 2k 5k 10k


Results (4): realistic model

49

0

0.2

0.4

0.6

0.8

1

1 10 100 1k 2k 5k 10k

0

0.2

0.4

0.6

0.8

1

1 10 100 1k 2k 5k 10k


Summary

What? Methods for evaluating rankers using implicit

feedback Analysis of interleaved comparison methods in

terms of bias and sensitivity And so?

Introduced a new probabilistic interleaved comparison method, unbiased and sensitive

Experimental analysis: more accurate, with substantially fewer observed queries, more robust

50


What’s next here?

Evaluate in a real-life setting in the future With more reliable and faster convergence, our

approach can pave the way for online learning to rank methods that require many comparisons

51


Wrap-up

52


Online learning to rank Emphasis on implicit feedback collected during

normal operation of the search engine Balancing exploration and exploitation Probabilistic method for inferring preferences from

clicks

53


Information retrieval observatory

Academic experiments on online learning and implicit feedback used simulators Need to validate the simulators

What’s really needed Move away from artificial explicit feedback to

natural implicit feedback Shared experimental environment for observing

users in the wild as they interact with systems

54


Adapting Rankers Online Maarten de Rijke, [email protected]

55

mailto:[email protected]

mailto:[email protected]


(Intentionally left blank)


Bias

x


d1d2d3d4

List l1d2d3d4d1

List l2

d1 d2 d3 d4

d2 d1 d3 d4

d1 d2 d3 d4

d2 d1 d3 d4


obse

rved

cl

icks

c


l2 wins the first comparison, and the lists tie for the second. In expectation l2 wins.

x

x x



Sensitivity

58


d1d2d3d4

List l1d2d3d4d1

List l2

d1 1d2 2d3 1d4 2

d2 2d1 1d3 2d4 1

Four possible interleaved lists l, with different assignments a:

assignments a

For the interleaved lists a) and b) l1 wins the comparison. l2 wins in the other two cases.

d1 1d2 2d3 2d4 1

d2 2d1 1d3 1d4 2

x

x x

x

a)

b)

c)

d)

adapting rankers online, maarten de rijke

Technology

expectation

data set np2003

noisy relative

learning document

l2 wins

online learning

clicked document

higher performance