yandex'10 kal-slides
TRANSCRIPT
Why Do Users in Real Life Use Short Queries?
Kal [email protected] work with: H Keskustalo, A Pirkola,T Sharma, M Lykke Nielsen
A Quick Answer
Because …they are good enoughand effortless
But how to show that?
Outline
1. Introduction2. Study Design
Research QuestionTest EnvironmentExperimental Protocol
3. Experimental Results4. Conclusions
44
Introduction
Traditional test collectionbased IR:methods compared based on result qualitytopical relevanceone query per topicverbose queries(long) lists of retrieved documentsoften binary relevance with low threshold
5
Introduction
On the contrary, real searchershave various interaction strategies / expectationsconsider beyondtopical relevanceuse more than one query, if needed, in sessionsshort queries (Jansen & al., 2000)unstructured queries (Ruthven, 2008)may or may not avoid sequences of topically nonrelevant documents (Azzopardi, 2007); wantfew, but good, documents (Järvelin & al., 2008)
6
The Present Talk ...
... brings opposing views closer together:Session simulation in a test collection
topical sessions (up to 5 queries per topic)idealized session strategies (3+1)short queries (including 1word sequences)short browsing (10document) windowtask to find one (highly) relevant document§ but other search goals can be assumed
7
Research Question
What is the effectiveness ofa session of short queriescompared to one verbose [TREC] query?
8
Test Environment
TREC 78 collection41 topics528 155 documentsgraded relevance judgments (highly, fairly,marginally relevant, and nonrelevantdocuments)
Lemur retrieval systemQuery keys collected from test persons
9
Experimental Protocol
Obtaining search keysSession strategiesSimulated session constructionRetrieval protocol
10
Collecting Search Keys
7+7 test personsIntellectual analysis of 41 topicsEach topic analyzed twice: once by astudent (Group A), and once by a staffmember (Group B)The task was to identify good search keysVarious session scenarios were employed
11
An Example
Topic number: 351Description:
What information is available on petroleumexploration in the South Atlantic near theFalkland Island
Narrative:Any document discussing petroleum explorationin the South Atlantic near the Falkland Islandsis considered relevant.
12
Session Strategy S1
Oneword queries onlyJansen & al., (2000); Stenmark (2008)Lykke & al. (2009) (employed 21 times in the 60 reallive sessions)
Example:falkland →exploration →island →petroleum →
13
Session Strategy S2
Incremental query extensionOne word added if the query failsLykke & al. (13 times of 60 reallife sessions)
Example:petroleum →petroleum exploration →petroleum exploration south →petroleum exploration south atlantic
14
Session Strategy S3
”Variations on a theme of two words”2 fixed keys; 3rd key is variedLykke & al. (in 38 of 60 reallive sessions)
Example:petroleum exploration south →petroleum exploration atlantic →petroleum exploration falkland → …
15
Session Strategy S4
One verbose [TREC] query (title +description)
traditional baseline
Example:falkland petroleum exploration informationavailable petroleum exploration south atlanticfalkland island
Simulations
Instead of real interactive sessions weperformed session simulation
Search keys chosen randomly from the pool foreach topicChosen keys arranged to consecutive queriesaccording to the four strategiesPerson assumed to scan the first page and stopat the first marginal/highly relevant doc.
17
Retrieval Protocol
Construct query sessions for each strategyRetrieve Top10 documents using eachindividual query (Top50 for S4)Determine whether / how rapidly eachquery sequence succeeds/fails
18
Results
Succ
ess
of s
trat
egie
sS1
S4
by in
divi
dual
top
ics
Liberal Relevance Stringent RelevanceS1 S2 S3 S4 S1 S2 S3 S4
Topic# A B A B A B _ A B A B A B _351 1 2 5 1 3 1 1 1 2 5 1 3 1 1353 1 1 1 1 1 1 1 2 2 1 1355 1 2 1 1 1 1 1 1 2 1 1 1 1 1358 1 1 1 1 2 1 1 1 1 1 1 2 1 1360 1 2 1 1 1 1 1 2 2 3 3 2 1 1362 1 1 1 1 1 1 1 1 1 1 1 3 1364 1 1 1 1 1 1 1 1 1 1 1 1 1 1365 3 1 1 2 1 1 1 3 1 1 2 1 1 1372 5 2 1 1 1 1 1 2 2 1 2 1373 1 1 1 1 1 1 1 1 1 1 1 1 1 1377 2 1 1 1 2 1 1 1378 3 3 1 1 1 384 2 1 1 1 1 1 1 4 3 2 1 2385 2 2 1 1 1 2 2 2 1 1387 1 1 1 1 1 2 1 2 1 2 2 1 2 1388 2 3 4 3 1 1 1 4392 2 1 1 1 1 1 1 2 1 1 1 3 1 1393 2 1 1 1 1 1 1 2 1 1 1 1 1 3396 1 3 2 1 1 1 1 1 3 2 1 1 1 1399 4 2 1 1 1 4 2 1 2400 4 2 2 1 1 1 1 4 2 2 1 2 1 1402 1 1 2 1 1 1 1 1 2 1 1 1403 1 1 1 1 1 1 1 1 1 1 1 1 1 1405 1 2 3 1 1 2 1 3 2407 1 1 1 1 1 1 1 2 1 1 1 1 1 1408 2 1 1 1 1 1 1 2 3 2 1 1410 3 2 1 1 1 1 1 3 2 1 1 1 1 1414 3 2 1 1 1 415 3 1 2 1 1 1 1 2 5 1 1 1416 3 1 1 1 1 1 1 3 1 1 1 1 1 1418 2 2 1 1 1 1 1 2 2 1 1 1 1 1420 1 1 1 2 1 1 1 1 1 1 2 1 1 1421 1 2 1 3 1 1 1 2 3 1 1 1427 2 2 1 1 2 1 1 2 4 4428 2 1 1 1 1 1 1 3 1 1 1 1 1 1431 2 3 1 1 1 1 1 2 1 1 1 1 1 1437 2 3 2 3 440 2 2 3 2 1 2 2 5 5442 1 1 2 1 1 1 445 3 1 1 1 3 1 2 4 1 4 1448 2 2 1 1 1
19
Count of successful sessions (max = 41),Liberal relevance threshold
0
5
10
15
20
25
30
35
40
S1 A S1 B S2 A S2 B S3 A S3 B S4
Session strategy and test group
# Sessions
20
Count of successful sessions (max = 38),Stringent relevance threshold
0
5
10
15
20
25
30
35
S1 A S1 B S2 A S2 B S3 A S3 B S4
Session strategy and test group
# Sessions
21
22
Statistical significance
Friedman’s test by the ordinal of success.Similar results for group A and B and forliberal and stringent relevance.Significant pairwise differences (p=0.01) asfollows:
S1 differs from S2, S3, S4S2 differs from S4S3 does not differ significantly from S4
S1: Cumulative success (%)
0
10
20
30
40
50
60
70
1 2 3 4 5
Query ordinal
Perc
ent
S1 Group AS1 Group B
23
S2: Cumulative success (%)
0102030405060708090
100
1 2 3 4 5
Query ordinal
Per
cent S2 Group A
S2 Group B
24
S3: Cumulative success (%)
0
1020
30
4050
60
7080
90
1 2 3 4 5
Query ordinal
Perc
ent
S3 Group AS3 Group B
25
S4: Cumulative success (%)
0102030405060708090
100
1 2 3 4 5
Ordinal of 10document page inspected
Perc
ent
S4 Baseline
26
Nonsession view: single best of all S1 querygenerations compared to S4 baseline
0
5
10
15
20
25
30
S1 S4
Session strategy
Per
cent
P@10
27
Nonsession view: single best of all S1 querygenerations compared to S4 baseline
0
5
10
15
20
25
S1 S4
Session strategy
Perc
ent
AP / 38 topics
28
Effort Expected number of search keysassuming various strategies
0
2
4
6
8
10
12
14
16
18
S1 S2 S3 S4
# Search keys toenter
29
Effort forequallevel ofsuccess:# searchkeys
Effort Expected number of queries to launch tofind one relevant document
0
0,5
1
1,5
2
2,5
3
3,5
4
S1 S2 S3 S4
# Queries to enter
30
Effort forequallevel ofsuccess:#queries
31
Discussion
Another way to look at the success of IRmotivated by observed user behavior:
short query sessionsshort browsingto find a few good documents.
Log studies justify simulationsShortqueries are good enough and easy
even if inferior when used individually
32
Conclusions
Test collectionbased IR evaluation could beextended to:
include multiplequery sessionsfocus on how the system is used§ querying/browsing strategies (interaction)§ in relation to user’s specific goals
focus, in evaluation, on user viewpoint§ strategies serving a particular goal§ simulation approach for repeatability + control
33
Conclusions
Session simulations:a promising approach to study the limits of theeffectiveness of various system usesfindings can be verified with real users§ but our results motivate the observed real user behavior
A prospect for search training:recognize QM patterns of userssimulate themmeasure session success from user pointofview for a”satisfactory result”
Acknowledgement
This research was supported by the Academy of Finlandgrants #120996 and #124131.Reference:Keskustalo, H. & Järvelin, K. & Pirkola, A. & Sharma, T. &Lykke Nielsen, M. (2009). Test CollectionBased IREvaluation Needs Extension Toward Sessions A Case ofExtremely Short Queries. In: Lee, G. & al., Proceedings ofAIRS 2009, Sapporo, Japan, October 2009. Heidelberg:Springer, LNCS vol. 5839, pp. 6374 .
Thank you!