ryen white microsoft research ryenw@microsoft.com...
Post on 13-Jan-2016
219 Views
Preview:
TRANSCRIPT
Ryen WhiteMicrosoft Research
ryenw@microsoft.comresearch.microsoft.com/~ryenw/talks/ppt/WhiteIMT542E.ppt
OverviewShort, selfish bit about meUser evaluation in IRCase study combining two approaches
User studyLog-based
Introduction to Exploratory Search SystemsFocus on evaluation
Short group activityWrap-up
Me, Me, MeInterested in understanding and supporting
peoples’ search behaviors, in particular on the WebPh.D. in Interactive Information Retrieval from
University of Glasgow, Scotland (2001 – 2004)Post-doc at University of Maryland Human-
Computer Interaction Lab (2004 – 2006)Instructor for course on Human-Computer Interaction
at UMD College of Library and Information StudiesResearcher in Text Mining, Search, and Navigation
group at Microsoft Research, Redmond (2006 - present)
OverviewShort, selfish bit about meUser evaluation in IRCase study combining two approaches
User studyLog-based
Introduction to Exploratory Search SystemsFocus on evaluation
Short group activityWrap-up
Search InterfacesThere are lots of different search interfaces, for
lots of different situations
Big question: How do we evaluate these interfaces?
Some ApproachesLaboratory ExperimentsNaturalistic StudiesLongitudinal StudiesFormative (during) and Summative (after)
evaluationsTraditional usability studies
Is an interface usable? Generally not comparative.
Case StudiesOften designer, not user, driven
Research QuestionsResearch questions are questions that you
hope that your study will answer (a formal statement of your goal)
Hypotheses are specific predictions about relationships among variables
Questions should be meaningful, answerable, concise, open-ended, and value-free
Research Questions: Example 1For study of advanced query syntax (e.g., +, -, “”,
site:), the research questions were: Is there a relationship between the use of advanced
syntax and other characteristics of a search?Is there a relationship between the use of advanced
syntax and post-query navigation behaviors?Is there a relationship between the use of advanced
syntax and measures of search success?
Research Questions: Example 2For a study of an interface gadget that points users
to popular destinations (i.e., pages that many people visit):Are popular destinations preferable and more
effective than query refinement suggestions and unaided Web search for: Searches that are well-defined (“known-item” tasks)? Searches that are ill-defined (“exploratory” tasks)?
Should popular destinations be taken from the end of query trails or the end of session trails?
More on this research question in the case study later!
VariablesIndependent Variable (IV): the “cause”; this is
often (but not always) controlled or manipulated by the investigator
Dependent Variable (DV): the “effect”; this is what is proposed to change as a result of different values of the independent variable
Other variables:Intervening variable: explains link between variablesModerating variable: affects direction/strength IV-to-
DVConfounding variable: not controlled for, affects DV
HypothesesAlternative Hypothesis: a statement describing
the relationship between two or more variables, e.g.,E.g., Search engine users that use advanced query
syntax find more relevant Web pages
Null Hypothesis: a statement declaring that there is no relationship among variables; you may have heard of“reject the null hypothesis”“failing to reject the null hypothesis”E.g., Search engine users that use advanced query
syntax find Web pages that are no more or less relevant than other users
Experimental DesignWithin and/or Between Subjects
Within-subjects: All subjects use all systemsBetween-subjects: Subjects use only one system,
different blocks of users use each systemControl:
System with no modifications (in within-subjects)Group of subjects that do not use experimental
system, but instead use a baseline (in between-subjects)
Factorial Designs> 1 variable (factor), e.g., system × task type
TasksTask or topic?
Task is the activity the user is asked to performTopic is the subject matter of the task
Artificial tasksSubjects given task or even queries; relevance
pre-determinedSimulated work tasks (Borlund, 2000)
Subjects given task; compose queries; determine relevance
Natural tasks (Kelly & Belkin, 2004)Subjects construct own tasks as part of real needs
System & Task RotationRotation & counterbalancing to
counteract learning effectsLatin Square rotation
n × n table filled with n different symbols so that each symbol occurs exactly once in each row and exactly once in each column
Factorial rotationall possible combinations
Factorial has twice as many subjectsTwice as expensive to perform
213
132
321
123
213
132
312
231
321
Data CollectionQuestionnairesDiariesInterviewsFocus groupsObservationThink-aloudLogging (system, proxy & server, client)
Data Analysis: QuantitativeDescriptive Statistics
Describes the characteristics of a sample of the relationship among variables
Presents summary information about the exampleE.g., mean, correlation coefficient
Inferential StatisticsUsed for hypotheses testingDemonstrate cause/effect relationshipsE.g., t-value (from t-test), F-value (from ANOVA)
Data Analysis: QualitativeCoding – open-questions, transcribed think-aloud,
…Classifying or categorizing individual pieces of dataOpen Coding: codes are suggested by the
investigator’s examination and questioning of the data Iterative process
Closed Coding: codes are identified before the data is collected
Each passage can have more than one codeAll passages do not have to have a codeCode, code, and code some more!
OverviewShort, selfish bit about meUser evaluation in IRCase study combining two approaches
User studyLog-based
Introduction to Exploratory Search SystemsFocus on evaluation
Short group activityWrap-up
Case StudyLeveraging popular destinations to enhance Web search interaction
White, R.W., Bilenko, M., Cucerzan, S. (2007). Studying the use of popular destinations to enhance web search interaction. In Proceedings of the 30th ACM SIGIR Conference on Research
and Development in Information Retrieval, pp. 159-166.
MotivationQuery suggestion is a popular approach to help users
better define their information needs
Incremental: may be inappropriate for exploratory needs
In exploratory searches users rely a lot on browsingCan we use places others go rather than what they
say?
Query = [hubble telescope]
Query suggestio
ns
Search Trails: from user logsInitiated with a query
to a top-5 search engine
Query trailsQuery Query
Session trailsQuery Event:
Session timeout Visit homepage Type URL Check Web-based
email or logon to online service
S1 S3 S4
S3
dpreview.com
S2
pmai.orgdigital
cameras
S2
QueryTrailEnd canon.com
amazon.com
S5
howstuffworks.com
S6
S5 S8
S6 S9
S1 S10 S11
S10 S12 S13 S14
amazon
digitalcamera-hq.com
digital camera canon
S7
S6
canon lenses
SessionTrailEnd
S2
Popular DestinationsPages at which other users end up frequently after
submitting the same or similar queries, and then browsing away from initially clicked search results
Popular destinations lie at the end of many users’ trailsMay not be among the top-ranked resultsMay not contain the queried termsMay not even be indexed by the search engine
Suggesting DestinationsCan we exploit a corpus of trails to support
Web search?
Research QuestionsRQ1: Are destination suggestions preferable
and more effective than query refinement suggestions and unaided Web search for:Searches that are well-defined (“known-item”
tasks)Searches that are ill-defined (“exploratory”
tasks)
RQ2: Should destination suggestions be taken from the end of the query trails or the end of the session trails?
User StudyConducted a user study to answer these
questions36 subjects drawn from subject pool within
our organization4 systems2 task types (“known-item” and “exploratory”)Within-subject experimental designGraeco-Latin square designSubjects attempted 2 known-item and 2
exploratory tasks, one on each system
Systems: Unaided Web SearchLive Search backendNo direct support for query refinement
Query = [hubble telescope]
Systems: Query Suggestion Suggests queries based on popular
extensions for the current query type by the userQuery = [hubble telescope]
Systems: Destination SuggestionQuery Destination (unaided + page support)
Suggests pages many users visit before next query
Session Destination (unaided + page support)Same as above, but before session end not next query
Query = [hubble telescope]
TasksTasks taken and adapted from TREC Interactive
Track and QA communities (e.g., Live QnA, Yahoo! Answers)
Six of each task type, subject chose without replacement
Two task types: known-item and exploratoryKnown-item: Identify three tropical storms
(hurricanes and typhoons) that have caused property damage and/or loss of life.
Exploratory task: You are considering purchasing a Voice Over Internet Protocol (VoIP) telephone. You want to learn more about VoIP technology and providers that offer the service, and select the provider and telephone that best suits you.
MethodologySubjects:
Chose two known-item and two exploratory tasks from six
Completed demographic and experience questionnaire
For each of four interfaces, subjects were:Given an explanation of interface functionality (2 min.)Attempt the task on the assigned system (10 min.)Asked to complete a post-search questionnaire after
each task
After using four systems, subjects answered exit questionnaire
Findings: System RankingSubjects asked to rank the systems in preference order
Subjects preferred QuerySuggestion and QueryDestination
Differences not statistically significantOverall ranking merges performance on different types
of search task to produce one ranking
Systems Baseline QuerySuggest. QueryDest. SessionDest.
Ranking 2.47 2.14 1.92 2.31
Relative ranking of systems (lower = better).
Findings: Subject CommentsResponses to open-ended questions
Baseline:+ familiarity of the system (e.g., “was familiar
and I didn’t end up using suggestions” (S36))− lack of support for query formulation (“Can be
difficult if you don’t pick good search terms” (S20))
− difficulty locating relevant documents (e.g., “Difficult to find what I was looking for” (S13))
Findings: Subject CommentsQuery Suggestion:
+ rapid support for query formulation (e.g., “was useful in saving typing and coming up with new ideas for query expansion” (S12); “helps me better phrase the search term” (S24); “made my next query easier” (S21))
− suggestion quality (e.g., “Not relevant” (S11); “Popular queries weren’t what I was looking for” (S18))
− quality of results they led to (e.g., “Results (after clicking on suggestions) were of low quality” (S35); “Ultimately unhelpful” (S1))
Findings: Subject CommentsQueryDestination:
+ support for accessing new information sources (e.g., “provided potentially helpful and new areas / domains to look at” (S27))
+ bypassing the need to browse to these pages (“Useful to try to ‘cut to the chase’ and go where others may have found answers to the topic” (S3))
− lack of specificity in the suggested domains (“Should just link to site-specific query, not site itself” (S16); “Sites were not very specific” (S24); “Too general/vague” (S28))
− quality of the suggestions (“Not relevant” (S11); “Irrelevant” (S6))
Findings: Subject CommentsSessionDestination:
+ utility of the suggested domains (“suggestions make an awful lot of sense in providing search assistance, and seemed to help very nicely” (S5))
− irrelevance of the suggestions (e.g., “did not seem reliable, not much help” (S30); “irrelevant, not my style” (S21))
− need to include explanations about why the suggestions were offered (e.g., “low-quality results, not enough information presented” (S35))
Findings: Task CompletionSubjects felt that they were more successful
for known-item searches on QuerySuggestion and more successful for exploratory searches in QueryDestination
Task-typeSystem
Baseline QSuggestion QDestination SDestination
Known-item 2.0 1.3 1.4 1.4
Exploratory 2.8 2.3 1.4 2.6
Perceptions of task success (lower = better, scale = 1-5 )
Findings: Task Completion Time
QuerySuggestion and QueryDestination sped up known-item performance
Exploratory tasks took longer
Known-item Exploratory0
100
200
300
400
500
600
Task categories
BaselineQSuggest
Time (seconds)
Systems
348.8
513.7
272.3
467.8
232.3
474.2
359.8
472.2
QDestination
SDestination
Findings: Interaction
Known-item taskssubjects used query suggestion most heavily
Exploratory taskssubjects benefited most from destination
suggestionsSubjects submitted fewer queries and clicked
fewer search results on QueryDestination
Task-typeSystem
QSuggestion QDestination SDestination
Known-item 35.7 33.5 23.4
Exploratory 30.0 35.2 25.3
Suggestion uptake (values are percentages).
Log AnalysisThese findings are all from the laboratoryLogs from consenting users of the Windows
Live Toolbar allowed us to determine the external validity of our experimental findingsDo the behaviors observed in the study mimic
those of real users in the “wild”?
Extracted search sessions from the logs that started with the same initial queries as our user study subjects
Log Analysis: Search TrailsInitiated with a query
to a top-5 search engine
Query trailsQuery Query
Session trailsQuery Event:
Session timeout Visit homepage Type URL Check Web-based
email or logon to online service
S1 S3 S4
S3
dpreview.com
S2
pmai.orgdigital
cameras
S2
QueryTrailEnd canon.com
amazon.com
S5
howstuffworks.com
S6
S5 S8
S6 S9
S1 S10 S11
S10 S12 S13 S14
amazon
digitalcamera-hq.com
digital camera canon
S7
S6
canon lenses
SessionTrailEnd
S2
Log Analysis: TrailsWe extracted 2,038 trails from the logs that began
with the same query as a user study session700 from known-item and 1,338 from exploratory
tasks
In vitro group: User study subjectsEx vitro group: Remote subjects
Compared:# query iterations, # unique query terms, # result
clicks, and # of unique domains visited
Log Analysis: Results
Generally same, apart from in the number of unique query terms submittedSubjects may be taking terms from the textual
task descriptions provided to them
FeatureKnown-item Exploratory
In vitroEx vitro
In vitroEx vitro
10 min All 10 min All
Query iterations 1.9 2.3 2.6 3.1 3.0 3.8
Unique query terms 5.2 2.8 3.2 7.4 4.4 4.9
Result clicks 2.6 1.8 2.5 3.3 2.8 3.1
Unique domains 1.3 1.4 1.7 2.1 1.8 2.1
These numbers are high!
These numbers are high!
Log Analysis: ResultsKnown-item tasks
72% overlap between queries issued and terms appearing in the task description
Exploratory tasks79% overlap between queries issued and terms
appearing in the task description
Could confound experiment if we are interested in query formulation behavior – need to address!
ConclusionsUser study compared the popular destinations with
traditional query refinement and unaided Web search
Results revealed that: RQ1a: Query suggestion preferred for known-item
tasksRQ1b: Destination suggestion preferred for
exploratory tasksRQ2: Destinations from query trails rather than
session trailsDifferences in number of unique query terms
suggests that textual task descriptions may introduce some degree of experimental bias
Case StudyWhat did we learn?
Showed how a user evaluation can be conducted
Showed how analysis of different sources – questionnaire responses and interaction logs (both local and remote) – can be combined to answer our research questions
Showed that the findings of a user study can be generalized in some respects to the “real” world (i.e., has some external validity)
Anything else?
OverviewShort, selfish bit about meUser evaluation in IRCase study combining two approaches
User studyLog-based
Introduction to Exploratory Search SystemsFocus on evaluation
Short group activityWrap-up
Exploratory Search“Exploratory search” describes:
an information-seeking problem context that is open-ended, persistent, and multi-faceted commonly used in scientific discovery, learning, and
decision making contextsinformation-seeking processes that are
opportunistic, iterative, and multi-tactical exploratory tactics are used in all manner of
information seeking and reflect seeker preferences and experience as much as the goal
User’s search
problem
User’s search
strategies
Marchionini’s definition:
Exploratory Search SystemsSupport both querying and browsing
activitiesSearch engines generally just support querying
Help users explore complex information spaces
Help users learn about new topics: go beyond finding
Can consider user contextE.g., Task constraints, user emotion, changing
needs
OverviewShort, selfish bit about meUser evaluation in IRCase study combining two approaches
User studyLog-based
Introduction to Exploratory Search SystemsFocus on evaluation
Short group activityWrap-up
Group ActivityDivide into two groups of 3-4 peopleEach group designs an evaluation of an
exploratory search systemTwo systems:
mSpace: faceted spatial browser for classical music
PhotoMesa: photo browser with flexible filtering, grouping, and zooming tools
You pick the evaluation criteria, comparator systems, approach, metrics, etc.
mSpace (mspace.fm)
PhotoMesa (photomesa.com)
Some questions to think aboutWhat are the independent/dependent variables?Which experimental design?What task types? What tasks? What topics? Any comparator systems?What subjects? How many? How will you recruit?Which instruments? (e.g., questionnaires)Which data analysis methods
(qualitative/quantitative)?
Most importantly: Which metrics?How do you determine user and system performance?
OverviewShort, selfish bit about meUser evaluation in IRCase study combining two approaches
User studyLog-based
Introduction to Exploratory Search SystemsFocus on evaluation
Short group activityWrap-up
Evaluating Exploratory SearchSIGIR 2006 workshop on Evaluating
Exploratory Search Systems Brought together around 40 experts to
discuss issues in the evaluation of exploratory search systems
http://research.microsoft.com/~ryenw/eess
What metrics did they come up with?How do they compare to yours?
Metrics from workshopEngagement and enjoyment:
e.g., task focus, happiness with system responses, the number of actionable events (e.g., purchases, forms filled)
Information novelty:e.g., the amount of new information encountered
Task success: e.g., reach target document? encountered
sufficient information en route?Task time: to assess efficiencyLearning and cognition:
e.g., cognitive loads, attainment of learning outcomes, richness/completeness of post-exploration perspective, amount of topic space covered, number of insights
Activity Wrap-up[insert summary of comments from group
activity]
ConclusionWe have:
Described aspects of user experimentation in IR
Walked through a case studyIntroduced exploratory searchPlanned evaluation of exploratory search
systemsRelated our proposed metrics to those of
others interested in evaluating exploratory search systems
Acknowledgements
Although modified, a few of the earlier slides in this lecture were based on an excellent SIGIR 2006 tutorial given by Diane Kelly and David Harper – Thank you Diane and David!
Referenced ReadingBorlund, P. (2000). Experimental components
for the evaluation of interaction information retrieval systems. Journal of Documentation, 56(1): 71-90.
Kelly, D. and Belkin, N.J. (2004). Display time as implicit feedback: Understanding task effects. Proceedings of the 29th ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 377-384.
top related