information acce s s in context cs276a · information visualization how to de sign user interfaces...
TRANSCRIPT
�
CS276AText Information Retrieval, Mining, and Exploitation
Lecture 107 Nov 2002
2
Information Access in Context
Stop
High-LevelGoal
Synthesize
Done?
Analyze
yes
no
User
Information Access
3
Exercise
� Observe your own information seeking behavior� WWW� University library� Grocery store
� Are you a searcher or a browser?� How do you reformulate your query?
� Read bad hits, then minus terms� Read good hits, then plus terms� Try a completely different query� …
4
Correction:Address Field vs. Search Box
� Are users typing urls into the search box ignorant?
� .com / .org / .net / international urls� cnn.com vs. www.cnn.com� Full url with protocol qualifier vs. partial url
5
Today’s Topics
� Information design and visualization� Evaluation measures and test collections� Evaluation of interactive information retrieval� Evaluation gotchas
6
Information Visualization and Exploration
� Tufte� Shneiderman� Information foraging: Xerox PARC / PARC Inc.
�
7
Edward Tufte
� Information design bible: The visual display of quantitative information
� The art and science of how to display (quantitative) information visually
� Significant influence on User Interface design
8
The Challenger Accident� On January 28, 1986, the
space shuttle Challenger explodes shortly after takeoff.
� Seven crew members die.� One of the causes: an O ring
failed due to cold temperatures.
� How could this happen?
9
How O-Rings were presented� T ime scale is
shown –instead of temperature scale!
� “Needless junk” (rockets don’t show information)
� Graphic does not help answer question: why do o-rings fail?
10
Tufte: Principles for Information Design
� Omit needless junk� Show what you mean� Don't obscure the meaning and order of scales� Make comparisons of related images possible� Claim authorship, and think twice when others
don't� Seek truth
11
Tufte’s O-Ring Visualization
12
Tufte: Summary� “Like poor writing, bad graphical displays distort
or obscure the data, make it harder to understand or compare, or otherwise thwart the communicative effect which the graph should convey.”
� Bad decisions are made based on bad information design.
� Tufte’s influence on UI design� Examples of the best and worst in information
visualization: http://www.math.yorku.ca/SCS/Gallery/noframes.html
�
13
Shneiderman: Information Visualization
� How to design user interfaces� How to engineer user interfaces for software� Task by type taxonomy
14
Shneiderman on HCI
� Well-designed interactive computer systems promote:
� Positive feelings of success, competence, and mastery.
� Allow users to concentrate on their work, rather than on the system.
Marti Hearst
15
Task by Type Taxonomy: Data Types
� 1-D linear: seesoft� 2-D map: multidimensional scaling (terms, docs, etc)� 3-D world: cat-a-cone� Multi-dim: table lens� Temporal: topic detection� Tree: hierarchies a la Yahoo� Network: network graphs of sites (kartoo)
16
Task by Type Taxonomy: Tasks
� Overview: gain an overview of the entire collection� Zoom: zoom in on items of interest� Filter: filter out uninteresting items� Details-on-demand: select an item or group and get
details when needed� Relate: view relationships among items� History: keep a history of actions to support, undo,
replay� Extract: allow extraction of subcollections and the query
parameters
17
Exercise
� If your project has a UI component:� Which data types are being displayed?� Which tasks are you supporting?
18
Xerox PARC: Information Foraging
� Metaphor from ecology/biology� People looking for information = animals foraging for
food� Predictive model that allows principled way of designing
user interfaces� The main focus is:
� What will the user do next?� How can we support a good choice for the next action?
� Rather than:� Evaluation of a single user-system interaction
�
19
Foraging Paradigm
Energy� � � �� � � � � � � �� � � �
Food ForagingFood ForagingBiological, behavioral, and cultural designs areBiological, behavioral, and cultural designs are
adaptive to the extent theyadaptive to the extent theyoptimize the optimize the rate of energy intakerate of energy intake..
� � � � �� � � � �
George Robertson, Microsoft20
Information Foraging Paradigm
Information
Information ForagingInformation ForagingInformation access and visualization technologies areInformation access and visualization technologies are
adaptive to the extent theyadaptive to the extent theyoptimize the optimize the rate of gain of valuable informationrate of gain of valuable information
George Robertson, Microsoft
21
Searching Patches
George Robertson, Microsoft22
Information Foraging: Theory
� G – information/food gained� g – average gain per doc/patch� T B – total time between docs/patches� tb – average time between docs/patches� TW – total time within docs/patches� tw – average time to process doc/patch� lambda = 1/tb – prevalence of information/food
23
Information Foraging: Theory
� R = G / (TB + TW) – rate of gain� R = lambda TB g / ( TB + lambda TB tw)� R = lambda g / ( 1 + lambda tw)� Goodness measure of UI = R = rate of gain� Optimize UI by increasing R� Increase prevalence lambda (asymptotic improvement)� Decrease tw (time it takes to absorb doc/food)� Better model: different types of docs/patches� Model can be used to find optimal UI parameters
24
Cost-of-Knowledge Characteristic Function
� Improve productivity: Less time or more output
Card, Pirolli, and Mackinlay
�
Creating Test Collectionsfor IR Evaluation
26
Test Corpora
27
Kappa Measure
� Kappa measures� Agreement among coders� Designed for categorical judgments� Corrects for chance agreement
� Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]� P(A) – proportion of time coders agree� P(E) – what agreement would be by chance� Kappa = 0 for chance agreement, 1 for total agreement.
28
Kappa Measure: Example
relevantNonrelevant10
NonrelevantRelevant20
NonrelevantNonrelevant70
RelevantRelevant300
Judge 2Judge 1Number of docs
P(A)? P(E)?
29
Kappa Example
� P(A) = 370/400 = 0.925� P(nonrelevant) = (10+20+70+70)/800 = 0.2125� P(relevant) = (10+20+300+300)/800 = 0.7878� P(E) = 0.2125^2 + 0.7878^2 = 0.665� Kappa = (0.925 – 0.665)/(1-0.665) = 0.776
� For >2 judges: average pairwise kappas
30
Kappa Measure
� Kappa > 0.8 = good agreement� 0.67 < Kappa < 0.8 -> “tentative conclusions” (Carletta
96)� Depends on purpose of study
�
31
Interjudge Disagreement: TREC 3
32
33
Impact of Interjudge Disagreement
� Impact on absolute performance measure can be significant (0.32 vs 0.39)
� Little impact on ranking of different systems or relativeperformance
Evaluation Measures
35
Recap: Precision/Recall
� Evaluation of ranked results:� You can return any number of results ordered by
similarity� By taking various numbers of documents (levels of
recall), you can produce a precision-recall curve� Precision: #correct&retrieved/#retrieved� Recall: #correct&retrieved/#correct� The truth, the whole truth, and nothing but the
truth. Recall 1.0 = the whole truth, precision 1.0 = nothing but the truth.
36
Recap: Precision-recall curves
�
37
F Measure
� F measure is the harmonic mean of precision and recall (strictly speaking F1)
� 1/F = ½ (1/P + 1/R)� Use F measure if you need to optimize a single
measure that balances precision and recall.
38
F-Measure
F1(0.956) = max = 0.96
Recall vs Precision and F1
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
Recall
Pre
cisi
on
an
d F
1
P re cision
F1
39
Recall vs Precision and F1
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
Recall
Pre
cis
ion
an
d F
1
Breakeven Point
� Breakeven point is the point where precision equals recall.
� Alternative single measure of IR effectiveness.
� How do you compute it?
40
Area under the ROC Curve
� True positive rate = recall = sensitivity
� False positive rate = fp/(tn+fp). Related to precision. fpr=0 <-> p=1
� Why is the blue line “worthless”?
41
Precision Recall Graph vs ROCP r e c i s i o n R e c a l l C u r v e v s R O C C u r v e
0
0. 2
0. 4
0. 6
0. 8
1
1. 2
0 0. 2 0. 4 0. 6 0. 8 1 1. 2
R e c a l l = T r u e P o s i t i v e R a t e ( R O C M i r r o r , P R C u r v e ) ; F a l s e P o s i t i v e R a t e ( R O C ) 42
Unit of Evaluation
� We can compute precision, recall, F, and ROC curve for different units.
� Possible units� Documents (most common)� Facts (used in some TREC evaluations)� Entities (e.g., car companies)
� May produce different results. Why?
�
43
Critique of Pure ReasonRelevance
� Relevance vs Marginal Relevance� A document can be redundant even if it is highly
relevant� Duplicates� The same information from different sources� Marginal relevance is a better measure of utility for
the user.� Using facts/entities as evaluation units more
directly measures true relevance.� But harder to create evaluation set� See Carbonell reference
Evaluation ofInteractive Information Retrieval
45
Evaluating Interactive IR
� Evaluating interactive IR poses special challenges
� Obtaining experimental data is more expensive� Experiments involving humans require careful
design.� Control for confounding variables� Questionnaire to collect relevant subject data� Ensure that experimental setup is close to
intended real world scenario� Approval for human subjects research
46
IIR Evaluation Case Study 1
� TREC-6 interactive TREC report� 9 participating groups (US, Europe, Australia)� Control system (simple IR system)� Each group ran their system and the control
system� 4 users at each site� 6 queries (= topics)� Goal of evaluation: Find best performing system� Why do you need control system for comparing
groups?
47
Queries (= Topics)
48
Latin Square Design
�
49
Analysis of Variance
50
Analysis of Variance
51
Analysis of Variance
52
Observations� Query effect is largest std for each site
� High degree of query variability� Searcher effect negligible for 4 our of 10 sites� Best Model:
� Interactions are small compared too overall error.� None of the 10 sites statistically better than control system!
1243#sites
m4M3M2M1
53
IIR Evaluation Case Study 2
� Evaluation of relevance feedback� Koenemann & Belkin 1996
54
Why Evaluate Relevance Feedback?
���
55
Questions being InvestigatedKoenemann & Belkin 96
� How well do users work with statistical ranking on full text?
� Does relevance feedback improve results?� Is user control over operation of relevance
feedback helpful?� How do different levels of user control effect
results?
Credit: Marti Hearst56
How much of the guts should the user see?� Opaque (black box) � (like web search engines)
� Transparent � (see available terms after the r.f. )
� Penetrable � (see suggested terms before the r.f.)
� Which do you think worked best?
Credit: Marti Hearst
57Credit: Marti Hearst
58
Terms available for relevance feedback made visible(from Koenemann & Belkin)
Credit: Marti Hearst
59
Details on User StudyKoenemann & Belkin 96
� Subjects have a tutorial session to learn the system� Their goal is to keep modifying the query until they’ve developed
one that gets high precision� This is an example of a routing query (as opposed to ad hoc)� Reweighting:� They did not reweight query terms� Instead, only term expansion� pool all terms in rel docs� take top N terms, where � n = 3 + (number-marked-relevant-docs*2)� (the more marked docs, the more terms added to the
query)
Credit: Marti Hearst60
Details on User StudyKoenemann & Belkin 96
� 64 novice searchers� 43 female, 21 male, native English
� T REC test bed� Wall Street Journal subset
� Two search topics� Automobile Recalls� Tobacco Advertising and the Young
� Relevance judgements from TREC and experimenter� System was INQUERY (vector space with some bells and
whistles)
Credit: Marti Hearst
���
61
Sample TREC query
Credit: Marti Hearst62
Evaluation� Precision at 30 documents� Baseline: (Trial 1)� How well does initial search go?� One topic has more relevant docs than the other
� Experimental condition (Trial 2)� Subjects get tutorial on relevance feedback� Modify query in one of four modes� no r.f., opaque, transparent, penetration
Credit: Marti Hearst
63
Precision vs. RF condition (from Koenemann & Belkin 96)
Credit: Marti Hearst
Can we concludefrom this chartthat RF is better?
64
Effectiveness Results
� Subjects with R.F. did 17-34% better performance than no R.F.
� Subjects with penetration case did 15% better as a group than those in opaque and transparent cases.
Credit: Marti Hearst
65
Number of iterations in formulating queries (from Koenemann & Belkin 96)
Credit: Marti Hearst66
Number of terms in created queries (from Koenemann & Belkin 96)
Credit: Marti Hearst
���
67
Behavior Results� Search times approximately equal� Precision increased in first few iterations � Penetration case required fewer iterations to make a good
query than transparent and opaque� R.F. queries much longer� but fewer terms in penetrable case -- users were more
selective about which terms were added in.
Credit: Marti Hearst68
Evaluation Gotchas
� No statistical test (!)� Lots of pairwise tests� Wrong evaluation measure� Query variability� Unintentionally biased evaluation
69
Gotchas: Evaluation Measures
� KDD cup 2002� Optimize model parameter: balance factor� Area under ROC curve and BEP have different behaviors� These two measures intuitively measure the same property.
0 1e-4 1e-2 1
0.4
0.6
0.8
narrow - 1.3%
� ��� ���� ����
B - balance factor
average
0 1e-41e-2 1
control - 1.5%
B - balance factor
± std dev
0 1e-4 1e-2 1
broad - 2.8%
B - balance factor
random
(b) Area under ROC for Yeast Gene Dataset-0.1
0
0.1
0.2
0.3
0.4
narrow - 1.3%
� �� ��� �� �� � control - 1.5% broad - 2.8%
(a) Breakeven Points for Yeast Gene Dataset
70
Gotchas: Query variability
• Eichmann et al. claim that for their approach to CLIR French is harder than Spanish.
• French average precision: 0.149• Spanish average precision: 0.173
71
Gotchas: Query variability
• Queries with Spanish > baseline: 14• Queries with Spanish ≈ baseline: 40• Queries with Spanish < baseline: 53• Queries with French > baseline: 20• Queries with French ≈ baseline: 22• Queries with French < baseline: 64
72
Gotchas: Biased Evaluation
� Compare two IR algorithms� 1. send query, present results� 2. send query, cluster results, present clusters
� Experiment was simulated (no users)� Results were clustered into 5 clusters� Clusters were ranked according to percentage
relevant documents� Documents within clusters were ranked according
to similarity to query
���
73
Sim-Ranked vs. Cluster-Ranked
Does this show superiority of cluster ranking? 74
Relevance Density of Clusters
75
Summary
� Information Visualization: A good visualization is worth a thousand pictures.
� But to make information visualization work for text is hard.
� Evaluation Measures: F measure, break-even point, area under the ROC curve
� Evaluating interactive systems is harder than evaluating algorithms.
� Evaluation gotchas: Begin with the end in mind
76
ResourcesFOA 4.3MIR Ch. 10.8 – 10.10Ellen Voorhees, Variations in Relevance Judgments and the
Measurement of Retrieval Effectiveness, ACM Sigir 98Harman, D.K. Overview of the Third REtrieval Conference (TREC-3).
In: Overview of The Third Text REtrieval Conference (TREC-3). Harman, D.K. (Ed.). NIST Special Publication 500-225, 1995, pp.l-19.
"Assessing agreement on classification tasks: the kappa statistic", Jean Carletta, Computational Linguistics 22(2):249-254, 1996
Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results (1996) Marti A. Hearst, Jan O. Pedersen
Proceedings of SIGIR-96,http://gim.unmc.edu/dxtests/ROC3.htmPirolli, P. and Card, S. K. (1999). Information Foraging. Psychological
Review 106(4): 643-675.Paul Over, TREC-6 Interactive Track Report, NIST, 1998.
77
Resourceshttp://www.acm.org/sigchi/chi96/proceedings/papers/Koenemann/jk1_txt.ht
mhttp://otal.umd.edu/oliveJaime Carbonell , Jade Goldstein, The use of MMR, diversity-based
reranking for reordering documents and producing summaries, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, p.335-336, August 24-28, 1998, Melbourne, Australia