document clustering for mediated information …ryenwhite.com/xsi/slides/muresan.pdfgheorghe muresan...
TRANSCRIPT
Gheorghe Muresan
SCILS, Rutgers University
Document Clustering for
Mediated Information Access
– The WebCluster Project –
Gheorghe Muresan
School of Communication, Information and Library SciencesRutgers University
The original WebCluster project was conducted at the Robert Gordon University, Aberdeen, UK.It was supervised by Prof. David J. Harper and sponsored by Ubilab, Zurich.
Current work is being conducted in collaboration with Ph.D. student Hyuk-Jin Lee andProf. Nicholas J. Belkin.
Exploratory Search Interfaces: Categorization, Clustering and BeyondWorkshop at HCIL 2005, University of Maryland, June 2, 2004
Gheorghe Muresan
SCILS, Rutgers University
WebCluster - Motivation
InformationNeed
Query Search engine
(within some subject domain)
WWW_SearchEngine
Domain
Ø Gulfs
– information need ↔ query
– structured subject domain ↔ unstructured target collection (WWW)
Gheorghe Muresan
SCILS, Rutgers University
Information
need1. Select library
2. Consult catalog
3. Browse
shelves
4. Use inter-library scheme
Information Need
Formulation
Interaction in the library
Gheorghe Muresan
SCILS, Rutgers University
1. Select source
collection
Information Need
Formulation
2. Explore
source collection
with ClusterBookResults
Results
Information
need
3. Search WWW
Can we simulate the library interaction ?
Structured
source
collections
Gheorghe Muresan
SCILS, Rutgers University
The mediated access interaction
Information
need
Web
sea
rch
en
gin
e
Web
Clu
ster
Query
Specialised
source
Target collection
(WWW)
Topical
documents
Gheorghe Muresan
SCILS, Rutgers University
Interaction model vs. prototype
Ø Structuring the source collectionwDocument clustering
w Supervised classification
wManual (intellectual) classification
Ø Exploring the structured source collectionwMetaphor – Library, book, encyclopaedia
w Visualization tool – Folder metaphor, hyperbolic tree, themescape, cone trees, thematic maps
w Search strategies supported – Best match or cluster-based searching, browsing
Gheorghe Muresan
SCILS, Rutgers University
Model vs. prototype
Ø Interaction model
w Explicit (the user marks relevant documents) vs. implicit (cues
on relevance are derived based on user behavior/actions)
w Transparent (the user is aware) vs. opaque (the user is happy
to see effect of ‘magic’)
w Automatic vs. manual/intellectual generation of the mediated query
Ø Query model
w Language models (generative, Kullback-Leibler)
w Probabilistic models
w Rocchio or other RF-specific formulae
Gheorghe Muresan
SCILS, Rutgers University
ClusterBook - Source collection
Gheorghe Muresan
SCILS, Rutgers University
ClusterBook - Target collection
Gheorghe Muresan
SCILS, Rutgers University
Informal experiments
- Objectives -
Ø Test the users’ reaction to the mediated access concept
Ø Test the user satisfaction regarding the functionality of the system, and the relevance of the documents retrieved
Ø Formative usability testing - some volunteers were not only experienced searchers, but also had experience in evaluating IR systems
Ø Comparison of user generated queries vs. system generated queries
Ø Note. These experiments were run at different stages of the development
Gheorghe Muresan
SCILS, Rutgers University
Informal experiments
- Experimental procedure -
Ø Subjects received introduction to the system
Ø Task assigned: “You are a trainee in a newspaper. You support the journalists by providing information for the topic of their articles.”
Ø Sample topics:
w The history of the Brasilian debt crisis
w How are the quotas for growing coffee set and controlled on a world-wide basis ?
Ø Source collection: a sub-collection of Reuters (newspaper articles)
Ø Steps followed by users (explicit scenario):w Formulate a query and record it
w Browse source collection, select ‘best’ cluster, edit query generated by system, submit it to the search engine
w Submit to the same search engine the initial, self-generated query
w Compare results of the two searches
Gheorghe Muresan
SCILS, Rutgers University
Informal experiments
- Results -
Ø Users found the mediation useful for unfamiliar topics
Ø The system nearly always proposed new, good query terms
Ø Users not always good at recognizing ‘good’ query terms
Ø The system proposed bad query terms (not specific to the topic)
⇒ the opaque scenario not viable unless the query formulation is improved
Ø The two-step process was questioned when:
w the query formulation was considered easy, for a familiar topic
w the documents of the source collection were considered sufficient to cover the information need
Ø Complete link, group average – OK; single link – bad
Ø Overall, the system is usable
Gheorghe Muresan
SCILS, Rutgers University
Consequences of informal experiments
Ø Formal experiments are needed to verify the main assumptions:
w The Cluster Hypothesis holds for a specialized collection
w Good clusters can be found with the search strategies provided
w Mediated queries can improve retrieval effectiveness
Ø The effect on retrieval performance of various parameters should be compared
w Weighting schemes
w Clustering methods
w Search strategies
Gheorghe Muresan
SCILS, Rutgers University
Fixed Plants
Coastal Wind Farms
Pacific Rim
Wind FarmsDesign of Coastal
Wind Farms
Design of
….
Desert
Wind Farms
Inland Wind Farms
...
Portable Generators
...
Wind generators
for yachts
Power Generation Propulsion
Wind Energy
Critical issue: The label generationw Document representatives
w searching
w Cluster representatives
w browsing
w searching
w mediation
w Collection representatives
w collection selection
Gheorghe Muresan
SCILS, Rutgers University
Mediation experiment - simulationsØ Objectives:
w Test the potential of mediation to increase retrieval effectiveness
w Test the effect on performance of a variety of parameters
Search engine
Search engine
Simple query generator
(baseline)
Topic-based mediator
(upperbound)
Sourcecollection
Targetcollection
Cluster-based mediation
(realistic mediation)
Gheorghe Muresan
SCILS, Rutgers University
Experimental setup
Ø Interactive track of TREC-8
w Offers relevance judgments for complex topics, with a multitude of aspects
w Offers the experimental design for the user experimentw Six topics with 12 to 56 aspects each
w Target collection: FT 1991-4, with 210,158 articles
w Source collection built based on relevance judgments: half of the relevant documents, their nearest neighbors, plus the documents judged non-relevant
Gheorghe Muresan
SCILS, Rutgers University
Results – the cluster
hypothesis
Ø Aspectual cluster hypothesis confirmed by an extended
version of the van Rijsbergen –
Sparck Jones separation test
w Similarity between pairs of docs covering the same aspect is higher than between pairs of docs covering the same topics, which is higher than between pairs of docs in the collection
Ø Consequence confirmed: clustering groups documents in
pockets of relevance
Gheorghe Muresan
SCILS, Rutgers University
Results – retrieval effectiveness
Ø Tf-Idf > KL > RelFreq as weighting schemes for document representation
Ø Adding disambiguation terms to the query increases recall, but decreases precision
Ø Nearest-neighbor mediation (“more like this”) highly significantly improves both recall and precision, even if just one exemplary document is offered for each topic aspect
Ø Cosine and Dice performs similarly
Gheorghe Muresan
SCILS, Rutgers University
Mediation results
Ø Upperbound experiment (all relevant docs
known in source)
w Both recall and precision increase with query length
w Query term weights strongly affect performance
w No evidence that uniformity of term frequency affects
performance
Ø Clustered source mediation
w Best cluster mediation increases P, decreases R
w “Fuse and search” – strong increase in R and P
w “Search and fuse” – good R, terrible P !
Gheorghe Muresan
SCILS, Rutgers University
User experiment – effectiveness of mediated
information retrieval for Web searches
Source & target –based mediation
On the fly clustering
Structured
(cluster)
Source-based
mediationBaseline
Linear
(list)
MediatedUnaided
Query formulation
(between subjects)
Re
su
lt p
res
en
tati
on
(wit
hin
su
bje
cts
)
Gheorghe Muresan
SCILS, Rutgers University
User experiment – no mediation
Gheorghe Muresan
SCILS, Rutgers University
User experiment – mediated access
Gheorghe Muresan
SCILS, Rutgers University
User experiment – mediated access
Gheorghe Muresan
SCILS, Rutgers University
Contributions of WebCluster
Ø Proposes and explores system-based mediated access to very large heterogeneous document collections
Ø Explores the use of clustering for capturing the topical, semantic structure of a problem domain (as represented by a specialized collection)
Ø Explores the use of language models for building cluster and document representatives
Ø Offers a framework for building structured portals on the WWW
Ø Offers a framework for building collaborative environments
Gheorghe Muresan
SCILS, Rutgers University
WebCluster - Other applications
Ø CD-ROM based collections
w structured access to the collection itself
w mediated access to WWW (via CD-ROM)
Ø Mediated access (portals) via hierarchically structured information sources
Examples are: via large structured report (e.g. government reports), via structured collection of information (e.g. encyclopaedia), via intranet
Ø Multimedia information access
w cluster multimedia source, e.g. annotated photographs
w mediated access to other photographs (not annotated)
Gheorghe Muresan
SCILS, Rutgers University
Other directions for WebCluster
Ø Clustered vs. categorized source collection
Ø Language model – based labels vs. specialized
terminology based on a domain ontology /
thesaurus
Ø Different interaction and visualization metaphors
w Spring-embedded algorithms for 2D representation of
clusters
Ø Various inter-document similarities (faceted ?)
Ø User profiles / personalization
w Change of interest over time
Gheorghe Muresan
SCILS, Rutgers University
Topic representation
Ø What (weighted) terms best describe a topic ?
w Applications:
w Clustering – generating cluster representatives
w Mediation – generating mediated queries
w Machine generation
w Simulation based on test topics and relevance judgments
w Use various weighting formulae and cut-off points
w Which representations are more effective ? What do they have in common / specific ?
w Human (manual / intellectual generation)
w Compare queries generated by searchers in TREC tasks
– Effectiveness
– Keyword vs. natural language queries
Gheorghe Muresan
SCILS, Rutgers University
Questions ?
Gheorghe Muresan
SCILS, Rutgers University
Query formulation problemsØ Vague information need
Ø Vocabulary mismatch
Ø Difficulty of query language syntax
Ø Lack of context, ambiguity of terms
Ø Lack of a search strategy
Ø No understanding of the underlying indexing/searching model
Ø Note. TREC experiments have shown that the quality of the query has a higher impact on retrieval effectiveness than weighting schemes or search algorithms.
Gheorghe Muresan
SCILS, Rutgers University
Role of structure
Computing
Computer
Screen Keyboard C++Pascal
Programming language
...
Mathematics
...
Algebra
Computing, Mathematics Physics
Science
Ø Reveals the semantic structure of the domain & its concepts
Ø Groups (semantically ?) similar documents
Ø Supports exploration and concept formation
Ø Supports term disambiguation (context)
Ø (Has potential for efficient retrieval)
Ø (Has potential for effective retrieval)
Gheorghe Muresan
SCILS, Rutgers University
Browsing label
(relative cluster representative)
Coastal Wind Farms Inland Wind Farms
Pacific
Rim Wind
Farms
Design of Coastal
Wind Farms
Design of
….
Desert
Wind Farms
Wind generators
for yachts
Fixed Plants
...
Portable Generators
...Power Generation Propulsion
Wind Energyparenti
clusteri
clusteriiip
ppparentclusterKLR
,
,
, log),( ==
Gheorghe Muresan
SCILS, Rutgers University
Searching label
(absolute cluster representative)
Coastal Wind Farms Inland Wind Farms
Pacific
Rim Wind
Farms
Design of Coastal
Wind Farms
Design of
….
Desert
Wind Farms
Wind generators
for yachts
Fixed Plants
...
Portable Generators
...Power Generation Propulsion
Wind Energycollectioni
clusteri
clusteriiip
ppcollectionclusterKLA
,
,
, log),( ==
Gheorghe Muresan
SCILS, Rutgers University
Mediation label
(Expanded cluster representative)
Fixed Plants
Coastal Wind Farms
Pacific
Rim Wind
Farms
Design of Coastal
Wind Farms
Design of
….
Desert
Wind Farms
Inland Wind Farms
...
Portable Generators
...
Wind generators
for yachts
Power Generation Propulsion
Wind Energyri
r
ri
r
iiii
AA
AAAE
,1,
1
2,
2
1,0,
)1(
...)1()1()1(
⋅+⋅⋅−+
+⋅⋅−+⋅⋅−+⋅−=
−− ωωω
ωωωωω
Gheorghe Muresan
SCILS, Rutgers University
Topic model representations
Exemplary representation
Statistical representation
Statistical analysis
Language model
Context analysis
Typical terms, weighted
Thresholding
Mediated query
Keyword representation
Gheorghe Muresan
SCILS, Rutgers University
The cluster hypothesisØ Reminder: the original cluster hypothesis
w “Closely associated documents tend to be relevant to the same requests” (van Rijsbergen)
Ø Aspectual cluster hypothesis: Highly similar documents tend to be relevant to the same topic. However, documents relevant to the same topic may be quite dissimilar if they cover distinct aspects of the topic.w Consequence: Clustering algorithms tend to group
together documents that cover highly focused topics, or aspects of complex topic. Documents covering distinct aspects of complex topics tend to be spread over the cluster structure.
Gheorghe Muresan
SCILS, Rutgers University
Aspects of relevance in the mediated
access process
Gheorghe Muresan
SCILS, Rutgers University
Distribution of relevant documents
in clustersClustering vs. Random
0%
5%
10%
15%
20%
25%
Clusters
Re
ca
ll
Clustering Random
Gheorghe Muresan
SCILS, Rutgers University
WebCluster scenario#1
Document from the source collection
Document from the target collection (WWW)
W
e
b
C
l
u
s
t
e
r
Web
Search
Engine
c0
c4 c5
c2c1 c3
c’0
c’3
c’2
c’5
WWWØ Name
w Transparent mediated access
Ø Targeted users
w Experienced searchers
Ø Specific
w The users are aware of the mediation process, of the separation between the source and target collections
w The users have the option to edit the query generated (proposed) by the system. They understand the indexing / searching model.
Gheorghe Muresan
SCILS, Rutgers University
WebCluster scenario#2
W
e
b
C
l
u
s
t
e
r
c0
c4 c5
c2c1 c3
WWW
c’0
c’3
c’2
c’5
Web
Search
Engine
Ø Name
w Opaque mediated access
Ø Targeted users
w Naive / beginner searchers
Ø Specific
w The users explore the structure of the domain, which contains sample documents, and have the option of asking for similar documents
w The users are unaware of the mediation process - the query generation and target search are not visible
Document from the source collection
Document from the target collection (WWW)
Gheorghe Muresan
SCILS, Rutgers University
Initial user interface (Java AWT)