sims 202 information organization and retrieval prof. marti hearst and prof. ray larson uc berkeley...

SIMS 202Information Organization

and Retrieval

Prof. Marti Hearst and Prof. Ray LarsonUC Berkeley SIMS

Tues/Thurs 9:30-11:00amFall 2000

Last Time

Starting Points for Search– Lists– Overviews

»Categories

Today and Next Time

Starting points (cont)– Clusters – Examples as starting points– Automated Source Selection

UIs for Query Specification UIs for Putting Results in Context UIs to support the Search Process

Starting Points for Search

Faced with a prompt or an empty entry form … how to start?– Lists of sources– Overviews

»Clusters»Category Hierarchies/Subject Codes»Co-citation links

– Examples, Wizards, and Guided Tours– Automatic source selection

Category Combinations

HiBrowse Problem: – Search is not integrated with

browsing of categories– Only see the subset of categories

selected (and the corresponding number of documents)

Cat-a-Cone:Multiple Simultaneous Categories

Key Ideas:– Separate documents from category

labels– Show both simultaneously

Link the two for iterative feedback Distinguish between:

– Searching for Documents vs.– Searching for Categories

Cat-a-Cone Interface

Cat-a-Cone

Catacomb: (definition 2b, online Websters)“A complex set of interrelated things”

Makes use of earlier PARC work on 3D+animation:

Rooms Henderson and Card 86IV: Cone Tree Robertson, Card, Mackinlay 93Web Book Card, Robertson, York 96

CategoryHierarch

browsebrowse

searchsearch

CategoryHierarch

Collection

Retrieved Documents

searchsearch

CategoryHierarch

query terms

Collection

Retrieved Documents

searchsearch

CategoryHierarch

browsebrowsequery terms

Collection

Retrieved Documents

searchsearch

CategoryHierarch

browsebrowsequery terms

ConeTree for Category Labels

Browse/explore category hierarchy– by search on label names– by growing/shrinking subtrees– by spinning subtrees

Affordances– learn meaning via ancestors, siblings– disambiguate meanings– all cats simultaneously viewable

Virtual Book for Result Sets

– Categories on Page (Retrieved Document) linked to Categories in Tree

– Flipping through Book Pages causes some Subtrees to Expand and Contract

– Most Subtrees remain unchanged

– Book can be Stored for later Re-Use

Improvements over Standard Category Interfaces

Integrate category selection with Integrate category selection with viewing of categories viewing of categories

Show all categories + context Show all categories + context Show relationship of retrieved Show relationship of retrieved

documents to the category structuredocuments to the category structure

Text Clustering

Finds overall similarities among groups of documents

Finds overall similarities among groups of tokens

Picks out some themes, ignores others

S/G Example: query on “star”

Encyclopedia text14 sports

8 symbols 47 film, tv 68 film, tv (p) 7 music97 astrophysics 67 astronomy(p) 12 steller phenomena 10 flora/fauna 49 galaxies, stars

29 constellations 7 miscelleneous

Clustering and re-clustering is entirely automated

Using Clustering in Document Ranking

Cluster entire collection Find cluster centroid that best

matches the query This has been explored extensively

– it is expensive– it doesn’t work well

Two Queries: Two Clusterings

AUTO, CAR, ELECTRIC AUTO, CAR, SAFETY

The main differences are the clusters that are central to the query

8 control drive accident …

25 battery california technology …

48 import j. rate honda toyota …

16 export international unit japan

3 service employee automatic …

6 control inventory integrate …

10 investigation washington …

12 study fuel death bag air …

61 sale domestic truck import …

11 japan export defect unite …

Another use of clustering

Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.

“Project” these onto a 2D graphical representation– Group by doc: SPIRE/Kohonen maps– Group by words: Galaxy of

News/HotSauce/Semio

Clustering Multi-Dimensional Document Space

(image from Wise et al 95)

al., JA

UWMS Data Mining Workshop

Study of Kohonen Feature Maps

H. Chen, A. Houston, R. Sewell, and B. Schatz, JASIS 49(7)

Comparison: Kohonen Map and Yahoo Task:

– “Window shop” for interesting home page– Repeat with other interface

Results:– Starting with map could repeat in Yahoo

(8/11)– Starting with Yahoo unable to repeat in map

(2/14)

Study (cont.)

Participants liked:– Correspondence of region size to #

documents– Overview (but also wanted zoom)– Ease of jumping from one topic to

another – Multiple routes to topics– Use of category and subcategory

labels

Study (cont.) Participants wanted:

– hierarchical organization– other ordering of concepts (alphabetical)– integration of browsing and search– corresponce of color to meaning – more meaningful labels– labels at same level of abstraction– fit more labels in the given space– combined keyword and category search– multiple category assignment (sports+entertain)

Visualization of Clusters

– Huge 2D maps may be inappropriate focus for information retrieval »Can’t see what documents are about»Documents forced into one position in

semantic space»Space is difficult to use for IR purposes»Hard to view titles

– Perhaps more suited for pattern discovery»problem: often only one view on the

Summary: Clustering Advantages:

– Get an overview of main themes– Domain independent

Disadvantages:– Many of the ways documents could group

together are not shown– Not always easy to understand what they

mean– Different levels of granularity

Automated Source Selection Compare the query against summaries of

what is contained in the collection– GLOSS (Tomasic et al. 97)

»Predict which of several sources is most likely»Based on how many instances of each query

term occurs in the collection– SavvySearch (Howe & Dreilinger 97, in reader)

»Predict which of several search engines is likely to produce a good answer to a given query

»Based on number of pages returned and amount of time users spend on retrieved pages

Query Specification

Interaction Styles (Shneiderman 97)– Command Language– Form Fillin– Menu Selection– Direct Manipulation– Natural Language

Example:– How do each apply to Boolean Queries

Command-Based Query Specification

command attribute value connector …

– find pa shneiderman and tw user# What are the attribute names? What are the command names? What are allowable values?

Form-Based Query Specification (Altavista)

Form-Based Query Specification (Melvyl)

Form-based Query Specification (Infoseek)

tion S

J ones

Menu-based Query Specification(Young & Shneiderman 93)

Context

Putting Results in Context Visualizations of Query Term Distribution

– KWIC, TileBars, SeeSoft Visualizing Shared Subsets of Query Terms

– InfoCrystal, VIBE, Lattice Views Table of Contents as Context

– Superbook, Cha-Cha, DynaCat Organizing Results with Tables

– Envision, SenseMaker Using Hyperlinks

– WebCutter

Putting Results in Context

Interfaces should – give hints about the roles terms play

in the collection– give hints about what will happen if

various terms are combined– show explicitly why documents are

retrieved in response to the query– summarize compactly the subset of

interest

KWIC (Keyword in Context) An old standard, ignored by internet search

engines– used in some intranet engines, e.g., Cha-Cha

Display of Retrieval Results

Goal: minimize time/effort for deciding which documents to examine in detail

Idea: show the roles of the query terms in the retrieved documents, making use of document structure

TileBars

Graphical Representation of Term Distribution and Overlap

Simultaneously Indicate:– relative document length– query term frequencies– query term distributions– query term overlap

Query terms:

What roles do they play in retrieved documents?

DBMS (Database Systems)

Reliability

Mainly about both DBMS & reliability

Mainly about DBMS, discusses reliability

Mainly about, say, banking, with a subtopic discussion on DBMS/Reliability

Mainly about high-tech layoffs

Example

Exploiting Visual Properties

– Variation in gray scale saturation imposes a universal, perceptual order (Bertin et al. ‘83)

– Varying shades of gray show varying quantities better than color (Tufte ‘83)

– Differences in shading should align with the values being presented (Kosslyn et al. ‘83)

Key Aspect: Faceted Queries Conjunct of disjuncts Each disjunct is a concept

– osteoporosis, bone loss– prevention, cure– research, Mayo clinic, study

User does not have to specify which are main topics, which are subtopics

Ranking algorithm gives higher weight to overlap of topics

Main Topic Context

Potential Problem with TileBarsGiven retrieved documents in which no

query terms are well-distributed,The user does not know the context in

which the query terms are used

Solution:Accompany with main topic display

TileBars Summary Compact, graphical representation

of term distribution for full text retrieval results– simultaneously display term frequency,

distribution, overlap, and doc length– allow for simple user-determined

ordering strategies

Part of a larger effort: user-centric, content-sensitive information access

TileBars Summary Preliminary User Studies

users understand them

find them helpful in some situations

sometimes terms need to be disambiguated

SeeSoft: Showing Text Content using a linear representation and brushing and linking (Eick &

Wills 95)

Query Term Subsets

Show which subsets of query terms occur in which subsets of documents occurs in which subsets of retrieved documents

Other Approaches Show how often each query term

occurs in retrieved documents– VIBE (Korfhage ‘91)– InfoCrystal (Spoerri ‘94)– Problems:

»can’t see overlap of terms within docs»quantities not represented graphically»more than 4 terms hard to handle»no help in selecting terms to begin

InfoCrystal (Spoerri 94)

VIBE (Olson et al. 93, Korfhage 93)

Superbook (Remde et al. 87) Next-generation hyper-media book Functions:

– Word Lookup: » Show a list query words, stems, and word combinations

– Table of Contents: Dynamic fisheye view of the hierarchical topics list

» Search words can be highlighted here too

– Page of Text: show selected page and highlighted search terms

Hypertext features linking through search words rather than page links

Superbook (http://superbook.bellcore.com/SB)

DynaCat (Pratt 97)

Decide on important question types in an advance– What are the adverse effects of drug

D?– What is the prognosis for treatment

T? Make use of MeSH categories Retain only those types of

categories known to be useful for this type of query.

DynaCat (Pratt, Hearst, & Fagan 99)

DynaCat Study

Design– Three queries– 24 cancer patients– Compared three interfaces

» ranked list, clusters, categories

Results– Participants strongly preferred categories– Participants found more answers using

categories– Participants took same amount of time with

all three interfaces

Cha-Cha (Chen & Hearst 98) Shows “table-of-contents”-like view, like

Superbook Takes advantage of human-created structure

within hyperlinks to create the TOC

Supporting the Process Interfaces to support the process

of information seeking– Standard Model

» Infogrid»Superbook

– Berry Picking Model»SketchTrieve»DLITE

– Retaining Search History

How to Present the Search Process?

What sequence of operations is allowed?

Which GUI layout style is used?– One window– Overlapping windows– Tiled windows– Monolithic layout

» One big window containing specialized internal windows that always occupy the same position and function

Slide by Shankar Raman

A general search interface architecture– Itemstash -- retrieved docs– Search Event -- current query– History -- history of queries– Result Item -- view selected docs +

metadata

InfoGrid/Protofoil (Rao et al. 92)

Infogrid (design mockup) (Rao et al. 92)

Infogrid Design

Mockups(Rao et al. 92)

Protofoil (Rao et al. 94)

Monolithic Layouts

Protofoil Layout (Hypothetical) Superbook Layout

Experimented with many variations of the layout and interaction sequence.– Several studies have shown that too many

different options are worse than an interface that is too restrictive.

Considered different screen sizes– Monolithic layout favored, however ...– Sequence of interactions is what matters– Smaller screen can force designers to

consider the interaction sequence carefully

SuperBook (Egan et al. 89)

Supporting the Information Seeking Process

Two recent similar approaches that focus on supporting the process– SketchTrieve (Hendry & Harper 97)– DLITE (Cousins 97)

Informal Interface Informal does not necessarily mean less

useful Show how the search is

– unfolding or evolving– expanding or contracting

Prompt the user to– reformulate and abandon plans– backtrack to points of task deferral– make side-by-side comparisons– define and discuss problems

DLITE UI to a digital library Direct manipulation interface to a

distributed info. system – must show network, remote server status

Workcenter approach– lots of handy tools for one task – experts create workcenters– contents persistent– concurrently shareable across sites

Web browser used to display document or collection metadata

DLITE (Cousins 97)

Drag and Drop interface Reify queries, sources, retrieval results Animation to keep track of activity

Components/tools in DLITE Documents (search results, or local

documents) Collections of components (e.g. result

sets) Queries -- translator used to apply same

query to many sources Services -- search services,

summarization, OCR, translation … People (for access control, payment …)

Interaction

Pointing at object brings up tooltip -- metadata

Activating object -- component specific action– 5 types for result set component

Drag-and-drop data onto program Animation used to show what happens

with drag-and-drop (e.g. “waggling”)

Comments Users seem to have lots of problem

with flexibility (result set icon activation)

Workcenter -- customization, acts as reminder

Animation used to track progess, (partial) results

Keeping Track of History

Examples– List of prior queries and results

(standard)– Graphical hierarchy for web browsing– “Slide sorter” view, snapshots of

earlier interactions

PadPrints (Hightower et al. 98)

Tree-based history of recently visited web-pages history map placed to left of browser window

Zoomable, can shrink sub-hierarchies]

Node = title + thumbnail

PadPrints (Hightower et al. 98)

13.4% unable to find recently visited pages

only 0.1% use History button, 42% use Back problems with history list (according to

authors)– incomplete, lose out on every branch– textual (not necessarily a problem! )– pull down menu cumbersome -- cannot see

history along with current document

Initial User Study of PadPrints

Second User Study of Padprints

Changed the task to involve revisiting web pages– CHI database, National Park Service website

Only correctly answered questions considered– 20-30% fewer pages accessed

– faster response time for tasks that involve revisiting pages

– slightly better user satisfaction ratings

Summary: UIs for Information Access

The part of the system that the user sees and interacts with

Better interfaces in future should produce better search experiences

UIs for search should– Help users keep track of what they have

done– Suggest next choices– Support the process of search

It is very difficult to design good UIs It is very difficult to evaluate search UIs

sims 202 information organization and retrieval prof. marti hearst and prof. ray larson uc berkeley...

category structure slide

automated slide

viewable slide

search process slide

category selection

category labels

cone interface slide

category combinations

Documents

marti hearst sims 247 sims 247 lecture 12 visual properties...

search and retrieval: more on term weighting and document...

searching in hypertext prof. marti hearst sims 202, lecture...

sims 213: user interface design & development marti hearst...

interaction models i marti hearst (ucb sims) sims 213, ui...

5/11/981 untangling text data mining stanford digital...

formal user studies marti hearst (ucb sims) sims 213, ui...

sims 247 information visualization and presentation prof....

marti hearst sims 247 sims 247 lecture 9 distortion-based...

sims 247 information visualization and presentation marti...

final exam review sims 202 profs. hearst & larson uc...

involving users in interface evaluation marti hearst (ucb...

search and retrieval: relevance and evaluation prof. marti...

sims 213: user interface design & development marti hearst...

using metadata in search prof. marti hearst sims 202,...

symbols and language lexical relations sims 202 profs....

marti hearst sims 247 sims 247 lecture 3 graphing basics,...

marti hearst sims 247 sims 247 lecture 16 pan and zoom march...

web site design marti hearst (ucb sims) sims 213, ui design...

information seeking behavior prof. marti hearst sims 202,...