marti hearst sims 247 sims 247 lecture 20 visualizing text & text collections (cont.) april 2,...

40
Marti Hearst SIMS 247 SIMS 247 Lecture 20 SIMS 247 Lecture 20 Visualizing Text & Text Visualizing Text & Text Collections (cont.) Collections (cont.) April 2, 1998 April 2, 1998

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

SIMS 247 Lecture 20 SIMS 247 Lecture 20 Visualizing Text & Text Collections Visualizing Text & Text Collections

(cont.)(cont.)

April 2, 1998April 2, 1998

Page 2: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

TodayToday

• Visualizing Collection Overviews (cont.)Visualizing Collection Overviews (cont.)

• Visualizing Query SpecificationsVisualizing Query Specifications– Selecting Term Subsets

– Viewing Metadata

• Visualizing Retrieval ResultsVisualizing Retrieval Results– Show Hyperlink Structure (WebCutter)

– Term Hit Distribution (TileBars, SeeSoft)

– Group by Shared Metadata (Cat-a-Cone)

Page 3: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Showing Collection OverviewsShowing Collection Overviews• From Last time:From Last time:

– Show documents as icons– Link together or place near one another according to:

• inter-document similarity• hyperlink structure• citation structure

– Advantages• can see large grouping patterns

– Disadvantages• what do the groups mean?• documents usually belong in multiple groups• groups are often somewhat arbitrary

Page 4: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Mapping Documents onto LandscapesMapping Documents onto Landscapes(Chalmers 96)(Chalmers 96)

Page 5: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Visualizing Query Term SpecificationVisualizing Query Term Specification

• Query term intersectionQuery term intersection– VIBE– Infocrystal

• Incremental Term SpecificationIncremental Term Specification– Lyberworld– WSJ online interface

Page 6: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Visualizing Query Term IntersectionVisualizing Query Term Intersection

• VIBE VIBE – establish points of interest (POI) on a

2D plane– these correspond to terms or

concepts– position documents according to their

intersections among POI

Page 7: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

VIBE VIBE (Olsen et al. 93)(Olsen et al. 93)

A

C

B

D

9

22

3

3

Page 8: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Visualizing Query Term IntersectionVisualizing Query Term Intersection

• InfoCrystalInfoCrystal– convert and extend Venn diagrams– show how many docs contain each

subset of up to five terms

Page 9: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

InfoCrystal InfoCrystal (Spoerri 93)(Spoerri 93)

A

C

B

D

1 9

201

# of docscontaing A

# of docscontaingB and D

# of docscontaingA, C, and B

34

Page 10: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Hyperlink Relevant DocumentsHyperlink Relevant DocumentsWebCutter WebCutter (Maarek & Shaul 97)(Maarek & Shaul 97)

• Show documents as icons or text Show documents as icons or text labelslabels

• Choose a starting point for searchChoose a starting point for search• Find documents that are linked to Find documents that are linked to

starting point and are most relevant to starting point and are most relevant to queryquery

• Continue searching most promising Continue searching most promising linkslinks

• Show link structure graphicallyShow link structure graphically

Page 11: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

WebCutterWebCutter(Maarek & Shaul 97)(Maarek & Shaul 97)

Page 12: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

WebCutterWebCutter(Maarek & Shaul 97)(Maarek & Shaul 97)

Page 13: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

TileBars: Viewing Retrieval ResultsTileBars: Viewing Retrieval Results

Goal: minimize time/effort for Goal: minimize time/effort for deciding which documents to deciding which documents to examine in detailexamine in detail

Idea: show the roles of the query Idea: show the roles of the query terms in the retrieved documents, terms in the retrieved documents, making use of document structuremaking use of document structure

Page 14: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

TileBarsTileBars

Graphical Representation of Term Graphical Representation of Term Distribution and OverlapDistribution and Overlap

Simultaneously Indicate:Simultaneously Indicate:– relative document length– query term frequencies– query term distributions– query term overlap

Page 15: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Query terms:

What roles do they play in retrieved documents?

DBMS (Database Systems)

Reliability

Mainly about both DBMS & reliability

Mainly about DBMS, discusses reliability

Mainly about, say, banking, with a subtopic discussion on DBMS/Reliability

Mainly about high-tech layoffs

TileBars ExampleTileBars Example

Page 16: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Page 17: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Page 18: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Exploiting Visual PropertiesExploiting Visual Properties

– Variation in gray scale saturation imposes a universal, perceptual order (Bertin et al. ‘83)

– Varying shades of gray show varying quantities better than color (Tufte ‘83)

– Differences in shading should align with the values being presented (Kosslyn et al. ‘83)

Page 19: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Represent Software as Represent Software as One-Dimensional TextOne-Dimensional Text

SeeSoft SeeSoft (Eick 94)(Eick 94)

– Originally for software development– Show lines of code graphically

• how often modified• written by whom• highlight search terms

– Extend to text• show locations of search terms• show recurring features

– e.g., characters in a story

Page 20: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

See

Sof

t: C

han

ges

of L

ines

of

Cod

e ov

er T

ime

See

Sof

t: C

han

ges

of L

ines

of

Cod

e ov

er T

ime

(Eic

k 9

4)(E

ick

94)

Page 21: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

SeeSoft: Characters in StoriesSeeSoft: Characters in Stories(Eick 94)(Eick 94)

Page 22: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

SeeDiff: Compare Differences between Two FilesSeeDiff: Compare Differences between Two Files(Eick and Ball)(Eick and Ball)

Page 23: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Alternative Way to Group Alternative Way to Group Documents: Category MetaDataDocuments: Category MetaData

• Last time we saw ways to visualizedLast time we saw ways to visualized– clusters of documents– clusters of words taken from documents

• Clusters are data-drivenClusters are data-driven– depend on what documents were clustered– can find main themes– sometimes are hard to understand

• Alternative: human-generated Alternative: human-generated categoriescategories

Page 24: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

What is Category Metadata for?What is Category Metadata for?

• ““Normalizing” natural languageNormalizing” natural language– distinguish homonyms– group synonyms together

• Organizing informationOrganizing information– for search– for browsing/navigation

• Examples:Examples:– Yahoo directory– ACM keyword hierarchy

Page 25: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Example: MeSH and MedLineExample: MeSH and MedLine

• MeSH Medical Category HierarchyMeSH Medical Category Hierarchy– ~18,000 labels– manually assigned – ~8 labels/article on average– avg depth: 4.5, max depth 9

• Top Level Categories:Top Level Categories:anatomyanatomy diagnosisdiagnosis related discrelated disc

animalsanimals psychpsych technologytechnology

diseasedisease biologybiology humanitieshumanities

drugsdrugs physicsphysics

Page 26: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

What Categories DoWhat Categories Do

• Summarize a document according Summarize a document according to pre-defined main topicsto pre-defined main topics

• Compress the many ways of Compress the many ways of representing a concept into onerepresenting a concept into one

• Identify which subset of attributes Identify which subset of attributes are salient for a collectionare salient for a collection

Page 27: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Clusters vs. CategoriesClusters vs. Categories

CLUSTERSCLUSTERS

Tailored to dataTailored to data

Overall themesOverall themes

Require Require interpretationinterpretation

CATEGORIESCATEGORIES

Pre-assignedPre-assigned

Particular Particular attributesattributes

Familiar Familiar terminologyterminology

Page 28: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Large Category SetsLarge Category Sets

• Problems for User InterfacesProblems for User Interfaces

• Too many categories to browse

• Too many docs per category

• Docs belong to multiple categories

• Need to integrate search

• Need to show the documents

Page 29: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Multiple Categories per DocumentMultiple Categories per Document

DrugDrug SymptomSymptom Anatomy Anatomy

D1D1 S1S1 A1A1

D2D2 S2S2 A2A2

D3D3 S3S3 A3A3

Medical articles contain Medical articles contain combinationscombinations of these concept typesof these concept types

Page 30: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

[D1 S3 A1][D3 S2 S3][D1 D2 S2 A2] …

Dx Sx Ax

Dx Sx A1 Dx S1 Ax D1 Sx Ax

Dx S1 A1 D1 S1 Ax D1 Sx A1

D1 S1 A1

How to Group the Category Types?How to Group the Category Types?A Lattice is InfeasibleA Lattice is Infeasible

Page 31: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Cat-a-Cone:Cat-a-Cone:Interactive Category InterfaceInteractive Category Interface

(Hearst & Karadi 97)(Hearst & Karadi 97)

• Key: Separate representation of Key: Separate representation of documents from categoriesdocuments from categories– Place categories in 3D animated Tree– Collect retrieved documents into a re-

usable “Book”– Link categories from Book to Tree – Innovative query specification

Page 32: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Cat-a-Cone:Cat-a-Cone:Integrate Navigation and SearchIntegrate Navigation and Search

(Hearst & Karadi 97)(Hearst & Karadi 97)

• Interface that smoothly integratesInterface that smoothly integrates– search over multiple categories– search over document contents– browsing of multiple categories– browsing of retrieved documents

• Iterative, InteractiveIterative, Interactive

Page 33: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Collection

Retrieved Documents

searchsearch

CategoryHierarch

y

browsebrowsequery terms

Page 34: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Collection

Retrieved Documents

searchsearch

CategoryHierarch

y

browsebrowsequery terms

Page 35: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Cat

-a-C

one

Cat

-a-C

one

(Hea

rst

& K

arad

i 97)

(Hea

rst

& K

arad

i 97)

Page 36: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

ConeTree for Category LabelsConeTree for Category Labels

• Browse/explore category hierarchyBrowse/explore category hierarchy– by search on label names– by growing/shrinking subtrees– by spinning subtrees

• AffordancesAffordances– learn meaning via ancestors, siblings– disambiguate meanings– all cats simultaneously viewable

Page 37: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Virtual Book for Result SetsVirtual Book for Result Sets

– Categories on Page (Retrieved Document) linked to Categories in Tree

– Flipping through Book Pages causes some Subtrees to Expand and Contract

– Most Subtrees remain unchanged

– Book can be Stored for later Re-Use

Page 38: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Cat-a-Cone Cat-a-Cone (Hearst & Karadi 97)(Hearst & Karadi 97)

• Catacomb: Catacomb: (definition 2b, online Websters)“A complex set of interrelated things”

• Makes use of earlier PARC work on Makes use of earlier PARC work on 3D+animation:3D+animation:Rooms Henderson and Card 86IV: Cone Tree Robertson, Card, Mackinlay 93Web Book Card, Robertson, York 96

Page 39: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Summary: Cat-a-ConeSummary: Cat-a-Cone

• Interface that smoothly integratesInterface that smoothly integrates– search over multiple categories– search over document contents– browsing of multiple categories– browsing of retrieved documents

• Iterative, InteractiveIterative, Interactive• Retain partial results in a Retain partial results in a

workspaceworkspace

Page 40: Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998

Marti HearstSIMS 247

Summary: Visualizing Text Summary: Visualizing Text • Text is difficult to visualizeText is difficult to visualize

– represents abstract concepts– many combinations of these abstract concepts

• Main visualization approaches:Main visualization approaches:– collection overviews based on 2D or 3D views of document

clusters– graphical displays of relationships to query terms (for

information access)– graphical displays of relationships to category subsets

• Open Questions:Open Questions:– How to walk the border between useful and gratuitous graphics?– Is anything better than showing titles?