marti hearst sims 247 sims 247 lecture 20 visualizing text & text collections (cont.) april 2,...
Post on 21-Dec-2015
216 views
TRANSCRIPT
Marti HearstSIMS 247
SIMS 247 Lecture 20 SIMS 247 Lecture 20 Visualizing Text & Text Collections Visualizing Text & Text Collections
(cont.)(cont.)
April 2, 1998April 2, 1998
Marti HearstSIMS 247
TodayToday
• Visualizing Collection Overviews (cont.)Visualizing Collection Overviews (cont.)
• Visualizing Query SpecificationsVisualizing Query Specifications– Selecting Term Subsets
– Viewing Metadata
• Visualizing Retrieval ResultsVisualizing Retrieval Results– Show Hyperlink Structure (WebCutter)
– Term Hit Distribution (TileBars, SeeSoft)
– Group by Shared Metadata (Cat-a-Cone)
Marti HearstSIMS 247
Showing Collection OverviewsShowing Collection Overviews• From Last time:From Last time:
– Show documents as icons– Link together or place near one another according to:
• inter-document similarity• hyperlink structure• citation structure
– Advantages• can see large grouping patterns
– Disadvantages• what do the groups mean?• documents usually belong in multiple groups• groups are often somewhat arbitrary
Marti HearstSIMS 247
Mapping Documents onto LandscapesMapping Documents onto Landscapes(Chalmers 96)(Chalmers 96)
Marti HearstSIMS 247
Visualizing Query Term SpecificationVisualizing Query Term Specification
• Query term intersectionQuery term intersection– VIBE– Infocrystal
• Incremental Term SpecificationIncremental Term Specification– Lyberworld– WSJ online interface
Marti HearstSIMS 247
Visualizing Query Term IntersectionVisualizing Query Term Intersection
• VIBE VIBE – establish points of interest (POI) on a
2D plane– these correspond to terms or
concepts– position documents according to their
intersections among POI
Marti HearstSIMS 247
VIBE VIBE (Olsen et al. 93)(Olsen et al. 93)
A
C
B
D
9
22
3
3
Marti HearstSIMS 247
Visualizing Query Term IntersectionVisualizing Query Term Intersection
• InfoCrystalInfoCrystal– convert and extend Venn diagrams– show how many docs contain each
subset of up to five terms
Marti HearstSIMS 247
InfoCrystal InfoCrystal (Spoerri 93)(Spoerri 93)
A
C
B
D
1 9
201
# of docscontaing A
# of docscontaingB and D
# of docscontaingA, C, and B
34
Marti HearstSIMS 247
Hyperlink Relevant DocumentsHyperlink Relevant DocumentsWebCutter WebCutter (Maarek & Shaul 97)(Maarek & Shaul 97)
• Show documents as icons or text Show documents as icons or text labelslabels
• Choose a starting point for searchChoose a starting point for search• Find documents that are linked to Find documents that are linked to
starting point and are most relevant to starting point and are most relevant to queryquery
• Continue searching most promising Continue searching most promising linkslinks
• Show link structure graphicallyShow link structure graphically
Marti HearstSIMS 247
WebCutterWebCutter(Maarek & Shaul 97)(Maarek & Shaul 97)
Marti HearstSIMS 247
WebCutterWebCutter(Maarek & Shaul 97)(Maarek & Shaul 97)
Marti HearstSIMS 247
TileBars: Viewing Retrieval ResultsTileBars: Viewing Retrieval Results
Goal: minimize time/effort for Goal: minimize time/effort for deciding which documents to deciding which documents to examine in detailexamine in detail
Idea: show the roles of the query Idea: show the roles of the query terms in the retrieved documents, terms in the retrieved documents, making use of document structuremaking use of document structure
Marti HearstSIMS 247
TileBarsTileBars
Graphical Representation of Term Graphical Representation of Term Distribution and OverlapDistribution and Overlap
Simultaneously Indicate:Simultaneously Indicate:– relative document length– query term frequencies– query term distributions– query term overlap
Marti HearstSIMS 247
Query terms:
What roles do they play in retrieved documents?
DBMS (Database Systems)
Reliability
Mainly about both DBMS & reliability
Mainly about DBMS, discusses reliability
Mainly about, say, banking, with a subtopic discussion on DBMS/Reliability
Mainly about high-tech layoffs
TileBars ExampleTileBars Example
Marti HearstSIMS 247
Marti HearstSIMS 247
Marti HearstSIMS 247
Exploiting Visual PropertiesExploiting Visual Properties
– Variation in gray scale saturation imposes a universal, perceptual order (Bertin et al. ‘83)
– Varying shades of gray show varying quantities better than color (Tufte ‘83)
– Differences in shading should align with the values being presented (Kosslyn et al. ‘83)
Marti HearstSIMS 247
Represent Software as Represent Software as One-Dimensional TextOne-Dimensional Text
SeeSoft SeeSoft (Eick 94)(Eick 94)
– Originally for software development– Show lines of code graphically
• how often modified• written by whom• highlight search terms
– Extend to text• show locations of search terms• show recurring features
– e.g., characters in a story
Marti HearstSIMS 247
See
Sof
t: C
han
ges
of L
ines
of
Cod
e ov
er T
ime
See
Sof
t: C
han
ges
of L
ines
of
Cod
e ov
er T
ime
(Eic
k 9
4)(E
ick
94)
Marti HearstSIMS 247
SeeSoft: Characters in StoriesSeeSoft: Characters in Stories(Eick 94)(Eick 94)
Marti HearstSIMS 247
SeeDiff: Compare Differences between Two FilesSeeDiff: Compare Differences between Two Files(Eick and Ball)(Eick and Ball)
Marti HearstSIMS 247
Alternative Way to Group Alternative Way to Group Documents: Category MetaDataDocuments: Category MetaData
• Last time we saw ways to visualizedLast time we saw ways to visualized– clusters of documents– clusters of words taken from documents
• Clusters are data-drivenClusters are data-driven– depend on what documents were clustered– can find main themes– sometimes are hard to understand
• Alternative: human-generated Alternative: human-generated categoriescategories
Marti HearstSIMS 247
What is Category Metadata for?What is Category Metadata for?
• ““Normalizing” natural languageNormalizing” natural language– distinguish homonyms– group synonyms together
• Organizing informationOrganizing information– for search– for browsing/navigation
• Examples:Examples:– Yahoo directory– ACM keyword hierarchy
Marti HearstSIMS 247
Example: MeSH and MedLineExample: MeSH and MedLine
• MeSH Medical Category HierarchyMeSH Medical Category Hierarchy– ~18,000 labels– manually assigned – ~8 labels/article on average– avg depth: 4.5, max depth 9
• Top Level Categories:Top Level Categories:anatomyanatomy diagnosisdiagnosis related discrelated disc
animalsanimals psychpsych technologytechnology
diseasedisease biologybiology humanitieshumanities
drugsdrugs physicsphysics
Marti HearstSIMS 247
What Categories DoWhat Categories Do
• Summarize a document according Summarize a document according to pre-defined main topicsto pre-defined main topics
• Compress the many ways of Compress the many ways of representing a concept into onerepresenting a concept into one
• Identify which subset of attributes Identify which subset of attributes are salient for a collectionare salient for a collection
Clusters vs. CategoriesClusters vs. Categories
CLUSTERSCLUSTERS
Tailored to dataTailored to data
Overall themesOverall themes
Require Require interpretationinterpretation
CATEGORIESCATEGORIES
Pre-assignedPre-assigned
Particular Particular attributesattributes
Familiar Familiar terminologyterminology
Marti HearstSIMS 247
Large Category SetsLarge Category Sets
• Problems for User InterfacesProblems for User Interfaces
• Too many categories to browse
• Too many docs per category
• Docs belong to multiple categories
• Need to integrate search
• Need to show the documents
Marti HearstSIMS 247
Multiple Categories per DocumentMultiple Categories per Document
DrugDrug SymptomSymptom Anatomy Anatomy
D1D1 S1S1 A1A1
D2D2 S2S2 A2A2
D3D3 S3S3 A3A3
Medical articles contain Medical articles contain combinationscombinations of these concept typesof these concept types
Marti HearstSIMS 247
[D1 S3 A1][D3 S2 S3][D1 D2 S2 A2] …
Dx Sx Ax
Dx Sx A1 Dx S1 Ax D1 Sx Ax
Dx S1 A1 D1 S1 Ax D1 Sx A1
D1 S1 A1
How to Group the Category Types?How to Group the Category Types?A Lattice is InfeasibleA Lattice is Infeasible
Marti HearstSIMS 247
Cat-a-Cone:Cat-a-Cone:Interactive Category InterfaceInteractive Category Interface
(Hearst & Karadi 97)(Hearst & Karadi 97)
• Key: Separate representation of Key: Separate representation of documents from categoriesdocuments from categories– Place categories in 3D animated Tree– Collect retrieved documents into a re-
usable “Book”– Link categories from Book to Tree – Innovative query specification
Marti HearstSIMS 247
Cat-a-Cone:Cat-a-Cone:Integrate Navigation and SearchIntegrate Navigation and Search
(Hearst & Karadi 97)(Hearst & Karadi 97)
• Interface that smoothly integratesInterface that smoothly integrates– search over multiple categories– search over document contents– browsing of multiple categories– browsing of retrieved documents
• Iterative, InteractiveIterative, Interactive
Marti HearstSIMS 247
Collection
Retrieved Documents
searchsearch
CategoryHierarch
y
browsebrowsequery terms
Marti HearstSIMS 247
Collection
Retrieved Documents
searchsearch
CategoryHierarch
y
browsebrowsequery terms
Marti HearstSIMS 247
Cat
-a-C
one
Cat
-a-C
one
(Hea
rst
& K
arad
i 97)
(Hea
rst
& K
arad
i 97)
Marti HearstSIMS 247
ConeTree for Category LabelsConeTree for Category Labels
• Browse/explore category hierarchyBrowse/explore category hierarchy– by search on label names– by growing/shrinking subtrees– by spinning subtrees
• AffordancesAffordances– learn meaning via ancestors, siblings– disambiguate meanings– all cats simultaneously viewable
Marti HearstSIMS 247
Virtual Book for Result SetsVirtual Book for Result Sets
– Categories on Page (Retrieved Document) linked to Categories in Tree
– Flipping through Book Pages causes some Subtrees to Expand and Contract
– Most Subtrees remain unchanged
– Book can be Stored for later Re-Use
Marti HearstSIMS 247
Cat-a-Cone Cat-a-Cone (Hearst & Karadi 97)(Hearst & Karadi 97)
• Catacomb: Catacomb: (definition 2b, online Websters)“A complex set of interrelated things”
• Makes use of earlier PARC work on Makes use of earlier PARC work on 3D+animation:3D+animation:Rooms Henderson and Card 86IV: Cone Tree Robertson, Card, Mackinlay 93Web Book Card, Robertson, York 96
Marti HearstSIMS 247
Summary: Cat-a-ConeSummary: Cat-a-Cone
• Interface that smoothly integratesInterface that smoothly integrates– search over multiple categories– search over document contents– browsing of multiple categories– browsing of retrieved documents
• Iterative, InteractiveIterative, Interactive• Retain partial results in a Retain partial results in a
workspaceworkspace
Marti HearstSIMS 247
Summary: Visualizing Text Summary: Visualizing Text • Text is difficult to visualizeText is difficult to visualize
– represents abstract concepts– many combinations of these abstract concepts
• Main visualization approaches:Main visualization approaches:– collection overviews based on 2D or 3D views of document
clusters– graphical displays of relationships to query terms (for
information access)– graphical displays of relationships to category subsets
• Open Questions:Open Questions:– How to walk the border between useful and gratuitous graphics?– Is anything better than showing titles?