env 20066.1 envisioning information lecture 6 – document visualization ken brodlie...
TRANSCRIPT
ENV 2006 6.2
Document Visualization - Challenges
• Large collections of electronic text– the Web is prime example!
– E-mail archives
– Literature collections
• Can we use visualization to help us understand..:– content of groups of documents?
– relationships between documents?
• Powerful search and retrieval engines– return documents based on some sort of keyword search
• Can we visualize the results of a query?
ENV 2006 6.3
Views of Documents – 1D View
• Documents can be viewed in different dimensions: 1D, 2D, 3D, multidimensional
• Linear text– Sees document as 1D string of
words– Split into tiles of ‘similar’ text
• Visualization idea– Tilebars– Each document a bar, length
proportional to document length– Shown as set of tiles, with
shading indicating strength of relevance of tile to keywords
Hearst, CHI, 1995
ENV 2006 6.4
2D Document View
• This is how we normally think of documents
– Structure on page is 2D– Zooming interfaces have been
developed– Early one was PAD++:
documents visible at different scales
– (return to zooming interfaces later)
http://www.cs.umd.edu/hcil/pad++/
ENV 2006 6.5
3D Document Views
• Innovative 3D views have been suggested
WebBook: Card et al, CHI, 1996
ENV 2006 6.6
Approach
• Generally approach is in three steps:
– Analyse to capture essential features of document (for Tilebars, relative frequency of words in a segment of text)
– Use algorithms to generate a viable representation of the documents (1D representation in Tilebars)
– Create an interactive visual representation (clicking on a tile gives a list of the corresponding text with keywords highlighted)
Analysis
Algorithms
Visualization
ENV 2006 6.7
Multidimensional Text
• Recent research sees text as multi-dimensional
• Document collection scanned for ‘distinguishing’ words
– Words distinctive to each document (keywords)
– Gives a mathematical ‘signature’ for each document as a high-dimensional vector
– Similarities between documents can then be calculated, so as to create clusters
– Clusters are mapped down to a 2D space, with similar clusters close together and dissimilar ones far apart Galaxy – developed at PNNL, part of
IN-SPIRE product
ENV 2006 6.8
How do we transform from multidimensional to 2D space?
• Self-organising feature maps (Kohonen maps)– Form of neural network
• Input are the vectors for each document• Output is a 2D grid whose nodes represent clusters of similar
documents, with related clusters placed close together
How does it work?
Multilingual informationretrieval documentsfrom database
ENV 2006 6.9
Self-organising maps – A worked example
• Set of 311 documents in a database
• 40 key words extracted from titles
• Matrix of documents vs keywords created
• Set up rectangular grid (10 x 14 was used)
• Each node gets assigned a reference vector with small random values
kw1 kw2 kw3
doc1 1 0 1
doc2 1 1 0
doc3 0 0 1
doc4 1 1 0
ENV 2006 6.10
Self-organising maps – Worked example
• Select a document at random
• Find the ‘nearest’ reference vector in N-dimensional space (ie 40-D here)
• Adjust the reference vector to be closer to the document…
• …and adjust all its neighbours on the grid also
• Iterate (here for 2500 iterations)
• Finally map each document to nearest node
doc2 1 1 0
Ref(5,7) 0.6 0.6 0.1
Ref(5,7) 0.9 0.9 0.03
5,7
ENV 2006 6.11
Self-organising map – Worked example
Concept areas are clustered: languages; technologies; tools
ENV 2006 6.12
Multidimensional Text
• The Galaxy View is extended by ThemeView
• High peaks indicate large number of documents with strong content similarity
• Peaks close together suggest themes which are related
http://in-spire.pnl.gov/
ENV 2006 6.13
Cartographic approach
• Cartographic principles are very relevant to document visualization
• Landscapes are very easy for us to recognise (cf faces)• Level of detail well understood by cartographers (cf Google
maps)
3 differentzoom levels
Skupin, IEEE CG&A, 2002
2200 abstracts
Clusters formed
ENV 2006 6.14
Case Study: Visualizing results from a search query
• Case study from NIST in US
• Suppose search returns a keyword strength– ie user enters a number of keywords
– engine returns list of documents
– each document has a score for each keyword specified (eg number of occurrences)
– most relevant document has largest total score
• How can we visualize this information?
ENV 2006 6.15
Document Spiral
Arrange docsin spiral, mostrelevant at centre
ENV 2006 6.16
Document Three-Keyword Axes Display
One keywordper axis
Plot docs ina scatter plotusing keywordstrengths toposition alongaxes
Same keywordon all axes linesdocs up on X=Y=Z line
ENV 2006 6.17
Nearest Neighbour Sequence
Choose one docand place on circle
Find the closest in‘keyword strength’space and placeadjacent to it.... and so on
http://zing.ncsl.nist.gov/~cugini/uicd/viz.html
ENV 2006 6.18
Visualizing Web Searches
www.kartoo.co.uk