env 20066.1 envisioning information lecture 6 – document visualization ken brodlie...

ENV 2006 6.1

Envisioning Information

Lecture 6 – Document Visualization

Ken [email protected]

ENV 2006 6.2

Document Visualization - Challenges

• Large collections of electronic text– the Web is prime example!

– E-mail archives

– Literature collections

• Can we use visualization to help us understand..:– content of groups of documents?

– relationships between documents?

• Powerful search and retrieval engines– return documents based on some sort of keyword search

• Can we visualize the results of a query?

ENV 2006 6.3

Views of Documents – 1D View

• Documents can be viewed in different dimensions: 1D, 2D, 3D, multidimensional

• Linear text– Sees document as 1D string of

words– Split into tiles of ‘similar’ text

• Visualization idea– Tilebars– Each document a bar, length

proportional to document length– Shown as set of tiles, with

shading indicating strength of relevance of tile to keywords

Hearst, CHI, 1995

ENV 2006 6.4

2D Document View

• This is how we normally think of documents

– Structure on page is 2D– Zooming interfaces have been

developed– Early one was PAD++:

documents visible at different scales

– (return to zooming interfaces later)

http://www.cs.umd.edu/hcil/pad++/

ENV 2006 6.5

3D Document Views

• Innovative 3D views have been suggested

WebBook: Card et al, CHI, 1996

ENV 2006 6.6

Approach

• Generally approach is in three steps:

– Analyse to capture essential features of document (for Tilebars, relative frequency of words in a segment of text)

– Use algorithms to generate a viable representation of the documents (1D representation in Tilebars)

– Create an interactive visual representation (clicking on a tile gives a list of the corresponding text with keywords highlighted)

Analysis

Algorithms

Visualization

ENV 2006 6.7

Multidimensional Text

• Recent research sees text as multi-dimensional

• Document collection scanned for ‘distinguishing’ words

– Words distinctive to each document (keywords)

– Gives a mathematical ‘signature’ for each document as a high-dimensional vector

– Similarities between documents can then be calculated, so as to create clusters

– Clusters are mapped down to a 2D space, with similar clusters close together and dissimilar ones far apart Galaxy – developed at PNNL, part of

IN-SPIRE product

ENV 2006 6.8

How do we transform from multidimensional to 2D space?

• Self-organising feature maps (Kohonen maps)– Form of neural network

• Input are the vectors for each document• Output is a 2D grid whose nodes represent clusters of similar

documents, with related clusters placed close together

How does it work?

Multilingual informationretrieval documentsfrom database

ENV 2006 6.9

Self-organising maps – A worked example

• Set of 311 documents in a database

• 40 key words extracted from titles

• Matrix of documents vs keywords created

• Set up rectangular grid (10 x 14 was used)

• Each node gets assigned a reference vector with small random values

kw1 kw2 kw3

doc1 1 0 1

doc2 1 1 0

doc3 0 0 1

doc4 1 1 0

ENV 2006 6.10

Self-organising maps – Worked example

• Select a document at random

• Find the ‘nearest’ reference vector in N-dimensional space (ie 40-D here)

• Adjust the reference vector to be closer to the document…

• …and adjust all its neighbours on the grid also

• Iterate (here for 2500 iterations)

• Finally map each document to nearest node

doc2 1 1 0

Ref(5,7) 0.6 0.6 0.1

Ref(5,7) 0.9 0.9 0.03

5,7

ENV 2006 6.11

Self-organising map – Worked example

Concept areas are clustered: languages; technologies; tools

ENV 2006 6.12

Multidimensional Text

• The Galaxy View is extended by ThemeView

• High peaks indicate large number of documents with strong content similarity

• Peaks close together suggest themes which are related

http://in-spire.pnl.gov/

ENV 2006 6.13

Cartographic approach

• Cartographic principles are very relevant to document visualization

• Landscapes are very easy for us to recognise (cf faces)• Level of detail well understood by cartographers (cf Google

maps)

3 differentzoom levels

Skupin, IEEE CG&A, 2002

2200 abstracts

Clusters formed

ENV 2006 6.14

Case Study: Visualizing results from a search query

• Case study from NIST in US

• Suppose search returns a keyword strength– ie user enters a number of keywords

– engine returns list of documents

– each document has a score for each keyword specified (eg number of occurrences)

– most relevant document has largest total score

• How can we visualize this information?

ENV 2006 6.15

Document Spiral

Arrange docsin spiral, mostrelevant at centre

ENV 2006 6.16

Document Three-Keyword Axes Display

One keywordper axis

Plot docs ina scatter plotusing keywordstrengths toposition alongaxes

Same keywordon all axes linesdocs up on X=Y=Z line

ENV 2006 6.17

Nearest Neighbour Sequence

Choose one docand place on circle

Find the closest in‘keyword strength’space and placeadjacent to it.... and so on

http://zing.ncsl.nist.gov/~cugini/uicd/viz.html

ENV 2006 6.18

Visualizing Web Searches

www.kartoo.co.uk

env 20066.1 envisioning information lecture 6 – document visualization ken brodlie...

Documents

document keywords

relevant document

document length

document spiral

d view documents

document output

database slide

d document views innovative