cse5230 - data mining, 2002 lecture 8.1
DESCRIPTION
TRANSCRIPT
CSE5230 - Data Mining, 2002 Lecture 8.1
Data Mining - CSE5230
Information Visualization
CSE5230/DMS/2002/8
CSE5230 - Data Mining, 2002 Lecture 8.2
Lecture Outline
Overview of information visualization The role of visualization in the process of data
mining The patterns being sought: clusters and outliers Issues when visualizing higher dimensional
relationships Criteria for comparison A range of visualization techniques for
exploratory data analysis
CSE5230 - Data Mining, 2002 Lecture 8.3
Information Visualization
A conjunction of a number of fields: Data Mining Cognitive Science Graphic Design Interactive Computer Graphics
Information Visualization attempts to use visual approaches and dynamic controls to provide understanding and analysis of multidimensional data
The data may have no inherent 2D or 3D semantics and may be abstract in nature. There is no underlying physical model. Much of the data in databases is of this type
CSE5230 - Data Mining, 2002 Lecture 8.4
Role of Information Visualization
Acts as an exploratory tool Useful for identifying subsets of the data Structures, trends and outliers may be
identified Statistical tests tend incorporate isolated
instances into a broader model as they attempt to formulate global features
There is no requirement for an hypothesis, but the techniques can also support the formulation of hypotheses if wanted
CSE5230 - Data Mining, 2002 Lecture 8.5
Integrating Visualization withData Mining
There are four possible approaches: Use the visualization technique to present the results of
the data mining process. Use visualization techniques as complements to the
data mining process. They complement and increase understanding in a passive way.
Use visualization techniques to steer the data mining process. The visualization aids in deciding the appropriate data mining technique to use and appropriate subsets of the data to consider.
Apply data mining techniques to the visualization rather than directly to the data. The idea is to capture the essential semantics visually then apply the data mining tools.
CSE5230 - Data Mining, 2002 Lecture 8.6
The Process of Knowledge Discovery in Databases (a.k.a. Data Mining)
Information Requirement
DataSelection
Cleaning & Enrichment
Coding Data mining Reporting
-domain consistency
- clustering
- segmentation-de-duplication
- prediction-disambiguation
Action
Feedback
Operational data External data
The Knowledge Discovery in Databases (KDD) process (AdZ1996)
CSE5230 - Data Mining, 2002 Lecture 8.7
Visualization in the Context of theData Mining Process
Visualization tools can potentially be used at a number of steps in the DM process. But: the same tools may not be appropriate at each step how they will be used may be different
In general, it is not important whether data visualization is the first step in the process or not the feedback loop which moves the process forward may be
commenced by either a visualization or a query
some visualizations, (e.g. see slide 25) require an initial query to generate a visualization this is an example of a complementary approach
» questions generate visualizations, which may prompt further questions or generate hypotheses
CSE5230 - Data Mining, 2002 Lecture 8.8
Motivations for Visualization
The human visual system is extremely good at recognizing patterns it is quicker and easier to understand visual representations than to
absorb information from language or formal notations.
Exploratory visualization assists in: identifying areas of interest identifying questions which might usefully be asked
i.e. a relevant or revealing visualization of either part or all of a data set, may suggest useful questions and/or hypotheses to the analyst. These can then be confirmed by more rigorous approaches e.g. some clustering techniques require an initial estimate of the
number of clusters present in the data
» visualization techniques can assist in this estimation
CSE5230 - Data Mining, 2002 Lecture 8.9
Criteria for Comparison of Visualization Tools
Number of dimensions that can be represented Number of data items that can be handled Ability to handle categorical and other non-
numeric data types Ability to reveal patterns Ease of use Learning Curve (to what degree is the technique
intuitive)
CSE5230 - Data Mining, 2002 Lecture 8.10
Examples - Scatterplot
Each pair of features (i.e. fields of records) in a multidimensional database is graphed as a point in two dimensions (2D) This straightforward graphing procedure produces a
simple scatterplot - a projection of the multidimensional data into 2D
The scatterplots of all pair-wise combinations of features are arranged in a matrix The figure on the following slide illustrates a scatter plot
matrix of 3D from a study of abrasion loss in tyres. The features are hardness, tensile-strength, abrasion-loss [Tie1989]
Each “sub-graph” gives insight into the relationship between a pair of features
CSE5230 - Data Mining, 2002 Lecture 8.11
Scatterplot Matrix
Scatterplot matrix of abrasion loss data [Tie1989]
CSE5230 - Data Mining, 2002 Lecture 8.12
Possible Problems with Scatterplots
Everitt [Eve78, p. 5] gives two reasons why scatter plots can prove unsatisfactory: if number of features is greater than ~10, the number of
plots to be examined is very large
» this is just as likely to lead to confusion as to knowledge of the structures in the data.
structures existing in multidimensional data set do not necessarily appear in the 2D projections of the features represented in scatterplots (see next slide)
Despite these potential problems, variations on the scatterplot approach are the most commonly used of all the visualization techniques
CSE5230 - Data Mining, 2002 Lecture 8.13
Scatterplots: recognizing high-dimensional structures - 1
A structure which appears as a cluster in a 2D projection may in fact be a “pipe” in 3D a pipe is a structure in 3D that looks like a rod or pipe when viewed
in a 3D representation
While the pipe is easily identifiable in a 3D display only projections of it will appear in the 2D components of the scatterplot matrix depending of the orientation of the pipe in 3D, it may not appear as
an obvious cluster, if at all
Equivalent structures can exist in higher dimensions, e.g. a cluster in 5D might be a “pipe” in 6D the appearance of high-D structures in lower-D projections
depends on the luck and skill of the analyst in choosing the projections, and on the alignment of the structures to the axes
CSE5230 - Data Mining, 2002 Lecture 8.14
Scatterplots: recognizing high-dimensional structures - 2
Random(Uniform) May be a plane in 3D
A cluster in 2D May be a pipe in 3D(or a cluster in 3D)
CSE5230 - Data Mining, 2002 Lecture 8.15
Example Tool: Spotfirehttp://www.spotfire.com/
CSE5230 - Data Mining, 2002 Lecture 8.16
Example Tool: Spotfirehttp://www.spotfire.com/
The user interacts with data by choosing which features will form the horizontal and vertical axes
Other features can represented by color this is an example of using the richness of visual representations
to provide more information to the user. As well as 2D spatial position, other modes such as colour, size, shape and even sound can be used to convey information about high-dimensional data
On the previous slide, the data set contains a 3D cluster in a 4D space (i.e. there are four features) There are also some background “noise” instances
The cluster can seen, with its centre at around (20, 74) all the points in the cluster are red, showing that it’s a 3D cluster
CSE5230 - Data Mining, 2002 Lecture 8.17
Example Tool: DBMinerhttp://www.dbminer.com/
CSE5230 - Data Mining, 2002 Lecture 8.18
Example Tool: DBMinerhttp://www.dbminer.com/
DBMiner is an integrated data mining tool It employs a data visualization known as a “data
cube” (see On-Line Analytic Processing - OLAP) After creating a data cube, user can apply a
variety of data mining techniques to analyze the data further, including: association, classification, prediction and clustering,
etc.
The figure on the preceding slide shows a data cube for a data set which has 3D cluster of data instances in a 3D space
CSE5230 - Data Mining, 2002 Lecture 8.19
Examples: Parallel Coordinates - 1
Uses the idea of mapping a point in a multidimensional feature space on to a number of parallel axes
Each feature is mapped one axis as many axes as need can be lined up side to side there is no limit to the number of dimensions that can
be represented
A single polygonal line connects the individual coordinate mappings for each point
The technique has been applied in air traffic control, robotics, computer vision and computational geometry
CSE5230 - Data Mining, 2002 Lecture 8.20
Examples: Parallel Coordinates - 2
Parallel axes for RN. The polygonal line shown represents the point C= (C1, .... , C i-1, Ci, Ci+1, ... , Cn)
C1Cn
X1 X2 X3 Xi-1 Xn
Ci-1
Ci-1
Ci
CSE5230 - Data Mining, 2002 Lecture 8.21
Examples: Parallel Coordinates - 3
The Parallel Coordinates visualization technique is employed in the software WinViz http://www.computer.org/intelligent/ex1996/x5069abs.htm
The main advantage of the technique is that it can represent unlimited numbers of dimensions
When many points are represented using the parallel coordinates, the overlap of the polygonal lines can make it difficult to identify structures in the data.
Certain structures, such as clusters, can often be identified but others are hidden due to the overlap.
CSE5230 - Data Mining, 2002 Lecture 8.22
Two Clusters In WinViz
CSE5230 - Data Mining, 2002 Lecture 8.23
Examples: Stick Figures
The stick figure technique is intended to make use of the user’s low-level perceptual processes [PGL1995], such as perception of: texture, color, motion, and depth
The hope is that the user will “automatically” try to make physical sense of the pictures of the data created
Visualizations which represent multidimensional feature spaces by using a number of subspaces of 3D or less (e.g. scatterplots) rely more on our cognitive abilities than our perceptual abilities
Stick figures avoid this, and present all variables and data points in a single representation.
CSE5230 - Data Mining, 2002 Lecture 8.24
Iconographic display using stick figures -
US Census Data
http://ivpr.cs.uml.edu/gallery/
CSE5230 - Data Mining, 2002 Lecture 8.25
Examples: Pixel-based techniqueshttp://www.dbs.informatik.uni-muenchen.de/dbs/projekt/visdb/visdb.html
Query-Dependent Pixel-based Techniques based on a query, a “semantic distance” is calculated
between each of the query feature values and the features of each instance in the DB
Distance is mapped to colour for each attribute Overall distance between the data values for a specific
instance and the data attribute values used in the predicate of the query is also calculated
Instances are arranged on the screen, with the data items with highest relevance in the centre of the display, and then proceeding outwards in a spiral
the values for each of the attributes are presented in separate subwindows
the arrangement inside the subwindows is according to the overall distance
CSE5230 - Data Mining, 2002 Lecture 8.26
Query-Dependent Pixel-based Techniques
Result of a complex query [KeK1994]
Overall Distance
CSE5230 - Data Mining, 2002 Lecture 8.27
Examples: Worlds within Worldshttp://www.cs.columbia.edu/graphics/projects/AutoVisual/AutoVisual.html
Employs virtual reality devices to represent an nD virtual world in 3D or 4D-Hyperworlds basic approach to reducing the complexity of a
multidimensional function is to hold one or more of its independent variables constant
» equivalent to taking an infinitely thin slice of the world perpendicular to the constant variable’s axis
can be repeated until there are 3 dimensions and the resulting slice can be manipulated and displayed with conventional 3D graphics hardware
After reducing the higher-dimensional space to 3 dimensions the additional dimensions can be added back, by adding additional 3D worlds within the first 3D world
CSE5230 - Data Mining, 2002 Lecture 8.28
Worlds within Worlds
CSE5230 - Data Mining, 2002 Lecture 8.29
Dynamic Techniques
Allow interaction with the visualization to explore the data more effectively. Can potentially be applied to all visualization techniques Dynamic linking of the data attributes to the parameters
of the visualization. Filtering Linking and “brushing” between multiple visualizations Zooming Details on demand
CSE5230 - Data Mining, 2002 Lecture 8.30
Other Techniques
Keim and Kriegel’s query independent approach Chernoff faces
http://www.fas.harvard.edu/~stats/Chernoff/Hcindex.htm
Cone trees Perspective walls Visualization Spreadsheet A number of techniques especially developed for
web pages and their links
CSE5230 - Data Mining, 2002 Lecture 8.31
References [AdZ1996] P. Adriaans and D. Zantinge. Data Mining. Addison-Wesley, 1996. [BeS1997] A. Berson & S. J. Smith, Data Warehousing, Data Mining and OLAP,
McGraw-Hill, 1997 [Eve1978] B. S. Everitt, Graphical Techniques for Multivariate Data, Heinemann
Educational Books Ltd., London, 1978 [KeK1994] Keim D. A., Kriegel H.-P, VisDB: Database Exploration using Multidimensional
Visualization, Computer Graphics & Applications, Sept. 1994, pp. 40-49. [LeG1993] Database issues for data visualization, Proceedings of the IEEE Visualization
'93 Workshop, J. P. Lee and G. G. Grinstein, (eds), San Jose, California, USA, October 26, 1993
[PGL1995] R. M. Pickett, G. Grinstein,H. Levkowitz and S. Smith, Harnessing Preattentive Perceptual Processes in Visualization, pp. 9-21 in Perceptual Issues in Visualization(Eds. G. Grinstein & H. Levkowitz), Springer-Verlag, Berlin, 1995
[Thu1999] B. Thuraisingham, Data Mining: Technologies, Techniques, Tools, and Trends, CRC Press LLC, Boca Raton, Florida 1999
[Tie1989] L. Tierney, XLISP-STAT: A Statistical Environment Based on the XLISP Language (Version 2.0), University of Minnesota School of Statistics, Technical Report Number 528, July 1989
[WGL1996] Database issues for data visualization, Proceedings of the IEEE Visualization '95 Workshop, A. Wierse, G. G. Grinstein and U. Lang, (eds), Atlanta, Georgia, USA, October 28, 1995