a plot for visualizing multivariate data · data sets and compare our results with well known...

A Plot for Visualizing Multivariate Data

Rida E. A. Moustafa1

Email: [email protected], [email protected] 8, 2003

Abstract

Due to rapid changes in the data set sizes, both in dimensions and observations,there is an urgent need for highly sophisticated techniques that allow the analyst tovisually discover hidden patterns that might exist in a given data set. In this paper,we introduce a new technique for visualizing large multivariate data by means of di-mension reduction. This new technique will allow the user to visually detect hiddenstructures from any given large, multivariate and complex data set in an efficientand interactive way, compared to widely used dimensional reduction techniques.

Keyword: Visual data mining, Pattern Recognition, Nonlinear transformation.

1 Introduction

Visualizing multivariate data is an indirect approach in data mining and knowledgediscovery processes. It is considered one of the most important steps in explor-ing hidden structures and attaining a better understanding of the relationships inthe data sets. Large dimensions are hard to visualize. Moreover, most of the con-ventional multivariate visualization techniques generally do not scale very well forlarge number of observations; hence, it is difficult to interactively visualize largeand complex data sets. Therefore, more attention must be expended in developingnew techniques and softwares for visual data mining that can deal with complex,large-sized multivariate data.

It is worth mentioning some of the dimensional reduction techniques that widelyused to visualize multivariate data. These techniques are, multidimensional scaling(MDS), principle component analysis (PCA), fisher discriminant analysis (FDA),projection pursuit and self-organized maps (SOM).

In this paper, we introduce an efficient algorithm for visualizing large-sized mul-tivariate data sets. It has strong mathematical and statistical foundations that assistthe analyst, not only in visually distinguishing classes in the data set, but, also, inunderstanding the underlying geometric structures of the data set.

Our algorithm is constructed of two consecutive projections: The first is the lin-ear projection (m) produced by computing the distance of each observation from the

1Dr. Rida Moustafa is the Director of Data Mining and Knowledge Discovery Group at AAL.

origin. The second projection is (v) produced by constructing a nonlinear functionof the data and the first projection and is computed by the following steps:

• find the deviations of the observation from the first projection

• compute the distance of these deviated observations from the origin

• find the relationship between the two projections in a scatter-plot view graph

We demonstrate the efficiency of this mv-method in visualizing multivariate dataand its application in several areas of statistical data mining and pattern recognition.

The rest of the article consists of the following main sections: Section two dis-cusses the theory of the introduced technique with illustrations, using simulateddata. In section three, we demonstrate the power of our method on three realdata sets and compare our results with well known existing methods for visualizingmultivariate data by means of dimension reductions.

2 The theory of mv-algorithm

The mv-plot is a 2-dimensional plot of multivariate large-sized data set. It is de-signed to assist the analyst visually explore the underlying structure(s) of a givendata set. It can be considered a technique for visual clustering and classification.Furthermore, it allows us to recognize geometric objects such as hyper-planes, hyper-spheres, etc. The designed algorithm for the mv-plot consists of linear and nonlinearinvariant transformation processes that map points in Rd into points in R1.

The mapping processes can be described as follows: Given an observation x ∈ Rd

as x = (x1, x2, · · · , xd), then m = f(x) = 1

d

∑dj=1

|xj| and v =√

g(x, f(x)) =√

1

d

∑dj=1

|xj − f(x)|2. In general, for a multivariate data set of n-observations in

d-dimensional space, say X = {xij| i = 1, 2, · · · , n; j = 1, 2, · · · , d, } the mappingprocess will produce the vectors m and v, both of length n. This is represented bythe following equations:

mi =1

d

d∑

j=1

|xij|

vi = (1

d

d∑

j=1

(xij − mi)2)

1

2

Understanding the mathematical and statistical relationship between m and v isessential in order to discover any pattern and to explain the results. Therefore, weinvestigate the relationship from several perspectives of mathematics and statisticson simulated data in two, three and hyper-dimensional space. Furthermore, weconsider several real data sets that have been frequently studied for the same purposeof classification, clustering and visual data mining. Finally, we compare the mv-plot

2

results with other effective techniques that have been used for the same purposes,such as: principle component, multidimensional scaling and Fisher discriminantanalysis.

2.1 The mv plot in two dimensions

In this part, we will discuss the situation where the data X ∈ Rn2. This allows usto show and understand the relationships in 2-dimensional original space and itsprojection into 2-dimensional mv-space. We consider scaling data to be positivereal numbers and this will be the same for all the cases we consider through thisstudy. In this case, computation of m and v values is as follows:

mi =1

2(xi1 + xi2),

v2

i =1

2

(

|xi1 − mi|2 + |xi2 − mi|

2)

=1

4

(

|xi1 − xi2|2 + |xi2 − xi1|

2)

⇒ vi =1

2|xi2 − xi1|.

This is exactly like computing the principle component in two dimensions. Assumeour data is drawn from a linear model of the form xi2 = w1xi1 + w0, then:

mi =1

2((w1 + 1)xi1 + w0) ,

vi =1

2|(w1 − 1)xi1 + w0|.

Consider the weights w′s are large enough such that w1 − 1 ≈ w1 + 1 = a1, w0 = a2

both of which are positive reals. Then, the relationship between m and v is linear.Therefore, the linear pattern in the original space is mapped into a linear patternin the mv-space by scaling the dataset.

As an example, we consider three cases for single line, intersected lines andparallel lines. These lines are shown at the Figure 1, respectively. Its mv-plotsare shown at the bottom of the same figure. One can notice that the structure ispreserved under the projection.

2.2 The mv plot in three dimensions

In this subsection, we will study the situation where the data X ∈ Rn3. This ofcourse, helps in the generalization process and shows the relationship between theprojection of the three dimensional structures in data set and their correspondencein two dimensional ones. Computing m and v values for the data in X, we get:

mi =1

3(xi1 + xi2 + xi3)

3

−2 −1 0 1 2

1015

20

x

y

−2 −1 0 1 2

1015

20x

y−2 −1 0 1 2

1015

2025

30

x

y

2 4 6 8 10 12

78

910

1112

1314

m

v

2 4 6 8 10 12

46

810

1214

1618

m

v

5 10 15

810

1214

1618

20

m

v

Figure 1: mv-plot of Line(s).

v2

i =1

3

(

|xi1 − mi|2 + |xi2 − mi|

2 + |xi3 − mi|2)

⇒

v2

i =1

3

(

|2xi1 − (xi2 + xi3)|2 + |2xi2 − (xi3 + xi1)|

2 + |2xi3 − (xi2 + xi1)|2)

.

Let us have linear model of the form xi3 = w1xi1 + w2xi2 + w0 then:

mi =1

3((w1 + 1)xi1 + (w2 + 1)xi2 + w0) ;

v2

i =1

3(|(w1 − 2)xi1 + (w2 + 1)xi2 + w0|

2 + |(w2 − 2)xi2+

(w1 + 1)xi1 + w0|2 + |(2w1 − 1)xi1 + (2w2 − 1)xi2 + 2w0|

2).

Suppose the weights w′s are large enough such that w1+1 ≈ w1−2 ≈ w1−0.5 =a1, w2 + 1 ≈ w2 − 2 ≈ w2 − 0.5 = a2, w0 = a3 ⇒:

mi ≈ a1xi1 + a2xi2 + a3;

vi ≈ a1xi1 + a2xi2 + a3;

From these two cases, we suspect relationship between mi and vi to be nearlylinear. We will show this case by an example, we consider random data that fit

4

plane(s) in 3-d. Figure 2, at top-left and bottom-left respectively, we show a planeand 2-planes in the original space. The corresponding line and 2-lines in mv-spaceare shown, respectively, at top-right and bottom-right of the figure.

−3 −2 −1 0 1 2 3−60−

40−2

0 0

20

40

−3−2

−1 0

1 2

3 4

x

y

z

10 15 20 25 3010

2030

4050

60

−5 0 5 10 15−100

−50

0

50

100 1

50

−3−2

−1 0

1 2

3 4

x

y

z

10 20 30 40 50 60 70

2040

6080

100

120

Figure 2: mv-plot of 3-d plane(s).

2.3 The mv plot in d-dimensions

In this part we will consider the data X ∈ Rnd drawn from a linear model givenby the following relationship xid = w0 +

∑d−1

j=1wjxij, after some restriction on the

weights, we can express m and v as:

mi =1

d

[

d−1∑

i=1

(wj + 1)xij + w0

]

,

vi =d − 1

d2

[

d−1∑

i=1

((d − 1)wj − 1)xij + (d − 1)w0

]

.

Of course we can adjust the weights to be high enough as in the 3-d situations.we will have a linear relationship between m and v as : mi =

∑d−1

j=1aixij + ad, and

vi =∑d−1

j=1aixij +ad. Therefore, the linear relationship can be detected, even for very

large dimensions, by scaling the weights and, on real data by scaling the variables.

5

2.4 Detecting nonlinear data with mv-plot

In this subsection, we would like to show that it is possible to detect nonlinearpatterns using the mv-technique. Let us consider nonlinear data sampled from thesin and the cos functions on [0, 2π]. Its plot in the original space is shown at the top(left and right) of Figure 3 and its plot in the mv space is shown at the bottom (leftand right). Notice that, at the mv-space, for (x, cos(x)) we have (m = x+cos(x), v =|x − cos(x)|) and for (x, sin(x)), we have (m = x + sin(x), v = |x − sin(x)|). Thisexplains why the produced plots are rotated sin and cos in the mv space.

0 1 2 3 4 5 6

−1.0

−0.5

0.0

0.5

1.0

x

sin(

x)

0 1 2 3 4 5 6

−1.0

−0.5

0.0

0.5

1.0

x

cos(

x)

−0.4 −0.2 0.0 0.2 0.4

0.5

1.0

1.5

m−vlaue

v−va

lue

−0.5 0.0 0.5 1.0

0.5

1.0

1.5

2.0

m−vlaue

v−va

lue

Figure 3: mv-plot of sin and cos functions.

In fact, we can employ mv-technique to detect sphere(s) in hyper-dimensions.Let the given data be distributed on a sphere with radius r and the data centeredat zero. Computing the v-values based on the L2 norm, we get:

v2

i =1

d

d∑

j=1

(xij − mi)2 =

1

d

d∑

j=1

x2

ij − dm2

i

⇒ v2

i + m2

i =r2

d.

6

−1.0 −0.5 0.0 0.5 1.0−1.0−

0.5 0

.0 0

.5 1

.0

−1.0−0.5

0.0 0.5

1.0

x

y

z

0.4 0.6 0.8 1.0 1.2 1.4 1.6

1.0

1.4

1.8

2.2

m

sqrt(

v)

−1 0 1 2 3 4 5 −1 0

1 2

3 4

5

−1 0

1 2

3 4

5

x

y

z

1 2 3 4 5

1.0

1.5

2.0

2.5

3.0

3.5

m

sqrt(

v)

Figure 4: mv-plot of 3-d sphere(s).

In general, for non-zero centered spheres, we have:

v2

i =1

d

d∑

j=1

(xij − xc + xc − mi)2 =

1

d

d∑

j=1

(xij − xc)2 − d(xc − mi)

2

⇒ v2

i + m2

i =r2

d.

Therefore, sphere(s) in hyper-dimension original space will be circle(s) in the mv-space. Let us consider data points that are randomly distributed on a sphere, shownin 3-dimension at the top-left of Figure 4. Its plot in mv-space is a circle shownat top-right of the figure. At bottom-right, we have two separated spheres withdifferent radii. The corresponding circles in mv-space are shown at the bottom-leftof the figure.

7

3 Results and Discussions

3.1 The iris data set

In this subsection, we will present our results on the well-known Fisher’s Iris data.This data and it can be obtained from URL 2. It consists of 150 observations con-taining four measurements based on the petals and sepal of three species of Iris.These three species are: Iris Setose, Iris Virginia, and Iris Versicolor. This data isan appropriate choice, as it is frequently used to illustrate classification, clusteringor visualization techniques (Venables,Ripley 1998),(Martinez(2002).

The visualization shown in Figure 5 demonstrates that mv-plot achieve as sat-isfactory a result as the three well known plots, namely: the Principle componentplot, Multidimensional scaling plot, and Fisher’s discriminant plot. All the plots,including ours, show that the data consists of three groups and one of those three(Iris Setose) is highly separated from the other two. Moreover, according to mv-plottheory, we can confirm that the three groups lie on three hyper-planes in R4.

-3 -2 -1 0 1 2 3

first principle component

-2-1

01

2

seco

nd p

rinci

ple

com

pone

nt

PC

s

sss

s

s

s s

s

s

s

s

ss

s

s

s

s

ss

ss

sss

s

sss

ss

s

ss

ss

ss

s

ss

s

s

s

s

s

s

s

s

s

cc c

c

cc

c

c

c

c

c

c

c

cc

c

cc

cc

c

cc

cc

ccc

c

ccc

c cc

cc

c

c

cc

c

c

c

c

cc c

c

c

v

v

vv

vv

v

v

v

v

v

v

v

v

v

vv

v

v

v

v

v

v

v

v v

vv

v

vv

v

vvv

vv

vv

vvv

v

vv

v

v

v

v

v

-4 -2 0 2

mds-x.values

-1.0

-0.5

0.0

0.5

1.0

mds

-y.v

alue

s

MDS

s

sss

s

s

ss

s

s

s

ss

s

ss

s

s

s

ss sss

sssss

ss

s

ss

ss

s

s

s

ss

s

s

ss

s

s

s

s

s

c

cc

c

c

c

c

c

c

c

c

c

c

cc

c

c cc c

c cc

c

cccc

cc

cc

cc c

cc

c c

cc

c

c

c

ccc

c

c

c

v

v

v

vv

v

v

v

v

v

v

v

v

vv

vv

v

v

v

v

v

v

v

vv

vvv

vv

v

v v

v

v

v vv

vv

v

v

vv v

v

vv

v

-10 -5 0 5

first discriminant variable

-9-8

-7-6

-5-4

seco

nd d

iscr

imin

ant v

aria

ble

FLDA

s

ss

s

s

s

ss

s s

s

s

ss

s

s

s

sss

s

ss

s

ss

sss

ss

s

ss

ss

ss

s

s

s

s

s

ss

s

s

s

s

scc

c

c

cc

c

c

cc

c

c

c

c

ccc

cc c

c

c

c c

cc

c

cc

ccc

cc

c

c

c

c

c

cc

c

cc

cccc

cc

v

vv

v

v

vv

vv

v

v

v

v

v

vv

v

v

v

v

v

v

v

v

v

v vvv

vv

v

v

v

v

v

v

vv

v

v v

v

v

v

v

v

v

v

v

2.0 2.5 3.0 3.5 4.0 4.5 5.0

m.values

34

56

78

v.va

lues

MV

ssss

ss

s

s

s

s

s

sss

ss

s

s

s

ssssss

s s

ss

ss

ss

s

ss

s

s

s

ss

ss

ss

s

s

s

s

s

c

c

c

c

c

cc

c

c

cc

c

cc

c

c

c

cc

cc

c

ccc c

c

c

cccc c

c

cc

cc

ccc

cc

cccc

c

c

c vv

v

v v

v

v

v

v

v

vv v

vv

v

v

v

v

v v

v

v

vv

v

vv

v

vv v

vvv

v

v

v

v

vvv

v

v

vvv v

vv

Figure 5: mv-plot of the Iris data: comparison.

3.2 The process control data set

The second data set is the Synthetic Control Chart Time Series. It contains 600examples of control charts synthetically generated by the process in Alcock and

2http://www.cmu.edu

8

Manolopoulos (1999). The data set lies in 60 dimensional space. There are sixdifferent classes of control charts: 1-normal, 2-cyclic, 3-increasing trend, 4-decreasingtrend, 5-upward shift, 6-downward shift. This time series data is listed on Knowledgediscovery cup-1999 as a challenging problem for the data mining community. It canbe obtained from the URL 3. The data set is generally used for testing classification,clustering and visualization techniques. Several clustering techniques have beentested on this particular data set and did not show the existing clusters(Pham,Chan 1998). Our goal here is to visually explore the existing 6-classes, using mv-algorithm, and to compare our results with other techniques. In Figure 6, we cansee that the PCP and FLDA methods are able to distinguish the first and secondclasses properly, but there is no clear discrimination between the classes (three, five)and (four, six). The MDS method did not show groups at all. Comparing this withthe mv-plot, there is no doubt that one can see that there are six classes in thisparticular data set. Moreover, the computational effort is negligible compared withother methods studied here.

-10 -5 0 5 10

first principle component

-20

24

68

seco

nd p

rinci

ple

com

pone

nt

PC

1111 1111111111111

11

1111 1

111

1111111111

111

11111 1111

1111111

111

111

111

1

11

1111

111111

111111

11

1

1111111111111

222

2

222

2

2

2

22

2

2

2

2

22

2

2

2

2

2

2

222

2

2

2

22

22

22

22

22

2222

22

2

22

2

2

2

2

22

2

2

2

2

22 2

2

2

2

2 2

22

2

222

222

22 22

2

2

2

22

22

22

2

2

2

2

22

2

2

222

33 3

3

33

33 333 333 3

3

33

33

3

333

333

3 3

333 33

33 3

333

33 33 333

3

33 33 3 33

33

3 33 333

333

3

3

3 3333

3

3

33 333 33 33 3

3 33

333

333 33

3 3334

444

4

44 444

4 4

444

44

444444

44

44

4

4

4

444 44 4

44

4444

4 4 4

444

44

444

44

444

44 444 44

444 44 44

44

4 44

44

4

44 44

444

44 44

44 4 44

4

4

44

55 5

555

5

5 555 5

5

55555

55

55

5 5

5

5

55

555555

55

5 5 55

5

5

55

5

55

5

5

5

55

55

55 5

55 5

5

5

5 5555

5

555

5

55

5

555

55

555

5

5

5 555

55

5

55

55 5555

66

6

66

6666

666 6

666 6666

6 66

66 66

6

66

6

666 6

6 666

6 66

66

666

66 6

6666 666 6

66

6666 66

666

666

66

6 666

666

6

66

66 666

6

6 6666

66666

-100 -50 0 50 100

mds-xvalues

-80

-60

-40

-20

020

40

mds

-yvl

aues

MDS

11111111111 1111 11

11111111111111

111111111

111111111

111111111

1111111111

11111111111111111111

1 1111111111 1

2

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2 2

2

2

22

2 2

2

2

2

2

2

2

22

2

22 22

22

2

2

2

2

2

2

2

2 2

2 2

2

2

22

22

2

2

2

2

2

2

2

2 2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

22

2

2

2

233 3333 3

3 3333 33 3 33 333

33 33

333

3 3333 3333 333333 33 333

3 33 33 3 333 33 33 333 333 333 33 33

333 3 3

33 33 33 33

3333

3333 333 33 34

4

44 444 4444

44444 44

44444444 44 4

44

4 444 44 4 44 44 4 4 444 4

44 444 44

4 444

4 444 44 444 44 44 44 4 4

44

44 44 4444444 4

444 4

44

44 4 4 55

5555

55555

5

5 5555 55

55 55 5

5 555 5

5 55 555 555 55 5

5 55

555

5

5555 55 5

5555 5

5555

55

55 5 555

55

55

5

5

555

55

555

555 555

5555 5

55 566 66

66

666 66666

66666666 6

66 66 6

66 666

66

6 666 6 6

6

6

6666

666 6

666 666

6

66 66666 6

666666 66

6 666

66 6

66

6666 666

6 66 66

66666

-20 -15 -10 -5

first discriminant variable

510

15

seco

nd d

iscr

imin

ant v

aria

ble

FLDA

11111

111 11 11111111

1

1

111

11111

111

111

1111

1111

111

111

1

11

111

111

1111111 1

1

11

11 11

111 1

11

111111 1

1

111

1111

111

1111

222

2

222

22 2

22

2

2

2222222222

22

222

222 2

2

2 22

22

2

2

2

222

2222

22 22

222

2

22

222 22

222

22

2

2

2

22 2

2

222

222

2

222222

22

2

22

2

22222

333

3

33

3333 333 33

3

3

3 3

3

333

3

3

3 33

3

3 333 33

333

33

3 33333 33

3 33 333 33

3

33333 3

3 3 333

33 333

3

3

333

3333333

33

33

33

33 33333 33 444 4

4

444 4

4

44

444 44 44

444 4

44

44

44

4

4 444 44 4

4

4

44 44

444 444 4

444444444

444 44 44 444 44

44

44444

4

4

4

4444 44 4

44

4

4 44444

444

55

55

5 5

55

55 55 5

5 55

555

55

5

555

5 55

5

55 5555

555

555

5

5

5

5

5 5

5

55

555 5

55 55

55 5

5

55 555

5

5

55

5

55

555 5

55

5 555

5

555 55 55

5

5

555 5 5

5

6 6

666

66

6 66 6

66

6

66666

6

66

6

6

66

6

6

666

6

666 6

6 66

66

666

66 6666

6

66

66666

6 66 66

6 66

666

6666

6666 66 6

66

6 666666

66

6 666

666 6

6

15 20 25 30 35 40 45

m-values

46

810

12

v-vl

aues

MV

1111 111 11111111 1111111111111111111111111 11111 111

11111111111111111

111111111111111

11111111111111111111

22 2

22

2

2

2

22

22

2

2

22 222

2

2

2

222

22222

22

22

2

2 2

2

2

2

22

2222

2

22

22

2

2

2

2

2

2

2

22

22

2222

22 22222

222

22

2

222

2

22

2

222

22

2

22

2

222

2

2

33

33

3

3

3

3

33

3 3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

33

33

33

33

33

33

33

3

333

33

3

3

3

3

3 3

33

33

3

3

33

3

3

33

3

333

3 333

3

3

3

3

33

3

3

3

3

3

3

33

333 3

3

3

3

33

33

3

4

44

4

44

4

4

4

444 4

44

4

4

444

4

4

4

4

4

4

4444

4

444

4

4

44

44

4

4

4 4

44

44

4

4

4

44

44

4

4

4

4

4

44

44

4

4

44

4

4

44

4

4

4

4

44 4

444

44

4

4

4

4 4

4

4

4

4

4

4

44

4

44

55

55

5

5

55

5

5

5

5

5

5

555

5

55

5

5

5

55

5

55 55

5

5

55

55

5

5

55

5

5

55

5

55

5

55

55

5

5

55

55

5

55 5

5

5

555

55

5

5

5

555

5 5

55

5

5

555

55 5

5

5

5

5

5

5 5

5

5

55

5

5

66

666

6

66

6

6

66

6

6

66

66

6

66

6

6

6

6 66

6 6

6 666

6

66

6

6

66

6

6

66

66

6

6

6

6

6

66

6666

66

6

6

666

6

6

66

6

66

6

6

6

6

66

6

6

6

6

66

6

66

6

66

6

6

6

666

6

6

6

6

6

1111 111 11111111 1111111111111111111111111 11111 111

11111111111111111

111111111111111

11111111111111111111

22 2

22

2

2

2

22

22

2

2

22 222

2

2

2

222

22222

22

22

2

2 2

2

2

2

22

2222

2

22

22

2

2

2

2

2

2

2

22

22

2222

22 22222

222

22

2

222

2

22

2

222

22

2

22

2

222

2

2

33

33

3

3

3

3

33

3 3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

33

33

33

33

33

33

33

3

333

33

3

3

3

3

3 3

33

33

3

3

33

3

3

33

3

333

3 333

3

3

3

3

33

3

3

3

3

3

3

33

333 3

3

3

3

33

33

3

4

44

4

44

4

4

4

444 4

44

4

4

444

4

4

4

4

4

4

4444

4

444

4

4

44

44

4

4

4 4

44

44

4

4

4

44

44

4

4

4

4

4

44

44

4

4

44

4

4

44

4

4

4

4

44 4

444

44

4

4

4

4 4

4

4

4

4

4

4

44

4

44

55

55

5

5

55

5

5

5

5

5

5

555

5

55

5

5

5

55

5

55 55

5

5

55

55

5

5

55

5

5

55

5

55

5

55

55

5

5

55

55

5

55 5

5

5

555

55

5

5

5

555

5 5

55

5

5

555

55 5

5

5

5

5

5

5 5

5

5

55

5

5

66

666

6

66

6

6

66

6

6

66

66

6

66

6

6

6

6 66

6 6

6 666

6

66

6

6

66

6

6

66

66

6

6

6

6

6

66

6666

66

6

6

666

6

6

66

6

66

6

6

6

6

66

6

6

6

6

66

6

66

6

66

6

6

6

666

6

6

6

6

6

Figure 6: mv-plot of the process control time series data: comparison.

3.3 The Pollen data set

This data set consists of 3, 848 observations in 5 dimensional space. It is a challeng-ing data set for statisticians offered by the American Statistical Association. It is

3http://kdd.ics.uci.edu

9

worth mentioning here that most of the existing techniques and, also, the ones wereport on need more computer power (storage and speed) to complete the process;therefore, we will report only on our plot. In Figure 7, we see a filled circular shapeand a line coming out of it. According to the mv-plot theory, one can tell thatthe circular shape is hyper-sphere in 5-dimensions, and, the linear shape is a hyper-plane in 5-dimensions, too. We extract the linear pattern from the data set andvisualize it in Figure 8, which shows the word (EUREKA). In fact, our discoverieswith the mv-plot agree with those found in (Wegman 2002) and with others thathave successfully analyzed this data set.

Figure 7: mv-plot of Pollen data: detecting both of linear and nonlinear structure.

4 Summary and Future work

We introduced an efficient algorithm for visualizing hyper-dimensional data sets.We demonstrated the efficiency of the algorithm on different complex simulated andreal data set in very high-dimensional space and were able to detect the existingpatterns visually. We compared the algorithm on the same data sets with the princi-pal components analysis, fisher discriminant analysis, and multidimensional scaling.In most cases, our algorithm achieved better results than the afore-mentioned al-gorithms. In some other cases, these algorithms didn’t work at all because of thecomputer speed and storage limitations of most computers.

10

Figure 8: mv-plot the hidden linear structure in Pollen data.

5 Acknowledgements

The authors would like to thank the computational statistics group at George MasonUniversity for their help and support and especially Dr. John Miller and Dr. EdwardWegman and AAL’s Advanced Data Mining Group.

This work was funded in part by the Air Force Office of Scientific Research underthe contract F49620-01-1-0274.

6 References

Pham D. , Chan A., 1998“Control Chart Pattern Recognition using a New Typeof Self Organizing Neural Network” Proc. Instn, Mech, Engrs. Vol 212, No 1, pp115-127.Mitchie, Spiegelhalter, and Taylor, 1994Machine Learning, Neural and Statistical

Classification,(STATLOG project).Martinez, W., Martinez, A.,2002 Computational Statistics Handbook with Matlab,CRC.Venables, W, Ripley, B., Modern Applied Statistics with S-PLUS, Second Edition,Springer, 1998

Everitt, B., Dunn, G. ,2001Applied Multivariate Data Analysis, Second Edition.Everitt, B. and Dunn, G. 1983 Advanced Methods of Data Exploration an d Model-

ing, Heinman, London.Everitt, B. and Dunn, G. 1991 Applied Multivariate Data Analysis, John Wiley &Sons, New York.

11

Fienberg, S. 1979 “Graphical methods in statistics,” The American Statistician,

33, 165-178.Fishback, W. T. 1962 Projective and Euclidean Geometry, John Wiley & Sons, NewYork.Kaufman, L. and Rousseeuw, P. 1990 Finding Groups in Data: An Introduction to

Cluster Analysis, John Wiley & Sons, New York.Mortenson, M. E. 1995 Geometric Transformation, Industrial Press Inc., New York.Moustafa, R. E. 2001 Fast Conceptual Clustering Algorithm For Data Mining and

Visualization, Ph.D. Dissertation, George Mason University, Fairfax, Va.Wegman, E., Carr, D., and Luo, Q. 1993 “Visualizing Multivariate Data,” Multi-

variate Analysis: Future Directions,, (C. R. Rao, ed.), Amsterdam: North Holland,423-466.Wegman, E. and Luo, Q. 1997 “High dimensional clustering using parallel coordi-nates and the grand tour,” Computing Science and Statistics, 28 352-360.Wegman, E. and Solka, J. 2002 “ On some mathematics for visualizing high dimen-sional data,” Sankhya, 64(2),429-452. Wegman, E. 2003 “Visual data mining“(inpress) Statistics in Medicine.

12

a plot for visualizing multivariate data · data sets and compare our results with well known...

Documents