intelligent database systems lab advisor : dr. hsu graduate : ching-lung chen author :...

32
Intelligent Database Systems Lab Advisor Dr. Hsu Graduate Ching-Lung Chen Author Victoria J. Ho dge Jim Austin Hierarchical Growing Cell Structures: TreeGCS 國國國國國國國國 National Yunlin University of Science and T echnology IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 13, NO.2, MARCH/APRIL 2001

Upload: samantha-long

Post on 02-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Intelligent Database Systems Lab

Advisor : Dr. Hsu

Graduate : Ching-Lung Chen

Author : Victoria J. Hodge

Jim Austin

Hierarchical Growing Cell Structures: TreeGCS

國立雲林科技大學National Yunlin University of Science and Technology

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 13, NO.2, MARCH/APRIL 2001

Intelligent Database Systems Lab

Outline

Motivation Objective Introduction GCS TreeGCS Evaluation

Single-pass TreeGCS Cyclic TreeGCS

Conclusions Personal Opinion Review

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Motivation

The GCS network topology is susceptible to the ordering of the input vectors.

The original algorithm to visualization of dendograms for large data sets as there are too many leaf nodes and branches to visualize.

Parameter selection is a combinatorial problem.

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Objective

To overcome the instability problem in GCS approach.

To overcome the visualization of dendograms for large data sets.

To recommendations for effective parameter combinations for TreeGCS that are easily derived.

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Introduction 1/3

Clustering algorithms have been investigated previously

However, nearly all clustering techniques suffer from at least one of the following:

1. assume specific forms for the probability distribution e.x. : normal

2. require unique global minima of the input probability distribution

3. The cannot handle identical cluster similarities.

4. do not scale well as the training time is often O(n2)

5. Require prior knowledge to set parameters.

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Introduction 2/3

The hierarchy may e formed agglomeratively (buttoon-up) by progressively merging the most similar clusters.

TreeGCS is an unsupervised, growing, self-organizing hierarchy of nodes able to form discrete clusters. In TreeGCS, high dimensional inputs are mapped onto a two-dimensional hierarchy reflecting the topological ordering of the input space.

TreeGCS is similar to HiGS.

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Introduction 3/3

However, the structure of HiGS does not match our requirements.

1. The toplology induced for HiGS is not a tree configuration as the parent must be a member of a cluster of cardinality at least three.

2. The HiGS algorithm generates child clusters and periodically deletes superfluous children so, at any particular time, the tree representation may be incorrect.

Our proposal maintains the correct cluster topology at each epoch.

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

GCS is a two-dimensional structure of cell linked by vertices. Each cell has a neighborhood defined as those cells directly linked by a vertex to the cell.

The adaptation strength is constant over time and only the best matching unit (bmu) and its direct topological neighbors are adapted, unlike SOM.

Each cell has a winner counter denoting the number of times that cell has been the bmu.

GCS 1/7N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

GCS 2/7

The GCS algorithm is described below. Initialized in (1) and (2~7) represent one iteration.

1. A random triangular structure of connected cells with attached vectors ( ) and E representing the winner counter.

2. The next random input vector is selected from the input vector density distribution.

3. The bmu is determined for and the bmu’s winning counter is incremented.

N.Y.U.S.T.

I.M.

nci

w

Intelligent Database Systems Lab

GCS 3/7

4. The bmu and its neighbor are adapted toward by adaptation increments set by the user.

5. If the number of input signals exceeds a threshold set by the user, a new cell ( ) is inserted between the cell with the highest winning counter ( ) and its farthest neighbor (wf) Fig. 2

N.Y.U.S.T.

I.M.

newwbmuw

Intelligent Database Systems Lab

GCS 4/7

5. The winner counter of all neighbors of is redistributed to donate fractions of the neighboring cells’ winning counters to the new cell.

The winner counter for the new cell is set to the total decremented:

6. After a user-specified number of iterations, the cell with the greatest mean Euclidean distance between itself and its neighbors is deleted and any cells within the neighborhood that would be left “dangling” are also deleted (see Fig. 3).

N.Y.U.S.T.

I.M.

neww

Intelligent Database Systems Lab

GCS 5/7

7. The winning counter variable of all cells is decreased by a user-specified factor to implement temporal decay:

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

GCS 6/7

The user-specified parameters are:

1. The dimensionality of GCS, which is fixed.

2. The maximum number of neighbor connections per cell

3. The maximum cells in the structure,

4. The adaptation step for the winning cell,

5. The adaptation step of the neighborhood,

6. The temporal decay factor;

7. The number of iterations for insertion

8. The number of iterations for deletion.

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

GCS 7/7

Fritzke has demonstrated superior performance for the GCS over SOMs. Superiority with respect to:

Topology preservation, with similar input vectors being mapped onto identical or closely neighboring neurons ensuring robustness against distortions.

Neighboring cells having similar attached vectors, ensuring robustness. If the dimensionality of the input vectors is greater than the network dimensionality, then the mapping usually preserves the similarities among the input vectors.

Lower distribution-modeling error (which is the standard deviation of all counters divided by the mean value of the counters).

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

GCS Evaluation

The run-time is complexity for GCS : (numberCells * dimension * numberInputs * epochs)

The GCS algorithm was susceptible in data order.

In this paper, we utilize three data orderings to illustrate the initial susceptibility of the algorithm to input data order and how cycling improves the stability.

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

TreeGCS 1/2

When the cluster subdivides, new node are added to the tree to reflect the additional cluster (Fig. 4)

Only leaf nodes maintain a cluster list.

The hierarchy generation is run once after each GCS epoch

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

TreeGCS 2/2

If the number of clusters has decreased, a cluster has been deleted and the associated tree node is deleted. (Fig.5)

All tree nodes except leaf nodes have only a identifier and pointers to their children.

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Evaluation

The data set is comprised of 41 countries in Europe each one with 47-dimensional real-valued vector.

We use three different orderings of the data to evaluate stability.

1. Alphabetical order of the country names.

2. Middle to front.

3. Numerical order.

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Dendogram

If we take the dendogram as three cluster, the clusters produced are:1. {Den, Fra, Ger, It, UK}

2. {Lux}

3. {Alb, And, Aus, Bel, Bos, Bul, Cro, Cyp, Cze, Eir, Est, Far, Fin, Gib, Gre, Hun, Ice, Lat, Lie, Lit, Mac, Mal, Mon, NL, Nor, Pol, Rom, SM, Ser, Slk, Sln, Spa, Swe, Swi, Ukr}.

The parameter setting for TreeGCS were:

The are six permutations of the three data orders (1,2,3) (1,3,2) (2,3,1) (2,1,3) (3,1,2) (3,2,1).

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Single-Pass TreeGCS 1/3N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Single-Pass TreeGCS 2/3Alphabetical order of countries (see Fig. 6). 34 {Lux, Ukr} 9 {Den, Fra, Ger, It, Spa, UK} 80 {Alb, And, Aus, Bel, Bos, Bul, Cro, Cyp, Cze, Eir, Est, Far, Fin

,Gib, Gre, Hun, Ice, Lat, Lie, Lit, Mac, Mal, Mon, NL, Nor, Pol, Rom, SM, Ser, Slk, Sln, Swi}

3 {Swe}

Middle to front order of countries (see Fig. 6). 11a {Den, Fra, Ger, It, Spa, UK} 11b {Aus, Bel, NL, Swe, Swi, Ukr} 12 {Cze, Fin, Gre, Nor, Rom} 13 {Bul, Eir, Hun, Pol, Slk} 14 {Lux, Ice} 65 {Alb, And, Bos, Bul, Cro, Cyp, Est, Far, Gib, Lat, Lie, Lit,

Mac, Mal, Mon, SM, Ser, Sln}

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Single-Pass TreeGCS 3/3Numerical order of first attributes (see Fig. 6). 29 {Aus, Bel, Den, Fra, Ger, It, NL, Spa, Swe, Swi,UK} 69 {Alb, And, Bos, Bul, Cro, Cyp, Est, Far, Gib, Ice, Lat, Lie, Lit, Mac,

Mal, Mon, Pol, Rom, SM, Slk, Sln} 8 {Hun, Lux, Ser} 20 {Cze, Eir, Fin, Gre, Nor, Ukr}

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Cyclic TreeGCS 1/4

D = alphabetical data order

M =middle to front

S = sorted numerically by the first attribute

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Cyclic TreeGCS 2/4

1. DMS (see Fig. 7). 18 {Den, Fra, Ger, It, NL, Spa, UK} 108 {Alb, And, Aus, Bel, Bos, Bul, Cro, Cyp, Cze, Eir,

Est, Far, Fin, Gib, Gre, Hun, Ice, Lat, Lie, Lit, Lux, Mac, Mal, Mon, Nor, Pol, Rom, SM, Ser, Slk, Sln, Swe, Swi, Ukr}

2. DSM (see Fig. 7). 30 {Bel, Den, Fra, Ger, It, NL, Spa, Swe, UK} 8 {Aus, Lux, Ser, Swi, Ukr} 88 {Alb, And, Bos, Bul, Cro, Cyp, Cze, Eir, Est, Far,

  Fin, Gib, Gre, Hun, Ice, Lat, Lie, Lit, Mac, Mal, Mon, Nor, Pol, Rom, SM, Slk, Sln}

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Cyclic TreeGCS 3/4

3. MSD (see Fig. 7). 10 {Den, Fra, Ger, It, Spa, UK} 116 {Alb, And, Aus, Bel, Bos, Bul, Cro, Cyp, Cze, Eir,

Est, Far, Fin, Gib, Gre, Hun, Ice, Lat, Lie, Lit, Lux, Mac, Mal, Mon, NL, Nor, Pol, Rom, SM, Ser, Slk, Sln, Swe, Swi, Ukr}

4. MDS (see Fig. 7). 17 {Den, Fra, Ger, It, NL, Spa, UK} 32 {Aus, Bel, Cze, Fin, Gre, Nor, Rom, Swe, Swi, Ukr} 11 {Bul, Eir, Hun, Lux, Ser, Slk} 66 {Alb, And, Bos, Cro, Cyp, Est, Far, Gib, Ice, Lat,

Lie, Lit, Mac, Mal, Mon, Pol, SM, Sln}

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Cyclic TreeGCS 4/4

5. SDM (see Fig. 7). 18 {Den, Fra, Ger, It, NL, Spa, UK} 5 {Cze, Gre, Lux, Ser} 15 {Aus, Bel, Rom, Swe, Swi} 12 {Eir, Fin, Hun, Nor, Ukr} 76 {Alb, And, Bos, Bul, Cro, Cyp, Est, Far, Gib, Ice,

Lat, Lie, Lit, Mac, Mal, Mon, Pol, SM, Slk, Sln}

6. SMD (see Fig. 7). 23 {Bel, Den, Fra, Ger, It, NL, Spa, UK} 90 {Alb, And, Bos, Bul, Cro, Cyp, Cze, Eir, Est, Far,

Fin, Gib, Gre, Hun, Ice, Lat, Lie, Lit, Lux, Mac, Mal, Mon, Pol, SM, Ser, Slk, Sln}

13 {Aus, Nor, Rom, Swe, Swi, Ukr}

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Parameter Settings 1/2N.Y.U.S.T.

I.M.

For the final column, a “T” indicates a static hierarchy and ”F” indicates that the hierarchy never became static.

Intelligent Database Systems Lab

Parameter Settings 2/2

For the final column, a “T” indicates a static hierarchy and “F” indicates that the hierarchy never became static.

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Analysis

One solution would be to maintain a list of the hierarchy nodes removed with details of parents and siblings.

Another solution would be a posteriori manual inspection of the run-time output.

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Conclusions

TreeGCS overcoming the instability problem.

The algorithm adaptively determines the depth of the cluster hierarchy; there is no requirement to prespecify network dimensions as with most SOM-based algorithms.

The superimposed are no user-specified parameters for the hierarchy.

A further advantage of our approach over dendograms is that leaf nodes in our hierarchy represent groups of input vectors.

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Personal Opinion

We can learning the skill of TreeGCS to subdivides cluster by added a new node to the tree in hierarchical clustering.

N.Y.U.S.T.

I.M.

Intelligent Database Systems Lab

Review

1. GCS seven point for one epoch.

2. TreeGCS

3. Parameter Settings

N.Y.U.S.T.

I.M.