copyright 2003 by hongyu xu mahalanobis distance based artmap networks master thesis supervisor:...

Copyright 2003 by Hongyu Xu

Mahalanobis Distance Based ARTMAP Networks

Master Thesis

Supervisor: Marko Vuskovic

Hongyu Xu

Department of Computer ScienceSan Diego State University

October 16, 2003

2Copyright 2003 by Hongyu Xu

Content

Taxonomy of pattern recognition Gaussian mixture model Taxonomy of ART networks ART and ARTMAP Gaussian ARTMAP Mahalanobis distance based ARTMAP MART Merging ellipsoids Experiments


Approaches in Pattern Recognition

Bayes’ Classifier

Bayes’ Classifier

Neural networksNeural networks

Single layer Perceptrons

Single layer Perceptrons

Probability density estimation

Parzen window

Multilayer Perceptrons

Multilayer Perceptrons

K-nearest neighbor

Mixture model

Feed forward Feed forward

ART ART

Recurrent Recurrent

Basis functions

Maximum likelihood

Bayesian inference

Pattern recognition

Pattern recognition

MART MART

K-mean

SOM

Mixture model

Radial Basis Function

Radial Basis Function


Mixture model is a semi-parametric method of estimating probability density function.

In such model, a density function is formed from a linear

combination of M basis functions.

The concept of mixture model can be applied separately to each class. Thus a class conditional probability density function is given by:

Where is label of cluster j. is array of labels indexed by k .

Gaussian Mixture Model

, ( )

( | ) ( | ) ( )k

kj L j

p p j P j

x x

( )L j

1

( ) ( | ) ( )M

j

p p j P j

x x

# of mixture models = # of the classes


Gaussian Mixture Model (Cont.)

12 1

2/ 2

( | ) ,(2 )

jtj

D

Qp j e

x

( )kjP j N N

Here we only consider Gaussian mixture model. An example: a 2 –D, two classes Gaussian mixture models

In each model, cluster conditional pdf of x is given by :

Prior probability of cluster j in mixture model is given by:

where is inverse covariance matrix of

cluster j , and is

Mahalanobis distance from x to cluster j .

jQ( ) ( )Tj j j jt x -w Q x -w

( It is the ratio of the number of patterns enclosed by cluster j to the total number of patterns represented by the mixture model of . )k

k


Thus, classification based on the Gaussian mixture model follows:

The class conditional pdf of x is given by :

, ( )

( | ) ( | ) ( )k

kj L j

p p j P j

x x

According to Bayes’ theorem, the posterior probability of class is :k( | ) ( )

( | )( )k k

k

p PP

p

x

xx

( )kkP N N where:

Gaussian Mixture Model (Cont.)

where is the ratio of the number patterns

enclosed by cluster j to the total number of patterns.

1 , ( )

arg max ( | ) ( )k

jk C j L j

J p j P c

x

( )j jP c N N


ART

What is ART?

ART stands for Adaptive Resonance Theory networks.

ART incorporates Leader – Follower algorithm into parallel network architecture and employs set operations into activation and match function.

Why ART?

The research at this area was motivated by the need of a real-time, incremental learning classifier.

ART characterizes self-organized, incremental, fast learning ability.

ART performs well in numerous benchmarks.

The initial effort at ART research was encouraging.


Important ART Networks

ART Networks Grossberg, 1976

ART Networks Grossberg, 1976

Unsupervised ART Learning

Unsupervised ART Learning

supervised ART Learning

supervised ART Learning

ART1, ART2

Carpenter & Grossberg,

1987

ART1, ART2


1987

Fuzzy ART


etal,1991

Fuzzy ART


etal,1991

ARTMAP


etal,1991

ARTMAP


etal,1991

Fuzzy ARTMAP


etal,1991

Fuzzy ARTMAP


etal,1991

Gaussian ARTMAP

Williamson,1992

Gaussian ARTMAP

Williamson,1992

Simplified ART

Baraldi & Alpaydin, 1998

Simplified ART

Baraldi & Alpaydin, 1998

Simplified ARTMAP

Kasuba, 1993

Simplified ARTMAP

Kasuba, 1993

Mahalanobis Distance Based ARTMAP

Vuskovic & Du, 2001, Vuskovic, Xu & Du, 2002


Vuskovic & Du, 2001, Vuskovic, Xu & Du, 2002


ART

Bias

1n 2n Cn

Output J

0x Input Layer

x

AttentionalSubsystem

OrientingSubsystem

vigilance

Architecture of a simplified ART

……

while (X not empty) Learning loop

{ get x; Get a pattern from X

new = true;

loop j = 1,C

Compute activation

loop i = 1,C Search for resonance

{ Find candidate

if resonance occurs

{ Update the template

new = false

break; } Stop the search

else

Suppress the candidate

} NEWNODE(new); Create new node

}

( , )j jt T x w

arg max jj C

J T

( , )JM x w

: ( , )J JUw x w

0JT

Sequential implementation of ART

if new == true

{

}

: 1C C

:C w x: CW W w


T(): choice, or activation function. The measure of degree of resemblance of an input with a template.

M(): match function. The measure of degree of resemblance of a template with an

input. Resonance happens when M > .

U(): update function. Update a template if it resonances with a pattern.

is learning rate.

is bitwise AND operator, is fuzzy AND operator, and is norm

operator defined by .

Functions Used in ART

( , )j

j

j

T

x wx w

w( , )

j

jM

x w

x wx

( , ) (1 ) ( )j j jU x w w x w

( , )j

j

j

T

x wx w

w( , )

j

jM

x w

x wx

( , ) (1 ) ( )j j jU x w w x w

ART-1(Binary pattern) ART-2(Fuzzy pattern)

1

D

ii

a

a

.

, ,

0,1


Geometric interpretation of fuzzy ART with fast learning and compliment coding

0J i iR x x

1 2 1 2( , ) ( , ,1 ,1 )c x x x x I x x

( , )cj j jw u v

Input (complement coding):

Weight vector is given by:

( ) ( )( )k ix x

( ) ( )( )k ix x

JR

Encoded with one pattern

Encoded with two patterns

,

Each category has a geometric representation as a rectangle .The size

of is defined by .

jR

jR j j jR u v


Geometric interpretation of fuzzy ART (Cont.)

min :j iix has been coded by category j x x

max :j iix has been coded by category j x x

The template is given by: ,c

j j j w x x

In general, if x has dimension D, the hyper rectangle includes the two vertices and , where:

j x j xjR

The following relationship can be easily derived:

j jR D w and (1 )jR D


Sequential Algorithm of ARTMAP…………..

Loop j =1,C

loop j = 1,C Search for resonance

{ Find template with highest activation value

if label matches

{

if { And the resonance occurs

; Update the template

new = false; No new node needed

break; Stop the search for the resonant node

} else

} If label or distance don't match

else {

; Reset its activation value, so J will not be elected

; Match tracking triggered

} Continue search for resonant node

} NEWNODE(new); Create new node if needed

………………

argmax ;jj CJ T

( ) ( )Jlabel labelw x

( , )JM x w

: ( , )J JUw x w

0JT

0JT

JT

if new == true

{

}

: 1C C

:C w x

: CW W w

( ) : ( )Jlabel labelw x

( , )j jt T x w


Gaussian ARTMAP GA employs Gaussian distribution into choice, and match

function proposed by Williamson in1995

2

2

1 1

1( ) log 2 ( | ) ( ) log log ( )

2

DDD ji i

j jii jji

w xT p j P j P j

x x

Choice function:

Match function:

2

1 1

1( , ) ( ) ln ln ( )

2

DDi Ji

Jii iJi

x wM J T P J

x x

2 2

: 1;

: (1 ) ; 1, 2,...,

(1 ) ( ) 1:

J J

Ji J Ji J i

J Ji J Ji i JJi

n n

w w x i D

w x n

Update function:

1J Jn Where,


Gaussian ARTMAP (cont.)

Classification function:

, ( ) , ( )

( | ) | exp ( , , )k k

k k j jj L j j L j

P P T

x x x w σ arg max ( | )k

kclass P

x,

Advantage: solve category proliferation caused by noise. better generalization and representation of category.

Disadvantage: less efficient representation than general ellipsoids.



• Vuskovic, M.I. and Du, S. Classification of EMG patterns with simplified Fuzzy ARTMAP networks. Proceeding of the 2002 International Joint Conference on Neural Networks, Networks, Honolulu, HI, May 12-17, 2002.

• Vuskovic, M.I, Xu, H., and Du, S. Simplified ARTMAP Network Based on Mahalanobis distance. Proceeding of the 2002 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Science, Las Vegas, Nevada, June 27-27, 2002.

• Xu, H., Valafar, F and Vuskovic, M. Prediction of Sickle Cell Anemia Patient’s Response to Hydroxyurea Treatment Using ARTMAP. Proceeding of the 2003 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Science, Las Vegas, Nevada, June 23-26, 2003.


…………..

Loop j =1,C

{ if

}

Find closest template for x

if Resonance occurs

{

update w and Q

new = false; No new node needed

} NEWNODE(new); Create new node if needed

………………

if new == true

{

}

Sequential Learning Algorithm of MART

: 1C C

: CW W w

( ) ( )Jlabel labelw x

( , )j jt T x w

argmin jj C

J t

Jt

( , , ) : ( , , , )J J J J J JN U Nw Q w Q x

:C w x

( ) : ( )Jlabel labelw x

Only the templates with the same label as input will be activated.

vigilance remains constant

Activation and match function are made the same, thus searching loop is not needed


(Inverse covariance matrix is initialized with: )

Functions Used in MART

12

1: ( )TJ J J Jt

Q Q g g

Activation and Match function:

( , ) ( , ) ( ) ( )Tj j j j j jt T M x w x w x w Q x w

Update function:

: (1 )J J J J w w x

( )J J Jg Q x -w

: 1J JN N

where: 1 21

1 11 ( 1) , ,

1 2J

J JJ

N

0

2

1

RI

, ( )

, ( )

arg max ( | ) arg max ( | ) ( )

1 1arg max exp ln ln

2 2

k

k

kk k j L j

j j jk j L j

class P p j P j

t N

x x

Q

Classification:

,

,


Recurrent Formula for Computing Q and W

12

1: ( )T

t

Q Q gg

1 1 1 -1 1 1( ) ( ) A BC A A B I C A B C AMatrix inversion lemma:

( )

1

1 ni

nin

w x ( ) ( )

1

1( )( )

1

ni i T

n n ni

Sn

x w x w

Cluster center and covariance matrix are estimated by:

( 1) ( 1)1

1 2( )( )

1n n T

n n n n

S S x w x w

( 1)1 1(1 ) n

n n w w x

Recurrent formula for computing S and W:

1, 1

1 1

n

n n

where:

Recurrent formula for computing inverse of S :

1 21

1 1,

1 2

where:


Architecture of MART (Training Mode)

1n

Input pattern ( with label )

2n 3n Cn u u

Net Manager ( Find the maximum activation value and commit new nodes)

(X,L)

t

(X,L)

X : FeaturesL : Labelt : Activation valuei : Inhibit / Update / New node signal

ttt i i ii i i

(X,L) (X,L) (X,L) (X,L)

OutputNeurons

Training Mode


Architecture of MART (Classification Mode)

1n

Input pattern ( without label ) x

2n 3n Cn

Net Manger ( Find the maximum posterior probability from all categories)

(p,L)

p : Posterior probabilityL : Label

(p,L)(p,L)(p,L)

Lable

ClassificationMode


Merging Ellipsoids

Why merge clusters?

To improve generalization

1 2 34 12 34

Before merging ( the number indicates the coming order of the patterns ):

After merging:

To reduce the output nodes

(1) (2)


Cluster i and j has the same label.

They are mutually close to each other

They should not have common enemy

where index “e” denotes an enemy cluster, i.e. cluster of different label.

Merging Criterion

( ) ( )Tji j i i j id w w Q w w ( ) ( )T

ij i j j i jd w w Q w w

andji ijd d

orie ied d

Distance of cluster j to cluster i: Distance of cluster i to cluster j:

To merge cluster i to cluster j:


Illustrate Merging Criterion (1)

Cluster i

Cluster j

iw jw

Cluster iCluster j

iw jw

Cluster j is close to cluster i, but cluster i is not close to cluster j. The two clusters cannot be merged.

Cluster i and cluster j are close to each other and have the same label, therefore, they can be merged.


Illustrate Merging Criterion (2)

Cluster i Cluster j

iw

jw

ewEnemyCluster

Cluster i

Cluster j

iw

jw

ew

Enemy Cluster

Cluster i and cluster j are close to each other. Though cluster j is close to an enemy cluster, because cluster i is not close to the enemy cluster, the two clusters can be merged.

Cluster i and cluster j are close to each other, but because they are also close to a common enemy, they can not be merged.


Evaluate Merging Algorithm on Circle-in-Square Benchmark (1)

Before merging After merging


Evaluate Merging Algorithm on Circle-in-Square Benchmark (2)

Before merging After merging


# of features

seed

Before merging After mergingChange

of # output nodes

Change of hit rate

# of output nodes

Hit rate %

# of output nodes

Hit rate %

11

1 687 96.65 352 95.93 -335 -0.72

2 687 96.80 380 96.20 -307 -0.60

3 676 96.73 364 95.58 -312 -1.15

4 657 96.53 356 95.95 -301 -0.58

Average 677 96.67 363 95.91 -314 -0.76

16

1 578 97.45 262 96.45 -316 -1.0

2 587 97.40 275 96.33 -312 -1.11

3 538 96.92 267 95.80 -271 -1.12

4 555 97.28 267 96.35 -288 -0.93

Average 565 97.26 268 96.23 -297 -1.03

Evaluate Merging Algorithm on Frey-Slate Letter Image Recognition Problem


Circle-in-Square Benchmark

It is specified as a benchmark in DARPA artificial neural network

technology program to measure a network performance

Description: A square with unit length. A circle covering exactly half area of the square is located at the center of it. The task is to tell whether a data is inside the circle or outside the circle.


Networks Samples/EpochsHit rate %

(Worse-Best)Output nodes(Best-Worst)

Fuzzy ARTMAP

100/2 89.60 12

1000/390.5

(85.9-93.4)27

DSAM

100/289.7

(80-99)8.9

(6.0-11.0)

1000/395.0

(92.00-97.00)11.1

(8-15.0)

MART

100/290.60

(79.00-99.00)9.16

(7.0-13.0)

1000/395.84

(92.50-97.50)10.80

(9.0-13.0)

Two-layer Back propagation network(with 20 to 40 hidden

neurons) [49]

It takes 5000 epochs to reach the equilibrium state and achieves about 90% hit rate.

Experimental Results on Circle-in-Square benchmark

The result shown here for MART is averaged over 100 independent experiments. Each time with a data set generated from different random seed.


Letter Recognition Problem

• The database is archived in the UCI Repository of Machine Learning Databases and Domain Theories, maintained by D. Aha and P. Murphy (ml_respsitory@#ics.uci,edu).

• The database consists of 20,000 unique samples. The first 16,000 samples are used for training, and the last 4000 samples are used for testing

• Each pattern consists of 16 features obtained from machine generated images of 26 capital alphabetical characters from A to Z .

• Characters are generated from 20 different fonts.

• The fonts are randomly distorted


Experimental Results on Letter Recognition Problem

The results show here are averaged over five independent experiments for FA, GA, and MART. Each of them has different training sequence. The results were averaged to produce 1-voter result, and were combined to produce the five-voter results.

# of features

Networks# of

votersEpoch Hit rate % output nodes

16

FA 5 7 94.85 5,175

GA 5 20 95.95 4208

NN - - 95.80 16,000

MART 1 1 97.52 580

11

FA 5 20 95.82 5,312

GA 5 20 95.98 5218

NN - - 96.55 16,000

MART 1 3 96.79 872


Electromyographic Signal (EMG)

• Basic types of finger grasps: cylindrical grasp, spherical grasp, precision grasp (pinch), lateral grasp (key grasp), and hook grasp . In this research, first four categories were recorded. And cylindrical grasps were further divided into big and small grasps, so are spherical grasps.

• The EMG signals were collected from several human subjects with healthy hands. Four electrodes (channels) were placed upon their upper forearms.

• The raw EMG signal of each channel was squared, then passed through a moving average (FIR) filter (Finite Impulse Response Filter) with Hamming windowing function of size 300 ms.

• The amplitude of the first oscillation in each channel was used as a feature.


Experimental Results on EMG Signal

# of categorie

sNetworks Hit rate (%) Output nodes

4

Classical SFAM 85.7 9.4

E-SFAM 86.53 30.1

M-SFAM 94.6 5.2

MART 96.03 4.31

6

Classical SFAM 61.1 24.3

E-SFAM 60.1 53.7

M-SFAM 77.6 7.0

MART 82.67 6.16

The results shown here are averaged over 100 experiments. E-SFAM stands for Euclidean distance based SFAM. M-SFAM stands for Mahalanobis distance based SFAM.


Train_set(six-category1,2,3,4,5,6)

Train_set1(four-category

1(1,3), 2(2,4), 5, 6)

Train_set2(two-category

1, 3)

Train_set3(two category

2,4)

MART1

PCA(1,3)

PCA(2,4)

MART2

MART3

Six-category grasps

1 - larger cylinderical grasp2 - larger sperical grasp3 - small cylinderical grasp4 - small sperical grasp5 - precision grasp6 - lateral grasp

NetworkTraining

Extract

Extract

Convert

Hierarchical Classification of EMG Patterns


Pattern(six-category1,2,3,4,5,6)

MART1

PCA(1,3) PCA(2,4)MART2 MART3

Prediction = 5 or 6 ?

No

Prediction= 1 ?

Yes

Yes

No

Output

Output output

ActivateActivate

Hierarchicalclassification

Hierarchical Classification of EMG Patterns

By using hierarchical MART, the hit rate for 6-category EMG is increased to 84.33 %


Prediction of Sickle Cell Anemia Patients’ Response to HU Treatment

Sickle cell anima is a genetic disorder. The red blood cells (RBC) of the patients are distorted into sickle shape. They stick in narrow blood vessels, blocking the flow of blood.

Sickle cell patients experience severe painful crises. Many sickle cell patients die before the age of 20.

A drug called hydroxyurea (HU) can alivate the syndrome of the disease, but it can also quite toxic.

Our task is to design a classifier which can help the physician to administrate the drug to those who are going to response positively.

The data used in this research is obtained from University of Georgia, Structural Genomics Group. Dr. Homayoun Valafar was responsible for the data collection and preprocessing.

The data contains the information from 92 patients [42]. Each of them contributes 26 parameters.


Prediction of Sickle Cell Anemia Patients’ Response to HU treatment

Label the data• 15 percent rule: If the final %HbF is over 15% while initial value of %HbF

is under 15% , the patient is labeled as a responder.• Double-rule: If the final HbF is increased at least two times over the initial

value of HbF, the patient is labeled as a responder.

Data preprocessing Remove linear dependency from the features Decrease the dimensionality

Experimental results

The performance of the MART network was evaluated using N-fold or leave-one-out method . Each time a different sample is extracted from the whole data set for testing, while the rest is used for training. This procedure is repeated until all samples have been tested.


15% Rule Double Rule

Number of Features 21 8 21 3

Accuracy of Predicting Responders

60.00%(27/45)

68.89%(31/45)

92.06%(58/63)

95.24%(60/63)

Accuracy of Predicting Non-Responders

74.19%(23/31)

64.52%(20/31)

61.11%(11/18)

77.77%(14/18)

Global Hit Rate 65.79% 67.11% 85.19% 91.36%

Number of Output Nodes 3.01 2.84 6.70 2.04

Prediction of Sickle Cell Anemia Patients’ Response to HU treatment

Interestingly, usage of the subset of the three most significant features (SNBRC, HbF and TotalTx) in case of double rule has given considerably better result. This can be explained by the fact that the other features have practically no relevance and act as noise.


The Important Contributions in this Work

Coauthored development of recurrent formula for computing inverse covariance matrix.

Eliminate searching loop for the winner node. Concluded experimentally that classification based on

Gaussian mixture model gives better hit rate than that of Mahalanobis distance.

Confirmed experimentally that variable learning rate is more beneficial than constant learning rate.

The idea of resetting covariance matrix. Designed and evaluated an effective merging algorithm. Applications of MART.


Future Work

Treat the plasticity/ stability property of MART theoretically.

Profile the matlab implementation and convert the whole or part of the programs into optimized C/C++.

Extend applications of MART to image processing, acoustic data classification, and micro array data analysis, etc.

copyright 2003 by hongyu xu mahalanobis distance based artmap networks master thesis supervisor:...

Documents