copyright 2003 by hongyu xu mahalanobis distance based artmap networks master thesis supervisor:...
TRANSCRIPT
Copyright 2003 by Hongyu Xu
Mahalanobis Distance Based ARTMAP Networks
Master Thesis
Supervisor: Marko Vuskovic
Hongyu Xu
Department of Computer ScienceSan Diego State University
October 16, 2003
2Copyright 2003 by Hongyu Xu
Content
Taxonomy of pattern recognition Gaussian mixture model Taxonomy of ART networks ART and ARTMAP Gaussian ARTMAP Mahalanobis distance based ARTMAP MART Merging ellipsoids Experiments
3Copyright 2003 by Hongyu Xu
Approaches in Pattern Recognition
Bayes’ Classifier
Bayes’ Classifier
Neural networksNeural networks
Single layer Perceptrons
Single layer Perceptrons
Probability density estimation
Parzen window
Multilayer Perceptrons
Multilayer Perceptrons
K-nearest neighbor
Mixture model
Feed forward Feed forward
ART ART
Recurrent Recurrent
Basis functions
Maximum likelihood
Bayesian inference
Pattern recognition
Pattern recognition
MART MART
K-mean
SOM
Mixture model
Radial Basis Function
Radial Basis Function
4Copyright 2003 by Hongyu Xu
Mixture model is a semi-parametric method of estimating probability density function.
In such model, a density function is formed from a linear
combination of M basis functions.
The concept of mixture model can be applied separately to each class. Thus a class conditional probability density function is given by:
Where is label of cluster j. is array of labels indexed by k .
Gaussian Mixture Model
, ( )
( | ) ( | ) ( )k
kj L j
p p j P j
x x
( )L j
1
( ) ( | ) ( )M
j
p p j P j
x x
# of mixture models = # of the classes
5Copyright 2003 by Hongyu Xu
Gaussian Mixture Model (Cont.)
12 1
2/ 2
( | ) ,(2 )
jtj
D
Qp j e
x
( )kjP j N N
Here we only consider Gaussian mixture model. An example: a 2 –D, two classes Gaussian mixture models
In each model, cluster conditional pdf of x is given by :
Prior probability of cluster j in mixture model is given by:
where is inverse covariance matrix of
cluster j , and is
Mahalanobis distance from x to cluster j .
jQ( ) ( )Tj j j jt x -w Q x -w
( It is the ratio of the number of patterns enclosed by cluster j to the total number of patterns represented by the mixture model of . )k
k
6Copyright 2003 by Hongyu Xu
Thus, classification based on the Gaussian mixture model follows:
The class conditional pdf of x is given by :
, ( )
( | ) ( | ) ( )k
kj L j
p p j P j
x x
According to Bayes’ theorem, the posterior probability of class is :k( | ) ( )
( | )( )k k
k
p PP
p
x
xx
( )kkP N N where:
Gaussian Mixture Model (Cont.)
where is the ratio of the number patterns
enclosed by cluster j to the total number of patterns.
1 , ( )
arg max ( | ) ( )k
jk C j L j
J p j P c
x
( )j jP c N N
7Copyright 2003 by Hongyu Xu
ART
What is ART?
ART stands for Adaptive Resonance Theory networks.
ART incorporates Leader – Follower algorithm into parallel network architecture and employs set operations into activation and match function.
Why ART?
The research at this area was motivated by the need of a real-time, incremental learning classifier.
ART characterizes self-organized, incremental, fast learning ability.
ART performs well in numerous benchmarks.
The initial effort at ART research was encouraging.
8Copyright 2003 by Hongyu Xu
Important ART Networks
ART Networks Grossberg, 1976
ART Networks Grossberg, 1976
Unsupervised ART Learning
Unsupervised ART Learning
supervised ART Learning
supervised ART Learning
ART1, ART2
Carpenter & Grossberg,
1987
ART1, ART2
Carpenter & Grossberg,
1987
Fuzzy ART
Carpenter & Grossberg,
etal,1991
Fuzzy ART
Carpenter & Grossberg,
etal,1991
ARTMAP
Carpenter & Grossberg,
etal,1991
ARTMAP
Carpenter & Grossberg,
etal,1991
Fuzzy ARTMAP
Carpenter & Grossberg,
etal,1991
Fuzzy ARTMAP
Carpenter & Grossberg,
etal,1991
Gaussian ARTMAP
Williamson,1992
Gaussian ARTMAP
Williamson,1992
Simplified ART
Baraldi & Alpaydin, 1998
Simplified ART
Baraldi & Alpaydin, 1998
Simplified ARTMAP
Kasuba, 1993
Simplified ARTMAP
Kasuba, 1993
Mahalanobis Distance Based ARTMAP
Vuskovic & Du, 2001, Vuskovic, Xu & Du, 2002
Mahalanobis Distance Based ARTMAP
Vuskovic & Du, 2001, Vuskovic, Xu & Du, 2002
9Copyright 2003 by Hongyu Xu
ART
Bias
1n 2n Cn
Output J
0x Input Layer
x
AttentionalSubsystem
OrientingSubsystem
vigilance
Architecture of a simplified ART
……
while (X not empty) Learning loop
{ get x; Get a pattern from X
new = true;
loop j = 1,C
Compute activation
loop i = 1,C Search for resonance
{ Find candidate
if resonance occurs
{ Update the template
new = false
break; } Stop the search
else
Suppress the candidate
} NEWNODE(new); Create new node
}
( , )j jt T x w
arg max jj C
J T
( , )JM x w
: ( , )J JUw x w
0JT
Sequential implementation of ART
if new == true
{
}
: 1C C
:C w x: CW W w
10Copyright 2003 by Hongyu Xu
T(): choice, or activation function. The measure of degree of resemblance of an input with a template.
M(): match function. The measure of degree of resemblance of a template with an
input. Resonance happens when M > .
U(): update function. Update a template if it resonances with a pattern.
is learning rate.
is bitwise AND operator, is fuzzy AND operator, and is norm
operator defined by .
Functions Used in ART
( , )j
j
j
T
x wx w
w( , )
j
jM
x w
x wx
( , ) (1 ) ( )j j jU x w w x w
( , )j
j
j
T
x wx w
w( , )
j
jM
x w
x wx
( , ) (1 ) ( )j j jU x w w x w
ART-1(Binary pattern) ART-2(Fuzzy pattern)
1
D
ii
a
a
.
, ,
0,1
11Copyright 2003 by Hongyu Xu
Geometric interpretation of fuzzy ART with fast learning and compliment coding
0J i iR x x
1 2 1 2( , ) ( , ,1 ,1 )c x x x x I x x
( , )cj j jw u v
Input (complement coding):
Weight vector is given by:
( ) ( )( )k ix x
( ) ( )( )k ix x
JR
Encoded with one pattern
Encoded with two patterns
,
Each category has a geometric representation as a rectangle .The size
of is defined by .
jR
jR j j jR u v
12Copyright 2003 by Hongyu Xu
Geometric interpretation of fuzzy ART (Cont.)
min :j iix has been coded by category j x x
max :j iix has been coded by category j x x
The template is given by: ,c
j j j w x x
In general, if x has dimension D, the hyper rectangle includes the two vertices and , where:
j x j xjR
The following relationship can be easily derived:
j jR D w and (1 )jR D
13Copyright 2003 by Hongyu Xu
Sequential Algorithm of ARTMAP…………..
Loop j =1,C
loop j = 1,C Search for resonance
{ Find template with highest activation value
if label matches
{
if { And the resonance occurs
; Update the template
new = false; No new node needed
break; Stop the search for the resonant node
} else
} If label or distance don't match
else {
; Reset its activation value, so J will not be elected
; Match tracking triggered
} Continue search for resonant node
} NEWNODE(new); Create new node if needed
………………
argmax ;jj CJ T
( ) ( )Jlabel labelw x
( , )JM x w
: ( , )J JUw x w
0JT
0JT
JT
if new == true
{
}
: 1C C
:C w x
: CW W w
( ) : ( )Jlabel labelw x
( , )j jt T x w
14Copyright 2003 by Hongyu Xu
Gaussian ARTMAP GA employs Gaussian distribution into choice, and match
function proposed by Williamson in1995
2
2
1 1
1( ) log 2 ( | ) ( ) log log ( )
2
DDD ji i
j jii jji
w xT p j P j P j
x x
Choice function:
Match function:
2
1 1
1( , ) ( ) ln ln ( )
2
DDi Ji
Jii iJi
x wM J T P J
x x
2 2
: 1;
: (1 ) ; 1, 2,...,
(1 ) ( ) 1:
J J
Ji J Ji J i
J Ji J Ji i JJi
n n
w w x i D
w x n
Update function:
1J Jn Where,
15Copyright 2003 by Hongyu Xu
Gaussian ARTMAP (cont.)
Classification function:
, ( ) , ( )
( | ) | exp ( , , )k k
k k j jj L j j L j
P P T
x x x w σ arg max ( | )k
kclass P
x,
Advantage: solve category proliferation caused by noise. better generalization and representation of category.
Disadvantage: less efficient representation than general ellipsoids.
16Copyright 2003 by Hongyu Xu
Mahalanobis Distance Based ARTMAP
• Vuskovic, M.I. and Du, S. Classification of EMG patterns with simplified Fuzzy ARTMAP networks. Proceeding of the 2002 International Joint Conference on Neural Networks, Networks, Honolulu, HI, May 12-17, 2002.
• Vuskovic, M.I, Xu, H., and Du, S. Simplified ARTMAP Network Based on Mahalanobis distance. Proceeding of the 2002 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Science, Las Vegas, Nevada, June 27-27, 2002.
• Xu, H., Valafar, F and Vuskovic, M. Prediction of Sickle Cell Anemia Patient’s Response to Hydroxyurea Treatment Using ARTMAP. Proceeding of the 2003 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Science, Las Vegas, Nevada, June 23-26, 2003.
17Copyright 2003 by Hongyu Xu
…………..
Loop j =1,C
{ if
}
Find closest template for x
if Resonance occurs
{
update w and Q
new = false; No new node needed
} NEWNODE(new); Create new node if needed
………………
if new == true
{
}
Sequential Learning Algorithm of MART
: 1C C
: CW W w
( ) ( )Jlabel labelw x
( , )j jt T x w
argmin jj C
J t
Jt
( , , ) : ( , , , )J J J J J JN U Nw Q w Q x
:C w x
( ) : ( )Jlabel labelw x
Only the templates with the same label as input will be activated.
vigilance remains constant
Activation and match function are made the same, thus searching loop is not needed
18Copyright 2003 by Hongyu Xu
(Inverse covariance matrix is initialized with: )
Functions Used in MART
12
1: ( )TJ J J Jt
Q Q g g
Activation and Match function:
( , ) ( , ) ( ) ( )Tj j j j j jt T M x w x w x w Q x w
Update function:
: (1 )J J J J w w x
( )J J Jg Q x -w
: 1J JN N
where: 1 21
1 11 ( 1) , ,
1 2J
J JJ
N
0
2
1
RI
, ( )
, ( )
arg max ( | ) arg max ( | ) ( )
1 1arg max exp ln ln
2 2
k
k
kk k j L j
j j jk j L j
class P p j P j
t N
x x
Q
Classification:
,
,
19Copyright 2003 by Hongyu Xu
Recurrent Formula for Computing Q and W
12
1: ( )T
t
Q Q gg
1 1 1 -1 1 1( ) ( ) A BC A A B I C A B C AMatrix inversion lemma:
( )
1
1 ni
nin
w x ( ) ( )
1
1( )( )
1
ni i T
n n ni
Sn
x w x w
Cluster center and covariance matrix are estimated by:
( 1) ( 1)1
1 2( )( )
1n n T
n n n n
S S x w x w
( 1)1 1(1 ) n
n n w w x
Recurrent formula for computing S and W:
1, 1
1 1
n
n n
where:
Recurrent formula for computing inverse of S :
1 21
1 1,
1 2
where:
20Copyright 2003 by Hongyu Xu
Architecture of MART (Training Mode)
1n
Input pattern ( with label )
2n 3n Cn u u
Net Manager ( Find the maximum activation value and commit new nodes)
(X,L)
t
(X,L)
X : FeaturesL : Labelt : Activation valuei : Inhibit / Update / New node signal
ttt i i ii i i
(X,L) (X,L) (X,L) (X,L)
OutputNeurons
Training Mode
21Copyright 2003 by Hongyu Xu
Architecture of MART (Classification Mode)
1n
Input pattern ( without label ) x
2n 3n Cn
Net Manger ( Find the maximum posterior probability from all categories)
(p,L)
p : Posterior probabilityL : Label
(p,L)(p,L)(p,L)
Lable
ClassificationMode
22Copyright 2003 by Hongyu Xu
Merging Ellipsoids
Why merge clusters?
To improve generalization
1 2 34 12 34
Before merging ( the number indicates the coming order of the patterns ):
After merging:
To reduce the output nodes
(1) (2)
23Copyright 2003 by Hongyu Xu
Cluster i and j has the same label.
They are mutually close to each other
They should not have common enemy
where index “e” denotes an enemy cluster, i.e. cluster of different label.
Merging Criterion
( ) ( )Tji j i i j id w w Q w w ( ) ( )T
ij i j j i jd w w Q w w
andji ijd d
orie ied d
Distance of cluster j to cluster i: Distance of cluster i to cluster j:
To merge cluster i to cluster j:
24Copyright 2003 by Hongyu Xu
Illustrate Merging Criterion (1)
Cluster i
Cluster j
iw jw
Cluster iCluster j
iw jw
Cluster j is close to cluster i, but cluster i is not close to cluster j. The two clusters cannot be merged.
Cluster i and cluster j are close to each other and have the same label, therefore, they can be merged.
25Copyright 2003 by Hongyu Xu
Illustrate Merging Criterion (2)
Cluster i Cluster j
iw
jw
ewEnemyCluster
Cluster i
Cluster j
iw
jw
ew
Enemy Cluster
Cluster i and cluster j are close to each other. Though cluster j is close to an enemy cluster, because cluster i is not close to the enemy cluster, the two clusters can be merged.
Cluster i and cluster j are close to each other, but because they are also close to a common enemy, they can not be merged.
26Copyright 2003 by Hongyu Xu
Evaluate Merging Algorithm on Circle-in-Square Benchmark (1)
Before merging After merging
27Copyright 2003 by Hongyu Xu
Evaluate Merging Algorithm on Circle-in-Square Benchmark (2)
Before merging After merging
28Copyright 2003 by Hongyu Xu
# of features
seed
Before merging After mergingChange
of # output nodes
Change of hit rate
# of output nodes
Hit rate %
# of output nodes
Hit rate %
11
1 687 96.65 352 95.93 -335 -0.72
2 687 96.80 380 96.20 -307 -0.60
3 676 96.73 364 95.58 -312 -1.15
4 657 96.53 356 95.95 -301 -0.58
Average 677 96.67 363 95.91 -314 -0.76
16
1 578 97.45 262 96.45 -316 -1.0
2 587 97.40 275 96.33 -312 -1.11
3 538 96.92 267 95.80 -271 -1.12
4 555 97.28 267 96.35 -288 -0.93
Average 565 97.26 268 96.23 -297 -1.03
Evaluate Merging Algorithm on Frey-Slate Letter Image Recognition Problem
29Copyright 2003 by Hongyu Xu
Circle-in-Square Benchmark
It is specified as a benchmark in DARPA artificial neural network
technology program to measure a network performance
Description: A square with unit length. A circle covering exactly half area of the square is located at the center of it. The task is to tell whether a data is inside the circle or outside the circle.
30Copyright 2003 by Hongyu Xu
Networks Samples/EpochsHit rate %
(Worse-Best)Output nodes(Best-Worst)
Fuzzy ARTMAP
100/2 89.60 12
1000/390.5
(85.9-93.4)27
DSAM
100/289.7
(80-99)8.9
(6.0-11.0)
1000/395.0
(92.00-97.00)11.1
(8-15.0)
MART
100/290.60
(79.00-99.00)9.16
(7.0-13.0)
1000/395.84
(92.50-97.50)10.80
(9.0-13.0)
Two-layer Back propagation network(with 20 to 40 hidden
neurons) [49]
It takes 5000 epochs to reach the equilibrium state and achieves about 90% hit rate.
Experimental Results on Circle-in-Square benchmark
The result shown here for MART is averaged over 100 independent experiments. Each time with a data set generated from different random seed.
31Copyright 2003 by Hongyu Xu
Letter Recognition Problem
• The database is archived in the UCI Repository of Machine Learning Databases and Domain Theories, maintained by D. Aha and P. Murphy (ml_respsitory@#ics.uci,edu).
• The database consists of 20,000 unique samples. The first 16,000 samples are used for training, and the last 4000 samples are used for testing
• Each pattern consists of 16 features obtained from machine generated images of 26 capital alphabetical characters from A to Z .
• Characters are generated from 20 different fonts.
• The fonts are randomly distorted
32Copyright 2003 by Hongyu Xu
Experimental Results on Letter Recognition Problem
The results show here are averaged over five independent experiments for FA, GA, and MART. Each of them has different training sequence. The results were averaged to produce 1-voter result, and were combined to produce the five-voter results.
# of features
Networks# of
votersEpoch Hit rate % output nodes
16
FA 5 7 94.85 5,175
GA 5 20 95.95 4208
NN - - 95.80 16,000
MART 1 1 97.52 580
11
FA 5 20 95.82 5,312
GA 5 20 95.98 5218
NN - - 96.55 16,000
MART 1 3 96.79 872
33Copyright 2003 by Hongyu Xu
Electromyographic Signal (EMG)
• Basic types of finger grasps: cylindrical grasp, spherical grasp, precision grasp (pinch), lateral grasp (key grasp), and hook grasp . In this research, first four categories were recorded. And cylindrical grasps were further divided into big and small grasps, so are spherical grasps.
• The EMG signals were collected from several human subjects with healthy hands. Four electrodes (channels) were placed upon their upper forearms.
• The raw EMG signal of each channel was squared, then passed through a moving average (FIR) filter (Finite Impulse Response Filter) with Hamming windowing function of size 300 ms.
• The amplitude of the first oscillation in each channel was used as a feature.
34Copyright 2003 by Hongyu Xu
35Copyright 2003 by Hongyu Xu
Experimental Results on EMG Signal
# of categorie
sNetworks Hit rate (%) Output nodes
4
Classical SFAM 85.7 9.4
E-SFAM 86.53 30.1
M-SFAM 94.6 5.2
MART 96.03 4.31
6
Classical SFAM 61.1 24.3
E-SFAM 60.1 53.7
M-SFAM 77.6 7.0
MART 82.67 6.16
The results shown here are averaged over 100 experiments. E-SFAM stands for Euclidean distance based SFAM. M-SFAM stands for Mahalanobis distance based SFAM.
36Copyright 2003 by Hongyu Xu
Train_set(six-category1,2,3,4,5,6)
Train_set1(four-category
1(1,3), 2(2,4), 5, 6)
Train_set2(two-category
1, 3)
Train_set3(two category
2,4)
MART1
PCA(1,3)
PCA(2,4)
MART2
MART3
Six-category grasps
1 - larger cylinderical grasp2 - larger sperical grasp3 - small cylinderical grasp4 - small sperical grasp5 - precision grasp6 - lateral grasp
NetworkTraining
Extract
Extract
Convert
Hierarchical Classification of EMG Patterns
37Copyright 2003 by Hongyu Xu
Pattern(six-category1,2,3,4,5,6)
MART1
PCA(1,3) PCA(2,4)MART2 MART3
Prediction = 5 or 6 ?
No
Prediction= 1 ?
Yes
Yes
No
Output
Output output
ActivateActivate
Hierarchicalclassification
Hierarchical Classification of EMG Patterns
By using hierarchical MART, the hit rate for 6-category EMG is increased to 84.33 %
38Copyright 2003 by Hongyu Xu
Prediction of Sickle Cell Anemia Patients’ Response to HU Treatment
Sickle cell anima is a genetic disorder. The red blood cells (RBC) of the patients are distorted into sickle shape. They stick in narrow blood vessels, blocking the flow of blood.
Sickle cell patients experience severe painful crises. Many sickle cell patients die before the age of 20.
A drug called hydroxyurea (HU) can alivate the syndrome of the disease, but it can also quite toxic.
Our task is to design a classifier which can help the physician to administrate the drug to those who are going to response positively.
The data used in this research is obtained from University of Georgia, Structural Genomics Group. Dr. Homayoun Valafar was responsible for the data collection and preprocessing.
The data contains the information from 92 patients [42]. Each of them contributes 26 parameters.
39Copyright 2003 by Hongyu Xu
Prediction of Sickle Cell Anemia Patients’ Response to HU treatment
Label the data• 15 percent rule: If the final %HbF is over 15% while initial value of %HbF
is under 15% , the patient is labeled as a responder.• Double-rule: If the final HbF is increased at least two times over the initial
value of HbF, the patient is labeled as a responder.
Data preprocessing Remove linear dependency from the features Decrease the dimensionality
Experimental results
The performance of the MART network was evaluated using N-fold or leave-one-out method . Each time a different sample is extracted from the whole data set for testing, while the rest is used for training. This procedure is repeated until all samples have been tested.
40Copyright 2003 by Hongyu Xu
15% Rule Double Rule
Number of Features 21 8 21 3
Accuracy of Predicting Responders
60.00%(27/45)
68.89%(31/45)
92.06%(58/63)
95.24%(60/63)
Accuracy of Predicting Non-Responders
74.19%(23/31)
64.52%(20/31)
61.11%(11/18)
77.77%(14/18)
Global Hit Rate 65.79% 67.11% 85.19% 91.36%
Number of Output Nodes 3.01 2.84 6.70 2.04
Prediction of Sickle Cell Anemia Patients’ Response to HU treatment
Interestingly, usage of the subset of the three most significant features (SNBRC, HbF and TotalTx) in case of double rule has given considerably better result. This can be explained by the fact that the other features have practically no relevance and act as noise.
41Copyright 2003 by Hongyu Xu
The Important Contributions in this Work
Coauthored development of recurrent formula for computing inverse covariance matrix.
Eliminate searching loop for the winner node. Concluded experimentally that classification based on
Gaussian mixture model gives better hit rate than that of Mahalanobis distance.
Confirmed experimentally that variable learning rate is more beneficial than constant learning rate.
The idea of resetting covariance matrix. Designed and evaluated an effective merging algorithm. Applications of MART.
42Copyright 2003 by Hongyu Xu
Future Work
Treat the plasticity/ stability property of MART theoretically.
Profile the matlab implementation and convert the whole or part of the programs into optimized C/C++.
Extend applications of MART to image processing, acoustic data classification, and micro array data analysis, etc.