1254 ieee transactions on neural networks and … · som have been developed and proposed in...

15
1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 8, AUGUST 2012 SOMKE: Kernel Density Estimation Over Data Streams by Sequences of Self-Organizing Maps Yuan Cao, Student Member, IEEE, Haibo He, Senior Member, IEEE, and Hong Man, Senior Member, IEEE Abstract—In this paper, we propose a novel method SOMKE, for kernel density estimation (KDE) over data streams based on sequences of self-organizing map (SOM). In many stream data mining applications, the traditional KDE methods are infeasible because of the high computational cost, processing time, and memory requirement. To reduce the time and space complexity, we propose a SOM structure in this paper to obtain well-defined data clusters to estimate the underlying probability distributions of incoming data streams. The main idea of this paper is to build a series of SOMs over the data streams via two operations, that is, creating and merging the SOM sequences. The creation phase produces the SOM sequence entries for windows of the data, which obtains clustering information of the incoming data streams. The size of the SOM sequences can be further reduced by combining the consecutive entries in the sequence based on the measure of Kullback–Leibler divergence. Finally, the probability density functions over arbitrary time periods along the data streams can be estimated using such SOM sequences. We com- pare SOMKE with two other KDE methods for data streams, the M-kernel approach and the cluster kernel approach, in terms of accuracy and processing time for various stationary data streams. Furthermore, we also investigate the use of SOMKE over nonsta- tionary (evolving) data streams, including a synthetic nonstation- ary data stream, a real-world financial data stream and a group of network traffic data streams. The simulation results illustrate the effectiveness and efficiency of the proposed approach. Index Terms— Kernel density estimation, Kullback–Leibler divergence, probability density functions, self-organized maps, stream data mining. I. I NTRODUCTION R ECENTLY a large amount of raw data in many scientific and commercial applications have been collected at an increasing pace. For instance, large organizations like the financial companies on Wall Street produce hundreds of mil- lions of records of transactions everyday and many scientific researches also generate gigabytes of data on a daily basis [1]. Manuscript received August 29, 2011; revised January 16, 2012 and May 14, 2012; accepted April 5, 2012. Date of publication June 14, 2012; date of current version July 16, 2012. This work was supported in part by the Defense Advanced Research Projects Agency under Grant FA8650-11-1-7148 and Grant FA8650-11-1-7152, and the National Science Foundation under Grant ECCS 1053717. Y. Cao is with MathWorks, Inc., Natick, MA 01760 USA (e-mail: [email protected]). H. He is with the Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, Kingston, RI 02881 USA (e-mail: [email protected]). H. Man is with the Department of Electrical and Computer Engineer- ing, Stevens Institute of Technology, Hoboken, NJ 07030 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2012.2201167 Most of these continuous data streams are characterized by fast arrival, high volume, and open-endedness. Therefore, how to process and analyze these data streams effectively and efficiently has become a challenging problem and many efforts have been made from both academic and industry communities [1]–[8]. In this paper, we investigate how to estimate the underlying probability density functions (pdfs) over the data streams based on self-organizing maps (SOM). Density estimation is defined as the construction of an estimate of the density function from the observed data [9]. Generally, there are two classes of den- sity estimation methods, parametric and nonparametric. Para- metric density estimation is conducted under the assumption that the data are drawn from a known parametric type of distri- bution, such as Gaussian distribution or uniform distribution, whereas nonparametric estimation has no such assumptions. In most real-world applications, the pdfs of the data streams are unknown. Therefore, nonparametric density estimation has found a great deal of applications in many domains, such as economics [10], network security [11], human motion tracking [12], financial data modeling [13], fluid mechan- ics [14], among others. Kernel density estimation (KDE) is one of the most popular nonparametric density estimation approaches, which is an important statistical analysis tool for data mining [15]. For instance, it is widely used in modeling and simulation of physical phenomena. With KDE, we can construct the probability distributions over the measurements collected from a process. Then we can generate random variables for a Monte Carlo simulation. KDE can also be used in classification problems by constructing the class-conditional pdfs that are used in a Bayesian classifier. Other applications of include outlier detection [11] and nonparametric regression. However, the inherent characteristics of the KDE method make it infeasible to tackle large-scale data sets because of high computational cost, processing time, and intensive memory allocation requirement [16]. To overcome these obsta- cles, several approaches have been presented in literature for off-line data sets or online data streams. For instance, in [17], a reduced KDE algorithm that combines KDE with SOM was proposed and some theoretical results on binned KDE were also presented. In [18], self-organizing mixture networks were proposed for probability density estimation, and applications of this model over density profile estimation and pattern classification have been presented to illustrate the efficiency of the proposed method. The fast Gauss trans- form technique was used in [19] to speed up the KDE and this algorithm was successfully applied to vision-tracking 2162–237X/$31.00 © 2012 IEEE

Upload: others

Post on 20-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND … · SOM have been developed and proposed in literature, such as temporal Kohonen map [32], recurrent SOM [33], recursive SOM [34],

1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 8, AUGUST 2012

SOMKE: Kernel Density Estimation Over DataStreams by Sequences of Self-Organizing Maps

Yuan Cao, Student Member, IEEE, Haibo He, Senior Member, IEEE, and Hong Man, Senior Member, IEEE

Abstract— In this paper, we propose a novel method SOMKE,for kernel density estimation (KDE) over data streams based onsequences of self-organizing map (SOM). In many stream datamining applications, the traditional KDE methods are infeasiblebecause of the high computational cost, processing time, andmemory requirement. To reduce the time and space complexity,we propose a SOM structure in this paper to obtain well-defineddata clusters to estimate the underlying probability distributionsof incoming data streams. The main idea of this paper is tobuild a series of SOMs over the data streams via two operations,that is, creating and merging the SOM sequences. The creationphase produces the SOM sequence entries for windows of thedata, which obtains clustering information of the incoming datastreams. The size of the SOM sequences can be further reducedby combining the consecutive entries in the sequence based on themeasure of Kullback–Leibler divergence. Finally, the probabilitydensity functions over arbitrary time periods along the datastreams can be estimated using such SOM sequences. We com-pare SOMKE with two other KDE methods for data streams, theM-kernel approach and the cluster kernel approach, in terms ofaccuracy and processing time for various stationary data streams.Furthermore, we also investigate the use of SOMKE over nonsta-tionary (evolving) data streams, including a synthetic nonstation-ary data stream, a real-world financial data stream and a groupof network traffic data streams. The simulation results illustratethe effectiveness and efficiency of the proposed approach.

Index Terms— Kernel density estimation, Kullback–Leiblerdivergence, probability density functions, self-organized maps,stream data mining.

I. INTRODUCTION

RECENTLY a large amount of raw data in many scientificand commercial applications have been collected at an

increasing pace. For instance, large organizations like thefinancial companies on Wall Street produce hundreds of mil-lions of records of transactions everyday and many scientificresearches also generate gigabytes of data on a daily basis [1].

Manuscript received August 29, 2011; revised January 16, 2012 and May 14,2012; accepted April 5, 2012. Date of publication June 14, 2012; date ofcurrent version July 16, 2012. This work was supported in part by the DefenseAdvanced Research Projects Agency under Grant FA8650-11-1-7148 andGrant FA8650-11-1-7152, and the National Science Foundation under GrantECCS 1053717.

Y. Cao is with MathWorks, Inc., Natick, MA 01760 USA (e-mail:[email protected]).

H. He is with the Department of Electrical, Computer, and BiomedicalEngineering, University of Rhode Island, Kingston, RI 02881 USA (e-mail:[email protected]).

H. Man is with the Department of Electrical and Computer Engineer-ing, Stevens Institute of Technology, Hoboken, NJ 07030 USA (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TNNLS.2012.2201167

Most of these continuous data streams are characterized byfast arrival, high volume, and open-endedness. Therefore,how to process and analyze these data streams effectivelyand efficiently has become a challenging problem and manyefforts have been made from both academic and industrycommunities [1]–[8].

In this paper, we investigate how to estimate the underlyingprobability density functions (pdfs) over the data streams basedon self-organizing maps (SOM). Density estimation is definedas the construction of an estimate of the density function fromthe observed data [9]. Generally, there are two classes of den-sity estimation methods, parametric and nonparametric. Para-metric density estimation is conducted under the assumptionthat the data are drawn from a known parametric type of distri-bution, such as Gaussian distribution or uniform distribution,whereas nonparametric estimation has no such assumptions.In most real-world applications, the pdfs of the data streamsare unknown. Therefore, nonparametric density estimation hasfound a great deal of applications in many domains, suchas economics [10], network security [11], human motiontracking [12], financial data modeling [13], fluid mechan-ics [14], among others. Kernel density estimation (KDE) isone of the most popular nonparametric density estimationapproaches, which is an important statistical analysis tool fordata mining [15]. For instance, it is widely used in modelingand simulation of physical phenomena. With KDE, we canconstruct the probability distributions over the measurementscollected from a process. Then we can generate randomvariables for a Monte Carlo simulation. KDE can also be usedin classification problems by constructing the class-conditionalpdfs that are used in a Bayesian classifier. Other applicationsof include outlier detection [11] and nonparametric regression.

However, the inherent characteristics of the KDE methodmake it infeasible to tackle large-scale data sets becauseof high computational cost, processing time, and intensivememory allocation requirement [16]. To overcome these obsta-cles, several approaches have been presented in literaturefor off-line data sets or online data streams. For instance,in [17], a reduced KDE algorithm that combines KDE withSOM was proposed and some theoretical results on binnedKDE were also presented. In [18], self-organizing mixturenetworks were proposed for probability density estimation,and applications of this model over density profile estimationand pattern classification have been presented to illustratethe efficiency of the proposed method. The fast Gauss trans-form technique was used in [19] to speed up the KDE andthis algorithm was successfully applied to vision-tracking

2162–237X/$31.00 © 2012 IEEE

Page 2: 1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND … · SOM have been developed and proposed in literature, such as temporal Kohonen map [32], recurrent SOM [33], recursive SOM [34],

CAO et al.: KERNEL DENSITY ESTIMATION OVER DATA STREAMS BY SEQUENCES OF SELF-ORGANIZING MAPS 1255

problems. For data streams, the multipole-accelerated onlinedensity estimation was presented in [20], which aims toprovide an approximated density estimation based on multi-pole techniques by maintaining a multivariate Taylor seriesexpansion of the estimator, therefore significantly reducingthe computational time. Local kernel algorithm for KDEover multivariate spatial data streams was proposed in [21],which computes the density estimators by updating localstatistics with a kd-tree-like structure on top of a continuouslymaintained sample of the stream. In [22], orthogonal serieswere used to construct the density estimators, in which theorthogonal series coefficients in the estimators can be updatedrecursively. M-kernel algorithm was presented in [23] andcluster kernel algorithm was presented in [16] and [24]. Thesetwo algorithms construct the density estimators over datastreams by inserting, maintaining, and merging kernels. In thisarticle, the term “merging” of two kernels or SOMs meanscombining the two kernels or SOMs into one in order toreduce the memory used to process the infinite stream data.Since we will compare our approach to these two methodsin the empirical study, we will take a closer look at them inSection III. Note that there are other SOM-based methods inthe literature, however, to our best knowledge, not too manyare used explicitly for density estimation over stream data.M-kernel and cluster kernel are the two approaches that wefound also targeting explicitly on stream data, and that is thereason that we mainly compare to both of them. Of course,it would be interesting to study some other SOM approaches,such as the recursive SOM and self-organizing mixture net-works, for stream data density estimation, and assess theirlimitations and advantages of each of such approaches.

SOM is a powerful learning model based on competitivelearning and unsupervised learning [25]–[27]. The principle ofSOM is to train a network of neurons to seek similar propertiesfor certain input patterns and project the input vector with highdimensions into low-dimensional (normally less than threedimensions) discrete map in topologically ordered pattern.Therefore, SOM can form an approximation of the distributionof input space in a compact fashion. Generally, SOM canbe used for visualization [28], dimensionality reduction [29],vector quantization [30], and clustering [31]. Some variants ofSOM have been developed and proposed in literature, such astemporal Kohonen map [32], recurrent SOM [33], recursiveSOM [34], and growing merge neural gas (GMNG) [35].All of these variants can preserve the temporal context ofthe input data and can be used for stream data mining.In [36] and, [37] the self-organizing mixture autoregressive(SOMAR) model and generalized SOMAR (GSOMAR) modelare presented to tackle nonlinear and nonstationary time series.This model contains a number of autoregressive models thatare learnt and organized in a self-organized manner by theadaptive least mean square algorithm. In [38], Alex et al.used Neural Gas and SOM in the patch clustering method.In this method, clusters are generated in small patches andthe information in the previous patch is used in learning ofnew ones. Empirical study over independent and identicaldistributed (i.i.d.) streams and nonstationary streams illustratethe effectiveness of the proposed algorithm. Furthermore,

parallelization is introduced into the learning procedure andcan further improve the efficiency of the proposed method.Combining an ensemble of SOMs, that is, fusion of SOMs,is another interesting topic. Several combining algorithmsare presented in literatures, including fusion methods-basedEuclidean distance [39], Voronoi polygons similarity [40], andordered similarity [41].

In this paper, we present a novel density estimation method,SOMKE, over data streams based on SOM. We take advan-tage of the learning and clustering capabilities of SOM byusing only the trained SOM neurons, instead of all the inputvectors, such as kernels, to calculate pdfs. This approach cansignificantly reduce the number of terms in a kernel densityestimator, and thus greatly improve the efficiency for the KDEanalysis over data streams. SOMKE consists of three steps:1) creating the SOM sequence entries; 2) merging the SOMsequence; and 3) final estimation. A SOM sequence entry isa structure created by applying the SOM algorithm to eachwindow of data to obtain the clustering information of theincoming data streams. All the SOM sequence entries arecollected together to form the SOM sequence. The creationoperation produces the SOM sequence entries for windowsof data along the stream. Thus, we can gather a sequenceof SOM entries along the data streams and form the SOMlist. In SOMKE, the consecutive entries in the sequence canbe combined based on the similarity measure to further savethe storage resources. In this paper, we use Kullback–Leilberdivergence to measure the distance of the distributions thatcan be estimated from two SOM sequence entries. Differentcombining strategies of the entries are also presented. In thefinal step, the pdfs over arbitrary time periods along the datastreams can be easily estimated by gathering the correspondingSOM sequence entries. Since all the information along thestreams is retained in a compact way, SOMKE can easilycapture the distribution drifting in the data streams and providegreat flexibility in the analysis. The structure in our method isdifferent from another hierarchical SOM approach, the grow-ing hierarchical SOM [42]. GHSOM attempts to capture thehierarchical relations within the data and provide a dynamicarchitecture of neural networks. So GHSOM is a top-downapproach. On the contrary, our approach start with SOMstrained from different chunks of raw data, and further mergesthese SOMs based on KL divergence to save memory resource.Therefore, our algorithm presents a bottom-up approach, butthe resulting SOM sequence contains only a linear structure.The increase in the sizes of the chunks or the presence of hier-archical relations within the data GHSOM is a good candidateto replace the traditional SOM algorithm in our framework.

The rest of this paper is organized as follows. In Section II,we briefly introduce the data stream model, KDE, SOM, andKullback–Leilber divergence to provide a foundation for theproposed method in this paper. In Section III, we discuss tworecently developed density estimation approaches over datastreams, the M-kernel method and Cluster kernel method. Thiswill provide the necessary support when we compare SOMKEwith these methods. We then present SOMKE in detail inSection IV. The simulation results are presented in Section V,and finally Section VI concludes this paper.

Page 3: 1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND … · SOM have been developed and proposed in literature, such as temporal Kohonen map [32], recurrent SOM [33], recursive SOM [34],

1256 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 8, AUGUST 2012

II. PRELIMINARIES

In this section, we first model the underlying data streams.Then, we give brief descriptions of the KDE method, theSOMs learning algorithm, and the Kullback–Leilber diver-gence.

A. Data Stream Model

There are various data stream models and correspond-ing techniques in literature. To clearly define the problemaddressed in this research, we first specify the data modelthat our method is targeting for. Specific, in this paper weconsider the same data stream model presented in the problemformulation addressed in [43]. Considering a d-dimensionaldata stream that contains input patterns {x1, x2, . . . , xn, . . .}with xk = (xk1, xk2, . . . , xkd ) ∈ �d where d and k areintegers. Assume that the data stream can be partitioned into asequence of windows W i with respect to time with width T (i)such that all the data in W i are i.i.d. observations from thesame distribution pi . Therefore, the data stream can be mod-eled as a series of windows {W1, W2, . . . , Wn, . . .}, whereW i = {x1

i , x2i , . . . , xT (i)

i } and all the data xki ∈ W i follow the

same underlying distribution pi . Based on this stream datamodel, we take advantage of the learning capability of SOMby using the trained SOM neurons, instead of all the inputvectors, as kernels to calculate pdfs. In this way, our approachcan reduce the number of required terms in a kernel densityestimator, thus greatly improving the efficiency for the KDEanalysis over data streams.

Note that the width T can be constant, that is, T (i) ≡ T ,for all windows, or varying over time, which either canbe determined by the characteristics of the data or decidedby the user for specific analysis purposes. For instance, fora data stream of the daily rates of return of Ford stock(NYSE:F) from January 3, 2000 to December 31, 2009, wecan use the window widths according to the fiscal yearsas T = [252, 248, 252, 252, 252, 252, 251, 251, 253, 252], orsimply set a fixed window width with T (i) = 250, fori = 1, 2, . . . , 9, and the rest for window 10 in the analysis.Note that in some applications, overlapped windows may begenerated over the data streams. In this paper we consider onlythe nonoverlapping windows.

B. KDE

Various density estimation methods are summarized in [9].In this section, we provide a brief discussion on KDE andinterested readers can refer to [9] for details.

In univariate cases, given a vector of the observed data X ={X1, X2, . . . , Xn} with size n, the kernel estimator with kernelK is defined by

f (x) = 1

nh

n∑

i=1

K

(x − Xi

h

)(1)

where h is the window width.Usually, K is a symmetric pdf that satisfies

�K (x)dx = 1,

�x K (x)dx = 0, and

�x2 K (x)dx �= 0.

There are various choices among kernels, such as uniform,triangle, quartic, triweight, and cosinus kernels, etc. [9]. InSOMKE, we adopt the popular Gaussian kernel defined as

K (x) = 1√2π

e− 12 x2

(2)

whereas in the Cluster Kernel method [16], [24], Epanech-nikov kernel is used, which is defined as

K (x) = 3

4(1 − x2)I|x |≤1. (3)

The window width is an important parameter in KDE. Toolarge window width will lead to oversmoothing, while toosmall window width will result in an undersmoothed estimate.Therefore, the window width has to be chosen carefully.A common way to find the optimal window width is byminimizing the asymptotic mean-integrated squared error. Theapproach of finding the optimal window width include plug-in method and cross-validation method. Interested readerscan refer to [44] and [45] for detailed information. In thispaper, since our focus is not window width estimation but thepresentation of a novel concept for KDE, we adopt the simplebut efficient bandwidth strategy as suggested by [46] and [9]

h = 1.06σn−15 (4)

where σ is the standard deviation of the vector X . This rule-of-thumb bandwidth is estimated by assuming that the underlyingdensity belongs to the family of normal distributions andGaussian kernel is used in the KDE. For Epanechnikov kernel,this simple bandwidth strategy is also valid [16] and it avoidsheavy computational burden for estimating the bandwidth thatmay be caused by other complex bandwidth strategies. There-fore, in this paper, we also adopt (4) to estimate bandwidthfor Epanechnikov kernel. We note that the better estimationresults may be obtained using a robust measure of spreador further improved by considering a skewness factor if thedata are heavily skewed. Interested reader may refer to [9] forfurther details.

For multivariate data sets, the kernel estimator can beobtained as [9]

fH(x) = 1

nh1h2 . . . hd

n∑

i=1

K

(x1 − Xi1

h1, · · · ,

xd − Xid

hd

)

(5)where H = (h1, h2, . . . , hd )′ is a vector of bandwidths forinput data with positive values, and the kernel function Kfor multivariate data is usually a radially symmetric unimodalpdf. For instance, it may be the standard multivariate normaldensity function as

K (x) = (2π)

( −d2

)

exp

(−1

2xT x

)(6)

or the multivariate Epanechnikov kernel which is defined as

K (x) ={

12 c−1

d (d + 2)(1 − xT x), xT x < 10, otherwise

(7)

where cd is the volume of the unit d-dimensional sphere. Thebandwidth H = {h1, h2, . . . , hd } in (5) can be estimated in

Page 4: 1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND … · SOM have been developed and proposed in literature, such as temporal Kohonen map [32], recurrent SOM [33], recursive SOM [34],

CAO et al.: KERNEL DENSITY ESTIMATION OVER DATA STREAMS BY SEQUENCES OF SELF-ORGANIZING MAPS 1257

the following way [46]:hk = σkn

−1(d+4) , k = 1, 2, . . . , d. (8)

In [47], flat-top kernels such as chi-squared kernel areshown to outperform the Gaussian kernel in high-dimensionalapplications. In [48], a family of rotation-based iterative Gaus-sianization transforms are used to tackle high-dimensionaldensity estimation problem. An incremental variationalmethod is proposed in [49] to learn the model and optimizeits complexity in Gaussian mixture models. The proposedalgorithm is used for probability density estimation and patternrecognition in relatively high-dimensional applications.

C. SOM

Generally, a SOM consists of a group of neurons thatare organized as a low-dimensional grid, usually a 2-D grid.Consider that all the input data are d-dimensional featurevectors, xi = (x1

i , x2i , . . . , xd

i ) ∈ X I ⊂ �d , where X I is theinput space. Each neuron ni in the SOM is associated with ad-dimensional feature vector or weight, ωi = (ω1

i , ω2i , . . . ,

ωdi ). These weights associated to the neurons are adjusted

according to the input patterns randomly sampled from X I .In the training phase, three learning processes are involved,including competition, cooperation, and synaptic adapta-tion [25]–[27].

During the competition stage, at each training step t , aninput vector xt is randomly sampled from X I according to auniform distribution D over the input set. Euclidean distancesbetween the input vector and each neuron are calculated, andthe winning neuron is the neuron nw(t) with the smallestdistance (maximum similarity) to the input vector

ωw(t) = arg minωi

‖xt − ωi‖ , i = 1, 2, . . . , � (9)

where � is the total number of neurons in the SOM.In the cooperation phase, a topological neighborhood around

the winning neuron has to be determined. Normally, the choiceof the neighborhood satisfies two conditions: 1) the topologicalneighborhood should be symmetric around the winning neuronthat has the maximum value and 2) the rate of learning inthe topological neighborhood decreases monotonically withincreasing distance on the grid of nodes between the synap-tic neuron and the winning neuron. A common selectionof the topological neighborhood is Gaussian function hi (t)defined as

hi (t) = exp

(− d2

i

2σ 2(t)

), i = 1, 2, . . . , � (10)

where d2i is the squared distance on the grid of nodes between

nw and ni , and σ(t) is the effective width of the topologicalneighborhood with the initial value σ0, defined, respectively,as follows:

d2i = ‖vw − vi‖2 , i = 1, 2, . . . , � (11)

σ(t) = σ0 exp

(− t

τ1

), t = 0, 1, 2, . . . (12)

where τ1 is a time constant.

Therefore, the initial natural boundaries of this topologicalneighborhood depend on the effective width σ(t). With a smalleffective width, the initial natural boundaries are narrow, andwith a large effective width, the initial natural boundariescontains more neighbors.

Finally, in the synaptic adaptation stage, the weights ofthe winning neuron and as those of the excited neurons areadjusted to the input pattern xt based on the topologicalneighborhood function (10). The weight-updating rule of SOMcan be written as

ωi (t + 1) = ωi (t) + η(t)hi (t)(xt − ωi (t)) (13)

where η(t) is the monotonically decreasing learning rate withthe initial value η0, defined as

η(t) = η0 exp

(− t

τ2

), t = 0, 1, 2, . . . (14)

where τ2 is another time constant.When SOM is used as density estimators, magnification

effects are quite critical issues. Generally, for vector quan-tizers, one can find the following density relation after theconverged learning process [50]

P(w) ∝ ρ(w)α (15)

where P(w) is the data density, ρ(w) is the achieved weightvector density, and α is the magnification factor. For SOM, themagnification factor is (1 + 12M2(σ ))/(3 + 18M2(σ )) for 1-D data streams, where M2(σ ) denotes the second normalizedmoment of the neighborhood function depending on the neigh-borhood range σ [51], and for neural gas, α is d/(d+2), whered is the intrinsic or Hausdorff dimension of the data [52].The intrinsic dimension can be estimated by Grassberger–Procaccia-analysis [53] or via a neural network approach [54].

In this paper, we adopt the localized learning schemeproposed in [55] to control the magnification factor. In themethod, the weight-updating rule (13) is modified to

ωi (t + 1) = ωi (t) + ηnw(t)hi (t)(x t − ωi (t)) (16)

where nw is the winning neuron. The learning rate ηnw(t) in(16) is defined as

ηnw(t) = η0

(1

t · |xt − ωw|d)m

. (17)

where d is the effective dimension of the data streams, and mis a free parameter that allows us to control the magnificationexponents. For the 1-D data streams, m is set to 0.5 in thispaper, t is set to 1, and d = 1. Interested readers may referto [55] for further information.

For other magnification factor control schemes, such aswinner-relaxing learning and convex-concave learning, inter-ested reader can refer to [56] for detailed information.

D. Kullback–Leibler Divergence

Kullback–Leibler divergence was introduced in [57], whichmeasures the distance between the distributions of randomvariables. Consider P and Q that are probability measures

Page 5: 1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND … · SOM have been developed and proposed in literature, such as temporal Kohonen map [32], recurrent SOM [33], recursive SOM [34],

1258 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 8, AUGUST 2012

defined on the measure space (�, F ). The Kullback–Leiblerdivergence can be formally expressed by

DK L(P||Q) =∫

�d P log

d P

d Q=

�dp(x) log

p(x)

q(x)dx (18)

where p(x) and q(x) are the density functions for P and Q,respectively, and they are Lebesgue measurable.

In this paper, since the underlying densities of the obser-vations, p(x) and q(x), are unknown, we use the esti-mated density functions, p(x) and q(x), to calculate theestimated Kullback–Leibler divergence. We can calculate theestimated divergence between P and Q as

DK L(P||Q) =∫

�dp(x) log

p(x)

q(x)dx . (19)

For example, in 1-D cases, we use Riemann sum approxi-mation and rewrite (19) as

DK L(P||Q) =T∑

i=1

f

(xi + xi−1

2

)(xi − xi−1) (20)

where f (x) = p(x) log( p(x)/q(x)), and x0 < x1 < · · · < xT

is a partition of the real line �. Other approaches to calculatethe Kullback–Leibler divergence can be found in [58] and [59].

III. M-KERNEL METHOD AND

CLUSTER KERNEL METHOD

M-kernels are kernel density estimators over 1-Ddata streams [23]. M-kernel method is based on KDEthat is also known as Parzen window method and isdesigned to tackle the density estimation problem over1-D data streams, so it modifies the original Parzen windowmethod by using merged kernels instead of the raw data setsto estimate the underlying densities of the data streams. InM-kernel method, a kernel list is maintained. Each kernelentry is of the form (X∗, h∗, ρ∗, m_costi,i+1) and sorted byX∗. The X∗ is the value of the kernel, h∗ is the bandwidth,ρ∗ is the weight, and m_costi,i+1 is the merging cost ofthe kernel with the next kernel in the list. At each time t ,a new kernel formed by the new-coming data xt is insertedto the list. If the number of the kernels exceeds a maximumnumber, a pair of kernels with the minimum merging cost ismerged. The merging cost is calculated by

m_costi, j =∫

|ρi Khi (x − Xi ) + ρ j Kh j (x − X j )

−(ρi + ρ j )Kh∗m(x − X∗

m)|dx . (21)

Since the combination of the two kernels is lossy approx-imation of the original two kernels, the merging cost is usedhere as a measure of the information that is discarded in thecombination. Therefore, in this method, we always choose thepair of kernels with the minimum merging cost to merge inorder to minimize the information loss caused by the merge.

In [23], the optimization problem of finding (X∗m , h∗

m) tominimize m_costi, j is solved by downhill simplex method.Finally, the density estimator can be obtained by

fn(x) = 1

n

m∑

j=1

ρ∗j

h∗j

K

(x − X∗

j

h∗j

). (22)

The idea of cluster kernel is quite similar to M-kernel.The major operations in the cluster kernel algorithm arealso inserting, maintaining, and merging the kernels. But,cluster kernel method redesigns the idea of M-kernel basedon a specified kernel, that is, Epanechnikov kernel. Instead ofcomputing different bandwidth for each kernel, cluster kernelassigns a global bandwidth for all kernels, which is calculatedbased on the estimated mean and variance of the data stream.Because of the global bandwidth and the bounded supportof the Epanechnikov kernel, the closed-form solution of themerging cost optimization problem for 1-D case can be derivedinstead of using numerical method to approximate the optimalsolution of (21) in M-kernel. Therefore, cluster kernel canachieve better performance than M-kernel in terms of accuracy.Two implementation approaches of cluster kernel are provided:list-based and tree-based approaches. In the list-based clusterkernel, a list of kernels is maintained. When a new elementarrives, since the global bandwidth is changed, the mergingcost of each kernel pair has to be recalculated. On the otherhand, the tree-based cluster kernel is an approximate of thelist-based approach. It uses a binary search tree to organize thekernel entries and a priority search tree to organize the mergingcosts. When a new element arrives, instead of updating allkernels in the list, only the merging costs between the newkernel entry and its neighbors are recomputed. Therefore,the tree-based implementation substantially reduces the timecomplexity compared to the list-based approach [16], [24].

IV. SOMKE ALGORITHM

In this section, we present the SOMKE algorithm for KDEover data streams based on SOMs. SOMKE consists of threebasic operations: developing the SOM sequence, merging theSOM sequence, and estimation of pdfs. The system architec-ture of the SOMKE algorithm in a 2-D example is illustratedin Fig. 1. In this example, the user’s objective is to estimate thepdf of the data stream over windows W i to W i+5. Each stepin the algorithm will be presented in detail in the followingsections.

A. Developing SOM Sequence

Let W t be the new chunk of data at time t . On the arrivalof W t , a structure with the form 〈SO M i , ci , kli , r i 〉 is built tosummarize the information of W t needed for KDE analysis,where SO M i is the trained SOM, ci is a vector of the numbersof input data in the Voronoi regions of the reference neuronsin SO M i , kli is the estimated Kullback–Leibler divergence,and r i is the range of windows summarized by this structure.We call this structure a SOM sequence entry that is denotedby SLi . SL denotes the whole SOM sequence as SL ={SL1, SL2, . . . , SL i }. Since some entries in SL may havebeen combined, i is always less than or equal to t .

More specifically, all the input data in W t are used as inputdata to train a SOM, SO M i = {ω1

i ,ω2i , . . . ,ω

�i }, where ωk

i ,k = 1, 2, . . . , � is the weight associated to the neuron nk

i inSO M i , and � is the total number of neurons in SO M i . Foreach neuron nk

i , we calculate the number of input data in the

Page 6: 1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND … · SOM have been developed and proposed in literature, such as temporal Kohonen map [32], recurrent SOM [33], recursive SOM [34],

CAO et al.: KERNEL DENSITY ESTIMATION OVER DATA STREAMS BY SEQUENCES OF SELF-ORGANIZING MAPS 1259

Fig. 1. SOMKE: the proposed system architecture of the SOM-based KDE method (example of 2-D data).

Voronoi region of the reference neuron nki as ck

i = |Xki |, where

Xki =

{x : ωk

i = arg minω

ji

∥∥∥x − ωji

∥∥∥ , x ∈ W t

j = 1, 2, . . . , �} (23)

and |Xki | is the number of elements in Xk

i . In the structure, kli

is the estimated Kullback–Leibler divergence between SLi−1and SLi that can be obtained using (19) based on the estimatedpdfs from SO M i−1 and SO M i by (28). Initially, SLi isbuilt to summarize the information in window W t . If SLi

is combined with other SOM sequence entries, SLi willbe associated to multiple windows. The range of windowsassociated by SLi is recorded in a two-element vector r i thatcontains the upper and lower time indexes of windows thatSLi summarizes. The initial value of r i for window W t is setto [t, t]. After combination with other entries, the r i will bechanged correspondingly as shown in Section IV-B.

Since the size of the SOMs, �, is always much smaller thanthe size of the data window, l � T (i), then the amount ofrequired memory is greatly reduced compared to the originaldata streams.

B. Merging SOM Sequence

To further reduce the memory resource required to storethe information, the SOM sequence entries are merged basedon the measure of Kullback–Leibler divergence. Here, weintroduce two SOM sequence merging strategies: fixed memoryand fixed threshold approaches.

Suppose that the SOM sequence entries SL j and SL j+1have to be merged. All the neurons in SO M j andSO M j+1 are collected together to form an input setX = {n1

j , . . . , n�j , n1

j+1, . . . , n�j+1} associated with c =

{c1j , . . . , c�

j , c1j+1, . . . , c�

j+1}. A SOM is trained on theinput set X = {n1, n2, . . . , n2�} associated with c ={c1, c2, . . . , c2�}. Instead of using the uniform distributionto sample examples from the input space in the traditionalSOM algorithm, we use c, which are the numbers of rawinput data associated to the neurons, to construct the samplingdistribution D′ as

D′i = ci

∑2�k=1 ck

, i = 1, 2, . . . , 2�. (24)

Equation (24) ensures that the input neuron with themost associated data has the best chance to be selected inthe learning process. Then we can get the combined SOMsequence entry SLm as SLm = 〈SO Mm , cm, klm , rm〉, whereSO Mm = {ω1

m,ω2m, . . . ,ω�

m} and cm = {c1m, c2

m , . . . , c�m}.

Here, cim , i = 1, 2, . . . , �, is the summation of ci ’s associated

to the input neurons in the Voronoi region of the referenceneuron ni

m calculated by cim = ∑

nk∈Xim

ck , where

Xim =

{n j : ωi

m = arg minωk

m

∥∥∥ω j − ωkm

∥∥∥ ,ω j ∈ X,

k = 1, . . . , �′} . (25)

In SLm , rm is the range of windows covered by both SL j andSL j+1. So now, SL j and SL j+1 can be removed from the

Page 7: 1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND … · SOM have been developed and proposed in literature, such as temporal Kohonen map [32], recurrent SOM [33], recursive SOM [34],

1260 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 8, AUGUST 2012

Fig. 2. Example of combination of two SOM sequence entries.

sequence and SLm is inserted as the new SL j . In addition, theKL divergences between the new SL j and its adjacent entriesSL j−1 and SL j+1 (which is SL j+2 in the old sequence beforecombination) need to be updated and get new kl j and kl j+1,respectively.

The example in Fig. 2 illustrates the combination of twoSOMs, that is, SO M j and SO M j+1 combine to SO Mc. Inthis example, only the input neurons n1

j from SO M j andn1

j+1 from SO M j+1 lie in the Voronoi region of the reference

neuron n1c in the combined SOM, that is, X

1m = {n1

j , n1j+1}.

Then c1m = c1

j + c1j+1 = 9. Since r j = [ j, j ] and r j+1 =

[ j + 1, j + 1], then one have rc = [ j, j + 1]. Note that themerging operation can be performed to SOMs with differentnumber of nodes. In this paper, for simplicity we assume thatthe number of the nodes in the SOMs is fixed.

Before presenting the merging strategies, we briefly discusstwo factors that are taken into account in the algorithm andintroduce the concept of the modified KL divergence withrespect to time.

Comprehensiveness and relevance are two importantconcerns in time series analysis. To deal with the stability-plasticity dilemma that commonly presents in online learningproblems, the system should be able to adaptively learn newor more relevant information while not diluting/forgettingprevious learned patterns. Up to now, we have discussed howto store all the information along the data stream, that is, thecomprehensive representation of the whole data streams. Onthe other hand, when the relevance of data is considered, itis reasonable to assume that older data is less relevant to thecurrent analysis than the current information. In SOMKE, thebasic idea of dealing with the relevance of data is to makethe older entries more likely to be merged than the newerones, while other things being equal. To do so, we introducethe modified KL divergence kl ′j between SL j−1 and SL j thatcan be described as

kl ′j = g(kl j , r j ) (26)

where g(·) is a mapping g : � × �+ → �, and r j is theaverage range of the SL j as r j = ∑

r j/2. As shown in (26),the modified KL divergence kl ′j not only depends on kl j thatis directly calculated from SL j−1 and SL j , but also on thetime indexes of windows associated by SL j . Note that g(·) is auser-defined monotonically increasing function with respect tor j , which means with a fixed kl j , larger r j (new information)leads to larger kl ′j . In other words, the old data informationwith the small time indexes will get small kl ′ value. In this

paper, we define

kl ′j = g(kl j , r j ) = kl j · e−β(t−r j ) (27)

where t is the time index of current window, and β is anonnegative parameter that controls the rate at which theexponential term in (27) decreases over time. If β is set to0, the exponential term is always 1 over time, meaning thatthe old data are treated equally as the new ones. When βincreases, given the same kl, the old SLs are more likely tobe merged than the new ones. Equation (27) guarantees that forthe latest entry SL j with r j = [t, t], kl ′j equals to kl j . Notethat β can be negative, which means that the old informationis more important than the new one, while this case rarelytakes place in practice. The choice of β depends on differentapplications. If that old information are equally important asthe new one, then β can be set to 0. In other scenarios that oldpatterns quickly become out-dated, then choosing a positive βis reasonable.

1) Fixed Memory Strategy: In the fixed memory strategy,the amount of memory allocated to store the SOM sequenceis fixed. In other words, the maximum numbers of entriesin the SOM sequence is fixed to N . When a chunk of newdata arrives, a SOM sequence entry SLi is built as describedin Section IV-A and appended to the SOM sequence SL. Ifthe number of entries is greater than N , meaning that theused memory exceeds the maximum allocated memory, twoconsecutive entries with the smallest absolute value of kl ′calculated by (27) will be merged to one entry. For instance,if |kl ′j | is the smallest one among all the |kl ′|’s, that is, |kl ′j | =min(|kl ′2|, |kl ′3|, . . . , |kl ′N+1|), j ∈ Z , and 2 ≤ j ≤ N +1, thenSL j−1 and SL j are merged to one entry. Therefore, this is aresource-aware approach.

2) Fixed Threshold Strategy: In the fixed threshold strategy,we focus on the overall minimum kl ′ value of all the consec-utive pairs of entries in the SOM sequence. In other words,we focus more on the resolution of the analysis system. Letα denote the threshold of the modified KL divergence. Whenthe modified KL divergence between two adjacent entries isless than α, the two entries will be merged. For instance, ifkl ′j ≤ α, then SL j−1 and SL j are merged.

These two strategies are not mutually exclusive. In practice,we can integrate both strategies to the analysis system. Wecan set both the threshold of combination and the maximumnumber of entries in the SOM sequence, because the maximumnumber of entries prevent the size of the SOM from increasinginfinitely. Therefore, either the kl ′ measure between twoconsecutive entries is below the threshold or the number ofthe entries in the SOM sequence exceeds the maximum, maytrigger a combination.

C. Final Estimation

The final step is using the SOM sequence to estimatethe pdfs over arbitrary time periods along the data streams.Suppose that the analysis time period is from t to T . Bysearching the SOM sequence, one can easily obtain the entriesthat summarize the data from t to T , say, SLi to SLi+h ,where r i (1) ≤ t ≤ r i (2) and r i+h (1) ≤ T ≤ r i+h(2).

Page 8: 1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND … · SOM have been developed and proposed in literature, such as temporal Kohonen map [32], recurrent SOM [33], recursive SOM [34],

CAO et al.: KERNEL DENSITY ESTIMATION OVER DATA STREAMS BY SEQUENCES OF SELF-ORGANIZING MAPS 1261

Fig. 3. Example of the final estimation.

All the neurons from SO M i to SO M i+h are collectedtogether as X = {ω1,ω2, . . . ,ω(h+1)�} associated with c ={c1, c2, . . . , c(h+1)�}. Then, for univariate data streams, the pdfcan be calculated by modifying (1) as

fh(x) =(h+1)�∑

i=1

κ i Kh(x − ωi ) (28)

where κ i = ci/∑(h+1)�

k=1 ck .For the d-dimensional (d > 1) data streams, the pdf can be

calculated by modifying (5) as

fH (x) =(h+1)�∑

i=1

κ i

h1h2 · · · hdK

(x1 − ωi,1

h1, · · · ,

xd − ωi,d

hd

).

(29)An example of the final estimation is illustrated in Fig. 3. In

this figure, the solid dots represent the boundaries of the SOMsequence entries, and the circles represent the boundaries ofthe query, t and T . In this example, ri (1) < t = ri (1) + 1 <ri (2) and ri+2(1) < T = ri+2(1) + 2 < ri+2(2). Therefore,all the neurons from SO M i to SO M i+2 are used to estimatethe queried pdf between t and T using (28) or (29).

V. SIMULATIONS AND ANALYSIS

In this section, we present the simulation results of SOMKEalgorithm for various synthetic data sets and real-world datastreams. In our current paper, we first follow the assumptionsmade in [16] on the data streams and compare SOMKE to twoexisting algorithms described in Section III, M-kernel methodand cluster kernel method, in terms of estimation accuracyand processing time over nine stationary data streams. For thestationary data streams, we assume that the distributions overthese data streams do not change when shifted in time. We alsoinvestigate the use of Neural Gas method in our frameworkinstead of SOM. The experiment on a 2-D data stream is alsopresented. In the second part of our experiment, we investigatethe use of SOMKE over three groups of nonstationary datastreams, that is, the underlying pdf will change over time.These nonstationary data streams include a synthetic Gaussianevolving data stream, a financial data stream, and a groupof network traffic data streams. In [16] and [23], althoughM-kernel method and cluster kernel method are claimed tobe capable to handle nonstationary data streams by usingweighting strategy, since the historical information over thedata streams is embedded in the current states of the kernels

implicitly and cannot be separated from new informationexplicitly, the estimation of pdfs over arbitrary periods of thestreams becomes impossible. Therefore, for the nonstationarydata streams, we only illustrate the performance of SOMKE.

A. Stationary Data Streams

1) Data Sets: In this paper, we use two synthetic data setand seven real-world 1-D data sets used in [16] and [24] tocompare the performance of SOMKE to M-kernel methodand cluster kernel method. The synthetic data streams area Gaussian distributed data stream with mean and standarddeviation both equal to 1 and a Gaussian mixture distributeddata stream with half drawn from N(−2, 1) and the restdrawn from N(2, 1). The real-world data streams cover diverseapplications, including burst, earthquake, EEG_heart_rate,fluid_dynamics, networks, power data, and S&P 500, whichare available online [60]. The 2-D data stream is a 2-DGaussian distribution with 100 000 elements. The mean of thedistribution is [1, 1], and the covariance matrix is [1, 0; 0, 1].

Table I summarizes the characteristics of the data streamsused in this paper. Fig. 4 illustrates the off-line estimates ofpdfs over the data streams using all available data. The solidline is calculated based on Gaussian kernel and the dotted lineis using Epanechnikov kernel.

2) Quality Metric: To evaluate the quality of the estimatedpdf f , besides Kullback–Leibler divergence, we also calculatethe approximated mean-integrated squared error (MISE), oneof the most widely used measure of the discrepancy of thedensity estimate f from the true density f . For most real-world applications, since the true densities are unknown, weuse the off-line estimates of pdfs over the entire data sets asthe approximated true densities. Therefore, for univariate datastreams, we can calculate the quality of f as

M I SE( f ) = E∫

( f (x) − f (x))2dx

≈ 1

Nt

Nt∑

i=1

( f (xi ) − f (xi ))2x (30)

where X = {x1, x2, . . . , x Nt } is the set of an equidistantpartition of the support of f with step x .

3) Parameter Settings: For the stationary data streams, weuse the fixed memory strategy and set the size of the memoryto 1. The parameter settings used in this paper are summarizedas follows:

1) window size is constant: T ( j) = 500, for all j ;2) number of neurons: � = 100;3) number of iterations: n = 3000;4) initial learning rate: η0 = 3;5) initial width of topological neighborhood: σ0 = 25;6) time constants: τ1 = 1000/ logσ0 and τ2 = 1000;7) number of kernels in M-kernels and cluster kernels: m =

100.8) size of X in the quality metric: Nt = 500.

These parameters are obtained through cross-validationmethod. For the 1-D data streams, the SOM is a 1-D topo-logical network (line) and for the 2-D data streams, the SOM

Page 9: 1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND … · SOM have been developed and proposed in literature, such as temporal Kohonen map [32], recurrent SOM [33], recursive SOM [34],

1262 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 8, AUGUST 2012

Fig. 4. Off-line estimates of pdfs over the data streams based on Gaussian kernel (solid line) and Epanechnikov kernel (dotted line).

TABLE I

STATIONARY DATA STREAMS

Data type Name Size Mean Std. Max MinSynthetic Gaussian 100 000 1.00 1.00 5.4289 −3.3926

data Gaussian Mixture 100 000 0.01 2.24 5.63 −6.25Burst 9382 1.92E5 8.56E4 5.99E6 1.05E5

Earthquake 4096 -5.09E-8 0.10 0.5156 −0.5234Real-world EEG 7200 28.10 3.96 44.78 10.76

data fluid_dyn. 10000 −0.01 0.16 1 −1networks 18000 69.02 7.64 560.083 66.4889

power data 35040 1.10E3 292.73 2152 614S&P 500 17610 82.96 97.41 456.33 4.4

TABLE II

PERFORMANCE OF FIVE METHODS OVER NINE DATA STREAMS (PERFORMANCE CRITERION:MISE)

Kernel Method GaussianGaussian

Burst Earthquake EEG fluid_dyn. networks power data S&P 500 WinMixture

Gau

MK 2.05E-2 8.19E-3 4.83E-5 3.26E + 0 6.52E-5 2.65E + 0 1.58E-2 1.56E-3 3.70E-3 0SOMKE (T) 6.01E-4 3.28E-5 2.44E-8 3.28E-3 2.20E-5 9.81E-3 1.19E-3 4.73E-7 2.09E-6 2SOMKE (M) 2.74E-4 4.61E-5 1.29E-8 2.98E-3 2.11E-5 8.01E-3 5.27E-4 3.75E-6 1.47E-6 6Neural Gas 2.97E-3 3.30E-4 6.25E-8 1.10E-4 8.45E-5 6.75E-3 9.74E-3 4.21E-6 6.35E-6 1

Epa

MK 2.13E-3 9.83E-4 9.70E-5 3.01E + 0 1.31E-4 2.54E + 0 9.99E-3 1.17E-3 3.25E-3 0CK-List 4.31E-4 3.30E-4 1.05E-7 4.75E-3 3.08E-5 3.96E-2 1.22E-2 3.10E-5 5.97E-4 1CK-Tree 4.47E-4 3.53E-3 9.84E-8 9.02E-3 2.39E-5 3.86E-2 1.22E-2 8.79E-6 7.45E-3 2

SOMKE (T) 6.22E-4 8.04E-5 1.25E-7 2.80E-2 1.62E-4 4.73E-2 4.64E-3 4.26E-6 1.96E-5 2SOMKE (M) 3.81E-3 6.99E-4 8.49E-8 2.15E-2 1.47E-4 3.88E-2 2.47E-3 7.89E-6 1.93E-5 3Neural Gas 4.07E-2 7.00E-3 3.05E-7 1.47E-3 9.06E-4 1.13E-1 7.79E-2 5.90E-5 2.33E-5 1

is a 10 × 10 2-D topological network (plane). The followingset of parameters are used for NG method: the number ofneurons � = 100, λinitial = 10, λfinal = 0.01, εinitial = 0.5, andεinitial = 0.005. The simulations are conducted using an IntelDuo Core CPU with 2.0 GHz and 2 GB RAM memory underthe MATLAB version 7.4.0.387(R2007a) environment.

4) Simulation Results: Tables II–IV summarize the simula-tion results of SOMKE for nine data streams, compared to twoother algorithms, that is, M-kernel and cluster kernel. In these

tables, SOMKE (T) stands for the traditional SOM method isused in the SOMKE algorithm and SOMKE (M) stands forthe modified SOM method with magnification factor controlscheme is used in the proposed algorithm. We also show thesimulation results when Neural Gas method is used in ourframework. For cluster kernel method, we implement bothlist-based approach and tree-based approach. For each dataset, we use both Gaussian kernel and Epanechnikov kernel toestimate the pdf. Because cluster kernel is developed based on

Page 10: 1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND … · SOM have been developed and proposed in literature, such as temporal Kohonen map [32], recurrent SOM [33], recursive SOM [34],

CAO et al.: KERNEL DENSITY ESTIMATION OVER DATA STREAMS BY SEQUENCES OF SELF-ORGANIZING MAPS 1263

TABLE III

PERFORMANCE OF FIVE METHODS OVER NINE DATA STREAMS (PERFORMANCE CRITERION:KULLBACK–LEIBLER DIVERGENCE)

Kernel Method Gaussian Gaussian Burst Earthquake EEG fluid_dyn. networks power data S&P 500 WinMixture

Gau

MK 0.1233 0.0510 0.2557 0.1839 0.0050 0.8631 0.2694 0.4906 0.3199 1SOMKE (T) 0.0131 0.0028 0.0344 0.0155 0.0068 0.0576 0.0203 0.0027 0.0024 2SOMKE (M) 0.0093 0.0037 0.0215 0.0130 0.0068 0.0491 0.0259 0.0098 0.0019 3Neural Gas 0.0323 0.0096 0.0390 0.0095 0.0082 0.0285 0.0197 0.0102 0.0063 3

Epa

MK 0.0099 0.0439 0.5116 1.1745 0.0101 1.7673 0.1667 0.4046 0.3889 0CK-List 0.0083 0.0026 0.0134 0.0103 0.0044 0.0116 0.1747 0.0198 0.0621 4CK-Tree 0.0095 0.0031 0.0324 0.0185 0.0042 0.0092 0.1792 0.0110 0.8096 2

SOMKE (T) 0.0130 0.0028 0.0347 0.0172 0.0078 0.0610 0.0050 0.0061 0.0146 1SOMKE (M) 0.0094 0.0035 0.0251 0.0162 0.0074 0.0518 0.0042 0.0115 0.0138 2Neural Gas 0.1793 0.0712 0.1349 0.0510 0.0290 0.1208 0.0654 0.0843 0.0176 0

TABLE IV

PROCESSING TIME OF FIVE METHODS OVER NINE DATA STREAMS (IN SECONDS)

Method Gaussian Gaussian Burst Earthquake EEG fluid_dyn. networks power data S&P 500 WinMixtureMK 1641.3 1663.92 53.86 63.13 102.07 155.16 135.84 304.93 118.45 0

CK-List 19148.87 19001.48 1836.25 791.12 1306.92 1902.90 3331.43 6188.43 3241.23 0CK-Tree 934.02 957.55 70.40 31.64 53.59 77.87 137.50 257.49 70.86 1SOMKE 509.88 524.55 47.47 21.82 36.99 49.77 90.94 181.93 91.78 8

Neural Gas 3773.91 2469.76 122.71 90.08 126.73 363.35 288.15 648.02 348.54 0

x

y

2D Gaussian (Contour) SOMKE

−4 −2 0 2 4−4

−2

0

2

4

x

y

2D Gaussian (Contour) Underlying

−4 −2 0 2 4−4

−2

0

2

4

−5

0

5

−5

0

50

0.1

0.2

xy

f

−5

0

5

−5

0

50

0.1

0.2

x

2D Gaussian Underlying

y

f

Fig. 5. Simulation results of the 2-D data stream.

the Epanechnikov kernel, we only use Epanechnikov kernel inthis algorithm. The results include the performance of the fiveapproaches in terms of the approximated MISE in Table II,Kullback–Leibler divergence in Table III, and the processingtime in Table IV. The approximated MISEs for each algorithmare calculated with respect to the off-line best estimate of thewhole data stream with the same kernel. In these tables, thebest results for each kernel are highlighted in bold face font.The rightmost columns of Tables II–IV summarize the winningtimes for the methods.

From Tables II to IV, one can see that for Gaussiankernel, SOMKE outperforms M-kernel for all data sets withless processing time. For Epanechnikov kernel, the tree-basedcluster kernel can achieve similar performance with list-basedcluster kernel with much less processing time. In addition,cluster kernel outperforms M-kernel for most data sets, which

are consistent to the results provided in [16] and [24]. SOMKEcan achieve competitive performance compared to clusterkernel method. And SOMKE requires the least processingtime among the five approaches in most cases. From ourobservation, SOM method can also achieve competitive per-formance compared to Neural Gas method in our frameworkwith less processing time. This is due to the distance orderingin each step needed in Neural Gas method, therefore theperformance of the sorting algorithms may greatly effectthe time complexity of the entire algorithm. Furthermore,compared to M-kernel and cluster kernel, SOMKE can easilybe used in multivariate applications, because SOM learningalgorithm can easily be applied to high-dimensional data sets.Considering the magnification factor issue, for most datasets, the SOMKE using the modified SOM algorithm canachieve better performance compared to the results when thetraditional SOM algorithm is used.

Fig. 5 illustrates the simulation results of SOMKE for the2-D Gaussian data stream. The left two figures are the resultsobtained by SOMKE, and the right two are plotted by thedistribution function. The top two figures show the pdfs, andthe bottom two show the contours of the distribution function,estimated and actual, respectively.

B. Nonstationary Data Streams

To assess the performance of SOMKE on nonstationary datastreams, we conduct experiments on a synthetic evolving datastream using the fixed threshold strategy and two groups ofreal-world data streams, that is, the daily rates of return of astock and network traffic data streams, using fixed memorystrategy.

1) Data Sets: We create the synthetic evolving data streamin the following way. The data stream contains 30 000 pointswith fixed window size 500, and thus, 60 windows. In allwindows, the data follow a normal distribution with the

Page 11: 1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND … · SOM have been developed and proposed in literature, such as temporal Kohonen map [32], recurrent SOM [33], recursive SOM [34],

1264 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 8, AUGUST 2012

−10 −5 0 5 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35Financial Data (Ford)

return (%)

p−10 0 10 200

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Evolving Gaussian

x

p

(a) (b)

Fig. 6. Off-line estimates of the pdfs of the nonstationary data streams.(a) Synthetic data stream. (b) Ford stock returns.

Fig. 7. Off-line estimates of the pdfs of the network traffic data streams.

different means and standard deviations . Specifically, we have

W t ∼

⎧⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎩

N(0, 2) :t ∈ [1, 10]N(3, 1) :t ∈ [11, 20]N(6, 2) :t ∈ [21, 30]N(9, 1) :t ∈ [31, 40]N(12, 2):t ∈ [41, 50]N(15, 1):t ∈ [51, 60]

where t ∈ Z .Nonparametric density estimation plays a critical role in

financial modeling, especially in computational finance. Forinstance, in [61] nonparametric density estimation is used toestimate the state-price density for options and in [62] KDEtechnique is used to design hedging strategies for mortgage-based securities and describe dynamic risk functions. Inaddition, KDE techniques are used in [63] to estimate theconditional mean of the stock return series and investigate theevolution of the equity market volatility. In this paper, we usethe daily rates of return of Ford stock (NYSE:F) from January4, 1977 to December 31, 2009 to construct the financialdata stream. This data stream contains 8327 observations. Weuse the fiscal year to partition the stream into 33 windows,and assume that the data in each window follow the same

TABLE V

r i , kli , AND kl′i IN THE SOM SEQUENCE ENTRIES FOR THE

SYNTHETIC EVOLVING DATA STREAM

SL β = 0 β = 0.025r i kli kl′i r i kli kl′i

SL1 1 10 − − 1 20 − −SL2 11 20 1.38 1.38 21 40 17.47 8.35SL3 21 30 11.79 11.79 41 50 9.00 6.27SL4 31 40 1.43 1.43 51 60 1.42 1.27SL5 41 50 14.45 14.45 − − − −SL6 51 60 1.42 1.42 − − − −

TABLE VI

r i , kli , AND kl′i IN EACH ENTRY OF THE SOM SEQUENCE FOR THE

FINANCIAL DATA STREAM

SL r i kli kl′iSL1 1977 1979 − −SL2 1980 1997 0.49 0.49SL3 1998 2007 0.35 0.35SL4 2008 2009 2.26 2.26

TABLE VII

r i , kli , AND kl′i IN EACH ENTRY OF THE SOM SEQUENCE FOR THE

NETWORK TRAFFIC DATA STREAMS

INTER ASOQSL r i kli SL r i kliSL1 1 1 − SL1 1 1 −SL2 2 12 2.4314 SL2 2 13 0.2276SL3 13 13 1.9264 SL3 14 14 0.1790SL4 14 15 3.4356 SL4 15 15 0.5842

LQAIT LAQITSL r i kli SL r i kliSL1 1 10 − SL1 1 1 −SL2 11 11 0.1263 SL2 2 10 0.0833SL3 12 13 0.0839 SL3 11 14 0.0034SL4 14 15 0.0756 SL4 14 15 0.0056

distribution. Fig. 6 illustrates the off-line estimates of thedensity functions of the whole data streams with Gaussiankernel.

For the network traffic data streams, we use the datacollected from the integrated network-based Ohio University’snetwork detective service system [64]. The parameters used inthis paper include: 1) INTER that describes the interactivityand defines the number of questions per second during a partic-ular period; 2) ASOQ that is the average size of questions; 3)LQAIT that is log (base 10) of question–answer idle time (inseconds) that the server takes before responding to a question;and 4) LAQIT that is log (base 10) of answer-question idletime that the client takes to ask another question(in seconds).All the four data streams contain 7192 samples that arepartitioned into 15 windows. Fig. 7 illustrates the off-lineestimates of the density functions of the whole data streamswith Gaussian kernel.

2) Simulation Results: For the synthetic data stream, weuse the fixed threshold approach with the threshold α = 1.We try alternative βs in (27), one choice is 0, meaning thatthe old data are treated equally as the new ones, the other is0.025. Table V summarizes the information stored in the SOMsequence including the r i , kli and kl ′i in each entry. Figs. 8and 9 show the estimated pdfs with each entry in the SOM

Page 12: 1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND … · SOM have been developed and proposed in literature, such as temporal Kohonen map [32], recurrent SOM [33], recursive SOM [34],

CAO et al.: KERNEL DENSITY ESTIMATION OVER DATA STREAMS BY SEQUENCES OF SELF-ORGANIZING MAPS 1265

−10 0 10 200

0.1

0.2

0.3

0.4

SL1[W

1−W

10~N(0,2)]

x

p

−10 0 10 200

0.1

0.2

0.3

0.4

SL2[W

11−W

20~N(3,1)]

x

p−10 0 10 200

0.1

0.2

0.3

0.4

SL3[W

21−W

30~N(6,2)]

x

p

−10 0 10 200

0.1

0.2

0.3

0.4

SL4[W

31−W

40~N(9,1)]

x

p

−10 0 10 200

0.1

0.2

0.3

0.4

SL5[W

41−W

50~N(12,2)]

x

p

−10 0 10 200

0.1

0.2

0.3

0.4

SL6[W

51−W

60~N(15,1)]

x

p

Fig. 8. Estimates of the pdfs of the SOM sequence entries for the syntheticevolving data stream (β = 0). Solid line: the estimated pdfs represented bythe entries in the SOM sequence. Dashed line: the evolving distributions overthe data stream.

−10 0 10 200

0.1

0.2

0.3

0.4

SL1[W

1−W

20]

x

p

−10 0 10 200

0.1

0.2

0.3

0.4

SL2[W

21−W

40]

x

p

−10 0 10 200

0.1

0.2

0.3

0.4

SL3[W

41−W

50]~N(12,2)

x

p

−10 0 10 200

0.1

0.2

0.3

0.4

SL4[W

51−W

60]~N(15,1)

x

p

Fig. 9. Estimates of the pdfs of the SOM sequence entries for the syntheticevolving data stream (β = 0.025). Solid line: the estimated pdfs representedby the entries in the SOM sequence; Dashed line:the evolving distributionsover the data stream.

sequence, with β = 0 and β = 0.025, respectively. FromFig. 8 and the left part of Table V one can see, when β = 0,the developed SOM sequence contains six entries, with eachone associated to the windows that are drawn from the samedistribution along the data stream. For instance, the first entrySL1 is associated to windows 1 − 10 that follow the normaldistribution N(0, 1). When β = 0.025, the new informationappeared as more relevant to the current analysis than theold one. So the new information is stored in more detailedway, and the old entries in the SOM sequence are more likelyto be combined than the new ones. β controls the rate atwhich the old information fades out from the analysis. In thisexperiment, we obtain only four entries along the stream, andthe oldest four entries are combined to two entries compared

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

SL1(1977−1979)

return (%)

p

−10 −5 0 5 100

0.1

0.2

0.3

0.4

SL2(1980−1997)

return (%)

p

−10 −5 0 5 100

0.1

0.2

0.3

0.4

SL3(1998−2007)

return (%)

p

−10 −5 0 5 100

0.1

0.2

0.3

0.4

SL4(2008−2009)

return (%)

p

Fig. 10. Estimates of the pdfs of the SOM sequence entries for the financialdata stream (β = 0).

Fig. 11. Estimates of the pdfs of the SOM sequence entries for the networktraffic data streams (β = 0). (a) INTER. (b) ASOQ. (c) LQAIT. (d) LAQIT.

to the first scenario as shown in Fig. 9 and the right part ofTable V. Note that in this case, the estimated pdfs for the first40 windows are no longer the true underlying distributions dueto the combination of the SOM sequence entries. But givenlimited memory resources, this approximation can still be usedas a good illustration of the historical evolving patterns ofthe data distributions over time, and it also reduces the spacecomplexity because of the combination.

For the real-world data streams, we use the fixed memoryapproach, and set the maximum number of the SOM sequenceentries N to 4 and set β to 0. Table VI summarizes theinformation of the financial data stream stored in the SOMsequence including the r i , kli , and kl ′i in each entry. Fig. 10illustrates the pdfs estimated from the entries in the SOM

Page 13: 1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND … · SOM have been developed and proposed in literature, such as temporal Kohonen map [32], recurrent SOM [33], recursive SOM [34],

1266 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 8, AUGUST 2012

Fig. 12. Estimates of the pdfs of INTER data streams with 25 entries.

sequence, and the evolving patterns of distribution of returnsover the 33 years. For instance, between 1977 to 1979, thereturns concentrated around the neighborhoods of three values,that is, 0% and ±2%. Compared to the distribution of returnsbetween 1980 to 1997, the returns between 1998 to 2007show higher kurtosis. The returns between 2008 and 2009illustrate the highest kurtosis and highest variance, or risk,among the four periods, which can be explained by the recentfinancial crisis starting from 2008 that substantially affectedthe profitability of the whole automobile industry. Table VIIsummarizes the information of the network traffic data streamsstored in the SOM sequences. Fig. 11 illustrates the pdfsestimated for the four data streams from the entries in the SOMsequences. Note that for the LAQIT data [Fig. 11(d)], althoughthe last two entries, SL3 and SL4, look pretty similar to eachother, they do not qualify for our merging criterion, that is,they do not provide the minimum merging cost. So in the finalSOM sequence, these two entries do not merge. To test morenumbers of entries in the SOM sequences, we partition theINTER network traffic data stream into 72 chunks, 100 sam-ples in the first 71 chunks and 92 in the last one. Fig. 12demonstrates the pdfs of INTER data stream estimated whenwe increase the number of entries to 25. Comparing Figs. 11(a)and 12, one can see that the first two entries in Fig. 12are corresponding to the first entry in Fig. 11(a), the entry3–20 in Fig. 12 corresponding to the entry 2 in Fig. 11(a), theentry 21 in Fig. 12 corresponding to the entry 3 in Fig. 11(a),and the entry 22–25 in Fig. 12 corresponding to the entry 4in Fig. 11(a). From these two figures, we can see that theproposed algorithm successfully captures two peaks in thedata stream and retains them in the SOM sequence for furtheranalysis.

C. Time and Space Complexity Analysis

In this section, we analyze the time and space complexityof SOMKE when fixed memory strategy is employed. We firstassume the following parameters in the algorithm:

1) k: the maximum number of entries in the SOMsequence;

2) �: the number of neurons in the SOM;3) N : the number of elements in the whole data stream;4) Nec : the number of elements in each chunk of data

stream;5) M: the number of evaluation points.

The time complexity can be decomposed into two parts:training phase and query phase. In training phase, SOMsare trained for each chunk of data. The training of a sin-gle SOM takes O(Nec�), then training all the SOMs takesO(N�). All the N/Nec SOMs will merge to k SOMs, thenthe combination operation takes O(((N/Nec ) − k) × Nec�).Since k << (N/Nec), then the time complexity of trainingstep is O(N�). In the query step, the time complexity isO(M�). Therefore, the total time complexity of SOMKE isO((N + M)�). For traditional kernel density method, thetime complexity is O(M N). One can see that since generally� << N , � << M , and (N + M) << N M , SOMKE is muchfaster than the traditional KDE method.

For space complexity, the storage needed in SOMKE isO(k × � + Nec), while the storage needed in the traditionalmethods is O(N). Since k × � + Nec << N , SOMKE greatlyreduces the space complexity of the traditional KDE.

VI. CONCLUSION

In this paper, we presented a novel scheme, SOMKE, forKDE over data streams based on SOMs and Kullback–Leiblerdivergence measure. SOMKE contains three operations:1) developing the SOM sequence entries; 2) merging theSOM sequence; and 3) final estimation. In the first step,SOMs are learned from windows of data, and the SOMsequence entries are built to form the SOM sequence. In thesecond phase, to further save memory, consecutive entries inthe SOM sequence may be merged based on the Kullback–Leibler divergence measure. Two combination strategies wereintroduced, including the fixed memory approach and the fixedthreshold approach. Finally, the density functions in arbitraryperiod time along the data streams can be estimated using theSOM sequence. We evaluated SOMKE over both syntheticdata streams and real-world streams. For the stationary datastreams, we compared the method to two other algorithms,M-kernel and cluster kernel, over various data sets. Thesimulation results illustrate that SOMKE can outperform M-kernel and achieve competitive performance compared withcluster kernel. Another important advantage of SOMKE is thatit can be used in the context of nonstationary data streamseffectively and efficiently. Experiments on synthetic and real-world data streams illustrate the effectiveness of SOMKE onthe nonstationary data streams.

There are several important future research directions alongthis topic. First, in our current paper, we compared the resultsof our method with those of M-kernel, cluster kernel, andneural gas approaches. It would be important to observe andanalyze the comparative studies of our method with otherSOM-based time-series approaches, such as the patch clustermethod [38], SOMAR model and GSOMAR [36], [37]. Such

Page 14: 1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND … · SOM have been developed and proposed in literature, such as temporal Kohonen map [32], recurrent SOM [33], recursive SOM [34],

CAO et al.: KERNEL DENSITY ESTIMATION OVER DATA STREAMS BY SEQUENCES OF SELF-ORGANIZING MAPS 1267

comparative studies will be able to provide new insights andfindings about various SOM-based techniques for data streamanalysis. Second, as magnification effect plays an importantrole in SOM base density estimation, it would be interestedto study both the theoretical and empirical impacts of themagnification effects as well as the control of magnificationfactors of SOMKE method across various types of streamdata. Third, as imbalanced learning has presented many newresearch challenges to the community [65], [66], it wouldalso be interested to study how the proposed SOMKE methodcan be applied to handle the imbalanced data streams overtime, and assess its potential advantages and limitations acrosssuch challenging data sets. Finally, as many of today’s data-intensive applications involve large-scale of stream data, itwould be important to study the applications of our methodacross different domains. Motivated by our results in thispaper, we believe this approach might provide importantinsights and useful techniques for density estimation over datastreams for different real-world applications.

REFERENCES

[1] P. Domingos and G. Hulten, “A general framework for mining massivedata stream,” J. Comput. Graphical Statist., vol. 12, no. 4, pp. 945–949,2003.

[2] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework for on-demand classification of evolving data streams,” IEEE Trans. Knowl.Data Eng., vol. 18, no. 5, pp. 577–589, May 2006.

[3] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Modelsand issues in data stream systems,” in Proc. 21st ACM Symp. PrinciplesDatabase Syst., 2002, pp. 1–16.

[4] P. Domingos and G. Hulten, “Mining high-speed data streams,” in Proc.6th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2000, pp.71–80.

[5] R. Elwell and R. Polikar, “Incremental learning of concept drift innonstationary environments,” IEEE Trans. Neural Netw., vol. 22, no.10, pp. 1517–1531, Oct. 2011.

[6] S. Guha, N. Mishra, R. Motwani, and L. O’Callahan, “Clustering datastreams,” in Proc. Foundations Comput. Sci. 41st Annu. Symp., 2000,pp. 359–366.

[7] A. A. Lazar and E. A. Pnevatikakis, “Video time encoding machines,”IEEE Trans. Neural Netw., vol. 22, no. 3, pp. 461–473, Mar. 2011.

[8] P. Rogister, R. Benosman, S.-H. Ieng, P. Lichtsteiner, and T. Delbruck,“Asynchronous event-based binocular stereo matching,” IEEE Trans.Neural Netw. Learn. Syst., vol. 23, no. 2, pp. 347–353, Feb. 2012.

[9] B. W. Silverman, Density Estimation for Statistics and Data Analysis.London, U.K.: Chapman & Hall, 1986.

[10] J. DiNardo and J. L. Tobias, “Nonparametric density and regressionestimation,” J. Economic Perspectives, vol. 15, no. 4, pp. 11–28, 2001.

[11] Y. Cao, H. He, H. Man, and X. Shen, “Integration of self-organizingmap (som) and kernel density estimation (kde) for network intrusiondetection,” Proc. SPIE, vol. 7480, pp. 74800N-1–74800N-12, Sep. 2009.

[12] T. Brox, B. Rosenhahn, D. Cremers, and H. P. Seidel, “Nonparametricdensity estimation with adaptive, anisotropic kernels for human motiontracking,” in Proc. 2nd Conf. Human Motion Underst. Model. CaptureAnimation, 2007, pp. 152–165.

[13] T. Bouezmarni and J. V. K. Rombouts, “Nonparametric density estima-tion for multivariate bounded data,” J. Statist. Plann. Inference, vol. 140,no. 1, pp. 139–152, 2010.

[14] Ö. Egecioglu and A. Srinivasan, “Efficient nonparametric density esti-mation on the sphere with applications in fluid mechanics,” SIAM J.Scientific Comput., vol. 22, no. 1, pp. 152–176, 2000.

[15] W. L. Martinez and A. R. Martinez, Computational Statistics Handbookwith MATLAB, 2nd ed. London, U.K.: Chapman & Hall, 2008.

[16] C. Heinz and B. Seeger, “Cluster kernels: Resource-aware kernel densityestimators over streaming data,” IEEE Trans. Knowl. Data Eng., vol. 20,no. 7, pp. 880–893, Jul. 2008.

[17] A. Hämäläinen, “Self-organizing map and reduced kernel density esti-mation,” Ph.D. thesis, Rolf Nevanlinna Inst., Univ. Jyväskylä, Jyväskylä,Finland, 1995.

[18] H. Yin and N. M. Allinson, “Self-organizing mixture networks forprobability density estimation,” IEEE Trans. Neural Netw., vol. 12, no.2, pp. 405–411, Mar. 2001.

[19] A. Elgammal, R. Duraiswami, and L. S. Davis, “Efficient kernel densityestimation using the fast gauss transform with applications to colormodeling and tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25,no. 11, pp. 1499–1504, Nov. 2003.

[20] C. Lambert, S. Harrington, C. Harvey, and A. Glodjo, “Efficient on-linenonparametric kernel density estimation,” Algorithmica, vol. 25, no. 1,pp. 37–57, 1999.

[21] C. M. Procopiuc and O. Procopiuc, “Density estimation for spatial datastreams,” in Proc. 9th Int. Conf. Adv. Spatial Temporal Databases, 2005,pp. 109–126.

[22] K. A. Caudle and E. Wegman, “Nonparametric density estimation ofstreaming data using orthogonal series,” Comput. Statist. Data Anal.,vol. 53, no. 12, pp. 3980–3986, 2009.

[23] Z. Cai, W. Qian, L. Wei, and A. Zhou, “M-kernel merging: Towarddensity estimation over data streams,” in Proc. Database Syst. Adv. Appl.18th Int. Conf., 2003, pp. 285–292.

[24] C. Heinz and B. Seeger, “Toward kernel density estimation over stream-ing Data,” in Proc. Int. Conf. Manage. Data, 2006, pp. 1–12.

[25] T. Kohonen, “The self-organizing map,” Proc. IEEE, vol. 78, no. 9, pp.1464–1480, Sep. 1990.

[26] T. Kohonen, “The self-organizing map,” Neurocomputing. New York:Elsevier, 1998.

[27] T. Kohonen, Self-Organizing Map. New York: Springer-Verlag, 2001.[28] G. G. Yen and Z. Wu, “Ranked centroid projection: A data visualization

approach with self-organizing maps,” IEEE Trans. Neural Netw., vol. 19,no. 2, pp. 245–259, Feb. 2008.

[29] E. López-Rubio, J. M. Ortiz-de-Lazcano-Lobato, and D. López-Rodríguez, “Probabilistic PCA self-organizing maps,” IEEE Trans.Neural Netw., vol. 20, no. 9, pp. 1474–1489, Sep. 2009.

[30] A. Hirose and T. Nagashima, “Predictive self-organizing map for vectorquantization of migratory signals and its application to mobile commu-nications,” IEEE Trans. Neural Netw., vol. 14, no. 6, pp. 1532–1540,Nov. 2003.

[31] S. Cheng, H. Fu, and H. Wang, “Model-based clustering by probabilisticself-organizing maps,” IEEE Trans. Neural Netw., vol. 20, no. 5, pp.805–826, Mar. 2009.

[32] G. J. Chappell and J. G. Taylor, “The temporal kohonen map,” NeuralNetw., vol. 6, no. 3, pp. 441–445, 1993.

[33] T. Koskela, M. Varsta, J. Heikkonen, and K. Kaski, “Temporal sequenceprocessing using recurrent SOM,” in Proc. Knowl.-Based Intell. Elec-tron. Syst. Int. Conf., 1998, pp. 290–297.

[34] T. Voegtlin, “Recursive self-organizing maps,” Neural Netw., vol. 15,nos. 8–9, pp. 979–991, 2002.

[35] A. Andreakis, N. V. Hoyningen-Huene, and M. Beetz, “Incrementalunsupervised time series analysis using merge growing neural gas,” inAdvances in Self-Organizing Maps (Lecture Notes on Computer Sci-ences), vol. 5629, J. C. Principe and R. Miikkulainen, Eds. Heidelberg,Germany: Springer-Verlag, 2009, pp. 10–18.

[36] H. Ni and H. Yin, “Self-organising mixture autoregressive model fornon-stationary time series modelling,” Int. J. Neural Syst., vol. 18, no.6, pp. 469–480, 2008.

[37] H. Yin and H. Ni, “Generalized self-organizing mixture autoregressivemodel,” in Proc. Lecture Notes Comput. Sci. Adv. Self-Organ. Maps,2009, pp. 353–361.

[38] N. Alex, A. Hasenfuss, and B. Hammer, “Patch clustering for massivedata sets,” Neurocomputing, vol. 72, nos. 7–9l, pp. 1455–1469, 2009.

[39] A. Georgakis, H. Li, and M. Gordan, “An ensemble of som networksfor document organization and retrieval,” in Proc. Int. Conf. AdaptiveKnowl. Represent. Reasoning, 2005, pp. 6–12.

[40] C. Saavedra, R. Salas, S. Moreno, and H. Allende, “Fusion of selforganizing maps,” in Proc. 9th Int. Work-Conf. Artificial Neural Netw.Lecture Notes Comput. Sci., 2007, pp. 227–234.

[41] B. Baruque and E. Corchado, “A weighted voting summarization ofSOM ensembles,” Data Mining Knowl. Discovery, vol. 21, no. 3, pp.398–426, 2010.

[42] A. Rauber, D. Merkl, and M. Dittenbach, “The growing hierarchical self-organizing map: Exploratory analysis of high-dimensional data,” IEEETrans. Neural Netw., vol. 13, no. 6, pp. 1331–1341, Nov. 2002.

[43] H. He, S. Chen, K. Li, and X. Xu, “Incremental learning from streamdata,” IEEE Trans. Neural Netw., vol. 22, no. 12, pp. 1901–1914, Dec.2011.

[44] M. C. Jones, J. S. Marron, and S. J. Sheather, “A brief survey ofbandwidth selection for density estimation,” J. Amer. Statist. Associat.,vol. 91, no. 433, pp. 401–407, 1996.

Page 15: 1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND … · SOM have been developed and proposed in literature, such as temporal Kohonen map [32], recurrent SOM [33], recursive SOM [34],

1268 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 8, AUGUST 2012

[45] V. C. Raykar and R. Duraiswami, “Fast optimal bandwidth selection forkernel density estimation,” in Proc. 6th SIAM Int. Conf. Data Mining,2006, pp. 524–528.

[46] D. W. Scott and S. R. Rain, “Multi-dimensional density estimation,”in Handbook of Statistics Data Mining and Computational Statistics,C. R. Rao and E. J. Wegman, Eds. New York: Elsevier, 2004.

[47] A. de Decker, J. A. Lee, D. François, and M. Verleysen, “Modeestimation in high-dimensional spaces with flat-top kernels: Applicationto image denoising,” Adv. Artificial Neural Netw. Mach. Learn. Comput.Intell., vol. 74, no. 9, pp. 1402–1410, 2010.

[48] V. Laparra, G. Camps-Valls, and J. Malo, “Iterative Gaussianization:From ICA to random rotations,” IEEE Trans. Neural Netw., vol. 22, no.4, pp. 537–549, Feb. 2011.

[49] A. Peñalver and F. Escolano, “Entropy-based incremental variationalbayes learning of Gaussian mixtures,” IEEE Trans. Neural Netw., vol.23, no. 3, pp. 534–540, Mar. 2012.

[50] P. L. Zador, “Asymptotic quantization error of continuous signals andthe quantization dimension,” IEEE Trans. Inform. Theory, vol. 28, no.2, pp. 149–159, Mar. 1982.

[51] D. Dersch and P. Tavan, “Asymptotic level density in topological featuremaps,” IEEE Trans. Neural Netw., vol. 6, no. 1, pp. 230–236, Jan. 1995.

[52] T. M. Martinetz, S. G. Berkovich, and K. J. Schulten, “Neural-gasnetwork for vector quantization and its application to time-series pre-diction,” IEEE Trans. Neural Netw., vol. 4, no. 4, pp. 558–569, Jul.1993.

[53] P. Grassberger and I. Procaccia, “Characterization of strange attractors,”Phys. Rev. Lett., vol. 50, no. 5, pp. 346–349, 1983.

[54] J. Bruske and G. Sommer, “Intrinsic dimensionality estimation withoptimally topology preserving maps,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 20, no. 5, pp. 572–575, May 1998.

[55] H. U. Bauer, R. Der, and M. Herrmann, “Controlling the magnificationfactor of self-organizing feature maps,” Neural Comput., vol. 8, no. 4,pp. 757–771, 1996.

[56] T. Villmann and J. C. Claussen, “Magnification control in self-organizingmaps and neural gas,” Neural Comput., vol. 18, no. 2, pp. 446–469, 2006.

[57] S. Kullback and R. A. Leibler, “On information and sufficiency,” Ann.Math. Statist., vol. 22, no. 1, pp. 79–86, 1951.

[58] Q. Wang, S. R. Kulkarni, and S. Verdú, “Divergence estima-tion of continuous distributions based on data-dependent partitions,”IEEE Trans. Inform. Theory, vol. 51, no. 9, pp. 3064–3074, Sep.2005.

[59] Q. Wang, S. R. Kulkarni, and S. Verdú, “Divergence estimation for mul-tidimensional densities via k-nearest-neighbor distances,” IEEE Trans.Inform. Theory, vol. 55, no. 5, pp. 2392–2405, May 2009.

[60] E. Keogh, Q. Zhu, B. Hu, Y. Hao, X. Xi, L. Wei, and C. A. Ratanama-hatana. The UCR Time Series Classification/Clustering Homepage[Online]. Available: http://www.cs.ucr.edu/∼eamonn/time_series_data/

[61] Y. Aït-Sahalia and A. Lo, “Nonparametric estimation of state-pricedensities implicit in financial asset prices,” to be published.

[62] C. R. Harvey, “The Specification of conditional expectation,” to bepublished.

[63] A. R. Pagan and G. W. Schwert, “Alternative models for conditionalstock volatility,” J. Econometrics, vol. 45, nos. 1–2, pp. 267–290, 1990.

[64] R. Balupari, “Real-time network-based anomaly intrusion detection,”M.S. thesis, Dept. Elect. Eng. Comput. Sci., Ohio Univ., Athens, OH,2002.

[65] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans.Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, May 2009.

[66] S. Chen, H. He, and E. A. Garcia, “RAMOBoost: Ranked minority over-sampling in boosting,” IEEE Trans. Neural Netw., vol. 21, no. 10, pp.1624–1642, Aug. 2010.

Yuan Cao (S’05) received the B.S. and M.S. degreesfrom Zhejiang University, Hangzhou, China, in 2001and 2004, respectively, and the M.S. degree fromOklahoma State University, Stillwater, all in electri-cal engineering, and the Ph.D. degree in computerengineering from the Stevens Institute of Technol-ogy, Hoboken, NJ, in 2011.

He is currently with MathWorks, Inc., Natick, MA.His current research interests include computationalintelligence, pattern recognition, machine learningand data mining, and their applications in parallel

computing and multi-core systems.

Haibo He (SM’11) received the B.S. and M.S.degrees in electrical engineering from the HuazhongUniversity of Science and Technology, Wuhan,China, in 1999 and 2002, respectively, and thePh.D. degree in electrical engineering from OhioUniversity, Athens, in 2006.

He was an Assistant Professor with the Depart-ment of Electrical and Computer Engineering,Stevens Institute of Technology, Hoboken, NJ, from2006 to 2009. He is currently an Assistant Professorwith the Department of Electrical, Computer, and

Biomedical Engineering, University of Rhode Island, Kingston. He haspublished one research book (Wiley), and edited six conference proceedings(Springer). He has authored or co-authored over 100 peer-reviewed journalsand conference papers. His current research interests include national andinternational media, such as the IEEE Smart Grid Newsletter, the Wall StreetJournal, and Providence Business News, adaptive dynamic programming,machine learning, computational intelligence, very large scale integration andfield-programmable gate array design, and various applications, such as smartgrid.

Dr. He is currently an Associate Editor of the IEEE TRANSACTIONS ON

NEURAL NETWORKS AND LEARNING SYSTEMS and the IEEE TRANS-ACTIONS ON SMART GRID. He was a recipient of the National ScienceFoundation CAREER Award in 2011 and the Providence Business NewsRising Star Innovator of the Year Award in 2011.

Hong Man (SM’06) received the Ph.D. degree inelectrical engineering from the Georgia Institute ofTechnology, Atlanta, in 1999.

He joined the Stevens Institute of Technology,Hoboken, NJ, in 2000. He is currently an AssociateProfessor and the Director of the Computer Engi-neering Program with the Electrical and ComputerEngineering Department, and the Director of theVisual Information Environment Laboratory, StevensInstitute of Technology. His current research inter-ests include signal and image processing, pattern

recognition, and data mining.Dr. Man has served on the Organizing Committees of the IEEE International

Conference on Multimedia and Expo in 2007 and 2009, the IEEE InternationalWorkshop on Multimedia Signal Processing in 2002 and 2005, and theTechnical Program Committees of various IEEE conferences.