clustering time series based on forecast distributions

8
Clustering Time Series Based on Forecast Distributions Using Kullback-Leibler Divergence Taiyeong Lee, Yongqiao Xiao, Xiangxiang Meng, David Duling SAS Institute, Inc 100 SAS Campus Dr. Cary, NC 27513, USA {taiyeong.lee, yongqiao.xiao, xiangxiang.meng, david.duling}@sas.com ABSTRACT One of the key tasks in time series data mining is to cluster time series. However, traditional clustering methods focus on the similarity of time series patterns in past time periods. In many cases such as retail sales, we would prefer to cluster based on the future forecast values. In this paper, we show an approach to cluster forecasts or forecast time series pat- terns based on the Kullback-Leibler divergences among the forecast densities. We use the same normality assumption for error terms as used in the calculation of forecast con- fidence intervals from the forecast model. So the method does not require any additional computation to obtain the forecast densities for the Kullback-Leibler divergences. This makes our approach suitable for mining very large sets of time series. A simulation study and two real data sets are used to evaluate and illustrate our method. It is shown that using the Kullback-Leibler divergence results in better clus- tering when there is a degree of uncertainty in the forecasts. Keywords Time Series Clustering, Time Series Forecasting, Kullback- Leibler Divergence, Euclidean Distance 1. INTRODUCTION Time series clustering has been used in many data mining areas such as retail, energy, weather, quality control chart, stock/financial data, and sequence/time series data gener- ated by medical devices etc[3, 12, 14]. Typically, the ob- served data is used directly or indirectly as a source of time series clustering. For example, we can cluster CO2 emis- sion patterns of each country based on their historical data or based on some extracted features from the historical data. Numerous similarity/dissimilarity/distance/divergence mea- sures [4, 5, 8] have been proposed and studied. Another cat- egory of time series clustering methods is the model-based clustering technique, which clusters time series using the pa- rameter estimates of the models or other statistics using the errors associated with the estimates[10, 13]. In[11], Liao summarized these time series clustering methods into three categories: raw data based, extracted feature based, and model based. Instead of using the observed time series, or some extracted features of the observations, or even models in the past time periods, we consider forecasts themselves at a specific fu- ture forecast time point or during a future time period. For the retail stores, we can cluster them based on their sales forecast distributions at a particular future time, instead of the observed sales data. Alonso [1] used density forecast models for time series clustering at a specific future time point. However, since the method [1] requires bootstrap samples, nonparametric forecast density estimation, and a specific distance measurement between the forecast densi- ties, it is not an efficient approach to cluster a large number of time series. In this paper, we use the Kullback-Leibler divergence [9] for clustering the forecasts at a future point. Under the normal- ity assumption in the error, the Kullback-Leibler distance can be computed directly from the forecast means and vari- ances provided by the forecast model. We also extend our method to cluster the forecasts at all future points in the forecast horizon to capture the forecast patterns that could evolve over time. For instance, in the retail industry, busi- ness decisions such as stocking up or rearranging the shelves can be made after clustering the products based on the sales forecasts. Similarly, the clustering could be carried out at the store level, so that the store sales or price policies can be made for each group of stores. Typically, the number of time series in the retail industry is very large, and the industry also requires fast forecasting as well as fast clustering. The proposed method is suitable for clustering large amounts of forecasts. The paper is organized as follows. In Sections 2 and 3, we describe the KL divergence as a distance measure of fore- cast densities, and explain how to cluster forecasts. Fol- lowing that, a simulation study and real data analyses are presented. 2. DISTANCE MEASURE FOR CLUSTER- ING FORECASTS Since forecasts are not observed values, the Euclidean dis- tance between two forecast values may not be close to the true distance. Our proposed method uses a symmetric ver- sion of Kullback-Leibler divergence to calculate the distance

Upload: others

Post on 09-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering Time Series Based on Forecast Distributions

Clustering Time Series Based on Forecast DistributionsUsing Kullback-Leibler Divergence

Taiyeong Lee, Yongqiao Xiao, Xiangxiang Meng, David DulingSAS Institute, Inc

100 SAS Campus Dr.Cary, NC 27513, USA

{taiyeong.lee, yongqiao.xiao, xiangxiang.meng, david.duling}@sas.com

ABSTRACTOne of the key tasks in time series data mining is to clustertime series. However, traditional clustering methods focuson the similarity of time series patterns in past time periods.In many cases such as retail sales, we would prefer to clusterbased on the future forecast values. In this paper, we showan approach to cluster forecasts or forecast time series pat-terns based on the Kullback-Leibler divergences among theforecast densities. We use the same normality assumptionfor error terms as used in the calculation of forecast con-fidence intervals from the forecast model. So the methoddoes not require any additional computation to obtain theforecast densities for the Kullback-Leibler divergences. Thismakes our approach suitable for mining very large sets oftime series. A simulation study and two real data sets areused to evaluate and illustrate our method. It is shown thatusing the Kullback-Leibler divergence results in better clus-tering when there is a degree of uncertainty in the forecasts.

KeywordsTime Series Clustering, Time Series Forecasting, Kullback-Leibler Divergence, Euclidean Distance

1. INTRODUCTIONTime series clustering has been used in many data miningareas such as retail, energy, weather, quality control chart,stock/financial data, and sequence/time series data gener-ated by medical devices etc[3, 12, 14]. Typically, the ob-served data is used directly or indirectly as a source of timeseries clustering. For example, we can cluster CO2 emis-sion patterns of each country based on their historical dataor based on some extracted features from the historical data.Numerous similarity/dissimilarity/distance/divergence mea-sures [4, 5, 8] have been proposed and studied. Another cat-egory of time series clustering methods is the model-basedclustering technique, which clusters time series using the pa-rameter estimates of the models or other statistics using theerrors associated with the estimates[10, 13]. In[11], Liao

summarized these time series clustering methods into threecategories: raw data based, extracted feature based, andmodel based.

Instead of using the observed time series, or some extractedfeatures of the observations, or even models in the past timeperiods, we consider forecasts themselves at a specific fu-ture forecast time point or during a future time period. Forthe retail stores, we can cluster them based on their salesforecast distributions at a particular future time, instead ofthe observed sales data. Alonso [1] used density forecastmodels for time series clustering at a specific future timepoint. However, since the method [1] requires bootstrapsamples, nonparametric forecast density estimation, and aspecific distance measurement between the forecast densi-ties, it is not an efficient approach to cluster a large numberof time series.

In this paper, we use the Kullback-Leibler divergence [9] forclustering the forecasts at a future point. Under the normal-ity assumption in the error, the Kullback-Leibler distancecan be computed directly from the forecast means and vari-ances provided by the forecast model. We also extend ourmethod to cluster the forecasts at all future points in theforecast horizon to capture the forecast patterns that couldevolve over time. For instance, in the retail industry, busi-ness decisions such as stocking up or rearranging the shelvescan be made after clustering the products based on the salesforecasts. Similarly, the clustering could be carried out atthe store level, so that the store sales or price policies can bemade for each group of stores. Typically, the number of timeseries in the retail industry is very large, and the industryalso requires fast forecasting as well as fast clustering. Theproposed method is suitable for clustering large amounts offorecasts.

The paper is organized as follows. In Sections 2 and 3, wedescribe the KL divergence as a distance measure of fore-cast densities, and explain how to cluster forecasts. Fol-lowing that, a simulation study and real data analyses arepresented.

2. DISTANCE MEASURE FOR CLUSTER-ING FORECASTS

Since forecasts are not observed values, the Euclidean dis-tance between two forecast values may not be close to thetrue distance. Our proposed method uses a symmetric ver-sion of Kullback-Leibler divergence to calculate the distance

Page 2: Clustering Time Series Based on Forecast Distributions

between the forecast densities under the normal assumptionof the forecast error terms. In another word, both mean(forecast) and variance(forecast variance) are used in thecalculation of the distance.

2.1 Kullback-Leibler DivergenceSuppose P0 and P1 are the probability distributions of twocontinuous random variables, the Kullback-Leibler diver-gence of P0 from P1 is defined as

KLD(P1 ‖ P0) =

∫p1(x) log

p1(x)

p0(x)dx (1)

where p0 and p1 are the density functions of P0 and P1. TheKullback-Leibler divergenceKLD(P1 ‖ P0) is not a symmet-ric measure of the difference between P0 and P1, but in clus-tering we need to define a symmetric version of distance mea-sure for the items (in this paper, time series) to be grouped.A well-known symmetric version of the Kullback-Leibler di-vergence is the average of two divergences KLD(P1 ‖ P0)and KLD(P0 ‖ P1),

KLDavg(P1, P0) =1

2{KLD(P1 ‖ P0) +KLD(P0 ‖ P1)}

=1

2

∫(p1(x)− p0(x)) log

p1(x)

p0(x)dx (2)

This is also known as the J-Divergence of P0 and P1 [7].When P1 and P0 are two normal distributions, that is, P1 ∼N(µ1, σ

21) and P0 ∼ N(µ0, σ

20), the KLDavg can be simpli-

fied as follows,

KLD(P1, P0) =1

2σ20

[(µ1 − µ0)2 + (σ21 − σ2

0)2] + logσ0

σ1

KLDavg(P1, P0) =1

2(

1

2σ20

+1

2σ21

)[(µ1 − µ0)2 + (σ21 − σ2

0)2]

(3)

In the rest of the paper, we denote the symmetric version ofKL divergence in (3) as KL distance.

2.2 KL and Euclidean Distances for Cluster-ing Forecasts

For two forecasts f0 and f1 with values µ0, µ1 and standarderrors σ0, σ1, the Euclidean distance between the two fore-casts is defined as

EUC(f1, f2) = (µ1 − µ0)2 (4)

In consistent with the definition of the KL divergence fornormal density, here we define the squared distance func-tion. Using Euclidean distance for clustering forecast timeseries ignores the variance information (σ2

0 and σ21) of the

underlying forecast distributions. In contrast, under normal

assumption, the Kullback-Leibler distance between the fore-cast distributions of f0 and f1 consider both the mean andvariance information of the forecasts, which has the follow-ing relationship with the Euclidean distance,

KLDavg(f1, f0) =1

4(

1

σ20

+1

σ21

)EUC(f1, f0) +1

4(K +

1

K)

(5)

where K =σ21

σ20

is the relative ratio of the noises in two fore-

cast f0 and f1.

The following plots of normal density functions (Figure 1and Figure 2) show that using the forecast values withoutconsidering their distributions (that is, using EUC distance)may not be appropriate in clustering forecast values. Theplots show the reverse relationship between what the KLand the Euclidean distances measure.

Figure 1: An example of forecasts with the samemean values but different errors

Figure 2: An example of forecasts with differentmean values but the same errors

When two forecast values are the same and the forecast dis-tributions are ignored, the two forecast values are definitelyclustered into the same category based on the mean differ-ence (Euclidean distance). However, when we use the KLdistance, clustering forecast values may produce a differentresult even when the mean difference is zero (Figure 1). Forexample, let us consider the sales data from retail stores.The sales forecasts of two stores are both zero in the nextweek but their standard deviations are different from eachother as shown in Figure 1. When we do not consider theforecast distributions, the two stores are clustered into thesame segment and may get the same sales policy for the

Page 3: Clustering Time Series Based on Forecast Distributions

coming week. Contrary to Figure 1, Figure 2 shows two dif-ferent forecast values of sales (0 and 50) with the same largestandard deviations. Based on the KL distance, the forecastsales of the two stores in Figure 2 show less difference thanthe forecast sales of two stores in Figure 1 (KL distance =1.78 vs. KL distance = 0.22). In other words, two stores inFigure 1 are less likely to be clustered into the same segmentcomparing with the two stores in Figure 2 even though theirforecast values are identical (Figure 1).

We also observe the following properties of the symmetricKullback-Leibler Divergence as defined in Equation 5,

Property 1. KLDavg is not scale free, that is, it de-pends on the forecast errors. Especially when σ1 = σ0,KLDavg = 1

2σ20EUC(µ1, µ0)

Property 1 is desirable for clustering the forecasts, since wewant to distinguish the forecast mean values together withtheir errors. It indicates that when the errors of the forecastsare the same, the KL distance differs from the Euclideandistance with a ratio which depends on the error.

Property 2. Suppose there exists a constant c > 0 suchthat σ0 = c ∗ σ1, KLDavg → 0 when σ0 →∞.

Property 2 implies that the KL distance cannot distinguishtwo forecasts when their errors are both very large. Thisindicates that the forecasting models are also very importantwhile clustering the forecasts. If a poor forecast model is fit,we may end up with few clusters because the errors makethe forecasts indistinguishable.

Property 3. Under ths same condition in Property 2,KLDavg →∞ when σ0 = c ∗ σ1 → 0.

Property 3 tells us that the KL distance cannot group twoforecasts when their errors are very small. In theory, whenwe have perfect forecasts (errors are zero), there is no need toconsider the errors in clustering. However, in practice, thisdoes not hold, since the errors will increase when we forecastfurther away, as shown by the example in Equation 6.

2.3 Forecast Distributions for KL DivergenceTo get the KL distance among forecasts, we need to knowthe forecast density. As stated before, we utilize the fore-cast distributions that are used in the calculation of forecastconfidence intervals to compute KL distance. Since the fore-cast confidence intervals are readily available in any forecastsoftware, it saves a lot of time and computing resources com-pared to [1], which needs the full forecast density estimationfor the calculation of the distance matrix. As an example,we show how to get the k-step ahead forecast value and vari-ance in the simple exponential smoothing model.

Under the assumption of Gaussian white noise process,

Yt = µt + εt, t = 1, 2, ...

then the smoothing equation is

St = αYt + (1− α)St−1,

and the k-step ahead forecast of Yt is St , i.e. Yt(k) = St.

The simple exponential smoothing model uses an exponen-tially weighted moving average of the past values. Themodel is equivalent to ARIMA(0,1,1) model without con-stant. So the model is

(1−B)Yt = (1− θB)εt,

where θ = 1− α. Thus Yt = εt +∑∞j=1 αεt−j . Therefor the

variance of Yt(k)

V (Yt(k)) = V (εt)[1 +

k−1∑j=1

α2] = V (εt)[1 + (k − 1)α2]. (6)

Under the Gaussian white noise assumption, Yt(k) follows

N(Yt(k), V (Yt(k))). Therefore the KL distance of two fore-casts at a future time point can be easily obtained usingEquation 3.

3. CLUSTERING THE FORECASTSWhen a distance function has been defined between all pairsof forecasts, we can use available clustering algorithms tocluster the forecasts. A hierarchical clustering algorithmneeds a distance matrix between all the pairs, while the morescalable k-means clustering algorithm requires the distancebetween a group of points (typically represented by the cen-troid of the group) and any other single point. Thanks tothe additive property of the normal distribution, that is, thesum of two independent normal random variables still fol-lows a normal distribution with mean and variance equal tothe sum of the individual means and variances respectively,the KL distance between a group of points and any othersingle point can be easily computed as well. Therefore, wecan use both the hierarchical and the k-means clustering al-gorithms with the KL distance for clustering the forecasts.

When clustering the forecasts, we consider two scenarios:clustering forecast values at a particular future time point,and clustering the forecast series for all future time pointsin the forecast horizon. Clustering the forecasts at a futuretime point helps us understand the forecasts and their clus-ters at the given time point, while clustering the forecastseries helps us understand the overall forecast patterns.

3.1 Clustering at One Future PointLet Xt(k) and Yt(k) be the k-step ahead forecasts of twotime series Xt and Yt, and σx(k) and σy(k) be the standard

errors of the forecasts. The KLDavg(Xt(k), Yt(k)) betweenthe two forecasts can be calculated using Equation 5.

Page 4: Clustering Time Series Based on Forecast Distributions

The steps of clustering the forecasts at a future time pointare shown below. In this report, we consider hierarchicalclustering, but the the procedure can be easily modified forany non-hierarchical clustering algorithms such as k-means.

1. Apply forecasting models to a forecast lead time k.

2. Obtain forecasts (Xt(k), Yt(k)) and their standard er-rors (σx(k), σy(k)) fore each pair of the time series.

3. Calculate the KL distance matrix among all pairs ofthe time series.

4. Apply a clustering algorithm with the KL distances.

5. Obtain the clusters of the forecasts.

3.2 Clustering the Forecast SeriesThe clusters at different future time points may be different.To capture the changes of the whole forecast pattern, we cancluster the forecast series for all future time points. Given atotal forecast lead h, we extend the KL distance as follows.

KLDavg(Xt, Yt) =

h∑k=1

KLDavg(Xt(k), Yt(k)). (7)

Note that we still define the squared distance. The steps ofclustering the forecast series are

1. Apply forecasting models with total forecast lead h.

2. Obtain forecasts (Xt(k), Yt(k)) and their standard er-rors (σx(k), σy(k)) at each lead time points k, k =1, 2, . . . h.

3. Calculate the KL distance matrix among all pairs ofthe time series using Equation 7.

4. Apply a clustering algorithm with the KL distances.

5. Obtain the clusters of the forecasts.

4. A SIMULATION STUDYTo demonstrate the performance of the proposed KL dis-tance for clustering the forecasts, we simulate two groups oftime series with the same autoregressive AR(2) [2] structurebut different intercepts. Each time series is of length 100,and there are 50 time series in each group.

X(i)t = µi + 0.75Xt−1 − 0.5Xt−2 + σi,

where t = 1, 2, ..., 100, i = 1, 2.

The two groups have µ1 = 0 and µ2 = 1, respectively. Thestandard errors of the white noise for both groups (σ1 andσ2) vary from 0.5 to 5 by 0.5 in order to examine the per-formance difference of the KL distance and the Euclideandistance for time series with different signal-to-noise ratios(SNR). This yields 100 settings of SNR combinations (σ1

and σ2) and for each setting we repeat the simulations for

400 times. We fit AR(2) models to the simulated time se-ries and obtain the forecast values and variances. For thesynthetic data, since we know the group label of each se-ries, we can easily compute the clustering error rate (CER).We report the mean clustering error rates of both distancemeasures for each SNR setting.

Figure 3: The mean clustering error rates in 400simulations for clustering two groups of time serieswith different combinations of noise standard errors(σ1, σ2). The forecast leads are 10. Top: Euclideandistance; Bottom: KL distance.

The density plots in Figure 3 show the mean CER for bothmethods in 400 simulations in the clustering of all futuretime points with forecast length 10. It is clear that theproposed KL distance outperforms the traditional Euclideandistance in a variety of SNR combinations. Especially, whenone group of time series tends to have a relative high SNRcompared with the other group of time series, shown in thetop-left and bottom-right corners in the density plots, theKL distance can help identify the true grouping of the timeseries (mean CERs are close to zero) but the Euclidean dis-tance results in poor clustering results (mean CERS arearound 15%). When both groups of time series have highnoise standard errors and the two standard errors are close,shown in the top-right corners in the density plots, the per-formances of both distance measures are poor.

Page 5: Clustering Time Series Based on Forecast Distributions

Figure 4: The mean clustering error rates in 400simulations for clustering two groups of time serieswith different combinations of noise standard errors(σ1, σ2). The forecast leads are 1. Top: Euclideandistance; Bottom: KL distance.

In Figure 4, we show the simulation results for the same SNRcombinations as in Figure 3 but the clustering is on one fu-ture forecast (forecast length is 1). Clearly, the proposed KLdistance measure still has better mean CERs compared withthe traditional Euclidean measure in the clustering of onefuture time point. Consistent with the results in Figure 3,when the noise standard errors of the two groups of timeseries are different (the top-left and bottom-right cornersin the density plots), the KL distance yields much smallererror rates than the Euclidean distance does. When com-paring across Figure 3 and Figure 4, it is found that for thesame SNRs, clustering time series based on 10 forecast timepoints tends to produce better results compared with clus-tering time series based on just one future time point, whichsuggests using a sufficient length of forecast is necessary toguarantee the further clustering based on the forecasts.

5. REAL LIFE DATA STUDYTwo real life data sets are investigated to further evaluatethe proposed K-L distance: one data set is the CO2 emission

by all the countries in the world1, and the other is the weeklysales amounts by store of a retail chain. The summary ofthe data sets is shown in Table 1.

Data Set Number of of Series Length Frequency

CO2 146 48 YearlySales 43 152 Monthly

Table 1: Summary of Data Sets

The CO2 data consists of the CO2 emissions (metric tonsper capita) of 216 countries from year 1961 to 2008 (TheCO2 data from year 1960 to 1999 was used in [1]). There isone CO2 emission time series recorded for each country. Forbetter comparison of the results, we remove the series withmissing values, and it ends up with 146 complete series.

The sales data is a small subset of the weekly sales data of abig retail chain in the USA, which has millions of productsand thousands of stores. The subset has the aggregated salesof a department for 43 stores. Each store has 152 weeks ofsales history from February 2008 to December 2010. Thereare 43 time series, and no other exogenous variables besidesthe aggregated sales in the data.

For the real life data sets, we cannot compute the clusteringerror rate since we don’t have the true class labels available.In order to compare the quality of the clustering results, wedo a holdout test, that is, we hold out the most recent hperiods of data for testing, and the forecasting models arebuilt with the data prior to the holdout periods (trainingdata). For example, for the sales data, we holdout the mostrecent 12 weeks of data, and the forecasting model are builtwith the 140 weekly data prior to the recent 12 weeks. Af-ter building the appropriate forecasting models, we use themodels to obtain forecast values and variance for the holdoutperiods.

Since our focus is on the evaluation of different distancefunctions in the clustering of the forecasts instead of the ac-curacy of the models, we simply fit the training data with thebest ESM (Exponential Smoothing Model) model for boththe sales and the CO2 data sets. The candidate ESM mod-els include simple, double, linear, damped trend, seasonal,additive and multiplicative Winters method. We select thebest model with the minimum Root Mean Squared Error(RMSE). After computing the distance matrix using eitherKL or Euclidean distance, hierarchical clustering is used.

5.1 CO2 Emission Data by CountryThere are 146 complete time series after removing those withmissing values. The initial exploration of the data indicatesthat there is an outlier series(country Qatar), which has sig-nificantly higher CO2 emissions per capita than the rest. Sowe remove the outlier from the further data analysis. For therest 145 series, we set the holdout period to 5. After cluster-ing the forecasts in the holdout period, we plot the actualtime series for each cluster of the forecasts when setting thenumber of clusters to 3 in Figure 5.1http://data.worldbank.org/indicator/EN.ATM.CO2E.PC

Page 6: Clustering Time Series Based on Forecast Distributions

Figure 5: Clusters of the time series for the CO2data: each line shows a series of the actual CO2emissions by a country in the holdout 5 years, whilethe clusters are identified based on the clusteringof the forecast series in the holdout 5 years usingEuclidean distance (EUC) and KL distance (KLD).

We can see that Euclidean distance separates the countriesinto 3 clusters: cluster 1 with relatively low CO2 emissions(106 countries), cluster 2 with medium CO2 emissions (29countries), and cluster 3 with relatively high CO2 emissions(10 countries). When the errors in the forecasts are consid-ered, the clusters are different. With KL distance, clusters1, 2 and 3 have 65, 52 and 28 countries respectively. Fig-ures 6 and 7 illustrate the difference in the clustering resultsbetween the KL distance and the Euclidean distance. Noticethat the dash lines are for the fits and forecasts, the solidlines are for the actual time series (including holdouts), andthe filled areas are for the confidence intervals of the forecastperiods.

Using Euclidean distance, Afghanistan and Albania in Fig-ure 6 are in cluster 1. However, Albania is separated fromcluster 1 into cluster 2 when using the KL distance. This isbecause there are larger errors in the forecasts for Albania,and thus the KL distance separates the time series from clus-ter 1. Figure 7 shows that Australia and United Kingdomare in different clusters (3 and 2 respectively) when using theEuclidean distance. Instead, they are both put into cluster3 when using the KL distance because their forecast distri-butions are similar.

We checked the accuracy of the forecasting models (the bestESM). It turned out the best ESM models perform very well

Figure 6: Two forecast series with different forecasterrors and separated by the KL distance: Albania incluster 2 and Afghanistan in cluster 1. When usingthe Euclidean distance, both countries are in cluster1. The clusters are shown in Figure 5.

Figure 7: Two forecast series with similar forecasterrors and grouped into one cluster by the KL dis-tance: both Korea and Australia are in cluster 3.When using the Euclidean distance, Korea is in clus-ter 2 and Australia is in cluster 3. The clusters areshown in Figure 5.

in forecasting the CO2 data with MAPE (Mean AbsolutePercent Error) in the holdout period about 10%.

It is also interesting to observe that the cluster patternchanges over time when we cluster at each time point in theholdout. Table 2 illustrates the cluster membership at eachtime point in the holdout and the overall cluster membershipfor four countries with the KL distance. The actual values,forecast values and the confidence intervals for these coun-tries in the holdout period are shown in Figure 8. Withinfive years, most countries stay in the same cluster. But thecluster membership of a country could change because ofthe forecast changes. For example, the forecasts for Chinahave an upward trend, and its cluster is changed from 1 to2 in 2007. The errors in the forecasts could also contributeto the changes of cluster membership from one future timepoint to another.

Page 7: Clustering Time Series Based on Forecast Distributions

ClusterCountry 2004 2005 2006 2007 2008 Overall

Kenya 1 1 1 1 1 1China 1 1 1 2 2 2

Mexico 2 2 2 2 2 2USA 3 3 3 3 3 3

Table 2: An illustration of the cluster changes overtime for the C02 emissions of four countries. Thecluster membership for each year is obtained basedon the clustering of forecasts at that year using KLdistance. The overall cluster membership is ob-tained based on clustering of the whole forecast timeseries using KL distance. For illustration, the num-ber of clusters is set to 3.

Figure 8: The time series plots of the four coun-tries listed in Table 2. The clustering of one futureforecast is performed at each of the five years in theholdout period. Dash: the forecast values, Solid:the actual holdout values, filled areas: the forecastconfidence intervals.

5.2 Retail Store Sales DataWe set the holdout period to be 12 weeks. When setting thenumber of clusters to 5, the actual time series for the clustersbased on the forecasts in the holdout period are shown inFigure 9.

With Euclidean distance, the stores are grouped into 5 clus-ters: 17, 16, 5, 3, 2 stores are in clusters 1, 2, 3, 4 and 5,respectively. By KL Distance, clusters 1, 2, 3, 4 and 5 havethe number of stores 14, 6, 14, 7, 2 respectively.

We check the forecast accuracy of the models. It turns outthat the best ESM models have MAPE 40% in the holdoutperiod. Notice that for the sales data the forecasting modelscould be enhanced to include other models like ARIMA[2],UCM[6] and with other exogenous variables such prices, hol-idays, promotions, etc.

We then experiment without the holdout, that is, all avail-

Figure 9: Clusters of the time series for the Salesdata: each line shows a series of the actual sales inthe holdout 12 weeks, while the clusters are iden-tified based on the clustering of the forecast seriesin the holdout 12 weeks using Euclidean distance(EUC) and KL distance (KLD).

Page 8: Clustering Time Series Based on Forecast Distributions

Figure 10: Forecasts for the Sales data: each lineshows a series of the forecast values in the future 6weeks. The squares show the weeks with interestingforecast clustering pattern changes.

Figure 11: Cluster changes over time in the futurefor the Sales data. The cluster membership for eachweek is obtained based on the clustering of forecastsat that week using KL distance.

able data points are used to build the forecasting models.Forecast values for the next six weeks are obtained usingthe best ESM models and the KL distances are calculatedat three specific future weeks. The interesting forecastinglead points and forecast values are marked with squares atFigure 10 .

Figure 11 shows the cluster change over time. We can seethat some stores which are grouped in the same cluster inthe first week (2Jan2011) may be separated into differentclusters in later weeks (30Jan2011). Thus the retail com-pany may need to adjust sales or promotion policy to eachstore and may set different business goals from week to week.

6. CONCLUSIONS AND FUTURE WORKIn this paper we used the KL distance for clustering the fore-cast distributions of time series. The KL distance requiresdensity functions, and the clustering of a larger number oftime series with full density estimations of the forecasts takesa lot of computing resources. We approximated the fore-cast density using normal density with the forecast meansand variances, which are directly provided by the forecastmodel. Thus, the KL distances among forecasts can be eas-ily obtained. We demonstrated the advantage of using theKL distance over the Euclidean distance with simulationsfrom autoregressive models. We also showed that the KLdistance improved the clustering results in two real life data

sets.

We would like to enhance KL distance to handle cases whenthe forecast errors are extremely large or small, in whichcases the current KL distance may not perform well. Dif-ferent clustering algorithms such as k-means can be appliedas well. Dynamic Time Warping[8] together with KL dis-tance to detect forecast pattern changes for series of differentlength can also be an extension of this paper.

7. REFERENCES[1] A. Alonso, J. Berrendero, A. Hernandez, and

A. Justel. Time series clustering based on forecastdensities. Computational Statistics & Data Analysis,51(2):762–776, 2006.

[2] G. Box, G. Jenkins, and G. Reinsel. Time SeriesAnalysis: Forecasting and Control. Prentice Hall, 3rdedition, 1994.

[3] P. S. P. Cowpertwait and T. F. Cox. Clusteringpopulation means under heterogeneity of variancewith an application to a rainfall time series problem.Journal of the Royal Statistical Society. Series D (TheStatistician), 41(1):113–121, 1992.

[4] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang,and E. Keogh. Querying and mining of time seriesdata: experimental comparison of representations anddistance measures. Proc. VLDB Endow., 1:1542–1552,August 2008.

[5] A. W.-C. Fu, E. Keogh, L. Y. H. Lau, and C. A.Ratanamahatana. Scaling and time warping in timeseries querying. In Proceedings of the 31stinternational conference on Very large data bases,VLDB ’05, pages 649–660, 2005.

[6] A. C. Harvey. Forecasting, Structural Time SeriesModels and the Kalman Filter. Cambridge UniversityPress, 1989.

[7] H. Jeffreys. An invariant form for the prior probabilityin estimation problems. Proc. Roy. Soc. A,186:453U–461, 1946.

[8] E. Keogh. Exact indexing of dynamic time warping. InProceedings of the 28th international conference onVery Large Data Bases, VLDB ’02, pages 406–417,2002.

[9] S. Kullback and R. A. Leibler. On information andsufficiency. The Annals of Mathematical Statistics,22(1):79–86, 1951.

[10] M. Kumar and N. R. Patel. Clustering data withmeasurement errors. Comput. Stat. Data Anal.,51:6084–6101, August 2007.

[11] T. W. Liao. Clustering of time series data - a survey.Pattern Recognition, 38:1857–1874, 2005.

[12] L. R. L. L. V. Macchiato, M. F. and M. Ragosta.Time modelling and spatial clustering of daily ambienttemperature: An application in southern italy.Environmetrics, 6(1):31U–53, 1995.

[13] P. C. Mahalanobis. On the generalized distance instatistics. Proceedings of the National Institute ofScience Calcutta, 2(1):49–55, 1936.

[14] R. H. S. Yoshihide Kakizawa and M. Taniguchi.Discrimination and clustering for multivariate timeseries. Journal of the American Statistical Association,93(441):328–340, 1998.