principal component analysis based time series ... · fusion’ problem, since "sensor fusion...

23
Principal Component Analysis based Time Series Segmentation – A New Sensor Fusion Algorithm Janos Abonyi, Balazs Feil, Sandor Nemeth, Peter Arva University of Veszprem, Department of Process Engineering P.O.Box. 158, H-8200, Veszprem, Hungary www.fmt.vein.hu/softcomp e-mail: [email protected] Abstract Segmentation is the most frequently used subroutine in clustering, indexing, sum- marization, anomaly detection, and classification of time series. Although in many real-life applications a lot of variables must be simultaneously monitored, most of the segmentation algorithms are used for the analysis of only one time-variant variable. Hence, this paper proposes Principal Component Analysis (PCA) based algorithms that are able to detect: (i) changes in the mean; (ii) changes in the variance; and (iii) changes in the correlation structure among several variables. The segments obtained by bottom-up segmentation algorithms are hierarchically clustered using a PCA similarity factor. The whole approach is applied to the monitoring of the industrial production of high-density polyethylene. Key words: sensor fusion, segmentation, PCA, fuzzy clustering, bottom-up method Preprint submitted to Elsevier Science 16 November 2004

Upload: others

Post on 14-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

Principal Component Analysis based

Time Series Segmentation –

A New Sensor Fusion Algorithm

Janos Abonyi, Balazs Feil, Sandor Nemeth, Peter Arva

University of Veszprem, Department of Process Engineering

P.O.Box. 158, H-8200, Veszprem, Hungary

www.fmt.vein.hu/softcomp e-mail: [email protected]

Abstract

Segmentation is the most frequently used subroutine in clustering, indexing, sum-

marization, anomaly detection, and classification of time series. Although in many

real-life applications a lot of variables must be simultaneously monitored, most of the

segmentation algorithms are used for the analysis of only one time-variant variable.

Hence, this paper proposes Principal Component Analysis (PCA) based algorithms

that are able to detect: (i) changes in the mean; (ii) changes in the variance; and

(iii) changes in the correlation structure among several variables. The segments

obtained by bottom-up segmentation algorithms are hierarchically clustered using

a PCA similarity factor. The whole approach is applied to the monitoring of the

industrial production of high-density polyethylene.

Key words: sensor fusion, segmentation, PCA, fuzzy clustering, bottom-up

method

Preprint submitted to Elsevier Science 16 November 2004

Page 2: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

1 Introduction

Real-life time series can be taken from business, physical, social and behavioral

science, economics, engineering [1–3], etc. Time series segmentation is often

used to extract internally homogeneous segments from a given time-series to

locate stable periods of time, to identify change points, or to simply compress

the original time-series into a more compact representation [4]. Although in

many real-life applications a lot of variables must be simultaneously tracked

and monitored, most of the time-series segmentation algorithms are based on

only one time-variant variable [1].

This paper deals with the problem of multivariate time series segmentation.

A univariate time series can contain data in a time ordered structure origi-

nated from a given sensor. Such time series can be taken from several sources,

e.g. in case of industrial processes the sensors measure physical or chemical

properties, e.g. pressure, temperature, concentration, flow or mass rate, valve

position, density, melt index, grain-size distribution etc. However accurate

and frequent measurements are taken, it is often the case that even the main

changes of the system cannot be detected from the signal of a single sensor.

This is because sometimes the changes of the correlation structure between

the variables (sensor signals) is interesting since such fused information reflects

the hidden change of the system. In these cases there is a need to integrate the

information and data taken from different sensors, which is a typical ’sensor

fusion’ problem, since ”sensor fusion is the combination of sensory data or

data derived from sensory data such that the resulting information is in some

sense better than would be possible when these sources were used individually”

[5].

2

Page 3: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

The aim of this paper is to develop new algorithms that are able to handle

time-varying multivariate data to detect: (i) changes in the mean; (ii) changes

in the variance; and (iii) changes in the correlation structure among the vari-

ables. Principal Component Analysis (PCA) is the most frequently applied

tool to discover such information [6], as PCA maps the multivariate data into

a lower (usually two or three) dimensional dimensional space which is useful

in the analysis and visualization of correlated high-dimensional data [2].

PCA is a widely used tool in the field of sensor fusion [7–9], and very popu-

lar multivariate technique used for developing multivariate statistical process

monitoring methods [10]. In most of the related works, PCA is used to elimi-

nate the less significant components or sensors reducing the data representa-

tion only to the most significant ones and to plot the data in two dimensions.

E.g. in [11], the measurements of an electronic nose and tongue were visualized

by PCA, and based on this method information was given about the mutual

correlation of sensors. Another interesting application field is quality moni-

toring. In [12], artificial neural network (ANN) was used to estimate global

pollution parameters in water samples, particularly the Chemical Oxygen De-

mand (COD), and PCA was used to select the input variables of the neural

model. Cimander et. al. applied PCA as a part of a real-time expert system in

[13] which allowed data transmission of more than 1800 different signals from

instrumentation. In another work [14], Cimander and his colleagues applied

PCA and ANN for on-line monitoring of yoghurt fermentation.

Linear PCA models have two particularly desirable features: they can be un-

derstood in great detail and they are straightforward to implement. Since the

PCA model defines linear hyperplane, the proposed segmentation algorithms

can be considered as the multivariate extension of the piecewise linear approx-

3

Page 4: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

imation (PLA) based time series segmentation and analysis tools developed by

Keogh [15]. Keogh showed that the PLA based representation of time-series

has many desirable properties:

• High rates of data compression. For example consider Figure 3. The original

time series contains 9.600× 11 points. The segmented version of it contains

only seven segments.

• Relative insensitivity to noise.

• Intuitiveness and ease of visualization.

Based on PLA models effective data mining algorithms have been worked

out for fast similarity search, weighted queries, and change point detection in

univariate time series [15]. Most of these algorithms utilize a simple distance

measure to compare the segments of different time series. This distance mea-

sure is calculated based on the endpoints of the linear lines used to describe

the segments [16]. Unfortunately, the distances among multivariate PCA mod-

els (i.e. hyperplanes) cannot be evaluated with this approach. Hence, for this

purpose the PCA similarity factor developed by Krzanowski [17,18] is used

to compare multivariate time series segments. This paper will show how this

PCA similarity factor can be used to search for similar segments and cluster

the detected subsequences.

The algorithms proposed in this paper are new time series segmentation tools

that can be used to extract new and useful information from multivariate data.

Clustering and the segmentation are the most frequently used data mining al-

gorithms, being useful in it’s own right as an exploratory technique, and also

as a subroutine in rule discovery, indexing, summarization, anomaly detection

and classification. These topics belong to ’data mining’ and ’knowledge discov-

4

Page 5: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

ery in databases’, but it can be interpreted in a much more general way: these

can be referred to as methods related to ’data fusion’ and ’information fusion’

[5]. The methods proposed in this paper can be seen as ’direct fusion meth-

ods’ [19], because they fuse (history values of) sensor data. The applicability

of the proposed algorithms is presented by the analysis of real-life process

data taken from an industrial polyethylene plant. It will be shown that in the

exploitation of the segmentation results there is a need for the utilization of a

priori knowledge about the environment (experience of operators, knowledge

of engineers and scientists), so ’indirect information fusion’ approach will be

also followed in this work.

The paper is organized as follows. The aim of time series segmentation is for-

malized in Section 2. Section 3 describes different cost functions for segmenta-

tion based on PCA. The new algorithms are presented in Section 4. Section 5

presents these application examples. Conclusions are given in Section 6.

2 Time Series Segmentation

A time-series T = {xk = [x1,k, x2,k, . . . , xn,k]T |1 ≤ k ≤ N} is a finite set of

N n-dimensional samples labelled by time points t1, . . . , tN . A segment of T

is a set of consecutive time points S(a, b) = {a ≤ k ≤ b}, xa,xa+1, . . . ,xb.

The c-segmentation of time-series T is a partition of T to c non - overlapping

segments ScT = {Si(ai, bi)|1 ≤ i ≤ c}, such that a1 = 1, bc = N , and ai =

bi−1 +1. In other words, an c-segmentation splits T to c disjoint time intervals

by segment boundaries s1 < s2 < . . . < sc, where Si(si, si+1 − 1).

The goal of the segmentation procedure is to find internally homogeneous

5

Page 6: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

segments from a given time-series. To formalize this goal, a cost function

cost(S(a, b)) describing the internal homogeneity of individual segments should

be defined. Usually, this cost function cost(S(a, b)) is defined based on the dis-

tances between the actual values of the time-series and the values given by the

a simple function (constant or linear function, or a polynomial of a higher but

limited degree) fitted to the data of each segment. For example in [20,21] the

sum of variances of the variables in the segment was defined as cost(S(a, b)):

cost(Si(ai, bi)) =1

bi − ai + 1

bi∑

k=ai

‖ xk − vi ‖2, (1)

vi =1

bi − ai + 1

bi∑

k=ai

xk,

where vi the mean of the segment.

The segmentation algorithms simultaneously determine the θi parameters of

the models used to approximate the behavior of the system in the segments,

and the ai, bi borders of the segments by minimizing the sum of the costs of

the individual segments:

cost(ScT ) =

c∑

i=1

cost(Si) . (2)

This cost function can be minimized by dynamic programming (e.g. [21]),

which is unfortunately computationally intractable for many real data sets.

Hence, usually one of the the following heuristic approaches are followed:

• Search for inflection points:

Searching for primitive episodes located between two inflection points [2].

• Sliding window: A segment is grown until it exceeds some error bound.

The process repeats with the next data point not included in the newly

6

Page 7: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

approximated segment. For example a linear model is fitted on the observed

period and the modelling error is analyzed [15].

• Top-down method: The time-series is recursively partitioned until some

stopping criteria is met [15].

• Bottom-up method: Starting from the finest possible approximation, seg-

ments are merged until some stopping criteria is met [15].

• Clustering based method: Time-series segmentation may be viewed as

clustering, but with a time-ordered structure. In [22] a new fuzzy clustering

algorithm has been proposed which can be effectively used to segment large,

multivariate time-series.

In data mining, the bottom-up algorithm has been used extensively to support

a variety of time series data mining tasks [15], hence in this paper this approach

will be followed. The algorithm begins creating a fine approximation of the

time series, and iteratively merge the lowest cost pair of segments until a

stopping criteria is met. When the pair of adjacent segments Si and Si+1 are

merged, the cost of merging the new segment with its right neighbor and

the cost of merging the Si−1 segment with its new larger neighbor must be

calculated. The pseudocode for algorithm is shown in Table 1.

This algorithm is quite powerful since the the merging cost evaluations re-

quires simple identifications of PCA models which is easy to implement and

computationally cheap to calculate. Because of this simplicities and because

PCA defines linear hyperplane, the proposed approach can be considered as

the multivariate extension of the piecewise linear approximation (PLA) based

time series segmentation and analysis tools developed by Keogh [15,16].

7

Page 8: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

Table 1

Bottom-up segmentation algorithm

• Create initial fine approximation.

• Find the cost of merging for each pair of segments:

mergecost(i) = cost(S(ai, bi+1))

• while min(mergecost) < maxerror

· Find the cheapest pair to merge:

i = argmini(mergecost(i))

· Merge the two segments, update the ai, bi boundary indices, and recalculate the

merge costs.

mergecost(i) = cost(S(ai, bi+1))

mergecost(i− 1) = cost(S(ai−1, bi))

end

3 PCA based Segmentation Costs

Since the aim of this paper is to design a segmentation algorithm that is able

to detect changes in the correlation structure among several variables, the cost

function of the segmentation is based on the Principal Component Analysis

of the Fi covariance matrices of the segments:

Fi =1

bi − ai

bi∑

k=ai

(xk − vi) (xk − vi)T . (3)

Principal Component Analysis (PCA) is based on the decomposition of the Fi

covariance matrix Fi = UiΛiUTi into a Λi matrix which includes the eigenval-

ues of Fi in its diagonal in decreasing order, and into a Ui matrix which in-

cludes the eigenvectors corresponding to the eigenvalues in its columns. With

the use of the first few (p < n) nonzero eigenvalues and the corresponding

8

Page 9: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

eigenvectors, the PCA model projects the correlated high-dimensional data

onto a hyperplane which is useful for the visualization and the analysis of

multivariate data:

yi,k = Λ− 1

2i,p UT

i,pxk (4)

When the PCA model has adequate number of dimensions, the distance of

the data from the p-dimensional hyperplane of the PCA model is resulted

by measurement failures, disturbances and negligible information. Hence, it is

useful to analyze the reconstruction error of the projection:

Qi,k = (xk − xk)T (xk − xk) = xT

k (I−Ui,pUTi,p)xk. (5)

The analysis of the distribution of the projected data is also informative. The

Hotelling T 2 measure is often used to calculate the distance of the mapped

data from the center of the linear subspace

T 2i,k = yT

i,kyi,k. (6)

Figure 1 illustrates these measures in case of two variables and one principal

component.

These T 2 and Q measures are often used for the monitoring of multivariate

systems and for the exploration of the errors and the causes of the errors.

The main idea of this paper is to use these measures as the measure of the

homogeneity of the segments:

costT 2(Si(ai, bi)) =1

bi − ai + 1

bi∑

k=ai

T 2i,k (7)

9

Page 10: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

costQ(Si(ai, bi)) =1

bi − ai + 1

bi∑

k=ai

Qi,k

Fig. 1. Distance measures based on the PCA model.

4 Hierarchial Clustering of Segments and Time-Series

4.1 Distance Measure for PCA Models

An advantage in using the PLA segment representation of the time-series

is that is allows one to define a variety of distance measures to represent

the similarities between two time-series. The distance measure defined for

univariate piecewise linear models is calculated based on the endpoints of the

linear lines used to describe the segments. Unfortunately, the distances among

multivariate PCA models (i.e. hyperplanes) cannot be evaluated with this

approach. Hence, for this purpose the PCA similarity factor, SPCA, developed

by Krzanowski [17,18] is used to compare multivariate time series segments.

Consider two segments, Si and Sj of a historical data set having the same

n variables. Let the PCA models for Si and Sj consist of p PCs each. The

10

Page 11: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

corresponding (n× k) subspaces defined by the eigenvectors of the covariance

matrices are denoted by Ui,p and Uj,p respectively. The similarity between

these subspaces is defined based on the sum of the squares of the cosines of

the angles between each principal component of Ui,p and Uj,p:

SPCA =1

p

p∑

i=1

p∑

j=1

cos2 θi,j =1

ptrace

(UT

i,pUj,pUTj,pUi,p

)(8)

Because subspaces Ui,p and Uj,p contain the p most important principal com-

ponents that account for most of the variance in their corresponding data sets,

SPCA is also a measure of similarity between the segments Si and Sj.

4.2 Hierarchical Clustering of Subsequences

The clustering of time series can be broadly classified into two categories:

• Whole clustering: the notation of clustering here is similar to that of

conventional clustering of discrete objects. Given a set of individual time

series data, the objective is to group similar time series into the same cluster.

• Subsequence clustering: Given a single time series, individual time se-

ries (subsequences) are extracted with a sliding window. Clustering is then

performed on the extracted time series.

In the interesting paper of Jessica Lin et.al. it is proven that the subsequence

clustering approach is meaningless [23]. Hence, in this paper the whole cluster-

ing approach is followed, so the extracted segments as individual time series

are clustered based on the PCA similarity factor presented in the previous

subsection.

11

Page 12: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

One of the most widely used clustering approaches is hierarchical clustering,

due to the resulted dendrogram effectively shows the merging of the objects

into clusters at various stages of the analysis and the similarities at each stage

of the clustering procedure (see Figure 3 at the bottom). The interpretation

of the results is intuitive, which is the major reason of the application of this

method.

The SPCA similarity factors are organized in the form of a matrix. The similar-

ity matrix is then scanned for the largest value, which corresponds to the most

similar segments. These two segments are linked and the rows and columns

corresponding to the old segments are then removed from the matrix. The

rows and the columns of the similarity matrix for the new group of segments is

then recomputed. This process is repeated until all segments have been linked.

There are a variety of ways to compute the distances between the objects and

clusters in hierarchical clustering. The utilized single-linkage method assesses

similarity by measuring the distance to the farthest object in the cluster.

The results of a hierarchical clustering are usually displayed as a dendrogram,

which is a tree-shaped map of the intersample distances in the data set.

5 Application to Process Monitoring

Manual process supervision relies heavily on visual monitoring of character-

istic shapes of changes in process variables, especially their trends. Although

humans are very good at visually detecting such patterns, for a control sys-

tem software it is a difficult problem. The aim of this example is to show how

the proposed algorithms are able to detect meaningful temporal shapes from

multivariate historical process data of several sensors.

12

Page 13: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

The monitoring of a medium and high-density polyethylene (MDPE, HDPE)

plant is considered. HDPE is versatile plastic used for household goods, pack-

aging, car parts and pipe. The plant is operated by TVK Ltd., which is the

largest Hungarian polymer production company in Hungary (www.tvk.hu).

An interesting problem with the process is that it requires to produce about

ten product grades according to market demand. The difficulty of the prob-

lem comes from the fact that there are more than ten process variables to

consider. Measurements are available in every 15 seconds (240 pro hours) on

process variables ~xk, which are the (xk,1 polymer production intensity (PE),

xk,(2,...,6) the inlet flowrates of hexene (C6in), ethylene (C2in), hydrogen (H2in),

the isobutane solvent (IBin) and the catalysts (Kat), xk,(7,...,9) the concentra-

tions of ethylene (C2), hexene (C2), and hydrogen (H2) and xk,10 the slurry in

the reactor (slurry), and xk,11 the temperature of the reactor (T ).

Before the application of the presented methods two important parameters

have to be selected. The first is the number of principal components.

With the increase of p the reconstruction error decreases. In case of p = n = 11

the reconstruction error becomes zero and the Hotelling T 2 becomes the real

distance in the whole range of the data set. If p is too small, the reconstruction

error will be large for the entire time-series. In these two extreme cases the

segmentation does not based on the internal relationships among the variables,

so equidistant segments are detected. When the number of the latent variables

is in the the range p = 3, . . . , 8, reliable segments are detected and the results

(borders of segments and dendrograms) are not sensitive to their numbers in

this range. It is because the first 3, . . . , 8 eigenvalues contain 95, . . . , 99 % of

the total variance.

The other important parameter is the number of segments. It can be deter-

13

Page 14: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

0 5 10 15 20 25 300

0.5

1

1.5

2x 10

−5

Wei

ghte

d co

st

Weighted cost based Q

0 5 10 15 20 25 300

0.1

0.2

0.3

0.4

Number of segments

Rel

ativ

e re

duct

ion

rate

Fig. 2. The costQ and its relative reduction rate with number of segments in Ex-

ample 1.

mined by the method presented by Vasko et al in [20]. This method is based

on permutation test so as to determine whether the increase of the model

accuracy with the increase of the number of segments is due to the underlying

structure of the data or due to the noise. In this paper a similar but much

simpler method has been applied for this purpose. It is based on the modelling

error and the relative reduction of the modelling error when the number of

segments is increased by one. It depends on the applied method but similar

diagrams can be obtained in the analyzed cases. The costQ (7) can be seen

in Figure 2 as a function of the number of segments in case of Example 1

and it can be seen that 7 segments can give acceptable results (the relative

reduction rate is quite inaccurate because there is noise on the data).

In the first example consider a 40-hour long period of time with product

transition between the 15th and 20th hour. It can be followed well e.g. by

the temperature. As it has been mentioned above, the algorithms have been

searched for 7 segments with 4 principal components. The results can be seen

in Figure 3. The left column contains the figures obtained by costQ function

and the right column by costT 2 function. The vertical lines show the borders of

14

Page 15: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

segments. The dendrograms show based on the applied PCA similarity mea-

sure (8) how similar the resulted segments are. It is important to mention that

the dendrograms must not be compared because they are not related to the

same borders of segments. Both methods determine the change very accurate

and detect two segments around the 16th hour. Both methods distinguish the

irregular operation between the 26th and 29th hour. Method based on costQ

splits the first 15 hours into two segments that method based on costT 2 does

not find, but if 6 segments are searched for by costQ, then the borders are

the same as 7 segments but the first two segments are merged. Consequently,

both methods give very similar results by product transition.

In our second example a 35-hour long period of production of a particu-

lar product was chosen to be analyzed. Both cost function, namely costQ and

costT 2 have been applied to this dataset as well. Based on the change of model

accuracy, 8 segments seemed to be acceptable by both algorithms. As it can

be seen in Figure 4, the algorithms give different results. The first 5 event-

ful hours contain several segments by both algorithms, 3 segments based on

costQ and 4 segments based on costT 2 . These first hours belong to the transi-

tion time period (see e.g. the density of slurry or the temperature) that lasts

approximately up to the 7th-8th hour. The borders of the other segments are

quite different, only the irregular operation round the 15th hour is similar:

both algorithms detect 2 segments close to this time. (The concentration of

ethylene is increased, the inlet flowrates of hexene, ethylene and isobutane sol-

vent are decreased, and the polymer production intensity is changed rapidly

during this sort time period, approximately three-quarters hour.) As it can

be seen in Figure 1, modelling error Q is more sensitive to changes in the

correlation structure of the data, since Hotelling T 2 can determine more ac-

15

Page 16: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

curate the changes in the mean and variance of the data, even if the analyzed

data points were generated by the same model. These changes usually happen

together during product transitions but they can be different by normal oper-

ating conditions. The differences between the two approaches can be the topic

of a possible future work, but here it must be summarized that these methods

are complementary approaches and they can be applied for a segmentation

problem together rather than separately.

Humans play the most important rule in the exploitation of results of segmen-

tation. In knowledge discovery from time series the goal is to detect interesting

patterns in the series that may help to better recognize the regularities in the

observed variables and thereby improve the understanding of the system. Hu-

mans are extremely good at visual observation, and able to observe the changes

even by 10-12 variables, while it is a very difficult and challenging problem to

computers. However, more variables cannot be reviewed by humans, and when

extremely large databases have to be analyzed it is also worth having it done

by computers. For this purpose, in this paper a new method has been devel-

oped, and humans, mainly the experts of the analyzed technology, take part

in explanation and utilization of the segmentation results because this cannot

be done by computers. E.g. it has to be determined why a given segment is

different from the other one, what its cause can be and how it can be used in

the operation of the technology to reduce the amount of the off-grade prod-

ucts etc. Hence, for the exploitation of the segmentation results there is a need

for the utilization of a priori knowledge about the environment (experience

of operators, knowledge of engineers and scientists), so ’indirect information

fusion’ approach is also followed in this work.

In the current state of our project we use this tool to compare the production

16

Page 17: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

of different products and extract homogenous segments, and the results show

that the application of the proposed tool useful to extract information about

the changes of the operation regimes of the process and process faults.

6 Conclusions

This paper presented a new algorithm for the segmentation of multivariate

time-series. The algorithm is based on the simultaneous identification of bor-

ders of the segments and the hyperplanes of the local PCA models used to

measure the homogeneity of the segments. Two homogeneity measures cor-

responding to the two typical application of PCA models have been defined.

The Q reconstruction error segments the time-series according to the change

of the correlation among the variables, while the Hotelling T 2 measure seg-

ments the time-series based on the drift of the center of the operating region.

The algorithm was applied to the monitoring of the production of high-density

polyethylene. The results suggest that the proposed tool can be applied to dis-

tinguish and cluster typical operational conditions and analyze product grade

transitions of process systems.

Beside the industrial application examples synthetic datasets can be analyzed

as well to convince the readers about the usefulness of the method. For this

purpose, the MATLAB code of the algorithm is available from our website

(www.fmt.vein.hu/softcomp/segment), so the readers can easily test the pro-

posed method on their own datasets.

The application of the identified segments in intelligent query system designed

for multivariate historical process databases is an interesting and useful idea

17

Page 18: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

for future research.

Acknowledgements

The support of the Cooperative Research Center (2001-II-1), the Hungarian

Ministry of Education (FKFP - 0073 / 2001), the Hungarian Science Foun-

dation (T037600) and the the Janos Bolyai Research Fellowship is gratefully

acknowledged.

18

Page 19: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

0 5 10 15 20 25 30 35 400

0.5

1

C2 (

w%

)

0 5 10 15 20 25 30 35 400

0.5

1

C6 (

w%

)

0 5 10 15 20 25 30 35 400.5

0.55

0.6

H2 (

mol

%)

0 5 10 15 20 25 30 35 400.8

0.85

0.9

slur

ry (

g/cm

3 )

0 5 10 15 20 25 30 35 400

0.2

0.4

T (

o C)

Time [h]

0 5 10 15 20 25 30 35 400

0.5

1

C2 (

w%

)

0 5 10 15 20 25 30 35 400

0.5

1

C6 (

w%

)

0 5 10 15 20 25 30 35 400.5

0.55

0.6

H2 (

mol

%)

0 5 10 15 20 25 30 35 400.8

0.85

0.9

slur

ry (

g/cm

3 )

0 5 10 15 20 25 30 35 400

0.2

0.4

T (

o C)

Time [h]

0 5 10 15 20 25 30 35 400

0.5

1

PE (

t/h)

0 5 10 15 20 25 30 35 400

0.5

1

C6i

n (kg

/h)

0 5 10 15 20 25 30 35 40

0.70.80.9

C2i

n (t/h

)

0 5 10 15 20 25 30 35 400.2

0.3

0.4

H2i

n (kg

/h)

0 5 10 15 20 25 30 35 400.5

0.6

0.7

IBin

(t/h

)

0 5 10 15 20 25 30 35 400

0.5

1

Kat

in

Time [h]

0 5 10 15 20 25 30 35 400

0.5

1

PE (

t/h)

0 5 10 15 20 25 30 35 400

0.5

1

C6i

n (kg

/h)

0 5 10 15 20 25 30 35 40

0.70.80.9

C2i

n (t/h

)

0 5 10 15 20 25 30 35 400.2

0.3

0.4

H2i

n (kg

/h)

0 5 10 15 20 25 30 35 400.5

0.6

0.7

IBin

(t/h

)

0 5 10 15 20 25 30 35 400

0.5

1

Kat

in

Time [h]

1 5 2 7 3 4 6

0.25

0.3

0.35

0.4

0.45

Lev

el

1 4 3 5 6 7 2

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

Lev

el

Fig. 3. The plots in the left side show the borders of the segments and the den-

drogram based on costQ, the plots in the right side based on costT 2 with product

transition. (Example 1)

References

[1] S. Kivikunnas, Overview of process trend analysis methods and applications,

ERUDIT Workshop on Applications in Pulp and Paper Industry (1998) CD

19

Page 20: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

0 5 10 15 20 25 30 35

0.35

0.4

0.45

C2 (

w%

)

0 5 10 15 20 25 30 350

0.5

1

C6 (

w%

)

0 5 10 15 20 25 30 350.4

0.5

0.6

H2 (

mol

%)

0 5 10 15 20 25 30 350.8

0.9

1

slur

ry (

g/cm

3 )

0 5 10 15 20 25 30 350

0.2

0.4

T (

o C)

Time [h]

0 5 10 15 20 25 30 35

0.35

0.4

0.45

C2 (

w%

)

0 5 10 15 20 25 30 350

0.5

1

C6 (

w%

)

0 5 10 15 20 25 30 350.4

0.5

0.6

H2 (

mol

%)

0 5 10 15 20 25 30 350.8

0.9

1

slur

ry (

g/cm

3 )

0 5 10 15 20 25 30 350

0.2

0.4

T (

o C)

Time [h]

0 5 10 15 20 25 30 350

0.5

1

PE (

t/h)

0 5 10 15 20 25 30 35

0.6

0.8

1

C6i

n (kg

/h)

0 5 10 15 20 25 30 35

0.70.80.9

C2i

n (t/h

)

0 5 10 15 20 25 30 350.2

0.3

0.4

H2i

n (kg

/h)

0 5 10 15 20 25 30 350.4

0.6

0.8

IBin

(t/h

)

0 5 10 15 20 25 30 35

0.350.4

0.45

Kat

in

Time [h]

0 5 10 15 20 25 30 350

0.5

1

PE (

t/h)

0 5 10 15 20 25 30 35

0.6

0.8

1

C6i

n (kg

/h)

0 5 10 15 20 25 30 35

0.70.80.9

C2i

n (t/h

)

0 5 10 15 20 25 30 350.2

0.3

0.4H

2in (

kg/h

)

0 5 10 15 20 25 30 350.4

0.6

0.8

IBin

(t/h

)

0 5 10 15 20 25 30 35

0.350.4

0.45

Kat

in

Time [h]

1 2 3 4 7 5 8 60.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

Lev

el

1 2 5 8 4 7 6 3

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Lev

el

Fig. 4. The plots in the left side show the borders of the segments and the dendro-

gram based on costQ, the plots in the right side based on costT 2 without transition.

(Example 2)

ROM.

[2] G. Stephanopoulos, C. Han, Intelligent systems in process engineering: A

review, Comput. Chem. Eng. 20 (1996) 743–791.

[3] J. C. Wong, K. McDonald, A. Palazoglu, Classification of process trends based

20

Page 21: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

on fuzzified symbolic representation and hidden markov models, Journal of

Process Control 8 (1998) 395–408.

[4] M. Last, Y. Klein, A. Kandel, Knowledge discovery in time series databases,

IEEE Transactions on Systems, Man, and Cybernetics 31 (1) (2000) 160–169.

[5] W. Elmenreich, An introduction to sensor fusion, Research Report 47/2001,

Technische Universitat Wien, Institut fur Technische Informatik, Treitlstr. 1-

3/182-1, 1040 Vienna, Austria (2001).

[6] M. E. Tipping, C. M. Bishop, Mixtures of probabilistic principal components

analysis, Neural Computation 11 (1999) 443–482.

[7] B. Karlsson, J.-O. Jrrhed, P. Wide, A fusion toolbox for sensor data fusion in

industrial recycling, IEEE Tansactions on Instrumentation and Measurement

51 (1) (2002) 144–149.

[8] G. L. Marcialis, F. Roli, Fusion of lda and pca for face verification, Springer-

Verlag, London, UK (2002) 30–38.

[9] F. Samadzadegan, Fusion techniques in remote sensing, The International

Archives of the Photogrammetry, Remote Sensing.

[10] A. Negiz, A. Cinar, Monitoring of multivariable dynamic processes and sensor

auditing, Journal of Process Control 8 (5) (1998) 375–380.

[11] C. Natale, R. Paolesse, A. Macagnano, A. Mantini, A. DAmico, A. Legin,

L. Lvova, A. Rudnitskaya, Y. Vlasov, Electronic nose and electronic tongue

integration for improved classification of clinical and food samples, Sensors and

Actuators B 64 (2000) 15–21.

[12] A. Charef, A. Ghauch, P. Baussand, M. Martin-Bouyer, Water quality

monitoring using a smart sensing system, Measurement 28 (2000) 219–224.

21

Page 22: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

[13] C. Cimander, T. Bachinger, C.-F. Mandenius, Integration of distributed multi-

analyzer monitoring and control in bioprocessing based on a real-time expert

system, Journal of Biotechnology 103 (2003) 237–248.

[14] C. Cimander, M. Carlsson, C.-F. Mandenius, Sensor fusion for on-line

monitoring of yoghurt fermentation, Journal of Biotechnology 99 (2002) 237–

248.

[15] E. Keogh, S. Chu, D. Hart, M. Pazzani, An online algorithm for

segmenting time series, IEEE International Conference on Data Mining (2001)

http://citeseer.nj.nec.com/keogh01online.html.

[16] E. Keogh, M. Pazzani, An enhanced representation of time series which allows

fast and accurate classification, clustering and relevance feedback, 4th Int. Conf.

on KDD. (1998) 239–243.

[17] W. Krzanowsky, Between group comparison of principal components, J. Amer.

Stat. Assoc. (1979) 703–707.

[18] A. Singhal, D. Seborg, Matching patterns from historical data using PCA and

distance similarity factors, Proceedings of the American Control Conference

(2001) 1759–1764.

[19] G. McKee, What can be fused?, Multisensor Fusion for Computer Vision, Nato

Advanced Studies Institute Series F (99).

[20] K. Vasko, H. Toivonen, Estimating the number of segments in time series data

using permutation tests, IEEE International Conference on Data Mining (2002)

466–473.

[21] J. Himberg, K. Korpiaho, H. Mannila, J. Tikanmaki, H. T. Toivonen,

Time-series segmentation for context recognition in mobile devices, IEEE

International Conference on Data Mining (ICDM01), San Jose, California

(2001) 203–210.

22

Page 23: Principal Component Analysis based Time Series ... · fusion’ problem, since "sensor fusion is the combination of sensory data or data derived from sensory data such that the resulting

[22] J. Abonyi, B. Feil, S. Nemeth, P. Arva, Fuzzy clustering time series

segmentation, IDA 2003 Conference (2003) http://www.fmt.vein.hu/softcomp.

[23] J. Lin, E. Keogh, W. Truppel, Clustering of streaming time series is meaningless:

Implications for previous and future research, SIGKDD’03.

23