arxiv:1610.07717v3 [cs.lg] 19 may 2017

Distributed and parallel time series feature extraction forindustrial big data applicationsI

Maximilian Christa, Andreas W. Kempa-Liehrb,c, Michael Feindta,1

aBlue Yonder GmbH, Karlsruhe, GermanybDepartment of Engineering Science, University of Auckland, Auckland, New Zealand

cFreiburg Materials Research Center, University of Freiburg, Freiburg, Germany

Abstract

The all-relevant problem of feature selection is the identification of all stronglyand weakly relevant attributes. This problem is especially hard to solve fortime series classification and regression in industrial applications such as pre-dictive maintenance or production line optimization, for which each label orregression target is associated with several time series and meta-informationsimultaneously. Here, we are proposing an efficient, scalable feature extractionalgorithm for time series, which filters the available features in an early stage ofthe machine learning pipeline with respect to their significance for the classifi-cation or regression task, while controlling the expected percentage of selectedbut irrelevant features. The proposed algorithm combines established featureextraction methods with a feature importance filter. It has a low computationalcomplexity, allows to start on a problem with only limited domain knowledgeavailable, can be trivially parallelized, is highly scalable and based on well stud-ied non-parametric hypothesis tests. We benchmark our proposed algorithm onall binary classification problems of the UCR time series classification archiveas well as time series from a production line optimization project and simulatedstochastic processes with underlying qualitative change of dynamics.

Keywords: Feature Engineering, Feature Selection, Time Series FeatureExtraction2010 MSC: 37M10

IThe feature extraction algorithms and the FRESH algorithm itself, all of which are describedin this work, have been implemented in a open source python package called tsfresh. Itssource code be found at https://github.com/blue-yonder/tsfresh.

∗Corresponding authorEmail addresses: [email protected] (Maximilian Christ),

[email protected] (Andreas W. Kempa-Liehr),[email protected] (Michael Feindt)

1on leave of absence from Karlsruhe Institute of Technology

Preprint submitted to Neurocomputing May 23, 2017

arX

iv:1

610.

0771

7v3

[cs

.LG

] 1

9 M

ay 2

017

https://github.com/blue-yonder/tsfresh

1. Introduction

Utilizing the continuously increasing amount of accessible data providestremendous benefit for the understanding of the complex systems, which weare part of and contribute to. Examples range from the often counter-intuitivebehavior of social systems and their macro-level collective dynamics [1] to diseasedynamics [2] and precision medicine [3]. Other promising fields of applicationfor machine learning are the Internet of Things (IoT) [4] and Industry 4.0 [5]environments. In these fields, machine learning models anticipate future de-vice states by combining knowledge about device attributes with historic sensortime series. They permit the classification of devices (e.g. hard drives) intorisk classes with respect to a specific defect [6]. Both fields are driven by theavailability of cheap sensors and advancing connectivity between devices, whichincreases the need for machine learning on temporally annotated data.

In most cases the volume of the generated time series data forbids theirtransport to centralized databases [4]. Instead, algorithms for an efficient re-duction of data volume by means of feature extraction and feature selection areneeded [7, p. 125–136]. Furthermore, for online applications of machine learn-ing it is important to continuously select relevant features in order to deal withconcept drifts caused by qualitative changes of the underlying dynamics [8].

Therefore, for industrial and other applications, one needs to combine dis-tributed feature extraction methods with a scalable feature selection, especiallyfor problems where several time series and meta-information have to be consid-ered per label/target [9]. For time series classification, it proved to be efficient toapply comprehensive feature extraction algorithms and then filter the respectivefeatures [10].

Motivated by industrial applications for machine learning models [11], weare extending the approach of Fulcher and Jones [10] by a highly parallel fea-ture filtering and propose FeatuRe Extraction based on Scalable Hypothesistests (FRESH). The algorithm characterizes time series with comprehensive andwell-established feature mappings and considers additional features describingmeta-information. In a second step, each feature vector is individually and in-dependently evaluated with respect to its significance for predicting the targetunder investigation. The result of these tests is a vector of p-values, quantifyingthe significance of each feature for predicting the label/target. This vector isevaluated on basis of the Benjamini-Yekutieli procedure [12] in order to decidewhich features to keep.

The proposed algorithm is evaluated on all binary classification problems ofthe UCR time series classification archive [13] as well as time series data from aproduction line optimization project and simulated time series from a stochasticprocess with underlying qualitative change of dynamics [14, p. 164]. The re-sults are benchmarked against well-established feature selection algorithms likelinear discriminant analysis [10] and the Boruta algorithm [15], but also againstDynamic Time Warping [16]. The analysis shows that the proposed method out-performs Boruta based feature selection approaches as well as Dynamic TimeWarping based approaches for problems with large feature sets and large time

2

series samples.The FRESH algorithm and the comprehensive framework to extract the fea-

tures have also been implemented in a Python package called tsfresh. Itssource is open, and it can be downloaded from www.github.com/blue-yonder/tsfresh. During the writing of this paper, tsfresh is the most comprehensivefeature extraction package that is publicly available in Python. The package isdesigned for the fast extraction of a huge number of features while being com-patible with popular Python machine learning frameworks such as scikit-learn[17], numpy [18] or pandas [19].

This work starts with an introduction to time series feature extraction inSec. 2. Afterwards, in Sec. 3, the FRESH algorithm is introduced. In Sec. 4, theperformance of its Python implementation is evaluated on the UCR time seriesand a industrial dataset. Afterwards, characteristics of FRESH are discussed inSec. 5. This work closes with a summary and an outlook over future work inin Sec. 6. Additionally, Appendix A contains an overview over the consideredfeature mappings.

2. Time series feature extraction

2.1. Time seriesTemporally annotated data come in three different variants [20]: Temporally

invariant information (e.g. the manufacturer of a device), temporally variantinformation, which change irregularly (e.g. process states of a device), and tem-porally variant information with regularly updated values (e.g. measurementsof sensors on a device). The latter describe the continuously changing statesi,j(t) of a system or device si with respect to a specific measurement of sensorj, which is repeated in intervals of length ∆t. This sampling captures the stateof the system or device under investigation as a sequence of length n(j)t

si,j(t1)→ si,j(t2)→ . . .→ si,j(tν)→ . . .→ si,j(tn(j)t

) (1a)

with tν+1 = tν + ∆t. Such kind of sequences are called time series2 and areabbreviated by

Si,j = (si,j(t1), si,j(t2), . . . , si,j(tν), . . . , si,j(tn(j)t

))T

= (si,j,1, si,j,2, . . . , si,j,ν , . . . , si,j,n(j)t

)T.(1b)

Here, we are considering i = 1, . . . ,m devices with recordings for j = 1, . . . , ndifferent time series. The recorded time series of one type j should have thesame same length n

(j)t for all devices. So in total we are dealing with n · m ·∑m

j=1 n(j)t values describing the input for the multi-variate time series problem

under investigation.

2The FRESH algorithm was developed while having time series in mind where tν denotesa point in time. However, it can also be applied on other uniformly sampled signals such asspectra, e.g. signals where tν denotes a wave length.

3

www.github.com/blue-yonder/tsfresh

www.github.com/blue-yonder/tsfresh

2.2. Time series classificationWe assume that we are given a target vector Y = (y1, . . . , ym)T where the

entry yi of Y describes a characteristic or state of device i. Those can either berepresented by discrete or continuous values. In the first case, the prediction ofthe values from Y based on the different time series S = (Si,j)i=1,...,m,j=1,...,n

states a multi-variate time series classification problem, in the later we performa regression on multi-variate time series input.

The first problem class, which deals with the assigning of time series todiscrete categories is called time series classification (TSC). There are differentapproaches to tackle TSC problems; Bagnall et al. [21] give an overview overcommon approaches and recent algorithmic advances in the field of TSC.

Shape-based approaches identify similar pairs of time series in terms of theirvalues through time. For this purpose a similarity metrics like the Euclideandistance is applied and algorithms like k-nearest-neighbors (kNN) can be usedto find similar time series and their associated target values {y1, ..., yk}. Then,the class label is for example predicted as the majority vote of across the kneighbors [23]. Dynamic Time Warping (DTW) is a variant of shape-basedapproaches, which accommodates local (temporal) shifts in the data [22]),

On the other hand, direct approaches learn a representation for the inputobjects. Neural networks and their deep extensions deploy a network of neuronsand weights, being especially suited to applications where manual feature engi-neering is difficult. Common extensions like recurrent neural networks (RNN)[24], stacked restricted Boltzmann-machines (RBM) [25], or convolutional neu-ral networks (CNN) [26] have already been deployed for TSC tasks.

A third group of TSC approaches is feature-based, which will be discussedin the following section.

2.3. Feature mappingIn this work, in order to characterize a time series and reduce the data

volume, a mapping θk : Rn(j)t → R is introduced, which captures a specific

aspect of the time series. One example for such mapping might be the maximumoperator

θmax(Si,j) = max{si,j,1, si,j,2, . . . , si,j,ν , . . . , si,j,n(j)t},

which quantifies the maximal value ever recorded for time series Si,j . This kindof lower dimensional representation is called a feature, which is a measurablecharacteristics of the considered time series. The feature mappings that weconsider are stateless and cannot use information from other time series toderive its value (this e.g. prohibits rankings of the time series with respect to acertain metric as a feature mapping). Other examples for feature mappings θk oftime series might be their mean, the number of peaks with a certain steepness,their periodicity, a global trend, etc.

There is a comprehensive literature on the topic of time series feature ex-traction. Early on, authors started to discuss the extraction of basic featuressuch as max, min, skewness [27] or generic patterns such as peaks [28]. Apart

4

from that, every field that investigates time series, discusses specialized featuresfor their applications, e.g. peak features to classify audio data [29], waveletbased features to monitor vibrations [30], the parameters of a fitted exponentialfunction to estimate the residual life of bearings [31] or logarithmic periodogramfeatures to detect arcs in the contact strips of high velocity trains [32]. Finally,Fulcher and Jones [10] and Nun et al. [33] build comprehensive collections ofsuch time series feature mappings over different domains. Fulcher and Jones[10] even collect more than 9000 features from 1000 different feature generatingalgorithms that are discussed in fields such as medicine, astrophysics, finance,mathematics, climate science, industrial applications and so on. In this work, wewill consider a smaller number of 111 features, a list of the considered mappingsis given in Appendix A.

Now, consider nf different time series feature mappings, which are applied toall m ·n time series recorded from n sensors of m devices (Fig. 1). The resultingfeature matrix X ∈ Rm×nφ has m rows (one for each device) and nφ = n ·nf +nicolumns with ni denoting the number of features generated from device specificmeta-information. Each column of X comprises a vector X ∈ Rm capturing aspecific characteristic of all considered devices. The resulting feature matrix andthe target vector are the base for supervised classification algorithms such as aRandom Forest classifier. By help of the feature matrix, the TSC task becomesa supervised classification problem.

Because it is such a common approach, many authors already tackled TSCtasks in a feature-based fashion [27, 28, 29, 31, 34, 35]. A very recent approachis the COTE algorithm [36], which computes 35 classifiers on four differentdata transformations capturing similarities in the time, frequency, change, andshape domains. On the UCR archive, an important benchmark collection ofTSC problems, it was able reach a higher accuracy than any other previouslypublished TSC algorithm.

3. Feature filtering

Typically, time series are noisy and contain redundancies. Therefore, oneshould keep the balance between extracting meaningful but probably fragilefeatures and robust but probably non-significant features. Some features suchas the median value will not be heavily influenced by outliers, others such as themaximal value of the time series will be intrinsically fragile. The choice of theright time series feature mappings is crucial to capture the right characteristicsfor the task at hand.

3.1. Relevance of featuresA meaningless feature describes a characteristic of the time series that is

not useful for the classification or regression task at hand. Radivojac et al. [37]measure the relevance of feature X for the classification of a binary target Yas the difference between the class conditional distribution of X given Y = y1,denoted by fX|Y=y1 , and of X given Y = y2, denoted by fX|Y=y2 .

5

Ra

w t

ime

se

rie

sTime series type 1

Ag

gre

ga

ted

fe

atu

res

p-v

alu

es

Se

lect

ed

fe

atu

res

Time series type 2 Time series type n

Max Min

...

Benjamini Yekutieli

procedure

Sample 1

Sample m

Data and calculations are vertically

and horrizontally fragmented

Data and calculations are vertically fragmented

...

...

...

...

Sample 1

Sample m

...

Feature significance hypothesis

tests

p

Max Min

...

Max Min

...

Max Min

...

p ... p p ... p p ...

...

Figure 1: Data processing tiers of the filtered feature extraction algorithm. All but theBenjamini-Yekutieli procedure can be computed in parallel.

We adopt this definition and consider a feature X being relevant for theclassification of binary target Y if the two conditional density functions are notequal. So, a feature X will be denoted relevant for predicting target Y if andonly if

∃ y1, y2 with fY (y1) > 0, fY (y2) > 0 : fX|Y=y1 6= fX|Y=y2 . (2)

Having different class conditional distribution as in Eq. (2) is actually equivalentto X and Y being statistically dependent. This becomes clear in the oppositecase, when the feature X is not relevant:

X is not relevant for target Y⇔ ∀y1, y2 with fY (y1) > 0, fY (y2) > 0 : fX|Y=y1 = fX|Y=y2

⇔ ∀y1 with fY (y1) > 0 : fX|Y=y1 = fX

⇔ fX,Y = fX|Y fY = fXfY

⇔ X and Y are statistically independent

(3)

We can therefore use the statistical independence to derive a shorter definitionof a relevant feature:

Definition 1 (A relevant feature). A feature Xφ is relevant or meaningfulfor the prediction of Y if and only if Xφ and Y are not statistically independent.

There are different approaches on how to check if a given feature is fulfilling

6

this definition or not. We will address the significance of a feature by means ofhypothesis testing, a statistical inference technique [38].

3.2. Hypothesis testsFor every extracted feature X1, . . . , Xφ, . . . , Xnφ we will deploy a singular

statistical test checking the hypotheses

Hφ0 = {Xφ is irrelevant for predicting Y },

Hφ1 = {Xφ is relevant for predicting Y }.

(4)

The result of each hypothesis test Hφ0 is a so-called p-value pφ, which quantifies

the probability that feature Xφ is not relevant for predicting Y . Small p-valuesindicate features, which are relevant for predicting the target.

Based on the vector (p1, . . . , pnφ)T of all hypothesis tests, a multiple test-ing approach will select the relevant features (Sec. 3.3). We propose to treatevery feature uniquely by a different statistical test, depending on whether thecodomains of target and feature are binary or not. The usage of one general fea-ture test for all constellations is not recommended. Specialized hypothesis testsyield a higher statistical power due to more assumptions about the codomainsthat can be used during the construction of those tests. The proposed featuresignificance tests are based on nonparametric hypothesis tests, which do notmake any assumptions about the distribution of the variables, thus ensuringrobustness of the procedure.

Exact Fisher test of independence: This feature significance test can beused if both the target and the inspected feature are binary. Fisher’s exact test[39] is based on the contingency table formed by Xφ and Y . It inspects if bothvariables are statistically independent, which corresponds to the hypothesesfrom Eq. (4). Fisher’s test belongs to the class of exact tests. For such tests,the significance of the deviation from a null hypothesis (e.g., the p-value) canbe calculated exactly, rather than relying on asymptotic results.

Kolmogorov-Smirnov test (binary feature): This feature significancetest assumes the feature to be binary and the target to be continuous. In general,the Kolmogorov-Smirnov (KS) test is a non-parametric and stable goodness-of-fit test, which checks if two random variables A and B follow the same distri-bution [40]:

H0 = {fA = fB}, H1 = {fA 6= fB}.

By conditionally modeling the distribution function of target Y on the twopossible values x1, x2 of the feature Xφ we can use the KS test to check if thedistribution of Y differs given different values of Xφ. Setting A = Y |Xφ = x1and B = Y |Xφ = x2 results in

Hφ0 = {fY |Xφ=x1

= fY |Xφ=x2}, Hφ

1 = {fY |Xφ=x16= fY |Xφ=x2

}. (5)

The hypotheses from Eq. (4) and (5) are equivalent as demonstrated in thechain of Eqns. (3). Hence, the KS test can address the feature relevance of Xφ.

7

0 50 100 150 200 250φ

0.0

0.2

0.4

0.6

0.8

1.0

Ordered p-values: p(φ)

Rejection line: φq

nφ∑φ

µ=11µ

(a) Ordered p-Values

0 5 10 15 20 25 30 35 40φ

0.000

0.005

0.010

0.015

0.020

0.025 Select respectivefeatures

Reject respectivefeatures

(b) The 40 lowest ordered p-values

Figure 2: The Benjamini-Yekutieli procedure for a sample of simulated p-Values of 250 indi-vidual feature significance tests. The rejection line aims to control a FER level q of 10%.

Kolmogorov-Smirnov test (binary target): When the target is binaryand the feature non-binary, we can deploy the Kolmogorov-Smirnov test again.We have to switch roles of target and feature variable, resulting in the testingof the following hypothesis:

Hφ0 = {fXφ|Y=y1 = fXφ|Y=y2}, H

φ1 = {fXφ|Y=y1 6= fXφ|Y=y2}.

This time y1 and y2 are the two possible values of Y and fXφ|Y=yj is the con-ditional density function of Xφ given Y . This hypothesis is also equivalent tothe one in Eq. (4).

Kendal rank test: This filter can be deployed if neither target nor featureare binary. Kendall’s rank test [41] checks if two continuous variables maybe regarded as statistically dependent, hence naturally fitting our hypothesesfrom Eq. (4). It is a non-parametric test based on Kendall’s rank statisticτ , measuring the strength of monotonic association between Xφ and Y . Thecalculation of the rank statistic is more complex when ties are involved [42], i.e.feature or target are categorical.

3.3. Feature significance testingIn the context of time series feature extraction, a wrongly added feature

is a feature Xφ for which the null hypothesis Hφ0 has been rejected by the

respective feature significance test, even though Hφ0 is true. The risk of such a

false positive result is by construction of the hypothesis tests only controlled forindividual features. However, when comparing multiple hypotheses and featuressimultaneously, errors in the inference tend to accumulate [43]. In multipletesting, the expected proportion of erroneous rejections among all rejections iscalled false discovery rate (FDR).

The FDR as a measure of the accumulated statistical error was suggested byBenjamini and Hochberg [44]. Later the non-parametric Benjamini-Yekutieliprocedure was proposed. Based on the p-values it tells which hypotheses toreject while still controlling the FDR under any dependency structure betweenthose hypotheses [12]. It will be the last component of our filtered featureextraction algorithm.

8

The procedure searches for the first intersection between the ordered se-quence of p-values p(φ) (blue dots in Fig. 2) with a linear sequence (green linesin Fig. 2)

rφ =φq

nφ∑φµ=1

1µ

. (6)

Here, nφ is the number of all tested null hypotheses and q is the FDR level thatthe procedure controls. It will reject all hypotheses belonging to p-values, whichhave a lower value than the p-value at the intersection (Fig. 2(b)).

By deploying the Benjamini-Yekutieli procedure, a global inference errormeasure, the False Extraction Rate (FER) is controlled:

FER = E[number of irrelevant extracted features

number of all extracted features

](7)

The FRESH algorithm controls the FER for all distributions of features and targetas well as for every dependency structure asymptotically.

3.4. The proposed feature extraction algorithmWe combine the components from Sec. 2,3.2 and 3.3 to propose FeatuRe

Extraction based on Scalable Hypothesis tests (FRESH) for parameter q ∈ [0, 1],given by the following three steps:

1. Perform a set of nφ univariate feature mappings as introduced in Sec. 2on m · n different time series to create the feature vectors Xφ with φ =1, . . . , nφ.

2. For each generated feature vector X1, . . . , Xnφ perform exactly one hy-pothesis test for the hypothesis Hφ

0 from Eq. (4). To do so, take the cor-responding feature significance test from Sec. 3.2. Calculate the p-valuesp1, . . . , pnφ of the tests.

3. Perform the Benjamini-Yekutieli procedure under correction for dependenthypotheses [12] for a FDR level of q on the collected p-values p1, . . . , pnφin order to decide which null hypothesis Hφ

0 to be rejected (c.f. Sec. 3.3).Only return features vectures for which the respective hypothesis Hφ

0 wasrejected by the procedure.

So, the FRESH algorithm extracts all features in step 1. which are then individ-ually investigated by the hypothesis tests in step 2. and finally in step 3. thedecision about the extracted features is made.

3.5. Variants of FRESHA problem of filter methods such as the one we utilize in steps 2 and 3

of FRESH is the redundancy in the feature selection. As long as features areconsidered associated with the target, they will all be selected by the filter eventhough many of them are highly correlated to each other [45]. For example the

9

vectors for median and mean are highly correlated in the absence of outliers inthe time series, and therefore both features will have similar p-values. Hence,we expect FRESH to either select or drop both median and mean at the sametime. To avoid generating a group of highly correlated features we propose toadd another step to FRESH:

∗. Normalize the features and perform a principal component analysis (PCA).Keep the principal components with highest eigenvalue describing p per-cent of the variance.

This step will reduce the number of features and the obtained principal compo-nents are de-correlated, orthogonal variables [46].

One could perform step ∗ between steps 1 and 2 of FRESH to get rid of thecorrelations between the created variables early. Then the feature significancetests in step 2 of FRESH will take principal components instead of the originalfeatures as input. We will denote this variant of FRESH as FRESH_PCAb(efore).Also, one could perform step ∗ after the FRESH algorithm, directly after step3. This means that the PCA will only process those features, which are foundrelevant by the FRESH algorithm instead of processing all features. This variantof FRESH is called FRESH_PCAa(fter).

Finally, due to the selection of hypothesis tests in Sec. 3.2, FRESH and itsvariants are only suitable for binary classification or regression problems. How-ever, by selecting suitable tests for step 2. of the algorithm, we can extend thealgorithm to multi-classification problems.

3.6. Contribution of this workThis work makes a contributions to both the field of time series classification

as well as regression of exogenous variables from time series. The presented fea-ture filtering process is based on already well investigated statistical techniquessuch as the Kendall rank test; hence, the novelty of the FRESH algorithm doesnot lay in the individual components but in the selection and combination ofsuitable univariate hypothesis tests (c.f. Sec. 3.2) with a multiple testing proce-dure (c.f. Sec. 3.3). This selection was made while having Big Data applicationsin mind, so one goal was to make FRESH highly scalable (e.g. with respect tolength, number of time series or number of extracted features); in Sec. 5 we willdiscuss the distributed nature of FRESH and further implications.

4. Evaluation

In the following presented simulations, the performance of FRESH, its twovariants from Sec. 3.5 and other time series feature extraction methods arecompared. We are interested if FRESH is able to automatically extract meaning-ful features and how long it takes to extract such features. Those aspects referto both the benefit (c.f. Sec. 4.2) and the cost dimension (c.f. Sec. 4.3) of thealgorithm.

10

During evaluation, all feature extraction methods operate on the same fea-ture mappings (c.f. Appendix A), the differences lay only in the used featureselection process. So we extract the same features but let them get selectedby different approaches. Further, the same partitions are used during folding,allowing a fair comparison between different approaches.

4.1. SetupIn the simulations, several feature based approaches and the shape-based

classifier DTW_NN [23], a nearest neighbor search under the Dynamic Time Warp-ing distance, are compared. As discussed, we extracted 111 features for ev-ery type of time series, which are then filtered by five different feature selec-tion approaches, the first three being the introduced FRESH, FRESH_PCAb andFRESH_PCAa (c.f. Sec 3.4 and 3.5). Further, the random tree based Boruta fea-ture selection algorithm [15] and a forward selection with a linear discriminantanalysis classifier [10], denoted by LDA, are considered. FRESH is parameterizedwith q = 10% (c.f. Eq. (6)). Its variants, which apply a PCA, are keepingthose principal components that explain p = 95% of the variance (c.f. Sec. 3.5).Lastly, the filtering of the features by Boruta will be based on 50 Random ForestEnsembles, each containing 10 Decision Trees.

The different extraction methods and DTW_NN were picked for the followingreasons: DTW_NN is reported to reach the highest accuracy rates among othertime series classifiers [47], LDA was the first proposed algorithm to automaticallyextract features from time series [10] and Boruta is a promising feature selectionalgorithm that incorporates interactions between features, in contrast to FRESHand its variants that evaluate features individually.

The UCR time series archive [13] is a widespread benchmark environmentfor the time series classification community. We picked those 31 time seriesdatasets from the archive that state a binary classification problem. Also, inorder to compare the runtime of the different methods, synthetic time series offlexible length and sample number belonging to two classes are generated bysimulating the stochastic dynamics of a dissipative soliton [14, p. 164].

The third and last data source is from the production of steel billets, ex-tracted during the German research project iPRODICT. This project demon-strates a typical application of industrial time series analysis, aiming to predictthe passing or failing of product specification testings based on temporally an-notated data. The steel producer can adapt his business processes during ordirectly after the production with such forecasts. Without them, he has to waitfor the results of the specifications tests for which the steel billets need to cooldown, a process that can take several hours or days [48]. Therefore, the predic-tions will enable faster business process adaptions and save costs by increasingthe agility and efficiency of the production. This dataset has 26 univariate meta-variables forming the baseline feature set extended by 20 different sensor timeseries having up to 44 data points for each sample. The dataset contains 5000samples of two classes "broken" and "not broken".

11

4.2. The benefit - achieved AccuracyFor datasets from both the UCR time series repository as well as the iPRO-

DICT project, the underlying structure and therefore the relevant features areunknown. We cannot compare the different methods on their ability to extractmeaningful features because we do not know which features are meaningful andwhich not. Also, we cannot compare the extracted features to shape-based clas-sifiers such as DTW_NN, who do not extract features. Therefore, we evaluate theperformance of the feature extraction algorithms by comparing the performanceof a classification algorithm on the extracted features. Hereby, we assume thatmore meaningful features will result in a better classification result.

So, we need to select classification algorithms that can be trained on the(filtered) feature matrices. In [49], Fernández-Delgado et al. evaluate a rangeof 179 classifiers from 17 families on 121 different datasets representing a widerange of different applications. In their evaluation, the Random Forest Classi-fiers formed the best performing group. Hence, in our evaluation, we will alsoevaluate our algorithms on a Random Forest Classifier, denoted by rfc. Fur-ther, we consider an AdaBoost Classifier [50], denoted by ada, due to boostingalgorithms performing generally well on a wide range of problems [49, 51, 52]and AdaBoost being considered on of the best “out-of-the-box classifiers” [53].The hyperparameters for those methods are not optimized to get an unbiasedview on the meaningfulness of the extracted features, instead the default valuesfrom the python package scikit-learn version 0.18.1 were used [17].

Combining the five feature selection algorithms with the two classifiers willresults in 10 different feature based approaches. We append the name of theclassifier as postfix to the approach name (e.g. LDA_ada denotes a filteringof the features by the LDA algorithm with subsequent classification by ada).This gives us the pipelines FRESH_rfc, FRESH_ada, Boruta_rfc, Boruta_ada,LDA_rfc, LDA_ada, FRESH_PCAa_rfc, FRESH_PCAa_ada, FRESH_PCAb_rfc andFRESH_PCAb_ada. Further, we apply the two considered classifiers on the un-filtered feature matrix, then denoted by rfc and ada. Finally, we includedtrivial; a benchmark algorithm that disregards any relationship between timeseries and class label by always predicting the majority class on the trainingdataset. Algorithms that learn an informative mapping between time series andexogenous target variables will outperform this simple benchmark.

For every dataset, all available samples will be used to perform the featureextraction itself. Then, the feature selection steps and classifiers are trainedusing a 10 fold cross validation scheme. The average accuracies of this evaluationare contained as a heatmap in Fig. 3. The value in a singular cell is the calculatedaccuracy for a singular approach-dataset combination, averaged over the 10folds. Furthermore, Fig. 4 reports the standard deviation of the accuracy metricover the 10 folds.

The best performing algorithms were ada and DTW_NN, achieving a highestmean accuracy on 21 out of the 32 datasets, with all 21 originating from theUCR time series archive. In addition, both approaches showed similar variancein the results; when comparing the standard deviation that is contained in Fig. 4,

12

the DTW_NN has a higher standard deviation than ada for the achieved accuracyon 9 datasets while vice-versa the standard deviation of the accuracy by ada ishigher on 10 datasets. This shows that ada and DTW_NN were both comparablewith respect to achieved accuracy.

On the dataset from the iPRODICT research project, the Boruta basedapproaches Boruta_ada and Boruta_rfc showed the highest mean accuraciesof 72%, with plain AdaBoost ada coming close behind with a mean accuracy of71%. The DTW_NN algorithm was only trained on 1 out of 20 sensor time seriesfor the iPRODICT data and was not using any univariate variables (we trainedDTW_NN for all 20 time series and then picked the one time series with the bestresults) which explains its low mean accuracy of 57%.

Regarding the trivial benchmark, only DTW_NN, LDA_ada, ada, FRESH_ada,FRESH_PCAa_ada and FRESH_PCAb_ada were able to beat it on every dataset.Every Random Forest based approach reported a mean accuracy that was lowerthan the trivial one on at least one dataset. For example, on the Wine datasetBoruta_rfc, FRESH_rfc and FRESH_PCAb_rfc were performing worse than thetrivial benchmark. It seems that the Random Forest approach is not as robustas AdaBoost.

But, overall, the reported accuracy and the good performance of ada showthat feature based approaches are competitive to shape-based approaches suchas DTW_NN with respect to prediction accuracy. Our simulations show that anout-of-box classifier without any hyperparameter optimization and a generic setof time series features (c.f. Appendix A) is able to be on par with a state-of-the-art shape-based approach.

Further, we were interested in the change of accuracy when deploying differ-ent feature selection techniques. For ada, the filtering by FRESH, FRESH_PCAa,FRESH_PCAb, Boruta and LDA, improved or not changed the mean accuracy in19, 12, 7, 2 and 5 out of 32 datasets, respectively. Regarding the second clas-sifier rfc, this happened on 22, 15, 17, 13 and 20 out of the 32 datasets. TheFRESH filter even increased the mean accuracy of rfc on 14 out of 32, so onnearly half of the datasets. This shows that the selection of features by FRESHhad the best chance among the considered feature selection techniques of notworsening the classifiers accuracy. However, this also shows that for some datasets, the feature filtering removed important information by dropping relevantfeatures. (e.g. for rfc, the mean accuracy decreased in 13 out of 32 datasetswhen filtering the features by FRESH).

As always, there is no best algorithm for all kind of applications. As ex-pected, the DTW_NN approach had one of the highest accuracies in our evaluation.However, we were surprised to see that the ada approach without any filteringof the features reached such a competitive accuracy. Regarding the feature fil-tering, we showed that FRESH for a Random Forest Classifier on two thirds ofdatasets did not worsen the accuracy of the final classification algorithms whilereducing the number of considered features. In comparison to Boruta and LDA,it was most often able to improve the performance of the Random Forest Clas-sifiers. For AdaBoost it seems, that for a low number of features (111 on theUCR datasets and 2220 on the iPRODICT data) and estimators (the scikit-

13

learn default are 50 decision trees in the ensemble), the feature filtering is notnecessary as the algorithm not yet tends to overfit.

4.3. The costs - needed RuntimeThe runtime of algorithms highly depend on the efficiency of the imple-

mentation and the characteristics of the used hardware. By using a slight moreinefficient implementation to compare your own algorithm against, you can skewthe picture in your favor. Therefore we decided to only inspect asymptotic run-times and not discuss absolute runtimes in the evaluation.

We are interested in the feature extraction method’s ability to scale with anincreasing number of feature mappings, time series length and device numbers.As expected, Fig. 5 shows that all considered feature extraction methods – incontrast to DTW_NN – scale linearly with an increasing length of the time seriesor increasing number of samples. This is due to the considered feature mappinghaving a linear runtime with respect to the length of the time series.

However, Fig. 6 shows that, among the feature based approaches, only FRESHand FRESH_PCAa scale linear with an increasing number of features (e.g. due tomore devices, feature mappings or types of time series). This makes FRESH andFRESH_PCAa suitable to filter huge amounts of time series features in Big Dataapplications.

4.4. Selected featuresWe were interested in how many features the final classification algorithm

was based on. Both ada and rfc are able to select features for the splits insidetheir decision trees. This means that they are able to internally select features.Fig. 7 contains boxplots of the number of final features that were picked by thetwo classification algorithms.

We can observe that all feature selection methods, so Boruta, LDA and theFRESH variants, were able to reduce the number of selected features for bothfinal classifiers rfc and ada. The filtering by LDA resulted in the lowest numberof relevant features. This is due to LDA selecting features in a stepwise nature.The algorithm will start with no features and only add features if they improvethe linear classification rate; this greedy, forward selection process can run intolocal minima and will often result in a small sets of considered features.

As expected, the filtering by both FRESH_PCAa and FRESH_PCAb will resultin a lower number of relevant features in the final classificators than with FRESH.This due to the filtering by the PCA step that works as a second filter step. Outof the three variants, FRESH_PCAb was able to reduce the number of features inthe final classification algorithm the most.

4.5. ResumeWe proposed FRESH as a highly scalable feature extraction algorithm. Our

simulations showed that, in contrast to other considered methods, FRESH is ableto scale with the number of feature mappings and samples as well as with the

14

Figure 3: Average value of the accuracy: Accuracy of the different feature extractionmethods and DTW_NN on the 31 two-class datasets from the UCR time series archive as wellas the data from the iPRODICT research project. The reported accuracy was averaged overa 10 fold cross validation, where every approach had access to the same folds to ensure a faircomparison.

15

Figure 4: Standard deviation of the accuracy: This heatmap shows the standard devia-tion of the accuracy metric from Fig. 3 for different combinations of approaches and datasetsduring a 10 fold cross validation.

16

500 1000 1500 2000Lenght of individual time series

0

10

20

30

40

50

Min

utes

FRESHFRESH_PCAaFRESH_PCAbBorutaDTW_NNFull_XLDA

(a) Number of samples fixed to 1000

500 1000 1500 2000Number of samples

0

10

20

30

40

50

Min

utes

FRESHFRESH_PCAaFRESH_PCAbBorutaDTW_NNFull_XLDA

(b) Length of time series fixed to 1000

Figure 5: Average pipeline runtime of time series classification concerning the nonlineardynamics of a dissipative soliton [14, p. 164]. The reported durations are the summed runtimesof feature extraction, feature filtering and predicting in the case of a feature based approaches,for DTW_NN it just denotes the fitting and predicting. For the feature based approaches, thefitting and predicting runtimes of a group of five classificators have been averaged: a layerneural network/perceptron, a logistic regression model, a Support Vector Machine, a RandomForest Classifier and an AdaBoost Classier. Full_X denotes the pipeline without any featurefiltering. The curves of all methods except DTW_NN lay on top of each other. We can observethat all feature based approaches scale linearly with the number of samples and length of timeseries, in contrast to DTW_NN.

0 2000 4000 6000 8000 10000Number of considered feature mappings

050

100150200250300350400450

Min

utes

FRESHFRESH_PCAaFRESH_PCAbLDABoruta

(a) All methods

0 2000 4000 6000 8000 10000Number of considered feature mappings

0

20

40

60

80

100

120

140

160

Min

utes

FRESHFRESH_PCAaFRESH_PCAb

(b) Only FRESH and variants

Figure 6: Average feature extraction runtime during ten feature selection runs for 10,000samples of a variable number of feature mappings (the curves of FRESH and FRESH_PCAa areoverlapping). We can observe that only FRESH and its variants scale linearly with the numberof considered features. The red line of FRESH_PCAb is due to the PCA being calculated on allfeatures while for FRESH_PCAa the filtered feature matrix is used

17

Figure 7: Relevant features in the final classificator: This boxplot shows the numberof features that were used by the final Adaboost ada and Random Forest Classificators rfc.Every data point is corresponds to the result of one of the 10 folds for each of the 32 datasetsfor one of the two classifier.

amount of different types and length of the time series. While doing so, it isextracting meaningful features as demonstrated by competitive accuracies.

The most compelling feature selection method was FRESH, without any PCAstep. When comparing the change in accuracy to the base classificator, in 14out of 32 data sets the accuracy of a Random Forest Classifier increased whilein 8 it stayed the same. Also, for the AdaBoost Classifier, the FRESH filteringwas most often not worsening the performance with respect to accuracy.

The relative bad performance of FRESH_PCAb seems to originate in the PCAstep selecting features only based on their ability to explain the variance inthe input variables and not in their significance to predict the target variable.By this, relevant information for the classification or regression task can getlost. Also, it was more effective for the Random Forest Classifier than for theAdaBoost approach.

On the other hand, the combination of FRESH with a subsequent PCA filter-ing to reduce the number of redundant and highly correlated features, denoted asFRESH_PCAa showed similar performance for both classifiers, its reduced numberof features resulted in a lower accuracy on 20 respectively 17 datasets.

5. Discussion

5.1. FRESH assists the acquisition of domain knowledgeIt is common knowledge that the quality of feature engineering is a crucial

success factor for supervised machine learning in general [54, p. 82] and for

18

time series analysis in particular [55]. But comprehensive domain knowledgeis needed in order to perform high quality feature engineering. Contrarily, itis quite common for machine learning projects that data scientists start withlimited domain knowledge and improve their process understanding while con-tinuously discussing their models with domain experts. This is basically thereason, why dedicated time series models are very hard to build from scratch.

Our experience with data science projects in the context of IoT and Industry4.0 applications [11] showed that it is very important to identify relevant timeseries features in an early stage of the project in order to engineer more special-ized features in discussions with domain experts. The FRESH algorithm supportsthis approach by applying a huge variety of established time series feature map-pings to different types of time series and meta-information simultaneously andidentifies relevant features in a robust manner.

We observe that features extracted by FRESH contribute to a deeper under-standing of the investigated problem, because each feature is intrinsically relatedto a distinct property of the investigated system and its dynamics. This fostersthe interpretation of the extracted features by domain experts and allows for theengineering of more complex, domain specific features [11] including dedicatedtime series models, such that their predictions in return might become a futurefeature mapping for FRESH.

5.2. FRESH is operationalWe have already mentioned that FRESH has been developed in the course

of IoT and Industry 4.0 projects [11]. Especially for predictive maintenanceapplications with limited numbers of samples and high level of noise in e.g.sensor readings, it has been proven as crucial to filter irrelevant features in orderto prevent overfitting. To ensure a robust and scalable filtering, we consider eachfeature importance individually. This causes several implications:

• FRESH is robust in the sense of classical statistics, because the hypothesistests and the Benjamini-Yekutieli procedure do not make any assumptionsabout the probability distribution or dependence structure between thefeatures. Here, robustness refers to the insensitivity of the estimator tooutliers or violations in underlying assumptions [56].

• FRESH is not considering the meaningfulness of interactions between fea-tures by design. Hence, in its discussed form it will not find meaningfulfeature combinations such as chessboard variables [57, Fig. 3a]. However,in our evaluation process the feature selection algorithm Boruta, whichconsiders feature interactions, was not able to beat the performance ofFRESH. Further, it is possible for FRESH to incorporate combinations offeatures and pre-defined interactions as new features themselves.

• FRESH is scalable due to the parallelity of the feature calculation and hy-pothesis tests (see the two topmost tiers in Fig. 1) and can be triviallyparallelized and distributed over several computational units. Because weonly deploy stateless features, so the calculation of each feature for each

19

device does not depend on the other features, it is trivial to parallelize thetask of feature calculation both horizontally (over different features) andvertically (over different entities). In addition, the feature filter process haslow computational costs compared to feature calculation and significancetesting. Therefore, FRESH scales linearly with the number of extractedfeatures, length of the time series, and number of considered time seriesmaking it a perfect fit for (industrial) Big Data applications.

• A side effect of ensuring robustness and testing features individually isthat FRESH tends to extract highly correlated features, which could resultin poor classification performance. We propose to combine FRESH with asubsequent PCA, which has been discussed as FRESH_PCAa in Sec. 3.5 andindeed improved the performance significantly.

5.3. Feature selection of FRESHNilsson et al. [58] proposed to divide feature selection into two flavors: The

minimal optimal problem is finding a set consisting of all strongly relevant at-tributes and a subset of weakly relevant attributes such that all remainingweakly relevant attributes contain only redundant information. The all-relevantproblem is finding all strongly and weakly relevant attributes. The first problemis way harder than the second, even asymptotically intractable for strictly pos-itive distributions [58]. Accordingly, FRESH solves the second, easier problem aswe extract every relevant feature, even though it might be a duplicate or highlycorrelated to another relevant feature [15].

Yu and Liu [59] separated feature selection algorithms into two categories,the wrapper model and the filter model. While the selection of wrapper modelsis based on the performance of a learning algorithm on the selected set of fea-tures, filter models use general characteristics to derive a decision about whichfeatures to keep. Filter models are further divided into feature weighting algo-rithms, which evaluate the goodness of features individually, and subset searchalgorithms, which inspect subsets. According to this definition, the featureselection part of FRESH is a filter model, more precisely, a feature weightingalgorithm with the weights being the p-values assigned to the correspondingfeatures.

FRESH contains a feature selection part on basis of hypothesis tests and theBenjamini-Yekutieli procedure, which of course can be used as a feature selectionalgorithm on features derived from manifold structured data such as spectra,images, videos and so on. But, due to its systematic incorporation of scalabletime series feature mappings and the proposed decomposition in computingtiers (Fig. 1) it is especially applicable to the needs of mass time series featureextraction and is considered as a time series feature extraction algorithm.

For a given set of attributes, feature selection techniques decide which at-tributes to delete and which to keep. While doing so, a possible evaluationmetric is the expected ratio of deleted relevant features to all deleted features.

20

This rate is the False Deletion Rate (FDR) and is formally defined by

FDR = E[number of relevant but deleted features

number of all deleted features

]

In contrast, FRESH controls the FER as defined in Eq. (7). A control of theFDR by FRESH would need a testing of the following hypotheses

Hφ0 = {Xφ and Y are dependent},

Hφ1 = {Xφ and Y are independent}.

(8)

Due to the topology of the hypotheses in Eq. 8 such a hypothesis testing isstatistically not feasible. This means that FRESH can not be adapted to controlthe FDR. Hence, the feature filtering of FRESH is not suitable for featureselection jobs where the FDR has to be controlled.

Finally, by applying a multiple testing algorithm, FRESH avoids the “look-elsewhere effect” [60] which is a statistically significant observation arising bychance due to the high number of tested hypotheses. This effect triggered arecent discussions about the use of p-values in scientific publications [61].

5.4. Related workThere are both structural and statistical approaches to extract patterns from

time series. Many statistical approaches rely on structures that allow the usageof genetic algorithms. They express the feature pattern for example as a tree [29,62, 63]. While doing so, they aim for the best pattern and the most explainingfeatures by alternating and optimizing the used feature mappings. In contrast,FRESH extracts the best fitting of a fixed set of patterns.

As an example for a structured pattern extraction, Olszewski detects sixmorphology types [64]: constant, straight, exponential, sinusoidal, triangular,and rectangular phases. Those phases are detected by structure detectors whichthen output a new time series whose values stand for the identified structure.Based on this structure a domain-independent structural pattern recognitionsystem is utilized to substitute the original time series signal by a known pattern.Due to its fixed patterns, FRESH can be considered to be a structured patternextractor.

Of course, there are other promising approaches like the combination ofnearest neighbor search with Dynamic Time Warping [23], which is specializedon considering an ensemble of exactly one dedicated time series type and cannottake meta-information into account. For binary classifications it scales withO(n2t ·mtrain ·mtest) [65] with mtrain and mtest being the number of devices inthe train and test set, respectively. This approach also has the disadvantagethat all data have to be transmitted to a central computing instance.

The extraction algorithm most similar to ours is presented by Fulcher andJones [10]. It applies a linear estimator with greedy search and a constant initialmodel to identify the most important features, which has been considered in thispaper as LDA. The evaluation has shown, that FRESH outperforms the approach

21

of Fulcher and Jones [10]. Also, FRESH provides a more general approach to timeseries feature extraction, because it is able to extract features for regression tasksand not only for classification.

6. Summary and future work

In this work, FeatuReExtraction based on ScalableHypothesis tests (FRESH)for time series classification and regression is introduced. It combines well es-tablished feature extraction methods with a scalable feature selection based onnon-parametric hypothesis tests and the Benjamini-Yekutieli procedure.

FRESH is highly parallel and suitable for distributed IoT and Industry 4.0applications like predictive maintenance or process line optimization, because itallows to consider several different time series types per label and additionallytakes meta-information into account. The latter has been demonstrated on basisof a steel billets process line optimization of project iPRODICT [48].

Our evaluation for UCR time series classification tasks has shown that FRESHis able to filter feature such that the performance of an AdaBoost or a RandomForest classificators are not worsened on the majority of datasets. Interestingly,an Adaboost Classifier without any filtering of features proved to reach thehighest accuracies among all feature based approaches, it was even able to beata shape-based nearest neighbor search under a Dynamic Time Warping distancemetric.

The parallel nature of FRESH with respect to both feature extraction andfiltering makes it highly applicable in situations where data is fragmented overa widespread infrastructure and computations cannot be performed on central-ized infrastructure. Due to its robustness and applicability to machine learningproblems in the context of IoT and Industry 4.0, we are expecting that FRESHwill find widespread application.

FRESH has been developed to extract meaningful features for classificationand regression tasks. Therefore, it can be easily combined with domain specificand possibly stateful feature mappings from more specialized machine learningalgorithms like e.g. Hubness-aware classifiers [66]. The considered datasets inthis work only contained binary classification variables. We plan to investigatethe performance of FRESH for multi-class or regression targets in the future.

Also, we are planning to investigate other hypothesis tests to address featuresignificances. For example, a substitute for the Kolmogorov-Smirnov is theMann-Whitney U test which checks if the median of two variables A and Bdiffers [67]. While the Kolmogorov-Smirnov test is sensitive to any differencesin the distributions with regard to shape, spread or median, the Mann-WhitneyU test is mostly sensitive to changes in the median. We will investigate howsuch different hypothesis tests affect the performance of FRESH.

As the only part of FRESH that depends on the structure of the data arethe used feature mappings, one could easily inspect the performance of thealgorithm on other domains such as image or video processing. We are curiousto see further research in this area.

22

Acknowlegement

The authors would like to thank Frank Kienle and Ulrich Kerzel for fruitfuldiscussions as well as Nils Braun and Julius Neuffer for their most valuablecontributions during the implementation of the tsfresh package. This researchwas funded in part by the German Federal Ministry of Education and Researchunder grant number 01IS14004 (project iPRODICT).

References

[1] D. Helbing, D. Brockmann, T. Chadefaux, K. Donnay, U. Blanke,O. Woolley-Meza, M. Moussaid, A. Johansson, J. Krause, S. Schutte,M. Perc, Saving Human Lives: What Complexity Science and Informa-tion Systems can Contribute, Journal of Statistical Physics 158 (3) (2015)735–781, doi:10.1007/s10955-014-1024-9.

[2] Z. Wang, C. T. Bauch, S. Bhattacharyya, A. d’Onofrio, P. Manfredi,M. Perc, N. Perra, M. Salathé, D. Zhao, Statistical physics of vaccination,Physics Reports 664 (2016) 1 – 113, statistical physics of vaccination.

[3] F. S. Collins, H. Varmus, A New Initiative on Precision Medicine,New England Journal of Medicine 372 (9) (2015) 793–795, doi:10.1056/NEJMp1500523.

[4] J. Gubbi, R. Buyya, S. Marusic, M. Palaniswami, Internet of Things (IoT):A Vision, Architectural Elements, and Future Directions, Future Genera-tion Computer Systems 29 (7) (2013) 1645–1660.

[5] M. Hermann, T. Pentek, B. Otto, Design Principles for Industrie 4.0Scenarios, in: 49th Hawaii International Conference on System Sciences(HICSS), 3928–3937, 2016.

[6] R. K. Mobley, An Introduction to Predictive Maintenance, Elsevier Inc., 2edn., 2002.

[7] V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos, Feature Selec-tion for High-Dimensional Data, Artificial Intelligence: Foundations, The-ory, and Algorithms, Springer International Publishing, 2015.

[8] H. Liu, R. Setiono, Some Issues on Scalable Feature Selection, Expert Sys-tems with Applications 15 (3–4) (1998) 333 – 339.

[9] A. Kusiak, W. Li, The prediction and diagnosis of wind turbine faults,Renewable Energy 36 (1) (2011) 16–23.

[10] B. D. Fulcher, N. S. Jones, Highly Comparative Feature-Based Time-SeriesClassification, Knowledge and Data Engineering, IEEE Transactions on26 (12) (2014) 3026–3037.

23

http://dx.doi.org/10.1007/s10955-014-1024-9

http://dx.doi.org/10.1056/NEJMp1500523

http://dx.doi.org/10.1056/NEJMp1500523

[11] M. Christ, F. Kienle, A. W. Kempa-Liehr, Time Series Analysis in In-dustrial Applications, in: Workshop on Extreme Value and Time SeriesAnalysis, KIT Karlsruhe, doi:10.13140/RG.2.1.3130.7922, 2016.

[12] Y. Benjamini, D. Yekutieli, The Control of the False Discovery Rate inMultiple Testing under Dependency, Annals of statistics (2001) 1165–1188.

[13] Y. Chen, E. Keogh, B. Hu, N. Begum, A. Bagnall, A. Mueen, G. Batista,The UCR Time Series Classification Archive, www.cs.ucr.edu/~eamonn/time_series_data/, 2015.

[14] A. W. Liehr, Dissipative Solitons in Reaction Diffusion Systems, vol. 70 ofSpringer Series in Synergetics, Springer, 2013.

[15] M. B. Kursa, W. R. Rudnicki, The all relevant feature selection using ran-dom forest, arXiv preprint arXiv:1106.5112 .

[16] X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann, E. Keogh,Experimental Comparison of Representation Methods and Distance Mea-sures for Time Series Data, Data Mining and Knowledge Discovery 26 (2)(2013) 275–309.

[17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn:Machine Learning in Python, Journal of Machine Learning Research 12(2011) 2825–2830.

[18] S. Van Der Walt, S. C. Colbert, G. Varoquaux, The NumPy Array: AStructure for Efficient Numerical Computation, Computing in Science &Engineering 13 (2) (2011) 22–30.

[19] W. McKinney, Data Structures for Statistical Computing in Python, in:S. van der Walt, J. Millman (Eds.), Proceedings of the 9th Python inScience Conference, 51 – 56, 2010.

[20] R. Elmasri, J. Y. Lee, Implementation Options for Time-Series Data, in:O. Etzion, S. Jajodia, S. Sripada (Eds.), Temporal Databases: Researchand Practice, Springer Berlin Heidelberg, 115–128, 1998.

[21] A. Bagnall, J. Lines, A. Bostrom, J. Large, E. Keogh, The great timeseries classification bake off: a review and experimental evaluation of recentalgorithmic advances, Data Mining and Knowledge Discovery (2016) 1–55.

[22] M. Müller, Dynamic Time Warping, Information retrieval for music andmotion .

[23] X. Wang, K. Smith, R. Hyndman, Characteristic-Based Clustering for TimeSeries Data, Data Mining and Knowledge Discovery 13 (3) (2006-09-25)335–364.

24

http://dx.doi.org/10.13140/RG.2.1.3130.7922

www.cs.ucr.edu/~eamonn/time_series_data/

www.cs.ucr.edu/~eamonn/time_series_data/

[24] M. Hüsken, P. Stagge, Recurrent neural networks for time series classifica-tion, Neurocomputing 50.

[25] H. Lee, P. Pham, Y. Largman, A. Y. Ng, Unsupervised feature learning foraudio classification using convolutional deep belief networks, in: Advancesin neural information processing systems, 2009.

[26] J. B. Yang, M. N. Nguyen, P. P. San, X. L. Li, S. Krishnaswamy, DeepConvolutional Neural Networks on Multichannel Time Series for HumanActivity Recognition, in: Proceedings of the 24th International Joint Con-ference on Artificial Intelligence (IJCAI), Buenos Aires, Argentina, 2015.

[27] A. Nanopoulos, R. Alcock, Y. Manolopoulos, Feature-based classification oftime-series data, in: Information processing and technology, Nova SciencePublishers, Inc., 49–61, 2001.

[28] P. Geurts, Pattern extraction for time series classification, in: Euro-pean Conference on Principles of Data Mining and Knowledge Discovery,Springer, 115–127, 2001.

[29] I. Mierswa, K. Morik, Automatic Feature Extraction for Classifying AudioData, Machine learning 58 (2-3) (2005) 127–149.

[30] G. G. Yen, K. C. Lin, Wavelet Packet Feature Extraction for VibrationMonitoring, IEEE Transactions on Industrial Electronics 47 (3) (2000-06)650–667.

[31] N. Gebraeel, M. Lawley, R. Liu, V. Parmeshwaran, Residual life predic-tions from vibration-based degradation signals: a neural network approach,IEEE Transactions on industrial electronics 51 (3) (2004) 694–700.

[32] S. Barmada, M. Raugi, M. Tucci, F. Romano, Arc detection in pantograph-catenary systems by the use of support vector machines-based classification,IET Electrical Systems in Transportation 4 (2) (2014) 45–52.

[33] I. Nun, P. Protopapas, B. Sim, M. Zhu, R. Dave, N. Castro, K. Pichara,FATS: Feature Analysis for Time Series, arXiv preprint arXiv:1506.00010 .

[34] B. D. Fulcher, N. S. Jones, Highly comparative feature-based time-seriesclassification, IEEE Transactions on Knowledge and Data Engineering26 (12) (2014) 3026–3037.

[35] Z. Xing, J. Pei, E. Keogh, A brief survey on sequence classification, ACMSigkdd Explorations Newsletter 12 (1) (2010) 40–48.

[36] A. Bagnall, J. Lines, J. Hills, A. Bostrom, Time-Series Classification withCOTE: The Collective of Transformation-Based Ensembles, IEEE Trans-actions on Knowledge and Data Engineering 27 (9).

25

[37] P. Radivojac, Z. Obradovic, A. K. Dunker, S. Vucetic, Feature SelectionFilters Based on the Permutation Test, in: Machine Learning: ECML 2004,Springer, 334–346, 2004.

[38] V. K. Rohatgi, A. K. M. Saleh, An introduction to probability and statis-tics, Wiley, Hoboken, New Jersey, 2015.

[39] R. A. Fisher, On the Interpretation of χ2 from Contingency Tables, and theCalculation of P, Journal of the Royal Statistical Society 85 (1) (1922-01)87.

[40] F. J. Massey, The Kolmogorov-Smirnov Test for Goodness of Fit, Journalof the American Statistical Association 46 (253) (1951-03) 68.

[41] M. G. Kendall, A New Measure of Rank Correlation, Biometrika 30 (1/2)(1938-06) 81–93.

[42] L. M. Adler, A Modification of Kendall’s Tau for the Case of Arbitrary Tiesin Both Rankings, Journal of the American Statistical Association 52 (277)(1957-03) 33–35.

[43] D. Curran-Everett, Multiple Comparisons: Philosophies and Illustrations,American Journal of Physiology - Regulatory, Integrative and ComparativePhysiology 279 (1) (2000-07-01) R1–R8.

[44] Y. Benjamini, Y. Hochberg, Controlling the False Discovery Rate: A Prac-tical and Powerful Approach to Multiple Testing, Journal of the RoyalStatistical Society. Series B (Methodological) (1995) 289–300.

[45] K. Kira, L. A. Rendell, A Practical Approach to Feature Selection, in:Proceedings of the Ninth International Workshop on Machine Learning,ML92, Morgan Kaufmann Publishers Inc., 249–256, 1992.

[46] H. Hotelling, Analysis of a Complex of Statistical Variables into PrincipalComponents., Journal of educational psychology 24 (6) (1933) 417.

[47] Z. Geler, V. Kurbalija, M. Radovanović, M. Ivanović, Comparison of dif-ferent weighting schemes for the k, Knowledge and Information Systems48 (2) (2016) 331–378.

[48] M. Christ, J. Krumeich, A. W. Kempa-Liehr, Integrating Predictive Ana-lytics into Complex Event Processing by Using Conditional Density Esti-mations, in: IEEE 20th International Enterprise Distributed Object Com-puting Workshop (EDOCW), IEEE Computer Society, Los Alamitos, CA,USA, 1–8, doi:10.1109/EDOCW.2016.7584363, 2016.

[49] M. Fernández-Delgado, E. Cernadas, S. Barro, D. Amorim, Do We NeedHundreds of Classifiers to Solve Real World Classification Problems, J.Mach. Learn. Res 15 (1) (????) 3133–3181.

26

http://dx.doi.org/10.1109/EDOCW.2016.7584363

[50] Y. Freund, R. E. Schapire, A desicion-theoretic generalization of on-linelearning and an application to boosting, in: European conference on com-putational learning theory, Springer, 23–37, 1995.

[51] E. Alfaro, N. García, M. Gámez, D. Elizondo, Bankruptcy forecasting: Anempirical comparison of AdaBoost and neural networks, Decision SupportSystems 45 (1) (2008) 110–122.

[52] S. Jones, D. Johnstone, R. Wilson, An empirical evaluation of the per-formance of binary classifiers in the prediction of credit ratings changes,Journal of Banking & Finance 56 (2015) 72–85.

[53] B. Kégl, The return of AdaBoost. MH: multi-class Hamming trees, arXivpreprint arXiv:1312.6086 .

[54] P. Domingos, A Few Useful Things to Know about Machine Learning,Communications of the ACM 55 (10) (2012-10-01) 78–87.

[55] J. Timmer, C. Gantert, G. Deuschl, J. Honerkamp, Characteristics of HandTremor Time Series, Biological Cybernetics 70 (1) (1993) 75–80.

[56] J. M. M. John, F. Van Lishout, E. S. Gusareva, K. Van Steen, A RobustnessStudy of Parametric and Non-Parametric Tests in Model-Based MultifactorDimensionality Reduction for Epistasis Detection, BioData mining 6 (1)(2013) 1.

[57] I. Guyon, A. Elisseeff, An Introduction to Variable and Feature Selection,The Journal of Machine Learning Research 3 (2003) 1157–1182.

[58] R. Nilsson, J. M. Peña, J. Björkegren, J. Tegnér, Consistent feature selec-tion for pattern recognition in polynomial time, The Journal of MachineLearning Research 8 (2007) 589–612.

[59] L. Yu, H. Liu, Feature Selection for High-Dimensional Data: A FastCorrelation-Based Filter Solution, in: ICML, vol. 3, 856–863, 2003.

[60] E. Gross, O. Vitells, Trial Factors for the Look Elsewhere Effect in HighEnergy Physics, The European Physical Journal C 70 (1-2) (2010) 525–530.

[61] R. L. Wasserstein, N. A. Lazar, The ASA’s Statement on P-Values: Con-text, Process, and Purpose, The American Statistician .

[62] P. Geurts, Pattern Extraction for Time Series Classification, in: Principlesof Data Mining and Knowledge Discovery, Springer, 115–127, 2001.

[63] D. R. Eads, D. Hill, S. Davis, S. J. Perkins, J. Ma, R. B. Porter, J. P.Theiler, Genetic Algorithms and Support Vector Machines for Time SeriesClassification, in: B. Bosacchi, D. B. Fogel, J. C. Bezdek (Eds.), Interna-tional Symposium on Optical Science and Technology, 74–85, 2002-12-01.

27

[64] R. T. Olszewski, Generalized Feature Extraction for Structural Pat-tern Recognition in Time-Series Data, DTIC Document CMU-CS-01-108,Carnegie Mellon University, 2001.

[65] L. Penserini, others, A Comparison of Two Machine-Learning Techniquesto Focus the Diagnosis Task, Frontiers in Artificial Intelligence and Appli-cations (2006) 265.

[66] N. Tomašev, K. Buza, K. Marussy, P. B. Kis, Hubness-aware classification,instance selection and feature construction: Survey and extensions to time-series, in: Feature selection for data and pattern recognition, Springer,231–262, 2015.

[67] H. B. Mann, D. R. Whitney, On a Test of Whether one of Two RandomVariables is Stochastically Larger than the Other, The Annals of Mathe-matical Statistics 18 (1) (1947) 50–60.

[68] D. Joanes, C. Gill, Comparing measures of sample skewness and kurtosis,Journal of the Royal Statistical Society: Series D (The Statistician) 47 (1)(1998) 183–189.

[69] E. W. Weisstein, CRC concise encyclopedia of mathematics, CRC press,2002.

[70] W. A. Fuller, Introduction to statistical time series, vol. 428, John Wiley& Sons, 2009.

[71] J. G. MacKinnon, Approximate asymptotic distribution functions for unit-root and cointegration tests, Journal of Business & Economic Statistics12 (2) (1994) 167–176.

[72] R. H. Jones, Maximum likelihood fitting of ARMA models to time serieswith missing observations, Technometrics 22 (3) (1980) 389–395.

[73] P. Du, W. A. Kibbe, S. M. Lin, Improved peak detection in mass spec-trum by incorporating continuous wavelet transform-based pattern match-ing, Bioinformatics 22 (17) (2006) 2059–2065.

[74] N. Ricker, The form and laws of propagation of seismic wavelets, Geophysics18 (1) (1953) 10–40.

[75] P. D. Welch, The use of fast Fourier transform for the estimation of powerspectra: A method based on time averaging over short, modified peri-odograms, IEEE Transactions on audio and electroacoustics 15 (2) (1967)70–73.

28

Vitae

Maximilian Christ received his M.S. degree in Mathemat-ics and Statistics from Heinrich-Heine-Universität Düssel-dorf, Germany, in 2014. In his daily work as a Data Sci-ence Consultant at Blue Yonder GmbH he optimizes businessprocesses through data driven decisions. Beside his businesswork he is persuing a Ph.D in collaboration with Universityof Kaiserslautern, Germany. His research interest relate on

how to deliver business value through Machine Learning based optimizations.

Dr. Andreas W. Kempa-Liehr is a Senior Lecturer atthe Department of Engineering Science of the University ofAuckland, New Zealand, and an Associate Member of theFreiburg Materials Research Center (FMF) at the Universityof Freiburg, Germany. Andreas received his doctorate fromthe University of Münster in 2004 and continued his researchas head of service group Scientific Information Processing atFMF. From 2009 to 2016 he was working in different data

science roles at EnBW Energie Baden-Württemberg AG and as Senior DataScientist at Blue Yonder GmbH. His research is focussed on the synergy of datascience and operations research for engineering applications.

Prof. Dr. Michael Feindt founded Blue Yonder GmbH,one of the world’s leading companies for Predictive Appli-cations in the retail sector, in 2008 and serves as its ChiefScientific Advisor and Member of Advisory Board. He wasa research fellow and staff member at the European Organi-

zation for Nuclear Research CERN from 1991 to 1997 and he also participatedin the CDFII experiment at Fermilab in Chicago/USA as well as Belle andBelle II experiments at KEK in Japan. He is a professor at the Institute ofExperimental Nuclear Physics at the Karlsruhe Institute of Technology (KIT),Germany. His main research interest lies in optimizing and automating researchin many different areas by means of machine learning, especially combiningBayesian statistics, neural networks, conditional probability density estimationand causality.

29

Appendix A. Considered feature mappings

FRESH and the respective tsfresh package depend on feature mappings θk :Rnt → R for capturing relevant characteristics from the respective time series.Each of these features mappings θ(S) take a single time series

S = (s(t1), s(t2), . . . , s(tν), . . . , s(tnt))T

= (s1, . . . , sν , . . . , snt)T.

as argument and return either a real valued feature (e.g. length), or a vectorof features (e.g. binned_entropy). In the latter case, each element of thereturned feature vector will be treated as separate feature. For the purposeof simplicity, the following descriptions are omitting the indices for differentdevices and sensors described in Eq. (1). Feature mappings with additionalparameters are denoted by θ(S|·).

Note, that the number of feature mappings provided by the tsfresh packageis continuously increasing. However, the following subsections describe the fea-ture mappings, which have been used for the evaluation in Section 4. The mostrecent list of feature mappings can be found at http://tsfresh.readthedocs.io/en/latest/.

Appendix A.1. Features from summary statistics• maximum(S): Sample maximum of time series S

maximum(S) = max {s1, . . . , sν , . . . , snt}.

• minimum(S): Sample minimum of time series S

minimum(S) = min {s1, . . . , sν , . . . , snt}.

• mean(S): Arithmetic mean of time series S

mean(S) = S =1

nt

nt∑

ν=1

sν .

• var(S): Expectation of the squared deviation of time series S from itsmean S without bias correction

var(S) =1

nt

nt∑

ν=1

(sν − S)2.

• std(S): Uncorrected sample standard deviation of time series S

std(S) =√var(S)

30

http://tsfresh.readthedocs.io/en/latest/

http://tsfresh.readthedocs.io/en/latest/

• skewness(S): Sample skewness calculated with adjusted Fisher-Pearsonstandardized moment coefficient [68]:

skewness(S) =n2t

(nt − 1)(nt − 2)

1nt

∑ntν=1

(sν − S

)3(

1nt−1

∑ntν=1

(sν − S

)2) 32

.

• kurtosis(S): Fourth central moment of time series S divided by thesquare of its variance

kurtosis(S) =1

nt

nt∑

ν=1

(sν − Sstd(S)

)4

− 3

as defined by Fisher [69]. The subtrahend of 3 ensures that the kurtosisis zero for normal distributed samples.

• length(S): Number of samples nt of time series S

length(S) = nt.

• median(S): For a time series S with an uneven number of samples nt, themedian is the middle of the sorted time series values. If the time serieshas an even number of samples, the two middle values are averaged

median(S) =

{s(nt+1)/2 : nt is uneven,12 (s(nt/2) + s(nt/2+1)) : nt is even.

• quantile_of_empiric_distribution_function(S|q): q-quantile Qq(S)

of empirical distribution function Snt of time series sample S

Qq(S) = inf

{z

∣∣∣∣ Snt(z) =1

nt

nt∑

ν=1

1sν≤z ≥ q

}.

Here, q% of the ordered values from S are lower or equal to this quantile.

Appendix A.2. Additional characteristics of sample distribution• absolute_energy(S): Interpreting the time series S as the velocity of a

particle with unit mass 2, the observed energy computes to

absolute_energy(S) =

nt∑

ν=1

s2ν .

• augmented_dickey_fuller_test_statistic(S): The Augmented Dickey-Fuller test checks the hypothesis that a unit root is present in a time seriessample [70]. This feature calculator returns the value of the respective teststatistic calculated on S [71].

31

• binned_entropy(S|m): This feature calculator bins the values of the timeseries sample into m equidistant bins. The returned features is

min(m,nt)∑

k=0

pklog(pk) · 1(pk>0)

where pk is the percentage of samples in bin k.

• has_large_standard_deviation(S): Boolean feature indicating that thestandard deviation std(S) is bigger than half the difference between themaximal and minimal value

std(S) >maximum(S)− minimum(S)

2.

• has_variance_larger_than_std(S): Boolean feature indicating that thevariance is greater than the standard deviation

var(S) > std(S),

which is equal to the variance of the sample being larger than 1.

• is_symmetric_looking(S): Boolean feature indicating

|mean(S)− median(S)| < maximum(S)− minimum(S)

2.

• mass_quantile(S|q): Relative index ν where q · 100% of the mass of thetime series S lie left of ν:

i

ntsuch that i = min

{k

∣∣∣∣1 ≤ k ≤ nt,∑kν=1 sν

S≥ q

}.

For example for q = 50% this feature calculator will return the mass centerof the time series.

• number_data_points_above_mean(S): Number of data points, which arelarger than the average value of the time series sample.

• number_data_points_above_median(S): Number of data points, whichare larger than the median value of the time series sample.

• number_data_points_below_mean(S): Number of data points, which arelower than the average value of the time series sample.

• number_data_points_below_median(S): Number of data points, whichare lower than the median value of the time series sample.

32

Appendix A.3. Features derived from observed dynamics• arima_model_coefficients(S|i, k): This feature calculator fits the un-

conditional maximum likelihood of an autoregressive AR(k) process [72]

sν = ϕ0 +

k∑

j=1

ϕjsν−j + εt

on the time series S. The parameter k is the maximum lag of the processand the calculated feature is the coefficient ϕi for index i of the fittedmodel.

• continuous_wavelet_transformation_coefficients(S|a, b): Calculatesa discretization of the continuous wavelet transformation [73] for the Rickerwavelet, also known as the Mexican hat wavelet [74] which is defined by

ψ(ν, a, b) =2√

3aπ14

(1− (ν − b)2

a2

)exp

(− (ν − b)2

2a2

).

Here, a is the width and b is the location parameter. The continuouswavelet transformation performs a convolution of the time series S withthe wavelet ψ(ν, a, b)

Xw(a, b) =

∞∑

ν=−∞sν ψ (ν, a, b) .

The features are the coefficients X(a, b) for the wavelet of length a atposition b.

• fast_fourier_transformation_coefficient(S|k): Calculates the one-dimensional discrete Fourier Transform for real input. The real part ofthe coefficients

Xk =

nt∑

ν=1

sν · exp

(−2πik(ν − 1)

nt

), k ∈ Z

are returned as features. For real valued input the real part of Xk is equalto the real part of X−k which means that this feature calculator only takenatural numbers as parameters.

• first_index_max(S): This feature is the relative position

min (arg maxS)

nt

for which the maximum value was observed for the first time in the timeseries sample.

33

• first_index_min(S): This feature is the relative position

min (arg minS)

nt

for which the minimal value was observed for the first time in the timeseries sample.

• lagged_autocorrelation(S|l): Calculates the autocorrelation of the timeseries S with its lagged version of lag l:

1

var(S)

nt−l∑

ν=1

(sν − S

) (sν+l − S

).

• large_number_of_peaks(S|l,m): Boolean variable indicating wether thenumber of peaks of size l as defined in number_peaks_of_size(l) isgreater than m.

• last_index_max(S): Relative position

max (arg maxS)

nt

at which the maximum value was observed for the last time in the timeseries sample.

• last_index_min(S): Relative position

max (arg minS)

nt

at which the maximum value was observed for the last time in the timeseries sample.

• longest_strike_above_mean(S): Length of the longest consecutive sub-sequence in S, which is larger or equal to the mean of S.

• longest_strike_above_median(S): Length of the longest consecutivesubsequence in S, which is larger or equal to the median of S.

• longest_strike_below_mean(S): Length of the longest consecutive sub-sequence in S, which is smaller or equal to the mean of S.

• longest_strike_below_median(S): Length of the longest consecutivesubsequence in S, which is smaller or equal to the median of S.

• longest_strike_negative(S): Length of the longest consecutive subse-quence in S, which is smaller or equal to the median of S.

• longest_strike_positive(S): Length of the longest consecutive subse-quence in S that is smaller or equal to the median of S.

34

• longest_strike_zero(S): Length of the longest consecutive subsequencein S that is smaller or equal to the median of S.

• mean_absolute_change(S): Arithmetic mean of absolute differences be-tween subsequent time series values

1

nt

nt−1∑

ν=1

|sν+1 − sν |.

• mean_absolute_change_quantiles(S|ql, qh): This feature calculator firstfixes a corridor given by the quantiles Qql and Qqh of the empirical dis-tribution function of S. Then it calculates the average absolute value ofconsecutive changes of the series S inside this corridor:

∑nt−1ν=1 |sν+1 − sν | 1Qql≤sν≤Qqh 1Qql≤sν+1≤Qqh∑nt−1

ν=1 1Qql≤sν≤Qqh1Qql≤sν+1≤Qqh

.

• mean_autocorrelation(S): Average autocorrelation over possible lags lranging from 1 to nt − 1:

1

(nt − 1) var(S)

nt∑

l=1

nt−l∑

ν=1

(sν − S)(sν+l − S).

• mean_second_derivate_central(S): Average value of the second derivateof the time series

1

nt − 2

nt−2∑

ν=2

1

2(sν−1 − 2 · sν + sν+1).

• number_continous_wavelet_transformation_peaks_of_size(S|l): Thisfeature calculator detects peaks in the time series by inspecting the coef-ficients of a discretization of the continuous wavelet transformation of thetime series S. First, it calculates the wavelet transformation Xw(a, b) forwidths a ranging from 1 to l. Again the ricker wavelet is used. The peaksare then found by comparing the local signal-to-noise ratio (SNR), if alocal maxima of the wavelet coefficients Xw(a, b) is observed for enoughwidths, it is counted as a peak [73].

• number_peaks_of_size(S|l): A peak of support l is defined as a datapoint si, which is larger than its l neighboring values to the left and tothe right. This feature calculates the number of peaks with support of lin the time series S.

• spektral_welch_density(S|i): This feature calculator uses Welch’s method[75] to compute an estimate of the power spectral density of the time se-ries sample S. Welch’s methods divides the time series into overlapping

35

segments, computes a modified periodogram for each segment and thenaverages the periodograms. The frequencies f are defined as follow

f =

(0,

1

nt,

2

nt, . . . ,

1

2− 1

nt,−1

2,−1

2+

1

nt, . . . ,− 2

nt,− 1

nt

)if nt is even

f =

(0,

1

nt,

2

nt, . . . ,

nt − 1

2nt,−nt − 1

2nt,−1

2+

1

nt, . . . ,− 2

nt,− 1

nt

)if nt is odd

This feature calculator will return the calculated power spectrum of fre-quency fi as the feature.

• time_reversal_asymmetry_statistic(S|l): Value of

1

nt − 2l

nt−2l∑

ν=1

s2ν+2l · sν+l − sν+l · s2ν

for lag l. It was proposed in [10] as promising feature to extract from timeseries.

36

arxiv:1610.07717v3 [cs.lg] 19 may 2017

Documents