missing data in time series and imputation methods

Missing Data in Time Seriesand Imputation Methods

Rantou Kleio Elissavet

MSc in Statistics and Actuarial Financial Mathematics

Department of Mathematics

University of the Aegean

Samos

February 2017

Missing Data in Time Seriesand Imputation Methods

Author

Rantou Kleio Elissavet

Supervisor

Karagrigoriou Alexandros

Commitee

Hatjispyros Spyros

Tsimikas Ioannis

1

Abstract

Missing data become the first obstacle when designing predictive models,as most statistical methods are premised on complete data without missingvalues. Thus, it is important to be familiar with the methods employed formissing data management. Time series data exist in nearly every scientificfield, where data are measured, recorded and monitored, so it is understand-able that missing values may occur. Therefore, when it comes to analyzethe data, the missing values should be replaced with rational values, to carryout an analysis based on a ”complete” dataset; in statistics this approach ofhandling missing values is called Imputation. In the special case of univariatetime series, the ”state of the art” techniques, cannot be employed, as theyare based on inter-variable correlations, in order to estimate missing values.Hence, time series characteristics need to be taken into consideration, to de-velop an appropriate and efficient strategy when dealing with missing data.The main scope of this thesis, is to compare and quantify the performance ofimputation algorithms in the context of univariate time series data, using thestatistical software R. Our approach generalizes the methodology by Moritzet al. (2015), in the sense that we explore various imputation techniquesespecially designed for univariate time series.

2

Contents

Acknowledgement 4

Introduction 5

1 Missing Data Theory 7

2 Imputation Methods 17

3 Structural Linear State Space Model 20

4 Interpolation 30

5 Preliminary Data Analysis 33

6 Implementation of Imputation Algorithms 40

7 Cross Validation of Imputation Algorithms 44

A RCode 55

Bibliography 63

3

ACKNOWLEDGEMENT

I would first like to thank my thesis advisor Dr. Alex Karagrigoriou of theDepartment of Mathematics at University of the Aegean,. The door to Prof.Karagrigoriou office was always open whenever I ran into a trouble spot orhad a question about my research or writing. He consistently allowed thisthesis to be my own work, but steered me in the right direction whenever hethought I needed it.

Finally, I must express my very profound gratitude to my parents, my brotherand my partner for providing me with unfailing support and continuous en-couragement throughout my years of study and through the process of re-searching and writing this thesis. This accomplishment would not have beenpossible without them. Thank you.

AuthorKleio Elissavet Rantou

4

Introduction

Missing data become the first obstacle when designing predictive models,as most statistical methods are premised on complete data without missingvalues. Thus, it is important to be familiar with the methods employedfor missing data management, as the method of choice, will influence thestatistical power of the predictive model.

In the special case of univariate time series, the ”state of the art” techniques,cannot be employed, as they are based on inter-variable correlations, in orderto estimate missing values. Hence, time series characteristics need to betaken into consideration, to develop an appropriate and efficient strategywhen dealing with missing data.

Time series data exist in nearly every scientific field, where data are mea-sured, recorded and monitored, so it is understandable that missing valuesmay occur. Therefore, when it comes to analyze the data, the missing valuesshould be replaced with rational values, to carry out an analysis based on a”complete” dataset; in statistics this approach of handling missing values iscalled Imputation. Imputation is an immense field of study, where a lot of re-search has already taken place. Popular techniques are Multiple Imputation(Rubin, 1987), Expectation-Maximization (Dempster et al., 1977), NearestNeighbor (Vacek and Ashikaga, 1980) and Hot Deck methods (Ford, 1980).The method of imputation assumes that the estimation of missing values de-rives from a predictive distribution which is based on observed values. Thusthe problem reduces to the construction of a predictive function.

The main scope of this thesis, is to compare and quantify the performance ofimputation algorithms in the context of univariate time series data, using thestatistical software R. Our approach generalizes the methodology by Moritzet al. (2015), in the sense that we explore various imputation techniquesespecially designed for univariate time series.

5

In the first Chapter the widely accepted framework of missing data theory isreviewed, as outlined in Rubin (1976). In the second Chapter the imputationmethods that one can use when handling missing data in a univariate timeseries are stated. In Chapters 3 & 4, we analyze in detail, the more complexmethods used in our experiments, namely, Structural models and Interpo-lation. In Chapter 5 the datasets that we use and the preliminary analysisthat we perform are presented. In Chapter 6 the considered methodologyis explained, along with the Rpackage ImputeTS (2016) that we use for theexperiments. In Chapter 7 we demonstrate the cross-validation process, vi-sualize and interpret the results.

6

Chapter 1

Missing Data Theory

Missing Data Pattern

A missing data pattern describes where the missing values are located inthe dataset. In Figure (1.1), the different types of missing data patterns aredisplayed, rows correspond to observations, columns to variables (Little andRubin, 2002).

Figure 1.1: Missing Data Patterns

7

In the case of univariate time series, the missingness is confined into a singlevariable. A useful tool for exploring the missing data pattern is a missingnessmap. This map visualizes the data set into a grid and colors the cells by themissingness status. The column of the grid is the variable and the rows arethe observations. This tool offers a quick summary of the pattern of themissing data (Honaker et al., 2015).

Missing Data Mechanisms

Rubin (1976) presented a classification system for missing data that is widelyused today. This system describes the relationship between the data andthe probability of missing values. To better understand and describe thedistributions of the following mechanisms, we shall refer to the basic missingdata theory and notation which is slightly different from that of the originalpaper by Rubin, but it is commonly used in the statistics literature on missingdata.

Let Y = (y1, ..., yn)T denote a vector random variable of the complete datawhich contains both the observed Yobs and the missing values Ymis, withprobability density function fθ. The objective is to make inferences aboutthe vector of the unknown parameter θ.

The missing data indicator, M = (M1, ...,Mn)T defines a binary variable thatdenotes whether the value of a variable is observed or missing (i.e. Mi = 0if value yi is observed and Mi = 1 if the value is missing). The missing dataindicator defines the pattern of missing data.

Essentially, Rubin’s (1976) theory views individual observations as a two-valued-vector, with the value of the observation denoted by Yobs (observed)or Ymis (missing) and the corresponding code of the missing value indicatorM . Presenting the missing data as a variable implies that there is an un-derlying probability distribution that governs the value of the missing dataindicator. In practice, it is impossible to know the exact distribution of M .Nevertheless, the nature of the relationship between the indicator M and thedata is what distinguishes the missing data mechanisms which are definedby the conditional distribution of M given the complete data Y , f(M |Y, φ)where φ, denotes the vector of unknown parameters that describe the prob-ability of missing data.

8

Missing Completely at Random (MCAR)

The missing completely at random mechanism requires that the probabilityof missing data on a variable Y is unrelated to other measured variables andis also unrelated to the values of Y itself; it can be considered as a randomsample of the complete data. In the case of univariate time series, time isconsidered as an implicit variable, thus the probability of a case being missingis independent of the point in time it has been observed. Having in mind thenotation we have referred to above, the distribution that governs the MCARmechanism is,

P (M |Yobs, Ymis, φ) = P (M |φ) for all Y, φ. (1.1)

Missing at Random (MAR)

An assumption less restrictive than MCAR is that missing data dependsonly on the observations that are observed and not on the observations thatare missing. The missing data mechanism is then called missing at random(MAR). Since there are no other variables, other than time in the case ofunivariate time series, it is assumed that the probability of missing datadepends on the point in time it has been observed. The distribution of theMAR mechanism is,

P (M |Yobs, Ymis, φ) = P (M |Yobs, φ) for all Ymis, φ. (1.2)

Missing Not at Random (MNAR)

Finally, data are missing not at random (MNAR) when the probability ofmissing data on a variable Y is related to the values of Y itself, both observedand missing. In the context of univariate time series, the MNAR mechanismcan be the outcome of an underlying rule. The probability of missing datamay be dependent but not necessarily on the point in time which are ob-served. The probability distribution for missing not at random mechanismis

P (M |Y, φ) = P (M |Yobs, Ymis, φ). (1.3)

9

Ignorable Missingness

Missing data theory implicates two sets of parameters: the parameters es-timated for inference θ, and the parameters that describe the probabilityof missing data φ. In practice, it is impossible to know why the data aremissing and estimate φ accurately, however in some cases φ can influence theestimation of θ. Rubin (1976) clarified the conditions that need to exist inorder to assume that the cause for missing data is ignorable.

The objective is to use Yobs to make inferences about θ, ignoring the processthat causes the missing data which is equivalent to:

• Fixing the random variable M at the observed pattern of missing datam = (m1, ..., mn).

• Assume that the values of the observed data Yobs arose from the marginaldensity of the random variable.

fθ(Yobs) =

∫fθ(Y )dYmis. (1.4)

Let gφ(m|Y ), the probability density function of M = m = (m1, ..., mn),given Y

Condition 1 The missing data are missing at random if for each value of φ,gφ(m|Y ) takes the same values for all Yobs.

Condition 2 The observed data are observed at random if for each value ofφ, gφ(m|Y ) takes the same value for all Ymis.

Condition 3 The parameter φ is distinct from θ if their joint parameter spacefactorizes into a φ-space and a θ-space, and when prior distributions arespecified for φ and θ, if these are independent.

Missing Data and Sampling Distribution

Inference

When sampling distribution inferences are made, to estimate the true value ofθ, the observed value of a statistic, say S(Y ), is compared to the hypothesized

10

distribution derived from fθ. This is not the case when missing values arepresent, as the sampling distribution depends not only on the hypothesizedfθ but also on gφ.

Theorem 1 (Rubin, 1976)

Suppose that

a The missing data are missing at random.

b The observed data are observed at random.

Then the sampling distribution of the statistic S(Y ) under fθ, ignoring theprocess that causes missing data, i.e. calculated from density (1.4) whichequals the conditional sampling distribution of S(Y ) given m under fθ(·)gφ(·)calculated from density: ∫

fθ(y)gφ(m|Y )

kθ,φ(m)dYmis, (1.5)

where

kθ,φ(m) =

∫fθ(Y )gφ(m|Y )dy,

which is the marginal probability that M = m, assuming kθ,φ(m) > 0.

Proof. The conditions (a) and (b), for each value φ, gφ(m|Y ) takes thesame value for all yi. This does not imply that Y and M are independentlydistributed unless it holds for all possible m. Hence kθ,φ(m) = gφ(m|Y ), andthus the distribution of every statistic under density (1.4) is the same asunder density (1.5).

Theorem 2 (Rubin, 1976))

The sampling distribution of S(Y ) under fθ calculated by ignoring the processthat causes missing data equals the correct conditional sampling distributionof S(Y ) given m under fθgφ for every S(Y ), if and only if

EYmis[gφ(m|Y )|m, Yobs, θ, φ] = kθ,φ(m) > 0. (1.6)

Proof. The sampling distribution of every S(Y ) that come from (1.4) willbe identical to that from density (1.5) if and only if these two densities are

11

equal. This equality may be written as equation (1.6) by dividing by (1.4)and multiplying by kθ,φ(m).

The phrase “ignoring the process that causes missing data when makingsampling distribution inferences” may suggest not only calculating samplingdistributions with respect to density (1.5) but also interpreting the resultingsampling distribution unconditionally on m.


The sampling distribution of S(Y ) under fθ calculated ignoring the processthat causes missing data equals the correct unconditional sampling distribu-tion of S(Y ) under fθ(·)gφ(·) for all S(Y ) if and only if

gφ(m|Y ) = 1.

Proof The sufficiency is immediate. To establish necessity, we consider thestatistic S(Y ) if m = m and 0 otherwise.

Missing Data and Bayesian Inference

In the context of Bayesian inference, the parameters θ and φ are randomvariables whose marginal distribution is described by their prior densitiesp(θ), p(φ|θ). Bayesian inference for θ ignoring the process that causes missingdata means choosing a prior p(θ) and suppose that the observed data Yobsderived from density (1.4). Hence, the posterior distribution of θ ignoringthe process of missing data is

p(θ|Yobs) = p(θ)

∫fθ(Y )dYmis. (1.7)


Suppose that

a the missing data are missing at random.

b φ is distinct from θ.

12

Then the posterior distribution of θ ignoring the process that causes missingdata (1.7), equals the joint posterior distribution of θ and φ calculated bydensity

p(θ)p(φ|θ)∫fθ(Y )gφ(m|Y )dYmis. (1.8)

Proof If conditions (a) an (b) apply, equation (1.8) is equal to

p(θ)

∫fθ(Y )dYmis{p(φ)gφ(m|Y )}.


The posterior distribution of θ ignoring the process that causes missing data,equals the joint posterior distribution of θ and φ if and only if

Eφ,Ymis[gφ(m|Y )|m, Yobs, θ] (1.9)

takes a constant positive value.

Proof The posterior distribution of θ is proportional to (1.9) integrated overφ. This can be written as{

p(θ)

∫fθ(Y )dYmis

}∫EYmis

[gφ(m|Y )|m, Yobs, θ, φ]p(φ|θ)dφ. (1.10)

Expressions (1.10) and (1.7) yield the same distribution for θ if and only ifthey are equal. Thus, the second factor in (1.10) which is expression (1.9),must take a constant positive value.

Artificially Created Missing Data in a

Univariate Sample

Little and Rubin (2002) present the data structure for simulating missingvalues in a univariate dataset, providing the MCAR mechanism.

Let Y = (y1, ..., yn)T where yi denotes the value of a random variable forobservation i, and let M = (M1, ...,Mn)T where Mi = 0 for units that are

13

observed and Mi = 1 for units that are missing. Suppose the joint distri-bution of f(yi,Mi) is independent across units i = 1, 2, ..., n, so that theprobability a value is observed does not depend on the values of Y or M forthe other units. Then,

f(Y,M |θ, φ) = f(Y |θ)f(M |Y, φ) =n∏i=1

f(yi|θ)n∏i=1

f(Mi|yi, φ). (1.11)

Where f(yi|θ) denotes the density of yi indexed by unknown parameters θ andf(Mi|yi, φ) is the density of a Bernoulli distribution for the binary indicatorMi with probability P (Mi = 1|yi, φ) that yi is missing. If missingness isindependent of Y, that is P (Mi = 1|yi, φ) = φ, is a constant in {0, 1}, thatdoes not depend on yi, then the missing data mechanism is MCAR or in thiscase equivalently MAR. If the mechanism depends on yi the mechanism isNMAR since it depends on data that are missing, assuming that there aresome.

Univariate t-Test Comparison

The MCAR mechanism requires that the observed data represent a simplerandom sample of the complete dataset, under which both observed andmissing data should share the same mean and variance, as they belong tothe same population. The simplest method for assessing MCAR is to usea series of independent t-tests to compare missing data subgroups (Dixon,1988). This method is implemented by separating the “missing” and “com-plete” observations and then the group means are tested, to examine if thereare statistically significant differences. Under the null hypothesis, the teststatistic follows Student’s t distribution, since in most cases the variance isnot known. It is common to use Welch’s t-test (Welch, 1947) which does notassume equality of variance and sample size. The data are MCAR, whenthe t-statistic is not significant and MAR or NMAR if there is statisticallysignificant mean difference. The t-statistic is given by the formula:

t =Yobs − Ymis√

S21

N1+

S22

N2

(1.12)

where, ¯Yobs, S21 and N1 are the sample mean, sample variance and sample size,

for the observed data and ¯Ymis, S22 and N2, the sample mean, sample vari-

ance and sample size, for the missing data. The degrees of freedom ν associ-ated with this variance estimate is approximated by the Welch-Satterthwaite

14

equation (Welch, 1947):

ν ≈

(S21

N1+

S22

N2

)2

√S41

N21 ν1

+S42

N22 ν2

(1.13)

Here, ν1 = N1 − 1, the degrees of freedom associated with the first varianceestimate and ν2 = N2−1, those associated with the second variance estimaterespectively.

Little’s MCAR Test

It is important to mention that Little (1988) proposed a multivariate exten-sion of the t-test that evaluates mean differences across subgroups of obser-vations in the entire data set, that share the same missing data pattern. Thetest statistic is a weighted sum of the standardized differences between thesubgroup means and the overall means:

d2 =J∑n=1

nj(µj − µMLj )Σ−1j (µj − µML

j )T (1.14)

where nj is the number of cases in missing data pattern j, µj the vector ofthe means of the missing observations in missing pattern j, µML

j the vector

of maximum likelihood estimates of the overall means, and Σ−1j is the maxi-mum likelihood estimate of the covariance matrix. The subscript j indicatesthat the number of elements in the parameter matrices vary across missingdata patterns. In the case where Σj is known or the sample size is large,the d2 statistic is Chi-square distributed with

∑pj − p degrees of freedom,

where pj is the number of the complete variables for pattern j, and p is thetotal number of variables. The small-sample null distribution of d2 is thet2 Hoteling distribution (Little, 1988). When the d2 statistic is statisticallysignificant, provides evidence against the MCAR mechanism.

Enders (2010) points out the disadvantages of both methods.

Disadvantages of t-test:

• The correlations among variables are not taken into account.

• For very small group sizes, statistical power is decreased.

15

• MAR and MNAR mechanisms can produce missing data with equalsubgroup means.

Disadvantages of Little’s MCAR test:

• The test does not specify which variable is responsible for the violationof the MCAR mechanism.

• It assumes that the missing data patterns share a common covariancematrix.

• It has been observed that when the number of variables that violatethe MCAR assumption is small, the relationship between observed dataand missingness is weak and Type II errors may be produced.

• Like the t-test approach, mean comparisons do not provide a conclusivetest of MCAR.

16

Chapter 2

Imputation Methods

Imputation is an immense field of study, where a lot of research has alreadytaken place. Popular techniques are Multiple Imputation (Rubin, 1987),Expectation-Maximization (Dempster et al., 1977), Nearest Neighbor (Vacekand Ashikaga, 1980) and Hot Deck methods (Ford, 1980). However, the stateof the art methods of imputation rely on inter-variable correlations to imputemissing values. In the case of univariate time series, additional variablescannot be employed directly. Instead, algorithms need to exploit time seriescharacteristics, in order to estimate the values of the missing data.(Moritzet al., 2015)

List-wise deletion

List-wise deletion is the method of handling missing data by excluding theobservations which contain missing values from the analysis. Generally, thismethod affects the statistical power, as it reduces the sample size. Further-more, in the case of time series, following this imputation strategy wouldcreate more obstacles, as it would produce an irregular or unevenly spacedtime series.

17

Mean Imputation

Mean imputation is a method in which the missing value on a certain vari-able is replaced by the mean of the available cases, which may be formedwithin cells or classes. This method maintains the sample size; however,the variability of the data is reduced and as a result the standard deviationestimates tend to be underestimated (Enders, 2010).

Exponentially Weighted Moving Average

This approach imputes the missing values by calculating the exponentiallyweighted moving average (EWMA). Initially, the value of the moving aver-age window is set, the mean thereafter is calculated from equal number ofobservations on either side of a central missing value.

For example, assuming there is a missing value on position yi in the time seriesand the window is set on k = 2, then the observations yi−2, yi−1, yi+1, yi+2

are used to calculate the mean. If there is a longer gap of missing values, thewindow expands further until two non missing values occur (Moritz, 2016).

In addition, the weighting factors decrease exponentially the further fromthe central missing value. Following the previous example, for the valuesyi−1, yi+1 the weight would be (1−λ)1, the weights on values yi−2, yi+2 wouldbe (1− λ)2 and so on (Prins, 2012).

The general form of EWMA is according to Hunter (1986)

Si = λ(yi−1 + (1− λ)yi−2 + (1− λ)2yi−3 + ...

...+ (1− λ)kyi−(k+1))Si−(k+1), (2.1)

where St the value of the EWMA at any time period t, k∈ {0, 1, 2, ...}.

18

Structural state space models and Kalman

Smoothing

This method is more sophisticated, in the way that the characteristics of timeseries are employed to explicitly estimate missing values. Kalman Smoothingis used either on structural time series models or on the state space represen-tation of an ARIMA model to impute the missing values (Moritz, 2016). Inthe next chapter we will review in more detail the basic theory of Structuralmodels and Kalman Smoothing.

The first step to determine such a model is to recognize the componentspresent, i.e. trend, seasonality, specify the state space form, use KalmanSmoother to estimate the parameters needed and finally calculate the missingvalues.

Interpolation

Interpolation is a commonly used method for missing data problems, as itcreates new data points within the range of individual observations. Theproblem of interpolation can be reduced to the approximation of a ”simpler”function by smoothing a given set of observation using linear, polynomial orpiecewise spline functions. In the fourth chapter we will review in depth theproperties of this method.

19

Chapter 3

Structural Linear State SpaceModel

State space modelling constitutes a unified method for dealing with a widerange of problems in time series analysis. In this approach, it is assumedthat the progress of the system under study is determined by an unobservedseries of vectors α1, ..., αn which are associated with a series of observationsy1, ..., yn. The relation between αt and yt is specified by the state space model.The objective of state space analysis is to infer the relevant properties of αtgiven the observed y1, ..., yn.

Local Level Model

The simple form of a structural time series state space model is the locallevel model. The local level model represents a time series in the additiveform;

yt = µt + γt + εt, t = 1, ..., n, (3.1)

where µt is the trend component, γt is the seasonal component and εt is theerror or noise. A random walk ai of size n, is then used to develop a suitablemodel for µt, γt which is determined by

at+1 = at + ηt,

where ηt are independent and identically distributed random variables withzero mean and variance σ2

η. Since, random walks are non-stationary, the

20

distributions of random variables αt and yt depend on time and the modelis considered non-stationary. However, most of the structural models weexamine, can be reduced to stationary form by differentiating.

The simplest form of such a model is considered, where αt = µt, in theabsence of the seasonal component:

yt = αt + εt, εt ∼ N(0, σ2ε ) (3.2)

αt+1 = αt + ηt, ηt ∼ N(0, σ2η).

Although, of simple form, this model is not an artificial special case; it pro-vides the basis for the analysis of important real problems in practical timeseries analysis. This model exhibits the characteristic structure of state spacemodels in which there is a series of unobserved values α1, ..., αn called thestates which represent the development of the time series under study, overtime, together with a set of observations y1, ..., yn which are related to at viathe state space model (Durbin and Koopman, 2012).

Univariate Structural Time Series Models

A structural time series model is one in which trend, seasonal and errorcomponents in the basic model (3.1), plus other relevant components aremodelled explicitly.

Trend Component

By adding a slope term which is generated by a random walk, to the pre-viously considered local level time series model, we obtain the local lineartrend model.

yt = µt + εt, εt ∼ N(0, σ2ε )

µt+1 = µt + νt + ξt, ξt ∼ N(0, σ2ξ ) (3.3)

νt+1 = νt + ζt, ζt ∼ N(0, σ2ζ ).

The form of (3.3) where σξ > 0 and σζ > 0 allows the trend level and slopeto vary over time. In the case where ξt = ζt = 0 then νt+1 = νt = ν and

21

µt+1 = µt+νt the trend becomes linear and (3.3) reduces to the deterministiclinear trend plus noise model.

Seasonal Component

To model the seasonal term γt in (3.1), the frequency s of the observationsper unit of time is considered; i.e. for monthly data s = 12, for quarterlys = 4 and for daily data, when modelling the weekly pattern, s = 7. If theseasonal pattern is fixed over time, the seasonal values for months 1, .., s canbe modelled by the constants γ∗1 , ..., γ

∗s where

s∑j=1

γ∗j = 0.

For jth ”month” in ”year” i γt = γ∗j where t = s(i− 1) + j for i = 1, 2, ... andj = 1, ..., s. So that

s−1∑j=0

γt+1−j = 0.

In order to allow the seasonal pattern to change over time, an error ωt termis added

γt+1 = −s−1∑j=1

γt+1−j + ωt, ωt ∼ N(0, σ2ω). (3.4)

Cycle Component

Another important component in some time series is the cycle ct which wecan include by extending the basic time series model (3.1) to:

yt = µt + γt + ct + εt, t = 1, ..., n (3.5)

The form of the cycle component is given by the equation:

ct = c cosλct+ c∗ sinλct (3.6)

where λc is the frequency of the cycle and the period is 2π/λc which is muchgreater than the seasonal period. An error term can be added, to allow thecycle to change stochastically over time. The cycle component is added to thestructural time series framework and frequency λc is treated as an unknownparameter to be estimated.

22

Explanatory variables

Explanatory variables can easily be incorporated into the structural model byintroducing k regressors x1t, ..., xkt to the model with regression coefficientsβ1, ..., βk, constant in time, so that the model would take the form:

yt = µt + γt + ct +k∑j=1

βjxjt + εt, t = 1, ..., n. (3.7)

State Space Form

The general linear Gaussian state space model, written in state space formas presented by Durbin and Koopman (2012):

yt = Ztαt + εt, εt ∼ N(0, Ht) (3.8)

αt+1 = Ttαt +Rtηt, ηt ∼ N(0, Qt).

where yt is a p× 1 vector of observations called the observations vector andat is the unobserved m × 1 vector called the state vector. The underlyingidea of the model is that the progress of the time series is determined by at asseen in the second equation of (3.2). However, due to the fact that at is notobserved the analysis must be based on observations yt. The first equationis called the observation equation and the second the state equation. Theerror terms ηt and εt are considered serially independent and independent ofeach other.

Basic Structural Time Series Model

The components mentioned can be combined, to design a structural modeland then converted to state space form (3.8). The basic structural time seriesmodel (BSM) consists of three components as seen in (3.1). The state vectorof such a model would be:

at = (µt νt γt γt−1 ... γt−s−2)T (3.9)

23

and the system matrices

Zt =(Z[µ], Z[γ]

)Tt = diag

(T[µ], T[γ]

)Rt = diag

(R[µ], R[γ]

)Qt = diag

(Q[µ], Q[γ]

)(3.10)

where

Z[µ] = (1, 0) Z[γ] = (1, 0, ..., 0)

T[µ] =

(1 10 1

)T[γ] =

−1 −1 · · · −1 −11 0 · · · 0 00 1 · · · 0 0...

.... . .

......

0 0 · · · 1 0

R[µ] = I2 R[γ] = (1, ..., 0, 0)T

Q[µ] =

(σ2ξ 0

0 σ2ζ

)Q[γ] = σ2

ω.

ARIMA Models in State Space Form

Box & Jenkins (2008) considered that a time series yt consists of the samecomponents as in structural models. However, instead of modelling the com-ponents, the trend and seasonality should be eliminated by differencing, inthe beginning of the analysis. Thus, the resulting time series would be trans-formed into a stationary time series, where means and covariances wouldremain invariant over time.

In order to present the state space form of the ARIMA model, its propertiesare briefly defined. The rank of the autoregressive integrated moving averagemodel is determined by the non-negative integers (p,d,q), where d denotesthe number of differences, needed to achieve stationarity. Let ∆t = yt− yt−1and ∆2yt = ∆(∆t), the first and second differences to eliminate the trendcomponent. Assuming that we have s ”months” per ”year”, the first andsecond differences to eliminate the seasonal component are ∆syt = yt− yt−s,∆2syt = ∆(∆syt). When stationarity is achieved, the transformed variable

becomes,y∗t = ∆d∆D

s yt

where d, D=0,1,..., is then modeled as a stationary autoregressive movingaverage model ARMA(p,q) given by the following equation:

y∗t = φ1y∗t−1 + ...+ φpy

∗t−p + ζt + θ1ζt−1 + ...+ θqζt−q, ζt ∼ N(0, σ2

ζ ) (3.11)

24

where ζt is a serially independent series of error terms. Equation (3.11) canbe written in the form

y∗t =r∑j=1

φjy∗t−j + ζt +

r−1∑j=1

θjζt−j, t = 1, ..., n (3.12)

where r = max(p, q + 1) for which some of the coefficients are zero.

Durbin & Koopman (2012) demonstrate how to represent ARIMA models instate space form. As an example, the case of an ARMA(p,q) is considered,where no differencing is needed, that is d = D = 0, so we proceed to modelthe series (3.12) where y∗t = yt. Usually a constant term is included but inthis example is omitted for simplicity.

Zt = (1 0 0 ... 0)

at =

yt

φ2yt−1 + ...+ φryt−r+1 + θ1ζt + ...+ θr−1ζt−r+2

φ3yt−1 + ...+ φryt−r+2 + θ2ζt + ...+ θr−1ζt−r+3...

φryt−1 + θr−1ζt

(3.13)

the state equation at+1 as given by equation (3.8)

Tt = T =

φ1 1 · · · 0...

. . .

φr−1 0 · · · 1φr 0 · · · 0

Rt = R

1θ1...

θr−1

ηt = ζt+1. (3.14)

Together with the observation equation yt = Ztat, is equivalent to (3.12), butin the form of (3.8).

Kalman Filter

To better understand the process of estimating the unobserved state equa-tion parameters of at given all observations, called smoothing, a brief reviewof the Kalman Filter in general is needed. As previously stated the statespace model consists of two processes the observed yt and the unobserved at.Having a sample of observations n, denoted by Yt = (y1, ..., yn), we aim tofind the best estimates of the state at denoted as at|n, with error covariance

25

matrix Pt|n = (at − at|n)(at − at|n)T Three different types of inferences arepresented depending on the different relations between t and n, (Tsay, 2010).

• Prediction Forecasting at or yt given y1, ..., yt−1, where n = t − 1 isthe forecast origin.

• Filtering Recovering at in terms of y1, ..., yt, to remove measurementerrors, where n = t.

• Smoothing at is estimated in terms of all observations y1, ..., yn.

The mean square error (MSE) is then used to determine the best estimate,

E(at − at|n)T (at − at|n).

According to this criterion the best estimate is the conditional expectation:

at|n = E(at|Yn).

This expectation is generally nonlinear and often complicated to calculate,so results are approximated by employing linear filters.

Kalman Smoothing

The objective is to estimate the state vector at of the structural model, giventhe entire sample Yn. The conditional distribution of at|Yn is considered tobe normal N(at, Vt), where at = E(at|Yn) and Vt = V ar(at|Yn). We callat the smoothed state, Vt the smoothed state variance and the process ofestimating a1, ..., an, state smoothing. The smoother given the entire sampleis the most common one, but various types of smoothers exist:

• a conditional mean E(at|yt, ..., ys) is called a fixed interval smoother,based on the time interval (t,s).

• the fixed point smoother at|n = E(at|Yn), for fixed t and n = t+ 1,t+ 2, ....

• the fixed lag smoother an−j|n = E(an−j|Yn), for a fixed positive integerj and n = j + 1, j + 2, ....

26

The following approach by Durbin & Koopman (2012), estimates the meanvectors and variance matrices, assuming normality. However, the results areproven to be valid without the assumption of normality in the context ofminimum variance linear unbiased estimator and that of Bayesian analysis,for the first two moments.

Smoothed state

The smoothed state vector estimate is given by equation:

at = E(at|Yn) = E(at|Yt−1, vt:n)

= at +n∑j=t

cov(at, vj)F−1j vj (3.15)

where vt:n = (v′t, ..., v′n)′, E(at|Yt−1) = at for t = 1, ..., n, cov denotes the

covariance of the conditional distribution given Yt−1 and

Fj = V ar(vj|Yt−1).

The prediction errors v1, ..., vn, are mutually independent and independent ofy1, ..., yt−1 with zero means, where vt = at−yt or alternatively, the innovationanalogue of the state space model

vt = Ztxt + εt

xt+1 = Ltxt +Rtηt −Ktεt (3.16)

where the state estimation errors xt are defined by xt = at − at, withV ar(xt) = Pt, t = 1, ..., n.

Having in mind equation (3.16),

cov(at, vj) = E[at, v′j|Yt−1]

= E[at(Zjxj + ej)′|Yt−1]

= E[atx′j|Yt−1]Z ′j, j = t, ..., n. (3.17)

27

Furthermore

E(atx′t|Yt−1) = E[at(at − at)|Yt−1] = Pt

E(atx′t+1|Yt−1) = E[at(Ltxt +Rtηt −Ktεt)

′|Yt−1] = PtL′t

E(atx′t+2|Yt−1) = PtL

′tL′t+1 (3.18)

.

.

.

E(atx′n|Yt−1) = PtL

′t...L

′n−1.

Using equation (3.16) repeatedly for t + 1, t + 2, .... where Lt...L′n−1 is the

Im matrix when t = n and L′n−1 when t = n − 1. From equation (3.15) thesmoothed state vector can be expressed as

at = at + Ptrt−1 (3.19)

where

rt−1 = Z ′tF−1t vt + L′tZ

′t+1F

−1t+1vt+1 + ...+ L′tL

′t+1...L

′n−1Z

′nF−1n , (3.20)

for t = n − 2, n − 3, ..., 1. The vector rt−1 is a weighted sum of innovationsvj occurring after time t− 1. For

rt = Z ′t+1F−1t+1vt+ 1 + L′t+1Z

′t+2F

−1t+2vt+2 + ...+ L′t+1...L

′n−1Z

′nF−1n vn, (3.21)

also rn = 0 since no observations are available after time n. Combiningequations (3.20) and (3.21), we obtain

rt−1 = Z ′tF−1t vt + L′trt t = n, ..., 1 (3.22)

with rn = 0.

Equations (3.19) and (3.22), provide the recursion equations for state smooth-ing, an efficient algorithm for estimating a1, ..., an.

Smoothed state variance

will be derived respectively.

Vt = V ar(at|Yt−1, vt:n) = Pt −n∑j=t

cov(at, vj)F−1j cov(at, vj)

′, (3.23)

28

where cov(at, vj) and Fj as in equation (3.15).Following the previous method-ology we have,

Vt = Pt − PtNt−1Pt, (3.24)

where,

Nt−1 = Z ′tF−1t Zt + L′tZ

′t+1F

−1t+1Zt+1Lt + ...

...+ L′t+1...L′n−1Z

′nF−1n ZnLn−1...Lt. (3.25)

The value at time t is given by;

Nt = Z ′t+1F−1t+1Zt+1 + L′t+1Z

′t+2F

−1t+2Zt+2Lt+1 + ...

...+ L′t+1...L′n−1Z

′nF−1n ZnLn−1...Lt+1. (3.26)

It is observed that Nn−1 = Z ′nF−1n Zn in equation (3.25) and the initial value

of Nn = 0. Combining (3.25) and (3.26) Vt is estimated by the recursion,

Nt−1 = Z ′tF−1t Zt + L′tNtLt

Vt = Pt − PtNt−1Pt, (3.27)

for t = n, ..1 Since vt+1, ..., vn are independent, it follows from (3.23) and(3.26) that Nt = V ar(rt).

The state smoothing recursions collectively are:

rt−1 = Z ′tF−1t vt + L′trt Nt−1 = Z ′tF

−t 1Zt + L′tNtLt

at = at + Ptrt−1 Vt = Pt − PtNt−1Pt

for t = n, ..., 1 initialized with rn = 0 and Nn = 0. Together they form theKalman Smoother which consists of the backwards recursions through theobservations for estimating, at, Vt for t = 1, ..., n.

29

Chapter 4

Interpolation

The following definition of interpolation is stated as given by in the Encyclo-pedia of Mathematics (Kaz’min, 2011); Interpolation is the approximationof a function of a certain class by its known values, or by known values ofits derivatives, at given points. Suppose a set of n + 1 points xk

nk=0 on the

segment ∆ = [a, b] where a ≤ x0 ≤ ... ≤ xn ≤ b and a set of yknk=0 respec-

tively. Suppose that there is a known function f , belonging to a fixed classof functions K that are defined at least on ∆, satisfying the conditions.

f(x) ∈ K f(xk) = yk, k = 0, ..., n. (4.1)

The points xk at which f(xk) = yk are given, are called interpolation nodes off . The method in which we derive information about f(x), on the intervals(xk−1, xk), k = 1, ..., n, that is between the nodes, is called interpolation. Toapproximate a function f of class K the Lagrange interpolation polynomialis employed

Lfn(x) =n∑k=0

yk(x− x0)...(x− xk−1)(x− xk+1)...(x− xn)

(xk − x0)...(xk − xk−1)(xk − xk+1)...(xk − xn). (4.2)

The polynomial Lfn allows one to assess, the behaviour of f on ∆ if it isconsidered that

f(x) ≈ Lfn, x ∈ ∆. (4.3)

The error of the estimation is given by

Rfn = f(x)− Lfn(x) =

f (n+1)(ξ)

(n+ 1)!

n∏k=0

(x− xk), ξ ∈ [a, b]. (4.4)

.

30

Linear Interpolation

Linear interpolation (Samarin, 2012) is a method for the approximation ofthe value of function f(x), using a linear function. That is,

L(x) = a(x− xk) + b (4.5)

where the parameters a, b are chosen in such a way that the values of L(x)are in accordance with the values of f(x); as an example at given pointsx1, x2

L(x1) = f(x1), L(x2) = f(x2). (4.6)

These conditions are satisfied by

L(x) =f(x2)− f(x1)

x2 − x2(x− x1) + f(x1) (4.7)

which approximates the given function f(x) in the interval [x1, x2], with error

Rf = f(x)− L(x) =f ′′(ξ)

2(x− x1)(x− x2), ξ ∈ [x1, x2]. (4.8)

Spline Interpolation

Respectively spline interpolation is using piecewise polynomials to approxi-mate a function f(x). Spline interpolation is employed to solve a range ofvariational problems. However it should satisfy further conditions at theend points; such as for the cubic spline S3(∆n, x) where ∆n is the partitiona = x0 ≤ x1 ≤ ... ≤ xn = b which, on[a, b] is reconstructed by piecewise cu-bic polynomials and has a continuous second order derivative. It is necessarythat S3(∆n, xi) = f(xi), on the same time there is one condition at each endpoint, i.e.

S ′3(∆n, a) = y′0, S′3(∆n, b) = y′n,

S ′′3 (∆n, a) = y′′0 , S ′′3 (∆n, b) = y′′n.

If the values of f(xi) derive from a periodic function (b− a), then the splinefunction is also required to be (b−a) periodic. The number of extra conditionsat each end point a, b are increased by k, for polynomial splines of degree2k + 1. For splines of degree 2k the knots of the spline in which the 2kth

31

derivative is discontinuous, are usually chosen in the middle of the points xi,and further k conditions apply to end points a and b (Subbotin, 2011).

Interpolation splines satisfy the relation∫ b

a

[f (k)(t)− S(k)2k−1(∆n, t)]

2dt =

∫ b

a

[f (k)(t)]2dt−∫ b

a

[S(k)2k−1(∆n, t)]

2dt. (4.9)

32

Chapter 5

Preliminary Data Analysis

A univariate time series is considered as a sequence of single observations atsuccessive points in time t1, ..., tn. It is assumed that the time series understudy are equi-spaced, meaning that time between successive points is equal

|t1 − t2| = |t2 − t3| = ... = |tn−1 − tn|.

In order to compare the performance of the imputation algorithms, threedifferent univariate time series were chosen.

• Simulated Data

The simulated data consists of one hundred (n = 100) observations ina time series format from the Normal distribution N(25, 4).

• Bank of America

The dataset Bank of America consists of monthly observations (n =443), from 3/1980 until 1/2017 of the adjusted closing price of the stockBank of America, Data Source: Yahoo!Finance (2007).

• France Flu

The dataset France flu consists of weekly estimates (n = 620), from28/9/2003 to 8/9/2015, of influenza like illness (ILI) outbreaks basedon google search patterns in France, Data Source: The Google Flu andDengue Trends Team (2015).

33

Figure 5.1: Visualization of Time series datasets

The datasets chosen for the experiments of imputation algorithms representvarious time series characteristics which are explained in the following sec-tion.

Data Characteristics

Decomposition

The behaviour of a time series can be described by the various patternspresent. Those patterns are categorized into components, each representinga certain characteristic, which then are isolated and examined separately.The original time series can be reconstructed by additions or multiplicationsof these components.

Before the implementation of the imputation algorithms the time series aredecomposed, to examine their characteristics, into components trend, season-

34

ality and random using the method of loess (non-parametric Local Regres-sion) decomposition or STL (Seasonal and Trend decomposition using Loess),in which the time series is partitioned in seasonal cycles and a local quadraticpolynomial is fitted. The seasonality is detected by loess smoothing. Theseasonal values are then removed and the remainder is again smoothed todetermine the trend component. This process is iterated a few times. Theresiduals comprise the random noise (Cleveland et al., 1990).

The Figures (5.2) − (5.4) display the loess decomposition of the time seriesunder study. The stl function is available in the ”stats” R package (R CoreTeam and contributors worldwide, 2017).

Figure 5.2: STL Decomposition of Simulated Data

35

Figure 5.3: STL Decomposition of Bank of America Data

Figure 5.4: STL Decomposition of France Flu Data

36

Looking at Figures (5.2)− (5.4), it becomes evident why these three datasetshave been chosen for the analysis:

• The simulated dataset is showing no apparent trend or seasonality,comprising mainly of random noise. A typical behaviour of simulateddata with a time invariant mean and variance.

• The Bank of America stock data is showing a clear trend but no regularseasonality which is quite common for long term observation of stockprices.

• In the France flu data series, as expected the dominant characteristicis seasonality.

Autocorrelation

Autocorrelation (Box et al., 2008) is a measure to quantify the relationshipbetween the time series and a lagged version of itself. The estimation ofautocorrelation for a discrete process y1, ..., yn is obtained by equation (5.1)as

φ(k) =1

(n− k)σ2

n−k∑t=1

(yt − µ)(yt+k − µ). (5.1)

In time series analysis, forecasting and imputation methods are based on theidea that the future is dependent on past observations. A strong autocorre-lation indicates that the future is highly related to the past and as a resultmore reliable predictions and imputations can be made.

The values of the autocorrelation function range within [−1, 1], with +1indicating perfect correlation,−1 perfect anti-correlation and a value of zeromeans that the values at hand have no correlation.

In Figures (5.5) − (5.7) the values of the autocorrelation function are dis-played, using the R package ”forecast” (Hyndman, 2016), for the time seriesthat were used in our experiments. The lag is returned and plotted in unitsof time, and not numbers of observations. The blue line marks the confi-dence interval, under which the values are not statistically significant. Theapproximate 95% confidence interval for the autocorrelation function is

± 2√N

37

Figure 5.5: Autocorrelation Function of Simulated Data

In Figure (5.5) the simulated data do not show significant autocorrelation.The simulated dataset was chosen in order to observe how imputation algo-rithms perform for series that exhibit random noise characteristics. Lookingat Figure (5.6), it can be seen that there is a strong positive autocorela-tion, decreasing slowly over time, despite the fact that we have performed alogarithmic transformation, a common approach when analyzing stock mar-ket data, the time series still seems to be non stationary. In Figure (5.7)autocorrelation goes from positive to negative indicating seasonality.

38

Figure 5.6: Autocorrelation Function of Bank of America Data

Figure 5.7: Autocorrelation Function of France Flu Data

39

Chapter 6

Implementation of ImputationAlgorithms

The purpose of the following experiments is to evaluate and quantify theperformance of imputation algorithms for univariate time series data. Thestatistical package R, was employed to carry out the experiments; the R codecreated is available in Appendix A.

It is obvious that we cannot test imputation algorithms on real missing values,because they are not actually observed and they cannot be compared withthe imputed values. Therefore, the need arise to simulate missing values oncomplete data sets, perform the imputation algorithms and compare themto the original complete time series.

Simulation of Missing Data

In the first chapter, we reviewed the theoretical context for artificially creat-ing missing values, as presented by Rubin & Little (2002). The challenge isto remove observed values from the complete data, without violating the as-sumption that the data are either MCAR, or equivalently in this case MAR,which is essential for the implementation of the imputation algorithms.

40

Algorithm for Generating Missing Data

Following the theoretical guideline by Rubin & Little (2002) an algorithmwas created for producing missing values in a complete dataset, that abideswith the abovementioned limitations.

The user provides a complete univariate time series and the desired level ofmissingness. Then the algorithm computes the missing data indicator, whichis a Bernoulli random variable with probability of success equal to the level ofmissingness inputted by the user. When the missing data indicator takes thevalue 1, the corresponding value in the time series is replaced by NA (NotAvailable) and from now on is considered to be missing. Afterwards thealgorithm, groups available and missing data to perform a Welch’s t-test (seechapter 1), to examine if the data are derived from the MCAR mechanism.

Below are displayed collectively the characteristics of the algorithm createdfor the experiments:

• Missing Data Mechanism MCAR

• Missing Data Distribution Bernoulli

• Probability of Missing Data adaptable.

Implementation

For the experimental part, the three complete time series (Simulated, Bankof America and France Flu), described in the previous chapter were used,creating missing values at four rates of missing data (0.1, 0.25, 0.5, 0.8). Asalready mentioned, the rate of missing data (level of missingness) representsthe parameter p of the Bernoulli distribution. As a reminder, the probabilitymass function of the Bernoulli distribution is expressed as

f(x, p) = px(1− p)1−x for x ∈ {0, 1}.

Consequently, in various realizations of the same series, the amount of miss-ing values can be slightly different, for the same rates of missing data. Sincethe results of the imputation algorithms can be influenced by the pattern of

41

missing data, the function generating the missing data was run with 30 differ-ent random seeds, in order to randomize the results. Our results were basedon experiments for 30 random seeds, 4 levels of missingness, implementing 6imputation algorithms, that is 720 runs, for each dataset.

R package ImputeTS

The statistical software R provides a plethora of packages for handling miss-ing data. The appropriate package to choose for our experiments, is Im-puteTS (Moritz, 2016) which aggregates various types of algorithms, espe-cially designed for imputation of missing values in univariate time series.Below the implemented algorithms are briefly described, the theoretical con-text of which has been thoroughly reviewed in previous chapters.

• na.kalman(structTS)

Missing values are imputed by Kalman Smoothing on State Space mod-els or the State Space representation of an Arima model. The StructTSoption fits a BSM, or Local Level model depending on the compositionof the time series.

• na.kalman(arima) The Arima option uses the state space representationof an Arima model obtained by function auto.arima, which returns theoptimal model, according to AIC.

• na.interpolation(linear)

• na.interpolation(spline)

Missing values are replaced by points calculated via linear interpolationand spline interpolation, respectively.

• na.MA(exponential)

Missing values are imputed by the Exponentially Weighted MovingAverage.

• na.mean

Missing values are replaced by the overall mean.

42

Evaluating Imputation Accuracy

For the purpose of quantifying the performance of the imputation algo-rithms two error metrics were employed (MRSE) mean root square errorand (MAPE) mean absolute percentage error. Both are popular in timeseries literature, for evaluating forecasting and imputation models. Usingboth measures is crucial, in order to objectively compare and evaluate thealgorithms.

For example, the higher values of a time series with intense trend, would im-pact MRSE more than, values on the lower part. On the contrary MAPE hasthe advantage of being scale independent. However, MAPE has the disad-vantage of producing extreme values when any yi is close to zero (Hyndmanand Athanasopoulos, 2013).

Let yi denote the ith observation of the complete series, yi the imputed valueand n the number of missing values for the realization of the time seriesfor certain random seed and rate of missing data. Then MRSE is given byequation

MRSE(y, y) =

√∑nt=1(yt − yt)2

n(6.1)

and MAPE is given by equation

MAPE(y, y) = 100

∑nt=1 |

yt−ytyt|

n(6.2)

43

Chapter 7

Cross Validation of ImputationAlgorithms

In order to quantify the performance of imputation algorithms, the next stepswere followed:

1. Acquire a complete time series.

2. Create missing values.

3. Apply Imputation Algorithms

4. Calculate errors and compare the differences between the complete andthe imputed time series.

To compare the different R functions for univariate time series imputation,two types of errors were calculated for every dataset, namely, Mean RootSquared Error (MRSE) and Mean Absolute Percentage Error (MAPE).

Simulated Data

Figures (7.1), (7.4), (7.7), were produced by the ”Amelia II” R package(Honaker et al., 2015). They display the missing data pattern for differentlevels of missingness, in one realization of the time series.

44

Figure 7.1: Visualization of Missing Values of the Simulated Data set

The following figures, were designed using the the R package ”ggplot2”(Wickham et al., 2016). Each point represents the value of the error ofthe imputation result for each realization of the time series, due to differentrandom seeds and the colors mark the missingness rate.

In Figures (7.2) & (7.3) , we can observe that the best performing algorithmsfor the simulated data set are the Kalman Smoothing with a Structuralmodel and Mean Imputation, followed by EWMA and linear interpolationmodels. It is worth mentioning that the algorithms perform better in higherrates of missing values. Spline interpolation and Kalman Arima algorithmsas compared with the rest, show opposite behavior. Since the errors arerelatively higher, especially for missing rate equal to 0.8, where they escalateabruptly. The simulated data set has no trend, neither seasonality beingalmost white noise, thus the algorithms are not expected to perform well, inthe sense that they cannot unfold their full capabilities.

45

Figure 7.2: MRSE of Simulated Data

46

Figure 7.3: MAPE of Simulated Data

47

Data Bank of America

Figure 7.4: Visualization of Missing Values Bank of America Data set

In Figures (7.5) & (7.6) the MRSE and the MAPE are depicted for the Bankof America time series. It is obvious that in this data set, where there is aclear trend, the Mean Imputation is the worst performing algorithm. Overall,the rest of the algorithms behave in a similar way, despite the occurrence ofa few outliers in some instances. Furthermore, missing rate and errors sharea clear pattern so that the errors increase as the missing rate grows. It canbe observed that the Kalman Structural model and the linear interpolationproduce almost identical results, that could happen because the main char-acteristic of time series Bank of America is trend, which can be adequatelyexpressed either by a linear model or a local trend model. The EWMA modelfollows on nearly the same level as Kalman Structural and linear interpola-tion and so does spline interpolation with an observable weakness for rateof missingness equal to 0.8. The Arima model produces a few extreme errorvalues, even for small rates of missing values.

48

Figure 7.5: MRSE Bank of America Data

49

Figure 7.6: MAPE Bank of America Data

50

France Flu Data

Figure 7.7: Visualization of Missing Values France Flu Data

In Figures (7.8) & (7.9) MRSE and MAPE are shown for the France FluData time series, the main characteristic of which is the strong seasonality.Again, the Kalman Structural model and the linear interpolation show thebest results with EWMA following closely. The Imputation errors for theArima model show greater variance for all rates of missing values. The MeanImputation, once again shows the poorest results, with recpect to the maincharaceristic of the data, we should follow another imputation strategy andthe mean should be avoided. It is very interesting that as the missingnessrate rise, the algorithms fail to identify the seasonal pattern of the data,especially for rate equal to 0.8 where the errors even of the best performingalgorithm Kalman Structural explode.

51

Figure 7.8: MRSE of France Flu Data

52

Figure 7.9: MAPE of France Flu Data

53

Conclusions

Missing data become the first obstacle when designing predictive models,as most statistical methods are premised on complete data without missingvalues. Thus, it is important to be familiar with the methods employed formissing data management, as the method of choice, will influence the statisti-cal power of the predictive model. In the special case of univariate time series,the ”state of the art” techniques, cannot be employed, as they are based oninter-variable correlations, in order to estimate missing values. Hence, timeseries characteristics need to be taken into consideration, to develop an ap-propriate and efficient strategy when dealing with missing data. The mainscope of this thesis, was to compare and quantify the performance of impu-tation algorithms in the context of univariate time series data. The results ofour experiments indicate structural models using Kalman smoothing and lin-ear interpolation, as the best performing algorithms, when handling missingdata in univariate time series.

54

Appendix A

RCode

# Missing values generator and MCAR test

miss.gen <- function(com.ts,level.mis){mis.ts <- c()missing.data <- c()m.index <- rbinom(length(com.ts),size= 1,p=level.mis)for (i in 1:length(com.ts)){

if (m.index[i]==1){mis.ts[i] <- NAmissing.data[i] <- com.ts[i]}

else {mis.ts[i] <- com.ts[i]missing.data[i] <- NA}

}

rest.data <- na.omit(mis.ts);m1 <- mean(rest.data)missing.data <- na.omit(missing.data);m2 <- mean(missing.data)

d=m1-m2#equivalent means provide support for the MCAR mechanism.MCARtest <- t.test(missing.data,rest.data,alternative = c("two

.sided"),mu=d)if(MCARtest$p.value>0.05){print("the Data are missing

Completely at Random")}else if(MCARtest$p.value<=0.05) {print("The Data are missing

at Random")}

results <- mis.tsreturn(results)

}

55

require(imputeTS)require(Amelia)require(ggplot2)require(forecast)

#Simulated data#preprocessset.seed(10)f<- ts(rnorm(100,mean=25,sd=4),frequency=12)plot(stl(f, s.window ="periodic"),main="Simulated Data

Decompositon")acf(f,main="Simulated Data")require(imputeTS)seeds <- 30n <- 100miss.rate <- c(0.1,0.25,0.5,0.8)complete.ts <- array(,dim=c(n,seeds,length(miss.rate)))incomplete.ts <- array(,dim=c(n,seeds,length(miss.rate)))Impute1<- array(,dim=c(n,seeds,length(miss.rate)))Impute2<- array(,dim=c(n,seeds,length(miss.rate)))Impute3<- array(,dim=c(n,seeds,length(miss.rate)))Impute4<- array(,dim=c(n,seeds,length(miss.rate)))Impute5<- array(,dim=c(n,seeds,length(miss.rate)))Impute6<- array(,dim=c(n,seeds,length(miss.rate)))NAs <- array(,dim=c(seeds,length(miss.rate)))

for (c in 1:length(miss.rate)){for (i in 1:seeds){

set.seed(i)complete.ts[,i,c]<- ts(rnorm(n,mean=25,sd=4),frequency=12)

incomplete.ts[,i,c] <- miss.gen(complete.ts[,i,c],miss.rate[c])

b=0for (a in 1:length(incomplete.ts[,i,c])){if(is.na(incomplete

.ts[a,i,c])){b=b+1}}NAs[i,c] <-b

incomp.ts <- as.numeric(incomplete.ts[,i,c])Impute1[,i,c] <- na.kalman(incomp.ts,model="auto.arima")Impute2[,i,c] <- na.kalman(incomp.ts)Impute3[,i,c] <- na.interpolation(incomp.ts)Impute4[,i,c] <- na.interpolation(incomp.ts,option="spline")Impute5[,i,c] <- na.ma(incomp.ts,weighting="exponential")Impute6[,i,c] <- na.mean(incomp.ts)

56

} }#visualiseinc <- as.data.frame(incomplete.ts[,5,])names(inc) <- c("0.1","0.25","0.5","0.8")missmap(inc)

#France Flu DataData <- read.csv("Influenza.csv")Timeseries <- ts(Data[,2], start = c(2003,9,28),end = c

(2015,9,8),frequency = 52)plot(stl(Timeseries,s.window = c("periodic")),main="France Flu

Data Decomposition")acf(Timeseries,main="France Flu Data")complete.ts <-Timeseriesseeds <- 30n <- length(complete.ts)miss.rate <- c(0.1,0.25,0.5,0.8)incomplete.ts <- array(,dim=c(n,seeds,length(miss.rate)))NAs <- array(,dim=c(seeds,length(miss.rate)))Impute1<- array(,dim=c(n,seeds,length(miss.rate)))Impute2<- array(,dim=c(n,seeds,length(miss.rate)))Impute3<- array(,dim=c(n,seeds,length(miss.rate)))Impute4<- array(,dim=c(n,seeds,length(miss.rate)))Impute5<- array(,dim=c(n,seeds,length(miss.rate)))Impute6<- array(,dim=c(n,seeds,length(miss.rate)))for (c in 1:length(miss.rate)){

for (i in 1:seeds){set.seed(i)incomplete.ts[,i,c] <-ts(miss.gen(complete.ts,miss.rate[c]))b=0for (a in 1:length(complete.ts)){if(is.na(incomplete.ts[a,i,

c])){b=b+1}NAs[i,c] <-b }


}}# visualization of missing valuesinc <- as.data.frame(incomplete.ts[,5,])names(inc) <- c("0.1","0.25","0.5","0.8")missmap(inc)

DATA <- read.csv("BAC.csv")data <- rev(DATA$Adj.Close)

57

Timeseries <- ts(data,start = c(1980,3),end = c(2017,1),frequency = 12)

real.ts <-log(Timeseries)acf(real.ts,main="Bank of America")pacf(real.ts)plot.ts(real.ts)plot(stl(real.ts,s.window=c("periodic")),main="Bank of America

Data Decomposition")complete.ts <- real.tsseeds <- 30n <- length(complete.ts)miss.rate <- c(0.1,0.25,0.5,0.8)incomplete.ts <- array(,dim=c(n,seeds,length(miss.rate)))NAs <- array(,dim=c(seeds,length(miss.rate)))Impute1<- array(,dim=c(n,seeds,length(miss.rate)))Impute2<- array(,dim=c(n,seeds,length(miss.rate)))Impute3<- array(,dim=c(n,seeds,length(miss.rate)))Impute4<- array(,dim=c(n,seeds,length(miss.rate)))Impute5<- array(,dim=c(n,seeds,length(miss.rate)))Impute6<- array(,dim=c(n,seeds,length(miss.rate)))for (c in 1:length(miss.rate)){

for (i in 1:seeds){set.seed(i)incomplete.ts[,i,c] <-ts(miss.gen(complete.ts,miss.rate[c]))b=0for (a in 1:length(complete.ts)){if(is.na(incomplete.ts[a,i,

c])){b=b+1}NAs[i,c] <-b }


}}# visualization of missing valuesinc <- as.data.frame(incomplete.ts[,5,])names(inc) <- c("0.1","0.25","0.5","0.8")missmap(inc)

#errorsMRSE1<- array(,dim=c(seeds,length(miss.rate)))MRSE2<- array(,dim=c(seeds,length(miss.rate)))MRSE3<- array(,dim=c(seeds,length(miss.rate)))MRSE4<- array(,dim=c(seeds,length(miss.rate)))MRSE5<- array(,dim=c(seeds,length(miss.rate)))

58

MRSE6<- array(,dim=c(seeds,length(miss.rate)))

MAPE1<- array(,dim=c(seeds,length(miss.rate)))MAPE2<- array(,dim=c(seeds,length(miss.rate)))MAPE3<- array(,dim=c(seeds,length(miss.rate)))MAPE4<- array(,dim=c(seeds,length(miss.rate)))MAPE5<- array(,dim=c(seeds,length(miss.rate)))MAPE6<- array(,dim=c(seeds,length(miss.rate)))

for(l in 1:length(miss.rate)){for(f in 1:seeds){

MRSE1[f,l] <-sqrt((sum((Impute1[,f,l]-complete.ts[,f,l])ˆ2))/NAs[f,l])MRSE2[f,l]<- sqrt((sum((Impute2[,f,l]-complete.ts[,f,l])ˆ2))

/NAs[f,l])MRSE3[f,l] <-sqrt((sum((Impute3[,f,l]-complete.ts[,f,l])ˆ2))




/NAs[f,l])

MAPE1[f,l] <-(100/NAs[f,l])*(sum(abs((Impute1[,f,l]-complete.ts[,f,l])/complete.ts)))






}}#errorsrealdata

require(ggplot2)MRSE1<- array(,dim=c(seeds,length(miss.rate)))MRSE2<- array(,dim=c(seeds,length(miss.rate)))MRSE3<- array(,dim=c(seeds,length(miss.rate)))MRSE4<- array(,dim=c(seeds,length(miss.rate)))MRSE5<- array(,dim=c(seeds,length(miss.rate)))MRSE6<- array(,dim=c(seeds,length(miss.rate)))

59

MAPE1<- array(,dim=c(seeds,length(miss.rate)))MAPE2<- array(,dim=c(seeds,length(miss.rate)))MAPE3<- array(,dim=c(seeds,length(miss.rate)))MAPE4<- array(,dim=c(seeds,length(miss.rate)))MAPE5<- array(,dim=c(seeds,length(miss.rate)))MAPE6<- array(,dim=c(seeds,length(miss.rate)))

for(l in 1:length(miss.rate)){for(f in 1:seeds){

MRSE1[f,l] <-sqrt((sum((Impute1[,f,l]-complete.ts)ˆ2))/NAs[f,l])

MRSE2[f,l]<- sqrt((sum((Impute2[,f,l]-complete.ts)ˆ2))/NAs[f,l])





MAPE1[f,l] <-(100/NAs[f,l])*(sum(abs((Impute1[,f,l]-complete.ts)/complete.ts)))






}}

#visualization

missing.rate <- c()missing.rate[1:30] <- 0.1missing.rate[31:60] <- 0.25missing.rate[61:90] <- 0.5missing.rate[91:120] <- 0.8

60

z <- cbind(as.vector(NAs),as.vector(MRSE1),missing.rate)r <- cbind(as.vector(NAs),as.vector(MRSE2),missing.rate)p <- cbind(as.vector(NAs),as.vector(MRSE3),missing.rate)q <- cbind(as.vector(NAs),as.vector(MRSE4),missing.rate)w <- cbind(as.vector(NAs),as.vector(MRSE5),missing.rate)x <- cbind(as.vector(NAs),as.vector(MRSE6),missing.rate)data1 <- as.data.frame(z)data2 <- as.data.frame(r)data3 <- as.data.frame(p)data4 <- as.data.frame(q)data5 <- as.data.frame(w)data6 <- as.data.frame(x)names(data1) <-c("NAs","MRSE","missing.rate")names(data2) <-c("NAs","MRSE","missing.rate")names(data3) <-c("NAs","MRSE","missing.rate")names(data4) <-c("NAs","MRSE","missing.rate")names(data5) <-c("NAs","MRSE","missing.rate")names(data6) <-c("NAs","MRSE","missing.rate")

require(ggplot2)plot1 <- ggplot(data=data1, aes(x=NAs, y=MRSE,colour=factor(

missing.rate)))+geom_point()+ggtitle("MRSE Kalman Arima")

plot2 <- ggplot(data=data2, aes(x=NAs, y=MRSE,colour=factor(missing.rate)))+geom_point()+

ggtitle("MRSE Kalman Stuctural")plot3 <- ggplot(data=data3, aes(x=NAs, y=MRSE,colour=factor(

missing.rate)))+geom_point()+ggtitle("MRSE linear interpolation")


ggtitle("MRSE spline interpolation")plot5 <- ggplot(data=data5, aes(x=NAs, y=MRSE,colour=factor(

missing.rate)))+geom_point()+ggtitle("MRSE EWMA")


ggtitle("MRSE Mean imputation")multiplot(plot1,plot2, plot3, plot4, plot5, plot6, cols=3)

#visualization

z1 <- cbind(as.vector(NAs),as.vector(MAPE1),missing.rate)r1 <- cbind(as.vector(NAs),as.vector(MAPE2),missing.rate)p1 <- cbind(as.vector(NAs),as.vector(MAPE3),missing.rate)q1 <- cbind(as.vector(NAs),as.vector(MAPE4),missing.rate)w1 <- cbind(as.vector(NAs),as.vector(MAPE5),missing.rate)x1 <- cbind(as.vector(NAs),as.vector(MAPE6),missing.rate)

61

data11 <- as.data.frame(z1)data22 <- as.data.frame(r1)data33 <- as.data.frame(p1)data44 <- as.data.frame(q1)data55 <- as.data.frame(w1)data66 <- as.data.frame(x1)names(data11) <-c("NAs","MAPE","missing.rate")names(data22) <-c("NAs","MAPE","missing.rate")names(data33) <-c("NAs","MAPE","missing.rate")names(data44) <-c("NAs","MAPE","missing.rate")names(data55) <-c("NAs","MAPE","missing.rate")names(data66) <-c("NAs","MAPE","missing.rate")

plot11 <- ggplot(data=data11, aes(x=NAs, y=MAPE,colour=factor(missing.rate)))+geom_point()+

ggtitle("MAPE Kalman Arima")plot21 <- ggplot(data=data22, aes(x=NAs, y=MAPE,colour=factor(

missing.rate)))+geom_point()+ggtitle("MAPE Kalman Structural")


ggtitle("MAPE linear interpolation")plot41 <- ggplot(data=data44, aes(x=NAs, y=MAPE,colour=factor(

missing.rate)))+geom_point()+ggtitle("MAPE spline interpolation")


ggtitle("MAPE EWMA")plot61 <- ggplot(data=data66, aes(x=NAs, y=MAPE,colour=factor(

missing.rate)))+geom_point()+ggtitle("MAPE Mean imputation")

multiplot(plot11,plot21, plot31, plot41, plot51, plot61, cols=3)

62

Bibliography

Box, G. E. P., Jenkins, G. M. and Reinsel, G. C. (2008), Time Series Anal-ysis, Fourth Edition, John Wiley & Sons, Inc.

Cleveland, R. B., s. Cleveland, W., McRae, J. E. and Terpenning, I. (1990),‘STL: A Seasonal-Trend Decomposition Procedure Based on Loess’, Jour-nal of Official Statistics 6(1), 3–73.

Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977), ‘Maximum Likeli-hood from Incomplete Data via the EM Algorithm’, Journal of the RoyalStatistical Society. Series B (Methodological) 39(1), 1–38.

Dixon, W. J. (1988), BMDP Statistical Software Manual to Accompany the1988 Software Release, University of California Press.

Durbin, J. and Koopman, S. J. (2012), Time Series Analysis by State SpaceMethods Second Edition, University Press.

Enders, C. K. (2010), Applied Missing Data Analysis, The Guilford Press.

Ford, B. (1980), ‘An overview of hot deck procedures. Draft paper for Panelon Incomplete Data’, Committee on National Statistics, National Academyof Sciences .

Honaker, J., King, G. and Blackwell, M. (2015), AMELIA II: A Programfor Missing Data, R Foundation for Statistical Computing.URL: https://cran.r-project.org/web/packages/Amelia/vignettes/amelia.pdf

Hunter, J. S. (1986), ‘The Exponentially Weighted Moving Average’, Journalof Quality Technology 18(4), 203–210.

Hyndman, R. (2016), Forecasting Functions for Time Series and Linear Mod-els, R Foundation for Statistical Computing.URL: https://cran.r-project.org/web/packages/forecast/forecast.pdf

63

Hyndman, R. J. and Athanasopoulos, G. (2013), Forecasting: principles andpractice, OTexts.org.URL: https://www.otexts.org/fpp

Kaz’min, Y. (2011), Interpolation, Encyclopedia of Mathematics.URL: http://www.encyclopediaofmath.org/index.php/Interpolation&oldid=19144

Little, R. J. A. (1988), ‘A Test of Missing Completely at Random for Mul-tivariate Data with Missing Values’, Journal of the American StatisticalAssociation 83(404), 1198–1202.

Little, R. J. A. and Rubin, D. B. (2002), Statistical Analysis with MissingData, 2nd Edition, John Wiley & Sons, Inc.

Moritz, S. (2016), Time Series Missing Value Imputation, R Foundation forStatistical Computing.URL: https://cran.r-project.org/web/packages/imputeTS/imputeTS.pdf

Moritz, S., Sarda, A., Beielstein, T. B., Zaefferer, M. and Stork, J. (2015),‘Comparison of different Methods for Univariate Time Series Imputationin R’.URL: https://arxiv.org/ftp/arxiv/papers/1510/1510.03924.pdf

Prins, J. (2012), ‘NIST/SEMATECH e-handbook of statistical methods:Single exponential smoothing’.URL: http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc431.htm

R Core Team and contributors worldwide (2017), The R Stats Package, RFoundation for Statistical Computing.URL: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/00Index.html

Rubin, D. B. (1976), ‘Inference and Missing Data’, Biometrika 63(3), 581–592.

Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys, JohnWiley & Sons, Inc.

Samarin, M. (2012), Linear Interpolation, Encyclopedia of Mathematics.URL: http://www.encyclopediaofmath.org/index.php/Linear interpolation&oldid=27068

Subbotin, Y. (2011), Spline interpolation., Encyclopedia of Mathematics.URL: http://www.encyclopediaofmath.org/index.php?title=Splineinterpolationoldid =11892

64

The Google Flu and Dengue Trends Team (2015), ‘Google Flu Trends’.URL: http://www.google.org/flutrends

Tsay, R. S. (2010), Analysis of Financial Time Series, John Wiley & Sons,Inc.

Vacek, P. and Ashikaga, T. (1980), ‘An examination of the nearest neigh-bor rule for imputing missing values’, ASA Proceedings of the StatisticalComputing Section, pp. 326–331.

Welch, B. (1947), ‘The generalization of ”Student’s” problem when severaldifferent population variances are involved’, Biometrika 34(1-2), 28–35.

Wickham, H., Chang, W. and RStudio (2016), Create Elegant Data Visu-alisations Using the Grammar of Graphics, R Foundation for StatisticalComputing.URL: https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf

Yahoo!Finance (2007), ‘Bank of America Corporation (BAC) HistoricalPrices’.URL: http://finance.yahoo.com/quote/BAC/history?period1=322092000period2=1487628000interval=1mofilter=historyfrequency=1mo

65

missing data in time series and imputation methods

Documents