construction of sarimax- models using...
TRANSCRIPT
SYSTEMS ANALYSIS LABORATORY
Construction of SARIMAX-
models using MATLAB Mat-2.4108 Independent research projects in applied
mathematics
Antti Savelainen, 63220J
9/25/2009
2
Contents 1 Introduction...........................................................................................................................3
2 Existing MATLAB functions for ARMAX-models....................................................................4
3 MATLAB implementation for SARIMAX-models....................................................................4
4 Numerical example ...............................................................................................................7
5 Comparison of MATLAB and SAS software ........................................................................15
Results ...............................................................................................................................15
Usability .............................................................................................................................16
6 Conclusion..........................................................................................................................16
Bibliography ..............................................................................................................................19
3
1 Introduction
The course Mat-2.3132 Systems Analysis Laboratory I [12] covers time series analysis and
implementation of seasonal autoregressive integrated moving average models with an external
input i.e. SARIMAX-models. The basis for SARIMAX-models is an ARMA-model, which contains
only autoregressive and moving average parts. Models are utilized to forecast company’s
electricity consumption. First, the task is to identify an appropriate SARIMA-model [1] to fit the
data and then the external data is added and the model becomes a SARIMAX-model.
The data consists of company’s electricity consumption and the outdoor temperature at one hour
interval of a 4 weeks period. The outdoor temperature is possibly used as an external variable (X
term) in the model, if the data correlate with each other. The consumption and the temperature
are plotted with SAS software [15] in Figure 1. As you can see, there is evident linear trend and
seasonal behavior at least with periods 24 and 168 hours in the data, and so a SARIMAX-model
is possibly identified.
The purpose of this research project is to construct a MATLAB [8] implementation of MATLAB´s
functions for building, identifying, fitting and checking models for time series, which is a
sequence of successive and independent data points. This implementation enables to use the
Box-Jenkins methodology [1] to forecast the unknown values of stochastic time series. This
project accomplished functions in MATLAB to differentiate nonstationary time series, identify and
build an appropriate SARIMAX-model, decide that the model is adequate and forecast with the
ready-made model [1]. Next, the devices are exploited in a numerical example to forecast
company´s electricity consumption data given in the course Mat-2.3132 Systems Analysis
Laboratory I.
At present, SAS software is used as a statistics tool to construct a SARIMAX-model. SAS
software is able to compute seasonal SARMA-models, ARIMA-models with an integrated data,
ARMAX-models with an external variable and all combinations of these different kinds of
models. Although, within this course it needs to be run on a remote computer via SSH
connection, which is not desirable. After this research project students should be able to use
MATLAB to estimate SARIMAX-model’s parameters on their own workstations.
Furthermore, this research project compares capability of MATLAB and SAS to build, identify
and check SARIMAX-models. They are compared according to their numerical results and
applicability as a statistical program is analyzed regarding the SARIMAX-models. In the end,
there is a short review of alternative programs to be used on the course Mat-2.3132 Systems
Analysis Laboratory I.
4
Figure 1 Original consumption (in green) and the temperature (in purple) data plotted with SAS
2 Existing MATLAB functions for ARMAX-models
MATLAB contains a System Identification Toolbox [10], which offers a possibility to construct
mathematical models of dynamic systems. This toolbox lets you fit linear and non-linear models
to the data, where as the Box-Jenkins methodology aims to fit a suitable linear model to time
series and then optimize the values of parameters by maximizing the likelihood function [14].
The likelihood function depends on the sample values and the unknown parameters of the
model and an algorithm estimates those parameters which most likely would generate the
sample.
MATLAB´s System Identification Toolbox contains two functions, which made possible to
implement a statistics tool to construct a SARIMAX-model. A function armax estimates
parameters for an ARMA- or ARMAX-model. This was the essential thing that made it possible
to extend the MATLAB function to estimate SARIMAX-models. After the parameters have been
estimated, the ARMAX-model is used to forecast the time series values in the future with a
function predict.
3 MATLAB implementation for SARIMAX-models
Despite the possibility to estimate parameters for an ARMA- and ARMAX-model, MATLAB is
insufficient to be used as a statistical tool on the course Mat-2.3132 Systems Analysis
Laboratory I. MATLAB lacked ready functions especially for identifying, building and checking for
SARIMAX-models.
Identifying demanded MATLAB to be able to produce autocorrelation and partial autocorrelation
functions. Both the autocorrelation and partial autocorrelation function are important when
deciding the order of parameters in an ARMA-model and differentiating order. Additionally, a
cross correlation function is implemented to perceive the correlation between the electricity
consumption and temperature with different lags. Moreover, a differentiating function is needed
to be able to differentiate data not only by 1 but also with other intervals such as lengths of
seasonal periods.
5
A spectral density function is implemented as a function specdens. That function is a measure of
the signal´s energy between different frequencies and it is used to characterize the properties of
a signal. Mathematically a signal´s spectrum is the square of the absolute value of its Fourier
transform [7].
As you can see in Figure 1, the data is notably nonstationary and needs to be differentiated
according to definition of ARMA-models. The lack of differentiation with parameters from one in
MATLAB was solved by making a simple differentiation function differ.
The building of seasonal models needed most work to implement it with the armax function in
MATLAB, because it was only capable to construct ARMAX- and ARMA-models. The seasonal
ARMA (SARMA(p,q)x(P,Q)) function is made as an ARMA function, but some of the parameters
are locked to zero. For instance, let the season length of an SAR-part be and a SMA-part to
be Q. First, a full-length ARMA(S,Q)-model is created
where is white noise, B is a lag operator and . Polynomials
and
contain the model parameters and to be estimated. This model is then
modified to be a SARMA(p,q)X(P,Q) model by setting the parameters [ ,…, ] and
[ ,…, ] to zero. This procedure provides polynomials
and .
This is extensible to vector formats of the parameters P and Q.
Since the ARMA-model estimates differentiated data it becomes a SARIMA-model and the
estimated data needs to be integrated. Two different ways were attempted to integrate estimated
data of a model
where is the input data and is the parameter of input to be estimated.
First idea is to estimate the SARMAX-model normally regardless of differentiated data and
afterwards the estimate is integrated by summing up the values of estimate with each other.
Intuitively this should work, but in practice little errors in the beginning of the ex post -estimate
become multiple because the estimate sums up with erroneous values again and again.
Consequently, this summing method didn’t work well, despite the estimate’s profile is close to
the original data, shown in Figure 2.
6
0 100 200 300 400 500 600 7000
1
2
3
4
5
6
7
8
9x 10
4
Time (h)
Con
sum
ptio
n (
kW
h)
Estimated
Original
Figure 2 Original and estimated ex post -forecast integrated by summing the values of the estimate
Another way to do the integration of is that the data is not integrated but the ARIMA-model’s
AR-part is revised instead. Let the data in ARMA-model be differentiated by
This feature is implemented by function rearrange, which arranges the AR-part again. As a
result, the model estimates integrated data shown in Figure 3 and it worked a lot better than
the preceding attempt to sum up the values of estimate with each other. Note that the length of
AR-part increases by the rearrangement of differentiation parameters.
7
0 100 200 300 400 500 600 7001
2
3
4
5
6
7
8
9x 10
4
Time (h)
Con
sum
ptio
n (
kW
h)
Estimated
Original
Figure 3 Original and estimated ex post -forecast integrated by re-arranging the AR-part of the model
The checking stage of Box Jenkins modeling is mostly based on analyzing the residuals of an ex
post-estimation. The residuals should be normally distributed and uncorrelated with each other.
This is diagnosed by looking at the residuals’ autocorrelation and partial autocorrelation
functions and a normal probability plot or with help of Ljung-Box test [5] implemented in MATLAB
as a function ljungbox.
where is the sample autocorrelation of residuals at lag . The critical rejection for the
hypothesis of randomness for significance level is
where [3] is the -quantile of the chi-square distribution with degrees of freedom. In
practice, it turns out hard to find a SARIMAX-model in MATLAB with residuals that are random
according to Ljung-Box test. The hypothesis of randomness is rejected at least a significance
level of 0.9.
4 Numerical example
Consider the data to be same as in Figure 1. As you can see in Figure 1, the consumption of
electricity is nonstationary. There seems to be a linear trend both in the consumption and
temperature data. This is tested by estimating a model
8
Where is the consumption of electricity. Function arfunc gains an estimate
It is arguable to differentiate data by 1, because is so close to one. If is stationary,
but when time series is nonstationary [1].
In Figure 4 is a power spectra of electricity consumption. The x-axis refers to the entire signal´s
frequency of density spectrum scaled . There are spikes at least at frequencies 0.1 and
0.62. This indicates periodicity at lags 24 and 168, because and ,
where 816 is the length of the electricity consumption data vector. 26 and 161 are approximately
24 and 168 hours, which are the reasonable values of periods such as a day and a week.
0 0.5 1 1.5 2 2.5 3 3.5-1
0
1
2
3
4
5
6
7x 10
6
Frequency
Figure 4 Electricity consumption´s power spectrum
9
0 100 200 300 400 500 600 700 800 900-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Lag Value
AC
Autocorrelation function
Figure 5 Autocorrelation function of the consumption of electricity differentiated by one
According to the autocorrelation function of the once differentiated electricity consumption data
(see Figure 5), there seems to be seasonal behavior in the consumption data with periods 24
and 168 hours. Consequently they are potential differentiating orders. Models
and are estimated and gains the parameter
values and . The differentiation order 168 is chosen, because
. Thus, the data is differentiated by 1 and 168. Hence, the differentiated data
means the data differentiated by 1 and 168.
10
0 100 200 300 400 500 600 700-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Lag Value
AC
Autocorrelation function
Figure 6 Autocorrelation function of the differentiated electricity consumption data
0 50 100 150 200 250 300 350 400 450-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Lag Value
PAC
Partial auto correlation function
Figure 7 Partial autocorrelation function of the differentiated electricity consumption data
The autocorrelation function presented in Figure 6 has two spikes next to each other at lags one
and two and a spike at a lag 168 which indicates seasonal behavior and a SMA-model.
The partial autocorrelation function presented in Figure 7 has as well two spikes next to each
other at lags one and two but also seasonal spikes at lags 24 (and multiples of 24) and 168
which indicate seasonal behavior and a SAR-model.
Thus the following SARIMA model is selected
11
The external variable temperature is differentiated as well by 1 and 168. The cross correlation
between the differentiated data is plotted in Figure 8 and there seems to be correlation between
the electricity consumption and the outdoor temperature. The x-axis refers to lags of the input,
and highest spikes exist at around lags 8, -11 and -16. The consumption of electricity is
considered to be dependent on the outdoor temperature (not vice versa) and thereby only the
positive lag values are taken into consideration.
-50 -40 -30 -20 -10 0 10 20 30 40 50-1
-0.5
0
0.5
1
1.5x 10
5Cross correlation of electricity and temperature
Figure 8 The cross correlation between differentiated electricity consumption and temperature
0 5 10 15 20 251565
1570
1575
1580
1585
1590
Input lag
Std
Err
or
Esti
ma
te
Figure 9 Standard error estimate of model with different input lags
Standard error of estimates with different input lags in Figure 9 doesn’t support the cross
correlation between the differentiated data in Figure 8, because the minimum standard error of
estimate (SEE) is achieved with an input lag 1 (SEE = 1348.9). Although, the differences of
standard error of estimates between different lags are small-sized. In consequence, the cross
12
correlation function cannot be used as a tool in MATLAB for selecting the most appropriate input
lag in the model.
To summarize, under these circumstances the following SARIMAX model is estimated
0 100 200 300 400 500 600 7001
2
3
4
5
6
7
8
9x 10
4
Time (h)
Con
sum
ptio
n (
kW
h)
Estimated
Original
Figure 10 Ex post-forecast of the model
570 580 590 600 610 620 630 640
5.5
6
6.5
7
7.5
8
8.5
x 104
Time (h)
Consum
ption (kW
h)
Estimated
Original
Figure 11 Ex post-forecast of the model
13
200 250 300 350 400 450 500 550 600
-5000
-4000
-3000
-2000
-1000
0
1000
2000
3000
4000
5000
Time (h)
Resid
ua
l (k
Wh)
Figure 12 Residuals of ex post-forecast
50 100 150 200 250
-0.2
0
0.2
0.4
0.6
0.8
1
Lag Value
AC
Autocorrelation function
Figure 13 Autocorrelation of residuals
14
-4000 -2000 0 2000 4000 60000.001
0.003
0.01 0.02
0.05
0.10
0.25
0.50
0.75
0.90
0.95
0.98 0.99
0.997
0.999
Data
Pro
bability
Normal Probability Plot
Figure 14 Normal probability plot of residuals
Figures 12, 13 and 14 show that residuals can be considered white noise. There is no
autocorrelation between residuals and normal probability plot forms an approximate straight line.
Ljung-Box test for residuals gains a value 4.7419 with 1 degree of freedom. Ljung-Box test
indicates that the hypothesis of randomness can be rejected for significance level 97.5 %. It
turns out to be hard to find a model in MATLAB with residuals that are random according to
Ljung-Box test. Contrary to the Box-Jenkins methodology, we do not return to step one and build
a better model, because no other model yield residuals that are random according to Ljung-Box
test. The forecast given by the chosen model is in Figure 15.
2 4 6 8 10 12 14 16 18 20 22 240
1
2
3
4
5
6
7
8
9
10x 10
4
Time (h)
Consum
ption (kW
h)
Estimated
Figure 15 Ex ante -forecast for next 24 hours of electricity consumption
15
5 Comparison of MATLAB and SAS software
MATLAB and SAS yields different results and they offer a different usability. It is hard to say
which one is better, because there isn´t such a best model or usability.
Results
In MATLAB autocorrelation and partial autocorrelation functions are implemented based on their
mathematical definitions and they are similar to the functions computed by SAS software, as
expected.
There are differences between models´ parameter estimates and models´ standard error of
estimates between MATLAB and SAS. Even if models and differentiations are exactly same,
there are differences in estimated parameters and naturally they produce different model´s
standard error of estimate. It seems that forecasts estimated by MATLAB have lower standard
error of estimates. For example, SAS achieves 2396 model´s standard error of estimate with the
same model that was chosen above. The differences between MATLAB and SAS results either
from estimation algorithms, initial conditions or iteration tolerances related to the algorithm that
estimate model´s parameters.
SAS produces automatically a lot of information about the model parameters´ distributions and
correlation with each other. In addition, SAS prints AIC (Akaike's information criterion [6]) and
SBC (Schwarz's Bayesian information criterion [6]) and the variance of the ex ante-estimate,
which increases in time. In MATLAB, the user is itself in response to produce that same
information. None of these above-mentioned missing features were not programmed during this
research project with MATLAB.
In SAS software it is a built-in feature that the chosen input lag value is automatically modulated
to work with the integrated data in an ARMAX-model. In MATLAB, the cross correlation function
is not as informative as in SAS. The cross correlation is calculated between differentiated data
and has nothing to do with real world anymore, where as in SAS user observes a cross
correlation function that is comparable to the real phenomenon. The cross correlation produced
by SAS describes unambiguously the coefficient between the temperature and electricity
consumption shown in Figure 16. It is constantly negative, because the colder it is the more
electricity is consumed. Besides, the absolute value of the correlation coefficient is the greatest
at a lag 12 which indicates that the factory reserves heat about twelve hours.
16
Figure 16 Autocorrelation (upper) and cross correlation (lower) function produced by SAS
Usability
MATLAB lets a user to handle all the data as arrays. Thus, the user is able to get the certain
information that is needed and plot whatever needed. This makes modeling process easier,
faster and more understandable. Unfortunately, the 95 % confidence level for ex ante-forecast is
not computed in MATLAB.
6 Conclusion
A MATLAB implementation to use the Box-Jenkins methodology was created, but as the
comparison of MATLAB and SAS software shows, there are differences between MATLAB and
SAS software as a statistics tool. MATLAB works doesn´t yield random residuals according to
Ljung-Box test and the lag of an external variable works illogically.
MATLAB´s System Identification Toolbox is not precisely designed to estimate time-series
models. The point of view is different, because this toolbox is especially intended for modeling
systems from the measured input-output data illustrated in Figure 17.
Figure 17 An ARMAX-model structure
In this point of view, the factory would be a system that produces an observable signal, the
consumption of electricity . The system is affected by external signals, the outdoor
temperature and a disturbance signal, white noise . In this case, neither of the signals is
controllable. After the data of the system have been observed, the goal is to link observations
together into a dynamic system, which means that the current output value depends not only on
17
the current external stimuli and disturbance but also on their earlier output values [6]. Dynamic
systems are efficient tools to identify how the output depends on some certain property of the
system. In this case, for example how the consumption of electricity is affected by the width of
the walls, the size of windows or a heating system used to keep the factory warm.
Because the nature of MATLAB, other possibilities to construct SARIMAX-models were
screened from the Internet. An ARMA-model can be made with Excel, for example, but it was
harder to find a program that is able to construct SARIMAX-models.
R programming language [2] is a free software language environment for statistical computing.
The R programming language is able to compute at least ARIMA-models, but it is not originally
designed to handle multivariate models such as ARMAX-models. The function in the R
programming language can be modulated by using a function called arima to estimate
parameters of an ARMAX-model. The function arima is originally designed to compute ARIMA-
models. Anyhow, this appears to be even more complicated than in MATLAB.
Scilab [16] is a free scientific software and is able to estimate an ARMAX-process. It is on the
same line with MATLAB, because this armax-function needs to be modulated to be able to
estimate SARIMAX-models´ parameters.
Unfortunately, it seems that only the commercial software and adherent toolboxes are able to
compute SARIMAX-models and methods that are needed in time series modeling. All the free
programming languages need programming to be able to utilize the whole Box-Jenkins
methodology.
MATLAB and SAS are not the only commercial software that are able to compute ARMAX-
models. For example, AUTOBOX [19] offers a complete set of Box-Jenkins modeling tools.
National Instruments has a product NI LabVIEW 2009 [4] which offers a tool to estimate
parameters of an ARMAX-model. Additionally, Econometric Software, Inc. [17]offers model
frameworks for Box-Jenkins methodology where as Timberlake Software [18] offers a time
series analysis feature as well, which contains an estimator for ARMAX-models.
MATLAB is able to compute SARIMAX-models with functions of this research project. Results,
such as parameters estimates, differ from SAS, but the functions are usable on the course Mat-
2.3132 Systems Analysis Laboratory I. SAS prints more information in a shorter time and is
obviously more validated than the functions made in this research project. Consequently, some
kind of software testing would be needed. Free software for SARIMAX-modeling were not found
from the Internet. SAS software and Econometrics Toolbox [8] for MATLAB still seem to be the
best alternative software to the solution of this research project, because SAS is already in use,
although via SSH connection and students are already familiar with MATLAB. It is intuitively
better, because MATLAB is already in use and only one toolbox is a cheaper alternative than to
buy a new program.
In the future, the functions of this research could be an appropriate tool for the course Mat-
2.3132 Systems Analysis Laboratory I, with a little more work. The functions need to validated
and some properties, such as the 95 % confidence levels on ex ante-forecast and some
additional statistical data regarding the estimated parameters could be added in the MATLAB.
18
19
Bibliography
[1] Box, G. E., & Ljung, G. (1970). Time Series Analysis: forecast and control. San Fransisco:
Holden-Day Inc.
[2] Foundation, T. R. (n.d.). The R Project for Statistical Computing. Retrieved September 25,
2009, from http://www.r-project.org
[3] Karr, A. F. (1993). Probability. New York: Springer-Verlag New York, Inc.
[4] LabVIEW, N. (n.d.). The Software that powers virtual instrumentation. Retrieved September
25, 2009, from http://www.ni.com/labview/
[5] Ljung, G. M., & Box, G. E. (1978). On a measure of lack of fit in time series models.
Biometrika , 297-303.
[6] Ljung, L. (1987). System Identification Theory For The User. New Jersey: Prentice-Hall, Inc.
[7] Ljung, L., & Glad, T. (1994). Modeling Of Dynamic Systems. New Jersey: Prentice-Hal, Inc.
[8] Mathworks. (n.d.). Econometrics Toolbox Matlab. Retrieved September 30, 2009, from
http://www.mathworks.com/products/econometrics/
[9] Mathworks, T. (n.d.). The Mathworks - MATLAB and Simulink for Technical Computing.
Retrieved September 25, 2009, from www.mathworks.com
[10] MATLAB. (n.d.). System Identification Toolbox - MATLAB. Retrieved September 29, 2009,
from www.mathworks.com/products/sysid/
[11] Milton, S. J., & Arnold, J. C. (2003). Introduction to probability and statistics. McGraw-Hill
Companies, Inc.
[12] Noppa. (2009, September 16). Noppa - Työ 2. Retrieved September 25, 2009, from
https://noppa.tkk.fi/noppa/kurssi/mat-2.3132/tyo_2
[13] Pakanen, J., & Karjalainen, S. (2002). An ARMAX-model approach for estimating static heat
flows in buildings. Espoo: VTT Publications.
[14] Pindyck, R. S., & Rubinfield, D. L. (1998). Econometric models and economic forecasts.
Singapore: McGraw-Hill Book Co.
[15] SAS. (n.d.). SAS Business Analytics and Business Intelligence Software. Retrieved
September 25, 2009, from www.sas.com
[16] Scilab. (n.d.). Scilab Home Page. Retrieved September 25, 2009, from http://www.scilab.org
20
[17] Econometrics Software (n.d.). Progam Features - Capabilities - Time Series. Retrieved 9 25,
2009, from http://www.limdep.com/features/capabilites/time_series.php
[18] Timberlake Software (n.d.). LIMDEP & NLOGIT. Retrieved September 25, 2009, from
http://www.timberlake.co.uk/software/limdep/limdepkey.html
[19] Autobox Systems (n.d.). Autobox Overview. Retrieved September 28, 2009, from
www.autobox.com/autobox.htm