an integrated ssa-arima approach to make multiple day

12
Aerosol and Air Quality Research, 15: 208–219, 2015 Copyright © Taiwan Association for Aerosol Research ISSN: 1680-8584 print / 2071-1409 online doi: 10.4209/aaqr.2014.03.0046 An Integrated SSA-ARIMA Approach to Make Multiple Day Ahead Forecasts for the Daily Maximum Ambient O 3 Concentration Ujjwal Kumar * Chemistry & Climate Division, KNMI – Royal Netherlands Meteorological Institute, De-Bilt, The Netherlands ABSTRACT An integrated approach that combines a signal extraction method SSA (Singular Spectrum Analysis) with ARIMA (Autoregressive Integrated Moving Average) has been introduced to make multiple day ahead forecasts for daily maximum ambient O 3 concentrations. The results are compared with the forecasts obtained by more commonly used signal extraction method FFT (Fast Fourier Transforms) in combination with ARIMA. The data from six different European AIRBASE stations have been analyzed to test their skills in one-day ahead to five-day ahead forecasts. In SSA-ARIMA, the SSA has been used to extract the different structural components of the time series. The first few principal components have been selected based on their dominance in the eigenvalues spectrum. Each component has been modelled separately and their sum is called SSA-component. The residuals of the “SSA-component subtracted from the actual data” exhibited stationarity and have been modeled using ARIMA process. In the second approach called by FFT-ARIMA, FFT has been used to model the periodicities exhibited by the time series. The first few predominant frequencies have been selected based on the power-spectrum characteristics of the time series. The residuals of the “FFT-component subtracted from the actual data” were modeled using ARIMA process. The performances of both the SSA-ARIMA and FFT-ARIMA were evaluated for 30 out of sample days for one to five day ahead forecasts. The statistical indicators RMSE (root mean square error) and MAE (mean absolute error) reveal that the SSA-ARIMA consistently outperforms FFT-ARIMA. Moreover, SSA-ARIMA has also been found to perform better especially for multiple day ahead forecasts. The results clearly indicate that the SSA-ARIMA approach can be a potential alternative to make short term air quality forecasting. Keywords: Singular Spectrum Analysis (SSA); Fast Fourier Transform (FFT); Autoregressive Integrated Moving Average (ARIMA); Ambient O 3 ; Forecasting. INTRODUCTION The forecasting of ambient O 3 concentration has been a matter of great interest for most of the air quality research works, mainly, because of its critical role in the formation of urban smog and also on account of its direct adverse impact on human health, plants and materials. A number of statistical-time series modelling approaches have been applied by numerous researchers in order to forecast ambient O 3 concentrations. Slini et al. (2002) applied ARIMA (Autoregressive Integrated Moving Average) modelling technique to forecast daily maximum O 3 concentrations in Athens, Greece. Their results show a good index of agreement, accompanied by a weakness in forecasting alarms. Heo and Kim (2004) developed fuzzy expert and * Corresponding author. Tel.: +31-2206669 E-mail address: [email protected]; [email protected] neural network systems to forecast daily maximum ozone concentrations at four monitoring stations of Seoul, Korea. The accuracy of their forecast system improved continuously through verification and augmentation. Duen and Liger (2005) also used ARIMA models to predict hourly ozone concentrations in an urban and a rural area of South- eastern Spain. The prediction agrees reasonably well with the measured values at both the sampling points and the predictions at rural areas were found to be better than the urban area. Corani et al. (2005) used feed-forward neural networks, pruned neural networks and lazy learning to assess the air quality prediciton in Milan. Lin and Cobourn (2007) applied Takagi-Sugeno fuzzy system and a nonlinear regression model for ozone predictions in six Kentucky urban areas. They conclude that the performance of the fuzzy models was equivalent to that of nonlinear regression model for all practical purposes. Mintz et al. (2005) applied an automated fuzzy logic method to hourly ozone concentrations during summer months in the city of Edmonton, Canada. The model captured the trends in ozone concentrations and showed a good agreement with the measured data. Chelani and Devotta (2006) used a hybrid

Upload: others

Post on 20-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Integrated SSA-ARIMA Approach to Make Multiple Day

Aerosol and Air Quality Research, 15: 208–219, 2015 Copyright © Taiwan Association for Aerosol Research ISSN: 1680-8584 print / 2071-1409 online doi: 10.4209/aaqr.2014.03.0046 

An Integrated SSA-ARIMA Approach to Make Multiple Day Ahead Forecasts for the Daily Maximum Ambient O3 Concentration Ujjwal Kumar* Chemistry & Climate Division, KNMI – Royal Netherlands Meteorological Institute, De-Bilt, The Netherlands

ABSTRACT

An integrated approach that combines a signal extraction method SSA (Singular Spectrum Analysis) with ARIMA (Autoregressive Integrated Moving Average) has been introduced to make multiple day ahead forecasts for daily maximum ambient O3 concentrations. The results are compared with the forecasts obtained by more commonly used signal extraction method FFT (Fast Fourier Transforms) in combination with ARIMA. The data from six different European AIRBASE stations have been analyzed to test their skills in one-day ahead to five-day ahead forecasts. In SSA-ARIMA, the SSA has been used to extract the different structural components of the time series. The first few principal components have been selected based on their dominance in the eigenvalues spectrum. Each component has been modelled separately and their sum is called SSA-component. The residuals of the “SSA-component subtracted from the actual data” exhibited stationarity and have been modeled using ARIMA process. In the second approach called by FFT-ARIMA, FFT has been used to model the periodicities exhibited by the time series. The first few predominant frequencies have been selected based on the power-spectrum characteristics of the time series. The residuals of the “FFT-component subtracted from the actual data” were modeled using ARIMA process. The performances of both the SSA-ARIMA and FFT-ARIMA were evaluated for 30 out of sample days for one to five day ahead forecasts. The statistical indicators RMSE (root mean square error) and MAE (mean absolute error) reveal that the SSA-ARIMA consistently outperforms FFT-ARIMA. Moreover, SSA-ARIMA has also been found to perform better especially for multiple day ahead forecasts. The results clearly indicate that the SSA-ARIMA approach can be a potential alternative to make short term air quality forecasting. Keywords: Singular Spectrum Analysis (SSA); Fast Fourier Transform (FFT); Autoregressive Integrated Moving Average (ARIMA); Ambient O3; Forecasting. INTRODUCTION

The forecasting of ambient O3 concentration has been a matter of great interest for most of the air quality research works, mainly, because of its critical role in the formation of urban smog and also on account of its direct adverse impact on human health, plants and materials. A number of statistical-time series modelling approaches have been applied by numerous researchers in order to forecast ambient O3 concentrations. Slini et al. (2002) applied ARIMA (Autoregressive Integrated Moving Average) modelling technique to forecast daily maximum O3 concentrations in Athens, Greece. Their results show a good index of agreement, accompanied by a weakness in forecasting alarms. Heo and Kim (2004) developed fuzzy expert and * Corresponding author.

Tel.: +31-2206669 E-mail address: [email protected]; [email protected]

neural network systems to forecast daily maximum ozone concentrations at four monitoring stations of Seoul, Korea. The accuracy of their forecast system improved continuously through verification and augmentation. Duen and Liger (2005) also used ARIMA models to predict hourly ozone concentrations in an urban and a rural area of South-eastern Spain. The prediction agrees reasonably well with the measured values at both the sampling points and the predictions at rural areas were found to be better than the urban area. Corani et al. (2005) used feed-forward neural networks, pruned neural networks and lazy learning to assess the air quality prediciton in Milan. Lin and Cobourn (2007) applied Takagi-Sugeno fuzzy system and a nonlinear regression model for ozone predictions in six Kentucky urban areas. They conclude that the performance of the fuzzy models was equivalent to that of nonlinear regression model for all practical purposes. Mintz et al. (2005) applied an automated fuzzy logic method to hourly ozone concentrations during summer months in the city of Edmonton, Canada. The model captured the trends in ozone concentrations and showed a good agreement with the measured data. Chelani and Devotta (2006) used a hybrid

Page 2: An Integrated SSA-ARIMA Approach to Make Multiple Day

Kumar, Aerosol and Air Quality Research, 15: 208–219, 2015  209

autoregressive and nonlinear model to predict nitrogen dioxide concentration at a site in Delhi. Their results indicate hybrid model outperforms the individual linear and nonlinear models. Coman et al. (2008) applied neural networks to two versions of the model “dynamic” versus “static” one for the prediction of ozone concentrations for a 24-h horizon. The results indicated a rather good applicability of these models for a short-term prediction of ozone. Tsi et al. (2009) applied two cost sensitive artificial neural network (ANN) methods to forecast O3-episode. They found that the cost sensitive ANN methods perform better than the standard ANN. Kumar and Jain (2009) applied ARIMA modelling approach to forecast daily mean ambient air pollutants (O3, CO, NO and NO2) concentration at an urban traffic site of Delhi, India. The study concluded that the one day ahead prediction is reasonably good for all of these pollutants for the study site. Carnevale et al. (2011) applied neural networks and cokriging technique to develop an integrating forecasting system for O3 and PM10 concentrations. Further, Singh et al. (2011) applied cokriging based approach to reconstruct air pollution maps for O3 and PM10.

Because of the versatility of the statistical models, they have been successfully applied to identify the sources/causes of the ambient air pollutants (see, e.g., Lu et al. (2002); Lee et al. (2002); Gomez-Sanchis et al. (2006); Lu et al. (2011); Dominick et al. (2012, 2012b); Li et al. (2013), Latif et al. (2014)). Dominick et al. (2012) applied multivariate analysis such as Hierarchical Agglomerative Cluster Analysis (HACA) to access the spatial patterns, Pricipal Component Analysis (PCA) to determine the major sources of each air pollutant. D'Urso et al. (2013) applied Fuzzy clustering in time series of pollutants to detect the possible information redundancy in the monitoring networks and then, decreasing the number of monitoring stations, to reduce the monitoring costs and then to increase the monitoring efficiency of the networks. Im et al. (2013) analyzed the diurnal, monthly and seasonal variations of O3 and nitrogen oxides (NOx) and studied its local and regional impacts.

Lu and Wang (2008) proposed standard support vector classification model (S-SVC) in order to make prediction for the high ozone days. Kumar and DeRidder (2010) applied GARCH modelling in association with FFT-ARIMA to forecast ozone episodes at four European stations. The results indicate that it has not only been able to make good forecasts for one-day ahead daily maximum O3 concentration but it also successfully predicts the possible ozone episodes and at the same time reduces the no. of false alarms. Prakash et al. (2011) applied wavelet-based neural network to predict ambient O3, NO, NO2, CO, SO2 and PM2.5. The model results demonstrated that a judicious selection of wavelet network design may be employed successfully for air quality forecasting. Moustris et al. (2012) applied Multiple Linear Regression against a forecasting model based on ANN approach. Their results suggest that the ANN model can be used to issue warning for the general population and the sensitive groups.

As evident that the statistical methods in air pollution modelling and forecasting is an active area of research and applied for the policy purpose as well, however, the multiple

day ahead forecasting of air pollutants still remains a challenging domain for the forecasters. In most of the above studies, the emphasis is mostly on one day ahead forecasting. The quality of results quickly decline as the prediction horizon increases, i.e., for multiple day ahead forecasts. In the present study, an integrated SSA-ARIMA time series forecasting approach has been applied and compared with FFT-ARIMA approach for six European AIRBASE stations for one to five day ahead forecasting of the daily maximum O3 concentrations. A time series might consist of deterministic and stochastic components. In SSA-ARIMA approach, the deterministic part of the time series has been modelled using singular spectrum analysis (SSA) while in the second approach it has been modelled using Fast Fourier Transform (FFT). The stochastic part of the time series has been modelled using autoregressive integrated moving average (ARIMA) process in both the cases. The results have been verified using 1-day ahead, 2-day ahead, 3-day ahead, 4-day ahead and 5-day ahead forecasts for the last one month that accounts for 30 out of sample days. It is pertinent to mention that one-day ahead forecast means forecast for the one day ahead of the last day of the measured data. The "out of sample" means test data set. The last 30 days of the time series has been kept as "test data set", while rest of the time series has been used as the "training data-set".

The rest of the article is organized as follows. Section 2 presents the description of the data and the stations used in the study. Singular Spectrum Analysis (SSA), Fast Fourier Transform (FFT) and ARIMA methodology have been described in section 3. Section 4 presents the results and discussion. Section 5 concludes the study. DATA AND STATIONS

The long term time-series of daily maximum ambient O3 concentrations from six European AIRBASE stations have been analyzed in the present work. The daily maximum represents the daily one hour maximum at all the stations. Table 1 presents the details of these stations: name, characteristics, source of the data and the period of the data under study. Out of the six stations, two stations are of the “Background rural” characteristics, three are of “Background urban” nature and the one site has been categorized as “traffic urban”. The purpose of choosing varied site characteristics is to assess the utility of the new approach across different site-characteristics. METHODOLOGY

Time series are realizations of underlying data-generating processes over a time span, occurring at regular points in time. As such, time series have identifiable stochastic or deterministic components (Yaffee, 1999). In this work, the deterministic components have been identified, separated and modeled by two different approaches, SSA (Singular Spectrum Analysis) and FFT (Fast Fourier Transform). The stochastic components have been modeled using the stationary stochastic model ARIMA (Autoregressive Moving Average).

Page 3: An Integrated SSA-ARIMA Approach to Make Multiple Day

Kumar, Aerosol and Air Quality Research, 15: 208–219, 2015 210

Table 1. Details of the study-stations.

S. No.

Site Name Station Code

Country Data Source Site

characteristics Period of the data studied

1. London (Brent)

__ UK UK Air Quality data archive

(http://uk-air.defra.gov.uk/data) Background

Urban 1-May-96 to 30-Sep-07

2. London

(Teddington) __ UK

UK Air Quality data archive (http://uk-air.defra.gov.uk/data)

Background Urban

1-Jan-97 to 31-07-11

3. Brussels

(Ste. Catheri) BEBB104 Belgium

AIRBASE air quality data archive (http://air-climate.eionet.europa.eu/databases)

Traffic urban

1-Jan-01 to 2-Mar-2006

4. Brussels

(Berchem S. A.) BETB011 Belgium

AIRBASE air quality data archive (http://air-climate.eionet.europa.eu/databases)

Background Urban

2-Nov-92 to 31-July-10

5. Posterholt-

Vlodropperweg NL00107 Netherlands

AIRBASE air quality data archive (http://air-climate.eionet.europa.eu/databases)

Background rural

19-Feb-91 to 31-July-2010

6. Zegveld-

Oude Meije NL00633 Netherlands

AIRBASE air quality data archive (http://air-climate.eionet.europa.eu/databases)

Background rural

1-Jan-97 to 31-July-10

SSA (Singular Spectrum Analysis) SSA is a powerful technique for nonparametric time

series analysis (Hassani et al., 2009). SSA decomposes the original time series into the sum of a small number of independent and interpretable components such as a slowly varying trend, oscillatory components and noise. This is followed by a reconstruction of the original series. In recent times, SSA has found its applications in many different areas from physics and meteorology to economics and finance (e.g., Ghil et al., 2002; Hassani et al., 2009; Hassani et al., 2010; Kumar and Jain, 2010).

Four basic steps can be identified in SSA: (I) Embedding the sampled time series in a vector space of dimension M also known as window length; (II) Computing the M × M lag-covariance matrix CD of the data; (III) Diagonalizing CD; and (IV) Recovering the time series. The procedure described in this section has been adapted after Elsner and Tsonis (1996).

(I) Embedding the sampled time series in a vector space of dimension M;

One-dimensional time series (x1, x2, …, xN) is represented as multidimensional time series as follows:

1 2 3

2 3 4 1,, 1

1 2

...

...

( ) . . . . .

. . . . .

...

M

Mk M

ij i j

k k k N

x x x x

x x x x

X x

x x x x

(1)

its dimension is known as window length (M). The matrix X is called trajectory matrix.

(II) Computing the M × M lag-covariance matrix CD of the data;

Broomhead and King (1986) proposed

1(1)

1 11

1( ) ( )( )

1

N m

ij i t j tt

D x x x xN

CM

(2)

 Alternatively, Vautard and Ghil (1989) suggested

| |(2)

| |1

1( ) ( )( )

| |

N i j

ij i j tD tt

x x x xN i

Cj

(3)

(CD)(2) is a Toeplitz matrix, meaning that all elements

along each of diagonals are the same. (III) Diagonalizing CD For an orthonormal matrix, the inverse of the matrix is

equal to its transpose. Since CD is real and symmetric, there exists a diagonalizing matrix E whose columns are orthonormal and a diagonal matrix Λ (Press et al., 2002) such that CD = EΛET (4)

or CD E = EΛ (5) which is called the spectral decomposition of CD. Also Λ is a diagonal matrix whose nonnegative entries are the eigenvalues of CD. Using the definition of CD we have XTXE = EΛ, (6) ETXTXE = Λ, (7) or (XE)T(XE) = Λ. (8)

The matrix XE is the trajectory matrix projected onto the basis E. The components of X aligned along the basis E are uncorrelated since E is composed of orthogonal vectors Ek called the singular vectors of X. The diagonal matrix Λ consists of ordered values 0 ≤ λ1 ≤ λ2 ≤…≤ λM whose square roots are called the singular values that are referred to collectively as the singular spectrum. These terms give SSA its name.

(IV) Recovering the time series: The Principal Components (PCs) of original time series

can be calculated as follows:

11

mk k

i i j jj

a x e

(9)

Page 4: An Integrated SSA-ARIMA Approach to Make Multiple Day

Kumar, Aerosol and Air Quality Research, 15: 208–219, 2015  211

for i = 1, 2, …, N and k = 1, 2, …, M where represents the jth component of the kth eigenvector.

To get back to the original time series we convolve these principal components with their associated eigenvectors, which amount to

11

mk k

i j i jk

x a e

(10)

where i = 1, 2, 3, …, N and j = 1, 2, 3, ..., M

The original time series can be filtered this way by using only a selection of the possible principal components.

In this work, principal components have been selected on the basis of their significance based upon eigenvalues and subjective evaluation, i.e., whether it represents a systematic pattern or noise (Kumar and Jain, 2010b). As long as the eigenvalue of a component is greater than 1 and itstime series appear to represent a systematic pattern, the component has been retained for forecasting purpose. The reconstructed series for each of the selected components have been forecasted individually based upon its pattern (trend or oscillatory components). The sum of all the selected reconstructed series together is called as SSA-component of the time series.

SSA is also capable of dealing in short-term time series (Golyandina et al., 2001), though in this work, long time series has been taken into account so as to make a better comparison with the FFT-ARIMA. FFT (Fast Fourier Transform)

FFT is a variant of Discrete Fourier Transform (DFT) with only difference that FFT is computationally faster (Press et al., 2002). Computationally, DFT is of the order of O(N2) while FFT is of the order of O(N.log2N). If (x0, x1, x2, …, xN–1) denote a time series, its DFT Hn is given by

12 /

0

Nikn N

n kk

H x e

(11)

which can be inverted by inverse Fourier transform as follows:

12 /

0

(1/ )N

ikn Nk n

n

x N H e

(12)

 Eq. (11) is periodic in n with period N. Thus, the frequency

range varies from –1/2 to 1/2 at discrete interval n/N. Now the periodogram estimates of the power spectrum at different frequencies are given as (Press et al., 2002)

20 02

2 22

1(0) ( ) | |

1( ) [| | | | ]k k N k

P P f HN

P f H HN

(13)

where fk(=k/N) is defined only for the zero and positive

frequencies (also Fourier transform (1) is symmetric, i.e., Hk and H–k have the same value). In the present study, we’ve plotted power vs. period.

The inspection of power spectrum helps us to identify and select the frequencies (periods) which are supposedly dominant in the time series. The dominance of a particular frequency in the time series has been decided by inspecting the power spectrum of the time series. The peak of the power spectrum identify the most dominant frequency. The next peak identifies the next dominant frequency. Those frequencies have been retained which has a clear peak and the period is less than 1/3rd of the size of time series, so that, the viability of the period can be verified. After taking only the selected number of dominant frequencies for a particular time series, we construct the Fourier transform and take inverse of it. In the present study, the time series obtained after inverse Fourier transform under selected frequencies has been called as FFT component of the time-series (Kumar and DeRidder, 2010). ARIMA Modelling

After subtracting the deterministic components (either SSA-component or the FFT-component) from the original time series, the residuals part is often found to follow a stationary process. Thus, the stationary stochastic model ARMA (Autoregressive Moving Average) is applied to forecast the residual time series. A time series {xt; t = 0, ± 1, ± 2, ...} is ARMA (p, q) if it is covariance stationary and can be represented as xt = ϕ1xt–1 + ... + ϕPxt–P + wt + θ1wt–1 + ... + θqwt–q; (14)

(Shumway and Stoffer, 2006) with ϕP ≠ 0, θq ≠ 0, and σw

2 > 0. The parameters p and q are called the autoregressive [AR(p)] and the moving average [MA(q)] orders, respectively. When a time series doesn’t appear covariance stationary, the differencing procedure may be applied to make it stationary. Then, the ARMA(p, q) model can be applied to the stationary differenced time series and model so constructed is called ARIMA(p, d, q) model where d denotes the order of differencing (Brockwell and Davis, 2002; Shumway and Stoffer, 2006). An inspection of autocorrelation function (ACF) and partial autocorrelation function (PACF) helps in identifying the orders AR(p) and MA(q). In addition, more objectively defined criterions such as Akaike information criterion (AIC), Hannon-Quinn Information Criterion (HIC) and Bayesian Information Criterion (BIC) can also be used to identify the correct orders p and q (Brockwell and Davis, 2002; Kumar and Jain, 2009). The parameters ϕ and θ  have been estimated using maximum likelihood method (Brockwell and Davis, 2002) in the present study. More details about the way ARIMA model applied above can be found in Kumar and Jain (2009). RESULTS AND DISCUSSION

The data for long term time series of daily maximum O3 concentrarions have been analysed for six European stations as detailed in Table 1. The forecasts analysis is mainly

Page 5: An Integrated SSA-ARIMA Approach to Make Multiple Day

Kumar, Aerosol and Air Quality Research, 15: 208–219, 2015 212

focussed for the spring and summer season as the observed O3 concentrations are higher and of more concern for this season. Each of the time series has been divided into training data-set and validation data-set. In each of the case, last one month (30 days) data has been kept completely out of model fitting procedure to treat them as “out of sample” or the validation data-set. Forecasts results have been analysed for one-day ahead, two-day ahead, three-day ahead, four-day ahead and five-day ahead forecasts.

Fig. 1 presents the time series of daily maximum O3 concentrations at the six selected stations. Similar techniques have been applied to all the time series although model fitting parameters differ in each case. Therefore, for sake of brevity and clarity, it has been decided to present figures for each step for one time series only, while model fitting

parameters for all the time series have been presented in Table 2.

FFT-ARIMA Modelling

The daily maximum O3 concentration time series (Fig. 1) at each station is clearly marked by seasonal cycles. The annual periodicity is quite distinct in each of the case. These periodicities can readily be extracted using FFT method. The FFT has been applied on each of the time series to extract these periodicities. Fig. 2 presents the power vs. period plot for the daily maximum O3 concentration time series at London (Brent). The predominant frequencies are marked. The first three predominant frequencies have been chosen to reconstruct the periodic (FFT) component of the time series. Fig. 5 shows the reconstructed FFT-

Fig. 1. Time series of daily maximum ambient O3 concentrations at six European stations: (clockwise from top left) London (Brent), London (Teddington), Brussels (Berchem), Brussels (Ste. Catheri), Posterholt-Vlodropperweg (Netherlands) and Zegveld-Oude Meije (Netherlands). Further details about the data and study stations are presented in Table 1.

Page 6: An Integrated SSA-ARIMA Approach to Make Multiple Day

Kumar, Aerosol and Air Quality Research, 15: 208–219, 2015  213

component alongwith the original time series. In the next step, FFT-component has been subtracted from the original time series and the resulting residual time series has been analysed. The left panel of the Fig. 6 presents the FFT-residuals time series. A visual inspection as well as unit root test reveals that this residual time series can be considered stationary. Thus, no differencing of time series is required here. The lower two panels of Fig. 6 also present the ACF and PACF pattern for the same time series. After a detailed examination of ACF and PACF pattern and the criterions like AIC, BIC and HIC, AR(1) model (Table 2) has been selected to represent the residual time series and to make forecasts in future.

FFT-components and FFT-residuals have been forecasted separately for one-day ahead, two-day ahead, three-day ahead, four-day ahead and five-day ahead for the last 30 “out of sample” days (validation data set). Final forecasts have been made by combining these individual forecasts. Fig. 7 presents the time series of validation data set together with FFT-ARIMA forecasts. Figs. 8 and 9 present the RMSE and MAE for these out of sample forecasts. SSA-ARIMA Modelling

SSA (Singular Spectrum analysis) can decompose a time series into the sum of a small number of independent and interpretable components such as a slowly varying trend, oscillatory components and noise. In SSA, one of the important parameters that need to be decided is the “window-length (L)”. The idea behind selecting an appropriate window length is that possible variations or periodicity can be covered by using this length of time series because SSA captures mainly the patterns in a time series that are contained within the window length. Therefore, it can’t be set too low, at the same time, it can’t be put arbitrarily large as then lag-covariance matrix might lose its statistical significance. Hence, care has been taken to select an appropriate window length that can be exploited by SSA to capture the periodicity and trend present in the time series. The chosen window for all the stations has been shown in Table 2. After selection of appropriate window length, principal components have been calculated and then each of the components has been reconstructed into a separate time series. First 19 reconstructed components for the time series at London (Brent) have been shown in Fig. 3. Based upon the eigen-values and the exhibited pattern of reconstructed component time series, first six components have been selected. Each of the components has been fit separately based upon their individual pattern. In general, the first component is the trend component and it has been modeled as linear trend (Fig. 4) for the London (Brent). Other components exhibit somewat oscillatory pattern with varying amplitues and periods, therefore, they’ve been modeled using a combination of sine and cosine functions just like Fourier series (Fig. 4). The fits were accompolished using nonlinear least square method (Press et al., 2002). Each of these components has been forecasted separately using their corresponding fit-functions. The sum total of all the components together is called SSA component. The time series of SSA-component at London (Brent) has alsoT

able

2.

Mod

el f

ittin

g pa

ram

eter

s at

dif

fere

nt s

tatio

ns:

AR

den

otes

the

aut

oreg

ress

ive

and

MA

den

otes

the

mov

ing

aver

age

part

of

AR

IMA

mod

el.

The

par

amet

er p

re

fers

to th

e or

der

of A

R w

hile

q r

efer

s to

the

orde

r of

MA

. The

par

amet

er d

den

otes

the

orde

r of

dif

fere

ncin

g re

quir

ed to

mak

e th

e tim

e se

ries

sta

tiona

ry.

Site

s N

(l

engt

h of

the

tim

e se

ries

)

SS

A-A

RIM

A p

aram

eter

s F

FT-A

RIM

A p

aram

eter

s

L

(Win

dow

le

ngth

)

No.

of

SS

A

com

pone

nts

sele

cted

AR

IMA

mod

el f

or S

SA r

esid

uals

N

o. o

f pr

edom

inan

t fr

eque

ncie

s se

lect

ed

AR

IMA

(p, d

, q)

AR

M

A

(p, d

, q)

AR

M

A

Lon

don

(Bre

nt)

4170

10

95

6 (1

,0,0

) A

R(1

) =

0.5

557

--

3 (1

,0,2

) A

R(1

) =

0.7

916

MA

(1)

= 0

.236

5 M

A(2

) =

0.1

096

Lon

don

(Ted

ding

ton)

53

25

1460

5

(1,0

,0)

AR

(1)

= 0

.568

4 --

3

(1,0

,2)

AR

(1)

= 0

.854

8 M

A(1

) =

0.2

879

MA

(2)

= 0

.172

B

russ

els

(Ste

. Cat

heri

) (B

EB

B10

4)

1887

73

0 3

(1,0

,0)

AR

(1)

= 0

.628

7 --

2

(1,0

,2)

AR

(1)

= 0

.714

4 M

A(1

) =

0.1

044

MA

(2)

= 0

.053

5 B

russ

els

(Ber

chem

S.A

.)

(BE

TB

001)

64

80

1095

4

(1,0

,0)

AR

(1)

= 0

.611

7 --

- 3

(1,0

,2)

AR

(1)

= 0

.840

1 M

A(1

) =

0.2

367

MA

(2)

= 0

.164

5 P

oste

rhol

t-V

lodr

oppe

rweg

(N

L00

107)

71

03

1460

4

(1,0

,0)

AR

(1)

= 0

.608

--

- 3

(1,0

,2)

AR

(1)

= 0

.863

9 M

A(1

) =

0.2

705

MA

(2)

= 0

.175

Z

egve

ld-O

ude

Mei

je

(NL

0063

3)

4960

14

60

7 (1

,0,0

) A

R(1

) =

0.6

146

---

3 (1

,0,2

) A

R(1

) =

0.8

401

MA

(1)

= 0

.236

7 M

A(2

) =

0.1

645

Page 7: An Integrated SSA-ARIMA Approach to Make Multiple Day

Kumar, Aerosol and Air Quality Research, 15: 208–219, 2015 214

Fig. 2. Power vs. period diagram for the daily maximum O3 concentration time-series at London (Brent).

Fig. 3. First 18 resconstructed SSA-components using a window length 1095 at London (Brent). The x-axes has the number of days from 1-May-96 to 31-Aug-2007. The y-axes denotes the values of the components extracted from the daily maximum O3 concentration (µg/m3) time series.

shown in Fig. 5 together with FFT-component. In the next step, the SSA-component has been subtracted from the original time series. The obtained residual time series exhibits stationarity (Fig. 6 right panel) as the visual inspection suggests that this time series lacks any trend or periodicity which has also been confirmed by the unit root test. This allows us to apply ARIMA model fitting to the SSA-residuals time series. It is to note that differencing is not required in this case, i.e., d = 0. The right side lower two panels of Fig. 6 show the ACF and PACF pattern of the SSA-residuals

time series. The ARIMA parameters p and q has been decided based upon the examination of ACF, PACF pattern and the criterions AIC, BIC and HIC. The selected ARIMA model parameters have been presented in Table 2. Thereafter, one-day to five-day ahead forecasts have been made using the fitted ARIMA(p, d, q).

As a final step, forecasts made by SSA-components fitting and the ARIMA fitting of the SSA-residuals have been summed up to make the final forecasts for one to five day ahead for the 30 out of sample days.

Page 8: An Integrated SSA-ARIMA Approach to Make Multiple Day

Kumar, Aerosol and Air Quality Research, 15: 208–219, 2015  215

Fig. 4. Curve-fitting for the first six selected SSA-components for the daily maximum O3 concentration time series at London (Brent).

Fig. 5. Plots of original time series (in blue), fitted FFT-components (in red) and fitted SSA-components (in green) for the daily maximum O3 concentration time series at London (Brent).

Comparison of SSA-ARIMA and FFT-ARIMA Forecasting

The step-wise procedures described above have been followed for each of the six stations. Although model parameters differ in each case but the basic approach essentially remains the same. Thus, applying both SSA-ARIMA and FFT-ARIMA at all the six stations, the one-day to five-day ahead forecasts have been obtained for 30 out of sample days in each case.

Fig. 7 presents the daily maximum O3 time series of observed, FFT-ARIMA and SSA-ARIMA one-day ahead forecasts with their corresponding 95% confidence intervals at all the six stations for the last 30 days which have been treated as "out of sample" in the study. A close examination of both the forecasts pattern suggests that the SSA-ARIMA one-day ahead forecasts are not only close to the observations but its 95% confidence interval ( the standard deviation) is also less as compared to FFT-ARIMA forecasts.

Page 9: An Integrated SSA-ARIMA Approach to Make Multiple Day

Kumar, Aerosol and Air Quality Research, 15: 208–219, 2015 216

Fig. 6. Left: Plot of FFT residuals at London (Brent) (top panel). Bottom two panels show the ACF and PACF pattern for the FFT-resuduals time series. Right: Plot of SSA-residuals at London (Brent) (top panel). Bottom two panels show the ACF and PACF pattern for the SSA-residuals time series.

Fig. 7. The daily maximum O3 concentration time series of observed, FFT-ARIMA forecasts and SSA-ARIMA forecasts with their corresponding 95% confidence intervals at all the six stations for the last 30 days that have been treated as "out of sample" in the study.

This suggests that SSA-ARIMA forecasts has greater reliabilty than the FFT-ARIMA. To evaluate the model forecasts further, the error analysis for both the approaches has been carried out. The error statistics RMSE and MAE have been calculated for all the forecasts at all the six stations. The RMSE and MAE are standard and reliable tool to evaluate and compare the different models performances (Willmott, 1981, 1982; Thunis et al., 2012). Figs. 8 and 9 present the RMSE and MAE for one-day to five-day ahead forecasts for 30 out of sample days obtained by applying these two different approaches at all the six stations. A RMSE comparison at London (Brent) clearly shows that SSA-ARIMA forecasts have outperformed the FFT-ARIMA. It should be noted that RMSE for SSA-ARIMA gets lowered by 13.2%. 24.4%, 30.7%, 34.9% and 38.2% for one-day, two-day, three-day, four-day and five-day ahead

forecasts, respectively. Similar trend of decrease in MAE error can also be observed (Fig. 9). An inspection of RMSE errors at all the stations (Fig. 8) clearly implies that SSA-ARIMA always has lower RMSE values. It should be noted that RMSE decrease for SSA-ARIMA gets more and more significant as one goes from one-day ahead forecasts to five-day ahead forecasts (e.g., from 13.2% to 38.2% decrease at London (Brent)). This trend is also distinctly visible in MAE pattern. Thus, the error analysis clearly suggests that SSA-ARIMA outperforms FFT-ARIMA especially for multiple day ahead forecasts. CONCLUSION

This study introduces an integrated SSA-ARIMA approach to forecast daily maximum ambient O3 concentrations that

Page 10: An Integrated SSA-ARIMA Approach to Make Multiple Day

Kumar, Aerosol and Air Quality Research, 15: 208–219, 2015  217

Fig. 8. The RMSE obtained for one-day ahead to five-days ahead forecasts for 30 out of sample days by SSA-ARIMA and FFT-ARIMA approaches at six stations: (clockwise from top-left) London (Brent), London (Teddington), Brussels (Berchem), Brussels (Ste. Catheri), Posterholt-Vlodropperweg (Netherlands) and Zegveld-Oude Meije (Netherlands).

Fig. 9. The MAE obtained for one-day ahead to five-days ahead forecasts for 30 out of sample days by SSA-ARIMA and FFT-ARIMA approaches at six stations: (clockwise from top-left) London (Brent), London (Teddington), Brussels (Berchem), Brussels (Ste. Catheri), Posterholt-Vlodropperweg (Netherlands) and Zegveld-Oude Meije (Netherlands).

has been applied at six European monitoring stations. These stations consist of “background urban”, “traffic urban” and “background rural” characteristics. At all the stations, the results obtained by SSA-ARIMA have been compared with the FFT-ARIMA approach. The out of sample forecasts error analysis using the metrics RMSE and MAE clearly reveals that SSA-ARIMA has performed consistently better at all the stations. More strikingly, SSA-ARIMA performed much better for multiple day ahead forecasts. The capability

of SSA to extract signals from the time series, whether it be trend or oscillatory component, offers more practical insight into the pattern of the time series and makes it easier to model them. The SSA is a highly capable method and can be applied to extract the signals from any time seires, therefore, the application of this method can easily be extended to forecast other pollutants such as NO2, PM10, VOCs. The results of this study suggest that the SSA can be a powerful method to extract signals from the air

Page 11: An Integrated SSA-ARIMA Approach to Make Multiple Day

Kumar, Aerosol and Air Quality Research, 15: 208–219, 2015 218

pollutants time series and the SSA-ARIMA approach can be a very good potential alternative to make short term air quality forecasting. REFERENCES Brockwell, J.B. and Davis, R.A. (2002). Introduction to

Time Series and Forecasting, Springer-Verlag New York, Inc.

Broomhead, D.S. and King, GP. (1986). Extracting Qualitative Dynamics from Experimental Data. Physica D 20: 217–236.

Carnevale, C., Finzi, G., Pisoni, E., Singh, V. and Volta, M. (2011). An Integrated Air Quality Forecast System for a Metropolitan Area. J. Environ. Monit. 13: 3437–3447

Chelani, A.B. and Devotta, S. (2006). Air Quality Forecasting Using a Hybrid Autoregressive and Nonlinear Model. Atmos. Environ. 40: 1774–1780

Coman, A., Ionescu, A. and Candau, Y. (2008). Hourly Ozone Prediction for a 24-h Horizon Using Neural Networks. Environ. Modell. Softw. 23: 1407–1421, doi: 10.1016/j.envsoft.2008.04.004.

Corani, G. (2005). Air Quality Prediction in Milan: Feed-forward Neural Networks, Pruned Neural Networks and Lazy Learning. Ecol. Modell. 185: 513–529, doi: 10.1016/ j.ecolmodel.2005.01.008.

Dominick, D, Juahir, H., Latif, M.T. and Zain, S.M. (2012). Spatial Assessment of Air Quality Patterns in Malaysia Using Multivariate Analysis. Atmos. Environ., 60: 172–181.

Dominick, D., Latif, M. T., Juahir, H., Aris, A.Z. and Zain, S.M. (2012). An Assessment of Influence of Meteorological Factors on PM10 and NO2 at Selected Stations in Malaysia. Sustainable Environ. Res. 22: 305–315.

Duen, C., Carretero, J. and Liger, E. (2005). Stochastic Model to Forecast Ground-level Ozone Concentration at Urban and Rural Areas. Chemosphere 61: 1379–1389, doi: 10.1016/j.chemosphere.2005.04.079.

D'Urso, P., Di Lallo, D. and Maharaj, E.A. (2013). Autoregressive Model-based Fuzzy Clustering and Its Application for Detecting Information Redundancy In Air Pollution Monitoring Networks. Soft Comput. 17: 83–131.

Elsner, J.B. and Tsonis, A.A. (1996). Singular Spectrum Analysis – A New Tool in Time Series Analysis, Plenum Press, New York and London.

Ghil, M., Allen, R.M., Dettinger, M.D., Ide, K., Kondrashov, D., Mann, M.E., Robertson, A.W., Saunders, A., Tian, Y., Varadi, F. and Yiou, P. (2002). Advanced Spectral Methods for Climatic Time Series. Rev. Geophys. 40: 3.1–3.41.

Golyandina, N., Nekrutkin, V. and Zhigljavsky, A. (2001). Analysis of Time Series Structure: SSA and Related Techniques, Chapman & Hall/CRC.

Gomez-Sanchis, J., Martin-Guerrero, J.D., Soria-Olivas, E., Vila-france, J., Carrasco, L. and Valle-tasco, S. (2006). Neural Networks for Analysing the Relevance of Input Variables in the Prediction of Tropospheric Ozone

Concentration. Atmos. Environ. 40: 6173–6180, doi: 10.1016/j.atmosenv.2006.04.067

Hassani, H., Heravi, S. and Zhigljavsky, A. (2009). Forecasting European Industrial Production with Singular Spectrum Analysis. Int. J. Forecasting 25: 103–118, doi: 10.1016/j.ijforecast.2008.09.007.

Hassani, H., Soofi, A.S. and Zhigljavsky, A. (2010). Predicting Daily Exchange Rate with Singular Spectrum Analysis. Nonlinear Anal. Real World Appl. 11: 2023–2034, doi: 10.1016/j.nonrwa.2009.05.008.

Heo, J.S. and Kim, D.S. (2004). A New Method of Ozone Forecasting Using Fuzzy Expert and Neural Network Systems. Sci. Total Environ. 325: 221–237, doi: 10.1016/ j.scitotenv.2003.11.009.

Im, U., Incecik, S., Guler, M., Tek, A., Topcu, S., Unal, Y.S., Yenigun, O., Kindap, T., Odman, M.T. and Tayanc, M. (2013). Analysis of Surface Ozone and Nitrogen Oxides at Urban, Semi-rural and Rural Sites in Istanbul, Turkey. Sci. Total Environ. 443: 920–931.

Kumar, U. and De Ridder, K. (2010). GARCH Modelling in Association with FFT-ARIMA to Forecast Ozone Episodes. Atmos. Environ. 44: 4252–4265, doi: 10.1016/ j.atmosenv.2010.06.055.

Kumar, U. and Jain, V.K. (2009). ARIMA Forecasting of Ambient Air Pollutants (O3, NO, NO2 and CO). Stoch. Environ. Res. Risk Assess. 24: 751–760, doi: 10.1007/ s00477-009-0361-8.

Kumar, U. and Jain, V.K. (2010). Time Series Models (Grey-Markov, Grey Model with rolling Mechanism and Singular Spectrum Analysis) to Forecast Energy Consumption in India. Energy 35: 1709–1716, doi: 10.1016/j.energy.2009.12.021.

Latif, M.T., Dominick, D., Ahamad, F., Khan, M.F., Juneng, L., Hamzah, F.M. and Nadzir, M.S.M. (2014). Long Term Assessment of Air Quality from a Background Station on the Malaysian Peninsula. Sci. Total Environ. 482–483: 336–48, doi: 10.1016/j.scitotenv.2014.02.132.

Lee, Y., Calori, G., Hills, P. and Carmichael, G. (2002). Ozone Episodes in Urban Hong Kong 1994–1999. Atmos. Environ. 36: 1957–1968, doi: 10.1016/S1352-2310(02) 00150-4.

Lin, Y. and Cobourn, W.G. (2007). Fuzzy System Models Combined with Nonlinear Regression for Daily Ground-level Ozone Predictions. Atmos. Environ. 41: 3502–3513, doi: 10.1016/j.atmosenv.2006.11.060.

Lu, W., Wang, X., Wang, W., Leung, A.Y.T. and Yuen, K. (2002). A Preliminary Study of Ozone Trend and Its Impact on Environment in Hong Kong. Environ. Int. 28: 503–512.

Lu, W.Z. and Wang, D. (2008). Ground-level Ozone Prediction by Support Vector Machine Approach with a Cost-sensitive Classification Scheme. Sci. Total Environ. 395: 109–16, doi: 10.1016/j.scitotenv.2008.01.035.

Lu, W.Z., He, H.D. and Dong, L. (2011). Performance Assessment of Air Quality Monitoring Networks Using Principal Component Analysis and Cluster Analysis. Build. Environ. 46: 577–583, doi: 10.1016/j.buildenv.20 10.09.004.

Lu, W.Z., Wang, X.K., Wang, W.J., Leung, A.Y.T. and

Page 12: An Integrated SSA-ARIMA Approach to Make Multiple Day

Kumar, Aerosol and Air Quality Research, 15: 208–219, 2015  219

Yuen, K.K. (2002). A Preliminary Study Of Ozone Trend and Its Impact on Environment in Hong Kong, Environ. Int. 28: 503–512.

Mintz, R., Young, B.R. and Svrcek, W.Y. (2005). Fuzzy Logic Modeling of Surface Ozone Concentrations. Comput. Chem. Eng. 29: 2049–2059.

Moustris, K.P., Nastos, P.T., Larissi, I.K. and Paliatsos, A.G. (2012). Application of Multiple Linear Regression Models and Artificial Neural Networks on the Surface Ozone Forecast in the Greater Athens Area, Greece. Adv. Meteorol. 2012: 894714, doi: 10.1155/2012/894714.

Prakash, A., Kumar, U., Kumar, K. and Jain, V.K. (2011). A Wavelet-based Neural Network Model to Predict Ambient Air Pollutants’ Concentration. Environ. Monit. Assess. 16: 503–517, doi: 10.1007/s10666-011-9270-6.

Press, W.H., Teukolsky, S.A., Vellerling, W.T. and Flannery, B.P. (2002). Numerical Recipes in C++: The Art of Scientific Computing, Cambridge University Press, Cambridge.

Shumway, R.H. and Stoffer, D.S. (2006). Time Series Analysis and Its Applications - With R Examples, Springer Science+Business Media, LLC.

Singh, V., Carnevale, C., Finzi, G., Pisoni, E. and Volta, M, (2011). A Cokriging Based Approach to Reconstruct Air Pollution Maps, Processing Measurement Station Concentrations and Deterministic Model Simulations. Environ. Modell. Softw. 26: 778–786.

Slini, T., Karatzas, K. and Moussiopoulos, N. (2002). Statistical Analysis of Environmental Data as the Basis of Forecasting: An Air Quality Application. Sci. Total Environ. 288: 227–237.

Thunis, P., Pederzoli, A. and Pernigotti, D. (2012). Performance Criteria to Evaluate Air Quality Modeling Applications. Atmos. Environ. 59: 476–482, doi: 10.1016/ j.atmosenv.2012.05.043.

Tsai, C.H., Chang, L.C. and Chiang, H.C. (2009).

Forecasting of Ozone Episode Days by Cost-sensitive Neural Network Methods. Sci. Total Environ. 407: 2124–2135, doi: 10.1016/j.scitotenv.2008.12.007.

Vautard, R. and Ghil, M. (1989). Singular-spectrum Analysis in Nonlinear Dynamics, with Applications to Paleoclimatic Time Series. Physica D 35: 395–424.

Vautard, R., Yiou, P. and Ghil, M. (1992). Singular-spectrum Analysis: A Toolkit for Short, Noisy Chaotic Signals. Physica D 58: 95–126.

Vlachogianni, A., Kassomenos, P., Karppinen, A., Karakitsios, S. and Kukkonen, J. (2011). Evaluation of a Multiple Regression Model for the Forecasting of the Concentrations of NOx and PM10 in Athens and Helsinki. Sci. Total Environ. 409: 1559–1571, doi: 10.1016/j.scito tenv.2010.12.040.

Voukantsis, D., Karatzas, K., Kukkonen, J., Räsänen, T., Karppinen, A. and Kolehmainen, M. (2011). Intercomparison of Air Quality Data Using Principal Component Analysis, and Forecasting of PM10 and PM2.5 Concentrations Using Artificial Neural Networks, in Thessaloniki and Helsinki. Sci. Total Environ. 409: 1266–1276, doi: 10.1016/j.scitotenv.2010.12.039.

Willmott, C.J. (1981). On the Validation of Models. Phys. Geog. 2: 184–194.

Willmott, C.J. (1982). Some Comments on the Evaluation of Model Performance. Bull. Am. Meteorol. Soc. 63: 1309–1313.

Yaffee, R.A. (1999). Introduction to Time Series Analysis and Forecasting: with Applications of SAS and SPSS, Academic Press, Inc.

Received for review, March 19, 2014 Revised, July 2, 2014

Accepted, August 27, 2014