chapter 5: box-jenkins (arima)...

1. Overview2. Stationarity and Invertibility

3. Differencing and General ARIMA Representation4. Autocorrelation and Partial Autocorrelation Functions

5. Model Identification and Estimation6. Diagnostic Checking and Forecasting

CHAPTER 5: Box-Jenkins (ARIMA) Forecasting

Prof. Alan Wan

1 / 67

Table of contents

1. Overview

2. Stationarity and Invertibility2.1 Stationarity2.2 Invertibility

3. Differencing and General ARIMA Representation

4. Autocorrelation and Partial Autocorrelation Functions4.1 Autocorrelation Function4.2 Partial Autocorrelation Function

5. Model Identification and Estimation

6. Diagnostic Checking and Forecasting6.1 Diagnostic Checking6.2 Forecasting

2 / 67

Overview

I The Box-Jenkins methodology refers to a set of procedures foridentifying and estimating time series models within the classof autoregressive integrated moving average (ARIMA) models.

I We speak also of AR models, MA models and ARMA modelswhich are special cases of this general class.

I Models generalise regression but ”explanatory” variables arepast values of the series itself and unobservable randomdisturbances.

I ARIMA models exploit information embedded in theautocorrelation pattern of the data.

I Estimation is based on maximum likelihood; not least squares.I This method applies to both non-seasonal and seasonal data.

In this chapter, we will only deal with non-seasonal data.

3 / 67

Overview

3 / 67

Overview

3 / 67

Overview

3 / 67

Overview

I Estimation is based on maximum likelihood; not least squares.

I This method applies to both non-seasonal and seasonal data.In this chapter, we will only deal with non-seasonal data.

3 / 67

Overview

In this chapter, we will only deal with non-seasonal data.3 / 67

Overview

I Let Y1,Y2, · · · ,Yn denote a series of values for a time seriesof interest. These values are observable.

I Let ε1, ε2, · · · , εn denote a series of random disturbances.These values are unobservable.

I The random disturbances (or noises) may be thought of as aseries of random shocks that are completely uncorrelated withone another.

I Usually the noises are assumed to be generated from adistribution with identical mean 0 and identical variance σ2

across all periods, and are uncorrelated with one another.They are called ”white noises” (more on this later).

4 / 67

Overview

4 / 67

Overview

4 / 67

Overview

4 / 67

Overview

The three basic Box-Jenkins models for Yt are:

1. Autoregressive model of order p (AR(p)):Yt = δ + φ1Yt−1 + φ2Yt−2 + .....+ φpYt−p + εt(i.e., Yt depends on its p previous values)

2. Moving average model of order q (MA(q)):Yt = δ + εt − θ1εt−1 − θ2εt−2 − .....− θqεt−q(i.e., Yt depends on its q previous random error terms)

3. Autoregressive moving average model of orders p and q(ARMA(p, q)):Yt = δ + φ1Yt−1 + φ2Yt−2 + .....+ φpYt−p

+εt − θ1εt−1 − θ2εt−2 − .....− θqεt−q(i.e., Yt depends on its p previous values and q previousrandom error terms)

5 / 67

Overview

I Obviously, AR and MA models are special cases of ARMA.

I εt is typically assumed to be ”white noise”; i.e., it isidentically and independently distributed with a commonmean 0 and common variance σ2 across all observations.

I We write εt ∼ i .i .d .(0, σ2).

I The white noise assumption rules out possibilities of serialautocorrelation and heteroscedasticity in the disturbances.

6 / 67

Overview

6 / 67

Overview

6 / 67

Overview

6 / 67

2.1 Stationarity2.2 Invertibility

Stationarity

I ”Stationarity” is a fundamental property underlying almost alltime series statistical models.

I A time series Yt is said to be strictly stationary if the jointdistribution of {Y1,Y2, · · · ,Yn} is the same as that of{Y1+k ,Y2+k , · · · ,Yn+k}. That is, when we shift through timethe behaviour of the random variables as characterised by thedensity function stays the same.

I Strict stationarity is difficult to fulfill or be tested in practice.Usually, when we speak of stationarity we refer to a weakerdefinition.

7 / 67

Stationarity

7 / 67

Stationarity

7 / 67

Stationarity

I A time series Yt is said to be weakly stationary if it satisfiesall of the following conditions:

1. E (Yt) = µy for all t2. var(Yt) = E [(Yt − µy )2] = σ2

y for all t3. cov(Yt ,Yt−k) = γk for all t

I A time series is thus weakly stationary if- its mean is the same at every period,- its variance is the same at every period, and- its autocovariance with respect to a particular lag is thesame at every period.

I A series of outcomes from independent identical trials isstationary, while a series with a trend cannot be stationary.

8 / 67

Stationarity

8 / 67

Stationarity

8 / 67

Stationarity

I With time series models, the sequence of observations isassumed to obey some sorts of dependence structure. Notethat observations can be independent only in one way butthey can be dependent in many different ways. Stationarity isone way of modeling the dependence structure.

I It turns out that many useful results that hold underindependence (e.g., the law of large numbers, Central LimitTheorem) also hold under the stationary dependencestructure.

I Without stationarity, the results can be spurious (e.g., themaximum likelihood estimators of the unknowns areinconsistent).

9 / 67

Stationarity

9 / 67

Stationarity

9 / 67

Stationarity

I The white noise series εt is stationary because1. E (εt) = 0 for all t2. var(εt) = σ2 for all t3. cov(εt , εt−k) = 0 for all t and k 6= 0.

I Here is a time series plot of a white noise process:

3632282420161284

Time Series Plot

10 / 67

Stationarity

I The white noise series εt is stationary because1. E (εt) = 0 for all t2. var(εt) = σ2 for all t3. cov(εt , εt−k) = 0 for all t and k 6= 0.

I Here is a time series plot of a white noise process:

3632282420161284

Time Series Plot

10 / 67

Stationarity

I However, not all time series are stationary. In fact, economicand financial time series are typically non-stationary.

I Here is a time series plot of the Dow Jones Industrial AverageIndex. The series cannot be stationary as the trend rules outany possibility of a constant mean over time.

2902612322031741451168758291

Time Series Plot of Dow-Jones Index

11 / 67

Stationarity

I However, not all time series are stationary. In fact, economicand financial time series are typically non-stationary.

I Here is a time series plot of the Dow Jones Industrial AverageIndex. The series cannot be stationary as the trend rules outany possibility of a constant mean over time.

2902612322031741451168758291

Time Series Plot of Dow-Jones Index

11 / 67

Stationarity

I Suppose that Yt follows an AR(1) process without drift (anintercept). Is Yt stationary?

I Note that

Yt = φ1Yt−1 + εt

= φ1(φ1Yt−2 + εt−1) + εt

= εt + φ1εt−1 + φ21εt−2 + φ3

1εt−3 + · · ·+ φt1Y0

I Without loss of generality, assume that Y0 = 0.

12 / 67

Stationarity

I Suppose that Yt follows an AR(1) process without drift (anintercept). Is Yt stationary?

I Note that

Yt = φ1Yt−1 + εt

= φ1(φ1Yt−2 + εt−1) + εt

= εt + φ1εt−1 + φ21εt−2 + φ3

1εt−3 + · · ·+ φt1Y0

I Without loss of generality, assume that Y0 = 0.

12 / 67

Stationarity

I Hence E (Yt) = 0.

I Assuming that the process started a long time ago (i.e., t islarge) and |φ1| < 1, then it can be shown, for t > k ≥ 0, that

var(Yt) =σ2

1− φ21

cov(Yt ,Yt−k) =φk1σ

1− φ21

= φk1var(Yt)

I That is, the mean, variance and covariances are allindependent of t, provided that t is large and |φ1| < 1.

I When t is large, the necessary and sufficient condition ofstationarity for an AR(1) process is |φ1| < 1.

13 / 67

Stationarity

I Hence E (Yt) = 0.

I Assuming that the process started a long time ago (i.e., t islarge) and |φ1| < 1, then it can be shown, for t > k ≥ 0, that

var(Yt) =σ2

1− φ21

cov(Yt ,Yt−k) =φk1σ

1− φ21

= φk1var(Yt)

I That is, the mean, variance and covariances are allindependent of t, provided that t is large and |φ1| < 1.

I When t is large, the necessary and sufficient condition ofstationarity for an AR(1) process is |φ1| < 1.

13 / 67

Stationarity

I Consider the special case of φ1 = 1, i.e.,Yt = Yt−1 + εt

I This process is known as a ”random walk”.

I Now, assuming that Y0 = 0, Yt may be expressed equivalentlyas

Yt =∑t−1

j=0 εt−jI Thus,

1. E (Yt) = 0 for all t2. var(Yt) = tσ2 for all t3. cov(Yt ,Yt−k) = (t − k)σ2 for all t > k ≥ 0.

14 / 67

Stationarity

Yt =∑t−1

j=0 εt−jI Thus,

14 / 67

Stationarity

Yt =∑t−1

j=0 εt−j

I Thus,

14 / 67

Stationarity

Yt =∑t−1

j=0 εt−jI Thus,

14 / 67

Stationarity

I Both the variance and covariance are dependent on t; thetime series is thus non-stationary.

I Yt ’s also vary more as t increases. That means when the datafollows a random walk, the best prediction of the future is thepresent (a naive forecast) and the prediction becomes lessaccurate the further into the future we forecast.

15 / 67

Stationarity

I Both the variance and covariance are dependent on t; thetime series is thus non-stationary.

I Yt ’s also vary more as t increases. That means when the datafollows a random walk, the best prediction of the future is thepresent (a naive forecast) and the prediction becomes lessaccurate the further into the future we forecast.

15 / 67

Stationarity

I Suppose that the model is an AR(2) without drift:Yt = φ1Yt−1 + φ2Yt−2 + εt

I The necessary and sufficient conditions of stationarity forAR(2) are:

φ1 + φ2 < 1, φ2 − φ1 < 1 and |φ2| < 1

I The algebraic complexity of the conditions increases with theorder of the process. While these conditions can begeneralised and do obey some kind of pattern, it is notnecessary to learn the derivation of the conditions. The keypoint to note is that AR processes are not stationary unlessappropriate conditions are imposed on the coefficients.

16 / 67

Stationarity

φ1 + φ2 < 1, φ2 − φ1 < 1 and |φ2| < 1

16 / 67

Stationarity

φ1 + φ2 < 1, φ2 − φ1 < 1 and |φ2| < 1

16 / 67

Stationarity

I Now, suppose that the model of interest is an MA(1) withoutdrift:

Yt = εt − θ1εt−1

I It can be shown, regardless of the value of θ1, that

1. E (Yt) = 0 for all t

2. var(Yt) = σ2(1 + θ21) for all t

3. cov(Yt ,Yt−k) =

{−θ1σ

2 if k = 10 otherwise

for all t > k > 0

I An MA(1) process is thus always stationary without the needto impose any condition on the unknown coefficient.

17 / 67

Stationarity

{−θ1σ

for all t > k > 0

17 / 67

Stationarity

{−θ1σ

for all t > k > 0

17 / 67

Stationarity

I For an MA(2) process without drift:Yt = εt − θ1εt−1 − θ2εt−2

I 1. E (Yt) = 0 for all t

2. var(Yt) = σ2(1 + θ21 + θ2

2) for all t3.

cov(Yt ,Yt−k) =

−θ1(1− θ2)σ2 if k = 1−θ2σ

for all t > k > 0

I Again, an MA(2) process is stationary irrespective of thevalues of the unknown coefficients.

18 / 67

Stationarity

2. var(Yt) = σ2(1 + θ21 + θ2

2) for all t3.

cov(Yt ,Yt−k) =

−θ1(1− θ2)σ2 if k = 1−θ2σ

for all t > k > 0

18 / 67

Stationarity

2. var(Yt) = σ2(1 + θ21 + θ2

2) for all t3.

cov(Yt ,Yt−k) =

−θ1(1− θ2)σ2 if k = 1−θ2σ

for all t > k > 0

18 / 67

Invertibility

I MA processes are always stationary regardless of the values ofthe coefficients, but they are not necessarily invertible.

I A finite order MA process is said to be invertible if it can beconverted into a stationary AR process of infinite order.

I As an example, consider an MA(1) process:

= εt − θ1(Yt−1 + θ1εt−2)

= · · ·= εt − θ1Yt−1 − θ2

1Yt−2 − θ31Yt−3 − · · · − θt1ε0

I In order for the MA(1) to be equivalent to an AR(∞), the lastterm on the r.h.s. of the above equation has to be zero.

19 / 67

Invertibility

= εt − θ1(Yt−1 + θ1εt−2)

= · · ·= εt − θ1Yt−1 − θ2

1Yt−2 − θ31Yt−3 − · · · − θt1ε0

19 / 67

Invertibility

= εt − θ1(Yt−1 + θ1εt−2)

= · · ·= εt − θ1Yt−1 − θ2

1Yt−2 − θ31Yt−3 − · · · − θt1ε0

19 / 67

Invertibility

= εt − θ1(Yt−1 + θ1εt−2)

= · · ·= εt − θ1Yt−1 − θ2

1Yt−2 − θ31Yt−3 − · · · − θt1ε0

19 / 67

Invertibility

I Assume that the process started a long time ago. With|θ1| < 1, we have limt→∞θ

t1ε0 = 0. So the condition |θ1| < 1

enables us to convert an MA(1) to an AR(∞).

I The conditions of invertibility for an MA(q) process areanalogous to those of stationary for an AR(q) process.

I Now consider the following two MA(1) processes:

1. Yt = εt + 2εt−1; εt ∼ i .i .d .(0, 1)2. Yt = εt + (1/2)εt−1; εt ∼ i .i .d .(0, 4)

I The second process is invertible but the first process isnon-invertible. However, both processes generate the samemean, variance and covariances of Yt ’s.

20 / 67

Invertibility

20 / 67

Invertibility

20 / 67

Invertibility

20 / 67

Invertibility

I For any non-invertible MA, there is always an equivalentinvertible representation up to the second moment. Theconverse is also true. We prefer the invertible representationbecause if we can convert an MA process to an AR process,we can find the unobservable εt based on the past values ofobservable Yt . If the process is non-invertible, then, in orderto find the value of εt , we have to know all future values of Yt

that are unobservable at time t.I To explain, note thatεt = Yt + θ1Yt−1 + θ2

1Yt−2 + θ31Yt−3 + · · ·+ θt1ε0

orεt = −1/θ1Yt+1 − 1/θ2

1Yt+2 − 1/θ31Yt+3 + · · ·+ 1/θk1 εt+k ,

where k > 0.(write Yt+1 = εt+1 − θ1εt)

21 / 67

Invertibility

I Also, when expressing the most recent error as a combinationof current and past observations by the AR(∞) representation,for an invertible process, |θ1| < 1, and so the most recentobservation has a higher weight than any observation from thepast. But when |θ1| > 1, the weights increase as the lagsincrease, so the more distant the observations the greatertheir influence on the current error. When |θ1| = 1, theweights are constant in size, and the distant observations havethe same influence as the current observation. As neither ofthese make much sense, we prefer the invertible process.

I Invertibility is a restriction programmed into time seriessoftware for estimating MA coefficients. It is not somethingthat we check for data analysis.

22 / 67

Invertibility

I Also, when expressing the most recent error as a combinationof current and past observations by the AR(∞) representation,for an invertible process, |θ1| < 1, and so the most recentobservation has a higher weight than any observation from thepast. But when |θ1| > 1, the weights increase as the lagsincrease, so the more distant the observations the greatertheir influence on the current error. When |θ1| = 1, theweights are constant in size, and the distant observations havethe same influence as the current observation. As neither ofthese make much sense, we prefer the invertible process.

I Invertibility is a restriction programmed into time seriessoftware for estimating MA coefficients. It is not somethingthat we check for data analysis.

22 / 67

Differencing

I Often non-stationary series can be made stationary throughdifferencing.

I For example,1. Yt = Yt−1 + εt is non-stationary, but

Wt = Yt − Yt−1 = εt is stationary2. Yt = 1.7Yt−1 − 0.7Yt−2 + εt is non-stationary, but

Wt = Yt − Yt−1 = 0.7Wt−1 + εt is stationaryI Let ∆ be the difference operator and B the backward shift (or

lag) operator such that ∆Yt = Yt − Yt−1 and BYt = Yt−1.I Thus,

∆Yt = Yt − BYt = (1− B)Yt

and∆2Yt = ∆∆Yt = (Yt − Yt−1)− (Yt−1 − Yt−2)

= Yt − 2Yt−1 + Yt−2 = (1− 2B + B2)Yt = (1−B)2Yt

23 / 67

Differencing

Wt = Yt − Yt−1 = 0.7Wt−1 + εt is stationary

I Let ∆ be the difference operator and B the backward shift (orlag) operator such that ∆Yt = Yt − Yt−1 and BYt = Yt−1.

I Thus,∆Yt = Yt − BYt = (1− B)Yt

= Yt − 2Yt−1 + Yt−2 = (1− 2B + B2)Yt = (1−B)2Yt

23 / 67

Differencing

lag) operator such that ∆Yt = Yt − Yt−1 and BYt = Yt−1.

I Thus,∆Yt = Yt − BYt = (1− B)Yt

= Yt − 2Yt−1 + Yt−2 = (1− 2B + B2)Yt = (1−B)2Yt

23 / 67

Differencing

lag) operator such that ∆Yt = Yt − Yt−1 and BYt = Yt−1.I Thus,

∆Yt = Yt − BYt = (1− B)Yt

= Yt − 2Yt−1 + Yt−2 = (1− 2B + B2)Yt = (1−B)2Yt23 / 67

Differencing and Order of Integration

I In general, a d th order difference can be written as∆dYt = (1− B)dYt

I The differenced series has n − 1 observations after taking thefirst difference, n − 2 observations after taking the seconddifference, and n − d observations after taking d differences.

I The number of times the original series must be differenced inorder to achieve stationarity is called the order of integration,denoted by d .

I In practice, it is seldom necessary to go beyond seconddifference, because real data generally involve only first orsecond level non-stationarity.

24 / 67

General ARIMA representation

I If Yt is integrated of order d , we write Yt ∼ I (d).

I Now, Suppose that Yt ∼ I (d) and the stationary series after ad th order differencing, Wt , is represented by an ARMA(p, q)model.

I Then we say that Yt is an ARIMA(p, d , q) process, that is,

(1− B)dYt = Wt

= δ + φ1Wt−1 + φ2Wt−2 + .....+ φpWt−p+εt − θ1εt−1 − θ2εt−2 − .....− θqεt−q

25 / 67

(1− B)dYt = Wt

25 / 67

(1− B)dYt = Wt

25 / 67

4.1 Autocorrelation Function4.2 Partial Autocorrelation Function

Autocorrelation Function

The question is, in practice, how can one tell if the data arestationary?

I Define the autocovariance at lag k asγk = cov(Yt ,Yt−k).(Hence γ0 = var(Yt)).

I Define the autocorrelation (AC) at lag k asρk = γk

I Consequently, for a random walk process,ρ1 = (t − 1)/t,ρ2 = (t − 2)/t,.ρk = (t − k)/t

26 / 67

This produces the following autocorrelation function (ACF). TheACF dies down to zero extremely slowly as k increases.

0.0 10.3 20.5 30.8 41.0

Model: ArmaRoutine(1;0;0;0)

27 / 67

I Now, if the process is a stationary AR(1) process, i.e.,|φ1| < 1. It can be easily verified that ρk = φk1 .

I So the ACF dies down to zero relatively quickly as k increases.I Suppose that φ1 = 0.8, then the ACF would look as follows:

0.0 10.3 20.5 30.8 41.0

Model: ArmaRoutine(0.8;0;0;0)

orrela

28 / 67

I So the ACF dies down to zero relatively quickly as k increases.

I Suppose that φ1 = 0.8, then the ACF would look as follows:

0.0 10.3 20.5 30.8 41.0

orrela

28 / 67

I So the ACF dies down to zero relatively quickly as k increases.I Suppose that φ1 = 0.8, then the ACF would look as follows:

0.0 10.3 20.5 30.8 41.0

orrela

28 / 67

I Now, consider an AR(2) process without drift:Yt = φ1Yt−1 + φ2Yt−2 + εt

I It can be shown that the AC coefficients are:ρ1 = φ1

1−φ2,

ρ2 = φ2 +φ2

11−φ2

, andρk = φ1ρk−1 + φ2ρk−2 for k > 2.

I Hence the ACF dies down to zero according to a mixture ofdamped exponentials and/or damped sine waves.

I In general, the ACF of a stationary AR process dies down tozero as k increases.

29 / 67

I Now, consider an AR(2) process without drift:Yt = φ1Yt−1 + φ2Yt−2 + εt

I It can be shown that the AC coefficients are:ρ1 = φ1

1−φ2,

ρ2 = φ2 +φ2

11−φ2

, andρk = φ1ρk−1 + φ2ρk−2 for k > 2.

I Hence the ACF dies down to zero according to a mixture ofdamped exponentials and/or damped sine waves.

I In general, the ACF of a stationary AR process dies down tozero as k increases.

29 / 67

I Consider an MA(1) process with no drift:Yt = εt − θ1εt−1

I It can be easily shown that

ρk =γkγ0

{−θ1

1+θ21

if k = 1

0 otherwise

I Hence the ACF cuts off at zero after lag 1.

30 / 67

I Consider an MA(1) process with no drift:Yt = εt − θ1εt−1

I It can be easily shown that

ρk =γkγ0

{−θ1

1+θ21

if k = 1

0 otherwise

I Hence the ACF cuts off at zero after lag 1.

30 / 67

I Similarly, for an MA(2) process, it can be shown that

ρ1 = −θ1(1−θ2)1+θ2

1+θ22

ρ2 = −θ2

1+θ21+θ2

ρk = 0 for k > 2

I The ACF of an MA(2) process thus cuts off at zero after 2lags.

31 / 67

I In general, if a time series is non-stationary, its ACF dies downto zero slowly and the first autocorrelation is near 1. If a timeseries is stationary, the ACF dies down to zero relativelyquickly in the case of AR, cuts off after certain lags in thecase of MA, and dies down to zero relatively quickly in thecase of ARMA (as the dying down pattern produced by theAR component would dominate the cutting off patternproduced by the MA component).

32 / 67

I Question: How are the AC coefficients estimated in practice?

I The sample AC at lag k is calculated as follows:

rk =∑n

t=k+1(Yt−Y )(Yt−k−Y )∑nt=1(Yt−Y )2 ,

where Y =∑n

t=1 Yt

n .I Thus, rk measures the linear association between the time

series observations separated by a lag of k time units in thesample, and is an estimator of ρk .

I In other words, computing the sample ACs is similar toperforming a series of simple regressions of Yt on Yt−1, thenon Yt−2, then on Yt−3, and so on. The autocorrelationcoefficients reflect only the relationship between the twoquantities included in the regression.

33 / 67

I Question: How are the AC coefficients estimated in practice?I The sample AC at lag k is calculated as follows:

rk =∑n

t=k+1(Yt−Y )(Yt−k−Y )∑nt=1(Yt−Y )2 ,

where Y =∑n

t=1 Yt

I Thus, rk measures the linear association between the timeseries observations separated by a lag of k time units in thesample, and is an estimator of ρk .

33 / 67

rk =∑n

t=k+1(Yt−Y )(Yt−k−Y )∑nt=1(Yt−Y )2 ,

where Y =∑n

t=1 Yt

33 / 67

rk =∑n

t=k+1(Yt−Y )(Yt−k−Y )∑nt=1(Yt−Y )2 ,

where Y =∑n

t=1 Yt

33 / 67

I The standard error of rk is srk =

√1+2

∑k−1j=1 r2

I To test H0 : ρk = 0 vs. H0 : ρk 6= 0, we use the statistic

t = rksrk

I The rule of thumb is to reject H0 at an approximately 5%level of significance if

|t| > 2,or equivalently,

|rk | > 2srk .

34 / 67

√1+2

∑k−1j=1 r2

t = rksrk

|rk | > 2srk .

34 / 67

√1+2

∑k−1j=1 r2

t = rksrk

|rk | > 2srk .

34 / 67

I Sample ACF of a stationary AR(1) process:

The ARIMA Procedure Name of Variable = y Mean of Working Series -0.08047 Standard Deviation 1.123515 Number of Observations 99 Autocorrelations Lag Covariance Correlation -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 Std Error 0 1.262285 1.00000 | |********************| 0 1 0.643124 0.50949 | . |********** | 0.100504 2 0.435316 0.34486 | . |******* | 0.123875 3 0.266020 0.21075 | . |****. | 0.133221 4 0.111942 0.08868 | . |** . | 0.136547 5 0.109251 0.08655 | . |** . | 0.137127 6 0.012504 0.00991 | . | . | 0.137678 7 -0.040513 -.03209 | . *| . | 0.137685 8 -0.199299 -.15789 | . ***| . | 0.137761 9 -0.253309 -.20067 | . ****| . | 0.139576 "." marks two standard errors

35 / 67

I Sample ACF of an invertible MA(2) process:

The ARIMA Procedure Name of Variable = y Mean of Working Series 0.020855 Standard Deviation 1.168993 Number of Observations 98 Autocorrelations Lag Covariance Correlation -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 Std Error 0 1.366545 1.00000 | |********************| 0 1 -0.345078 -.25252 | *****| . | 0.101015 2 -0.288095 -.21082 | ****| . | 0.107263 3 -0.064644 -.04730 | . *| . | 0.111411 4 0.160680 0.11758 | . |** . | 0.111616 5 0.0060944 0.00446 | . | . | 0.112873 6 -0.117599 -.08606 | . **| . | 0.112875 7 -0.104943 -.07679 | . **| . | 0.113542 8 0.151050 0.11053 | . |** . | 0.114071 9 0.122021 0.08929 | . |** . | 0.115159 "." marks two standard errors

36 / 67

I Sample ACF of a random walk process:

The ARIMA Procedure Name of Variable = y Mean of Working Series 16.79147 Standard Deviation 9.39551 Number of Observations 98 Autocorrelations Lag Covariance Correlation -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 Std Error 0 88.275614 1.00000 | |********************| 0 1 85.581769 0.96948 | . |******************* | 0.101015 2 81.637135 0.92480 | . |****************** | 0.171423 3 77.030769 0.87262 | . |***************** | 0.216425 4 72.573174 0.82212 | . |**************** | 0.249759 5 68.419227 0.77506 | . |**************** | 0.275995 6 64.688289 0.73280 | . |*************** | 0.297377 7 61.119745 0.69237 | . |************** | 0.315265 8 57.932253 0.65627 | . |************* | 0.330417 9 55.302847 0.62648 | . |*************. | 0.343460 "." marks two standard errors

37 / 67

I Sample ACF of a random walk process after first difference:

The ARIMA Procedure Name of Variable = y Period(s) of Differencing 1 Mean of Working Series 0.261527 Standard Deviation 1.160915 Number of Observations 97 Observation(s) eliminated by differencing 1 Autocorrelations Lag Covariance Correlation -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 Std Error 0 1.347723 1.00000 | |********************| 0 1 0.712219 0.52846 | . |*********** | 0.101535 2 0.263094 0.19521 | . |****. | 0.126757 3 -0.043040 -.03194 | . *| . | 0.129820 4 -0.151081 -.11210 | . **| . | 0.129901 5 -0.247540 -.18367 | .****| . | 0.130894 6 -0.285363 -.21174 | .****| . | 0.133525 7 -0.274084 -.20337 | .****| . | 0.136943 8 -0.215508 -.15991 | . ***| . | 0.140022 9 -0.077629 -.05760 | . *| . | 0.141892 "." marks two standard errors

38 / 67

Partial Autocorrelation Function

I The partial autocorrelation (PAC) at lag k, denoted as rkk ,measures the degree of association between Yt and Yt−kwhen the effects of other time lags (1,2,3,...,k-1) are removed.Intuitively, the PAC may be thought of as the AC of timeseries observations separated by k time units with the effectsof the intervening observations eliminated. The PACs at lags1, 2, 3,... make up the partial autocorrelation function(PACF).

I Computing the PACs is more in the spirit of multipleregression. The PAC removes the effects of all lower orderlags before computing the autocorrelation. For example, the2nd PAC is the effects of the observation two periods ago onthe current observation, given that the effect of theobservation one period ago has been removed.

39 / 67

I The partial autocorrelation (PAC) at lag k, denoted as rkk ,measures the degree of association between Yt and Yt−kwhen the effects of other time lags (1,2,3,...,k-1) are removed.Intuitively, the PAC may be thought of as the AC of timeseries observations separated by k time units with the effectsof the intervening observations eliminated. The PACs at lags1, 2, 3,... make up the partial autocorrelation function(PACF).

I Computing the PACs is more in the spirit of multipleregression. The PAC removes the effects of all lower orderlags before computing the autocorrelation. For example, the2nd PAC is the effects of the observation two periods ago onthe current observation, given that the effect of theobservation one period ago has been removed. 39 / 67

I Thus, if the model is an AR(2), in theory the first two PACswould be non-zero and all other PACs would be zero.

I In general, the PACF of a stationary AR(p) process would cutoff at zero after lag p.

I Because an invertible finite order MA process can be writtenas a stationary AR(∞), the PACF of a MA process would diedown to zero in the same way as the ACF of the analogousAC process dies down to zero.

I It is also clear the AC and PAC at lag 1 of a given process areidentical.

40 / 67

I The sample PAC at lag k is:

r1 if k = 1rk−

∑k−1j=1 rk−1,j rk−j

1−∑k−1

j=1 rk−1,j rk, if k > 1

where rkj = rk−1,j − rkk rk−1,k−j for j = 1, 2, · · · , k − 1.

I The standard error of rkk is srkk =√

I The statistic for testing H0 : ρkk = 0 vs. H0 : ρkk 6= 0 is

t = rkksrkk

|t| > 2, or equivalently, |rkk | > 2srkk .

41 / 67

r1 if k = 1rk−

1−∑k−1

j=1 rk−1,j rk, if k > 1

t = rkksrkk

41 / 67

r1 if k = 1rk−

1−∑k−1

j=1 rk−1,j rk, if k > 1

t = rkksrkk

41 / 67

r1 if k = 1rk−

1−∑k−1

j=1 rk−1,j rk, if k > 1

t = rkksrkk

41 / 67

I Sample PACF of a stationary AR(2) process:

Partial Autocorrelations Lag Correlation -10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 1 0.23561 | . |***** | 2 -0.48355 | **********| . | 3 -0.04892 | .*| . | 4 0.07474 | . |*. | 5 -0.00086 | . | . | 6 -0.03688 | .*| . | 7 0.00335 | . | . | 8 0.05250 | . |*. | 9 -0.07018 | .*| . | 10 -0.06710 | .*| . |

42 / 67

I Sample PACF of an invertible MA(1) process:

Partial Autocorrelations Lag Correlation -10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 1 0.47462 | . |********* | 2 -0.28275 | ******| . | 3 0.22025 | . |**** | 4 -0.08660 | **| . | 5 0.07201 | . |*. | 6 -0.02120 | . | . | 7 0.06986 | . |*. | 8 -0.06343 | .*| . | 9 0.00465 | . | . | 10 -0.03059 | .*| . |

43 / 67

Behaviour of AC and PAC for specific non-seasonal ARMA models:

Model AC PAC

AR(1) Dies down in a damped cuts off after lag 1exponential fashion

AR(2) Dies down according to cuts off after lag 2a mixture of damped exponentials

and/or damped sine waves

MA(1) cuts off after lag 1 Dies down in a dampedexponential fashion

MA(2) cuts off after lag 2 Dies down according toa mixture of dampedexponentials and/ordamped sine waves

ARMA(1,1) Dies down in a damped Dies down in a dampedexponential fashion exponential fashion

44 / 67

Model Identification

I The ACF and PACF give insights into what models to fit tothe data.

I Refer to Class Examples 1, 2 and 3 for the sample ACF andPACF of simulated AR(2), MA(1) and ARMA(2,1) processesrespectively.

45 / 67

I The ACF and PACF give insights into what models to fit tothe data.

I Refer to Class Examples 1, 2 and 3 for the sample ACF andPACF of simulated AR(2), MA(1) and ARMA(2,1) processesrespectively.

45 / 67

I We work backwards in identifying the underlying ARMAmodel for a time series: we understand the theoreticalproperties of the ACF and PACF of a given ARMA process; ifthe sample ACF and PACF from the data have recognisablepatterns then we will fit the ARMA model that would producethose ACF and PACF patterns to the data.

I Note that the ARMA model is unduly restrictive - linear incoefficients, white noise error assumption, etc. In practice, nodata series is generated exactly by an ARMA process. Hencewe look for the best ARMA approximation to the real dataonly.

46 / 67

I We work backwards in identifying the underlying ARMAmodel for a time series: we understand the theoreticalproperties of the ACF and PACF of a given ARMA process; ifthe sample ACF and PACF from the data have recognisablepatterns then we will fit the ARMA model that would producethose ACF and PACF patterns to the data.

I Note that the ARMA model is unduly restrictive - linear incoefficients, white noise error assumption, etc. In practice, nodata series is generated exactly by an ARMA process. Hencewe look for the best ARMA approximation to the real dataonly.

46 / 67

Parameter Estimation

I The method that is frequently used for estimating unknownsin ARMA model is maximum likelihood (M.L.).

I Given n observations Y1,Y2, · · · ,Yn, the likelihood function Lis defined to be the probability of obtaining the data actuallyobserved. It is really just the joint density function of the data.

I The maximum likelihood estimator (M.L.E.) are those valuesof the parameters that would lead to the highest probability ofproducing the data actually observed; that is, they are thevalues of the unknowns that maximise the likelihood functionL.

47 / 67

I In an ARMA model, L is a function of δ, φ’s, θ’s and σ2 giventhe data Y1,Y2, · · · ,Yn. One would also need to make anassumption of the distribution of the data.

I The M.L.E. of these unknowns are their values that make theobservation of Y1,Y2, · · · ,Yn a most likely event; that is,assuming that the data come from a particular distribution(e.g., Gaussian), it is most likely that the unknown parameterstake on the values of the M.L.E. in order for the dataY1,Y2, · · · ,Yn to be observed.

48 / 67

I In an ARMA model, L is a function of δ, φ’s, θ’s and σ2 giventhe data Y1,Y2, · · · ,Yn. One would also need to make anassumption of the distribution of the data.

I The M.L.E. of these unknowns are their values that make theobservation of Y1,Y2, · · · ,Yn a most likely event; that is,assuming that the data come from a particular distribution(e.g., Gaussian), it is most likely that the unknown parameterstake on the values of the M.L.E. in order for the dataY1,Y2, · · · ,Yn to be observed.

48 / 67

Estimation of MA processes. Consider an MA(1) process:Yt = δ + εt − θ1εt−1.

I Assume ε0 = 0.

I Use δ = Y and r1 = −θ1/(1 + θ12) to obtain initial estimates

of δ and θ1.

I Obtain series of observations of εt ’s using the relation:εt = Yt − δ + θ1εt−1.

I Use M.L.E. to obtain improved estimates of δ and θ1.

I Repeat steps until differences in estimates are small.

49 / 67

I Before estimating any ARMA model, one would typically testif the drift term should be included. A test of

H0 : δ = 0 vs. H1 : δ 6= 0may be conducted using the statistic

t = zsz/√n′

where z is the mean of the working series z , sz is the s.d. ofz , and n′ is the number of observations of the working series.

I For example, if the series is I (0), then n′ = n; if the series isI (1), then n′ = n − 1.

I The decision rule is to reject H0 at an approximately 5% levelof significance if |t| > 2.

50 / 67

t = zsz/√n′

50 / 67

t = zsz/√n′

50 / 67

I Refer to the MA(2) example seen before. Here,t = 0.020855/(1.168993/

√98) = 0.176608 < 2. Thus, the

drift term should not be included.

The ARIMA Procedure Name of Variable = y Mean of Working Series 0.020855 Standard Deviation 1.168993 Number of Observations 98 Autocorrelations Lag Covariance Correlation -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 Std Error 0 1.366545 1.00000 | |********************| 0 1 -0.345078 -.25252 | *****| . | 0.101015 2 -0.288095 -.21082 | ****| . | 0.107263 3 -0.064644 -.04730 | . *| . | 0.111411 4 0.160680 0.11758 | . |** . | 0.111616 5 0.0060944 0.00446 | . | . | 0.112873 6 -0.117599 -.08606 | . **| . | 0.112875 7 -0.104943 -.07679 | . **| . | 0.113542 8 0.151050 0.11053 | . |** . | 0.114071 9 0.122021 0.08929 | . |** . | 0.115159 "." marks two standard errors

51 / 67

6.1 Diagnostic Checking6.2 Forecasting

Diagnostic Checking

I Often it is not straightforward to determine a single ARMAmodel that most adequately represents the data generatingprocess, and it is not uncommon to identify several candidatemodels at the identification stage. The model that is finallyselected is the one considered best based on a set ofdiagnostic checking criteria. These include:

1. t-tests for coefficient significance2. Residual portmanteau test3. AIC and BIC for model selection

52 / 67

Diagnostic Checking

I Consider the data series of Class Example 4.- First, the data appear to be stationary;- Second, the drift term is significant;- Third, the sample ACF and PACF indicate that an AR(2)model probably best fits the data.

I For purposes of illustration, suppose that an MA(1) and anARMA(2,1) are fitted in addition to the AR(2). A priori, weexpect the AR(2) to be the preferred model of the three.

53 / 67

Diagnostic Checking

I Consider the data series of Class Example 4.- First, the data appear to be stationary;- Second, the drift term is significant;- Third, the sample ACF and PACF indicate that an AR(2)model probably best fits the data.

I For purposes of illustration, suppose that an MA(1) and anARMA(2,1) are fitted in addition to the AR(2). A priori, weexpect the AR(2) to be the preferred model of the three.

53 / 67

Diagnostic Checking

I The SAS program for estimation is as follows:

data example4; input y; cards; 4.0493268 7.0899911 4.7896497 . . 2.2253477 2.439893; proc arima data=example4; identify var=y; estimate p=2 method=ml printall; estimate q=1 method=ml printall; estimate p=2 q=1 method=ml printall; run;

54 / 67

T-Tests for Coefficient Significance

I The AR(2) specification produces the modelYt = 4.68115 + 0.35039Yt−1 − 0.49115Yt−2

I The t-statistics indicate that both φ1 and φ2 are significantlydifferent from zero.

I Note that for any ARMA(p, q) model, the mean value and thedrift term are related through the formula

µ = δ1−φ1−φ2−···−φp

I Hence for the present AR(2) model,µ = 4.10353 = 4.681115/(1− 0.35039 + 0.49115)

55 / 67

µ = δ1−φ1−φ2−···−φp

55 / 67

µ = δ1−φ1−φ2−···−φp

55 / 67

µ = δ1−φ1−φ2−···−φp

55 / 67

I The MA(1) specification produces the modelYt = 4.10209 + 0.48367et−1,

where ej = Yj − Yj is the prediction error for the j th period.

I The t-statistic indicates significance of θ1.

I µ = 4.10209 = δ.

56 / 67

I µ = 4.10209 = δ.

56 / 67

I µ = 4.10209 = δ.

56 / 67

I The ARMA(2,1) specification produces the modelYt = 4.49793+0.40904Yt−1−0.50516Yt−2−0.07693et−1

I The t-statistic indicates that φ1 and φ2 are significantlydifferent from zero but θ1 is not significantly different fromzero.

I µ = 4.10349 = 4.49793/(1− 0.40904 + 0.50516)

57 / 67

I The ARMA(2,1) specification produces the modelYt = 4.49793+0.40904Yt−1−0.50516Yt−2−0.07693et−1

I The t-statistic indicates that φ1 and φ2 are significantlydifferent from zero but θ1 is not significantly different fromzero.

I µ = 4.10349 = 4.49793/(1− 0.40904 + 0.50516)

57 / 67

Residual Portmanteau Test

I If an ARMA(p,q) model is an adequate representation of thedata generating process of Yt , the residuals resulting from thezero should be uncorrelated.

I We postulate that et = η1et−1 + η2et−2 + η3et−3 + · · · .I If et ’s are uncorrelated, then η1 = η2 = η3 = · · · = 0.

58 / 67

I If an ARMA(p,q) model is an adequate representation of thedata generating process of Yt , the residuals resulting from thezero should be uncorrelated.

I We postulate that et = η1et−1 + η2et−2 + η3et−3 + · · · .I If et ’s are uncorrelated, then η1 = η2 = η3 = · · · = 0.

58 / 67

I Hence we testH0 : η1 = η2 = η3 = · · · = ηm = 0 vs. H1: otherwise

I The portmanteau test statistic for H0 is

Q(m) = (n − d)(n − d + 2)∑m

k=1r2k

n−d−k ,

where rk is the sample AC of the et ’s at lag k . Under H0,Q(m) ∼ χ2

m−p−q.

I The decision rule is to reject H0 at the α level of significanceif Q(m) ≥ χ2

α,m−p−q.

59 / 67

Q(m) = (n − d)(n − d + 2)∑m

k=1r2k

n−d−k ,

m−p−q.

α,m−p−q.

59 / 67

Q(m) = (n − d)(n − d + 2)∑m

k=1r2k

n−d−k ,

m−p−q.

α,m−p−q.

59 / 67

I For example, for the AR(2) model, for m = 6, Q(6) = 3.85with a d.o.f. of 4 and a p-value of 0.4268. Hence H0 cannotbe rejected and the residuals are uncorrelated up to lag 6.The same conclusion is reached for m = 12, 18, 24, 30 if onesets α to 0.05.

I On the other hand, residuals from the MA(1) model aresignificantly correlated. Residuals from the ARMA(2,1) modelare uncorrelated.

60 / 67

I For example, for the AR(2) model, for m = 6, Q(6) = 3.85with a d.o.f. of 4 and a p-value of 0.4268. Hence H0 cannotbe rejected and the residuals are uncorrelated up to lag 6.The same conclusion is reached for m = 12, 18, 24, 30 if onesets α to 0.05.

I On the other hand, residuals from the MA(1) model aresignificantly correlated. Residuals from the ARMA(2,1) modelare uncorrelated.

60 / 67

Model Selection Criteria

Two most commonly adopted model selection criteria are the AICand BIC:

I Akaike Information Criterion (AIC):AIC = −2ln(L) + 2g

I Bayesian Information Criterion (BIC):BIC = −2ln(L) + gln(n),where L= value of the likelihood function, g=number ofcoefficients being estimated, and n=number of observations.

I BIC is also known as Schwartz Bayesian Criterion (SBC).

I Both the AIC and BIC are ”penalised” versions of the loglikelihood.

61 / 67

I The penalty term is larger in the BIC than in the AIC; in otherwords, the BIC penalises additional parameters more than theAIC does, meaning that the marginal cost of addingexplanatory variables is greater with the BIC than with theAIC. Hence the BIC tends to select ”parsimonious” models.

I Ideally, both the AIC and BIC should be as small as possible.

I In our example,AR(2): AIC=1424.660, SBC=1437.292MA(1): AIC=1507.162, SBC=1515.583ARMA(2,1): AIC=1425.727, SBC=1442.570

62 / 67

I The penalty term is larger in the BIC than in the AIC; in otherwords, the BIC penalises additional parameters more than theAIC does, meaning that the marginal cost of addingexplanatory variables is greater with the BIC than with theAIC. Hence the BIC tends to select ”parsimonious” models.

I Ideally, both the AIC and BIC should be as small as possible.

I In our example,AR(2): AIC=1424.660, SBC=1437.292MA(1): AIC=1507.162, SBC=1515.583ARMA(2,1): AIC=1425.727, SBC=1442.570

62 / 67

Diagnostic Checking

I Summary of model diagnostics and comparisons:

t-test Q-test AIC BIC

AR(2)√ √

1424.660 1437.292

MA(1)√

× 1507.162 1515.583

ARMA(2,1)√

(partial)√

1425.727 1442.57

I Judging from these results, the AR(2) model is the best,corroborating our conjecture.

63 / 67

Forecasting

I The h-period ahead forecast based on an ARMA(p,q) model isgiven by:

Yt+h = δ + φ1Yt+h−1 + φ2Yt+h−2 + · · ·+ φpYt+h−p−θ1et+h−1 − θ2et+h−2 − · · · − θqet+h−q,

where quantities on the r.h.s. of the equation may be replacedby their estimates when the actual values are unavailable.

64 / 67

Forecasting

I For example, for the AR(2) model,

Y498+1 = 4.681115 + 0.35039× 2.4339893−0.49115× 2.2253477= 4.440981

Y498+2 = 4.681115 + 0.35039× 4.440981−0.49115× 2.4339893= 5.041784

I For the MA(1) model,

Y498+1 = 4.10209 + 0.48367×−0.7263 = 3.7508Y498+2 = 4.10209 + 0.48367× 0 = 4.10209Y498+3 = 4.10209

I Class exercise: Example 5 - Finland’s quarterly constructionbuilding permit (in thousands), seasonally adjusted,1977Q1-1987Q1. 65 / 67

Summary

I ARIMA models are regression models that use past values ofthe series itself and unobservable random disturbances asexplanatory variables.

I Stationarity of data is a fundamental requirement underlyingARIMA and most other techniques involving time series data.

I AR models are not always stationary; MA models are alwaysstationary but not necessarily invertible; ARMA models areneither necessarily stationary nor invertible.

I Differencing transformation is required for non-stationaryseries. The number of times a series must be differenced inorder to achieve stationarity is known as the order ofintegration. Differencing always results in a loss ofinformation.

66 / 67

Summary

I Tentative model identification is based on recognisablepatterns of ACF and PACF. We look for the bestapproximation to the data by a member of the ARIMA family.Usually more than one candidate model is identified at theinitial stage.

I ARIMA models are estimated by maximum likelihood.

I Diagnostic checking comprises t-tests of coefficientsignificance, residual portmanteau test, and AIC and BICmodel selection criteria.

I Sometimes the diagnostics can lead to contradictory resultsand no one single model is necessarily superior to all others.We can consider combining models in such situations.

67 / 67

chapter 5: box-jenkins (arima)...

Documents