data visualisation for time series in environmental

11
Journal of Epidemiology and Biostatistics (2001) Vol. 6, No. 6, 433–443 433 Data visualisation for time series in environmental epidemiology B ERBAS 1 and RJ HYNDMAN 2 1 Department of Public Health, The University of Melbourne, Australia 2 Department of Econometrics and Business Statistics, Monash University, Australia © 2001 Isis Medical Media Limited Background Data visualisation has become an integral part of statistical modelling. Methods We present visualisation methods for preliminary exploration of time-series data, and graphical diagnostic methods for modelling relationships between time- series data in medicine. We use exploratory graphical methods to better understand the relationship between a time-series reponse and a number of potential covari- ates. Graphical methods are also used to examine any remaining information in the residuals from these models. Results We applied exploratory graphical methods to a time-series data set consisting of daily counts of hospital admissions for asthma, and pollution and climatic variables. We provide an overview of the most recent and widely applicable data-visualisation methods for portraying and analysing epidemiological time series. Discussion Exploratory graphical analysis allows insight into the underlying structure of observations in a data set, and graphical methods for diagnostic purposes after model-fitting provide insight into the fitted model and its inadequacies. Keywords additive models, environmental epidemiology graphical methods, nonparametric smoothing seasonality. Introduction Graphics can be used to uncover and present complex trends and relationships in an informal and simple way 1 , enabling the researcher to better understand the underly- ing structure of the data 2 . Time-series data consist of observations equally spaced through time e.g. daily, monthly, or quarterly observations. In this paper we explore the use of graphical methods to improve under- standing of the underlying variation within and between time-series data. To examine the underlying structure of the pattern within a time series, we will discuss time-series plots with a nonparametric smooth-curve superimposed, and sub-series plots for studying seasonal patterns. In addi- tion, we will present decomposition plots, to detect the magnitude and strength of the seasonal and trend com- ponents within each series, and scatterplot matrices, to examine the relationships between various time series. We will consider coplots as a method suitable for identi- fying potential interactions between two time series. We will also discuss partial residual plots, as a diag- nostic graphical tool useful for identifying nonlinearities in the covariates for generalised linear models, and for examining the nonlinear relationship between the response and the covariates for generalised additive models. A natural feature of time-series data is the serial correlation within the series. We will discuss plots of the autocorrelation and partial autocorrelation functions, to examine the underlying correlation structure of the residuals from a particular model. Correct specification of the relationship between the response and the covariates, and identifying remaining autocorrelation in the residuals, are major methodologi- cal issues in daily mortality/morbidity research 3,4 . The data-visualisation methods we present will assist the investigator in resolving these issues. None of the methods we discuss are original. In recent years, they have come to be used in diverse fields, especially in the analysis of business and economic time-series data. We provide a link with this literature, and demonstrate how the techniques are transferable to epidemiological time-series data. As an illustration, we will use time-series data from a study of asthma hospital admissions in Melbourne, Australia. Asthma hospital admissions from all short- stay acute public hospitals in Melbourne, registered on a daily basis by the Health Department of Victoria, was used as the response variable for the period 1 July 1989 to 31 December 1992. International Classification of Disease (ICD) code (493) was used to define asthma. Air pollution data were obtained from the Environ- mental Protection Authority (EPA). Maximum hourly values were averaged each day across nine monitoring stations in Melbourne, for nitrogen dioxide (NO 2 ), Received 30 August 2000 Revised 14 February 2001 Accepted 29 August 2001 Correspondence to: B. Erbas, Department of Public Health, The University of Melbourne, VIC 3010 Australia.

Upload: others

Post on 24-Apr-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data visualisation for time series in environmental

Journal of Epidemiology and Biostatistics (2001) Vol. 6, No. 6, 433–443

433

Data visualisation for time series in environmentalepidemiologyB ERBAS1 and RJ HYNDMAN2

1 Department of Public Health, The University of Melbourne, Australia2 Department of Econometrics and Business Statistics, Monash University, Australia

© 2001 Isis Medical Media Limited

Background Data visualisation has become an integralpart of statistical modelling.

Methods We present visualisation methods for preliminaryexploration of time-series data, and graphical diagnosticmethods for modelling relationships between time-series data in medicine. We use exploratory graphicalmethods to better understand the relationship between atime-series reponse and a number of potential covari-ates. Graphical methods are also used to examine any remaining information in the residuals from thesemodels.

Results We applied exploratory graphical methods to a

time-series data set consisting of daily counts of hospitaladmissions for asthma, and pollution and climaticvariables. We provide an overview of the most recentand widely applicable data-visualisation methods forportraying and analysing epidemiological time series.

Discussion Exploratory graphical analysis allows insightinto the underlying structure of observations in a dataset, and graphical methods for diagnostic purposes aftermodel-fitting provide insight into the fitted model andits inadequacies.

Keywords additive models, environmental epidemiologygraphical methods, nonparametric smoothing seasonality.

IntroductionGraphics can be used to uncover and present complextrends and relationships in an informal and simple way1,enabling the researcher to better understand the underly-ing structure of the data2. Time-series data consist ofobservations equally spaced through time e.g. daily,monthly, or quarterly observations. In this paper weexplore the use of graphical methods to improve under-standing of the underlying variation within and betweentime-series data.

To examine the underlying structure of the patternwithin a time series, we will discuss time-series plotswith a nonparametric smooth-curve superimposed, andsub-series plots for studying seasonal patterns. In addi-tion, we will present decomposition plots, to detect themagnitude and strength of the seasonal and trend com-ponents within each series, and scatterplot matrices, toexamine the relationships between various time series.We will consider coplots as a method suitable for identi-fying potential interactions between two time series.

We will also discuss partial residual plots, as a diag-nostic graphical tool useful for identifying nonlinearitiesin the covariates for generalised linear models, and forexamining the nonlinear relationship between theresponse and the covariates for generalised additivemodels. A natural feature of time-series data is the serialcorrelation within the series. We will discuss plots of the

autocorrelation and partial autocorrelation functions, toexamine the underlying correlation structure of theresiduals from a particular model.

Correct specification of the relationship between theresponse and the covariates, and identifying remainingautocorrelation in the residuals, are major methodologi-cal issues in daily mortality/morbidity research3,4. Thedata-visualisation methods we present will assist theinvestigator in resolving these issues.

None of the methods we discuss are original. Inrecent years, they have come to be used in diverse fields,especially in the analysis of business and economictime-series data. We provide a link with this literature,and demonstrate how the techniques are transferable toepidemiological time-series data.

As an illustration, we will use time-series data from astudy of asthma hospital admissions in Melbourne,Australia. Asthma hospital admissions from all short-stay acute public hospitals in Melbourne, registered on adaily basis by the Health Department of Victoria, wasused as the response variable for the period 1 July 1989to 31 December 1992. International Classification ofDisease (ICD) code (493) was used to define asthma.

Air pollution data were obtained from the Environ-mental Protection Authority (EPA). Maximum hourlyvalues were averaged each day across nine monitoringstations in Melbourne, for nitrogen dioxide (NO2),

Received 30 August 2000Revised 14 February 2001Accepted 29 August 2001

Correspondence to: B. Erbas, Department of Public Health, The University ofMelbourne, VIC 3010 Australia.

Page 2: Data visualisation for time series in environmental

sulphur dioxide (SO2), and ozone (O3), all measured inparts-per-hundred-million (pphm). Particular matter wasmeasured by a device that detects back-scatterng of lightby visibility-reducing particulates 0.1–1 l m in diameter.Particulates were derived from Bscat 3 1024 and theresulting variable is denoted by API.

Meteorologicaldata were obtained from the Common-wealth Bureau of Meteorology. Three hourly maximumdaily levels of relative humidity (hu), wind speed anddry bulbs temperature (db) were averaged across fourmonitoring stations in the Melbourne area.

To simplify the analysis of seasonality, we excludedthe leap day of 29 February 1992 in each series.

Exploratory graphical methodsTime-series plots and smooth curvesA time-series plot is a commonly used graphical tooluseful for examining the underlying long-term patternof a time series. Time (days, months, years, etc.) isplotted on the horizontal axis, and the series of interestis plotted on the vertical axis. Although these plots pro-vide a visual insight into the long-term trend, they donot always allow a clear visual understanding of thefluctuations of each observation from one time-periodto the next. The trend can be reinforced, and we canvisually examine specific time-periods by supplement-ing the plot with a nonparametric smooth trend-curve5–9. The smooth curve can be estimated in anumber of different ways, the most commonly usedbeing loess smoothers6, kernel smoothers7 and cubicsmoothing splines8. A brief description of thesesmoothers is provided in the Appendix.

A time-series plot of daily counts of asthma hospitaladmissions is displayed in Figure 1. Although weobserve various fluctuations in the trend throughout theseries, it is difficult to visually determine the pattern ofvariation between time periods. We fitted a loesssmooth6, with span 5 10%, to the daily counts of asthmaadmissions and superimposed the smooth curve onto thetime-series plot in Figure 1. The result is shown in Figure2 and enables a clearer visual interpretation of the long-term pattern in asthma admissions. For example, there isevidence of a decrease in admissions in December–Janu-ary each year; this decrease is less pronounced in1990–91 and almost nonexistence at the end of 1992.There is a sharp increase in admissions in autumn for1990 and 1992, but no evidence of this increase in 1991.

Seasonal subseries plotsSeasonality is a strong confounder in the analysis ofdaily hospital admissions/mortality data3,10. It isimportant, therefore, to identify the strength and mag-nitude of the seasonal component in a time series. Wecan construct a seasonal subseries plot to assess the

behaviour of each monthly subseries. A subseries plotis constructed by plotting January values for succes-sive years, then February values, and so on up to val-ues for December. For each month, the mean of thevalues is drawn by a horizontal line, and vertical linesemanating from these horizontal lines are values foreach subseries. We can assess the overall seasonal pat-tern of the series by the means of each subseries andwe can assess the variation within each subseries bythe vertical lines11,12 (Figure 3).

Figure 3 shows this style of plot (a) hu and (b) db inMelbourne, Australia from 1 July 1989 to 31 December1992. Note there are only three vertical lines emanatingfrom each mean from January to June for both series.This is because both series began in July, rather thanJanuary, for the study period. The subseries plot for huin (a) shows a large variation in the means for eachmonth, and the vertical lines emanating from each meanis indicative of a large variation within each month. Theoverall pattern in (a) is one in which June is the highestmonth for hu. The monthly variation for db in (b) is farmore stable than the monthly variation in (a). The over-all pattern for db shows a clear minimum in July and

434 B ERBAS AND RJ HYNDMAN

1/7/89 1/1/90 1/7/90 1/1/91 1/7/91 1/1/92 1/7/92 1/1/93

20

40

60

80

Ast

hma

adm

issi

ons

per d

ay

Fig. 1 A time-series plot of daily counts of asthma hospitaladmissions in Melbourne, Australia from 1 July 1989 to 31December 1992.

1/7/89 1/1/90 1/7/90 1/1/91 1/7/91 1/1/92 1/7/92 1/1/93

20

40

60

80

Ast

hma

adm

issi

ons

per

day

Fig. 2 A loess smooth of daily counts of hospital admissionsfor asthma in Melbourne, Australia, from 1 July 1989 to 31December 1992. (Smoothing span set to 10%.)

Page 3: Data visualisation for time series in environmental

August, and a relatively level period from Decemberthrough March. Although there is some variation withineach month, particularly January, the variation in eachsubseries is relatively small compared with the overallpattern in the monthly series.

Decomposition plotsA common approach to time-series in business and eco-nomics is to consider them as a mixture of several com-ponents13: trend, seasonality and an irregular randomeffect. It is important to identify the trend and seasonalcomponents, and remove them from the time serieswhen modelling relations between time-series. Whenthis is not done, highly seasonal series can appear to berelated, purely because of their seasonality rather thanbecause of any real relationship. Similarly, trendedseries can exhibit spurious collinearity. Consequently,the estimation of trend and seasonal terms, and the cal-culation of adjusted series removing their effect, areimportant issues in any time-series analysis12.

Time-series decomposition methods allow an assess-ment of the strength of the seasonal component in eachof the pollutants and climatic variables. After identifica-tion, the seasonal component is removed and the resul-tant seasonally adjusted series is used in subsequentanalysis. Extracting the seasonal component thus allowsa clearer picture of other characteristics of the data.

A number of time-series decomposition methods areavailable. Classical decomposition12 is a relatively simplemethod but has several drawbacks, including bias prob-lems near the ends of the series and an inability to allow asmoothly varying seasonal component. To overcomethese difficulties we adopt the seasonal trend decomposi-tion procedure based on loess (STL) method14.

We assume an additive decomposition:

Yt 5 Tt 1 St 1 Et (1)

where Yt denotes the time-series of interest, Tt denotesthe trend component, St denotes the seasonal componentand Et denotes the remainder (or irregular) component.The seasonally adjusted series, Y t

* is computed simplyby subtracting the estimated seasonal component fromthe original series, Y t

* 5 Yt 2 St.STL consists of a sequence of applications of the

loess smoother, to provide robust estimates of the com-ponents Tt, St and Et from Yt

12. The STL methodinvolves an iterative algorithm, to progressively refineand improve estimates of trend and seasonal compo-nents. After appropriate seasonal adjustment of the pol-lutants and climatic variables using STL, we constructdecomposition plots.

A decomposition plot normally consists of four pan-els: the original series, the trend component, the season-al component and the irregular (or random) component.The panels are arranged vertically, so that time is a com-mon horizontal axis for all panels11. To illustrate theusefulness of this method we display the decompositionof the asthma admissions data in Figure 4. Where wehave applied the STL method14 to compute the trendand seasonal components of this series. These are shownin the second and third panels respectively. In effect, weare separating the smooth curve estimated in Figure 2into two components, trend and seasonality. Note thatthe seasonal component captures the drop in Januaryand the increase in autumn each year. The trend compo-nent in Panel 2 makes it clear that the trend is increasingnear the end of the series, thus explaining the smaller

DATA VISUALISATION FOR TIME SERIES IN ENVIRONMENTAL EPIDEMIOLOGY 435

Jan Mar May Jul Sep Nov

60

65

70

75

80

Hum

idity

Jan Mar May Jul Sep Nov

10

12

14

16

20

18

Dry

bul

b te

mpe

ratu

re

Fig. 3: (a) A subseries plot for relative humidity. The variation in the subseries is large, as are the variations in the means foreach month. (b) A subseries plot for dry bulb temperature. The monthly variation is smaller than the monthly variation in (a).

a b

Page 4: Data visualisation for time series in environmental

decrease than was expected in admissions in December1989.

The bottom panel displays what remains when thetrend and seasonal components are removed from thedata. This allows unusual days or periods to be deter-mined without the confounding that occurs because ofseasonality or trend. The bars on the right of each paneldepict the amount of variation in the data and compo-nents. These are all the same length, but plotted on dif-ferent scales. Thus, a decomposition plot can be used toassess the strength and magnitude of the trend, and sea-sonal components within a series. In this example, thelength of the bar to the right of the third panel indicatesthat a small amount of variation in the original series isaccounted for by the seasonal component.

Scatterplot matricesA scatterplot is a useful graphical tool that allows avisual examination of the possible relationship betweentwo series. However, its usefulness is limited to twodimensional space15. The scatterplot matrix, firstdescribed in Chambers, et al.16 (1983), is a simplegraphical tool for displaying pairwise scatterplots forthree or more variables. Visually, it represents multi-dimensional space, thus allowing insight into the com-plex relationship between multiple series. The graphsare arranged in a matrix with a shared scale, so that eachpanel of the matrix is a scatterplot of one variableagainst another variable. If there are k variables, thereare k(k 2 1) panels in total, and each pair of variables is

plotted twice. The scatterplot matrix is symmetric, in thatthe upper left triangle consists of all k(k 2 1)/2 pairs, asdoes the lower right triangle15.

Scatterplot matrices can be especially useful inselecting covariates for modelling and detection ofmulti-collinearity5. They can also be useful in detectinga nonlinear bivariate relationship between a responsevariable and a potential covariate in modelling. Scatter-plot matrices provide a clear sense of the underlyingpatterns in the relationships, and allow multiple compar-isons on a single plot5.

A pairwise scatterplot matrix for daily counts ofasthma hospital admissions, pollutants and climaticvariables is shown in Figure 5. This clearly exhibits anonlinear bivariate relationship between asthma admis-sions and the climatic variables dry bulbs temperature,and relative humidity. This is consistent with resultsreported in previous studies17–19. Collinearity is animportant methodological issue in the analysis of dailycounts of morbidity/mortality20. Figure 5 allows avisual insight of the inherent collinearity between dband hu, NO2 and wind-speed. Although a scatterplotmatrix allows us to visually detect collinearity, theextent of collinearity should be determined by aformal test21.

CoplotsWhen modelling relationships between a responsevariable and two or more explanatory variables, weare often interested in the interaction between the

436 B ERBAS AND RJ HYNDMAN

1/7/89 1/1/90 1/7/90 1/1/91 1/7/91 1/1/92 1/7/92 1/1/93

Ð5

0

5

20

40

60

80

30

34

38

42

Ð20

0

20

40

Re

mai

nder

Sea

sona

lfc

.600

Da

ta

Fig. 4 The original daily asthma admissions are displayed in the top panel. The three components are displayed on the other pan-els. In this case, the trend and seasonal components do not account for much of the variation.

Page 5: Data visualisation for time series in environmental

explanatory variables. Interaction is a condition wherethe levels of the response variable are different at vari-ous levels of the explanatory variables21. To study how aresponse variable depends on two explanatory variableswe can use a coplot (conditioning plot), as introducedby Cleveland22 (1993) (Figure 6).

One of the explanatory variables act as the ‘givenseries’. A coplot consists of a top panel, the ‘given’panel (each point represents a value for the given series)and a bottom panel, the ‘dependence’ panel. The lattercomprises a series of scatterplots of the response vari-able with the other explanatory variable for those obser-vations, whose values of the given series are equalto each point in the given panel. (For a given series

with many unique values, the plot is constructed basedon intervals of the given series, rather than at uniquevalues.)

Smooth curves and confidence intervals (CI) can beadded to each dependence panel to detect patterns in thebivariate relationship for each panel. A coplot is usefulin detecting the presence of interaction. If the panelsexhibit relationships of different shapes, then this isindicative of an interactive effect between the explanat-ory variables on the response.

Figure 6 shows a coplot where each panel demon-strates the relationship between asthma admissionsand db, where the data are conditioned on the month ofobservation. Smoothing spline curves and 95% point-

DATA VISUALISATION FOR TIME SERIES IN ENVIRONMENTAL EPIDEMIOLOGY 437

20 40 60 80 0 1 2 3 1 2 3 4 5 5 10 15 20

40

60

80

510

20

30

2468

2

2 4 6 8 2 4 6 8 12 5 10 20 30 10 20 30

80

60

40

20

3

2

1

0

543

12

20

15

10

5

468

12

Asthma

NO2

SO2

Ozone

API

DBtemp

Windspeed

Humidity

Fig. 5 Pairwise scatterplots for asthma hospital admissions, pollutants and climatic data.

Page 6: Data visualisation for time series in environmental

wise CI are added to enable easier visualisation of thechanging relationship throughout the year. Note thatthe relationship is negative for May, but positive in thehotter months of November through February. In thewinter months of June–September, there is little rela-tionship at all.

Diagnostic graphical methodsGraphical analysis of residuals are central to the processof statistical model building10. These plots can be usedto identify any undetected tendencies in the data, out-liers, and homogeneity in the variance of the responsevariable1. In addition, a variety of exploratory graphicalmethods are available to assess the nonlinearity betweena response variable and an explanatory variable in statis-tical modelling23. This paper discusses partial residualplots as an exploratory diagnostic graphical method anddemonstrates diagnostic plots after fitting a generalisedlinear model (GLM) and generalised additive model(GAM) to asthma hospital admissions in Melbourne,Australia from 1 July 1989 to 31 December 1992.

GLM and GAMPreliminary exploratory analysis of asthma hospitaladmissions exhibited strong long-wave length patterns.To adequately model these patters we use Fourier seriesterms of sin(2 p jt/365) and cos(2p jt/365) for j 5 1, . . .,10. Fourier terms of sine and cosine pairs for j . 10were not included in the analysis because neither thesine nor cosine pair was statistically significant (p ,0.05) when included in the model. Also, the inclusion ofsine and cosine pairs for j . 10 did not contribute to areduction in the AIC24 goodness-of-fit statistic.

Dummies are used to model day-of-week patternsusing a parameterisation based on Helmert contrasts25.Five-day lagged pollutants and db and hu are also includ-ed as potential covariates. Lag combinations that con-tribute to the largest decrease in AIC are included in thefinal model. Step-wise procedures are used to evaluate thecontribution of each covariate and determine the GLMand GAM that best describes asthma hospital admissions.For this analysis, seasonally adjusted NO2, O3 and season-ally adjustedhu and db are includedin theanalysis.

438 B ERBAS AND RJ HYNDMAN

20

40

60

80

20

40

60

80

50 150 25050 150 250

50 150 250 50 150 250

80

60

40

20

dB

Ast

hma

adm

issi

ons

Sept Oct Nov Dec

AugJulJunMay

Jan Feb Mar Apr

Fig. 6 A coplot of asthma admissions against db given the month of observation.

Page 7: Data visualisation for time series in environmental

GLMFor a GLM we estimate the expected value of theresponse variable, as a function of a linear combinationof covariates:

E(Yt|Xt)5exp{ 01 1t1 2t21 3NO2,t1 4O3,t211 5dbt1 6dbt211

7dbt221 t1å10

j51[a j sin(2 jt /365)1 j cos(2 jt /365)]} (2)

where Xt denotes the vector of covariates, t denotes theday of observation (1, . . ., 1279), NO2,t denotes the sea-sonally adjusted NO2 measurement, O3,t denotes the sea-sonally adjusted O3 measurement and dbt denotes theseasonally adjusted db measurement, all on day t. Theday-of-week effect is captured using the variable c t,which takes on a different value for each day of the week.We assume Yt is pseudo-Poisson (i.e. Poisson with over-dispersion), conditional on the values of the covariates.

GAMFor a GAM, we adopt a similar modelling framework,except the linear terms in a GLM may be replaced bysmooth nonlinear functions that are estimated nonpara-metrically. The following model was fitted:

E(Yt|Xt)5exp{ 01g1(t)1 2NO2,t1 3NO2,t211 4O3,t1g5(O3,t22)1 6dbt

1g7(dbt21)1 8dbt221 t1 [å10

j51a j sin(2 jt/365)

1 j cos(2 jt/365)]} (3)

where t denotes the day of observation (1, . . ., 1279),NO2,t denotes the NO2 measurement, O3,t denotes theO3 measurement and dbt denotes the db measurement,all on day t. The summation term involving sins andcos functions models the seasonal pattern throughoutthe year using a Fourier series approximation. The dayof week effect is captured using the variable c t, whichtakes on a different value for each day of the week.Note that O3,t 22,, dbt21 and t are modelled usingsmooth nonparametric functions; the other variables areall modelled linearly. The choice of a smooth or linearterm for each covariate was made using the AIC. Weassume Yt is pseudo-Poisson (i.e. Poisson with over-dispersion) conditional on the values of the covariates.The choice of smoothing constant for the smooth terms(g1, g5 and g7) was determined by setting the equivalentdegrees of freedom for these terms to 4. While thischoice is relatively arbitrary, we felt it provided suffi-cient flexibility to model any non-linearity in the serieswithout allowing the smooth term to be unnecessarilyaffected by individual observations.

Partial residual plotsPartial residual plots are useful in detecting nonlineari-ties and for identifying the possible cause of unduly

large observations26 in a generalised linear model and ageneralised additive model. The partial residuals for avariable Xj in an ordinary linear regression model are thesame as the ordinary residuals except the jth term isadded to each value:

d jt 5 ˆ

jXt,j 1 (Yt 2 t) (4)

where t denotes the fitted value for observation t.This is equivalent to adjusting the observation Yt forevery variable in the model except for the jth term.For GLMs or GAMs, we use partial deviance residu-als instead, which are defined analogously8. Otherpossibilities are to use partial Pearson or partial quan-tile residuals27. To detect nonlinearities more easilyfrom a partial residual plot, one may superimpose asmooth curve to enhance the display and allow easierdetection of a nonlinear relationship between theresponse variable and a covariate.

We use partial residual plots as a diagnostic graphicalmethod to detect nonlinearities in generalised linearmodels, and to better understand the relationshipbetween the response variable (asthma hospital admis-sions) and the explanatory variables (pollution and cli-mate) in generalised additive models.

To demonstrate the diagnostic usefulness of a par-tial residual plot we present results from the GLManalysis of the time trend on asthma hospital admis-sions. Figure 7 shows the partial residual of the time

DATA VISUALISATION FOR TIME SERIES IN ENVIRONMENTAL EPIDEMIOLOGY 439

0 200 400 600

Time (days)

800 1000 1200

–4

–2

0

2

4

6

Par

tial r

esid

uals

Fig. 7 Partial deviance residuals for the time trend from theGLM. The fitted quadratic time trend is shown as a thick line.A smoothing spline has been fitted to the partial devianceresiduals and is shown as a fine line. The shaded area repre-sents 95% pointwise CI around the spline curve.

Page 8: Data visualisation for time series in environmental

covariate from a GLM analysis with a quadratic termfitted for the time covariate. We fit smoothing splineto these residuals to enable a clearer examination ofthe nonlinear relationship between time and asthmaadmissions. The shaded area represents 95% pointwiseCI around the fit. The plot displays a striking nonlin-ear relationship between time and asthma admissionsthat has not been adequately captured by the quadraticfit in the GLM.

Figure 8 shows the smoothing spline estimates forthe time trend, ozone term and db term in the GAMmodel. The seasonal term constructed from the Fourierapproximation is also shown. The time trend is clearlymore complicated than a simple quadratic. The seasonalstructure is also quite complex, as would be expectedwith 10 Fourier terms. The ozone and db effects aresignificantly non-linear (hence the use of smooth func-tions in the model), but only near the centre of therange of observations is the function estimated with

any degree of precision (shown by the wide confidencebands elsewhere).

Autocorrelation and partial autocorrelation plotsA characteristic feature of time series is that the obser-vations are ordered through time; a consequence of thisis autocorrelation, that is an underlying pattern betweenobservations from one time period to the next within atime series. We can define autocorrelation at lag k as thecorrelation between observations k time-periods apart.This can be estimated for lags k 5 0, 1, 2, . . ., as:

ån

t 5 k 1 1(xt 2 x– ) (xt2k 2 x– ) (5)

rk 5ån

t51(xt 2 x– )2

where xt denotes the observation at time t and x– denotesthe sample mean of the series {xt}.

440 B ERBAS AND RJ HYNDMAN

0 200 400 600 800 1000 1200

–0.1

0.0

0.1

0.2

Tim

e tre

nd

0 100 200 300

–0.8

–0.4

0.0

0.4

Sea

sona

l com

pone

nt

Day Day of year

0 2 4 6 8 10

–0.6

–0.4

–0.2

0.0

Sm

ooth

ozo

ne c

ompo

nent

50 100 150 200

–0.2

0.0

0.2

Sm

ooth

dB

tem

p co

mpo

nent

Ozone dB temp

Fig. 8 Smooth terms from the GAM. Dashed lines show 95% pointwise CI.

Page 9: Data visualisation for time series in environmental

The partial autocorrelation at lag k is defined as thepartial correlation between observations k time-periodsapart. That is, it is the correlation between xt and xt 2 kafter removing the effect of the intervening observationsxt 2 1 , . . . , xt 2 k 1 1. It can be estimated using a fast com-putational algorithm28.

The collection of autocorrelations at lags k 5 1, 2, . . .is known as the autocorrelation function (ACF). Simi-larly, the collection of partial autocorrelations is knownas the partial autocorrelation function (PACF). To helpvisualise the temporal pattern in a time series, we plotthe ACF and PACF of the series.

Figure 9 displays ACF and PACF plots of the residu-als from the GAM model fitted to the asthma admis-sions data. The dashed lines denote 95% critical valuesfor each autocorrelation. Any spike lying outside thesebounds shows a correlation that is significantly differentfrom zero. The graph shows there is strong autocorrela-tion remaining in the residuals, for which allowanceneeds to be made in the model. It is possible that a loworder autoregressive term will adequately capture thispattern3. GAM with autocorrelation are currently beingdeveloped by the authors.

A related visual diagnostic tool is the cross-correla-tion function28, which is useful for examining the lag-effect of a change in pollution on a change in theresponse variable29.

DiscussionWe have presented some simple data-visualisation meth-ods that allow an effective graphical examination ofquantitative information. Most exploratory graphicalmethods fall into two categories: displaying the originaldata, or displaying quantities associated with fitted

models. Methods in the first category method are usedto explore the data and those in the second are used toenhance statistical analysis that are based on assump-tions about relations in the data26.

We have reviewed exploratory graphical methods fortime-series analysis in environmental epidemiology, andhave applied time-series plots, subseries plots, decom-position plots, scatterplot matrices and coplots, toenable a visual assessment of long-term trend, seasonal-ity and nonlinearity in time-series data. We have alsoapplied partial residual plots and ACF and PACF resid-ual plots as diagnostic methods to assess nonlinearitiesin explanatory variables, and to assess autocorrelationpatterns in the residuals from a fitted model.

We found that a smooth plot superimposed on atime-series plot provided a more enhanced visual repre-sentation of the long-term pattern and short-term varia-tion between time periods in a time series. A seasonalsubseries plot was used to assess the variation withineach month, and between successive months in a timeseries. This allows visual insight into the seasonalbehaviour of a time series. Decomposition plots pro-vide understanding of the underlying trend-cycle, sea-sonal and random components of a time series, andtheir relative strengths.

Scatterplot matrices are particularly useful for detect-ing collinearity betwen potential explanatory variables,and nonlinearity between the response and potentialexplanatory variables. They provide an immediate visualinsight into the complex relations between each series.

Coplots were used to assess the interactive effects ofpollution and climate on asthma admissions. These plotsare particularly useful for detecting interactive effectsbetween explanatory variables.

DATA VISUALISATION FOR TIME SERIES IN ENVIRONMENTAL EPIDEMIOLOGY 441

1 3 5 7 9 11 14

–0.1

0.0

0.1

0.2

0.3

AC

F

Lag

1 3 5 7 9 11 14

–0.1

0.0

0.1

0.2

0.3

PA

CF

Lag

Fig. 9 The autocorrelation (a) and partial autocorrelation (b) function for the residuals from a GAM.

a b

Page 10: Data visualisation for time series in environmental

Partial residual plots are useful in detecting nonlinearrelationships between the response and covariates and inunderstanding the effects of each covariate on theresponse variable. Plots of the autocorrelation and par-tial autocorrelation function allowed a visual insight intoany remaining correlation structure in the residuals.

We have demonstrated that data visualisation is anintegral part of statistical modelling. Many of the diag-nostic graphical methods presented here are not specificto time series, they can be used in most applications ofstatistical methodology. Methodological issues in theanalysis of morbidity/mortality have been an importantissue for many decades, the inclusion of exploratorygraphical tools for exploratory and diagnostic purposesis long overdue.

References1 Toit SHC, Steyn AGW, Stumpf RH. Graphical exploratory

data analysis New York: Springer-Verlag, 1986.2 Tukey JW. Exploratory data analysis . Massachusetts: Addison-

Wesley, 1977.3 Schwartz J, Spix C, Touloumi G et al. Methodological issues

in studies of air pollution and daily counts of deaths or hospi-tal admissions. J Epidemiol Commun Health 1996; 50(Suppl1): s3–s11.

4 Thurston GD, Kinney PL. Air pollution epidemiology: con-siderations in time series modeling. Inhal Toxicol 1995; 7:71–83.

5 Henry TG. Graphing data: techniques for display and analy-sis. California: SAGE Publications, 1995.

6 Cleveland WS. Robust locally weighted regression andsmoothing scatterplots. J Am Statist Assoc 1979; 74: 829–36.

7 Bowman AW, Azzalini A. Applied smoothing techniques fordata analysis . New York: Oxford Press, 1997.

8 Hastie T, Tibshirani RJ. Generalized additive models. London:Chapman and Hall, 1990.

9 Simonoff JS. Smoothing methods in statistics . New York:Springer-Verlag, 1996.

10 Erbas B, Hyndman RJ. The effect of air pollution and climateon hospital admissions for chronic obstructive airways dis-ease: a nonparametric alternative. In press.

11 Cleveland WS, Terpenning IJ. Graphical methods for seasonaladjustment. J Am Statist Assoc: Theory Methods 1982; 77:52–62.

12 Makridakis S, Wheelwright SC, Hyndman RJ. Forecasting:methods and applications . 3rd edition. New York: Wiley &Sons, 1998.

13 Kendall M, Ord KJ. Time series. 3rd edition. Kent: Hodderand Stoughton Ltd, 1990.

14 Cleveland RB, Cleveland WS, McRae JE, Terpenning I. STL:a seasonal-trend decomposition procedure based on loess[with discussion]. J Oficial Stat 1990; 6: 3–73.

15 Cleveland WS. The elements of graphing data. Revised edn.New Jersey: Hobart Press, 1994.

16 Chambers JM, Cleveland WS, Kleiner B, Tukey PA. Graphicalmethods for data analysis . New York: Chapman and Hall, 1983.

17 Saez M, Sunyer J, Castellsague J et al. Relationship betweenweather temperature and mortality: a time series analysisapproach in Barcelona. Int J Epidemiol 1995; 24: 576–581.

18 Ballester F, Corella D, Perez-Hoyos S et al. Mortality as afunction of temperature. A study in Valencia, Spain,1991–1993. Int J Epidemiol 1997; 26: 551–61.

19 Morgan G, Corbett S, Wlodarcyzk J. Air pollution and hospi-tal admissions in Sydney, Australia, 1990–1994. Am J PubHealth 1998; 88: 1761–6.

20 Pitard A, Viel J. Some methods to address collinearity amongpollutants in epidemiological time series. Stat Med 1997; 16:527–44.

21 Kleinbaum DG, Kupper LL, Muller KE, Nizam A. Appliedregression analysis and other multivariable methods. 3rd edi-tion. Pacific Grove: Duxbury Press, 1998.

22 Cleveland WS. Visualizing data. New Jersey: Hobart Press,1993.

23 Cook RD, Weisberg Residuals and influence in regression .New York: Chapman and Hall, 1982.

24 Akaike H. A new look at statistical model identification.IEEE Transac Auto Control. 1974; AC-19: 716–23.

25 Venables WN, Ripley BD. Modern applied statistics with S-PLUS, 2nd edition. New York: Springer-Verlag, 1997.

26 Chambers J, Hastie T. Statistical models in S. California:Wadsworth and Brooks/Cole Advanced Books and Software,1992.

27 Dunn P, Smyth G. Randomized quantile residuals. J ComputGraphical Stat 1996; 5: 236–44.

28 Diggle P. Time series: a biostatistical introduction . OxfordUniversity Press, 1990.

29 Campbell MJ, Tobias A. Causality and temporality in thestudy of short-term effects of air pollution on health. Int JEpidemiol 2000; 29: 271–3.

AppendixNonparametric smoothingLet (X1, Y1), . . ., (Xn, Yn) denote pairs of observations.Then a nonparametric smooth curve is a function s(x)with the same domain as the values in {Xt}:

s(x) 5 E(Yt | Xt 5 x)

The function s(x) is intended to show the mean value ofYt for given values of Xt 5 x.

A kernel smoother is a weighted average of the form:

s(x) 5 ån

j 5 1wj (x)Yj

where the weights are given by:

K (x2j2x

b— )

wj (x) 5

ån

i51

K (x2i2x

b— )

442 B ERBAS AND RJ HYNDMAN

Page 11: Data visualisation for time series in environmental

The function K is a ‘kernel function’, which is usuallynon-negative,symmetric about zero and integrates to one.A common choice for K is the standard normal densityfunction. The value of b (the ‘bandwidth’) determineshow smooth the resulting curve will be. Large band-width values will produce smoother curves and smallbandwidth values will produce more wiggly curves.

A cubic smoothing spline is defined as a functions(x) 5 g(x) that minimises the penalised residual sum ofsquares:

S (g) 5 ån

t 5 1[Yt 2 g(Xt) ] 2 1 ò (g99(x))2 dx

Thus, S is a compromise between goodness-of-fit andthe degree of smoothness. The smoothness of the fittedcurve is controlled by k (the ‘smoothing parameter’).Large values of k will produce smoother curves, andsmaller values will produce more wiggly curves.

In local linear regression, instead of fitting a straightline to the entire data set, we fit a series of straight linesto sections of the data. The resulting curve is thesmoother s(x). The estimated curve at the point x is y 5ax where ax and bx are chosen to minimise the weightedsum of squares:

ån

j 5 1wj(x)[Yj 2 b2x 2 a2x(Xj 2 x)]2

Here, wj(x) represents the weights. The weight functionwj(x) can be the same as in kernel smoothing.

Loess smoothing is a form of local linear regressionwith some additional steps to make it more robust tooutliers. The ‘span’ of a loess smoother describes thepercentage of points receiving non-zero weight in thecalculation of the smoother at each value of x.

DATA VISUALISATION FOR TIME SERIES IN ENVIRONMENTAL EPIDEMIOLOGY 443