a review of meta-analyses of single-subject experimental designs: methodological issues and practice

14
This article was downloaded by: [UZH Hauptbibliothek / Zentralbibliothek Zürich] On: 10 September 2013, At: 13:24 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Evidence-Based Communication Assessment and Intervention Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/tebc20 A review of meta-analyses of single-subject experimental designs: Methodological issues and practice S. Natasha Beretvas a & Hyewon Chung a a Department of Educational Psychology, University of Texas at Austin, USA Published online: 15 Nov 2008. To cite this article: S. Natasha Beretvas & Hyewon Chung (2008) A review of meta-analyses of single-subject experimental designs: Methodological issues and practice, Evidence-Based Communication Assessment and Intervention, 2:3, 129-141, DOI: 10.1080/17489530802446302 To link to this article: http://dx.doi.org/10.1080/17489530802446302 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions

Upload: hyewon

Post on 15-Dec-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A review of meta-analyses of single-subject experimental designs: Methodological issues and practice

This article was downloaded by: [UZH Hauptbibliothek / Zentralbibliothek Zürich]On: 10 September 2013, At: 13:24Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

Evidence-Based Communication Assessment andInterventionPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/tebc20

A review of meta-analyses of single-subjectexperimental designs: Methodological issues andpracticeS. Natasha Beretvas a & Hyewon Chung aa Department of Educational Psychology, University of Texas at Austin, USAPublished online: 15 Nov 2008.

To cite this article: S. Natasha Beretvas & Hyewon Chung (2008) A review of meta-analyses of single-subject experimentaldesigns: Methodological issues and practice, Evidence-Based Communication Assessment and Intervention, 2:3, 129-141, DOI:10.1080/17489530802446302

To link to this article: http://dx.doi.org/10.1080/17489530802446302

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Page 2: A review of meta-analyses of single-subject experimental designs: Methodological issues and practice

129 META-ANALYSES OF SINGLE-SUBJECT EXPERIMENTAL DESIGNS

A review of meta-analyses of single-subject experimentaldesigns: Methodological issues and practiceS. Natasha Beretvas & Hyewon Chung, Department of Educational Psychology, University of Texas at Austin, USA...............................................................................................................................................

AbstractSeveral metrics have been suggested for summarizing results from single-subject experimental designs. This study brieflyreviews the most commonly used metrics, noting their methodological limitations. This study also includes a synthesis ofrecent meta-analyses, describing which metrics were used and how meta-analysts handled dependence in the form ofmultiple treatments, outcomes, and participants per study. Guidelines for future methodological research and for single-subject experimental design meta-analysts are provided.

Keywords: Meta–analysis, effect sizes, single–subject experimental–designs, methodology

INTRODUCTION

Meta-analytic procedures afford researchersa method to quantitatively synthesize past researchresults, thereby providing evidence to support bestpractice (Glass, 1976; Hedges & Olkin, 1985; Hunter& Schmidt, 1990). While these meta-analytic meth-ods work well for synthesizing the results of studieswith large sample sizes (large-n), there is still noconsensus concerning how best to summarize resultsfrom single-subject experimental design (SSED)studies. This is a problem because a considerableamount of educational and psychological researchhas made use of SSEDs (Galassi & Gersh, 1993).Indeed, SSEDs are frequently employed in educa-tional research designed to assess a treatment’seffect on special populations, such as individualswith autism and related developmental disabilities.

Typically in pre–post large-n intervention studies,the focus is on change in outcome level between pre-test and post-test. While this is part of the interestwith SSED studies, trends within and across baselineand treatment phases are simultaneously consideredand are perhaps the most important aspect of thedata to consider when evaluating the results fromsuch studies. Numerical descriptors of these trendsare difficult to estimate when the data describea treatment’s effect on an individual (Crosbie, 1993)and when the number of repeated measures is assmall as is commonly found in educational SSEDresearch (Busk & Marascuilo, 1988; Huitema, 1985).For this reason, visual analysis of graphed data istypically employed in SSED studies to assessa treatment’s effect.

A perusal of a plot of results should clearlyidentify whether a treatment effect can be

discerned from SSED results. And for the resultsof a visual analysis of data to support a treatment’seffect, usually the magnitude of the effect must besufficiently large that associated practical andclinical significance can be assumed. However,visual inspection of results involves the considera-tion of several dimensions along which data mightvary, including mean shift (change in averagebehavior in baseline versus treatment), slopechange (change in trend between baseline andtreatment), and variability (of data points aroundthe general trend). Parsonson and Baer (1992)provide a detailed summary of research conductedto investigate the relationship between character-istics of graphs and the inferences resulting fromvisual interpretations of the results. Unfortunately,some of this research has indicated that inferencesbased on visual analysis are not very reliable.Statistical summaries of results provide an alter-native to visual inspection.

There are several justifications for using statisticalmethods to summarize results from SSEDs. Beyondthe parsimony associated with use of quantitativemeta-analysis, the current climate of evidence-basedpractice also heralds a renewed focus on methodsused to meta-analyze SSEDs results. A statisticalsummary of results also allows a potentially moreobjective summary of studies’ results through theuse of meta-analytic procedures. Use of meta-analysis encourages generalization of findingsacross individuals. It also permits exploration ofany differences identified in study results.In addition, statistical description of study resultsallows for the identification of a treatment effect,despite potentially unstable baseline data, and ofsmall treatment effects that might not be visiblegraphically (Nourbackhsh & Ottenbacher, 1994).Lastly, the potential lack of reliability noted forinterpretation of data using visual analysis couldbe reduced with the use of statistical summaries.

For correspondence: S. Natasha Beretvas. E-mail: tasha.beretvas@

utexas.edu

Source of funding: Preparation of this article was supported by a grant from

the Institute of Education Sciences, U.S. Department of Education. However,

the opinions expressed do not express the opinions of this agency.

ISSN 1748–9539 print/ISSN 1748–9547 online � 2008 Informa Healthcare USA, Inc.DOI: 10.1080/17489530802446302

Dow

nloa

ded

by [

UZ

H H

aupt

bibl

ioth

ek /

Zen

tral

bibl

ioth

ek Z

üric

h] a

t 13:

24 1

0 Se

ptem

ber

2013

Page 3: A review of meta-analyses of single-subject experimental designs: Methodological issues and practice

The interpretation of results associated with anappropriately conducted quantitative descriptor willbe more consistent.

Several quantitative descriptors have been derivedas options to describe a treatment’s effect in SSEDs.These descriptors include single-indicator summariessuch as various versions of a standardized differencein phase (treatment and baseline) means (Busk &Serlin, 1992) and the percentage of non-overlappingdata (PND; Scruggs, Mastropieri, & Casto, 1987).Others have suggested using single R2-changeindicators that simultaneously describe change inthe outcome’s level and slope as a result of treatment(Center, Skiba, & Casey, 1985–1986; Faith, Allison, &Gorman, 1996). Other researchers have suggesteda pair of R2-change indicators to describe a treatmenteffect with one indicator describing change in thelevel of an outcome between baseline and treatment(e.g. Crosbie, 1995; Tryon, 1982), with the otherindicator describing change in trend as a result of thetreatment being introduced (e.g. Beretvas & Chung,2007; Crosbie, 1995). Finally, some researchers haveproposed a multi-level approach to the meta-analysisof SSEDs (van den Noortgate & Onghena, in press).Despite the need to quantitatively synthesize resultsfrom SSEDs, there is little consensus in the fieldabout what qualifies as an appropriate descriptor.

Methodological critiques of each of the descriptorshave been performed, and no single descriptor hasyet been established as best. In spite of thesecriticisms, researchers continue to use several ofthese descriptors. The current paper provides a briefdescription of some of the more commonly usedeffect sizes metrics, emphasizing some of theproblems associated with each metric. In addition,a summary of recent applied meta-analyses of singlesubject designs will be provided, along with somedirections for future research in this area.

SINGLE INDICATOR EFFECT-SIZE METRICS

Standardized mean difference

The standardized mean difference used in SSEDs iscalculated on the basis of the effect size of the samename used for large-n studies that compares thedifference in means of two independent groups ata single time point. The basic formula for the large-nstandardized mean difference is estimated using

�_

¼�YT � �YC

Sð1Þ

where �_

is an estimate of the difference between themean outcome score of the treatment group, �YT , andthe mean of the control group, �YC. This difference isstandardized by dividing by an estimate of the

population standard deviation, s, of outcome scoreswithin the populations. The corresponding standard-ized mean difference that is frequently used asa metric for SSEDs is

�_

SMD ¼�YB � �YA

S�ð2Þ

where the sample mean, �Yi , is calculated on the basisof the mean of the outcome values in phase i (A or B)for a single individual, whereas the sample mean inEquation 1 is calculated on the basis of the outcomescores for a sample of individuals. What furtherdistinguishes �

_

from �_

SMD is the calculation of thestandard deviation. In Equation 2, s� is the samplestandard deviation for the individual’s outcomescores (calculated using data just from the baselinephase or pooled across the A and B phases).

There will be less variability in measures ona single individual over time, S�, than in measureson multiple individuals at a single time point, s. Thereason for this is that an individual’s scores willprobably be similar to his or her score at an earlierpoint in time, whereas if two independentindividuals had been randomly selected, then thereshould be no relationship between their scores. Theautocorrelation introduced in SSED time-seriesdata results in a violation of the assumption ofindependence that is made with most large-ngroup-comparison designs. Given the fundamentaldifferences between s and S�, the metric estimates inEquations 1 and 2 should not be considered asmeasured on the same metric. The reduced varia-bility in repeated measures on an individual (S�) willinflate the scale of the resulting �

_

SMD as comparedwith that of �

_

. Thus, if the effect-size estimate inEquation 2 is used to describe results from SSEDs,they should not be combined with effect-sizeestimates from large-n studies. In addition, itshould not be assumed that the variance typicallyassociated with �

_

(see, for example, Cooper &Hedges, 1995; Hedges & Olkin, 1985) is associatedwith �

_

SMD. Mostly as a result of the potentialautocorrelation resulting from repeated measureson an individual, the sampling distribution of �

_

SMD

is unclear and, thus, the relevant variance toassociate with the metric’s index is also unclear.

The standardized mean difference effect size,�_

SMD, appears to be quite commonly used in meta-analyses of SSED research (see, for example, Busk &Serlin, 1992; Faith et al., 1996). Other researchers(Hershberger, Wallace, Green, & Marquis, 1999)have encouraged calculating an effect-size estimatesimilar to that in Equation 2, except that the meansand standard deviations would be based on only thelast three time points in each phase. Use of only the

130 META-ANALYSES OF SINGLE-SUBJECT EXPERIMENTAL DESIGNS

Dow

nloa

ded

by [

UZ

H H

aupt

bibl

ioth

ek /

Zen

tral

bibl

ioth

ek Z

üric

h] a

t 13:

24 1

0 Se

ptem

ber

2013

Page 4: A review of meta-analyses of single-subject experimental designs: Methodological issues and practice

last three time points per phase might be preferred asproviding a more valid measure of baseline data oncethe pattern in the outcome measure has stabilized;however, all of the criticisms mentioned above withregard to �

_

SMD also apply to effect size.Apart from the caveat about the different metrics

of �_

SMD versus �_

, it should be emphasized that use ofthis metric descriptor would only make sense whenno trend was evident in the baseline or treatmentphases (see Figure 1 for an example). However, theremight be a trend in the pattern of behavior eitherwithin one or within both phases and, thus, themean value will not be very representative ofa treatment’s effect. This metric will only representthe difference in the average outcome levels in eachphase. No trend might be evident in baseline but thetreatment might change the outcome level andintroduce a trend (for example, see Figure 2).Alternatively, there might already be a slight trendin baseline (reflecting a natural development in theoutcome over time). The intervention might raise thelevel and convert the trend to a steeper slopesupporting more rapid growth in the outcome (seeFigure 3). In any of the scenarios in which there isa trend1, a change in the average level in theoutcome of each phase will not capture the treat-ment’s impact on the trend. It is also possible thatthere might be a trend in baseline and that thetreatment has absolutely no effect on either trend orlevel (for example, see Figure 4). Summarizing theresults of a study for which the pattern depicted inFigure 4 applies using �

_

SMD will lead to (false)detection of a treatment’s effect even though thechange in level solely resulted from naturaldevelopment.

Percentage of non-overlapping data

Scruggs et al. (1987) introduced the PND as an indexof a treatment’s effect. The PND provides a non-parametric descriptor of the overlap between thedata in the treatment versus the baseline phases. Todescribe results in an AB design for a treatmentdesigned to increase a behavior, the PND iscalculated by tallying the percentage of data pointsin the treatment phase that exceed the highest pointreached in the baseline phase. (If a treatment isdesigned to decrease a behavior then the percent ofpoints in the treatment phase that are lower than thelowest data point in the baseline phase are tallied).The PND does not require the assumptions ofnormality, linearity, and homogeneity associated

with parametric descriptors of effect size. Andwhile exact values for data points are needed forother metrics, this is not the case for PND calcula-tion. Extensive description of problems and benefitsassociated with use of the PND are providedelsewhere (see, for example, the entire second issueof Remedial and Special Education, 1987, Volume 8).

0

1

2

3

4

5

6

7

8

9

0 4 10 11 12 13 14

Time

Out

com

e

1 2 3 5 6 7 8 9

Figure 1. AB design, outcome level shift (increase) with no trendwithin baseline and treatment phases.

0

1

2

3

4

5

6

7

8

0 4 10 11 12 13 14

Time

Out

com

e

1 2 3 5 6 7 8 9

Figure 2. AB design, outcome level shift (increase) with no trend inbaseline and trend (increasing behavior) introduced in treatmentphase.

0

5

15

10

20

25

30

0 4 10 11 12 13 14

Time

Out

com

e

1 2 3 5 6 7 8 9

Figure 3. AB design, outcome level shift (increase) with slight,positive trend in baseline and change (increase) in trendintroduced in treatment phase.

.....................................................................1All trends have been depicted, here, as positive. It is of course feasible that

trends could be negative (reflecting a reduction in the behavioral outcome).

The same conclusions apply for such scenarios.

131 META-ANALYSES OF SINGLE-SUBJECT EXPERIMENTAL DESIGNS

Dow

nloa

ded

by [

UZ

H H

aupt

bibl

ioth

ek /

Zen

tral

bibl

ioth

ek Z

üric

h] a

t 13:

24 1

0 Se

ptem

ber

2013

Page 5: A review of meta-analyses of single-subject experimental designs: Methodological issues and practice

Despite the caveats associated with its use, the PNDis one of the most commonly used descriptors oftreatment effectiveness in the research syntheses ofSSEDs in education and special education (seeSchlosser, Lee, & Wendt, in this issue).

Several other non-overlapping metrics have beendeveloped since the introduction of the PND,including the percentage of all non-overlappingdata (Parker, Hagan-Burke, & Vannest, 2007), thepercentage of data points exceeding the median (Ma,2006), the improvement rate difference (Parker &Hagan-Burke, 2007), the mean baseline reduction(Lundervold & Bourland 1988), among others. Thebenefit of these metrics includes that they are simpleto use and unaffected by potential autocorrelation.Unfortunately, however, several associated problemshave been noted. One of the problems with thesenon-parametric indices has to do with the unknownsampling distributions associated with each of them.This seriously compromises the validity of statisticaltests conducted using these indices. Additionalcriticisms, including the potential impact of floor orceiling effects and orthogonal slope changes havealso been noted. Another common criticism of thesenon-parametric descriptors is that their values can beconfounded in the presence of a trend in the data.Additional procedures have been recommended tohandle assessment of a treatment’s effect usingregression-based procedures that model the possiblechanges in level and trend of the outcome. Beforedescribing the different procedures, one of thefundamental assumptions of multiple regressionwill be reviewed. In addition, its possible violationin SSEDs will be discussed along with the associatedeffect on statistical results.

With ordinary least squares (OLS) regressionanalyses for large-n studies, the outcome forindividual i, Yi, is modeled as a function of therelevant predictors. For example if two predictors, X1

and X2 are hypothesized to predict Y, then thefollowing regression model could be tested:

Yi ¼ �0 þ �1X1i þ �2X2i þ . . . ei ð3Þ

Estimation of the relevant regression coefficients(the �s) in Equation 3 is unbiased and efficient whenthe relevant model assumptions are met. One of theprimary assumptions (beyond homoscedasticity andnormality) is that of the independence of theresidual terms, the eis. This assumption is approxi-mately met in large-n studies and, thus, the standarderrors estimated using OLS can be assumed to beaccurate. However, when this assumption is violated,then estimation can no longer be considered efficientand the standard error estimates cannot be assumedaccurate.

If residuals are (what is termed) positivelyautocorrelated2, as might be possible with time-series data (namely, data measured over time for anindividual), then the standard error estimates will bebiased underestimates (see for example, Pankratz,1983; Crosbie, 1993; Busk & Marascuilo, 1992;Gorman & Allison, 1996). Various researchers(Jones, Weinrott, & Vaught, 1978; Huitema, 1985;Busk & Marascuilo, 1988, and others) have arguedwhether there is autocorrelation in the errors ofSSED data. Regardless, as recommended by Buskand Marascuilo (1992), ‘‘single-case researchersshould analyze their data not assuming indepen-dence of observations’’ (p. 165).

Huitema and McKean (1998) demonstrated thatappropriate modeling of the trends in SSED datasetscan greatly reduce the autocorrelation in residuals.This should be a goal for researchers interested indescribing trends in SSED studies because, ifautocorrelation is not sufficiently well explainedaway with predictors, the autoregressive models thatare needed to appropriately model the patterns areparticularly complex. The use of autoregressiveintegrated moving average (ARIMA) time-seriesmodels as popularized by Box and Jenkins (1976)has been advocated for the more formally termed‘interrupted time series’ (ITS) data that are com-monly encountered in SSEDs (Glass, Willson,& Gottman, 1975). However, ARIMA models arecomplicated and function well only with a number ofdata points per phase that is much higher (i.e. 50 ormore; Box & Jenkins, 1976) than is typicallyencountered in educational SSED research(Hartmann et al., 1980).

Huitema (1985) examined 881 experiments pub-lished in the Journal of Applied Behavior Analysis

0

1

2

3

4

5

6

7

8

9

0 4 10 11 12 13 14

Time

Out

com

e

1 2 3 5 6 7 8 9

Figure 4. AB design, positive trend in baseline and treatmentphases and no treatment effect (on level or trend).

.....................................................................2If residuals are negatively autocorrelated, then standard error estimates will

be inflated with overly conservative Type I error rates.

132 META-ANALYSES OF SINGLE-SUBJECT EXPERIMENTAL DESIGNS

Dow

nloa

ded

by [

UZ

H H

aupt

bibl

ioth

ek /

Zen

tral

bibl

ioth

ek Z

üric

h] a

t 13:

24 1

0 Se

ptem

ber

2013

Page 6: A review of meta-analyses of single-subject experimental designs: Methodological issues and practice

between 1968 and 1977 and found the modalnumber of data points per phase to be 3–4, witha median of 5. Busk and Marascuilo (1988)summarized this information for articles publishedin the same journal from 1975 to 1985 and foundthat the vast majority of SSED analyses involvedfewer than 30 data points for the baseline andintervention phases (85% and 73%, respectively). Notonly are model parameters poorly estimated in thepresence of autocorrelation, but the autocorrelation(in the residuals) is also poorly estimated with smalldata sets.

Given the evident problems with estimatingautocorrelation, Huitema and McKean’s (1998)suggestion to reduce autocorrelation with appropri-ate model specification holds great merit. Severalregression models have been suggested to explainpatterns exhibited in two-phase data to explore theexistence of potential treatment effects. It should beremembered that one of the strongest criticismsapplied to the non-regression effect-size estimates(such as the PND and �

_

SMD) is that they do notreflect the possible impact of a linear trend. In thepresence of the simplest kind of trend, a treatmentcan affect both level and trend and, thus, therelevant model needs to take this into consideration.

SINGLE METRICS BASED ON CHANGE IN R2

As a first reaction to this same concern, Gorsuch(1983) had suggested calculating an effect-sizeestimate that was a function of the change in R2introduced by adding time as a predictor of theoutcome, Yt. A similar procedure was suggested byWhite, Rusch, Kazdin and Hartmann (1989). Whiteet al.’s suggestion was to calculate an effect-sizeestimate using the following formula:

ESY 0 ¼Y 0B � Y 0A

SY 0ð4Þ

where Y 0i is the outcome score on the last day ofphase i predicted using the linear regression of Y onTime using data from phase i, and SY 0 is the pooledwithin-phase standard deviation estimate. The stan-dard deviation is calculated for each phase i using:

SY 0i¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiSi 1� r2

Yt

� �qð5Þ

where si is the conventional standard-deviationestimate of scores calculated within phase i (foreach of phase A and B), and rYt represents thecorrelation between Y and time. The resulting twovalues for SY 0

iare pooled together to obtain SY 0 (in

Equation 4). Given the non-independence of thevalues on Y, it seems necessary to correct the

standard deviation describing the variability of thescores in each phase. A form of the correction usedby White et al. (see Equation 5), although using r notr2, is applied to standard deviations in related-samples test statistics in large-n meta-analyses (seeMorris & DeShon, 2002, for example) to correct forcorrelated data. Yet, given that the relationshipbetween time and the outcome has already beenmodeled in each phase with the regression of Y on t,use of this correction does not seem appropriate.

Since these ideas were introduced, several authorshave suggested use of a piecewise regression modelto describe ITS data (such as is found in AB designsand their extensions). The parameterizations of thepiecewise regression model were designed to provideparameters describing potential changes in level andslope upon introduction of the treatment. Centeret al. (1985–1986) suggested use of a metric based ona change in R2 (�R2) for two piecewise regressionmodels. The full piecewise regression model’sparameterization,

Yt ¼ �0 þ �1Tt þ �2Dt þ �3ðTt � n1ÞDt þ et ð6Þ

was designed to provide two regression coefficients(�2 and �3) that described the change in the levelof and the trend in the outcome from baseline totreatment. (Note that variable Tt is used to identifythe time point of the outcome, Yt, at time t, and Dt isa dummy-coded variable used to identify whetherthe outcome was measured in the baseline (D¼ 0) orthe treatment (D¼ 1) phase). A restricted piecewiseregression model, which did not include the inter-action (change in slope) nor the phase (D) (changein level) predictors,

Yt ¼ �0 þ �1Tt þ et ð7Þ

would also be estimated. The change in R2 for thetwo models (in Equations 6 and 7) is calculated andconverted into an effect size that describes the effectof an intervention on both the slope and intercept.Specifically, the effect size, �F(2,df), representsa measure of the change in the proportion ofvariance in the outcome explained (i.e. the �R2)with the simultaneous addition of the change inintercept and the change in trend parameters (fromEquation 6 to Equation 7). The effect size iscalculated based on the F-ratio statistic testing the�R2 of the full and restricted models:

F ¼ðR2

full � R2restÞ=ðdfrest � dffullÞ

ð1� R2fullÞ=ðn� dffull � 1Þ

ð8Þ

where df represents the degrees of freedom.In Equation 8, the ‘rest’ subscript refers to the‘restricted’ regression model in Equation 7. The ‘full’

133 META-ANALYSES OF SINGLE-SUBJECT EXPERIMENTAL DESIGNS

Dow

nloa

ded

by [

UZ

H H

aupt

bibl

ioth

ek /

Zen

tral

bibl

ioth

ek Z

üric

h] a

t 13:

24 1

0 Se

ptem

ber

2013

Page 7: A review of meta-analyses of single-subject experimental designs: Methodological issues and practice

subscript refers to the ‘full’ regression model inEquation 6 that includes the change in level andlinear growth coefficients. This means that thenumerator of Equation 8 provides the amount ofvariability explained by the addition of the change inlevel and change in trend coefficients. The associatedeffect size, �F(2,df), is a function of this F-ratio and isdesigned to correct for the F-ratio being based ontwo df (Faith et al. 1996):

�_

Fð2, df Þ ¼ 2

ffiffiffiffiffiffiffiffiffiffiffiffi2F

dfError

sð9Þ

This resulting �F(2,df) essentially describes the mag-nitude of the change in level and trend introducedby the treatment.

Faith et al. (1996) suggested a slight modificationto Center et al.’s (1985–1986) suggested metric;however, there are two problems with what wassuggested. A fundamental problem is that thepiecewise regression used by both Center et al.(1985–1986) and Faith et al. (1996) is betterparameterized using Huitema and McKean’s (2000)model. In this model, [Tt� (n1þ 1)]Dt is usedinstead of Tt� n1 (in Equation 6):

Yt ¼ �0 þ �1Tt þ �2Dt þ �3½Tt � ðn1 þ 1Þ�Dt þ et:

ð10Þ

In this model, coefficient �0 provides an estimate ofthe baseline phase’s intercept (value of the outcomewhen T¼ 0 and D¼ 0). Coefficient �1 provides anestimate of the linear trend in the outcome duringthe baseline phase (i.e. the change in Y for a changeof 1 in T ). �2 provides an estimate of the differencein the intercept predicted based on the linear trendin the intervention data and the intercept predictedfor the first intervention time point (whenT¼ n1þ 1) based on the baseline data. The �3

coefficient (of the interaction term) provides anestimate of the change in the slope for the treatmentdata versus the baseline data. The �2 and �3

coefficients thus provide valuable information thatcan be used to describe a treatment’s effect on thelevel and slope, respectively.

Another fundamental problem with Center et al.’s(1985–1986) and Faith et al.’s (1996) metric is thatthe use of a single effect-size estimate to describea treatment’s effect on both level and slope does notseem optimal. It would seem more appropriate tocalculate a metric for an intervention’s effect ona slope and a metric to describe the effect on the levelof the outcome. Lastly, if any of the models useddo not sufficiently explain autocorrelation in resi-duals, then this could negatively impact the resultingmetrics as accurate estimates of effect size.

Still other metrics have been suggested, althoughproblems with these have also been identified. (Forexample, Blumberg, 1984; Crosbie, 1995; Gorman& Allison, 1996, all of whom have pointed outserious problems with the C statistic of Tryon,1982). Crosbie (1993; 1995) developed a proceduredesigned to analyze ITS data such as is encoun-tered in SSEDs. Crosbie wrote a program that usesan alternative estimate of the autocorrelationdesigned to correct for its bias with small samplesizes. The program, ITSACORR, provides twoindices. It includes tests of the change in interceptbetween baseline and treatment phases and of thechange in slope between the two phases whilemodeling potential autocorrelation. On the basis ofthe results of a simulation study, Crosbie foundevidence supporting the usefulness of ITSACORRfor estimating treatment effects for two-phase,short ITS data sets (1993). More recently, however,Huitema (2004) described some potential problemswith the setting of parameters for the designmatrix assumed in ITSACORR. The set of problemsevolve whenever there might be a trend during thebaseline phase. The comparison of intercepts underITSACORR involves a comparison of the predictedmeasure at the first baseline time point with thepredicted value for the first measure in thetreatment phase. When there is a trend duringbaseline, then these intercepts would be expectedto be different regardless of whether there isa treatment effect or not. Another problemis that the value of the intercept predicted forthe treatment phase will be a function of theautocorrelation. When no trends exist in thebaseline data, then ITSACORR has been found tofunction well.

Several other researchers have also explored theuse of a pair of indices to describe a treatment’seffect on slope and on level (Beretvas & Chung, 2007;van den Noortgate & Onghena, 2003). Beretvas andChung used Center et al.’s (1985–1986) �R2 metricidea paired with Huitema and McKean’s (2000)piecewise regression equation (see Equation 10)formulation to derive their pair of effect sizes.Specifically, for their effect size describing theeffect of an intervention on the level of an outcome,the authors calculated the �R2 for the full piecewiseregression equation (Equation 10) versus the follow-ing restricted regression equation:

Yt ¼ �0 þ �1Tt þ �2½Tt � ðn1 þ 1Þ�Dt þ et ð11Þ

which included no change-in-level coefficient. Theassociated F-ratio statistic testing the significanceof this �R2 was then calculated using Equation 8.The effect size describing the treatment’s effect on

134 META-ANALYSES OF SINGLE-SUBJECT EXPERIMENTAL DESIGNS

Dow

nloa

ded

by [

UZ

H H

aupt

bibl

ioth

ek /

Zen

tral

bibl

ioth

ek Z

üric

h] a

t 13:

24 1

0 Se

ptem

ber

2013

Page 8: A review of meta-analyses of single-subject experimental designs: Methodological issues and practice

the level was calculated by substituting the resultingF-ratio into the following Equation:

�_

F,Level ¼ 2

ffiffiffiffiffiffiffiffiffiffiffiffiF

dfError

sð12Þ

Note that, because there is only one parameter addedto the full (Equation 10) over the restricted model(Equation 11), the df associated with this F-ratio isonly 1 (see Equation 12 versus Equation 9).

The metric describing the effect of the interventionon the slope is calculated in a similar way. Thedifference is calculated in the R2 for the fullpiecewise regression model in Equation 10 versusthe following restricted piecewise regression model:

Yt ¼ �0 þ �1Tt þ �2Dt þ et ð13Þ

which assumed no change-in-slope as a result of theintervention. The F-ratio testing the change in R2 iscalculated using Equation 8 and the metric describ-ing the effect of the intervention on the slope iscalculated in the same way as in Equation 12:

�_

F,Slope ¼ 2

ffiffiffiffiffiffiffiffiffiffiffiffiF

dfError

s: ð14Þ

Beretvas and Chung (2007) recommend testing forautocorrelation in the residuals (remaining oncethe full model was estimated). If significant auto-correlation is found, the authors recommend usingauto-regression to estimate the relevant regressionequations (Equations 10, 11, and 13). Otherwise, OLSestimation could be used. The authors conducteda simulation study to assess the functioning of theirmetric and found that it worked well for scenarios inwhich the model fully explained the autocorrelation.The metric worked less well in scenarios possiblytypical of some SSEDs with small data sets, in whichthere was residual autocorrelation even when it wasmodeled using auto-regression.

Van den Noortgate and Onghena (2003) suggestedcalculating two metrics to describe an intervention’seffect on the level and linear growth in an outcome.They encouraged estimating the full regressionmodel in Equation 10 using OLS and suggestedstandardizing the change in level and slope coeffi-cients (i.e. �2 and �3) by dividing each by the squareroot of the mean squared error. The authors suggestsynthesizing the resulting standardized effect-sizeestimates using multivariate multilevel modeling(while correcting the covariance matrix for the twoeffect sizes by dividing its elements by the meansquared error. This procedure addresses several ofthe concerns associated with some of the othermetrics. However, the procedure was applied only toa real data set and was not empirically evaluated.

In addition, as with Beretvas and Chung’s (2007)metrics, potential autocorrelation in residuals couldaffect the precision and accuracy of these metricswhen used with small data sets. Thus, whencalculating these metrics, the full piecewise regres-sion model (in Equation 10) could be estimated bymodeling the potential autocorrelation in residuals.It is anticipated, however, that with the small datasets typically encountered in SSED studies, estima-tion will still be problematic.

There is clearly a lack of consensus about the bestmetric to use to summarize SSED results. And, as isevident from this review of methodological studies’discussions of these metrics, there are problemsassociated with each of them. However, appliedmeta-analyses are still being conducted on SSEDresults. It is important to see which metrics are mostcommonly used by practitioners despite their asso-ciated methodological caveats. This could helpinform future research into refinements for therelevant metrics. Given the plethora of metrics thathave been suggested for use in describing SSEDresults, a survey of relatively recent SSED meta-analyses was conducted. A description of the surveyand its results is described below.

SSED-STUDY META-ANALYSES NARRATIVE REVIEWRESULTS

The PsycINFO, MEDLINE and ERIC databases wereeach searched using the keywords ‘meta-analysis’,‘review’, or ‘synthesis’ paired with ‘single-case’,‘single-subject’, or ‘PND’ for the years 1985 to2005. Applied meta-analyses were included thatincorporated computation of some form of quanti-tative effect size for single-n studies. In addition tothe studies identified in the search, relevant appliedmeta-analyses known to the authors were alsosurveyed. Only those studies that clearly definedthe type of metric calculated were included in thereview. If an applied meta-analysis was unclearabout any of the other meta-analytic steps, it wasstill included.

Metrics used

The database searches led to the identification of 279studies in PsycINFO, 65 in ERIC, and 405 inMEDLINE. Redundancy across databases wasremoved and the remaining subset of studies wasassessed to identify which of them involved appliedmeta-analyses that synthesized results from SSEDsusing clearly described quantitative metrics. Theresulting 21 studies were supplemented withanother 4 applied single-n meta-analyses that met

135 META-ANALYSES OF SINGLE-SUBJECT EXPERIMENTAL DESIGNS

Dow

nloa

ded

by [

UZ

H H

aupt

bibl

ioth

ek /

Zen

tral

bibl

ioth

ek Z

üric

h] a

t 13:

24 1

0 Se

ptem

ber

2013

Page 9: A review of meta-analyses of single-subject experimental designs: Methodological issues and practice

criteria for inclusion. Thus, a total of 25 meta-analyses were summarized.

The most popular metric used was the PND index,and was commonly used along with the percent ofzero data points (PZD). The PND and the PZD wereused in 12 of the 25 meta-analyses. The next mostpopular metric, used in 7 of the studies, wasthe �

_

SMD, in various versions. This metric wastypically obtained by dividing the difference in theoutcome means of the selected AB phases by thestandard deviation pooled across the data points inthe baseline phase. Three studies used a version ofthe mean baseline reduction procedure, and threeother studies used results from analyses usingCrosbie’s ITSACORR software. Two studies usedmetrics based on an incorrectly specified piecewiseregression model.

Of the 25 meta-analyses, 7 used multiple indices todescribe their results. Of these 7, however, 4 usedPZD with PND, which tend to provide very similarsummaries. Campbell (2003) and Bonner andBarnett (2004) used a version of the standardizedmean difference and the PND to assess treatments’effectiveness. Maughan, Christiansen and Jenson(2005) standardized the results from ITSACORR andalso used �

_

SMD. Marquis et al. (2000) used threeeffect sizes, including one very like the meanbaseline reduction, the �

_

SMD, and a regression-based effect size.

Results of this survey indicated that most SSEDmeta-analysts are using the simplest indicators tosynthesize results (i.e. PND and the standardizedmean difference). Given the typically small data setsinvolved in these meta-analyses, only simple metricsshould really be used. However, use of theseparticular indices has been criticized as noted above.

In addition to investigating the types of metricsthat were used to describe studies’ results, othersteps in the meta-analytic process were also brieflysummarized. The foundation of most SSEDs entailsa comparison between results in a baseline phasewith results in an intervention phase. Nevertheless,the simple AB design in and of itself is not used dueto resulting validity threats (see, for example,Kazdin, 1982). Instead, more complex designs,such as multiple baseline, reversal, alternatingtreatment designs, and the like, are used.Interpretation of the results from these morecomplex designs, however, still focus on the patternin an intervention phase compared with the patternin the relevant baseline phase. The problem is, ofcourse, that there are frequently multiple treatmentphases and even multiple baseline phases. In anABAB design, for example, a researcher couldcalculate two indices to describe the treatment’seffect on a single subject. (One index would describe

the effect from the first baseline to the firsttreatment phase and the other would describe theeffect for the second AB phase). How are these twoindices then used in the meta-analysis? Are theytreated as independent data points? They should notbe treated as independent because they represent thetreatment’s effect on the same person. In a designincluding the following sequence of phases, ABC, aretwo effect sizes calculated; one describing results inphase B versus the baseline phase (A) and anothereffect size for the pattern of outcome scores in phaseC versus the same baseline phase? How do meta-analysts handle these two dependent metrics?

Typically, a primary study does not just investigatethe effect of an intervention on a single subject.In a multiple-baseline study, for example, theresearcher might assess a treatment’s effectivenessfor three participants who are part of the samemultiple-baseline study. This would lead to thecalculation of three outcomes. How are these threemetrics used in an ensuing meta-analysis? Are theytreated as independent? It seems likely that there issome degree of dependence among these threemetrics, given the commonality in the participants’experiences in their being assessed by the sameresearcher, possibly being treated by the samecaregiver, and their assessment taking place in thesame setting.

It is also possible that the intervention’s effect isnot evaluated just for a single outcome. Quitefrequently, multiple measures are used to assessthe effect of an intervention. If, for example, twokinds of social behaviors are tallied for a singleparticipant, then two outcomes could be calculatedfor the participant. These metrics cannot be assumedto be independent, because they describe the sameparticipant and the measures being assessed areprobably correlated.

Little methodological research addresses howSSED meta-analysts should handle multiple treat-ments, participants, or outcomes. In other words, itis unclear how the dependence of outcomes withineach primary study is handled. It seems importantto consider how to handle these issues anda preliminary step involves assessing how they arecurrently being handled.

Techniques used to deal with the dependence ofoutcomes yielded by the same metric

Given the lack of methodological focus on the issueof dependence of multiple outcomes yielded by thesame metric within studies, the majority of meta-analyses were not very clear about how thisissue was dealt with. Of the 25 meta-analysesthat were reviewed, 10 did not make explicit

136 META-ANALYSES OF SINGLE-SUBJECT EXPERIMENTAL DESIGNS

Dow

nloa

ded

by [

UZ

H H

aupt

bibl

ioth

ek /

Zen

tral

bibl

ioth

ek Z

üric

h] a

t 13:

24 1

0 Se

ptem

ber

2013

Page 10: A review of meta-analyses of single-subject experimental designs: Methodological issues and practice

how the inevitable multiple-treatment dependencewas handled. Some seemed to have treated themultiple outcomes per primary study as indepen-dent, but it was not completely clear. The mostcommonly described method (in 7 meta-analyses)for handling multiple-treatment dependence was toaverage the multiple indices (one per treatmentphase versus preceding baseline phase) together.This meant taking a simple mean of �

_

SMD s (e.g.Swanson & Sachse-Lee, 2000) or using the mean ormedian of the PNDs (e.g. Algozzine, Browder, &Karvonen, 2001; Schlosser & Lee, 2000). In six of themeta-analyses, the multiple-treatment outcomes perstudy were treated as independent, thereby ignoringthe possible dependence. In one study (Skiba, Casey,& Center, 1985–1986), only the outcome for the firsttreatment phase was used to counter possible carry-over effects. And one study used the outcomeassociated with the largest dosage treatment(Allison, Faith, & Franklin, 1995).

The techniques used to handle multiple measuresper study were also reviewed. The majority of thestudies (n¼ 13) summarized results separately foreach of the related outcomes. For example, Skibaet al. (1985–1986) presented metrics for each of thefollowing different, but related, types of behavior:withdrawn, noncompliant, management problem,off-task, appropriate and social interaction beha-viors. In eight of the meta-analyses it was veryunclear how the multiple related outcome depen-dence was handled. Four studies calculated simpleaverages across the multiple measures perparticipant.

When multiple participants per study wereencountered, a similar set of techniques were usedto handle the resulting within-study dependencies.Almost half of the studies (n¼ 12) ignored thedependence and treated each participant’s outcomeas independent. Four of the studies calculateda single outcome for each study by aggregating theparticipants’ outcomes. Three of the studies (Carey &Matyas, 2005; Scholz & Ott, 2000; Wurthmann,Klieser, & Lehmann, 1996) used meta-analytictechniques to summarize the results gathered intheir study and treated each participant’s results asindependent. For example, Scholz and Ott (2000)synthesized the 21 p values from ITSACORR analysesof 21 participants’ data. In six studies, how thedependence resulting from multiple participants perstudy was handled was not clearly described.

In terms of the analyses conducted using theresulting outcomes, the majority of meta-analysesreported average (mean or median) metric valuesacross the set of metrics (either per study or perparticipant as noted above). Sample or study char-acteristics were explored as moderators of outcomes

in several studies (n¼ 6), using either descriptive orinferential statistics. All studies includeda table listing the outcomes by study (and participant,treatment, and outcome, as mentioned above).

DISCUSSION

As outlined in the introduction, there are multiplemetrics that have been introduced and suggested foruse with SSEDs. Unfortunately, there are methodo-logical issues associated with the majority of thesemetrics. These issues are founded in the complextime-series nature of SSEDs and in the inevitabledevelopmental trajectories of outcomes measuredover time. A critical weakness of many of the metricscurrently used in applied meta-analyses is that onlya single metric, rather than multiple metrics, is usedto describe a treatment’s effect. Development leadsto possible trends over time. If there is a trend in anoutcome even without intervention, then the level ofthe outcome will inevitably change at a later (e.g.intervention-phase) time point. For example,a treatment might only be considered clinicallysignificant if it produces immediate change in thedependent variable. For such studies, a summary ofdifferences in means between baseline and treat-ment could be appropriate. More commonly, it isrecognized that the effect of a treatment might entailmore gradual improvement. For example, in devel-oping communication behaviors in children withautism, treatment is expected to be associated witha gradual, positively accelerating trend. In thisscenario, not only is the level of the outcomeexpected to change with the introduction of thetreatment, so is the growth, or slope, of the outcome.Thus, a descriptor other than a single mean-shiftmetric would be needed to describe the treatment’seffect. A single number could not effectively conveya change in both level and slope.

Other treatments might be designed to changeoutcomes that already naturally increase (ordecrease) over time during baseline. Such treatmentsmight be designed to accelerate (or decelerate) thetrend in the outcome that already exists withoutintervention. To identify the effectiveness of thesekinds of treatments, again, a single metric cannotconvey the expected change in level and trend. Thecurrent paper has focused solely on the potential forlinear trends. There is, of course, a possibility ofcurvilinear trends for certain outcomes and espe-cially of asymptotic trends (Shadish & Rindskopf,2007). This adds further complexity to the modelsneeded to describe an intervention’s effect, and is animportant area of continued research. It seems thatmore specific guidelines that outline which metric to

137 META-ANALYSES OF SINGLE-SUBJECT EXPERIMENTAL DESIGNS

Dow

nloa

ded

by [

UZ

H H

aupt

bibl

ioth

ek /

Zen

tral

bibl

ioth

ek Z

üric

h] a

t 13:

24 1

0 Se

ptem

ber

2013

Page 11: A review of meta-analyses of single-subject experimental designs: Methodological issues and practice

use for which kind of treatment-effect investigationare needed. In addition, while there are a host ofpossible alternatives, the emphasis needs to be onmatching the method with the anticipated trends inthe data.

The other challenge to the identification of anappropriate metric for summarizing SSED results isautocorrelation. Autocorrelation changes the vari-ability of the standardized metrics associated withSSEDs over what would be expected with large-ndesigns. Traditional tests of autocorrelation havebeen found to be biased when used with data sets assmall as those typically encountered in the SSEDliterature. More recent research has identified somemodified test statistics that perform well in identify-ing lag-one autocorrelation for small data sets(Huitema & McKean, 2000; Riviello & Beretvas,2008). It is hoped that future research will evolvefrom better identification of, to potential correctionfor, autocorrelation. This could then lead to metricformulation that controls for autocorrelation result-ing from repeated measures of individuals.

Additional challenges to the meta-analysis ofresults from SSEDs have also been noted andsurveyed in the current study. A primary challengeinvolves the inevitable dependence of metrics withinstudies being synthesized. As noted above, thisdependence can result for individual participants’data in the following ways: (a) from calculatingmultiple outcomes with repeated use of single-baseline-phase data (e.g. for a design incorporatinga pattern such as ‘ABC’); (b) from multiple treat-ments per study (e.g. a design including the ‘‘ABAC’’pattern); or (c) from multiple outcome measures perparticipant. Multiple dependent outcomes could beassociated with a primary study’s results because thestudy involves multiple participants (e.g. multiple-baseline designs). From the results of the narrativereview, it seems clear that a number of techniquesare being used to handle this dependence. Thesetechniques match the various techniques used tohandle multiple dependent outcomes per study inlarge-n meta-analyses. While several ad hoc techni-ques are used with large-n meta-analyses (includingaveraging together each study’s effect sizes, selecting‘best’ single effect sizes for each study, etc.), multi-variate pooling through generalized least squaresestimation (Becker, 1992) is also frequently used.Unfortunately, generalized least squares cannot yetbe used with SSED results, because of the unknownsampling distributions of most of the SSED metrics.This again underscores the importance of identifyingSSED metrics and their associated sampling dis-tributions. This could then lead to better-foundedtechniques for handling within-study dependence.Identification of optimal metrics and their sampling

distributions will also provide accurate varianceestimates that could then be used in the weightingof SSED metrics for calculating pooled estimates.

CAVEATS AND RECOMMENDATIONS FOR SSED META-ANALYSTS

In the culture of evidence-based practice andaccountability, many researchers and practitionersare turning to meta-analysis to provide the relevantevidence for best practice. The meta-analysis tradi-tion was founded originally to summarize treatmenteffects from large-n studies most typically involvinga comparison of groups at a single time point. As theuse of SSED studies increased, meta-analyticresearchers tried to impose the effect sizes usedwith large-n studies on results from single-n studies.Unfortunately, while SSED studies typically involvea comparison of outcomes scores for an individualunder treatment with his or her scores duringa baseline phase, the pattern of these scores isassessed over time for an individual. An effect sizeneeded to describe changes in an individual overtime will not have the same metric as an effect sizecomparing groups of individuals at a single timepoint. In fact, only for the simplest pattern of changeanticipated with a treatment (in which there is notrend during baseline and treatment and the treat-ment is designed only to change the outcome’slevel), could there be some correspondence betweenlarge-n and SSED summaries. Both designs couldthen be used to detect a change in outcome level.However, even for this simplest scenario, thevariability underlying estimates, and thus the result-ing metrics of the associated effect sizes, will differas a result of the studies’ designs. This subtledifference between large-n and SSED studies lies atthe root of what complicates the formulation ofa useful effect size for use with single-n data. And,more importantly, researchers should not quantita-tively synthesize results from large-n and single-nstudies (Kavale, Mathur, Forness, Quinn, &Rutherford, 2000). Separate large-n and SSEDsyntheses should be conducted.

Using data from an applied meta-analysis ofschool-based interventions’ effect on communicationskills, Wendt (2008) and Beretvas, Chung,Machalicek and Riviello (2008) compared variousnon-parametric and parametric effect-size estimates,respectively. Different inferences about the interven-tion’s effectiveness were made on the basis ofthe metric that was used to describe the studies’results. This does not seem surprising, given thediffering sources of the criticisms of SSED metrics(e.g. autocorrelation, trend in baseline, need for

138 META-ANALYSES OF SINGLE-SUBJECT EXPERIMENTAL DESIGNS

Dow

nloa

ded

by [

UZ

H H

aupt

bibl

ioth

ek /

Zen

tral

bibl

ioth

ek Z

üric

h] a

t 13:

24 1

0 Se

ptem

ber

2013

Page 12: A review of meta-analyses of single-subject experimental designs: Methodological issues and practice

multiple metrics). The authors of both papers foundthat, when the effect-size-based inferences did notconverge, it was possible to identify the source of thedivergence from the primary study’s data or design.This strongly supports triangulation of metricsthrough meta-analysts’ use of multiple SSED metricsuntil some optimal metric has been derived. Whendivergent inferences result, the meta-analyst isstrongly encouraged to identify the source of thedifferences by close examination of each primarystudy.

Finally, no mention has been made of criteria thatcould be used to categorize metric values mimickingCohen’s (1977) ‘small’, ‘moderate’ and ‘large’ cut-offs. Until a metric (and sampling distribution) isderived that better matches the pattern of SSED databeing synthesized, numerical cut-offs are perhapsless important than the consideration of clinicalsignificance (Shogren, Faggella-Luby, Bae, &Wehmeyer, 2004).

It is hoped that methodological research focusedon synthesizing results from SSEDs will continue toevolve, and that some of these challenges willeventually be overcome. In the meantime, carefuldescriptive summaries of studies’ results can andshould be conducted.

REFERENCES

References used in the narrative literature review are marked with

an asterisk.

*Algozzine, B., Browder, D., & Karvonen, M. (2001). Effects of

interventions to promote self-determination for individuals

with disabilities. Review of Educational Research, 71, 219–277.

*Allison, D. B., Faith, M. S., & Franklin, R. D. (1995). Antecedent

exercise in the treatment of disruptive behavior: A meta-

analytic review. Clinical Psychology: Science and Practice, 2,

279–304.

Becker, B. J. (1992). Using results from replicated studies to

estimate linear models. Journal of Educational Statistics, 17,

341–362.

Beretvas, S. N., & Chung, H. (2007, May) R-squared change effect size

estimates for single-subject meta-analyses. Paper presented at the

7th Annual International Campbell Collaboration Colloquium

in London, England.

Beretvas, S. N., Chung, H., Machalicek, W. A., & Riviello, C. (2008,

May). Computation of regression-based effect size measures. Paper

presented at the 8th Annual International Campbell

Collaboration Colloquium in Vancouver, Canada.

Blumberg, C. J. (1984). Comments on ‘‘A simplified time-series

analysis for evaluating treatment interventions’’. Journal of

Applied Behavior Analysis, 17, 539–542.�Bonner, M., & Barnett, D. W. (2004). Intervention-based school

psychology services: Training for child-level accountability;

preparing for program-level accountability. Journal of School

Psychology, 42, 23–43.

Box, G. E. P., & Jenkins, J. M. (1976). Time series analysis:

Forecasting and control (2nd ed.). San Francisco, CA: Holden–

Day.

Busk, P. L., & Marascuilo, L. A. (1988). Autocorrelation

in single-subject research: A counterargument to the

myth of no autocorrelation. Behavioral Assessment, 10,

229–242.

Busk, P. L., & Marascuilo, L. A. (1992). Statistical analysis in

single-case research: Issues, procedures, and recommendations,

with applications to multiple behaviors. In T. R. Kratochwill, &

J. R. Levin (Eds.), Single-case research design and analysis: New

directions for psychology and education (pp. 159–185). Hillsdale, NJ:

Erlbaum.

Campbell, J. M. (2003). Efficacy of behavioral interventions for

reducing problem behavior in persons with autism: A quanti-

tative synthesis of single-subject research. Research in

Developmental Disabilities, 24, 120–138.�Carey, L. M., & Matyas, T. A. (2005). Training of somatosensory

discrimination after stroke: Facilitation of stimulus general-

ization. American Journal of Physical and Medical Rehabilitation, 84,

428–42.

Center, B. A., Skiba, R. J., & Casey, A. (1985–1986).

A methodology for the quantitative synthesis of intra-

subject design research. Journal of Special Education, 19,

387–400.

Cohen, J. (1977). Statistical power analysis for the behavioral sciences

(rev ed.). Hillsdale, NJ: Erblaum.

Cooper, H., & Hedges, L. V. (Eds.)., (1994). The handbook of research

synthesis. New York, NY: Russell Sage Foundation.

Crosbie, J. (1993). Interrupted time-series analysis with single-

subject data. Journal of Consulting and Clinical Psychology, 61,

966–974.

Crosbie, J. (1995). Interrupted time-series analysis with short

series: Why it is problematic; how it can be improved.

In J. M. Gottman (Ed.), The analysis of change (pp. 361–395).

Mahwah, NJ: Erlbaum.�Didden, R., Duker, P. C., & Korzilius, H. (1997). Meta-analytic

study on treatment effectiveness for problem behaviors with

individuals who have mental retardation. American Journal on

Mental Retardation, 101, 387–399.�DuPaul, G. J., & Eckert, T. L. (1997). The effects of school-based

interventions for attention deficit hyperactivity disorder: A

meta-analysis. School Psychology Review, 26, 5–27.

Faith, M. S., Allison, D. B., & Gorman, B. S. (1996). Meta-analysis

of single-case research. In D. R. Franklin, D. B. Allison, &

B. S. Gorman (Eds.), Design and analysis of single-case research

(pp. 245–277). Hillsdale, NJ: Erlbaum.

Galassi, J. P., & Gersh, T. L. (1993). Myths, misconceptions

and missed opportunity: Single-case designs and counsel-

ing psychology. Journal of Counseling Psychology, 40,

525–531.

Glass, G. V. (1976). Primary, secondary, and meta-analysis of

research. Educational Researcher, 5, 3–8.

Glass, G. V., Willson, V. W., & Gottman, J. M. (1975). Design and

analysis of time-series experiments. Boulder: Colorado Associated

University Press.

Gorman, B. S., & Allison, D. B. (1996). Statistical alternatives for

single-case designs. In D. R. Franklin, D. B. Allison, &

B. S. Gorman (Eds.), Design and analysis of single-case research

(pp. 159–214). Hillsdale, NJ: Erlbaum.

Gorsuch, R. L. (1983). Three methods for analyzing limited

time-series (N of 1) data. Behavioral Assessment, 5,

141–154.

Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis.

New York: Academic Press.

Hershberger, S. L., Wallace, D. D., Green, S. G., & Marquis, J. G.

(1999). Meta-analysis of single-case designs. In R.

H. Hoyle (Ed.), Statistical strategies for small sample research

(pp. 107–132). Thousand Oaks, CA: Sage.

139 META-ANALYSES OF SINGLE-SUBJECT EXPERIMENTAL DESIGNS

Dow

nloa

ded

by [

UZ

H H

aupt

bibl

ioth

ek /

Zen

tral

bibl

ioth

ek Z

üric

h] a

t 13:

24 1

0 Se

ptem

ber

2013

Page 13: A review of meta-analyses of single-subject experimental designs: Methodological issues and practice

Huitema, B. E. (1985). Autocorrelation in applied behavior

analysis: A myth. Behavioral Analysis, 7, 107–110.

Huitema, B. E. (2004). Analysis of interrupted time-series

experiments using ITSE: A critique. Understanding statistics, 3,

27–46.

Huitema, B. E., & McKean, J. W. (1998). Irrelevant autocorrela-

tion in least-squares intervention models. Psychological Methods,

3, 104–116.

Huitema, B. E., & McKean, J. W. (2000). Design specification

issues in time-series intervention models. Educational and

Psychological Measurement, 60, 38–58.

Hunter, J. E., & Schmidt, F. L. (1990). Methods of meta-analysis:

Correcting error and bias in research findings. Newbury Park, CA:

Sage Publications.

Jones, R. R., Weinrott, M., & Vaught, R. S. (1978). Effects of serial

dependency on the agreement between visual and statistical

inference. Journal of Applied Behavior Analysis, 11, 277–283.�Kahng, S., Iwata, B. A., & Lewin, A. B. (2002). Behavioral

treatment of self-injury, 1964 to 2000. American Journal on

Mental Retardation, 107, 212–221.

Kavale, K. A., Mathur, S. R., Forness, S. R., Quinn, M. M.,

& Rutherford, R. B. (2000). Right reason in the integration of

group and single-subject research in behavioral disorders.

Behavioral Disorders, 25, 142–157.

Kazdin, A. E. (1982). Single-case research designs: Methods for clinical

and applied settings. New York, NY: Oxford University Press.

Lundervold, D., & Bourland, G. (1988). Quantitative analysis of

treatment of aggression, self-injury, and property destruction.

Behavior Modification, 12, 590–617.

Ma, H. (2006). An alternative method for quantifying synthesis of

single-subject research: Percent of data points exceeding the

median. Behavior Modification, 30, 598–617.�Marquis, J. G., Horner, R. H., Carr, E. G., Turnbull, A. P.,

Thompson, M., Behrens, G. A., et al. (2000). A meta-analysis of

positive behavior support. In R. M. Gersten, E. P. Schiller, &

S. Vaughn (Eds.), Contemporary special education research: Syntheses

of the knowledge base on critical instructional issues (pp. 137–178).

Mahway, NJ: Erlbaum.�Mastropieri, M. A., & Scruggs, T. E. (1985–86). Early intervention

for socially withdrawn children. Journal of Special Education, 19,

429–441.�Mathur, S. R., Kavale, K. A., Quinn, M. M., Forness, S. R.,

& Rutherford, R. B. (1998). Social skills interventions with

students with emotional and behavioral problems: A quantita-

tive synthesis of single-subject research. Behavioral Disorders, 23,

193–201.�Maughan, D. R., Christiansen, E., & Jenson, W. R. (2005).

Behavioral parent training as a treatment for externalizing

behaviors and disruptive behavior disorders: A meta-analysis.

School Psychology Review, 34, 267–286.

Morris, S. B., & DeShon, R. P. (2002). Combining effect size

estimates in meta-analysis with repeated measures and

independent-groups designs. Psychological Methods, 7, 105–125.

Nourbakhsh, M. R., & Ottenbacher, K. J. (1994). The statistical

analysis of single-subject data: A comparative examination.

Physical Therapy, 74, 768–776.

Pankratz, A. (1983). Forecasting with univariate Box–Jenkins models:

Concepts and cases. New York: Wiley.

Parker, R. I., & Hagan-Burke, S. (2007). Single-case research

results as clinical outcomes. Journal of School Psychology, 45,

637–653.

Parker, R. I., Hagan-Burke, S., & Vannest, K. (2007). Percentage of

all non-overlapping data (PAND): An alternative to PND.

Journal of Special Education, 40, 194–204.

Parsonson, B. S., & Baer, D. M. (1992). The visual analysis of data,

and current research into the stimuli controlling it.

In T. R. Kratochwill, & J. R. Levin (Eds.), Single-case research

design and analysis: new directions for psychology and education

(pp. 15–40). Hillsdale, NJ: Erlbaum.

Riviello, C., & Beretvas, S. N. (2008). Detecting lag-one autocorrelation

in interrupted time series designs with small sample sizes. Manuscript

submitted for publication.�Schlosser, R. W., & Lee, D. L. (2000). Promoting general-

ization and maintenance in augmentative and alternative

communication: A meta-analysis of 20 years of effectiveness

research. Augmentative and Alternative Communication, 16,

208–226.

Schlosser, R. W., Lee, D., & Wendt, O. (in press). The

percentage of non-overlapping data (PND): A systematic

review of reporting characteristics in systematic reviews and

meta-analyses. Evidence-Based Communication Assessment and

Intervention.�Scholz, O. B., & Ott, R. (2000). Effect and course of tape-based

hypnotherapy in subjects suffering from insomnia. Australian

Journal of Clinical Hypnotherapy and Hypnosis, 21, 96–114.�Scotti, J. R., Evans, I. M., Meyer, L. H., & Walker, P. (1991).

A meta-analysis of intervention research with problem

behavior: Treatment validity and standards of practice.

American Journal on Mental Retardation, 96, 233–256.

Scruggs, T. E., Mastropieri, M. A., & Casto, G. (1987). The

quantitative synthesis of single-subject research: Methodology

and validation. Remedial and Special Education, 8, 24–33.�Scruggs, T. E., Mastropieri, M. A., Cook, S. B., & Escobar, C.

(1986). Early intervention for children with conduct disorders:

A quantitative synthesis of single-subject research. Behavioral

Disorders, 11, 260–71.�Scruggs, T. E., Mastropieri, M. A., Forness, S. R., & Kavale, K. A.

(1988). Early language intervention: A quantitative synthesis

of single-subject research. Journal of Special Education, 22,

259–83.�Scruggs, T. E., Mastropieri, M. A., & McEwen, I. (1988). Early

intervention for developmental functioning: A quantitative

synthesis of single-subject research. Journal of the Division for

Early Childhood, 12, 359–67.

Shadish, W. R., & Rindskopf, D. M. (2007). Methods for evidence-

based practice: Quantitative synthesis of single-subject designs.

In G. Julnes, & D. J. Rog (Eds.), Informing federal policies on

evaluation method: Building the evidence base for method choice in

government sponsored evaluation (pp. 95–109). San Francisco:

Jossey–Bass.�Shogren, K. A., Faggella-Luby, M. N., Bae, S. J., & Wehmeyer, M. L.

(2004). The effect of choice-making as an intervention for

problem behavior: A meta-analysis. Journal of Positive Behavior

Interventions, 6, 228–237.�Skiba, R. J., Casey, A., & Center, B. A. (1985–86). Nonaversive

procedures in the treatment of classroom behavior problems.

Journal of Special Education, 19, 459–481.�Swanson, H. L., & Sachse-Lee, C. (2000). A meta-analysis of

single-subject-design intervention research for students with

LD. Journal of Learning Disabilities, 33, 114–136.�Swanson, H. L., O’Shaughnessy, T. E., & McMahon, C. M. (1998).

A selective synthesis of single subject design intervention

research on students with learning disabilities. Advances in

Learning and Behavioral Disabilities, 12, 79–126.

Tryon, W. W. (1982). A simplified time-series analysis for

evaluating treatment interventions. Journal of Applied Behavior

Analysis, 15, 423–429.

van den Noortgate, W., & Onghena, P. (2003). Hierarchical linear

models for the quantitative integration of effect sizes in single-

case research. Behavior Research Methods, Instruments & Computers,

35, 1–10.

van den Noortgate, W., & Onghena, P. (in press). A multi-level

analysis of single-subject studies. Evidence-Based Communication

Assessment and Intervention.

140 META-ANALYSES OF SINGLE-SUBJECT EXPERIMENTAL DESIGNS

Dow

nloa

ded

by [

UZ

H H

aupt

bibl

ioth

ek /

Zen

tral

bibl

ioth

ek Z

üric

h] a

t 13:

24 1

0 Se

ptem

ber

2013

Page 14: A review of meta-analyses of single-subject experimental designs: Methodological issues and practice

Wendt, O. (2008, May). Computation of non-regression-based effect

size metrics. Paper presented at the 8th Annual

International Campbell Collaboration Colloquium in

Vancouver, Canada.

White, O. R., Rusch, F. R., Kazdin, A. E.,

& Hartmann, D. P. (1989). Applications of meta-

analysis in individual subject research. Behavioral

Assessment, 11, 281–296.

Wurthmann, C., Klieser, E., Lehmann, E., & Krauth, J. (1996).

Single-subject experiments to determine individually differen-

tial effects of anxiolytics in generalized anxiety disorder.

Neuropsychobiology, 33, 196–201.

Xin, Y. P., Grasso, E., Dipipi-Hoy, C. M., & Jitendra, A. (2005). The

effects of purchasing skill instruction for individuals with

developmental disabilities: A meta-analysis. Exceptional Children,

71, 379–400.

141 META-ANALYSES OF SINGLE-SUBJECT EXPERIMENTAL DESIGNS

Dow

nloa

ded

by [

UZ

H H

aupt

bibl

ioth

ek /

Zen

tral

bibl

ioth

ek Z

üric

h] a

t 13:

24 1

0 Se

ptem

ber

2013