2003 evaluating snglecase data compare 7 stat methods parker....pdf

23
BEHAVIOR THERAPY 34, 189-211,2003 Evaluating Single-Case Research Data: A Comparison of Seven Statistical Methods RICHARD I. PARKER DANIEL E BROSSART Texas A & M University This study examined and compared the performances of seven popular or promising techniques for analyzing between-phase differences in single-case research designs. The techniques are: (a) Owen White's binomial test on extended Phase A baseline (White & Haring, 1980), (b) D. M. White, Rusch, Kazdin, and Hartmann's Last Treatment Day technique (1989), (c) Gorsuch's "trend analysis effect size" (Faith, Allison, & Gorman, 1996; Gorsuch, 1983), (d) Center's mean-only and mean-plus- trend models (Center, Skiba, & Casey, 1985-1986), and (e) Allison's mean-only and mean-plus-trend models (Allison & Gorman, 1993; Faith et al., 1996). The tech- niques were assayed by applying them to a set of 50 single-case AB design (baseline and intervention) data sets, constructed to represent a range of type and degree of in- tervention effects. From analysis of these 50 data sets, four questions were answered about the analytic techniques: (a) How much statistical power is possessed by the more promising techniques? (b) What typical R2 effect sizes are evidenced for graphed data sets which, according to visual analysis, show a positive intervention effect? (c) How do the five analytic techniques covary with one another? and (d) To what extent does each technique tend to produce autocorrelated residuals? The debate over whether visual analysis is sufficient for single-case research data has reached stasis, if it has not been resolved. The advantages of visual analysis (Baer, 1977; Michael, 1974; Parsonson & Baer, 1978, 1986) are widely acknowledged, but statistical analysis is now recommended to supplement visual analysis in most cases. Kazdin (1982) acknowledged that statistical procedures may be of value when (a) there is no stable base- line, (b) expected treatment effects cannot be well predicted, as with a new treatment, or (c) statistical control is needed for extraneous factors in natural- istic environments. Huitema (1986), another strong proponent of visual anal- ysis, recommends adding statistical analyses when unambiguous results must be shared with other professionals (p. 228). Franklin, Gorman, Beasley, and Allison (1996) conclude a recent book chapter on the subject by emphasizing Address correspondence to Richard I. Parker, Department of Educational Psychology, Texas A & M University, 4225 TAMU, College Station, TX 77843-4225; e-mail: [email protected]. 189 005-7894/03/0189~0211 $1.00/0 Copyright2003 by Association for Advancement of BehaviorTherapy All rights for reproductionin any formreserved.

Upload: adina-carter

Post on 23-Jan-2016

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

BEHAVIOR THERAPY 34, 189-211,2003

Evaluating Single-Case Research Data: A Comparison of Seven Statistical Methods

RICHARD I. PARKER

DANIEL E BROSSART

Texas A & M University

This study examined and compared the performances of seven popular or promising techniques for analyzing between-phase differences in single-case research designs. The techniques are: (a) Owen White's binomial test on extended Phase A baseline (White & Haring, 1980), (b) D. M. White, Rusch, Kazdin, and Hartmann's Last Treatment Day technique (1989), (c) Gorsuch's "trend analysis effect size" (Faith, Allison, & Gorman, 1996; Gorsuch, 1983), (d) Center's mean-only and mean-plus- trend models (Center, Skiba, & Casey, 1985-1986), and (e) Allison's mean-only and mean-plus-trend models (Allison & Gorman, 1993; Faith et al., 1996). The tech- niques were assayed by applying them to a set of 50 single-case AB design (baseline and intervention) data sets, constructed to represent a range of type and degree of in- tervention effects. From analysis of these 50 data sets, four questions were answered about the analytic techniques: (a) How much statistical power is possessed by the more promising techniques? (b) What typical R 2 effect sizes are evidenced for graphed data sets which, according to visual analysis, show a positive intervention effect? (c) How do the five analytic techniques covary with one another? and (d) To what extent does each technique tend to produce autocorrelated residuals?

The debate over whether visual analysis is sufficient for single-case research data has reached stasis, if it has not been resolved. The advantages of visual analysis (Baer, 1977; Michael, 1974; Parsonson & Baer, 1978, 1986) are widely acknowledged, but statistical analysis is now recommended to supplement visual analysis in most cases. Kazdin (1982) acknowledged that statistical procedures may be of value when (a) there is no stable base- line, (b) expected treatment effects cannot be well predicted, as with a new treatment, or (c) statistical control is needed for extraneous factors in natural- istic environments. Huitema (1986), another strong proponent of visual anal- ysis, recommends adding statistical analyses when unambiguous results must be shared with other professionals (p. 228). Franklin, Gorman, Beasley, and Allison (1996) conclude a recent book chapter on the subject by emphasizing

Address correspondence to Richard I. Parker, Department of Educational Psychology, Texas A & M University, 4225 TAMU, College Station, TX 77843-4225; e-mail: [email protected].

189 005-7894/03/0189~0211 $1.00/0 Copyright 2003 by Association for Advancement of Behavior Therapy

All rights for reproduction in any form reserved.

Page 2: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

190 PARKER & BROSSART

the need to integrate visual and statistical analyses. D. M. White et al. (1989) con- cur, stating that the "logic of decision making that underlies single-case investiga- tions is compatible with statistical reasoning" (p. 283). They make the point that the challenge is to develop statistical tests that are sensitive and sophisticated enough to reflect the complex judgments involved in visual analysis.

Although most experts now call for statistical procedures to enhance visual analysis, in journal articles visual analysis alone still predominates (Busk & Marascuilo, 1992; Kratochwill & Brody, 1978). Statistical analyses have been documented in less than 10% of studies sampled (Busk & Marascuilo; Kra- tochwill & Brody). Our own current informal review of single-case research articles published in counseling, clinical, and school psychology journals over the past 15 years found visual analysis used solely in over 65% of articles.

Researchers wishing to supplement visual analysis of single-case research data with statistical analysis have a number of available techniques but little information on any one technique. The number of analytic techniques avail- able for short data series has easily tripled since the early 1980s (Barlow & Hersen, 1984; Kazdin, 1982), yet promising techniques such as the regres- sion models of Center, Skiba, and Casey (1985-1986) and of Allison and col- leagues (Allison & Gorman, 1993; Faith, Allison, & Gorman, 1996) have to date been applied in only a handful of studies. They are known mainly through summaries in texts by Franklin, Allison, and Gorman (1996) and Kratochwill and Levin (1992).

As these analytic techniques are rarely published in application, their com- parative performance has not been explored. Very few studies have attempted to compare the measurement attributes of these new techniques on the same data sets. Those few comparative studies have concluded that different tech- niques produce quite different results. Nourbakhsh and Ottenbacher (1994) found that three supposedly similar statistical indices (Tryon's C statistic, two--standard deviation band method, Owen White's split-middle method) performed very differently on the same data series.

Consequently, researchers lack evidence on how these techniques behave with various single-case data attributes and how they compare with one another. What is the statistical power of these techniques to detect noteworthy effects in the relatively short data sets available to most single-case clinicians and researchers? To what extent do the various techniques tend to agree with one another (covary) in analysis of the same data? Do the various techniques yield comparable or quite different effect sizes? To what extent is autocorre- lation (serial dependency) of residuals an issue with each technique? Answers to such questions are needed for scientist-practitioners to comfortably use the statistical techniques.

Effect Size in Single-Case Analysis Criticism of statistical significance testing (Kirk, 1996, p. 747) has refo-

cused social science research from p values to magnitude-of-effect measures

Page 3: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

EVALUATING SINGLE-CASE DATA 191

or "effect sizes" (Cohen, 1988, 1990; Kupfersmid, 1988; Rosnow & Rosenthal, 1989). These criticisms have resulted in a steady increase in effect-size reporting in published psychological research (Dar, Serlin, & Omer, 1994). Three major advantages of the effect size over statistical significance testing are described by Mitchell and Hartmann (1981). First, effect sizes provide an index of the strength of association between intervention and outcome, implying how much of the outcome variable can be explained, controlled, and predicted by the intervention (Carver, 1978; Rosnow & Rosenthal). Sec- ond, effect sizes provide a continuous (rather than dichotomous) index of treatment success, supporting decisions of degree or increment, not just those to continue versus terminate treatment. Third, effect sizes are not systemati- cally affected by sample size, so a strong effect may be discerned even within a short data series. Busk and Serlin (1992) conclude that an effect- size measure is the "obvious choice" for summarizing single-case study effects (p. 192).

The shift in focus to effect sizes has led to a related shift in focus from sta- tistical significance to practical significance (Shaver, 1991; Thompson, 2002), and from there to clinical significance (i.e., improvement from a dys- functional to functional range of activity; Jacobson, Follette, & Revenstorf, 1984). However, there does not yet seem to be consensus in the field on what process or metric should determine practical significance.

Effect-Size Types Kirk (1996) found from a survey of four APA journals that only two of the

measures of magnitude of effect (R 2 and ~12) were used with any frequency (~2 is simply R 2 derived from categorical predictors). The competing major effect-size family is the d statistic, or "standardized mean difference" (includ- ing Cohen's d, Glass's g, Hedges's g), promoted by Hedges and Olkin (1985) and others (Kirk), which can be readily converted to R 2 (Cohen, 1988; Rosenthal, 1984). Compared to the d family (Hedges & Olkin; Kirk), the R 2 effect-size family provides greater flexibility in interpretation of single-ease research designs. Single-case researchers can select the interpretation that best fits their purpose and orientation: (a) "proportion of a client's score vari- ance explained by phase differences" (b) "reduction in uncertainty (percent increase in prediction ability) due to phase differences" (c) "percent of non- overlap of client scores between phases" or (d) "the percent of the scores of one phase exceeded by the upper half of the scores of the other phase" (Cohen, p. 22). Another benefit of the R 2 family is that most current innova- tions in this field seem to be regression model-based. Finally, complex models with both trend and mean difference components are difficult to con- ceptualize through "standardized mean differences" formulas. After review- ing most available analytic techniques for single-case data, Faith et al. (1996) concluded that regression approaches are the best available, though "no gold standard" (p. 253). In the present study, we have selected R 2 effect sizes as the final metric for all seven statistical techniques.

Page 4: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

192 PARKER & BROSSART

Effect-Size Interpretation Cohen (1988) provided ballpark descriptors of"large" (R 2 = .25), "medium"

(R 2 = .09) and "small" (R 2 = .01 ) effects (his ~12 ballpark descriptors are slightly different), but stressed that these guidelines were derived from large group social science research. Kirk (1996) cautioned that it is important "not to sanctify" any particular effect-size numbers, given their contextual depen- dency (p. 756).

Effect-size interpretation in single-case research depends first on its calcu- lation formula (Mitchell & Hartmann, 1981). Multiple regression models tend to produce larger effect sizes than simple regression models, and partial- ing or semipartialing effects will modify an effect size in predictable ways (Cohen, 1988). Effect sizes also are dependent on the experimental context (Rosnow & Rosenthal, 1989). Studies with different treatment intensity levels, different time frames, or different types of participants will tend to produce different effect sizes (Maxwell, Camp, & Avery, 1981; Mitchell & Hart- mann). Kirk (1996) emphasizes that to interpret effect sizes from unfamiliar measurement scales, new guidelines first need to be developed, a goal of this present study.

Effect-Size Limitations

Despite their advantages, Thompson (1998) states that effect-size presenta- tion should always include an index of reliability, such as confidence inter- vals (CIs), around effect sizes (Fowler 1985), though CIs are rarely published (Kirk, 1996). When conducting several R 2 analyses, an alternative approach to indicating R 2 reliability is to conduct power analyses. The resulting power curve indicates whether effects at a given level will be statistically significant, given your study's typical N and data variability (Cohen, 1988). Allison, Sil- verstein, and Gorman (1996) cite several surveys of research literature (e.g., Kawano, 1993; Kosciulek & Szymanski, 1993; Rossi, 1990; Sedlmeier & Gigerenzer, 1989) affirming the underuse of power analysis and the inade- quate power of many statistical tests, given available sample size and data variability.

For the sake of efficiency, in place of the recommended use of CIs, in this study we conducted power analyses for the seven statistical tests. The CI approach would have entailed 350 confidence intervals (50 data sets × 7 ana- lytic techniques). However, for single-case research with few data sets, we recommend the superior confidence interval approach.

The Problem of Autocorrelation in Single-Case Data

Autocorrelation, long a challenge for single-case research analysis, vio- lates the requirement that residual errors be uncorrelated (Fox, 1991). When this assumption is violated and single-case data residuals are positively auto- correlated, standard errors will be deceptively smalt, and results will be inflated, increasing Type I (false positive) errors (Crosbie, 1987; ScheffG 1959; Suen & Ary, 1987a). Even small autocorrelations of r = .20 to .30 can

Page 5: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

EVALUATING SINGLE-CASE DATA 193

increase Type I error rate by a factor of 2 to 3 (Scheff6). When residuals are negatively correlated, standard errors will be underestimated and p values will be too conservative, increasing Type II errors, an outcome that is not desirable but preferred over inflated Type I errors (Gorman & Allison, 1996; Ostrom, 1990).

Debate on the existence of autocorrelation in single-case data has produced disparate evidence, depending on study designs (including number and length of phases) and type of data being analyzed (Busk & Marascuilo, 1988; Holtzman, 1963; Kazdin, 1984). The claim by Huitema (1985) and Kazdin that autocorrelation in time series data existed only in small amounts and had little impact on analytic results met stiff criticism. Huitema's tests were believed to lack adequate power of detection (Busk & Marascuilo; Matyas & Greenwood, 1996; Sharpley & Alavosius, 1988; Suen& Ary, 1987b, 1989).

Since autocorrelation is properly calculated on residuals, and with multiple statistical techniques available, Gorman and Allison (1996) logically ask: Residuals from which analyses? Autocorrelation tests are commonly con- ducted on the residuals of detrended data to safeguard the "acceptable" auto- correlation due solely to linear trend. Yet trend is only one component of the more sophisticated analytic techniques, and so each technique will likely pos- sess different autocorrelation coefficients for a given data set. Center et al. (1985-1986) found that their analytic models reduced autocorrelation greatly--by over .40--compared to autocorrelation from standard ANOVA analysis on the same data. Considering this variability, single-case researchers need guidelines on the likelihood of autocorrelation for different analytic techniques. Because we were unable to find studies examining autocorrela- tion differences by analytic technique, our present study addresses this issue.

Purpose. The purpose of this study is to provide useful information on seven popular or promising techniques for analyzing between-phase differ- ences in single-case research designs. In particular, we compared their per- formance when applied to 50 strategically constructed data series. Analysis of these 50 data sets permitted us to answer four questions: (a) Given graphed data that by visual analysis show a positive intervention effect, what typical R 2 effect sizes result? (b) How do the seven analytic techniques covary with one other? (c) To what extent is each technique burdened with autocorrelated residuals? and (d) What statistical power is possessed by the analytic tech- niques for short data series? By answering these questions, we hope to pro- vide single-case researchers with enough information to at least begin to judge whether and how the analytic techniques may be useful in supplement- ing visual analysis.

Method Creating Test Data Sets

Two hundred and fifty single-case AB (baseline and intervention phase) data sets were created to represent a range of types and degrees of intervention

Page 6: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

194 PARKER & BROSSART

effects. Each data set contained a total of 20 data points--10 per phase, reflecting the upper end of what is commonly possible in real-world applica- tions (Busk & Marascuilo, 1988). The graphs were created from 250 random number series, which were split after the 10th data point. Then each phase was separately transformed by four levels of four statistical characteristics: (a) variability, (b) trend, (c) mean level, (d) trend line intercept level between data points 10 and 11. The four levels of these four characteristics were selected by trial-and-error field testing among the research team of four doc- toral students and two professors, ensuring a useful range of graphs, with well-distributed ratings of intervention effectiveness. The test set of 50 graphs for this study was then strategically selected from the full 250. We selected graphs that showed comparatively little trend in Phase A (to mirror published research) and which well represented a mix of values for the four statistical characteristics.

Judging Intervention Effectiveness

Meaningfulness of effect sizes is derived from their relations to clinical judgment or to some other external client improvement criterion. We wished to identify among the 50 test graphs those that would be judged by most pro- fessionals as representing effective interventions. This was done by using 35 of the graphs (deemed the maximum number for sustained, concentrated attention) in a scenario, and soliciting judgments from 13 faculty and 32 doc- toral students in the fields of counseling psychology, educational psychology, school psychology, and special education from a large southwestern univer- sity. All judges had some experience and/or training with graphic displays of single-case data. The 45 judges were individually presented 35 of the graphs in random order, which were attributed to a student with behavior disorders in a public school classroom. We attended to critiques of past graph judgment research (Parsonson & Baer, 1992) by ensuring that the task included: (a) contextualized data (a scenario complete with data-capturing observation instrument), (b) variation of multiple data characteristics across graphs (vari- ability, mean levels, trends, intercepts), (c) requesting judgments of practical (not statistical) significance, and (d) requesting judgments of amount or degree of effectiveness (not yes/no judgments). The 45 judges rated each graph on a 1-to-5 scale according to "how certain or convinced" they were that "the child's behavior improved due to the intervention" (from not at all to very convinced). Thus, we sought ratings of intervention effect, rather than simpler ratings of improvement alone.

Selecting Statistical Techniques

For popular and promising analytic techniques, we first perused older texts by Kazdin (1982) and by Barlow and Hersen (1984), and then more recent texts by Kratochwill and Levin (1992) and by Franklin et al. (1996). From this review we concluded that whereas certain older, simpler techniques continue to be recommended, most promising newer techniques follow either regres-

Page 7: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

EVALUATING SINGLE-CASE DATA 195

sion (Allison et al., 1996, and Center et al., 1985-1986, techniques) or ran- domization models (e.g., Edgington, 1987). The randomization models were dropped from consideration in this study, as they require random assignment of subjects and/or phases across multiple phase designs.

We included seven techniques in this study, two of which are variations of others: (a) (BINOM): binomial test on extended Phase A baseline (O. R. White & Haring, 1980); (b) (LTD): Last Treatment Day (D. M. White et al., 1989); (c) (GORSUCH): Gorsuch's trend effect size (Faith et al., 1996; Gor- such, 1983); (d) (CENTER-M): Center's mean-only difference (Center et al., 1985-1986); (e) (CENTER-MT): Center's mean plus trend difference (Cen- ter et al.), (f) (ALLISON-M): Allison's mean-only difference (Allison & Got- man, 1993; Faith et al.); and (g) (ALLISON-MT): Allison's mean plus trend difference (Allison & Gorman; Faith et al.). We had originally intended to also include Crosbie's ITSACORR model (Crosbie, 1993, 1995); however, its results bore near-zero order relationships with any other technique, so we dropped this technique pending further research. The calculation of each sta- tistic and its conversion to an R 2 effect size is found in the appendix.

Results Graph Attributes

The 50 graphs were constructed to differ systematically in four attributes: (a) trend line slope differences between phases, (b) mean differences between phases, (c) differences in trend line intercepts at the point of intervention, and (d) data variability within phases. The first three features can be expressed as R 2 effect sizes, and the last is described by the coefficient of variation (COV), which can reflect data variability independent of mean levels. A summary of these four features for the 50 graphs is found in Table 1.

Table 1 shows that effect sizes for trend and for mean differences are simi- lar. The interquartile range for trend differences is .44 to .71, compared to .39 to .64 for mean differences. Intervention intercept differences had to be cal- culated differently--as point estimates, using the standard error of prediction, so we were not surprised to find a very different range of .07 to .50. Note the

T A B L E 1 EFFECT SIZES FOR 50 AB DATA SETS: MEAN DIFFERENCES, SLOPE DIFFERENCES,

INTERVENTION INTERCEPT DIFFERENCES, AND VARIABILITY

lOth %ile 25th %ile 50th %ile 75th %ile 90th %ile

Trend line slope diffs. (R 2) .258 Mean diffs. (R 2) .320 Intervention intercept diffs. (R 2) .018 Data variability (COV): Phase A .135 Data variability (COV): Phase B .167

.443 .552 ,712 .786

.396 .587 .642 ,750

.069 .311 .502 .637

.171 .206 .253 .275

.236 .307 .360 .420

Page 8: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

196 PARKER & BROSSART

greater COV range for Phase B than for Phase A as a result of our strategic selection of graphs with less Phase A variability.

Autocorrelation We conducted autocorrelation tests on the residuals of all 50 data sets (i.e.,

on detrended data) using the NCSS (Hintze, 2000) ARIMA module. The median autocorrelation was r = .33, and most (80%) coefficients were between r = .06 and r = .60. Autocorrelation values at key percentile ranks help depict the distribution: 10th = .06, 25th = .18, 50th = .33, 75th = .51, 90th = .60. Although it has been argued that autocorrelation tests should be conducted separately for each phase, that is not standard practice, and the small n = 10 per phase discouraged us from doing so. Our finding of a median autocorrelation of .33 in our 50 data sets was consistent with (e.g., Busk & Marascuilo, 1988) and large enough to be problematic.

Judgment Reliability for Intervention Effectiveness A representative subset of 35 of the 50 graphs was then selected to create

the judgment task. For these 35 graphs, the 1 to 5 intervention effectiveness of the 45 judges bore internal consistency (~) of .85. This was an average correlation of individual rater to the whole group of r = .51, with a range for individual raters of r = .21 to .65. Whereas not impressive, these agreement levels are consistent with other graph judgment research (DeProspero & Cohen, 1979; Harbst, Ottenbacher, & Harris, 1991; Ottenbacher, 1990; Park, Marascuilo, & Gaylord-Ross, 1990).

Ratings of the 35 graphs were well-distributed as follows: Eight graphs (23%) had mean ratings of 1.0 to 2.9, denoting "not effective or minimally effective interventions." Fourteen graphs (40%) had mean ratings of 3.0 to 3.5, denoting "somewhat effective interventions" Thirteen graphs (37%) had mean ratings of 3.6 to 5.0, denoting "effective or very effective interven- tions." We next examined the effect sizes produced by applying our seven analytic techniques to these 14 "effective intervention" graphs.

Typical Effect-Size Levels for Effective Intervention Graphs To determine typical effect-size ranges for "effective intervention" AB

graphs, each of the seven analytic techniques was applied to each of the 14 effective intervention graphs. The resulting percentile distributions are depicted as box plots in Figure 1 and in tabular form in Table 2.

Figure 1 compares seven box plots, regarded as superior graphic tools for describing score distributions (Tukey, 1977). Box plots are interpreted as fol- lows: Top and bottom of boxes are the 75th and 25th percentile ranks, and the box is divided by the 50th percentile (median). The extremes of the top and bottom wands or "whiskers" mark the 90th and 10th percentiles. Outliers beyond the 90th and 10th percentiles are marked by individual dots.

Notable are the large effect-size differences among the seven analytic tech- n iques -bo th in median values and in dispersion or spread of their distribu-

Page 9: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

EVALUATING SINGLE-CASE DATA 197

Effect Size

1,00

.80

,60

.40

.20

. 0 0 _

O

i - I I

a. b. c. d . e . . g.

Analytic Techniques

a. BINOMIAL b. LTD c. GORSUCH d. CENTER-M e. CENTER-MT f. ALLISON-M g. ALLISON-MT

FIG. 1. Effect-size distributions for seven analytic techniques applied to 14 effective inter- vention AB data sets.

tions. Closer comparisons are possible by referring to the complementary percentile distribution in Table 2. The mean difference techniques of CENTER- M and GORSUCH clearly produced much lower effect sizes than the other techniques, with most values below R 2 = .15. Very different were the high effect-size values produced by ALLISON-MT and LTD, with most values for these techniques above R 2 = .85. Binomial scores showed very little variabil- ity for the data sets, whereas CENTER-M, CENTER-MT, ALLISON-M, and

TABLE 2 EFFECT-SIZE DISTRIBUTIONS FROM SEVEN ANALYTIC TECHNIQUES APPLIED

TO 14 EFFECTIVE INTERVENTION AB DATA SETS

Percentile Values

10th 25th 50th 75th 90th

BINOMIAL .190 .297 .333 .333 .333 LTD .722 .868 .903 .917 .945 GORSUCH 0.0 0.0 .028 .053 .059 CENTER-M 0.0 .037 .113 .214 .239 CENTER-MT .221 .404 .545 .665 .710 ALLISON-M .425 .588 .662 .772 .785 ALLISON-MT .635 .738 .862 .906 .935

Page 10: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

198 PARKER & BROSSART

ALLISON-MT techniques showed interquartile ranges (IQR) of around .20 points. These results underscore the warning of Cohen and others that old effect-size guidelines will not be adequate for new analytic techniques (Cohen, 1988; Kirk, 1996; Maxwell et al., 1981; Mitchell & Hartmann, 1981; Rosnow & Rosenthal, 1989).

Statistical Power of Analytic Techniques Using the PASS power analysis module of NCSS (Hintze, 2000) software,

we produced power analysis graphs for each statistical technique. A power analysis relates four measurement attributes: (a) assigned significance level (Type I alpha, or p value), (b) assigned power level (1-[3), (c) the minimum ("critical") effect size you wish to be able to detect, and (d) sample size. We provisionally decided that each technique should have the power to reliably detect effect sizes for the average or median of our 13 "effective intervention" graphs. So the techniques needed the power to detect the following median effect sizes (ordered from small to large): GORSUCH .03, CENTER-M .11, BINOMIAL .33, CENTER-MT .55, ALLISON-M .66, ALLISON-MT .86, LTD .90. We set a liberal one-tailed alpha level of .10, and an 80% power level ([3 = .20), the latter value recommended by Cohen for most studies (Cohen, 1988, p. 56).

Justification for the relatively low one-tailed alpha level (increasing chances for false positives) is that counterintuitive results (deteriorating performance during treatment) are unlikely in an intensive, closely monitored treatment with a single subject. Were deterioration to occur, the treatment would likely be changed or terminated. Also, single-case statistical tests would be con- firmed with visual analysis. Finally, the social consequences of a wrong deci- sion are typically not major--more likely involving treatment adjustments than diagnosis or determinations of eligibility.

The power analysis results separated the seven techniques into three groups, those with very low power, moderately low power, and strong power. We will deal briefly with the first two groups, and in more detail with the last group. BINOM stands alone as the lowest power technique. With an N of 10 per phase, BINOM has adequate power to reliably detect only the most extreme results. The most extreme ratio results of 10/0 earns an effect size of only .33. BINOM requires 20 data points per phase to reliably detect a point split ratio of 9/1. Thus, for our shorter data series (often found in practice), BINOM has inadequate power, and so only descriptive usefulness. Next low- est in power is LTD, a point estimate technique, with power calculated from a student t statistic, the error term being the combined standard errors of estimate from Phases A and B. With the amount of "bounce" typical in our data sets, standard errors of estimate were very high. Even with liberal one-tailed alpha = .10, LTD power was generally inadequate for samples such as our N = 20.

Power analysis of the five regression techniques was calculated in the NCSS PASS multiple regression module (Hintze, 2000), which was built according to Cohen's (1988) seminal Statistical Power for the Social Sciences.

Page 11: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

EVALUATING SINGLE-CASE DATA 199

Effect Size .70

.60

.50

.40

.30

.20

;0 8 13 18 23 28 33 38

Number of Measurements (N)

Fro. 2. Power analysis curves for five analytic techniques applied to AB design data sets. Bottom line represents the power curve for the ALLISON-M and GORSUCH techniques. The middle power curve is for the ALLISON-MT technique. The top line is the power curve for the CENTER-MT technique.

The bottom line of Figure 2, representing ALLISON-M and GORSUCH, shows that for N -- 20 observations, ALLISON-M reliably detects effect sizes as small as approximately R 2 = .15. Based on our data set, we needed ALLISON-M to have the sensitivity to detect effect sizes around R 2 = .66, so ALLISON-M possesses more than adequate power. In fact, the power curve shows that even with only nine data points, ALLISON-M may have sufficient power to reli- ably detect effects of "effective intervention" graphs. The bottom line also represents GORSUCH, which needed to be able to detect approximately R 2 = .03. This line shows that even with an N of 38, GORSUCH does not have sufficient power. Therefore, we conclude that GORSUCH must be used only descriptively with data sets such as ours.

The second line from the bottom of Figure 2 represents the technique with the second greatest power: ALLISON-MT. ALLISON-MT, a model with two predictors, could detect R 2 effect sizes as small as .25 to .30 with short data series of 15 to 20 points. We needed ALLISON-MT to have the power to reli- ably detect only large effect sizes of R 2 = .86, so ALLISON-MT possesses plenty of power for our data sets.

The top line represents a regression model with two predictors and one partialed variable, CENTER-MT. (The PASS module requires inputting the variance accounted for by the partialed variable, for which we entered the mean value from our data sets.) CENTER-MT shows the ability to reliably detect effect sizes of R 2 = .35 to .40 for the same short data series. Because we needed only enough power to detect approximately R 2 = .55, CENTER- MT showed adequate power for our data sets.

Page 12: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

200 PARKER & BROSSART

TABLE 3 CORRELATION MATRIX OF SEVEN STATISTICAL TECHNIQUES BASED

ON 50 AB DESIGN DATA SETS

BINOMIAL LTD GORSUCH CENTER-M CENTER-MT ALLISON-M

BINOMIAL 1.00 LTD .658 1.00 GORSUCH .467 .070 1.00 CENTER-M .418 - .027 .963 CENTER-MT .228 .129 .322 ALLISON-M .870 .717 .640 ALLISON-MT .630 .599 .377

1.oo .266 1.oo .573 .282 1.oo .327 .660 .720

Intercorrelation of Analytic Techniques

To answer the question of how the seven analytic techniques intercorre- lated, a Pearson r matrix was created based on the full 50 data sets (see Table 3).

The matrix in Table 3 shows several moderately high correlations, with the following percentile distribution for the 21 unique coefficients: 10th = .08; 25th = .27; 50th = .47; 75th = .66; 90th = .84. Scanning the matrix shows that the most remotely connected technique is CENTER-MT, beating no greater than a .32 relation with other variables, except ALLISON-MT. The most closely related cluster is composed of BINOMIAL, LTD, ALLISON-M, and ALLISON-MT, with no intercorrelation below .59, and average intercor- relations of a sizeable .70. Two pairs deserve special note for their strength of relationship, being above r = .85; the first is CENTER-M with GORSUCH, and the second is BINOMIAL with ALLISON-M.

Autocorrelation in Analytic Techniques

The present autocorrelation question is focused on the residuals from using each analytic technique. We calculated autocorrelation for only the five regression models, omitting BINOMIAL and LTD techniques. We tested residuals for lag-1 autocorrelation in the NCSS ARIMA module (Hintze, 2000), a total of 250 analyses (5 techniques × 50 data sets). Resulting auto- correlations are presented as percentile distributions in Table 4.

Autocorrelation clearly varied by technique. CENTER-MT and ALLISON-MT, both techniques that include trend plus mean difference components, were relatively free of autocorrelation. GORSUCH and ALLISON-M showed low-moderate (but potentially influential) amounts of autocorrelation. ALLISON-M showed the highest amount of autocorrela- tion, with 50% of graphs falling between r = .18 and .48. The autocorrela- tion showed by GORSUCH was nearly identical to that for the detrended data, because the GORSUCH technique is simply a mean difference test on detrended data.

Page 13: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

EVALUATING S I N G L E - C A S E DATA

T A B L E 4

DISTRIBUTIONS OF RESIDUAL AUTOCORRELATIONS IN FIVE ANALYTIC TECHNIQUES

201

Percen t i l e s

10th 25 th 50 th 75 th 90 th

G O R S U C H - .066 .181 .329 .491 .595

C E N T E R - M - .096 .089 .298 .388 .556

C E N T E R - M T - .468 - .262 - .090 .078 .172

A L L I S O N - M .035 .179 .364 .476 .574

A L L I S O N - M T - .468 - .261 - .059 .088 .172

Discussion This article sought to provide useful information for single-case researchers

on the measurement attributes of seven analytic techniques for single case data. We hoped to provide information to help researchers decide "whether, how, and when" to use these techniques to augment visual analysis of graphed data. In particular, we aimed to provide information on: (a) the sta- tistical power of each technique, (b) the typical R a effect size of each tech- nique, (c) intercorrelation of the seven techniques, and (d) the tendency of each to be burdened with autocorrelation. Although our 50 sample data sets (35 visually analyzed) were only two-phase (AB) in design, each of these phases was carefully constructed to reflect ranges of four attributes that indi- cate shift in performance between phases.

The first question, answered through power analysis, was the reliability of effect sizes produced by each analytic technique. Power can be expressed as the ability to reliably detect an effect size at a given, meaningful level. Each technique had to be able to reliably detect effects at the medians for the 13 "effective intervention" graphs. These effect-size target levels varied consid- erably, from R 2 = .03 for GORSUCH to R 2 = .90 for LTD. The power analy- sis results were surprising; despite cautions against analyzing short data sets, four of the seven techniques (ALLISON-M, ALLISON-MT, CENTER-M, and CENTER-MT) showed adequate power to reliably detect effect sizes at their target levels. Two very different statistical models, BINOMIAL and LTD, lacked sufficient power, and GORSUCH also failed because it could not detect the very small target effect sizes in our graphs (median R 2 = . 0 3 ) .

Despite its attractiveness as an easily hand-calculated technique (with ref- erence to a binomial table), BINOMIAL lacked the discrimination sensitivity for our data sets. For example, the 25th to 75th percentile BINOMIAL effect sizes were identical. The problem of BINOMIAL was more severe than that of LTD. Whereas LTD differentiated among our data sets, though with unreli- able effect sizes, BINOMIAL could neither differentiate nor yield reliable effect sizes.

Answering our second question about typical effect sizes, our results varied

Page 14: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

202 P A R K E R & B R O S S A R T

greatly by analytic technique, underscoring the danger of a single interpreta- tional guideline for effect sizes as stressed by Cohen (1988) and others. Typi- cal effect sizes (interquartile ranges) ranged from small R 2 = .01 to .05 for GORSUCH to large R 2 = .87 to .92 for LTD. Some of these differences can be easily explained. For example, the two MT (mean plus trend) models should produce larger effect sizes than the M-only models, as the inclusion of the trend variable in MT models typically produces improved prediction. The relatively high effect sizes produced by both of the ALLISON models also can be explained. ALLISON models remove (semipartial) trend prior to the main model test, so the test is actually on modified data (with reduced error variance), rather than the original data. The very high LTD effect sizes result mainly from the large spread of individual predicted scores in the distant future--at the end of the intervention phase. These large effect sizes must be interpreted in light of the very large standard errors of prediction for these future data points. The large LTD effect sizes also exemplify the danger in directly comparing effect sizes from very different statistical models, regres- sion versus point prediction. Therefore, we recommend developing interpre- tational criteria with the group of five linear regression models only.

The third question we asked was about the interrelationships among the seven analytic techniques. Most intercorrelations were of low-moderate size (median r = .47), with four techniques (BINOMIAL, LTD, ALLISON-M, and ALLISON-MT) most closely related at an average r = .70, whereas the most isolated technique was CENTER-MT. We were surprised that BINOMIAL was closely related to three other techniques, and moderately related to two others, considering that its computational formula is so differ- ent from the others. This attests to the fact that, despite its statistical limita- tions, BINOMIAL effectively reflects the same important attributes as the ana- lyrically more sophisticated techniques. We were also surprised by the isolation of CENTER-MT (except from ALLISON-MT) from even its partner tech- nique, CENTER-M. CENTER-MT is the only technique that includes trend differences between phases after eliminating overall data trend from the anal- ysis. It is beyond the scope of this article to critique the conceptual approach of CENTER-MT. Here we only point out that using CENTER-MT may result in quite different findings from those obtained by most other techniques.

Noteworthy were the very close relationships of CENTER-M with GORSUCH (r = .96) and of BINOMIAL with ALLISON-M (r = .87). CENTER-M and GORSUCH are both mean difference techniques that differ only in whether the full data trend is semipartialed from Y (GORSUCH) or fully partialed from both sides of the equation (CENTER-M). Since at this point we cannot argue for the superiority of semipartialing or fully partialing trend, it is gratifying that both methods tend to agree. The close relationship between BINOMIAL and ALLISON-M undoubtedly is a result of their con- ceptual similarity, despite their differences in analytic formula. Both of these techniques identify quasi-mean differences in Phase B after controlling for only Phase A trend.

Page 15: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

EVALUATING SINGLE-CASE DATA 203

The fourth research question was about the residual autocorrelation typi- cally produced by each analytic technique. Based on amount of autocorrelation evident, the five techniques tested fell roughly into two groups. Very little auto- correlation (median near zero) was shown by CENTER-MT and ALLISON-MT. The other three techniques had positive median autocorrelation values of r = .30 to .36, which could dangerously inflate Type I error when inferring to a population. The autocorrelation differences among these five techniques can be explained in part. Any technique with trend as a predictor (as ALLISON- MT and CENTER-MT) largely eliminates from residuals the autocorrelation in overall trend. Inclusion of mean differences (the M in MT), by interacting with T (trend), also effectively removes much autocorrelation, at least within the two separate, phase-specific trends. In general, we found autocorrelation differ- ences to be large enough to indicate the need for technique-specific guidelines to help single-case researchers anticipate and reduce autocorrelation.

The findings from our four research questions are summarized in Table 5, and help us organize tentative guidelines for use of these seven analytic approaches.

The BINOMIAL technique was attractive in its close intercorrelations with other techniques and in the "reasonable" effect sizes produced. However, the very low power of its binomial test makes the technique inappropriate for sta- tistical inferences with short data sets. BINOMIAL seems useful mainly as a descriptive technique for data sets with longer, stable baselines, and for inter- ventions with large effects. The LTD technique also covaried well with other techniques, and similarly lacked power for making statistical inferences. Fur- thermore, its very large effect sizes (due to long-range future predictions) could not be interpreted at face value. LTD showed the same weaknesses as BINOMIAL, and no noted advantages. Given the computational simplicity of BINOMIAL, we preferred that technique for descriptive work.

Of the mean-shift regression techniques, GORSUCH lacked power for our data sets, showed moderate autocorrelation, and did not covary well with most other analytic techniques. These disadvantages, plus the difficulty in com- municating the very small effect sizes, placed GORSUCH in our disfavor. A more attractive mean-shift technique was CENTER-M, which had adequate

T A B L E 5 SUMMARY OF PERFORMANCE BY SEVEN ANALYTIC TECHNIQUES

Technique Power Autocorrelation Effect Sizes Intercorrelations

BINOMIAL Very low n/a Moderate Clustered LTD Very low n/a Very large Clustered GORSUCH Low Moderate Very small Paired with CENTER-M CENTER-M Good Low-moderate Small Paired with GORSUCH CENTER-MT Good Almost none Moderately large Isolated ALLISON-M Strong Moderate Large Clustered ALLISON-MT Strong Almost none Very large Clustered

Page 16: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

204 PARKER & BROSSART

power for our data sets, only low-moderate autocorrelation, and smallish effect sizes as those from group studies. However, CENTER-M intercorrelated well only with GORSUCH. The third mean-shift technique, ALLISON-M, was a strong performer in most respects. Its only disadvantages were moderate autocorrelation and its tendency to produce larger effect sizes than are com- mon in group research. In cases of larger autocorrelation and/or shorter, unstable baselines, CENTER-M may be the preferred mean-shift technique.

CENTER-MT and ALLISON-MT, the two competing mean-plus-trend techniques, both showed good power, little autocorrelation, and tended to produce large effect sizes (very large for ALLISON-MT). However, the two techniques were not equally attractive because CENTER-MT was an outlier, relating weakly with all other techniques, even the other CENTER technique. The major caution with both ALLISON techniques is that they require a longer, more stable baseline than do the CENTER techniques. Applying ALLISON techniques to data with an unstable baseline (slopes with large stan- dard errors) will produce an R 2 whose unreliability is undetectable, because the error was buried in a preliminary stage of the data analysis.

Despite this study's limitations in scope, its results permit us to conclude that new guidelines for interpreting statistical results in single-case research are essen- tial. Existing guidelines, such as Cohen's effect-size markers (Cohen, 1988), appear not to apply to single-case data. We can also conclude that guidelines will need to be tailored to each analytic technique or family of techniques. The seven analytic techniques varied greatly in effect sizes, autocorrelation, and statistical power--even by factors of 5 and 10. Although an R 2 effect size has been termed a universal index of the magnitude of relationship, R 2 values produced by different computational formulae appear to require very different interpretations.

There was a moderate amount of agreement among most of these measures of intervention effectiveness, even despite very different computational models. Autocorrelation appears to be a significant concern only with the simpler mean-difference models. The Allison and Center models, which included mean and trend differences, showed remarkably little autocorrelation.

Inadequate power was a concern with the two nonregression models and with the GORSUCH regression model. However, the remaining four tech- niques appeared to possess sufficient power for analysis with our short data sets of 20 observations and fewer.

This study comes with notable limitations. First, data sets with more than two phases and with phases of different lengths were not included. Because we examined seven different analytic techniques, we needed to construct sample data sets to represent the several combinations of data variability, mean differences, trend differences, and intercept differences. Including number and length of phases as additional variables would have made the number of graphs unmanageable. In addition, our judges showed good focused attention with no more than 35 graphs.

A second related limitation is the danger of generalizing from our manu- factured data sets to most published research. Our data sets are more

Page 17: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

EVALUATING SINGLE-CASE DATA 205

representative of four key attributes of change than they are similar to graphs typically published. Within the constraints of our four attributes, two phases, and 20 data points, we did attempt to ensure that the graphs represented con- vincing evidence of effective interventions. In this way, we tried to ensure that our data were meaningful, though not necessarily representative.

Extending this study to graphs with three or more phases deserves discus- sion. For BINOMIAL (O. R. White & Haring, 1980) and LTD (D. M. White et al., 1989) techniques, extending to more phases would need to resolve how to handle multiple pair-wise phase comparisons (Hershberger, Wallace, Green, & Marquis, 1999). For the remaining five regression techniques, extending to multiple phases is easily accomplished, but with major interpretational prob- lems. First, such an omnibus analysis would need to differentiate among differ- ent experimenter intents or expectations for the third phase (e.g., a return-to- baseline versus a maintenance phase). Second, regression results are optimized when the score spread for the three phases is maximized, that is, when the third phase scores shoot up or down beyond the extremes of the other two phases, which is not a common experimenter expectation. Third, a multiphase omnibus analysis would need to handle individual phases that reflect undesired and counterintuitive results, as in the case of a counterproductive intervention. To date, this obstacle has not been fully overcome by any analytic technique.

Appendix Calculation of Seven Statistical Tests for Single-Case Data

BINOM: Binomial Test on Extended Phase A Baseline (0. R. White & Haring, 1980)

For nearly three decades, clinicians have hand-fit median-based slopes to small data sets (Kazdin, 1982). The hand-fit slope evenly splits the Phase A data (50% above and below the line). After fitting the median line to Phase A, that line can be extended into Phase B. A binomial test on the Phase B data (Darlington & Carlson, 1987) indicates whether the Phase B data also have been split evenly by the extended baseline, which would be expected in the case of no intervention effect. An extreme Phase B split (e.g., 9 data points above the line, and only 1 below) indicates an intervention effect. An advan- tage of White's technique is that it may be sensitive to multiple types of between-phase difference: in mean, slope, and intervention intercept. A known disadvantage is that the binomial test has low sensitivity with short series (e.g., fewer than 7 to 9 data points in Phase B). The test of two propor- tions can be conducted exactly using the Fisher Exact Test, or approximately from a standard normal table (Darlington & Carlson):

P l - - P2 Z =

J R(1 - P) P(1 - P ) '

-n- 7 + n2

Page 18: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

206 PARKER & BROSSART

where Pl is always .50, and P2 is the proportion of Phase B data above the extended baseline. We used the less exact Z score estimate (although it tends to inflate results) for the ease of converting Z to an R 2 effect size, through the formula: R 2 = ZZ/N (Rosenthal, 1991). If client performance deteriorated during an intervention, the R 2 was made zero.

LTD: Last Treatment Day (D. M. White et al., 1989)

Like the BINOM test, White et al.'s LTD technique may reflect multiple types of differences between phases. The LTD technique compares data points at the end of the treatment phase predicted from two different regres- sion lines--one data point predicted by an extended Phase A regression line, and the other by the Phase B regression line. These two predicted values are compared using Cohen's d formula, using a standard error of prediction error term (Nunnally, 1978):

LTD 8 - LTD a d = I

r2)" .~ eoolea( 1 -

This d was then converted to an R e effect size using R 2 = d2/(d 2 + 4). Where the Phase A predicted score exceeded the Phase B predicted score, the R 2 was made zero.

GORSUCH: Gorsuch's Trend Analysis Effect Size (Faith et al., 1996; Gorsuch, 1983)

GORSUCH is a test of mean differences between phases while controlling for overall data trend. Gorsuch's technique is conceptually analogous to ANCOVA, in which the overall trend is the covariate (Maxwell & Delaney, 1990), and is considered a technical advancement over simple mean or slope tests. The overall data trend may obscure interpretation of phase differences due solely to treatment effect, so it is eliminated prior to analysis. The final analysis is not conducted on original scores, but rather on "detrended" scores, or residuals from regressing client scores on the time variable. After detrend- ing, the residuals are simply regressed on a dummy-coded (0,1) phase vari- able to yield the final R 2. This procedure constitutes a semipartialing of trend from the Y side only of the regression equation. Where the Phase A mean exceeded the Phase B mean, the R 2 was made zero in this study.

CENTER-M and CENTER-MT: Mean-Only and Mean-Plus-Trend Models (Center et al., 1985-1986)

These piece-wise regression techniques were promoted for single-case research by Center et al. and by Berry and Lewis-Beck (1986), with alterna- tive analytic formulae provided by Kromrey and Foster-Johnson (1996). CENTER-M and CENTER-MT both test for between-phase differences while controlling for overall data trend, but do not follow the GORSUCH procedure of controlling overall data trend by detrending (i.e., by semipartialing

Page 19: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

EVALUATING SINGLE-CASE DATA 207

it out of only the dependent scores). Instead, both CENTER-M and CENTER- MT fully partial data trend from both sides of the prediction equation (i.e., from Y and all X variables). The simplest method for calculating CENTER- MT is to use the standard partial correlation model-differencing formula (Cohen, 1988; Kromrey & Foster-Johnson), which results in CENTER-M:

2 2 Ry.T, M -- Ry. T

2 1 - Ry. T

The subscript letters are Y, the client response variable; T, a time or linear trend variable; M, a dummy-coded (0,1) phase vector variable that estimates mean level shifts; and Tb, a variable with trend values for Phase B only (remaining cells are blank). When visual analysis shows trend differences in the wrong direction (stronger Phase A trend), the trend effect is eliminated from the model, reverting to a mean-only effect. When a mean difference is found to be in the wrong direction, the effect size is made zero.

ALLISON-M and ALLISON-MT: Allison et al.'s Mean-Only and Mean-Plus-Trend Models (Allison & Gorman, 1993; Faith et aI., 1996)

The ALLISON-M and ALLISON-MT techniques axe parallel to the two CENTER techniques and are similarly regression-based. However, the ALLI- SON techniques are intended to be improvements over the CENTER and GOR- SUCH techniques by controlling for Phase A trend only, rather than trend of the full data series. The ALLISON authors argue that the GORSUCH and CENTER approaches tend to overcorrect by removing Phase B trend, which is plausibly due to the treatment (Allison & Gorman). ALLISON techniques are similar to GORSUCH (and different from CENTER) in that trend is semipartialed from Y only (client scores), rather than fully partialed from X and Y variables.

Both ALLISON-M and ALLISON-MT require preliminary detrending steps: (a) create a temporary variable containing the scores for Phase A only, (b) regress this new "A scores" variable on trend, (c) save the predicted out- put; and (d) subtract these predicted values from the original scores. The resulting difference or residual scores are used instead of the original scores in the final regression formula for ALLISON-M R2detY.M, and for ALLISON- MT: R2detY.M.TS, (where aetY is the detrended Y scores, subscript M is a dummy- coded phase mean shift vector variable, T is a time or trend variable, and TB is a variable containing trend scores for Phase B only). In this study, we chose not to follow a recent, relatively untested recommendation to use adjusted R 2 rather than R 2 to compensate for number of predictors (Faith et al., 1996). As was done with CENTER-MT, trend effects in the wrong direction were dropped from the model, reverting to a mean-only effect.

and CENTER-MT:

R2.r.M, rb 2 - - R y . r

2 1 - - Ry. r

Page 20: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

208 PARKER & BROSSART

References

Allison, D. B., & Gorman, B. S. (1993). Calculating effect sizes for meta-analysis: The case of the single case. Behavior Research and Therapy, 31,621-631.

Allison, D. B., Silverstein, J. M., & Gorman, B. S. (1996). Power, sample size estimation, and early stopping rules. In R. D. Franklin, D. B. Allison, & B. S. Gorman (Eds.), Design and analysis of single-case research. Mahwah, NJ: Lawrence Erlbaum.

Baer, D. M. (1977). Perhaps it would be better not to know everything. Journal of Applied Behavior Analysis, 10, 167-172.

Barlow, D. H., & Hersen, M. (Eds.). (1984). Single case experimental designs: Strategies for studying behavior change (2nd ed.). Oxford, England: Pergamon Press.

Berry, W. D., & Lewis-Beck, M. S. (1986). Interrupted time series. In W. D. Berry & M. S. Lewis-Beck (Eds.), New tools for social scientists: Advances and applications in research methods. Beverly Hills, CA: Sage.

Busk, R L., & Marascuilo, L. A. (1988). Autocorrelation in single-subject research: A counter- argument to the myth of no autocorrelation. Behavioral Assessment, 1 O, 229-242.

Busk, P. L., & Marascuilo, L. A. (1992). Statistical analysis in single-case research: Issues, pro- cedures, and recommendations, with applications to multiple behaviors. In T. R. Kratoch- will & J. R. Levin (Eds.), Single-case research design and analysis: New directions for psychology and education (pp. 159-185). Hillsdale, NJ: Lawrence Erlbaum.

Busk, E L., & Serlin, R. C. (1992). Meta-analysis for single-case research. In T. R. Kratochwill & J. R. Levin (Eds.), Single-case research design and analysis: New directions for psy- chology and education (pp. 187-212). Hillsdale, NJ: Lawrence Erlbaum.

Carver, R. (1978). The case against statistical significance testing. Harvard Educational Review, 48,378-399.

Center, B. A., Skiba, R. J., & Casey, A. (1985-1986). A methodology for the quantitative syn- thesis of intra-subject design research. Journal of Special Education, 19,387-400.

Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin, 70,426-443.

Cohen, J. (1988). Statisticalpower analysis for the behavioral sciences (2nd ed.). Hillsdale, N J: Lawrence Erlbaum.

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304-1312. Cohen, J., & Cohen, P. (1983). Applied multiple regression~correlation analysis for the behav-

ioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum. Crosbie, J. (1987). The inability of the binomial test to control Type I error with single-subject

data. Behavioral Assessment, 9, 141-150. Crosbie, J. (1993). Interrupted time-series analysis with brief single-subject data. Journal of

Consulting and Clinical Psychology, 61,966-974. Crosbie, J. (1995). Interrupted time-series analysis with short series: Why it is problematic; how

it can be improved. In J. M. Gottman (Ed.), The analysis of change (pp. 361-395). Mah- wah, NJ: Lawrence Erlbaum.

Dar, R., S erlin, R. C., & Omer, H. (1994). Misuse of staff stical tests in three decades of psycho- therapy research. Journal of Consulting and Clinical Psychology, 62, 75-82.

Darlington, R. B., & Carlson, P. M. (1987). Behavioral statistics: Logic and methods. New York: Free Press.

DeProspero, A., & Cohen, S. (1979). Inconsistent visual analyses of intrasubject data. Journal of Applied Behavior Analysis, 12,573-579.

Edgington, E. S. (1987). Randomizing single subject experiments and statistical tests. Journal of Counseling Psychology, 34,437-442.

Faith, M. S., Allison, D. B., & Gorman, B. S. (1996). Meta-analysis of single-case research. In R. D. Franklin, D. B. Allison, & B. S. Gorman (Eds.), Design and analysis of single-case research (pp. 245-277). Mahwah, NJ: Lawrence Erlbaum.

Page 21: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

EVALUATING SINGLE-CASE DATA 209

Fowler, R. L. (1985). Point estimates and confidence intervals in measures of association. Psy- chological Bulletin, 98, 160-165.

Fox, J. (1991). Regression diagnostics. Newbury Park, CA: Sage. Franklin, R. D., Allison, D. B., & Gorman, B. S. (Eds.). (1996). Design and analysis of single-

case research. Mahwah, NJ: Lawrence Erlbaum. Franklin, R. D., Gorman, B. S., Beasley, T. M., & Allison, D. B. (1996). Graphical display and

visual analysis. In R. D. Franklin, D. B. Allison, & B. S. Gorman (Eds.), Design and anal- ysis of single-case research (pp. 119-158). Mahwah, NJ: Lawrence Erlbaum.

Gorman, B. S., & Allison, D. B. (1996). Statistical alternatives for single-case designs. In R. D. Franklin, D. B. Allison, & B. S. Gorman (Eds.), Design and analysis of single-case research (pp. 159-214). Mahwah, NJ: Lawrence Erlbanm.

Gorsuch, R. L. (1983). Three methods for analyzing time-series (N of 1) data. Behavioral Assessment, 5, 141-154.

Harbst, K. B., Ottenbacher, K. J., & Harris, S. R. (1991). Interrater reliability of therapists' judgments of graphed data. Physical Therapy, 71,107-115.

Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, FL: Aca- demic Press.

Hershberger, S. L., Wallace, D. D., Green, S. B., & Marquis, J. G. (1999). Meta-analysis of single- case designs. In Rick H. Hoyle (Ed.), Statistical strategies for small sample research. Thousand Oaks, CA: Sage Publications.

Hintze, J. (2000). NCSS 2000 [Computer software]. Kaysville, UT: NCSS Statistical Software. Holtzman, W. H. (1963). Statistical methods for the study of change in the single case. In C. W.

Harris (Ed.), Problems in measuring change (pp. 199-211). Madison: University of Wis- consin Press.

Huitema, B. E. (1985). Autocorrelation in applied behavior analysis: A myth. Behavioral Assessment, 7, 107-118.

Huitema, B. E. (1986). Autocorrelation in behavioral research: Wherefore art thou? In A. Poling & R. W. Fuqua (Eds.), Research methods in applied behavior analysis: Issues and advances. New York: Plenum.

Jacobson, N. S., Follette, W. C., & Revenstorf, D. (1984). Psychotherapy outcome research: Methods for reporting variability and evaluating clinical significance. Behavior Therapy, 15,336-352.

Kawano, T. (1993). School psychology journals: Relationships with related journal and external and internal quality indices. Journal of School Psychology, 31,407-424.

Kazdin, A. E. (1982). Single-case research designs: Methods for clinical and applied settings. New York: Oxford University Press.

Kazdin, A. E. (1984). Statistical analysis for single-case experimental designs. In D. H. Barlow & M. Hersen (Eds.), Single case experimental designs (2nd ed., pp. 285-324). New York: Pergamon.

Kirk, R. E. (1996). Practical significance: A concept whose time has come. Educational & Psy- chological Measurement, 56, 746-759.

Kosciulek, J. F., & Szymanski, E. M. (1993). Statistical power analysis of rehabilitation coun- seling research. Rehabilitation Counseling Bulletin, 36,212-219.

Kratochwill, T. R., & Brody, G. H. (1978). Single subject designs: A perspective on the contro- versy over employing statistical inference and implications for research and training in behavior modification. Behavior Modification, 2, 291-307.

Kratochwill, T. R., & Levin, J. R. (Eds.). (1992). Single-case research design and analy- sis: New directions for psychology and education. Hillsdale, N J: Lawrence Erlbaum Associates.

Kromrey, J. D., & Foster-Johnson, L. (1996). Detel~aining the efficacy of intervention: The use of effect sizes for data analysis in single-subject research. The Journal of Experimental Education, 65, 73-93.

Page 22: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

210 PARKER & BROSSART

Kupfersmid, J. (1988). Improving what is published: A model in search of an editor. American Psychologist, 43,635-642.

Matyas, T. A., & Greenwood, K. M. (1996). Serial dependency in single-case time series. In R. D. Franklin, D. B. Allison, & B. S. Gorman (Eds.), Design and analysis of single-case research (pp. 215-243). Mahwah, NJ: Lawrence Erlbaum.

Maxwell, S. E., Camp, C. J., & Avery, R. D. (1981). Measures of strength of association: A comparative examination. Journal of Applied Psychology, 66,525-534.

Maxwell, S. E., & Delaney, H. D. (1990). Designing experiments and analyzing data: A model comparison perspective. Belmont, CA: Wadsworth.

Michael, J. L. (1974). Statistical inference for individual organism research: Mixed blessing or curse? Journal of Applied Behavior Analysis, 7, 647-653.

Mitchell, C., & Hartmann, D. P. (1981). A cautionary note on the use of Omega squared to eval- uate the effectiveness of behavioral treatments. Behavioral Assessment, 3, 93-100.

Nourbakhsh, M. R., & Ottenbacher, K. J. (1994). The statistical analysis of single-subject data: A comparative examination. Physical Therapy, 74, 80-88.

Nunnally, J. C., Jr. (1978). Psychometric theory (2rid ed.). New York: McGraw Hill. Ostrom, C. W., Jr. (1990). Time series analysis: Regression techniques (2nd ed.). Beverly Hills,

CA: Sage. Ottenbacher, K. J. (1990). Visual inspection of single-subject data: An empirical analysis. Men-

tal Retardation, 28,283-290. Park, H., Marascuilo, L., & Gaylord-Ross, R. (1990). Visual inspection and statistical analysis

of single-case designs. Journal of Experimental Education, 58, 311-320. Parsonson, B. S., & Baer, D. M. (1978). The analysis and presentation of graphic data. In T. R.

Kratochwill (Ed.), Single-subject research: Strategies for evaluating change. New York: Academic Press.

Parsonson, B. S., & Baer, D. M. (1986). The graphic analysis of data. In A. Poling & R. W. Fuqua (Eds.), Research methods in applied behavior analysis: Issues and advances (pp. 157-186). New York: Plenum.

Parsonson, B. S., & Baer, D. M. (1992). The visual analysis of data, and current research into the stimuli controlling it. In T. R. Kratochwill & J. R. Levin (Eds.), Single-case research design and analysis (pp. 15-40). Hillsdale, NJ: Lawrence Erlbaum Associates.

Rosenthal, R. (1984). Meta-analytic procedures for social research. Beverly Hills, CA: Sage.

Rosenthal, R. (1991). Meta-analytic procedures for social research (Rev. ed., Vol. 6). Newbury Park, CA: Sage.

Rosnow, R., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44, 1276-1284.

Rossi, J. S. (1990). Statistical power of psychological research: What have we gained in 20 years? Journal of Consulting and Clinical Psychology, 58,646-656.

Scheffr, H. (1959). The analysis of variance. New York: Wiley. Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the

power of studies? Psychological Bulletin, 105,309-316. Sharpley, C. F., & Alavosius, M. P. (1988). Autocorrelation in behavioral data: An alternative

perspective. Behavioral Assessment, 10,243-251. Shaver, J. P. (1991). Quantitative reviewing of research. In J. P. Shaver (Ed.), Handbook of

research on social studies reaching and learning (pp. 83-97). New York: Macmillan. Suen, H. K., & Ary, D. (1987a). Application of statistical power in assessing autocorrelation.

Behavioral Assessment, 9, 125-130. Suen, H. K., & Ary, D. (1987b). Autocorrelation in applied behavior analysis: Myth or reality?

Behavioral Assessment, 9, 125-130. Such, H. K., & Ary, D. (1989). Analyzing quantitative behavioral observation data. Hillsdale,

NJ: Lawrence Erlbaum.

Page 23: 2003 Evaluating SngleCase Data Compare 7 Stat Methods Parker....pdf

EVALUATING SINGLE-CASE DATA 211

Thompson, B. (1998). Statistical significance and effect size reporting: Portrait of a possible future. Research in the Schools, 5(2), 33-38.

Thompson, B. (2002). "Statistical" "practical" and "clinical": How many kinds of significance do counselors need to consider? Journal of Counseling and Development, 80, 64-71.

Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley. White, D. M., Rusch, F. R., Kazdin, A. E., & Hartmann, D. P. (1989). Applications of meta-

analysis in individual subject research. Behavioral Assessment, 11,281-296. White, O. R., & Hating, N. G. (1980). Exceptional teaching (2rid ed.). Columbus, OH: Merrill.

RECEIVED: October 15, 2001 ACCEPTED: August 26, 2002