the psychometric function: ii. bootstrap-based confidence ... · the size of the bootstrap...

Copyright 2001 Psychonomic Society, Inc. 1314

Perception & Psychophysics2001, 63 (8), 1314-1329

The performance of an observer on a psychophysicaltask is typically summarized by reporting one or more re-sponse thresholds—stimulus intensities required to pro-duce a given level of performance—and by a characteri-zation of the rate at which performance improves withincreasing stimulus intensity. These measures are de-rived from a psychometric function, which describes thedependence of an observer’s performance on some phys-ical aspect of the stimulus.

Fitting psychometric functions is a variant of the moregeneral problem of modeling data. Modeling data is a

three-step process: First, a model is chosen, and the param-eters are adjusted to minimize the appropriate error met-ric or loss function. Second, error estimates of the param-eters are derived and third, the goodness of fit betweenmodel and the data is assessed. This paper is concernedwith the second of these steps, the estimation of variabilityin fitted parameters and in quantities derived from them.Our companion paper (Wichmann & Hill, 2001) illustrateshow to fit psychometric functions while avoiding bias re-sulting from stimulus-independent lapses, and how to eval-uate goodness of fit between model and data.

We advocate the use of Efron’s bootstrap method, aparticular kind of Monte Carlo technique, for the prob-lem of estimating the variability of parameters, thresholds,and slopes of psychometric functions (Efron, 1979, 1982;Efron & Gong, 1983; Efron & Tibshirani, 1991, 1993).Bootstrap techniques are not without their own assump-tions and potential pitfalls. In the course of this paper, weshall discuss these and examine their effect on the esti-mates of variability we obtain. We describe and examinethe use of parametric bootstrap techniques in finding con-fidence intervals for thresholds and slopes. We then ex-plore the sensitivity of the estimated confidence intervalwidths to (1) sampling schemes, (2) mismatch of the ob-jective function, and (3) accuracy of the originally fittedparameters. The last of these is particularly important,since it provides a test of the validity of the bridging as-

Part of this work was presented at the Computers in Psychology (CiP)Conference in York, England, during April 1998 (Hill & Wichmann,1998). This research was supported by a Wellcome Trust MathematicalBiology Studentship, a Jubilee Scholarship from St. Hugh’s College,Oxford, and a Fellowship by Examination from Magdalen College, Ox-ford, to F.A.W. and by a grant from the Christopher Welch Trust Fundand a Maplethorpe Scholarship from St. Hugh’s College, Oxford, to N.J.H.We are indebted to Andrew Derrington, Karl Gegenfurtner, Bruce Hen-ning, Larry Maloney, Eero Simoncelli, and Stefaan Tibeau for helpfulcomments and suggestions. This paper benefited considerably from con-scientious peer review, and we thank our reviewers, David Foster, StanKlein, Marjorie Leek, and Bernhard Treutwein, for helping us to improveour manuscript. Software implementing the methods described in thispaper is available (MATLAB); contact F.A.W. at the address provided orsee http://users.ox.ac.uk/~sruoxfor/psychofit /. Correspondence con-cerning this article should be addressed to F. A. Wichmann, Max-Planck-Institut für Biologische Kybernetik, Spemannstraße 38, D-72076Tübingen, Germany (e-mail: [email protected]).

The psychometric function: II. Bootstrap-basedconfidence intervals and sampling

FELIX A. WICHMANN and N. JEREMY HILLUniversity of Oxford, Oxford, England

The psychometric function relates an observer’s performance to an independent variable, usually a phys-ical quantity of an experimental stimulus. Even if a model is successfully fit to the data and its goodnessof fit is acceptable, experimenters require an estimate of the variability of the parameters to assess whetherdifferences across conditions are significant. Accurate estimates of variability are difficult to obtain, how-ever, given the typically small size of psychophysical data sets: Traditional statistical techniques are onlyasymptotically correct and can be shown to be unreliable in some common situations. Here and in ourcompanion paper (Wichmann & Hill, 2001), we suggest alternative statistical techniques based on MonteCarlo resampling methods. The present paper’s principal topic is the estimation of the variability of fittedparameters and derived quantities, such as thresholds and slopes. First, we outline the basic bootstrapprocedure and argue in favor of the parametric, as opposed to the nonparametric, bootstrap. Second, wedescribe how the bootstrap bridging assumption, on which the validity of the procedure depends, canbe tested. Third, we show how one’s choice of sampling scheme (the placement of sample points on thestimulus axis) strongly affects the reliability of bootstrap confidence intervals, and we make recom-mendations on how to sample the psychometric function efficiently. Fourth, we show that, under cer-tain circumstances, the (arbitrary) choice of the distribution function can exert an unwanted influence onthe size of the bootstrap confidence intervals obtained, and we make recommendations on how to avoidthis influence. Finally, we introduce improved confidence intervals (bias corrected and accelerated) thatimprove on the parametric and percentile-based bootstrap confidence intervals previously used. Softwareimplementing our methods is available.

THE PSYCHOMETRIC FUNCTION II 1315

sumption on which the use of parametric bootstrap tech-niques relies. Finally, we recommend, on the basis of thetheoretical work of others, the use of a technique calledbias correction with acceleration (BCa) to obtain stableand accurate confidence interval estimates.

BACKGROUND

The Psychometric FunctionOur notation will follow the conventions we have out-

lined in Wichmann and Hill (2001). A brief summary ofterms follows.

Performance on K blocks of a constant-stimuli psy-chophysical experiment can be expressed using threevectors, each of length K. An x denotes the stimulus val-ues used, an n denotes the numbers of trials performed ateach point, and a y denotes the proportion of correct re-sponses (in n-alternative forced-choice [n-AFC] experi-ments) or positive responses (single-interval or yes/no ex-periments) on each block. We often use N to refer to thetotal number of trials in the set, N 5 åni.

The number of correct responses yini in a given block iis assumed to be the sum of random samples from aBernoulli process with probability of success pi. A psy-chometric function y(x) is the function that relates thestimulus dimension x to the expected performance value p.

A common general form for the psychometric func-tion is

(1)

The shape of the curve is determined by our choice of afunctional form for F and by the four parameters {a, b, g,l}, to which we shall refer collectively by using the pa-rameter vector qq. F is typically a sigmoidal function, suchas the Weibull, cumulative Gaussian, logistic, or Gumbel.We assume that F describes the underlying psychologicalmechanism of interest: The parameters g and l determinethe lower and upper bounds of the curve, which are af-fected by other factors. In yes/no paradigms, g is the guessrate and l the miss rate. In n-AFC paradigms, g usuallyreflects chance performance and is fixed at the recipro-cal of the number of intervals per trial, and l reflects thestimulus-independent error rate or lapse rate (see Wich-mann & Hill, 2001, for more details).

When a parameter set has been estimated, we will usu-ally be interested in measurements of the threshold (dis-placement along the x-axis) and slope of the psychomet-ric function. We calculate thresholds by taking the inverseof F at a specified probability level, usually .5. Slopes arecalculated by finding the derivative of F with respect to x,evaluated at a specified threshold. Thus, we shall use thenotation threshold0.8, for example, to mean F 21

0.8, andslope0.8 to mean dF/dx evaluated at F21

0.8. When we use theterms threshold and slope without a subscript, we meanthreshold0.5 and slope0.5: In our 2AFC examples, this willmean the stimulus value and slope of F at the point whereperformance is approximately 75% correct, although the

exact performance level is affected slightly by the (small)value of l.

Where an estimate of a parameter set is required, givena particular data set, we use a maximum-likelihood searchalgorithm, with Bayesian constraints on the parametersbased on our beliefs about their possible values. For ex-ample, l is constrained within the range [0, .06], reflect-ing our belief that normal, trained observers do not makestimulus-independent errors at high rates. We describeour method in detail in Wichmann and Hill (2001).

Estimates of Variability: Asymptotic Versus Monte Carlo Methods

In order to be able to compare response thresholds orslopes across experimental conditions, experimenters re-quire a measure of their variability, which will depend onthe number of experimental trials taken and their place-ment along the stimulus axis. Thus, a fitting proceduremust provide not only parameter estimates, but also errorestimates for those parameters. Reporting error estimateson fitted parameters is unfortunately not very common inpsychophysical studies. Sometimes probit analysis hasbeen used to provide variability estimates (Finney, 1952,1971). In probit analysis, an iteratively reweighted linearregression is performed on the data once they have under-gone transformation through the inverse of a cumulativeGaussian function. Probit analysis relies, however, on as-ymptotic theory: Maximum-likelihood estimators are as-ymptotically Gaussian, allowing the standard deviation tobe computed from the empirical distribution (Cox & Hink-ley, 1974). Asymptotic methods assume that the numberof datapoints is large; unfortunately, however, the numberof points in a typical psychophysical data set is small (be-tween 4 and 10, with between 20 and 100 trials at each),and in these cases, substantial errors have accordingly beenfound in the probit estimates of variability (Foster &Bischof, 1987, 1991; McKee, Klein, & Teller, 1985). Forthis reason, asymptotic theory methods are not recom-mended for estimating variability in most realistic psy-chophysical settings.

An alternative method, the bootstrap (Efron, 1979,1982; Efron & Gong, 1983; Efron & Tibshirani, 1991,1993), has been made possible by the recent sharp increasein the processing speed of desktop computers. The boot-strap method is a Monte Carlo resampling technique re-lying on a large number of simulated repetitions of theoriginal experiment. It is potentially well suited to theanalysis of psychophysical data, because its accuracy doesnot rely on large numbers of trials, as do methods derivedfrom asymptotic theory (Hinkley, 1988). We apply thebootstrap to the problem of estimating the variability ofparameters, thresholds, and slopes of psychometric func-tions, following Maloney (1990), Foster and Bischof(1987, 1991, 1997), and Treutwein (1995; Treutwein &Strasburger, 1999).

The essence of Monte Carlo techniques is that a largenumber, B, of “synthetic” data sets y1*, …, yB* are gener-ated. For each data set yi*, the quantity of interest J (thresh-

y a b g l g g l a bx F x; , , , ; , .( ) = + - -( ) ( )1

1316 WICHMANN AND HILL

old or slope, e.g.) is estimated to give Ji*. The process forobtaining Ji* is the same as that used to obtain the first esti-mate J. Thus, if our first estimate was obtained by J 5 t(qq),where qq is the maximum-likelihood parameter estimatefrom a fit to the original data y, so the simulated esti-mates Ji* will be given by Ji* 5 t(qqi*), where qqi* is themaximum-likelihood parameter estimate from a fit to thesimulated data yi*.

Sometimes it is erroneously assumed that the intentionis to measure the variability of the underlying J itself.This cannot be the case, however, because repeated com-puter simulation of the same experiment is no substitutefor the real repeated measurements this would require.What Monte Carlo simulations can do is estimate the vari-ability inherent in (1) our sampling, as characterized bythe distribution of sample points (x) and the size of thesamples (n), and (2) any interaction between our sam-pling strategy and the process used to estimate J —that is,assuming a model of the observer’s variability, fitting afunction to obtain qq, and applying t(qq).

Bootstrap Data Sets: Nonparametric and Parametric Generation

In applying Monte Carlo techniques to psychophysicaldata, we require, in order to obtain a simulated data set yi*,some system that provides generating probabilities p forthe binomial variates yi1*, …, yiK*. These should be thesame generating probabilities that we hypothesize to un-derlie the empirical data set y.

Efron’s bootstrap offers such a system. In the nonpara-metric bootstrap method, we would assume p 5 y. Thisis equivalent to resampling, with replacement, the origi-nal set of correct and incorrect responses on each blockof observations j in y to produce a simulated sample yij*.

Alternatively, a parametric bootstrap can be performed.In the parametric bootstrap, assumptions are made aboutthe generating model from which the observed data arebelieved to arise. In the context of estimating the variabil-ity of parameters of psychometric functions, the data aregenerated by a simulated observer whose underlyingprobabilities of success are determined by the maximum-likelihood fit to the real observer’s data [yf it 5 y(x;qq)].Thus, where the nonparametric bootstrap uses y, the para-metric bootstrap uses yf it as generating probabilities pfor the simulated data sets.

As is frequently the case in statistics, the choice ofparametric versus nonparametric analysis concerns howmuch confidence one has in one’s hypothesis about theunderlying mechanism that gave rise to the raw data, asagainst the confidence one has in the raw data’s precisenumerical values. Choosing the parametric bootstrap forthe estimation of variability in psychometric function fit-ting appears the natural choice for several reasons. Firstand foremost, in fitting a parametric model to the data,one has already committed oneself to a parametric analy-sis. No additional assumptions are required to perform aparametric bootstrap beyond those required for fitting afunction to the data: Specification of the source of vari-

ability (binomial variability) and the model from which thedata are most likely to come (parameter vector qq and dis-tribution function F ). Second, given the assumption thatdata from psychophysical experiments are binomially dis-tributed, we expect data to be variable (noisy). The non-parametric bootstrap treats every datapoint as if its exactvalue reflected the underlying mechanism.1 The paramet-ric bootstrap, on the other hand, allows the datapoints tobe treated as noisy samples from a smooth and monotonicfunction, determined by qq and F.

One consequence of the two different bootstrap regimesis as follows. Assume two observers performing the samepsychophysical task at the same stimulus intensities x andassume that it happens that the maximum-likelihood fitsto the two data sets yield identical parameter vectors qq.Given such a scenario, the parametric bootstrap returnsidentical estimates of variability for both observers, sinceit depends only on x, qq, and F. The nonparametric boot-strap’s estimates would, on the other hand, depend on theindividual differences between the two data sets y1 and y2,something we consider unconvincing: A method for es-timating variability in parameters and thresholds shouldreturn identical estimates for identical observers perform-ing the identical experiment.2

Treutwein (1995) and Treutwein and Strasburger (1999)used the nonparametric bootstrap and Maloney (1990)used the parametric bootstrap to compare bootstrap esti-mates of variability with real-word variability in the dataof repeated psychophysical experiments. All of the abovestudies found bootstrap studies to be in agreement with thehuman data. Keeping in mind that the number of repeatsin the above-quoted cases was small, this is nonethelessencouraging, suggesting that bootstrap methods are a validmethod of variability estimation for parameters fitted topsychophysical data.

Testing the Bridging AssumptionAsymptotically—that is, for large K and N—qq will con-

verge toward qq, since maximum-likelihood estimation isasymptotically unbiased3 (Cox & Hinkley, 1974; Kendall& Stuart, 1979). For the small K typical of psychophysicalexperiments, however, we can only hope that our estimatedparameter vector qq is “close enough” to the true param-eter vector qq for the estimated variability in the parametervector qq obtained by the bootstrap method to be valid. Wecall this the bootstrap bridging assumption.

Whether qq is indeed sufficiently close to qq depends, ina complex way, on the sampling—that is, the number ofblocks of trials (K ), the numbers of observations at eachblock of trials (n), and the stimulus intensities (x) relativeto the true parameter vector qq. Maloney (1990) summa-rized these dependencies for a given experimental designby plotting the standard deviation of b as a function of aand b as a contour plot (Maloney, 1990, Figure 3, p. 129).Similar contour plots for the standard deviation of a andfor bias in both a and b could be obtained. If, owing tosmall K, bad sampling, or otherwise, our estimation pro-cedure is inaccurate, the distribution of bootstrap param-


eter vectors qq1*, …, qqB*—centered around qq—will notbe centered around true qq. As a result, the estimates ofvariability are likely to be incorrect unless the magnitudeof the standard deviation is similar around qq and qq, despitethe fact that the points are some distance apart in pa-rameter space.

One way to assess the likely accuracy of bootstrap es-timates of variability is to follow Maloney (1990) and toexamine the local flatness of the contours around qq, ourbest estimate of qq. If the contours are sufficiently flat, thevariability estimates will be similar, assuming that thetrue qq is somewhere within this flat region. However, theprocess of local contour estimation is computationally ex-pensive, since it requires a very large number of completebootstrap runs for each data set.

A much quicker alternative is the following. Having ob-tained qq and performed a bootstrap using y(x;qq) as thegenerating function, we move to eight different points ff1, …,ff8 in a–b space. Eight Monte Carlo simulations are per-formed, using ff1, …, ff8 as the generating functions to ex-plore the variability in those parts of the parameter space(only the generating parameters of the bootstrap are changedx remains the same for all of them). If the contours of vari-ability around qq are sufficiently flat, as we hope they are,confidence intervals at ff1, …, ff8 should be of the samemagnitude as those obtained at qq. Prudence should leadus to accept the largest of the nine confidence intervalsobtained as our estimate of variability.

A decision has to be made as to which eight points ina–b space to use for the new set of simulations. Gener-ally, provided that the psychometric function is at leastreasonably well sampled, the contours vary smoothly inthe immediate vicinity of qq, so that the precise placementof the sample points ff1, …, ff8 is not critical. One sug-gested and easy way to obtain a set of additional gener-ating parameters is shown in Figure 1.

Figure 1 shows B 5 2,000 bootstrap parameter pairs asdark filled circles plotted in a–b space. Simulated datasets were generated from ygen with the Weibull as F andqqgen 5 {10, 3, 0.5, 0.01}; sampling scheme s7 (triangles;see Figure 2) was used, and N was set to 480 (ni 5 80).The large central triangle at (10, 3) marks the generatingparameter set; the solid and dashed line segments adjacentto the x- and y-axes mark the 68% and 95% confidenceintervals for a and b, respectively.4 In the following, weshall use WCI to stand for width of confidence interval,with a subscript denoting its coverage percentage—that is,WCI68 denotes the width of the 68% confidence interval.5The eight additional generating parameter pairs ff1, …, ff8are marked by the light triangles. They form a rectanglewhose sides have length WCI68 in a and b. Typically, thiscentral rectangular region contains approximately 30%–40% of all a–b pairs and could thus be viewed as a crudejoint 35% confidence region for a and b. A coverage per-centage of 35% represents a sensible compromise betweenerroneously accepting the estimate around qq, very likelyunderestimating the true variability, and performing ad-

ditional bootstrap replications too far in the periphery,where variability and, thus, the estimated confidence in-tervals become erroneously inflated owing to poor sam-pling. Recently, one of us (Hill, 2001b) performed MonteCarlo simulations to test the coverage of various bootstrapconfidence interval methods and found that the abovemethod, based on 25% of points or more, was a good wayof guaranteeing that both sides of a two-tailed confidenceinterval had at least the desired level of coverage.

MONTE CARLO SIMULATIONS

In both our papers, we use only the specific case of the2AFC paradigm in our examples: Thus, g is fixed at .5.In our simulations, where we must assume a distributionof true generating probabilities, we always use the Wei-bull function in conjunction with the same fixed set ofgenerating parameters qqgen: {agen 5 10, bgen 5 3, ggen 5 .5,lgen 5 .01}. In our investigation of the effects of samplingpatterns, we shall always use K 5 6 and ni constant—thatis, six blocks of trials with the same number of points ineach block. The number of observations per point, ni, couldbe set to 20, 40, 80, or 160, and with K 5 6, this meansthat the total number of observations N could take the val-ues 120, 240, 480, and 960.

We have introduced these limitations purely for the pur-poses of illustration, to keep our explanatory variablesdown to a manageable number. We have found, in manyother simulations, that in most cases this is done withoutloss of generality of our conclusions.

Confidence intervals of parameters, thresholds, andslopes that we report in the following were always obtainedusing a method called bias-corrected and accelerated(BCa ) confidence intervals, which we describe later in ourBootstrap Confidence Intervals section.

The Effects of Sampling Schemes and Number of Trials

One of our aims in this study was to examine the ef-fect of N and one’s choice of sample points x on both thesize of one’s confidence intervals for J and their sensitiv-ity to errors in qq.

Seven different sampling schemes were used, each dic-tating a different distribution of datapoints along the stim-ulus axis; they are the same schemes as those used and de-scribed in Wichmann and Hill (2001), and they are shownin Figure 2. Each horizontal chain of symbols representsone of the schemes, marking the stimulus values at whichthe six sample points are placed. The different symbolshapes will be used to identify the sampling schemes in ourresults plots. To provide a frame of reference, the solidcurve shows the psychometric function used—that is,0.5 1 0.5F(x;{agen, bgen})—with the 55%, 75%, and 95%performance levels marked by dotted lines.

As we shall see, even for a fixed number of samplepoints and a fixed number of trials per point, biases inparameter estimation and goodness-of-fit assessment


(companion paper), as well as the width of confidence in-tervals (this paper), all depend markedly on the distrib-ution of stimulus values x.

Monte Carlo data sets were generated using our sevensampling schemes shown in Figure 2, using the generationparameters qqgen, as well as N and K as specified above.A maximum-likelihood fit was performed on each sim-ulated data set to obtain bootstrap parameter vectors qq*,from which we subsequently derived the x values corre-sponding to threshold0.5 and threshold0.8, as well as to theslope. For each sampling scheme and value of N, a totalof nine simulations were performed: one at qqgen and eightmore at points ff1, …, ff8 as specified in our section on thebootstrap bridging assumption. Thus, each of our 28conditions (7 sampling schemes 3 4 values of N ) required9 3 1,000 simulated data sets, for a total of 252,000 sim-ulations, or 1.134 3 108 simulated 2AFC trials.

Figures 3, 4, and 5 show the results of the simulationsdealing with slope0.5, threshold0.5, and threshold0.8, re-spectively. The top-left panel (A) of each figure plots theWCI68 of the estimate under consideration as a functionof N. Data for all seven sampling schemes are shown,using their respective symbols. The top-right hand panel(B) plots, as a function of N, the maximal elevation ofthe WCI68 encountered in the vicinity of qqgen—that is,max{WCIf1/WCIqqgen, …, WCIf8/WCIqqgen}. The eleva-tion factor is an indication of the sensitivity of our vari-ability estimates to errors in the estimation of qq. Thecloser it comes to 1.0, the better. The bottom-left panel(C) plots the largest of the WCI68’s measurements at qqgenand ff1, …, ff8 for a given sampling scheme and N; it is,thus, the product of the elevation factor plotted in panelB by its respective WCI68 in panel A. This quantity wedenote as MWCI68, standing for maximum width of the

8 9 10 11 121

2

3

4

5

6

7

8

parameter a

N = 480u = {10, 3, 0.5, 0.01}

par

amet

er b

95% confidence interval68% confidence interval (WCI68)

Figure 1. B 5 2,000 data sets were generated from a two-alternative forced-choice Weibullpsychometric function with parameter vector qqgen 5 {10, 3, .5, .01} and then fit using ourmaximum-likelihood procedure, resulting in 2,000 estimated parameter pairs (aa, bb) shownas dark circles in aa–bb parameter space. The location of the generating aa and bb (10, 3) ismarked by the large triangle in the center of the plot. The sampling scheme s7 was used togenerate the data sets (see Figure 2 for details) with N 5 480. Solid lines mark the 68% con-fidence interval width (WCI68) separately for aa and bb; broken lines mark the 95% confidenceintervals. The light small triangles show the aa–bb parameter sets ff1, …, ff8 from which eachbootstrap is repeated during sensitivity analysis while keeping the x-values of the samplingscheme unchanged.


68% confidence interval; this is the confidence intervalwe suggest experimenters should use when they reporterror bounds. The bottom-right panel (D), finally, showsMWCI68 multiplied by a sweat factor, ÏNw/1w2w0w. Ideally—that is, for very large K and N—confidence intervalsshould be inversely proportional to ÏNw, and multiplica-tion by our sweat factor should remove this dependency.Any change in the sweat-adjusted MWCI68 as a functionof N might thus be taken as an indicator of the degree towhich sampling scheme, N, and true confidence intervalwidth interact in a way not predicted by asymptotic theory.

Figure 3A shows the WCI68 around the median esti-mated slope. Clearly, the different sampling schemes havea profound effect on the magnitude of the confidence in-tervals for slope estimates. For example, in order to en-sure that the WCI68 is approximately 0.06, one requiresnearly 960 trials if using sampling s1 or s4. Samplingschemes s3, s5, or s7, on the other hand, require onlyaround 200 trials to achieve similar confidence intervalwidth. The important difference that makes, for example,s3, s5, and s7 more efficient than s1 and s4 is the presenceof samples at high predicted performance values (p $ .9)where binomial variability is low and, thus, the data con-strain our maximum-likelihood fit more tightly. Figure 3Billustrates the complex interactions between different sam-pling schemes, N, and the stability of the bootstrap esti-mates of variability, as indicated by the local flatness of thecontours around qqgen. A perfectly flat local contour wouldresult in the horizontal line at 1. Sampling scheme s6 iswell behaved for N $ 480, its maximal elevation beingaround 1.7. For N , 240, however, elevation rises to near

2.5. Other schemes, like s1, s4, or s7, never rise above an el-evation of 2 regardless of N. It is important to note that themagnitude of WCI68 at qqgen is by no means a good predic-tor of the stability of the estimate, as indicated by the sen-sitivity factor. Figure 3C shows the MWCI68, and here thedifferences in efficiency between sampling schemes areeven more apparent than in Figure 3A. Sampling schemess3, s5, and s7 are clearly superior to all other samplingschemes, particularly for N , 480; these three samplingschemes are the ones that include two sampling points atp . .92.

Figure 4, showing the equivalent data for the estimatesof threshold0.5, also illustrates that some sampling schemesmake much more efficient use of the experimenter’s timeby providing MWCI68s that are more compact than oth-ers by a factor of 3.2.

Two aspects of the data shown in Figure 4B and 4C areimportant. First, the sampling schemes fall into two dis-tinct classes: Five of the seven sampling schemes are al-most ideally well behaved, with elevations barely exceed-ing 1.5 even for small N (and MWCI68 # 3); the remainingtwo, on the other hand, behave poorly, with elevations inexcess of 3 for N , 240 (MWCI68 . 5). The two unstablesampling schemes, s1 and s4, are those that do not in-clude at least a single sample point at p $ .95 (see Fig-ure 2). It thus appears crucial to include at least one sam-ple point at p $ .95 to make one’s estimates of variabilityrobust to small errors in qq, even if the threshold of inter-est, as in our example, has a p of only approximately .75.Second, both s1 and s4 are prone to lead to a serious un-derestimation of the true width of the confidence interval

4 5 7 10 12 15 17 200

.1

.2

.3

.4

.5

.6

.7

.8

.9

1

signal intensity

pro

po

rtio

n c

orr

ect

s7

s6

s5

s4

s3

s2

s1

Figure 2. A two-alternative forced-choice Weibull psychometric function with parameter vector qq 5{10, 3, .5, 0} on semilogarithmic coordinates. The rows of symbols below the curve mark the x valuesof the seven different sampling schemes, s1–s7, used throughout the remainder of the paper.


if the sensitivity analysis (or bootstrap bridging assump-tion test) is not carried out: The WCI68 is unstable in thevicinity of qqgen even though WCI68 is small at qqgen.

Figure 5 is similar to Figure 4, except that it shows theWCI68 around threshold0.8. The trends found in Figure 4are even more exaggerated here: The sampling schemes

without high sampling points (s1, s4) are unstable to thepoint of being meaningless for N , 480. In addition, theirWCI68 is inflated, relative to that of the other samplingschemes, even at qqgen.

The results of the Monte Carlo simulations are sum-marized in Table 1. The columns of Table 1 correspond to

120 240 480 960

0.05

0.1

0.15

120 240 480 960

1

1.5

2

2.5

120 240 480 960

0.05

0.1

0.15

120 240 480 960

0.1

0.15

0.2

0.25

max

imal

ele

vati

on

of W

CI 6

8

slo

pe

WC

I 68

sqrt

(N/1

20

)*sl

op

e M

WC

I 68

slo

pe

MW

CI 6

8

A B

C D

total number of observations N total number of observations N

total number of observations Ntotal number of observations N

sampling schemess1 s7s6s5s4s3s2

Figure 3. (A) The width of the 68% BCa confidence interval (WCI68) of slopes of B 5 999 fitted psychometric functions to para-metric bootstrap data sets generated from qqgen 5 {10, 3, .5, .01} as a function of the total number of observations, N. (B) The maximalelevation of the WCI68 in the vicinity of qqgen, again as a function of N (see the text for details). (C) The maximal width of the 68% BCaconfidence interval (MWCI68) in the vicinity of qqgen,—that is, the product of the respective WCI68 and elevations factors of panels Aand B. (D) MWCI68 scaled by the sweat factor sqrt(N/120), the ideally expected decrease in confidence interval width with increasingN. The seven symbols denote the seven sampling schemes as of Figure 2.


the different sampling schemes, marked by their respec-tive symbols. The first four rows contain the MWCI68 atthreshold0.5 for N 5 120, 240, 480, and 960; similarly, thenext four rows contain the MWCI68 at threshold0.8, andthe following four those for the slope0.5. The scheme withthe lowest MWCI68 in each row is given the score of 100.The others on the same row are given proportionally higherscores, to indicate their MWCI68 as a percentage of thebest scheme’s value. The bottom three rows of Table 1 con-

tain summary statistics of how well the different samplingschemes do across all 12 estimates.

An inspection of Table 1 reveals that the samplingschemes fall into four categories. By a long way worstare sampling schemes s1 and s4, with mean and medianMWCI68 . 210%. Already somewhat superior is s2, par-ticularly for small N, with a median MWCI68 of 180% anda significantly lower standard deviation. Next come sam-pling schemes s5 and s6, with means and medians of

120 240 480 960

0.5

0.7

1

2

3

45

7

120 240 480 960

1

2

3

4

120 240 480 960

0.5

0.7

1

2

3

45

7

120 240 480 9601

2

3

4

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

0.5

) W

CI 6

8-1

sqrt

(N/1

20

)*th

resh

old

(F0

.5)

MW

CI 6

8-1

thre

sho

ld (F

0.5

) M

WC

I 68

-1

A B

C D



sampling schemess1 s7s6s5s4s3s2

Figure 4. Similar to Figure 3, except that it shows thresholds corresponding to F210.5 (approximately equal to 75% correct during two-

alternative forced-choice).


around 130%. Each of these has at least one sample at p $.95, and 50% at p $ .80. The importance of this highsample point is clearly demonstrated by s6. Comparing s1and s6, s6 is identical to scheme s1, except that one sam-ple point was moved from .75 to .95. Still, s6 is superiorto s1 on each of the 12 estimates, and often very markedlyso. Finally, there are two sampling schemes with meansand medians below 120%—very nearly optimal6 on mostestimates. Both of these sampling schemes, s3 and s7,

have 50% of their sample points at p $ .90 and one thirdat p $ .95.

In order to obtain stable estimates of the variability ofparameters, thresholds, and slopes of psychometricfunctions, it appears that we must include at least one,but preferably more, sample points at large p values. Suchsampling schemes are, however, sensitive to stimulus-independent lapses that could potentially bias the esti-mates if we were to fix the upper asymptote of the psy-

120 240 480 960

0.7

1

2

3

5

10

120 240 480 960

1

2

3

4

5

120 240 480 960

0.7

1

2

3

5

10

120 240 480 960

2

4

6

8

10

12

max

imal

ele

vati

on

of W

CI 6

8

thre

sho

ld (F

0.8

) W

CI 6

8-1

sqrt

(N/1

20

)*th

resh

old

(F0

.8)

MW

CI 6

8-1

thre

sho

ld (F

0.8

) M

WC

I 68

-1

A B

C D



sampling schemess7s6s5s4s3s1 s2

Figure 5. Similar to Figure 3, except that it shows thresholds corresponding to F210.8 (approximately equal to 90% correct during two-

alternative forced-choice).


chometric function (the parameter l in Equation 1; seeour companion paper, Wichmann & Hill, 2001).

Somewhat counterintuitively, it is thus not sensible toplace all or most samples close to the point of interest (e.g.,close to threshold0.5, in order to obtain tight confidence in-tervals for threshold0.5), because estimation is done via thewhole psychometric function, which, in turn, is estimatedfrom the entire data set: Sample points at high p values (ornear zero in yes/no tasks) have very little variance and thusconstrain the fit more tightly than do points in the middleof the psychometric function, where the expected varianceis highest (at least during maximum-likelihood fitting).Hence, adaptive techniques that sample predominantlyaround the threshold value of interest are less efficientthan one might think (cf. Lam, Mills, & Dubno, 1996).

All the simulations reported in this section were per-formed with the a and b parameter of the Weibull distri-bution function constrained during fitting; we used Baye-sian priors to limit a to be between 0 and 25 and b to bebetween 0 and 15 (not including the endpoints of the in-tervals), similar to constraining l to be between 0 and .06(see our discussion of Bayesian priors in Wichmann &Hill, 2001). Such fairly stringent bounds on the possibleparameter values are frequently justifiable given the ex-perimental context (cf. Figure 2); it is important to note,however, that allowing a and b to be unconstrained wouldonly amplify the differences between the samplingschemes7 but would not change them qualitatively.

Influence of the Distribution Function on Estimates of Variability

Thus far, we have argued in favor of the bootstrapmethod for estimating the variability of fitted parameters,

thresholds, and slopes, since its estimates do not rely onasymptotic theory. However, in the context of fitting psy-chometric functions, one requires in addition that theexact form of the distribution function F—Weibull, logis-tic, cumulative Gaussian, Gumbel, or any other reason-ably similar sigmoid—has only a minor influence on theestimates of variability. The importance of this cannot beunderestimated, since a strong dependence of the esti-mates of variability on the precise algebraic form of thedistribution function would call the usefulness of the boot-strap into question, because, as experimenters, we do notknow, and never will, the true underlying distributionfunction or objective function from which the empiricaldata were generated. The problem is illustrated in Fig-ure 6; Figure 6A shows four different psychometric func-tions: (1) yW(x;qqW), using the Weibull as F, and qqW 5{10, 3, .5, .01} (our “standard” generating function ygen);(2) yCG(x;qqCG), using the cumulative Gaussian with qqCG 5{8.875, 3.278, .5, .01}; (3) yL(x;qqL), using the logisticwith qqL 5 {8.957, 2.014, .5, .01}; and finally, (4) yG(x;qqG),using the Gumbel and qqG 5 {10.022, 2.906, .5, .01}. Forall practical purposes in psychophysics, the four functionsare indistinguishable. Thus, if one of the above psycho-metric functions were to provide a good fit to a data set,all of them would, despite the fact that, at most, one ofthem is correct. The question one has to ask is whethermaking the choice of one distribution function over an-other markedly changes the bootstrap estimates of vari-ability.8 Note that this is not trivially true: Although it canbe the case that several psychometric functions with dif-ferent distribution functions F are indistinguishable givena particular data set—as is shown in Figure 6A—this doesnot imply that the same is true for every data set generated

Table 1Summary of Results of Monte Carlo Simulations

● ■ ★ ◆ ▲N s1 s2 s3 s4 s5 s6 s7

MWCI68 at 120 331 135 143 369 162 100 110x 5 F.5

21 240 209 163 126 339 133 100 113480 156 179 143 193 158 100 155960 124 141 140 138 152 100 119

MWCI68 at 120 5725 428 100 1761 180 263 115x 5 F.8

21 240 1388 205 100 1257 110 125 109480 385 181 115 478 100 116 137960 215 149 100 291 102 119 101

MWCI68 of dF/dx at 120 212 305 119 321 139 217 100x 5 F.5

21 240 228 263 100 254 128 185 121480 186 213 119 217 100 148 134960 186 175 113 201 100 151 118

Mean 779 211 118 485 130 140 119Standard deviation 1595 85 17 499 28 49 16Median 214 180 117 306 131 122 117

Note—Columns correspond to the seven sampling schemes and are marked by their respective symbols (see Figure 2). Thefirst four rows contain the MWCI68 at threshold0.5 for N 5 120, 240, 480, and 960; similarly, the next eight rows contain theMWCI68 at threshold0.8 and slope0.5. (See the text for the definition of the MWCI68.) Each entry corresponds to the largestMWCI68 in the vicinity of qqgen, as sampled at the points qqgen and ff1, …, ff8. The MWCI68 values are expressed in percent-age relative to the minimal MWCI68 per row. The bottom three rows contain summary statistics of how well the differentsampling schemes perform across estimates.

▼

▼


from one of such similar psychometric functions duringthe bootstrap procedure: Figure 6B shows the fit of twopsychometric functions (Weibull and logistic) to a data setgenerated from our “standard” generating function ygen,using sampling scheme s2 with N 5 120.

Slope, threshold0.8, and threshold0.2 are quite dissimilarfor the two fits, illustrating the point that there is a realpossibility that the bootstrap distributions of thresholdsand slopes from the B bootstrap repeats differ substan-tially for different choices of F, even if the fits to theoriginal (empirical) data set were almost identical.

To explore the effect of F on estimates of variability,we conducted Monte Carlo simulations, using

as the generating function, and fitted psychometric func-tions, using the Weibull, cumulative Gaussian, logistic,and Gumbel as the distribution functions to each data set.From the fitted psychometric functions, we obtained es-timates of threshold0.2, threshold0.5, threshold0.8, andslope0.5, as described previously. All four different val-

y y y y y= + + +[ ]14

W CG L G

2 4 6 8 10 12 14.4

.5

.6

.7

.8

.9

1

2 4 6 8 10 12 14.4

.5

.6

.7

.8

.9

1

signal intensity

pro

po

rtio

n c

orr

ect

signal intensity

pro

po

rtio

n c

orr

ect

cW

(Weibull), u = {10, 3, 0.5, 0.01}

cCG

(cumulative Gaussian), u = {8.9, 3.3, 0.5, 0.01}

cL (logistic), u = {9.0, 2.0, 0.5, 0.01}

cG (Gumbel), u = {10.0, 2.9, 0.5, 0.02}

cW

(Weibull), u = {8.4, 3.4, 0.5, 0.05}

cL (logistic), u = {7.8, 1.0, 0.5, 0.05}

A

B

Figure 6. (A) Four two-alternative forced-choice psychometric functions plotted onsemilogarithmic coordinates; each has a different distribution function F (Weibull, cu-mulative Gaussian, logistic, and Gumbel). See the text for details. (B) A fit of two psy-chometric functions with different distribution functions F (Weibull, logistic) to the samedata set, generated from the mean of the four psychometric functions shown in panel A,using sampling scheme s2 with N 5 120. See the text for details.


ues of N and our seven sampling schemes were used, re-sulting in 112 conditions (4 distribution functions 3 7sampling schemes 3 4 N values). In addition, we repeatedthe above procedure 40 times to obtain an estimate of thenumerical variability intrinsic to our bootstrap routines,9for a total of 4,480 bootstrap repeats affording 8,960,000psychometric function fits (4.032 3 109 simulated 2AFCtrials).

An analysis of variance was applied to the resulting data,with the number of trials N, the sampling schemes s1 to s7,and the distribution function F as independent factors(variables). The dependent variables were the confidenceinterval widths (WCI68); each cell contained the WCI68estimates from our 40 repetitions. For all four dependentmeasures—threshold0.2, threshold0.5, threshold0.8, andslope0.5—not only the first two factors, the number oftrials N and the sampling scheme, were, as was expected,significant (p , .0001), but also the distribution functionF and all possible interactions: The three two-way inter-actions and the three-way interaction were similarly sig-nificant at p , .0001. This result in itself, however, is notnecessarily damaging to the bootstrap method applied topsychophysical data, because the significance is broughtabout by the very low (and desirable) variability of ourWCI68 estimates: Model R2 is between .995 and .997,implying that virtually all the variance in our simulationsis due to N, sampling scheme, F and interactions thereof.

Rather than focusing exclusively on significance, inTable 2 we provide information about effect size—namely,the percentage of the total sum of squares of variation ac-counted for by the different factors and their interactions.For threshold0.5 and slope0.5 (columns 2 and 4), N, samplingscheme, and their interaction account for 98.63% and96.39% of the total variance, respectively.10 The choiceof distribution function F does not have, despite being asignificant factor, a large effect on WCI68 for thresh-old0.5 and slope0.5.

The same is not true, however, for the WCI68 of thresh-old0.2. Here, the choice of F has an undesirably large ef-fect on the bootstrap estimate of WCI68—its influence islarger than that of the sampling scheme used—and only84.36% of the variance is explained by N, samplingscheme, and their interaction. Figure 7, finally, summa-rizes the effect sizes of N, sampling scheme, and F graph-ically: Each of the four panels of Figure 7 plots the WCI68

(normalized by dividing each WCI68 score by the largestmean WCI68) on the y-axis as a function of N on the x-axis; the different symbols refer to the different samplingschemes. The two symbols shown in each panel corre-spond to the sampling schemes that yielded the smallestand largest mean WCI68 (averaged across F and N). Thegray levels, finally, code the smallest (black), mean (gray),and largest (white) WCI68 for a given N and samplingscheme as a function of the distribution function F.

For threshold0.5 and slope0.5 (Figure 7B and 7D), esti-mates are virtually unaffected by the choice of F, but forthreshold0.2, the choice of F has a profound influence onWCI68 (e.g., in Figure 7A, there is a difference of nearlya factor of two for sampling scheme s7 [triangles] whenN 5 120). The same is also true, albeit to a lesser extent,if one is interested in threshold0.8: Figure 7C shows the(again undesirable) interaction between sampling schemeand choice of F. WCI68 estimates for sampling schemes5 (leftward triangles) show little influence of F, but forsampling scheme s4 (rightward triangles) the choice of Fhas a marked influence on WCI68. It was generally thecase for threshold0.8 that those sampling schemes that re-sulted in small confidence intervals (s2, s3, s5, s6, and s7;see the previous section) were less affected by F than werethose resulting in large confidence intervals (s1 and s4).

Two main conclusions can be drawn from these simu-lations. First, in the absence of any other constraints, ex-perimenters should choose as threshold and slope mea-sures corresponding to threshold0.5 and slope0.5, becauseonly then are the main factors influencing the estimates ofvariability the number and placement of stimuli, as wewould like it to be. Second, away from the midpoint of F,estimates of variability are, however, not as independentof the distribution function chosen as one might wish—in particular, for lower proportions correct (threshold0.2 ismuch more affected by the choice of F than threshold0.8;cf. Figures 7A and 7C). If very low (or, perhaps, very high)response thresholds must be used when comparing ex-perimental conditions—for example, 60% (or 90%) cor-rect in 2AFC—and only small differences exist betweenthe different experimental conditions, this requires the ex-ploration of a number of distribution functions F to avoidfinding significant differences between conditions, owingto the (arbitrary) choice of a distribution function’s result-ing in comparatively narrow confidence intervals.

Table 2 Summary of Analysis of Variance Effect Size (Sum of Squares [s5] Normalized to 100%)

Factor Threshold0.2 Threshold0.5 Threshold0.8 Slope0.5

Number of trials N 76.24 87.30 65.14 72.86Sampling schemes s1 . . . s7 6.60 10.00 19.93 19.69Distribution function F 11.36 0.13 3.87 0.92Error (numerical variability in bootstrap) 0.34 0.35 0.46 0.34Interaction of N and sampling schemes s1 . . . s7 1.18 0.98 5.97 3.50Sum of interactions involving F 4.28 1.24 4.63 2.69Percentage of SS accounted for without F 84.36 98.63 91.5 96.39

Note—The columns refer to threshold0.2, threshold0.5, threshold0.8, and slope0.5—that is, to approximately 60%,75%, and 90% correct and the slope at 75% correct during two-alternative forced choice, respectively. Rowscorrespond to the independent variables, their interactions, and summary statistics. See the text for details.


total number of observations N

thre

sho

ld (F

0.5

) WC

I 68

-1

thre

sho

ld (F

0.2

) WC

I 68

-1th

resh

old

(F0

.8) W

CI 6

8-1

slo

pe

WC

I 68

total number of observations N

120 240 480 9600

0.2

0.4

0.6

0.8

1

1.2

1.4

120 240 480 9600

0.2

0.4

0.6

0.8

1

1.2

1.4

120 240 480 9600

0.2

0.4

0.6

0.8

1

1.2

1.4

120 240 480 9600

0.2

0.4

0.6

0.8

1

1.2

1.4

C D

BA

Color Codelargest WCI68mean WCI68minimal WCI68

mean WCI68 as a function of N

range of WCI68's for a givensampling scheme as a functionof F

Symbols refer to the sampling scheme

Figure 7. The four panels show the width of WCI68 as a function of N and sampling scheme, as well as of the distribution functionF. The symbols refer to the different sampling schemes, as in Figure 2; gray levels code the influence of the distribution function F onthe width of WCI68 for a given sampling scheme and N: Black symbols show the smallest WCI68 obtained, middle gray the mean WCI68across the four distribution functions used, and white the largest WCI68. (A) WCI68 for threshold0.2. (B) WCI68 for threshold0.5. (C)WCI68 for threshold0.8. (D) WCI68 for slope0.5.

Bootstrap Confidence IntervalsIn the existing literature on bootstrap estimates of the

parameters and thresholds of psychometric functions, moststudies use parametric or nonparametric plug-in esti-mates11 of the variability of a distribution JJ*. For example,

Foster and Bischof (1997) estimate parametric (moment-based) standard deviations s by the plug-in estimate s.Maloney (1990), in addition to s, uses a comparable non-parametric estimate, obtained by scaling plug-in estimatesof the interquartile range so as to cover a confidence in-


terval of 68.3%. Neither kind of plug-in estimate is guar-anteed to be reliable, however: Moment-based estimatesof a distribution’s central tendency (such as the mean) orvariability (such as s) are not robust; they are very sen-sitive to outliers, because a change in a single sample canhave an arbitrarily large effect on the estimate. (The esti-mator is said to have a breakdown of 1/n, because that isthe proportion of the data set that can have such an effect.Nonparametric estimates are usually much less sensitiveto outliers, and the median, for example, has a breakdownof 1/2, as opposed to 1/n for the mean.) A moment-basedestimate of a quantity J might be seriously in error if onlya single bootstrap Ji* estimate is wrong by a largeamount. Large errors can and do occur occasionally—forexample, when the maximum-likelihood search algo-rithm gets stuck in a local minimum on its error surface.12

Nonparametric plug-in estimates are also not withoutproblems. Percentile-based bootstrap confidence intervalsare sometimes significantly biased and converge slowlyto the true confidence intervals (Efron & Tibshirani, 1993,chaps. 12–14, 22). In the psychological literature, thisproblem was critically noted by Rasmussen (1987, 1988).

Methods to improve convergence accuracy and avoidbias have received the most theoretical attention in thestudy of the bootstrap (Efron, 1987, 1988; Efron & Tib-shirani, 1993; Hall, 1988; Hinkley, 1988; Strube, 1988; cf.Foster & Bischof, 1991, p. 158).

In situations in which asymptotic confidence intervalsare known to apply and are correct, bias-corrected andaccelerated (BCa) confidence intervals have been demon-strated to show faster convergence and increased accuracyover ordinary percentile-based methods, while retainingthe desirable property of robustness (see, in particular,Efron & Tibshirani, 1993, p. 183, Table 14.2, and p. 184,Figure 14.3, as well as Efron, 1988; Rasmussen, 1987,1988; and Strube, 1988).

BCa confidence intervals are necessary because thedistribution of sampling points x along the stimulus axismay cause the distribution of bootstrap estimates qq* to bebiased and skewed estimators of the generating values qq.The same applies to the bootstrap distributions of estimatesJJ* of any quantity of interest, be it thresholds, slopes, orwhatever. Maloney (1990) found that skew and bias par-ticularly raised problems for the distribution of the b pa-rameter of the Weibull, bb* (N 5 210, K 5 7). We alsofound in our simulations that bb*—and thus, slopes s*—were skewed and biased for N smaller than 480, evenusing the best of our sampling schemes. The BCa methodattempts to correct both bias and skew by assuming thatan increasing transformation, m, exists to transform thebootstrap distribution into a normal distribution. Hence,we assume that F 5 m(q ) and F5 m(q) resulting in

(2)

where

(3)

and kF0any reference point on the scale of F values. In

Equation 3, z0 is the bias correction term, and a in Equa-tion 4 is the acceleration term. Assuming Equation 2 to becorrect, it has been shown that an e-level confidence in-terval endpoint of the BCa interval can be calculated as

(4)

where CG is the cumulative Gaussian distribution func-tion, G21 is the inverse of the cumulative distributionfunction of the bootstrap replications q*, z0 is our esti-mate of the bias, and a is our estimate of acceleration. Fordetails on how to calculate the bias and the accelerationterm for a single fitted parameter, see Efron and Tibshi-rani (1993, chap. 22) and also Davison and Hinkley (1997,chap. 5); the extension to two or more fitted parametersis provided by Hill (2001b).

For large classes of problems, it has been shown thatEquation 2 is approximately correct and that the error inthe confidence intervals obtained from Equation 4 issmaller than those introduced by the standard percentileapproximation to the true underlying distribution, wherean e-level confidence interval endpoint is simply G[e](Efron & Tibshirani, 1993). Although we cannot offer aformal proof that this is also true for the bias and skewsometimes found in bootstrap estimates from fits to psy-chophysical data, to our knowledge, it has been shownonly that BCa confidence intervals are either superior orequally good in performance to standard percentiles, butnot that they perform significantly worse. Hill (2001a,2001b) recently performed Monte Carlo simulations totest the coverage of various confidence interval methods,including the BCa and other bootstrap methods, as wellas asymptotic methods from probit analysis. In general,the BCa method was the most reliable, in that its cover-age was least affected by variations in sampling schemeand in N, and the least imbalanced (i.e., probabilities ofcovering the true parameter value in the upper and lowerparts of the confidence interval were roughly equal). TheBCa method by itself was found to produce confidenceintervals that are a little too narrow; this underlines theneed for sensitivity analysis, as described above.

CONCLUSIONS

In this paper, we have given an account of the procedureswe use to estimate the variability of fitted parameters and thederived measures, such as thresholds and slopes of psy-chometric functions.

First, we recommend the use of Efron’s parametric boot-strap technique, because traditional asymptotic methodshave been found to be unsuitable, given the small numberof datapoints typically taken in psychophysical experi-ments. Second, we have introduced a practicable test ofthe bootstrap bridging assumption or sensitivity analysisthat must be applied every time bootstrap-derived vari-

ˆ ˆ ˆˆ

ˆ ˆ,

( )

( )q e

e

eBCaG z

z z

a z z[ ] = +

+

- +( )æ

è

çç

ö

ø

÷÷

æ

è

çç

ö

ø

÷÷

-10

0

01CG

k k aF F F F= + -( )0 0 ,

ˆ~ , ,F F

F

- -( )k

N z0 1


ability estimates are obtained, to ensure that variability es-timates do not change markedly with small variations inthe bootstrap generating function’s parameters. This is crit-ical because the fitted parameters qq are almost certain todeviate at least slightly from the (unknown) underlyingparameters qq. Third, we explored the influence of differ-ent sampling schemes (x) on both the size of one’s con-fidence intervals and their sensitivity to errors in qq. Weconclude that only sampling schemes including at leastone sample at p $ .95 yield reliable bootstrap confi-dence intervals. Fourth, we have shown that the size ofbootstrap confidence intervals is mainly influenced by xand N if and only if we choose as threshold and slope val-ues around the midpoint of the distribution function F;particularly for low thresholds (threshold 0.2), the precisemathematical form of F exerts a noticeable and undesir-able influence on the size of bootstrap confidence inter-vals. Finally, we have reported the use of BCa confidenceintervals that improve on parametric and percentile-basedbootstrap confidence intervals, whose bias and slow con-vergence had previously been noted (Rasmussen, 1987).

With this and our companion paper (Wichmann & Hill,2001), we have covered the three central aspects of mod-eling experimental data: first, parameter estimation; sec-ond, obtaining error estimates on these parameters; andthird, assessing goodness of fit between model and data.

REFERENCES

Cox, D. R., & Hinkley, D. V. (1974). Theoretical statistics. London:Chapman and Hall.

Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and theirapplication. Cambridge: Cambridge University Press.

Efron, B. (1979). Bootstrap methods: Another look at the jackknife.Annals of Statistics, 7, 1-26.

Efron, B. (1982). The jackknife, the bootstrap and other resampling plans(CBMS-NSF Regional Conference Series in Applied Mathematics).Philadelphia: Society for Industrial and Applied Mathematics.

Efron, B. (1987). Better bootstrap confidence intervals. Journal of theAmerican Statistical Association, 82, 171-200.

Efron, B. (1988). Bootstrap confidence intervals: Good or bad? Psy-chological Bulletin, 104, 293-296.

Efron, B., & Gong, G. (1983). A leisurely look at the bootstrap, thejackknife, and cross-validation. American Statistician, 37, 36-48.

Efron, B., & Tibshirani, R. (1991). Statistical data analysis in the com-puter age. Science, 253, 390-395.

Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap.New York: Chapman and Hall.

Finney, D. J. (1952). Probit analysis (2nd ed.). Cambridge: CambridgeUniversity Press.

Finney, D. J. (1971). Probit analysis. (3rd ed.). Cambridge: CambridgeUniversity Press.

Foster, D. H., & Bischof, W. F. (1987). Bootstrap variance estimatorsfor the parameters of small-sample sensory-performance functions.Biological Cybernetics, 57, 341-347.

Foster, D. H., & Bischof, W. F. (1991). Thresholds from psychometricfunctions: Superiority of bootstrap to incremental and probit varianceestimators. Psychological Bulletin, 109, 152-159.

Foster, D. H., & Bischof, W. F. (1997). Bootstrap estimates of the statis-tical accuracy of thresholds obtained from psychometric functions. Spa-tial Vision, 11, 135-139.

Hall, P. (1988). Theoretical comparison of bootstrap confidence inter-vals (with discussion). Annals of Statistics, 16, 927-953.

Hill, N. J. (2001a, May). An investigation of bootstrap interval coverageand sampling efficiency in psychometric functions. Poster presented atthe Annual Meeting of the Vision Sciences Society, Sarasota, FL.

Hill, N. J. (2001b). Testing hypotheses about psychometric functions:An investigation of some confidence interval methods, their validity,and their use in assessing optimal sampling strategies. Forthcomingdoctoral dissertation, Oxford University.

Hill, N. J., & Wichmann, F. A. (1998, April). A bootstrap method fortesting hypotheses concerning psychometric functions. Paper presentedat the Computers in Psychology, York, U.K.

Hinkley, D. V. (1988). Bootstrap methods. Journal of the Royal Statis-tical Society B, 50, 321-337.

Kendall, M. K., & Stuart, A. (1979). The advanced theory of statistics:Vol. 2. Inference and relationship. New York: Macmillan.

Lam, C. F., Mills, J. H., & Dubno, J. R. (1996). Placement of obser-vations for the efficient estimation of a psychometric function. Jour-nal of the Acoustical Society of America, 99, 3689-3693.

Maloney, L. T. (1990). Confidence intervals for the parameters of psy-chometric functions. Perception & Psychophysics, 47, 127-134.

McKee, S. P., Klein, S. A., & Teller, D. Y. (1985). Statistical proper-ties of forced-choice psychometric functions: Implications of probitanalysis. Perception & Psychophysics, 37, 286-298.

Rasmussen, J. L. (1987). Estimating correlation coefficients: Bootstrapand parametric approaches. Psychological Bulletin, 101, 136-139.

Rasmussen, J. L. (1988). Bootstrap confidence intervals: Good or bad.Comments on Efron (1988) and Strube (1988) and further evaluation.Psychological Bulletin, 104, 297-299.

Strube, M. J. (1988). Bootstrap type I error rates for the correlation co-efficient: An examination of alternate procedures. Psychological Bul-letin, 104, 290-292.

Treutwein, B. (1995, August). Error estimates for the parameters ofpsychometric functions from a single session. Poster presented at theEuropean Conference of Visual Perception, Tübingen.

Treutwein, B., & Strasburger, H. (1999, September). Assessing thevariability of psychometric functions. Paper presented at the Euro-pean Mathematical Psychology Meeting, Mannheim.

Wichmann, F. A., & Hill, N. J. (2001). The psychometric function: I. Fit-ting, sampling, and goodness of fit. Perception & Psychophysics, 63,1293-1313.

NOTES

1. Under different circumstances and in the absence of a model of thenoise and/or the process from which the data stem, this is frequently thebest one can do.

2. This is true, of course, only if we have reason to believe that ourmodel actually is a good model of the process under study. This importantissue is taken up in the section on goodness of fit in our companion paper.

3. This holds only if our model is correct: Maximum-likelihood pa-rameter estimation for two-parameter psychometric functions to data fromobservers who occasionally lapse—that is, display nonstationarity—is as-ymptotically biased, as we show in our companion paper, together witha method to overcome such bias (Wichmann & Hill, 2001).

4. Confidence intervals here are computed by the bootstrap percentilemethod: The 95% confidence interval for a, for example, was deter-mined simply by [a*(.025), a*(.975)], where a*(n) denotes the 100nth per-centile of the bootstrap distribution aa*.

5. Because this is the approximate coverage of the familiarity stan-dard error bar denoting one’s original estimate 6 1 standard deviationof a Gaussian, 68% was chosen.

6. Optimal here, of course, means relative to the sampling schemesexplored.

7. In our initial simulations, we did not constrain a and b other than tolimit them to be positive (the Weibull function is not defined for negativeparameters). Sampling schemes s1 and s4 were even worse, and s3 and s7were, relative to the other schemes, even more superior under noncon-strained fitting conditions. The only substantial difference was perfor-mance of sampling scheme s5: It does very well now, but its slope esti-mates were unstable during sensitivity analysis of b was unconstrained.This is a good example of the importance of (sensible) Bayesian assump-tions constraining the fit. We wish to thank Stan Klein for encouraging usto redo our simulations with a and b constrained.

8. Of course, there might be situations where one psychometric func-tion using a particular distribution function provides a significantly bet-


ter fit to a given data set than do others using different distribution func-tions. Differences in bootstrap estimates of variability in such cases arenot worrisome: The appropriate estimates of variability are those of thebest-fitting function.

9. WCI68 estimates were obtained using BCa confidence intervals,described in the next section.

10. Clearly, the above-reported effect sizes are tied to the ranges inthe factors explored: N spanned a comparatively large range of 120–960observations, or a factor of 8, whereas all of our sampling schemes were“reasonable”; inclusion of “unrealistic” or “unusual” sampling schemes—for example, all x values such that nominal y values are below 0.55—would have increased the percentage of variation accounted for by sam-pling scheme. Taken together, N and sampling scheme should be repre-sentative of most typically used psychophysical settings, however.

11. A straightforward way to estimate a quantity J, which is derivedfrom a probability distribution F by J 5 t(F ), is to obtain F from em-pirical data and then use JJ 5 t(F) as an estimate. This is called a plug-in estimate.

12. Foster and Bischof (1987) report problems with local minima, whichthey overcame by discarding bootstrap estimates that were larger than 20times the stimulus range (4.2% of their datapoints had to be removed).Nonparametric estimates naturally avoid having to perform posthoc datasmoothing by their resilience to such infrequent but extreme outliers.

(Manuscript received June 10, 1998; revision accepted for publication February 27, 2001.)

the psychometric function: ii. bootstrap-based confidence ... · the size of the bootstrap...

Documents