evaluating probabilistic forecasts with bayesian signal detection models

Risk Analysis, Vol. 34, No. 3, 2014 DOI: 10.1111/risa.12127

Evaluating Probabilistic Forecasts with Bayesian SignalDetection Models

Mark Steyvers,1,∗ Thomas S. Wallsten,2 Edgar C. Merkle,3 and Brandon M. Turner4

We propose the use of signal detection theory (SDT) to evaluate the performance of bothprobabilistic forecasting systems and individual forecasters. The main advantage of SDT isthat it provides a principled way to distinguish the response from system diagnosticity, whichis defined as the ability to distinguish events that occur from those that do not. There aretwo challenges in applying SDT to probabilistic forecasts. First, the SDT model must handlejudged probabilities rather than the conventional binary decisions. Second, the model mustbe able to operate in the presence of sparse data generated within the context of human fore-casting systems. Our approach is to specify a model of how individual forecasts are generatedfrom underlying representations and use Bayesian inference to estimate the underlying la-tent parameters. Given our estimate of the underlying representations, features of the classicSDT model, such as the receiver operating characteristic (ROC) curve and the area underthe ROC curve (AUC), follow immediately. We show how our approach allows ROC curvesand AUCs to be applied to individuals within a group of forecasters, estimated as a func-tion of time, and extended to measure differences in forecastability across different domains.Among the advantages of this method is that it depends only on the ordinal properties ofthe probabilistic forecasts. We conclude with a brief discussion of how this approach mightfacilitate decision making.

KEY WORDS: AUC; Bayesian methods; combining forecasts; evaluating forecasts; judgmentalforecasting; probability forecasting; ROC analysis; signal detection theory

1. INTRODUCTION

A fundamental issue for forecasting binaryevents is to distinguish the occurrence of an event, E,from its nonoccurrence, ¬E. Signal detection theory

1Department of Cognitive Sciences, University of California,Irvine, CA, USA.

2Department of Psychology, University of Maryland, CollegePark, MD, USA.

3Department of Psychological Sciences, University of Missouri,Columbia, MO, USA.

4Department of Psychology, Stanford University, Stanford, CA,USA.

∗Address correspondence to Mark Steyvers, Department of Cog-nitive Sciences, 2316 Social & Behavioral Sciences GatewayBldg., University of California, Irvine, CA 92697-5100, USA;[email protected].

(SDT)(1–3) offers a natural framework to analyze theperformance of forecasting systems and individualhuman forecasters in these situations. The key idea isto characterize the distributions that explain the fore-casts associated with E (“signal”) separately fromthe forecasts associated with ¬E (“noise”). Based onthe underlying distributions, receiver operating char-acteristic (ROC) analysis can be used (as explainedlater) to derive measures of diagnosticity such asthe area under the curve (AUC). The AUC mea-sure assesses the ability to discriminate E from ¬E,unconfounded by response bias and base rates. Inaddition to characterizing forecasting performance,ROC analysis can be used to analyze performanceunder different decision thresholds and misclassifica-tion costs. SDT and ROC analyses have long been

435 0272-4332/14/0100-0435$22.00/1 C© 2013 Society for Risk Analysis

436 Steyvers et al.

used to evaluate forecasting or diagnostic systems ina variety of areas (examples to follow), but have onlyrecently been proposed in the prediction of uncertainworld events.(4)

In this article, we will focus on the problemof evaluating individual human forecasters and ag-gregates of human forecasts related to uncertainworld events such as the United Kingdom leaving theEuropean Union, the winner of the 2012 South Ko-rean presidential election, and group on declaringbankruptcy. Recent work in this area(4) assumes thatthe forecaster produces binary forecasts (e.g., “tar-get event will occur”). However, forecasts for binaryevents are often uncertain and therefore accompa-nied by a confidence or probability estimate (e.g.,there is a 70% chance that the event will occur withina specific time frame). One solution is to use rating-scale SDT models where a judgment is discretizedinto a number of ordered response bins.(3) We ex-plore another solution in the form of a new type ofSDT approach where we directly model the contin-uous judgments without assuming a discrete orderedscale.

Furthermore, in forecasts based on humanjudgment,(5) problems of data sparsity can arise thatcan complicate standard SDT analyses. For exam-ple, forecasters might choose to answer only a fewforecasting problems or they might answer many.In addition, an individual who contributes prob-abilistic forecasts to only a few problems mightalso have an associated imbalance in the numberof forecasts associated with E and ¬E. In an ex-treme situation, we might have a forecaster whocontributes judgments to only problems that resolveas E or that resolve as ¬E. Data sparsity and im-balance can result in unstable parameter estimates,potentially leading one to draw inaccurate conclu-sions about the forecasters in question. Therefore,one should take care when estimating parametersand conducting ROC analyses in such cases.

Other challenges for the evaluation of humanforecasts are that they might be dependent on thetime of judgment relative to the resolution of theproblem and/or on the forecasting domain. In thecase of time, one might expect that forecast diagnos-ticity improves dynamically over time as the point offorecast resolution approaches. Similarly, one mightexpect that forecasting domains vary in inherentuncertainty, with some domains being more fore-castable than others. Thus, a reasonable SDT analy-sis should account for differences in temporal dynam-ics as well as differences among forecasting domains.

To address these challenges, we introduce a newSDT modeling framework designed to handle judgedprobabilities rather than discrete decisions or confi-dence ratings. Our approach is to specify a modelof how individual forecasting judgments are gener-ated from underlying belief distributions, and useBayesian inference to estimate the distributions. Wewill use the Beta distribution to model probabilis-tic judgments and refer to the overall approach asthe Beta-SDT model. To address the data sparsityproblem, we use a hierarchical Bayesian approachin which the belief distributions for individual fore-casters are based on group-level belief distributionsthat characterize the commonalities across individ-uals. Hierarchical Bayesian SDT models have beenused successfully to model individual differences aswell as differences among items(6–8) and are use-ful in measuring individual performance in the con-text of sparse data.(9) We also show how the SDTapproach can be extended to measure forecastingperformance at different points in time relative tothe closing date.5 Finally, we show how to accountfor domain-specific forecasting differences. Gener-ally, one attractive feature of our hierarchical Beta-SDT approach is that we can flexibly incorporateforecaster-specific and problem-specific effects.

After the Beta-SDT distributions are estimated,standard ROC analysis can be applied to the inferreddistributions in order to evaluate performance. ROCanalysis had its roots in statistical decision the-ory, psychophysics, and radar engineering,(10,11) butis now used in many areas to evaluate predictivemodels including artificial intelligence and machinelearning,(12–14) medicine(15–17) and meteorology,(18–21)

and other domains.(22) We will focus on the AUC asa summary statistic of the ROC curve. The AUC isa widely used measure for ranking and classificationperformance.(23) We use the AUC as a way to assessthe ability of individual forecasters and forecastingsystems to discriminate between E and ¬E. The ad-vantages of the AUC as an index of diagnostic per-formance are outlined in Section 2.1.

An advantage of the Bayesian approach to an-alyzing probabilistic forecasts is that it naturallyleads to distributions over the AUC value for a

5The closing date is the date at which the forecasting problem willresolve and is closely related to the forecasting horizon, i.e., thelength of time into the future for which forecasts are to be pre-pared. In the data sets that we focus on, each forecasting problemis associated with a fixed closing date but individual forecasterscan choose to forecast on any date before the closing date, effec-tively leading to different forecasting horizons across forecasters.

Bayesian Signal Detection Models 437

particular forecaster or forecasting method. Thesedistributions can be expressed as credible intervals—the Bayesian version of a confidence interval—whichis important when interpreting the performance ofany predictive method.(24) For example, we mightwant to know which individual forecasters are per-forming better than chance or whether one forecasteris performing better (in terms of AUC) than anotherforecaster.

In the rest of the article, we first give a brief in-troduction to empirical ROC curves that can be usedto calculate the AUC without using an SDT model.We discuss the potential problems of estimating theAUC through empirical ROC curves. We then intro-duce the Beta-SDT model for judged probabilitiesthat gives a parametric model for ROC analysis. Wediscuss a number of Beta-SDT modeling variantsthat allow us to measure differences among individu-als as well as forecasting domains, and we also specifya temporal SDT model that can measure changesin discrimination ability over time. We apply themodels to data from a recent forecasting study,(5)

comparing the inferred AUC values at the level ofindividual forecasters as well as aggregates of fore-casting judgments. Finally, we compare the AUC toconventional measures of forecasting performance(i.e., the Brier score and its decompositions) andargue that the AUC is preferable to these measuresif the goal is to measure diagnosticity.

2. EMPIRICAL ROC ANALYSIS

ROC analysis is used across many disciplines asa way to evaluate the diagnosticity of human judg-ments and artificial systems.(10–13,16,17,22,25) We willfirst provide a basic review of empirical ROC anal-ysis in the context of forecasting systems, in which noassumptions are made about underlying belief dis-tributions, and therefore no SDT assumptions areneeded for the analysis. For a more in-depth tutorialon ROC analysis, we refer the reader to Refs. 23, 26,and 27.

In our forecasting context, individual forecastersand forecasting systems produce a probability judg-ment on a continuous scale about E. We can obtainbinary classifications by comparing the probabilityjudgments against a decision threshold. For allcases where the judgment exceeds or equals thethreshold, we predict E, and ¬E otherwise. ROCanalysis provides a graphical means to explore thetradeoff between hit rates and false alarm rates atvarious decision thresholds. The hit rate (also known

as sensitivity) measures the proportion of those casesin which E occurs (or is true) that the decisionmaker(DM) (or system) forecasted E. The false alarmrate (also known as the complement of specificity,or 1-specificity) measures the proportion of thosecases in which E does not occur (i.e., ¬E is true) thatthe DM (or system) forecasted E. The ROC curvespecifically traces the hit rates against the false alarmrates under decision thresholds that vary from theminimum to the maximum value (see Fig. 1). Highdecision thresholds lead to relatively few forecastsof E and therefore to low hit and false alarm rates;and conversely, low decision thresholds lead to manyforecasts of E and correspondingly high hit and falsealarm rates.

2.1. The AUC

The area under the ROC curve traced by the hitand false alarm rates as the decision threshold variesserves as a useful summary statistic of performance.A perfect AUC of 1 is achieved when all probabil-ity judgments associated with E are higher than judg-ments associated with ¬E. This perfect performanceyields a single point at (0, 1), the resulting ROC con-sists of a line from (0, 0) to (0, 1) and another from(0, 1) to (1, 1), and as a consequence AUC = 1. Arandom guessing strategy (e.g., generating randomprobabilities between 0 and 1) leads to an AUC of0.5.6 An AUC smaller than 0.5 is indicative of a sys-tem that is diagnostic but where better performancecan be obtained by reporting the complement of theprobability judgments. In many practical situations,the AUC lies between 0.5 and 1. A chance-correctedAUC is also known as the Gini coefficient,(28) G (i.e.,G = 2AUC − 1, the area above the diagonal).

The AUC has a natural interpretation as a rank-ing measure: it is equivalent to the probability thata forecaster assigns a higher probability judgment toa randomly chosen E relative to a randomly cho-sen ¬E. Therefore, what matters are the ordinal re-lationships among the judged probabilities and notthe numerical values of the probabilities per se. Forexample, if all forecasts associated with E and ¬Eare judged with values of 0.51 and 0.49, respec-tively, this would lead to a perfect AUC value of1 even though the forecasts are hardly separated

6Note that while a random guessing strategy always leads to anAUC of 0.5, the converse is not necessarily true. An AUC of 0.5and an ROC curve close to the diagonal is required to indicate arandom guessing strategy.

438 Steyvers et al.

0 0.1910 0.3210 0.3510 0.3731 0.3960 0.4070 0.4100 0.4191 0.4510 0.4590 0.4940 0.5161 0.5181 0.5571 0.6291 0.6731 0.7631 0.7701 0.7901 0.902

resolution judgment(a)

1

2

3

4

5

0 0.5 1

0

0.2

0.4

0.6

0.8

1

p(hi

t)

p(false alarm)

(b)

1

2

3

4

5

Fig. 1. An example of constructing an empir-ical ROC curve from probabilistic forecastingjudgments. The probabilistic judgments from 20problems are arranged in order of magnitudealong with the associated resolution (1=E istrue, 0=¬E is true). The ROC plot shows thehit rate against the false alarm rate under a va-riety of decision thresholds varying from 0 to 1.One practical method considers only K + 1thresholds when there are K unique values ob-served in the forecasts. In this example, thismethod leads to 21 possible thresholds. Five ex-ample decision thresholds are shown in (a) withcorresponding ROC points in (b). One way toderive the AUC value is by calculating the areaunder the curve shown in gray. For this exam-ple, the AUC is 0.91.

quantitatively. In this example, what matters is thatthe forecasts associated with E are consistently higherthan the forecasts associated with ¬E. Any AUCvalue higher than 0.5 indicates that the forecaster isperforming above chance in discriminating betweenE and ¬E.

One attractive feature of the AUC is that it is in-dependent of any threshold required for making de-cisions (i.e., deciding whether a forecast should leadto an E or ¬E decision based on the judged proba-bility). This feature is useful in deployment contextswhere the costs of misclassifications (e.g., incorrectlypredicting the occurrence of an event or missing theactual occurrence of an event) cannot be determinedin advance, and a measure is needed that averagesover a wide range of misclassification costs andassociated decision thresholds. From another per-spective, given a well-established ROC, the DMcan estimate hit and false alarm probabilities as-sociated with any forecast, combine them with hercost-benefit matrix, and make a decision accordingly.Another attractive feature of the AUC is that itis independent of the extent to which forecastsare calibrated. As we indicated above, any strictlymonotonic transformation of the probabilistic fore-casts will yield the same AUC. What matters arethe ordinal relationships among the probabilitiesassigned to E and ¬E. This is important becausestatistical techniques can be used to calibrate indi-vidual forecasters,(29) but the fundamental measureof interest is the discrimination ability of individualforecasters. Indeed, because it is only the ordinalrelationships among the judgments that are neededfor the AUC, judged probabilities are not required.A discrete ordered scale also would do. For example,

a seven-category rating scale might be used withratings such as very unlikely, unlikely, somewhatunlikely, 50-50, somewhat likely, likely, and verylikely. Therefore, the same measure of diagnosticitycan be applied to both probabilistic forecasts basedon probability judgments and judgments based on adiscrete ordered scale.

3. BETA-SDT MODELS FOR PROBABILISTICFORECASTS

In the basic SDT approach to forecasting ofdichotomous outcomes, all resolved forecastingproblems are separated into two classes: problemswhere the event of interest occurred (E, or “sig-nal” trials) and problems where the event did notoccur (¬E, or “noise” trials). Prior applications ofSDT to forecasting(4,18–20,30) have assumed that theforecasting system produces discrete judgments inthe form of warnings—either a warning is producedabout an upcoming event or it is not. Models forthis case have often taken the standard GaussianSDT approach where internal evidence valuesare sampled from latent Gaussian distributions.These values are then compared against a decisionthreshold. For cases where the evidence exceedsthe threshold, a warning is issued. It is important tonote that although the vast majority of SDT modelsmake Gaussian assumptions, the underlying theoryof signal detection does not depend on the form ofthe distributions and other types of distributions arepossible.(31) In principle, the standard SDT approachcan be extended to probability judgments on a ratingscale by assuming that the judgments fall into a


0 1Forecast

Den

sity

1 2 3 4 51 2 3 4 51 2 3 4 5

(a)

0 0.5 1

0

0.2

0.4

0.6

0.8

1 12

3

4

5

p(hi

t)

p(false alarm)

(b)Fig. 2. Illustration of the Beta-SDT model andROC analysis. In the SDT analysis, all resolvedforecasting problems are separated into twoclasses: problems where the event of interest oc-curred (E; “signal” trials) and problems wherethe event did not occur (¬E; “noise” trials). Theprobabilistic forecasts in these two classes aremodeled by two separate Beta distributions, f1and f0 (a). The ROC plot in (b) traces the hitand false alarm rates for all possible decisionthresholds. Five points are marked correspond-ing to the thresholds shown in (a). The AUCvalue corresponds to the area under the curveshown in gray. In this example, the AUC is 0.82.

predetermined number of ordered categories (e.g.,0–0.1, 0.1–0.2, etc.), which requires multiple internalresponse criteria to be estimated.

The key difference in our approach is that wemodel the judgments as arising from a simple familyof distributions based on the Beta distribution, as op-posed to the normal. In addition, our approach doesnot require any response criteria to be estimated.Fig. 2 illustrates the basic approach. The probabilisticforecasts for the two classes of problems are modeledby two separate probability distributions. We assumethat f1(y) and f0(y) describe the density functions forprobabilistic forecasts y associated with signal andnoise trials, respectively. Just as in standard SDT, theparameters that govern the shape and location of thesignal and noise distribution are latent and need to beestimated on the basis of the observed data. The maindifference is that in standard SDT, the observed dataconsist of discrete judgments or rating scales whereasin our approach, the observed data consist of contin-uous probability judgments. These judgments are as-sumed to be the direct result of a sampling processfrom the underlying signal and noise distribution. Instandard SDT, the internal evidence values sampledfrom the signal and noise distributions are first com-pared to a decision or rating scale threshold in orderto generate a discrete judgment.

The ROC plot, then, is simply a plot of the decu-mulative distribution (i.e., one minus the cumulativedistribution) under f1(y) as a function of the decu-mulative under f0(y), calculated for each point from0 to 1 on the probability judgment axis. The former isthe hit rate, the probability of calling a signal a signaland the latter is the false alarm rate, the probabil-ity of calling a nonsignal (noise) a signal . Note thatthis process is conceptually identical to that used tocreate ROC plots in standard SDT, in which the hit

and false alarms rates are the areas to the right of thethreshold under the signal and noise distributions, re-spectively, for any given threshold location.

In the remainder of this section, we describe howwe can model the underlying belief distribution fk

with a Beta distribution with separate parameters forthe signal (k = 1) and the noise (k = 0) distributions.The Beta distribution7 is a natural distribution formodeling probabilistic forecasts(31) because it inher-ently produces values bounded between 0 and 1.

We will describe three versions of Beta-SDTmodels for the evaluation of probabilistic forecasts.The first model applies to situations where a numberof individual forecasters are forecasting on overlap-ping sets of problems and the diagnosticity of eachindividual forecaster needs to be assessed. In thisapproach, we use a hierarchical Bayesian model toaddress data sparsity problems, in which some in-dividual forecasters produce judgments on a smallnumber of problems and some problems have onlya small number of forecasts. The second model ex-tends the hierarchical approach to model differencesamong forecasting domains and the time of judgmentrelative to the forecasting horizon. In this approach,the signal and noise distributions are made function-ally dependent on the time of judgment as well asthe forecasting domain, allowing us to estimate howfast diagnosticity improves as forecasters come closerto the forecasting horizon and more accurate infor-mation becomes available. Finally, the third model,achieved by removing the hierarchical componentfrom the first model, applies to situations where only

7The Beta distribution has the density function fk(y|ak, bk) =1

B(ak,bk) yak−1(1 − y)bk−1, where B(a, b) is the Beta function,

B(a, b) = ∫ 10 ya−1(1 − y)b−1dy.

440 Steyvers et al.

a single forecasting system or individual forecaster isevaluated and there are no issues of data sparsity.

3.1. Basic Model

The hierarchical model handles data sparsity byassociating judges with their own signal and noise dis-tributions, pooling statistical strength across judges,and estimating the distributions at the individualjudge level.

To introduce notation, let N be the number ofresolved events and M be the number of judges.Let yi, j represent the probabilistic forecast of theith judge for the jth event where i ∈ {1, . . . , M} andj ∈ {1, . . . , N}. Let xj represent the coded outcomefor Event j , setting xj = 0 if the jth event did notoccur (corresponding to a noise trial) and xj = 1 ifthe jth event did occur (corresponding to a signaltrial).

Each forecast yi, j is assumed to arise from one oftwo Beta distributions (corresponding to the signaland noise distributions) depending on the outcomexj :

yi, j ∼ Beta (ai,0, bi,0)) , if xj = 0,

Beta (ai,1, bi,1) , if xj = 1.(1)

The parameters ai,k and bi,k, where k ∈ {0, 1}, de-termine the shapes of the Beta distributions for thesignal and noise conditions. In the model, we de-termine these parameters in terms of mean andprecision parameters:

ai,k = μi,k exp(ξi,k)

bi,k = (1 − μi,k) exp(ξi,k). (2)

The first parameter μ, 0 < μ < 1, is the mean of thedistribution. The second parameter ξ is a precisionparameter that determines how concentrated the dis-tribution is around the mean. At the next level, we as-sume that the means of the individual judge distribu-tions are sampled from a logit-normal distribution:

logit(μi,k) ∼ N(μ∗k, φ

∗k), (3)

where μ∗k and φ∗

k are the mean and precision of thisgroup-level distribution.

In Equation (2), it is convenient to use an ex-ponential transformation on ξ . This allows us tomodel the precision parameters ξi,k by a normaldistribution:

ξi,k ∼ N(δ∗k, τ

∗k ). (4)

To complete the model, we need to specify priorprobability distributions on the parameters μ∗

k, φ∗k , δ∗

k,and τ ∗

k . One approach would be to inform our pri-ors by previous experimental research or theory tofind plausible ranges for these parameters. However,we are unaware of prior research (or theory). There-fore, we set the priors with the goal to be as unin-formative as possible about the means and precisionsof the probabilistic forecasts consistent with compu-tational simplicity. Specifically, we use normal priorson the means and inverse Gamma (IG) priors on theprecisions:

μ∗k ∼ N(0, 0.35), (5)

φ∗k ∼ IG(1, 1), (6)

δ∗k ∼ N(2, 0.01), (7)

τ ∗k ∼ IG(1, 1). (8)

The N(0, 0.35) prior on μ∗k in Equation (5) was

chosen such that the means μi,k in the probabilityscale (i.e., after the inverse logit transform in Equa-tion (3)) are approximately uniform. The IG(1,1) pri-ors in Equations (5) and (7) are noninformative pri-ors for precision parameters. The prior on δ∗

k placesmost density on low precisions (high variances). Thisis useful because previous research has found theBeta distribution’s precision parameter to generallybe overestimated.(32) However, this distribution alsoplaces enough density on high precisions to counter-act the density on low precisions, resulting in a min-imally informative prior. For similar priors in otherBeta-distributed models, see Ref. 33.Taken together,these priors only provide minimal information aboutμi,k and ξi,k.

3.2. Incorporating Temporal and Domain Effects

We can extend the hierarchical model by incor-porating the effects of the time of judgment and fore-casting domain. With respect to forecasting domain,we assume that it is easier to provide accurate fore-casts in some domains than in others and so they willbe associated with higher diagnosticity, and thereforegreater area under the ROC. With respect to the roleof time, we expect that forecasts made closer to theforecasting horizon will be more accurate than thosemade earlier in time.


We incorporate time effects by allowing themean μ of the signal and noise distributions inEquation (3) to vary as a function of time. We con-sider growth functions of the form logit(μ) = b +exp(−γ t)(a − b), where t is the time at which thejudge forecasts the event, expressed in days beforeresolution, a is the intercept when t = 0, b is theasymptotic value as t approaches infinity, and γ is ascaling parameter. There are many ways to parame-terize a, b, and γ on the basis of item, person, andoutcome differences.

To introduce notation, let K be the number offorecasting domains (K < N), and c j ∈ {1, . . . , K}represent the forecasting domain for the jth event.In the hierarchical model, we replace Equation (3)with:

logit(μi, j ) = b1,i,xj + b3,c j ,xj + exp(−γc j ti, j )

(b0,i,xj + b2,c j ,xj − b1,i,xj − b3,c j ,xj )

b0,i,xj ∼ N(μ0,xj , φ0,xj )

b1,i,xj ∼ N(μ1,xj , φ1,xj ) (9)

b2,c j ,xj ∼ N(μ2,xj , φ2,xj )

b3,c j ,xj ∼ N(μ3,xj , φ3,xj ),

where ti j is the time at which Judge i forecastedEvent j , expressed in days before resolution. Theb1 and b3 parameters, respectively, account for fore-caster and domain differences in the intercept of thegrowth function, while the b0 and b2 parameters, re-spectively, account for forecaster and domain differ-ences in the asymptote of the growth function. Notethat in this parameterization, the scaling variable γk

in the growth function depends on the forecasting do-main k, allowing for differences in the temporal dy-namics of diagnosticity across domains.

To complete the model, we place uninforma-tive priors on the means and precisions of the bparameters:

μ0,xj ∼ N(0, 0.35), μ1,xj ∼ N(0, 0.35),

μ2,xj ∼ N(0, 0.35), μ3,xj ∼ N(0, 0.35),

φ0,xj ∼ IG(1, 1), φ1,xj ∼ IG(1, 1),

φ2,xj ∼ IG(1, 1), φ3,xj ∼ IG(1, 1). (10)

Note that these priors are equivalent to the priorsused in Equations (5) and (6) and lead to an approx-imately uniform distribution (a priori) over the b pa-rameters.

We also place a uniform prior on the temporalscaling variable γk:

γk ∼ Uniform(0.0001, 0.75). (11)

This prior allows for a flexible range in the temporalscaling of the growth function.

3.3. Nonhierarchical Model

Finally, we can also evaluate diagnosticity of asingle forecaster in situations where data sparsity isnot an issue. In this case, we can create a nonhierar-chical model variant of the basic model. To simplifynotation, we can omit the index for judges and let yj

represent the probabilistic forecast for event j wherej ∈ {1, . . . , N}. Each forecast yj is sampled from ei-ther the signal and noise distribution depending onthe outcome xj :

yj ∼ Beta (μ0 exp(ξ0), (1 − μ0) exp(ξ0)) if xj = 0,

Beta (μ1exp(ξ1), (1 − μ1) exp(ξ1)) if xj = 1.

(12)Instead of specifying group-level distributions as thebasic model, we can complete the model by placingpriors on the means and precisions μk and ξk, respec-tively. As before, we place a normal prior on the logitof μk and a normal prior on ξk:

logit(μk) ∼ N(0, 0.35), ξk ∼ N(2, 0.01). (13)

3.4. ROC and AUC Analysis withBeta-SDT Models

Once the signal and noise densities ( f1 and f0,respectively) are estimated in any of three modelingapproaches we described, the procedure for calculat-ing the hit (F1) and false alarm rates (F0) involvesintegrating over the signal and noise densities f1 andf0:

F1 =∫ 1

cf1(y)dy, and (14)

F0 =∫ 1

cf0(y)dy, (15)

where c is a point on the probability forecast axis.Based on the signal and noise densities or the hit andfalse alarm rates, the AUC can then be computed(24)

442 Steyvers et al.

by calculating:

AUC=∫

y1>y2

∫f1(y1)dy1 f0(y2)dy2 =

∫ 1

0F1(y)dF0(y).

(16)For the Beta distribution, the AUC has an analyticexpression if the parameters of the Beta distributionhave integer values.(31) This does not apply in our sit-uation, and we have to use numerical integration tocalculate Equations (14)–(16).

4. DATA AND MODEL ESTIMATION

To evaluate the signal detection modeling ap-proach for probabilistic forecasting judgments, weuse data from a total of 1,309 participants col-lected by the aggregative contingent estimation sys-tem (ACES), a large-scale project for collecting andcombining forecasts of many widely dispersed indi-viduals (http://www.forecastingace.com/aces). A pre-liminary description of the data-collection procedurecan be found in Ref. 5.Volunteer participants wereasked to estimate the probability of various futureevents’ occurrences, such as the outcome of presiden-tial elections in Taiwan and the potential of a down-grade of Greek sovereign debt. Participants werefree to log on to the website at their convenience andforecast any items of interest. A median of 52 fore-casters contributed to each forecasting problem. Forthis article, we focused on a subset of 176 resolvedbinary forecasting problems. The forecasting prob-lems were categorized into a priori K = 5 forecast-ing domains, including politics and policy (N = 82),business and economy (N = 39), science and tech-nology (N = 16), military and security (N = 23),and sports, health, and social (N = 16). All forecast-ing problems involved a standard way of framing theevent and were presented in the form: Will eventA happen by date B? This last constraint excludeda small number of events from the current analysiswhere the event was framed in terms of a deviationfrom status quo (e.g., will Aremain true by date B?).In this data set, 39 of the 176 events happened beforethe closing date (xj = 1), such that the base rate ofevent occurrence, x = 0.22.

4.1. Parameter Inference

We used JAGS(34) to estimate the joint poste-rior distribution of each set of model parameters inthe Beta-SDT models. For each model, we obtained

1,000 samples from the joint posterior after a burn-inperiod of 1,000 samples, and we also collapsed acrossseven chains.

4.2. Performance Measures

We use a number of measures to evaluate fore-casting performance, including AUC, Brier scores,and measures of (mis)calibration, as explained be-low. For each of these measures, we evaluate individ-ual forecasters as well as aggregates of probabilisticforecasts. To simplify notation, we omit the indexingover individuals and the resulting value of the perfor-mance statistic refers to a specific forecaster or aggre-gation method.

4.2.1. AUC

One of the advantages of the posterior samplingapproach in the Beta-SDT models is that we can in-fer distributions of the AUC value. We obtain thesedistributions by calculating the estimated signal andnoise densities f0 and f1 for each posterior parametersample. We can then calculate Equations (14)–(16)for each posterior sample in order to get distribu-tions over the AUC value. From these distributions,we will derive the 95% credible intervals. We willalso calculate the empirical AUC value as explainedin Section 2 when sufficient data are available. Thisallows us to compare the results from the Beta-SDTprocedure with the results obtained through standardempirical ROC analysis.

4.2.2. Global Calibration Offset (GCO)

A previous study using the same data found thatmany individual forecasters are poorly calibrated andsystematically overestimate the likelihood of futureevents.(29) We will evaluate the calibration of fore-casters with a single measure that assesses the de-gree to which forecasters overestimate the likeli-hood of future events. Our measure, which we callGCO, measures the log-difference between the mean(expected) probabilistic forecasts, derived from theBeta-SDT model, and the base rate of events. Specif-ically, the GCO is based on the contrast between theexpectation of probabilistic forecasts y derived fromthe Beta-SDT model and the base rate of events x:

GCO = log(y/x)

= log(

(x y1 + (1 − x)y0) /x). (17)


This definition of GCO assumes that x can neverbe zero and that any processes that lead to over-and underestimation have multiplicative effects onthe judged probabilities.8 Note also that in the sec-ond equation, the expectation of probabilistic fore-cast is based on the empirical base rates of eventoccurrence—it is simply the average of the (inferred)means of the signal and noise distributions weightedby the empirical base rates of event occurrence.Overall, a GCO value is greater (less) than zero if theexpected probabilistic forecasts are greater (smaller)than the empirical base rate. A GCO value of zero in-dicates that there are neither overestimation nor un-derestimation errors.

4.2.3. Brier Scores and Mean Predictive Error(MPE)

Finally, we will also evaluate models through useof the Brier score.(35–37) Because all events involvedonly two outcomes, the Brier score for the jth eventcan be expressed as:

Bj = (xj − yj )2, (18)

where yj is the probabilistic forecast for Event j andxj is the resolution of the jth event. Thus, in this def-inition of the Brier score, the best score Bj is zero,and the worst possible score is one. After the Brierscores are obtained for each event, we compute theMPE by averaging the Brier scores across the num-ber of events N:

MPE = 1N

N∑j=1

Bj . (19)

4.3. Forecast Aggregation Methods

There exists a large body of work focused onthe use of statistical models for combining individualsubjective probability judgments into a single prob-ability estimate.(38–42) A simple form of aggregation,namely, the unweighted linear average, has proven tobe effective in many situations.(35) The goal of this ar-ticle is not to propose novel aggregation methods forprobabilistic forecasts. Instead, we investigate a num-ber of simple aggregation procedures(29) that allow usto highlight the effects of aggregation on diagnostic-ity and GCO.

8An alternative definition of GCO could be based on the differ-ence between y and x.

� ULinOP. The unweighted linear opinion pool(ULinOP) is simply the unweighted averageof probabilistic forecasts across judges. Thus,predictions λ j are obtained by evaluating λ j =1

nj(∑nj

i=1 yi, j ), where nj is the number of re-sponses obtained on event j .

� Calibrated ULinOP. The ULinOP is not neces-sarily calibrated(29,42) and can be associated withsystematic forecasting errors. A simple aggrega-tion method is to recalibrate the unweighted av-erage using a monotonic transformation func-tion f such that λ j = f ( 1

nj(∑nj

i=1 yi, j )). In thisprocedure, we chose the linear in log-oddstransformation function(43) to recalibrate theunweighted average, where f (p) = δpγ /(δpγ +(1 − p)γ ), and γ and δ are parameters. We fol-lowed the procedures of Ref. 29 to estimatethese parameters for our data set.

� Calibrated Time-Weighted Average. Anotherprocedure is to take the time of judgment rel-ative to the forecasting horizon into account.The idea is to upweight forecasts closer tothe forecasting horizon as these are expected,on average, to be more accurate. We imple-mented this in a weighted averaging schemeλ j = f ((

∑ni=1 wi, j yi, j )/(

∑ni=1 wi, j ))), where the

weight for each judgment is based on an expo-nential decay function, wi, j ∝ exp(−ti, j c), ti, j isthe time of judgment expressed in number ofdays before the forecasting horizon, and c is ascaling constant.

� Guess Baseline. Our final forecasting proceduredoes not involve aggregating the forecasts at all,but instead relies on the base rate of event oc-currence such that λ j = x. Therefore, using thismethod, a constant probabilistic forecast is usedacross all forecasting problems. This method re-sults in zero GCO because the average fore-casted probability exactly matches the base rate.Even though this particular method is not psy-chologically interesting (as it does not rely onhuman judgment) and computationally simplis-tic (there is no procedure for estimating thebase rate in an online fashion), it is helpful forillustration purposes because it is associated, bydefinition, with zero diagnosticity (i.e., an AUCvalue of 0.5) and has zero GCO.

5. RESULTS

In the sections below, we evaluate forecast-ing performance of individual forecasters as well as

444 Steyvers et al.

0 0.2 0.4 0.6 0.8 10

1,000

2,000

3,000All Judges

Fre

quen

cy

SignalNoise

0 0.2 0.4 0.6 0.8 10

1

2

3

Den

sity

Forecast

SignalNoise

Fig. 3. Empirical results and posterior predictive distributions ofthe hierarchical Beta-SDT model. The top panel shows the fre-quency counts of the probabilistic forecasts across all judges sep-arated into problems where the event did or did not occur (signaland noise trials, respectively). The bottom panel shows the poste-rior predictive distributions when sampling new judges from thepopulation-level distributions in the model.

forecast aggregates using various Beta-SDT models.We first describe the results of the basic hierarchicalmodel and show individual forecaster differences indiagnosticity and calibration. We then show how wecan use the nonhierarchical model to evaluate fore-cast aggregates. By combining the performance mea-sures for individual forecasters and aggregates intoone visualization, we show how forecast aggregatesimprove on the diagnosticity of the majority of indi-vidual forecasters. Finally, we show how the hierar-chical model with a temporal component allows us toevaluate the temporal changes in forecasting perfor-mance as well as differences in those dynamics acrossforecasting domains.

5.1. Evaluating Individual Forecasters

The top panel of Fig. 3 shows the frequency dis-tribution of probabilistic forecasts across all judges.Because the majority of events did not occur, thereare more probabilistic forecasts associated with noisetrials than signal trials. The results show that the dis-tribution of forecasts on noise trials is skewed toward

the lower probabilities. In contrast, the distributionof forecasts on signal trials is fairly uniform with noindication that higher probabilistic forecasts are fa-vored for events that eventually will occur.

The bottom panel of Fig. 3 shows the posteriorpredictive distribution of forecasts for the basic hi-erarchical model. These correspond to the forecastdistributions for a new (simulated) judge sampledfrom the group-level distributions using the distribu-tions in Equations (3) and (4). These posterior pre-dictive distributions describe what the forecasts of an“average” judge looks like according to the model.Note that the empirical distributions and predictionsof the model match reasonably well with the one ex-ception that the model favors a U-shaped Beta distri-bution for the signal trials in contrast to the relativelyflat empirical distribution.

Fig. 4 illustrates the results of the hierarchicalmodel for three individual judges. The top panelsshow the empirical distributions of forecasts. Notethat judge #133 performs reasonably well and tendsto separate forecasts for signal and noise trials. Judge#935 shows poor performance and does not seemto distinguish forecasts based on eventual outcome.Judge #1198 highlights a common issue when ana-lyzing sparse forecasts from individual users. Thisjudge has provided only a single forecast for a prob-lem where the event did not occur. The middle pan-els show the inferred distributions for these individ-ual judges. Note that for judge #1198, the model isable to infer a signal distribution even though thejudge never contributed any forecasts for this condi-tion. The hierarchical model in this case simply usesthe parameter estimates at the group level to infera distribution. The bottom panels show the inferredROC curves. Importantly, the ROC curves shownare the mean curves. The model actually infers a dis-tribution over ROC curves from which a distributionover AUC values can be calculated. The 95% cred-ible intervals of AUC values are shown in the fig-ure. Note that the model infers the AUC for judge#935 around 0.5, consistent with the empirical distri-butions of forecasts not discriminating between sig-nal and noise trials. For judge #1198, the range ofAUC values is much larger, indicating large uncer-tainty about the diagnosticity of this judge. The rangeincludes AUC = 0.5, suggesting the judge might becompletely undiagnostic, but most of the range ex-ceeds 0.5, which implies that it is likely that the judgeperforms better than chance. This inference is basedon only a single forecast (and, of course, on the as-sumptions in the model). However, this one forecast


0 0.5 10

5

10

Judge #133

Fre

quen

cy0 0.5 1

0

1

2

3

Den

sity

0 0.5 10

0.5

1p(

H)

AUC CI:(0.67−0.94)

0 0.5 10

5

10

Judge #935

Forecast

0 0.5 10

1

2

3

Forecast

0 0.5 10

0.5

1

p(FA)

AUC CI:(0.35−0.67)

0 0.5 10

5

10

Judge #1198

0 0.5 10

1

2

3

0 0.5 10

0.5

1

AUC CI:(0.48−0.87)

Fig. 4. Results of the hierarchical Beta-SDTmodel for three individual judges. The top rowshows the frequency counts of the judged prob-abilities separated into problems where the eventdid or did not occur (filled and open squares, re-spectively). The middle row shows the estimatedBeta distributions. The bottom row shows theROC curves with the 95% credible interval ofthe AUC values. Note that the left and middlecolumns show the results of judges with relativelymany judgments, one associated with a high AUC(left column) and the other with a chance-levelAUC (middle column). The right column showsa judge with a single probability judgment associ-ated with an event that did not occur. This judgeis associated with an expected AUC better thanchance but there is a great uncertainty about theexact AUC value.

was quite accurate, which makes it more likely thannot (but by no means guaranteed) that this judge willperform well on other problems.

Fig. 5 shows the inferred AUC distributions forindividual forecasters. The results show that the ma-jority of individual forecasters have an AUC valuearound 0.65, but that there are significant individualdifferences—some forecasters appear to be perform-ing at or near chance whereas others are much betterthan average.

5.2. Evaluating Forecast Aggregates

Up to this point, we have observed the diagnos-ticity of individual forecasters. We can also investi-gate how forecasting performance, as measured byAUC, MPE, and GCO, changes when aggregatingover individual forecasts. Table I and Fig. 6 showthe forecasting performance for the four aggregationprocedures. The results show the AUC values de-rived from empirical ROC analysis as well as the non-hierarchical Beta-SDT model. The values are quitesimilar to each other, showing that the two proce-dures result in the same AUC (as long as the dataset is not sparse). The results also show that the ef-fect of (model) calibration (row 2 of Table I) hasno effect on the AUC value, but it does remove the

overestimation bias (GCO > 0). This is not surprisingas the calibration is designed to remove the system-atic error (leading to zero GCO) but cannot improvediagnosticity—the calibration procedure involves astrictly monotonic transformation of the probabilis-tic forecast and the AUC is not sensitive to suchtransformations. The results also show that taking aweighted average leads to higher diagnosticity rela-tive to an unweighted average (compare the secondand third rows of Table I). As a reminder, in theweighted average procedure, recent forecasts are up-weighted relative to older forecasts. The difference indiagnosticity demonstrates that forecasters are bet-ter able to discriminate between signal and noise astime approaches the forecasting horizon. Finally, theresults show that the guessing strategy involving aconstant baseline probability is associated with zerodiagnosticity (i.e., an AUC around 0.5) and no over-estimation (i.e., a GCO value of 0).

Interestingly, the MPE based on Brier scoresshows significant changes across all four methods.This is because the Brier score does not separate be-tween diagnosticity and bias components of perfor-mance. This could potentially lead to the wrong con-clusions. The guessing strategy, for example, showsan MPE that is similar to the MPE of the ULinOP.Based on these results, one might question whether

446 Steyvers et al.

0 200 400 600 800 1000 12000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Individuals

AU

C

(a) (b)

0.4 0.5 0.6 0.7 0.8 0.9 10

50

100

150

200

250

300

350

AUC

Fre

quen

cy

Fig. 5. Estimated AUC values of individual forecasters using the hierarchical Beta-SDT model. Panel (a) shows the mean AUC values foreach forecaster ordered from worst to best along with the 95% credible interval shown in the gray area. Panel (b) shows the distribution ofAUC values across forecasters.

Table I. AUC, Global Calibration Offset (GCO), and MPE (Brier) Scores for Four Aggregation Methods

Model Empirical AUC AUC GCO MPE (Brier)

ULinOp 0.834 0.835 (0.753–0.901) 0.605 (0.552–0.656) 0.153 (0.135–0.171)Calibr. ULinOp 0.828 0.822 (0.732–0.892) 0.042 (−0.077–0.150) 0.120 (0.090–0.152)Calibr. Weighted Average 0.931 0.932 (0.873–0.968) 0.129 (0.012–0.238) 0.072 (0.052–0.098)Guess Baseline 0.492 0.488 (0.381–0.592) −0.009 (−0.016 to −0.001) 0.166 (0.131–0.202)

Notes: The empirical AUC is the area under the curve derived from empirical ROC analysis. AUC is the area under the curve estimated bythe nonhierarchical Beta-STD model. The ranges provide the 95% confidence interval.

human forecasters exceed the performance of sim-ple guessing strategies that do not rely on any hu-man judgment. However, based on the AUC, it isclear that aggregates of human probabilistic forecastscarry important diagnostic value that is not present inrandom guessing strategies.

Overall, the results for AUC and GCO show thatsome aggregation procedures perform well on diag-nosticity but not calibration, and vice versa. There-fore, the AUC can be used to identify aggregationprocedures that perform well on diagnosticity, whichis arguably the most important goal when developingaggregation methods.

5.3. Comparing Individual Forecastersand Aggregates

We can also investigate how the performanceof forecast aggregates compares with the perfor-mance of individual forecasters. Fig. 7, left panel,shows the AUC plotted against GCO for individual

forecasters as well as the four aggregation meth-ods. The results show that all aggregation proceduresexcept the guessing heuristic are associated with amuch higher diagnosticity than the majority of in-dividual forecasters. However, just taking the un-weighted average (ULinOP) does not reduce the ten-dency to overestimate event probabilities (measuredby GCO) relative to the individual forecasters. Thisresult is not surprising because averaging is not ex-pected to remove any systematic bias. Visualizing theaggregation procedures and individual forecasters inthe AUC versus GCO space clarifies which compo-nents of forecasting performance are improved un-der the different aggregation procedures.

Fig. 7, right panel, shows the Brier scores for in-dividual forecasters in the AUC versus GCO space.Note that many individual forecasters have similarBrier scores but different AUC and GCO values.Fig. 8 plots the Brier score as a function of AUC witha point for each forecaster and points with 95% cred-ible intervals for the various aggregation methods.


0 0.5 10

20

40

ULinOp

Fre

quen

cy

Forecast

0 0.5 10

2

4

Den

sity

Forecast

0 0.5 10

0.5

1

p(H

)

p(FA)

AUC CI:(0.75−0.90)

0 0.5 10

20

40

Calibr. ULinOp

0 0.5 10

2

4

0 0.5 10

0.5

1

AUC CI:(0.73−0.89)

0 0.5 10

20

40

Calibr. Weighted Average

0 0.5 10

2

4

0 0.5 10

0.5

1

AUC CI:(0.87−0.97)

0 0.5 10

20

40

Guess Baseline

0 0.5 10

2

4

0 0.5 10

0.5

1

AUC CI:(0.38−0.59)

Fig. 6. Results of the basic Beta-SDT model applied to four aggregation models. The top row shows the frequency counts of the aggregatedjudgments across forecasting problems separated into problems where the event did or did not occur (solid and dashed lines, respectively).The middle row shows the estimated Beta distributions. The bottom row shows the ROC curves with the 95% credible interval for the AUCvalues.

Fig. 7. Estimated global calibration offset (GCO) plotted against estimated AUC values. The left panel shows individual forecasters as wellas aggregation models. The right panel indicates the Brier scores for individual forecasters with the colormap as shown on the right side.For individual forecasters, only the mean performance numbers are visualized.

448 Steyvers et al.

0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AUC

Brie

r S

core

ULinOpCalibr. ULinOpCalibr. Weighted AverageGuess BaselineIndividual Forecaster

Fig. 8. Brier scores plotted against estimated AUCvalues for individual forecasters and aggregationmodels.

This visualization highlights the differences betweenthe two indices. Note the severe nonmonotonicityin the relationship. This nonmonotonicity occurs be-cause, as Yates (1982) has demonstrated, the Brierscore reflects three components: diagnosticity, cali-bration, and problem difficulty (base rate). For anygiven base rate, procedures or methods that decreaseBrier scores could be either improving calibrationor diagnosticity. Because calibration can always beimproved via monotonic transformation, the morefundamental issue is to assess improvement in di-agnosticity. AUC provides this measure uncontam-inated by other factors.

Fig. 7 also shows a correlation between the esti-mated AUC and GCO values. Individual forecasterswith higher diagnosticity tend to overestimate eventprobabilities to a lesser degree. This correlationstems from the fact that the base rate plays a role inboth the GCO and AUC measures. In an analysis ofthe prior predictive distribution(44) of the model, wefound that a small negative (positive) correlation canbe expected, a priori, between AUC and GCO, whenthe base rate of events falls below 0.5 (above 0.5).

5.4. Evaluating the Effects of Time andForecasting Domain

Finally, we will evaluate how the performanceof individual forecasters changes over time and how

the dynamics of this change varies among forecast-ing domains. Fig. 9 shows the empirical mean fore-casts conditioned on event occurrence and nonoccur-rence computed over a number of temporal ranges.The figure also shows the posterior predictive means(lines; collapsed across judges) from the hierarchicalmodel that incorporates time effects. Note that thereis a reasonable overlap between theory and data.For some forecasting domains, such as science andtechnology, there are only a few forecasting prob-lems (N = 16), which makes the estimation of em-pirical means difficult. The majority of forecastingproblems in the current data set fall in the politicsand policy domain (N = 82), which leads to a clearerpicture of the temporal dynamics. Overall, the resultsshow that the signal and noise distributions separatewhen time approaches the forecasting horizon.

Fig. 10 shows the mean estimated AUC (acrossjudges) as a function of time and forecasting domain.Note that these trends show the performance of theaverage individual forecaster. The figure also showsthe 95% credible interval as dashed lines. The com-parisons among forecasting domains reveal some in-teresting differences in temporal dynamics betweendomains. For example, for the politics and policy do-main, the AUC quickly increases when fewer than 20days remain in the forecasting period, suggesting thatthe type of problems in this domain can be resolvedwith recent information. In the domain of business


0 20 40 600

0.2

0.4

0.6

0.8

1Politics & Policy

Pro

babi

lity

0 20 40 600

0.2

0.4

0.6

0.8

1Business & Economy

0 20 40 600

0.2

0.4

0.6

0.8

1Science & Technology

0 20 40 600

0.2

0.4

0.6

0.8

1Military & Security

Pro

babi

lity

0 20 40 600

0.2

0.4

0.6

0.8

1Sports, Health, Social

0 20 40 600

0.2

0.4

0.6

0.8

1All Categories

Fig. 9. Empirical and theoretical means of probabilistic forecasts given signal (event occurrence) and noise (event nonoccurrence), respec-tively, as a function of time (horizontal axis) and forecasting domain (panels). Note that these means are based on the average individualforecaster. Signal and noise distributions are represented by solid and dashed lines (model) and squares and circles (empirical data), respec-tively. Note that time is expressed as the number of days before the forecasting horizon. Thus, time from problem closure decreases to theleft on the horizontal axis.

and economy on the other hand, the AUC changesslowly, indicating that accurate information availableto forecasters does not accumulate quickly over time.The most important message in Fig. 10, however, isthe general one that AUC can quantify diagnosticity,or forecasting difficulty, as a function of domain andhorizon.

6. DISCUSSION

In this article, we introduced a novel Beta-SDT modeling approach to describing and evalu-ating probabilistic forecasts. We showed that theBeta-SDT model can be estimated on the basis ofsparse data using Bayesian hierarchical modelingtechniques. The hierarchical model allows us to es-timate the underlying belief distributions for individ-ual forecasters as well as for the group of forecastersas a whole. Furthermore, the model provides esti-mates of diagnosticity (AUC) along with credible in-tervals for individuals or groups, overall or for de-

fined domains collapsed over time or as a function oftime expressed as forecasting horizon.

Two features not emphasized earlier bear men-tioning here. The first is that the GCO measureintroduced above addresses a global aspect ofcalibration, which is forecasters’ tendency to over-estimate (or underestimate, if that turns out to bethe case) the likelihood of events. By distinguishingAUC and GCO at the level of individual forecasters,we can compare the performance of a large numberof forecasters with aggregates of the individualforecasts. Going beyond what we already knowabout the benefits of forecast aggregation,(29) Figs.7 and 8 show that any of the aggregation methodsimprove diagnosticity (AUC) relative to the indi-vidual forecasts, some methods more than others,whereas recalibration is required to reduce theGCO. Aggregation via calibrated weighted averag-ing provides better AUC than does aggregation viathe calibrated ULinOp model, but at the expenseof GCO. We argue that this tradeoff is worthwhileas, within this framework, what is important are

450 Steyvers et al.

0 20 400.4

0.5

0.6

0.7

0.8

0.9

1Politics & Policy

AU

C

Time0 20 40

0.4

0.5

0.6

0.7

0.8

0.9

1Business & Economy

AU

C

Time0 20 40

0.4

0.5

0.6

0.7

0.8

0.9

1Science & Technology

AU

C

Time

0 20 400.4

0.5

0.6

0.7

0.8

0.9

1Military & Security

AU

C

Time0 20 40

0.4

0.5

0.6

0.7

0.8

0.9

1Sports, Health, Social

AU

C

Time0 20 40

0.4

0.5

0.6

0.7

0.8

0.9

1All Categories

AU

C

Time

Fig. 10. AUC of the average individualforecaster as a function of time (horizon-tal axis) and forecasting domain (panels).The dashed lines show the 95% confi-dence interval across individual forecast-ers. Note that time is expressed as thenumber of days before the forecastinghorizon.

the AUC and the hit and false alarm probabilitiesassociated with any decision threshold, not thenumerical values of the forecasts themselves.

6.1. Other Modeling Extensions

We have already discussed a number of variantsof the Beta-SDT approach that allow us to investi-gate individual differences as well differences in thetemporal dynamics and the effect of forecasting do-main. One attractive feature of the Beta-SDT ap-proach is that it can be extended in many other ways.

For example, the model can be extended tohandle ordinal judgments even when expressednonnumerically (e.g., unlikely, perhaps, likely) by in-cluding response thresholds for the samples from theunderlying evidence distributions. In fact, the samelatent belief distributions could be used to model theprobabilistic forecasts as well as ordinal responses,allowing the mapping between different types ofresponses.

We have also focused in the current work onforecasting problems with only two possible out-comes. However, it is possible to extend the ROC

analysis to multiple classes,(28,45) and these alterna-tive ROC approaches can motivate different SDTmodels.

Finally, we have assumed that the resolutionof the binary forecasting problems leads to an un-ambiguous assignment of E and ¬E. However, insome situations, the definition of E and ¬E are arbi-trary. For example, a temperature forecasting prob-lem could be framed in terms of the temperature ex-ceeding a given value or in terms of the temperaturenot exceeding a given value. In these cases, the SDTmodel can be simplified to enforce symmetry in thesignal and noise distributions, i.e., μ0 = 1 − μ1, andξ0 = ξ1.

6.2. Relationship to Brier Score andRelated Measures

The relative performance of competing forecast-ing systems is often evaluated with Brier scores. Wecontend that the Brier score, or any strictly properscoring rule, is not sufficient for evaluating forecast-ing systems, as these rules were developed to moti-vate honest reporting, not to compare systems. The


Brier score, as illustrated in our results, is sensitiveto different components of forecasting performance,including diagnosticity and calibration. One possi-bility is to decompose the Brier scores into compo-nent scores to assess different dimensions of forecast-ing performance.(37,46,47) Although a detailed discus-sion of these decompositions is outside the scope ofthis article, we note a few differences with the Beta-SDT approach. First, it is not clear how to separateout different components of performance in situa-tions where only sparse forecasting data are avail-able. There is also no obvious way to take uncertaintyabout the measured values of the components intoaccount. In addition, the AUC comes with a guaran-tee that it is insensitive to strictly monotonic transfor-mations of probabilistic forecasts. In contrast, thereare no corresponding results for components of theBrier score. Indeed, although AUC is a principledstatistic based on underlying theory, the Brier scoredecomposition simply reflects a convenient variancepartitioning. See Ref. 46 for further discussion ofissues associated with interpreting partitioned com-ponents. Finally, one of the appealing properties ofthe Beta-SDT approach is that it can be viewed asa model that describes the generation of probabilis-tic forecasts, and can be extended in any number ofways to take additional covariates, different responsetypes, and different types of forecasting problemsinto account. In contrast, the Brier score decompo-sition is based on a measurement approach that doesnot lend easily to extensions, primarily because it wasnot intended to provide an explanation of the processof forecasting.

7. CONCLUSIONS AND FINAL THOUGHTS

The developments in this article highlight twomain points. (1) It is possible to distinguish in a prin-cipled way two important properties of forecasts inorder to assess the effects of any methods for improv-ing them or for quantifying differences in domainforecastability as a function of domain and/or time oroverall. AUC is a principled, easy to understand, in-dex of diagnosticity; and GCO, the log-difference be-tween the mean forecast and the base rate, is an easyto understand index of bias. (2) The Beta-SDT modelis a powerful tool for estimating AUC along with theprecision of the estimate for individuals or groupsoverall or within domains, collapsed over time or asa function of time.

An additional point not emphasized until now,but that derives directly from SDT, is that if the ROC

is sufficiently well specified for any given forecaster,group, or domain, it provides the DM with the prob-ability estimates she needs in order to do expectedutility analyses on potential decisions. That is, theprobability forecasts, per se, are not sufficient for theDM to make a best decision. She also requires goodestimates of the hit and false alarm rates, which areobtained from the ROC. Armed with estimates ofthese two rates as well as of the event probability,the DM can utilize cost or utility estimates of the twokinds of errors, a miss (complement of a hit) or a falsealarm, and estimate the expected utility of acting asthough the event will or will not occur.

Details on how such a decision policy might beimplemented await further research, but we can ad-dress one potential criticism immediately. That crit-icism is that the utilities of outcomes of many de-cisions, e.g., public policy, national security, or per-sonal health, cannot be numerically estimated and inthat sense, expected utility is not the right decisioncriterion at all. To this point, we agree, but never-theless it is the case that sensitivity analyses, accom-plished by varying the ratios of the error costs, sayfrom small to large, can be enormously helpful to theDM in arriving at a justifiable decision.

ACKNOWLEDGMENTS

Supported by the Intelligence AdvancedResearch Projects Activity (IARPA) via De-partment of Interior National Business Center(DoI/NBC) contract number D11PC20059. TheU.S. government is authorized to reproduce anddistribute reprints for governmental purposesnotwithstanding any copyright annotation thereon.

DISCLAIMER

The views and conclusions contained herein arethose of the authors and should not be interpretedas necessarily representing the official policies or en-dorsements, either expressed or implied, of IARPA,DoI/NBC, or the U.S. government.

REFERENCES

1. Green DM, Swets JA. Signal Detection Theory andPsychophysics. New York: Wiley Press, 1966.

2. Egan JP. Recognition Memory and the Operating Character-istic. Bloomington, IN: Hearing and Communication Labora-tory, Indiana University, Tech. Rep. AFCRC-TN-58-51, 1958.

3. Macmillan NA, Creelman CD. Detection Theory: A User’sGuide. Mahwah, NJ: Lawrence Erlbaum Associates, 2005.

452 Steyvers et al.

4. McClelland GH. Use of signal detection theory as a toolfor enhancing performance and evaluating tradecraft inintelligence analysis. Pp. 83–100 in Fischhoff B, Chauvin C(eds). Intelligence Analysis: Behavioral and Social ScientificFoundations. Washington, DC: National Academies Press,2011.

5. Warnaar D, Merkle E, Steyvers M, Wallsten T, Stone E,Budescu D, Yates J, Sieck W, Arkes H, Argenta C, Shin Y,Carter J. The aggregative contingent estimation system: Se-lecting, rewarding, and training experts in a wisdom of crowdsapproach to forecasting. Pp. 75–76 in Proceedings of the 2012Association for the Advancement of Artificial IntelligenceSpring Symposium Series. AAAI Technical Report SS-12-06,2012.

6. DeCarlo LT. Signal detection theory with item effects. Journalof Mathematical Psychology, 2011; 55: 229–239.

7. Rouder JN, Lu J. An introduction to Bayesian hierarchicalmodels with an application in the theory of signal detection.Psychonomic Bulletin and Review, 2005; 12: 573–604.

8. Rouder JN, Lu J, Sun D, Speckman PL, Morey RD, Naveh-Benjamin M. Signal detection models with random participantand item effects. Psychometrika, 2007; 72: 621–642.

9. Lee MD. Three case studies in the Bayesian analysis of cogni-tive models. Psychonomic Bulletin and Review, 2008; 15: 1–15.

10. Tanner WP, Swets JA. A decision-making theory of visual de-tection. Psychological Review, 1954; 61: 401–409.

11. Swets JA. The relative operating characteristic in psychology.Science, 1973; 182: 990–1000.

12. Marzban C. Performance measures and uncertainty. Pp. 49–75in Haupt SE, Pasini A, Marzban C (eds). Artificial IntelligenceMethods in the Environmental Sciences. Berlin Heidelberg:Springer-Verlag, 2009.

13. Provost F, Fawcett T, Kohavi R. The case against accuracyestimation for comparing induction algorithms. Pp. 445–453in Proceedings of the Fifteenth International Conference onMachine Learning. San Francisco, CA: Morgan Kaufmann,1997.

14. Spackman KA. Signal detection theory: Valuable tools forevaluating inductive learning. Pp. 160–163 in Proceedings ofthe Sixth International Workshop on Machine Learning. SanFrancisco, CA: Morgan Kaufmann Publishers Inc., 1989.

15. Lusted LB. Signal detectability and medical decision-making.Science, 1971; 171: 1217–1219.

16. Pepe MS. The Statistical Evaluation of Medical Tests forClassification and Prediction. Oxford, UK: Oxford UniversityPress, 2003.

17. Zhou XH, McClish DK, Obuchowski NA. Statistical Meth-ods in Diagnositic Medicine. New York: John Wiley and Sons,2002.

18. Harvey LO, Hammond KR, Lusk CM, Mross EF. The appli-cation of signal detection theory to weather forecasting behav-ior. Monthly Weather Review, 1992; 120: 863–883.

19. Levi K. A signal detection framework for the evaluation ofprobabilistic forecasts. Organizational Behavior and HumanDecision Processes, 1985; 36: 143–166.

20. Mason SJ. A model for assessment of weather forecasts.Australian Metereological Magazine, 1979; 30: 291–303.

21. Mason SJ, Graham NE. Areas beneath the relative operatingcharacteristics (ROC) and the relative operating levels (ROL)curves: Statistical significance and interpretation. QuarterlyJournal of the Royal Metereological Society, 2002; 128: 2145–2166.

22. Swets JA. Signal Detection Theory and ROC Analyses inPsychology and Diagnostics: Collected Papers. Mahwah, NJ:Erlbaum, 1996.

23. Fawcett T. An introduction to ROC analysis. Pattern Recog-nition Letters, 2006; 27: 861–874.

24. Berrar D, Flach P. Caveats and pitfalls of ROC analysis in clin-ical microarray research (and how to avoid them). Briefings inBioinformatics, 2012; 13: 83–97.

25. Swets JA, Dawes RM, Monahan J. Psychological science canimprove diagnostic decisions. Psychological Science in thePublic Interest, 2000; 1: 1–26.

26. Flach PA. ROC analysis. Pp. 869–875 in Sammut C, Webb GI(eds). Encyclopedia of Machine Learning, No. 19. Springer,2010.

27. Wickens TD. Elementary Signal Detection Theory. NewYork: Oxford University Press, 2001.

28. Hand DJ, Till RJ. A simple generalisation of the area un-der the ROC curve for multiple class classification problems.Machine Learning, 2001; 45: 171–186.

29. Turner BM, Steyvers M, Merkle EC, Budescu DV, WallstenTS. Forecast aggregation via recalibration. Machine Learning,in press.

30. Kharin VV, Zwiers FW. On the ROC score of probabilityforecasts. Journal of Climate, 2003; 16: 4145–4150.

31. Marzban C. The ROC curve and the area under it as perfor-mance measures. Weather and Forecasting, 2004; 19: 1106–1114.

32. Kosmidis I, Firth D. A generic algorithm for reducing bias inparametric estimation. Electronic Journal of Statistics, 2010;4: 1097–1112.

33. Smithson M, Merkle EC, Verkuilen J. Beta regression finitemixture models of polarization and priming. Journal ofEducational and Behavioral Statistics, 2011; 36: 804–831.

34. Plummer M. JAGS: A program for analysis of Bayesian graph-ical models using Gibbs sampling. In Hornik K, Leisch F,Zeileis A (eds). Proceedings of the 3rd International Work-shop on Distributed Statistical Computing, 2003.

35. Armstrong JS. Principles of Forecasting. Norwell, MA:Kluwer Academic, 2001.

36. Brier GW. Verification of forecasts expressed in terms ofprobability. Monthly Weather Review, 1950; 78: 1–3.

37. Murphy AH. A new vector partition of the probability score.Journal of Applied Meteorology, 1973; 12: 595–600.

38. Ariely D, Au WT, Bender RH, Budescu DV, Dietz CB,Gu H, Wallsten TS, Zauberman G. The effects of averagingsubjective probability estimates between and within judges.Journal of Experimental Psychology: Applied, 2000; 6: 130–147.

39. Clemen RT. Calibration and the aggregation of probabilities.Management Science, 1986; 32: 312–314.

40. Clemen RT, Winkler RL. Combining economic forecasts.Journal of Business and Economic Statistics, 1986; 4:39–46.

41. Cooke RM. Experts in uncertainty: Opinion and subjectiveprobability in science. New York: Oxford University Press,1991.

42. Wallsten TS, Budescu DV, Erev I. Evaluating and combiningsubjective probability estimates. Journal of Behavioral Deci-sion Making, 1997; 10: 243–268.

43. Gonzalez R, Wu G. On the shape of the probability weightingfunction. Cognitive Psychology, 1999; 38: 129–166.

44. Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian DataAnalysis. New York: Chapman and Hall, 2004.

45. Wandashin MS, Mullen SJ. Multiclass ROC analysis. Weatherand Forecasting, 2009; 24: 530–547.

46. Yaniv I, Yates JF, Smith JEK. Measures of discrimination skillin probabilistic judgment. Psychological Bulletin, 1991; 110:611–617.

47. Yates JF. External correspondence: Decompositions of themean probability score. Organizational Behavior and HumanPerformance, 1982; 30: 132–156.

evaluating probabilistic forecasts with bayesian signal detection models

Documents